Google Gemini: DeepMind's Multimodal Answer to GPT-4
Google launches Gemini, its most capable AI model, built from the ground up for multimodal reasoning
Two days ago, Google launched Gemini, and the AI landscape just shifted again. Gemini is Google DeepMind's new family of multimodal models, and it represents the most significant challenge to GPT-4's position as the leading frontier model. The benchmarks are impressive, the demo is striking, and the strategic implications for the AI race are substantial.
What Gemini Is
Gemini is a family of models available in three sizes:
Gemini Ultra: The largest and most capable version, designed to compete directly with GPT-4 and exceed it on key benchmarks. Google claims Gemini Ultra is the first model to outperform human experts on the MMLU (Massive Multitask Language Understanding) benchmark, scoring 90.0% compared to the previous state of the art.
Gemini Pro: A mid-range model optimized for broad task performance. This is the version currently available through the Bard chatbot and the Gemini API. Google positions it as competitive with GPT-3.5 Turbo, offering strong capability at lower cost.
Gemini Nano: A smaller model designed to run on mobile devices. This is being deployed in the Pixel 8 Pro for on-device AI features like text summarization and smart reply.
The key technical differentiator is that Gemini was built from the ground up as a multimodal model. While GPT-4 had multimodal capabilities added through image understanding, Gemini was designed to natively process and reason across text, images, audio, video, and code. Google claims this native multimodality results in more seamless and capable cross-modal reasoning.
The Benchmark Story
Google is making aggressive claims about Gemini Ultra's benchmark performance. On MMLU, it scores 90.0%, the first model to surpass 90% and the first to exceed estimated human expert performance. On other benchmarks, including HumanEval for code generation, MATH for mathematical reasoning, and various multimodal understanding tests, Gemini Ultra matches or exceeds GPT-4.
These numbers are impressive, but they come with important caveats. The MMLU score used a chain-of-thought prompting approach with 32 samples per question, while GPT-4's reported score used 5-shot prompting. Different evaluation methodologies can produce different results, and the AI community will need time to conduct independent evaluations before the comparative claims can be fully validated.
Additionally, benchmarks are imperfect proxies for real-world capability. A model that scores higher on MMLU is not necessarily better at every practical task. User experience, instruction following, creativity, safety, and reliability all matter in ways that standardized benchmarks do not fully capture.
That said, the benchmark results signal that Google has closed the capability gap with OpenAI. Even if the precise rankings are debatable, the fact that Gemini is competitive with GPT-4 on major benchmarks means that the era of a single dominant frontier model is over.
The Demo and Its Controversy
Google released a video demo showing Gemini interacting with a user through a live camera feed, responding to visual inputs in real time with voice commentary. The demo was visually striking: Gemini appeared to identify objects, track movements, understand visual context, and respond conversationally in real time.
However, subsequent reporting revealed that the demo was not a live interaction. The video was produced using still images rather than real-time video, and the responses were generated from text prompts rather than the multimodal interaction implied by the demo. Google acknowledged that the video was "illustrative" rather than literal.
This matters for two reasons. First, it undermines trust at a moment when Google is trying to establish Gemini's credibility. After the Bard demo error in February that cost Alphabet a hundred billion in market value, another misleading demo reinforces concerns about Google's communication around AI products. Second, it highlights the gap between what these models can do in controlled conditions and what they can do in real-time, real-world applications.
The Technical Architecture
While Google has not published a full technical report comparable to what OpenAI released for GPT-4, some details about Gemini's architecture are known.
Gemini was trained on Google's TPU v5e and TPU v4 pods, leveraging Google's custom silicon advantage. The training infrastructure is among the largest in the world, and Google's ability to provision compute at this scale without depending on NVIDIA GPUs is a significant structural advantage.
The native multimodal architecture means that vision, language, and other modalities are processed within a unified model rather than through separate components that are stitched together. This should, in theory, result in better cross-modal understanding and more coherent multimodal reasoning.
Gemini also supports a 32K token context window, which is competitive with GPT-4's extended context offerings though smaller than the 128K window offered by GPT-4 Turbo.
Implications for the AI Race
Gemini's launch changes the competitive dynamics in several important ways:
Google is back in the game. After the Bard stumble in February and months of perceived lagging behind OpenAI, Gemini establishes Google as a credible competitor in the frontier model race. Google's research depth, compute resources, and distribution through products used by billions of people make it a formidable player.
The multimodal race intensifies. Both GPT-4 and Gemini now offer multimodal capabilities, but neither has fully delivered on the promise of seamless multimodal interaction. The question of which model handles real-world multimodal tasks better will play out over the coming months as developers and users test both.
Enterprise options expand. For organizations evaluating AI platforms, Gemini Pro through the Gemini API offers another strong option alongside OpenAI and Anthropic. Competition among providers benefits enterprise customers through better pricing, more features, and reduced vendor lock-in.
The model commoditization trend continues. With multiple providers offering frontier-class models, the value is increasingly shifting from the model itself to the applications built on top of it, the data that feeds into it, and the tools that make it useful in specific contexts.
What This Means for My Work
I have been building with OpenAI and Anthropic models, and Gemini's launch adds another provider to evaluate. From a practical standpoint, several aspects interest me:
The Gemini API is available through Google Cloud, which means organizations already using GCP have a natural integration path. For organizations running multi-cloud environments, as many do, having strong AI capabilities on both AWS (through Bedrock) and GCP (through Gemini) provides useful flexibility.
The native multimodal capabilities are relevant to several use cases I have been exploring. Image understanding for content analysis, document processing that involves visual elements, and applications that need to reason about mixed media all benefit from strong multimodal performance.
The pricing for Gemini Pro is competitive, and if the performance is strong enough for production use cases, it provides leverage in conversations about AI spend and vendor strategy.
The Larger Picture
Gemini's launch caps a year in which the AI landscape has been transformed. We started 2023 with ChatGPT as a consumer phenomenon and a handful of companies competing in the model space. We end it with multiple frontier-class models from multiple providers, a maturing framework ecosystem, emerging enterprise deployment patterns, and a growing understanding of both the capabilities and limitations of large language models.
Google bringing its full resources to bear on the frontier model competition is good for the industry. Competition drives innovation, improves pricing, and prevents any single company from controlling the direction of AI development. The next phase of the AI race will be fought not just on benchmark scores but on reliability, safety, developer experience, and enterprise readiness.
I will be testing Gemini Pro in the coming weeks alongside the models I already use. The best model for a given task depends on the specific requirements, and having another strong contender in the mix gives us more options and more leverage.
The race continues, and it is accelerating.