|6 min read

GPT-4 Is Multimodal and It Changes Everything

GPT-4 launches with multimodal capabilities, passing professional exams and setting a new benchmark for AI

Two days ago, OpenAI released GPT-4, and the leap from GPT-3.5 to GPT-4 is not incremental. It is a generational jump. The model is multimodal, meaning it can process both text and images as input. It passes the bar exam in the 90th percentile. It scores in the top ranks on the SAT, GRE, and a battery of AP exams. The benchmark results alone would be remarkable, but having spent time with the model, the qualitative improvement in reasoning, nuance, and instruction-following is what truly stands out.

What GPT-4 Can Do

The headline feature is multimodal input. You can feed GPT-4 an image along with text and it can reason about the contents. Show it a photo of your refrigerator and ask what meals you could make. Give it a hand-drawn sketch of a website layout and it will generate the HTML and CSS. Hand it a chart from a scientific paper and it will interpret the data and explain the findings.

This is not image recognition in the traditional computer vision sense. The model is not just labeling objects. It is understanding the context, the relationships, and the intent behind visual information. The implications for accessibility, education, and productivity tools are enormous.

Beyond multimodality, GPT-4 shows dramatic improvements in:

  • Reasoning: The model can handle multi-step logical problems that GPT-3.5 would fumble. Chain-of-thought reasoning is more robust and coherent.
  • Instruction following: It is significantly better at following complex, multi-part instructions and maintaining consistency throughout a long interaction.
  • Knowledge: While still subject to a training data cutoff, the breadth and accuracy of its knowledge across domains is notably improved.
  • Code generation: Not just writing code from descriptions, but understanding existing codebases, debugging issues, and explaining complex logic.

The Benchmark Story

The exam results are worth pausing on, not because passing a bar exam makes GPT-4 a lawyer, but because of what the scores reveal about the model's generalization capabilities.

GPT-3.5 scored in the bottom 10th percentile on the bar exam. GPT-4 scores in the 90th percentile. That is not a modest improvement; it is a jump from clearly failing to clearly passing. The same pattern repeats across dozens of standardized tests. Biology, history, mathematics, law, medicine: GPT-4 demonstrates competence across an extraordinary range of domains.

This matters because these exams test the kind of reasoning, knowledge synthesis, and careful analysis that we have traditionally considered distinctly human capabilities. A model that can pass the bar exam does not necessarily understand law the way a lawyer does, but it can produce outputs that are functionally indistinguishable from a competent human response across many tasks.

The System Card and Safety

OpenAI released a detailed system card alongside GPT-4 that documents known limitations and safety measures. This is commendable transparency and worth reading in full. The document acknowledges that GPT-4 can still hallucinate, can still be manipulated through adversarial prompting, and can still produce biased or harmful output.

What is new is the investment in safety. OpenAI reports that they used reinforcement learning from human feedback (RLHF) more extensively with GPT-4, and that they engaged in extensive red-teaming before release. They also introduced a system message feature that allows developers to set behavioral guidelines for the model, providing more control over how it responds in different contexts.

The safety story is going to become increasingly important as these models get deployed in high-stakes environments. Healthcare, legal, financial, and educational applications all require a level of reliability that current models cannot consistently guarantee. But GPT-4 represents meaningful progress on this front.

The API and Developer Ecosystem

For developers, GPT-4 is available through an API with a waitlist. The pricing is significantly higher than GPT-3.5, reflecting the increased compute costs of the larger model. This creates an interesting tension. GPT-4 is clearly more capable, but GPT-3.5 is much cheaper and fast enough for many applications. The right model choice depends on the specific use case, the required quality, and the cost sensitivity.

The developer ecosystem around OpenAI's models continues to grow rapidly. LangChain, LlamaIndex, and other frameworks are making it easier to build applications that chain model calls together, integrate with external data sources, and create agents that can take actions. GPT-4's improved instruction following makes these frameworks significantly more reliable.

What This Means for Enterprise

At my company, we have been exploring GPT-3.5 for various internal use cases. GPT-4 changes the calculus significantly. Tasks that were on the edge of feasibility with GPT-3.5, where the model would get it right 70% of the time but fail in frustrating ways the other 30%, become viable with GPT-4's improved reliability.

Some specific areas where I see immediate enterprise potential:

  • Document analysis and summarization: GPT-4 can process and synthesize long documents with much better accuracy and coherence.
  • Code review and generation: The improvements in code understanding make it a genuinely useful development tool, not just a novelty.
  • Customer-facing applications: The reduced hallucination rate and better instruction following make it more suitable for applications where incorrect responses have real consequences.
  • Knowledge management: Combining GPT-4 with retrieval-augmented generation (RAG) patterns for internal knowledge bases becomes much more practical.

The Scaling Question

GPT-4 raises a fundamental question about where this technology is headed. The improvements from GPT-3.5 to GPT-4 were achieved largely through scaling: more parameters, more training data, more compute. If the scaling laws continue to hold, GPT-5 could be dramatically more capable still.

But scaling has limits. The compute costs for training frontier models are already measured in hundreds of millions of dollars. The GPU clusters required are among the largest in the world. There are open questions about whether we will run out of high-quality training data, whether the returns to scale will diminish, and whether the economic model of spending more and more on training can sustain itself.

Alternatively, the next wave of improvements might come not from raw scaling but from architectural innovations, better training techniques, and smarter data curation. This is an active area of research across all the major labs.

My Takeaway

GPT-4 has solidified my conviction that large language models are not a hype cycle. They are a genuine platform shift. The trajectory from GPT-3 to GPT-3.5 to GPT-4 shows consistent, dramatic improvement in capability. The applications that are becoming possible are not theoretical; they are practical, near-term, and relevant to every industry.

I have been spending more and more of my time understanding this technology deeply. Not just how to use the API, but how the models work, how they are trained, what their limitations are, and how to build reliable systems around them. This is not a curiosity. It is rapidly becoming a core professional competency.

The pace of progress is breathtaking. Four months between ChatGPT and GPT-4. Each iteration dramatically more capable than the last. Whatever comes next, the foundation for a new era of computing is being laid right now.

Share: