Claude 3 Opus: The Best LLM I Have Ever Used
Anthropic's Claude 3 Opus is not just an incremental improvement; it changes what is possible with AI-assisted engineering
Anthropic released the Claude 3 model family two days ago, and I need to talk about Opus. I have been using it intensively since launch, and it is the best large language model I have ever worked with. Not by a small margin. By a significant one.
I say this as someone who has used GPT-4 daily for over a year, who has built production systems on top of multiple LLM providers, and who is deeply invested in the multi-model future. Claude 3 Opus is genuinely special.
The Claude 3 Family
Anthropic released three models: Haiku (fast and cheap), Sonnet (balanced), and Opus (maximum capability). This tiered approach mirrors what OpenAI has done with GPT-3.5 and GPT-4, but the execution is better.
Haiku is remarkably capable for its speed and cost. Sonnet hits a sweet spot for production workloads where you need quality but also need to manage costs. And Opus is the model you reach for when the task is hard and getting it right matters more than speed or cost.
But let me focus on what makes Opus exceptional.
Reasoning That Actually Works
The single biggest improvement in Opus over previous models, including GPT-4, is reasoning depth. When I give Opus a complex engineering problem, it does not just pattern-match to a plausible answer. It works through the problem systematically.
I tested this extensively with architecture design questions, debugging scenarios, and code review tasks. In each case, Opus demonstrated an ability to hold multiple constraints in mind simultaneously, consider trade-offs explicitly, and arrive at recommendations that reflected genuine understanding of the problem space.
Here is a concrete example: I gave Opus a microservices architecture with a subtle data consistency issue that involved eventual consistency semantics across three services. The correct diagnosis required understanding distributed systems theory, the specific guarantees of the message broker involved, and the business logic implications of out-of-order event processing. Opus identified the issue, explained why it was happening, proposed two different solutions with their respective trade-offs, and recommended which one to use based on the constraints I had described. GPT-4 got the diagnosis right but missed one of the solutions and did not adequately address the trade-offs.
This is not a cherry-picked example. It is representative of what I am seeing across dozens of interactions.
Code Quality
The code that Opus generates is noticeably better than what I get from other models. And "better" here means several specific things:
More idiomatic. Opus writes code that looks like it was written by an experienced developer in that language. Python code uses Python conventions. Go code uses Go conventions. The code does not just work; it reads well.
Better error handling. This is a persistent weakness in LLM-generated code: the happy path works, but error cases are handled poorly or not at all. Opus is significantly better about considering failure modes and handling them appropriately.
Cleaner abstractions. When generating larger code structures, Opus makes better architectural decisions. Functions are the right size. Responsibilities are well-separated. The code is structured in a way that would pass a reasonable code review.
Accurate context use. When I provide existing code as context, Opus does a better job of matching the style, conventions, and patterns already present. It does not fight the codebase; it works with it.
The 200K Context Window
Claude 3 Opus supports a 200,000 token context window. To put that in perspective, that is roughly the equivalent of a 500-page book. You can feed it an entire codebase and ask questions about it.
I have been testing this with real projects, loading entire repositories into the context and asking Opus to analyze cross-cutting concerns, identify inconsistencies, and suggest improvements. The results are impressive. The model maintains coherence across the full context window in a way that previous models with large contexts did not quite achieve.
This matters enormously for the agent work I am doing. An agent that can hold an entire project in context, reason about it holistically, and make changes that consider the full picture is fundamentally more capable than one that operates on file-at-a-time snippets.
Vision Capabilities
Claude 3 is multimodal, and the vision capabilities are strong. I have tested it with architecture diagrams, screenshots of error messages, whiteboard photos, and system monitoring dashboards.
The model can read text in images accurately, interpret diagrams, and reason about what it sees. For engineering workflows, this opens up interesting possibilities: an agent that can look at a monitoring dashboard, understand the metrics, and take action based on what it observes.
What This Means for Agent Development
I am going to be direct: Claude 3 Opus changes my plans for what I am building.
The agent infrastructure I have been developing assumes that the underlying models have certain capabilities and limitations. Opus shifts those assumptions. Tasks that required careful prompt engineering and multi-step decomposition with GPT-4 can be handled more directly with Opus. The reliability of tool use is higher. The reasoning about when and how to use tools is better.
This does not mean I am going all-in on a single provider. The multi-model, provider-agnostic approach is still the right architecture. But Opus is going to be the default model for high-value tasks where quality matters most.
Specifically, I am planning to use Opus for:
- Planning and reasoning phases in agent workflows, where the quality of the plan directly impacts everything downstream
- Code review and quality verification, where deep understanding of the code and its context is essential
- Complex debugging scenarios, where the model needs to synthesize information from multiple sources and reason about system behavior
- Architecture decisions, where trade-offs need to be explicitly considered and communicated
The Competitive Landscape Shifts
Claude 3 Opus puts Anthropic at the top of the LLM leaderboard on most benchmarks, and more importantly, in practical use. This matters for the industry.
Competition is driving rapid improvement. OpenAI dominated 2023. Anthropic is starting 2024 with a strong move. Google has Gemini Ultra. The result is that all of these labs are pushing each other to improve faster, and the beneficiaries are people like me who build on top of these models.
I expect OpenAI to respond. GPT-5 or whatever they call the next generation will likely close the gap or surpass Opus. Then Anthropic will respond to that. This cycle is going to produce extraordinary capabilities over the next 12 to 18 months.
My Recommendation
If you are building with LLMs, try Claude 3 Opus. Give it your hardest problems. Test it against whatever model you are currently using. I think you will see what I am seeing.
If you are not building with LLMs yet, this is the model that might change your mind. The gap between "interesting toy" and "useful engineering tool" has been closing for a while. With Opus, for many tasks, that gap is closed.
The future I have been predicting, where AI agents handle significant engineering work within structured systems, just got a lot closer. The models are ready. Now we need to build the infrastructure to use them well.