|8 min read

Mixtral 8x7B and the Rise of Mixture of Experts

Mistral AI's Mixtral model demonstrates that mixture of experts architectures can deliver frontier-class performance efficiently

Earlier this month, Mistral AI quietly released Mixtral 8x7B, a mixture of experts (MoE) model that has sent ripples through the AI community. The model appeared with minimal fanfare, initially shared as a torrent link in a tweet, but its performance has captured serious attention. Mixtral demonstrates that architectural innovation can challenge the brute-force scaling approach that has dominated the frontier model conversation.

What Mixture of Experts Means

To understand why Mixtral matters, you need to understand the mixture of experts architecture.

In a standard dense transformer model like GPT-4 or Llama 2, every input token is processed by every parameter in the network. A 70 billion parameter model uses all 70 billion parameters for every prediction. This is computationally expensive and means that making the model more capable requires making it proportionally more expensive to run.

A mixture of experts model takes a different approach. The model contains multiple "expert" sub-networks, and a routing mechanism selects which experts to activate for each input. In Mixtral 8x7B, there are eight expert networks, each roughly seven billion parameters in size, but only two experts are active for any given token. The total parameter count is around 46 billion, but the compute cost per token is equivalent to a roughly 13 billion parameter dense model.

This is an elegant tradeoff. The model has access to the knowledge and capability of a much larger parameter space, but the inference cost is determined by the active parameters rather than the total. You get something closer to a 46B model's capability at something closer to a 13B model's cost.

The Performance

The benchmark results for Mixtral 8x7B are striking. It matches or exceeds Llama 2 70B on most benchmarks while being significantly faster and cheaper to run. On code generation, mathematical reasoning, and multilingual tasks, it performs particularly well.

Mistral also released Mixtral 8x7B Instruct, a version fine-tuned for instruction following, which approaches GPT-3.5 Turbo performance on many tasks. For an open source model that can run on affordable hardware, this is a remarkable level of capability.

The practical implications are significant. Organizations that have been evaluating Llama 2 70B for deployment have struggled with the hardware requirements: the model needs multiple high-end GPUs for inference. Mixtral offers comparable or better performance at a fraction of the compute cost, making self-hosted deployment much more accessible.

Mistral AI: The Company

Mistral AI deserves attention as a company, not just for this single model release. Founded in 2023 by former Meta and Google DeepMind researchers, the Paris-based startup has moved with remarkable speed. Their first model, Mistral 7B, was released in September and quickly became one of the most popular open source language models due to its impressive performance relative to its size.

Mistral's approach differs from both the closed-source providers (OpenAI, Anthropic, Google) and the large-company open source releases (Meta's Llama). They are a startup that is openly releasing capable models, betting that the value creation will come from enterprise services, fine-tuned versions, and hosted deployments rather than from keeping the model weights proprietary.

This is a significant strategic bet. If Mistral can demonstrate that a company can build a sustainable business while releasing its core technology as open source, it establishes an alternative model for the AI industry. The precedent from other areas of technology, where companies like Red Hat, HashiCorp, and MongoDB built substantial businesses around open source software, suggests this can work.

Why MoE Architecture Matters

The mixture of experts approach is not new. It has been explored in various forms for years, and Google used a version of it in the Switch Transformer paper. What Mixtral demonstrates is that MoE can produce practical, deployable models that offer a genuinely better efficiency-capability tradeoff than dense architectures.

This has implications for the future trajectory of AI development:

Efficiency matters as much as raw capability. The frontier model conversation has been dominated by scaling: bigger models, more data, more compute. MoE suggests that architectural innovation can deliver significant capability gains without proportional increases in compute cost. This is good news for the sustainability and accessibility of AI development.

Self-hosted deployment becomes more practical. If MoE models can deliver near-frontier performance at reduced compute cost, more organizations can afford to run capable models on their own infrastructure. This strengthens the case for self-hosted deployment in enterprises with data sensitivity or vendor dependency concerns.

The training-inference cost relationship changes. MoE models are more complex to train (the routing mechanism adds engineering challenges), but cheaper to serve. For production deployments where inference volume far exceeds training runs, this tradeoff is favorable.

Specialization within models becomes possible. The expert mechanism means that different experts can specialize in different types of content or tasks. While current MoE models do not have explicitly specialized experts, future architectures could take advantage of this property for more targeted capability.

Infrastructure Considerations

From an infrastructure perspective, MoE models present both opportunities and challenges.

The reduced compute per token means lower GPU memory requirements during inference, which directly translates to lower hardware costs and higher throughput. For serving infrastructure, this means you can handle more concurrent requests on the same hardware, or use less expensive hardware to achieve the same throughput.

However, the total model size (all expert parameters) still needs to be loaded into memory, even though only a subset is active at any time. For Mixtral 8x7B, this means roughly 90GB of memory in full precision, or significantly less with quantization. The memory versus compute tradeoff is different from dense models and requires different infrastructure optimization strategies.

Model serving frameworks are adapting to support MoE architectures. vLLM and other serving systems are adding optimizations specific to MoE models, including efficient expert routing and memory management for inactive experts.

My Experiments

I have been running Mixtral on a development GPU server, and the results are impressive. The quantized version runs well on a single GPU with 24GB of memory, producing coherent and capable responses across a range of tasks.

For the internal tools I have been building, Mixtral offers an interesting middle ground. It is capable enough for many of our use cases, it runs on hardware we can provision within our existing cloud accounts, and it does not require sending any data to external APIs. The instruction-tuned version handles our documentation Q&A and code generation tasks with quality that is close to what we get from commercial APIs.

I have been running comparative evaluations against GPT-3.5 Turbo and Claude on our specific use cases, and Mixtral holds up well. On tasks that involve our internal terminology and domain knowledge, the gap is smaller than the general benchmarks would suggest, particularly after we apply prompt engineering specific to our context.

The Open Source AI Trajectory

Mixtral 8x7B, coming on the heels of Llama 2 and Mistral 7B, reinforces a trend that I have been tracking all year: open source AI models are closing the gap with closed-source alternatives faster than most people expected.

The trajectory is clear. In early 2023, the best open source models were significantly behind GPT-3.5. By mid-year, Llama 2 70B was competitive with the earlier versions of ChatGPT. By year's end, Mixtral is approaching GPT-3.5 Turbo performance at a fraction of the cost. If this trajectory continues, 2024 could bring open source models that are competitive with GPT-4 on many practical tasks.

This convergence has profound implications for the economics of AI deployment, the balance of power between model providers and model consumers, and the accessibility of AI capabilities to organizations that cannot or will not depend on proprietary APIs.

Looking Ahead

Mixtral 8x7B is both a milestone and a signpost. It is a milestone because it demonstrates that open source models can achieve strong performance through architectural innovation rather than just parameter scaling. It is a signpost because it points toward a future where the model landscape is more diverse, more efficient, and more accessible.

For infrastructure engineers and architects, MoE models add a new dimension to the evaluation process. It is no longer just about parameter count or benchmark scores; it is about the efficiency of the architecture and the practical costs of deployment. Understanding these tradeoffs is becoming an important part of the AI engineering skill set.

I am going to continue experimenting with Mixtral and watching Mistral AI closely. They are building something important, and the architectural approach they represent may be as significant as the specific models they release.

Share: