Mixture of Experts (MoE)

Architecture where only some model weights activate per token.

Mixture of Experts models (DeepSeek V3, Llama 4, Mixtral) have many parameters on paper but only activate a fraction per forward pass. DeepSeek V3 has 671B total parameters but activates ~37B per token. This makes inference cheap at high capacity — the models can be big and fast. The tradeoff: MoE inference is latency-spiky under load because different tokens route to different experts, which can cause memory pressure. MoE is why open-weights pricing has collapsed over the past year.