Research
MiniMax M3 Uses Sparse Attention for 9.7× Prefill Speedup
MiniMax's M3 model achieves 9.7× prefill speedup and 15.6× decode speedup at 1M tokens, signaling a shift toward sparse attention mechanisms.
Image: Hugging Face
On May 26, MiniMax R&D lead Skyler Miao shared a diagram on X outlining the company's approach to sparse attention in its M3 model. The diagram highlights two key performance metrics: 9.7× prefill speedup and 15.6× decode speedup at 1M tokens. The community widely interpreted this as a teaser for M3, a model that marks a significant departure from its predecessor, M2, which used full attention. The design of M3 separates the process of selecting which key-value (KV) blocks to use from the actual computation of attention, creating two distinct stages: an index branch and a sparse branch. The index branch identifies which blocks to consider, while the sparse branch executes the attention computation using the selected blocks. This approach allows for efficient use of hardware resources and reuses existing kernels like those in vLLM and FlashAttention. *Source: [huggingface](https://huggingface.co/blog/AtlasCloud-AI/minimax-goes-sparse)*
Key points
- MiniMax's M3 model achieves 9.7× prefill speedup and 15.6× decode speedup at 1M tokens.
- The diagram outlines a two-stage approach to sparse attention, splitting the selection of KV blocks from the attention computation.
- MiniMax's M3 uses GQA (Grouped Query Attention) as the substrate, avoiding MLA (Multi-Head Attention) for efficiency.
- M3 computes attention on real K/V blocks rather than compressed ones, maintaining the full expressive power of softmax attention.
- The design of M3 eliminates two of the three parallel paths from DeepSeek's NSA approach, retaining only the selection stage.
- The sparsity rate in M3 aligns with the NSA paper's reported range of 6–10%, optimizing performance at the 1M token scale.