Research

MiniMax M3 Uses Sparse Attention for 9.7× Prefill Speedup

MiniMax's M3 model achieves 9.7× prefill speedup and 15.6× decode speedup at 1M tokens, signaling a shift toward sparse attention mechanisms.

Image: Hugging Face

On May 26, MiniMax R&D lead Skyler Miao shared a diagram on X outlining the company's approach to sparse attention in its M3 model. The diagram highlights two key performance metrics: 9.7× prefill speedup and 15.6× decode speedup at 1M tokens. The community widely interpreted this as a teaser for M3, a model that marks a significant departure from its predecessor, M2, which used full attention.

The design of M3 separates the process of selecting which key-value (KV) blocks to use from the actual computation of attention, creating two distinct stages: an index branch and a sparse branch. The index branch identifies which blocks to consider, while the sparse branch executes the attention computation using the selected blocks. This approach allows for efficient use of hardware resources and reuses existing kernels like those in vLLM and FlashAttention.

Source: huggingface

Key points

MiniMax's M3 model achieves 9.7× prefill speedup and 15.6× decode speedup at 1M tokens.
The diagram outlines a two-stage approach to sparse attention, splitting the selection of KV blocks from the attention computation.
MiniMax's M3 uses GQA (Grouped Query Attention) as the substrate, avoiding MLA (Multi-Head Attention) for efficiency.
M3 computes attention on real K/V blocks rather than compressed ones, maintaining the full expressive power of softmax attention.
The design of M3 eliminates two of the three parallel paths from DeepSeek's NSA approach, retaining only the selection stage.
The sparsity rate in M3 aligns with the NSA paper's reported range of 6–10%, optimizing performance at the 1M token scale.

Source: Hugging Face Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

MiniMax M3 Uses Sparse Attention for 9.7× Prefill Speedup

Key points

Related articles

Meta and Stanford Test AI with Baby-Like Learning

ELIZA Source Code Reveals Chatbot's Multiple Personalities

Hugging Face Evaluates Open-Source AI Models for Swiss Legal Tasks

Anthropic Discovers New Internal Space in AI Models