On May 26, MiniMax R&D lead Skyler Miao shared a diagram on X outlining the company's approach to sparse attention in its M3 model. The diagram highlights two key performance metrics: 9.7× prefill speedup and 15.6× decode speedup at 1M tokens. The community widely interpreted this as a teaser for M3, a model that marks a significant departure from its predecessor, M2, which used full attention. The design of M3 separates the process of selecting which key-value (KV) blocks to use from the actual computation of attention, creating two distinct stages: an index branch and a sparse branch. The index branch identifies which blocks to consider, while the sparse branch executes the attention computation using the selected blocks. This approach allows for efficient use of hardware resources and reuses existing kernels like those in vLLM and FlashAttention. *Source: [huggingface](https://huggingface.co/blog/AtlasCloud-AI/minimax-goes-sparse)*