Model Release

MiniMax M3 Open-Weight Model Challenges Proprietary Leaders With One-Million-Token Context

MiniMax's M3 model, with a one-million-token context window, matches top proprietary models like Opus 4.7 and GPT-5.5 in benchmarks, according to the company.

Image: The Decoder

Chinese AI company MiniMax has released its new open-weight model, M3, which combines strong coding performance, native multimodality, and a one-million-token context window. According to MiniMax, this combination was previously out of reach for open models and reserved for proprietary systems like Opus 4.7, GPT-5.5, or Gemini 3.1 Pro. A new attention mechanism makes the leap possible by stretching the context window to one million tokens without letting compute costs spiral out of control. In internal tests, M3 also planned, debugged, and self-corrected on its own over many hours. Source: thedecoder

On SWE-Bench Pro, an established software development benchmark, M3 scores 59 percent according to MiniMax. That puts it ahead of GPT-5.5 and Gemini 3.1 Pro, but just behind Opus 4.7. M3 also lands in proprietary-class territory on terminal tasks and tool use. On autonomous web search, it actually pulls ahead of Opus 4.7 (79.3) with 83.5 points on BrowseComp. MiniMax positions M3 close to Opus 4.7 on its own benchmarks, partly ahead of GPT-5.5 and Gemini 3.1 Pro. To get closer to real developer workflows, MiniMax built a simulator framework that mimics typical behavior patterns. These include refining requirements, discussing solution approaches, reacting to intermediate results, and carrying tasks across multiple contexts. This exposes the model to multi-turn collaboration during training, not just single, clearly defined prompts. Source: thedecoder

MiniMax describes three internal experiments designed to show how these capabilities work together. In the first, the team had M3 independently reproduce a paper on LLM fine-tuning. The model worked for nearly twelve hours without intervention, produced 18 commits and 23 figures, and confirmed the paper's key findings. M3 independently reproduced an ICLR 2025 paper over twelve hours, achieving a score of 0.650. In the second test, M3 was asked to optimize a compute kernel for matrix multiplications on Nvidia Hopper GPUs, one of the most compute-intensive building blocks in large-model inference. Experienced teams typically need one to two weeks for this, according to MiniMax. M3 got only a task description, a benchmark script, and a non-functional code skeleton with no reference solution to copy from. After about 24 hours, the model had pushed Hopper hardware utilization from 7.6 to 71.3 percent. Most other tested models gave up after a few dozen attempts, while M3 worked through several plateaus and didn't reach its best solution until attempt 145. Source: thedecoder

In the third test, PostTrainBench, M3 was tasked with independently training four base models, synthesizing data, training, evaluating, and iterating without human input. The model landed just behind Opus 4.7 and GPT-5.5 but well ahead of the remaining tested models. MiniMax says M3 was trained with mixed modalities from the start. So-called interleaved data, where text and images are woven together within a sequence, turned out to matter more than initially expected. After reworking the data pipeline, training scales to the order of 100 trillion tokens. Source: thedecoder

Key points

MiniMax M3 is an open-weight model with a one-million-token context window.
MiniMax M3 matches results of top proprietary models like Opus 4.7 and GPT-5.5 in benchmarks.
MiniMax M3 independently reproduced an ICLR 2025 paper over twelve hours, achieving a score of 0.650.
M3 optimized an FP8 kernel for Nvidia Hopper GPUs, reaching 71.3 percent of peak performance after 147 runs.
MiniMax M3 was trained with interleaved data where text and images are woven together within a sequence.
MiniMax M3's new attention mechanism, MiniMax Sparse Attention (MSA), reduces compute costs by one-twentieth compared to its predecessor.
M3 processes input prompts more than nine times faster and generates responses more than fifteen times faster than its predecessor.

Source: The Decoder Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.

MiniMax M3 Open-Weight Model Challenges Proprietary Leaders With One-Million-Token Context

Key points

Related articles

Google Renames NotebookLM as Gemini Notebook

Google Vids Update Lets Users Star in AI Videos

LightOn-rerank: 2B Model Reranks Text and Document Pages

Google Introduces Gemini Omni and Personal Avatars in Google Vids