Chinese AI company MiniMax has released its new open-weight model, M3, which combines strong coding performance, native multimodality, and a one-million-token context window. According to MiniMax, this combination was previously out of reach for open models and reserved for proprietary systems like Opus 4.7, GPT-5.5, or Gemini 3.1 Pro. A new attention mechanism makes the leap possible by stretching the context window to one million tokens without letting compute costs spiral out of control. In internal tests, M3 also planned, debugged, and self-corrected on its own over many hours. Source: [thedecoder](https://the-decoder.com/minimax-m3-open-weight-model-with-a-million-token-context-challenges-proprietary-leaders/) On SWE-Bench Pro, an established software development benchmark, M3 scores 59 percent according to MiniMax. That puts it ahead of GPT-5.5 and Gemini 3.1 Pro, but just behind Opus 4.7. M3 also lands in proprietary-class territory on terminal tasks and tool use. On autonomous web search, it actually pulls ahead of Opus 4.7 (79.3) with 83.5 points on BrowseComp. MiniMax positions M3 close to Opus 4.7 on its own benchmarks, partly ahead of GPT-5.5 and Gemini 3.1 Pro. To get closer to real developer workflows, MiniMax built a simulator framework that mimics typical behavior patterns. These include refining requirements, discussing solution approaches, reacting to intermediate results, and carrying tasks across multiple contexts. This exposes the model to multi-turn collaboration during training, not just single, clearly defined prompts. Source: [thedecoder](https://the-decoder.com/minimax-m3-open-weight-model-with-a-million-token-context-challenges-proprietary-leaders/) MiniMax describes three internal experiments designed to show how these capabilities work together. In the first, the team had M3 independently reproduce a paper on LLM fine-tuning. The model worked for nearly twelve hours without intervention, produced 18 commits and 23 figures, and confirmed the paper's key findings. M3 independently reproduced an ICLR 2025 paper over twelve hours, achieving a score of 0.650. In the second test, M3 was asked to optimize a compute kernel for matrix multiplications on Nvidia Hopper GPUs, one of the most compute-intensive building blocks in large-model inference. Experienced teams typically need one to two weeks for this, according to MiniMax. M3 got only a task description, a benchmark script, and a non-functional code skeleton with no reference solution to copy from. After about 24 hours, the model had pushed Hopper hardware utilization from 7.6 to 71.3 percent. Most other tested models gave up after a few dozen attempts, while M3 worked through several plateaus and didn't reach its best solution until attempt 145. Source: [thedecoder](https://the-decoder.com/minimax-m3-open-weight-model-with-a-million-token-context-challenges-proprietary-leaders/) In the third test, PostTrainBench, M3 was tasked with independently training four base models, synthesizing data, training, evaluating, and iterating without human input. The model landed just behind Opus 4.7 and GPT-5.5 but well ahead of the remaining tested models. MiniMax says M3 was trained with mixed modalities from the start. So-called interleaved data, where text and images are woven together within a sequence, turned out to matter more than initially expected. After reworking the data pipeline, training scales to the order of 100 trillion tokens. Source: [thedecoder](https://the-decoder.com/minimax-m3-open-weight-model-with-a-million-token-context-challenges-proprietary-leaders/)