AMD announced its MLPerf Training v6.0 results on June 16, 2026, highlighting performance on three benchmarks: Llama 2 70B LoRA fine-tuning, Llama 3.1 8B pretraining, and FLUX.1 Schnell text-to-image pretraining. The results are based on the MI325X, MI350X, and MI355X Instinct GPUs. These results reflect AMD's continued efforts to deliver competitive AI training performance using its latest GPU technology. The MLPerf Training benchmark is considered the most rigorous public benchmark suite for AI training workloads, covering a wide range of model architectures and use cases. Participation in each successive round allows AMD to demonstrate generational progress and validate its software stack under competitive scrutiny. AMD also aims to provide transparency to customers evaluating AI infrastructure options. The submission includes three key milestones: the debut of a production-ready MXFP4 training recipe for LLM benchmarks, the first use of AMD’s Primus training framework in MLPerf Training submissions, and AMD’s first multi-node submissions, marking a critical step toward large-scale training. These milestones signal the growing maturity of AMD’s training software stack and the readiness of the MI355X’s native FP4 hardware for production-scale training. The multi-node results are particularly significant as AI training is inherently a cluster-scale problem, and demonstrating strong scaling beyond a single server is essential for credibility in this space. Details on how to reproduce the submission results can be found in the blog on Reproducing AMD MLPerf Training v6.0 Submission Result.
AMD’s submission round brings significant advances in its software stack. The two most important improvements are the introduction of the Primus training framework and the MXFP4 training recipe. Primus is AMD’s unified training framework for large-scale foundation model training on Instinct GPUs. It provides a single CLI and configuration system over multiple training backends, including Megatron-LM, Megatron-Bridge, and TorchTitan. This design allows the same workflow to drive pretraining, supervised fine-tuning, and post-training without requiring users to manage each framework separately. The MXFP4 training recipe leverages the native FP4 hardware on MI355X GPUs to push training throughput beyond what FP8 alone can deliver. The recipe is implemented as a first-class precision mode in the ROCm Transformer Engine, using the OCP Microscaling format. This format stores data in E2M1 format and groups 32 contiguous elements under a single E8M0 scaling factor, striking a balance between compression and representational fidelity. The recipe also includes a quantization pipeline that configures quantization for all linear layers in the transformer stack. This pipeline produces packed FP4 data alongside E8M0 scales in both row-wise and column-wise layouts, serving all three training GEMMs: the forward pass, the data gradient, and the weight gradient. The use of a deterministic 16-point Hadamard rotation proved essential in reducing quantization distortion without altering the underlying computation. The implementation of the quantization pipeline is fused into a single HIP kernel launch, eliminating intermediate memory round-trips that could bottleneck the pipeline. The resulting MXFP4 tensors are dispatched to AITER’s hand-tuned ASM FP4×FP4 GEMM kernels, with per-model configurations loaded from pre-computed CSV files to ensure optimal tile sizes and instruction scheduling for every matrix shape encountered during training.
The MXFP4 training recipe is used in both Llama 3.1 8B and Llama 2 70B LoRA models, though their training recipes differ significantly. Llama 3.1 8B trains end-to-end in MXFP4, whereas Llama 2 70B LoRA requires a mid-training transition to FP8 to recover convergence. Llama 3.1 8B pretraining is the simpler of the two recipes, operating in a single MXFP4 regime with deterministic Hadamard rotation enabled and no mid-training precision transitions. Before the measured training begins, a warmup phase using FP8 hybrid precision JIT-compiles all kernels, and the model state is then fully reset for a clean start. The transpose cache is enabled for this model, keeping a precomputed weight transpose alongside the primary quantized data to avoid recomputation during backward GEMMs. Llama 2 70B LoRA, on the other hand, requires a mid-training transition to FP8 to maintain convergence. This transition is necessary to address the instability caused by the weight gradient quantization, which was found to be the dominant source of training instability. The use of a deterministic rotation proved essential in reducing quantization distortion without altering the underlying computation. The orthogonality of the Hadamard transform ensures exact cancellation across the forward–backward cycle. The implementation of the quantization pipeline is fused into a single HIP kernel launch, eliminating intermediate memory round-trips that could bottleneck the pipeline. The resulting MXFP4 tensors are dispatched to AITER’s hand-tuned ASM FP4×FP4 GEMM kernels, with per-model configurations loaded from pre-computed CSV files to ensure optimal tile sizes and instruction scheduling for every matrix shape encountered during training.
Source: amd