Software
AMD's Gluon GEMM Tutorial Achieves Near-Peak Performance on MI355
AMD's Gluon GEMM tutorial on MI355 GPUs reaches 99% MFMA efficiency with FP16 kernels, achieving 1489 TFLOPS through iterative optimization.
Image: AMD
AMD has released a detailed tutorial on optimizing GEMM (General Matrix Multiply) kernels using its Gluon programming model, showcasing how near-peak performance can be achieved on its MI355 GPU. The tutorial, part of the gfx950-gluon-tutorials repository, outlines a step-by-step journey from a naive baseline to a highly optimized kernel. Starting with a simple FP16 GEMM kernel that runs at 520 TFLOPS and 25% MFMA efficiency, the tutorial progressively improves performance through targeted optimizations, eventually reaching 1489 TFLOPS and 99% MFMA efficiency. This represents a 2.9× speedup over the initial version. The tutorial also demonstrates similar improvements for BF8 and MXFP4 data types, achieving 3257 TFLOPS and 5255 TFLOPS respectively. The process highlights the importance of addressing bottlenecks such as register pressure and memory latency, with specific fixes like slicing and register budgeting playing key roles in achieving peak performance. The tutorial is aimed at kernel developers, ML compiler engineers, and performance specialists who seek to understand how to construct high-performance kernels on AMD GPUs. *Source: [amd](https://rocm.blogs.amd.com/software-tools-optimization/gluon-gemm-tutorial/README.html)*
Key points
- AMD's Gluon GEMM tutorial achieves 1489 TFLOPS and 99% MFMA efficiency on MI355 GPUs.
- The tutorial demonstrates a 2.9× speedup over the initial FP16 GEMM kernel baseline.
- The a16w16 kernel runs at 520 TFLOPS and 25% MFMA efficiency in the naive version.
- The a8w8 BF8 kernel achieves 3257 TFLOPS and 99.72% MFMA efficiency.
- The a4w4 MXFP4 kernel reaches 5255 TFLOPS and 92.41% MFMA efficiency.
- The tutorial addresses register pressure and memory latency through slicing and register budgeting.