AMD has released a detailed tutorial on optimizing GEMM (General Matrix Multiply) kernels using its Gluon programming model, showcasing how near-peak performance can be achieved on its MI355 GPU. The tutorial, part of the gfx950-gluon-tutorials repository, outlines a step-by-step journey from a naive baseline to a highly optimized kernel. Starting with a simple FP16 GEMM kernel that runs at 520 TFLOPS and 25% MFMA efficiency, the tutorial progressively improves performance through targeted optimizations, eventually reaching 1489 TFLOPS and 99% MFMA efficiency. This represents a 2.9× speedup over the initial version. The tutorial also demonstrates similar improvements for BF8 and MXFP4 data types, achieving 3257 TFLOPS and 5255 TFLOPS respectively. The process highlights the importance of addressing bottlenecks such as register pressure and memory latency, with specific fixes like slicing and register budgeting playing key roles in achieving peak performance. The tutorial is aimed at kernel developers, ML compiler engineers, and performance specialists who seek to understand how to construct high-performance kernels on AMD GPUs. *Source: [amd](https://rocm.blogs.amd.com/software-tools-optimization/gluon-gemm-tutorial/README.html)*