AMD has introduced a new algorithm design for FP8 GEMM (General Matrix Multiply) operations, called 4-wave interleave, which aims to enhance performance compared to its 8-wave ping-pong implementation. The 4-wave approach places one wave per SIMD unit, allowing the wave to issue both MFMA (Matrix Multiply and Accumulate) and memory instructions in a carefully crafted order, as opposed to the 8-wave kernel that alternates between memory and MMA instructions. This design change allows the wave to access all 512 VGPRs, compared to half that in the 8-wave kernel. With this increase in register file size, the wave can hold a complete 128×128 output register tile at once, compared to 64×128 in the 8-wave case. The 4-wave and 8-wave designs differ in implementation complexity, with the 8-wave kernel requiring conditional __builtin_amdgcn_s_barrier() to create alternating wave behavior, whereas the 4-wave kernel uses a finer-grained software pipeline to overlap memory and MFMA instructions. The core of the kernel is the interleaved_block function, which issues 16 MFMA instructions to compute a 64x64 part of C, alongside 8 LDS → register loads and 4 global → LDS loads. The key mechanism is __builtin_amdgcn_sched_barrier(x), which prevents the compiler from reordering instructions across the barrier, ensuring memory instructions are placed between MFMA operations. *Source: [amd](https://rocm.blogs.amd.com/software-tools-optimization/4wave-fp8gemm/README.html)*