Hardware
AMD Introduces 4-Wave Interleave FP8 GEMM for Enhanced Performance
AMD's 4-wave interleave FP8 GEMM design improves performance by doubling register file size, enabling full 128×128 output tile storage. The update builds on prior 8-wave ping-pong implementation.
Image: AMD
AMD has introduced a new algorithm design for FP8 GEMM (General Matrix Multiply) operations, called 4-wave interleave, which aims to enhance performance compared to its 8-wave ping-pong implementation. The 4-wave approach places one wave per SIMD unit, allowing the wave to issue both MFMA (Matrix Multiply and Accumulate) and memory instructions in a carefully crafted order, as opposed to the 8-wave kernel that alternates between memory and MMA instructions. This design change allows the wave to access all 512 VGPRs, compared to half that in the 8-wave kernel. With this increase in register file size, the wave can hold a complete 128×128 output register tile at once, compared to 64×128 in the 8-wave case. The 4-wave and 8-wave designs differ in implementation complexity, with the 8-wave kernel requiring conditional __builtin_amdgcn_s_barrier() to create alternating wave behavior, whereas the 4-wave kernel uses a finer-grained software pipeline to overlap memory and MFMA instructions. The core of the kernel is the interleaved_block function, which issues 16 MFMA instructions to compute a 64x64 part of C, alongside 8 LDS → register loads and 4 global → LDS loads. The key mechanism is __builtin_amdgcn_sched_barrier(x), which prevents the compiler from reordering instructions across the barrier, ensuring memory instructions are placed between MFMA operations. *Source: [amd](https://rocm.blogs.amd.com/software-tools-optimization/4wave-fp8gemm/README.html)*
Key points
- AMD's 4-wave interleave FP8 GEMM design doubles the register file size, enabling full 128×128 output tile storage.
- The 4-wave approach places one wave per SIMD unit, allowing it to issue both MFMA and memory instructions in a carefully crafted order.
- The 4-wave kernel uses a finer-grained software pipeline to overlap memory and MFMA instructions, differing from the 8-wave kernel's conditional __builtin_amdgcn_s_barrier() implementation.
- The interleaved_block function issues 16 MFMA instructions to compute a 64x64 part of C, alongside 8 LDS → register loads and 4 global → LDS loads.
- The key mechanism is __builtin_amdgcn_sched_barrier(x), which prevents the compiler from reordering instructions across the barrier, ensuring memory instructions are placed between MFMA operations.