Hardware

AMD Introduces 4-Wave Interleave FP8 GEMM for Enhanced Performance

AMD's 4-wave interleave FP8 GEMM design improves performance by doubling register file size, enabling full 128×128 output tile storage. The update builds on prior 8-wave ping-pong implementation.

Image: AMD

AMD has introduced a new algorithm design for FP8 GEMM (General Matrix Multiply) operations, called 4-wave interleave, which aims to enhance performance compared to its 8-wave ping-pong implementation. The 4-wave approach places one wave per SIMD unit, allowing the wave to issue both MFMA (Matrix Multiply and Accumulate) and memory instructions in a carefully crafted order, as opposed to the 8-wave kernel that alternates between memory and MMA instructions. This design change allows the wave to access all 512 VGPRs, compared to half that in the 8-wave kernel. With this increase in register file size, the wave can hold a complete 128×128 output register tile at once, compared to 64×128 in the 8-wave case.

The 4-wave and 8-wave designs differ in implementation complexity, with the 8-wave kernel requiring conditional __builtin_amdgcn_s_barrier() to create alternating wave behavior, whereas the 4-wave kernel uses a finer-grained software pipeline to overlap memory and MFMA instructions. The core of the kernel is the interleaved_block function, which issues 16 MFMA instructions to compute a 64x64 part of C, alongside 8 LDS → register loads and 4 global → LDS loads. The key mechanism is __builtin_amdgcn_sched_barrier(x), which prevents the compiler from reordering instructions across the barrier, ensuring memory instructions are placed between MFMA operations.

Source: amd

Key points

AMD's 4-wave interleave FP8 GEMM design doubles the register file size, enabling full 128×128 output tile storage.
The 4-wave approach places one wave per SIMD unit, allowing it to issue both MFMA and memory instructions in a carefully crafted order.
The 4-wave kernel uses a finer-grained software pipeline to overlap memory and MFMA instructions, differing from the 8-wave kernel's conditional __builtin_amdgcn_s_barrier() implementation.
The interleaved_block function issues 16 MFMA instructions to compute a 64x64 part of C, alongside 8 LDS → register loads and 4 global → LDS loads.
The key mechanism is __builtin_amdgcn_sched_barrier(x), which prevents the compiler from reordering instructions across the barrier, ensuring memory instructions are placed between MFMA operations.

Source: AMD Read the original →

WRITTEN BY

Sam Bergstrom

AI Infrastructure & Hardware

Sam specializes in AI chips, data centers, and training infrastructure.

AMD Introduces 4-Wave Interleave FP8 GEMM for Enhanced Performance

Key points

Related articles

OpenAI Launches $230 Keyboard for Codex

OpenAI Launches Screenless AI Speaker Designed to Feel Alive

OpenAI's First Hardware Device Is a Screenless Speaker That Can Move

Space Data Centers Face Real-World Challenges Despite SpaceX's Ambitions