AMD has introduced a new optimization technique to improve the performance of large language model (LLM) inference on its GPUs. The focus is on reducing decode-time latency, which is critical for interactive applications such as chatbots and real-time assistants. This optimization targets the GEMM operations, which are essential for LLM inference, and aims to enhance the efficiency of small-M, large-K GEMMs that are common in decoding phases. By optimizing these operations, AMD aims to deliver faster response times and smoother user experiences.
The new approach, called LDS-Pipelined Split-K GEMM, splits the long K reduction across multiple compute units (CTAs) and warp groups within each CTA. This method uses a multi-stage LDS memory pipeline to keep the computation moving efficiently. According to AMD, this optimization significantly reduces the overhead associated with repeated decode GEMMs, which are a key factor in the latency of LLM serving. The technique is implemented as an AITER FlyDSL kernel family, which allows for shape-specialized variants tailored to the model dimensions encountered during decoding.
The optimization was tested against existing solutions like HipblasLT, AITER Triton, and AITER ASM, achieving an average latency improvement of 1.64x on a K = 7168 decode grid. Additionally, it showed a 1.49x improvement on other BF16 model-shape tests. AMD's approach is based on real-world model shape traces rather than synthetic data, ensuring it addresses the specific challenges faced in practical LLM serving scenarios.
Source: amd