AMD has deployed an improved version of TurboQuant, a KV-cache compression algorithm, on its GPUs to enhance large language model (LLM) inference performance. The company's work focuses on translating the algorithm into a production-ready deployment, addressing practical challenges in accuracy, kernel performance, and serving behavior. TurboQuant is designed to reduce the memory footprint of KV caches, which become a bottleneck in agentic and long-context workloads. This optimization is crucial for improving latency and throughput in multi-turn conversations and other resource-intensive tasks.

The deployment of TurboQuant on AMD GPUs via vLLM shows significant improvements in performance metrics. In a multi-turn agentic workload test, TurboQuant achieved a 17.1× faster time-to-first-token (TTFT) and a 1.63× higher total throughput compared to FP8 baselines. These gains are attributed to better cache residency and hit rates, which reduce the need for re-prefilling evicted contexts. The results highlight the effectiveness of TurboQuant in scenarios where KV-cache capacity limits performance, rather than computational power.

The study outlines key refinements to the original TurboQuant algorithm, including the use of Walsh-Hadamard rotation instead of random rotation and skipping boundary layers for full-attention models. These adjustments aim to balance compression, accuracy, and performance while maintaining system efficiency. AMD's implementation also includes custom kernels optimized for ROCm, which significantly enhance the algorithm's practicality in real-world deployments.

Source: amd