Software

AMD Deploys TurboQuant on GPUs for Efficient LLM Inference

AMD demonstrates TurboQuant, a KV-cache compression algorithm, achieving up to 3.6× speedup on AMD GPUs for long-context LLM workloads.

Image: AMD

AMD has deployed an improved version of TurboQuant, a KV-cache compression algorithm, on its GPUs to enhance large language model (LLM) inference performance. The company's work focuses on translating the algorithm into a production-ready deployment, addressing practical challenges in accuracy, kernel performance, and serving behavior. TurboQuant is designed to reduce the memory footprint of KV caches, which become a bottleneck in agentic and long-context workloads. This optimization is crucial for improving latency and throughput in multi-turn conversations and other resource-intensive tasks.

The deployment of TurboQuant on AMD GPUs via vLLM shows significant improvements in performance metrics. In a multi-turn agentic workload test, TurboQuant achieved a 17.1× faster time-to-first-token (TTFT) and a 1.63× higher total throughput compared to FP8 baselines. These gains are attributed to better cache residency and hit rates, which reduce the need for re-prefilling evicted contexts. The results highlight the effectiveness of TurboQuant in scenarios where KV-cache capacity limits performance, rather than computational power.

The study outlines key refinements to the original TurboQuant algorithm, including the use of Walsh-Hadamard rotation instead of random rotation and skipping boundary layers for full-attention models. These adjustments aim to balance compression, accuracy, and performance while maintaining system efficiency. AMD's implementation also includes custom kernels optimized for ROCm, which significantly enhance the algorithm's practicality in real-world deployments.

Source: amd

Key points

AMD enables and optimizes TurboQuant for ROCm in vLLM on AMD GPUs.
TurboQuant achieves up to ~3.6× end-to-end speedup over the open-source vLLM TurboQuant baseline.
TurboQuant improves cache residency and hit rate, reducing TTFT and increasing throughput in multi-turn agentic workloads.
The algorithm uses Walsh-Hadamard rotation instead of random rotation for better kernel performance and accuracy.
Skipping boundary layers for full-attention models helps recover accuracy at modest compression cost.
TurboQuant4/4 is recommended for production use to balance compression, accuracy, and performance.

Source: AMD Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.

AMD Deploys TurboQuant on GPUs for Efficient LLM Inference

Key points

Related articles

AWS Professional Services Redefines AI-Driven Development with AI-DLC

HuggingFace Releases Faster MTEB Leaderboard with Enhanced Features

Hugging Face Introduces Serge, GitHub-Native AI Code Reviewer

Pool App Uses AI to Turn Screenshots Into Searchable Memory Bank