Research

AMD Introduces Eagle3 Speculative Decoding for AI Inference

AMD announces Eagle3 speculative decoding for AI inference on Instinct MI355X GPUs, boosting throughput for large models like Kimi-K2.5 and MiniMax-M2.5.

Image: AMD

AMD has introduced Eagle3 speculative decoding to enhance large language model (LLM) inference on its Instinct MI355X GPUs. The technique aims to improve throughput for models such as Kimi-K2.5 and MiniMax-M2.5 by reducing the number of expensive decoding steps required. Speculative decoding allows a smaller draft model to propose multiple tokens, which are then verified by the target model in a single pass, significantly speeding up the process without compromising output quality. According to AMD, this approach helps maintain the exact output distribution of the target model while improving efficiency. The implementation of Eagle3 is part of a broader effort to optimize AI serving and benchmarking on AMD hardware. The AMD Quark team has worked on enabling Eagle3 for these models, integrating it with the InferenceX benchmark and supporting FP8 quantization for better performance. Source: amd

Speculative decoding addresses the bottleneck of autoregressive decoding by allowing a draft model to propose multiple tokens, which are then validated by the target model. This method reduces the number of iterations needed for decoding, thereby increasing throughput. The key metric in this process is the acceptance rate, which measures how many draft tokens the target model can accept. If the draft model predicts tokens that match the target model’s distribution, those tokens are accepted in a single verification step, reducing the computational load. This approach ensures that the final output remains accurate while significantly improving inference speed. The effectiveness of speculative decoding is further enhanced by the use of high-quality draft models, such as Eagle3, which leverages multi-layer features from the target model to improve accuracy and speed. Source: amd

Eagle3 represents an evolution in speculative decoding, starting with feature-level decoding in Eagle, improving draft quality and acceptance rates in Eagle2, and further enhancing accuracy and speed in Eagle3. It uses training-time testing techniques and combines low-, mid-, and high-level semantic features from the target model to help the draft model propose tokens more likely to be accepted by the verifier. The goal is to ensure that the draft model's predictions align closely with the target model's output behavior, preserving the final result while improving inference throughput. This method is crucial for production environments where maintaining output quality is essential. Source: amd

Key points

AMD introduces Eagle3 speculative decoding for AI inference on Instinct MI355X GPUs.
Speculative decoding allows a smaller draft model to propose multiple tokens verified by the target model in a single pass.
The key metric in speculative decoding is the acceptance rate, measuring how many draft tokens the target model can accept.
Eagle3 improves accuracy and speed by leveraging multi-layer features from the target model.
AMD Quark team enables Eagle3 for Kimi-K2.5 and MiniMax-M2.5 with FP8 quantization and InferenceX benchmark integration.
The implementation of Eagle3 helps maintain the exact output distribution of the target model while improving inference throughput.

Source: AMD Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.