AMD has introduced Eagle3 speculative decoding to enhance large language model (LLM) inference on its Instinct MI355X GPUs. The technique aims to improve throughput for models such as Kimi-K2.5 and MiniMax-M2.5 by reducing the number of expensive decoding steps required. Speculative decoding allows a smaller draft model to propose multiple tokens, which are then verified by the target model in a single pass, significantly speeding up the process without compromising output quality. According to AMD, this approach helps maintain the exact output distribution of the target model while improving efficiency. The implementation of Eagle3 is part of a broader effort to optimize AI serving and benchmarking on AMD hardware. The AMD Quark team has worked on enabling Eagle3 for these models, integrating it with the InferenceX benchmark and supporting FP8 quantization for better performance. Source: amd
Speculative decoding addresses the bottleneck of autoregressive decoding by allowing a draft model to propose multiple tokens, which are then validated by the target model. This method reduces the number of iterations needed for decoding, thereby increasing throughput. The key metric in this process is the acceptance rate, which measures how many draft tokens the target model can accept. If the draft model predicts tokens that match the target model’s distribution, those tokens are accepted in a single verification step, reducing the computational load. This approach ensures that the final output remains accurate while significantly improving inference speed. The effectiveness of speculative decoding is further enhanced by the use of high-quality draft models, such as Eagle3, which leverages multi-layer features from the target model to improve accuracy and speed. Source: amd
Eagle3 represents an evolution in speculative decoding, starting with feature-level decoding in Eagle, improving draft quality and acceptance rates in Eagle2, and further enhancing accuracy and speed in Eagle3. It uses training-time testing techniques and combines low-, mid-, and high-level semantic features from the target model to help the draft model propose tokens more likely to be accepted by the verifier. The goal is to ensure that the draft model's predictions align closely with the target model's output behavior, preserving the final result while improving inference throughput. This method is crucial for production environments where maintaining output quality is essential. Source: amd