Research
AMD Enables Speculative Speculative Decoding on MI300X GPUs
AMD has enabled speculative speculative decoding (SSD) on MI300X GPUs, improving throughput-latency trade-offs for large language models.
Image: AMD
AMD has introduced a new method for accelerating large language model (LLM) inference by enabling speculative speculative decoding (SSD) on its MI300X GPUs. SSD is an enhancement of speculative decoding (SD), which allows a smaller draft model to propose multiple tokens, verified in parallel by a larger target model. This approach reduces the sequential dependency between drafting and verification, allowing the draft model to precompute multiple speculative paths. The implementation, part of AMD's ROCm ecosystem, demonstrates how SSD can hide draft latency behind target-side verification, improving performance for latency-sensitive applications. The work highlights the importance of asynchronous multi-device scheduling and tree-style speculative decoding for future inference workloads. *Source: [amd](https://rocm.blogs.amd.com/artificial-intelligence/ssd_mi300x/README.html)*
Key points
- AMD enabled speculative speculative decoding (SSD) on MI300X GPUs to improve throughput-latency trade-offs for large language models.
- SSD removes the sequential dependency between drafting and verification by allowing the draft model to precompute multiple speculative paths.
- The implementation uses five MI300X GPUs for the 70B benchmark configuration, with four for the target model and one for the draft model.
- Two correctness bugs in the FlashInfer HIP path were identified and fixed to ensure proper SSD functionality.
- The attention stack was adapted for ROCm-compatible backends, removing NVIDIA-specific dependencies.
- A dual tree-decode backend was introduced to support both high-performance and correctness-oriented execution paths.
- A setup_rocm.sh script was added to automate the environment bring-up for SSD reproduction on MI300X.