Software
AMD Introduces QuickReduce FP4 on MI355 for Enhanced AI Performance
AMD's QuickReduce FP4 quantization on MI355 delivers up to 4.14x speedup over RCCL for 1 GB message sizes, improving AI inference efficiency.
Image: AMD
AMD has introduced QuickReduce FP4 quantization on its MI355 platform, enhancing performance for AI inference tasks. QuickReduce, a high-performance all-reduce library for AMD ROCm, supports inline compression and delivers improved results on newer hardware. The MI355 architecture provides dedicated FP4 assembly instructions, enabling faster quantization and dequantization operations. QuickReduce achieves up to 2.25x faster performance than RCCL on 2×MI300X and 4×MI300X configurations, and outperforms RCCL for all multi-GPU (single-node) configurations when optimized. In this blog post, AMD extends its evaluation to the MI355 platform, presenting performance and accuracy results for this newer hardware. Additionally, the company introduces support for FP4 quantization within QuickReduce on MI355. The OCP standard defines MXFP4 with an E8M0 scaling exponent, but AMD diverges from this standard and computes the scale directly in FP16 format to achieve higher precision. The MI355 architecture provides dedicated FP4 assembly instructions that can be leveraged to accelerate both quantization and dequantization operations. Quantization and dequantization processes are detailed in the source text, showing how FP16 values are converted to FP4 and back using native instructions. QuickReduce benchmarking on MI355 evaluates three all-reduce implementations across message sizes ranging from 4 KB to 1 GB. All latency values are in microseconds, and speedup over RCCL is plotted to visualize relative performance. At TP=2, QuickReduce FP4 achieves a 4.14x speedup over RCCL for 1 GB message sizes, while at TP=8, it delivers a 1.52x speedup. FP4 and INT4 deliver comparable performance, with FP4 being marginally faster in most cases. End-to-end performance and accuracy evaluations were conducted using Qwen3-30B-A3B-Instruct-2507 and DeepSeek-R1-0528 models with vLLM. The results show that FP4 and INT4 reduce TTFT and TPOT while preserving GSM8K accuracy. *Source: [amd](https://rocm.blogs.amd.com/artificial-intelligence/quick-reduce-2/README.html)*
Key points
- AMD's QuickReduce FP4 quantization on MI355 delivers up to 4.14x speedup over RCCL for 1 GB message sizes.
- QuickReduce achieves up to 2.25x faster performance than RCCL on 2×MI300X and 4×MI300X configurations.
- FP4 and INT4 deliver comparable performance, with FP4 being marginally faster in most cases.
- QuickReduce supports inline compression and is integrated into popular inference frameworks vLLM and SGLang.
- The MI355 architecture provides dedicated FP4 assembly instructions to accelerate quantization and dequantization operations.
- Quantization and dequantization processes use MI355 native instructions to convert FP16 values to FP4 and back.
- End-to-end performance and accuracy evaluations show that FP4 and INT4 reduce TTFT and TPOT while preserving GSM8K accuracy.