AMD has introduced ATOM, a new framework designed to optimize DeepSeek-V4 inference performance on its MI355X GPUs. The framework addresses two major challenges in parallelizing MoE communication across GPUs and hiding that communication behind useful compute. By combining DP Attention with TP MoE and implementing key optimizations like coordinated prefill scheduling and Two-Batch Overlap (TBO), ATOM delivers strong performance improvements. The approach differs from traditional methods such as Expert Parallel with all2all backends, which require specialized kernels and careful expert placement. According to AMD, these optimizations help maintain phase alignment across DP ranks, reducing the overhead of mixed-phase operations that can lead to significant performance degradation. The framework also extends TBO beyond all2all to standard collectives such as all_gather and reduce_scatter, allowing for more efficient communication and compute overlap. This marks a significant advancement in optimizing large-scale model inference on AMD hardware.
ATOM's DP Attention Scheduling Optimization, known as PrefillDelayer, addresses the issue of phase mismatch by coordinating prefill admission across DP ranks. This prevents a single rank from entering prefill while others remain in decode, which can lead to padding waste and increased synchronization wait times. By ensuring that all ranks stay in the same phase—either all prefill or all decode—ATOM avoids the need for padding and eager decode fallbacks. The result is a more consistent execution environment that supports efficient CUDA Graph replay and minimizes the overhead of mixed-phase operations. This optimization is critical for maintaining high throughput in real-world workloads where request lengths and arrival times vary.
The TBO Optimization in ATOM improves communication efficiency by splitting the batch into micro-batches and overlapping compute of one micro-batch with communication of another. This approach reduces idle waiting and maximizes the overlap window. The new token-level even splitting method balances token counts across micro-batches, ensuring that both ubatches finish at the same time. This method also allows for efficient attention computation by exposing the split prefix as cached prefix for the suffix, preserving the same attention semantics as a full request prefill. These optimizations collectively enhance the performance of DeepSeek-V4 inference on AMD's MI355X GPUs.
Source: amd