Software

AMD Introduces ATOM for DeepSeek-V4 Inference on MI355X

AMD's ATOM framework improves DeepSeek-V4 inference performance on MI355X GPUs by reducing padding waste to 86% and enhancing communication efficiency.

Image: AMD

AMD has introduced ATOM, a new framework designed to optimize DeepSeek-V4 inference performance on its MI355X GPUs. The framework addresses two major challenges in parallelizing MoE communication across GPUs and hiding that communication behind useful compute. By combining DP Attention with TP MoE and implementing key optimizations like coordinated prefill scheduling and Two-Batch Overlap (TBO), ATOM delivers strong performance improvements. The approach differs from traditional methods such as Expert Parallel with all2all backends, which require specialized kernels and careful expert placement. According to AMD, these optimizations help maintain phase alignment across DP ranks, reducing the overhead of mixed-phase operations that can lead to significant performance degradation. The framework also extends TBO beyond all2all to standard collectives such as all_gather and reduce_scatter, allowing for more efficient communication and compute overlap. This marks a significant advancement in optimizing large-scale model inference on AMD hardware.

ATOM's DP Attention Scheduling Optimization, known as PrefillDelayer, addresses the issue of phase mismatch by coordinating prefill admission across DP ranks. This prevents a single rank from entering prefill while others remain in decode, which can lead to padding waste and increased synchronization wait times. By ensuring that all ranks stay in the same phase—either all prefill or all decode—ATOM avoids the need for padding and eager decode fallbacks. The result is a more consistent execution environment that supports efficient CUDA Graph replay and minimizes the overhead of mixed-phase operations. This optimization is critical for maintaining high throughput in real-world workloads where request lengths and arrival times vary.

The TBO Optimization in ATOM improves communication efficiency by splitting the batch into micro-batches and overlapping compute of one micro-batch with communication of another. This approach reduces idle waiting and maximizes the overlap window. The new token-level even splitting method balances token counts across micro-batches, ensuring that both ubatches finish at the same time. This method also allows for efficient attention computation by exposing the split prefix as cached prefix for the suffix, preserving the same attention semantics as a full request prefill. These optimizations collectively enhance the performance of DeepSeek-V4 inference on AMD's MI355X GPUs.

Source: amd

Key points

ATOM improves DeepSeek-V4 inference performance on AMD MI355X GPUs by reducing padding waste to 86%.
ATOM uses DP Attention with TP MoE and two key optimizations to enhance communication efficiency.
PrefillDelayer coordinates prefill admission across DP ranks to reduce mixed-phase overhead.
Token-level even splitting balances token counts across micro-batches, improving TBO prefill throughput.
TBO extends beyond all2all to standard collectives such as all_gather and reduce_scatter.
The framework avoids padding and eager decode fallbacks by maintaining phase alignment across DP ranks.

Source: AMD Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.

AMD Introduces ATOM for DeepSeek-V4 Inference on MI355X

Key points

Related articles

Mistral AI Introduces Enhanced Connector Controls

Anthropic's Claude Tag Embeds AI in Slack for Internal Code

HuggingFace Launches Moon Bot for Slack Integration

Anthropic Introduces Claude Tag for Team Collaboration