AMD has released ATOMesh, a distributed inference gateway designed to optimize large language model (LLM) serving on its Instinct GPUs. The tool aims to address the challenges of high concurrency, long-context prefill, and latency-sensitive decoding by integrating ROCm-native kernels, communication libraries, and inference engines into a scalable serving stack. ATOMesh and ATOM work together to provide a unified platform for routing requests and managing execution on AMD GPU clusters. This system allows for intelligent request routing and efficient reuse of KV cache, enhancing performance for production deployments. Source: amd

The AMD LLM inference stack is structured into five layers: application, orchestration, engine/framework, kernel/collective, and hardware. ATOMesh operates in the orchestration layer, while ATOM functions in the engine layer. Below these, components like AITER, MORI, and RCCL provide optimized execution foundations. ATOMesh coordinates distributed inference across the cluster, managing responsibilities such as prefill/decode routing, cache-aware scheduling, worker lifecycle management, reliability handling, and scaling. It offers a unified serving surface for AMD GPU inference deployments, allowing multiple backends like ATOM, vLLM, and SGLang to coexist. Source: amd

The design of ATOMesh focuses on maintaining cluster-level serving policies independent from model execution internals, enabling routing, scheduling, reliability, and scaling to evolve without requiring changes to each backend engine. The system uses a unified placement core for multi-connection routing, centralizing worker selection around a transport-neutral placement plan. It isolates backend-specific wire formats behind adapters, allowing each backend to evolve independently while keeping scheduling models consistent. ATOMesh also separates request preparation, backend dispatch, and response rendering, making the routing pipeline easier to reason about and test. Source: amd