Software

AMD Introduces ATOMesh for Scalable LLM Inference on Instinct GPUs

AMD's ATOMesh enables distributed LLM serving on Instinct GPUs, supporting high concurrency and efficient KV cache reuse.

Image: AMD

AMD has released ATOMesh, a distributed inference gateway designed to optimize large language model (LLM) serving on its Instinct GPUs. The tool aims to address the challenges of high concurrency, long-context prefill, and latency-sensitive decoding by integrating ROCm-native kernels, communication libraries, and inference engines into a scalable serving stack. ATOMesh and ATOM work together to provide a unified platform for routing requests and managing execution on AMD GPU clusters. This system allows for intelligent request routing and efficient reuse of KV cache, enhancing performance for production deployments. Source: amd

The AMD LLM inference stack is structured into five layers: application, orchestration, engine/framework, kernel/collective, and hardware. ATOMesh operates in the orchestration layer, while ATOM functions in the engine layer. Below these, components like AITER, MORI, and RCCL provide optimized execution foundations. ATOMesh coordinates distributed inference across the cluster, managing responsibilities such as prefill/decode routing, cache-aware scheduling, worker lifecycle management, reliability handling, and scaling. It offers a unified serving surface for AMD GPU inference deployments, allowing multiple backends like ATOM, vLLM, and SGLang to coexist. Source: amd

The design of ATOMesh focuses on maintaining cluster-level serving policies independent from model execution internals, enabling routing, scheduling, reliability, and scaling to evolve without requiring changes to each backend engine. The system uses a unified placement core for multi-connection routing, centralizing worker selection around a transport-neutral placement plan. It isolates backend-specific wire formats behind adapters, allowing each backend to evolve independently while keeping scheduling models consistent. ATOMesh also separates request preparation, backend dispatch, and response rendering, making the routing pipeline easier to reason about and test. Source: amd

Key points

AMD's ATOMesh enables distributed LLM serving on Instinct GPUs.
ATOMesh coordinates distributed inference across the cluster, managing prefill/decode routing, cache-aware scheduling, worker lifecycle management, reliability handling, and scaling.
The AMD LLM inference stack includes five layers: application, orchestration, engine/framework, kernel/collective, and hardware.
ATOMesh operates in the orchestration layer, while ATOM functions in the engine layer.
Components like AITER, MORI, and RCCL provide optimized execution foundations below ATOMesh.
ATOMesh offers a unified serving surface for AMD GPU inference deployments, allowing multiple backends like ATOM, vLLM, and SGLang to coexist.
The system uses a unified placement core for multi-connection routing, centralizing worker selection around a transport-neutral placement plan.

Source: AMD Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.