Hardware

AWS Introduces GPUDirect on Amazon FSx for Lustre to Speed LLM Inference

AWS announced GPUDirect on Amazon FSx for Lustre, reducing model load times for large language models from minutes to seconds. The technology cuts cold start latency for Llama 3.1 405B from 10–20 minutes to seconds.

Image: AWS Machine Learning

Amazon Web Services (AWS) announced a new method to accelerate large language model (LLM) loading and increase context windows using GPUDirect on Amazon FSx for Lustre. The approach leverages NVIDIA GPUDirect Storage (GDS) to bypass the CPU and system memory, enabling direct data transfers from storage to GPU High Bandwidth Memory (HBM). This significantly reduces cold start latency for LLMs, such as Llama 3.1 405B, which previously took 10–20 minutes to load.

With GPUDirect, the same model can be loaded in seconds, improving efficiency and reducing GPU idle time during model initialization. The solution is designed for AWS GPU instances like the P5en and P6e, which support GDS and high-speed networking via EFA. According to AWS, the integration of FSx for Lustre with GDS creates multiple direct data paths to GPU memory, bypassing traditional bottlenecks. This allows for sharded parallel model loading, where weights are distributed across GPUs and loaded simultaneously.

For example, the P5en instance, equipped with 8 H200 GPUs, can load a 400 GB Llama 3.1 405B model in seconds using this method. The performance improvements are attributed to the combination of GDS, EFA, and scalable filesystem throughput.

Source: awsml

Key points

AWS announced GPUDirect on Amazon FSx for Lustre to reduce LLM model load times from minutes to seconds.
The new method cuts cold start latency for Llama 3.1 405B from 10–20 minutes to seconds.
GPUDirect enables direct data transfers from storage to GPU HBM, bypassing the CPU and system memory.
The P5en instance, with 8 H200 GPUs, can load a 400 GB Llama 3.1 405B model in seconds using this method.
The integration of FSx for Lustre with GDS creates multiple direct data paths to GPU memory.
Sharded parallel model loading allows weights to be distributed across GPUs and loaded simultaneously.

Source: AWS Machine Learning Read the original →

WRITTEN BY

Sam Bergstrom

AI Infrastructure & Hardware

Sam specializes in AI chips, data centers, and training infrastructure.

AWS Introduces GPUDirect on Amazon FSx for Lustre to Speed LLM Inference

Key points

Related articles

OpenAI Unveils Codex Micro for AI Agent Control

OpenAI Unveils ChatGPT Basketball and Mini Keyboard

OpenAI Launches $230 Keyboard for Codex

OpenAI Launches Screenless AI Speaker Designed to Feel Alive