Hardware
AWS Introduces GPUDirect on Amazon FSx for Lustre to Speed LLM Inference
AWS announced GPUDirect on Amazon FSx for Lustre, reducing model load times for large language models from minutes to seconds. The technology cuts cold start latency for Llama 3.1 405B from 10–20 minutes to seconds.
Image: AWS Machine Learning
Amazon Web Services (AWS) announced a new method to accelerate large language model (LLM) loading and increase context windows using GPUDirect on Amazon FSx for Lustre. The approach leverages NVIDIA GPUDirect Storage (GDS) to bypass the CPU and system memory, enabling direct data transfers from storage to GPU High Bandwidth Memory (HBM). This significantly reduces cold start latency for LLMs, such as Llama 3.1 405B, which previously took 10–20 minutes to load. With GPUDirect, the same model can be loaded in seconds, improving efficiency and reducing GPU idle time during model initialization. The solution is designed for AWS GPU instances like the P5en and P6e, which support GDS and high-speed networking via EFA. According to AWS, the integration of FSx for Lustre with GDS creates multiple direct data paths to GPU memory, bypassing traditional bottlenecks. This allows for sharded parallel model loading, where weights are distributed across GPUs and loaded simultaneously. For example, the P5en instance, equipped with 8 H200 GPUs, can load a 400 GB Llama 3.1 405B model in seconds using this method. The performance improvements are attributed to the combination of GDS, EFA, and scalable filesystem throughput. *Source: [awsml](https://aws.amazon.com/blogs/machine-learning/accelerate-llm-model-loading-and-increase-context-windows-with-gpudirect-on-amazon-fsx-for-lustre-and-turboquant/)*
Key points
- AWS announced GPUDirect on Amazon FSx for Lustre to reduce LLM model load times from minutes to seconds.
- The new method cuts cold start latency for Llama 3.1 405B from 10–20 minutes to seconds.
- GPUDirect enables direct data transfers from storage to GPU HBM, bypassing the CPU and system memory.
- The P5en instance, with 8 H200 GPUs, can load a 400 GB Llama 3.1 405B model in seconds using this method.
- The integration of FSx for Lustre with GDS creates multiple direct data paths to GPU memory.
- Sharded parallel model loading allows weights to be distributed across GPUs and loaded simultaneously.