Amazon Web Services (AWS) announced a new method to accelerate large language model (LLM) loading and increase context windows using GPUDirect on Amazon FSx for Lustre. The approach leverages NVIDIA GPUDirect Storage (GDS) to bypass the CPU and system memory, enabling direct data transfers from storage to GPU High Bandwidth Memory (HBM). This significantly reduces cold start latency for LLMs, such as Llama 3.1 405B, which previously took 10–20 minutes to load. With GPUDirect, the same model can be loaded in seconds, improving efficiency and reducing GPU idle time during model initialization. The solution is designed for AWS GPU instances like the P5en and P6e, which support GDS and high-speed networking via EFA. According to AWS, the integration of FSx for Lustre with GDS creates multiple direct data paths to GPU memory, bypassing traditional bottlenecks. This allows for sharded parallel model loading, where weights are distributed across GPUs and loaded simultaneously. For example, the P5en instance, equipped with 8 H200 GPUs, can load a 400 GB Llama 3.1 405B model in seconds using this method. The performance improvements are attributed to the combination of GDS, EFA, and scalable filesystem throughput. *Source: [awsml](https://aws.amazon.com/blogs/machine-learning/accelerate-llm-model-loading-and-increase-context-windows-with-gpudirect-on-amazon-fsx-for-lustre-and-turboquant/)*