Amazon SageMaker AI has introduced support for NVIDIA Blackwell GPUs to improve the training of large AI models. This update allows users to train models with larger batch sizes and longer sequences, reducing communication overhead and improving throughput. P6-B200 instances, featuring 8 Blackwell GPUs, are now available for booking through Flexible Training Plans. These instances provide predictable access, cost management, and automated resource management, allowing users to focus on data and algorithms rather than infrastructure operations. This post outlines how to configure training jobs on Amazon SageMaker AI to maximize Blackwell’s architecture on AWS. Users can learn how to select batch sizes and sequence lengths that take advantage of Blackwell’s expanded memory, choose the right precision format for their model size (1B to 64B parameters), and apply activation checkpointing strategically. Properly configured Blackwell training jobs can process larger batch sizes without aggressive sharding, reducing communication overhead and improving throughput. Longer sequence lengths become viable for long-range dependency tasks. With the right precision format, models that previously required multi-node setups can run on a single 8-GPU node, which means faster iteration cycles, less networking overhead, and lower infrastructure costs.

Blackwell’s dual-chip architecture and fifth-generation Tensor Cores deliver measurable gains for multi-GPU training out of the box. The NVLink 5 interconnect provides up to 1.8 TB/s of bidirectional GPU-to-GPU bandwidth, while B200’s larger HBM capacity and higher memory bandwidth help reduce memory pressure for large batches, long sequences, and distributed training workloads. The examples in this post use single-node 8-GPU training with transformer models ranging from 1B to 64B parameters. The training configuration uses PyTorch Fully Sharded Data Parallel (FSDP), a distributed training technique that shards model parameters, gradients, and optimizer states across GPUs to train models larger than single-GPU memory. The results cover multiple configurations with varying batch sizes, sequence lengths, and precision formats to show when different approaches deliver the optimal results.

Memory management is a key aspect of Blackwell’s capabilities. Blackwell’s expanded memory (180 GB on B200, 268 GB on B300) gives users room to optimize in three areas: larger batch sizes, simplified model sharding, and longer sequence lengths. Larger batch sizes reduce the number of gradient synchronization steps across GPUs, improving overall throughput. Simplified model sharding becomes possible because more memory per GPU means users might be able to reduce the degree of model parallelism or eliminate it entirely for some models. Fewer shards mean less inter-GPU communication overhead. Longer sequence lengths allow models to process more context in a single pass, which is critical for long-range dependency tasks. If throughput is your primary goal, start with batch size tuning. If communication overhead is the bottleneck, simplify sharding first. If your task requires long-range context, prioritize sequence length. Batch size and sequence length both increase memory consumption, and finding an effective balance matters.

Source: awsml