Amazon SageMaker AI has introduced container caching to enhance model scaling performance. The new feature reduces end-to-end latency for generative AI models during scale-out events by up to 51%, according to the company. This advancement builds on previous optimizations that have already improved auto scaling responsiveness for inference components. The container caching feature is designed to address the latency caused by downloading container images when new instances are launched, a common bottleneck in scaling operations.
Container caching eliminates the need to download container images from Amazon ECR during scale-out events, significantly reducing the time required to start new instances. For example, the startup latency for the Qwen3-8B model on an ml.g6.2xlarge instance dropped from 525 seconds to 258 seconds after implementing container caching. This improvement is achieved by pre-caching container images locally, which prevents network bandwidth contention between image pulls and model downloads. The feature also ensures that scaling is never blocked if the cached image is unavailable, as SageMaker AI automatically falls back to pulling from Amazon ECR.
The company emphasized that container caching works alongside existing optimizations like sub-minute metrics and inference component data caching. These combined features aim to reduce the major sources of scale-out latency, allowing generative AI applications to handle traffic spikes with low latency and high availability. The feature is supported for accelerator instance types on SageMaker inference endpoints and works with any container image hosted in Amazon ECR, including custom images. It is available in all commercial AWS Regions where SageMaker AI inference is supported.
Source: awsml