Amazon SageMaker AI is now integrated with NVIDIA Isaac Lab to streamline the training of robot policies for reinforcement learning tasks. This integration allows robotics teams to train complex behaviors, such as humanoid locomotion, more efficiently by leveraging managed compute resources and automated fault recovery. The solution supports two compute options, enabling both iterative development and long-horizon training jobs without the need for manual infrastructure management.
The integration enables teams to run training jobs on Amazon SageMaker HyperPod and SageMaker Training Jobs, both of which provide managed compute environments. SageMaker HyperPod offers cluster resiliency and direct access to nodes, while SageMaker Training Jobs provide ephemeral compute for short, iterative runs. These options are designed to accommodate the different phases of robot policy development, from short experiments to production-grade training.
The solution includes a Docker image and a generator script that creates Kubernetes manifests and SageMaker launch scripts from a shared configuration file. This allows users to run the same training code across both compute options without changes to the training task. The training task used in this solution is Isaac-Velocity-Rough-H1-v0, where a Unitree H1 humanoid robot learns to track velocity commands on rough terrain using Proximal Policy Optimization (PPO).
Source: awsml