Amazon SageMaker AI has introduced a multi-turn reinforcement learning (RL) service to train agents that handle complex, sequential tasks. These agents must navigate multiple steps, such as reading instructions, making tool calls, and recovering from mistakes, to complete tasks like resolving support tickets or moderating content. The service provides a training loop for agentic tasks, allowing agents to run on Amazon Bedrock AgentCore, Amazon EKS, Amazon EC2, AWS Fargate, or custom infrastructure. A small adapter connects the agent to a rollout server, enabling low-code integration while offering full algorithmic control. The service supports custom rewards, tool loops, and multi-turn conversation shapes, along with serverless execution and per-token pricing without the need for GPU clusters. It also includes asynchronous rollout and trajectory collection to speed up training. The native algorithm library spans Proximal Policy Optimization (PPO), Clipped Importance Sampling Policy Optimization (CISPO), and importance-sampling (IS) losses, paired with multiple group-based advantage estimators. These tools help manage the complexity of multi-turn RL training. The service also provides observability through MLflow, enabling detailed tracking of agent behavior across training steps. Evaluation jobs report reward, pass, trajectory metrics, and more before deployment, ensuring reliable agent performance. The choices that determine the agent's reliability are left to the user, who must build the training environment, measure success outside the reward, design the reward, and decide when to iterate. The service aims to provide production-scale agentic RL while minimizing infrastructure concerns. Source: awsml
Building a reliable training environment is critical for multi-turn RL. Single-turn RL requires a prompt and reward function, but multi-turn RL adds an environment for the agent to interact with across turns. This environment includes the tools the agent calls and the systems behind them. A sandboxed or simulated environment that mirrors production but remains isolated from live traffic is recommended. Tool calls and responses should maintain the same schemas and business logic, using recorded responses or isolated state instead of live calls. Simulated environments are the starting point, as a typical run can produce thousands of rollouts, each making several tool calls. For example, a batch size of 128 with group size 8 results in 1,024 rollouts per step. Pointing this traffic at live systems could lead to unintended customer impacts. Without a simulated environment, exploration can result in real side effects, such as issuing refunds or triggering workflows that were not intended. Additionally, live data shifts, making the same trajectory score differently across runs. To compute a reward, the correct outcome must be known, which requires a fixed, labeled set of tasks or a trustworthy judge model. The simulated environment must be built based on the tools' functions, with three common patterns: read-only tools that replay recorded responses, stateful tools that maintain episode-specific state, and verifiable outcomes that execute code or SQL in isolated environments. Source: awsml
Before training, the environment must be verified to ensure reproducibility and representativeness. Reproducibility means the same tool called with the same arguments returns the same result, ensuring stable reward computation. Representativeness requires the environment to reflect real schemas and data distributions so the model's behavior transfers to production. To confirm the environment is configured correctly, run the same instance twice and compare rollout messages. Each rollout should have isolated per-rollout state, with separate temp directories, IDs, and DB connections. The available tools must match the production environment, along with tool request/response schemas. Setting up an external evaluation before training helps measure success directly. This evaluation should score the outcome at deployment, computed independently of the reward. For SOP-Bench, the evaluation is an exact-match on the final JSON object inside , requiring every field in the agent’s output to match the ground-truth field. Source: awsml