Software

Amazon SageMaker AI Introduces Multi-Turn Reinforcement Learning

Amazon SageMaker AI now offers multi-turn reinforcement learning tools to train agents for complex tasks like content moderation and support ticket resolution, using the SOP-Bench dataset for evaluation.

Amazon SageMaker AI has introduced a multi-turn reinforcement learning (RL) service to train agents that handle complex, sequential tasks. These agents must navigate multiple steps, such as reading instructions, making tool calls, and recovering from mistakes, to complete tasks like resolving support tickets or moderating content. The service provides a training loop for agentic tasks, allowing agents to run on Amazon Bedrock AgentCore, Amazon EKS, Amazon EC2, AWS Fargate, or custom infrastructure. A small adapter connects the agent to a rollout server, enabling low-code integration while offering full algorithmic control. The service supports custom rewards, tool loops, and multi-turn conversation shapes, along with serverless execution and per-token pricing without the need for GPU clusters. It also includes asynchronous rollout and trajectory collection to speed up training. The native algorithm library spans Proximal Policy Optimization (PPO), Clipped Importance Sampling Policy Optimization (CISPO), and importance-sampling (IS) losses, paired with multiple group-based advantage estimators. These tools help manage the complexity of multi-turn RL training. The service also provides observability through MLflow, enabling detailed tracking of agent behavior across training steps. Evaluation jobs report reward, pass, trajectory metrics, and more before deployment, ensuring reliable agent performance. The choices that determine the agent's reliability are left to the user, who must build the training environment, measure success outside the reward, design the reward, and decide when to iterate. The service aims to provide production-scale agentic RL while minimizing infrastructure concerns. Source: awsml

Building a reliable training environment is critical for multi-turn RL. Single-turn RL requires a prompt and reward function, but multi-turn RL adds an environment for the agent to interact with across turns. This environment includes the tools the agent calls and the systems behind them. A sandboxed or simulated environment that mirrors production but remains isolated from live traffic is recommended. Tool calls and responses should maintain the same schemas and business logic, using recorded responses or isolated state instead of live calls. Simulated environments are the starting point, as a typical run can produce thousands of rollouts, each making several tool calls. For example, a batch size of 128 with group size 8 results in 1,024 rollouts per step. Pointing this traffic at live systems could lead to unintended customer impacts. Without a simulated environment, exploration can result in real side effects, such as issuing refunds or triggering workflows that were not intended. Additionally, live data shifts, making the same trajectory score differently across runs. To compute a reward, the correct outcome must be known, which requires a fixed, labeled set of tasks or a trustworthy judge model. The simulated environment must be built based on the tools' functions, with three common patterns: read-only tools that replay recorded responses, stateful tools that maintain episode-specific state, and verifiable outcomes that execute code or SQL in isolated environments. Source: awsml

Before training, the environment must be verified to ensure reproducibility and representativeness. Reproducibility means the same tool called with the same arguments returns the same result, ensuring stable reward computation. Representativeness requires the environment to reflect real schemas and data distributions so the model's behavior transfers to production. To confirm the environment is configured correctly, run the same instance twice and compare rollout messages. Each rollout should have isolated per-rollout state, with separate temp directories, IDs, and DB connections. The available tools must match the production environment, along with tool request/response schemas. Setting up an external evaluation before training helps measure success directly. This evaluation should score the outcome at deployment, computed independently of the reward. For SOP-Bench, the evaluation is an exact-match on the final JSON object inside , requiring every field in the agent’s output to match the ground-truth field. Source: awsml

Key points

Amazon SageMaker AI provides a multi-turn reinforcement learning service for training agents that handle complex, sequential tasks.
The service supports custom rewards, tool loops, and multi-turn conversation shapes with serverless execution and per-token pricing.
A simulated environment is recommended to prevent unintended customer impacts during training.
Reproducibility and representativeness are critical to ensure the model's behavior transfers to production.
An external evaluation is set up to score the outcome at deployment, computed independently of the reward.
The SOP-Bench dataset is used to evaluate agents' ability to resolve tasks based on complex Standard Operating Procedures (SOP) across 12 business domains.

Source: AWS Machine Learning Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.

Amazon SageMaker AI Introduces Multi-Turn Reinforcement Learning

Key points

Related articles

Meta Launches Pocket, AI-Powered Gaming App

Mistral Launches Vibe Agent for Work and Code

Amazon SageMaker AI Accelerates Protein Design with BoltzGen

Amazon Launches Open Source Model Profiler for Amazon Bedrock