Training large language models with reinforcement learning (RL) presents unique challenges, particularly when integrating tool calls into the training loop. While single-turn RL appears straightforward, multi-turn scenarios introduce complexities that can break the Token-In, Token-Out (TITO) invariant. This invariant ensures that gradients are computed on the exact tokens the model sampled, rather than re-encoded versions that may differ from the original sequence. When models call tools mid-rollout, the generated tokens are parsed, re-tokenized, and appended back into the conversation, potentially leading to mismatches that corrupt gradient signals. HuggingFace emphasizes that this issue arises because tokenization is not reversible, and re-encoding can result in different token sequences than the ones the model produced. The solution proposed involves maintaining a buffer of the model’s sampled tokens and avoiding re-encoding them, which preserves the integrity of the training process. This method ensures that gradients are applied to the correct tokens, preventing silent errors that can undermine the effectiveness of RL training. *Source: [huggingface](https://huggingface.co/blog/huggingface/tito)*