Research
HuggingFace Introduces Token-In, Token-Out for RL Training
HuggingFace outlines a method to maintain token integrity during reinforcement learning, ensuring gradients are computed on the exact tokens the model generated. This approach resolves issues with token mismatch during multi-turn interactions.
Image: Hugging Face
Training large language models with reinforcement learning (RL) presents unique challenges, particularly when integrating tool calls into the training loop. While single-turn RL appears straightforward, multi-turn scenarios introduce complexities that can break the Token-In, Token-Out (TITO) invariant. This invariant ensures that gradients are computed on the exact tokens the model sampled, rather than re-encoded versions that may differ from the original sequence. When models call tools mid-rollout, the generated tokens are parsed, re-tokenized, and appended back into the conversation, potentially leading to mismatches that corrupt gradient signals. HuggingFace emphasizes that this issue arises because tokenization is not reversible, and re-encoding can result in different token sequences than the ones the model produced. The solution proposed involves maintaining a buffer of the model’s sampled tokens and avoiding re-encoding them, which preserves the integrity of the training process. This method ensures that gradients are applied to the correct tokens, preventing silent errors that can undermine the effectiveness of RL training. *Source: [huggingface](https://huggingface.co/blog/huggingface/tito)*
Key points
- HuggingFace outlines a method to maintain token integrity during reinforcement learning, ensuring gradients are computed on the exact tokens the model generated.
- Tokenization is not reversible, and re-encoding can result in different token sequences than the ones the model produced.
- The TITO invariant ensures that gradients are computed on the exact tokens the model sampled, rather than re-encoded versions that may differ from the original sequence.
- Maintaining a buffer of the model’s sampled tokens and avoiding re-encoding them preserves the integrity of the training process.
- Re-tokenizing the conversation at the end can give a slightly different token sequence than the one the model sampled, leading to incorrect gradient computation.