Research

HuggingFace Introduces Token-In, Token-Out for RL Training

HuggingFace outlines a method to maintain token integrity during reinforcement learning, ensuring gradients are computed on the exact tokens the model generated. This approach resolves issues with token mismatch during multi-turn interactions.

Image: Hugging Face

Training large language models with reinforcement learning (RL) presents unique challenges, particularly when integrating tool calls into the training loop. While single-turn RL appears straightforward, multi-turn scenarios introduce complexities that can break the Token-In, Token-Out (TITO) invariant. This invariant ensures that gradients are computed on the exact tokens the model sampled, rather than re-encoded versions that may differ from the original sequence. When models call tools mid-rollout, the generated tokens are parsed, re-tokenized, and appended back into the conversation, potentially leading to mismatches that corrupt gradient signals.

HuggingFace emphasizes that this issue arises because tokenization is not reversible, and re-encoding can result in different token sequences than the ones the model produced. The solution proposed involves maintaining a buffer of the model’s sampled tokens and avoiding re-encoding them, which preserves the integrity of the training process. This method ensures that gradients are applied to the correct tokens, preventing silent errors that can undermine the effectiveness of RL training.

Source: huggingface

Key points

HuggingFace outlines a method to maintain token integrity during reinforcement learning, ensuring gradients are computed on the exact tokens the model generated.
Tokenization is not reversible, and re-encoding can result in different token sequences than the ones the model produced.
The TITO invariant ensures that gradients are computed on the exact tokens the model sampled, rather than re-encoded versions that may differ from the original sequence.
Maintaining a buffer of the model’s sampled tokens and avoiding re-encoding them preserves the integrity of the training process.
Re-tokenizing the conversation at the end can give a slightly different token sequence than the one the model sampled, leading to incorrect gradient computation.

Source: Hugging Face Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

HuggingFace Introduces Token-In, Token-Out for RL Training

Key points

Related articles

ELIZA Source Code Reveals Chatbot's Multiple Personalities

Hugging Face Evaluates Open-Source AI Models for Swiss Legal Tasks

Anthropic Discovers New Internal Space in AI Models

AMD Optimizes Video Sparse Attention on ROCm