Research

OLMo Models Learn Rare Skills Small Ones Miss

A new study shows larger OLMo models reliably learn rare tasks that smaller ones fail to grasp, even with extensive training.

Image: The Decoder

Researchers at Anthropic, Stanford, and other institutions have identified why larger language models like OLMo can reliably learn rare tasks that smaller models often miss. The study challenges the common belief that bigger models simply learn faster, revealing that size helps by allowing models to focus on less frequent but important skills once basic tasks are mastered. This finding comes from experiments that tested models of varying sizes on tasks with different frequencies and complexities. | Source: thedecoder

The research highlights how smaller models struggle with rare tasks due to an 'update-and-forget' loop, where they quickly learn a rare example but then lose it when training shifts to more frequent tasks. In contrast, larger models retain these signals between training steps, allowing them to build on rare tasks over time. Experiments showed that only models with enough parameters could learn tasks making up just 0.25 percent of the training data. The study also demonstrated that larger models like OLMo can grasp complex tasks such as modular addition, where they move from memorization to understanding the underlying principles. | Source: thedecoder

The findings were tested using OLMo models trained on up to 210 billion tokens from the Dolma corpus. Researchers mixed two artificial tasks into the data, with frequencies ranging from 1,000 instances per batch to one instance every ten batches. The results showed that smaller models like the 20M parameter version struggled to retain signals from rare tasks, while larger models like the 300M and 1B parameter versions maintained the signal throughout training. | Source: thedecoder

Source: thedecoder

Key points

Larger OLMo models reliably learn rare tasks that smaller ones fail to grasp, even with extensive training.
Smaller models struggle with rare tasks due to an 'update-and-forget' loop, where they quickly learn a rare example but then lose it when training shifts to more frequent tasks.
Only models with enough parameters can learn tasks making up just 0.25 percent of the training data.
Larger models like OLMo can grasp complex tasks such as modular addition, where they move from memorization to understanding the underlying principles.
Researchers tested OLMo models trained on up to 210 billion tokens from the Dolma corpus, mixing two artificial tasks into the data with varying frequencies.
Smaller models like the 20M parameter version struggled to retain signals from rare tasks, while larger models like the 300M and 1B parameter versions maintained the signal throughout training.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

OLMo Models Learn Rare Skills Small Ones Miss

Key points

Related articles

Kimi K3 Trails U.S. Frontier Models on Cyber Exploits

Anthropic Launches $200M Economic Futures Research Fund

JudgeGPT AI Tool Boosts Pakistani Judicial Productivity

AI Researchers Propose Genie Coefficient Metric