Researchers at Anthropic, Stanford, and other institutions have identified why larger language models like OLMo can reliably learn rare tasks that smaller models often miss. The study challenges the common belief that bigger models simply learn faster, revealing that size helps by allowing models to focus on less frequent but important skills once basic tasks are mastered. This finding comes from experiments that tested models of varying sizes on tasks with different frequencies and complexities. | Source: thedecoder

The research highlights how smaller models struggle with rare tasks due to an 'update-and-forget' loop, where they quickly learn a rare example but then lose it when training shifts to more frequent tasks. In contrast, larger models retain these signals between training steps, allowing them to build on rare tasks over time. Experiments showed that only models with enough parameters could learn tasks making up just 0.25 percent of the training data. The study also demonstrated that larger models like OLMo can grasp complex tasks such as modular addition, where they move from memorization to understanding the underlying principles. | Source: thedecoder

The findings were tested using OLMo models trained on up to 210 billion tokens from the Dolma corpus. Researchers mixed two artificial tasks into the data, with frequencies ranging from 1,000 instances per batch to one instance every ten batches. The results showed that smaller models like the 20M parameter version struggled to retain signals from rare tasks, while larger models like the 300M and 1B parameter versions maintained the signal throughout training. | Source: thedecoder

Source: thedecoder