Research

NVIDIA Introduces Task-Seeded SDG for Nemotron Training

NVIDIA's task-seeded synthetic data generation improved multiple benchmarks by up to 11.1% in a 100B-token experiment with Nemotron-3 Nano.

Image: Hugging Face

NVIDIA has introduced a task-seeded synthetic data generation (SDG) approach for training its Nemotron models. The method uses public task training splits as seeds to generate new examples that align with task structures, enhancing model performance across various domains. The approach aims to improve structured learning signals in large-scale language model development by creating compact, task-aligned examples with clear information needs and constrained response spaces.

In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1, while keeping average math stable. The workflow involves collecting training-split seeds, normalizing heterogeneous task records, generating new examples, enriching answers with reasoning and relevant knowledge, and filtering the resulting data into curated synthetic datasets. Held-out evaluation and test data are excluded from generation, allowing downstream training recipes to decide how to mix those datasets with the broader corpus.

The approach leverages public task datasets from lm-eval-harness, covering about 700 subtasks across 70 tasks. Seed tasks are selected based on their suitability for training splits, with held-out test data excluded. The generated data is used for late-stage Nemotron-family training, including Ultra and Super workstreams. NVIDIA emphasized that the goal is to expose the model to reusable reasoning and knowledge-use patterns across task families without tying the dataset to the surface format of one data source.

Source: huggingface

Key points

NVIDIA's task-seeded synthetic data generation improved MMLU-Pro by +1.8 in a 100B-token experiment with Nemotron-3 Nano.
Task-seeded SDG improved average code by +1.9 and GPQA by +11.1 in the same experiment.
The workflow uses public task training splits as capability seeds, not as examples to memorize.
Generated data is filtered into curated synthetic datasets, excluding held-out evaluation and test data.
The approach leverages public task datasets from lm-eval-harness, covering about 700 subtasks across 70 tasks.
Seed tasks are selected based on their suitability for training splits, with held-out test data excluded.
The generated data is used for late-stage Nemotron-family training, including Ultra and Super workstreams.

Source: Hugging Face Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

NVIDIA Introduces Task-Seeded SDG for Nemotron Training

Key points

Related articles

RadLE 2.0 Tests AI Chatbots' Confidence in Reading X-Rays

Epoch AI Study Shows AI Detectors Struggle With Style Imitation

Kimi K3 Outperforms Fable 5 in Frontend Code

Meta and Stanford Test AI with Baby-Like Learning