NVIDIA has introduced a task-seeded synthetic data generation (SDG) approach for training its Nemotron models. The method uses public task training splits as seeds to generate new examples that align with task structures, enhancing model performance across various domains. The approach aims to improve structured learning signals in large-scale language model development by creating compact, task-aligned examples with clear information needs and constrained response spaces.
In a 100B-token continuation experiment on the Nemotron-3 Nano model, task-seeded SDG improved MMLU-Pro by +1.8, average code by +1.9, commonsense understanding by +1.6, and GPQA by +11.1, while keeping average math stable. The workflow involves collecting training-split seeds, normalizing heterogeneous task records, generating new examples, enriching answers with reasoning and relevant knowledge, and filtering the resulting data into curated synthetic datasets. Held-out evaluation and test data are excluded from generation, allowing downstream training recipes to decide how to mix those datasets with the broader corpus.
The approach leverages public task datasets from lm-eval-harness, covering about 700 subtasks across 70 tasks. Seed tasks are selected based on their suitability for training splits, with held-out test data excluded. The generated data is used for late-stage Nemotron-family training, including Ultra and Super workstreams. NVIDIA emphasized that the goal is to expose the model to reusable reasoning and knowledge-use patterns across task families without tying the dataset to the surface format of one data source.
Source: huggingface