Jasper AI announced the release of MONET, a massive open image-text dataset designed to advance text-to-image research. The dataset, containing 104.9 million high-quality images, was curated from 2.9 billion sources through a multi-stage filtering process. MONET aims to provide researchers with the tools needed to train production-grade text-to-image models without the prohibitive costs associated with traditional methods. The release includes nano-t2i, a minimal codebase that enables training a competitive diffusion model on a single GPU in a few days. This combination of dataset and codebase is intended to lower barriers to entry for academic researchers and smaller companies. The dataset is available under the Apache 2.0 license, allowing commercial use. *Source: [huggingface](https://huggingface.co/blog/jasperai/monet)* The creation of MONET was driven by the challenges of training text-to-image models, which require large, high-quality image-text pairs. Existing datasets like LAION-5B were too large and messy, containing duplicates, low-quality images, and harmful content. More curated alternatives were either too small for pre-training or kept proprietary. MONET addresses these issues by being the first openly released, filtered, deduplicated, and multi-captioned dataset specifically for pre-training large text-to-image models. The dataset's curation process involved six stages, including aesthetic pre-filtering, safety filtering, deduplication, and domain filtering. This process reduced the initial 2.9 billion images to 104.9 million high-quality samples. MONET also includes AI-generated captions from four different vision-language models, providing diverse and detailed descriptions for each image. The dataset's content spans a wide range of human visual culture, from street scenes and wildlife to digital art and food, ensuring a balanced distribution across relevant categories. *Source: [huggingface](https://huggingface.co/blog/jasperai/monet)* Jasper AI's MONET dataset includes a mix of real and AI-generated images, with a synthetic ratio of 13%. The team conducted experiments to determine the optimal proportion of synthetic data, finding that a 50% mix resulted in the best performance on the FID score. Using 100% synthetic data led to a significant drop in quality, indicating the 'AI eating itself' problem. MONET's 13% synthetic ratio improves text-image alignment without the risks of synthetic data saturation. The dataset's validation was tested against existing commercial and research models, with MONET's 4-billion-parameter model outperforming larger models like DALL-E 3 and FLUX.1 Dev on the GenEval benchmark. This demonstrates that training on open data can produce competitive results. *Source: [huggingface](https://huggingface.co/blog/jasperai/monet)*