Other-ai
Jasper AI Unveils MONET Dataset for Text-to-Image Research
Jasper AI released MONET, a 104.9 million high-quality image dataset, to address data gaps in text-to-image model training.
Jasper AI announced the release of MONET, a massive open image-text dataset designed to advance text-to-image research. The dataset, containing 104.9 million high-quality images, was curated from 2.9 billion sources through a multi-stage filtering process. MONET aims to provide researchers with the tools needed to train production-grade text-to-image models without the prohibitive costs associated with traditional methods. The release includes nano-t2i, a minimal codebase that enables training a competitive diffusion model on a single GPU in a few days. This combination of dataset and codebase is intended to lower barriers to entry for academic researchers and smaller companies. The dataset is available under the Apache 2.0 license, allowing commercial use. *Source: [huggingface](https://huggingface.co/blog/jasperai/monet)*
The creation of MONET was driven by the challenges of training text-to-image models, which require large, high-quality image-text pairs. Existing datasets like LAION-5B were too large and messy, containing duplicates, low-quality images, and harmful content. More curated alternatives were either too small for pre-training or kept proprietary. MONET addresses these issues by being the first openly released, filtered, deduplicated, and multi-captioned dataset specifically for pre-training large text-to-image models. The dataset's curation process involved six stages, including aesthetic pre-filtering, safety filtering, deduplication, and domain filtering. This process reduced the initial 2.9 billion images to 104.9 million high-quality samples. MONET also includes AI-generated captions from four different vision-language models, providing diverse and detailed descriptions for each image. The dataset's content spans a wide range of human visual culture, from street scenes and wildlife to digital art and food, ensuring a balanced distribution across relevant categories. *Source: [huggingface](https://huggingface.co/blog/jasperai/monet)*
Jasper AI's MONET dataset includes a mix of real and AI-generated images, with a synthetic ratio of 13%. The team conducted experiments to determine the optimal proportion of synthetic data, finding that a 50% mix resulted in the best performance on the FID score. Using 100% synthetic data led to a significant drop in quality, indicating the 'AI eating itself' problem. MONET's 13% synthetic ratio improves text-image alignment without the risks of synthetic data saturation. The dataset's validation was tested against existing commercial and research models, with MONET's 4-billion-parameter model outperforming larger models like DALL-E 3 and FLUX.1 Dev on the GenEval benchmark. This demonstrates that training on open data can produce competitive results. *Source: [huggingface](https://huggingface.co/blog/jasperai/monet)*
Viktiga punkter
- Jasper AI released MONET, the largest open image-text dataset ever, containing 104.9 million high-quality images.
- MONET was curated from 2.9 billion images through six stages of filtering, reducing the pool to 104.9 million high-quality samples.
- The dataset includes AI-generated captions from four vision-language models, providing diverse and detailed descriptions for each image.
- MONET spans a wide range of human visual culture, ensuring a balanced distribution across relevant categories.
- The synthetic data ratio in MONET is 13%, optimized to improve text-image alignment without synthetic data saturation risks.
- Jasper AI's 4-billion-parameter model trained exclusively on MONET outperformed larger models like DALL-E 3 and FLUX.1 Dev on the GenEval benchmark.