Research

Microsoft Research's Lens Shows Detailed Captions Outperform Raw Scale

Microsoft Research's Lens model, requiring one-fifth the compute of Z-Image, outperforms larger rivals on benchmarks.

Microsoft Research has introduced Lens, a text-to-image model that achieves high performance with significantly less computational demand than comparable models. According to the technical report, Lens requires roughly one-fifth the compute that models like Z-Image need for pre-training. It outperforms much larger models across several benchmarks, including Hunyuan-Image-3.0, which has about 80 billion parameters, while Lens has just 3.8 billion. The model also maintains short inference times and a small size, making it more efficient than larger alternatives.

The efficiency gains are attributed to a combination of a compact model, detailed captions, and a training process that converges with fewer passes. The Lens-800M dataset, consisting of 800 million image-text pairs with captions generated by GPT-4.1, plays a central role. These captions, averaging around 100 words, are more detailed than standard alt-text scraped from the web. An ablation study confirmed that training with these long descriptions leads to better results than using short or mixed captions. Web alt-text is often vague or incorrect, which weakens the learning signal. Training with detailed captions improves generation quality compared to shorter or mixed captions.

Microsoft's research team also experimented with different resolutions and aspect ratios in each training batch. Despite being trained on a fixed set of image sizes, the model generalizes to unseen formats and resolutions up to about two megapixels, reducing the need for high-resolution training data. For the architecture, the team tested several variational autoencoder variants, ultimately selecting the semantic VAE from FLUX.2, which performed best and accelerated convergence. The text encoder used is GPT-OSS, an openly available language model from OpenAI. Stronger language encoders allow the model to learn faster and handle inputs in languages it was never trained on. Lens was trained only on English image-text pairs but accepts prompts in Chinese, French, Japanese, or Spanish.

Source: thedecoder

Key points

Microsoft Research's Lens model requires roughly one-fifth the compute that comparable models like Z-Image need for pre-training.
Lens outperforms much larger models like Hunyuan-Image-3.0 across several benchmarks.
The Lens-800M dataset includes 800 million image-text pairs with captions generated by GPT-4.1.
Training with detailed captions produces higher generation quality than short or mixed captions.
The model generalizes to unseen formats and resolutions up to about two megapixels.
The semantic VAE from FLUX.2 performed best in architecture testing and accelerated convergence.
Lens accepts prompts in Chinese, French, Japanese, or Spanish despite being trained only on English image-text pairs.

Source: The Decoder Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.

Microsoft Research's Lens Shows Detailed Captions Outperform Raw Scale

Key points

Related articles

Kimi K3 Trails U.S. Frontier Models on Cyber Exploits

Anthropic Launches $200M Economic Futures Research Fund

JudgeGPT AI Tool Boosts Pakistani Judicial Productivity

AI Researchers Propose Genie Coefficient Metric