Microsoft Research has introduced Lens, a text-to-image model that achieves high performance with significantly less computational demand than comparable models. According to the technical report, Lens requires roughly one-fifth the compute that models like Z-Image need for pre-training. It outperforms much larger models across several benchmarks, including Hunyuan-Image-3.0, which has about 80 billion parameters, while Lens has just 3.8 billion. The model also maintains short inference times and a small size, making it more efficient than larger alternatives.

The efficiency gains are attributed to a combination of a compact model, detailed captions, and a training process that converges with fewer passes. The Lens-800M dataset, consisting of 800 million image-text pairs with captions generated by GPT-4.1, plays a central role. These captions, averaging around 100 words, are more detailed than standard alt-text scraped from the web. An ablation study confirmed that training with these long descriptions leads to better results than using short or mixed captions. Web alt-text is often vague or incorrect, which weakens the learning signal. Training with detailed captions improves generation quality compared to shorter or mixed captions.

Microsoft's research team also experimented with different resolutions and aspect ratios in each training batch. Despite being trained on a fixed set of image sizes, the model generalizes to unseen formats and resolutions up to about two megapixels, reducing the need for high-resolution training data. For the architecture, the team tested several variational autoencoder variants, ultimately selecting the semantic VAE from FLUX.2, which performed best and accelerated convergence. The text encoder used is GPT-OSS, an openly available language model from OpenAI. Stronger language encoders allow the model to learn faster and handle inputs in languages it was never trained on. Lens was trained only on English image-text pairs but accepts prompts in Chinese, French, Japanese, or Spanish.

Source: thedecoder