Model Release

Microsoft's Mirage Model Enhances Video Generation with Spatial Memory

Microsoft Research's Mirage model generates videos up to 10.5x faster and uses 55x less memory than comparable systems, according to thedecoder.

Robotic hand with articulated fingers reaching towards the sky on a blue background.

Photo: Tara Winstead / Pexels

Microsoft Research has introduced Mirage, a new video world model that improves spatial consistency during long camera movements. The model stores internal image features in a spatial memory within its latent space, avoiding the need for costly pixel-based 3D point clouds. This approach allows for faster generation and stable spatial structures, even when the camera moves extensively. Mirage was developed in collaboration with several universities and builds on Alibaba's open-source video model Wan2.2, with additional modules and fine-tuning using LoRA adapters.

Mirage outperforms color-based rivals on the WorldScore benchmark, surpassing Spatia and leaving general video generators like Wan2.1 and CogVideoX behind. It excels at maintaining a scene's spatial structure and keeping surfaces consistent across multiple frames. In closed-loop tests on the RealEstate10K dataset, Mirage leads two of three metrics when the camera circles back to its starting point, a challenging test for maintaining accuracy over time. The model's efficiency is its strongest feature, with compute costs per frame remaining stable after the first segment, leading to up to 10.57x faster generation and 55x less memory usage compared to color-based systems.

The researchers acknowledge limitations, noting that moving objects are excluded from long-term memory due to unreliable geometry. Busy scenes benefit less from spatial memory than quiet interiors. The team highlights dynamic content storage as the next challenge to address. Video world models are a growing area of AI research, with companies like Google exploring interactive environments and world models for real-time simulations.

Source: thedecoder

Key points

Microsoft Research's Mirage model generates videos up to 10.5x faster than comparable systems.
Mirage uses up to 55x less memory than color-based video generation models.
Mirage stores internal image features in a spatial memory within its latent space, avoiding pixel-based 3D point clouds.
Mirage outperforms color-based rivals on the WorldScore benchmark, surpassing Spatia and leaving general video generators like Wan2.1 and CogVideoX behind.
Mirage leads two of three metrics on the RealEstate10K dataset in closed-loop tests.
Mirage's compute cost per frame remains stable after the first segment, leading to up to 10.57x faster generation.
Moving objects are excluded from long-term memory due to unreliable geometry.

Source: The Decoder Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.