Model Release
Lucidml's Supercut Model Generates Playable Games from Images
Lucidml's Supercut model creates playable games from images using consumer GPUs, with no training data from initial images.
Image: Hugging Face
Lucidml's Supercut model demonstrates the ability to generate playable games from images using consumer-grade hardware. The model takes an image and real-time keyboard input, then generates future frames autoregressively. Every clip in the video is from real play sessions, streamed from an RTX 5090 machine. None of the initial images were from the training dataset, instead sourced from Google Image Search. The model is built upon the lucidml 420M image DiT model, with added temporal mixing modules trained on video and gameplay data to simulate dynamics over time. This approach results in a core denoiser trained from scratch, not derived from an existing video model. The development was conducted under a compute budget significantly smaller than what frontier labs typically have access to. The creator noted that the current model represents only about 1% of its potential, as they are finishing training a newer 800M model that improves motion quality, diversity, and long-context behavior. They have not yet implemented quantisation. The video is a supercut of favorite rollouts, with the red car rollout sped up 2×. The creator also mentioned using HandBrake to reduce file size, expressing hope that this did not affect video quality. *Source: [huggingface](https://huggingface.co/blog/lucidml/516supercut)*
Key points
- Lucidml's Supercut model generates playable games from images using consumer GPUs.
- The model takes in an image and real-time keyboard input, then generates future frames autoregressively.
- Every clip in the video is from real play sessions streamed from an RTX 5090 machine.
- None of the initial images were from the training dataset, instead sourced from Google Image Search.
- The model is built upon the lucidml 420M image DiT model with added temporal mixing modules.
- The core denoiser is trained from scratch, not derived from an existing video model.
- The development was conducted under a compute budget significantly smaller than what frontier labs typically have access to.