ByteDance has introduced iLLaDA, an 8B diffusion language model that performs comparably to Qwen2.5 at the base level. The model, developed in collaboration with researchers from Renmin University, uses a diffusion approach that differs from traditional autoregressive models like ChatGPT. iLLaDA matches Qwen2.5 in general tasks but falls behind in fine-tuned scenarios, according to the research team.

The diffusion method used by iLLaDA involves starting with placeholders and refining them through multiple passes, similar to how image models generate images from noise. This allows every position in the sequence to attend to all others simultaneously, making the process bidirectional. iLLaDA's training data includes 12 trillion tokens, a significant increase from the 2.3 trillion used for its predecessor, LLaDA. The model's performance improved sharply, achieving 63.9 points on the reasoning test BBH, surpassing Qwen2.5's 63.3 points.

The research team notes that while iLLaDA shows promise, it still lags behind Qwen2.5 in certain areas, particularly in coding benchmarks. The authors attribute this to the absence of reinforcement learning alignment in iLLaDA, which is present in Qwen2.5. They also mention that the model can get stuck in reasoning loops on complex tasks. The study highlights the ongoing debate over whether diffusion models can match the performance of autoregressive models.

Source: thedecoder