Software

AMD Introduces Primus-Turbo for Dropless MoE Training in JAX

AMD's Primus-Turbo enables dropless MoE training in JAX by addressing memory and performance challenges in large-scale models.

Image: AMD

AMD has introduced Primus-Turbo, a new tool for JAX that enables dropless Mixture-of-Experts (MoE) training on GPUs. This advancement allows for more efficient and accurate training of large language models by eliminating token drops during the training process. The tool includes two key components: a grouped GEMM for handling variable-length expert matmuls and a DeepEP dispatch/combine mechanism for token-aware expert-parallel routing. These features are integrated into JAX through the XLA FFI, preserving autodiff, sharding, and numerical fidelity. The implementation allows for a more memory-efficient and faster training process, making dropless MoE training practical for large-scale models. Source: amd

The Mixture-of-Experts (MoE) approach replaces a transformer block’s single feed-forward network with multiple FFNs and a router that assigns tokens to a subset of experts. This method increases model capacity without the full computational cost. However, the dynamic nature of token routing poses challenges, particularly in maintaining uniform shapes and managing memory efficiently. The dense_matmul approach, which is a common default in MaxText, uses a capacity-factor to manage token overflow by dropping tokens when experts exceed their capacity. This method is efficient but results in some loss of fidelity. In contrast, the sparse_matmul approach ensures all tokens reach their assigned experts, but it faces significant memory and performance challenges due to the dynamic routing and variable-length groups. Source: amd

The dense_matmul method, while efficient, sacrifices some model fidelity by dropping tokens when experts exceed their capacity. This approach uses a fixed-shape buffer for each expert, which can lead to memory inflation and reduced accuracy. On the other hand, the sparse_matmul method, which is dropless, faces memory constraints due to the dynamic routing and variable-length groups, leading to higher memory usage and lower batch sizes. The implementation of Primus-Turbo addresses these challenges by introducing a grouped GEMM kernel that reduces memory usage and a DeepEP mechanism that optimizes the routing process. These improvements make dropless MoE training more feasible and efficient for large-scale models. Source: amd

Key points

AMD’s Primus-Turbo brings two Composable Kernel-backed primitives to JAX for dropless MoE training.
The dense_matmul path uses a capacity-factor to manage token overflow by dropping tokens when experts exceed their capacity.
The sparse_matmul path ensures all tokens reach their assigned experts but faces memory constraints due to dynamic routing and variable-length groups.
Primus-Turbo introduces a grouped GEMM kernel to reduce memory usage and a DeepEP mechanism to optimize the routing process.
The dropless path encounters memory walls due to ragged_dot expert matmul and ragged_all_to_all shuffle.

Source: AMD Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.

AMD Introduces Primus-Turbo for Dropless MoE Training in JAX

Key points

Related articles

eToro's Tori AI Uses SpaceXAI Models for Real-Time Market Sentiment

Jedify Raises $24M to Help Companies Arm AI Agents With Business Context

Amazon SageMaker AI Powers NVIDIA Isaac Lab for Robot Reinforcement Learning

AWS Introduces Hands-Free FNOL Intake Using Strands Agents and Bedrock