AMD has introduced Primus-Turbo, a new tool for JAX that enables dropless Mixture-of-Experts (MoE) training on GPUs. This advancement allows for more efficient and accurate training of large language models by eliminating token drops during the training process. The tool includes two key components: a grouped GEMM for handling variable-length expert matmuls and a DeepEP dispatch/combine mechanism for token-aware expert-parallel routing. These features are integrated into JAX through the XLA FFI, preserving autodiff, sharding, and numerical fidelity. The implementation allows for a more memory-efficient and faster training process, making dropless MoE training practical for large-scale models. Source: amd

The Mixture-of-Experts (MoE) approach replaces a transformer block’s single feed-forward network with multiple FFNs and a router that assigns tokens to a subset of experts. This method increases model capacity without the full computational cost. However, the dynamic nature of token routing poses challenges, particularly in maintaining uniform shapes and managing memory efficiently. The dense_matmul approach, which is a common default in MaxText, uses a capacity-factor to manage token overflow by dropping tokens when experts exceed their capacity. This method is efficient but results in some loss of fidelity. In contrast, the sparse_matmul approach ensures all tokens reach their assigned experts, but it faces significant memory and performance challenges due to the dynamic routing and variable-length groups. Source: amd

The dense_matmul method, while efficient, sacrifices some model fidelity by dropping tokens when experts exceed their capacity. This approach uses a fixed-shape buffer for each expert, which can lead to memory inflation and reduced accuracy. On the other hand, the sparse_matmul method, which is dropless, faces memory constraints due to the dynamic routing and variable-length groups, leading to higher memory usage and lower batch sizes. The implementation of Primus-Turbo addresses these challenges by introducing a grouped GEMM kernel that reduces memory usage and a DeepEP mechanism that optimizes the routing process. These improvements make dropless MoE training more feasible and efficient for large-scale models. Source: amd