AMD has introduced workload pre-emption in its Resource Manager, a cluster management application designed to help AI development teams manage compute resources efficiently. The feature allows administrators to monitor GPU utilization across workloads and terminate idle ones to free up resources for other tasks. This helps ensure that prioritized workloads can access necessary capacity when needed, improving overall GPU utilization.
Workload pre-emption is a project-level feature that monitors GPU activity and terminates idle workloads when they fall below a configured threshold for a set duration. Administrators can choose between two pre-emption policies: 'During GPU pressure,' which only terminates workloads when others are waiting for resources, and 'Always,' which reclaims idle GPUs regardless of cluster demand. The feature also allows setting a GPU activity threshold and an idle timer, giving users control over how long a workload can remain idle before being terminated.
The AMD Resource Manager was validated on a cluster powered by AMD Instinct™ MI300X GPUs with more than 1 TB of storage. The tool is part of AMD's broader software ecosystem aimed at optimizing AI development and GPU resource management.
Source: amd