Cohere has introduced a new system called Serving Fairness to address the challenge of scheduling inference requests fairly across multiple tenants on its multi-tenant SaaS platform. The solution ensures that tenants receive a fair share of inference capacity based on scheduling rather than the aggressiveness of their request submissions. This system maintains priority and deadline ordering within each tenant while preserving batching efficiency. The approach combines four distinct mechanisms that work in a fixed order to manage workloads fairly across tenants.

The system begins with a Rate Limiter that controls admission to the scheduling queue, capping the maximum number of inference requests a tenant can submit within a given timeframe. These limits are configured at the endpoint level and vary based on the resource consumption of different models. Cohere also employs real-time throttling checks to reject requests that could not be served within their latency targets, protecting the system from overload and ensuring predictable latencies. After admission, requests proceed through a series of selectors that determine their order of processing.

The core of the system is the Deficit Round Robin (DRR) algorithm, which ensures equitable distribution of compute resources across the fleet within a tenant tier. Each tenant receives its own line in the scheduling queue, and the scheduler takes turns between tenants to ensure fair access. This system balances the need for efficient batching with the requirement to prevent any single tenant from monopolizing GPU resources. The fairness of the system depends on two key variables: the quantum, which represents the budget a tenant is granted per round, and the cost, which measures how much budget each request consumes.

Source: cohere