Software

Cohere Introduces Serving Fairness for Multi-Tenant LLM Platforms

Cohere announced a new solution to fairly schedule inference requests across tenants, using a layered approach to capacity management with real-time throttling checks.

Image: Cohere

Cohere has introduced a new system called Serving Fairness to address the challenge of scheduling inference requests fairly across multiple tenants on its multi-tenant SaaS platform. The solution ensures that tenants receive a fair share of inference capacity based on scheduling rather than the aggressiveness of their request submissions. This system maintains priority and deadline ordering within each tenant while preserving batching efficiency. The approach combines four distinct mechanisms that work in a fixed order to manage workloads fairly across tenants.

The system begins with a Rate Limiter that controls admission to the scheduling queue, capping the maximum number of inference requests a tenant can submit within a given timeframe. These limits are configured at the endpoint level and vary based on the resource consumption of different models. Cohere also employs real-time throttling checks to reject requests that could not be served within their latency targets, protecting the system from overload and ensuring predictable latencies. After admission, requests proceed through a series of selectors that determine their order of processing.

The core of the system is the Deficit Round Robin (DRR) algorithm, which ensures equitable distribution of compute resources across the fleet within a tenant tier. Each tenant receives its own line in the scheduling queue, and the scheduler takes turns between tenants to ensure fair access. This system balances the need for efficient batching with the requirement to prevent any single tenant from monopolizing GPU resources. The fairness of the system depends on two key variables: the quantum, which represents the budget a tenant is granted per round, and the cost, which measures how much budget each request consumes.

Source: cohere

Key points

Cohere introduced a new system called Serving Fairness to fairly schedule inference requests across tenants.
The system uses a layered approach to capacity management with real-time throttling checks.
Cohere employs a Rate Limiter to control admission to the scheduling queue, capping the maximum number of inference requests per tenant.
The Deficit Round Robin (DRR) algorithm ensures equitable distribution of compute resources across the fleet within a tenant tier.
Each tenant receives its own line in the scheduling queue, and the scheduler takes turns between tenants to ensure fair access.
Fairness in the system is determined by two key variables: quantum and cost, which represent the budget per round and the resource consumption of each request.

Source: Cohere Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.