Software

Gemma-4 31B Model Serves 1.17k Tok/s on RTX 6000 Pro

Gemma-4 31B model achieves 1.17k tokens per second throughput on RTX 6000 Pro GPU during peak load, with median TTFT of 0.7 seconds.

Image: Hugging Face

A recent benchmark evaluates the performance of the Gemma-4 31B model running on the vLLM serving engine with an RTX 6000 Pro GPU. The test measured throughput, latency, and queue depth under progressive load, from 12 to 24 concurrency levels. The results show a peak token throughput of 1.17k tok/s, with a median time to first token (TTFT) of approximately 0.7 seconds. The system maintained low queue depth, indicating efficient handling of requests even at higher concurrency levels. The benchmark was conducted using the ShareGPT dataset, with 128 unique prompts per concurrency level, and the model was configured for a 4K-token context window to align with dataset requirements.

The Gemma-4-31B-it-FP8 model, a 30.7B parameter dense Transformer developed by Google DeepMind, supports a 256K-token context window and accepts text, image, and video inputs. However, the deployment was limited to a 4K-token context window to match the ShareGPT dataset specifications. The model's weights and activations were quantized to FP8 data type using RedHatAI's LLM Compressor, while maintaining the vision tower, embedding, and output head layers in their original precision. The serving configuration included a vLLM version 0.20 with optimized settings for performance, including prefix caching and chunked prefill.

The benchmark results indicate that the server remained unsaturated throughout the test, with near-zero queueing and a peak throughput of 1.17k tok/s. High end-to-end latency, particularly at the 99th percentile, was attributed to long generation times rather than server inefficiency. The only notable stress signal was tail TTFT, which spiked to 19 seconds under peak load, indicating a minor degradation in first-token wait time. Overall, the system demonstrated robust performance, with the potential for further scaling.

Source: huggingface

Key points

Gemma-4 31B model achieves 1.17k tok/s peak throughput on RTX 6000 Pro GPU during peak load.
Median time to first token (TTFT) for Gemma-4 31B is approximately 0.7 seconds.
The model was configured for a 4K-token context window to align with ShareGPT dataset requirements.
Gemma-4-31B-it-FP8 is a 30.7B parameter dense Transformer developed by Google DeepMind.
Weights and activations of Gemma-4 31B were quantized to FP8 data type using RedHatAI's LLM Compressor.
The serving engine used vLLM version 0.20 with optimized settings for performance.
Server remained unsaturated throughout the test, with near-zero queueing and peak throughput of 1.17k tok/s.

Source: Hugging Face Read the original →

WRITTEN BY

Theo Almeida

AI Software & Developer Tools

Theo covers AI software, developer tools, frameworks, and the platforms builders use every day.

Gemma-4 31B Model Serves 1.17k Tok/s on RTX 6000 Pro

Key points

Related articles

Open-source tool pxpipe cuts Claude Code and Fable 5 token costs up to 70%

AMD Introduces AgentKernelArena for AI Coding Agent Benchmarking

Meta Launches Pocket, AI-Powered Gaming App

Amazon SageMaker AI Introduces Multi-Turn Reinforcement Learning