A recent benchmark evaluates the performance of the Gemma-4 31B model running on the vLLM serving engine with an RTX 6000 Pro GPU. The test measured throughput, latency, and queue depth under progressive load, from 12 to 24 concurrency levels. The results show a peak token throughput of 1.17k tok/s, with a median time to first token (TTFT) of approximately 0.7 seconds. The system maintained low queue depth, indicating efficient handling of requests even at higher concurrency levels. The benchmark was conducted using the ShareGPT dataset, with 128 unique prompts per concurrency level, and the model was configured for a 4K-token context window to align with dataset requirements.

The Gemma-4-31B-it-FP8 model, a 30.7B parameter dense Transformer developed by Google DeepMind, supports a 256K-token context window and accepts text, image, and video inputs. However, the deployment was limited to a 4K-token context window to match the ShareGPT dataset specifications. The model's weights and activations were quantized to FP8 data type using RedHatAI's LLM Compressor, while maintaining the vision tower, embedding, and output head layers in their original precision. The serving configuration included a vLLM version 0.20 with optimized settings for performance, including prefix caching and chunked prefill.

The benchmark results indicate that the server remained unsaturated throughout the test, with near-zero queueing and a peak throughput of 1.17k tok/s. High end-to-end latency, particularly at the 99th percentile, was attributed to long generation times rather than server inefficiency. The only notable stress signal was tail TTFT, which spiked to 19 seconds under peak load, indicating a minor degradation in first-token wait time. Overall, the system demonstrated robust performance, with the potential for further scaling.

Source: huggingface