Amazon SageMaker has introduced enhanced observability features for generative AI inference endpoints, providing detailed metrics to help teams monitor and debug performance issues. These metrics include GPU health, token-level latency, and cold start diagnostics, allowing for more precise troubleshooting of latency spikes and resource bottlenecks. The new capabilities are designed to support multi-model deployments on shared GPU infrastructure, ensuring high availability and efficient scaling.

The SageMaker Insights dashboard, integrated with Amazon CloudWatch, now supports both single-model and inference component endpoints. This dashboard automatically displays IC-specific panels when inference components are detected, offering a comprehensive view of fleet health across Performance, Capacity, and Reliability views. Users can also connect these metrics to their own observability tools through a PromQL-compatible endpoint, enhancing flexibility in monitoring workflows.

According to the source, SageMaker endpoints emit native OpenTelemetry metrics to CloudWatch, which are queried using PromQL for visualization at the fleet, endpoint, and inference-component level. The new observability features are available for both new and existing endpoints, with existing endpoints requiring an explicit opt-in process. The SageMaker console provides a guided wizard to help users enable detailed observability and OTel enrichment for classic CloudWatch metrics.

Source: awsml