
"The resulting system now ingests over 100 million samples per second in production, showcasing the scalability and efficiency of the new metrics stack."
"The primary challenge was bridging three coexisting instrumentation worlds: StatsD libraries, growing OTLP adoption, and a new Prometheus-compatible storage backend based on Grafana Mimir."
"The move to OTLP brought measurable gains: CPU time spent on metrics processing in JVM services dropped from 10% to under 1% of total CPU samples."
"The fix was switching those specific services to delta temporality via AggregationTemporalitySelector.deltaPreferred(), which avoids retaining full state of all metric-label combinations between exports."
Airbnb's observability engineering team migrated from StatsD and a proprietary aggregation pipeline to an open-source metrics stack based on OpenTelemetry Protocol and VictoriaMetrics. The migration prioritized getting all metrics into the new system before addressing user-facing tools. The updated metrics library allowed dual-emission of metrics, facilitating a smoother transition. The switch to OTLP resulted in significant CPU time savings and eliminated packet loss risks. However, high-cardinality services faced memory issues, which were resolved by adjusting to delta temporality, accepting some trade-offs in metric retention.
Read at InfoQ
Unable to calculate read time
Collection
[
|
...
]