Skip to content

fix(metrics): extend datasource latency histogram buckets to 3 minutes#3568

Open
aryanmehrotra wants to merge 2 commits into
developmentfrom
fix/datasource-histogram-buckets
Open

fix(metrics): extend datasource latency histogram buckets to 3 minutes#3568
aryanmehrotra wants to merge 2 commits into
developmentfrom
fix/datasource-histogram-buckets

Conversation

@aryanmehrotra

Copy link
Copy Markdown
Member

What

Datasource latency histograms capped their buckets at 10. For the datasources that record duration in microseconds (clickhouse, cassandra, scylladb, surrealdb, dgraph, redis, …) that meant the largest bucket was 10µs, so essentially every real query fell into +Inf and the histogram carried no usable distribution. The app_clickhouse_stats_bucket example was effectively useless.

This standardizes all datasource latency histograms on one shared bucket shape spanning 50µs → 3 minutes, expressed in each datasource's native recorded unit.

Approach

Per the agreed direction: recorded values are the source of truth and are left unchanged (so nothing breaks on the data side); only bucket boundaries and descriptions/names change to match what is actually recorded.

Unit recorded Datasources New bucket top
milliseconds sql, nats, dynamodb, influxdb, opentsdb 180000 (3 min)
microseconds redis, clickhouse, cassandra, scylladb, surrealdb, dgraph, dbresolver, badger, solr, arangodb, mongo, couchbase, elasticsearch, file 180000000 (3 min)

SQL keeps using the shared getDefaultDatasourceBuckets() (ms); a parallel getDefaultDatasourceMicrosecondBuckets() was added for Redis (which records µs).

Metadata corrections (records win, desc/name updated)

These metrics recorded microseconds but advertised milliseconds — descriptions corrected:
app_redis_stats, app_badger_stats, app_solr_stats, app_arango_stats, app_mongo_stats, app_couchbase_stats.

Other fixes:

  • elasticsearch: metric renamed es_request_duration_mses_request_duration_us (it records µs).
  • file: app_file_stats description now states the unit (microseconds).

Compatibility

  • No recorded metric values change → existing rate/avg queries are unaffected.
  • Histogram _bucket boundaries change (the whole point), and the noted description/name changes update dashboard metadata. The elasticsearch metric rename is the only series-name change.

Tests

Updated affected unit tests (container snapshot/bucket assertions, plus per-module bucket/description/name assertions). All touched modules build; their tests pass except two pre-existing, unrelated failures verified against baseline (scylladb Test_Exec and arangodb document mock-type tests) that are not touched by this change.

Follow-up (not in this PR)

gRPC histograms (app_gRPC-Server_stats, app_gRPC-Stream_stats, app_gRPC-Client_stats) are not registered by the framework — they come from // Code generated by gofr.dev/cli/gofr. DO NOT EDIT. files produced by the gofr-cli generator. Increasing their buckets to 3 minutes needs a change in the gofr.dev/cli/gofr repo's template, not here.

Copilot AI review requested due to automatic review settings June 9, 2026 20:39

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

The datasource latency histograms capped their buckets at 10, which for
microsecond-recording datasources (clickhouse, cassandra, scylladb,
surrealdb, etc.) meant the top bucket was 10µs — so virtually every real
query landed in +Inf and the histograms carried no usable distribution.

Standardize all datasource histograms on a shared bucket shape spanning
50µs to 3 minutes, expressed in each datasource's native recorded unit:

- millisecond datasources (sql, nats, dynamodb, influxdb, opentsdb):
  buckets up to 180000 (3 min)
- microsecond datasources (redis, clickhouse, cassandra, scylladb,
  surrealdb, dgraph, dbresolver, badger, solr, arangodb, mongo,
  couchbase, elasticsearch, file): buckets up to 180000000 (3 min)

Recorded values are unchanged (non-breaking on the data); only bucket
boundaries and metadata move. Descriptions/names are corrected to match
what is actually recorded:

- redis, badger, solr, arangodb, mongo, couchbase: description said
  "milliseconds" but they record microseconds -> now "microseconds"
- elasticsearch: metric renamed es_request_duration_ms ->
  es_request_duration_us (it records microseconds)
- file: description now states the unit (microseconds)
@aryanmehrotra aryanmehrotra force-pushed the fix/datasource-histogram-buckets branch from b37c823 to 804a759 Compare June 10, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants