fix(metrics): extend datasource latency histogram buckets to 3 minutes#3568
Open
aryanmehrotra wants to merge 2 commits into
Open
fix(metrics): extend datasource latency histogram buckets to 3 minutes#3568aryanmehrotra wants to merge 2 commits into
aryanmehrotra wants to merge 2 commits into
Conversation
6853753 to
b37c823
Compare
The datasource latency histograms capped their buckets at 10, which for microsecond-recording datasources (clickhouse, cassandra, scylladb, surrealdb, etc.) meant the top bucket was 10µs — so virtually every real query landed in +Inf and the histograms carried no usable distribution. Standardize all datasource histograms on a shared bucket shape spanning 50µs to 3 minutes, expressed in each datasource's native recorded unit: - millisecond datasources (sql, nats, dynamodb, influxdb, opentsdb): buckets up to 180000 (3 min) - microsecond datasources (redis, clickhouse, cassandra, scylladb, surrealdb, dgraph, dbresolver, badger, solr, arangodb, mongo, couchbase, elasticsearch, file): buckets up to 180000000 (3 min) Recorded values are unchanged (non-breaking on the data); only bucket boundaries and metadata move. Descriptions/names are corrected to match what is actually recorded: - redis, badger, solr, arangodb, mongo, couchbase: description said "milliseconds" but they record microseconds -> now "microseconds" - elasticsearch: metric renamed es_request_duration_ms -> es_request_duration_us (it records microseconds) - file: description now states the unit (microseconds)
b37c823 to
804a759
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Datasource latency histograms capped their buckets at
10. For the datasources that record duration in microseconds (clickhouse, cassandra, scylladb, surrealdb, dgraph, redis, …) that meant the largest bucket was 10µs, so essentially every real query fell into+Infand the histogram carried no usable distribution. Theapp_clickhouse_stats_bucketexample was effectively useless.This standardizes all datasource latency histograms on one shared bucket shape spanning 50µs → 3 minutes, expressed in each datasource's native recorded unit.
Approach
Per the agreed direction: recorded values are the source of truth and are left unchanged (so nothing breaks on the data side); only bucket boundaries and descriptions/names change to match what is actually recorded.
180000(3 min)180000000(3 min)SQL keeps using the shared
getDefaultDatasourceBuckets()(ms); a parallelgetDefaultDatasourceMicrosecondBuckets()was added for Redis (which records µs).Metadata corrections (records win, desc/name updated)
These metrics recorded microseconds but advertised milliseconds — descriptions corrected:
app_redis_stats,app_badger_stats,app_solr_stats,app_arango_stats,app_mongo_stats,app_couchbase_stats.Other fixes:
es_request_duration_ms→es_request_duration_us(it records µs).app_file_statsdescription now states the unit (microseconds).Compatibility
_bucketboundaries change (the whole point), and the noted description/name changes update dashboard metadata. The elasticsearch metric rename is the only series-name change.Tests
Updated affected unit tests (container snapshot/bucket assertions, plus per-module bucket/description/name assertions). All touched modules build; their tests pass except two pre-existing, unrelated failures verified against baseline (
scylladb Test_Execandarangodbdocument mock-type tests) that are not touched by this change.Follow-up (not in this PR)
gRPC histograms (
app_gRPC-Server_stats,app_gRPC-Stream_stats,app_gRPC-Client_stats) are not registered by the framework — they come from// Code generated by gofr.dev/cli/gofr. DO NOT EDIT.files produced by the gofr-cli generator. Increasing their buckets to 3 minutes needs a change in thegofr.dev/cli/gofrrepo's template, not here.