-
Notifications
You must be signed in to change notification settings - Fork 77
Integration test for Prometheus receiver of metric types histogram and summary #909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration test for Prometheus receiver of metric types histogram and summary #909
Conversation
bd16a1f to
f0948e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks pretty good to me but I had a few questions. Will quick scan once more after
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this retry needed again? I don't remember anymore - maybe we should comment on it here for future viewers of this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry was needed because the Python http server with nohup would randomly fail. The Python http server also different across distros and Python versions and make this test unstable. I have changed to use a go http server (since we can already set up go env in the vm) with systemd (to make sure the server starts, and logs to journalctl_output.txt for better debugging). From my testing it looks this is much more reliable than the previous approach.
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file really JSON? I thought it was a Prometheus metrics output file no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Renamed to just data.
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't completely understand why these are all 0 values? They all do have some values right (in your example above, they have a total of 10 points). Shouldn't this have the first values? I see in step 2 they get the delta between the second and the first set of values, but in this case the delta between the first and 0 is dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is expected, please let me know why. I may be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync'ed offline, the cumulative points (histogram, count & sum of summary) get normalized when exporting to GCM: the first point of a cumulative time series is not sent and used to subtract from all following points. That's why even though the first step point has values, the agent scrape this multiple times and they are all received as 0 values. Added comments to point to the normalization of the cumulative points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. I think we should atleast manually verify that this is actually what promql users expect once they migrate off of prometheus. We should be doing what GMP does on GKE, or what prometheus does by default
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always zero? Should it not be set to what was found before step 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also looks like we are already testing for Step 2 summary points below. Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Together with WaitForMetricWithCondition, this check here has been removed and replaced with a sleep(3m).
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question for these values like for the histogram. It feels like the first points are basically just dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary's count and sum are cumulative and get normalized as well.
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[super ultra nit]: This is a pretty massive test. I wonder if we can simplify some of this into a separate method for the set up and then have two tests: 1 for histogram and 1 for summary. That way if one of them breaks, debugging and fixing may be easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! I have broken this into two separate tests and look cleaner this way.
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it not possible to use the extraFilters param instead of adding a condition after we wait for the metric?
Asking to see if we can push the query for this condition to the server and not do it client side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WaitForMetricWithCondition was used to make sure step two points have been received before checking their values. Now I have found that those step two points normally get received pretty quick (with 10s scrape interval for the agent, it usually takes another couple seconds before those become available) so I have replaced this with a sleep(3m) and we don't have to keep retrying here if something wrong with the step two points.
e84f738 to
4a0d0ad
Compare
3df2c20 to
527bd34
Compare
527bd34 to
111f12f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new http server is very cool. Can we use it for TestPrometheusMetricsWithJSONExporter too?
Overall looks pretty solid, just not sure yet about whether we want the normalization that the exporter enables by default or whether we want to toggle that off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the test curl /data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The http server will make all files in a folder available, in case we need to host/query multiple files. Currently the test curls the {dir}/data file to get the metrics.
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much nicer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will use this same method for the TestPrometheusMetricsWithJSONExporter as well!
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the step_one and step_two, everything here looks like the same for the summary test too. Maybe we can pull out this setup into a helper method and reuse it in both places. Will also simplify changes made to the http server and setup later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Right now the top level test functions will only specify step one and two's files & checks against expected values, and the helper function will have all the common steps.
integration_test/ops_agent_test.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me. I think we should atleast manually verify that this is actually what promql users expect once they migrate off of prometheus. We should be doing what GMP does on GKE, or what prometheus does by default
…m and summary Refactor the test to show steps Add comments to the steps and expected values Fix typo Fix typo
5f21d30 to
22eaa73
Compare
Description
Add integration test case for the Prometheus receiver, with histogram and summary types of metrics.
Related issue
b/253060243
How has this been tested?
Run the test manually and use unexpected values to make sure the test can catch that.
Run the test in the PR.
Checklist: