Skip to content

Conversation

@LujieDuan
Copy link
Contributor

@LujieDuan LujieDuan commented Oct 17, 2022

Description

Add integration test case for the Prometheus receiver, with histogram and summary types of metrics.

Related issue

b/253060243

How has this been tested?

Run the test manually and use unexpected values to make sure the test can catch that.
Run the test in the PR.

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@LujieDuan LujieDuan changed the title Add integration test for Prometheus receiver of metric types histogra… Integration test for Prometheus receiver of metric types histogram and summary Oct 20, 2022
@LujieDuan LujieDuan marked this pull request as ready for review October 20, 2022 22:42
@LujieDuan LujieDuan force-pushed the lujieduan-prometheus-histogram-summary-test branch from bd16a1f to f0948e4 Compare November 15, 2022 17:49
@LujieDuan LujieDuan changed the base branch from private_preview/prometheus to master November 15, 2022 17:50
Copy link
Contributor

@ridwanmsharif ridwanmsharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks pretty good to me but I had a few questions. Will quick scan once more after

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this retry needed again? I don't remember anymore - maybe we should comment on it here for future viewers of this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry was needed because the Python http server with nohup would randomly fail. The Python http server also different across distros and Python versions and make this test unstable. I have changed to use a go http server (since we can already set up go env in the vm) with systemd (to make sure the server starts, and logs to journalctl_output.txt for better debugging). From my testing it looks this is much more reliable than the previous approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file really JSON? I thought it was a Prometheus metrics output file no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Renamed to just data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand why these are all 0 values? They all do have some values right (in your example above, they have a total of 10 points). Shouldn't this have the first values? I see in step 2 they get the delta between the second and the first set of values, but in this case the delta between the first and 0 is dropped?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is expected, please let me know why. I may be missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync'ed offline, the cumulative points (histogram, count & sum of summary) get normalized when exporting to GCM: the first point of a cumulative time series is not sent and used to subtract from all following points. That's why even though the first step point has values, the agent scrape this multiple times and they are all received as 0 values. Added comments to point to the normalization of the cumulative points.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. I think we should atleast manually verify that this is actually what promql users expect once they migrate off of prometheus. We should be doing what GMP does on GKE, or what prometheus does by default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always zero? Should it not be set to what was found before step 4?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks like we are already testing for Step 2 summary points below. Is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Together with WaitForMetricWithCondition, this check here has been removed and replaced with a sleep(3m).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question for these values like for the histogram. It feels like the first points are basically just dropped?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary's count and sum are cumulative and get normalized as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[super ultra nit]: This is a pretty massive test. I wonder if we can simplify some of this into a separate method for the set up and then have two tests: 1 for histogram and 1 for summary. That way if one of them breaks, debugging and fixing may be easier

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I have broken this into two separate tests and look cleaner this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it not possible to use the extraFilters param instead of adding a condition after we wait for the metric?
Asking to see if we can push the query for this condition to the server and not do it client side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForMetricWithCondition was used to make sure step two points have been received before checking their values. Now I have found that those step two points normally get received pretty quick (with 10s scrape interval for the agent, it usually takes another couple seconds before those become available) so I have replaced this with a sleep(3m) and we don't have to keep retrying here if something wrong with the step two points.

@LujieDuan LujieDuan force-pushed the lujieduan-prometheus-histogram-summary-test branch 2 times, most recently from e84f738 to 4a0d0ad Compare December 20, 2022 22:51
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 21, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 21, 2022
@LujieDuan LujieDuan force-pushed the lujieduan-prometheus-histogram-summary-test branch from 3df2c20 to 527bd34 Compare December 21, 2022 16:51
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 21, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 21, 2022
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 22, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 22, 2022
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 28, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 28, 2022
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 28, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Dec 28, 2022
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 3, 2023
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 3, 2023
@LujieDuan LujieDuan force-pushed the lujieduan-prometheus-histogram-summary-test branch from 527bd34 to 111f12f Compare January 3, 2023 22:09
Copy link
Contributor

@ridwanmsharif ridwanmsharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new http server is very cool. Can we use it for TestPrometheusMetricsWithJSONExporter too?

Overall looks pretty solid, just not sure yet about whether we want the normalization that the exporter enables by default or whether we want to toggle that off.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the test curl /data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The http server will make all files in a folder available, in case we need to host/query multiple files. Currently the test curls the {dir}/data file to get the metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much nicer!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will use this same method for the TestPrometheusMetricsWithJSONExporter as well!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the step_one and step_two, everything here looks like the same for the summary test too. Maybe we can pull out this setup into a helper method and reuse it in both places. Will also simplify changes made to the http server and setup later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Right now the top level test functions will only specify step one and two's files & checks against expected values, and the helper function will have all the common steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me. I think we should atleast manually verify that this is actually what promql users expect once they migrate off of prometheus. We should be doing what GMP does on GKE, or what prometheus does by default

@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 5, 2023
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 5, 2023
…m and summary

Refactor the test to show steps

Add comments to the steps and expected values

Fix typo

Fix typo
@LujieDuan LujieDuan force-pushed the lujieduan-prometheus-histogram-summary-test branch from 5f21d30 to 22eaa73 Compare January 9, 2023 21:42
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 9, 2023
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 9, 2023
@LujieDuan LujieDuan added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 9, 2023
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 9, 2023
@LujieDuan LujieDuan merged commit e6cf401 into master Jan 10, 2023
@igorpeshansky igorpeshansky deleted the lujieduan-prometheus-histogram-summary-test branch July 10, 2023 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants