Skip to content

Conversation

@sophieyfang
Copy link
Contributor

@sophieyfang sophieyfang commented Sep 13, 2022

Description

The changes presented implement a new binary in the folder cmd/google_cloud_ops_agent_diagnostics that will run after the google_cloud_ops_agent_engine binary successfully generates and validates the required configurations for fluent-bit and otelopscol. This binary will run as a service, in Windows and Linux, along side the execution of fluent-bit and otelopscol.

The purpose of the google-cloud-ops-agent-diagnostics service is to have a place to run diagnostics checks and actions during the execution of the Ops Agent. A non exhaustive list of this are :

  • Report Ops Agent Self Metrics related to the currently set configuration :
    • agent.googleapis.com/agent/ops_agent/enabled_receivers (Implemented in this PR)
    • agent.googleapis.com/agent/ops_agent/feature_tracking (Future Work)
  • Log Rotation for Fluent-bit Logs (Possible use case)
  • Report Ops Agent Health Metrics (Possible use case)

Related issue

b/232815588 | b/245344014

How has this been tested?

  • Integration Tests
    • Implemented a new test TestDiagnosticsCrashRestart to verify the diagnostics service restarts after manually stopping (pkill) the binary.
    • Added agent.googleapis.com/agent/ops_agent/enabled_receivers to the expected metrics verified by the TestDefaultMetricsNoProxy test.
    • In a following PR we can add the diagnosticsLivenessChecker to the opsAgentLivenessChecker. It can't be done in the same PR since it's used in TestUpgradeOpsAgent which compares to a previous version that doesn't have the diagnostics service.
  • Unit Tests
    • Added TestEnabledReceiversDefaultConfig unit test to verify the count of receivers in the default configurations is what we expect.
  • Manual Verification
    • Verified the correct functionality of the Diagnostics Service in a Windows VM ( windows-2019 ) and a Linux VM ( debian 10 ). The metrics were sent to the Monitoring API correctly and the service executes correctly.

Windows 2019

Screen Shot 2022-09-22 at 2 50 09 PM

Screen Shot 2022-09-22 at 2 52 02 PM

Debian 10

Screen Shot 2022-09-22 at 3 01 57 PM

Screen Shot 2022-09-22 at 2 51 52 PM

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@sophieyfang sophieyfang changed the title Draft google-cloud-ops-agent-diagnostics service [Windows only] Sep 14, 2022
@sophieyfang sophieyfang changed the title google-cloud-ops-agent-diagnostics service [Windows only] google-cloud-ops-agent-diagnostics service [Windows and Linux] Sep 15, 2022
@franciscovalentecastro franciscovalentecastro force-pushed the branch-off-149e99 branch 2 times, most recently from 4017d41 to d7326dc Compare September 20, 2022 16:58
@franciscovalentecastro franciscovalentecastro added the kokoro:force-run Forces kokoro to run integration tests on a CL label Sep 20, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Sep 20, 2022
@franciscovalentecastro franciscovalentecastro added the kokoro:force-run Forces kokoro to run integration tests on a CL label Sep 22, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Sep 22, 2022
Copy link
Contributor

@franciscovalentecastro franciscovalentecastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responding to comments.

Copy link
Contributor

@franciscovalentecastro franciscovalentecastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing current comments.

@franciscovalentecastro franciscovalentecastro removed the request for review from igorpeshansky September 30, 2022 18:49
@franciscovalentecastro franciscovalentecastro added the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 5, 2022
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Oct 5, 2022
@qingling128
Copy link
Contributor

Current status:

  • The upstream otel SDK feature request has been fixed. (@franciscovalentecastro Is there a link to that if it's in GitHub?)
  • Design wise, the decision is to use a separate service and use otel SDK to ingest metrics.
  • The PR is now unblocked for code review. Braydon will be the primary approver for this PR. Other reviewers are welcome to add feedback. Please resolve the threads once your feedback is addressed to help us keep track of any open conversations that still need attention.

@franciscovalentecastro
Copy link
Contributor

franciscovalentecastro commented Oct 7, 2022

@qingling128 Adding elements to the current status update :

Copy link
Contributor

@braydonk braydonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one comment that I forgot to send before, but otherwise LGTM!

@franciscovalentecastro
Copy link
Contributor

Most of the integration tests in b1b3756 and 31e9b21 PASS. Some details :

  • Build (SLES 12) failure is due to failure to get OSS Update repository.
  • Recurrent failure b/240564518 in Windows.

@franciscovalentecastro franciscovalentecastro merged commit 3bbb1fc into master Oct 14, 2022
@franciscovalentecastro franciscovalentecastro deleted the branch-off-149e99 branch January 9, 2023 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.