Integration Testing Ops Agent UAP Plugin #1870

XuechunHou · 2025-01-29T00:23:23Z

Description

This PR updated the existing Ops agent integration tests and third party tests, making them compatible with the Ops Agent plugin.
This PR also added a few integration tests for the Ops Agent Plugin, exclusively testing its GetStatus method.

Related issue

b/381246351

How has this been tested?

Ops Agent integration tests should run twice per OS distro and architecture, once for the regular ops agent, once for the plugin. Both runs should pass.

Checklist:

Unit tests
- Unit tests do not apply.
- Unit tests have been added/modified and passed for this PR.
Integration tests
- Integration tests do not apply.
- Integration tests have been added/modified and passed for this PR.
Documentation
- This PR introduces no user visible changes.
- This PR introduces user visible changes and the corresponding documentation change has been made.
Minor version bump
- This PR introduces no new features.
- This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
- This PR bumps the version.

…the flakes on Rocky linux 8 and 9

integration_test/third_party_apps_test/applications/aerospike/enable

integration_test/third_party_apps_test/applications/apache/enable

integration_test/agents/agents.go

integration_test/ops_agent_test/main_test.go

rafaelwestphal · 2025-02-12T19:34:00Z

integration_test/agents/agents.go

 	gce.RunRemotely(ctx, logger.ToFile("journalctl_output.txt"), vm, "sudo journalctl -xe")
+	if isUAPPlugin {
+		for _, log := range []string{
+			gce.SyslogLocation(vm.ImageSpec),


Can we refactor this code avoid the duplication of the files list? It will make it easier in the feature when we need to remove/add/replace anything.

Thank you for the comments. You're correct, they are the same set of files. However, they have different paths, and I haven't yet found a good way to generalize them. Even within the Ops Agent plugin itself, not all of these files are located in the same directories.

Genmini suggested something like the following:

filePaths := map[string]string{ "health-checks.log": "/var/log/google-cloud-ops-agent/health-checks.log", "config.yaml": "/etc/google-cloud-ops-agent/config.yaml", "subagents/logging-module.log": "/var/log/google-cloud-ops-agent/subagents/logging-module.log", "subagents/metrics-module.log": "/var/log/google-cloud-ops-agent/subagents/metrics-module.log", "nvidia-installer.log": "/var/log/nvidia-installer.log", "fluent_bit_main.conf": "/run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf", "fluent_bit_parser.conf": "/run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf", "otel.yaml": "/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml", } if isUAPPlugin { for file, filePath := range filePaths { filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin", filePath) // Adjust for run and log directories. if file == "fluent_bit_main.conf" || file == "fluent_bit_parser.conf" || file == "otel.yaml" { filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin/run", path.Base(filePath)) } else if file == "health-checks.log" || file == "subagents/logging-module.log" || file == "subagents/metrics-module.log" || file == "nvidia-installer.log" { filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin/log", path.Base(filePath)) } else if file == "config.yaml" { filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin", path.Base(filePath)) } } }

However, that adds too many if-else statements, and it also reduces the readability of the file paths for the plugin.

Can we extract the 2 lists of files into a separate function like getLogFileList, which would return one of the 2 lists based on isUAPPlugin. This way, the if statement gets moved to the new function, so the current function is easier to read. Also, since the new function's only responsibility is returning the list of files, it's easy to change for both if we need to modify them later (and harder to miss).

integration_test/third_party_apps_test/applications/aerospike/enable

integration_test/third_party_apps_test/applications/apache/enable

jinghan-ma · 2025-02-12T22:30:36Z

integration_test/agents/agents.go

-	gce.RunRemotely(ctx, logger.ToFile("systemctl_status_for_ops_agent.txt"), vm, "sudo systemctl status google-cloud-ops-agent*")
+	isUAPPlugin := IsOpsAgentUAPPlugin()
+	if isUAPPlugin {
+		gce.RunRemotely(ctx, logger.ToFile("systemctl_status_for_ops_agent.txt"), vm, fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/GetStatus", OpsAgentPluginServerPort))


Should this be a different file name? Since we're not using systemd calling it systemctl_status seems a bit weird.

jinghan-ma · 2025-02-12T22:37:53Z

integration_test/agents/agents.go

 	gce.RunRemotely(ctx, logger.ToFile("journalctl_output.txt"), vm, "sudo journalctl -xe")
+	if isUAPPlugin {
+		for _, log := range []string{
+			gce.SyslogLocation(vm.ImageSpec),


Can we extract the 2 lists of files into a separate function like getLogFileList, which would return one of the 2 lists based on isUAPPlugin. This way, the if statement gets moved to the new function, so the current function is easier to read. Also, since the new function's only responsibility is returning the list of files, it's easy to change for both if we need to modify them later (and harder to miss).

jinghan-ma · 2025-02-12T22:41:11Z

integration_test/agents/agents.go

+		if gce.IsWindows(imageSpec) {
+			return ""
+		}
+		return fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Stop && grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Start", OpsAgentPluginServerPort, OpsAgentPluginServerPort)


Would UAP ever send a Start signal to a plugin that has already received a Stop signal? Or will it start the plugin on an entirely different process?

The test is not interacting with UAP at all, but we've asked this question to the UAP team and the answer is no.

"Or will it start the plugin on an entirely different process?" process or port? I think both are possible.

I think it might be better to mimic UAP's operations more closely in our own tests by killing the wrapper and starting a new one?

Not a big issue either way.

re-starting the plugin and re-starting the Ops Agent through the UAP plugin is different. Here I am explicitly re-starting the Ops Agent itself. In the UAP world, this is done by sending a Stop and then a Start gRPC request to the plugin gRPC server.

jinghan-ma · 2025-02-12T22:56:20Z

integration_test/agents/agents.go

+	if _, err := gce.RunRemotely(ctx, logger, vm, "sudo gsutil cp -r "+gcsPath+"/* /tmp/agentUpload"); err != nil {
+		return fmt.Errorf("error copying down agent package from GCS: %v", err)
+	}
+	if _, err := gce.RunRemotely(ctx, logger, vm, "mv /tmp/agentUpload/*.tar.gz /tmp/agentPlugin"); err != nil {
+		return err
+	}


Is there a reason why we don't directly copy to /tmp/agentPlugin?

jinghan-ma · 2025-02-12T23:01:34Z

integration_test/agents/agents.go

 			return err
 		}
+	} else if IsOpsAgentUAPPlugin() {
+		return RestartOpsAgent(ctx, logger, vm)


Why do we need to restart the ops agent plugin here specifically?

Not like the regular ops agent, the agent plugin is not started automatically after installation. It needs to be explicitly started.

The point of callingRestartOpsAgent function is to trigger a start of the agent.

When len(config)> 0, the regular agent needs to be re-started, and the plugin agent needs to be started to apply the custom config.

When len(config) ==0, only the plugin agent needs to be started.

In that case, is it possible to be more explicit here and call StartOpsAgentPlugin instead?

StartOpsAgentPlugin and RestartOpsAgent do different things for the agent plugin; StartOpsAgentPlugin starts the plugin grpc server itself. However, RestartOpsAgent sends a gRPC request to the plugin's Start() endpoint, to start the Ops Agent.
In this case the plugin gRPC server is live already, I only need to start the Ops Agent.

jinghan-ma · 2025-02-12T23:04:04Z

integration_test/gce/gce_testing.go

+	if IsARM(vm.ImageSpec) {
+		arch = "arm64"
+	}
+	if arch != "x86_64" && arch != "arm64" {


Given how we define arch above, will we ever go into this if statement?

You are right, updated

jinghan-ma · 2025-02-12T23:14:31Z

integration_test/ops_agent_test/main_test.go

+	if agents.IsOpsAgentUAPPlugin() {
+		if gce.IsWindows(imageSpec) {
+			return ""
+		}
+		return "grpcurl -plaintext -d '{}' localhost:1234 plugin_comm.GuestAgentPlugin/GetStatus"
+	}


I wonder if it makes sense for us to decouple the UAP status call from ServiceOutput. They currently are somewhat similar, but we expect them to evolve in very different ways -- so the way we test for it will also start to deviate.

I feel like it makes it clearer if we remove the abstraction and explicitly call something like GetPluginStatus in many places, so that it is clear to the reader that only UAP has this behaviour. Examples of this includes the current check for an Unhealthy status when plugin is stopped, or future tests such as checking for error codes (which is exclusive to UAP GetStatus).

I think the root of the problem is that there's a layer of abstraction missing here.
Currently, getRecentServiceOutputForImage is actually overloaded to perform two jobs:

Get service status (i.e. check if ops agent service is running), and

Get health check results.

These two happen to come from the same place when the Ops Agent is running as a service. So we can just call getRecentServiceOutputForImage and check both.

But the plugin version of the Ops Agent separates these two (more specifically, health checks is directly logged to a file). So it doesn't make sense having such a function for the UAP version.

I think we can make a quick change to fix this: Introduce 2 new functions. getServiceStatusForImage and getHealthCheckResultsForImage (this one is also mentioned in the comment below).

For regular Ops Agent, these functions will still call getRecentServiceOutputForImage. For UAP, it'll call the appropriate UAP equivalent.

Then, for each of the existing calls to getRecentServiceOutputForImage, replace it with one of the new functions above based on what exactly we are checking.

6078a3e Update PTAL.

getRecentServiceOutputForImage does different things for each kind of the Ops Agent:

For the regular agent: it retrieves health check results from systemctl status
For the agent plugin: it retrieves the response of the GetStatus call.

In my commit 6078a3e, a new function getUAPPluginStatusForImage is added to run exclusively for the agent plugin to retrieve the gRPC response of the GetStatus call.

getRecentServiceOutputForImage is simplified to work only for the regular ops agent.

WDYT? Thanks

jinghan-ma · 2025-02-12T23:20:02Z

integration_test/ops_agent_test/main_test.go

+			cmdOut, err = gce.RunRemotely(ctx, logger, vm, getRecentServiceOutputForImage(vm.ImageSpec))
+			if err != nil {
+				t.Fatal(err)
+			}


This block seem to be repeated for each of the health check tests. Let's abstract this by introducing a new function getHealthCheckResultsForImage that calls one of getHealthCheckLogsForUAPPluginByImage or getRecentServiceOutputForImage.

jinghan-ma · 2025-02-12T23:32:53Z

integration_test/third_party_apps_test/main_test.go

+	if err := agents.RestartOpsAgent(ctx, logger.ToMainLog(), vm); err != nil {
+		return nonRetryable, fmt.Errorf("error restarting agent: %v", err)
+	}
+	time.Sleep(60 * time.Second)
+


Why do we need to restart the ops agent here? Also, do we need to sleep for a whole minute?

If you look at the if block right above this one:

if _, err = runScriptFromScriptsDir(ctx, logger.ToMainLog(), vm, path.Join("applications", app, "enable"), nil); err != nil { return nonRetryable, fmt.Errorf("error enabling %s: %v", app, err) }

that block executes what's inside the enable script for each third party app. The enable script populates config.yaml with some custom config. That's why Ops agent needs to be restarted in order to apply the config.

https://github.com/GoogleCloudPlatform/ops-agent/pull/1870/files#diff-db0ec4d7bd3cd0a566b078eb57aa07cee85ca6c3ee560bd68206d4d471db9211 if you look at the changes I made to the enable scripts, you will see that the sleep statements are removed from the script and added here in main_test.go. The sleep seconds range from 0 to 60 seconds, so I decided to sleep for 60 seconds in main_test.go universally.

rafaelwestphal · 2025-02-13T19:44:19Z

integration_test/ops_agent_test/main_test.go

+			t.Fatal(err)
+		}
+
+		logger.Printf("TestPluginGetStatusReturnsUnhealthyStatusOnSubAgentTermination: Found %s", processName)


We can remove this print.

integration_test/ops_agent_test/main_test.go

…Platform/ops-agent into ida-ops-agent-plugin-testing

jinghan-ma · 2025-02-13T21:04:54Z

integration_test/agents/agents.go

+		if gce.IsWindows(imageSpec) {
+			return ""
+		}
+		return fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Stop && grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Start", OpsAgentPluginServerPort, OpsAgentPluginServerPort)


I think it might be better to mimic UAP's operations more closely in our own tests by killing the wrapper and starting a new one?

Not a big issue either way.

jinghan-ma · 2025-02-13T21:07:39Z

integration_test/agents/agents.go

 			return err
 		}
+	} else if IsOpsAgentUAPPlugin() {
+		return RestartOpsAgent(ctx, logger, vm)


In that case, is it possible to be more explicit here and call StartOpsAgentPlugin instead?

jinghan-ma · 2025-02-13T21:23:41Z

integration_test/ops_agent_test/main_test.go

-			t.Fatal(err)
-		}
+		isUAPPlugin := agents.IsOpsAgentUAPPlugin()
+		if isUAPPlugin {


Would it make more sense to skip this test if we're in UAP world? Different from the systemd version, we don't actually expect the UAP plugin to be started automatically after a restart, so the behaviour this test checks for doesn't really apply. Maybe it's less confusing if we explicitly skip this test.

I am unsure what the RestartVM test is actually testing. If it's verifying that the agent installation always works even after a VM restart, then it makes sense to test for the plugin as well.

XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 29, 2025

stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 29, 2025

XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 4 times, most recently from 7767508 to 2d29934 Compare January 31, 2025 18:59

XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 31, 2025

stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 31, 2025

XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 8 times, most recently from 422f1ea to acd10dd Compare February 3, 2025 23:15

XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025

stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025

XuechunHou force-pushed the ida-ops-agent-plugin-testing branch from 30dfac5 to e91d662 Compare February 4, 2025 05:11

XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025

stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025

XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025

stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025

XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 5 times, most recently from dadfb2d to 7b1911f Compare February 5, 2025 03:31

XuechunHou marked this pull request as ready for review February 5, 2025 16:30

XuechunHou self-assigned this Feb 5, 2025

XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 12, 2025

stackdriver-instrumentation-release removed kokoro:force-run Forces kokoro to run integration tests on a CL labels Feb 12, 2025

wait 10 seconds before querying plugin status, this is trying to fix …

2424a08

…the flakes on Rocky linux 8 and 9

XuechunHou requested review from jinghan-ma and rafaelwestphal February 12, 2025 18:45

rafaelwestphal reviewed Feb 12, 2025

View reviewed changes

XuechunHou added 3 commits February 12, 2025 20:17

resolved comments

8909d42

resolved comments

f25647d

resolved comments

e80ce8a

XuechunHou requested a review from rafaelwestphal February 12, 2025 22:10

jinghan-ma requested changes Feb 12, 2025

View reviewed changes

XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 4 times, most recently from bd9d2d3 to e80ce8a Compare February 13, 2025 15:22

XuechunHou added 5 commits February 13, 2025 15:40

moved repeated blocks into a function

5376413

return a list of log files through a function call

01cefc3

directly download the plugin tarball to /tmp/agentPlugin

612accf

Create a seperate function for getting Plugin status

6078a3e

Merge branch 'master' into ida-ops-agent-plugin-testing

fa807e4

XuechunHou requested a review from jinghan-ma February 13, 2025 17:17

rafaelwestphal reviewed Feb 13, 2025

View reviewed changes

rafaelwestphal approved these changes Feb 13, 2025

View reviewed changes

XuechunHou added 2 commits February 13, 2025 19:56

resolved more comments

9ace67e

Merge branch 'ida-ops-agent-plugin-testing' of github.com:GoogleCloud…

c615666

…Platform/ops-agent into ida-ops-agent-plugin-testing

XuechunHou force-pushed the ida-ops-agent-plugin-testing branch from 2f8c5d4 to c615666 Compare February 13, 2025 20:57

jinghan-ma approved these changes Feb 13, 2025

View reviewed changes

XuechunHou merged commit 8d9e723 into master Feb 13, 2025
86 of 91 checks passed

XuechunHou deleted the ida-ops-agent-plugin-testing branch February 13, 2025 23:54

Integration Testing Ops Agent UAP Plugin #1870

Integration Testing Ops Agent UAP Plugin #1870

Uh oh!

Conversation

XuechunHou commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issue

How has this been tested?

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuechunHou Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuechunHou Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuechunHou commented Jan 29, 2025 •

edited

Loading

XuechunHou Feb 13, 2025 •

edited

Loading

XuechunHou Feb 13, 2025 •

edited

Loading

XuechunHou Feb 13, 2025 •

edited

Loading