Skip to content

Conversation

@XuechunHou
Copy link
Contributor

@XuechunHou XuechunHou commented Jan 29, 2025

Description

  1. This PR updated the existing Ops agent integration tests and third party tests, making them compatible with the Ops Agent plugin.
  2. This PR also added a few integration tests for the Ops Agent Plugin, exclusively testing its GetStatus method.

Related issue

b/381246351

How has this been tested?

Ops Agent integration tests should run twice per OS distro and architecture, once for the regular ops agent, once for the plugin. Both runs should pass.

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

@XuechunHou XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 29, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 29, 2025
@XuechunHou XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 4 times, most recently from 7767508 to 2d29934 Compare January 31, 2025 18:59
@XuechunHou XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 31, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Jan 31, 2025
@XuechunHou XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 8 times, most recently from 422f1ea to acd10dd Compare February 3, 2025 23:15
@XuechunHou XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025
@XuechunHou XuechunHou force-pushed the ida-ops-agent-plugin-testing branch from 30dfac5 to e91d662 Compare February 4, 2025 05:11
@XuechunHou XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025
@XuechunHou XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 4, 2025
@XuechunHou XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 5 times, most recently from dadfb2d to 7b1911f Compare February 5, 2025 03:31
@XuechunHou XuechunHou marked this pull request as ready for review February 5, 2025 16:30
@XuechunHou XuechunHou self-assigned this Feb 5, 2025
@XuechunHou XuechunHou added the kokoro:force-run Forces kokoro to run integration tests on a CL label Feb 12, 2025
@stackdriver-instrumentation-release stackdriver-instrumentation-release removed kokoro:force-run Forces kokoro to run integration tests on a CL labels Feb 12, 2025
gce.RunRemotely(ctx, logger.ToFile("journalctl_output.txt"), vm, "sudo journalctl -xe")
if isUAPPlugin {
for _, log := range []string{
gce.SyslogLocation(vm.ImageSpec),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we refactor this code avoid the duplication of the files list? It will make it easier in the feature when we need to remove/add/replace anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments. You're correct, they are the same set of files. However, they have different paths, and I haven't yet found a good way to generalize them. Even within the Ops Agent plugin itself, not all of these files are located in the same directories.

Genmini suggested something like the following:

filePaths := map[string]string{
                "health-checks.log":             "/var/log/google-cloud-ops-agent/health-checks.log",
                "config.yaml":                    "/etc/google-cloud-ops-agent/config.yaml",
                "subagents/logging-module.log":   "/var/log/google-cloud-ops-agent/subagents/logging-module.log",
                "subagents/metrics-module.log":   "/var/log/google-cloud-ops-agent/subagents/metrics-module.log",
                "nvidia-installer.log":           "/var/log/nvidia-installer.log",
                "fluent_bit_main.conf":          "/run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf",
                "fluent_bit_parser.conf":        "/run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf",
                "otel.yaml":                      "/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml",
        }

        if isUAPPlugin {
                for file, filePath := range filePaths {
                        filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin", filePath)
                        // Adjust for run and log directories.
                        if file == "fluent_bit_main.conf" || file == "fluent_bit_parser.conf" || file == "otel.yaml" {
                                filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin/run", path.Base(filePath))
                        } else if file == "health-checks.log" || file == "subagents/logging-module.log" || file == "subagents/metrics-module.log" || file == "nvidia-installer.log" {
                                filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin/log", path.Base(filePath))
                        } else if file == "config.yaml" {
                                filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin", path.Base(filePath))
                        }
                }
        }

However, that adds too many if-else statements, and it also reduces the readability of the file paths for the plugin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract the 2 lists of files into a separate function like getLogFileList, which would return one of the 2 lists based on isUAPPlugin. This way, the if statement gets moved to the new function, so the current function is easier to read. Also, since the new function's only responsibility is returning the list of files, it's easy to change for both if we need to modify them later (and harder to miss).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

gce.RunRemotely(ctx, logger.ToFile("systemctl_status_for_ops_agent.txt"), vm, "sudo systemctl status google-cloud-ops-agent*")
isUAPPlugin := IsOpsAgentUAPPlugin()
if isUAPPlugin {
gce.RunRemotely(ctx, logger.ToFile("systemctl_status_for_ops_agent.txt"), vm, fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/GetStatus", OpsAgentPluginServerPort))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a different file name? Since we're not using systemd calling it systemctl_status seems a bit weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

gce.RunRemotely(ctx, logger.ToFile("journalctl_output.txt"), vm, "sudo journalctl -xe")
if isUAPPlugin {
for _, log := range []string{
gce.SyslogLocation(vm.ImageSpec),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extract the 2 lists of files into a separate function like getLogFileList, which would return one of the 2 lists based on isUAPPlugin. This way, the if statement gets moved to the new function, so the current function is easier to read. Also, since the new function's only responsibility is returning the list of files, it's easy to change for both if we need to modify them later (and harder to miss).

if gce.IsWindows(imageSpec) {
return ""
}
return fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Stop && grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Start", OpsAgentPluginServerPort, OpsAgentPluginServerPort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would UAP ever send a Start signal to a plugin that has already received a Stop signal? Or will it start the plugin on an entirely different process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is not interacting with UAP at all, but we've asked this question to the UAP team and the answer is no.

"Or will it start the plugin on an entirely different process?" process or port? I think both are possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to mimic UAP's operations more closely in our own tests by killing the wrapper and starting a new one?

Not a big issue either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-starting the plugin and re-starting the Ops Agent through the UAP plugin is different. Here I am explicitly re-starting the Ops Agent itself. In the UAP world, this is done by sending a Stop and then a Start gRPC request to the plugin gRPC server.

Comment on lines 951 to 956
if _, err := gce.RunRemotely(ctx, logger, vm, "sudo gsutil cp -r "+gcsPath+"/* /tmp/agentUpload"); err != nil {
return fmt.Errorf("error copying down agent package from GCS: %v", err)
}
if _, err := gce.RunRemotely(ctx, logger, vm, "mv /tmp/agentUpload/*.tar.gz /tmp/agentPlugin"); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't directly copy to /tmp/agentPlugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return err
}
} else if IsOpsAgentUAPPlugin() {
return RestartOpsAgent(ctx, logger, vm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to restart the ops agent plugin here specifically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not like the regular ops agent, the agent plugin is not started automatically after installation. It needs to be explicitly started.

The point of callingRestartOpsAgent function is to trigger a start of the agent.

When len(config)> 0, the regular agent needs to be re-started, and the plugin agent needs to be started to apply the custom config.

When len(config) ==0, only the plugin agent needs to be started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, is it possible to be more explicit here and call StartOpsAgentPlugin instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StartOpsAgentPlugin and RestartOpsAgent do different things for the agent plugin; StartOpsAgentPlugin starts the plugin grpc server itself. However, RestartOpsAgent sends a gRPC request to the plugin's Start() endpoint, to start the Ops Agent.
In this case the plugin gRPC server is live already, I only need to start the Ops Agent.

if IsARM(vm.ImageSpec) {
arch = "arm64"
}
if arch != "x86_64" && arch != "arm64" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how we define arch above, will we ever go into this if statement?

Copy link
Contributor Author

@XuechunHou XuechunHou Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, updated

Comment on lines 4556 to 4561
if agents.IsOpsAgentUAPPlugin() {
if gce.IsWindows(imageSpec) {
return ""
}
return "grpcurl -plaintext -d '{}' localhost:1234 plugin_comm.GuestAgentPlugin/GetStatus"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense for us to decouple the UAP status call from ServiceOutput. They currently are somewhat similar, but we expect them to evolve in very different ways -- so the way we test for it will also start to deviate.

I feel like it makes it clearer if we remove the abstraction and explicitly call something like GetPluginStatus in many places, so that it is clear to the reader that only UAP has this behaviour. Examples of this includes the current check for an Unhealthy status when plugin is stopped, or future tests such as checking for error codes (which is exclusive to UAP GetStatus).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the root of the problem is that there's a layer of abstraction missing here.
Currently, getRecentServiceOutputForImage is actually overloaded to perform two jobs:

  1. Get service status (i.e. check if ops agent service is running), and
  2. Get health check results.

These two happen to come from the same place when the Ops Agent is running as a service. So we can just call getRecentServiceOutputForImage and check both.

But the plugin version of the Ops Agent separates these two (more specifically, health checks is directly logged to a file). So it doesn't make sense having such a function for the UAP version.

I think we can make a quick change to fix this: Introduce 2 new functions. getServiceStatusForImage and getHealthCheckResultsForImage (this one is also mentioned in the comment below).

For regular Ops Agent, these functions will still call getRecentServiceOutputForImage. For UAP, it'll call the appropriate UAP equivalent.

Then, for each of the existing calls to getRecentServiceOutputForImage, replace it with one of the new functions above based on what exactly we are checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6078a3e Update PTAL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRecentServiceOutputForImage does different things for each kind of the Ops Agent:

For the regular agent: it retrieves health check results from systemctl status
For the agent plugin: it retrieves the response of the GetStatus call.

In my commit 6078a3e, a new function getUAPPluginStatusForImage is added to run exclusively for the agent plugin to retrieve the gRPC response of the GetStatus call.

getRecentServiceOutputForImage is simplified to work only for the regular ops agent.

WDYT? Thanks

cmdOut, err = gce.RunRemotely(ctx, logger, vm, getRecentServiceOutputForImage(vm.ImageSpec))
if err != nil {
t.Fatal(err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block seem to be repeated for each of the health check tests. Let's abstract this by introducing a new function getHealthCheckResultsForImage that calls one of getHealthCheckLogsForUAPPluginByImage or getRecentServiceOutputForImage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +598 to +602
if err := agents.RestartOpsAgent(ctx, logger.ToMainLog(), vm); err != nil {
return nonRetryable, fmt.Errorf("error restarting agent: %v", err)
}
time.Sleep(60 * time.Second)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to restart the ops agent here? Also, do we need to sleep for a whole minute?

Copy link
Contributor Author

@XuechunHou XuechunHou Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the if block right above this one:

if _, err = runScriptFromScriptsDir(ctx, logger.ToMainLog(), vm, path.Join("applications", app, "enable"), nil); err != nil {
		return nonRetryable, fmt.Errorf("error enabling %s: %v", app, err)
	}

that block executes what's inside the enable script for each third party app. The enable script populates config.yaml with some custom config. That's why Ops agent needs to be restarted in order to apply the config.

https://github.com/GoogleCloudPlatform/ops-agent/pull/1870/files#diff-db0ec4d7bd3cd0a566b078eb57aa07cee85ca6c3ee560bd68206d4d471db9211 if you look at the changes I made to the enable scripts, you will see that the sleep statements are removed from the script and added here in main_test.go. The sleep seconds range from 0 to 60 seconds, so I decided to sleep for 60 seconds in main_test.go universally.

@XuechunHou XuechunHou force-pushed the ida-ops-agent-plugin-testing branch 4 times, most recently from bd9d2d3 to e80ce8a Compare February 13, 2025 15:22
t.Fatal(err)
}

logger.Printf("TestPluginGetStatusReturnsUnhealthyStatusOnSubAgentTermination: Found %s", processName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this print.

@XuechunHou XuechunHou force-pushed the ida-ops-agent-plugin-testing branch from 2f8c5d4 to c615666 Compare February 13, 2025 20:57
if gce.IsWindows(imageSpec) {
return ""
}
return fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Stop && grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Start", OpsAgentPluginServerPort, OpsAgentPluginServerPort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to mimic UAP's operations more closely in our own tests by killing the wrapper and starting a new one?

Not a big issue either way.

return err
}
} else if IsOpsAgentUAPPlugin() {
return RestartOpsAgent(ctx, logger, vm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, is it possible to be more explicit here and call StartOpsAgentPlugin instead?

t.Fatal(err)
}
isUAPPlugin := agents.IsOpsAgentUAPPlugin()
if isUAPPlugin {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to skip this test if we're in UAP world? Different from the systemd version, we don't actually expect the UAP plugin to be started automatically after a restart, so the behaviour this test checks for doesn't really apply. Maybe it's less confusing if we explicitly skip this test.

Copy link
Contributor Author

@XuechunHou XuechunHou Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure what the RestartVM test is actually testing. If it's verifying that the agent installation always works even after a VM restart, then it makes sense to test for the plugin as well.

@XuechunHou XuechunHou merged commit 8d9e723 into master Feb 13, 2025
86 of 91 checks passed
@XuechunHou XuechunHou deleted the ida-ops-agent-plugin-testing branch February 13, 2025 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants