-
Notifications
You must be signed in to change notification settings - Fork 77
Integration Testing Ops Agent UAP Plugin #1870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7767508 to
2d29934
Compare
422f1ea to
acd10dd
Compare
30dfac5 to
e91d662
Compare
dadfb2d to
7b1911f
Compare
…the flakes on Rocky linux 8 and 9
integration_test/third_party_apps_test/applications/aerospike/enable
Outdated
Show resolved
Hide resolved
integration_test/third_party_apps_test/applications/apache/enable
Outdated
Show resolved
Hide resolved
integration_test/agents/agents.go
Outdated
| gce.RunRemotely(ctx, logger.ToFile("journalctl_output.txt"), vm, "sudo journalctl -xe") | ||
| if isUAPPlugin { | ||
| for _, log := range []string{ | ||
| gce.SyslogLocation(vm.ImageSpec), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we refactor this code avoid the duplication of the files list? It will make it easier in the feature when we need to remove/add/replace anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments. You're correct, they are the same set of files. However, they have different paths, and I haven't yet found a good way to generalize them. Even within the Ops Agent plugin itself, not all of these files are located in the same directories.
Genmini suggested something like the following:
filePaths := map[string]string{
"health-checks.log": "/var/log/google-cloud-ops-agent/health-checks.log",
"config.yaml": "/etc/google-cloud-ops-agent/config.yaml",
"subagents/logging-module.log": "/var/log/google-cloud-ops-agent/subagents/logging-module.log",
"subagents/metrics-module.log": "/var/log/google-cloud-ops-agent/subagents/metrics-module.log",
"nvidia-installer.log": "/var/log/nvidia-installer.log",
"fluent_bit_main.conf": "/run/google-cloud-ops-agent-fluent-bit/fluent_bit_main.conf",
"fluent_bit_parser.conf": "/run/google-cloud-ops-agent-fluent-bit/fluent_bit_parser.conf",
"otel.yaml": "/run/google-cloud-ops-agent-opentelemetry-collector/otel.yaml",
}
if isUAPPlugin {
for file, filePath := range filePaths {
filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin", filePath)
// Adjust for run and log directories.
if file == "fluent_bit_main.conf" || file == "fluent_bit_parser.conf" || file == "otel.yaml" {
filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin/run", path.Base(filePath))
} else if file == "health-checks.log" || file == "subagents/logging-module.log" || file == "subagents/metrics-module.log" || file == "nvidia-installer.log" {
filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin/log", path.Base(filePath))
} else if file == "config.yaml" {
filePaths[file] = path.Join("/var/lib/google-guest-agent/agent_state/plugins/ops-agent-plugin", path.Base(filePath))
}
}
}
However, that adds too many if-else statements, and it also reduces the readability of the file paths for the plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract the 2 lists of files into a separate function like getLogFileList, which would return one of the 2 lists based on isUAPPlugin. This way, the if statement gets moved to the new function, so the current function is easier to read. Also, since the new function's only responsibility is returning the list of files, it's easy to change for both if we need to modify them later (and harder to miss).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
integration_test/third_party_apps_test/applications/aerospike/enable
Outdated
Show resolved
Hide resolved
integration_test/third_party_apps_test/applications/apache/enable
Outdated
Show resolved
Hide resolved
integration_test/agents/agents.go
Outdated
| gce.RunRemotely(ctx, logger.ToFile("systemctl_status_for_ops_agent.txt"), vm, "sudo systemctl status google-cloud-ops-agent*") | ||
| isUAPPlugin := IsOpsAgentUAPPlugin() | ||
| if isUAPPlugin { | ||
| gce.RunRemotely(ctx, logger.ToFile("systemctl_status_for_ops_agent.txt"), vm, fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/GetStatus", OpsAgentPluginServerPort)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a different file name? Since we're not using systemd calling it systemctl_status seems a bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
integration_test/agents/agents.go
Outdated
| gce.RunRemotely(ctx, logger.ToFile("journalctl_output.txt"), vm, "sudo journalctl -xe") | ||
| if isUAPPlugin { | ||
| for _, log := range []string{ | ||
| gce.SyslogLocation(vm.ImageSpec), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we extract the 2 lists of files into a separate function like getLogFileList, which would return one of the 2 lists based on isUAPPlugin. This way, the if statement gets moved to the new function, so the current function is easier to read. Also, since the new function's only responsibility is returning the list of files, it's easy to change for both if we need to modify them later (and harder to miss).
| if gce.IsWindows(imageSpec) { | ||
| return "" | ||
| } | ||
| return fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Stop && grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Start", OpsAgentPluginServerPort, OpsAgentPluginServerPort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would UAP ever send a Start signal to a plugin that has already received a Stop signal? Or will it start the plugin on an entirely different process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is not interacting with UAP at all, but we've asked this question to the UAP team and the answer is no.
"Or will it start the plugin on an entirely different process?" process or port? I think both are possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to mimic UAP's operations more closely in our own tests by killing the wrapper and starting a new one?
Not a big issue either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-starting the plugin and re-starting the Ops Agent through the UAP plugin is different. Here I am explicitly re-starting the Ops Agent itself. In the UAP world, this is done by sending a Stop and then a Start gRPC request to the plugin gRPC server.
integration_test/agents/agents.go
Outdated
| if _, err := gce.RunRemotely(ctx, logger, vm, "sudo gsutil cp -r "+gcsPath+"/* /tmp/agentUpload"); err != nil { | ||
| return fmt.Errorf("error copying down agent package from GCS: %v", err) | ||
| } | ||
| if _, err := gce.RunRemotely(ctx, logger, vm, "mv /tmp/agentUpload/*.tar.gz /tmp/agentPlugin"); err != nil { | ||
| return err | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we don't directly copy to /tmp/agentPlugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| return err | ||
| } | ||
| } else if IsOpsAgentUAPPlugin() { | ||
| return RestartOpsAgent(ctx, logger, vm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to restart the ops agent plugin here specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not like the regular ops agent, the agent plugin is not started automatically after installation. It needs to be explicitly started.
The point of callingRestartOpsAgent function is to trigger a start of the agent.
When len(config)> 0, the regular agent needs to be re-started, and the plugin agent needs to be started to apply the custom config.
When len(config) ==0, only the plugin agent needs to be started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, is it possible to be more explicit here and call StartOpsAgentPlugin instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StartOpsAgentPlugin and RestartOpsAgent do different things for the agent plugin; StartOpsAgentPlugin starts the plugin grpc server itself. However, RestartOpsAgent sends a gRPC request to the plugin's Start() endpoint, to start the Ops Agent.
In this case the plugin gRPC server is live already, I only need to start the Ops Agent.
integration_test/gce/gce_testing.go
Outdated
| if IsARM(vm.ImageSpec) { | ||
| arch = "arm64" | ||
| } | ||
| if arch != "x86_64" && arch != "arm64" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how we define arch above, will we ever go into this if statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, updated
| if agents.IsOpsAgentUAPPlugin() { | ||
| if gce.IsWindows(imageSpec) { | ||
| return "" | ||
| } | ||
| return "grpcurl -plaintext -d '{}' localhost:1234 plugin_comm.GuestAgentPlugin/GetStatus" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense for us to decouple the UAP status call from ServiceOutput. They currently are somewhat similar, but we expect them to evolve in very different ways -- so the way we test for it will also start to deviate.
I feel like it makes it clearer if we remove the abstraction and explicitly call something like GetPluginStatus in many places, so that it is clear to the reader that only UAP has this behaviour. Examples of this includes the current check for an Unhealthy status when plugin is stopped, or future tests such as checking for error codes (which is exclusive to UAP GetStatus).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the root of the problem is that there's a layer of abstraction missing here.
Currently, getRecentServiceOutputForImage is actually overloaded to perform two jobs:
- Get service status (i.e. check if ops agent service is running), and
- Get health check results.
These two happen to come from the same place when the Ops Agent is running as a service. So we can just call getRecentServiceOutputForImage and check both.
But the plugin version of the Ops Agent separates these two (more specifically, health checks is directly logged to a file). So it doesn't make sense having such a function for the UAP version.
I think we can make a quick change to fix this: Introduce 2 new functions. getServiceStatusForImage and getHealthCheckResultsForImage (this one is also mentioned in the comment below).
For regular Ops Agent, these functions will still call getRecentServiceOutputForImage. For UAP, it'll call the appropriate UAP equivalent.
Then, for each of the existing calls to getRecentServiceOutputForImage, replace it with one of the new functions above based on what exactly we are checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6078a3e Update PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getRecentServiceOutputForImage does different things for each kind of the Ops Agent:
For the regular agent: it retrieves health check results from systemctl status
For the agent plugin: it retrieves the response of the GetStatus call.
In my commit 6078a3e, a new function getUAPPluginStatusForImage is added to run exclusively for the agent plugin to retrieve the gRPC response of the GetStatus call.
getRecentServiceOutputForImage is simplified to work only for the regular ops agent.
WDYT? Thanks
| cmdOut, err = gce.RunRemotely(ctx, logger, vm, getRecentServiceOutputForImage(vm.ImageSpec)) | ||
| if err != nil { | ||
| t.Fatal(err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block seem to be repeated for each of the health check tests. Let's abstract this by introducing a new function getHealthCheckResultsForImage that calls one of getHealthCheckLogsForUAPPluginByImage or getRecentServiceOutputForImage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if err := agents.RestartOpsAgent(ctx, logger.ToMainLog(), vm); err != nil { | ||
| return nonRetryable, fmt.Errorf("error restarting agent: %v", err) | ||
| } | ||
| time.Sleep(60 * time.Second) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to restart the ops agent here? Also, do we need to sleep for a whole minute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the if block right above this one:
if _, err = runScriptFromScriptsDir(ctx, logger.ToMainLog(), vm, path.Join("applications", app, "enable"), nil); err != nil {
return nonRetryable, fmt.Errorf("error enabling %s: %v", app, err)
}
that block executes what's inside the enable script for each third party app. The enable script populates config.yaml with some custom config. That's why Ops agent needs to be restarted in order to apply the config.
https://github.com/GoogleCloudPlatform/ops-agent/pull/1870/files#diff-db0ec4d7bd3cd0a566b078eb57aa07cee85ca6c3ee560bd68206d4d471db9211 if you look at the changes I made to the enable scripts, you will see that the sleep statements are removed from the script and added here in main_test.go. The sleep seconds range from 0 to 60 seconds, so I decided to sleep for 60 seconds in main_test.go universally.
bd9d2d3 to
e80ce8a
Compare
| t.Fatal(err) | ||
| } | ||
|
|
||
| logger.Printf("TestPluginGetStatusReturnsUnhealthyStatusOnSubAgentTermination: Found %s", processName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this print.
…Platform/ops-agent into ida-ops-agent-plugin-testing
2f8c5d4 to
c615666
Compare
| if gce.IsWindows(imageSpec) { | ||
| return "" | ||
| } | ||
| return fmt.Sprintf("grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Stop && grpcurl -plaintext -d '{}' localhost:%s plugin_comm.GuestAgentPlugin/Start", OpsAgentPluginServerPort, OpsAgentPluginServerPort) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to mimic UAP's operations more closely in our own tests by killing the wrapper and starting a new one?
Not a big issue either way.
| return err | ||
| } | ||
| } else if IsOpsAgentUAPPlugin() { | ||
| return RestartOpsAgent(ctx, logger, vm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, is it possible to be more explicit here and call StartOpsAgentPlugin instead?
| t.Fatal(err) | ||
| } | ||
| isUAPPlugin := agents.IsOpsAgentUAPPlugin() | ||
| if isUAPPlugin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to skip this test if we're in UAP world? Different from the systemd version, we don't actually expect the UAP plugin to be started automatically after a restart, so the behaviour this test checks for doesn't really apply. Maybe it's less confusing if we explicitly skip this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure what the RestartVM test is actually testing. If it's verifying that the agent installation always works even after a VM restart, then it makes sense to test for the plugin as well.
Description
GetStatusmethod.Related issue
b/381246351
How has this been tested?
Ops Agent integration tests should run twice per OS distro and architecture, once for the regular ops agent, once for the plugin. Both runs should pass.
Checklist: