Skip to content

Conversation

@XuechunHou
Copy link
Contributor

@XuechunHou XuechunHou commented Feb 3, 2025

Description

This PR identifies and fixes a race condition in the Ops Agent UAP plugin implementation. The setting (read) and resetting (write) of the cancel function in the Start() and Stop() methods should be atomic and protected by a mutex, which has been added in this PR.

Related issue

b/380277488

How has this been tested?

go test -mod=mod -coverpkg="./..." -coverprofile=covprofile ./... command triggered in the presubmit flakes due to this race condition.

Checklist:

  • Unit tests
    • Unit tests do not apply.
    • Unit tests have been added/modified and passed for this PR.
  • Integration tests
    • Integration tests do not apply.
    • [] Integration tests have been added/modified and passed for this PR.
  • Documentation
    • This PR introduces no user visible changes.
    • This PR introduces user visible changes and the corresponding documentation change has been made.
  • Minor version bump
    • This PR introduces no new features.
    • This PR introduces new features, and there is a separate PR to bump the minor version since the last release already.
    • This PR bumps the version.

Copy link
Contributor

@rafaelwestphal rafaelwestphal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

if foundConflictingInstallations || err != nil {
ps.cancel()
ps.cancel = nil
ps.Stop(ctx, &pb.StopRequest{Cleanup: false})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ps.Stop() substituting ps.cancel() and ps.cancel = nil ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ps.Stop() literally does the same thing. Since a lock must be obtained before accessing the cancel function, extra lock and unlock statements would need to be added everywhere that calls ps.cancel() and ps.cancel = nil. Because ps.Stop() achieves the same result, I decided to replace the calls to ps.cancel() and ps.cancel = nil with a call to ps.Stop().

ps.mu.Lock()
if ps.cancel != nil {
log.Printf("The Ops Agent plugin is started already, skipping the current request")
ps.mu.Unlock()
Copy link
Contributor

@franciscovalentecastro franciscovalentecastro Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use defer ps.mu.Unlock() the same as in Stop ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the mutex unlock on L78 should not wait until the Start() method is about to return. Start() is a long function, unlock Stop() and GetStatus()

The only thing Stop() and GetStatus() do is to access the field cancel, that's not the case with Start(). Start does additional stuff like: run config generator, start up subagents etc.

Copy link
Contributor

@franciscovalentecastro franciscovalentecastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for all the explanations!

@XuechunHou XuechunHou merged commit 21078a0 into master Feb 3, 2025
46 of 66 checks passed
@XuechunHou XuechunHou deleted the fix-race-condition branch February 3, 2025 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants