-
Notifications
You must be signed in to change notification settings - Fork 77
Fixed a Race Condition in the UAP Plugin Implementation #1874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
| if foundConflictingInstallations || err != nil { | ||
| ps.cancel() | ||
| ps.cancel = nil | ||
| ps.Stop(ctx, &pb.StopRequest{Cleanup: false}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is ps.Stop() substituting ps.cancel() and ps.cancel = nil ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because ps.Stop() literally does the same thing. Since a lock must be obtained before accessing the cancel function, extra lock and unlock statements would need to be added everywhere that calls ps.cancel() and ps.cancel = nil. Because ps.Stop() achieves the same result, I decided to replace the calls to ps.cancel() and ps.cancel = nil with a call to ps.Stop().
| ps.mu.Lock() | ||
| if ps.cancel != nil { | ||
| log.Printf("The Ops Agent plugin is started already, skipping the current request") | ||
| ps.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use defer ps.mu.Unlock() the same as in Stop ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the mutex unlock on L78 should not wait until the Start() method is about to return. Start() is a long function, unlock Stop() and GetStatus()
The only thing Stop() and GetStatus() do is to access the field cancel, that's not the case with Start(). Start does additional stuff like: run config generator, start up subagents etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for all the explanations!
Description
This PR identifies and fixes a race condition in the Ops Agent UAP plugin implementation. The setting (read) and resetting (write) of the
cancelfunction in theStart()andStop()methods should be atomic and protected by a mutex, which has been added in this PR.Related issue
b/380277488
How has this been tested?
go test -mod=mod -coverpkg="./..." -coverprofile=covprofile ./...command triggered in the presubmit flakes due to this race condition.Checklist: