-
Notifications
You must be signed in to change notification settings - Fork 34
Registry server: report AlreadyExists errors in updates as Abort to cause them to be retried #715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This ensures that associated blobs are also stored correctly.
…nd project actions.
… up after the mutating transaction completes.
…ting them as retriable errors.
Codecov Report
@@ Coverage Diff @@
## main #715 +/- ##
==========================================
+ Coverage 56.81% 56.96% +0.15%
==========================================
Files 93 93
Lines 7993 8008 +15
==========================================
+ Hits 4541 4562 +21
+ Misses 3018 3014 -4
+ Partials 434 432 -2
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
| default: | ||
| t.Errorf("UpdateApi(%+v) returned status code %q", test.req, status.Code(err)) | ||
| } | ||
| wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a sample concurrency test that makes 10 UpdateApi calls concurrently. But instead of triggering the AlreadyExists or Aborted errors that we want, we get Unavailable because the database is locked.
Edit: this was occurring because the tests were running with SQLite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... just to be clear: this test failed before this change was made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When run with Postgres, yes.
31dff39 to
e0dd975
Compare
e0dd975 to
f0640c9
Compare
|
A possibility for skipping long-running tests would be to simply use the |
|
The build is reporting that none of the new code is is covered by tests. |
|
Code coverage measurements are made with SQLite (not Postgres): registry/.github/workflows/go.yml Line 86 in 0a574b1
When I run these tests with SQLite, I get When I run with Postgres (and without the added changes), I get |
b5334a6 to
2641ff7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It doesn't seem to be an issue with this PR directly, then, but I am concerned about the SQLite Unavailable failures as a separate issue.
|
@theganyo The underlying SQLite errors are "database locked". I suspect that the transactions are locking the SQLite database for both reads and writes, so for the SQLite case, the concurrency test added here just verifies that we get "Unavailable" (retryable errors) when we try to make concurrent updates. Postgres seems to be giving us finer access, at least allowing reads while writes are taking place. https://www.tutorialspoint.com/postgresql/postgresql_locks.htm#:~:text=Locks%20or%20Exclusive%20Locks%20or,either%20committed%20or%20rolled%20back. |
|
@theganyo This might allow us to block reads during Postgres transactions - https://gorm.io/docs/advanced_query.html#Locking-FOR-UPDATE - that would eliminate the race condition being addressed here. (Ideally this would be a row lock and not a table lock, right now I'm not sure which it is.) |
|
The original issue of "when two updates are made to the same non-existent resource" seems like it would be a quite rare case. Is it worth trying to do more around that? |
@theganyo Good question. Let's say no for now and discuss with @seaneganx later |
As discussed here, #700 removed a mutex with this associated comment:
This commit addresses the underlying problem by detecting when it occurs and returning the gRPC
Abortedstatus code, which client libraries are configured to retry. This client retry avoids slowing the common case while still allowing for correct handling.The effect of this is observable in the bulk uploaders, e.g. here which explicitly ignores the
AlreadyExistserrors that can be returned when two callers concurrently try to create the same API usingUpdateApicalls withallow missingset to true.With the change, those errors are reported as Aborted and are retried by the client libraries (because we configured this). The modified bulk uploader code that explicitly ignores
AlreadyExistsis not reached.How can we test this? It is observable by running
registry upload bulk discoverywith a high--jobscount (e.g. 50). However this has an external service dependency (it calls the API discovery service), and it seems overly-slow. A more direct test might be to run a large number of goroutines (maybe 10?) that concurrently try to create the same API (or version, spec, deployment, or artifact) usingUpdatewithallow missing. These should returnAbortedor no error.Let's consider making concurrency tests a separate group of tests that are run outside of the core server suite (perhaps like benchmarks). This would reduce the load of functional verification (as done by the core) and would increase focus on the kinds of concurrency situations that we are checking.