Skip to content

Conversation

@timburks
Copy link
Contributor

@timburks timburks commented Aug 31, 2022

As discussed here, #700 removed a mutex with this associated comment:

// Prevent a race condition that can occur when two updates are made
// to the same non-existent resource. The db.Get...() call returns
// NotFound for both updates, and after one creates the resource,
// the other creation fails. The lock() prevents this by serializing
// the get and create operations. Future updates could improve this
// with improvements closer to the database level.

This commit addresses the underlying problem by detecting when it occurs and returning the gRPC Aborted status code, which client libraries are configured to retry. This client retry avoids slowing the common case while still allowing for correct handling.

The effect of this is observable in the bulk uploaders, e.g. here which explicitly ignores the AlreadyExists errors that can be returned when two callers concurrently try to create the same API using UpdateApi calls with allow missing set to true.

With the change, those errors are reported as Aborted and are retried by the client libraries (because we configured this). The modified bulk uploader code that explicitly ignores AlreadyExists is not reached.

How can we test this? It is observable by running registry upload bulk discovery with a high --jobs count (e.g. 50). However this has an external service dependency (it calls the API discovery service), and it seems overly-slow. A more direct test might be to run a large number of goroutines (maybe 10?) that concurrently try to create the same API (or version, spec, deployment, or artifact) using Update with allow missing. These should return Aborted or no error.

Let's consider making concurrency tests a separate group of tests that are run outside of the core server suite (perhaps like benchmarks). This would reduce the load of functional verification (as done by the core) and would increase focus on the kinds of concurrency situations that we are checking.

This ensures that associated blobs are also stored correctly.
… up after the mutating transaction completes.
@timburks timburks changed the title Report "already exists" errors in updates as with "Abort" to cause them to be retried Report "already exists" errors in updates as "Abort" to cause them to be retried Aug 31, 2022
@timburks timburks changed the title Report "already exists" errors in updates as "Abort" to cause them to be retried Registry server: report AlreadyExists errors in updates as Abort to cause them to be retried Aug 31, 2022
@codecov
Copy link

codecov bot commented Aug 31, 2022

Codecov Report

Merging #715 (2641ff7) into main (0a574b1) will increase coverage by 0.15%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #715      +/-   ##
==========================================
+ Coverage   56.81%   56.96%   +0.15%     
==========================================
  Files          93       93              
  Lines        7993     8008      +15     
==========================================
+ Hits         4541     4562      +21     
+ Misses       3018     3014       -4     
+ Partials      434      432       -2     
Impacted Files Coverage Δ
server/registry/actions_apis.go 75.17% <0.00%> (+0.52%) ⬆️
server/registry/actions_deployments.go 76.33% <0.00%> (+2.23%) ⬆️
server/registry/actions_projects.go 82.92% <0.00%> (+0.42%) ⬆️
server/registry/actions_specs.go 78.57% <0.00%> (+1.64%) ⬆️
server/registry/actions_versions.go 75.17% <0.00%> (+0.52%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

default:
t.Errorf("UpdateApi(%+v) returned status code %q", test.req, status.Code(err))
}
wg.Done()
Copy link
Contributor Author

@timburks timburks Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a sample concurrency test that makes 10 UpdateApi calls concurrently. But instead of triggering the AlreadyExists or Aborted errors that we want, we get Unavailable because the database is locked.

Edit: this was occurring because the tests were running with SQLite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... just to be clear: this test failed before this change was made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When run with Postgres, yes.

@timburks timburks requested review from seaneganx and theganyo August 31, 2022 21:21
@timburks timburks force-pushed the retry-already-exists branch from 31dff39 to e0dd975 Compare August 31, 2022 21:22
@timburks timburks force-pushed the retry-already-exists branch from e0dd975 to f0640c9 Compare August 31, 2022 21:25
@theganyo
Copy link
Member

A possibility for skipping long-running tests would be to simply use the go test -short flag. See: https://pkg.go.dev/cmd/go#hdr-Testing_flags and https://pkg.go.dev/testing#hdr-Skipping.

@theganyo
Copy link
Member

The build is reporting that none of the new code is is covered by tests.

@timburks
Copy link
Contributor Author

timburks commented Sep 1, 2022

Code coverage measurements are made with SQLite (not Postgres):

go test -race -coverprofile=coverage.txt -covermode=atomic ./...

When I run these tests with SQLite, I get Unavailable errors and the new code isn't exercised.

When I run with Postgres (and without the added changes), I get AlreadyExists responses, but non-deterministically, because the test is relying on Go to schedule the concurrent accesses in goroutines. With the changes in the action handlers, the new tests pass. To run the concurrent tests with Postgres:

go test ./server/registry --run Concurrent -postgresql -v

@timburks timburks force-pushed the retry-already-exists branch from b5334a6 to 2641ff7 Compare September 1, 2022 02:00
Copy link
Member

@theganyo theganyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It doesn't seem to be an issue with this PR directly, then, but I am concerned about the SQLite Unavailable failures as a separate issue.

@timburks
Copy link
Contributor Author

timburks commented Sep 1, 2022

@theganyo The underlying SQLite errors are "database locked". I suspect that the transactions are locking the SQLite database for both reads and writes, so for the SQLite case, the concurrency test added here just verifies that we get "Unavailable" (retryable errors) when we try to make concurrent updates.

Postgres seems to be giving us finer access, at least allowing reads while writes are taking place. https://www.tutorialspoint.com/postgresql/postgresql_locks.htm#:~:text=Locks%20or%20Exclusive%20Locks%20or,either%20committed%20or%20rolled%20back.

@timburks
Copy link
Contributor Author

timburks commented Sep 1, 2022

@theganyo This might allow us to block reads during Postgres transactions - https://gorm.io/docs/advanced_query.html#Locking-FOR-UPDATE - that would eliminate the race condition being addressed here. (Ideally this would be a row lock and not a table lock, right now I'm not sure which it is.)

@theganyo
Copy link
Member

theganyo commented Sep 1, 2022

The original issue of "when two updates are made to the same non-existent resource" seems like it would be a quite rare case. Is it worth trying to do more around that?

@timburks
Copy link
Contributor Author

timburks commented Sep 1, 2022

Is it worth trying to do more around that?

@theganyo Good question. Let's say no for now and discuss with @seaneganx later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants