Registry server: report AlreadyExists errors in updates as Abort to cause them to be retried #715

timburks · 2022-08-31T20:22:43Z

As discussed here, #700 removed a mutex with this associated comment:

// Prevent a race condition that can occur when two updates are made
// to the same non-existent resource. The db.Get...() call returns
// NotFound for both updates, and after one creates the resource,
// the other creation fails. The lock() prevents this by serializing
// the get and create operations. Future updates could improve this
// with improvements closer to the database level.

This commit addresses the underlying problem by detecting when it occurs and returning the gRPC Aborted status code, which client libraries are configured to retry. This client retry avoids slowing the common case while still allowing for correct handling.

The effect of this is observable in the bulk uploaders, e.g. here which explicitly ignores the AlreadyExists errors that can be returned when two callers concurrently try to create the same API using UpdateApi calls with allow missing set to true.

With the change, those errors are reported as Aborted and are retried by the client libraries (because we configured this). The modified bulk uploader code that explicitly ignores AlreadyExists is not reached.

How can we test this? It is observable by running registry upload bulk discovery with a high --jobs count (e.g. 50). However this has an external service dependency (it calls the API discovery service), and it seems overly-slow. A more direct test might be to run a large number of goroutines (maybe 10?) that concurrently try to create the same API (or version, spec, deployment, or artifact) using Update with allow missing. These should return Aborted or no error.

Let's consider making concurrency tests a separate group of tests that are run outside of the core server suite (perhaps like benchmarks). This would reduce the load of functional verification (as done by the core) and would increase focus on the kinds of concurrency situations that we are checking.

This ensures that associated blobs are also stored correctly.

…actions.

…nd project actions.

… up after the mutating transaction completes.

…ting them as retriable errors.

codecov · 2022-08-31T20:30:35Z

Codecov Report

Merging #715 (2641ff7) into main (0a574b1) will increase coverage by 0.15%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main     #715      +/-   ##
==========================================
+ Coverage   56.81%   56.96%   +0.15%     
==========================================
  Files          93       93              
  Lines        7993     8008      +15     
==========================================
+ Hits         4541     4562      +21     
+ Misses       3018     3014       -4     
+ Partials      434      432       -2

Impacted Files	Coverage Δ
server/registry/actions_apis.go	`75.17% <0.00%> (+0.52%)`	⬆️
server/registry/actions_deployments.go	`76.33% <0.00%> (+2.23%)`	⬆️
server/registry/actions_projects.go	`82.92% <0.00%> (+0.42%)`	⬆️
server/registry/actions_specs.go	`78.57% <0.00%> (+1.64%)`	⬆️
server/registry/actions_versions.go	`75.17% <0.00%> (+0.52%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…exists

timburks · 2022-08-31T21:21:15Z

server/registry/concurrency_test.go

+				default:
+					t.Errorf("UpdateApi(%+v) returned status code %q", test.req, status.Code(err))
+				}
+				wg.Done()


Here's a sample concurrency test that makes 10 UpdateApi calls concurrently. But instead of triggering the AlreadyExists or Aborted errors that we want, we get Unavailable because the database is locked.

Edit: this was occurring because the tests were running with SQLite.

So... just to be clear: this test failed before this change was made?

When run with Postgres, yes.

theganyo · 2022-08-31T23:44:49Z

A possibility for skipping long-running tests would be to simply use the go test -short flag. See: https://pkg.go.dev/cmd/go#hdr-Testing_flags and https://pkg.go.dev/testing#hdr-Skipping.

theganyo · 2022-08-31T23:56:49Z

The build is reporting that none of the new code is is covered by tests.

timburks · 2022-09-01T01:55:12Z

Code coverage measurements are made with SQLite (not Postgres):

registry/.github/workflows/go.yml

Line 86 in 0a574b1

go test -race -coverprofile=coverage.txt -covermode=atomic ./...

When I run these tests with SQLite, I get Unavailable errors and the new code isn't exercised.

When I run with Postgres (and without the added changes), I get AlreadyExists responses, but non-deterministically, because the test is relying on Go to schedule the concurrent accesses in goroutines. With the changes in the action handlers, the new tests pass. To run the concurrent tests with Postgres:

go test ./server/registry --run Concurrent -postgresql -v

theganyo

Ok. It doesn't seem to be an issue with this PR directly, then, but I am concerned about the SQLite Unavailable failures as a separate issue.

timburks · 2022-09-01T16:02:20Z

@theganyo The underlying SQLite errors are "database locked". I suspect that the transactions are locking the SQLite database for both reads and writes, so for the SQLite case, the concurrency test added here just verifies that we get "Unavailable" (retryable errors) when we try to make concurrent updates.

Postgres seems to be giving us finer access, at least allowing reads while writes are taking place. https://www.tutorialspoint.com/postgresql/postgresql_locks.htm#:~:text=Locks%20or%20Exclusive%20Locks%20or,either%20committed%20or%20rolled%20back.

timburks · 2022-09-01T16:11:51Z

@theganyo This might allow us to block reads during Postgres transactions - https://gorm.io/docs/advanced_query.html#Locking-FOR-UPDATE - that would eliminate the race condition being addressed here. (Ideally this would be a row lock and not a table lock, right now I'm not sure which it is.)

theganyo · 2022-09-01T16:32:23Z

The original issue of "when two updates are made to the same non-existent resource" seems like it would be a quite rare case. Is it worth trying to do more around that?

timburks · 2022-09-01T16:44:25Z

Is it worth trying to do more around that?

@theganyo Good question. Let's say no for now and discuss with @seaneganx later

timburks added 25 commits August 25, 2022 11:36

hyper-detailed error reporting and an update wrapped in a transaction.

e11b434

wrap api and version updates in transactions

30c402b

Merge branch 'main' of github.com:apigee/registry into tx

052930a

wrap artifact creation and replacement in transactions

dffb67b

This ensures that associated blobs are also stored correctly.

Add transactions to spec and deployment actions.

f825f4e

Merge branch 'main' of github.com:apigee/registry into tx

e5d4254

Fix golangci-lint errors

6325b71

Update mutating API actions to run in transactions.

aa94162

Update mutating project actions to use transactions aligned with api …

98df2e8

…actions.

Update mutating version actions to use transactions, align with api a…

c84a293

…nd project actions.

Update mutating deployment actions to use transactions.

7317585

Update mutating spec actions to use transactions.

d65fdfa

Update mutating artifact actions to use transactions.

114e716

Update mutating deployment revision actions to use transactions.

bf44d3a

Update mutating spec revision actions to use transactions.

6136d35

correctly get newest remaining spec/deployment revision by looking it…

9d30969

… up after the mutating transaction completes.

Remove labels from grpcErrorForDBError

d8496c0

remove extra logging statement from grpcErrorForDBError

9a6cd38

Update logs and error reporting to handle more situations.

6afc01c

Merge branch 'main' of github.com:apigee/registry into tx

c37ff9f

Add context to db sessions used for deletion.

8f5fa15

Rename runWithTransaction to runInTransaction

3870106

Restructure switch statement as suggested.

ce5ffe6

Merge branch 'main' of github.com:apigee/registry into tx

aa02e3e

Add checks for possible concurrency problems in update methods, repor…

a1ece2a

…ting them as retriable errors.

timburks changed the title ~~Report "already exists" errors in updates as with "Abort" to cause them to be retried~~ Report "already exists" errors in updates as "Abort" to cause them to be retried Aug 31, 2022

timburks changed the title ~~Report "already exists" errors in updates as "Abort" to cause them to be retried~~ Registry server: report AlreadyExists errors in updates as Abort to cause them to be retried Aug 31, 2022

timburks mentioned this pull request Aug 31, 2022

Registry server: wrap mutating actions in transactions #700

Merged

Merge branch 'main' of github.com:apigee/registry into retry-already-…

043c914

…exists

timburks commented Aug 31, 2022

View reviewed changes

timburks requested review from seaneganx and theganyo August 31, 2022 21:21

timburks force-pushed the retry-already-exists branch from 31dff39 to e0dd975 Compare August 31, 2022 21:22

Add a sample concurrency test

f0640c9

timburks force-pushed the retry-already-exists branch from e0dd975 to f0640c9 Compare August 31, 2022 21:25

Add additional concurrency tests.

2641ff7

timburks force-pushed the retry-already-exists branch from b5334a6 to 2641ff7 Compare September 1, 2022 02:00

theganyo approved these changes Sep 1, 2022

View reviewed changes

timburks merged commit 952753d into apigee:main Sep 1, 2022

This was referenced Sep 1, 2022

Registry API: Race conditions in database modifications #382

Closed

Registry API: Transactional database operations #448

Closed

Lock the table of affected entities during update operations. #720

Merged

timburks deleted the retry-already-exists branch October 19, 2022 16:32

Registry server: report AlreadyExists errors in updates as Abort to cause them to be retried #715

Registry server: report AlreadyExists errors in updates as Abort to cause them to be retried #715

Uh oh!

Conversation

timburks commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

timburks Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theganyo Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

timburks Sep 1, 2022

Choose a reason for hiding this comment

Uh oh!

theganyo commented Aug 31, 2022

Uh oh!

theganyo commented Aug 31, 2022

Uh oh!

timburks commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theganyo left a comment

Choose a reason for hiding this comment

Uh oh!

timburks commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timburks commented Sep 1, 2022

Uh oh!

theganyo commented Sep 1, 2022

Uh oh!

timburks commented Sep 1, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timburks commented Aug 31, 2022 •

edited

Loading

codecov bot commented Aug 31, 2022 •

edited

Loading

timburks Aug 31, 2022 •

edited

Loading

timburks commented Sep 1, 2022 •

edited

Loading

timburks commented Sep 1, 2022 •

edited

Loading