Update Retry backoff to follow changes made to gRFC A6 #7514

justinfx · 2024-08-15T22:54:03Z

What version of gRPC are you using?

google.golang.org/grpc v1.65.0

What version of Go are you using (`go version`)?

1.23.0

What operating system (Linux, Windows, …) and version?

Linux

What did you do?

I have been trying to debug the effective behaviour of the client retryPolicy values in response to an "UNAVAILABLE" server. After using many different combinations of values for "InitialBackoff", "MaxBackoff", and "BackoffMultiplier", I have found the retry backoff times to be wildly unpredictable.

Example configs:

"retryPolicy": {
    "MaxAttempts": 5,
    "InitialBackoff": "0.5s",
    "MaxBackoff": "5s",
    "BackoffMultiplier": 4.0,
    "RetryableStatusCodes": [ "UNAVAILABLE" ]
}

"retryPolicy": {
    "MaxAttempts": 5,
    "InitialBackoff": "10s",
    "MaxBackoff": "20s",
    "BackoffMultiplier": 2.0,
    "RetryableStatusCodes": [ "UNAVAILABLE" ]
}

What did you expect to see?

As I increase the InitialBackoff and BackoffMultiplier I would expect to see a progressively longer backoff time between each retry. The goal is to choose an overall timeout, such as "20s", and try to get the retry process to attempt for something close to that window.

What did you see instead?

I observe wildly variable outcomes, where sometimes the total elapsed time for all retries could be 5s, or 10s, or 20s.

Here is an example of a pretty strange set of 5 attempts:

Policy:

"retryPolicy": {
    "MaxAttempts": 5,
    "InitialBackoff": "1.0s",
    "MaxBackoff": "8s",
    "BackoffMultiplier": 8.0,
    "RetryableStatusCodes": [ "UNAVAILABLE" ]
}

Connection logs:

2024/08/16 13:54:46 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "host:port" to connect
2024/08/16 13:54:46 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "host:port" to connect
2024/08/16 13:54:51 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "host:port" to connect
2024/08/16 13:54:51 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "host:port" to connect
2024/08/16 13:54:59 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "host:port" to connect

It retries with less than a second between attempts, two different times, and then a big 8 second one for the last one. But it is all entirely random.

Further investigations

MaxAttempts is capped to 5, so it is not possible to try and squeeze more retry time out of the entire process, when the retry backoff times are extremely variable in comparison to the configuration.

The Go implementation uses a randomization of the calculated backoff duration, which makes it land anywhere between 0-backoff:
https://github.com/grpc/grpc-go/blob/v1.65.0/stream.go#L702

The C++ implementation uses the actual calculated backoff duration, which makes it actually increase the backoff with each attempt. The randomization is only applied to the jitter:
https://github.com/grpc/grpc/blob/v1.65.4/src/core/lib/backoff/backoff.cc#L34

It seems entirely non-sensical to randomize the entire backoff value between 0-backoff. Am I reading the logic wrong? Does it not defeat the purpose of an exponential backoff if the next value could be smaller? Why is the Go client not doing a similar approach where it applies random jitter?

The end result is that the numbers I plug into the Go retryPolicy feel really arbitrary and magical. And I have to keep twiddling them and running a test to see what kind of range of retries I might get out of the results.

The text was updated successfully, but these errors were encountered:

justinfx · 2024-08-16T02:31:14Z

Because the Go and C core clients seem to have different implementations of the retryPolicy exponential backoff, I have asked for clarification on StackOverflow: https://stackoverflow.com/questions/78877444/grpc-client-core-c-retrypolicy-logic-vs-go-client

justinfx · 2024-08-16T03:53:29Z

I've also just discovered that my use of "waitForReady": true greatly influences the behavior of the Go client in being able to retry enough times to be useful, because of the way it actually wait for the connection before then beginning the retry attempts. If this feature is disable, the retries can burn through so quickly sometimes that they can't even catch an UNAVAILBLE endpoint for a few seconds.

arjan-bal · 2024-08-23T09:30:35Z

Gitter chat for the same issue: https://matrix.to/#/!LWDFtYAcIVyULYUnXs:gitter.im/$NTr_B4whf71VHcXiBnNB78hzqF46RrYSTTFIGXnaIXU?via=gitter.im

Based on the reply from @ejona86

seems like you found a C bug. I don't disagree that the randomization is weird, but that is how it is defined. https://github.com/grpc/proposal/blob/master/A6-client-retries.md#integration-with-service-config (See "Exponential backoff parameters." comment)

Checking with other language maintainers about the way forward.

arjan-bal · 2024-08-26T05:19:14Z

@justinfx, thanks for bringing up this issue. After discussions between the maintainers, it is decided that the gRFC should be updated to specify the behaviour similar to c-core. Java, Go and Node will update their behaviour to match the updated gRFC. For backward compatibility, language implementations can provide an environment variable or other knob to get back the old behaviour, for a release or two.

I will keep this issue open to track the changes in Go.

justinfx · 2024-08-26T07:26:37Z

@arjan-bal this is a fantastic result! I'm so happy to see that the C-core approach was deemed more appropriate. This will really improve the usefulness of the retry policy in the other clients. Thanks!

easwars · 2024-08-30T19:38:57Z

@arjan-bal : Is there an issue to update the gRFC? Can we link it here and mark this as blocked? Thanks.

dfawley · 2024-08-30T21:32:58Z

The gRFC was updated: grpc/proposal#452

arjan-bal · 2024-09-02T06:57:24Z

@easwars since the gRFC is updated in grpc/proposal#452, I believe this issue isn't blocked now.

easwars · 2024-09-03T17:24:26Z

@arjan-bal : Do we need code changes for this?

dfawley · 2024-09-03T18:08:45Z

Yes. We implemented the gRFC as documented before, and now the behavior is changed.

justinfx added the Type: Bug label Aug 15, 2024

justinfx mentioned this issue Aug 15, 2024

retryPolicy is not working grpc/grpc#35060

Closed

arjan-bal self-assigned this Aug 23, 2024

markdroth mentioned this issue Aug 23, 2024

A6 update: change specification of backoff behavior to make sense grpc/proposal#452

Merged

arjan-bal removed their assignment Aug 26, 2024

arjan-bal added Type: Behavior Change Behavior changes not categorized as bugs and removed Type: Bug labels Sep 2, 2024

arjan-bal added the Area: Client Includes Channel/Subchannel/Streams, Connectivity States, RPC Retries, Dial/Call Options and more. label Sep 4, 2024

dfawley changed the title ~~RetryPolicy backoff value being randomized makes no sense compared to C++ reference client~~ Update Retry backoff to follow changes made to gRFC A6 Sep 10, 2024

purnesh42H added Status: Help Wanted P2 labels Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Retry backoff to follow changes made to gRFC A6 #7514

Update Retry backoff to follow changes made to gRFC A6 #7514

justinfx commented Aug 15, 2024 •

edited

Loading

justinfx commented Aug 16, 2024

justinfx commented Aug 16, 2024

arjan-bal commented Aug 23, 2024

arjan-bal commented Aug 26, 2024

justinfx commented Aug 26, 2024

easwars commented Aug 30, 2024

dfawley commented Aug 30, 2024

arjan-bal commented Sep 2, 2024

easwars commented Sep 3, 2024

dfawley commented Sep 3, 2024

Update Retry backoff to follow changes made to gRFC A6 #7514

Update Retry backoff to follow changes made to gRFC A6 #7514

Comments

justinfx commented Aug 15, 2024 • edited Loading

What version of gRPC are you using?

What version of Go are you using (go version)?

What operating system (Linux, Windows, …) and version?

What did you do?

What did you expect to see?

What did you see instead?

Further investigations

justinfx commented Aug 16, 2024

justinfx commented Aug 16, 2024

arjan-bal commented Aug 23, 2024

arjan-bal commented Aug 26, 2024

justinfx commented Aug 26, 2024

easwars commented Aug 30, 2024

dfawley commented Aug 30, 2024

arjan-bal commented Sep 2, 2024

easwars commented Sep 3, 2024

dfawley commented Sep 3, 2024

justinfx commented Aug 15, 2024 •

edited

Loading

What version of Go are you using (`go version`)?