-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cancel cause when client sends RST_STREAM frame #7541
Comments
Hi @dastrobu, thanks for submitting this proposal. It seems useful to me, I'm discussing this with other maintainers to see if we want to make this change. |
@dastrobu are you trying to propagate the reason for context cancellation to service method handlers? If yes, can you describe some use cases in which the handlers would react differently to different If the |
@arjan-bal sure, let me try to detail the described use case a bit. The Grafana Tempo distribute exposes a gRPC API to consume data. In the handler, the error is reported so that operations teams can react to it.
The proposal would make it transparent if the context cancellation was due to a client or a server error. To illustrate, I tried to sketch a simplified version of the code: func handle(ctx context.Context, d []string) {
err := processData(ctx, d)
if err != nil {
fmt.Printf("Error processing data: %v\n", err) // Error processing data: context cancelled
}
}
func processData(ctx context.Context, d []string) error {
ctx, cancel := context.WithCancelCause(ctx)
defer cancel(nil)
doneChan := make(chan struct{}, 1)
errChan := make(chan error, 1)
var wg sync.WaitGroup
wg.Add(len(d))
for _, di := range d {
go func() {
defer wg.Done()
err := doStuff(ctx, di)
if err != nil {
cancel(err) // interrupt other goroutines
errChan <- err
return
}
}()
}
go func() {
wg.Wait()
doneChan <- struct{}{}
}()
select {
case err := <-errChan:
return err
case <-doneChan:
return nil
case <-ctx.Done():
return context.Cause(ctx)
}
} In the real-world situation described in grafana/tempo#3957, it took us several weeks to find out that context cancellation was actually caused by a misconfigured client. |
@dastrobu can you provide any example in grpc that needs to convey specific cause for context cancel from the server apart from cancelation by client? In gRPC, many errors and cancellation scenarios are associated with specific status codes or error codes. Are you looking to specifically log cancelation causes for your application service debugging which are not satisfied by error codes? |
@purnesh42H, thanks for your question. The primary use case I was creating this issue for is the client-side request cancellation described in #7541 (comment). There are certainly more scenarios, where causes would come in handy. Consider, for example, timeout handling. grpc-go/internal/transport/http2_server.go Line 531 in 6a5a283
Considering the case where there are more timeouts down the context stack, it would also be very helpful to see the cause of the timeout. So changing to if timeoutSet {
s.ctx, s.cancel = context.WithTimeoutCause(ctx, timeout, grpcTimeoutErr) could make this more transparent. I would not consider myself a gRPC expert, but I'd say status codes will primarily communicate errors to the calling client, while context cancellation (with causes) communicates to the upstream handler. I see that introducing causes to all context cancellations might be a larger change. But you might consider introducing it step-by-step, as it is not a breaking change. Application code checking just for Over time, one could introduce the cancellation cause to many places like, for example: if t.state != reachable {
t.mu.Unlock()
s.cancel(stateUnreachableErr) // ⬅ cause
return nil
}
if uint32(len(t.activeStreams)) >= t.maxStreams {
t.mu.Unlock()
t.controlBuf.put(&cleanupStream{
streamID: streamID,
rst: true,
rstCode: http2.ErrCodeRefusedStream,
onWrite: func() {},
})
s.cancel(tooManyStreamsErr) // ⬅ cause
return nil
} @purnesh42H, I'd be interested if you see any downside to adding the cause (apart from the actual implementation work)? |
@dastrobu i agree to the use case for providing cause of context cancelation but all the use cases you are describing are applicable to application logic. As far as grpc server library is concerned, the cause for context cancelation is either client canceled the context or context deadline exceeded which is inferred from the error code and can always be logged by the application handler itself if they need to log more information about the timeout etc. We still don't see much use in changing the grpc server library to add cause for context cancelation. I agree that application handler can utilize the cause cancelation to propagate more information applicable to their use case. |
@purnesh42H I am not sure I can follow your suggestion to detect request cancellation from a handler that offloads work to another goroutine.
Taking your example code from https://github.com/grpc/grpc-go/tree/master/examples/features/cancellation and modifying it a bit may illustrate the problem: func workOnMessage(ctx context.Context, msg string) {
i := 0
for {
if err := ctx.Err(); err != nil {
fmt.Printf("context cancelled for message worker %q: %v\n", msg, err)
return
}
fmt.Printf("working on message: %q %d\n", msg, i)
i += 1
}
}
func (s *server) BidirectionalStreamingEcho(stream pb.Echo_BidirectionalStreamingEchoServer) error {
ctx := stream.Context()
for {
if err := ctx.Err(); err != nil {
fmt.Printf("context closed: %v\n", err)
return nil
}
recv, err := stream.Recv()
if err != nil {
fmt.Printf("server: error receiving from stream: %v\n", err)
if err == io.EOF {
return nil
}
return err
}
msg := recv.Message
go workOnMessage(ctx, msg)
stream.Send(&pb.EchoResponse{Message: msg})
}
} Note that the Next, you mentioned that a context gets canceled only in the case of the RST frame (and the context deadline). First, that is very implicit to always assume that any context cancellation is caused by an RST. Second, I do see a few other calls of grpc-go/internal/transport/http2_server.go Line 1324 in 6f50403
Note that From my perspective, I still don't see a good way for error handling in asynchronous scenarios such as outlined above. I'd also still be interested in your opinion on my previous question:
|
@dastrobu let me know if i understand your ask correctly. You want to distinguishing client-initiated RST_STREAM from other cancellation causes?
Let me know if i missed anything. Meanwhile, would it be possible for you to send a draft PR of what/how much you will change in grpc library to achieve 1)? We can bring the proposal to other maintainers if the change is contained, additive and clean |
@purnesh42H Great summary, that's exactly what I had in mind. I can work on drafting a PR in about three weeks, as I likely won't have time before then. Thanks for the productive discussion up to this point! |
I've put together a quick draft to help move the discussion forward. The tests and details still need further refinement, but this should give a good illustration of the idea I had in mind. |
Thanks @dastrobu i will take a look. |
Use case(s) - what problem will this feature solve?
When a client resets a connection, the http2_server cancels the current context to interrupt all "server internal processing". Relevant code is
t.closeStream(s, false, 0, false)
ands.cancel()
.When there is a big context stack within the server's logic, it can be quite hard to find out why the context was canceled.
Was it due to some internal error? Timeout? Or is it due to a client sending the RST frame?
A real-world problem analysis is described in grafana/tempo#3957.
Proposed Solution
I am proposing to replace
context.CancelFunc
withcontext.CancelCauseFunc
and adjust all context cancellations to report the cause.For the example described above, it could be something like this
Alternatives Considered
nil
Additional Context
nil
The text was updated successfully, but these errors were encountered: