Add cancel cause when client sends RST_STREAM frame #7541

dastrobu · 2024-08-20T13:40:32Z

Use case(s) - what problem will this feature solve?

When a client resets a connection, the http2_server cancels the current context to interrupt all "server internal processing". Relevant code is t.closeStream(s, false, 0, false) and s.cancel().

When there is a big context stack within the server's logic, it can be quite hard to find out why the context was canceled.
Was it due to some internal error? Timeout? Or is it due to a client sending the RST frame?

A real-world problem analysis is described in grafana/tempo#3957.

Proposed Solution

I am proposing to replace context.CancelFunc with context.CancelCauseFunc and adjust all context cancellations to report the cause.

For the example described above, it could be something like this

func (t *http2Server) handleRSTStream(f *http2.RSTStreamFrame) {
	// If the stream is not deleted from the transport's active streams map, then do a regular close stream.
	if s, ok := t.getStream(f); ok {
		t.closeStream(s, false, 0, false, errors.New("RST_STREAM frame received"))
		return
	}

// ... 

func (t *http2Server) closeStream(s *Stream, rst bool, rstCode http2.ErrCode, eosReceived bool, cause error) {
	// In case stream sending and receiving are invoked in separate
	// goroutines (e.g., bi-directional streaming), cancel needs to be
	// called to interrupt the potential blocking on other goroutines.
	s.cancel(cause)

Alternatives Considered

nil

Additional Context

nil

The text was updated successfully, but these errors were encountered:

arjan-bal · 2024-08-21T10:06:22Z

Hi @dastrobu, thanks for submitting this proposal. It seems useful to me, I'm discussing this with other maintainers to see if we want to make this change.

arjan-bal · 2024-08-22T17:29:53Z

@dastrobu are you trying to propagate the reason for context cancellation to service method handlers? If yes, can you describe some use cases in which the handlers would react differently to different causes?

If the cause is not intended for method handlers, can you explain how it will be used?

dastrobu · 2024-08-23T12:22:12Z

@arjan-bal sure, let me try to detail the described use case a bit.

The Grafana Tempo distribute exposes a gRPC API to consume data.
In the handler, the context is propagated and some goroutines do some data processing in parallel.
When one of the goroutines fails (for whatever reason), it cancels the context, which in turn cancels all other goroutines.

In the handler, the error is reported so that operations teams can react to it.
However, the error message is just saying "context canceled". So what happened?

The client sent a RST frame. So the server is behaving correctly. The client may have canceled intentionally or the client is somehow misbehaving.
The server had an internal error and canceled the context. The ops team needs to investigate and fix it.

The proposal would make it transparent if the context cancellation was due to a client or a server error.

To illustrate, I tried to sketch a simplified version of the code:

func handle(ctx context.Context, d []string) {
	err := processData(ctx, d)
	if err != nil {
		fmt.Printf("Error processing data: %v\n", err) // Error processing data: context cancelled
	}
}

func processData(ctx context.Context, d []string) error {
	ctx, cancel := context.WithCancelCause(ctx)
	defer cancel(nil)

	doneChan := make(chan struct{}, 1)
	errChan := make(chan error, 1)

	var wg sync.WaitGroup
	wg.Add(len(d))
	for _, di := range d {
		go func() {
			defer wg.Done()
			err := doStuff(ctx, di)
			if err != nil {
				cancel(err) // interrupt other goroutines
				errChan <- err
				return
			}
		}()
	}

	go func() {
		wg.Wait()
		doneChan <- struct{}{}
	}()

	select {
	case err := <-errChan:
		return err
	case <-doneChan:
		return nil
	case <-ctx.Done():
		return context.Cause(ctx)
	}
}

In the real-world situation described in grafana/tempo#3957, it took us several weeks to find out that context cancellation was actually caused by a misconfigured client.
We did not find the reason until patching the Go SDK to log stack traces of context cancellation.
I would like to avoid this hassle for my team and others in the future, and I think implementing this small feature could contribute a lot.

purnesh42H · 2024-10-08T12:55:25Z

@dastrobu can you provide any example in grpc that needs to convey specific cause for context cancel from the server apart from cancelation by client? In gRPC, many errors and cancellation scenarios are associated with specific status codes or error codes. Are you looking to specifically log cancelation causes for your application service debugging which are not satisfied by error codes?

dastrobu · 2024-10-11T06:01:53Z

@purnesh42H, thanks for your question.

The primary use case I was creating this issue for is the client-side request cancellation described in #7541 (comment).

There are certainly more scenarios, where causes would come in handy. Consider, for example, timeout handling.
The Go server may cancel with context deadline exceeded when a timeout is set:

grpc-go/internal/transport/http2_server.go

Line 531 in 6a5a283

s.ctx, s.cancel = context.WithTimeout(ctx, timeout)

)

Considering the case where there are more timeouts down the context stack, it would also be very helpful to see the cause of the timeout. So changing to

	if timeoutSet {
		s.ctx, s.cancel = context.WithTimeoutCause(ctx, timeout, grpcTimeoutErr)

could make this more transparent.

I would not consider myself a gRPC expert, but I'd say status codes will primarily communicate errors to the calling client, while context cancellation (with causes) communicates to the upstream handler.

I see that introducing causes to all context cancellations might be a larger change. But you might consider introducing it step-by-step, as it is not a breaking change.

Application code checking just for ctx.Err() won't be affected by the added cause. However, application code will introduce checking for context.Cause(ctx) over time anyway (in my opinion). Note that context.Cause(ctx) falls back to ctx.Err() if no cause was provided.

Over time, one could introduce the cancellation cause to many places like, for example:

	if t.state != reachable {
		t.mu.Unlock()
		s.cancel(stateUnreachableErr) // ⬅ cause 
		return nil
	}
	if uint32(len(t.activeStreams)) >= t.maxStreams {
		t.mu.Unlock()
		t.controlBuf.put(&cleanupStream{
			streamID: streamID,
			rst:      true,
			rstCode:  http2.ErrCodeRefusedStream,
			onWrite:  func() {},
		})
		s.cancel(tooManyStreamsErr) // ⬅ cause 
		return nil
	}

@purnesh42H, I'd be interested if you see any downside to adding the cause (apart from the actual implementation work)?

purnesh42H · 2024-10-17T08:01:55Z

@dastrobu i agree to the use case for providing cause of context cancelation but all the use cases you are describing are applicable to application logic. As far as grpc server library is concerned, the cause for context cancelation is either client canceled the context or context deadline exceeded which is inferred from the error code and can always be logged by the application handler itself if they need to log more information about the timeout etc.

We still don't see much use in changing the grpc server library to add cause for context cancelation. I agree that application handler can utilize the cause cancelation to propagate more information applicable to their use case.

dastrobu · 2024-10-22T06:33:33Z

@purnesh42H I am not sure I can follow your suggestion to detect request cancellation from a handler that offloads work to another goroutine.

which is inferred from the error code and can always be logged by the application handler itself if they need to log more information about the timeout etc.

Taking your example code from https://github.com/grpc/grpc-go/tree/master/examples/features/cancellation and modifying it a bit may illustrate the problem:

func workOnMessage(ctx context.Context, msg string) {
	i := 0
	for {
		if err := ctx.Err(); err != nil {
			fmt.Printf("context cancelled for message worker %q: %v\n", msg, err)
			return
		}
		fmt.Printf("working on message: %q %d\n", msg, i)
		i += 1
	}
}

func (s *server) BidirectionalStreamingEcho(stream pb.Echo_BidirectionalStreamingEchoServer) error {
	ctx := stream.Context()
	for {
		if err := ctx.Err(); err != nil {
			fmt.Printf("context closed: %v\n", err)
			return nil
		}
		recv, err := stream.Recv()
		if err != nil {
			fmt.Printf("server: error receiving from stream: %v\n", err)
			if err == io.EOF {
				return nil
			}
			return err
		}
		msg := recv.Message
		go workOnMessage(ctx, msg)
		stream.Send(&pb.EchoResponse{Message: msg})
	}
}

Note that the err in fmt.Printf("server: error receiving from stream: %v\n", err) will just report: server: error receiving from stream: rpc error: code = Canceled desc = context canceled, so I am not sure how the handler would tell that the client canceled via an RST.

Next, you mentioned that a context gets canceled only in the case of the RST frame (and the context deadline). First, that is very implicit to always assume that any context cancellation is caused by an RST. Second, I do see a few other calls of

grpc-go/internal/transport/http2_server.go

Line 1324 in 6f50403

    
           func (t *http2Server) closeStream(s *Stream, rst bool, rstCode http2.ErrCode, eosReceived bool) {

For example here and there.

Note that rstCode http2.ErrCode from closeStream(...) would be valuable information to pass to the context cancellation cause. Please also note that s.cancel() is called before updating the stream state.

From my perspective, I still don't see a good way for error handling in asynchronous scenarios such as outlined above.
I'd be happy to contribute an example along the outlined scenario for asynchronous message processing with proper error handling under your guidance, if I am missing something here.

I'd also still be interested in your opinion on my previous question:

@purnesh42H, I'd be interested if you see any downside to adding the cause (apart from the actual implementation work)?

purnesh42H · 2024-10-24T17:35:43Z

@dastrobu let me know if i understand your ask correctly. You want to distinguishing client-initiated RST_STREAM from other cancellation causes? context.Canceled doesn't tell you why the context was canceled, making it difficult to differentiate between:

Client intentionally closing the connection (e.g., normal shutdown, user action).
Client sending a RST_STREAM due to an error on their end.
Server-side timeout or any internal error leading to stream termination.

Let me know if i missed anything.

Meanwhile, would it be possible for you to send a draft PR of what/how much you will change in grpc library to achieve 1)? We can bring the proposal to other maintainers if the change is contained, additive and clean

dastrobu · 2024-10-24T18:10:11Z

@purnesh42H Great summary, that's exactly what I had in mind. I can work on drafting a PR in about three weeks, as I likely won't have time before then. Thanks for the productive discussion up to this point!

Closes grpc#7541

dastrobu · 2024-10-24T22:24:17Z

I've put together a quick draft to help move the discussion forward. The tests and details still need further refinement, but this should give a good illustration of the idea I had in mind.

Closes grpc#7541

purnesh42H · 2024-10-25T07:54:30Z

Thanks @dastrobu i will take a look.

dastrobu added the Type: Feature New features or improvements in behavior label Aug 20, 2024

arjan-bal self-assigned this Aug 21, 2024

dastrobu mentioned this issue Aug 21, 2024

pusher failed to consume trace data due to "context canceled" grafana/tempo#3957

Open

arjan-bal assigned dastrobu and unassigned arjan-bal Aug 22, 2024

arjan-bal added the Status: Requires Reporter Clarification label Aug 22, 2024

arjan-bal removed the Status: Requires Reporter Clarification label Aug 26, 2024

arjan-bal assigned arjan-bal and unassigned dastrobu Aug 26, 2024

purnesh42H assigned purnesh42H and unassigned arjan-bal Aug 27, 2024

eshitachandwani added the Area: Server Includes Server, Streams and Server Options. label Sep 4, 2024

purnesh42H assigned dastrobu and unassigned purnesh42H Oct 8, 2024

purnesh42H added the Status: Requires Reporter Clarification label Oct 8, 2024

purnesh42H removed the Status: Requires Reporter Clarification label Oct 14, 2024

purnesh42H assigned purnesh42H and unassigned dastrobu Oct 14, 2024

purnesh42H assigned dastrobu and unassigned purnesh42H Oct 17, 2024

purnesh42H added the Status: Requires Reporter Clarification label Oct 17, 2024

aranjans assigned purnesh42H and unassigned dastrobu Oct 22, 2024

dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024

add cause to canceled context

3ea7f44

Closes grpc#7541

dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024

add cause to canceled context

a1bea76

Closes grpc#7541

dastrobu mentioned this issue Oct 24, 2024

proposal: add cause to canceled context when Client sending a RST_STREAM due to an error on their end #7778

Open

dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024

add cause to canceled context

d2aa7ba

Closes grpc#7541

dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024

add cause to canceled context

fedc451

Closes grpc#7541

dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024

add cause to canceled context

f1793eb

Closes grpc#7541

purnesh42H assigned dastrobu and unassigned purnesh42H Oct 25, 2024

purnesh42H added P2 and removed Status: Requires Reporter Clarification labels Oct 25, 2024

purnesh42H added Area: RPC Features Includes Compression, Encoding, Attributes/Metadata, Interceptors. and removed Area: Server Includes Server, Streams and Server Options. labels Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cancel cause when client sends RST_STREAM frame #7541

Add cancel cause when client sends RST_STREAM frame #7541

dastrobu commented Aug 20, 2024

arjan-bal commented Aug 21, 2024 •

edited

Loading

arjan-bal commented Aug 22, 2024

dastrobu commented Aug 23, 2024

purnesh42H commented Oct 8, 2024 •

edited

Loading

dastrobu commented Oct 11, 2024

purnesh42H commented Oct 17, 2024 •

edited

Loading

dastrobu commented Oct 22, 2024

purnesh42H commented Oct 24, 2024

dastrobu commented Oct 24, 2024

dastrobu commented Oct 24, 2024

purnesh42H commented Oct 25, 2024

Add cancel cause when client sends RST_STREAM frame #7541

Add cancel cause when client sends RST_STREAM frame #7541

Comments

dastrobu commented Aug 20, 2024

Use case(s) - what problem will this feature solve?

Proposed Solution

Alternatives Considered

Additional Context

arjan-bal commented Aug 21, 2024 • edited Loading

arjan-bal commented Aug 22, 2024

dastrobu commented Aug 23, 2024

purnesh42H commented Oct 8, 2024 • edited Loading

dastrobu commented Oct 11, 2024

purnesh42H commented Oct 17, 2024 • edited Loading

dastrobu commented Oct 22, 2024

purnesh42H commented Oct 24, 2024

dastrobu commented Oct 24, 2024

dastrobu commented Oct 24, 2024

purnesh42H commented Oct 25, 2024

arjan-bal commented Aug 21, 2024 •

edited

Loading

purnesh42H commented Oct 8, 2024 •

edited

Loading

purnesh42H commented Oct 17, 2024 •

edited

Loading