Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cancel cause when client sends RST_STREAM frame #7541

Open
dastrobu opened this issue Aug 20, 2024 · 11 comments
Open

Add cancel cause when client sends RST_STREAM frame #7541

dastrobu opened this issue Aug 20, 2024 · 11 comments
Assignees
Labels
Area: RPC Features Includes Compression, Encoding, Attributes/Metadata, Interceptors. P2 Type: Feature New features or improvements in behavior

Comments

@dastrobu
Copy link

Use case(s) - what problem will this feature solve?

When a client resets a connection, the http2_server cancels the current context to interrupt all "server internal processing". Relevant code is t.closeStream(s, false, 0, false) and s.cancel().

When there is a big context stack within the server's logic, it can be quite hard to find out why the context was canceled.
Was it due to some internal error? Timeout? Or is it due to a client sending the RST frame?

A real-world problem analysis is described in grafana/tempo#3957.

Proposed Solution

I am proposing to replace context.CancelFunc with context.CancelCauseFunc and adjust all context cancellations to report the cause.

For the example described above, it could be something like this

func (t *http2Server) handleRSTStream(f *http2.RSTStreamFrame) {
	// If the stream is not deleted from the transport's active streams map, then do a regular close stream.
	if s, ok := t.getStream(f); ok {
		t.closeStream(s, false, 0, false, errors.New("RST_STREAM frame received"))
		return
	}

// ... 

func (t *http2Server) closeStream(s *Stream, rst bool, rstCode http2.ErrCode, eosReceived bool, cause error) {
	// In case stream sending and receiving are invoked in separate
	// goroutines (e.g., bi-directional streaming), cancel needs to be
	// called to interrupt the potential blocking on other goroutines.
	s.cancel(cause)

Alternatives Considered

nil

Additional Context

nil

@dastrobu dastrobu added the Type: Feature New features or improvements in behavior label Aug 20, 2024
@arjan-bal arjan-bal self-assigned this Aug 21, 2024
@arjan-bal
Copy link
Contributor

arjan-bal commented Aug 21, 2024

Hi @dastrobu, thanks for submitting this proposal. It seems useful to me, I'm discussing this with other maintainers to see if we want to make this change.

@arjan-bal
Copy link
Contributor

@dastrobu are you trying to propagate the reason for context cancellation to service method handlers? If yes, can you describe some use cases in which the handlers would react differently to different causes?

If the cause is not intended for method handlers, can you explain how it will be used?

@dastrobu
Copy link
Author

@arjan-bal sure, let me try to detail the described use case a bit.

The Grafana Tempo distribute exposes a gRPC API to consume data.
In the handler, the context is propagated and some goroutines do some data processing in parallel.
When one of the goroutines fails (for whatever reason), it cancels the context, which in turn cancels all other goroutines.

In the handler, the error is reported so that operations teams can react to it.
However, the error message is just saying "context canceled". So what happened?

  1. The client sent a RST frame. So the server is behaving correctly. The client may have canceled intentionally or the client is somehow misbehaving.
  2. The server had an internal error and canceled the context. The ops team needs to investigate and fix it.

The proposal would make it transparent if the context cancellation was due to a client or a server error.

To illustrate, I tried to sketch a simplified version of the code:

func handle(ctx context.Context, d []string) {
	err := processData(ctx, d)
	if err != nil {
		fmt.Printf("Error processing data: %v\n", err) // Error processing data: context cancelled
	}
}

func processData(ctx context.Context, d []string) error {
	ctx, cancel := context.WithCancelCause(ctx)
	defer cancel(nil)

	doneChan := make(chan struct{}, 1)
	errChan := make(chan error, 1)

	var wg sync.WaitGroup
	wg.Add(len(d))
	for _, di := range d {
		go func() {
			defer wg.Done()
			err := doStuff(ctx, di)
			if err != nil {
				cancel(err) // interrupt other goroutines
				errChan <- err
				return
			}
		}()
	}

	go func() {
		wg.Wait()
		doneChan <- struct{}{}
	}()

	select {
	case err := <-errChan:
		return err
	case <-doneChan:
		return nil
	case <-ctx.Done():
		return context.Cause(ctx)
	}
}

In the real-world situation described in grafana/tempo#3957, it took us several weeks to find out that context cancellation was actually caused by a misconfigured client.
We did not find the reason until patching the Go SDK to log stack traces of context cancellation.
I would like to avoid this hassle for my team and others in the future, and I think implementing this small feature could contribute a lot.

@arjan-bal arjan-bal assigned arjan-bal and unassigned dastrobu Aug 26, 2024
@purnesh42H purnesh42H assigned purnesh42H and unassigned arjan-bal Aug 27, 2024
@eshitachandwani eshitachandwani added the Area: Server Includes Server, Streams and Server Options. label Sep 4, 2024
@purnesh42H
Copy link
Contributor

purnesh42H commented Oct 8, 2024

@dastrobu can you provide any example in grpc that needs to convey specific cause for context cancel from the server apart from cancelation by client? In gRPC, many errors and cancellation scenarios are associated with specific status codes or error codes. Are you looking to specifically log cancelation causes for your application service debugging which are not satisfied by error codes?

@dastrobu
Copy link
Author

@purnesh42H, thanks for your question.

The primary use case I was creating this issue for is the client-side request cancellation described in #7541 (comment).

There are certainly more scenarios, where causes would come in handy. Consider, for example, timeout handling.
The Go server may cancel with context deadline exceeded when a timeout is set:

s.ctx, s.cancel = context.WithTimeout(ctx, timeout)
)

Considering the case where there are more timeouts down the context stack, it would also be very helpful to see the cause of the timeout. So changing to

	if timeoutSet {
		s.ctx, s.cancel = context.WithTimeoutCause(ctx, timeout, grpcTimeoutErr)

could make this more transparent.

I would not consider myself a gRPC expert, but I'd say status codes will primarily communicate errors to the calling client, while context cancellation (with causes) communicates to the upstream handler.

I see that introducing causes to all context cancellations might be a larger change. But you might consider introducing it step-by-step, as it is not a breaking change.

Application code checking just for ctx.Err() won't be affected by the added cause. However, application code will introduce checking for context.Cause(ctx) over time anyway (in my opinion). Note that context.Cause(ctx) falls back to ctx.Err() if no cause was provided.

Over time, one could introduce the cancellation cause to many places like, for example:

	if t.state != reachable {
		t.mu.Unlock()
		s.cancel(stateUnreachableErr) // ⬅ cause 
		return nil
	}
	if uint32(len(t.activeStreams)) >= t.maxStreams {
		t.mu.Unlock()
		t.controlBuf.put(&cleanupStream{
			streamID: streamID,
			rst:      true,
			rstCode:  http2.ErrCodeRefusedStream,
			onWrite:  func() {},
		})
		s.cancel(tooManyStreamsErr) // ⬅ cause 
		return nil
	}

@purnesh42H, I'd be interested if you see any downside to adding the cause (apart from the actual implementation work)?

@purnesh42H
Copy link
Contributor

purnesh42H commented Oct 17, 2024

@dastrobu i agree to the use case for providing cause of context cancelation but all the use cases you are describing are applicable to application logic. As far as grpc server library is concerned, the cause for context cancelation is either client canceled the context or context deadline exceeded which is inferred from the error code and can always be logged by the application handler itself if they need to log more information about the timeout etc.

We still don't see much use in changing the grpc server library to add cause for context cancelation. I agree that application handler can utilize the cause cancelation to propagate more information applicable to their use case.

@dastrobu
Copy link
Author

@purnesh42H I am not sure I can follow your suggestion to detect request cancellation from a handler that offloads work to another goroutine.

which is inferred from the error code and can always be logged by the application handler itself if they need to log more information about the timeout etc.

Taking your example code from https://github.com/grpc/grpc-go/tree/master/examples/features/cancellation and modifying it a bit may illustrate the problem:

func workOnMessage(ctx context.Context, msg string) {
	i := 0
	for {
		if err := ctx.Err(); err != nil {
			fmt.Printf("context cancelled for message worker %q: %v\n", msg, err)
			return
		}
		fmt.Printf("working on message: %q %d\n", msg, i)
		i += 1
	}
}

func (s *server) BidirectionalStreamingEcho(stream pb.Echo_BidirectionalStreamingEchoServer) error {
	ctx := stream.Context()
	for {
		if err := ctx.Err(); err != nil {
			fmt.Printf("context closed: %v\n", err)
			return nil
		}
		recv, err := stream.Recv()
		if err != nil {
			fmt.Printf("server: error receiving from stream: %v\n", err)
			if err == io.EOF {
				return nil
			}
			return err
		}
		msg := recv.Message
		go workOnMessage(ctx, msg)
		stream.Send(&pb.EchoResponse{Message: msg})
	}
}

Note that the err in fmt.Printf("server: error receiving from stream: %v\n", err) will just report: server: error receiving from stream: rpc error: code = Canceled desc = context canceled, so I am not sure how the handler would tell that the client canceled via an RST.

Next, you mentioned that a context gets canceled only in the case of the RST frame (and the context deadline). First, that is very implicit to always assume that any context cancellation is caused by an RST. Second, I do see a few other calls of

func (t *http2Server) closeStream(s *Stream, rst bool, rstCode http2.ErrCode, eosReceived bool) {

For example here and there.

Note that rstCode http2.ErrCode from closeStream(...) would be valuable information to pass to the context cancellation cause. Please also note that s.cancel() is called before updating the stream state.

From my perspective, I still don't see a good way for error handling in asynchronous scenarios such as outlined above.
I'd be happy to contribute an example along the outlined scenario for asynchronous message processing with proper error handling under your guidance, if I am missing something here.

I'd also still be interested in your opinion on my previous question:

@purnesh42H, I'd be interested if you see any downside to adding the cause (apart from the actual implementation work)?

@aranjans aranjans assigned purnesh42H and unassigned dastrobu Oct 22, 2024
@purnesh42H
Copy link
Contributor

@dastrobu let me know if i understand your ask correctly. You want to distinguishing client-initiated RST_STREAM from other cancellation causes? context.Canceled doesn't tell you why the context was canceled, making it difficult to differentiate between:

  1. Client intentionally closing the connection (e.g., normal shutdown, user action).
  2. Client sending a RST_STREAM due to an error on their end.
  3. Server-side timeout or any internal error leading to stream termination.

Let me know if i missed anything.

Meanwhile, would it be possible for you to send a draft PR of what/how much you will change in grpc library to achieve 1)? We can bring the proposal to other maintainers if the change is contained, additive and clean

@dastrobu
Copy link
Author

@purnesh42H Great summary, that's exactly what I had in mind. I can work on drafting a PR in about three weeks, as I likely won't have time before then. Thanks for the productive discussion up to this point!

dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024
dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024
dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024
@dastrobu
Copy link
Author

I've put together a quick draft to help move the discussion forward. The tests and details still need further refinement, but this should give a good illustration of the idea I had in mind.

dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024
dastrobu added a commit to dastrobu/grpc-go that referenced this issue Oct 24, 2024
@purnesh42H purnesh42H assigned dastrobu and unassigned purnesh42H Oct 25, 2024
@purnesh42H
Copy link
Contributor

Thanks @dastrobu i will take a look.

@purnesh42H purnesh42H added Area: RPC Features Includes Compression, Encoding, Attributes/Metadata, Interceptors. and removed Area: Server Includes Server, Streams and Server Options. labels Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: RPC Features Includes Compression, Encoding, Attributes/Metadata, Interceptors. P2 Type: Feature New features or improvements in behavior
Projects
None yet
Development

No branches or pull requests

4 participants