endpointsharding: Skip nil child pickers while updating ClientConn state #7584

arjan-bal · 2024-09-02T21:16:23Z

The child pickers are initialized to nil when child LB policies are created

grpc-go/balancer/endpointsharding/endpointsharding.go

Lines 118 to 122 in ca4865d

    
           bal = &balancerWrapper{ 
        
           	childState: ChildState{Endpoint: endpoint}, 
        
           	ClientConn: es.cc, 
        
           	es:         es, 
        
           }

When pick is called on endpointsharding's picker, it tries to call Pick() on a nil picker here and panics

grpc-go/balancer/endpointsharding/endpointsharding.go

Lines 254 to 258 in ca4865d

    
           func (p *pickerWithChildStates) Pick(info balancer.PickInfo) (balancer.PickResult, error) { 
        
           	nextIndex := atomic.AddUint32(&p.next, 1) 
        
           	picker := p.pickers[nextIndex%uint32(len(p.pickers))] 
        
           	return picker.Pick(info) 
        
           }

This change ignores nil child pickers during aggregation. If all pickers are nil, the channel state is set to CONNECTING with error balancer.ErrNoSubConnAvailable.

Test logs demonstrating the panic: panic_logs.txt

RELEASE NOTES:

endpointsharding: Skip nil child pickers while updating ClientConn state. This avoid panics when an RPC is made before the child policy generates a picker.

codecov · 2024-09-02T21:18:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.85%. Comparing base (005b092) to head (e6d5ac6).
Report is 37 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7584      +/-   ##
==========================================
- Coverage   81.91%   81.85%   -0.07%     
==========================================
  Files         361      361              
  Lines       27825    27832       +7     
==========================================
- Hits        22794    22782      -12     
- Misses       3841     3853      +12     
- Partials     1190     1197       +7

Files with missing lines	Coverage Δ
balancer/endpointsharding/endpointsharding.go	`69.34% <100.00%> (+2.41%)`	⬆️

... and 57 files with indirect coverage changes

purnesh42H · 2024-09-12T06:57:40Z

balancer/endpointsharding/endpointsharding_test.go

+		ServiceConfig: sc,
+	})
+
+	cc, err := grpc.Dial(mr.Scheme()+":///", grpc.WithResolvers(mr), grpc.WithTransportCredentials(insecure.NewCredentials()))


nit: can we use grpc.NewClient?

Changed to NewClient.

purnesh42H · 2024-09-12T07:08:11Z

balancer/endpointsharding/endpointsharding_test.go

+	defer cancel()
+	client := testgrpc.NewTestServiceClient(cc)
+
+	// Even though the child LB policy hasn't given an picker updates, it is


// Even though the child LB policy hasn't given an picker updates, it is assumed that its in connecting state

We should verify that there is no state change i.e. balancer was created with connecting state and balancer is still in connecting state. May be check State.ConnectivityState?

There isn't way to get the endpoint sharding balancer's state with the test harness. I've added a check for the channel state instead. The channel state should remain CONNECTING until the endpoint sharding balancer changes it.

dfawley · 2024-09-30T15:48:36Z

Is this just covering up for a child LB policy that isn't following the rules of the API? Should the child policy be required to set its state?

Note that we will want some children to start IDLE, specifically we'll probably want to create a wrapper LB policy that creates a single child in IDLE and only starts it when it gets an RPC or an ExitIdle call.

dfawley · 2024-09-30T15:50:53Z

The endpointsharding policy was not setting the initial state for child policies and would panic if the child hasn't provided a balancer state update and an RPC is made.

What is the panic, exactly? There's no explanation in the PR description and no linked issue here.

arjan-bal · 2024-09-30T16:38:54Z

What is the panic, exactly? There's no explanation in the PR description and no linked issue here.

Attached logs to the PR description. The problem is that the child pickers in the child state are not initialized here:

grpc-go/balancer/endpointsharding/endpointsharding.go

Lines 118 to 122 in ca4865d

    
           bal = &balancerWrapper{ 
        
           	childState: ChildState{Endpoint: endpoint}, 
        
           	ClientConn: es.cc, 
        
           	es:         es, 
        
           }

When pick is called on endpointsharding's picker, it tries to call Pick() on a nil picker here:

grpc-go/balancer/endpointsharding/endpointsharding.go

Lines 254 to 258 in ca4865d

    
           func (p *pickerWithChildStates) Pick(info balancer.PickInfo) (balancer.PickResult, error) { 
        
           	nextIndex := atomic.AddUint32(&p.next, 1) 
        
           	picker := p.pickers[nextIndex%uint32(len(p.pickers))] 
        
           	return picker.Pick(info) 
        
           }

arjan-bal · 2024-09-30T16:49:13Z

Is this just covering up for a child LB policy that isn't following the rules of the API? Should the child policy be required to set its state?

Are LB policies required to produce a picker when UpdateClientConnState is called?
I found this bug while working on pickfirst changes. From a discussion on the pickfirst PR, I found that pickfirst can assume that it starts in state CONNECTING. Reason being that the LB policy should not be built if it doesn't need to connect. I then copied an optimization from Java, which ignored sending duplicate CONNECTING pickers as the channel state and picker error remains the same. I found that endpointsharding panics if the child policy doesn't send a picker on the first call to UpdateClientConnState.

dfawley · 2024-09-30T17:32:04Z

The problem is that the child pickers in the child state are not initialized here

But..there are no child pickers yet. It seems like the real problem is that the endpointsharding balancer shouldn't perform any UpdateState calls if no children have done so yet, either.

Are LB policies required to produce a picker when UpdateClientConnState is called?

No, there is no such requirement today.

From a discussion on the pickfirst PR, I found that pickfirst can assume that it starts in state CONNECTING.

I don't think that's quite correct. Pickfirst should START in state CONNECTING. It should do so by producing a picker that queues RPCs and sets the state to CONNECTING.

I then copied an optimization from Java, which ignored sending duplicate CONNECTING pickers as the channel state and picker error remains the same.

This seems OK as long as a picker has been sent that says CONNECTING. You shouldn't suppress the first update based on the assumption that you started CONNECTING.

dfawley · 2024-09-30T17:58:52Z

I think this change is actually OK, but I'm concerned.

I would be OK with the assumptions here if it's documented in the balancer API and if we audit all parent LB policies that we own to make sure they conform to this behavior.

This is not currently documented, however, so it's questionable to propagate this assumption throughout the tree.

Also I don't see anything in the channel that does this, so I think it's the gracefulswitch balancer that makes the channel behave this way. But I'm not exactly clear on where we would set the channel state to CONNECTING while we're still waiting for the initial name resolver update. Maybe we aren't, and that's a bug? Or maybe we're initializing more than I was expecting even while waiting for the initial update?

dfawley · 2024-09-30T18:07:50Z

I filed #7686 to make sure we have a test case for the behavior in the last paragraph I wrote above.

arjan-bal · 2024-09-30T19:19:21Z

@dfawley I've changed the approach to ignore nil child pickers instead of assuming child LB policies start in CONNECTING. The PR description and title are updated to reflect the same.

dfawley · 2024-09-30T20:17:05Z

balancer/endpointsharding/endpointsharding.go

+		es.cc.UpdateState(balancer.State{
+			ConnectivityState: connectivity.Connecting,
+			Picker:            base.NewErrPicker(balancer.ErrNoSubConnAvailable),
+		})


If no children have produced a picker, why are we doing so here?

I think the safest behavior here is to either do what you said before (and go through the exercise to make "initial state == connecting" an API guarantee), or to only have this endpointsharding LB policy produce pickers in sync with its children: i.e. whenever any child updates its picker, this policy updates its combined picker.

If we don't send a picker here, the unit test added in the PR indicates that the channel remains in IDLE mode.

arjan-bal · 2024-10-01T10:26:09Z

Closing this PR, will open it once #7686 is fixed.

Assume child policies start in CONNECTING state

5d49466

arjan-bal added the Type: Bug label Sep 2, 2024

arjan-bal added this to the 1.67 Release milestone Sep 2, 2024

arjan-bal requested a review from purnesh42H September 10, 2024 08:08

arjan-bal assigned purnesh42H Sep 10, 2024

purnesh42H modified the milestones: 1.67 Release, 1.68 Release Sep 10, 2024

purnesh42H reviewed Sep 12, 2024

View reviewed changes

purnesh42H assigned arjan-bal and unassigned purnesh42H Sep 12, 2024

Use NewClient and verify channel state

235e14e

arjan-bal assigned purnesh42H and unassigned arjan-bal Sep 30, 2024

arjan-bal requested a review from purnesh42H September 30, 2024 06:45

arjan-bal assigned dfawley and unassigned purnesh42H Sep 30, 2024

arjan-bal requested a review from dfawley September 30, 2024 16:49

arjan-bal assigned arjan-bal and unassigned dfawley Sep 30, 2024

Skip nil pickers instead of assuming CONNECTING

e6d5ac6

arjan-bal force-pushed the endpoint-sharding-prevent-panic branch from 368ae08 to e6d5ac6 Compare September 30, 2024 19:13

arjan-bal changed the title ~~endpointsharding: Assume child policies start in CONNECTING state~~ endpointsharding: Skip nil child pickers while updating ClientConn state Sep 30, 2024

arjan-bal assigned dfawley and unassigned arjan-bal Sep 30, 2024

dfawley reviewed Sep 30, 2024

View reviewed changes

arjan-bal closed this Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

endpointsharding: Skip nil child pickers while updating ClientConn state #7584

endpointsharding: Skip nil child pickers while updating ClientConn state #7584

arjan-bal commented Sep 2, 2024 •

edited

Loading

codecov bot commented Sep 2, 2024 •

edited

Loading

purnesh42H Sep 12, 2024

arjan-bal Sep 30, 2024

purnesh42H Sep 12, 2024

arjan-bal Sep 30, 2024

dfawley commented Sep 30, 2024

dfawley commented Sep 30, 2024

arjan-bal commented Sep 30, 2024 •

edited

Loading

arjan-bal commented Sep 30, 2024 •

edited

Loading

dfawley commented Sep 30, 2024

dfawley commented Sep 30, 2024

dfawley commented Sep 30, 2024

arjan-bal commented Sep 30, 2024

dfawley Sep 30, 2024

arjan-bal Oct 1, 2024

arjan-bal commented Oct 1, 2024

	bal = &balancerWrapper{
	childState: ChildState{Endpoint: endpoint},
	ClientConn: es.cc,
	es: es,
	}

	func (p *pickerWithChildStates) Pick(info balancer.PickInfo) (balancer.PickResult, error) {
	nextIndex := atomic.AddUint32(&p.next, 1)
	picker := p.pickers[nextIndex%uint32(len(p.pickers))]
	return picker.Pick(info)
	}

endpointsharding: Skip nil child pickers while updating ClientConn state #7584

endpointsharding: Skip nil child pickers while updating ClientConn state #7584

Conversation

arjan-bal commented Sep 2, 2024 • edited Loading

codecov bot commented Sep 2, 2024 • edited Loading

Codecov Report

purnesh42H Sep 12, 2024

Choose a reason for hiding this comment

arjan-bal Sep 30, 2024

Choose a reason for hiding this comment

purnesh42H Sep 12, 2024

Choose a reason for hiding this comment

arjan-bal Sep 30, 2024

Choose a reason for hiding this comment

dfawley commented Sep 30, 2024

dfawley commented Sep 30, 2024

arjan-bal commented Sep 30, 2024 • edited Loading

arjan-bal commented Sep 30, 2024 • edited Loading

dfawley commented Sep 30, 2024

dfawley commented Sep 30, 2024

dfawley commented Sep 30, 2024

arjan-bal commented Sep 30, 2024

dfawley Sep 30, 2024

Choose a reason for hiding this comment

arjan-bal Oct 1, 2024

Choose a reason for hiding this comment

arjan-bal commented Oct 1, 2024

arjan-bal commented Sep 2, 2024 •

edited

Loading

codecov bot commented Sep 2, 2024 •

edited

Loading

arjan-bal commented Sep 30, 2024 •

edited

Loading

arjan-bal commented Sep 30, 2024 •

edited

Loading