Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

endpointsharding: Skip nil child pickers while updating ClientConn state #7584

Closed

Conversation

arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Sep 2, 2024

The child pickers are initialized to nil when child LB policies are created

bal = &balancerWrapper{
childState: ChildState{Endpoint: endpoint},
ClientConn: es.cc,
es: es,
}

When pick is called on endpointsharding's picker, it tries to call Pick() on a nil picker here and panics

func (p *pickerWithChildStates) Pick(info balancer.PickInfo) (balancer.PickResult, error) {
nextIndex := atomic.AddUint32(&p.next, 1)
picker := p.pickers[nextIndex%uint32(len(p.pickers))]
return picker.Pick(info)
}

This change ignores nil child pickers during aggregation. If all pickers are nil, the channel state is set to CONNECTING with error balancer.ErrNoSubConnAvailable.

Test logs demonstrating the panic: panic_logs.txt

RELEASE NOTES:

  • endpointsharding: Skip nil child pickers while updating ClientConn state. This avoid panics when an RPC is made before the child policy generates a picker.

@arjan-bal arjan-bal added this to the 1.67 Release milestone Sep 2, 2024
Copy link

codecov bot commented Sep 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.85%. Comparing base (005b092) to head (e6d5ac6).
Report is 37 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7584      +/-   ##
==========================================
- Coverage   81.91%   81.85%   -0.07%     
==========================================
  Files         361      361              
  Lines       27825    27832       +7     
==========================================
- Hits        22794    22782      -12     
- Misses       3841     3853      +12     
- Partials     1190     1197       +7     
Files with missing lines Coverage Δ
balancer/endpointsharding/endpointsharding.go 69.34% <100.00%> (+2.41%) ⬆️

... and 57 files with indirect coverage changes

ServiceConfig: sc,
})

cc, err := grpc.Dial(mr.Scheme()+":///", grpc.WithResolvers(mr), grpc.WithTransportCredentials(insecure.NewCredentials()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use grpc.NewClient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to NewClient.

defer cancel()
client := testgrpc.NewTestServiceClient(cc)

// Even though the child LB policy hasn't given an picker updates, it is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Even though the child LB policy hasn't given an picker updates, it is assumed that its in connecting state

We should verify that there is no state change i.e. balancer was created with connecting state and balancer is still in connecting state. May be check State.ConnectivityState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't way to get the endpoint sharding balancer's state with the test harness. I've added a check for the channel state instead. The channel state should remain CONNECTING until the endpoint sharding balancer changes it.

@purnesh42H purnesh42H assigned arjan-bal and unassigned purnesh42H Sep 12, 2024
@dfawley
Copy link
Member

dfawley commented Sep 30, 2024

Is this just covering up for a child LB policy that isn't following the rules of the API? Should the child policy be required to set its state?

Note that we will want some children to start IDLE, specifically we'll probably want to create a wrapper LB policy that creates a single child in IDLE and only starts it when it gets an RPC or an ExitIdle call.

@dfawley
Copy link
Member

dfawley commented Sep 30, 2024

The endpointsharding policy was not setting the initial state for child policies and would panic if the child hasn't provided a balancer state update and an RPC is made.

What is the panic, exactly? There's no explanation in the PR description and no linked issue here.

@arjan-bal
Copy link
Contributor Author

arjan-bal commented Sep 30, 2024

What is the panic, exactly? There's no explanation in the PR description and no linked issue here.

Attached logs to the PR description. The problem is that the child pickers in the child state are not initialized here:

bal = &balancerWrapper{
childState: ChildState{Endpoint: endpoint},
ClientConn: es.cc,
es: es,
}

When pick is called on endpointsharding's picker, it tries to call Pick() on a nil picker here:

func (p *pickerWithChildStates) Pick(info balancer.PickInfo) (balancer.PickResult, error) {
nextIndex := atomic.AddUint32(&p.next, 1)
picker := p.pickers[nextIndex%uint32(len(p.pickers))]
return picker.Pick(info)
}

@arjan-bal
Copy link
Contributor Author

arjan-bal commented Sep 30, 2024

Is this just covering up for a child LB policy that isn't following the rules of the API? Should the child policy be required to set its state?

Are LB policies required to produce a picker when UpdateClientConnState is called?
I found this bug while working on pickfirst changes. From a discussion on the pickfirst PR, I found that pickfirst can assume that it starts in state CONNECTING. Reason being that the LB policy should not be built if it doesn't need to connect. I then copied an optimization from Java, which ignored sending duplicate CONNECTING pickers as the channel state and picker error remains the same. I found that endpointsharding panics if the child policy doesn't send a picker on the first call to UpdateClientConnState.

@arjan-bal arjan-bal assigned dfawley and unassigned purnesh42H Sep 30, 2024
@dfawley
Copy link
Member

dfawley commented Sep 30, 2024

The problem is that the child pickers in the child state are not initialized here

But..there are no child pickers yet. It seems like the real problem is that the endpointsharding balancer shouldn't perform any UpdateState calls if no children have done so yet, either.

Are LB policies required to produce a picker when UpdateClientConnState is called?

No, there is no such requirement today.

From a discussion on the pickfirst PR, I found that pickfirst can assume that it starts in state CONNECTING.

I don't think that's quite correct. Pickfirst should START in state CONNECTING. It should do so by producing a picker that queues RPCs and sets the state to CONNECTING.

I then copied an optimization from Java, which ignored sending duplicate CONNECTING pickers as the channel state and picker error remains the same.

This seems OK as long as a picker has been sent that says CONNECTING. You shouldn't suppress the first update based on the assumption that you started CONNECTING.

@dfawley
Copy link
Member

dfawley commented Sep 30, 2024

I think this change is actually OK, but I'm concerned.

I would be OK with the assumptions here if it's documented in the balancer API and if we audit all parent LB policies that we own to make sure they conform to this behavior.

This is not currently documented, however, so it's questionable to propagate this assumption throughout the tree.

Also I don't see anything in the channel that does this, so I think it's the gracefulswitch balancer that makes the channel behave this way. But I'm not exactly clear on where we would set the channel state to CONNECTING while we're still waiting for the initial name resolver update. Maybe we aren't, and that's a bug? Or maybe we're initializing more than I was expecting even while waiting for the initial update?

@dfawley
Copy link
Member

dfawley commented Sep 30, 2024

I filed #7686 to make sure we have a test case for the behavior in the last paragraph I wrote above.

@arjan-bal arjan-bal assigned arjan-bal and unassigned dfawley Sep 30, 2024
@arjan-bal arjan-bal force-pushed the endpoint-sharding-prevent-panic branch from 368ae08 to e6d5ac6 Compare September 30, 2024 19:13
@arjan-bal arjan-bal changed the title endpointsharding: Assume child policies start in CONNECTING state endpointsharding: Skip nil child pickers while updating ClientConn state Sep 30, 2024
@arjan-bal
Copy link
Contributor Author

@dfawley I've changed the approach to ignore nil child pickers instead of assuming child LB policies start in CONNECTING. The PR description and title are updated to reflect the same.

@arjan-bal arjan-bal assigned dfawley and unassigned arjan-bal Sep 30, 2024
Comment on lines +220 to +223
es.cc.UpdateState(balancer.State{
ConnectivityState: connectivity.Connecting,
Picker: base.NewErrPicker(balancer.ErrNoSubConnAvailable),
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no children have produced a picker, why are we doing so here?

I think the safest behavior here is to either do what you said before (and go through the exercise to make "initial state == connecting" an API guarantee), or to only have this endpointsharding LB policy produce pickers in sync with its children: i.e. whenever any child updates its picker, this policy updates its combined picker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't send a picker here, the unit test added in the PR indicates that the channel remains in IDLE mode.

@arjan-bal
Copy link
Contributor Author

Closing this PR, will open it once #7686 is fixed.

@arjan-bal arjan-bal closed this Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants