Skip to content

introduce QueueingHint for wise-enqueueing #114297

@sanposhiho

Description

@sanposhiho

What would you like to be added?

#87850 tries to almost remove the flushing unschedulable Pods into activeQ.
Then, after that, almost all moving to activeQ can be associated with certain events.

Here, I'd propose two changes for PreEnqueue so that we can utilize PreEnqueue for wise-enqueueing.

1. adding parameters about what happens to PreEnqueue.

  • event: By which event the pod will be moved back to schedQ/backoffQ. It's represented by framework.ClusterEvent.
  • involvedObj: the object involved in that event.
    • For example, the given event is "Node deleted", the involvedObj will be that deleted Node.
type EnqueuePlugin interface {
    Plugin
    PreEnqueue(ctx context.Context, state *CycleState, p *v1.Pod, event framework.ClusterEvent, involvedObj runtime.Object) *Status
}

Also, in the following cases, event and involvedObj would be empty:

  • If an enqueued Pod is a newly created.
  • If the Pod will be backed to schedQ/backoffQ by flushing.

2. return Skip if the plugin does nothing with that event.

e.g., NodeAffinity returns Success when a Node gets labels that match with Pod's NodeSelector. But, it returns Skip when a Node gets labels that doesn't match with Pod's NodeSelector.

Currently, PreEnqueue's return status is quite simple:

  • all PreEnqueue plugins return Success -> enqueue to activeQ.
  • any PreEnqueue plugin return non-Success -> don't enqueue to activeQ.

By introducing Skip, we can change here like:

  • all PreEnqueue plugins which are in UnschedulablePlugins return Skip -> don't enqueue to activeQ.
    • it means that the event won't change the filtering results of all UnschedulablePlugins.
  • any PreEnqueue plugin return non-Success -> don't enqueue to activeQ.
  • any other cases -> enqueue to activeQ.
    • e.g., Some UnschedulablePlugins return Success in PreEnqueue. Pod has no UnschedulablePlugins. etc.

Why is this needed?

It'd much contribute to wise-enqueueing and thus overall scheduler performance.
This proposal would allow plugins to bring their specific logic into enqueueing by PreEnqueue. We probably can implement PreEnqueue in all Filter plugins.
And we can move Pods only when it's highly possible to be schedulable in the next scheduling cycle. We can move Pods back to unschedulable Pod pool when the event won't change the filtering results of all UnschedulablePlugins.


Currently, the scheduler always moves unschedulable Pods to activeQ when any framework.ClusterEvent, which is defined in EventToRegister of any UnschedulablePlugins, happens.
It's obviously inefficient, and actually, the scheduler filters out some events before letting the scheduling queue know the event happens.

// getUnschedulablePodsWithMatchingAffinityTerm returns unschedulable pods which have
// any affinity term that matches "pod".
// NOTE: this function assumes lock has been acquired in caller.
func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *v1.Pod) []*framework.QueuedPodInfo {
var nsLabels labels.Set
nsLabels = interpodaffinity.GetNamespaceLabelsSnapshot(pod.Namespace, p.nsLister)
var podsToMove []*framework.QueuedPodInfo
for _, pInfo := range p.unschedulablePods.podInfoMap {
for _, term := range pInfo.RequiredAffinityTerms {
if term.Matches(pod, nsLabels) {
podsToMove = append(podsToMove, pInfo)
break
}
}
}
return podsToMove
}

And that results in the issue #110175; it robs Pods, which are rejected by other plugins which have events, of chances of being moved to activeQ.

Having such a logic for specific plugins in the scheduler core prevents out-of-tree plugins' extendability. The scheduler itself should be pure, should treat all plugins equally, and ideally shouldn't do anything special for any in-tree plugin implementation.

/sig scheduling
/assign
/kind feature

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.

Type

No type

Projects

Status

Closed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions