-
Notifications
You must be signed in to change notification settings - Fork 41.6k
Description
What would you like to be added?
#87850 tries to almost remove the flushing unschedulable Pods into activeQ.
Then, after that, almost all moving to activeQ can be associated with certain events.
Here, I'd propose two changes for PreEnqueue so that we can utilize PreEnqueue for wise-enqueueing.
1. adding parameters about what happens to PreEnqueue.
event: By which event the pod will be moved back to schedQ/backoffQ. It's represented byframework.ClusterEvent.involvedObj: the object involved in that event.- For example, the given event is "Node deleted", the involvedObj will be that deleted Node.
type EnqueuePlugin interface {
Plugin
PreEnqueue(ctx context.Context, state *CycleState, p *v1.Pod, event framework.ClusterEvent, involvedObj runtime.Object) *Status
}Also, in the following cases, event and involvedObj would be empty:
- If an enqueued Pod is a newly created.
- If the Pod will be backed to schedQ/backoffQ by flushing.
2. return Skip if the plugin does nothing with that event.
e.g., NodeAffinity returns Success when a Node gets labels that match with Pod's NodeSelector. But, it returns Skip when a Node gets labels that doesn't match with Pod's NodeSelector.
Currently, PreEnqueue's return status is quite simple:
- all PreEnqueue plugins return Success -> enqueue to activeQ.
- any PreEnqueue plugin return non-Success -> don't enqueue to activeQ.
By introducing Skip, we can change here like:
- all PreEnqueue plugins which are in UnschedulablePlugins return Skip -> don't enqueue to activeQ.
- it means that the event won't change the filtering results of all
UnschedulablePlugins.
- it means that the event won't change the filtering results of all
- any PreEnqueue plugin return non-Success -> don't enqueue to activeQ.
- any other cases -> enqueue to activeQ.
- e.g., Some UnschedulablePlugins return Success in PreEnqueue. Pod has no UnschedulablePlugins. etc.
Why is this needed?
It'd much contribute to wise-enqueueing and thus overall scheduler performance.
This proposal would allow plugins to bring their specific logic into enqueueing by PreEnqueue. We probably can implement PreEnqueue in all Filter plugins.
And we can move Pods only when it's highly possible to be schedulable in the next scheduling cycle. We can move Pods back to unschedulable Pod pool when the event won't change the filtering results of all UnschedulablePlugins.
Currently, the scheduler always moves unschedulable Pods to activeQ when any framework.ClusterEvent, which is defined in EventToRegister of any UnschedulablePlugins, happens.
It's obviously inefficient, and actually, the scheduler filters out some events before letting the scheduling queue know the event happens.
kubernetes/pkg/scheduler/internal/queue/scheduling_queue.go
Lines 660 to 678 in 31a1024
| // getUnschedulablePodsWithMatchingAffinityTerm returns unschedulable pods which have | |
| // any affinity term that matches "pod". | |
| // NOTE: this function assumes lock has been acquired in caller. | |
| func (p *PriorityQueue) getUnschedulablePodsWithMatchingAffinityTerm(pod *v1.Pod) []*framework.QueuedPodInfo { | |
| var nsLabels labels.Set | |
| nsLabels = interpodaffinity.GetNamespaceLabelsSnapshot(pod.Namespace, p.nsLister) | |
| var podsToMove []*framework.QueuedPodInfo | |
| for _, pInfo := range p.unschedulablePods.podInfoMap { | |
| for _, term := range pInfo.RequiredAffinityTerms { | |
| if term.Matches(pod, nsLabels) { | |
| podsToMove = append(podsToMove, pInfo) | |
| break | |
| } | |
| } | |
| } | |
| return podsToMove | |
| } |
And that results in the issue #110175; it robs Pods, which are rejected by other plugins which have events, of chances of being moved to activeQ.
Having such a logic for specific plugins in the scheduler core prevents out-of-tree plugins' extendability. The scheduler itself should be pure, should treat all plugins equally, and ideally shouldn't do anything special for any in-tree plugin implementation.
/sig scheduling
/assign
/kind feature
Metadata
Metadata
Assignees
Labels
Type
Projects
Status