-
Notifications
You must be signed in to change notification settings - Fork 41.6k
Description
What would you like to be added?
NodeLifeCycleController applies predefined NoExecute taints (e.g., Unreachable, NotReady) to mark nodes as unhealthy state, when it cannot obtain heartbeat acks from the nodes. After the nodes get tainted, the taint-manager does its due diligence to start evicting running pods on those nodes based on all and arbitrary NoExecute taints, which can be added by anyone.
We propose to
- Decouple
taint-managerthat performs taint-based pod eviction fromNodeLifeCycleControllerand make them two separate controllers:NodeLifeCycleManagerto add taints to unhealthy nodes andTaintManagerto perform pod eviction on tainted nodes. - Enable to turn off the default
TaintManagerand replace it with a custom implementation.
Why is this needed?
-
NodeCycleControllercombines two independent functions: adding a pre-defined set ofNoExecutetaints to unhealthy nodes and performing pod eviction on arbitraryNoExecutetaints. Mixing them together is not ideal. Decoupling them can handle more general cases and manage different and custom node taints in a more flexible, consistent, and extensible manner similar to other Kubernetes controllers, such as scheduler. -
Default
NoExecutetaint-manager works well in most cases, however, not extensible enough to meet the in-house requirements of complex workloads, such as stateful workloads with local storage. For example, whether or not evict a stateful pod with local storage onNoExecutecan depend on the actual taint conditions, workload status and other conditions. Such workloads often requires customtaint-managerto control how and when a pod is evicted onNoExecute. -
Starting kubernetes 1.27, with the removal of the flag
enable-taint-manager, the extension of custom taint based eviction would be more challenging, or even impossible. -
The requirement of custom taint-based eviction is not unique actually, it even exists in the vanilla Kubernetes, where we call it
FullyDisrutionMode, in which case, when all nodes are unhealthy,NodeLifeCycleControllerchooses not to applyNoExecutetaints properly. In a real-world case, the integrator may choose to define the disruption rate/budget, etc.
Considered Alternatives
While operators can opt-out from taint-based eviction by injecting tolerations for NoExecute taints, this would have few major disadvantages over custom implementations of TaintManager.
- It won’t allow users to leverage the existing tolerations API to interface with the component in charge of handling their termination.
- It will require to either inject tolerations to all of the Pods via mutating webhooks and/or changing the manifests of all the running workloads.
Initial Proposal
https://docs.google.com/document/d/1iL9rAHs5qUpH5VXTMrY17Lx0fkhT5pXJhzoQu2z4xDI/edit?usp=sharing
Reference
NodeLifecycleControlleronly adds a subset of node taints: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/nodelifecycle/node_lifecycle_controller.go#L666taint-manageracts on all taints: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/nodelifecycle/scheduler/taint_manager.go- Taint-based pod eviction: https://github.com/kubernetes/enhancements/blob/74e610bb0f7e40862688e8a434c77bfafc53cb9e/keps/sig-scheduling/20200114-taint-based-evictions.md
- The
enable-taint-managerflag is deprecated and it will be removed in 1.27:fs.MarkDeprecated("enable-taint-manager", "This flag is deprecated and it will be removed in 1.27. The taint-manager is enabled by default and will remain implicitly enabled once this flag is removed.")