Skip to content

Decouple Taint-based Pod Eviction from Node LifeCycle Controller  #115779

@yuanchen8911

Description

@yuanchen8911

What would you like to be added?

NodeLifeCycleController applies predefined NoExecute taints (e.g., Unreachable, NotReady) to mark nodes as unhealthy state, when it cannot obtain heartbeat acks from the nodes. After the nodes get tainted, the taint-manager does its due diligence to start evicting running pods on those nodes based on all and arbitrary NoExecute taints, which can be added by anyone.

We propose to

  1. Decouple taint-manager that performs taint-based pod eviction from NodeLifeCycleController and make them two separate controllers: NodeLifeCycleManager to add taints to unhealthy nodes and TaintManager to perform pod eviction on tainted nodes.
  2. Enable to turn off the default TaintManager and replace it with a custom implementation.

Why is this needed?

  • NodeCycleController combines two independent functions: adding a pre-defined set of NoExecute taints to unhealthy nodes and performing pod eviction on arbitrary NoExecute taints. Mixing them together is not ideal. Decoupling them can handle more general cases and manage different and custom node taints in a more flexible, consistent, and extensible manner similar to other Kubernetes controllers, such as scheduler.

  • Default NoExecute taint-manager works well in most cases, however, not extensible enough to meet the in-house requirements of complex workloads, such as stateful workloads with local storage. For example, whether or not evict a stateful pod with local storage on NoExecute can depend on the actual taint conditions, workload status and other conditions. Such workloads often requires custom taint-manager to control how and when a pod is evicted on NoExecute.

  • Starting kubernetes 1.27, with the removal of the flag enable-taint-manager, the extension of custom taint based eviction would be more challenging, or even impossible.

  • The requirement of custom taint-based eviction is not unique actually, it even exists in the vanilla Kubernetes, where we call it FullyDisrutionMode, in which case, when all nodes are unhealthy, NodeLifeCycleController chooses not to apply NoExecute taints properly. In a real-world case, the integrator may choose to define the disruption rate/budget, etc.

Considered Alternatives

While operators can opt-out from taint-based eviction by injecting tolerations for NoExecute taints, this would have few major disadvantages over custom implementations of TaintManager.

  • It won’t allow users to leverage the existing tolerations API to interface with the component in charge of handling their termination.
  • It will require to either inject tolerations to all of the Pods via mutating webhooks and/or changing the manifests of all the running workloads.

Initial Proposal

https://docs.google.com/document/d/1iL9rAHs5qUpH5VXTMrY17Lx0fkhT5pXJhzoQu2z4xDI/edit?usp=sharing

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/appsCategorizes an issue or PR as relevant to SIG Apps.sig/nodeCategorizes an issue or PR as relevant to SIG Node.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions