-
Notifications
You must be signed in to change notification settings - Fork 41.6k
Description
Problem:
At Box we have a number of large deployments (relative to our cluster capacity) with long drain times which we would like to quickly rollback in the face of issues. The current rolling update deployments can take 12+ hours due to drain time, number of pods, and max unavailability. If something were to go wrong it would potentially take 12+ hours to roll back.
One thing to note in this case is that while the old pods need to hang around until the deployment is complete, their CPU utilization is inversely-correlated with the new pod's CPU utilization. i.e. as we're sending traffic to the new pods there is a lowering of traffic/utilization in the old pods.
What would you like to be added:
We would like to implement a form of blue-green (supports rapid rollbacks, allows old version to drain long requests), but one blocker we are constantly running into is the fact that a naive implementation of blue-green temporarily requires 2x the capacity which is an impossibility in our bare metal clusters and expensive in public cloud.
We have discussed a "hack" internally where the blue and the green pods each request their resource limit divided by 2. And then scheduler guarantees that the blue and green pods get scheduled together either via pod affinities or a custom scheduler, but we're nervous about diverging too much from the community on this issue. And worried about the scalability of pod affinities/anti-affinities.
It feels like supporting low overhead blue-green deployments would be something that we'd be interested in enabling in a non-hacky way in the core. I'm not proposing putting blue green deployments into the core as some other folks have, but rather allowing the concept of a pod whose cpu utilization is inversely correlated with another pod so they can be scheduled together without consuming additional capacity.
I'm also curious how other folks are solving these problems. Are there existing community solutions to this?
Some details:
- The deployments have 1000+ pods and a max unavailability of 10%.
- The drains are up to an hour
- New version should come up and be ready to take traffic in less than 10 minutes.
- New version should take traffic in 1%, 10%, 100% steps.
- Rollbacks should take less than 5 minutes.
- The pods are a significant fraction of our cluster capacity 35% and fair exceed our reserved space to spin up a full duplicate.
Trying to front run some other suggestions people might steer us towards:
- Standard rolling update doesn't work because it would take 10 hours just to roll out the new version. And up to 10 hours in the case of a roll back.
- Running a pod with a container with both current version and previous version and using standard rolling update solves the roll back problem, but still takes a long time to roll out.
- We could mutate the code in place within the container and achieve a number of our goals, but there are a ton of other issues with that and it becomes even more non-standard.
- We could use affinity rules to try and achieve this with the aforementioned hack around requests and limits. But we're worried about scheduler performance with 1000+ pods running.