Skip to content

Pod failures #4557

@jonjohnsonjr

Description

@jonjohnsonjr

This is a tracking issue for detecting and surfacing problems with a user's pods. There are a variety of failure modes, and so far we've been dealing with them in a very ad-hoc manner. Let's enumerate them here and start a discussion towards a more deliberate solution so we don't have to continue playing whack-a-mole.

Detection

We currently try to detect pod failures in the revision reconciler when reconciling a deployment. This logic will probably move to the autoscaler, but remains largely the same.

We look at a single pod to determine if:

  1. It could not be scheduled.
  2. The user container terminated.
  3. The user container is waiting for too long.

Since we only look at a single pod, we can only surface issues that always affect every pod in a deployment, e.g. the image cannot be pulled, the container crashes on start, or the cluster has no resources. We should fix this, likely by looking at every pod's status.

It's unclear to me if there's a way to generically detect all of these issues.

Categorization

Ideally we could distill these issues down to a small set of buckets so we can deal with the issues in a generic way. I don't have a good answer here, but a non-exhaustive list of things we've encountered thus far:

  1. We can't schedule pods because the cluster has insufficient resources: Surface pod scheduling errors in service status #4153 A deployment with all unschedulable pods should show up in revision status #3593
  2. We can't create the deployment because we are out of ResourceQuota: Surface Revision quota problems #496
  3. We can't scale up the deployment because we are out of ResourceQuota: ResourceQuota error isn't reflected in kservice status #4416
  4. We can't start the container because we can't pull the image: Surface image pull errors in service status #4192
  5. The container crashes upon starting: Surface startup issues in Revision status. #499 Stop scaling up if pods are crashlooping #2145
  6. The container starts, but is eventually killed with OOMKilled: Add logs for container in bad status #4534

A: For 1, 2, 4, and 5, the revision may never be able to serve traffic, but also may be caused by a temporary issue.

B: For 1 and 3, the revision may be serving traffic, but we are unable to continue scaling.

C: For 6, the revision can serve traffic, but will experience intermittent failures. This could be caused by a memory leak, a query of death, a bug in the code, or insufficient resource limits.

I invite suggestions for names/conditions for these categories. I suspect we'd want to surface these different kinds of failures in different ways...

Reporting

For category A, we definitely want to surface a fatal condition in the Revision status, which should get propagated up to the Revision status, because the user needs to take some action in order to fix their Revision.

For category B, I suspect we want to do something similar, but not be a fatal condition -- just informational. The user should take action to unblock the autoscaler, perhaps by notifying the cluster operator. In the case where we can't scale up to min_scale, this should probably be fatal.

For category C, the problem will be intermittent, and kubernetes is designed to handle these failures. The best we could do here is to somehow help the user diagnose these issues by surfacing what happened -- possibly by injecting some information into their logs?

Metadata

Metadata

Assignees

Labels

area/APIAPI objects and controllersarea/autoscalearea/monitoringkind/featureWell-understood/specified features, ready for coding.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.triage/acceptedIssues which should be fixed (post-triage)

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions