Pod failures

This is a tracking issue for detecting and surfacing problems with a user's pods. There are a variety of failure modes, and so far we've been dealing with them in a very ad-hoc manner. Let's enumerate them here and start a discussion towards a more deliberate solution so we don't have to continue playing whack-a-mole.

### Detection

We currently try to detect pod failures [in the revision reconciler when reconciling a deployment](https://github.com/knative/serving/blob/91a2f5f0d195c238383ef7b502839b2555a64d53/pkg/reconciler/revision/reconcile_resources.go#L68-L108). This logic will probably [move to the autoscaler](https://github.com/knative/serving/pull/4094), but remains largely the same.

We look at a single pod to determine if:
1. It could not be scheduled.
1. The user container terminated.
1. The user container is waiting for too long.

Since we only look at a single pod, we can only surface issues that always affect every pod in a deployment, e.g. the image cannot be pulled, the container crashes on start, or the cluster has no resources. We should fix this, likely by looking at every pod's status.

It's unclear to me if there's a way to generically detect all of these issues.

### Categorization

Ideally we could distill these issues down to a small set of buckets so we can deal with the issues in a generic way. I don't have a good answer here, but a non-exhaustive list of things we've encountered thus far:

1. We can't schedule pods because the cluster has insufficient resources: https://github.com/knative/serving/issues/4153 https://github.com/knative/serving/issues/3593
1. We can't create the deployment because we are out of ResourceQuota: https://github.com/knative/serving/issues/496
1. We can't scale up the deployment because we are out of ResourceQuota: https://github.com/knative/serving/issues/4416
1. We can't start the container because we can't pull the image: https://github.com/knative/serving/issues/4192
1. The container crashes upon starting: https://github.com/knative/serving/issues/499 https://github.com/knative/serving/issues/2145
1. The container starts, but is eventually killed with OOMKilled: https://github.com/knative/serving/issues/4534

**A**: For 1, 2, 4, and 5, the revision may never be able to serve traffic, but also may be caused by a temporary issue.

**B**: For 1 and 3, the revision may be serving traffic, but we are unable to continue scaling.

**C**: For 6, the revision can serve traffic, but will experience intermittent failures. This could be caused by a memory leak, a query of death, a bug in the code, or insufficient resource limits.

I invite suggestions for names/conditions for these categories. I suspect we'd want to surface these different kinds of failures in different ways...

### Reporting

For category A, we definitely want to surface a fatal condition in the Revision status, which should get propagated up to the Revision status, because the user needs to take some action in order to fix their Revision.

For category B, I suspect we want to do something similar, but not be a fatal condition -- just informational. The user should take action to unblock the autoscaler, perhaps by notifying the cluster operator. In the case where we can't scale up to min_scale, this should probably be fatal.

For category C, the problem will be intermittent, and kubernetes is designed to handle these failures. The best we could do here is to somehow help the user diagnose these issues by surfacing what happened -- possibly by injecting some information into their logs?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pod failures #4557

Detection

Categorization

Reporting

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pod failures #4557

Description

Detection

Categorization

Reporting

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions