Bug: Unschedulable pod due to lack of disk space


### Current Behaviour
When the disk pool used for creating physical volumes (via Longhorn) becomes fully allocated, new pods cannot be created. However, the Kubernetes autoscaler does not scale up the cluster. This likely happens because the affected pods remain in the ContainerCreating state rather than transitioning to Pending, which prevents the autoscaler from recognizing the need for additional resources.

This results in a deadlock: pods are stuck in an unschedulable state, and the autoscaler does not react.

### Expected Behaviour
The autoscaler should detect that Longhorn cannot provision a persistent volume (PV) and should trigger a scale-up to accommodate the new workload.

### Steps To Reproduce
Example: steps to reproduce the behaviour:
1. Create a Claudie cluster with autoscaling enabled and attached storage disks:
```
...
      - name: computes
        providerSpec:
          name: hetzner
          region: fsn1
          zone: fsn1-dc14
        serverType: cpx21
        image: ubuntu-24.04
        autoscaler:       
          min: 1
          max: 4    
        storageDiskSize: 50
...
```
2. Create some test workload to simulate load
```
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sleeper
  namespace: default
spec:
  serviceName: sleeper
  replicas: 20
  selector:
    matchLabels:
      app: sleeper
  template:
    metadata:
      labels:
        app: sleeper
    spec:
      containers:
      - name: stress
        image: progrium/stress
        command: ["sh", "-c"]
        args:
          - |
            echo "Filling volume with 1.7Gi. Does not support status=progress";
            dd if=/dev/urandom of=/data/fillfile bs=1M count=1700;
            echo "Starting CPU stress test";
            stress --cpu 1 --timeout 86400s
        resources:
          limits:
           cpu: "0.2"
        volumeMounts:
        - name: sleeper-volume
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: sleeper-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: longhorn
      resources:
        requests:
          storage: 2Gi
```
3.  Observe status of kubernetes cluster: `k top node`, ` k get pod --watch`, `k events sts sleeper`
4. At some point, a new node will be created. Longhorn will allocate space on that node (some times consuming the majority of available disk space for volume replicas -  base on size of replicas).
5. Eventually, volume creation will fail with an error:
```
20m                   Normal    Scheduled                 Pod/sleeper-17                                    Successfully assigned default/sleeper-17 to computes-0e1x9ut-02
2m2s (x17 over 20m)   Warning   FailedAttachVolume        Pod/sleeper-17                                    AttachVolume.Attach failed for volume "pvc-2ca81235-d871-49c4-992d-595b02c09714" : rpc error: code = Aborted desc = volume pvc-2ca81235-d871-49c4-992d-595b02c09714 is not ready for workloads
```
6. In the Longhorn UI, you'll see that all nodes are fully allocated. Longhorn cannot create or attach new PVs due to the reserved 30% disk space and lack of capacity.

<img width="1656" height="478" alt="Image" src="https://github.com/user-attachments/assets/1a8cd025-0e56-4cb9-9093-c44c0b41a69e" />

<img width="1645" height="288" alt="Image" src="https://github.com/user-attachments/assets/8f0249f4-6df9-46dc-a3de-052f6737d6d1" />

<img width="1637" height="709" alt="Image" src="https://github.com/user-attachments/assets/139bcc2f-5d4d-47e8-aefc-2ac333531cbe" />

### Expected results
The Kubernetes autoscaler should detect that pods are unschedulable due to storage constraints and should provision additional nodes to resolve the issue. This may require tuning the autoscaler to also consider disk pressure or persistent volume provisioning failures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Unschedulable pod due to lack of disk space #1813

Current Behaviour

Expected Behaviour

Steps To Reproduce

Expected results

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Unschedulable pod due to lack of disk space #1813

Description

Current Behaviour

Expected Behaviour

Steps To Reproduce

Expected results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions