Skip to content

Bug: Unschedulable pod due to lack of disk space #1813

@m-brando

Description

@m-brando

Current Behaviour

When the disk pool used for creating physical volumes (via Longhorn) becomes fully allocated, new pods cannot be created. However, the Kubernetes autoscaler does not scale up the cluster. This likely happens because the affected pods remain in the ContainerCreating state rather than transitioning to Pending, which prevents the autoscaler from recognizing the need for additional resources.

This results in a deadlock: pods are stuck in an unschedulable state, and the autoscaler does not react.

Expected Behaviour

The autoscaler should detect that Longhorn cannot provision a persistent volume (PV) and should trigger a scale-up to accommodate the new workload.

Steps To Reproduce

Example: steps to reproduce the behaviour:

  1. Create a Claudie cluster with autoscaling enabled and attached storage disks:
...
      - name: computes
        providerSpec:
          name: hetzner
          region: fsn1
          zone: fsn1-dc14
        serverType: cpx21
        image: ubuntu-24.04
        autoscaler:       
          min: 1
          max: 4    
        storageDiskSize: 50
...
  1. Create some test workload to simulate load
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: sleeper
  namespace: default
spec:
  serviceName: sleeper
  replicas: 20
  selector:
    matchLabels:
      app: sleeper
  template:
    metadata:
      labels:
        app: sleeper
    spec:
      containers:
      - name: stress
        image: progrium/stress
        command: ["sh", "-c"]
        args:
          - |
            echo "Filling volume with 1.7Gi. Does not support status=progress";
            dd if=/dev/urandom of=/data/fillfile bs=1M count=1700;
            echo "Starting CPU stress test";
            stress --cpu 1 --timeout 86400s
        resources:
          limits:
           cpu: "0.2"
        volumeMounts:
        - name: sleeper-volume
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: sleeper-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: longhorn
      resources:
        requests:
          storage: 2Gi
  1. Observe status of kubernetes cluster: k top node, k get pod --watch, k events sts sleeper
  2. At some point, a new node will be created. Longhorn will allocate space on that node (some times consuming the majority of available disk space for volume replicas - base on size of replicas).
  3. Eventually, volume creation will fail with an error:
20m                   Normal    Scheduled                 Pod/sleeper-17                                    Successfully assigned default/sleeper-17 to computes-0e1x9ut-02
2m2s (x17 over 20m)   Warning   FailedAttachVolume        Pod/sleeper-17                                    AttachVolume.Attach failed for volume "pvc-2ca81235-d871-49c4-992d-595b02c09714" : rpc error: code = Aborted desc = volume pvc-2ca81235-d871-49c4-992d-595b02c09714 is not ready for workloads
  1. In the Longhorn UI, you'll see that all nodes are fully allocated. Longhorn cannot create or attach new PVs due to the reserved 30% disk space and lack of capacity.
Image Image Image

Expected results

The Kubernetes autoscaler should detect that pods are unschedulable due to storage constraints and should provision additional nodes to resolve the issue. This may require tuning the autoscaler to also consider disk pressure or persistent volume provisioning failures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions