Skip to content

Issue with dask-jobqueue workers getting stuck #24

@jpellman

Description

@jpellman

For some reason the spillover directory can get corrupted and prevent forward progress when processing images. Worker nodes start up but stall with this error:

dask-worker-4332718.err:2025-08-05 12:25:42,580 - distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.32.23:41009'. Reason: failure-to-start-<class 'asyncio.exceptions.TimeoutError'>
dask-worker-4332719.err:2025-08-05 12:25:42,580 - distributed.nanny - INFO - Closing Nanny at 'tcp://192.168.32.23:39265'. Reason: failure-to-start-<class 'asyncio.exceptions.TimeoutError'>

This can be remedied manually by removing all scratch files in the spillover directory and restarting the run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions