-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread safety issue with open_mfdataset, Zarr, and dask #8815
Comments
I suspect this is a lack of thread safety in dask. In particular, the tracebacks are internal dask data structures. Is there anything to suggest this something we could fix from the xarray side? |
The SystemError looks like a CPython error. I wonder if this goes away with python 3.12. Dask is likely not to blame for this one. It happens when we execute the task, i.e. we're likely running some numpy code underneath. I'm a little skeptical about the thread safety concern. The part that is being shown in the traceback is running in a single thread and the data that is being accessed here is also only read from and written to from that same thread. I would also recommend upgrading dask and trying a more recent version. A lot has changed since January |
I've since restructured the code to go through Dask Futures directly submitted to a distributed (albeit threaded, rather than multiprocess) scheduler, rather than have separate execution threads independently call .compute(). That seems to have fixed or at least suppressed the problem. (The resulting logic was also a lot simpler, but it just took a bunch of awkward experimentation to think in event loops.) The To my recollection, I never saw suggestive warnings like your quoted |
What happened?
I have a large-ish weather dataset (mirrored version of a subset of WeatherBench2 data), which is stored on-disk as a collection of Zarr datasets, segregated by time. Because processing is intensive and GPU-related, I handle the data in parallel as:
dask.distributed.Client
in threading-only mode (processes=False
) to convert the lazy Dask arrays into in-memory arrays with a thread-blocking call tocompute
, andWhen doing so, I intermittently receive two categories of exceptions from dask. The first is a
SystemError
:while the second is a KeyNotFound error:
The computation will often succeed on a subsequent attempt if I catch the exception raised by
.compute
and re-execute the call.This second error seems to disappear2 if I pass
cache=False
toopen_mfdataset
, but the first persists.Both errors seem to disappear if I use only a single worker thread3 or if I restrict the data being used to only one file's worth (that is, still open all the files from
open_mfdataset
, but only use entries that I know come from one of those source Zarr trees).What did you expect to happen?
No response
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
Footnotes
I was previously using
.sel
to slice the data in a more semantic way, but I switched to the lower-level slicing to rule out any thread-unsafety inside the non-delayed-computing parts of Xarray when dealing with a shared dataset. ↩The errors are random in exactly the annoying way that threading errors are. One factor making them more likely is debugging printouts, which seems to disrupt thread scheduling enough to trigger them more often. ↩
Unfortunately, this results in an unacceptably slow performance. Loading the data in parallel really does improve throughput. I haven't yet tried using the asynchronous model for the dask client. ↩
The text was updated successfully, but these errors were encountered: