'File exists: "/00000_locals"' when integrated with deepspeed training scripts #717

Clement25 · 2024-07-08T07:34:05Z

Environment

OS: [Ubuntu 22.04.2 LTS]
Hardware (GPU, or instance type): [A800]

To reproduce

Steps to reproduce the behavior:

pip install deepspeed
deepspeed train.py ... (training arguments are omitted)

Expected behavior

[2024-07-08 15:29:47]   File "/mnt/data/weihan/projects/cepe/data.py", line 226, in load_streams
[2024-07-08 15:29:47]     self.encoder_decoder_dataset = StreamingDataset(streams=streams, epoch_size=self.epoch_size, allow_unsafe_types=True)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/dataset.py", line 513, in __init__
[2024-07-08 15:29:47]     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[2024-07-08 15:29:47]     shm = SharedMemory(name, True, len(data))
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[2024-07-08 15:29:47]     shm = BuiltinSharedMemory(name, create, size)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[2024-07-08 15:29:47]     self._fd = _posixshmem.shm_open(
[2024-07-08 15:29:47] FileExistsError: [Errno 17] File exists: '/000000_locals'

Additional context

The text was updated successfully, but these errors were encountered:

snarayan21 · 2024-07-09T18:42:13Z

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

sukritipaul5 · 2024-07-09T22:44:09Z

Hey @snarayan21 ! :)
I've tried this to no avail.
I also downgraded mosaicml and deepspeed versions.
Let me know if you have any other suggestion(s).
I'm using A100s.

Clement25 · 2024-07-10T15:15:42Z

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I tried but it didn't work.

Clement25 · 2024-07-19T09:01:03Z

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I solved by setting env variable "LOCAL_WORLD_SIZE=$NUM_GPU"

Clement25 added the bug Something isn't working label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'File exists: "/00000_locals"' when integrated with deepspeed training scripts #717

'File exists: "/00000_locals"' when integrated with deepspeed training scripts #717

Clement25 commented Jul 8, 2024

snarayan21 commented Jul 9, 2024

sukritipaul5 commented Jul 9, 2024 •

edited

Loading

Clement25 commented Jul 10, 2024 •

edited

Loading

Clement25 commented Jul 19, 2024

'File exists: "/00000_locals"' when integrated with deepspeed training scripts #717

'File exists: "/00000_locals"' when integrated with deepspeed training scripts #717

Comments

Clement25 commented Jul 8, 2024

Environment

To reproduce

Expected behavior

Additional context

snarayan21 commented Jul 9, 2024

sukritipaul5 commented Jul 9, 2024 • edited Loading

Clement25 commented Jul 10, 2024 • edited Loading

Clement25 commented Jul 19, 2024

sukritipaul5 commented Jul 9, 2024 •

edited

Loading

Clement25 commented Jul 10, 2024 •

edited

Loading