Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'File exists: "/00000_locals"' when integrated with deepspeed training scripts #717

Open
Clement25 opened this issue Jul 8, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Clement25
Copy link

Environment

  • OS: [Ubuntu 22.04.2 LTS]
  • Hardware (GPU, or instance type): [A800]

To reproduce

Steps to reproduce the behavior:

  1. pip install deepspeed
  2. deepspeed train.py ... (training arguments are omitted)

Expected behavior

[2024-07-08 15:29:47]   File "/mnt/data/weihan/projects/cepe/data.py", line 226, in load_streams
[2024-07-08 15:29:47]     self.encoder_decoder_dataset = StreamingDataset(streams=streams, epoch_size=self.epoch_size, allow_unsafe_types=True)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/dataset.py", line 513, in __init__
[2024-07-08 15:29:47]     self._shm_prefix_int, self._locals_shm = get_shm_prefix(streams_local, streams_remote,
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/prefix.py", line 192, in get_shm_prefix
[2024-07-08 15:29:47]     shm = SharedMemory(name, True, len(data))
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/site-packages/streaming/base/shared/memory.py", line 41, in __init__
[2024-07-08 15:29:47]     shm = BuiltinSharedMemory(name, create, size)
[2024-07-08 15:29:47]   File "/opt/conda/envs/cepe/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__
[2024-07-08 15:29:47]     self._fd = _posixshmem.shm_open(
[2024-07-08 15:29:47] FileExistsError: [Errno 17] File exists: '/000000_locals'

Additional context

@Clement25 Clement25 added the bug Something isn't working label Jul 8, 2024
@snarayan21
Copy link
Collaborator

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

@sukritipaul5
Copy link

sukritipaul5 commented Jul 9, 2024

Hey @snarayan21 ! :)
I've tried this to no avail.
I also downgraded mosaicml and deepspeed versions.
Let me know if you have any other suggestion(s).
I'm using A100s.

@Clement25
Copy link
Author

Clement25 commented Jul 10, 2024

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I tried but it didn't work.

@Clement25
Copy link
Author

Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?

I solved by setting env variable "LOCAL_WORLD_SIZE=$NUM_GPU"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants