You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?
Hey @snarayan21 ! :)
I've tried this to no avail.
I also downgraded mosaicml and deepspeed versions.
Let me know if you have any other suggestion(s).
I'm using A100s.
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?
Hey, this seems like there's some stale shared memory. Just once, at the start of your training job, can you add a call to streaming.base.util.clean_stale_shared_memory() and see if that addresses the issue?
I solved by setting env variable "LOCAL_WORLD_SIZE=$NUM_GPU"
Environment
To reproduce
Steps to reproduce the behavior:
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: