-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write to S3 is very slow #812
Labels
bug
Something isn't working
Comments
Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html Please let us know if that works for you. |
I'm just using the parallel format You just mentioned. Please check my code. If i am using it wrongly or the speed is just as this kind of slow.
By the way, I try to copy files to the s3 With a command 30g Only takes less than five minutes. But currently I need to use half an hour with our lib To transmit samples one by one with parallel. I have a huge data, if speed is this kind of slow I cannot use this.
Please give some help.
…---Original---
From: "Saaketh ***@***.***>
Date: Fri, Oct 25, 2024 21:27 PM
To: ***@***.***>;
Cc: "Optimus ***@***.******@***.***>;
Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812)
Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html
Please let us know if that works for you.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I also try to output the data to the local disk but the speed is still the same. However, if i am using the dataset lib to save to the disk. It is only a few seconds.
…---Original---
From: "Saaketh ***@***.***>
Date: Fri, Oct 25, 2024 21:27 PM
To: ***@***.***>;
Cc: "Optimus ***@***.******@***.***>;
Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812)
Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html
Please let us know if that works for you.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Environment
To reproduce
I got a 2G jsonl.gz text file, I tokenized it with data stored as numpy array. Writer is defined as below:
tokenized data is pre-processed and loaded already, so time won't be wasted at tokenization.
With code above, the writing time to S3 takes 30 minutes with mds_num_workers set as 200. If I set to 1, it
takes 1 hour to finish. It's just so slow, I have huge data to process. How to accelerate? Is that possible to write
a block of data once rather than one by one? Please give some suggestion to accelerate.
Expected behavior
Additional context
The text was updated successfully, but these errors were encountered: