write to S3 is very slow #812

charliedream1 · 2024-10-25T08:39:12Z

Environment

OS: [Ubuntu 20.04]
Hardware (GPU, or instance type): [H800]

To reproduce

I got a 2G jsonl.gz text file, I tokenized it with data stored as numpy array. Writer is defined as below:

   out = MDSWriter(
        columns={"input_ids": f"ndarray:int32:{args.seq_len}",
                 "token_type_ids": f"ndarray:int8:{args.seq_len}",
                 "attention_mask": f"ndarray:int8:{args.seq_len}",
                 "special_tokens_mask": f"ndarray:int8:{args.seq_len}"},
        out=out_path,
        compression=None
    )

tokenized data is pre-processed and loaded already, so time won't be wasted at tokenization.

def parse_data_2_mds_format(tokenized_dataset):
    input_ids = np.array(tokenized_dataset['input_ids']).astype(np.int32)
    token_type_ids = np.array(tokenized_dataset['token_type_ids']).astype(np.int8)
    attention_mask = np.array(tokenized_dataset['attention_mask']).astype(np.int8)
    special_tokens_mask = np.array(tokenized_dataset['special_tokens_mask']).astype(np.int8)
    return {'input_ids': input_ids, 'token_type_ids': token_type_ids,
            'attention_mask': attention_mask, 'special_tokens_mask': special_tokens_mask}

 with Pool(processes=args.mds_num_workers) as inner_pool:
                with tqdm(total=len(tokenized_datasets), desc="Writing Out MDS File") as pbar:
                    for result in inner_pool.imap(parse_data_2_mds_format, tokenized_datasets):
                        out.write(result)
                        pbar.update()

With code above, the writing time to S3 takes 30 minutes with mds_num_workers set as 200. If I set to 1, it
takes 1 hour to finish. It's just so slow, I have huge data to process. How to accelerate? Is that possible to write
a block of data once rather than one by one? Please give some suggestion to accelerate.

Expected behavior

Additional context

snarayan21 · 2024-10-25T13:26:37Z

Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html

Please let us know if that works for you.

charliedream1 · 2024-10-25T13:43:44Z

I'm just using the parallel format You just mentioned. Please check my code. If i am using it wrongly or the speed is just as this kind of slow.  By the way, I try to copy files to the s3 With a command 30g Only takes less than five minutes. But currently I need to use half an hour with our lib To transmit samples one by one with parallel. I have a huge data, if speed is this kind of slow I cannot use this. Please give some help.

…

---Original--- From: "Saaketh ***@***.***> Date: Fri, Oct 25, 2024 21:27 PM To: ***@***.***>; Cc: "Optimus ***@***.******@***.***>; Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812) Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html Please let us know if that works for you. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

charliedream1 · 2024-10-25T15:36:07Z

I also try to output the data to the local disk but the speed is still the same. However, if i am using the dataset lib to save to the disk. It is only a few seconds.

…

---Original--- From: "Saaketh ***@***.***> Date: Fri, Oct 25, 2024 21:27 PM To: ***@***.***>; Cc: "Optimus ***@***.******@***.***>; Subject: Re: [mosaicml/streaming] write to S3 is very slow (Issue #812) Hey @charliedream1, have you tried the parallel dataset conversion approach as detailed in our docs below? https://docs.mosaicml.com/projects/streaming/en/stable/preparing_datasets/parallel_dataset_conversion.html Please let us know if that works for you. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

charliedream1 added the bug Something isn't working label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write to S3 is very slow #812

write to S3 is very slow #812

charliedream1 commented Oct 25, 2024 •

edited

Loading

snarayan21 commented Oct 25, 2024

charliedream1 commented Oct 25, 2024 via email

charliedream1 commented Oct 25, 2024 via email

write to S3 is very slow #812

write to S3 is very slow #812

Comments

charliedream1 commented Oct 25, 2024 • edited Loading

Environment

To reproduce

Expected behavior

Additional context

snarayan21 commented Oct 25, 2024

charliedream1 commented Oct 25, 2024 via email

charliedream1 commented Oct 25, 2024 via email

charliedream1 commented Oct 25, 2024 •

edited

Loading