Skip to content

How to address the bottleneck issue of data loading speed during the pre-training process? #16

@Struggleyin

Description

@Struggleyin

I saw the description of "

dataset:

We will extract the data from raw dataset

and store them in the disk buffer by producer

When training, we will read the data

randomly from the buffer by consumer

The producer will replace the data which has been

read by the consumer with new data

The path to the buffer (at least 400GB)

buf_path: /mnt/data/yinzhenhan/buffer

The number of chunks in the buffer

" in your config/hrdt_pretrain.yaml file. However, in the code, it seems that you didn't use it but directly obtained it in real time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions