How to address the bottleneck issue of data loading speed during the pre-training process？

I saw the description of "

dataset:
  # We will extract the data from raw dataset
  # and store them in the disk buffer by producer
  # When training, we will read the data
  # randomly from the buffer by consumer
  # The producer will replace the data which has been
  # read by the consumer with new data

  # The path to the buffer (at least 400GB)
  buf_path: /mnt/data/yinzhenhan/buffer
  # The number of chunks in the buffer

" in your config/hrdt_pretrain.yaml file. However, in the code, it seems that you didn't use it but directly obtained it in real time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to address the bottleneck issue of data loading speed during the pre-training process？ #16

We will extract the data from raw dataset

and store them in the disk buffer by producer

When training, we will read the data

randomly from the buffer by consumer

The producer will replace the data which has been

read by the consumer with new data

The path to the buffer (at least 400GB)

The number of chunks in the buffer

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to address the bottleneck issue of data loading speed during the pre-training process？ #16

Description

We will extract the data from raw dataset

and store them in the disk buffer by producer

When training, we will read the data

randomly from the buffer by consumer

The producer will replace the data which has been

read by the consumer with new data

The path to the buffer (at least 400GB)

The number of chunks in the buffer

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions