Training batches not fully processed for each epoch

I am using my own dataloader which produced data on the fly. But the problem is that after the first epoch, only about 1/8 of the dataloader is used for each epoch:

```
Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:46<00:00,  0.69it/s, v_num=0]Epoch 0, global step 32: 'train/nll' reached 9.21892 (best 9.21892), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 1:  12%|███████████▌                                                                                | 4/32 [00:14<01:44,  0.27it/s, v_num=0]Epoch 1, global step 36: 'train/nll' reached 9.15495 (best 9.15495), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 2:  12%|███████████▌                                                                                | 4/32 [00:12<01:29,  0.31it/s, v_num=0]Epoch 2, global step 40: 'train/nll' reached 9.13006 (best 9.13006), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 3:  12%|███████████▌                                                                                | 4/32 [00:13<01:32,  0.30it/s, v_num=0]Epoch 3, global step 44: 'train/nll' reached 9.10235 (best 9.10235), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 4:  12%|███████████▌                                                                                | 4/32 [00:12<01:30,  0.31it/s, v_num=0]Epoch 4, global step 48: 'train/nll' reached 9.07143 (best 9.07143), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
```


The only change I made to the config file was setting `limit_val_batches: 0`. I also tried with other dataset sizes but still 1/8 of the data was used. I am using AR parametrization and I previously tested my dataloader on [smiles-mdlm](https://github.com/michaelhla/smiles-mdlm/) and did not face such issue. Any idea why this happens?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training batches not fully processed for each epoch #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training batches not fully processed for each epoch #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions