Skip to content

Training batches not fully processed for each epoch #15

@shayandesu

Description

@shayandesu

I am using my own dataloader which produced data on the fly. But the problem is that after the first epoch, only about 1/8 of the dataloader is used for each epoch:

Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:46<00:00,  0.69it/s, v_num=0]Epoch 0, global step 32: 'train/nll' reached 9.21892 (best 9.21892), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 1:  12%|███████████▌                                                                                | 4/32 [00:14<01:44,  0.27it/s, v_num=0]Epoch 1, global step 36: 'train/nll' reached 9.15495 (best 9.15495), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 2:  12%|███████████▌                                                                                | 4/32 [00:12<01:29,  0.31it/s, v_num=0]Epoch 2, global step 40: 'train/nll' reached 9.13006 (best 9.13006), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 3:  12%|███████████▌                                                                                | 4/32 [00:13<01:32,  0.30it/s, v_num=0]Epoch 3, global step 44: 'train/nll' reached 9.10235 (best 9.10235), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1
Epoch 4:  12%|███████████▌                                                                                | 4/32 [00:12<01:30,  0.31it/s, v_num=0]Epoch 4, global step 48: 'train/nll' reached 9.07143 (best 9.07143), saving model to '/home/akasaei/shayan/other/duo/outputs/openwebtext/2025.09.15/150638/checkpoints/best.ckpt' as top 1

The only change I made to the config file was setting limit_val_batches: 0. I also tried with other dataset sizes but still 1/8 of the data was used. I am using AR parametrization and I previously tested my dataloader on smiles-mdlm and did not face such issue. Any idea why this happens?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions