Skip to content

Conversation

@ParagEkbote
Copy link

Refs #2310

As described in the open tasks for the benchmarking of LoRA methods, I have added a dataloader to data.py script. Before integrating it in run.py, I had a few queries:

  1. Is the default_data_collator collator ideal for benchmarking or something more advanced like DataCollatorWithPadding preferred?

  2. Are there any tests needed for checking improvement in throughput or memory needed?

Could you please review?

cc: @BenjaminBossan

@BenjaminBossan
Copy link
Member

Thanks for the PR. Note that this may not be quite as easy to do anymore as we added a BucketIterator for more efficient batching:

class BucketIterator:

Honestly, I'm not sure if adding the standard torch DataLoader at this point is still very useful. It does have some nice features like parallelization, but that's really not a huge deal for this dataset. And when it comes to padding, I do remember that it was a bit tricky to get it completely right -- the padding is done manually in multiple places for now. Possibly, this could be helped with a better collator, but it would require good testing to ensure that there is no change to the current logic and we don't have unit tests for this script.

You could give it a try and see if you can simplify/improve the training code this way, but I think the expected value of working on this wouldn't be too big.

@ParagEkbote
Copy link
Author

So, in the open tasks mentioned, which of the pending tasks are considered to have a higher priority?

cc: @BenjaminBossan

@BenjaminBossan
Copy link
Member

Hmm, good question. Perhaps it would be a good idea to remove the DataLoader item from the list and mention that if anyone wants to contribute, they should open an issue on PEFT first to ask for feedback.

Among the open tasks, I think that 1. ensuring that we get similar results with Trainer and 2. investigating AMP would be the most important. But these aren't really contributions to the code base but rather testing out some stuff. Depending on the results, there might be follow up work that leads to a code contribution, but there is no guarantee.

Another way to contribute would be by adding experiment settings, see the discussions at the end of #2310. This requires access to hardware that can run the experiments at a decent speed.

@ParagEkbote
Copy link
Author

Can the vLLM addition and clean-up of prints also be considered testing issues. Would you like me to open a seperate PR specifying which are test issues and code issues in the readme?

cc: @BenjaminBossan

@BenjaminBossan
Copy link
Member

Can the vLLM addition and clean-up of prints also be considered testing issues

Regarding vLLM, over time we have sped up generation times quite well, so it's less of an issue now. As mentioned, for testing, it's still a bit on the slow side. I think the main value of adding vLLM would be more of a learning experience, I'm not sure if it's worth it to add it as a dependency.

Regarding the prints, it's not super high value and I'd have to check how to best stream them, it's more of a personal judgment.

Would you like me to open a seperate PR specifying which are test issues and code issues in the readme?

You could do a PR with that change, yes. Please also remove the DataLoader item and add a sentence about first opening an issue before starting the work. Thanks!

@ParagEkbote ParagEkbote marked this pull request as draft December 17, 2025 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants