GKDTrainer: Fix return_outputs in Liger kernel path and update tests#4688
Conversation
- Add missing vocab dimension reduction in generalized_jsd_loss (jsd.sum(dim=-1)) to properly compute KL divergence - Fix UnboundLocalError when return_outputs=True in Liger path by handling student_outputs before deletion
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@kashif |
|
@codex review |
|
I was thinking perhaps a simpler fix could be to store the logits before deleting them in the liger path, and then use them for returning the outputs instead of re-runing the model() forward? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64a1b72680
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
💡 Codex Reviewtrl/tests/experimental/test_gkd_trainer.py Line 307 in 64a1b72 Using trl/trl/experimental/gkd/gkd_trainer.py Lines 359 to 363 in 64a1b72 This adds a special ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Thanks for the suggestion! Applied it in the latest commit. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 745245f. Configure here.
Refactor return statement to include ModelOutput when return_outputs is True.
What does this PR do?
Fixes
UnboundLocalErrorwhenreturn_outputs=Truein GKDTrainer's Liger kernel path, and updates tests for Liger kernel compatibility.Changes
Bug fix: Handle
return_outputs=Truein Liger kernel pathstudent_outputswas deleted before being returned incompute_lossTest update: Remove
xfailmarker fromtest_gkd_trainer_with_ligerNew test: Add
test_compute_loss_return_outputs_with_ligerevaluate()works correctly with Liger kernelTested with liger-kernel 0.6.4. All GKD tests pass.
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Note
Medium Risk
Touches
GKDTrainer.compute_lossin the Liger fused-loss path, which is core training/eval logic; incorrect outputs could affect evaluation behavior, but the change is small and covered by targeted tests.Overview
Fixes
GKDTrainer.compute_losswhen using the Liger fused JSD loss soreturn_outputs=Trueno longer fails (keepsstudent_outputsalive, returns aModelOutput, and frees memory/empties cache after loss computation).Updates Liger-related tests by removing the previous
xfailon the Liger training test and adding a new regression test assertingevaluate()works on the Liger path (i.e.,return_outputs=Truedoes not raise and produceseval_loss).Reviewed by Cursor Bugbot for commit 36e2fc9. Bugbot is set up for automated code reviews on this repo. Configure here.