Align KTO with DPO: Align tokenization#5601
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 13db276. Configure here.
| ] | ||
| prompt_completion_ids = self._tokenize( | ||
| processing_class, example["prompt"] + example["completion"] | ||
| )["input_ids"] |
There was a problem hiding this comment.
Conversational branch in tokenize_fn is unreachable dead code
Medium Severity
The is_conversational(example) check in tokenize_fn will always return False because maybe_apply_chat_template is applied to the dataset earlier (line 587–591), which converts all conversational data (lists of message dicts) to plain strings. DPO's equivalent tokenize_fn works because DPO does not call maybe_apply_chat_template before tokenization — data is still in conversational format when DPO's is_conversational check runs. Here, the entire conversational branch (using add_generation_prompt=True and apply_chat_template for tokenization) is dead code that can never execute.
Additional Locations (1)
Triggered by project rule: ../.ai/AGENTS.md
Reviewed by Cursor Bugbot for commit 13db276. Configure here.
There was a problem hiding this comment.
Yes, this is intentional. I am addressing this in my follow-up PR: Remove maybe_apply_chat_template
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
|
|
||
| from trl.experimental.kto import KTOConfig, KTOTrainer | ||
| from trl.experimental.kto.kto_trainer import _get_kl_dataset, _process_tokens, _tokenize |
Align KTO with DPO: Align tokenization.
This PR refactors the tokenization logic in
KTOTrainerto improve clarity, modularity, and support for conversational data. The main changes include removing the old_tokenizefunction, introducing a new instance method_tokenizethat handles both conversational and non-conversational inputs, and updating dataset preparation to use this new logic. Additionally, the pull request simplifies and clarifies the token processing pipeline.Part of:
Changes
Tokenization Refactor and Conversational Data Support:
_tokenizefunction, which was previously responsible for tokenizing batches, and replaced it with a new instance method_tokenizethat can handle both plain text and conversational (chat) inputs, improving modularity and clarity._prepare_datasetto use the new_tokenizemethod and added support for conversational data via theis_conversationalutility, ensuring correct tokenization for both chat and non-chat datasets.Code Cleanup and Simplification:
numpyand the type import forPreTrainedTokenizer, streamlining dependencies._process_tokensfunction by removing redundant checks and batch construction, focusing it on label processing.Note
Medium Risk
Changes how training/eval datasets are tokenized (including chat-style inputs), which can subtly affect prompt/answer boundaries and thus loss computation; tests were updated to cover the new behavior but tokenizer edge cases may still surface.
Overview
Refactors
KTOTrainerdataset preprocessing to use a new instance_tokenizemethod and a per-exampletokenize_fn, adding explicit handling for conversational inputs viais_conversational/apply_chat_templateand emitting a warning when prompt vs prompt+completion tokenization mismatches.Removes the old batch
_tokenizeimplementation (and itsnumpydependency) and simplifies_process_tokensto operate purely on pre-tokenized fields while preserving thelabel.Updates
test_tokenize_and_process_tokensto validate tokenization throughtrainer.train_dataset(rather than calling_tokenizedirectly), adds KL corruption checks against the prepared dataset, and adds a syntheticDataset.from_dictcase to exercise_process_tokenswithnum_proc.Reviewed by Cursor Bugbot for commit 8780ca1. Bugbot is set up for automated code reviews on this repo. Configure here.