Use PreTrainedTokenizerBase for tokenizer type hints#5629
Conversation
PreTrainedTokenizerBase for tokenizer type hints
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 382914bc3b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 90c1e99. Configure here.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
In TRL, type hints for tokenizer parameters should use
PreTrainedTokenizerBase, notPreTrainedTokenizer.In transformers,
PreTrainedTokenizeris an alias forPythonBackend(slow tokenizer only). The fast tokenizer (TokenizersBackend) inherits directly fromPreTrainedTokenizerBaseand is NOT aPreTrainedTokenizer. SinceAutoTokenizer.from_pretrained(...)returns a fast tokenizer by default for most models,PreTrainedTokenizerhints fail to cover the common case.PreTrainedTokenizerBaseis the common ancestor and exposes.apply_chat_template,.chat_template,.vocab,.eos_token, etc.How to apply: When adding or editing a function that accepts a tokenizer (not model-specific), use
PreTrainedTokenizerBase. Only usePreTrainedTokenizerif the code truly requires slow-tokenizer-only behavior (rare).Note
Low Risk
Low risk type-hint-only change; runtime behavior should be unchanged, but downstream type checkers may surface new/changed annotations.
Overview
Broadens tokenizer type annotations from
PreTrainedTokenizertoPreTrainedTokenizerBaseacross chat-template utilities (e.g.clone_chat_template,get_training_chat_template,parse_response) and BCO dataset tokenization (_tokenize) so fast tokenizers and processors are covered.Cleans up related typing imports (drops
TYPE_CHECKING/Optionalusage) and updates theProcessingClassTTypeVar/docstrings to match the new base type.Reviewed by Cursor Bugbot for commit 87c2dbf. Bugbot is set up for automated code reviews on this repo. Configure here.