-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
Description
Summary
NLTK still uses unguarded pickle.load in a few modules, and the complexity/depth of the objects being serialized now makes replacing pickle with a safer format (e.g., JSON or custom serialization) highly impractical. Instead, we propose to document the risk rather than rewrite these modules.
Details
The following NLTK modules still allow loading pickled files which, if malicious or tampered with, would allow arbitrary code execution:
nltk/app/chartparser_app.py(chart and grammar loading via GUI file dialogs, e.g.pickle.load(infile))nltk/parse/transitionparser.py(loads scikit-learn models viapickle.load(open(modelFile, "rb")))nltk/tbl/demo.py(demo code cache/reloads pickled taggers:pickle.load(print_rules))
Context
- The pickles in question often store deeply nested or complex Python objects (parsers, taggers, ML models), sometimes generated by 3rd-party libraries (e.g., scikit-learn), making format conversion extremely labor-intensive and fragile.
- Previous efforts in NLTK to make model loading safer (e.g.,
RestrictedUnpicklerfornltk.data.load) do not cover these code paths, which callpickle.loaddirectly. Moving these call sites to restricted unpicklers is non-trivial and may break usage. - See: Security issues with pickles (#2522), PR #3290 (restricted unpickler), other related security work.
Proposal
- Document these modules as not safe for loading untrusted files.
- In the module docstrings or README, add clear warnings that loading a pickle file can execute arbitrary code and must never be done with untrusted files.
- We do not propose to replace pickle with a safer format in these modules, since the cost/benefit ratio is prohibitive.
Acceptance Criteria
- A concise, visible warning is added to the relevant modules (docstring or top-of-file comment).
- The warning states that loading pickle files implements arbitrary code execution and should only be used with files produced in a trusted environment.
- Close this issue as "won't fix" with justification after the documentation is in place.
Further suggestions or discussion welcome.