Skip to content

Unsafe pickle loading in a few modules #3486

@ekaf

Description

@ekaf

Summary

NLTK still uses unguarded pickle.load in a few modules, and the complexity/depth of the objects being serialized now makes replacing pickle with a safer format (e.g., JSON or custom serialization) highly impractical. Instead, we propose to document the risk rather than rewrite these modules.

Details

The following NLTK modules still allow loading pickled files which, if malicious or tampered with, would allow arbitrary code execution:

  • nltk/app/chartparser_app.py (chart and grammar loading via GUI file dialogs, e.g. pickle.load(infile))
  • nltk/parse/transitionparser.py (loads scikit-learn models via pickle.load(open(modelFile, "rb")))
  • nltk/tbl/demo.py (demo code cache/reloads pickled taggers: pickle.load(print_rules))

Context

  • The pickles in question often store deeply nested or complex Python objects (parsers, taggers, ML models), sometimes generated by 3rd-party libraries (e.g., scikit-learn), making format conversion extremely labor-intensive and fragile.
  • Previous efforts in NLTK to make model loading safer (e.g., RestrictedUnpickler for nltk.data.load) do not cover these code paths, which call pickle.load directly. Moving these call sites to restricted unpicklers is non-trivial and may break usage.
  • See: Security issues with pickles (#2522), PR #3290 (restricted unpickler), other related security work.

Proposal

  • Document these modules as not safe for loading untrusted files.
  • In the module docstrings or README, add clear warnings that loading a pickle file can execute arbitrary code and must never be done with untrusted files.
  • We do not propose to replace pickle with a safer format in these modules, since the cost/benefit ratio is prohibitive.

Acceptance Criteria

  • A concise, visible warning is added to the relevant modules (docstring or top-of-file comment).
  • The warning states that loading pickle files implements arbitrary code execution and should only be used with files produced in a trusted environment.
  • Close this issue as "won't fix" with justification after the documentation is in place.

Further suggestions or discussion welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions