Skip to content

UniversalDetector vs charset.detect difference #296

@fomcl

Description

@fomcl

Hi, thanks for creating chardet! Today I used chardet.detect to find the encoding of about 1600 files. A bit later I got UnicodeDecodeErrors with 3 of those files when I tried opening them in mode='r' with the derived encoding. All three were mis-identified as windows-1252. I did a quick check in the terminal as shown below and I was surprised to see a different result: Mac Roman. Is there an explanation for this difference? I thought chardet.detect was the recommended way. I now use
UniversalDetector() to get the correct result.

I'm using Python 3.11 with a recent chardet version.

   # Interpreter
   >>> contents = open(FILENAME, "rb").read()
   >>> chardet.detect(contents)
   {'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
   ''}

   # Terminal
   $ python -m chardet FILENAME
   FILENAME: MacRoman with confidence 0.7167379080370483

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions