-
Notifications
You must be signed in to change notification settings - Fork 266
Open
Description
Hi, thanks for creating chardet! Today I used chardet.detect to find the encoding of about 1600 files. A bit later I got UnicodeDecodeErrors with 3 of those files when I tried opening them in mode='r' with the derived encoding. All three were mis-identified as windows-1252. I did a quick check in the terminal as shown below and I was surprised to see a different result: Mac Roman. Is there an explanation for this difference? I thought chardet.detect was the recommended way. I now use
UniversalDetector() to get the correct result.
I'm using Python 3.11 with a recent chardet version.
# Interpreter
>>> contents = open(FILENAME, "rb").read()
>>> chardet.detect(contents)
{'encoding': 'Windows-1252', 'confidence': 0.7282676610947401, 'language':
''}
# Terminal
$ python -m chardet FILENAME
FILENAME: MacRoman with confidence 0.7167379080370483
Metadata
Metadata
Assignees
Labels
No labels