Dumped JSON file, contains encoded strings, which could error prone while decoding in scripting languages like Python.
These encoded strings, especially those in the name table, are directly dumped as bytes, and are not always uniformly encoded because they are stored encoded in fonts as bytes when the corresponding platform is Windows.
JSON decoding, for example, in Python 3, the json.load() or json.loads() accept str instead of bytes, and when we try to decode the bytes, the problem would occurs because the JSON file could contain mixed encoded bytes. The same could also happen when we consider the JSON generation problem in Python 3. Most third party JSON parsing packages in Python faces the same problem. And many C/C++ JSON parsing library also assume the JSON should use Unicode encoding (RapidJSON, for example). But there are ShiftJIS and GBK in these dumped bytes for many Chinese and Japanese fonts. Decisions made by these software is reasonable because according to RFC-7159, section 8.1,
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the
maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).
Similar articles appear in ECMA-404 as well.
I definitely can write a JSON parsing package based on YAJL's tokenizer and its corresponding JSON parser, which uses bytes as its sole data exchanging type even in Python 3, and I am actually working on one. However, I am hoping there could be a more elegant solution. For example, when dump JSON, use base64 to do an additional encoding, or decode them and encode to Unicode before dump. Or provide a manipulate API for other language is also preferable.
Hope these information is self-contained and can help you understand the problem.
Dumped JSON file, contains encoded strings, which could error prone while decoding in scripting languages like Python.
These encoded strings, especially those in the
nametable, are directly dumped as bytes, and are not always uniformly encoded because they are stored encoded in fonts as bytes when the corresponding platform is Windows.JSON decoding, for example, in Python 3, the
json.load()orjson.loads()acceptstrinstead ofbytes, and when we try to decode thebytes, the problem would occurs because the JSON file could contain mixed encoded bytes. The same could also happen when we consider the JSON generation problem in Python 3. Most third party JSON parsing packages in Python faces the same problem. And many C/C++ JSON parsing library also assume the JSON should use Unicode encoding (RapidJSON, for example). But there are ShiftJIS and GBK in these dumped bytes for many Chinese and Japanese fonts. Decisions made by these software is reasonable because according to RFC-7159, section 8.1,Similar articles appear in ECMA-404 as well.
I definitely can write a JSON parsing package based on YAJL's tokenizer and its corresponding JSON parser, which uses
bytesas its sole data exchanging type even in Python 3, and I am actually working on one. However, I am hoping there could be a more elegant solution. For example, when dump JSON, use base64 to do an additional encoding, or decode them and encode to Unicode before dump. Or provide a manipulate API for other language is also preferable.Hope these information is self-contained and can help you understand the problem.