-
Notifications
You must be signed in to change notification settings - Fork 206
Text Encoding Documentation
There are a number of unique i18n challenges to overcome, due in part to the nature of the original game data. This document discusses these issues and the rationale for our chosen solutions.
Not all data is encoded the same way. The game text data comes from the TLK which is encoded in some localized encoding that unfortunately predates unicode. Game text needs to be drawn, of course, so there are Fonts (BAMs) which are used to associate a "character" with a "glyph". This lookup only works with encodings that don't rely on "surrogates" and is limited to 16bits due to the 8bit frame and 8bit cycle of the BAM format. Therefore, it makes the most sense for GemRB to internally use UTF-16 (and intentionally ignore surrogates) since it can represent all required languages and uses 16 bit characters. There are other reasons to prefer unicode as well. It makes interfacing with our Python scripts much easier since we won't have to manually encode/decode strings in the python scripts. Using UTF-16 does introduce one additional challenge, however. The Fonts are specifically crafted to match the encoding of the TLK, so in order to work with UTF-16 we need to have Fonts aware of their encoding so when we pass a UTF-16 character it can map to the correct frame in the BAM.
Then there are of course the game scripts which we assume to be ASCII C-locale.
We should continue to use the same encoding as BCS (ASCII with C-locale)
Python should handle and necessary encoding and decoding for us as long as we use the "new" Unicode API. We don't have to care what Python uses internally or what the scripts are encoded with.
I'm not sure what the most sensible thing to do here is. Adding a config option doesn't really work since we need to decode the options to read those. We should probably just enforce something sensible like UTF-8.
Each plugin is responsible for knowing the encoding required for the file format it is reading/writing. They will need to perform conversion to/from UTF-16 when required.
Different filesystem/OSs expect paths in different encodings (they actually are just bytes at this level). I think its probably best to assume UTF-8 since we need to be compatible with ResRef and most file explorers use UTF-8. I don't know a better way.
Does SDL always use UTF-8? The users locale? I cant find documentation for how to decode text events.