Skip to content

Text Encoding Documentation

Brad Allred edited this page Jan 27, 2022 · 2 revisions

There are a number of unique i18n challenges to overcome, due in part to the nature of the original game data. This document discusses these issues and the rationale for our chosen solutions.

Game Data

TLK and Fonts

Not all data is encoded the same way. The game text data comes from the TLK which is encoded in some localized encoding that unfortunately predates unicode. Game text needs to be drawn, of course, so there are Fonts (BAMs) which are used to associate a "character" with a "glyph". This lookup only works with encodings that don't rely on "surrogates" and is limited to 16bits due to the 8bit frame and 8bit cycle of the BAM format. Therefore, it makes the most sense for GemRB to internally use UTF-16 (and intentionally ignore surrogates) since it can represent all required languages and uses 16 bit characters. There are other reasons to prefer unicode as well. It makes interfacing with our Python scripts much easier since we won't have to manually encode/decode strings in the python scripts. Using UTF-16 does introduce one additional challenge, however. The Fonts are specifically crafted to match the encoding of the TLK, so in order to work with UTF-16 we need to have Fonts aware of their encoding so when we pass a UTF-16 character it can map to the correct frame in the BAM.

Game Scripts (BCS)

Then there are of course the game scripts which we assume to be ASCII C-locale.

ResRefs/Variables

We should continue to use the same encoding as BCS (ASCII with C-locale)

GUIScripts

Python should handle and necessary encoding and decoding for us as long as we use the "new" Unicode API. We don't have to care what Python uses internally or what the scripts are encoded with.

Config files

I'm not sure what the most sensible thing to do here is. Adding a config option doesn't really work since we need to decode the options to read those. We should probably just enforce something sensible like UTF-8.

Data Importer Plugins

Each plugin is responsible for knowing the encoding required for the file format it is reading/writing. They will need to perform conversion to/from UTF-16 when required.

System Encoding Challenges

Filesystem

Different filesystem/OSs expect paths in different encodings (they actually are just bytes at this level). I think its probably best to assume UTF-8 since we need to be compatible with ResRef and most file explorers use UTF-8. I don't know a better way.

User locale and input from SDL

Does SDL always use UTF-8? The users locale? I cant find documentation for how to decode text events.

Clone this wiki locally