-
Notifications
You must be signed in to change notification settings - Fork 2.8k
ICU Extension & Extension Support #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d StringVector::AddBlob that skip verification, and add Vector::UTFVerify that is only called on vectors with SQL type VARCHAR, and only for specific functions that require valid UTF8 input. Comparisons and ordering support invalid UTF8 strings. This functionality is necessary for collation using ICU sort keys, as the sort keys are blobs of bytes and not necessarily valid UTF8.
…tValue to not perform UTF8 checking either
|
Quick question: Can ICU also do the NFC normalisation? If so, it might be desirable to not have two UTF libraries? |
|
ICU can do everything that utf8proc can do and a billion other things, but for now I have opted to leave ICU as an optional extension because it is extremely large and everything is quite interconnected. Even my "stripped" version has more than 1MB of compiled code (+5MB of data). If we want to replace utf8proc and use ICU for unicode lower/upper case, grapheme breaker detection and NFC normalization we will need to always include ICU. We will also need to include other data packages from ICU to support these, specifically the Normalization and Break Iteration packages, which are by themselves larger than UTF8proc already:
|
This PR adds extension support and implements the ICU extension. The ICU extension uses the minimal ICU collation library to add support for collations of various locales, rather than just the simple
NOCASE/NOACCENTcollations that are currently supported.The extension is optional, and is by default not build. It can be build by passing the flag
-DBUILD_ICU_EXTENSION=1to CMake. The extension can be loaded into a database as follows:Note that for now only statically linked extensions are supported, i.e. the extension must be compiled into the program. The current extension API does not (yet?) support loading extensions from shared libraries/DLLs.
Loading the extension allows a number of new collations to be used. The total list can be viewed using the
PRAGMA collationscommand or queried using thepragma_collationsfunction.These are the supported collations:
// [af, am, ar, as, az, be, bg, bn, bo, bs, bs, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US, eo, es, et, fa, fa_AF, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, om, or, pa, pa, pa_IN, pl, ps, pt, ro, ru, se, si, sk, sl, smn, sq, sr, sr, sr_BA, sr_ME, sr_RS, sr, sr_BA, sr_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh, zh_CN, zh_SG, zh, zh_HK, zh_MO, zh_TW, zu]These correspond to the collation rules found here.