Skip to content

Conversation

@Mytherin
Copy link
Collaborator

This PR adds extension support and implements the ICU extension. The ICU extension uses the minimal ICU collation library to add support for collations of various locales, rather than just the simple NOCASE/NOACCENT collations that are currently supported.

The extension is optional, and is by default not build. It can be build by passing the flag -DBUILD_ICU_EXTENSION=1 to CMake. The extension can be loaded into a database as follows:

DuckDB db;
db.LoadExtension<ICUExtension>();

Note that for now only statically linked extensions are supported, i.e. the extension must be compiled into the program. The current extension API does not (yet?) support loading extensions from shared libraries/DLLs.

Loading the extension allows a number of new collations to be used. The total list can be viewed using the PRAGMA collations command or queried using the pragma_collations function.

These are the supported collations:

// [af, am, ar, as, az, be, bg, bn, bo, bs, bs, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US, eo, es, et, fa, fa_AF, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, om, or, pa, pa, pa_IN, pl, ps, pt, ro, ru, se, si, sk, sl, smn, sq, sr, sr, sr_BA, sr_ME, sr_RS, sr, sr_BA, sr_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh, zh_CN, zh_SG, zh, zh_HK, zh_MO, zh_TW, zu]

These correspond to the collation rules found here.

@hannes
Copy link
Member

hannes commented Apr 27, 2020

Quick question: Can ICU also do the NFC normalisation? If so, it might be desirable to not have two UTF libraries?

@Mytherin
Copy link
Collaborator Author

Mytherin commented Apr 27, 2020

ICU can do everything that utf8proc can do and a billion other things, but for now I have opted to leave ICU as an optional extension because it is extremely large and everything is quite interconnected. Even my "stripped" version has more than 1MB of compiled code (+5MB of data). If we want to replace utf8proc and use ICU for unicode lower/upper case, grapheme breaker detection and NFC normalization we will need to always include ICU.

We will also need to include other data packages from ICU to support these, specifically the Normalization and Break Iteration packages, which are by themselves larger than UTF8proc already:

Feature Category ID(s) Data Files
(icu4c/source/data)
Resource Size
(as of ICU 64)
Break Iteration "brkitr_rules"
"brkitr_dictionaries"
"brkitr_tree"
brkitr/rules/*.txt
brkitr/dictionaries/*.txt
brkitr/*.txt
522 KiB
2.8 MiB
14 KiB
Charset Conversion "conversion_mappings" mappings/*.ucm 4.9 MiB
Collation
more info
"coll_ucadata"
"coll_tree"
in/coll/ucadata-*.icu
coll/*.txt
511 KiB
2.8 MiB
Confusables "confusables" unidata/confusables*.txt 45 KiB
Currencies "misc"
"curr_supplemental"
"curr_tree"
misc/currencyNumericCodes.txt
curr/supplementalData.txt
curr/*.txt
3.1 KiB
27 KiB
2.5 MiB
Language Display
Names
"lang_tree" lang/*.txt 2.1 MiB
Language Tags "misc" misc/keyTypeData.txt
misc/langInfo.txt
misc/likelySubtags.txt
misc/metadata.txt
6.8 KiB
37 KiB
53 KiB
33 KiB
Normalization "normalization" in/*.nrm except in/nfc.nrm 160 KiB
Plural Rules "misc" misc/pluralRanges.txt
misc/plurals.txt
3.3 KiB
33 KiB
Region Display
Names
"region_tree" region/*.txt 1.1 MiB
Rule-Based
Number Formatting
(Spellout, Ordinals)
"rbnf_tree" rbnf/*.txt 538 KiB
StringPrep "stringprep" sprep/*.txt 193 KiB
Time Zones "misc"
"zone_tree"
"zone_supplemental"
misc/metaZones.txt
misc/timezoneTypes.txt
misc/windowsZones.txt
misc/zoneinfo64.txt
zone/*.txt
zone/tzdbNames.txt
41 KiB
20 KiB
22 KiB
151 KiB
2.7 MiB
4.8 KiB
Transliteration "translit" translit/*.txt 685 KiB
Unicode Character
Names
"unames" in/unames.icu 269 KiB
Unicode Text Layout "ulayout" in/ulayout.icu 14 KiB
Units "unit_tree" unit/*.txt 1.7 MiB
OTHER "cnvalias"
"misc"
"locales_tree"
mappings/convrtrs.txt
misc/dayPeriods.txt
misc/genderList.txt
misc/numberingSystems.txt
misc/supplementalData.txt
locales/*.txt
63 KiB
19 KiB
0.5 KiB
5.6 KiB
228 KiB
2.4 MiB

@Mytherin Mytherin merged commit 8b912f6 into master Apr 27, 2020
@Mytherin Mytherin deleted the extension branch April 27, 2020 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants