ICU Extension & Extension Support #594

Mytherin · 2020-04-26T20:00:05Z

This PR adds extension support and implements the ICU extension. The ICU extension uses the minimal ICU collation library to add support for collations of various locales, rather than just the simple NOCASE/NOACCENT collations that are currently supported.

The extension is optional, and is by default not build. It can be build by passing the flag -DBUILD_ICU_EXTENSION=1 to CMake. The extension can be loaded into a database as follows:

DuckDB db;
db.LoadExtension<ICUExtension>();

Note that for now only statically linked extensions are supported, i.e. the extension must be compiled into the program. The current extension API does not (yet?) support loading extensions from shared libraries/DLLs.

Loading the extension allows a number of new collations to be used. The total list can be viewed using the PRAGMA collations command or queried using the pragma_collations function.

These are the supported collations:

// [af, am, ar, as, az, be, bg, bn, bo, bs, bs, ca, ceb, chr, cs, cy, da, de, de_AT, dsb, dz, ee, el, en, en_US, en_US, eo, es, et, fa, fa_AF, fi, fil, fo, fr, fr_CA, ga, gl, gu, ha, haw, he, he_IL, hi, hr, hsb, hu, hy, id, id_ID, ig, is, it, ja, ka, kk, kl, km, kn, ko, kok, ku, ky, lb, lkt, ln, lo, lt, lv, mk, ml, mn, mr, ms, mt, my, nb, nb_NO, ne, nl, nn, om, or, pa, pa, pa_IN, pl, ps, pt, ro, ru, se, si, sk, sl, smn, sq, sr, sr, sr_BA, sr_ME, sr_RS, sr, sr_BA, sr_RS, sv, sw, ta, te, th, tk, to, tr, ug, uk, ur, uz, vi, wae, wo, xh, yi, yo, zh, zh, zh_CN, zh_SG, zh, zh_HK, zh_MO, zh_TW, zu]

These correspond to the collation rules found here.

… strings

…d StringVector::AddBlob that skip verification, and add Vector::UTFVerify that is only called on vectors with SQL type VARCHAR, and only for specific functions that require valid UTF8 input. Comparisons and ordering support invalid UTF8 strings. This functionality is necessary for collation using ICU sort keys, as the sort keys are blobs of bytes and not necessarily valid UTF8.

…tValue to not perform UTF8 checking either

hannes · 2020-04-27T08:38:01Z

Quick question: Can ICU also do the NFC normalisation? If so, it might be desirable to not have two UTF libraries?

Mytherin · 2020-04-27T08:51:34Z

ICU can do everything that utf8proc can do and a billion other things, but for now I have opted to leave ICU as an optional extension because it is extremely large and everything is quite interconnected. Even my "stripped" version has more than 1MB of compiled code (+5MB of data). If we want to replace utf8proc and use ICU for unicode lower/upper case, grapheme breaker detection and NFC normalization we will need to always include ICU.

We will also need to include other data packages from ICU to support these, specifically the Normalization and Break Iteration packages, which are by themselves larger than UTF8proc already:

Feature	Category ID(s)	Data Files (icu4c/source/data)	Resource Size (as of ICU 64)
Break Iteration	`"brkitr_rules"` `"brkitr_dictionaries"` `"brkitr_tree"`	brkitr/rules/.txt brkitr/dictionaries/.txt brkitr/*.txt	522 KiB 2.8 MiB 14 KiB
Charset Conversion	`"conversion_mappings"`	mappings/*.ucm	4.9 MiB
Collation more info	`"coll_ucadata"` `"coll_tree"`	in/coll/ucadata-.icu coll/.txt	511 KiB 2.8 MiB
Confusables	`"confusables"`	unidata/confusables*.txt	45 KiB
Currencies	`"misc"` `"curr_supplemental"` `"curr_tree"`	misc/currencyNumericCodes.txt curr/supplementalData.txt curr/*.txt	3.1 KiB 27 KiB 2.5 MiB
Language Display Names	`"lang_tree"`	lang/*.txt	2.1 MiB
Language Tags	`"misc"`	misc/keyTypeData.txt misc/langInfo.txt misc/likelySubtags.txt misc/metadata.txt	6.8 KiB 37 KiB 53 KiB 33 KiB
Normalization	`"normalization"`	in/*.nrm except in/nfc.nrm	160 KiB
Plural Rules	`"misc"`	misc/pluralRanges.txt misc/plurals.txt	3.3 KiB 33 KiB
Region Display Names	`"region_tree"`	region/*.txt	1.1 MiB
Rule-Based Number Formatting (Spellout, Ordinals)	`"rbnf_tree"`	rbnf/*.txt	538 KiB
StringPrep	`"stringprep"`	sprep/*.txt	193 KiB
Time Zones	`"misc"` `"zone_tree"` `"zone_supplemental"`	misc/metaZones.txt misc/timezoneTypes.txt misc/windowsZones.txt misc/zoneinfo64.txt zone/*.txt zone/tzdbNames.txt	41 KiB 20 KiB 22 KiB 151 KiB 2.7 MiB 4.8 KiB
Transliteration	`"translit"`	translit/*.txt	685 KiB
Unicode Character Names	`"unames"`	in/unames.icu	269 KiB
Unicode Text Layout	`"ulayout"`	in/ulayout.icu	14 KiB
Units	`"unit_tree"`	unit/*.txt	1.7 MiB
OTHER	`"cnvalias"` `"misc"` `"locales_tree"`	mappings/convrtrs.txt misc/dayPeriods.txt misc/genderList.txt misc/numberingSystems.txt misc/supplementalData.txt locales/*.txt	63 KiB 19 KiB 0.5 KiB 5.6 KiB 228 KiB 2.4 MiB

Mytherin added 16 commits April 24, 2020 14:13

Add extension API and setup for ICU extension

c58a5c6

Move collations into the catalog, rather than being hard coded

0222173

ICU extension now (mostly) working: need to remove UTF8 validation in…

aafd01e

… strings

Add range filter test to ICU, and fix Vector::GetValue and Vector::Se…

5315f69

…tValue to not perform UTF8 checking either

Add PRAGMA collations command that lists all supported collations

6e5b494

Expand ICU collation tests

5372899

Make ICU extension building a flag and add ICU tests to travis

fb4ea6f

Format

e3507a3

Merge branch 'master' into extension

a65061b

Correct path for Travis Windows

1b8e9d9

Or is the windows path like this?

2211bcf

Rename statement types because windows.h likes to #define simple words

f05fd56

Also ened to rename relation types for Windows

5691956

On Windows, link to static build of DuckDB

b705632

Remove printing from test

f404825

Add more tests for various collations taken from the unicode website

2bb31a5

Mytherin merged commit 8b912f6 into master Apr 27, 2020

Mytherin deleted the extension branch April 27, 2020 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICU Extension & Extension Support #594

ICU Extension & Extension Support #594

Uh oh!

Mytherin commented Apr 26, 2020

Uh oh!

hannes commented Apr 27, 2020

Uh oh!

Mytherin commented Apr 27, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ICU Extension & Extension Support #594

ICU Extension & Extension Support #594

Uh oh!

Conversation

Mytherin commented Apr 26, 2020

Uh oh!

hannes commented Apr 27, 2020

Uh oh!

Mytherin commented Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mytherin commented Apr 27, 2020 •

edited

Loading