rktio: support ICU as iconv alternative, e.g. for Windows#4844
rktio: support ICU as iconv alternative, e.g. for Windows#4844LiberalArtist wants to merge 27 commits into
Conversation
Just to be clear, it sounds from this that the minimum version of Windows required to run the official distributions of Racket available from |
That's my intention. We should definitely check that it works in practice, though! If we went this route with
Then building Racket in old environments would require a configuration change, but running it should not. |
|
I don't think we need to support compile-time configuration of ICU for Windows. It should just be always on in DLL mode and used as a fallback when iconv isn't available. (It looks like things are in place where we could avoid An |
That would certainly make things easier.
I initially thought this wouldn't work, but then I ended up getting halfway there anyway, and now I think it probably would work, so I will try filling in the rest and see how it goes. I think the biggest thing that had seemed problematic to me originally were comments pointing to https://unicode-org.atlassian.net/browse/ICU-12420 like the one deprecating I had also seen that there is some complicated logic allowing for alternate definitions of |
|
@LiberalArtist Did you decide that this wasn't worthwhile, or just lose interest, or just run out of time? I remain interested in this improvement for both Racket and Chez Scheme — though admittedly not enough to try anything myself, so far. |
|
I just ran out of time and haven't gotten back to it. I still think it's a viable approach. It is on my to-do stack to get back to, but that won't be for the next month or so, at least. If someone else wants to fill things in before then, so much the better! |
This clean-up was originally part of racket#4844
This clean-up was originally part of #4844
10a3ae7 to
82dae57
Compare
|
I started looking at this again and found I had actually done but not committed much of the work. Looking at the CI errors https://github.com/racket/racket/actions/runs/16229453902/job/45828973905?pr=4844#step:8:79: Are these cases that were expected to be errors but now work? I'm having trouble locating the tests. |
|
I found the file, I just haven't found the specific tests. For example, grepping that file for |
|
That in particular is from the expansion of err/rt-test. My guess is the relevant lines are around here: https://github.com/racket/racket/blob/master/pkgs/racket-test-core/tests/racket/portlib.rktl#L964 |
|
Some notes on my investigation so far: As @samth suggested, the failing tests are specifically: racket/pkgs/racket-test-core/tests/racket/portlib.rktl Lines 973 to 981 in 52e3304 The relevant parts boil down to this expected behavior, i.e. catching invalid encoding: > (bytes-convert (bytes-open-converter "UTF-8" "UTF-8") #"\xFF\xFF")
#""
0
'error(Note that One surprise is that this goes through the iconv/ICU path at all: Nonetheless, I think we do want to understand what's going on in these test failures before merging. I noticed that: > (bytes-convert (bytes-open-converter "UTF-8-permissive" "UTF-8") #"\xFF\xFF")
#"\357\277\275\357\277\275"
2
'completematching the unexpected result in the first test case. So it seems to me that we are currently calling ICU in permissive mode, and need to figure out how to not do that. |
|
Apparently ICU's default behavior is to emit a substitution character for an invalid sequence, but setting callbacks can override this. I've pushed a first attempt. |
(Also, rebased and resolved parallel-thread–related conflicts. This is a combination of 5 commits. This is the 1st commit message: [wip] rktio icu - progress before 2023-12-15 This is a combination of 20 commits. This is the 1st commit message: wip rktio icu This is the commit message #2: typo fix This is the commit message racket#3: hack for testing This is the commit message racket#4: fixup This is the commit message racket#5: fixup more This is the commit message racket#6: RKTIO_NO_ICU -> RKTIO_HAVE_ICU This is the commit message racket#7: fixup set_racket_error This is the commit message racket#8: autoconf This is the commit message racket#9: fixup iconv.m4 This is the commit message racket#10: attempt at winfig.bat for ICU This is the commit message racket#11: debug winfig.bat This is the commit message racket#12: another try at winfig.bat This is the commit message racket#13: tollerate "''" from top-level Makefile in build.bat This is the commit message racket#14: maybe it is "" This is the commit message racket#15: remove some obsolete files This is the commit message racket#16: winfig.bat: try ignoring all unknown args This is the commit message racket#17: debug This is the commit message racket#18: winfig.bat: fewer quotation marks This is the commit message racket#19: remove even more quotation marks This is the commit message racket#20: [wip] drop RKTIO_ICU_DLL; revamp initialization This is the commit message #2: [wip] fixup & uloc_getDefault() This is the commit message racket#3: fixup This is the commit message racket#4: fixup This is the commit message racket#5: set callbacks to stop on illegal/unmapped/invalid sequences Apparently, ICU's default behavior is to emit a substitution character.
5824db5 to
93f1be0
Compare
|
That took care of half of the test failures. What remains: Also, we should double-check that I resolved the merge conflicts from the new parallel threads correctly. |
Apparently the matching for the guaranteed combinations is case-sensitive, which is presumably why the tests use racket/racket/src/io/converter/main.rkt Lines 35 to 68 in 52e3304 |
|
While doing a build on Debian for debugging, I found that there are some uses of encodings called At some point in the process of building Racket, something tries to use In more detail, these are the related encodings recognized by ICU vs. GNU libiconv built with The Unicode Standard, Version 17.0.0, Appendix C§2, “Encoding Forms in ISO/IEC 10646” says:
Furthermore, neither the IANA Character Sets Registry nor the WHATWG Encoding Standard recognizes endianness-specific variants of UCS-4. (WHATWG doesn't recognize any variant of UCS-4 as a name.) In brief searching, I can see a few, perhaps older, pieces of software that mention the endianness-specific variants, but only a few, and (while I haven't read it) I'm guessing they may not actually have been defined by ISO 10646. The only reason I can guess for GNU libiconv to treat My conclusion is that we ought to just change to The existence of this code reinforces my thinking from LiberalArtist@bbd48c7 that we ought to support all of the Unicode variants without relying on an iconv implementation. But I'm not intending to look at that until after ICU support is done, because I don't want to mask any errors. |
|
These equivalent expressions: with ICU both evaluate to: but with iconv evaluate to: I used DrRacket's step debugger in an and found that it got to this code: racket/racket/collects/racket/port.rkt Lines 2014 to 2034 in 20b2596 with: so, at line 2022, The test reaches line 2022 twice, then gets to this block of code: racket/racket/collects/racket/port.rkt Lines 1758 to 1771 in 20b2596 with (= start end) and (buffer-flushed?) both returning #t, so it returns 0 at line 1770.
There doesn't seem to be any path in the code that leads to an exception when there persistently is I'm not sure why the ICU and iconv backends are returning different second results from Note that using capitalized
|
|
Re parallel threads, as long as this still holds: racket/racket/src/rktio/rktio.h Lines 72 to 76 in 20b2596 it looks like it should still be safe to initialize static variables in the first rktio_init().
|
|
The docs for
That is less than clear about whether "the last
But my understanding is that the byte |
|
For another case, consider (bytes-append #"abc"
(let ([bs (string->bytes/utf-8 "🎜")])
(subbytes bs 0 (/ (bytes-length bs) 2)))
#"\xFF")i.e., chopping off the second half of The iconv backend counts only the three bytes that were successfully converted, whereas the ICU backend counts the 5 bytes that could be part of valid UTF-8, but not, in this case, the bad It looks like the lower-level functions |
|
I've put together a little C program to verbosely explore the various input/output pointers and results: https://gist.github.com/LiberalArtist/3f32e8c0b4d9a7db2a23a0cafe444d12 |
|
I found some comments about the state during error callbacks and where the various pointers are pointing: |
|
We now have a different test failure: ((portlib) ((EXN #(struct:exn:fail:contract "user port peek: contract violation\n expected: \n (or/c exact-nonnegative-integer? eof-object? evt? pipe-input-port? #f procedure?)\n received: #<void>" #<continuation-mark-set>)) #f (#<procedure:regexp-match?> #px#"temple" #<input-port:string>)))The test in question is here: racket/pkgs/racket-test-core/tests/racket/portlib.rktl Lines 983 to 1004 in 20b2596 But the contract violation is from an internal function not directly related to byte conversion. I'm not sure how anything I changed could have affected that. Also, the previous test failures I've been able to reproduce in a build on Debian with --enable-icu, but this test passes on my machine.
|
Status UpdateI put some time into this last month, inspired by RacketCon, with the thought that 9.0 would have been an auspicious time for the switch, but there is more to do. Aside from tracking down the test failure in #4844 (comment), there is a bigger-picture issue I've been thinking about. ICU's converters are stateful. In particular, they use a pivot buffer containing an internal representation (usually UTF-16 for historical reasons, but IIUC there are fast paths for UTF-8) when converting from a source encoding to a target encoding. This part of the documentation is relevant:
Thus, from ICU's perspective, there may exist bytes that have been consumed from the source but not yet written to the target. Initially, I thought this would be trivial: after all, in the general case, correctly using the byte converter API requires using On reflection, though, the qualifier “in the general case” seems significant. The
By implication, a user who knows that the specific encodings in some application do not use shift sequences would seem justified in never calling On further investigation, the requirement seems to follow from the POSIX
This part of the POSIX spec explicitly requires that input be counted as consumed only when corresponding output has been produced, and it seems to ban any state other than shift sequences in conversion descriptors:
Reluctantly, I am coming to the conclusion that we shouldn't use It definitely is still possible to use ICU to implement the In the worst-case scenario, we could use preflighting mode, or possibly character mode. Ideally, though, we would want to stay within buffered or streamed mode. I'm still trying to figure out how to do that. This mailing list thread is one part I want to better understand. The documentation for A lingering concern, though, is that some of the suggestions seem to work like this: Imagine an identity conversion, The problem I see with that approach is that, in the second call, there is a requirement that the input must start with Another approach could use |
| static void init_icu() | ||
| { | ||
| /* Called from rktio_convert_init(rktio), which is called from rktio_init(). | ||
| The ordering requirement on the first call to rktio_init() guarantees | ||
| exclusive access to static variables here. | ||
| */ |
There was a problem hiding this comment.
@mflatt Is this still true after the changes for parallel threads?
There was a problem hiding this comment.
Yes. The first call to rktio_init() has to compete before any other call to rktio_init() per rktio's rules.
This PR aims to support ICU as an alternative to
iconvin rktio to implement byte converters.Windows does not have iconv, so Racket currently distributes
libiconv-2.dllin platform-specific packages likeracket-win32-x86_64-3that are part of even the minimal Racket distribution. Building and distributing third-party C libraries is no fun. However, while looking into #4277, I discovered that Windows has provided ICU as a supported system DLL since the Windows 10 Creators Update (1703), released in 2017. This appears to mean that all Windows versions in "mainstream support" from Microsoft include ICU.1My hope is that by using the OS-provided ICU, we can stop distributing
libiconv-2.dllfor minimal Racket on Windows.The functionality may be useful in other contexts, too: for example, it seems Android has either no iconv or a very rudimentary iconv (vid. e.g. 8a08704), but Android provides ICU as a system library.
I intend to contribute a similar patch to Chez Scheme, but I've found it easier to start with rktio.
Open Design Questions
I have not added any support to
configureyet. I think the behavior should be:AC_CHECK_HEADER([icu.h])finds it. To support building in older environments, don't build ICU support if the header is not found. To support running on older Windows versions, dynamically loadicu.dllwithLoadLibraryW/GetProcAddress. This means no linking flags are needed.AX_CHECK_ICUor by usingpkg-configdirectly.But native Windows builds do not use Autoconf, and I'm not sure how to do the equivalent of
AC_CHECK_HEADER([icu.h]). (I cannot overstate the extent to which I am not a Windows person—perhaps ironically, I have only tested this so far on GNU/Linux.) The simplest thing to do would probably be to omit the check and require an explicit--disable-icu(/disableICU?) to build in old Windows environments.If both iconv and ICU are enabled and found, currently ICU is used. That's certainly debatable. On non-Windows platforms, it seems to make sense because enabling ICU would be an explicit choice. On Windows, one consideration is that an iconv DLL might (sometimes?) end up being installed as a dependency of
racket/draw(see racket/libs@8ba7a87) or for some other reason, and having that change the behavior ofbytes-open-converterseems worth avoiding.Currently, this change is entirely contained within the C code of rktio, with no change to the interface exposed to Racket. I could imagine ways of exposing more. The possibility I'd most strongly consider is adding a function to report whether iconv or ICU is being used. If there were a reason to want such a thing, it would be possible to expose the new
rktio_iconv_converter_openandrktio_icu_converter_openso they could both be used if both are available. If there were a desire to supportdlopening ICU on non-Windows platforms, that might be a reason to drive more from the Racket side.There might be a better
ICU_BUF_SIZE.If we want to stop distributing
libiconv-2.dllfor minimal Racket, there would need to be changes to the native library packages, which I have not attempted. That can (and probably should!) be done as a second step, though.As a compatibility consideration, I looked into the encoding names recognized by ICU compared to GNU libiconv. (The documentation is already clear that the supported encodings are system-dependent, so this is an extra level of scrutiny.) Both ICU and GNU libiconv recognize multiple aliases for some encodings. For most of the encodings supported by GNU libiconv (with
--enable-extra-encodings), at least one alias is also listed by ICU. There are some encodings for which that is not the case, though. I suspect these encodings are also supported by ICU under different names, and a sufficiently motivated person could fill in a translation table. Some of them might even be recognized by the matching functions in the ICU API: I only checked for exact matches.Footnotes
As explained in a comment in the code, in this PR I'm actually targeting Windows 10 version 1903, which avoids some per-thread initialization needed in earlier versions. The only earlier version of Windows still in mainstream support appears to be version 1809 (Enterprise LTSC 2019), for which mainstream support ends in January. ↩