Skip to content

rktio: support ICU as iconv alternative, e.g. for Windows#4844

Draft
LiberalArtist wants to merge 27 commits into
racket:masterfrom
LiberalArtist:iconv-via-icu
Draft

rktio: support ICU as iconv alternative, e.g. for Windows#4844
LiberalArtist wants to merge 27 commits into
racket:masterfrom
LiberalArtist:iconv-via-icu

Conversation

@LiberalArtist

Copy link
Copy Markdown
Contributor

This PR aims to support ICU as an alternative to iconv in rktio to implement byte converters.

Windows does not have iconv, so Racket currently distributes libiconv-2.dll in platform-specific packages like racket-win32-x86_64-3 that are part of even the minimal Racket distribution. Building and distributing third-party C libraries is no fun. However, while looking into #4277, I discovered that Windows has provided ICU as a supported system DLL since the Windows 10 Creators Update (1703), released in 2017. This appears to mean that all Windows versions in "mainstream support" from Microsoft include ICU.1

My hope is that by using the OS-provided ICU, we can stop distributing libiconv-2.dll for minimal Racket on Windows.

The functionality may be useful in other contexts, too: for example, it seems Android has either no iconv or a very rudimentary iconv (vid. e.g. 8a08704), but Android provides ICU as a system library.

I intend to contribute a similar patch to Chez Scheme, but I've found it easier to start with rktio.

Open Design Questions

  1. I have not added any support to configure yet. I think the behavior should be:

    • On Windows, automatically enable ICU if AC_CHECK_HEADER([icu.h]) finds it. To support building in older environments, don't build ICU support if the header is not found. To support running on older Windows versions, dynamically load icu.dll with LoadLibraryW/GetProcAddress. This means no linking flags are needed.
    • On other platforms, enable ICU only if specifically requested. If it is requested, find the appropriate flags either with AX_CHECK_ICU or by using pkg-config directly.

    But native Windows builds do not use Autoconf, and I'm not sure how to do the equivalent of AC_CHECK_HEADER([icu.h]). (I cannot overstate the extent to which I am not a Windows person—perhaps ironically, I have only tested this so far on GNU/Linux.) The simplest thing to do would probably be to omit the check and require an explicit --disable-icu (/disableICU?) to build in old Windows environments.

  2. If both iconv and ICU are enabled and found, currently ICU is used. That's certainly debatable. On non-Windows platforms, it seems to make sense because enabling ICU would be an explicit choice. On Windows, one consideration is that an iconv DLL might (sometimes?) end up being installed as a dependency of racket/draw (see racket/libs@8ba7a87) or for some other reason, and having that change the behavior of bytes-open-converter seems worth avoiding.

  3. Currently, this change is entirely contained within the C code of rktio, with no change to the interface exposed to Racket. I could imagine ways of exposing more. The possibility I'd most strongly consider is adding a function to report whether iconv or ICU is being used. If there were a reason to want such a thing, it would be possible to expose the new rktio_iconv_converter_open and rktio_icu_converter_open so they could both be used if both are available. If there were a desire to support dlopening ICU on non-Windows platforms, that might be a reason to drive more from the Racket side.

  4. There might be a better ICU_BUF_SIZE.

  5. If we want to stop distributing libiconv-2.dll for minimal Racket, there would need to be changes to the native library packages, which I have not attempted. That can (and probably should!) be done as a second step, though.

    As a compatibility consideration, I looked into the encoding names recognized by ICU compared to GNU libiconv. (The documentation is already clear that the supported encodings are system-dependent, so this is an extra level of scrutiny.) Both ICU and GNU libiconv recognize multiple aliases for some encodings. For most of the encodings supported by GNU libiconv (with --enable-extra-encodings), at least one alias is also listed by ICU. There are some encodings for which that is not the case, though. I suspect these encodings are also supported by ICU under different names, and a sufficiently motivated person could fill in a translation table. Some of them might even be recognized by the matching functions in the ICU API: I only checked for exact matches.

Footnotes

  1. As explained in a comment in the code, in this PR I'm actually targeting Windows 10 version 1903, which avoids some per-thread initialization needed in earlier versions. The only earlier version of Windows still in mainstream support appears to be version 1809 (Enterprise LTSC 2019), for which mainstream support ends in January.

@otherjoel

Copy link
Copy Markdown
Contributor
  • On Windows, automatically enable ICU if AC_CHECK_HEADER([icu.h]) finds it. To support building in older environments, don't build ICU support if the header is not found. To support running on older Windows versions, dynamically load icu.dll with LoadLibraryW/GetProcAddress. This means no linking flags are needed.

Just to be clear, it sounds from this that the minimum version of Windows required to run the official distributions of Racket available from download.racket-lang.org would not increase? (Currently that minimum version is given as Windows 7, although for 8.8 and 8.9 it was given as Windows 10. There was some discussion in #4542 about what versions of Windows are actually supported.)

@LiberalArtist

Copy link
Copy Markdown
Contributor Author
  • On Windows, automatically enable ICU if AC_CHECK_HEADER([icu.h]) finds it. To support building in older environments, don't build ICU support if the header is not found. To support running on older Windows versions, dynamically load icu.dll with LoadLibraryW/GetProcAddress. This means no linking flags are needed.

Just to be clear, it sounds from this that the minimum version of Windows required to run the official distributions of Racket available from download.racket-lang.org would not increase? (Currently that minimum version is given as Windows 7, although for 8.8 and 8.9 it was given as Windows 10. There was some discussion in #4542 about what versions of Windows are actually supported.)

That's my intention. We should definitely check that it works in practice, though!

If we went this route with configure:

But native Windows builds do not use Autoconf, and I'm not sure how to do the equivalent of AC_CHECK_HEADER([icu.h]). (I cannot overstate the extent to which I am not a Windows person—perhaps ironically, I have only tested this so far on GNU/Linux.) The simplest thing to do would probably be to omit the check and require an explicit --disable-icu (/disableICU?) to build in old Windows environments.

Then building Racket in old environments would require a configuration change, but running it should not.

@mflatt

mflatt commented Dec 7, 2023

Copy link
Copy Markdown
Member

I don't think we need to support compile-time configuration of ICU for Windows. It should just be always on in DLL mode and used as a fallback when iconv isn't available. (It looks like things are in place where we could avoid #includeing icu.h on Windows even if it's available, but I'm not sure I'm reading things right.) So, --enable-icu-dll could go away and winfig.bat wouldn't need to change. Maybe one day we'll switch the default from iconv to ICU.

An --enable-icu for other platforms makes sense, and the various configure improvements there look good.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

I don't think we need to support compile-time configuration of ICU for Windows. … So, --enable-icu-dll could go away and winfig.bat wouldn't need to change. Maybe one day we'll switch the default from iconv to ICU.

That would certainly make things easier.

(It looks like things are in place where we could avoid #includeing icu.h on Windows even if it's available, but I'm not sure I'm reading things right.)

I initially thought this wouldn't work, but then I ended up getting halfway there anyway, and now I think it probably would work, so I will try filling in the rest and see how it goes.

I think the biggest thing that had seemed problematic to me originally were comments pointing to https://unicode-org.atlassian.net/browse/ICU-12420 like the one deprecating U_STANDARD_ERROR_LIMIT because The numeric value may change over time. Now, though, I think that issue is actually specific to constants that were meant to mark the end of open-ended lists. The numeric values of the actual error codes we need to test, including for U_SUCCESS and U_FAILURE, do seem to be covered by ICU's ABI stability.

I had also seen that there is some complicated logic allowing for alternate definitions of UChar. I haven't fully understood it, but it seems like it is for niche situations that wouldn't apply here.

@mflatt

mflatt commented Dec 6, 2024

Copy link
Copy Markdown
Member

@LiberalArtist Did you decide that this wasn't worthwhile, or just lose interest, or just run out of time?

I remain interested in this improvement for both Racket and Chez Scheme — though admittedly not enough to try anything myself, so far.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

I just ran out of time and haven't gotten back to it. I still think it's a viable approach. It is on my to-do stack to get back to, but that won't be for the next month or so, at least. If someone else wants to fill things in before then, so much the better!

mflatt added a commit to mflatt/racket that referenced this pull request Jun 26, 2025
This clean-up was originally part of racket#4844
mflatt added a commit that referenced this pull request Jun 28, 2025
This clean-up was originally part of #4844
@LiberalArtist

Copy link
Copy Markdown
Contributor Author

I started looking at this again and found I had actually done but not committed much of the work.

Looking at the CI errors https://github.com/racket/racket/actions/runs/16229453902/job/45828973905?pr=4844#step:8:79:

Errors were [4]:
(Section (got expected (call)))
((portlib) ((#"\357\277\275\357\277\275") Error (port->bytes (reencode-input-port (open-input-bytes #"\377\377") "utf-8"))))
((portlib) ((not-an-error) Error (let () (port->bytes (reencode-input-port (open-input-bytes #"\377\377") "utf-8")) (quote not-an-error))))
((portlib) ((#<void>) Error (let ((o (reencode-output-port (open-output-bytes) "utf-8"))) (display #"\377\377" o) (flush-output o))))
((portlib) ((not-an-error) Error (let () (let ((o (reencode-output-port (open-output-bytes) "utf-8"))) (display #"\377\377" o) (flush-output o)) (quote not-an-error))))

Are these cases that were expected to be errors but now work? I'm having trouble locating the tests.

@samth

samth commented Jul 11, 2025

Copy link
Copy Markdown
Member

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

I found the file, I just haven't found the specific tests. For example, grepping that file for not-an-error doesn't find any results.

@samth

samth commented Jul 11, 2025

Copy link
Copy Markdown
Member

That in particular is from the expansion of err/rt-test. My guess is the relevant lines are around here: https://github.com/racket/racket/blob/master/pkgs/racket-test-core/tests/racket/portlib.rktl#L964

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

Some notes on my investigation so far:

As @samth suggested, the failing tests are specifically:

(err/rt-test
(port->bytes (reencode-input-port (open-input-bytes #"\xFF\xFF") "utf-8"))
(lambda (exn)
(regexp-match? #rx"^reencode-input-port:" (exn-message exn))))
(err/rt-test
(let ([o (reencode-output-port (open-output-bytes) "utf-8")])
(display #"\xFF\xFF" o) (flush-output o))
(lambda (exn)
(regexp-match? #rx"^reencode-output-port:" (exn-message exn))))

The relevant parts boil down to this expected behavior, i.e. catching invalid encoding:

> (bytes-convert (bytes-open-converter "UTF-8" "UTF-8") #"\xFF\xFF")
#""
0
'error

(Note that #"\377\377" is the way the printer writes #"\xFF\xFF".)

One surprise is that this goes through the iconv/ICU path at all: (bytes-open-converter "UTF-8" "UTF-8") is one of the guaranteed combinations, even when no iconv/ICU library is available, so there is built-in support for this somewhere. Does the implementation switch to use always use the foreign library if present, even for guaranteed combinations? Something to investigate further.

Nonetheless, I think we do want to understand what's going on in these test failures before merging.

I noticed that:

> (bytes-convert (bytes-open-converter "UTF-8-permissive" "UTF-8") #"\xFF\xFF")
#"\357\277\275\357\277\275"
2
'complete

matching the unexpected result in the first test case.

So it seems to me that we are currently calling ICU in permissive mode, and need to figure out how to not do that.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

Apparently ICU's default behavior is to emit a substitution character for an invalid sequence, but setting callbacks can override this. I've pushed a first attempt.

(Also, rebased and resolved parallel-thread–related conflicts.

 This is a combination of 5 commits.
 This is the 1st commit message:

[wip] rktio icu - progress before 2023-12-15

 This is a combination of 20 commits.
 This is the 1st commit message:

wip rktio icu

 This is the commit message #2:

typo fix

 This is the commit message racket#3:

hack for testing

 This is the commit message racket#4:

fixup

 This is the commit message racket#5:

fixup more

 This is the commit message racket#6:

RKTIO_NO_ICU -> RKTIO_HAVE_ICU

 This is the commit message racket#7:

fixup set_racket_error

 This is the commit message racket#8:

autoconf

 This is the commit message racket#9:

fixup iconv.m4

 This is the commit message racket#10:

attempt at winfig.bat for ICU

 This is the commit message racket#11:

debug winfig.bat

 This is the commit message racket#12:

another try at winfig.bat

 This is the commit message racket#13:

tollerate "''" from top-level Makefile in build.bat

 This is the commit message racket#14:

maybe it is ""

 This is the commit message racket#15:

remove some obsolete files

 This is the commit message racket#16:

winfig.bat: try ignoring all unknown args

 This is the commit message racket#17:

debug

 This is the commit message racket#18:

winfig.bat: fewer quotation marks

 This is the commit message racket#19:

remove even more quotation marks

 This is the commit message racket#20:

[wip] drop RKTIO_ICU_DLL; revamp initialization

 This is the commit message #2:

[wip] fixup & uloc_getDefault()

 This is the commit message racket#3:

fixup

 This is the commit message racket#4:

fixup

 This is the commit message racket#5:

set callbacks to stop on illegal/unmapped/invalid sequences

Apparently, ICU's default behavior is to emit a substitution character.
@LiberalArtist

Copy link
Copy Markdown
Contributor Author

That took care of half of the test failures. What remains:

Errors were [2]:
(Section (got expected (call)))
((portlib) ((#<void>) Error (let ((o (reencode-output-port (open-output-bytes) "utf-8"))) (display #"\377\377" o) (flush-output o))))
((portlib) ((not-an-error) Error (let () (let ((o (reencode-output-port (open-output-bytes) "utf-8"))) (display #"\377\377" o) (flush-output o)) (quote not-an-error))))

Also, we should double-check that I resolved the merge conflicts from the new parallel threads correctly.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

One surprise is that this goes through the iconv/ICU path at all: (bytes-open-converter "UTF-8" "UTF-8") is one of the guaranteed combinations, even when no iconv/ICU library is available, so there is built-in support for this somewhere. Does the implementation switch to use always use the foreign library if present, even for guaranteed combinations? Something to investigate further.

Apparently the matching for the guaranteed combinations is case-sensitive, which is presumably why the tests use "utf-8":

[(and (string=? from-str "UTF-8") (string=? to-str "UTF-8"))
(bytes-converter (utf-8-converter 'utf-8 'utf-8)
#f)]
[(and (string=? from-str "UTF-8-permissive") (string=? to-str "UTF-8"))
(bytes-converter (utf-8-converter 'utf-8-permissive 'utf-8)
#f)]
[(and (string=? from-str "platform-UTF-8") (string=? to-str "platform-UTF-16"))
(bytes-converter (utf-8-converter platform-utf-8 platform-utf-16)
#f)]
[(and (string=? from-str "platform-UTF-8-permissive") (string=? to-str "platform-UTF-16"))
(bytes-converter (utf-8-converter platform-utf-8-permissive platform-utf-16)
#f)]
[(and (string=? from-str "platform-UTF-16") (string=? to-str "platform-UTF-8"))
(bytes-converter (utf-8-converter platform-utf-16 platform-utf-8)
#f)]
;; WTF-16 is similar to UTF-16, but allows unpaired surrogates --- which is still
;; different from UCS-2, since paired surrogates are decoded as in UTF-16.
;; WTF-8 is the analogous extension of UTF-8, where a surrogate pair encoded
;; as a sequence of unpaired surrogates is specifically disallowed.
[(and (string=? from-str "WTF-8") (string=? to-str "WTF-16"))
(bytes-converter (utf-8-converter 'wtf-8 'wtf-16)
#f)]
[(and (string=? from-str "WTF-8-permissive") (string=? to-str "WTF-16"))
(bytes-converter (utf-8-converter 'wtf-8-permissive 'wtf-16)
#f)]
[(and (string=? from-str "WTF-16") (string=? to-str "WTF-8"))
(bytes-converter (utf-8-converter 'wtf-16 'wtf-8)
#f)]
[(and (or (and (string=? from-str "UTF-8") (string=? to-str ""))
(and (string=? from-str "") (string=? to-str "UTF-8")))
(locale-encoding-is-utf-8?))
(bytes-converter (utf-8-converter 'utf-8 'utf-8)
#f)]
[else

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

While doing a build on Debian for debugging, I found that there are some uses of encodings called UCS-4LE or UCS-4BE. It looks like they are defined in the same place as the implementation of _string/ucs-4, but not actually used by it, at least on CS. The specific places are the ones changed in LiberalArtist@bbd48c7.

At some point in the process of building Racket, something tries to use bytes-open-converter to convert between UTF-8 and UCS-4{L,B}E (the one corresponding to the native endianness). This is a problem, because ICU doesn't recognize either of those as encoding names.

In more detail, these are the related encodings recognized by ICU vs. GNU libiconv built with --enable-extra-encodings. Names listed on the same line are aliases, and matching is case-insensitive (and flexible in other details I don't recall off-hand):

;; ICU
UTF-32 ISO-10646-UCS-4 ibm-1236 ibm-1237 csUCS4 ucs-4 
UTF-32BE UTF32_BigEndian ibm-1232 ibm-1233 ibm-9424 
UTF-32LE UTF32_LittleEndian ibm-1234 ibm-1235 
UTF32_PlatformEndian 
UTF32_OppositeEndian
;; GNU libiconv w/ --enable-extra-encodings
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-32
UTF-32BE
UTF-32LE
UCS-4-INTERNAL
UCS-4-SWAPPED

The Unicode Standard, Version 17.0.0, Appendix C§2, “Encoding Forms in ISO/IEC 10646” says:

ISO/IEC 10646:2011 has significantly revised its discussion of encoding forms, compared to earlier editions of that standard. The terminology for encoding forms (and encoding schemes) in 10646 now matches exactly the terminology used in the Unicode Standard. Furthermore, 10646 is now described in terms of a codespace U+0000..U+10FFFF, instead of a 31-bit codespace, as in earlier editions. This convergence in codespace description has eliminated any discrepancies in possible interpretation of the numeric values greater than 0x10FFFF. As a result, this section now merely notes a few items of mostly historic interest regarding encoding forms and terminology.

UCS-4. UCS-4 stands for “Universal Character Set coded in 4 octets.” It is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646.

Furthermore, neither the IANA Character Sets Registry nor the WHATWG Encoding Standard recognizes endianness-specific variants of UCS-4. (WHATWG doesn't recognize any variant of UCS-4 as a name.) In brief searching, I can see a few, perhaps older, pieces of software that mention the endianness-specific variants, but only a few, and (while I haven't read it) I'm guessing they may not actually have been defined by ISO 10646.

The only reason I can guess for GNU libiconv to treat UTF-32 and UCS-4 as distinct encodings, rather than aliases, is that it might support, or once have supported, older versions of ISO 10646 in which only a subset of UTF-32 was valid UCS-4, and it could have raised errors if anything out of that subset was used.

My conclusion is that we ought to just change to UTF-32LE and UTF-32BE. That's what I've done in LiberalArtist@bbd48c7 (part of this PR, likely to be squashed eventually). I haven't done it yet, but I also think we should update the docs for _string/ucs-4 and _string/utf-16, neither of which say anything about endianness or the lack of a BOM.

The existence of this code reinforces my thinking from LiberalArtist@bbd48c7 that we ought to support all of the Unicode variants without relying on an iconv implementation. But I'm not intending to look at that until after ICU support is done, because I don't want to mask any errors.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

These equivalent expressions:

(bytes-convert (bytes-open-converter "utf-8" "UTF-8") #"\xFF\xFF")
(bytes-convert (bytes-open-converter "UTF-8" "utf-8") #"\xFF\xFF")

with ICU both evaluate to:

#""
1
'error

but with iconv evaluate to:

#""
0
'error

I used DrRacket's step debugger in an --enable-icu build on the failing test:

(let ([o (reencode-output-port (open-output-bytes) "utf-8")]) 
  (display #"\xFF\xFF" o)
  (flush-output o))

and found that it got to this code:

(let-values ([(got-c used-c status)
(bytes-convert c out-bytes out-start out-end
ready-bytes)])
(set! ready-end got-c)
(set! out-start (+ out-start used-c))
(when (and (eq? status 'continues) (zero? used-c))
;; Yikes! Size of ready-bytes isn't enough room for progress!?
(raise-insane-decoding-length))
(when (and (eq? status 'error) (zero? used-c))
;; No progress before an encoding error.
(if error-bytes
;; Write error bytes and drop an output byte:
(begin (set! out-start (add1 out-start))
(bytes-copy! ready-bytes 0 error-bytes)
(set! ready-end (bytes-length error-bytes)))
;; Raise an exception:
(begin
(set! out-start out-end) ;; flush buffer so close can work
(decode-error
"error decoding output to stream"
port))))))))

with:

got-c => 0
used-c => 1
status => error

so, at line 2022, (zero? used-c) is #f, and the block labeled No progress before an encoding error. never executes.

The test reaches line 2022 twice, then gets to this block of code:

;; The main writing entry point:
(define (write-it s start end no-buffer&block? enable-break?)
(cond
[(= start end)
;; This is a flush request; no-buffer&block? must be #f
;; Note: we could get stuck because only half an encoding
;; is available in out-bytes.
(flush-buffer-pipe #f enable-break?)
(flush-some #f enable-break?)
(if (buffer-flushed?)
(begin
(buffering! #f)
0)
(write-it s start end no-buffer&block? enable-break?))]

with (= start end) and (buffer-flushed?) both returning #t, so it returns 0 at line 1770.


There doesn't seem to be any path in the code that leads to an exception when there persistently is progress before an encoding error.

I'm not sure why the ICU and iconv backends are returning different second results from (bytes-convert (bytes-open-converter "utf-8" "UTF-8") #"\xFF\xFF"). It's very possible that there could be something wrong in my C code, but I can also imagine that the iconv API may not make a portable guarantee about this. I will investigate further, though I'd be glad for any suggestions.

Note that using capitalized "UTF-8" in both cases does in fact get Racket's built-in implementation, so this happens even with --enable-icu:
``

(bytes-convert (bytes-open-converter "UTF-8" "UTF-8") #"\xFF\xFF")
#""
0
'error

After ICU support is in place, this further suggests to me that we should support all of the Unicode variants without relying on a foreign library.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

Re parallel threads, as long as this still holds:

- A given `rktio_t` can be used from only one thread at a time,
except as specified for some functions. Otherwise, as long as the
initial call to `rktio_init` returns before a second call,
different `rktio_t` values can be used freely from different
threads.

it looks like it should still be safe to initialize static variables in the first rktio_init().

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

The docs for ucnv_convertEx() say that, on output:

*source points to the first unit after the last char read

That is less than clear about whether "the last char read" is a char read from the erroneous character, or if it's the last char successfully read. However, an article by JeanHeyd Meneide (the editor of the ISO C standard) says:

An even better version of this updates the pointer only if the operation was successful, rather than just doing it willy-nilly. This is the way ICU works, and it is incredibly helpful when *pErrorCode is not set to U_SUCCESS. It allows you to insert replacement characters and/or skip over the problematic input if needed, allowing for greater control over how errors are handled.

But my understanding is that the byte #"\xFF" cannot appear anywhere in UTF-8, so I don't see how it can possibly have been successfully read.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

For another case, consider (bytes-convert (bytes-open-converter "UTF-8" "utf-8") #"abc\xF0\x9F\xFF"). That byte string comes from:

(bytes-append #"abc"
              (let ([bs (string->bytes/utf-8 "🎜")])
                (subbytes bs 0 (/ (bytes-length bs) 2)))
              #"\xFF")

i.e., chopping off the second half of #\🎜 in UTF-8 and replacing it with the invalid #"\xFF".
The results we get are:

;; ICU
#"abc"
5
'error
;; iconv
#"abc"
3
'error

The iconv backend counts only the three bytes that were successfully converted, whereas the ICU backend counts the 5 bytes that could be part of valid UTF-8, but not, in this case, the bad #"\xFF" (why?).

It looks like the lower-level functions ucnv_fromUnicode() and ucnv_toUnicode() provide enough granular offset information that this should be implementable, but I really thought ucnv_convertEx was supposed to have this capability.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

I've put together a little C program to verbosely explore the various input/output pointers and results: https://gist.github.com/LiberalArtist/3f32e8c0b4d9a7db2a23a0cafe444d12

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

I found some comments about the state during error callbacks and where the various pointers are pointing:
https://github.com/unicode-org/icu/blob/4974e9b262366fe5b16eb10293c35172d55ce2a9/icu4c/source/common/unicode/ucnv_cb.h#L29-L59
I think our problem will be solvable by supplying custom callbacks.

@LiberalArtist

Copy link
Copy Markdown
Contributor Author

We now have a different test failure:

((portlib) ((EXN #(struct:exn:fail:contract "user port peek: contract violation\n  expected: \n   (or/c exact-nonnegative-integer? eof-object? evt? pipe-input-port? #f procedure?)\n  received: #<void>" #<continuation-mark-set>)) #f (#<procedure:regexp-match?> #px#"temple" #<input-port:string>)))

The test in question is here:

;; Check that slow input stream is handled correctly:
(let ()
(define poem (bytes-append #"\307\256\314\306\272\376\264\272\320\320\n\n\271\302\311\275\313\302\261\261"
#"\274\326\315\244\316\367\n\313\256\303\346\263\365\306\275\324\306\275\305\265"
#"\315\n\274\270\264\246\324\347\335\272\325\371\305\257\312\367\n\313\255\274"
#"\322\320\302\321\340\327\304\264\272\304\340\n\302\322\273\250\275\245\323\373"
#"\303\324\310\313\321\333\n\307\263\262\335\262\305\304\334\303\273\302\355\314"
#"\343\n\327\356\260\256\272\376\266\253\320\320\262\273\327\343\n\302\314\321\356"
#"\322\365\300\357\260\327\311\263\265\314\n"))
(define (slower p)
(make-input-port
(object-name p)
(lambda (bstr)
(if (zero? (random 2))
(wrap-evt (alarm-evt (+ (current-inexact-monotonic-milliseconds) 5) #t)
(lambda (v) 0))
(read-bytes-avail! bstr p)))
#f
(lambda () (void))))
(define i (reencode-input-port (slower (open-input-bytes poem)) "gbk" #"?"))
(test #f regexp-match? #px#"temple" i)
(close-input-port i))

But the contract violation is from an internal function not directly related to byte conversion. I'm not sure how anything I changed could have affected that. Also, the previous test failures I've been able to reproduce in a build on Debian with --enable-icu, but this test passes on my machine.

@LiberalArtist

LiberalArtist commented Nov 27, 2025

Copy link
Copy Markdown
Contributor Author

Status Update

I put some time into this last month, inspired by RacketCon, with the thought that 9.0 would have been an auspicious time for the switch, but there is more to do. Aside from tracking down the test failure in #4844 (comment), there is a bigger-picture issue I've been thinking about.

ICU's converters are stateful. In particular, they use a pivot buffer containing an internal representation (usually UTF-16 for historical reasons, but IIUC there are fast paths for UTF-8) when converting from a source encoding to a target encoding. This part of the documentation is relevant:

The converters always consume the source buffer as far as possible, and advance the source pointer.

The converters write to the target all converted output as far as possible, and then write any remaining output to the internal services buffer. When the conversion routines are called again, the internal buffer is flushed out and written to the target buffer before proceeding with any further conversion.

Thus, from ICU's perspective, there may exist bytes that have been consumed from the source but not yet written to the target.

Initially, I thought this would be trivial: after all, in the general case, correctly using the byte converter API requires using bytes-convert-end to potentially get more output after all input has been consumed.

On reflection, though, the qualifier “in the general case” seems significant. The bytes-convert-end docs say:

instead of converting bytes, this procedure generates an ending sequence for the conversion (sometimes called a “shift sequence”), if any. Few encodings use shift sequences, so this function will succeed with no output for most encodings. In any case, successful output of a (possibly empty) shift sequence resets the converter to its initial state.

By implication, a user who knows that the specific encodings in some application do not use shift sequences would seem justified in never calling bytes-convert-end. In other words, bytes-convert-end is not allowed to emit anything other than a (possibly empty) shift sequence.

On further investigation, the requirement seems to follow from the POSIX iconv specification, which says:

For state-dependent encodings, the conversion descriptor cd is placed into its initial shift state by a call for which inbuf is a null pointer, or for which inbuf points to a null pointer. When iconv() is called in this way, and if outbuf is not a null pointer or a pointer to a null pointer, and outbytesleft points to a positive value, iconv() shall place, into the output buffer, the byte sequence to change the output buffer to its initial shift state.

This part of the POSIX spec explicitly requires that input be counted as consumed only when corresponding output has been produced, and it seems to ban any state other than shift sequences in conversion descriptors:

If the output buffer is not large enough to hold the entire converted input, conversion shall stop just prior to the input bytes that would cause the output buffer to overflow. The variable pointed to by inbuf shall be updated to point to the byte following the last byte successfully used in the conversion. The value pointed to by inbytesleft shall be decremented to reflect the number of bytes still not converted in the input buffer. … For state-dependent encodings, the conversion descriptor shall be updated to reflect the shift state in effect at the end of the last successfully converted byte sequence.

Reluctantly, I am coming to the conclusion that we shouldn't use bytes-convert-end to flush an ICU converter's pivot buffer. While hypothetically we could change the specification of the Racket API, and existing uses that correctly handle the general case would continue to be correct, the whole point of this exercise in supporting niche non-Unicode encodings is backward compatibility.

It definitely is still possible to use ICU to implement the iconv API.

In the worst-case scenario, we could use preflighting mode, or possibly character mode.

Ideally, though, we would want to stay within buffered or streamed mode. I'm still trying to figure out how to do that.

This mailing list thread is one part I want to better understand. The documentation for ucnv_fromUCountPending() and ucnv_toUCountPending() says they are “useful for mapping semantics of ICU's converter interface to those of iconv.”

A lingering concern, though, is that some of the suggestions seem to work like this: Imagine an identity conversion, UTF-8 to UTF-8, with abcdefg as the initial input, an empty two-byte target buffer, and an empty two-byte pivot buffer. After the first call, ab will be in the target buffer, cd will be in the pivot buffer, and, from ICU's perspective, efg will not have been consumed. An adapter layer could report instead, per iconv semantics, that only ab were consumed. Then, the client would make a second call with cdefg as the input and, let's say, another empty two-byte target buffer. The adapter layer would offset the input by the content of the pivot buffer, calling ICU with just efg as input and cd in the pivot. The result would have cd in the output buffer, ef in the pivot buffer, and, from ICU's perspective, just g not having been consumed.

The problem I see with that approach is that, in the second call, there is a requirement that the input must start with cd to match the content of the pivot buffer. Most of the time that will obviously be true, especially in reasonable, practical use-cases. But I don't think the iconv semantics require that to be true. Instead, a client could supply vwxyz for the second call and end up with cd wrongfully emitted, vw discarded completely, and xy in the pivot buffer.

Another approach could use ucnv_getMaxCharSize() and ucnv_getMinCharSize() to calculate a safe buffer size for batch processing, perhaps in combination with ucnv_safeClone() to do limited preflighting or character-by-character conversion only when needed. This might have better performance than using preflighting mode or character mode all the time. The downside would be a potentially more complex implementation in rktio.

Comment on lines +327 to +332
static void init_icu()
{
/* Called from rktio_convert_init(rktio), which is called from rktio_init().
The ordering requirement on the first call to rktio_init() guarantees
exclusive access to static variables here.
*/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mflatt Is this still true after the changes for parallel threads?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The first call to rktio_init() has to compete before any other call to rktio_init() per rktio's rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants