Skip to content

UnicodeUtil: skip Hangul decompose/compose when no Hangul syllables present#390

Open
Copilot wants to merge 3 commits into
masterfrom
copilot/optimize-normalize-hangul-processing
Open

UnicodeUtil: skip Hangul decompose/compose when no Hangul syllables present#390
Copilot wants to merge 3 commits into
masterfrom
copilot/optimize-normalize-hangul-processing

Conversation

Copilot AI commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

normalize() unconditionally called decomposeHangul() and composeHangul() — both O(n) passes — even for strings with no Hangul content (0xAC00–0xD7A3). This penalizes all non-ASCII normalization (Greek, Arabic, CJK, etc.).

Changes

  • Single-pass Hangul detection: the existing scan loop now also checks for Hangul syllables. Sets hasHangul = true and breaks early on first match; otherwise continues scanning so a Hangul character appearing later in the string isn't missed.
  • Conditional Hangul processing: decomposeHangul() and composeHangul() are now gated on hasHangul.
// Before
if (needsNormalizing) {
    s = decomposeHangul(s);           // always called
    UnicodeString ustring = new UnicodeString(s);
    String result = ustring.decompose().compose().toString();
    result = composeHangul(result);   // always called
    return result;
}

// After
if (needsNormalizing) {
    if (hasHangul) s = decomposeHangul(s);
    UnicodeString ustring = new UnicodeString(s);
    String result = ustring.decompose().compose().toString();
    if (hasHangul) result = composeHangul(result);
    return result;
}

…se/compose calls

Agent-Logs-Url: https://github.com/elharo/xom/sessions/0936fd40-179a-4ab0-a19a-ff83ce25893a

Co-authored-by: elharo <1005544+elharo@users.noreply.github.com>
Copilot AI changed the title [WIP] Optimize normalize() to conditionally call Hangul methods UnicodeUtil: skip Hangul decompose/compose when no Hangul syllables present Apr 1, 2026
Copilot AI requested a review from elharo April 1, 2026 11:17
@elharo elharo marked this pull request as ready for review April 1, 2026 11:18

@elharo elharo left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (c > 255) {
needsNormalizing = true;
break;
if (c >= FIRST_HANGUL_SYLLABLE && c <= LAST_HANGUL_SYLLABLE) {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to instead scan forward from i at this point looking specifically for hangul characters since we already know normalizing is needed, avoiding the c > 255 and needsNormalizing checks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5e9fe27. The first loop now breaks immediately on finding any c > 255 (setting needsNormalizing). If that character is already Hangul we're done; otherwise a second, focused loop scans from position i+1 checking only the Hangul range, avoiding the redundant c > 255 test.

…er suggestion

Agent-Logs-Url: https://github.com/elharo/xom/sessions/da3d8e96-a25c-42c6-9863-6a4fadd4ec66

Co-authored-by: elharo <1005544+elharo@users.noreply.github.com>
Copilot AI requested a review from elharo April 1, 2026 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UnicodeUtil: optimize normalize() to avoid unnecessary Hangul decompose/compose calls

2 participants