Skip to content

Fix Japanese IME emoji composition#1485

Open
cromz22 wants to merge 2 commits into
overleaf:mainfrom
cromz22:fix/japanese-ime-emoji-composition
Open

Fix Japanese IME emoji composition#1485
cromz22 wants to merge 2 commits into
overleaf:mainfrom
cromz22:fix/japanese-ime-emoji-composition

Conversation

@cromz22

@cromz22 cromz22 commented Jun 1, 2026

Copy link
Copy Markdown

Fix Japanese IME breaking on emoji conversion candidates in the source editor

Background: Japanese input and IMEs (context for reviewers)

If you don't use a CJK input method, this bug is hard to picture, so here's the necessary background.

Japanese can't be typed one keystroke per character.
A writer types the pronunciation (in romaji or kana) and the operating system's IME (Input Method Editor) converts it into the intended characters.
Crucially, the same pronunciation maps to many possible words/characters (homophones): for example the sound "kao" (かお) can become 顔 ("face"), カオ, かお, and others, so the IME shows a candidate list and the user picks one.

Composition.
While the user is choosing, the text is uncommitted and shown inline with an underline as a live preview (the "composition").
The user presses Space to cycle through candidates and Enter to commit the chosen one.
Until commit, this text is provisional and the application is expected to leave it alone and let the browser/IME manage it.
Rewriting the composing text out from under the IME cancels or corrupts the composition.

Where emoji come in.
Most IMEs offer emojis as conversion candidates.
For example, "kao" can convert to a 😄 face emoji.
Emoji are non-BMP (Basic Multilingual Plane) characters (UTF-16 surrogate pairs).

Normal, correct behaviour:
type kao -> underlined preview かお -> press Space to cycle 顔 / 😄 / カオ / ... -> press Enter to commit.
Space keeps cycling, nothing disturbs the preview, and after commit ordinary editing (Backspace, etc.) works.

The bug

In the Overleaf source (Code) editor, the moment an emoji candidate appears in the list (while still composing, before commit), the IME breaks:

  • a garbled, doubled U+FFFD (? inside a diamond) replacement glyph appears in place of the preview;
  • Space stops cycling candidates;
  • Backspace sometimes stops working;
  • after committing, the caret jumps backwards by several characters instead of staying after the inserted character.

So Japanese users can't enter emoji nor the ordinary candidates that appears after an emoji.
This is reproducible across browsers, and only inside the editor.
(Although I can say it's reproducible, I cannot tell you the exact steps to replicate with the exact string, because IMEs "learn" the user's input and change the candidates' order as the user uses it.)

I understand that not allowing non-BMP characters is a design choice (Inserting emojis in LaTeX documents on Overleaf); However, it should not mean that IMEs cannot work with Overleaf.

Root cause

Overleaf's OT (operational-transform) layer cannot represent non-BMP characters, so the editor scrubs them to U+FFFD.
The problem is that this scrub runs while the IME is still composing:

  • filter-characters.ts installs an EditorState.transactionFilter that replaces surrogate code units with U+FFFD and appends a corrective transaction rewriting the composed range, which desyncs the browser IME.
    Because that rewrite happens while the browser is still composing, the browser IME and CodeMirror's composition tracking are left desynced and stuck active, so Space stops cycling candidates and, after commit, Backspace does nothing (both swallowed by the dangling composition) until a click elsewhere or a clean non-emoji commit resets it.
    Its regex matches per UTF-16 code unit, so one emoji becomes two U+FFFD.
    That corrective transaction sets no new selection, so the caret is left wherever the disrupted composition leaves it (before the committed text, sometimes several characters back) rather than after it.
  • The editor cannot hold a non-BMP character even transiently: history-OT (overleaf-editor-core) throws on non-BMP in TextOperation/InsertOp, and checkConsistency compares the editor against the OT snapshot on every change. (sharejs-text-ot doesn't reject, but the server sanitises surrogates per code unit.)
  • Two independent local -> OT submission paths forward composition changes with no composition awareness: the realtime ViewPlugin (sharejs op submission + consistency check + auto-compile trigger) and history-OT's updateSender transactionExtender.

Approach: composition-aware sync

I first tried containing the fix to the character filter alone, deferring the scrub until after composition so the IME isn't disturbed.
But because the editor can't hold the emoji even transiently (above), deferring only traded the IME breakage for an "Out of sync" error.
The fix has to coordinate across the forwarding paths, not live in one filter.

While an IME composition is active, let the editor hold the real composed text (including a previewed emoji) and pause every local -> OT path.
A compositionField flag is read by filter-characters, the realtime ViewPlugin, and history-OT's updateSender, so they become no-ops during composition.
The IME is therefore never disrupted while cycling candidates.

On compositionend:

  1. Revert the composed text out of the editor. This revert is not forwarded to the OT layer (which never saw it) and not added to the undo history.
  2. Re-apply the scrubbed text as a single normal edit. The code-point-aware, so one emoji collapses to a single U+FFFD. Because it is an ordinary local edit, it:
    • syncs to the OT layer through the usual path, so it is correct for both OT modes and attributed as a tracked change in review mode, and
    • is a single, clean undo step back to the pre-composition state.

If a remote edit arrives mid-composition we can't safely merge it against the in-progress composition, so we drop the composition, resync the editor to the snapshot, and warn the user with a toast.
The remote echo is buffered during composition so the IME isn't visually disrupted before that point.

A fallback reconciles even if no compositionend event fires, so the editor <-> OT bridge can never get stuck paused.

What this covers

  • Both OT modes (history-OT and sharejs-text-ot): sync goes through the normal local-edit path.
  • Track changes (review mode): composed text is attributed correctly and positions are mapped through tracked deletes.
  • Caret position: the re-applied edit restores the caret to its committed position (mapped through the scrub), so committing keeps the caret after the inserted character instead of jumping it backwards.
  • Undo: one clean step; can neither strand the replacement char nor resurrect the emoji.
  • Auto-compile: unaffected; forwarding pauses during composition and fires once after the commit.
  • Concurrent remote edits during composition: safe (drop + resync + warn); never desyncs or corrupts the document.

Testing

Automated (services/web/test/frontend/features/source-editor/extensions/):

  • filter-characters.test.ts: code-point-aware scrubbing (one emoji -> one U+FFFD, lone surrogates, NUL, remote-skip, idempotency).
  • composition.test.ts: the scrub helper and the diff -> change-spec helper.
  • composition-sync.test.ts: integration tests driving synthetic composition events: emoji commit -> single U+FFFD with the caret preserved; normal Japanese commit; and remote-edit-during-composition -> discard + resync + warning toast.

Manual (history-OT dev environment), verified on macOS (macOS Japanese IME) and Linux (Ubuntu, fcitx5-mozc), in Chrome and Firefox:

Typing a Japanese reading, cycling to an emoji candidate, then committing:

  • the IME is no longer disrupted (Space cycles, native preview shows);
  • the commit yields a single U+FFFD;
  • Backspace works;
  • the caret stays in place;
  • undo is clean;
  • there's no "Out of sync";
  • ordinary Japanese round-trips unchanged.

Not verified / out of scope

The following were not verified:

  • PDF compilation / rendering: the dev environment has no TeX Live, so I could not confirm that documents still compile or how a committed U+FFFD renders. (The change only touches editor<->OT sync, not compilation, and U+FFFD is a normal BMP character.)
  • Concurrent / multi-user editing during composition: can't be triggered by a single user (switching focus commits the composition), so it is covered only by the synthetic-event integration test, not a real two-client session. To verify with real clients: two users on the same document, one composing Japanese on a line while the other edits it: the composing user keeps a stable IME and, on commit, sees the collaborator's edit plus an "Input not applied" toast.
  • sharejs-text-ot mode at runtime: the dev environment runs history-OT. The sharejs path now goes through the same normal sync path, but was not exercised live.
  • Track changes (review mode) at runtime: covered by reuse of the normal edit path and by reasoning, but not exercised by hand.
  • The visual (rich text) editor: this change is for the source editor only. The visual editor was not touched, and whether it has the same bug is unknown.
  • Other IMEs: Windows IME, Google Japanese Input, Chinese, Korean, and mobile IMEs were not tested, and composition-event timing can differ between them.
  • Pasting an emoji outside composition: the filter now collapses it to a single U+FFFD (unit-tested), but this was not exercised by hand in the app.
  • Full CI: I ran eslint, prettier, and the new frontend unit tests on the changed files, but not the full project test suite or a full type-check.

Files changed

  • services/web/frontend/js/features/source-editor/extensions/composition.ts (new): the composition flag, scrub/diff helpers, and the compositionend reconcile.
  • services/web/frontend/js/features/source-editor/extensions/filter-characters.ts: code-point-aware regex; skip while composing.
  • services/web/frontend/js/features/source-editor/extensions/realtime.ts: pause OT forwarding while composing; buffer remote echoes.
  • services/web/frontend/js/features/source-editor/extensions/history-ot.ts: pause updateSender while composing.
  • services/web/frontend/js/features/source-editor/extensions/index.ts: register the extension.
  • services/web/frontend/js/features/source-editor/components/composition-toasts.tsx (new) + .../ide-react/components/global-toasts.tsx: the "Input not applied" warning toast.
  • services/web/locales/en.json: toast strings.
  • services/web/test/frontend/features/source-editor/extensions/{filter-characters,composition,composition-sync}.test.ts (new).

Related issues

  • #1234 "Enable to display 4-byte characters in the editor" (open):
    the user-side report of this bug (Windows, Google/Microsoft IME). This PR fixes the input breakage, not the request to store/display 4-byte characters, so a committed emoji still becomes U+FFFD (because I thought this is a design choice).

Additional notes

I used Claude code for the development, because although I have general knowledge about web processing, I am not an expert on web development nor an expert on Overleaf.
However, I believe I am a good engineer in general and tried my best to check that the implementation looks reasonable and also manually checked the behavior.

Additionally, I understand that this is a community version and is different from the officially hosted version.
However, this problem is persistent across both versions, and I hope that once this PR is accepted for the community version, it's also applied to the hosted version as well.

Review

  • I have signed the Contributor License Agreement.

cromz22 and others added 2 commits May 31, 2026 22:14
In the source editor, cycling Japanese IME candidates broke as soon as an emoji (non-BMP surrogate pair) appeared: the character filter rewrote the composed range mid-composition, stopping candidate cycling and Backspace, and the OT core rejected the non-BMP char ("Out of sync").

Make the bridge composition-aware: while an IME composition is active, let the editor hold the real composed text (including an emoji preview) and pause every local->OT path (filter-characters, the realtime ViewPlugin, and history-ot's updateSender). On compositionend, scrub non-BMP characters code-point-aware to a single U+FFFD and reconcile the OT snapshot once by diffing it against the scrubbed editor text. Also makes the filter regex code-point-aware so one emoji collapses to a single replacement char.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Builds on the composition-aware sync fix for the Japanese IME / emoji bug:

- Reconcile at compositionend by reverting the composition out of the editor and re-applying the scrubbed text as a single normal edit. It syncs to the OT layer through the usual local-edit path (both OT modes, with track-changes attribution) and forms one clean undo step back to the pre-composition state.

- Keep the raw composition out of the undo history so undo can neither strand the replacement character nor resurrect the emoji.

- Buffer remote edits that arrive during composition so the IME isn't disrupted; if one lands, drop the in-progress composition, resync to the snapshot, and warn the user with a toast.

- Add a fallback so the editor<->OT bridge can't get stuck paused if compositionend never fires.

- Add frontend unit tests for the character scrub and diff helpers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant