Fix Japanese IME emoji composition#1485
Open
cromz22 wants to merge 2 commits into
Open
Conversation
In the source editor, cycling Japanese IME candidates broke as soon as an emoji (non-BMP surrogate pair) appeared: the character filter rewrote the composed range mid-composition, stopping candidate cycling and Backspace, and the OT core rejected the non-BMP char ("Out of sync").
Make the bridge composition-aware: while an IME composition is active, let the editor hold the real composed text (including an emoji preview) and pause every local->OT path (filter-characters, the realtime ViewPlugin, and history-ot's updateSender). On compositionend, scrub non-BMP characters code-point-aware to a single U+FFFD and reconcile the OT snapshot once by diffing it against the scrubbed editor text. Also makes the filter regex code-point-aware so one emoji collapses to a single replacement char.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Builds on the composition-aware sync fix for the Japanese IME / emoji bug: - Reconcile at compositionend by reverting the composition out of the editor and re-applying the scrubbed text as a single normal edit. It syncs to the OT layer through the usual local-edit path (both OT modes, with track-changes attribution) and forms one clean undo step back to the pre-composition state. - Keep the raw composition out of the undo history so undo can neither strand the replacement character nor resurrect the emoji. - Buffer remote edits that arrive during composition so the IME isn't disrupted; if one lands, drop the in-progress composition, resync to the snapshot, and warn the user with a toast. - Add a fallback so the editor<->OT bridge can't get stuck paused if compositionend never fires. - Add frontend unit tests for the character scrub and diff helpers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix Japanese IME breaking on emoji conversion candidates in the source editor
Background: Japanese input and IMEs (context for reviewers)
If you don't use a CJK input method, this bug is hard to picture, so here's the necessary background.
Japanese can't be typed one keystroke per character.
A writer types the pronunciation (in romaji or kana) and the operating system's IME (Input Method Editor) converts it into the intended characters.
Crucially, the same pronunciation maps to many possible words/characters (homophones): for example the sound "kao" (かお) can become 顔 ("face"), カオ, かお, and others, so the IME shows a candidate list and the user picks one.
Composition.
While the user is choosing, the text is uncommitted and shown inline with an underline as a live preview (the "composition").
The user presses Space to cycle through candidates and Enter to commit the chosen one.
Until commit, this text is provisional and the application is expected to leave it alone and let the browser/IME manage it.
Rewriting the composing text out from under the IME cancels or corrupts the composition.
Where emoji come in.
Most IMEs offer emojis as conversion candidates.
For example, "kao" can convert to a 😄 face emoji.
Emoji are non-BMP (Basic Multilingual Plane) characters (UTF-16 surrogate pairs).
Normal, correct behaviour:
type
kao-> underlined preview かお -> press Space to cycle 顔 / 😄 / カオ / ... -> press Enter to commit.Space keeps cycling, nothing disturbs the preview, and after commit ordinary editing (Backspace, etc.) works.
The bug
In the Overleaf source (Code) editor, the moment an emoji candidate appears in the list (while still composing, before commit), the IME breaks:
U+FFFD(? inside a diamond) replacement glyph appears in place of the preview;So Japanese users can't enter emoji nor the ordinary candidates that appears after an emoji.
This is reproducible across browsers, and only inside the editor.
(Although I can say it's reproducible, I cannot tell you the exact steps to replicate with the exact string, because IMEs "learn" the user's input and change the candidates' order as the user uses it.)
I understand that not allowing non-BMP characters is a design choice (Inserting emojis in LaTeX documents on Overleaf); However, it should not mean that IMEs cannot work with Overleaf.
Root cause
Overleaf's OT (operational-transform) layer cannot represent non-BMP characters, so the editor scrubs them to
U+FFFD.The problem is that this scrub runs while the IME is still composing:
filter-characters.tsinstalls anEditorState.transactionFilterthat replaces surrogate code units withU+FFFDand appends a corrective transaction rewriting the composed range, which desyncs the browser IME.Because that rewrite happens while the browser is still composing, the browser IME and CodeMirror's composition tracking are left desynced and stuck active, so Space stops cycling candidates and, after commit, Backspace does nothing (both swallowed by the dangling composition) until a click elsewhere or a clean non-emoji commit resets it.
Its regex matches per UTF-16 code unit, so one emoji becomes two
U+FFFD.That corrective transaction sets no new selection, so the caret is left wherever the disrupted composition leaves it (before the committed text, sometimes several characters back) rather than after it.
overleaf-editor-core) throws on non-BMP inTextOperation/InsertOp, andcheckConsistencycompares the editor against the OT snapshot on every change. (sharejs-text-ot doesn't reject, but the server sanitises surrogates per code unit.)ViewPlugin(sharejs op submission + consistency check + auto-compile trigger) and history-OT'supdateSendertransactionExtender.Approach: composition-aware sync
I first tried containing the fix to the character filter alone, deferring the scrub until after composition so the IME isn't disturbed.
But because the editor can't hold the emoji even transiently (above), deferring only traded the IME breakage for an "Out of sync" error.
The fix has to coordinate across the forwarding paths, not live in one filter.
While an IME composition is active, let the editor hold the real composed text (including a previewed emoji) and pause every local -> OT path.
A
compositionFieldflag is read byfilter-characters, the realtimeViewPlugin, and history-OT'supdateSender, so they become no-ops during composition.The IME is therefore never disrupted while cycling candidates.
On
compositionend:U+FFFD. Because it is an ordinary local edit, it:If a remote edit arrives mid-composition we can't safely merge it against the in-progress composition, so we drop the composition, resync the editor to the snapshot, and warn the user with a toast.
The remote echo is buffered during composition so the IME isn't visually disrupted before that point.
A fallback reconciles even if no
compositionendevent fires, so the editor <-> OT bridge can never get stuck paused.What this covers
Testing
Automated (
services/web/test/frontend/features/source-editor/extensions/):filter-characters.test.ts: code-point-aware scrubbing (one emoji -> oneU+FFFD, lone surrogates, NUL, remote-skip, idempotency).composition.test.ts: the scrub helper and the diff -> change-spec helper.composition-sync.test.ts: integration tests driving synthetic composition events: emoji commit -> singleU+FFFDwith the caret preserved; normal Japanese commit; and remote-edit-during-composition -> discard + resync + warning toast.Manual (history-OT dev environment), verified on macOS (macOS Japanese IME) and Linux (Ubuntu, fcitx5-mozc), in Chrome and Firefox:
Typing a Japanese reading, cycling to an emoji candidate, then committing:
U+FFFD;Not verified / out of scope
The following were not verified:
U+FFFDrenders. (The change only touches editor<->OT sync, not compilation, andU+FFFDis a normal BMP character.)U+FFFD(unit-tested), but this was not exercised by hand in the app.Files changed
services/web/frontend/js/features/source-editor/extensions/composition.ts(new): the composition flag, scrub/diff helpers, and the compositionend reconcile.services/web/frontend/js/features/source-editor/extensions/filter-characters.ts: code-point-aware regex; skip while composing.services/web/frontend/js/features/source-editor/extensions/realtime.ts: pause OT forwarding while composing; buffer remote echoes.services/web/frontend/js/features/source-editor/extensions/history-ot.ts: pauseupdateSenderwhile composing.services/web/frontend/js/features/source-editor/extensions/index.ts: register the extension.services/web/frontend/js/features/source-editor/components/composition-toasts.tsx(new) +.../ide-react/components/global-toasts.tsx: the "Input not applied" warning toast.services/web/locales/en.json: toast strings.services/web/test/frontend/features/source-editor/extensions/{filter-characters,composition,composition-sync}.test.ts(new).Related issues
the user-side report of this bug (Windows, Google/Microsoft IME). This PR fixes the input breakage, not the request to store/display 4-byte characters, so a committed emoji still becomes
U+FFFD(because I thought this is a design choice).Additional notes
I used Claude code for the development, because although I have general knowledge about web processing, I am not an expert on web development nor an expert on Overleaf.
However, I believe I am a good engineer in general and tried my best to check that the implementation looks reasonable and also manually checked the behavior.
Additionally, I understand that this is a community version and is different from the officially hosted version.
However, this problem is persistent across both versions, and I hope that once this PR is accepted for the community version, it's also applied to the hosted version as well.
Review