Skip to content

Update regex.rs to not split words on combining marks and diacritics#4

Open
ajaykg wants to merge 1 commit into
gnp:masterfrom
ajaykg:patch-1
Open

Update regex.rs to not split words on combining marks and diacritics#4
ajaykg wants to merge 1 commit into
gnp:masterfrom
ajaykg:patch-1

Conversation

@ajaykg

@ajaykg ajaykg commented May 6, 2024

Copy link
Copy Markdown

Over half the world population seems to speak languages that use unicode combining marks like accents and matras in between the words. GPT / tictoken regular expressions seem to break such words in between preventing merges of characters that should actually merge. Edited the regular expression to not split on such combining characters.

Over half the world population seems to speak languages that use unicode combining marks like accents and matras in between the words. GPT / tictoken regular expressions seem to break such words in between preventing merges of characters that should actually merge. Edited the regular expression to not split on such combining characters.
@gnp

gnp commented May 17, 2024

Copy link
Copy Markdown
Owner

Previously the regex tokenizer defaulted to gpt4 identical (hopefully) behavior. Does your proposed change fix cases where we were not compatible with gpt4 and we just didn't have the test cases to prove it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants