Skip to content

Conversation

@Bloeckchengrafik
Copy link

:shipit: :shipit: :shipit:

@face-hh
Copy link
Member

face-hh commented Apr 22, 2025

I would unironically merge this, if it also integrated autocorrection for more words.

This can be done just like lexicon, inserting the entries from Microsoft Research Spelling-Correction Data.

It contains a en_*.txt file with:

yung	young
yung	young
yung	young
yuork	york
yuou	you
yuou	you
yuour	your
yup.	up.
yur	our
yur	your
yur	your
yur	your
...more

@face-hh face-hh changed the title I love spelling things wrong Spellcheck Apr 22, 2025
@Bloeckchengrafik
Copy link
Author

i'll work on it

@Bloeckchengrafik Bloeckchengrafik marked this pull request as draft April 22, 2025 16:39
@Bloeckchengrafik
Copy link
Author

image
image

This is where I landed now. If you got any other idea, please let me know.

Please note that it may be required to reduce the spelling correction list a tad bit since it is very agressive:
image

One could also build a (more expensive) string-distance based spellchecking engine. Please let me know if you'd be interested in that

@Bloeckchengrafik Bloeckchengrafik marked this pull request as ready for review April 22, 2025 18:31
@face-hh
Copy link
Member

face-hh commented Apr 22, 2025

LGTM, will check it out tomorrow afternoon. Thank you!

@face-hh
Copy link
Member

face-hh commented Apr 23, 2025

I've done some testing and it partially works.

There are a ton of edge cases where it gets stuff wrong:

  • whp os 1ne -> who as one (should be "who is one")
  • whre do i find john pork -> were am , find john park (should be "where do i find john pork")
    and many more

The code inside lexicon seems to work perfectly, the problem is spellingCorrection.ts.

I've pushed a change to cache database results, as it was querying everything on each function call.

I suppose we could attempt to create a more resource-intensive checker for spellingCorrection.ts.

@Bloeckchengrafik
Copy link
Author

I'll build a small dictionary-based one later

@Bloeckchengrafik
Copy link
Author

image

Built this one using levenshtein distance. Still overcorrects and chooses words that are not often used over more sensible words. I've got some embedding-based pre-filtering and ranking brewing but that'll take a couple hours

@face-hh
Copy link
Member

face-hh commented Apr 23, 2025

I believe the best approach at this would be using a large language model and storing the autocorrections, similar to how we currently handle autocomplete.

However, I've attempted this with the "ai summaries" feature, and it was pretty costly.

@Bloeckchengrafik
Copy link
Author

That's what i'm trying to circumvent: My current idea is embedding the old and new token as well as the query to compare the vectors to get to the relevant meaning.

Real-Time-AI is too expensive for this

@Bloeckchengrafik
Copy link
Author

Bloeckchengrafik commented Apr 24, 2025

image
image
image

That's my last attempt at an algorithm for this since I sadly got other stuff to do as well and this should originally be a quick and simple PR for laughs and giggles. But anyways, here we go!

The current approach works as follows:

  • The query is tokenized using space-splitting (others don't work well with token matching later)
  • Special misspellings are corrected directly ("guthib"->"github")
  • Does the word set (10k most used english words) contain the token? don't change it!
  • Find the 5 most similar words in the word set
    • First, filter by length - this is faster than levenshtein distance calc
    • Filter by first character or a common misspelling of it
    • Use levenshtein to filter the rest
  • Assign a score based on the default usage amount in the English language and direct character similarity (to favor where over were when an h is contained)
  • Return the word with the highest score

Doing grammar correction (where os john pork -> where is john pork) is out-of-scope for this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants