Spellcheck #5

Bloeckchengrafik · 2025-04-22T13:57:48Z

face-hh · 2025-04-22T16:08:31Z

I would unironically merge this, if it also integrated autocorrection for more words.

This can be done just like lexicon, inserting the entries from Microsoft Research Spelling-Correction Data.

It contains a en_*.txt file with:

yung	young
yung	young
yung	young
yuork	york
yuou	you
yuou	you
yuour	your
yup.	up.
yur	our
yur	your
yur	your
yur	your
...more

Bloeckchengrafik · 2025-04-22T16:31:05Z

i'll work on it

Bloeckchengrafik · 2025-04-22T18:30:21Z

This is where I landed now. If you got any other idea, please let me know.

Please note that it may be required to reduce the spelling correction list a tad bit since it is very agressive:

One could also build a (more expensive) string-distance based spellchecking engine. Please let me know if you'd be interested in that

face-hh · 2025-04-22T18:40:47Z

LGTM, will check it out tomorrow afternoon. Thank you!

face-hh · 2025-04-23T09:31:48Z

I've done some testing and it partially works.

There are a ton of edge cases where it gets stuff wrong:

whp os 1ne -> who as one (should be "who is one")
whre do i find john pork -> were am , find john park (should be "where do i find john pork")
and many more

The code inside lexicon seems to work perfectly, the problem is spellingCorrection.ts.

I've pushed a change to cache database results, as it was querying everything on each function call.

I suppose we could attempt to create a more resource-intensive checker for spellingCorrection.ts.

Bloeckchengrafik · 2025-04-23T10:09:06Z

I'll build a small dictionary-based one later

Bloeckchengrafik · 2025-04-23T14:35:55Z

Built this one using levenshtein distance. Still overcorrects and chooses words that are not often used over more sensible words. I've got some embedding-based pre-filtering and ranking brewing but that'll take a couple hours

face-hh · 2025-04-23T14:43:19Z

I believe the best approach at this would be using a large language model and storing the autocorrections, similar to how we currently handle autocomplete.

However, I've attempted this with the "ai summaries" feature, and it was pretty costly.

Bloeckchengrafik · 2025-04-23T14:57:18Z

That's what i'm trying to circumvent: My current idea is embedding the old and new token as well as the query to compare the vectors to get to the relevant meaning.

Real-Time-AI is too expensive for this

…in distance

Bloeckchengrafik · 2025-04-24T07:48:06Z

That's my last attempt at an algorithm for this since I sadly got other stuff to do as well and this should originally be a quick and simple PR for laughs and giggles. But anyways, here we go!

The current approach works as follows:

The query is tokenized using space-splitting (others don't work well with token matching later)
Special misspellings are corrected directly ("guthib"->"github")
Does the word set (10k most used english words) contain the token? don't change it!
Find the 5 most similar words in the word set
- First, filter by length - this is faster than levenshtein distance calc
- Filter by first character or a common misspelling of it
- Use levenshtein to filter the rest
Assign a score based on the default usage amount in the English language and direct character similarity (to favor where over were when an h is contained)
Return the word with the highest score

Doing grammar correction (where os john pork -> where is john pork) is out-of-scope for this PR!

I love spelling things wrong

80e2395

face-hh changed the title ~~I love spelling things wrong~~ Spellcheck Apr 22, 2025

Merge branch 'outpoot:main' into main

d88891a

Bloeckchengrafik marked this pull request as draft April 22, 2025 16:39

feat: load spelling data into database

253f0ea

feat: spelling correction using microsoft spelling correction data

3c420cc

Bloeckchengrafik marked this pull request as ready for review April 22, 2025 18:31

cache database results

464a664

feat: spelling correction using probabilistic filtering and levenshte…

b2f17d0

…in distance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spellcheck #5

Spellcheck #5

Uh oh!

Bloeckchengrafik commented Apr 22, 2025

Uh oh!

face-hh commented Apr 22, 2025

Uh oh!

Bloeckchengrafik commented Apr 22, 2025

Uh oh!

Bloeckchengrafik commented Apr 22, 2025

Uh oh!

face-hh commented Apr 22, 2025

Uh oh!

face-hh commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 23, 2025

Uh oh!

face-hh commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spellcheck #5

Are you sure you want to change the base?

Spellcheck #5

Uh oh!

Conversation

Bloeckchengrafik commented Apr 22, 2025

Uh oh!

face-hh commented Apr 22, 2025

Uh oh!

Bloeckchengrafik commented Apr 22, 2025

Uh oh!

Bloeckchengrafik commented Apr 22, 2025

Uh oh!

face-hh commented Apr 22, 2025

Uh oh!

face-hh commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 23, 2025

Uh oh!

face-hh commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 23, 2025

Uh oh!

Bloeckchengrafik commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bloeckchengrafik commented Apr 24, 2025 •

edited

Loading