text-normalization

Digestion Algorithm in Hierarchical Symbolic Forests: A Fast Text Normalization Algorithm and Semantic Parsing Framework for Specific Scenarios and Lightweight Deployment

lightweight-framework incremental-learning semantic-parsing text-normalization data-structures-and-algorithms fast-algorithm symbolism sequence-parsing lightweight-deployment scenario-specific

Updated Dec 23, 2025
Python

A modern, actively maintained contractions library. Expands English contractions (you're → you are) with improved performance, type safety, and features like bulk dictionary imports and JSON loading. Includes 100% test coverage, full type hints, and works with both pip and uv.

python nlp natural-language python-library ml text-analysis python3 text-processing preprocessing type-hints contractions text-normalization pythonlibrary

Updated Dec 15, 2025
Python

esentis / string_extensions

Star

Useful String extensions to save you time in production.

string strings string-manipulation flutter string-formatter dartlang string-matching flutter-plugin text-normalization string-validation string-extensions string-normalization dart-string-manipulation

Updated Dec 10, 2025
Dart

HMByteSensei / WhisperAI-Evaluation

Star

Comprehensive benchmark of OpenAI Whisper models for Bosnian, Croatian, and Serbian languages. Includes pipelines for audio transcription, rigorous text normalization, Levenshtein distance evaluation, and LLM-based post-processing.

python nlp benchmarking machine-learning natural-language-processing serbian levenshtein-distance speech-to-text asr bosnian croatian graph-representation wer text-normalization accuracy-evaluation openai-whisper

Updated Dec 7, 2025
HTML

ikegami-yukino / neologdn

Sponsor

Star

Japanese text normalizer for mecab-neologd

nlp japanese-language preprocessing mecab-ipadic-neologd text-normalization

Updated Dec 2, 2025
Cython

pszemraj / rehuman

Star

Unicode-safe text cleaning & typographic normalization for Rust

unicode text-processing rust-crate text-normalization no-emoji-in-code

Updated Oct 27, 2025
Rust

Agash / TTSTextNormalization

Sponsor

Star

Modern .NET 9 / C# 13 library to normalize text (emojis, currency, numbers, abbreviations, chat slang) for consistent and natural Text-to-Speech (TTS) synthesis, ideal for stream chat/donations.

text-to-speech youtube twitch streaming dotnet tts speech-synthesis text-normalization

Updated Oct 24, 2025
C#

curegit / unicodecheck

Star

Simple tool to check if Unicode text files are Unicode-normalized

unicode character-encoding text-normalization

Updated Oct 18, 2025
Python

vinhdq842 / soe-vinorm

Star

Soe Vinorm: An Effective Text Normalization Toolkit for converting Vietnamese text to its spoken form.

nlp-library vietnamese-nlp text-normalization

Updated Oct 17, 2025
Python

MohamedNassih / NLP-for-Darija-Enrichissement-de-traduction_darija.json

Star

Pipeline Python pour enrichir un dataset Arabe (MSA) → Darija (MA) depuis livres PDF & transcriptions YouTube ; normalisation, segmentation par tokens, génération (OpenAI ou règles) et export JSON. Projet de stage d’application chez YaneCode Digital.

nlp machine-learning ocr translation corpus openai segmentation arabic tokenization darija text-normalization youtube-transcript

Updated Oct 13, 2025
Python

ItsAJ1005 / AI-powered-amount-extraction-in-medi-docs

Star

A robust service that extracts and classifies financial amounts from medical bills and receipts using Tesseract.js OCR, normalization, and AI-powered context classification with Gemini API fallback.

text-classification text-extraction tesseract-ocr text-normalization gemini-api medical-documents

Updated Sep 27, 2025
JavaScript

xga0 / nlp-preprocessing-showcase

Star

Demonstrating specialized NLP preprocessing packages (lightlemma, emoticon-fix, contraction-fix) through Twitter sentiment analysis. Achieves 97% accuracy with PyTorch LSTM + attention model.

nlp benchmark machine-learning natural-language-processing social-media twitter deep-learning sentiment-analysis pytorch lstm classification text-processing preprocessing attention-mechanism lemmatization text-normalization lightlemma emoticon-fix contraction-fix

Updated Aug 9, 2025
Jupyter Notebook

bgnatowski / XSNTS-Analyzer

Star

X.com Social Network Topic-Sentiment Analyzer - application that scrapes Twitter/X data, performs Polish-language text normalization and lemmatization, builds topic models with MALLET, and runs sentiment analysis for downstream analytics. Created as a project for Master Thesis.

sentiment-analysis mallet twitter-scraper lemmatization text-normalization lda-topic-modeling spring-boot-3 polish-nlp