Lib.rs

› Keywords #nlp #bpe #llm #japanese #hugging-face #text-tokenizer #forms #parser

#tokenize

Keyword
Search

tokenizers

today's most used tokenizers, with a focus on performances and versatility

v0.23.1 2.0M #bpe #nlp #hugging-face #tokenize #tokenizer
xmlparser

Pull-based, zero-allocation XML parser

v0.13.6 5.9M #tokenize #xml #tokenizer
markup5ever

Common code for xml5ever and html5ever

v0.39.0 6.3M #html5ever #whatwg #xml-parser #serialization #tokenize #xml5ever #html-parser #forms #html5 #tree-builder
markdown

CommonMark compliant markdown parser in Rust with ASTs and extensions

v1.0.0 461K #markdown-parser #render-markdown #common-mark #tokenize
svgtypes

SVG types parser

v0.16.1 1.9M #svg-parser #tokenize #svg
charabia

detect the language, tokenize the text and normalize the tokens

v0.9.9 69K #tokenize #normalize #tokenizer
sqlite3-parser

SQL parser (as understood by SQLite)

v0.16.0 157K #tokenize #sql-parser #sql #tokenizer #parser
sentencepiece

Binding for the sentencepiece tokenizer

v0.13.1 53K #tokenize #text #bindings #text-tokenizer #binds #unsupervised
htmlparser

Pull-based, zero-allocation HTML parser

v0.2.1 140K #tokenize #tokenizer
html5gum

A WHATWG-compliant HTML5 tokenizer and tag soup parser

v0.8.3 34K #html-parser #tokenize #whatwg #html5 #html #tokenizer
azul-simplecss

A very simple CSS 2.1 tokenizer with CSS nesting support

v0.2.0 21K #tokenize #css-parser #css #nested #tokenizer
momoa

A JSON parsing library suitable for static analysis

v3.2.5 500 #ast #json-parser #static-analysis #tokenize
cargo-context-mcp

Model Context Protocol server exposing cargo-context as tools/resources/prompts

v0.4.1 #cargo-context #model-context-protocol #mcp #cargo-workspace #tokenize #mcp-server #scrub #prompt-engineering #context-engineering #token-budget
textprep

Text preprocessing primitives: normalization, tokenization, and fast keyword matching

v0.1.6 1.6K #tokenize #aho-corasick #normalization
cargo-context-cli

CLI front-end for cargo-context — the cargo context subcommand

v0.4.1 #cargo-subcommand #tokenize #scrub #pack #profile #cargo-cli #token-budget #cargo-workspace #cargo-metadata #mcp
vaporetto

pointwise prediction based tokenizer

v0.6.5 13K #japanese #tokenize #analyzer #morphological #tokenizer
formualizer-parse

High-performance Excel/OpenFormula tokenizer + parser with a stable AST surface

v2.0.0 280 #spreadsheet #tokenize #excel-formula #formula-parser #excel
lindera-tantivy

Lindera Tokenizer for Tantivy

v2.0.0 7.8K #tokenize #lindera #tantivy #tokenizer
vibrato-rkyv

Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading

v0.7.8 #tokenize #japanese #morphological #analyzer #tokenizer
harn-lexer

Tokenizer with span tracking for the Harn programming language

v0.8.22 1.0K #artificial-intelligence #harn #tokenize #spans #acp #programming-language #llm
splintr

Fast Rust tokenizer (BPE + SentencePiece + WordPiece) with Python bindings

v0.9.1 300 #tokenize #sentence-piece #llm #word-piece #bpe #tokenizer
octofhir-fhirpath-parser

Parser and tokenizer for FHIRPath expressions

v0.4.20 1.0K #tokenize #parser #fhir #tokenizer #fhirpath
styx-tokenizer

Tokenizer for the Styx configuration language

v3.0.1 280 #tokenize #styx #configuration-language
chaptr

Filename tokenizer for manga, manhwa, manhua, and light novels

v1.5.0 #manga #tokenize #filename #lightnovel #tokenizer
kiwi-rs

Ergonomic Rust bindings for the Kiwi Korean morphological analyzer C API

v0.1.4 130 #tokenize #nlp #korean #morphology #api-bindings
tokstream-cli

CLI token stream simulator using Hugging Face tokenizers

v0.1.2 #tokenize #streaming #cli #tokenizer
alyze

High-performance text analysis for full-text search

v0.1.3 #tokenize #nlp #analysis #unicode-segmentation #unicode #tokenizer
vibrato

viterbi-based accelerated tokenizer

v0.5.2 4.3K #tokenize #japanese #tokenizer
huggingface/tokenizers-python

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

GitHub 0.23.2-dev.0 #tokenize #bpe #python #language-model #byte-level #bert #nlp #stub #pad
fsqlite-ext-fts5

FTS5 full-text search extension

v0.1.3 2.3K #full-text-search #fts5 #extension #inverted-index #tokenize #virtual-table #ascii #bm25 #snippets #trigram
kiri-engine

Core Rust engine for Kiri Japanese morphological analyzer

v0.2.0 #dictionary #japanese #morphological #kiri #tokenize #sudachi #part-of-speech #forms #tokenize-text #mmap
unscanny

Painless string scanning

v0.1.0 1.1M #tokenize #scanning #tokenizing
sentencepiece-rs

Rust runtime reimplementation of SentencePiece model loading, normalization, encoding, and decoding

v0.2.1 #bpe #nlp #unigram #tokenize #tokenizer
gaze-mcp-core

Transport-free MCP-shaped chokepoint runtime for Gaze. Enforces redact→manifest→return ordering at the type level.

v0.9.0 #mcp #redaction #pii #tokenize #agent
naturallanguage

Safe Rust bindings for Apple's NaturalLanguage framework — language detection, tokenization, tagging, embeddings, gazetteers, and custom models on macOS

v0.3.0 #nlp #tokenize #macos #api-bindings
toktrie_hf_tokenizers

HuggingFace tokenizers library support for toktrie and llguidance

v1.7.5 37K #tokenize #structured-output #hugging-face #llguidance #toktrie #byte-level #json-schema #context-free-grammar #llama-cpp
tokie

Blazingly fast tokenizer - 50x faster tokenization, 10x smaller model files, 100% accurate drop-in replacement for HuggingFace

v0.0.9 #bpe #nlp #transformer #tokenize #tokenizer #word-piece
segtok

Sentence segmentation and word tokenization tools

v0.1.5 154K #tokenize #split #tokenizer #word
pinyinchch

一个拼音转汉字的工具库

v0.3.0 #chinese #pinyin #tokenize #hanzi
rwkv-tokenizer

A fast RWKV Tokenizer

v0.9.1 900 #rwkv #tokenize #model #testing #world
pred-recdec

Predicated Recursive Descent Parsing with BNF and impure hooks

v0.2.1 #ast #recursion-descent-parser #grammar #bnf #tokenize #recursive-descent #regex #ll-parser #token-stream #pred
miktik

A unified, multi-backend tokenizer library for LLMs

v0.2.0 460 #tiktoken #hugging-face #llm #rust #tokenize #tokenizer
agent-shell-parser

Shared parsing substrate for agent hook binaries — JSON input, shell tokenization

v0.4.2 #shell-parser #hook #git #tokenize #agent #json-input #substrate #productivity #git-push
fsqlite-ext-icu

ICU collation extension

v0.1.3 2.2K #collation #tokenize #icu #extension #locale-aware #cjk #turkish #sigma #dotted #edge-cases
toak-ocr

OCR engine with Apple Vision framework support for macOS

v6.0.3 #macos-framework #ocr-engine #tokenize #toak #prompt #markdown #vision-framework #git #sensitive-information #nodejs
jayce

tokenizer 🌌

v12.1.0 1.1K #tokenize #duo #sync #generic-simd #default #once-lock
language-tokenizer

Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.

v0.3.0 350 #tokenize #language #text-tokenizer #tokenizer
tergo-tokenizer

R language tokenizer

v0.2.5 300 #tokenize #formatting #tergo #language #line #latin #aqua #code-formatter
paltoquet

rule-based general-purpose tokenizers

v0.12.0 800 #tokenize #rule-based
token-count

Count tokens for LLM models using exact tokenization

v0.4.0 #tokenize #llm #gpt #cli #tokenizer
unobtanium-segmenter

A text segmentation toolbox for search applications inspired by charabia and tantivy

v0.5.2 #tokenize #language #tokenizer
cang-jie

A Chinese tokenizer for tantivy

v0.19.0 220 #tokenize #tantivy #chinese #search
sqlite-simple-tokenizer

This's a run-time loadable extension of SQLite fts5, supports Chinese and pinyin word segmentation and search

v0.6.0 #sqlite-extension #tokenize #pinyin #chinese #sqlite
lexers

Tools for tokenizing and scanning

v0.1.4 1.4K #lexer-tokenizer #ebnf #lexer #tokenize #tokenizer
tentoku

Japanese text tokenizer with deinflection support

v0.1.2 #sqlite #tokenize #text-tokenizer #japanese #word-search #deinflection #database #forms #text-tokenization #past
libsql-sqlite3-parser

SQL parser (as understood by SQLite) (libsql fork)

v0.13.0 170K #tokenize #sql #sql-parser #parser #tokenizer
europa

A lightweight AI utilities library for Rust

v0.0.3 #artificial-intelligence #vector-embedding #vector-search #vector-similarity #tokenize #text-embedding #euclidean-distance #semantic-search #cosine-similarity #vector-database
rag_engine

Core Rust RAG engine with optional Flutter Rust Bridge integration

v0.8.1 #vector-search #rag #bridge #tokenize #hnsw #flutter #document-parser #hybrid-search #embedding #bm25
tokmat

Standalone high-performance Canadian address parsing engine core

v0.2.0 #tokenize #address #address-parser #regex #canada #tokenizer
rust_tokenizers

High performance tokenizers for Rust

v8.1.1 5.2K #tokenize #machine-learning #tokenizer
mecab-ko

한국어 형태소 분석기 - MeCab-Ko의 순수 Rust 구현

v0.7.2 #tokenize #nlp #korean #mecab #morphology #tokenizer
scanlex

lexical scanner for parsing text into tokens

v0.1.4 505K #tokenize #input #scan #tokenize-text
kiri-native

Native Rust accelerator for Kiri Japanese morphological analyzer

v0.2.0 #dictionary #japanese #tokenize #kiri #morphological #forms #accelerator #mmap #text-normalization #elixir
tantivy-tokenizer-api

Tokenizer API of tantivy

v0.7.0 906K #tantivy #tokenize #api #tokenizer-in-charge #text
kiri-yaiba

kiri-刃: Standalone Rust Japanese morphological tokenizer

v0.1.0 #dictionary #tokenize #japanese #kiri #morphological #forms #morpheme #json-output #kanji #katakana
neurorvq

biosignal tokenizer — inference in Rust with Burn ML

v0.1.0 180 #tokenize #eeg #ecg #bci #emg
text-tokenizer

Custom text tokenizer

v0.6.16 #tokenize #text
ah-ah-ah

VUN token! TWO tokens! Count all the beautiful tokens ... offline! Ah-ah-ah!

v0.1.0 #openai #token-counting #tokenize #claude #llm
ailed-soulsteal

Fast game data extractor — drains PGN into tokenized training data for ML models. 60K+ games/sec.

v26.3.0 #chess #tokenize #pgn #game-ai #machine-learning #tokenizer
nlpo3

Thai natural language processing library, with Python and Node bindings

v1.4.0 950 #tokenize #thai #word-segmentation #tokenizer
latex-math

Command-line interface (CLI) tool and library for parsing and representing mathematical expressions

v0.1.2 #math #equations #tokenize #latex #tokenizer
bleuscore

A fast bleu score calculator

v0.1.6 #tokenize #bleu #deep-learning #tokenizer
tokc

Count tokens from stdin or files, with colorized tokenization display

v0.1.2 #tokenize #stdin #colorized #display #model #gpt-4o #input-file #breakdown #anthropic-claude
rust-forth-tokenizer

A Forth tokenizer written in Rust

v0.2.1 480 #tokenize #forth #tokenizer
panko

A small, zero-copy text tokenizer that crumbles strings into Words, Symbols, and Newlines

v0.1.0 #tokenize #text-tokenizer #unicode #split #nlp #unicode-text #tokenizer
syn_derive

Derive macros for syn::Parse and quote::ToTokens

v0.2.0 495K #macro-derive #to-tokens #quote #parser #tokenize #macro-parser
crossandra

A fast and simple lexical tokenization library

v1.0.1 #tokenize #regex #pattern #lexical #position #lexer
makepad-live-tokenizer

Makepad platform live DSL tokenizer

v1.0.0 440 #tokenize #dsl #makepad #live #wasm #cargo-makepad #3d-rendering
toktrie_hf_downloader

HuggingFace Hub download library support for toktrie and llguidance

v1.7.5 600 #structured-output #hugging-face #tokenize #llguidance #toktrie #context-free-grammar #json-schema #llama-cpp #llm
limbo_sqlite3_parser

SQL parser (as understood by SQLite)

v0.0.22 550 #sql-parser #tokenize #sql
rune-tokenize

Approximate token counting, budget checking, text truncation, and overlapping chunk splitting for LLM context windows

v0.1.0 #llm #tokenize #chunking
fast_html5ever

High-performance browser-grade HTML5 parser

v0.26.6 9.1K #html5ever #whatwg #html-parser #html5 #tokenize #tree-builder #serialization #browser-grade #utf-8 #forms
fsqlite-ext-fts3

FTS3/FTS4 full-text search extension

v0.1.3 #full-text-search #extension #fts3 #fts4 #tokenize
token-dict

basic dictionary based tokenization

v1.0.2 #tokenize #dictionary #text-tokenization #split
tokstream-core

Core tokenizer streaming engine for tokstream

v0.1.2 #tokenize #simulation #streaming #tokenizer
axonml-text

Text processing utilities for the Axonml ML framework

v0.6.2 #tokenize #vocab #axonml #dataset #utilities #ngrams #synthetic #language-modeling #pad #vocabulary
edifact-parser

Streaming EDIFACT tokenizer and SAX-style parser — standalone, no BO4E dependency

v0.1.60 110 #streaming-parser #edifact #tokenize #bo4e #handler
roketok

way to simply set up a tokenizer and use it. Not recommended for simple tokenizers as this crate adds a bunch of stuff to support many if not all kinds of tokenizers

v0.3.1 490 #tokenize #setup #focused #kinds #not-recommended #warnings
tantivy-vibrato

A Tantivy tokenizer using Vibrato

v0.4.0 #tokenize #tantivy #vibrato
turso_sqlite3_parser

SQL parser (as understood by SQLite)

v0.2.0-pre.7 2.1K #tokenize #sql-parser #sql #parser #tokenizer
parse_light

customizable, and lightweight JSON parser with minimal dependencies

v0.1.1 #json-parser #tokenize #string #customizable #performance-optimization #developer-experience #understanding #evolving #real-world #hobby
tokenx-rs

Fast token count estimation for LLMs at 96% accuracy without a full tokenizer

v0.1.0 #llm #claude #tokenize #tokenizer
ferrum-tokenizer

Tokenization wrapper for Ferrum inference engine

v0.7.3 #llama #inference-engine #ferrum #apple-silicon #tokenize #open-ai-compatible #metal #llm #moe #production-grade
divvunspell

Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support

v1.0.0-beta.3 #spell-check #tokenize #suggestions #zhfst #archive #bhfst #memory-map #hfst-ospell #alphabet #morphological-analysis
lindera-tokenizer

A morphological analysis library

v0.32.3 9.1K #morphological-analysis #tokenize #tokenizer #analysis #morphological
punkt

sentence tokenizer

v1.0.5 #tokenize #sentence #tokenizer
fuzzy-pickles

A low-level parser of Rust source code with high-level visitor implementations

v0.1.1 #tokenize #rust #tokenizer
llama-tokenizer

Tokenizer crate for llama.rs — deterministic text-to-token conversion

v0.1.1 #tokenize #llama #deterministic #utf-8 #white-space
trustformers-tokenizers

Tokenizers for TrustformeRS

v0.1.1 #bpe #tokenize #tokenizer #nlp-processing
rusqlite-ext

Rusqlite extension for building the FTS5 tokenizer

v0.39.0 #sqlite-extension #tokenize #rusqlite #sqlite
cuttle

A large language model inference engine in Rust

v0.1.1 #inference-engine #language-model #model-inference #tokenize #qwen3 #model-download #text-generation #performance-monitoring
burn_dragon_tokenizer

Tokenizer primitives for burn_dragon

v0.21.0 #tokenize #burn-dragon #gpt-4 #python-bindings #training #byte-pair #pyo3 #experiment
flat-cli

Flatten codebases into AI-friendly format

v0.4.0 #tokenize #compression #flatten
oxidoc-text

Shared tokenization pipeline for oxidoc — used by both build-time and query-time search

v0.1.10 #oxidoc #pipeline #search #documentation #tokenize #build-time #wasm #islands #code-block #rdx
mipl

Minimal Imperative Parsing Library

v0.2.1 #tokenize #token-stream #tokenizer #parser
kanpyo-dict

Dictionary Library for Kanpyo

v0.2.0 #kanpyo #japanese #dictionary #tokenize #analyzer
indent_tokenizer

Generate tokens based on indentation

v0.4.0 #indentation #tokenize #tokenizer
toktrie_tiktoken

tiktoken (OpenAI BPE) library support for toktrie and llguidance

v1.7.4 #structured-output #tiktoken #openai #tokenize #llguidance #bpe #toktrie #json-schema #context-free-grammar #llama
lang_pt

A parser tool to generate recursive descent top down parser

v0.1.2 #top-down-parser #tokenize #recursive-descent #parser
oaken

Search tree based Lexical Analysis tokenizer

v0.1.0 #lexical-analysis #search-tree #tokenize #fallback #string
tokenizer-lib

Tokenization utilities for building parsers in Rust

v1.6.0 230 #tokenize #parser #tokenization
kohaku

tokenizer

v0.1.5 #tokenize #abc #ok
vn-nlp

Vietnamese NLP library — tokenization, normalization, segmentation

v0.1.3 #nlp #tokenize #vietnamese #linguistics
philharmonic-connector-impl-embed

Embedding connector implementation for Philharmonic (embed capability: text in, vector out)

v0.1.1 #onnx #philharmonic #connector #embed #tokenize #embedding
rs-jsontxt2token

Converts the jsonl to tokens

v0.1.0 #jsonl #tokenize #japanese #json #helper
pinyinchch-type

一个拼音转汉字的工具库

v0.3.0 #chinese #pinyin #tokenize #hanzi
tekken-rs

Mistral Tekken tokenizer with audio support

v0.1.1 1.1K #tokenize #mistral #nlp #artificial-intelligence #audio #tokenizer
rten-text

Text tokenization and other ML pre/post-processing functions

v0.24.0 170 #tokenize #token-id #text-tokenization #text-tokenizer #hugging-face #post-processing #bert #transformer-models #eg #canonical
voice-stt

Speech-to-text library backed by MLX, starting with Moonshine

v0.1.0 #text-to-speech #mlx #moonshine #audio-samples #tokenize #hugging-face #cache #transcribe #conv #filesystem-access
syntaqlite-buildtools

Internal codegen and build tools for syntaqlite — not intended for direct use

v0.5.9 #syntaqlite #codegen #build-tool #sqlite #tokenize #grammar #bootstrap #parser-generator #sql
wordchipper-cli-util

Wordchipper CLI

v0.9.1 #tokenize #wordchipper #lexer #cli #python-bindings #bpe #tiktoken #openai #gpt-2
ragit-korean

korean tokenizer for ragit

v0.4.5 #korean #tokenize #ragit #document #convert
neco-syntax-textmate

TextMate-style syntax loading and tokenization on top of syntect

v0.2.0 #tokenize #textmate #editor #highlight #syntax
mecab-ko-dict-sync

Korean National Institute dictionary API client for MeCab-Ko

v0.7.2 #tokenize #korean #nlp #mecab #morphology
scivex-nlp

Scivex — Tokenization, embeddings, and text processing

v0.1.1 #tokenize #scientific-computing #embedding
unitoken

Fast BPE tokenizer/trainer with a Rust core and Python bindings

v0.1.1 #bpe #nlp #tokenize #tokenizer
sqlite-jieba-tokenizer

This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

v0.6.0 #sqlite-extension #tokenize #sqlite #chinese #english
vaporetto_rules

Rule-base filters for Vaporetto

v0.6.5 900 #japanese #tokenize #analyzer #morphological #tokenizer
mt_mtc

Tokenizer and parser for the Minot language

v0.6.0 #robotics #tokenize #minot
chinese_segmenter

Tokenize Chinese sentences using a dictionary-driven largest first matching approach

v1.0.1 #chinese #tokenize #hanzi #segment #localization
sentencepiece-sys

Binding for the sentencepiece tokenizer

v0.13.1 25K #bindings #tokenize #text-tokenizer #dynamic-linking #version #pkg-config #build-script
derive-finite-automaton

Procedural macro for generating finite automaton

v0.3.0 100 #finite-automata #tokenize #tokenization
go-brrr

Token-efficient code analysis for LLMs - Rust implementation

v0.1.0 #tree-sitter #code-analysis #tokenize #ast #llm #ast-analysis
rkllm-sys-rs

Raw FFI bindings for librkllmrt

v0.1.0 #librkllmrt #bindings #rk3588 #artificial-intelligence #tokenize
instant-clip-tokenizer

Fast text tokenizer for the CLIP neural network

v0.1.0 6.0K #neural-network-clip #tokenize #text-tokenizer #model #instant #python-bindings
skimmer

streams reader

v0.0.3 #stream-reader #byte-stream #tokenize
use-token

Composable tokenization primitives for RustUse

v0.1.0 #tokenize #string-parser #text #string #token-parser
syntaqlite-syntax

Internal parser and tokenizer for syntaqlite — not intended for direct use

v0.5.9 #tokenize #syntaqlite #sql #sqlite #validation #sqlite-parser #language-server #re-exports
tokenizers-enfer

today's most used tokenizers, with a focus on performances and versatility

v0.21.1 #tokenize #hugging-face #word-piece #bpe #tokenizer
vn-nlp-tokenize

Vietnamese tokenization algorithms for vn-nlp

v0.1.3 #tokenize #nlp #vietnamese #linguistics
vtext

NLP with Rust

v0.2.0 #tf-idf #levenshtein #tokenize #text-processing
bpe-tokenizer

A BPE Tokenizer library

v0.1.4 150 #byte-pair #tokenize #bpe #encoding #byte
marukov

markov chain text generator

v0.0.2 #markov-chain #generator #text-generator #text-generation #tokenize #generations
rtf-grimoire

A Rich Text File (RTF) document tokenizer. Useful for writing RTF parsers.

v0.2.1 130 #rtf #rich-text #tokenize
vn-nlp-segment

Vietnamese sentence segmentation for vn-nlp

v0.1.3 #tokenize #nlp #vietnamese #linguistics
vn-nlp-normalize

Vietnamese text normalization — diacritics, unicode NFC/NFD

v0.1.3 #nlp #tokenize #vietnamese #linguistics
alith-models

Load and Download LLM Models, Metadata, and Tokenizers

v0.4.3 #gguf #model #tokenize #hugging-face #metadata #llm #embedding #artificial-intelligence
vil_tokenizer

VIL Tokenizer Engine — native Rust BPE tokenization for LLM token counting and text splitting

v0.4.0 #tokenize #vil #engine #llm #split #bpe #distributed-systems #byte-pair #truncation #language-framework
vaporetto_tantivy

Vaporetto Tokenizer for Tantivy

v0.24.0 950 #tokenize #tantivy #japanese
tokenizer

Thai text tokenizer

v0.1.2 #tokenize #localization #thai #text-tokenizer #tokeniser
palate_polyglot_tokenizer

A generic programming language tokenizer

v0.2.1 #tokenize #generics #polyglot #programming-language #line-comment #block-comment

Try searching with DuckDuckGo.

ld-lucivy-tokenizer-api

Tokenizer API of lucivy

v0.27.0 #tokenize #lucivy #api #tokenizer-in-charge #indexing
punkt_n

Punkt sentence tokenizer

v1.0.5 #tokenize #sentence #punkt #tokenizer
mako

main Sidekick AI data processing library

v0.3.0 #artificial-intelligence #data-processing #tokenize #data-loader #machine-learning #dataflow #sidekick #tokenized
s-expression

parser

v0.2.0 #s-expr #expression-parser #zero-copy-parser #tokenize #borrowing #preallocated #numbers-parser #parser-compiler #performance-optimization #interpreter
wordpieces

Split tokens into word pieces

v0.6.1 110 #word-piece #tokenize #piece #wordpiece #word
sixel-tokenizer

A tokenizer for serialized Sixel bytes

v0.1.0 11K #sixel #tokenize #byte #serialization #events #coordinate-system
thfst-tools

Support tools for DivvunSpell - convert ZHFST files to BHFST

v1.0.0-beta.1 #spell-check #morphological-analysis #zhfst #tokenize #bhfst #divvunspell #hfst-ospell #fst #suggestions
sentencepiece-model

SentencePiece model parser generated from the SentencePiece protobuf definition

v0.1.4 16K #sentence-piece #tokenize #machine-learning
ellie_tokenizer

Tokenizer for ellie language

v0.7.3 340 #tokenize #ellie #embedded #item #position
alpino-tokenizer

Wrapper around the Alpino tokenizer for Dutch

v0.4.0 #tokenize #finite-state-transducer #dutch #alpino #principles #text-tokenizer
biors-core

Core biological sequence validation and protein tokenization contracts for bio-rs

v0.37.3 #bioinformatics #tokenize #bioinformatics-sequence
simple-tokenizer

A tiny no_std tokenizer with line & column tracking

v0.4.2 250 #tokenize #no-std #column
syntaxdot-tokenizers

Subword tokenizers

v0.5.0 #syntax-dot #tokenize #sentence-piece #labeling #bert #subword #lemmatization #biaffine-parser #word-piece #morphology
procedural-masquarade

Incorrect spelling for procedural-masquerade

v0.2.0 #css #css-parser #tokenize #level #detect #spelling #incorrect #character-encoding #syntax-tree
gtars-tokenizers

Genomic region tokenizers for machine learning in Rust

v0.5.2 #tokenize #machine-learning #genomics #gtars #region #overlap #genomic-region #genomic-data #transformer-models
tele_tokenizer

A CSS tokenizer

v0.2.0 #tokenize #css #telecss #tokenizer
divvunspell-bin

Spellchecker for ZHFST/BHFST spellers, with case handling and tokenization support

v1.0.0 #spell-check #tokenize #zhfst #bhfst #case
tokengeex

efficient tokenizer for code based on UnigramLM and TokenMonster

v1.1.0 900 #tokenize #llm #codegeex #tokenizer
data_vault

Data Vault is a modular, pragmatic, credit card vault for Rust

v0.3.4 #credit-card #encryption #credits #vault #tokenize #aes-gcm-siv #postgresql #blake3 #redis #encryption-key
tiniestsegmenter

Compact Japanese segmenter

v0.3.0 390 #japanese #tokenize #ngrams
uscan

A universal source code scanner

v0.1.3 #tokenize #compiler #tokenizer
sentence

tokenizes English language sentences for use in TTS applications

v0.0.2 #tokenize #text-to-speech #english
toresy

term rewriting system based on tokenization

v0.5.0 #tokenize #system #rewriting #rewriting-rules #input #text-output
izihawa-tantivy-tokenizer-api

Tokenizer API of tantivy

v0.25.0 500 #tokenize #tantivy #full-text-search #tokenizer-in-charge #text-indexing #search-engine
earl-lang-syntax

tokenizer and parser for the language Earl

v1.0.0 #tokenize #syntax #s-expr #earl #language-syntax #multi-line #syntax-parser
sqlite-charabia-tokenizer

This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

v0.5.0 #sqlite-extension #tokenize #sqlite #charabia
tinytoken

tokenizing text into words, numbers, symbols, and more, with customizable parsing options

v0.1.4 130 #tokenize #numbers #text-input #tokenizer
caddyfile

working with Caddy's Caddyfile format

v0.1.1 750 #caddy #tokenize #format #testing
aleph-alpha-tokenizer

A fast implementation of a wordpiece-inspired tokenizer

v0.3.1 #tokenize #aleph-alpha #hugging-face #tokenizer
cssparser-macros

Procedural macros for cssparser

v0.7.0 3.6M #css-parser #proc-macro #tokenize #byte #level
strizer

minimal and fast library for text tokenization

v0.1.0 #text-tokenization #tokenize #string-tokenizer
specmc-base

common code for parsing Minecraft specification

v0.1.11 490 #minecraft #specification #parser #tokenize #identifier
colorblast

Syntax highlighting library for various programming languages, markup languages and various other formats

v0.0.3 #syntax-highlighting #tokenize #highlighter
castle_tokenizer

Castle Tokenizer: tokenizer

v0.20.2 180 #tokenize #castle
json-parser

JSON parser

v1.0.2 #tokenize #json #tokenizer
blingfire

Wrapper for the BlingFire tokenization library

v1.0.0 2.4K #tokenize #machine-learning #tokenizer
indentation_flattener

From indented input, generate plain output with indentation PUSH and POP codes

v0.1.0 #indentation #tokenize #parser
nipah_tokenizer

A powerful yet simple text tokenizer for your everyday needs!

v0.1.0 #tokenize #text-tokenizer #nlp #tokenizer
gtokenizers

tokenizing genomic data with an emphasis on region set data

v0.0.18 #genomics #genomic-data #tokenize #region #machine-learning #emphasis
regex-bnf

A deterministic parser for a BNF inspired syntax with regular expressions

v0.1.2 200 #regex #bnf #syntax #tokenize #syntax-parser #grammar #grammar-parser #csv #csv-parser
basic_lexer

Basic lexical analyzer for parsing and compiling

v0.2.1 #tokenize #lexical-analysis #white-space #tokenizer
neca-cmd

command tokenizer used by my Twitch chat bot

v0.3.0 250 #tokenize #chat-bot #twitch
xtoken

Iterator based no_std XML Tokenizer using memchr

v0.1.1 #tokenize #iterator #xml #memchr #byte-slice
mecab-ko-core

한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저

v0.7.2 #tokenize #viterbi #nlp #korean #morphology #tokenizer
boost_tokenizer

Boost C++ library boost_tokenizer packaged using Zanbil

v0.1.0 #tokenize #boost #zanbil #packaged #io-stream #badge
tokenmonster

Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)

v0.1.0 #tokenize #tiktoken #nlp #tokenizer
morsels_lang_ascii

Basic ascii tokenizer for morsels

v0.7.3 #ascii #language #morsels #tokenize #tokenizer-for-morsels
tokeneer

tokenizer crate

v0.1.0 340 #tokenize #bpe #tokenizer
regex-tokenizer

A regex tokenizer

v0.1.1 #tokenize #regex #tokenizer
tinysegmenter

Compact Japanese tokenizer

v0.1.1 2.5K #tokenize #japanese #compact
pretok

A string pre-tokenizer for C-like syntaxes

v0.1.0 #lexer-tokenizer #lexer #text #tokenize #tokenizer
alpino-tokenize

Wrapper around the Alpino tokenizer for Dutch

v0.4.0 #tokenize #finite-state-transducer #alpino-tokenizer #dutch #command-line-tool
morsels_lang_chinese

Chinese tokenizer for morsels

v0.7.3 100 #chinese #morsels #language #tokenize #tokenizer-for-morsels
text-scanner

A UTF-8 char-oriented, zero-copy, text and code scanning library

v0.0.3 270 #lexer #streaming-parser #tokenize
quote-data

A tokenization Library for Rust

v1.0.0 #proc-macro #tokenize #macro-derive #struct #quote
tokenate

do some grunt work of writing a tokenizer

v0.1.0 #tokenize #inner #parse
ccgen

generate manually maintained C (and C++) headers

v0.2.0 #header #generate #tok #generator #tokenize

Search powered by tantivy. The index is a combination of multiple data sources and heuristics, not just pure crate metadata.

Browse all categories.