#tokenize

  1. tokenizers

    today's most used tokenizers, with a focus on performances and versatility

    v0.23.1 2.0M #bpe #nlp #hugging-face #tokenize #tokenizer
  2. xmlparser

    Pull-based, zero-allocation XML parser

    v0.13.6 5.9M #tokenize #xml #tokenizer
  3. markup5ever

    Common code for xml5ever and html5ever

    v0.39.0 6.3M #html5ever #whatwg #xml-parser #serialization #tokenize #xml5ever #html-parser #forms #html5 #tree-builder
  4. markdown

    CommonMark compliant markdown parser in Rust with ASTs and extensions

    v1.0.0 461K #markdown-parser #render-markdown #common-mark #tokenize
  5. svgtypes

    SVG types parser

    v0.16.1 1.9M #svg-parser #tokenize #svg
  6. charabia

    detect the language, tokenize the text and normalize the tokens

    v0.9.9 69K #tokenize #normalize #tokenizer
  7. sqlite3-parser

    SQL parser (as understood by SQLite)

    v0.16.0 157K #tokenize #sql-parser #sql #tokenizer #parser
  8. sentencepiece

    Binding for the sentencepiece tokenizer

    v0.13.1 53K #tokenize #text #bindings #text-tokenizer #binds #unsupervised
  9. htmlparser

    Pull-based, zero-allocation HTML parser

    v0.2.1 140K #tokenize #tokenizer
  10. html5gum

    A WHATWG-compliant HTML5 tokenizer and tag soup parser

    v0.8.3 34K #html-parser #tokenize #whatwg #html5 #html #tokenizer
  11. azul-simplecss

    A very simple CSS 2.1 tokenizer with CSS nesting support

    v0.2.0 21K #tokenize #css-parser #css #nested #tokenizer
  12. momoa

    A JSON parsing library suitable for static analysis

    v3.2.5 500 #ast #json-parser #static-analysis #tokenize
  13. cargo-context-mcp

    Model Context Protocol server exposing cargo-context as tools/resources/prompts

    v0.4.1 #cargo-context #model-context-protocol #mcp #cargo-workspace #tokenize #mcp-server #scrub #prompt-engineering #context-engineering #token-budget
  14. textprep

    Text preprocessing primitives: normalization, tokenization, and fast keyword matching

    v0.1.6 1.6K #tokenize #aho-corasick #normalization
  15. cargo-context-cli

    CLI front-end for cargo-context — the cargo context subcommand

    v0.4.1 #cargo-subcommand #tokenize #scrub #pack #profile #cargo-cli #token-budget #cargo-workspace #cargo-metadata #mcp
  16. vaporetto

    pointwise prediction based tokenizer

    v0.6.5 13K #japanese #tokenize #analyzer #morphological #tokenizer
  17. formualizer-parse

    High-performance Excel/OpenFormula tokenizer + parser with a stable AST surface

    v2.0.0 280 #spreadsheet #tokenize #excel-formula #formula-parser #excel
  18. lindera-tantivy

    Lindera Tokenizer for Tantivy

    v2.0.0 7.8K #tokenize #lindera #tantivy #tokenizer
  19. vibrato-rkyv

    Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading

    v0.7.8 #tokenize #japanese #morphological #analyzer #tokenizer
  20. harn-lexer

    Tokenizer with span tracking for the Harn programming language

    v0.8.22 1.0K #artificial-intelligence #harn #tokenize #spans #acp #programming-language #llm
  21. splintr

    Fast Rust tokenizer (BPE + SentencePiece + WordPiece) with Python bindings

    v0.9.1 300 #tokenize #sentence-piece #llm #word-piece #bpe #tokenizer
  22. octofhir-fhirpath-parser

    Parser and tokenizer for FHIRPath expressions

    v0.4.20 1.0K #tokenize #parser #fhir #tokenizer #fhirpath
  23. styx-tokenizer

    Tokenizer for the Styx configuration language

    v3.0.1 280 #tokenize #styx #configuration-language
  24. chaptr

    Filename tokenizer for manga, manhwa, manhua, and light novels

    v1.5.0 #manga #tokenize #filename #lightnovel #tokenizer
  25. kiwi-rs

    Ergonomic Rust bindings for the Kiwi Korean morphological analyzer C API

    v0.1.4 130 #tokenize #nlp #korean #morphology #api-bindings
  26. tokstream-cli

    CLI token stream simulator using Hugging Face tokenizers

    v0.1.2 #tokenize #streaming #cli #tokenizer
  27. alyze

    High-performance text analysis for full-text search

    v0.1.3 #tokenize #nlp #analysis #unicode-segmentation #unicode #tokenizer
  28. vibrato

    viterbi-based accelerated tokenizer

    v0.5.2 4.3K #tokenize #japanese #tokenizer
  29. huggingface/tokenizers-python

    💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

    GitHub 0.23.2-dev.0 #tokenize #bpe #python #language-model #byte-level #bert #nlp #stub #pad
  30. fsqlite-ext-fts5

    FTS5 full-text search extension

    v0.1.3 2.3K #full-text-search #fts5 #extension #inverted-index #tokenize #virtual-table #ascii #bm25 #snippets #trigram
  31. kiri-engine

    Core Rust engine for Kiri Japanese morphological analyzer

    v0.2.0 #dictionary #japanese #morphological #kiri #tokenize #sudachi #part-of-speech #forms #tokenize-text #mmap
  32. unscanny

    Painless string scanning

    v0.1.0 1.1M #tokenize #scanning #tokenizing
  33. sentencepiece-rs

    Rust runtime reimplementation of SentencePiece model loading, normalization, encoding, and decoding

    v0.2.1 #bpe #nlp #unigram #tokenize #tokenizer
  34. gaze-mcp-core

    Transport-free MCP-shaped chokepoint runtime for Gaze. Enforces redact→manifest→return ordering at the type level.

    v0.9.0 #mcp #redaction #pii #tokenize #agent
  35. naturallanguage

    Safe Rust bindings for Apple's NaturalLanguage framework — language detection, tokenization, tagging, embeddings, gazetteers, and custom models on macOS

    v0.3.0 #nlp #tokenize #macos #api-bindings
  36. toktrie_hf_tokenizers

    HuggingFace tokenizers library support for toktrie and llguidance

    v1.7.5 37K #tokenize #structured-output #hugging-face #llguidance #toktrie #byte-level #json-schema #context-free-grammar #llama-cpp
  37. tokie

    Blazingly fast tokenizer - 50x faster tokenization, 10x smaller model files, 100% accurate drop-in replacement for HuggingFace

    v0.0.9 #bpe #nlp #transformer #tokenize #tokenizer #word-piece
  38. segtok

    Sentence segmentation and word tokenization tools

    v0.1.5 154K #tokenize #split #tokenizer #word
  39. pinyinchch

    一个拼音转汉字的工具库

    v0.3.0 #chinese #pinyin #tokenize #hanzi
  40. rwkv-tokenizer

    A fast RWKV Tokenizer

    v0.9.1 900 #rwkv #tokenize #model #testing #world
  41. pred-recdec

    Predicated Recursive Descent Parsing with BNF and impure hooks

    v0.2.1 #ast #recursion-descent-parser #grammar #bnf #tokenize #recursive-descent #regex #ll-parser #token-stream #pred
  42. miktik

    A unified, multi-backend tokenizer library for LLMs

    v0.2.0 460 #tiktoken #hugging-face #llm #rust #tokenize #tokenizer
  43. agent-shell-parser

    Shared parsing substrate for agent hook binaries — JSON input, shell tokenization

    v0.4.2 #shell-parser #hook #git #tokenize #agent #json-input #substrate #productivity #git-push
  44. fsqlite-ext-icu

    ICU collation extension

    v0.1.3 2.2K #collation #tokenize #icu #extension #locale-aware #cjk #turkish #sigma #dotted #edge-cases
  45. toak-ocr

    OCR engine with Apple Vision framework support for macOS

    v6.0.3 #macos-framework #ocr-engine #tokenize #toak #prompt #markdown #vision-framework #git #sensitive-information #nodejs
  46. jayce

    tokenizer 🌌

    v12.1.0 1.1K #tokenize #duo #sync #generic-simd #default #once-lock
  47. language-tokenizer

    Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.

    v0.3.0 350 #tokenize #language #text-tokenizer #tokenizer
  48. tergo-tokenizer

    R language tokenizer

    v0.2.5 300 #tokenize #formatting #tergo #language #line #latin #aqua #code-formatter
  49. paltoquet

    rule-based general-purpose tokenizers

    v0.12.0 800 #tokenize #rule-based
  50. token-count

    Count tokens for LLM models using exact tokenization

    v0.4.0 #tokenize #llm #gpt #cli #tokenizer
  51. unobtanium-segmenter

    A text segmentation toolbox for search applications inspired by charabia and tantivy

    v0.5.2 #tokenize #language #tokenizer
  52. cang-jie

    A Chinese tokenizer for tantivy

    v0.19.0 220 #tokenize #tantivy #chinese #search
  53. sqlite-simple-tokenizer

    This's a run-time loadable extension of SQLite fts5, supports Chinese and pinyin word segmentation and search

    v0.6.0 #sqlite-extension #tokenize #pinyin #chinese #sqlite
  54. lexers

    Tools for tokenizing and scanning

    v0.1.4 1.4K #lexer-tokenizer #ebnf #lexer #tokenize #tokenizer
  55. tentoku

    Japanese text tokenizer with deinflection support

    v0.1.2 #sqlite #tokenize #text-tokenizer #japanese #word-search #deinflection #database #forms #text-tokenization #past
  56. libsql-sqlite3-parser

    SQL parser (as understood by SQLite) (libsql fork)

    v0.13.0 170K #tokenize #sql #sql-parser #parser #tokenizer
  57. europa

    A lightweight AI utilities library for Rust

    v0.0.3 #artificial-intelligence #vector-embedding #vector-search #vector-similarity #tokenize #text-embedding #euclidean-distance #semantic-search #cosine-similarity #vector-database
  58. rag_engine

    Core Rust RAG engine with optional Flutter Rust Bridge integration

    v0.8.1 #vector-search #rag #bridge #tokenize #hnsw #flutter #document-parser #hybrid-search #embedding #bm25
  59. tokmat

    Standalone high-performance Canadian address parsing engine core

    v0.2.0 #tokenize #address #address-parser #regex #canada #tokenizer
  60. rust_tokenizers

    High performance tokenizers for Rust

    v8.1.1 5.2K #tokenize #machine-learning #tokenizer
  61. mecab-ko

    한국어 형태소 분석기 - MeCab-Ko의 순수 Rust 구현

    v0.7.2 #tokenize #nlp #korean #mecab #morphology #tokenizer
  62. scanlex

    lexical scanner for parsing text into tokens

    v0.1.4 505K #tokenize #input #scan #tokenize-text
  63. kiri-native

    Native Rust accelerator for Kiri Japanese morphological analyzer

    v0.2.0 #dictionary #japanese #tokenize #kiri #morphological #forms #accelerator #mmap #text-normalization #elixir
  64. tantivy-tokenizer-api

    Tokenizer API of tantivy

    v0.7.0 906K #tantivy #tokenize #api #tokenizer-in-charge #text
  65. kiri-yaiba

    kiri-刃: Standalone Rust Japanese morphological tokenizer

    v0.1.0 #dictionary #tokenize #japanese #kiri #morphological #forms #morpheme #json-output #kanji #katakana
  66. neurorvq

    biosignal tokenizer — inference in Rust with Burn ML

    v0.1.0 180 #tokenize #eeg #ecg #bci #emg
  67. text-tokenizer

    Custom text tokenizer

    v0.6.16 #tokenize #text
  68. ah-ah-ah

    VUN token! TWO tokens! Count all the beautiful tokens ... offline! Ah-ah-ah!

    v0.1.0 #openai #token-counting #tokenize #claude #llm
  69. ailed-soulsteal

    Fast game data extractor — drains PGN into tokenized training data for ML models. 60K+ games/sec.

    v26.3.0 #chess #tokenize #pgn #game-ai #machine-learning #tokenizer
  70. nlpo3

    Thai natural language processing library, with Python and Node bindings

    v1.4.0 950 #tokenize #thai #word-segmentation #tokenizer
  71. latex-math

    Command-line interface (CLI) tool and library for parsing and representing mathematical expressions

    v0.1.2 #math #equations #tokenize #latex #tokenizer
  72. bleuscore

    A fast bleu score calculator

    v0.1.6 #tokenize #bleu #deep-learning #tokenizer
  73. tokc

    Count tokens from stdin or files, with colorized tokenization display

    v0.1.2 #tokenize #stdin #colorized #display #model #gpt-4o #input-file #breakdown #anthropic-claude
  74. rust-forth-tokenizer

    A Forth tokenizer written in Rust

    v0.2.1 480 #tokenize #forth #tokenizer
  75. panko

    A small, zero-copy text tokenizer that crumbles strings into Words, Symbols, and Newlines

    v0.1.0 #tokenize #text-tokenizer #unicode #split #nlp #unicode-text #tokenizer
  76. syn_derive

    Derive macros for syn::Parse and quote::ToTokens

    v0.2.0 495K #macro-derive #to-tokens #quote #parser #tokenize #macro-parser
  77. crossandra

    A fast and simple lexical tokenization library

    v1.0.1 #tokenize #regex #pattern #lexical #position #lexer
  78. makepad-live-tokenizer

    Makepad platform live DSL tokenizer

    v1.0.0 440 #tokenize #dsl #makepad #live #wasm #cargo-makepad #3d-rendering
  79. toktrie_hf_downloader

    HuggingFace Hub download library support for toktrie and llguidance

    v1.7.5 600 #structured-output #hugging-face #tokenize #llguidance #toktrie #context-free-grammar #json-schema #llama-cpp #llm
  80. limbo_sqlite3_parser

    SQL parser (as understood by SQLite)

    v0.0.22 550 #sql-parser #tokenize #sql
  81. rune-tokenize

    Approximate token counting, budget checking, text truncation, and overlapping chunk splitting for LLM context windows

    v0.1.0 #llm #tokenize #chunking
  82. fast_html5ever

    High-performance browser-grade HTML5 parser

    v0.26.6 9.1K #html5ever #whatwg #html-parser #html5 #tokenize #tree-builder #serialization #browser-grade #utf-8 #forms
  83. fsqlite-ext-fts3

    FTS3/FTS4 full-text search extension

    v0.1.3 #full-text-search #extension #fts3 #fts4 #tokenize
  84. token-dict

    basic dictionary based tokenization

    v1.0.2 #tokenize #dictionary #text-tokenization #split
  85. tokstream-core

    Core tokenizer streaming engine for tokstream

    v0.1.2 #tokenize #simulation #streaming #tokenizer
  86. axonml-text

    Text processing utilities for the Axonml ML framework

    v0.6.2 #tokenize #vocab #axonml #dataset #utilities #ngrams #synthetic #language-modeling #pad #vocabulary
  87. edifact-parser

    Streaming EDIFACT tokenizer and SAX-style parser — standalone, no BO4E dependency

    v0.1.60 110 #streaming-parser #edifact #tokenize #bo4e #handler
  88. roketok

    way to simply set up a tokenizer and use it. Not recommended for simple tokenizers as this crate adds a bunch of stuff to support many if not all kinds of tokenizers

    v0.3.1 490 #tokenize #setup #focused #kinds #not-recommended #warnings
  89. tantivy-vibrato

    A Tantivy tokenizer using Vibrato

    v0.4.0 #tokenize #tantivy #vibrato
  90. turso_sqlite3_parser

    SQL parser (as understood by SQLite)

    v0.2.0-pre.7 2.1K #tokenize #sql-parser #sql #parser #tokenizer
  91. parse_light

    customizable, and lightweight JSON parser with minimal dependencies

    v0.1.1 #json-parser #tokenize #string #customizable #performance-optimization #developer-experience #understanding #evolving #real-world #hobby
  92. tokenx-rs

    Fast token count estimation for LLMs at 96% accuracy without a full tokenizer

    v0.1.0 #llm #claude #tokenize #tokenizer
  93. ferrum-tokenizer

    Tokenization wrapper for Ferrum inference engine

    v0.7.3 #llama #inference-engine #ferrum #apple-silicon #tokenize #open-ai-compatible #metal #llm #moe #production-grade
  94. divvunspell

    Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support

    v1.0.0-beta.3 #spell-check #tokenize #suggestions #zhfst #archive #bhfst #memory-map #hfst-ospell #alphabet #morphological-analysis
  95. lindera-tokenizer

    A morphological analysis library

    v0.32.3 9.1K #morphological-analysis #tokenize #tokenizer #analysis #morphological
  96. punkt

    sentence tokenizer

    v1.0.5 #tokenize #sentence #tokenizer
  97. fuzzy-pickles

    A low-level parser of Rust source code with high-level visitor implementations

    v0.1.1 #tokenize #rust #tokenizer
  98. llama-tokenizer

    Tokenizer crate for llama.rs — deterministic text-to-token conversion

    v0.1.1 #tokenize #llama #deterministic #utf-8 #white-space
  99. trustformers-tokenizers

    Tokenizers for TrustformeRS

    v0.1.1 #bpe #tokenize #tokenizer #nlp-processing
  100. rusqlite-ext

    Rusqlite extension for building the FTS5 tokenizer

    v0.39.0 #sqlite-extension #tokenize #rusqlite #sqlite
  101. cuttle

    A large language model inference engine in Rust

    v0.1.1 #inference-engine #language-model #model-inference #tokenize #qwen3 #model-download #text-generation #performance-monitoring
  102. burn_dragon_tokenizer

    Tokenizer primitives for burn_dragon

    v0.21.0 #tokenize #burn-dragon #gpt-4 #python-bindings #training #byte-pair #pyo3 #experiment
  103. flat-cli

    Flatten codebases into AI-friendly format

    v0.4.0 #tokenize #compression #flatten
  104. oxidoc-text

    Shared tokenization pipeline for oxidoc — used by both build-time and query-time search

    v0.1.10 #oxidoc #pipeline #search #documentation #tokenize #build-time #wasm #islands #code-block #rdx
  105. mipl

    Minimal Imperative Parsing Library

    v0.2.1 #tokenize #token-stream #tokenizer #parser
  106. kanpyo-dict

    Dictionary Library for Kanpyo

    v0.2.0 #kanpyo #japanese #dictionary #tokenize #analyzer
  107. indent_tokenizer

    Generate tokens based on indentation

    v0.4.0 #indentation #tokenize #tokenizer
  108. toktrie_tiktoken

    tiktoken (OpenAI BPE) library support for toktrie and llguidance

    v1.7.4 #structured-output #tiktoken #openai #tokenize #llguidance #bpe #toktrie #json-schema #context-free-grammar #llama
  109. lang_pt

    A parser tool to generate recursive descent top down parser

    v0.1.2 #top-down-parser #tokenize #recursive-descent #parser
  110. oaken

    Search tree based Lexical Analysis tokenizer

    v0.1.0 #lexical-analysis #search-tree #tokenize #fallback #string
  111. tokenizer-lib

    Tokenization utilities for building parsers in Rust

    v1.6.0 230 #tokenize #parser #tokenization
  112. kohaku

    tokenizer

    v0.1.5 #tokenize #abc #ok
  113. vn-nlp

    Vietnamese NLP library — tokenization, normalization, segmentation

    v0.1.3 #nlp #tokenize #vietnamese #linguistics
  114. philharmonic-connector-impl-embed

    Embedding connector implementation for Philharmonic (embed capability: text in, vector out)

    v0.1.1 #onnx #philharmonic #connector #embed #tokenize #embedding
  115. rs-jsontxt2token

    Converts the jsonl to tokens

    v0.1.0 #jsonl #tokenize #japanese #json #helper
  116. pinyinchch-type

    一个拼音转汉字的工具库

    v0.3.0 #chinese #pinyin #tokenize #hanzi
  117. tekken-rs

    Mistral Tekken tokenizer with audio support

    v0.1.1 1.1K #tokenize #mistral #nlp #artificial-intelligence #audio #tokenizer
  118. rten-text

    Text tokenization and other ML pre/post-processing functions

    v0.24.0 170 #tokenize #token-id #text-tokenization #text-tokenizer #hugging-face #post-processing #bert #transformer-models #eg #canonical
  119. voice-stt

    Speech-to-text library backed by MLX, starting with Moonshine

    v0.1.0 #text-to-speech #mlx #moonshine #audio-samples #tokenize #hugging-face #cache #transcribe #conv #filesystem-access
  120. syntaqlite-buildtools

    Internal codegen and build tools for syntaqlite — not intended for direct use

    v0.5.9 #syntaqlite #codegen #build-tool #sqlite #tokenize #grammar #bootstrap #parser-generator #sql
  121. wordchipper-cli-util

    Wordchipper CLI

    v0.9.1 #tokenize #wordchipper #lexer #cli #python-bindings #bpe #tiktoken #openai #gpt-2
  122. ragit-korean

    korean tokenizer for ragit

    v0.4.5 #korean #tokenize #ragit #document #convert
  123. neco-syntax-textmate

    TextMate-style syntax loading and tokenization on top of syntect

    v0.2.0 #tokenize #textmate #editor #highlight #syntax
  124. mecab-ko-dict-sync

    Korean National Institute dictionary API client for MeCab-Ko

    v0.7.2 #tokenize #korean #nlp #mecab #morphology
  125. scivex-nlp

    Scivex — Tokenization, embeddings, and text processing

    v0.1.1 #tokenize #scientific-computing #embedding
  126. unitoken

    Fast BPE tokenizer/trainer with a Rust core and Python bindings

    v0.1.1 #bpe #nlp #tokenize #tokenizer
  127. sqlite-jieba-tokenizer

    This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

    v0.6.0 #sqlite-extension #tokenize #sqlite #chinese #english
  128. vaporetto_rules

    Rule-base filters for Vaporetto

    v0.6.5 900 #japanese #tokenize #analyzer #morphological #tokenizer
  129. mt_mtc

    Tokenizer and parser for the Minot language

    v0.6.0 #robotics #tokenize #minot
  130. chinese_segmenter

    Tokenize Chinese sentences using a dictionary-driven largest first matching approach

    v1.0.1 #chinese #tokenize #hanzi #segment #localization
  131. sentencepiece-sys

    Binding for the sentencepiece tokenizer

    v0.13.1 25K #bindings #tokenize #text-tokenizer #dynamic-linking #version #pkg-config #build-script
  132. derive-finite-automaton

    Procedural macro for generating finite automaton

    v0.3.0 100 #finite-automata #tokenize #tokenization
  133. go-brrr

    Token-efficient code analysis for LLMs - Rust implementation

    v0.1.0 #tree-sitter #code-analysis #tokenize #ast #llm #ast-analysis
  134. rkllm-sys-rs

    Raw FFI bindings for librkllmrt

    v0.1.0 #librkllmrt #bindings #rk3588 #artificial-intelligence #tokenize
  135. instant-clip-tokenizer

    Fast text tokenizer for the CLIP neural network

    v0.1.0 6.0K #neural-network-clip #tokenize #text-tokenizer #model #instant #python-bindings
  136. skimmer

    streams reader

    v0.0.3 #stream-reader #byte-stream #tokenize
  137. use-token

    Composable tokenization primitives for RustUse

    v0.1.0 #tokenize #string-parser #text #string #token-parser
  138. syntaqlite-syntax

    Internal parser and tokenizer for syntaqlite — not intended for direct use

    v0.5.9 #tokenize #syntaqlite #sql #sqlite #validation #sqlite-parser #language-server #re-exports
  139. tokenizers-enfer

    today's most used tokenizers, with a focus on performances and versatility

    v0.21.1 #tokenize #hugging-face #word-piece #bpe #tokenizer
  140. vn-nlp-tokenize

    Vietnamese tokenization algorithms for vn-nlp

    v0.1.3 #tokenize #nlp #vietnamese #linguistics
  141. vtext

    NLP with Rust

    v0.2.0 #tf-idf #levenshtein #tokenize #text-processing
  142. bpe-tokenizer

    A BPE Tokenizer library

    v0.1.4 150 #byte-pair #tokenize #bpe #encoding #byte
  143. marukov

    markov chain text generator

    v0.0.2 #markov-chain #generator #text-generator #text-generation #tokenize #generations
  144. rtf-grimoire

    A Rich Text File (RTF) document tokenizer. Useful for writing RTF parsers.

    v0.2.1 130 #rtf #rich-text #tokenize
  145. vn-nlp-segment

    Vietnamese sentence segmentation for vn-nlp

    v0.1.3 #tokenize #nlp #vietnamese #linguistics
  146. vn-nlp-normalize

    Vietnamese text normalization — diacritics, unicode NFC/NFD

    v0.1.3 #nlp #tokenize #vietnamese #linguistics
  147. alith-models

    Load and Download LLM Models, Metadata, and Tokenizers

    v0.4.3 #gguf #model #tokenize #hugging-face #metadata #llm #embedding #artificial-intelligence
  148. vil_tokenizer

    VIL Tokenizer Engine — native Rust BPE tokenization for LLM token counting and text splitting

    v0.4.0 #tokenize #vil #engine #llm #split #bpe #distributed-systems #byte-pair #truncation #language-framework
  149. vaporetto_tantivy

    Vaporetto Tokenizer for Tantivy

    v0.24.0 950 #tokenize #tantivy #japanese
  150. tokenizer

    Thai text tokenizer

    v0.1.2 #tokenize #localization #thai #text-tokenizer #tokeniser
  151. palate_polyglot_tokenizer

    A generic programming language tokenizer

    v0.2.1 #tokenize #generics #polyglot #programming-language #line-comment #block-comment
  152. Try searching with DuckDuckGo.

  153. ld-lucivy-tokenizer-api

    Tokenizer API of lucivy

    v0.27.0 #tokenize #lucivy #api #tokenizer-in-charge #indexing
  154. punkt_n

    Punkt sentence tokenizer

    v1.0.5 #tokenize #sentence #punkt #tokenizer
  155. mako

    main Sidekick AI data processing library

    v0.3.0 #artificial-intelligence #data-processing #tokenize #data-loader #machine-learning #dataflow #sidekick #tokenized
  156. s-expression

    parser

    v0.2.0 #s-expr #expression-parser #zero-copy-parser #tokenize #borrowing #preallocated #numbers-parser #parser-compiler #performance-optimization #interpreter
  157. wordpieces

    Split tokens into word pieces

    v0.6.1 110 #word-piece #tokenize #piece #wordpiece #word
  158. sixel-tokenizer

    A tokenizer for serialized Sixel bytes

    v0.1.0 11K #sixel #tokenize #byte #serialization #events #coordinate-system
  159. thfst-tools

    Support tools for DivvunSpell - convert ZHFST files to BHFST

    v1.0.0-beta.1 #spell-check #morphological-analysis #zhfst #tokenize #bhfst #divvunspell #hfst-ospell #fst #suggestions
  160. sentencepiece-model

    SentencePiece model parser generated from the SentencePiece protobuf definition

    v0.1.4 16K #sentence-piece #tokenize #machine-learning
  161. ellie_tokenizer

    Tokenizer for ellie language

    v0.7.3 340 #tokenize #ellie #embedded #item #position
  162. alpino-tokenizer

    Wrapper around the Alpino tokenizer for Dutch

    v0.4.0 #tokenize #finite-state-transducer #dutch #alpino #principles #text-tokenizer
  163. biors-core

    Core biological sequence validation and protein tokenization contracts for bio-rs

    v0.37.3 #bioinformatics #tokenize #bioinformatics-sequence
  164. simple-tokenizer

    A tiny no_std tokenizer with line & column tracking

    v0.4.2 250 #tokenize #no-std #column
  165. syntaxdot-tokenizers

    Subword tokenizers

    v0.5.0 #syntax-dot #tokenize #sentence-piece #labeling #bert #subword #lemmatization #biaffine-parser #word-piece #morphology
  166. procedural-masquarade

    Incorrect spelling for procedural-masquerade

    v0.2.0 #css #css-parser #tokenize #level #detect #spelling #incorrect #character-encoding #syntax-tree
  167. gtars-tokenizers

    Genomic region tokenizers for machine learning in Rust

    v0.5.2 #tokenize #machine-learning #genomics #gtars #region #overlap #genomic-region #genomic-data #transformer-models
  168. tele_tokenizer

    A CSS tokenizer

    v0.2.0 #tokenize #css #telecss #tokenizer
  169. divvunspell-bin

    Spellchecker for ZHFST/BHFST spellers, with case handling and tokenization support

    v1.0.0 #spell-check #tokenize #zhfst #bhfst #case
  170. tokengeex

    efficient tokenizer for code based on UnigramLM and TokenMonster

    v1.1.0 900 #tokenize #llm #codegeex #tokenizer
  171. data_vault

    Data Vault is a modular, pragmatic, credit card vault for Rust

    v0.3.4 #credit-card #encryption #credits #vault #tokenize #aes-gcm-siv #postgresql #blake3 #redis #encryption-key
  172. tiniestsegmenter

    Compact Japanese segmenter

    v0.3.0 390 #japanese #tokenize #ngrams
  173. uscan

    A universal source code scanner

    v0.1.3 #tokenize #compiler #tokenizer
  174. sentence

    tokenizes English language sentences for use in TTS applications

    v0.0.2 #tokenize #text-to-speech #english
  175. toresy

    term rewriting system based on tokenization

    v0.5.0 #tokenize #system #rewriting #rewriting-rules #input #text-output
  176. izihawa-tantivy-tokenizer-api

    Tokenizer API of tantivy

    v0.25.0 500 #tokenize #tantivy #full-text-search #tokenizer-in-charge #text-indexing #search-engine
  177. earl-lang-syntax

    tokenizer and parser for the language Earl

    v1.0.0 #tokenize #syntax #s-expr #earl #language-syntax #multi-line #syntax-parser
  178. sqlite-charabia-tokenizer

    This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

    v0.5.0 #sqlite-extension #tokenize #sqlite #charabia
  179. tinytoken

    tokenizing text into words, numbers, symbols, and more, with customizable parsing options

    v0.1.4 130 #tokenize #numbers #text-input #tokenizer
  180. caddyfile

    working with Caddy's Caddyfile format

    v0.1.1 750 #caddy #tokenize #format #testing
  181. aleph-alpha-tokenizer

    A fast implementation of a wordpiece-inspired tokenizer

    v0.3.1 #tokenize #aleph-alpha #hugging-face #tokenizer
  182. cssparser-macros

    Procedural macros for cssparser

    v0.7.0 3.6M #css-parser #proc-macro #tokenize #byte #level
  183. strizer

    minimal and fast library for text tokenization

    v0.1.0 #text-tokenization #tokenize #string-tokenizer
  184. specmc-base

    common code for parsing Minecraft specification

    v0.1.11 490 #minecraft #specification #parser #tokenize #identifier
  185. colorblast

    Syntax highlighting library for various programming languages, markup languages and various other formats

    v0.0.3 #syntax-highlighting #tokenize #highlighter
  186. castle_tokenizer

    Castle Tokenizer: tokenizer

    v0.20.2 180 #tokenize #castle
  187. json-parser

    JSON parser

    v1.0.2 #tokenize #json #tokenizer
  188. blingfire

    Wrapper for the BlingFire tokenization library

    v1.0.0 2.4K #tokenize #machine-learning #tokenizer
  189. indentation_flattener

    From indented input, generate plain output with indentation PUSH and POP codes

    v0.1.0 #indentation #tokenize #parser
  190. nipah_tokenizer

    A powerful yet simple text tokenizer for your everyday needs!

    v0.1.0 #tokenize #text-tokenizer #nlp #tokenizer
  191. gtokenizers

    tokenizing genomic data with an emphasis on region set data

    v0.0.18 #genomics #genomic-data #tokenize #region #machine-learning #emphasis
  192. regex-bnf

    A deterministic parser for a BNF inspired syntax with regular expressions

    v0.1.2 200 #regex #bnf #syntax #tokenize #syntax-parser #grammar #grammar-parser #csv #csv-parser
  193. basic_lexer

    Basic lexical analyzer for parsing and compiling

    v0.2.1 #tokenize #lexical-analysis #white-space #tokenizer
  194. neca-cmd

    command tokenizer used by my Twitch chat bot

    v0.3.0 250 #tokenize #chat-bot #twitch
  195. xtoken

    Iterator based no_std XML Tokenizer using memchr

    v0.1.1 #tokenize #iterator #xml #memchr #byte-slice
  196. mecab-ko-core

    한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저

    v0.7.2 #tokenize #viterbi #nlp #korean #morphology #tokenizer
  197. boost_tokenizer

    Boost C++ library boost_tokenizer packaged using Zanbil

    v0.1.0 #tokenize #boost #zanbil #packaged #io-stream #badge
  198. tokenmonster

    Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)

    v0.1.0 #tokenize #tiktoken #nlp #tokenizer
  199. morsels_lang_ascii

    Basic ascii tokenizer for morsels

    v0.7.3 #ascii #language #morsels #tokenize #tokenizer-for-morsels
  200. tokeneer

    tokenizer crate

    v0.1.0 340 #tokenize #bpe #tokenizer
  201. regex-tokenizer

    A regex tokenizer

    v0.1.1 #tokenize #regex #tokenizer
  202. tinysegmenter

    Compact Japanese tokenizer

    v0.1.1 2.5K #tokenize #japanese #compact
  203. pretok

    A string pre-tokenizer for C-like syntaxes

    v0.1.0 #lexer-tokenizer #lexer #text #tokenize #tokenizer
  204. alpino-tokenize

    Wrapper around the Alpino tokenizer for Dutch

    v0.4.0 #tokenize #finite-state-transducer #alpino-tokenizer #dutch #command-line-tool
  205. morsels_lang_chinese

    Chinese tokenizer for morsels

    v0.7.3 100 #chinese #morsels #language #tokenize #tokenizer-for-morsels
  206. text-scanner

    A UTF-8 char-oriented, zero-copy, text and code scanning library

    v0.0.3 270 #lexer #streaming-parser #tokenize
  207. quote-data

    A tokenization Library for Rust

    v1.0.0 #proc-macro #tokenize #macro-derive #struct #quote
  208. tokenate

    do some grunt work of writing a tokenizer

    v0.1.0 #tokenize #inner #parse
  209. ccgen

    generate manually maintained C (and C++) headers

    v0.2.0 #header #generate #tok #generator #tokenize