tokenizer
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Here are 88 public repositories matching this topic...
Compilatore yield, set e for con LFM testing.
-
Updated
Feb 3, 2025 - C++
small project
-
Updated
Nov 20, 2025 - C++
Modern UTF-8 aware C++ tokenizer with vocabulary support, ideal for NLP and transformer models. Header-only and zero-dependency.
-
Updated
Aug 7, 2025 - C++
HRTDS, pronounced "Hearts", is an acronym for "Human-Readable Typed Data Serialization".
-
Updated
Jul 17, 2025 - C++
Implementation of C++ lexical analyzer to demonstrate how it actually works as a part of the compiler.
-
Updated
Dec 13, 2021 - C++
Custom tokenizer loosely based on Byte-Pair Encoding
-
Updated
Nov 1, 2025 - C++
A C++ Parser Project
-
Updated
Oct 1, 2023 - C++
Fast wordpiece, sentencepiece tokenizer by Trie, OpenMP, SIMD, MemoryPool
-
Updated
May 11, 2025 - C++
Fast & efficient BPE tokenizer written in C & python for LLM tranining
-
Updated
Oct 25, 2025 - C++
A very fast and low memory usage C++ automaton tokenizer that breaks an input string into a list of tokens looking at tabs, spaces, new lines, and detects special tokens like numbers, prces, personal noms, emails, lexemes, etc. It allows to specify delimeters and detect special cases.
-
Updated
Apr 4, 2023 - C++
Gradient Boosting Dicision Tree(LightGBM)を用い、教師ありで自然言語の分かちと形態素の推定を学習&予想します。名称は珊瑚(sango)にしたい
-
Updated
Oct 28, 2017 - C++
- Followers
- 11k followers
- Website
- github.com/topics/parsing
- Wikipedia
- Wikipedia