Vietnamese tokenizer (Maximum Matching and CRF)
-
Updated
Mar 1, 2017 - Python
A grammar describes the syntax of a programming language, and might be defined in Backus-Naur form (BNF). A lexer performs lexical analysis, turning text into tokens. A parser takes tokens and builds a data structure like an abstract syntax tree (AST). The parser is concerned with context: does the sequence of tokens fit the grammar? A compiler is a combined lexer and parser, built for a specific grammar.
Vietnamese tokenizer (Maximum Matching and CRF)
Benchmark for tokenizers. Utility to compare the performance of different tokenizers with different datasets.
This Python project for data structures and algorithms class converts infix to postfix expressions, evaluates postfix expressions, and computes infix expressions without using `eval`. It supports complex tokenization, custom exception handling, and avoids built-in stack classes.
Byte Pair Encoding tokenizer supporting Arabic text with full diacritical marks (تشكيل). Train, save, and deploy custom tokenizers.
Tokenizer for Bodo language
Chabot is an application with a graphical user interface that uses various natural language processing (NLP) techniques to tokenize, stem, find stop words, and apply regular expressions to user-input text. The interface is built using Tkinter.
Data-driven text adventure game in Python using small generative text
Impact of morphologically aware tokenization on Polish LLM performance (Master)
Tokker is a fast local-first CLI tool for tokenizing text with all the best models in one place.
In this project I used tensorflow 2 for natural language processing. To be specific, predicting labels from the tweets. I used Kaggles free GPUs and Datasets in this competion. I used tokenizer to make my text data fit for the tensorflow model
🍺 Python implementation on vgram tokenization
A Mediocre JSON parser
TF-IDF Calculation
A personal project where I'm experimenting with building a basic Transformer-based language model from scratch.
A pure Python implementation of Byte Pair Encoding (BPE) tokenizer. Train on any text, encode/decode with saved models, and explore BPE tokenization fundamentals.
Applying NLP techniques to analyze the corpus acquired from Wikitext.