Qtok is a Python-based tool designed for quality control and analysis of tokenizers used in natural language processing (NLP) tasks.
- Analyze multiple tokenizer vocabularies simultaneously
- Generate statistics on token distribution
- Produce visualizations of token characteristics
- Compare multiple tokenizers
- Analyze Unicode coverage
- Assess language-specific token distributions (Latin and Cyrillic scripts)
You can install Qtok using pip:
pip install qtokOr clone the repository and install:
git clone https://github.com/nup-csai/Qtok.git
cd Qtok
pip install .Qtok can be used as a command-line tool:
qtok -i /path/to/tokenizer1.json /path/to/tokenizer2.json ... -l label1 label2 ... -o /path/to/output/folder [--latex]Arguments:
-i: Paths to the tokenizer JSON files or URLs (required, multiple inputs accepted)-l: Labels for the tokenizers (required, must match the number of input files)-o: Output folder for results (required)--latex: Optional flag to generate LaTeX and PDF reports (default: False)
Example:
qtok -i /path/to/tokenizer1.json /path/to/tokenizer2.json -l label1 label2 -o /path/to/output/folder --latex- Arguments:
-i: Paths to the tokenizer JSON files or URLs (required, multiple inputs accepted)-l: Labels for the tokenizers (required, must match the number of input files)-o: Output folder for results (required)
Qtok generates several output files:
basic_stats.tsvandbasic_stats.png: Basic statistics of the tokenizersunicode_stats.tsvandunicode_stats.png: Unicode coverage statisticslatin_stats.tsvandlatin_stats.png: Statistics for Latin script tokenscyrillic_stats.tsvandcyrillic_stats.png: Statistics for Cyrillic script tokensreport.html: An HTML report summarizing all analysesreport.texandreport.pdf: LaTeX and PDF versions of the report (if--latexflag is used and pdflatex is installed)
- Python 3.6+
- matplotlib
- numpy
- pandas
- requests
- tqdm
For full tables and data, please refer to the Jupyter notebook available at:
Contributions to Qtok are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Aleksey Komissarov
- Iaroslav Chelombitko
- Egor Safronov
For any queries, please contact ad3002@gmail.com.
- Thanks to all contributors and users of Qtok
- Special thanks to the NLP community for inspiration and support