pretok

Universal pre-token language adaptation layer for text-based LLMs.

pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supports—all before tokenization, without modifying the model or tokenizer.

✨ Features

Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
Pre-Token Boundary: All transformations occur on raw text before tokenization
Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
Pluggable Backends: Support for multiple detection and translation engines
Explicit Capability Contracts: Models declare their supported languages

🚀 Installation

pip install pretok

Or with uv:

uv add pretok

Optional Dependencies

# Language detection
pip install pretok[fasttext]      # FastText (high accuracy)
pip install pretok[langdetect]    # langdetect (pure Python)

# Translation backends
pip install pretok[nllb]          # Meta's NLLB model (local)
pip install pretok[openai]        # OpenAI API

# All features
pip install pretok[all]

📖 Quick Start

from pretok import Pretok, create_pretok

# Create with default settings
pretok = Pretok(target_language="en")

# Process text
result = pretok.process("Bonjour, comment ca va?")

print(result.processed_text)  # "Hello, how are you?"
print(result.was_modified)    # True

With Model-Specific Optimization

# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4")     # Uses English
pretok = create_pretok(model_id="qwen-7b")   # Uses Chinese

With Custom Translation Backend

from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator

# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
    base_url="https://api.openai.com/v1",  # Or OpenRouter, Ollama, vLLM
    model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)

Preserving Prompt Structure

prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""

result = pretok.process(prompt)
# Role markers preserved, only content translated

Configuration

Create a pretok.yaml:

version: "1.0"

pipeline:
  default_detector: langdetect
  cache_enabled: true

translation:
  llm:
    base_url: "https://api.openai.com/v1"
    model: "gpt-4o-mini"

cache:
  memory:
    max_size: 1000
    ttl: 3600

from pretok import Pretok
from pretok.config import load_config

config = load_config("pretok.yaml")
pretok = Pretok(config=config)

🏗️ Architecture

Input Text (any language)
        ↓
Segment Parsing (roles, code, text)
        ↓
Language Detection
        ↓
Translation Decision
        ↓
Translation (if needed)
        ↓
Prompt Reconstruction
        ↓
Tokenizer (unchanged)
        ↓
LLM Inference

📚 Documentation

🛠️ Development

# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/

# Run type checking
uv run mypy src/

📄 License

MIT License - see LICENSE for details.

🤝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
docs		docs
src/pretok		src/pretok
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
logo.png		logo.png
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pretok

✨ Features

🚀 Installation

Optional Dependencies

📖 Quick Start

With Model-Specific Optimization

With Custom Translation Backend

Preserving Prompt Structure

Configuration

🏗️ Architecture

📚 Documentation

🛠️ Development

📄 License

🤝 Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pretok

✨ Features

🚀 Installation

Optional Dependencies

📖 Quick Start

With Model-Specific Optimization

With Custom Translation Backend

Preserving Prompt Structure

Configuration

🏗️ Architecture

📚 Documentation

🛠️ Development

📄 License

🤝 Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages