Universal pre-token language adaptation layer for text-based LLMs.
pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supportsβall before tokenization, without modifying the model or tokenizer.
- Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
- Pre-Token Boundary: All transformations occur on raw text before tokenization
- Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
- Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
- Pluggable Backends: Support for multiple detection and translation engines
- Explicit Capability Contracts: Models declare their supported languages
pip install pretokOr with uv:
uv add pretok# Language detection
pip install pretok[fasttext] # FastText (high accuracy)
pip install pretok[langdetect] # langdetect (pure Python)
# Translation backends
pip install pretok[nllb] # Meta's NLLB model (local)
pip install pretok[openai] # OpenAI API
# All features
pip install pretok[all]from pretok import Pretok, create_pretok
# Create with default settings
pretok = Pretok(target_language="en")
# Process text
result = pretok.process("Bonjour, comment ca va?")
print(result.processed_text) # "Hello, how are you?"
print(result.was_modified) # True# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4") # Uses English
pretok = create_pretok(model_id="qwen-7b") # Uses Chinesefrom pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator
# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
base_url="https://api.openai.com/v1", # Or OpenRouter, Ollama, vLLM
model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""
result = pretok.process(prompt)
# Role markers preserved, only content translatedCreate a pretok.yaml:
version: "1.0"
pipeline:
default_detector: langdetect
cache_enabled: true
translation:
llm:
base_url: "https://api.openai.com/v1"
model: "gpt-4o-mini"
cache:
memory:
max_size: 1000
ttl: 3600from pretok import Pretok
from pretok.config import load_config
config = load_config("pretok.yaml")
pretok = Pretok(config=config)Input Text (any language)
β
Segment Parsing (roles, code, text)
β
Language Detection
β
Translation Decision
β
Translation (if needed)
β
Prompt Reconstruction
β
Tokenizer (unchanged)
β
LLM Inference
# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok
# Install dependencies
uv sync --dev
# Run tests
uv run pytest
# Run linting
uv run ruff check src/ tests/
# Run type checking
uv run mypy src/MIT License - see LICENSE for details.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.