Skip to content

yen0304/pretok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

pretok logo

pretok

CI codecov PyPI version Python 3.11+ License: MIT Ruff

Universal pre-token language adaptation layer for text-based LLMs.

pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supportsβ€”all before tokenization, without modifying the model or tokenizer.

✨ Features

  • Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
  • Pre-Token Boundary: All transformations occur on raw text before tokenization
  • Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
  • Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
  • Pluggable Backends: Support for multiple detection and translation engines
  • Explicit Capability Contracts: Models declare their supported languages

πŸš€ Installation

pip install pretok

Or with uv:

uv add pretok

Optional Dependencies

# Language detection
pip install pretok[fasttext]      # FastText (high accuracy)
pip install pretok[langdetect]    # langdetect (pure Python)

# Translation backends
pip install pretok[nllb]          # Meta's NLLB model (local)
pip install pretok[openai]        # OpenAI API

# All features
pip install pretok[all]

πŸ“– Quick Start

from pretok import Pretok, create_pretok

# Create with default settings
pretok = Pretok(target_language="en")

# Process text
result = pretok.process("Bonjour, comment ca va?")

print(result.processed_text)  # "Hello, how are you?"
print(result.was_modified)    # True

With Model-Specific Optimization

# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4")     # Uses English
pretok = create_pretok(model_id="qwen-7b")   # Uses Chinese

With Custom Translation Backend

from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator

# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
    base_url="https://api.openai.com/v1",  # Or OpenRouter, Ollama, vLLM
    model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)

Preserving Prompt Structure

prompt = """<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
What is the capital of Japan?
<|im_end|>"""

result = pretok.process(prompt)
# Role markers preserved, only content translated

Configuration

Create a pretok.yaml:

version: "1.0"

pipeline:
  default_detector: langdetect
  cache_enabled: true

translation:
  llm:
    base_url: "https://api.openai.com/v1"
    model: "gpt-4o-mini"

cache:
  memory:
    max_size: 1000
    ttl: 3600
from pretok import Pretok
from pretok.config import load_config

config = load_config("pretok.yaml")
pretok = Pretok(config=config)

πŸ—οΈ Architecture

Input Text (any language)
        ↓
Segment Parsing (roles, code, text)
        ↓
Language Detection
        ↓
Translation Decision
        ↓
Translation (if needed)
        ↓
Prompt Reconstruction
        ↓
Tokenizer (unchanged)
        ↓
LLM Inference

πŸ“š Documentation

πŸ› οΈ Development

# Clone the repository
git clone https://github.com/yen0304/pretok.git
cd pretok

# Install dependencies
uv sync --dev

# Run tests
uv run pytest

# Run linting
uv run ruff check src/ tests/

# Run type checking
uv run mypy src/

πŸ“„ License

MIT License - see LICENSE for details.

🀝 Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

About

Universal pre-token language adaptation layer for text-based LLMs https://pypi.org/project/pretok/

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages