This project provides a tool for translating Markdown documents from one language to another using OpenAI's API. It tokenizes the input document, splits it into chunks, translates each chunk, and stitches the output back together to retain the original formatting.
- Accepts Plain Text/Markdown file as input
- Tokenizes input text using tiktoken
- Splits input into chunks at multiple newlines
- Sends each chunk to OpenAI for translation
- Reconstructs translated output with original formatting
To use this translation workflow:
- Clone this repository
- Install requirements
pip install -r requirements.txt
- Set OpenAI API key
- Run the Jupyter notebook
- Pass file path to
input_path
variable - Set
input_language
andoutput_language
- Execute notebook cells
- Pass file path to
- Translated file will be printed in the final cell
The main configuration options are:
input_path
- Path to input fileinput_language
- Source language codeoutput_language
- Target language codesplit_string
- String used to split input into chunks
This can be used to translate Plain Text/Markdown docs like:
- READMEs
- Wikis/documentation
- Articles/blog posts
- Books
- Only tested with Markdown and plain text formatting
- Accuracy depends on OpenAI's translation model
- Currently only caters to OpenAI's GPT models
- Does not allow for lining up translations sequentially - only one file at a time
- Does not allow for processing multiple segments of the tranlsation simultaneously
- tiktoken for fast encoding/tokenization
- OpenAI API for translation
MIT