This repository provides recognition models for the Sorbian languages, supporting both Latin script and Fraktur script. It will provide training data in the near future.
- Pretrained models for different OCR frameworks (currently Tesseract only)
- soon: Training data for creating new OCR models
- ✅ Models for Tesseract are available
- ⏳ Models for Calamari and Kraken will be added soon
This repository provides Tesseract OCR models for Upper Sorbian, trained and fine-tuned at the Sorbian Institute.
Both Latin and Fraktur script variants are available.
-
hsb2.traineddata
Upper Sorbian (Latin script), v2
Fine-tuned from the official Latin.traineddata (tessdata).
Recommended for modern Sorbian texts (printed in Latin script). -
hsb_frak2.traineddata
Upper Sorbian (Fraktur script), v2
Fine-tuned from the official Fraktur.traineddata (tessdata).
Recommended for historical Sorbian Fraktur prints.
Copy the models into your Tesseract tessdata directory.
Linux (Debian/Ubuntu):
sudo cp traineddata/*.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
hsb2 + hsb_frak2 OCR Model © 2025 [Sorbian Institute] – licensed under CC BY 4.0.
Based on the official Tesseract tessdata models Latin and Fraktur,
© Google et al., licensed under the Apache License 2.0.