Skip to content

Sorbisches-Institut/OCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sorbische OCR-Trainingsdaten und Modelle

Sorbian OCR Training Data and Models

Serbske OCR-treningowe daty a modele

License: CC BY 4.0

This repository provides recognition models for the Sorbian languages, supporting both Latin script and Fraktur script. It will provide training data in the near future.

Contents

  • Pretrained models for different OCR frameworks (currently Tesseract only)
  • soon: Training data for creating new OCR models

Current Status

  • ✅ Models for Tesseract are available
  • ⏳ Models for Calamari and Kraken will be added soon

Upper Sorbian Tesseract Models

This repository provides Tesseract OCR models for Upper Sorbian, trained and fine-tuned at the Sorbian Institute.
Both Latin and Fraktur script variants are available.


📦 Models

  • hsb2.traineddata
    Upper Sorbian (Latin script), v2
    Fine-tuned from the official Latin.traineddata (tessdata).
    Recommended for modern Sorbian texts (printed in Latin script).

  • hsb_frak2.traineddata
    Upper Sorbian (Fraktur script), v2
    Fine-tuned from the official Fraktur.traineddata (tessdata).
    Recommended for historical Sorbian Fraktur prints.


🔧 Installation

System tessdata folder

Copy the models into your Tesseract tessdata directory.

Linux (Debian/Ubuntu):

sudo cp traineddata/*.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

License & Origin

hsb2 + hsb_frak2 OCR Model © 2025 [Sorbian Institute] – licensed under CC BY 4.0.
Based on the official Tesseract tessdata models Latin and Fraktur,
© Google et al., licensed under the Apache License 2.0.


About

OCR Models and Training Data Sorbian languages in Latin and Fraktur

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published