OCRopus4

This is a very preliminary version of OCRopus 4. Components are:

ocrobin -- binarization (hardcoded binarizer is also available)
ocrorot -- orientation
ocroskew -- small page rotations ("skew")
ocroscale -- scale determination
ocroseg -- word, line, and block segmentation of pages
ocrorec -- text recognition from words and lines

Status is:

DL training and recognition for text recognition and page layout analysis are working
training fully works on UW3 and G1000
we have large datasets extracted from the Tobacco corpus and Google 1000 Books

Warnings:

The models are very preliminary and have only been trained on UW3.
The code needs to be refactored and cleaned up.

Getting Started

Look at g1000test/quicktest.ipynb to see all the components in action.

Other Comments

OCRopus4 provides both command line interfaces and Python APIs.
- command line APIs are provided by Typer
- command invocation generally looks like this:
  - python3 -m ocropus.ocroseg train ...
  - that is, commands are simply part of the library files
  - the toplevel ocropus4 script provides a nicer interface
All the computations are carried out in PyTorch.
OCRopus4 uses some extra external and internal libraries:
- typer for command line interfaces
- webdatset for large scale I/O
- torchmore for easy model building (shape inference and extra layers)
- tarp for fast processing of large datasets
- ocropus/slog.py for logging (into sqlite3 databases)

Directory Structure

REF - various old notebooks and code snippets kept for reference (they don't necessarily run)
TESTS - various manual tests
ocropus - most Python code (will be renamed to ocropus)
testdata - datasets used by PyTest tests
models - trained models (must download with ./run download)
prep - notebooks and scripts for data preparation (mostly UW3 examples)
*.ipynb - various scripts
ocropus4.yaml - configuration file for ocropus4 command

Commands:

./ocropus4 -- the main command for OCRopus 4
./run -- various utility commands for building and working with the repo
- clean -- remove temporary files
- cleanlogs -- clean up log files
- download -- download ./models from Google Cloud
- upload -- upload ./models to Google Cloud
- info -- get info about models from all logfiles in the current directory
- getbest -- get the best model out of a logfile
- val2model -- fix for validation data into model
- venv -- set up a virtualenv
- testline -- test line recognition
- testseg -- test segmentation
- uw3dewline -- train on pre-dewarped lines (uw3-dew)
- uw3rawline -- train on raw lines (uw3-rawlines)
- uw3words -- train on extracted words (uw3-lwords)
- uw3wseg -- train a word segmenter (uw3-wordseg-markers-masked-patches)
- uw3wwseg -- train a word segmenter (uw3-wordseg-markers-masked-patches)
- uw3lseg -- train a line segmenter (uw3-lineseg-markers-patches)
- uw3wlseg -- train a line segmenter (uw3-lineseg-markers-patches)
- uw3zones -- train a block segmenter (uw3-zoneseg)
- uw3wzones -- train a block segmenter (uw3-zoneseg)
- lab -- run jupyter lab in the environment
- build --
- test --
- help -- help message
./train -- miscellaneous training scripts

Name		Name	Last commit message	Last commit date
Latest commit History 291 Commits
OLD		OLD
REF		REF
TESTS		TESTS
docker		docker
g1000test		g1000test
models		models
ocropus		ocropus
prep		prep
testdata		testdata
training		training
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
TODO.md		TODO.md
demo.ipynb		demo.ipynb
ocropus4		ocropus4
ocropus4.yaml		ocropus4.yaml
pyproject.toml		pyproject.toml
raygoogle.yaml		raygoogle.yaml
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
run		run
setup.py		setup.py
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRopus4

Getting Started

Other Comments

Directory Structure

Commands:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OCRopus4

Getting Started

Other Comments

Directory Structure

Commands:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages