MICR Line Training for Tesseract OCR

This repository contains methods and tools for training Tesseract OCR to recognize MICR (Magnetic Ink Character Recognition) lines on bank checks. While the focus is on MICR, the general approach can be applied to other specialized fonts and media.

Overview

The project explores two main approaches for training Tesseract:

Using real check images
Using synthetically generated MICR lines

Each method has its own advantages and challenges, which are detailed in their respective directories.

Approaches

1. Real Check Training

Located in the real/ directory, this method uses actual check images for training. Currently, this approach has yielded the best results in terms of accuracy.

Learn more about real check training

2. Generated Data Training

Found in the generated/ directory, this method generates artificial MICR lines for training. It is a great introduction into training data without real check images.

Learn more about generated data training

3. Synthetic Data Training

Found in the synthetic/ directory, this method also generates artificial MICR lines for training. The approaches in this section are more advanced, including how to intentionally obscure the generated MICR line to simulate the types of issues that might arise when processing checks in the real world. It has shown lower accuracy compared to using real checks in our tests.

Learn more about synthetic data training

Additional Tools

X9 Extract

The x9-extract/ directory contains a tool for extracting check details from X9 files, which can be used to prepare data for training the OCR system.

Learn more about X9 Extract

Roadmap

Improve usability and reliability of Synthetic training to better simulate real life scenarios in order to better train the model

Contributing

This document provides guidance for how YOU can collaborate with our project community to improve this technology.

FIN-OCR Contribution

License

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github		.github
_images		_images
generated		generated
real		real
synthetic		synthetic
x9-extract		x9-extract
CONTRIBUTE.md		CONTRIBUTE.md
CREDITS.md		CREDITS.md
LICENSE		LICENSE
LICENSE.spdx		LICENSE.spdx
NOTICE.md		NOTICE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MICR Line Training for Tesseract OCR

Overview

Approaches

1. Real Check Training

2. Generated Data Training

3. Synthetic Data Training

Additional Tools

X9 Extract

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

finos/fin-ocr-train

Folders and files

Latest commit

History

Repository files navigation

MICR Line Training for Tesseract OCR

Overview

Approaches

1. Real Check Training

2. Generated Data Training

3. Synthetic Data Training

Additional Tools

X9 Extract

Roadmap

Contributing

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages