This repository contains methods and tools for training Tesseract OCR to recognize MICR (Magnetic Ink Character Recognition) lines on bank checks. While the focus is on MICR, the general approach can be applied to other specialized fonts and media.
The project explores two main approaches for training Tesseract:
- Using real check images
- Using synthetically generated MICR lines
Each method has its own advantages and challenges, which are detailed in their respective directories.
Located in the real/ directory, this method uses actual check images for training. Currently, this approach has yielded the best results in terms of accuracy.
Learn more about real check training
Found in the generated/ directory, this method generates artificial MICR lines for training. It is a great introduction into training data without real check images.
Learn more about generated data training
Found in the synthetic/ directory, this method also generates artificial MICR lines for training. The approaches in this section are more advanced, including how to intentionally obscure the generated MICR line to simulate the types of issues that might arise when processing checks in the real world. It has shown lower accuracy compared to using real checks in our tests.
Learn more about synthetic data training
The x9-extract/ directory contains a tool for extracting check details from X9 files, which can be used to prepare data for training the OCR system.
- Improve usability and reliability of Synthetic training to better simulate real life scenarios in order to better train the model
This document provides guidance for how YOU can collaborate with our project community to improve this technology.
Copyright 2024 Capital One
Distributed under the Apache License, Version 2.0.
SPDX-License-Identifier: Apache-2.0