This is a Flask-based web application that allows users to upload images for text extraction using Tesseract OCR. The application supports image preview, text extraction, download, clipboard copy, and text revision through an integrated Large Language Model (LLM) to correct OCR-induced errors and refine the extracted text.
- Single image processing for OCR.
- Image preview before processing.
- Download extracted text as a
.txt
file. - Copy extracted text to the clipboard.
- LLM integration for text revision (grammar and spelling corrections).
- SymSpell integration for correcting common OCR-induced spelling mistakes (e.g., correcting "cukure" to "culture").
- File size validation (max 5MB).
- Batch processing for multiple images.
- Azure deployment for cloud-based OCR processing.
- Multi-language OCR support (future integration).
To run this application, you will need to have the following installed on your system:
Tesseract is the engine used for Optical Character Recognition (OCR). You need to install Tesseract and ensure it is accessible via the system PATH.
- Windows: Download and install Tesseract from here.
- Linux (Ubuntu):
sudo apt update sudo apt install tesseract-ocr sudo apt install libtesseract-dev
Ensure Python 3.x is installed. You can download Python from the official website: Python Downloads.
For Windows users, you need to install the C++ Build Tools required by symspellpy
. To install them, run the following command in PowerShell (as Administrator):
npm install --global --production windows-build-tools
Alternatively, download the build tools directly from here and install the Desktop development with C++ workload.
For multi-language support (currently disabled), download and place the necessary .traineddata
files in your tessdata
folder.
In addition to the basic dependencies like Flask, Pillow, and Pytesseract, the following libraries have been added to support text revision and spelling correction:
- SymSpell: A fast spelling correction library for fixing OCR errors.
- Transformers: Hugging Face library for the Large Language Model (LLM) used to improve text accuracy.
- PyTorch: Backend for running the T5 LLM model.
Clone the repository and install the required dependencies:
git clone https://github.com/LEnc95/OCR.git
cd <repository-folder>
pip install -r requirements.txt
Alternatively, if requirements.txt
isn't present, manually install the dependencies:
pip install flask pillow pytesseract opencv-python numpy symspellpy transformers torch sentencepiece
To generate the requirements.txt
file for easier dependency management, run:
pip freeze > requirements.txt
-
Windows:
- Go to System Properties -> Advanced -> Environment Variables.
- Add a new system variable:
- Variable Name:
TESSDATA_PREFIX
- Variable Value:
C:\Program Files\Tesseract-OCR\
(or the path where Tesseract is installed).
- Variable Name:
-
Linux: Add this to your
~/.bashrc
or run it before starting the app:export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
- Ensure that the required language files (e.g.,
eng.traineddata
for English) are present in thetessdata
folder.
After setting up the environment, you can run the application locally:
python app.py
The app will be available at: http://127.0.0.1:5000/
.
- Drag and drop an image or click to upload a PNG, JPG, or JPEG file.
- Make sure the file size is below 5MB.
- Once the image is uploaded, the OCR will extract the text from the image and display it in the interface.
- Click the Revise Text button to apply spelling corrections using SymSpell and grammatical improvements using the T5 LLM.
- The revised text will then be displayed, and you can download or copy the revised text.
- After extracting or revising the text, you can download the text as a
.txt
file or copy it directly to your clipboard.
For production deployment, consider using Gunicorn:
gunicorn --bind 0.0.0.0:5000 app:app
The application can be deployed to Azure for scalable cloud-based OCR processing. Follow Azure’s Python Flask deployment guide for more information.
- File Format: The application supports
.png
,.jpg
, and.jpeg
image formats. - Max File Size: The maximum file size is 5MB.
- Multi-language OCR support is currently disabled due to language pack availability. Ensure
eng.traineddata
exists in thetessdata
folder for English OCR. - LLM processing may take longer for large text outputs, and grammar corrections may vary in accuracy.
This project is licensed under the MIT License.