GOKRa PDF (Google-OCR-a)

This is a file processor for PDFs. It is intended to be run as a cron job in order to batch process PDF files through Google's Document AI and overlay the OCR text ontop of the original. The PDF page images are re-encoded. Some image quality loss may occur compared to the original.

This uses Google's Document AI's synchronous API and is limited to PDFs of only 15 pages.

Potential Improvements

There is not attempt to match font style. Google's API does not return font information.
Text alignment is not perfect. It is good enough for my usage, but could be improved with some investment
If the input PDF already has text on a page, copy the source page directly.
Google seems to rotate the PDF image. This may have side effects. There is no handling for this one way or another
Create a watchdog to ingest once file is added to a directory

This was a weekend project. I did not think about long-term, but if you want to submit a pull request to make this tool better, I will consider them on a case by case basis.

Installation

cd whereever you want this
git clone <this repo>
add ./bin folder to your path

Create new VENV

pip install -r requirements.txt
python -m venv venv
source venv/bin/activate
touch ~/.gocr.env
sudo vi ~/.gocr.env

Paste the contents and fill in with meaningful values for your cloud setup:

GOOGLE_APPLICATION_CREDENTIALS="/path/to/my/key.json"
PROJECT_ID = "MY_GOOGLE_PROJECT_ID"
PROCESSOR_ID = "MY_DOCUMENTAI_PROCESSOR_ID"  # Create processor in Cloud Console
LOCATION = "us"  # Format is 'us' or 'eu'

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
samples		samples
.gitignore		.gitignore
README.md		README.md
gocr.env		gocr.env
install.sh		install.sh
requirements.txt		requirements.txt
wget-install.sh		wget-install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GOKRa PDF (Google-OCR-a)

Potential Improvements

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GOKRa PDF (Google-OCR-a)

Potential Improvements

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages