Skip to content

daviden1013/vlm4ocr

Repository files navigation

Python Version PyPI Website

Vision Language Models (VLMs) for Optical Character Recognition (OCR).

Feature Support
File Types ✅ PDF, TIFF, PNG, JPG/JPEG, BMP, GIF, WEBP
VLM Engines ✅ Ollama, OpenAI Compatible, OpenAI, Azure OpenAI
Output Modes ✅ Markdown, HTML, plain text
Batch OCR ✅ Processes many files concurrently with Python, CLI, and web app

🆕Recent Updates

  • v0.2.0 (Jun 1, 2025):
    • 📐User Guide is now available at Documentation Page
    • Image processing features: added rotate_correction and max_dimension_pixels to handle misaligned scan and large images.
    • Optimized file staging: added max_file_load parameter to concurrent_ocr method.
  • v0.3.0 (Jul 5, 2025):
    • Key information extraction with JSON: instead of OCR the entire document, extract only the key information and output as JSON.
    • Batch processing to web app: The web app now supports concurrent batch processing of multiple images using OCREngine.concurrent_ocr method as backend.
  • v0.3.1 (Oct 11, 2025):
    • Added reasoning VLM supports: Added new configs to support reasoning VLMs (e.g., Qwen-VL-30B-A3B-Thinking, o4-mini).
    • Added OpenAICompatible VLM engines: Separated the OpenAI compatible VLM engines into a parent class OpenAICompatibleVLMEngine.
  • v0.4.0 (Dec 15, 2025):
    • Few-shot examples: Added support for few-shot examples to improve OCR accuracy.

Table of Contents

✨Overview

vlm4ocr provides a simple way to perform OCR using the power of modern Vision Language Models (VLMs). A drag-and-drop web application is included for easy access. The Python package supports concurrent batch processing for large amount of documents. CLI provides lightweight access to most OCR features without the burden of coding. Below are screenshots from our Web Application. Note that all contents shown in this README are synthesized. There is no real personal information.

Stream OCR resuts in real-time

A scanned lab report with tables and highlights are converted into markdown text by our OCR engine.

Batch processing many files

Many scanned documents are batch processed and converted into markdown text by our OCR engine.

⭐Supported Models

Open-weights (ALL Supported!!)

All open-weights VLMs are supported via our Ollama and OpenAI compatible engines, including:

Proprietary

Proprietary models such as gpt-4o are supported via our OpenAI and Azure engines.

🚦Prerequisites

  • Python 3.x
  • For PDF processing: poppler library.
  • At least one VLM inference engine setup (Ollama, OpenAI/Azure API keys, or an OpenAI-compatible API endpoint).
pip install ollama # For Ollama
pip install openai # For OpenAI (compatible) and Azure OpenAI

🌎Web Application

A ready-to-use Flask web application is included. We support input preview, real-time streaming, and output export.

vlm4ocr.demo.mp4

Installation

🐳Running with Docker

The easiest way to run VLM4OCR web application is through Docker. The image is available on Docker Hub.

docker pull daviden1013/vlm4ocr-app:latest
docker run -p 5000:5000 daviden1013/vlm4ocr-app:latest

Open your web browser and navigate to: http://localhost:5000

If port 5000 is already in use on your machine, you can map it to a different local port. For example, to map it to local port 8080:

docker run -p 8080:5000 daviden1013/vlm4ocr-app:latest

Then visit http://localhost:8080

Using Ollama with the Dockerized App

If you are running Ollama on your host machine (outside the Docker container) and want to connect to it from the VLM4OCR web app running inside Docker:

Docker Desktop (Windows/Mac): In the VLM4OCR web UI, set the Ollama Host to http://host.docker.internal:11434. Linux: Run the Docker container with host networking:

docker run --network="host" daviden1013/vlm4ocr-app:latest

With --network="host", you don't need the -p flag for port mapping (the container will use the host's network directly, so the app will be available at http://localhost:5000). Then, in the VLM4OCR web app, you can use the default http://localhost:11434 for the Ollama Host.

Install from source

Alternatively, you can clone this repo and run VLM4OCR web application from source:

# Install python package
pip install vlm4ocr 

# Clone source code
git clone https://github.com/daviden1013/vlm4ocr.git

# Run Web App
cd vlm4ocr/services/web_app
python run.py

🐍Python package

Installation

Python package is available on PyPi

pip install vlm4ocr

Quick start

In this demo, we use a locally deployed vLLM OpenAI compatible server to run Qwen3-VL-30B-A3B-Instruct model.

from vlm4ocr import VLLMVLMEngine

vlm_engine = VLLMVLMEngine(model="Qwen/Qwen3-VL-30B-A3B-Instruct")

To use other VLM inference engines:

OpenAI Compatible
from vlm4ocr import OpenAICompatibleVLMEngine

vlm_engine = OpenAICompatibleVLMEngine(model="<mode_name>", base_url="<base_url>", api_key="<api_key>")
Icon Ollama
from vlm4ocr import OllamaVLMEngine

vlm_engine = OllamaVLMEngine(model_name="llama3.2-vision:11b-instruct-fp16")
OpenAI API

Follow the Best Practices for API Key Safety to set up API key.

export OPENAI_API_KEY=<your_API_key>
from vlm4ocr import OpenAIVLMEngine

vlm_engine = OpenAIVLMEngine(model="gpt-4o-mini")
Azure OpenAI API

Follow the Azure AI Services Quickstart to set up Endpoint and API key.

export AZURE_OPENAI_API_KEY="<your_API_key>"
export AZURE_OPENAI_ENDPOINT="<your_endpoint>"
from llm_ie.engines import AzureOpenAIVLMEngine

vlm_engine = AzureOpenAIVLMEngine(model="gpt-4o-mini", 
                                  api_version="<your api version>")

We define OCR engine and specify output formats.

from vlm4ocr import OCREngine

# Image/PDF paths
image_path = "/examples/synthesized_data/GPT-4o_synthesized_note_1_page_1.jpg"
pdf_path = "/examples/synthesized_data/GPT-4o_synthesized_note_1.pdf"

# Define OCR engine
ocr = OCREngine(vlm_engine, output_mode="markdown")

Run OCR sequentially (process one image at a time) for single or multiple files:

# OCR for a single image
ocr_results = ocr.sequential_ocr(image_path, verbose=True)

# OCR for a single pdf (multiple pages)
ocr_results = ocr.sequential_ocr(pdf_path, verbose=True)

# OCR for multiple image and pdf files
ocr_results = ocr.sequential_ocr([pdf_path, image_path], verbose=True)

# Inspect OCR results
len(ocr_results) # 2 files
ocr_results[0].input_dir
ocr_results[0].filename
len(ocr_results[0]) # PDF file number of pages
ocr_text = ocr_results[0].to_string() # OCR text (all pages concatenated)

Run OCR concurrently and write to file:

import asyncio

async def run_ocr():
    response = ocr.concurrent_ocr([image_path_1, image_path_2], concurrent_batch_size=4)
    async for result in response:
        if result.status == "success":
            filename = result.filename
            ocr_text = result.to_string()
            with open(f"{filename}.md", "w", encoding="utf-8") as f:
                f.write(ocr_text)

asyncio.run(run_ocr())

Supply few-shot examples to improve OCR accuracy:

from PIL import Image
import asyncio
from vlm4ocr import FewShotExample, VLLMVLMEngine, OCREngine

# Load few-shot examples
# Note that the example text should be the expected OCR output for the example image. Do not include any extra instructions.
example_1_image = Image.open("example_1.JPG"))
with open("example_1.txt", "r") as f:
    example_1_text = f.read()

example_2_image = Image.open("example_2.JPG")
with open("example_2.txt", "r") as f:
    example_2_text = f.read()

few_shot_examples = [
    FewShotExample(image=example_1_image, text=example_1_text, max_dimension_pixels=512),
    FewShotExample(image=example_2_image, text=example_2_text, max_dimension_pixels=512)
]

async def run_ocr():
    response = ocr.concurrent_ocr([image_path_1, image_path_2], concurrent_batch_size=4, few_shot_examples=few_shot_examples)
    async for result in response:
        if result.status == "success":
            filename = result.filename
            ocr_text = result.to_string()
            with open(f"{filename}.md", "w", encoding="utf-8") as f:
                f.write(ocr_text)

asyncio.run(run_ocr())

💻CLI

Command line interface (CLI) provides an easy way to batch process many images, PDFs, and TIFFs in a directory.

Installation

Install the Python package on PyPi and CLI tool will be automatically installed.

pip install vlm4ocr

Usage

Run OCR for all supported file types in the /examples/synthesized_data/ folder with a locally deployed Qwen2.5-VL-7B-Instruct and generate results as markdown. OCR results and a log file (enabled by --log) will be written to the output_path. --concurrent_batch_size deternmines the number of images/pages can be processed at a time. This is good for managing resources.

# OpenAI compatible API
vlm4ocr --input_path /examples/synthesized_data/ \
        --output_path /examples/ocr_output/ \
        --output_mode markdown \
        --log \
        --vlm_engine openai_compatible \
        --model Qwen/Qwen2.5-VL-7B-Instruct \
        --api_key EMPTY \
        --base_url http://localhost:8000/v1 \
        --concurrent_batch_size 4

Use gpt-4.1-mini to process a PDF with many pages. Since --output_path is not specified, outputs and logs will be written to the current work directory.

# OpenAI API
export OPENAI_API_KEY=<api key>
vlm4ocr --input_path /examples/synthesized_data/GPT-4o_synthesized_note_1.pdf \
        --output_mode HTML \
        --log \
        --vlm_engine openai \
        --model gpt-4.1-mini \
        --concurrent_batch_size 4