Vision Language Models (VLMs) for Optical Character Recognition (OCR).
| Feature | Support |
|---|---|
| File Types | ✅ PDF, TIFF, PNG, JPG/JPEG, BMP, GIF, WEBP |
| VLM Engines | ✅ Ollama, OpenAI Compatible, OpenAI, Azure OpenAI |
| Output Modes | ✅ Markdown, HTML, plain text |
| Batch OCR | ✅ Processes many files concurrently with Python, CLI, and web app |
- v0.2.0 (Jun 1, 2025):
- 📐User Guide is now available at Documentation Page
- Image processing features: added
rotate_correctionandmax_dimension_pixelsto handle misaligned scan and large images. - Optimized file staging: added
max_file_loadparameter toconcurrent_ocrmethod.
- v0.3.0 (Jul 5, 2025):
- Key information extraction with JSON: instead of OCR the entire document, extract only the key information and output as JSON.
- Batch processing to web app: The web app now supports concurrent batch processing of multiple images using
OCREngine.concurrent_ocrmethod as backend.
- v0.3.1 (Oct 11, 2025):
- Added reasoning VLM supports: Added new configs to support reasoning VLMs (e.g., Qwen-VL-30B-A3B-Thinking, o4-mini).
- Added OpenAICompatible VLM engines: Separated the OpenAI compatible VLM engines into a parent class
OpenAICompatibleVLMEngine.
- v0.4.0 (Dec 15, 2025):
- Few-shot examples: Added support for few-shot examples to improve OCR accuracy.
vlm4ocr provides a simple way to perform OCR using the power of modern Vision Language Models (VLMs). A drag-and-drop web application is included for easy access. The Python package supports concurrent batch processing for large amount of documents. CLI provides lightweight access to most OCR features without the burden of coding.
Below are screenshots from our Web Application. Note that all contents shown in this README are synthesized. There is no real personal information.
A scanned lab report with tables and highlights are converted into markdown text by our OCR engine.
Many scanned documents are batch processed and converted into markdown text by our OCR engine.
All open-weights VLMs are supported via our Ollama and OpenAI compatible engines, including:
Proprietary models such as gpt-4o are supported via our OpenAI and Azure engines.
- Python 3.x
- For PDF processing: poppler library.
- At least one VLM inference engine setup (Ollama, OpenAI/Azure API keys, or an OpenAI-compatible API endpoint).
pip install ollama # For Ollama
pip install openai # For OpenAI (compatible) and Azure OpenAIA ready-to-use Flask web application is included. We support input preview, real-time streaming, and output export.
vlm4ocr.demo.mp4
The easiest way to run VLM4OCR web application is through Docker. The image is available on Docker Hub.
docker pull daviden1013/vlm4ocr-app:latest
docker run -p 5000:5000 daviden1013/vlm4ocr-app:latestOpen your web browser and navigate to: http://localhost:5000
If port 5000 is already in use on your machine, you can map it to a different local port. For example, to map it to local port 8080:
docker run -p 8080:5000 daviden1013/vlm4ocr-app:latestThen visit http://localhost:8080
If you are running Ollama on your host machine (outside the Docker container) and want to connect to it from the VLM4OCR web app running inside Docker:
Docker Desktop (Windows/Mac): In the VLM4OCR web UI, set the Ollama Host to http://host.docker.internal:11434. Linux: Run the Docker container with host networking:
docker run --network="host" daviden1013/vlm4ocr-app:latestWith --network="host", you don't need the -p flag for port mapping (the container will use the host's network directly, so the app will be available at http://localhost:5000). Then, in the VLM4OCR web app, you can use the default http://localhost:11434 for the Ollama Host.
Alternatively, you can clone this repo and run VLM4OCR web application from source:
# Install python package
pip install vlm4ocr
# Clone source code
git clone https://github.com/daviden1013/vlm4ocr.git
# Run Web App
cd vlm4ocr/services/web_app
python run.pyPython package is available on PyPi
pip install vlm4ocrIn this demo, we use a locally deployed vLLM OpenAI compatible server to run Qwen3-VL-30B-A3B-Instruct model.
from vlm4ocr import VLLMVLMEngine
vlm_engine = VLLMVLMEngine(model="Qwen/Qwen3-VL-30B-A3B-Instruct")To use other VLM inference engines:
OpenAI Compatible
from vlm4ocr import OpenAICompatibleVLMEngine
vlm_engine = OpenAICompatibleVLMEngine(model="<mode_name>", base_url="<base_url>", api_key="<api_key>")
Ollama
from vlm4ocr import OllamaVLMEngine
vlm_engine = OllamaVLMEngine(model_name="llama3.2-vision:11b-instruct-fp16")
OpenAI API
Follow the Best Practices for API Key Safety to set up API key.
export OPENAI_API_KEY=<your_API_key>from vlm4ocr import OpenAIVLMEngine
vlm_engine = OpenAIVLMEngine(model="gpt-4o-mini")
Azure OpenAI API
Follow the Azure AI Services Quickstart to set up Endpoint and API key.
export AZURE_OPENAI_API_KEY="<your_API_key>"
export AZURE_OPENAI_ENDPOINT="<your_endpoint>"from llm_ie.engines import AzureOpenAIVLMEngine
vlm_engine = AzureOpenAIVLMEngine(model="gpt-4o-mini",
api_version="<your api version>")We define OCR engine and specify output formats.
from vlm4ocr import OCREngine
# Image/PDF paths
image_path = "/examples/synthesized_data/GPT-4o_synthesized_note_1_page_1.jpg"
pdf_path = "/examples/synthesized_data/GPT-4o_synthesized_note_1.pdf"
# Define OCR engine
ocr = OCREngine(vlm_engine, output_mode="markdown")Run OCR sequentially (process one image at a time) for single or multiple files:
# OCR for a single image
ocr_results = ocr.sequential_ocr(image_path, verbose=True)
# OCR for a single pdf (multiple pages)
ocr_results = ocr.sequential_ocr(pdf_path, verbose=True)
# OCR for multiple image and pdf files
ocr_results = ocr.sequential_ocr([pdf_path, image_path], verbose=True)
# Inspect OCR results
len(ocr_results) # 2 files
ocr_results[0].input_dir
ocr_results[0].filename
len(ocr_results[0]) # PDF file number of pages
ocr_text = ocr_results[0].to_string() # OCR text (all pages concatenated)Run OCR concurrently and write to file:
import asyncio
async def run_ocr():
response = ocr.concurrent_ocr([image_path_1, image_path_2], concurrent_batch_size=4)
async for result in response:
if result.status == "success":
filename = result.filename
ocr_text = result.to_string()
with open(f"{filename}.md", "w", encoding="utf-8") as f:
f.write(ocr_text)
asyncio.run(run_ocr())Supply few-shot examples to improve OCR accuracy:
from PIL import Image
import asyncio
from vlm4ocr import FewShotExample, VLLMVLMEngine, OCREngine
# Load few-shot examples
# Note that the example text should be the expected OCR output for the example image. Do not include any extra instructions.
example_1_image = Image.open("example_1.JPG"))
with open("example_1.txt", "r") as f:
example_1_text = f.read()
example_2_image = Image.open("example_2.JPG")
with open("example_2.txt", "r") as f:
example_2_text = f.read()
few_shot_examples = [
FewShotExample(image=example_1_image, text=example_1_text, max_dimension_pixels=512),
FewShotExample(image=example_2_image, text=example_2_text, max_dimension_pixels=512)
]
async def run_ocr():
response = ocr.concurrent_ocr([image_path_1, image_path_2], concurrent_batch_size=4, few_shot_examples=few_shot_examples)
async for result in response:
if result.status == "success":
filename = result.filename
ocr_text = result.to_string()
with open(f"{filename}.md", "w", encoding="utf-8") as f:
f.write(ocr_text)
asyncio.run(run_ocr())Command line interface (CLI) provides an easy way to batch process many images, PDFs, and TIFFs in a directory.
Install the Python package on PyPi and CLI tool will be automatically installed.
pip install vlm4ocrRun OCR for all supported file types in the /examples/synthesized_data/ folder with a locally deployed Qwen2.5-VL-7B-Instruct and generate results as markdown. OCR results and a log file (enabled by --log) will be written to the output_path. --concurrent_batch_size deternmines the number of images/pages can be processed at a time. This is good for managing resources.
# OpenAI compatible API
vlm4ocr --input_path /examples/synthesized_data/ \
--output_path /examples/ocr_output/ \
--output_mode markdown \
--log \
--vlm_engine openai_compatible \
--model Qwen/Qwen2.5-VL-7B-Instruct \
--api_key EMPTY \
--base_url http://localhost:8000/v1 \
--concurrent_batch_size 4Use gpt-4.1-mini to process a PDF with many pages. Since --output_path is not specified, outputs and logs will be written to the current work directory.
# OpenAI API
export OPENAI_API_KEY=<api key>
vlm4ocr --input_path /examples/synthesized_data/GPT-4o_synthesized_note_1.pdf \
--output_mode HTML \
--log \
--vlm_engine openai \
--model gpt-4.1-mini \
--concurrent_batch_size 4