Vision Text Extractor

Extract text from images and documents using multiple AI providers. Choose from local models (SmolVLM, LLaVA) or cloud-based (OpenAI GPT-4o) for maximum flexibility.

✨ Features

🤖 3 AI Providers: Local SmolVLM/LLaVA or cloud OpenAI GPT-4o
🔒 Privacy-First: Local processing keeps your data private
🌐 Flexible Input: Local files or web URLs
💬 Custom Prompts: Extract specific information
⚡ Easy Setup: One-command installation with Pixi

🚀 Quick Start

# Clone and install
git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install

# Quick demo
pixi run demo-ocr-huggingface

# Use with your images  
python main.py path/to/your/image.jpg
python main.py "https://example.com/image.png"

📖 Documentation

For detailed guides and tutorials, visit our Wiki:

📋 Installation Guide - Complete setup for all providers
🚀 Quick Start Tutorial - Get started in 5 minutes
📊 Provider Comparison - Choose the right AI model
📄 Document Processing - Real-world examples
⚙️ Pixi Tasks Reference - All available commands

🛠️ Installation

Prerequisites

Pixi package manager
Python 3.10+ (managed by Pixi)

Setup

git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install
pixi run setup

Choose Your AI Provider

🟢 Local & Free (Recommended)

pixi run setup-smolvlm    # Hugging Face SmolVLM (~2GB)
pixi run setup-ollama     # Ollama LLaVA (~4GB)

🟡 Cloud & Paid (Highest Accuracy)

# Add your OpenAI key to .env file
echo "OPENAI_API_KEY=your_key_here" >> .env

💡 Basic Usage

# Extract text from any image
python main.py path/to/your/image.jpg

# Process web images
python main.py "https://example.com/document.png"

# Custom extraction prompt
python main.py receipt.jpg --prompt "Extract total amount and date"

# Try different providers
python main.py image.png --provider ollama --model llava:7b
python main.py image.png --provider openai --model gpt-4o

🎯 Common Use Cases

📄 Business Documents: Invoices, contracts, forms, receipts
🍽️ Food & Restaurants: Recipes, menus, nutrition labels
💰 Finance: Bank statements, tax documents, expense reports
📚 Education: Homework, research papers, lecture notes
🏥 Healthcare: Prescriptions, lab results, medical forms

See our Document Processing Tutorial for detailed examples.

🔧 Quick Commands

# Demo with sample images
pixi run demo-ocr-huggingface  # SmolVLM demo
pixi run demo-ocr-ollama       # LLaVA demo  
pixi run demo-ocr-openai       # OpenAI demo

# Test your setup
pixi run test-setup            # Validate installation
pixi run check-env             # Check API keys

# Process your files
pixi run ocr_llm "my-image.jpg"
pixi run ocr_ollama "document.pdf"

🧪 Handwriting OCR Test

Run the handwriting sample test to verify the SmolVLM transcription output.

Using Pixi (recommended, ensures model setup):

pixi run test-handwriting

Directly with Python:

python tests/test_handwriting_ocr.py

The test runs the SmolVLM pipeline against images/handwriting_sample.webp and checks the extracted text against the expected transcription. Use the Pixi command if you haven't run pixi run setup-smolvlm yet.

📂 Project Structure

vision-text-extractor/
├── main.py              # Main CLI application
├── agent/tools.py       # OCR extraction tools
├── tests/              # Test scripts
├── images/             # Sample images
├── wiki_content/       # Documentation source
├── LICENSE             # MIT License
└── pixi.toml          # Dependencies & tasks

🗺️ Roadmap & Future Updates

We're actively working on exciting new features! Here's what's planned:

🚀 Next Release (v0.2.0)

📊 Batch Processing: Process multiple files in one command
🎯 Output Formats: JSON, CSV, XML structured output options
🔄 Result Caching: Skip reprocessing of identical images
📈 Progress Bars: Visual feedback for long operations

🌟 Upcoming Features

🧠 More AI Providers:
- Google Gemini Vision
- Anthropic Claude Vision
- Local Qwen2-VL support
🎨 Image Preprocessing:
- Auto-rotate, denoise, enhance quality
- OCR confidence scoring
🔧 Advanced Tools:
- Table structure extraction
- Form field detection
- Handwriting analysis mode

🏢 Enterprise Features

🔐 Enhanced Security: SOC2 compliance, audit logs
⚡ Performance: GPU optimization, model quantization
🌐 API Server: REST API for integration
📊 Analytics: Usage metrics and accuracy reporting

🎯 Long-term Vision

🤖 AI Agents: Multi-step document analysis workflows
🌍 Multi-language: Better support for non-English text
📱 Mobile App: Companion mobile application
🔌 Integrations: Direct cloud storage, CRM, ERP connections

Want to contribute? Check our Issues or suggest new features!

🤝 Contributing

We welcome contributions! Please see our Wiki for development guides and check out existing Issues.

Ways to contribute:

🐛 Bug Reports: Found an issue? Let us know!
💡 Feature Requests: Suggest improvements
📝 Documentation: Help improve our wiki
🧪 Testing: Try new features and providers
💻 Code: Submit pull requests

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary:

✅ Commercial use - Use in commercial projects
✅ Modification - Change and adapt the code
✅ Distribution - Share with others
✅ Private use - Use for personal projects
❓ Warranty - No warranty provided

⚠️ Privacy Notice

Local providers (SmolVLM, LLaVA): Your data never leaves your machine
OpenAI provider: Data is sent to OpenAI's servers
API keys: Never commit .env files to version control

Need help? Check our Wiki or create an Issue 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
additions		additions
agent		agent
images		images
llm_setup		llm_setup
tests		tests
wiki_content		wiki_content
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
github_setup_commands.txt		github_setup_commands.txt
main.py		main.py
pixi.toml		pixi.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision Text Extractor

✨ Features

🚀 Quick Start

📖 Documentation

🛠️ Installation

Prerequisites

Setup

Choose Your AI Provider

💡 Basic Usage

🎯 Common Use Cases

🔧 Quick Commands

🧪 Handwriting OCR Test

📂 Project Structure

🗺️ Roadmap & Future Updates

🚀 Next Release (v0.2.0)

🌟 Upcoming Features

🏢 Enterprise Features

🎯 Long-term Vision

🤝 Contributing

📄 License

⚠️ Privacy Notice

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

udit-asopa/vision-text-extractor

Folders and files

Latest commit

History

Repository files navigation

Vision Text Extractor

✨ Features

🚀 Quick Start

📖 Documentation

🛠️ Installation

Prerequisites

Setup

Choose Your AI Provider

💡 Basic Usage

🎯 Common Use Cases

🔧 Quick Commands

🧪 Handwriting OCR Test

📂 Project Structure

🗺️ Roadmap & Future Updates

🚀 Next Release (v0.2.0)

🌟 Upcoming Features

🏢 Enterprise Features

🎯 Long-term Vision

🤝 Contributing

📄 License

⚠️ Privacy Notice

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages