Skip to content

udit-asopa/vision-text-extractor

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Vision Text Extractor

Extract text from images and documents using multiple AI providers. Choose from local models (SmolVLM, LLaVA) or cloud-based (OpenAI GPT-4o) for maximum flexibility.

✨ Features

  • πŸ€– 3 AI Providers: Local SmolVLM/LLaVA or cloud OpenAI GPT-4o
  • πŸ”’ Privacy-First: Local processing keeps your data private
  • 🌐 Flexible Input: Local files or web URLs
  • πŸ’¬ Custom Prompts: Extract specific information
  • ⚑ Easy Setup: One-command installation with Pixi

πŸš€ Quick Start

# Clone and install
git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install

# Quick demo
pixi run demo-ocr-huggingface

# Use with your images  
python main.py path/to/your/image.jpg
python main.py "https://example.com/image.png"

πŸ“– Documentation

For detailed guides and tutorials, visit our Wiki:

πŸ› οΈ Installation

Prerequisites

  • Pixi package manager
  • Python 3.10+ (managed by Pixi)

Setup

git clone https://github.com/udit-asopa/vision-text-extractor.git
cd vision-text-extractor
pixi install
pixi run setup

Choose Your AI Provider

🟒 Local & Free (Recommended)

pixi run setup-smolvlm    # Hugging Face SmolVLM (~2GB)
pixi run setup-ollama     # Ollama LLaVA (~4GB)  

🟑 Cloud & Paid (Highest Accuracy)

# Add your OpenAI key to .env file
echo "OPENAI_API_KEY=your_key_here" >> .env

πŸ’‘ Basic Usage

# Extract text from any image
python main.py path/to/your/image.jpg

# Process web images
python main.py "https://example.com/document.png"

# Custom extraction prompt
python main.py receipt.jpg --prompt "Extract total amount and date"

# Try different providers
python main.py image.png --provider ollama --model llava:7b
python main.py image.png --provider openai --model gpt-4o

🎯 Common Use Cases

  • πŸ“„ Business Documents: Invoices, contracts, forms, receipts
  • 🍽️ Food & Restaurants: Recipes, menus, nutrition labels
  • πŸ’° Finance: Bank statements, tax documents, expense reports
  • πŸ“š Education: Homework, research papers, lecture notes
  • πŸ₯ Healthcare: Prescriptions, lab results, medical forms

See our Document Processing Tutorial for detailed examples.

πŸ”§ Quick Commands

# Demo with sample images
pixi run demo-ocr-huggingface  # SmolVLM demo
pixi run demo-ocr-ollama       # LLaVA demo  
pixi run demo-ocr-openai       # OpenAI demo

# Test your setup
pixi run test-setup            # Validate installation
pixi run check-env             # Check API keys

# Process your files
pixi run ocr_llm "my-image.jpg"
pixi run ocr_ollama "document.pdf"

πŸ§ͺ Handwriting OCR Test

Run the handwriting sample test to verify the SmolVLM transcription output.

  • Using Pixi (recommended, ensures model setup):
pixi run test-handwriting
  • Directly with Python:
python tests/test_handwriting_ocr.py

The test runs the SmolVLM pipeline against images/handwriting_sample.webp and checks the extracted text against the expected transcription. Use the Pixi command if you haven't run pixi run setup-smolvlm yet.

πŸ“‚ Project Structure

vision-text-extractor/
β”œβ”€β”€ main.py              # Main CLI application
β”œβ”€β”€ agent/tools.py       # OCR extraction tools
β”œβ”€β”€ tests/              # Test scripts
β”œβ”€β”€ images/             # Sample images
β”œβ”€β”€ wiki_content/       # Documentation source
β”œβ”€β”€ LICENSE             # MIT License
└── pixi.toml          # Dependencies & tasks

πŸ—ΊοΈ Roadmap & Future Updates

We're actively working on exciting new features! Here's what's planned:

πŸš€ Next Release (v0.2.0)

  • πŸ“Š Batch Processing: Process multiple files in one command
  • 🎯 Output Formats: JSON, CSV, XML structured output options
  • πŸ”„ Result Caching: Skip reprocessing of identical images
  • πŸ“ˆ Progress Bars: Visual feedback for long operations

🌟 Upcoming Features

  • 🧠 More AI Providers:
    • Google Gemini Vision
    • Anthropic Claude Vision
    • Local Qwen2-VL support
  • 🎨 Image Preprocessing:
    • Auto-rotate, denoise, enhance quality
    • OCR confidence scoring
  • πŸ”§ Advanced Tools:
    • Table structure extraction
    • Form field detection
    • Handwriting analysis mode

🏒 Enterprise Features

  • πŸ” Enhanced Security: SOC2 compliance, audit logs
  • ⚑ Performance: GPU optimization, model quantization
  • 🌐 API Server: REST API for integration
  • πŸ“Š Analytics: Usage metrics and accuracy reporting

🎯 Long-term Vision

  • πŸ€– AI Agents: Multi-step document analysis workflows
  • 🌍 Multi-language: Better support for non-English text
  • πŸ“± Mobile App: Companion mobile application
  • πŸ”Œ Integrations: Direct cloud storage, CRM, ERP connections

Want to contribute? Check our Issues or suggest new features!

🀝 Contributing

We welcome contributions! Please see our Wiki for development guides and check out existing Issues.

Ways to contribute:

  • πŸ› Bug Reports: Found an issue? Let us know!
  • πŸ’‘ Feature Requests: Suggest improvements
  • πŸ“ Documentation: Help improve our wiki
  • πŸ§ͺ Testing: Try new features and providers
  • πŸ’» Code: Submit pull requests

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary:

  • βœ… Commercial use - Use in commercial projects
  • βœ… Modification - Change and adapt the code
  • βœ… Distribution - Share with others
  • βœ… Private use - Use for personal projects
  • ❓ Warranty - No warranty provided

⚠️ Privacy Notice

  • Local providers (SmolVLM, LLaVA): Your data never leaves your machine
  • OpenAI provider: Data is sent to OpenAI's servers
  • API keys: Never commit .env files to version control

Need help? Check our Wiki or create an Issue πŸš€

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.3%
  • Shell 4.7%