Skip to content

Volkopat/VLM-Optical-Encoder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Optical Compression for Qwen3-VL-2B

Adapter-based optical compression for long documents using DeepSeek's DeepEncoder with Qwen3-VL-2B.

Built on DeepSeek-OCR by DeepSeek-AI. This repo includes deepencoder.py from DeepSeek-OCR (Apache-2.0 License).

πŸš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Download Model Weights

# Download DeepEncoder weights (401M params, auto-downloads on first use)
# Will be cached in: ~/.cache/huggingface/hub/models--Volkopat--DeepSeek-DeepEncoder

# Download pre-trained adapter (41MB)
mkdir -p adapters
wget https://huggingface.co/Volkopat/Qwen-VLM-Optical-Encoder/resolve/main/qwen3_vl_2b.pth -O adapters/qwen3_vl_2b.pth

Model weights:

3. Test with Pre-trained Adapter

# Quick test (10 samples)
python test.py \
    --vlm_model_path Qwen/Qwen3-VL-2B-Instruct \
    --adapter_checkpoint adapters/qwen3_vl_2b.pth \
    --benchmark longbench \
    --num_samples 10

# Full benchmark (50 samples)
python test.py \
    --vlm_model_path Qwen/Qwen3-VL-2B-Instruct \
    --adapter_checkpoint adapters/qwen3_vl_2b.pth \
    --benchmark longbench \
    --num_samples 50

This will run the benchmark and show you the results comparing optical vs native text processing.

4. Train Your Own Adapter (Optional)

python train.py \
    --vlm_model_path Qwen/Qwen3-VL-2B-Instruct \
    --target_dim 2048 \
    --num_samples 1000 \
    --num_epochs 10

Training time: ~2-3 hours on RTX 5070 12GB

πŸ“Š Results (LongBench v2, 50 samples)

Tested on: Qwen3-VL-2B-Instruct, RTX 5070 12GB

Metric Native Text Optical Compression
Overall Score 12% (6/50) 18% (9/50) βœ…
Success Rate 22% (11/50) 90% (45/50) βœ…
Accuracy (completed) 54.5% (6/11) 20% (9/45)
Avg Tokens 38K 17K (2.2Γ— compression)
Avg Time 6s 24s (4Γ— slower)

Key Finding: When counting failures (OOM/context exceeded) as wrong answers, optical achieves 6% better overall score because it successfully completes 90% of samples vs native's 22%. Native has higher accuracy on completed samples but fails on 78% of long documents.

πŸ”§ Methodology

DeepSeek-OCR (Original)

DeepSeek-OCR uses DeepEncoder (SAM-ViT-B + CLIP-L + Projector) for optical character recognition:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DeepSeek-OCR Pipeline                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Text Document
     ↓
Render to Images (1024Γ—1024)
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   DeepEncoder (401M params)        β”‚
β”‚   β”œβ”€β”€ SAM-ViT-B (95M)              β”‚
β”‚   β”œβ”€β”€ CLIP-L (303M)                β”‚
β”‚   └── Projector (2.6M)             β”‚
β”‚   Output: [N, 1280]                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓
DeepSeek VLM (proprietary)
     ↓
Response

Limitation: DeepEncoder outputs are designed for DeepSeek's proprietary VLM.

My Approach: Adapter for Qwen3-VL-2B

We add a lightweight adapter to bridge DeepEncoder to Qwen3-VL-2B:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Optical Compression for Qwen3-VL-2B                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Text Document
     ↓
Render to Images (1024Γ—1024, 10pt font)
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   DeepEncoder (401M) [FROZEN]      β”‚
β”‚   Auto-downloads from HuggingFace  β”‚
β”‚   Output: [N, 1280]                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Adapter (10.6M) [TRAINABLE]      β”‚  ← Our contribution
β”‚   β”œβ”€β”€ MLP: 1280 β†’ 3072 β†’ 2048     β”‚
β”‚   β”œβ”€β”€ Page Embeddings (200 pages) β”‚
β”‚   └── Layer Norm                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓
Qwen3-VL-2B (2048 dims)
     ↓
Response

Key idea: Freeze DeepEncoder, train only lightweight adapter (10.6M params) to align with Qwen3-VL-2B's 2048-dimensional space.

πŸ’‘ Training Details

Training Dataset

  • Source: Wikipedia (20220301.en)
  • Samples: 1000 documents
  • Length: 5K-100K characters per document (1-6 pages)
  • Rendering: Black text on white background, 10pt monospace font

Training Process

  1. Load Qwen3-VL-2B (frozen)
  2. Load DeepEncoder from HuggingFace (frozen, 401M params)
  3. Generate 1000 Wikipedia documents, render to images
  4. Train adapter to align DeepEncoder outputs (1280-dim) with Qwen3-VL text embeddings (2048-dim)
  5. Loss: MSE between optical vision tokens and native text embeddings

Training Stats

  • Time: 2-3 hours on RTX 5070 12GB
  • Trainable params: 10.6M (adapter only)
  • Loss reduction: 87% (1.17 β†’ 0.14)
  • GPU memory: ~5GB (2B model + DeepEncoder + adapter)

πŸ“ˆ Performance Analysis

Token Compression

  • Average: 2.2Γ— token savings
  • Formula: ~256 tokens/page (optical) vs ~4096 tokens/page (native)
  • Benefit: Enables processing of 100+ page documents that exceed context limits

Speed vs Context Tradeoff

  • Optical: 24s per document (slower) but 90% success rate
  • Native: 6s per document (faster) but 22% success rate on long documents
  • Tradeoff: 4Γ— slower processing for unlimited context handling

Why Optical Wins Overall

Native text has higher accuracy (54.5%) on the few samples it completes, but fails on 78% of long documents due to OOM/context limits. Optical has lower accuracy (20%) but successfully processes 90% of samples, resulting in better overall score (18% vs 12%).

πŸ” When to Use Optical Compression

βœ… Use Optical When:

  • Documents exceed 32K token context window
  • Processing 50+ page documents
  • Context window exhaustion is the bottleneck
  • 4Γ— slower processing is acceptable

❌ Use Native Text When:

  • Short documents (< 10 pages)
  • Real-time processing required
  • Speed is critical
  • Documents fit comfortably in context window

πŸ“ Repository Structure

VLM-Optical-Encoder/
β”œβ”€β”€ deepencoder.py             # DeepEncoder architecture (from DeepSeek-OCR)
β”œβ”€β”€ optical_encoder.py         # Adapter + integration code
β”œβ”€β”€ train.py                   # Training script
β”œβ”€β”€ test.py                    # Testing script (LongBench support)
β”œβ”€β”€ requirements.txt           # Dependencies (CUDA 12.8)
β”œβ”€β”€ configs/
β”‚   └── qwen3_vl.yaml         # Training configuration
β”œβ”€β”€ adapters/                  # Download from HuggingFace
β”‚   └── qwen3_vl_2b.pth       # Pre-trained adapter (41MB)
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE                    # MIT License
└── README.md

Download pre-trained adapter: Volkopat/Qwen-VLM-Optical-Encoder

⚠️ Disclaimer

This is experimental research code. Scalability has not been tested beyond the reported benchmarks. Performance on diverse tasks and production workloads is untested. Use at your own risk. I lack the hardware to scale this up to 4B, 8B, 30B and 235B models so I appreciate it if someone can test this beyond 2B.

πŸ™ Credits

This work builds on:

  • DeepSeek-OCR (2025) by DeepSeek-AI for DeepEncoder architecture
  • Qwen3-VL-2B-Instruct by Alibaba for the vision-language model
  • LongBench v2 benchmark for evaluation
  • Claude Code by Anthropic for development assistance

Model Weights:

πŸ“Š Citation

@software{optical_compression_qwen,
  title = {Optical Compression for Qwen3-VL via Universal Adapter},
  year = {2025},
  note = {Built on DeepSeek-OCR by DeepSeek-AI}
}

@misc{deepseek_ocr,
  title = {DeepSeek-OCR},
  author = {DeepSeek-AI},
  year = {2025},
  url = {https://huggingface.co/deepseek-ai/DeepSeek-OCR}
}

@misc{qwen3vl,
  title = {Qwen3-VL-2B-Instruct},
  author = {Qwen Team, Alibaba},
  year = {2024},
  url = {https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct}
}

πŸ“ License

  • DeepEncoder: Apache-2.0 (from DeepSeek-AI/DeepSeek-OCR)
  • Adapter code: MIT License

Last Updated: 2025-10-23 Tested On: RTX 5070 12GB, CUDA 12.8, Qwen3-VL-2B-Instruct

About

DeepSeek-OCR Inspired Optical Encoder for Qwen3-VLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages