Adapter-based optical compression for long documents using DeepSeek's DeepEncoder with Qwen3-VL-2B.
Built on DeepSeek-OCR by DeepSeek-AI. This repo includes deepencoder.py from DeepSeek-OCR (Apache-2.0 License).
pip install -r requirements.txt# Download DeepEncoder weights (401M params, auto-downloads on first use)
# Will be cached in: ~/.cache/huggingface/hub/models--Volkopat--DeepSeek-DeepEncoder
# Download pre-trained adapter (41MB)
mkdir -p adapters
wget https://huggingface.co/Volkopat/Qwen-VLM-Optical-Encoder/resolve/main/qwen3_vl_2b.pth -O adapters/qwen3_vl_2b.pthModel weights:
- DeepEncoder: Volkopat/DeepSeek-DeepEncoder
- Adapter: Volkopat/Qwen-VLM-Optical-Encoder
# Quick test (10 samples)
python test.py \
--vlm_model_path Qwen/Qwen3-VL-2B-Instruct \
--adapter_checkpoint adapters/qwen3_vl_2b.pth \
--benchmark longbench \
--num_samples 10
# Full benchmark (50 samples)
python test.py \
--vlm_model_path Qwen/Qwen3-VL-2B-Instruct \
--adapter_checkpoint adapters/qwen3_vl_2b.pth \
--benchmark longbench \
--num_samples 50This will run the benchmark and show you the results comparing optical vs native text processing.
python train.py \
--vlm_model_path Qwen/Qwen3-VL-2B-Instruct \
--target_dim 2048 \
--num_samples 1000 \
--num_epochs 10Training time: ~2-3 hours on RTX 5070 12GB
Tested on: Qwen3-VL-2B-Instruct, RTX 5070 12GB
| Metric | Native Text | Optical Compression |
|---|---|---|
| Overall Score | 12% (6/50) | 18% (9/50) β |
| Success Rate | 22% (11/50) | 90% (45/50) β |
| Accuracy (completed) | 54.5% (6/11) | 20% (9/45) |
| Avg Tokens | 38K | 17K (2.2Γ compression) |
| Avg Time | 6s | 24s (4Γ slower) |
Key Finding: When counting failures (OOM/context exceeded) as wrong answers, optical achieves 6% better overall score because it successfully completes 90% of samples vs native's 22%. Native has higher accuracy on completed samples but fails on 78% of long documents.
DeepSeek-OCR uses DeepEncoder (SAM-ViT-B + CLIP-L + Projector) for optical character recognition:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DeepSeek-OCR Pipeline β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Text Document
β
Render to Images (1024Γ1024)
β
ββββββββββββββββββββββββββββββββββββββ
β DeepEncoder (401M params) β
β βββ SAM-ViT-B (95M) β
β βββ CLIP-L (303M) β
β βββ Projector (2.6M) β
β Output: [N, 1280] β
ββββββββββββββββββββββββββββββββββββββ
β
DeepSeek VLM (proprietary)
β
Response
Limitation: DeepEncoder outputs are designed for DeepSeek's proprietary VLM.
We add a lightweight adapter to bridge DeepEncoder to Qwen3-VL-2B:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Optical Compression for Qwen3-VL-2B β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Text Document
β
Render to Images (1024Γ1024, 10pt font)
β
ββββββββββββββββββββββββββββββββββββββ
β DeepEncoder (401M) [FROZEN] β
β Auto-downloads from HuggingFace β
β Output: [N, 1280] β
ββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββββββββββββββββββ
β Adapter (10.6M) [TRAINABLE] β β Our contribution
β βββ MLP: 1280 β 3072 β 2048 β
β βββ Page Embeddings (200 pages) β
β βββ Layer Norm β
ββββββββββββββββββββββββββββββββββββββ
β
Qwen3-VL-2B (2048 dims)
β
Response
Key idea: Freeze DeepEncoder, train only lightweight adapter (10.6M params) to align with Qwen3-VL-2B's 2048-dimensional space.
- Source: Wikipedia (20220301.en)
- Samples: 1000 documents
- Length: 5K-100K characters per document (1-6 pages)
- Rendering: Black text on white background, 10pt monospace font
- Load Qwen3-VL-2B (frozen)
- Load DeepEncoder from HuggingFace (frozen, 401M params)
- Generate 1000 Wikipedia documents, render to images
- Train adapter to align DeepEncoder outputs (1280-dim) with Qwen3-VL text embeddings (2048-dim)
- Loss: MSE between optical vision tokens and native text embeddings
- Time: 2-3 hours on RTX 5070 12GB
- Trainable params: 10.6M (adapter only)
- Loss reduction: 87% (1.17 β 0.14)
- GPU memory: ~5GB (2B model + DeepEncoder + adapter)
- Average: 2.2Γ token savings
- Formula: ~256 tokens/page (optical) vs ~4096 tokens/page (native)
- Benefit: Enables processing of 100+ page documents that exceed context limits
- Optical: 24s per document (slower) but 90% success rate
- Native: 6s per document (faster) but 22% success rate on long documents
- Tradeoff: 4Γ slower processing for unlimited context handling
Native text has higher accuracy (54.5%) on the few samples it completes, but fails on 78% of long documents due to OOM/context limits. Optical has lower accuracy (20%) but successfully processes 90% of samples, resulting in better overall score (18% vs 12%).
- Documents exceed 32K token context window
- Processing 50+ page documents
- Context window exhaustion is the bottleneck
- 4Γ slower processing is acceptable
- Short documents (< 10 pages)
- Real-time processing required
- Speed is critical
- Documents fit comfortably in context window
VLM-Optical-Encoder/
βββ deepencoder.py # DeepEncoder architecture (from DeepSeek-OCR)
βββ optical_encoder.py # Adapter + integration code
βββ train.py # Training script
βββ test.py # Testing script (LongBench support)
βββ requirements.txt # Dependencies (CUDA 12.8)
βββ configs/
β βββ qwen3_vl.yaml # Training configuration
βββ adapters/ # Download from HuggingFace
β βββ qwen3_vl_2b.pth # Pre-trained adapter (41MB)
βββ .gitignore
βββ LICENSE # MIT License
βββ README.md
Download pre-trained adapter: Volkopat/Qwen-VLM-Optical-Encoder
This is experimental research code. Scalability has not been tested beyond the reported benchmarks. Performance on diverse tasks and production workloads is untested. Use at your own risk. I lack the hardware to scale this up to 4B, 8B, 30B and 235B models so I appreciate it if someone can test this beyond 2B.
This work builds on:
- DeepSeek-OCR (2025) by DeepSeek-AI for DeepEncoder architecture
- Qwen3-VL-2B-Instruct by Alibaba for the vision-language model
- LongBench v2 benchmark for evaluation
- Claude Code by Anthropic for development assistance
Model Weights:
- DeepEncoder (401M params): Volkopat/DeepSeek-DeepEncoder
- Adapter (10.6M params): Volkopat/Qwen-VLM-Optical-Encoder
@software{optical_compression_qwen,
title = {Optical Compression for Qwen3-VL via Universal Adapter},
year = {2025},
note = {Built on DeepSeek-OCR by DeepSeek-AI}
}
@misc{deepseek_ocr,
title = {DeepSeek-OCR},
author = {DeepSeek-AI},
year = {2025},
url = {https://huggingface.co/deepseek-ai/DeepSeek-OCR}
}
@misc{qwen3vl,
title = {Qwen3-VL-2B-Instruct},
author = {Qwen Team, Alibaba},
year = {2024},
url = {https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct}
}- DeepEncoder: Apache-2.0 (from DeepSeek-AI/DeepSeek-OCR)
- Adapter code: MIT License
Last Updated: 2025-10-23 Tested On: RTX 5070 12GB, CUDA 12.8, Qwen3-VL-2B-Instruct