A tiny, hackable Vision-Language Model implementation. No Trainer, no PEFT,
no mystery wrappers — just ~1.5k lines of PyTorch so you can actually read
the forward pass.
Built for teaching myself how VLMs fit together; if it's useful to you, great.
CLIP / SigLIP ─► projector ─► decoder LM (Qwen2.5-0.5B)
(frozen) (MLP | QFormer) (full-FT or LoRA)
- Two vision encoders: CLIP and SigLIP (docs/swapping_encoders.md).
- Two projectors: a 2-layer MLP (LLaVA-style) and a learnable-queries QueryPooler (baby QFormer).
- Plain-torch training loop with
acceleratefor bf16 + multi-GPU. - A minimal LoRA implementation (~100 lines). Swap in
peftif you need quantisation or nested adapters. - Chat, captioning, and a small VQA benchmark script.
pip install -e .
# full-FT
accelerate launch scripts/train.py configs/default.yaml
# LoRA
accelerate launch scripts/train.py configs/lora.yaml
# caption a single image
python scripts/caption.py cat.jpg --checkpoint outputs/run0/final/model.safetensors
# VQA-style single-question
python scripts/vqa.py cat.jpg "What is the cat doing?" \
--lora outputs/lora0/adapter.ptSee results/README.md. TL;DR on my 3090:
| projector | trainable | VQAv2 EM | VRAM |
|---|---|---|---|
| MLP full-FT | 0.6B | 0.412 | 22 GB |
| QueryPooler 32 | 0.6B | 0.398 | 22 GB |
| MLP + LoRA r=8 | 8.4M | 0.377 | 13 GB |
Reading LLaVA / BLIP-2 / InternVL code is an "expand 30 layers of wrappers" exercise. I wanted something I could actually step through in a debugger, including the image-token splicing. Pedagogical first, SOTA... nowhere near.
MIT.