Skip to content

HIDUSB/LightVLM

Repository files navigation

LightVLM 🔦

A tiny, hackable Vision-Language Model implementation. No Trainer, no PEFT, no mystery wrappers — just ~1.5k lines of PyTorch so you can actually read the forward pass.

Built for teaching myself how VLMs fit together; if it's useful to you, great.

What's in the box

  CLIP / SigLIP  ─►  projector  ─►  decoder LM (Qwen2.5-0.5B)
    (frozen)       (MLP | QFormer)      (full-FT or LoRA)
  • Two vision encoders: CLIP and SigLIP (docs/swapping_encoders.md).
  • Two projectors: a 2-layer MLP (LLaVA-style) and a learnable-queries QueryPooler (baby QFormer).
  • Plain-torch training loop with accelerate for bf16 + multi-GPU.
  • A minimal LoRA implementation (~100 lines). Swap in peft if you need quantisation or nested adapters.
  • Chat, captioning, and a small VQA benchmark script.

Quickstart

pip install -e .

# full-FT
accelerate launch scripts/train.py configs/default.yaml

# LoRA
accelerate launch scripts/train.py configs/lora.yaml

# caption a single image
python scripts/caption.py cat.jpg --checkpoint outputs/run0/final/model.safetensors

# VQA-style single-question
python scripts/vqa.py cat.jpg "What is the cat doing?" \
    --lora outputs/lora0/adapter.pt

Results

See results/README.md. TL;DR on my 3090:

projector trainable VQAv2 EM VRAM
MLP full-FT 0.6B 0.412 22 GB
QueryPooler 32 0.6B 0.398 22 GB
MLP + LoRA r=8 8.4M 0.377 13 GB

Why this exists

Reading LLaVA / BLIP-2 / InternVL code is an "expand 30 layers of wrappers" exercise. I wanted something I could actually step through in a debugger, including the image-token splicing. Pedagogical first, SOTA... nowhere near.

License

MIT.

About

A tiny, hackable vision-language model. CLIP/SigLIP + Qwen2.5 + minimal LoRA, ~1.5k lines.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages