LightVLM 🔦

A tiny, hackable Vision-Language Model implementation. No Trainer, no PEFT, no mystery wrappers — just ~1.5k lines of PyTorch so you can actually read the forward pass.

Built for teaching myself how VLMs fit together; if it's useful to you, great.

What's in the box

  CLIP / SigLIP  ─►  projector  ─►  decoder LM (Qwen2.5-0.5B)
    (frozen)       (MLP | QFormer)      (full-FT or LoRA)

Two vision encoders: CLIP and SigLIP (docs/swapping_encoders.md).
Two projectors: a 2-layer MLP (LLaVA-style) and a learnable-queries QueryPooler (baby QFormer).
Plain-torch training loop with accelerate for bf16 + multi-GPU.
A minimal LoRA implementation (~100 lines). Swap in peft if you need quantisation or nested adapters.
Chat, captioning, and a small VQA benchmark script.

Quickstart

pip install -e .

# full-FT
accelerate launch scripts/train.py configs/default.yaml

# LoRA
accelerate launch scripts/train.py configs/lora.yaml

# caption a single image
python scripts/caption.py cat.jpg --checkpoint outputs/run0/final/model.safetensors

# VQA-style single-question
python scripts/vqa.py cat.jpg "What is the cat doing?" \
    --lora outputs/lora0/adapter.pt

Results

See results/README.md. TL;DR on my 3090:

projector	trainable	VQAv2 EM	VRAM
MLP full-FT	0.6B	0.412	22 GB
QueryPooler 32	0.6B	0.398	22 GB
MLP + LoRA r=8	8.4M	0.377	13 GB

Why this exists

Reading LLaVA / BLIP-2 / InternVL code is an "expand 30 layers of wrappers" exercise. I wanted something I could actually step through in a debugger, including the image-token splicing. Pedagogical first, SOTA... nowhere near.

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
lightvlm		lightvlm
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LightVLM 🔦

What's in the box

Quickstart

Results

Why this exists

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LightVLM 🔦

What's in the box

Quickstart

Results

Why this exists

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages