Skip to content

wpferrell/Bigsmall

Repository files navigation

PyPI version DOI License Python Downloads

BigSmall — Lossless AI Model Compression

Make any AI model ~34% smaller with bit-identical weights — and let it decide how to run on your machine. Your machine runs bigger models than you think, and BigSmall never lies to you about it.

pip install bigsmall          # CLI + compression/decompression
pip install bigsmall[torch]   # add this for model loading (from_pretrained)

A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. A 14.2 GB model runs in ~2.5 GB of working memory through the streaming executor (it reads weights from disk layer by layer instead of holding the whole model). The decompressed model is every weight bit-for-bit identical to the original — each tensor's md5 (a checksum) is verified on decompress.


60-second quickstart

Three commands. No settings to learn.

bigsmall profile          # once: ~10s hardware probe (saved, never asked again)
bigsmall plan model.bs    # one sentence: what would run, where, how faithfully
bigsmall run model.bs     # do it

Real output from the reference machine (8-core CPU, RTX A4500 busy with another job):

$ bigsmall plan qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.

$ bigsmall run qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.
loaded 290 tensors into host RAM in 27.0s [mode=perfect] (bit-exact receipt honoured)

The planner picks the highest fidelity that runs at usable speed on your hardware, and one rule is enforced by the test suite itself: anything below bit-exact is announced before it happens, never silently. Full walkthrough: docs/quickstart.md.


The improvement, in two pictures

Bar chart: loading the bit-exact original took 150.9 s (Qwen2.5-0.5B) and 411.2 s (Llama-3.2-1B) in the first 4.0 build; 4.0.0 ships 25.3 s and 56.2 s — 6.0x and 7.3x faster

Bar chart: one download, three sizes — the original is 100%, the lossless .bs is about 66%, and the Ferrell Duo fast member reads about 21.5%; the Duo file holds both for about 3% more than lossless alone


What's in 4.0

Feature What you get Command
Autopilot You don't pick settings. It checks your computer once, tells you in one plain sentence what it's about to do, and proves it did it. bigsmall profile / plan / rundocs
Ferrell Duo (.bsd) One file with two models inside: a small fast one for everyday work, and the exact original for when it has to be perfect. Carrying both costs only ~3% more than the perfect one alone. bigsmall dualdocs
Streaming executor Runs a model too big to fit in memory by reading it from disk a piece at a time - and tells you up front how much memory it'll use. bigsmall run --stream, serve-streamdocs
FP8-native lossless Newer models that ship in the compact FP8 form also shrink - about 17% smaller - and come back perfectly identical. bigsmall compress (automatic) — docs
Capacity math Tells you honestly whether a model will run on your card and how. A 20 GB card holds a ~13B model perfectly (vs ~9B normally). bigsmall plandocs
Integrity tooling Every file carries its own proof the model came back identical. Tools can inspect a model and even spot one secretly damaged or never properly trained. bigsmall xray, verifydocs

Everything from 3.x is still here: delta patches for fine-tunes, KV-cache compression (entry v3 + FP8 + 32k chunked + GPU codec), xray, reshard, resume, ECC. Full changelog →


What BigSmall does

1. Make any model smaller

bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/

Before: 14.2 GB of safetensors. After: 9.3 GB .bs file. Saved: 4.9 GB (34%).

Every weight is bit-for-bit identical. Works on any safetensors model — LLMs, diffusion, audio, vision. BF16, F16, F32, and FP8 weights are all native.

2. Store fine-tunes as patches

bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/

Before: 14.2 GB Qwen2.5-7B-Instruct. After: ~5 GB patch. If your users already have the base model, they only download what changed. Delta size is pair-dependent — measured from under 1% of full size (best ≥7B SFT pairs) to ~61% (small-model full tunes); the ≥7B official-instruct class measures 34–50%. The engine measures both codings per tensor and never ships a delta larger than standalone. Full table: docs/delta-compression.md.

3. One file. Two models: the fast one and the real one — the Ferrell Duo

bigsmall dual qwen2.5-0.5b/model.safetensors -o qwen2.5-0.5b.bsd
bigsmall run qwen2.5-0.5b.bsd                # perfect: bit-exact, verified
bigsmall run qwen2.5-0.5b.bsd --mode fast    # fast: INT4, reads 21.4% of raw

A .bsd is a Ferrell Duo: use the fast one while you work (a lossy-INT4 member that loads by reading a fifth of the bytes); the real one is the exact original — every bit, proven. The same file holds the lossless residual that reconstructs the original bits, receipt included. Measured cost: a .bsd is ~2.9 pp of raw larger than the lossless-only .bs (details).

4. Run models bigger than your RAM

bigsmall run model.bsd --stream

The layer-streaming executor keeps the resident set bounded and tells you the bound before it starts. Measured receipt, Qwen2.5-7B (14.2 GB raw): streamed forward pass bit-exact against the fully-loaded model (identical logits sha256) in ~2.5 GB of hot state. Slow is stated plainly where it is slow — see docs/features/streaming.md for the measured rates.

5. Download smaller, use instantly

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall"
)

Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. 25+ pre-compressed models ready to use (browse them all).


Compression numbers (every published model)

Every row is a real measurement. Click a model to download it.

Model Original BigSmall Saved
Qwen2.5-14B-Instruct 29.5 GB 19.5 GB 34%
Gemma-3-12B-it 22.7 GB 14.8 GB 35%
Gemma-2-9B-it 17.2 GB 11.3 GB 34%
Llama-3.1-8B-Instruct 15.0 GB 9.7 GB 35%
Llama-3-8B-Instruct 15.0 GB 9.8 GB 34%
Qwen3-8B 15.3 GB 10.1 GB 34%
Mistral-7B-Instruct v0.3 14.2 GB 8.9 GB 37%
Mistral-7B-Instruct v0.2 14.2 GB 8.9 GB 37%
Qwen2.5-7B-Instruct 14.2 GB 9.4 GB 34%
Phi-3.5-mini-instruct 7.1 GB 4.7 GB 34%
Gemma-3-4B-it 8.0 GB 5.2 GB 35%
Qwen3-4B-Instruct 7.5 GB 5.0 GB 34%
Llama-3.2-3B-Instruct 6.4 GB 3.9 GB 39%
Gemma-2-2B-it 4.9 GB 3.2 GB 34%
Qwen2.5-3B-Instruct 5.7 GB 3.8 GB 34%
Qwen2.5-1.5B-Instruct 2.9 GB 1.9 GB 34%
Llama-3.2-1B-Instruct 2.3 GB 1.5 GB 34%
Gemma-3-1B-it 1.9 GB 1.2 GB 35%
Qwen2.5-0.5B-Instruct 920 MB 610 MB 34%
GPT-2 (117M) 548 MB 414 MB 24%
Gemma-3-270M-it 500 MB 330 MB 34%
Gemma-3-270M 500 MB 330 MB 34%
Gemma-2-2B 9.7 GB 8.1 GB 17%

Browse all 25+ models on HuggingFace →

v4-line measurements on the reference machine (2026-06 campaign): Qwen2.5-7B compressed to 65.95% of raw and ran bit-exact through the streaming executor; a real fp8 release (Qwen3-0.6B-FP8) compressed to 0.829 of its fp8 weight bytes, 507/507 tensors bit-exact.


What "lossless" actually means

Every weight in the model is mathematically identical to the original — same bit pattern, same floating-point value, same gradient, same output.

  • Not quantization. Quantization rounds weights to fewer bits and the model's behaviour changes. (The .bsd fast member is quantization — and is labelled lossy every time it is picked, which is the point.)
  • Not pruning. Pruning deletes weights.
  • Not approximation. No tricks, no calibration data, no quality drop.

BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. md5 is verified on every tensor at decompression. If a single bit differs, verify fails.

The honesty rules are code, not policy: any pick below a file's best fidelity carries a mandatory announcement, the announcement is printed before the load, and a test sweep across every profile × file × override combination fails the suite if a downgrade could ever happen silently. Details: docs/features/integrity.md.


CLI

bigsmall profile                            one-time hardware probe (~10s)
bigsmall plan SRC [--mode M]                the decision, one sentence, no execution
bigsmall run SRC [--mode M] [--stream]      pick, announce, load (or stream)
bigsmall dual SRC [-o OUT.bsd]              create/inspect a dual-fidelity .bsd
bigsmall compress SRC [-o OUT] [--delta-from BASE] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall transcode SRC DST.bsr [--mode M]   re-encode for decode speed
bigsmall serve-stream SRC [--prompt ...]    tiered weight-streaming inference
bigsmall xray SRC                           checkpoint forensics
bigsmall info | scan | stat | verify | diff | apply | repair | benchmark
bigsmall migrate | status | pipeline run | reshard

Every command has --help. See docs/cli-reference.md for full examples with real output.


Python API

import bigsmall

# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")

# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")

# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")

# Low-VRAM streaming inference (~12x less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,
)

# Stream-decompress straight from the HF CDN — no .bs written to disk
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")

# Reshard .bs files along layer boundaries, no re-encoding
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)

Research

The lossless compression ceiling for BF16 neural weights has been measured. It is ~62% of raw BF16 for any model, ~34% for ≥7B instruct fine-tunes with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.

There's a hard limit to how small a trained model can be squeezed without losing anything - a point past which the data is as random as TV static and won't compress further. We measured that wall across 4,143 weight matrices in 8 model families, and BigSmall compresses right up against it. (The technical version - why trained weights hit this floor, the flat per-tensor entropy, and how we beat DFloat11's bound on every layer type measured - is in docs/dfloat11.md.)

Curious how BigSmall relates to DFloat11, ZipNN, GGUF quants, and the rest of the landscape? One honest reference page: docs/comparison.md.

The Ferrell Duo format has its own paper — how carrying the bit-exact original next to an INT4 fast member went from +13.4 to +2.9 points of file size, every number receipted: 10.5281/zenodo.20673133 (markdown, plain-English page).

Full findings, all experiments, all dead-ends: 10.5281/zenodo.20279247. Plain-English summary: docs/research.md.


Install

pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything

Requires Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.

New here? Start with docs/quickstart.md, or find your situation in docs/how-it-helps.md.


License

Code: Elastic License 2.0. Free for personal, research, and commercial use. SaaS providers should see LICENSING.md.

Model weights distributed in .bs format keep the license of the original model.


Links


Feedback & Community

Did BigSmall work for your model? We'd love to know.

  • Open a Discussion — share your compression results, ask questions, or suggest improvements
  • File an Issue — if something didn't work, tell us exactly what happened
  • HuggingFace — all compressed models are at huggingface.co/wpferrell

We especially want to hear:

  • Which model you compressed and what ratio you got
  • Any errors or unexpected behaviour
  • Use cases we haven't thought of

About

Lossless AI model compression - ~34% smaller with bit-identical weights; the autopilot profiles your machine, picks the highest fidelity that runs, and streams models bigger than your RAM.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages