README.txt - Notepad

Nils Matteson

I build systems for LLM inference: GPU and CUDA, distributed systems, applied ML. I like problems where the deliverable is a number someone else can re-run.

The thing I keep pulling on: a transformer mid-generation is a multi-gigabyte data structure that the entire ecosystem treats as disposable. Treating it as a first-class artifact instead got me thaw (below), an open PR against vLLM core, and a sole-author research preprint measuring what the standard replay shortcut actually does to token-level credit estimates (also below).

B.S. Data Science, CS minor, UW-Madison, May 2026. M.S. CS, Northeastern Silicon Valley, San Jose, starting September 2026. The longer plan is research: measurement problems in ML systems, then a PhD.

Open to SWE/MLE internship Summer 2027, full-time 2028. GPU inference, distributed systems, ML infrastructure.

mail nils@thaw.sh · code github.com/matteso1 · resume.pdf

thaw v0.6.0

git for live LLM agent sessions. checkpoints, branches, diffs, and restores live vLLM/SGLang inference state (weights, KV cache, prefix-hash table, scheduler). a session forks in 0.88s median on an H100 instead of a ~340s cold boot, about 400x amortized. 16 releases on PyPI as thaw-vllm, currently 0.6.0, Apache-2.0. opened PR #44074 in vLLM (pluggable sleep-mode backend), following participation in RFC #34303. sole-author preprint: "Re-feeding Is Not Replaying" (June 2026).

0.88s fork	vs ~340s cold boot (H100)
14.3 GB/s restore	weight restore, disk to GPU
0.29s / 55 GB/s hot-swap	8B model reload after one-time pin
388 tests in CI	155 Rust + 233 Python, no GPU required

Read the write-up PyPI Source

Re-feeding Is Not Replaying.pdf - Acrobat Reader

Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation
sole author · June 2026 · 10 pages · total compute under $10

Every published method that asks "which token caused the model's answer" rebuilds the model's state by re-feeding the transcript as a fresh prompt, and assumes that is the same state. I measured the assumption on stock vLLM with a three-pass design: continuations resumed from the exact decode-time KV state, an identical second exact pass as a replica noise floor, and the re-feed. At low-margin decision tokens, re-feeding changes the credit estimate at rates 14 to 28 points above the floor. The perturbation is consistent with mean-zero, so averaged quantities mostly survive; threshold-based critical-token selection does not. Rerunning under vLLM's batch-invariant kernels makes every pass bit-identical, which closes the causal attribution and validates the instrument in one move.

Headed to arXiv (cs.LG). Every per-pivot record, run log, and the analysis script that emits each number are public in the repo.

Read the paper (PDF) Data + harness

Exploring - C:\nils

Name	Type	Notes
work\thaw	File Folder	0.88s fork vs ~340s cold boot
work\matteson-systems	File Folder	10,500 businesses scored at ~$0.03
work\sentinel	File Folder	Go log engine, LSM + Raft
work\madison-metro-ml	File Folder	conformal ETAs, calibrated 90% coverage
writing\project-gorgon.md	Markdown	speculative decoding at 0.66x of baseline, negative result
writing\madison-bus-eta.md	Markdown	the transit-ETA writeup
writing\wattbot-rag.md	Markdown	RAG benchmarks, 282 questions
refeed-drift.pdf	Adobe Acrobat Document	the preprint, 10 pages, reproduces for under $10
agents.txt	Text Document	facts for LLMs

9 object(s)

System Properties

System: B.S. Data Science, UW-Madison (May 2026)

System: M.S. CS, Northeastern Silicon Valley (Sep 2026 - May 2028)

Registered to: Nils Matteson, Madison WI -> San Jose CA

Status: seeking Summer 2027 SWE/MLE internship

Scheduled upgrade: Ph.D., ML systems (pending)

OK Cancel

New Message - Outlook Express

To:

Subject: shoot you an electronic message on the net

"Every program attempts to expand until it can read mail. Those which cannot are replaced by ones which can."

--
Zawinski's Law, 1995

(you are reading this inside exactly such a program)

you are visitor

016,384

(counter is static. no js, no tracking.)

RUST INSIDE CUDA 12+ vLLM APACHE-2.0 Y2K COMPLIANT BEST VIEWED IN NETSCAPE 4.0 Tor
.onion I2P
eepsite