Chris Santiago chris-santiago

Chris Santiago

Data Scientist · Working on representation learning at Capital One

About

I came to data science after a decade in finance and graduate degrees from Johns Hopkins and Georgia Tech. I've been at Capital One since 2018, first in financial analytics, then in fraud detection from late 2021 with traditional tabular ML. In 2024 I piloted an early-LLM feasibility study on customer-agent call summarization, and the findings informed a production system later built by others. I lead the bank's first representation and sequence-learning effort for transaction fraud. The work has progressed from a proof of concept to validation at large scale, then to a study of where the approach fails (classical tabular models still fit some transaction segments better), and on to a feasibility study for real-time deployment. Throughout, I've kept refining the base architecture and training regime.

Projects

Independent ML research & tooling

These projects span agentic frameworks, empirical studies, reusable libraries, and paper reproductions. Some are built with those frameworks; some evaluate the frameworks themselves. All are independent of my day job.

Ferrum

A Python visualization library with a Rust core that applies a unified grammar of graphics to statistical and ML diagnostics (from scatter plots to SHAP summaries) without switching abstractions or hitting row limits.

Repo · Live

Details

Overview

Ferrum is a statistical visualization library for Python built on a Rust core. It covers exploratory charts, statistical graphics, and ML model diagnostics under a single grammar of graphics. A scatter plot, a faceted histogram, a confusion matrix, and a SHAP beeswarm are all built the same way.

The problem

Python visualization fragments one activity into too many mental models. Statistical graphics, interactive charts, convenience plotting, and model diagnostics are treated as separate domains (matplotlib, seaborn, Altair, Plotly, Yellowbrick), and each switch introduces a new object model, new defaults, and new limitations. Ferrum exists to reject that fragmentation.

One chart model

Every visualization follows the same grammar: data, encodings, marks, scales, coordinate systems, transforms, and views. The library is grammar-first but not grammar-only. Convenience helpers like displot and rocchart exist for speed, but they are sugar over the grammar rather than parallel APIs with their own rules, so a chart from a helper stays as themeable and composable as one written from first principles.

Statistical operations live inside the rendering pipeline. KDEs, LOESS fits, bootstrap confidence intervals, and binning are declarative chart operations, not manual preprocessing, which keeps plotting code short and makes statistical assumptions visible in the spec itself. Interactivity is a render mode, not a rewrite: .interactive() changes the rendering path, not the user's conceptual model.

import ferrum as fm

chart = (
    fm.Chart(iris)
    .mark_point()
    .encode(x="sepal_length", y="petal_length", color="species:N")
)
chart.save("plot.svg")

Charts compose with operators: + layers, | places side by side, & stacks.

What you can build

Beyond general charts (scatter, line, histogram, KDE, box, violin, heatmap, bar), Ferrum treats model output as just another dataset. ROC and PR curves, confusion matrices, calibration curves, residuals, partial dependence plots, learning and validation curves, and SHAP beeswarm/bar/waterfall plots all feed the same chart system, so diagnostics are as composable and themeable as any other plot.

Python declares, Rust computes

Python is the declaration layer; Rust is the computation layer. The boundary is crossed once via the Arrow C Data Interface, avoiding row-level copying. Rust handles layout, statistical transforms, and rendering; Python constructs the specification and manages the API. The same chart grammar works at 100 rows and at production scale.

Scale and operational simplicity

Rendering is pure Rust (SVG and PNG output with no Cairo, X11, or display server), so it runs cleanly in notebooks, containers, and CI. Auto-rasterization and GPU-backed rendering scale from hundreds to millions of rows behind the same spec, including full-sample SHAP plots that would be impractical elsewhere. Dataframes are met where they are: pandas, Polars, cuDF, Dask, and others are normalized to Arrow once, so the Rust core only ever sees one shape.

How it was built

Ferrum is also a record of how it was made. Roughly 104,000 lines of Python and Rust (across twelve phases, 975 commits, and nearly 3,900 tests) were written in ten days by one human directing an agentic Claude Code framework.

The velocity came from process, not typing speed. Every phase followed the same order: brainstorm, write a design spec, write an implementation plan, then execute. No phase began until both documents were approved. That front-loaded the hard architectural decisions (the Rust/Python boundary, the no-matplotlib constraint, the serialization format) into early design sessions, so later work executed against settled architecture instead of re-litigating it.

Review was enforced structurally rather than periodically. A layered automation pipeline placed a read-only review gate on every staged commit, ran full subsystem audits at phase boundaries, and dispatched repeatable quality campaigns: combinatorial test sweeps, parallel bug hunts, and gallery audits that render every chart type against seaborn and Yellowbrick. Coding agents never commit; an orchestrator handles staging and gate dispatch, which makes the review pipeline impossible to skip.

The split that made it scale: a high-capability model orchestrates (reading specs, decomposing work, making cross-cutting decisions) while specialist agents do the line-by-line Python and Rust. Architectural judgment is expensive and rare; execution is cheap and frequent, and the system matches work to cost.

ML Lab

A Claude Code system for structured ML hypothesis investigation. Sharpens a vague claim into a falsifiable one, then puts it through adversarial debate or ensemble critique, empirical testing, peer review, and a coherence audit. Designed for rigor over speed.

Repo · Live

Details

Overview

ML Lab runs a structured investigation when you have an ML idea, signal, or model claim that needs validating rather than ad-hoc experimentation. It converts a vague hypothesis into a falsifiable claim, locks metrics and pass criteria before any code is written, then puts the proof-of-concept through adversarial review, empirical testing, and peer review.

The problem

Manual code review of ML experiments tends to focus on obvious bugs and miss implicit assumptions: metric sufficiency, baseline meaningfulness, sample adequacy. Single-pass model review misses the cases that matter most: framing errors that need an independent question raised rather than an answer processed, work that sounds questionable but is actually sound, and cases where the honest verdict is "this is empirically open, run this test first" rather than a binary pass or fail.

Origin

ML Lab wasn't designed speculatively. It emerged from a concrete technical investigation (whether FastText-encoded device attributes could serve as ML features for account-takeover detection), and the framework crystallized as that investigation forced workflow choices. Eight iterative versions refined the protocol through self-evaluation experiments. Everything in the workflow exists because it solved a real problem encountered along the way, then earned its place through calibration.

The workflow

The investigation is a fixed sequence of steps. Early steps sharpen the hypothesis and lock metrics. A review step subjects the proof-of-concept to structured critique. Only the tests both sides agree on proceed to empirical execution, and the hypothesis is gate-locked during that run to prevent drift. If results contradict the review's assumptions, the whole review cycle re-opens with evidence in hand. Optional later steps add deep peer review, a production re-evaluation, and a cross-document coherence audit over every artifact produced.

Two review modes

Debate (default). A critic identifies untested claims, a defender responds through a structured rebuttal taxonomy, and a convergence loop runs until verdicts stabilize. The final verdict is computed by a deterministic Python function, so there is no LLM variance in the outcome. After v8 calibration, this is the recommended mode for standard investigations.
Ensemble (opt-in). Three independent critics run with no cross-visibility, and their findings are union-pooled and tagged by how many critics agreed. Reserved for exploratory audits where the risk surface is unknown and missing a real issue is costlier than the manual triage required to filter false positives.

Architecture

ML Lab is a Claude Code plugin built from focused subagents (an orchestrator plus dedicated critic, defender, reviewer, and report-writer agents) with deterministic verdict logic and append-only JSONL investigation logging for post-hoc audit. Each subagent does one job; the orchestrator runs the workflow.

Research finding

Eight compute-matched calibration experiments underwrite the protocol. The working paper When Does Debate Help? Divergent Detection and Convergent Judgment in Multi-Agent LLM Evaluation reports the empirical case: independent ensembles win on detecting issues, while multi-round debate wins on judging ambiguous cases. The headline guidance ("ensemble for detection, multi-round for judgment") is why the system ships both modes rather than picking one. The latest calibration shifted the default to debate for standard investigations; ensemble remains the right tool for exploratory audits in new domains.

Documentation

See Diátaxis-structureddocumentation for Tutorials, How-to, Reference, Explanation and more.

LLM Preamble

Two investigations into whether coding-agent system prompts change LLM code quality. They do, through attention allocation rather than generic lift: preambles improve whichever rubric dimensions they enumerate, so the 'best' preamble is whichever one matches your downstream evaluator.

Repo · Live

Details

Overview

LLM Preamble asks whether the system-prompt content of a coding agent materially changes the code its model produces. Two investigations sit behind a one-sentence answer: yes, but the direction is governed by overlap between the preamble's enumerated dimensions and the dimensions the downstream evaluator scores, not by tone or expertise.

The question

Coding-agent vendors and researchers ship long, opinionated system prompts ("you are a senior engineer", "prefer composition over inheritance", "validate inputs and handle edge cases"). Whether that preamble content actually moves output quality, in which directions, and through what mechanism, was largely unmeasured. v1 established a measurable effect on alignment-tunable craft dimensions. v2 refined the mechanism with independent confound probes.

Method

Both investigations use locked analysis plans before any data collection. The generation pool spans 10 frontier models via OpenRouter, a fixed task suite, and multiple preamble conditions per task. Evaluation uses LLM-as-judge with a cross-judge panel: every generation rated by a panel of judges blind to which preamble produced it, with mixed-effects models accounting for model, task, and judge random effects. Independent confound probes then test the proposed mechanism rather than only the headline effect.

What it found

Preambles are potent enough to push performance below the no-prompt baseline. A "junior developer" framing and a high-effort expert directive enumerating non-rubric dimensions both lower code quality more than the strongest aligned preamble raises it (β = −0.060 and −0.155 respectively, both p < 10⁻⁴ vs no preamble). The lever pulls both directions, and harder downward than upward. Whatever you put in the preamble, the model genuinely listens to.
The mechanism is rubric overlap, not engineering virtue. A bare list of the evaluator's dimensions, with no expert tone, captures ~70% of the maximum positive lift. Imperative phrasing and per-dimension explanations add the remaining 30%. What's enumerated moves; what isn't, doesn't.
Implication: there is no universal best preamble. Preamble winners flip under a different evaluator. The "best" preamble for a benchmark that scores compactness and performance would lose under a benchmark that scores documentation and edge-case handling.
Static-analysis tools (radon, pylint, cyclomatic, Halstead) can't see preamble effects at all. Flat results across every condition. The signal lives only in evaluators that score the dimensions preambles actually tune.

Built with ML Lab

The investigation was run end-to-end inside ML Lab, with hypothesis sharpening, pre-registration with locked metrics and pass criteria, adversarial review of the experimental design, and an append-only audit log of every step. The pre-registration, falsifiable claims, and explicit confound probes aren't decoration; they're what the framework requires. This is the second public investigation completed through that workflow, after ATO Device Embeddings.

ATO Device Embeddings

A controlled FastText-embedding study for account-takeover anomaly detection. Four pre-specified findings reproduced on 5/5 seeds isolate one positive design (mean-pool per-feature beats concatenated strings by +0.131 AUC on spoof attacks) and three implementation traps that standard aggregate evaluation conceals.

Repo

Details

Overview

A controlled empirical study of FastText-based device-attribute embeddings for account-takeover (ATO) anomaly detection. Four pre-specified comparisons run across five independent seeds isolate one positive design choice and three implementation traps that standard aggregate evaluation conceals.

The question

ATO detection works under a hard asymmetry: attackers are rare (roughly 1:100), but each one is high-value. Device fingerprints are an appealing signal because they're observed passively at login and add no user friction. Embedding-based scoring is doubly appealing because it needs no labels; the signal is anomaly proximity to a per-account behavioral centroid.

But how to embed device attributes for label-free cosine-distance scoring is underspecified. No published work compares per-feature versus concatenated-string strategies on this task, characterizes corpus-construction failure modes, or quantifies how per-user score normalization interacts with realistic class imbalance.

Method

Each comparison is a pre-specified, five-seed experiment on synthetic data with controlled ground truth: which accounts have fleet injection, which features were spoofed, at what imbalance ratio. Mechanism isolation requires that controlled ground truth; production data doesn't expose it. The C1 direction is then replicated on the public RBA dataset for distributional realism, with the caveat that the RBA ATO population at the chosen temporal split is small (n=9), so the public-dataset evidence is directional rather than strict.

The architecture under test: FastText skip-gram (vector_size=64, window=6, char n-grams 3–6) trained on per-account corpora, scoring each login by cosine distance from the per-account centroid.

The four findings

Mean-pooled per-feature embeddings beat concatenated-string encoding on spoof attacks. +0.131 ROC-AUC (0.868 vs 0.737), with the PR-AUC gap nearly 2× larger (+0.248). Concatenated-string spoof PR-AUC (0.542) is barely above the trivial baseline (0.500), exposing it as essentially non-functional at the task, a severity that ROC-AUC masks. The mechanism: cross-boundary character n-grams in the concatenated string create subword features spanning multiple device attributes, so when one attribute changes, its signal is diluted across positional n-grams shared with unchanged neighbors. Mean-pooling avoids this; one feature changes, 1/6 of the total distance shifts directly. Both encoders converge on novel (all features differ) and fleet (known-device proximity); the gap is concentrated on the case where exactly one feature differs and its signal must not be diluted. Direction holds on the RBA public dataset on 5/5 seeds.

Embedding collapse comes from corpus construction, not training objective. A 2×2 factorial (skip-gram or CBOW, per-account or per-event corpus) shows that per-event corpora collapse within-feature cosine similarity to ~0.99 on every seed and every training objective, while per-account corpora never collapse on either. The mechanism is structural: in a per-event corpus, every sentence is [os, browser, tz, lang, network, screen] in fixed positional order, so tokens of the same feature always occupy the same position with the same neighbors. Their context distributions become structurally identical, and gradient descent converges to identical embeddings. The downstream consequence is invisible to aggregate AUC: under the collapsed configuration, novel and fleet AUC remain ~0.99 while spoof AUC degrades to ~0.38. The within/cross-feature cosine similarity ratio is proposed as an operational pre-deployment health check; a ratio approaching 1.0 signals impending collapse before model retraining commits the damage.

The standard known-device gate produces zero true positives on the fleet population it's designed to catch. Zero top-1% true positives on fleet residuals (cold-start fleet accounts) on all 5 seeds. Raw centroid-cosine scoring recovers detection at 0.491 top-1% precision on the same events. The mechanism is structural: fleet devices appear in training by construction (simulating unconfirmed prior attack events), so the gate, which fires on any training appearance, suppresses exactly the population it's meant to detect. The two-stage gate's ROC-AUC is not merely close to the trivial baseline; it's identical (delta = 0.000 on all 5 seeds), a logical consequence rather than a noisy near-miss. The cosine signal itself isn't degraded by training contamination; the centroid averages over ~40 training events, so a single contaminating event shifts it minimally and the anomaly score still reflects bulk-of-history fit. In any deployment where device reuse across accounts is possible (botnets, credential stuffing), a binary known-device gate systematically suppresses the highest-value alerts.

Per-user CDF rank-normalization collapses PR-AUC under realistic imbalance. PR-AUC drops from 0.888 to 0.224 at 1:100 imbalance (a 4× collapse), while ROC-AUC declines only 0.021. The 31× ratio between the two declines means a practitioner monitoring ROC-AUC would not detect the problem. The mechanism is score-margin compression: rank-normalization maps each user's scores into [0,1] by their own distribution, so ~5% of every user's benign events receive rank > 0.95 regardless of absolute anomaly score, a structural floor. At 1:100 imbalance, this injects thousands of high-rank benign events into the operating region where attacks should dominate, destroying precision at any useful recall point. ROC marginalizes over the majority class and barely registers the collapse; PR does not. The finding generalizes beyond ATO to any per-entity score normalization under class imbalance, and is dangerous precisely because ROC-AUC is the metric most commonly monitored in production fraud systems.

Common thread

The four findings share a structural property: standard aggregate evaluation conceals each failure mode.

Spoof gap is invisible without attack-subtype-stratified AUC.
Corpus collapse preserves novel and fleet AUC while destroying spoof.
Gate failure is invisible without fleet-population-stratified top-k analysis.
PR-AUC collapse is concealed by ROC-AUC.

A team monitoring overall held-out AUROC would ship a system containing all four. The contributions are concrete design guidance for the lightweight tier of ATO defense: a specific architecture that outperforms the baseline on the hardest attack subtype, three specific choices that break it, and the evaluation strategy needed to tell the difference.

Scope

A controlled PoC study. The primary evaluation is on synthetic data with a closed 30-token vocabulary and controlled ground truth; the RBA replication confirms C1's direction on independently-structured data but is underpowered for strict encoder ordering. The work does not benchmark against transformer or GNN frontier methods (they target a different deployment tier of offline batch scoring with labeled data), and does not study adversarial drift.

Built with ML Lab

The investigation was run end-to-end inside ML Lab, with hypothesis sharpening, pre-registered verdicts evaluated per seed, adversarial review of the experimental design, and an append-only audit log of each step. The per-seed boolean verdicts, the locked-before-data analysis plan, and the structured comparison protocol aren't decoration; they're what the framework requires.

Imbalanced Losses

A PyTorch library of training losses for class-imbalanced classification (Focal Loss, Smooth-AP, Recall-at-Quantile, and Partial-AUC-at-Budget) with DDP all-gather support for globally correct rank estimation under distributed training.

Repo · Live

Details

Overview

Imbalanced Losses is a PyTorch library of training losses for class-imbalanced classification, the regime where the positive class is rare: fraud, anomaly detection, object detection. It provides focal and ranking-based objectives and, importantly, makes the ranking losses behave correctly under distributed training.

The problem

Cross-entropy and BCE minimize log-loss over the training distribution. When 99% of samples are negative, the global minimum is a model that predicts negative everywhere: high accuracy on a metric that no longer means anything. As the project puts it, "the loss on the 99% majority overwhelms any signal from the 1% minority." The fix is either to down-weight the easy majority or to optimize a ranking metric directly.

The losses

Sigmoid / Softmax Focal Loss. Focal objectives that down-weight easy examples so rare-class signal isn't drowned out; drop-in replacements for BCE / cross-entropy.
Smooth-AP. A differentiable approximation of Average Precision, for when optimizing AUCPR directly is the actual goal.
Recall-at-Quantile. Optimizes recall above a score threshold set at a chosen quantile, matching fixed operating points like an alert system that can only review the top fraction of cases.
Partial-AUC-at-Budget. Optimizes partial AUC over a false-positive-rate band around a target operating point, for when the constraint is a fixed false-alarm budget rather than a single threshold (fraud at 50 bps, say). Where Recall-at-Quantile pins one decision boundary, this optimizes across the whole band inside the budget.
Loss Warmup Wrapper. Trains on plain BCE/CE during warmup, blends into the target loss, and decays temperature on a schedule, resetting the rank queue at the phase switch so warmup-era logits don't poison it.

Globally correct ranking under DDP

This is the part most implementations get wrong. Standard losses decompose across samples, so sharding data across GPUs is harmless. Ranking losses do not decompose: a soft rank is a sum over the whole pool, and a quantile threshold is a property of the whole score distribution. Estimated on one GPU's shard, both are (in the project's words) "qualitatively wrong" when positives are rare and unevenly spread.

The library's all-gather helpers collect logits and targets across every worker, preserve gradient flow for the local shard so autograd still works, and handle variable batch sizes, so each worker computes the same globally correct rank and threshold. A memory queue accumulates past batches to stabilize estimates at very low positive rates (at 0.5% positives with batch size 32, most batches would otherwise contain no positives at all), and temperature soft-ranking replaces the non-differentiable hard rank with a smooth sigmoid that approaches the true rank as temperature drops.

Engineering notes

The losses are instantiated and called like any PyTorch loss: loss = loss_fn(logits, targets); loss.backward(). What sets the library apart is the documentation discipline. An explicit "Assumptions and Failure Modes" guide gives per-loss breakdowns (focal loss stops working below ~0.01% positives; Smooth-AP fails with pools too small or temperatures too low at init; Partial-AUC-at-Budget needs the negative pool to comfortably exceed the inverse of the budget, about 200 negatives for a 50 bps band, or its tail estimate biases toward the largest negative score) and a diagnostic table mapping symptoms (loss stuck near 0.5, rare class never improving, threshold instability) to root causes. The scope is deliberately narrow, and the correctness claims are stated, tested, and bounded.

ML Wiki

A personal research wiki that ingests papers, Zotero libraries, PDFs, and experiment logs into a structured directory of interlinked markdown pages, with semantic fragment search, contradiction detection, and staleness tracking, all orchestrated through Claude Code.

Repo

Details

Overview

ML Wiki turns a reading list into a maintained body of knowledge. It ingests papers, Zotero libraries, PDFs, markdown notes, and ML experiment logs, then produces a directory of interlinked markdown pages: paper summaries, topic Maps of Content, syntheses, and idea pages. The current wiki holds ~1,070 paper pages, ~121 topic pages, and ~4,570 searchable knowledge fragments drawn from four Zotero libraries.

Why a wiki, not RAG

Most LLM-and-document workflows are retrieval-based: upload files, retrieve relevant chunks at query time, generate an answer from scratch. Nothing accumulates. A subtle question that spans five papers re-derives the same fragments every time.

ML Wiki takes the opposite approach. When a new paper is added, the LLM reads it, extracts the key knowledge, and integrates it into the existing wiki. Cross-references are already in place. Contradictions have already been flagged. The synthesis already reflects everything read so far.

The maintenance problem

Personal knowledge bases die from bookkeeping, not from a lack of reading: updating cross-references, keeping summaries current, noting when new evidence contradicts old claims. The maintenance burden grows faster than the value. LLMs flip that: they don't get bored, don't forget a cross-reference, and can touch every file in one pass. Maintenance cost drops to near zero, so the wiki stays alive.

Architecture

Three layers, each with a single responsibility:

CLI layer. A Python Click CLI that orchestrates everything: resolves config, calls scripts, dispatches LLM calls, validates output.
LLM layer. OpenRouter calls with JSON output, validated against JSON schemas and domain-specific post-validators.
Script layer. Python scripts for all file I/O, index CRUD, source extraction, and page assembly. No script calls another; they communicate via stdin/stdout/files/exit codes.

A JSONL index is the single source of truth. Fragments (self-contained one- or two-sentence claims, methods, and findings) are the atomic units: they power meaning-based search and let the linter detect contradictions across papers that share a tag.

Staleness tracking runs at build time. A check flags synthesis and idea pages whose last-improved timestamp predates newer entries sharing two or more tags (along with idea pages never improved at all), and a batch pass then inserts or clears a staleness banner at the top of each affected page.

Design notes

The system enforces a strict separation of concerns: Zotero is the library, ml-journal is the experiment log, the wiki is the synthesis layer, Obsidian is the viewer. Data flows one direction (sources to index to wiki to viewer), and nothing writes back upstream.

It also separates by cost. Index operations (ingest, sync, diff) are fast and free; LLM operations (render, query, synthesis) cost time and tokens. The two are always decoupled, so ingesting an entire Zotero library takes seconds and rendering happens on demand.

Representation-learning reproductions

Reproductions of unsupervised representation-learning methods, all evaluated under fixed linear-probing protocols. Three tabular SSL papers (MET, SCARF, VIME) as separate repos under one shared protocol, plus a sweep of autoencoder architectures compared on MNIST.

Repo

Details

Four reproductions of representation-learning methods, organized into two sweeps. The shared discipline is the same across both: pretrain unsupervised, then evaluate the learned representations with a fixed linear-probing protocol, with no architecture-level hyperparameter tuning. A difference in the result is a difference in the architecture, not in the tuning effort.

Tabular SSL: MET, SCARF, VIME

Three foundational tabular self-supervised methods reproduced as separate PyTorch repos and evaluated under one shared linear-probing protocol. Tabular SSL is its own subfield (vision-style augmentations don't apply), and reproducing the three end to end clarifies what actually transfers.

MET: Masked Encoding for Tabular data

A transformer-based masked autoencoder for tabular features. Reconstructs masked entries from visible context and includes the paper's bounded adversarial training component.

Repository

SCARF: Self-supervised contrastive learning

Contrastive pretraining for tabular data using random feature corruption to generate augmented views. Sidesteps the absence of vision-style augmentations entirely.

Repository

VIME: Value Imputation and Mask Estimation

Self- and semi-supervised learning via a masked feature-imputation pretext task. Demonstrates that pretext-task pretraining transfers cleanly to tabular downstream tasks.

Repository

Autoencoders: architecture comparison on MNIST

A comparative study of autoencoder architectures collected in a single repo: deep, convolutional, denoising, variational, and self-supervised (SimSiam and a SimSiam-denoising hybrid), most following their original papers. All trained unsupervised on the 60,000-image MNIST training set, then evaluated with a linear classifier across label budgets from 10 to 8,000 examples. No hyperparameter tuning on the encoders or the probes, so the architecture is the only variable. Configuration runs through Hydra, with tracking in Weights & Biases.

Repository

Small Python libraries

Four narrowly-scoped Python libraries that each solve one repetitive data-prep problem cleanly (classical feature selection, fuzzy string deduplication, time-series feature engineering, and reusable function composition), three of them slotting into the standard scikit-learn Pipeline.

Repo

Details

Four small Python libraries written in the same spirit: each solves one narrow data-prep or feature-engineering problem inside the scikit-learn Pipeline API, rather than introducing a framework around it.

Steps: classical feature selection

Best-subsets and forward stepwise regression, scored by AIC or BIC, wrapped as scikit-learn-compatible selectors. Brings interpretable classical selection (well-established in statistics but rarely accessible in modern ML pipelines) into the standard fit / transform workflow. Adapts automatically: linear regression for continuous targets, logistic regression for classification.

Repository · Docs

StringCluster: fuzzy string deduplication

Identifies near-duplicate strings (the messy-data problem of "Acme Inc.", "Acme, Inc" and "ACME INCORPORATED" all being the same entity) via TF-IDF character n-grams and a cosine-similarity threshold. Works against itself or against a master reference list, with regex stop tokens for domain-specific noise. Standard fit / transform; slots into a Pipeline like any other step.

Repository

TSFeast: time-series feature engineering

Lag, rolling, EWMA, differencing, datetime, and polynomial-feature transformers for time series, plus a TimeSeriesFeatures meta-transformer that combines several at once and an ARMARegressor that wraps an sklearn regressor with ARMA residual modeling. Keeps time-series prep inside the Pipeline where the fit / transform boundary is respected, instead of leaking into ad-hoc pandas code outside it.

Repository

DPipes: reusable function composition

PipeProcessor for method-chaining APIs (pandas, Polars, any object with a pipe method) and Pipeline for general function composition over arbitrary Python functions. Defines a transformation sequence once and applies it to any compatible input (train, test, new batch) without rewriting the chain or sacrificing readability for nested calls.

Repository · Docs

On the shared scope

None of these tries to be a framework. Each adds the one transformer or composition primitive that sklearn (or pandas, in DPipes' case) doesn't ship with, and stops there. That narrowness is the point: they slot into existing workflows without asking anyone to adopt a new abstraction layer.

DeepSets PyTorch

A PyTorch implementation of Deep Sets (Zaheer et al.): permutation-invariant and equivariant neural networks for learning functions on sets, with a test suite that verifies the paper's theorems.

Repo · Live

Details

Overview

Deep Sets (Zaheer et al., NeurIPS 2017) is a neural architecture for learning functions on sets (inputs with no inherent order). Its core result: any permutation-invariant function can be written as ρ(Σ φ(x)). Transform each element with one network, pool the results, then apply a second network. The sum pooling guarantees that shuffling the input never changes the output.

What it implements

The repository covers every architectural variant from the paper: invariant models (set to vector), equivariant models and layers (set to set, preserving permutation symmetry), and context-conditioned models with multiple fusion strategies. Variable-size sets are supported throughout via masking, with sum, max, and mean pooling, each with mathematically correct mask handling.

Notes

What makes this more than a toy port is the verification: a dedicated test suite checks structural compliance with the paper's theorems, and the README numerically demonstrates that shuffling inputs leaves outputs unchanged. The documentation explains the non-obvious design choices, including why masked max-pooling fills with negative infinity rather than zero, and how the architecture's linear complexity compares to quadratic attention-based alternatives.

Chris Santiago chris-santiago

Achievements

Achievements

Chris Santiago

About

Projects

Ferrum

Overview

The problem

One chart model

What you can build

Python declares, Rust computes

Scale and operational simplicity

How it was built

ML Lab

Overview

The problem

Origin

The workflow

Two review modes

Architecture

Research finding

Documentation

LLM Preamble

Overview

The question

Method

What it found

Built with ML Lab

ATO Device Embeddings

Overview

The question

Method

The four findings

Common thread

Scope

Built with ML Lab

Imbalanced Losses

Overview

The problem

The losses

Globally correct ranking under DDP

Engineering notes

ML Wiki

Overview

Why a wiki, not RAG

The maintenance problem

Architecture

Design notes

Representation-learning reproductions

Tabular SSL: MET, SCARF, VIME

MET: Masked Encoding for Tabular data

SCARF: Self-supervised contrastive learning

VIME: Value Imputation and Mask Estimation

Autoencoders: architecture comparison on MNIST

Small Python libraries

Steps: classical feature selection

StringCluster: fuzzy string deduplication

TSFeast: time-series feature engineering

DPipes: reusable function composition

On the shared scope

DeepSets PyTorch

Overview

What it implements

Notes

Pinned Loading

Uh oh!