Training-data quality is the silent killer of model quality. The bugs are mundane: train/eval leakage from a sloppy split, label drift after a schema migration, prompt-template version skew across rows of the same dataset, malformed tool-call traces, conversation-turn role inversions, hidden duplicates from a normalization mismatch. Most of this is mechanically detectable, and yet the standard stack (pandas + ad-hoc scripts + a notebook) catches almost none of it pre-train.
datalint is a Krit-shaped static analyzer for training-data pipelines: it lints the code that produces the data, lints the schemas the data declares, and lints the files themselves (JSONL and Parquet today; MDS, WebDataset on the roadmap). Read ~/kaeawc/krit/CLAUDE.md first; you're reusing the architecture pattern.
make build # builds ./datalint, ./datalint-lsp, ./datalint-mcp
# CLI
./datalint tests/fixtures/jsonl-malformed-line/positive.jsonl # JSON output by default
./datalint --format=html ... > report.html # self-contained HTML
./datalint --format=sarif ... # SARIF 2.1.0 for code-scanning
./datalint --format=drops ... | sort -u # row-removal manifest: path<TAB>row<TAB>rules
./datalint --train train.jsonl --eval eval.jsonl # corpus-scope leakage rules
./datalint --config datalint.yml ... # custom thresholds & filters
./datalint --fix path/to/pipeline.py # apply auto-fixes in place
./datalint --fail-on=error --min-severity=warning ... # CI exit codes + display filter
./datalint --diff-old old.jsonl --diff-new new.jsonl # row count + field set + distribution + length stats
./datalint --diff-old ... --diff-new ... --diff-format=json # same diff in JSON for scripted consumers
./datalint --dataset train=t.jsonl --dataset eval=e.jsonl ... # N-way cross-dataset overlap
# IDE / agent integrations (JSON-RPC over stdio)
./datalint-lsp # Language Server Protocol
./datalint-mcp # Model Context ProtocolA representative datalint.yml:
enable:
- jsonl-malformed-line
- role-inversion
- train-eval-overlap
disable:
- dedup-key-misses-normalization # known-noisy on this codebase
rules:
enum-drift:
lock_in_rows: 50
max_distinct: 20
optional-field-required-by-downstream:
min_presence_ratio: 0.95
min_rows: 50
required_fields: # explicit schema; overrides ratio for these fields
- input
- output
field-type-mismatch-with-schema:
field_types: # path → JSON type (string|number|boolean|array|object|null)
input: string
output: string
score: number
tags: array
meta.author: string # nested object access
messages[].role: string # array-each-element
messages[].content: string
train-eval-overlap:
prompt_field: input
near_dup_threshold: 0.85
parquet-row-group-too-large-for-streaming:
max_rows_per_group: 500000
system-prompt-leaks-eval-instructions:
extra_patterns:
- "(?i)reply with one of"
- "MMLU"
privacy-pii-detected:
extra_patterns:
- "internal-id=INT-\\d{6,}"
cross-dataset-overlap:
prompt_field: input
near_dup_threshold: 0.85
anchor: later # later (default) | earlier — which side of each pair hosts findingsIn-source / in-data suppression:
random.shuffle(data) # datalint:disable=random-seed-not-set{"messages": [...], "_datalint_disable": ["role-inversion"]}Seventeen rules across all five README categories. Configurable thresholds, enable/disable lists, four output formats (JSON / SARIF / HTML / drops), MinHash + LSH near-duplicate detection, suppression markers, auto-fix for random-seed-not-set, diff mode with per-field distribution shifts, character-length percentiles, and Unicode-script mix (text or JSON), N-way cross-dataset overlap, and live-linting LSP / MCP servers.
| ID | Category | Severity | Confidence | Source | Auto-fix |
|---|---|---|---|---|---|
jsonl-malformed-line |
file | error | high | per-file (JSONL) | — |
parquet-row-group-too-large-for-streaming |
file | warning | medium | per-file (Parquet) | — |
field-type-mixed-across-rows |
schema | warning | high | per-file (JSONL) | — |
enum-drift |
schema | warning | medium | per-file (JSONL) | — |
optional-field-required-by-downstream |
schema | warning | medium | per-file (JSONL) | — |
field-type-mismatch-with-schema |
schema | warning | high | per-file (JSONL) | — |
role-inversion |
conversation | error | high | per-file (JSONL) | — |
system-message-mid-conversation |
conversation | error | high | per-file (JSONL) | — |
unbalanced-tool-call-id |
conversation | error | high | per-file (JSONL) | — |
tool-result-without-tool-call |
conversation | error | high | per-file (JSONL) | — |
random-seed-not-set |
pipeline | warning | medium | per-file (Python AST) | idiomatic |
shuffle-after-split |
pipeline | error | medium | per-file (Python AST) | — |
dedup-key-misses-normalization |
pipeline | warning | low | per-file (Python AST) | — |
train-eval-overlap |
leakage | error | high | corpus-scope (--train/--eval) |
— |
cross-dataset-overlap |
leakage | error | high | corpus-scope (--dataset) |
— |
system-prompt-leaks-eval-instructions |
leakage | warning | medium | per-file (JSONL) | — |
privacy-pii-detected |
file | error | medium | per-file (JSONL) | — |
Outputs: JSON (default), SARIF 2.1.0, self-contained HTML, drops (row-removal manifest: path<TAB>row<TAB>rules). Per-rule and global enable/disable via datalint.yml. Corpus-scope dispatch via --train/--eval (2-way) or --dataset NAME=PATH[,PATH...] (N-way pairwise). CI: --fail-on={none,info,warning,error} for exit codes; --min-severity={...} for output filtering. Diff mode: --diff-old / --diff-new reports row-count delta, field-set delta, per-field top-value distribution shifts, linearly-interpolated character-length percentiles (count / mean / min / p50 / p90 / p99 / max), and per-field Unicode-script mix (Latin / Han / Cyrillic / Hiragana / Katakana / Hangul / Arabic / Hebrew / Devanagari / Greek / Thai / Other); --diff-format=text|json (text default).
datalint-lsp— Language Server speaking JSON-RPC 2.0 over stdio. Capabilities:textDocumentSync(Incremental),diagnosticProvider,codeActionProvider. Lints ondidOpen/didChange(live, against the in-memory buffer for Python; range-bearing changes splice into the buffer using UTF-16 character offsets per the LSP default) anddidSave. Auto-fixes surface asquickfixcode actions — same edits the CLI's--fixwould apply.datalint-mcp— Model Context Protocol server with newline-delimited JSON-RPC 2.0 over stdio. Surface:- Tools:
lint(returns findings as a JSON text block) andfix(lints, applies fixes viainternal/fixer, returns a summary plus the pre-fix findings). - Resources:
datalint:rules/index(Markdown table of every registered rule) anddatalint:config/example(annotateddatalint.ymlcovering every config knob). - Prompts: three templates that return system+user message pairs for the agent to specialize on.
explain-rule(rule_id) — explain a rule's bug class, why it matters, and what to do when it fires.draft-fix(rule_id,path, optionalline/row/message) — draft a unified-diff patch for code findings or a row replacement/removal for data findings.review-corpus(paths, optionaldataset_names/goals) — suggest a starting datalint configuration plus the CLI commands to run on the listed corpus.
- Tools:
Schema discipline
field-type-mixed-across-rows—scoreis float in 99% of rows and string in 1%.field-type-mismatch-with-schema— declare per-path JSON types infield_types; rule fires when an actual value doesn't match the declared type (null counts as a mismatch). Paths support nested objects (meta.author) and array-each-element (messages[].role).optional-field-required-by-downstream— fields almost-always present (presence-ratio heuristic) plus an optionalrequired_fieldslist for strict schema-vs-data checking.enum-drift— new label appears mid-file with no schema update.
Conversation/tool-call hygiene
role-inversion—assistantfollowsassistantwith nouserin between.tool-result-without-tool-call—toolrole with no precedingtool_use.system-message-mid-conversation.unbalanced-tool-call-id—tool_use_idreferenced but never opened.
Leakage
train-eval-overlap— exact or near-duplicate prompts appear in both splits (MinHash + 32×4 LSH bands).cross-dataset-overlap— N-way pairwise generalization oftrain-eval-overlapfor projects with more than two splits (--dataset NAME=PATH[,PATH...]).eval-prompt-in-pretrain-corpus— given an eval set, detect contamination in a training shard. (covered bytrain-eval-overlap/cross-dataset-overlapwith renamed flags; dedicated rule is a follow-up)system-prompt-leaks-eval-instructions.
Pipeline code
random-seed-not-set— split function uses unseeded RNG. Auto-fix inserts a seed call after the last import.shuffle-after-split— order corruption, breaks reproducibility.dedup-key-misses-normalization— dedup runs before unicode/whitespace normalization, undercounts duplicates.
File-level
jsonl-malformed-line— non-JSON line, pinpointed.parquet-row-group-too-large-for-streaming— row group'sNumRowsexceeds the streaming-friendly threshold.privacy-pii-detected— string fields match email / US SSN / phone / credit-card patterns; project-specific patterns added viaextra_patterns.mds-shard-size-imbalanced— (not yet implemented; needs MDS reader)
- Go, tree-sitter Python (for pipeline code) + JSONL streaming + Parquet metadata read in Go. MDS / WebDataset on the roadmap.
- Two layers:
- Code rules — same shape as Krit, walk Python AST to flag pipeline mistakes.
- Data rules — stream the dataset, compute row-level + corpus-level stats, emit findings with line/row pointers.
- Capability gates —
NeedsCorpusScan,NeedsLSH,NeedsExternalEvalSet,NeedsPythonAST,NeedsJSONL,NeedsParquet. Declared on each rule; the dispatcher routes per-file vs corpus-scope accordingly. - Outputs: JSON, SARIF 2.1.0, HTML report, drops (per-row removal manifest).
- Autofix tiers —
cosmetic,idiomatic,semantic.random-seed-not-setemits anidiomaticfix; the--fixflag applies dedup'd edits in reverse-line order. The same fix surfaces through LSPtextDocument/codeActionand the MCPfixtool. - LSP server — incremental
didOpen/didChange/didSave/didCloselifecycle, in-memory buffer store for live linting Python,quickfixcode actions for fixes in the editor's selected range. - MCP server —
tools/list+tools/callforlintandfix;resources/list+resources/readfor the rules-index Markdown and the annotated config example;prompts/list+prompts/getforexplain-rule,draft-fix, andreview-corpus. Same rule pipeline as the CLI.
- Skeleton + tree-sitter Python. ✓
- JSONL streaming reader with row-pointer findings. ✓
- Five rules (mix of code + data + leakage). ✓ (seventeen)
- HTML report. ✓
- CI on a public RLHF corpus (e.g. HH-RLHF, UltraFeedback) — hand-label to compare. (internal smoke corpus covers regression at small scale; full public-corpus run is a follow-up)
- MDS, WebDataset support — Parquet landed, MDS is the remaining file format.
- Auto-fix on more rules — currently only
random-seed-not-setemits one. - Schemas across files —
field_typescovers single-file paths today; pinning a single schema at the dataset level so every shard validates against it is the next axis. - Per-rowgroup byte heuristic for the parquet rule (waits for an upstream API surface).
- Per-language mix beyond script — the diff reports Unicode-script mix; distinguishing English from German (both Latin) would need a model or wordlist.
- More MCP prompts —
explain-rule,draft-fix, andreview-corpusship today; further templates (e.g. a per-finding triage walkthrough) are an open follow-up.
Training-data bugs are almost universally caught after the fact, by an eval regression or — worse — a release. The cost asymmetry is enormous: catching a leakage issue at lint time saves a training run. Krit's incremental + capability-gated architecture is exactly right because data-rule passes can be expensive (corpus-wide MinHash) and you don't want to run them on every CI commit.
- Training framework integration.
- Quality scoring of individual examples (that's reward-model territory).
- Replacing existing data validation libs (Great Expectations, Pandera) — datalint complements them by focusing on LLM-data-specific failure modes.