This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Behavioral guidelines (after Andrej Karpathy) to reduce common LLM coding mistakes; they apply to all work in this repo. The same content is also installed as the on-demand skill .claude/skills/karpathy-guidelines/.
Tradeoff: These guidelines bias toward caution over speed. For trivial tasks, use judgment.
Don't assume. Don't hide confusion. Surface tradeoffs.
Before implementing:
- State your assumptions explicitly. If uncertain, ask.
- If multiple interpretations exist, present them - don't pick silently.
- If a simpler approach exists, say so. Push back when warranted.
- If something is unclear, stop. Name what's confusing. Ask.
Minimum code that solves the problem. Nothing speculative.
- No features beyond what was asked.
- No abstractions for single-use code.
- No "flexibility" or "configurability" that wasn't requested.
- No error handling for impossible scenarios.
- If you write 200 lines and it could be 50, rewrite it.
Ask yourself: "Would a senior engineer say this is overcomplicated?" If yes, simplify.
Touch only what you must. Clean up only your own mess.
When editing existing code:
- Don't "improve" adjacent code, comments, or formatting.
- Don't refactor things that aren't broken.
- Match existing style, even if you'd do it differently.
- If you notice unrelated dead code, mention it - don't delete it.
When your changes create orphans:
- Remove imports/variables/functions that YOUR changes made unused.
- Don't remove pre-existing dead code unless asked.
The test: Every changed line should trace directly to the user's request.
Define success criteria. Loop until verified.
Transform tasks into verifiable goals:
- "Add validation" → "Write tests for invalid inputs, then make them pass"
- "Fix the bug" → "Write a test that reproduces it, then make it pass"
- "Refactor X" → "Ensure tests pass before and after"
For multi-step tasks, state a brief plan:
1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]
Strong success criteria let you loop independently. Weak criteria ("make it work") require constant clarification.
These guidelines are working if: fewer unnecessary changes in diffs, fewer rewrites due to overcomplication, and clarifying questions come before implementation rather than after mistakes.
minisv is a lightweight mosaic/somatic structural-variation (SV) caller for long
genomic reads (PacBio HiFi / ONT). Its defining idea: align reads against multiple
reference genomes / pangenome graphs and keep an SV on a read only if it appears in
the alignments against all references. This filters alignment errors and germline/
population SVs without a matched normal.
This is a Python port of minisv.js plus extra features.
Output parity with the JS implementation is a design goal — many functions carry
NOTE: comments documenting JS quirks that are deliberately preserved. The shell tests
in tests/ diff Python output against minisv.js run via k8.
make # == poetry install (the only build step)
poetry run minisv --help # CLI entry point is minisv.cli:cli (rich-click)
poetry run pytest tests/ # Python unit tests (only test_constant.py is meaningful;
# test_cigar_ds.py is mostly unimplemented stubs)
pre-commit run --all-files # black (py3.10) + isort(--profile black) + ruff lint/formatThere is no lint/test target in the Makefile and no CI; quality is enforced only by the pre-commit hooks above.
The tests/*.sh scripts are integration checks, not runnable here: they hardcode
lab-cluster paths (/hlilab/..., /homes6/...) and require the k8/minisv.js
reference binary and large alignment files. Read them to understand expected CLI usage
and the diff-against-JS validation pattern, not to execute.
build.py cythonizes minisv/*.pyx (only cyminisv.pyx) with -march=native -O3, but
the [tool.poetry.build] hook is commented out in pyproject.toml, so the package
installs as pure Python. The Cython path (cy_GafParser) is experimental and unused by
the active CLI.
The standalone caller is three streaming steps (see README for the germline / somatic / mosaic / trio variants — they differ only in inputs and which calls are kept):
extract → isec → (sort -k1,1 -k2,2n) → merge → genvcf
extract(read_parser.load_reads→breakpoint.get_breakpoint+indel.get_indel): streams one read group at a time from PAF/GAF, emits raw SVs in a BED-like minisv format (.rsv/.gsv).isec(cli.py+type.get_type): keeps only SVs on a read that appear in every input file (the multi-reference filter).merge(merge.merge_sv): clusters overlapping records into calls, counts supporting reads per sample/strand, applies count/strand/centromere-distance filters. Requires input pre-sorted withsort -k1,1 -k2,2n.genvcf(io.write_vcf): minisv.msv→ VCF. minisv never infers genotypes.
Alignments must carry the ds:Z tag — minigraph ≥0.21 (-cxlr --ds) or minimap2
≥2.28 (-cx map-hifi -s50 --ds for HiFi, -cxlr:hq for ONT). In load_reads, records
without a cg:Z tag or that are not primary (tp:A:P) are silently skipped.
.rsv/.gsv: raw per-read SVs. Two record shapes distinguished by column 3: breakpoint records have an orientation token (>>,<<,><,<>) and usecol_info=8(graph) or6; indel records usecol_info=4. Thiscol_infooffset logic recurs acrossisec,merge.parse_sv, andtype.get_type..msv: merged calls (mergeoutput).- Info field is a
;-delimited string:SVTYPE=,SVLEN=,qoff_l=,qoff_r=,source=,count=,reads=,.... source=is the sample name (set viaextract -n); somatic calling = grep the wanted sample out of merged output (e.g.grep TUMOR | grep -v NORMAL).- SV type encoding (
type.py): INS/DUP→flag 1, DEL→2, INV→4, cross-contig BND→8.
cli.py is the hub: every subcommand lives here and builds one of the option dataclasses
(opt for extract, mergeopt, unionopt, EvalOpt, viewopt) that thread through the
pipeline.
| Module | Role |
|---|---|
read_parser.py |
load_reads — group reads, dispatch to indel + breakpoint extractors |
breakpoint.py |
split-read breakends; infer_svtype classifies DEL/INS/DUP/INV/BND from orientation + gaps; get_end_coor maps path coords |
indel.py |
cigar/ds:Z parsing for long INDELs; TSD/polyA (retrotransposon) detection |
merge.py |
merge_sv/same_sv/write_sv — cluster records, count support, filter |
type.py |
SV-type flags for isec overlap tests |
eval.py |
eval command (callset comparison); shared VCF/minisv parsing (gc_parse_sv), interval tree (iit_*), bed reader (gc_read_bed) |
io.py |
write_vcf (genvcf), gc_cmd_view (view), parseNum ("500k"→int) |
union.py |
union_sv/advunion_sv/union_sv_with_tr — ensemble callsets into binary-membership truth sets |
ensemble.py |
insilico_truth (ensembleunion / collapse), double_strand_break |
filtercaller.py |
heavyweight filtering orchestration — see below |
annot.py |
annot command |
phase.py |
extracthp / annotatehp haplotype-tag handling |
annotation.py, graph_genome_coor.py, regex.py |
centromere distance/overlap, GAF-path→contig coords, shared compiled regexes |
minisv.py |
legacy GafParser behind getindel/getsv (older interface) |
identify_breaks_v5.py, merge_break_pts_v3.py, paired_merge_v2.py |
superseded versioned implementations — not on the active code path |
This drives sv-cross-ref-filter (somatic), sv-trio-filter (de novo in a trio), and
test-sv-filter (cutoff sweep on an existing workdir) via three near-parallel classes:
MinisvReads, MinisvReadsTrio, MSVTestCutOff. They share a method sequence:
extract_read_ids (from caller VCF + read-id TSV)
→ extract_reads (samtools/seqtk from BAM/CRAM)
→ align_reads_to_self / _to_graph (mappy + external minimap2/minigraph via --mm2/--mg)
→ parse_raw_sv_* (reuse the extract pipeline)
→ isec_* → othercaller_filterasm → union_filtered_vcf
The goal is to re-derive SV evidence for a third-party caller's calls (Severus, SAVANA,
nanomonsv, Sniffles2, etc.) by realigning only the supporting reads to a de novo assembly
and pangenome, then keep only calls still supported. The three classes are largely
copy-pasted with per-mode tweaks — changes to one usually need mirroring in the others.
External tool paths come from --mm2/--mg; intermediate .gsv.gz/.paf.gz/.gaf.gz
and per-caller *_filtered.{vcf,stat} files land in the output workdir.
data/*.bed and minisv/*.bed are centromere masks (*.cen-mask.bed) and confident
regions (*.reg.bed) for hs38 and chm13v2 — passed via -b/--maskb to suppress
spurious calls in centromeric/satellite repeats. minisv/paired.chm13.*.vcf is a packaged
breakpoint reference.
Three improvements agreed with the user. Full plan file:
~/.claude/plans/now-let-s-plan-first-partitioned-orbit.md. Design decisions were made via
explicit user choices — do not re-litigate them without asking.
Status (synced 2026-05-27): 3 of 7 code pieces landed in the working tree (uncommitted).
Task 1 complete (function + CLI command + unit tests in tests/test_gnomad_filter.py),
Task 2 design locked, ready to execute next session (sub-features 2a + 2b below, all
decisions made — do not re-litigate), Task 3 partial. Docs/README not yet updated.
Drop a caller's calls that fall in common population SV regions. Decision: BED-based (type-agnostic breakpoint overlap, no allele-frequency parsing); standalone command (not wired into the orchestrators), caller-agnostic because VCF parsing is shared.
- [DONE] New
gnomad_filter(vcf_file, gnomad_bed, opt, both_ends=False, pad=0, out=None)inminisv/filtercaller.py:173: loads BED withgc_read_bed(eval.py:881), parses the caller VCF permissively withgc_parse_sv(vcf_file, 0, 0, opt.ignore_flt, False)(eval.py:184, which setssvid=t[2]and fillsctg/pos/ctg2/pos2), builds a drop-set ofparse_svid(t.svid)for SVs whose breakpoint overlaps the BED viaiit_overlap(eval.py:773, guarded byt.ctg in bed) — default drop if either end overlaps;both_endsrequires both — then re-emits the original VCF minus the drop-set. - [DONE] New CLI command
gnomadfilterincli.py(afterfilterasm; argsgnomad_bed,vcffile; options-F/--ignoreflt,--both,--pad; filtered VCF to stdout). Builds anEvalOpt(ignore_flt=...)and callsgnomad_filter.
Export SVs called by all of severus/savana/nanomonsv from the raw VCFs, regardless
of asm/pangenome filtering, and the subset of that consensus the de-novo-assembly filter
dropped. Decision: first-3 callers, strict (self.som_vcfs[:3], present in every one).
Both products are emitted by the single --output_consensus flag. Runs inside
MinisvReads only (not trio / cutoff classes), and after union_filtered_vcf because
it consumes that step's l+s_union.msv.
2a. Consensus set (raw):
- Add keyword
min_file_count=Nonetounion_svinminisv/union.py; in the per-group loop (after thein_bed/length/count skips ~union.py:97-99) addif min_file_count is not None and bin(x).count("1") < min_file_count: continue. DefaultNonekeeps all existing callers unchanged. - New
MinisvReads.output_consensus(self, read_min_len, opt)(next tounion_filtered_vcf~filtercaller.py:947):union_sv(self.som_vcfs[:3], ..., min_file_count=3)→consensus_union.msv, theninsilico_truth(ensemble.py:6) →consensus_3caller_dedup.msv.
2b. Consensus-lost-by-assembly subset (the requested feature): emit consensus calls the
de-novo assembly filtered out → consensus_lost_by_l+s.msv. Locked decisions:
- Scope =
l+sonly (reads re-aligned to the self de-novo assembly; notl+g/l+g+s). "Lost" is measured againstl+s_union.msv(union of the 3 callers' l+s-filtered survivors, already produced byunion_filtered_vcf~filtercaller.py:963). - Drop rule = lost everywhere: a consensus call qualifies iff it overlaps no call in
l+s_union.msv— all 3 callers agreed in raw, but after re-aligning supporting reads to the assembly no caller's filtered set still carries it. - Output =
.msv, folded into--output_consensus(no separate flag). - Set difference via a small local helper that reuses the existing
gc_cmp_same_sv1(opt.win_size, opt.min_len_ratio, …)coordinate test and the same sorted-scan/win_sizewindow already used inunion_sv— invent no new matching logic. Test consensus reps (consensus_3caller_dedup.msv) againstl+s_union.msv; emit reps with no overlap.
Wiring: add --output_consensus flag to the sv_cross_ref_filter command; after the
union_filtered_vcf(...) call (~cli.py:807) do if output_consensus: reads.output_consensus(...).
Tests (style of tests/test_gnomad_filter.py, synthetic .msv inputs): (i) min_file_count=3
keeps only all-three-caller groups, None unchanged; (ii) subtraction helper excludes a
consensus call with an overlapping l+s survivor, emits one with none. No JS-parity check (new
functionality, no minisv.js counterpart).
Pre-existing inconsistency to leave alone (flagged, do not "fix" silently): filtered VCFs are
written with opt.min_count in the name (filtercaller.py:932) but read with
opt.read_min_count (filtercaller.py:960). This design consumes the l+s_union.msv product,
so it is insulated; do not touch unless asked.
- [DONE] Replaced
__version__ = "0.1.2"(cli.py:23) withimportlib.metadata.version("minisv")+ aPackageNotFoundErrorfallback to"0.1.3"so it trackspyproject.tomland can't drift again. - [DONE] Deleted the duplicate
getsvcommand (commit68dc9a1). The commented-out#def getsv(stub and the unrelated# getsv optionscomments incli.pywere left.
The Module-map note at the top has been updated (the obsolete "version disagrees / getsv is a
duplicate" sentence is removed). gnomadfilter + --gnomadaf are now in the README
(commit 2e9e2d8); adding --output_consensus to the README is still TODO.
Port the entire extract pipeline step — read_parser.load_reads + indel.get_indel +
breakpoint.get_breakpoint — to Rust, exposed as a PyO3 extension that the extract CLI
command calls in place of the Python load_reads. This is the successor to the abandoned Cython
path (cyminisv.pyx / cy_GafParser, wired in build.py). Decisions locked via AskUserQuestion
(2026-05-27) — do not re-litigate without asking:
- Scope = the whole extract step. Rust reads gzipped PAF/GAF, does tokenize +
mapq < min_mapq/ primary-only (tp:A:P) /cg:Z-required filtering + group-by-qname (read_parser.py:42-108), runs the indel + breakend SV inference, and emits the same.rsv/.gsvrecords. SAM input stays out of scope (already stubbed/commented inload_reads). - Integration = PyO3 extension (maturin/PyO3 native module, e.g.
minisv_rs), imported into Python;cli.py'sextractcallsminisv_rs.extract(...). Adds a cargo+maturin build step — a departure from the current pure-Python install; the commented-out[tool.poetry.build]Cython hook inpyproject.tomlis the template for wiring an optional native build.
HARD CONSTRAINT — output parity. Rust output must be byte-identical to the current Python
load_reads→get_indel/get_breakpoint, which is itself diffed against minisv.js. The Rust port
must faithfully reproduce every JS-quirk NOTE: behavior in indel.py / breakpoint.py (incl.
the col_info 8-vs-6 graph-vs-linear offset, infer_svtype orientation/gap classification,
get_end_coor path→contig coord mapping, and TSD/polyA retrotransposon detection). Keep Python as
the reference oracle: add a locally-runnable test that runs Rust and Python on the same
PAF/GAF and diffs the .gsv (extends the tests/*.sh diff-against-JS pattern, but without needing
the k8/minisv.js binary).
Suggested sequencing (de-risk parity first):
- Rust crate + maturin/PyO3 skeleton; no-op
extractround-tripping a PAF/GAF line. Verify:import minisv_rsworks aftermaturin develop. - Port
load_readsparse/group/filter; diff read grouping against Python on a sample PAF/GAF. - Port
get_indel(cigar/ds:Zparse, TSD/polyA) → diff.gsvindel records byte-for-byte. - Port
get_breakpoint(infer_svtype,get_end_coor) → diff.gsvbreakpoint records. - Wire
cli.pyextractto call Rust whenminisv_rsis importable, else fall back to Python. Verify:extractoutput unchanged on test inputs.
Recommendation: make the extension optional with a Python fallback — preserves pure-Python installability and keeps Python as a permanent parity oracle. Decide the build backend (maturin vs poetry + separate maturin build) at step 1.
Bonus: filtercaller.py's realtime filter reuses this same extract pipeline via its
parse_raw_sv_* methods, so they inherit the speedup for free once extract is ported.
This project has a CodeGraph MCP server (codegraph_* tools) configured. CodeGraph is a tree-sitter-parsed knowledge graph of every symbol, edge, and file. Reads are sub-millisecond and return structural information grep cannot.
Use codegraph for structural questions — what calls what, what would break, where is X defined, what is X's signature. Use native grep/read only for literal text queries (string contents, comments, log messages) or after you already have a specific file open.
| Question | Tool |
|---|---|
| "Where is X defined?" / "Find symbol named X" | codegraph_search |
| "What calls function Y?" | codegraph_callers |
| "What does Y call?" | codegraph_callees |
| "How does X reach/become Y? / trace the flow from X to Y" | codegraph_trace (one call = the whole path, incl. callback/React/JSX dynamic hops) |
| "What would break if I changed Z?" | codegraph_impact |
| "Show me Y's signature / source / docstring" | codegraph_node |
| "Give me focused context for a task/area" | codegraph_context |
| "See several related symbols' source at once" | codegraph_explore |
| "What files exist under path/" | codegraph_files |
| "Is the index healthy?" | codegraph_status |
- Answer directly — don't delegate exploration. For "how does X work" / architecture questions, answer with 2-3 codegraph calls:
codegraph_contextfirst, then ONEcodegraph_explorefor the source of the symbols it surfaces. For a specific flow ("how does X reach Y") start withcodegraph_tracefrom→to — one call returns the whole path with dynamic hops bridged — then ONEcodegraph_explorefor the bodies; don't rebuild the path withcodegraph_search+codegraph_callers. Codegraph IS the pre-built index, so spawning a separate file-reading sub-task/agent — or running a grep + read loop — repeats work codegraph already did and costs more for the same answer. - Trust codegraph results. They come from a full AST parse. Do NOT re-verify them with grep — that's slower, less accurate, and wastes context.
- Don't grep first when looking up a symbol by name.
codegraph_searchis faster and returns kind + location + signature in one call. - Don't chain
codegraph_search+codegraph_nodewhen you just want context —codegraph_contextis one call. - Don't loop
codegraph_nodeover many symbols — onecodegraph_explorecall returns several symbols' source grouped in a single capped call, while each separate node/Read call re-reads the whole context and costs far more. - Index lag — check the staleness banner, don't guess a wait. When a codegraph response starts with "
⚠️ Some files referenced below were edited since the last index sync…", the listed files are pending re-index — Read those specific files for accurate content. Files NOT in that banner are fresh and codegraph is authoritative for them.codegraph_statusalso lists pending files under "Pending sync".
The MCP server returns "not initialized." Ask the user: "I notice this project doesn't have CodeGraph initialized. Want me to run codegraph init -i to build the index?"