seqcheck validates sequencing reads against a seqspec.
This matters because a seqspec can be internally valid and still encode the wrong read geometry. A barcode, UMI, primer anchor, or feature can be placed in the wrong read, shifted to the wrong coordinates, or backed by the wrong whitelist. seqcheck reads the FASTQs directly and reports those mismatches.
seqspec check validates the seqspec file itself.
seqcheck validates the reads that the seqspec claims to describe.
The current Rust implementation is exact-only:
- no fuzzy barcode matching
- no approximate fixed-sequence matching
- no FastQC-style quality modules
Each run works on one modality and one or more explicit FASTQ paths.
For local development:
cargo testFor a standalone install:
cargo install --path .All read-check commands use the same core interface:
seqcheck <command> -s SPEC -m MODALITY [-n N] [--format text|json] FASTQ...Arguments:
-s, --spec: path to theseqspecYAML file-m, --modality: modality to inspect-n, --n-reads: number of reads to inspect per FASTQ.0means all reads--format text|json: human-readable text or structured JSON output--auth-profile: optional auth profile for remote resources declared in the seqspecFASTQ...: one or more FASTQ files to inspect
FASTQ paths are resolved against the spec in this order:
file_idfilenameurlbasenameread_id
Run the core checks in one report:
seqcheck check \
--format json \
-o report.json \
-s tests/fixtures/synthetic/spec.yaml \
-m rna \
-n 0 \
tests/fixtures/synthetic/fastqs/synthetic_R1.fastqRender that JSON as a self-contained HTML report:
seqcheck report -i report.json -o report.htmlseqcheck has these subcommands:
check: run the core checks in one reportlength: compare observed read lengths to the read length range in the speccoverage: report how often each expected region is fully covered by the readfixed: check exact matches for fixed-sequence regionsonlist: check exact onlist membership for onlist regionsprimer: classify the primer anchor and scan reads for exact primer or reverse-complement hitsrandom: compute whole-sequence entropy forsequence_type=randomregionscut: extract an exact region slice from each covered readhist: count exact region slices across readsreport: render a self-contained HTML QC report fromseqcheckJSONversion: print theseqcheckversion and the seqspec file version
seqcheck check runs:
input_checklengthcoverageprimerfixedonlistrandom
It does not run cut or hist. Those remain drill-down tools.
seqcheck report takes an existing JSON report and writes one offline HTML file:
seqcheck report -i report.json -o report.htmlArguments:
-i, --input: path to aseqcheckJSON report-o, --output: path to the output HTML report--spec: optional seqspec YAML path to use for the library diagram. By defaultseqcheck reportusesmeta.specfrom the JSON report
The HTML report is built from the same atomic result objects used in the JSON output. It includes:
- a header with report metadata
- a run section with the
seqcheck checkandseqcheck reportcommands - the seqspec library structure diagram when the spec is available
- severity counts
- a flat result list sorted by severity, with expandable expected and observed detail
seqcheck works on a small set of objects. Each check attaches to the narrowest object that makes sense. These scopes also appear in the JSON report through the files, reads, and regions id lists on each result.
| Object | Meaning in seqcheck |
Checks on that object |
|---|---|---|
| input set | The supplied FASTQs resolved against the seqspec for one modality | input_check |
| resolved read | One matched seqspec file/read pair and its read geometry | length, file-level coverage, primer |
| projected region | One seqspec region projected onto one resolved read | region-level coverage, cut, hist |
| fixed region | A projected region with sequence_type=fixed |
fixed |
| onlist region | A projected region with sequence_type=onlist |
onlist |
| random region | A projected region with sequence_type=random |
random |
| primer anchor | The region referenced by primer_id for a resolved read |
primer |
| shared region across reads | The same logical region, or the same curated biological region type, visible in more than one read | cross-file coverage overlap results |
The table below summarizes the read-based checks.
| Command | Inputs | Check | Reports |
|---|---|---|---|
length |
spec, modality, one or more FASTQs, optional n_reads |
Compare each observed read length to the matched read's min_len and max_len in the seqspec. |
Sampled count, expected min/max length, observed min/max length, out-of-range count, out-of-range fraction |
coverage |
spec, modality, one or more FASTQs, optional n_reads |
Project expected regions onto each read and ask whether the observed read is long enough to cover each region. Also emit overlap interpretations when the same logical region appears in multiple reads. | Expected region coordinates, per-region covered count/fraction, full-read coverage count/fraction, overlap assessments |
fixed |
spec, modality, one or more FASTQs, optional n_reads |
Extract each sequence_type=fixed slice and compare it exactly to the expected sequence. Matching is strand-aware by default. The command also scans the full read for shifted exact hits of the same motif. |
Sampled count, covered count/fraction, short-read count, exact-match count/fraction, orientation counts, offset histogram, absent count, multi-hit count, top non-matching sequences |
onlist |
spec, modality, one or more FASTQs, optional n_reads |
Extract each sequence_type=onlist slice and test exact membership in the loaded whitelist. |
Sampled count, covered count/fraction, short-read count, exact-onlist count/fraction, offlist count, top offlist sequences, onlist source, onlist size |
primer |
spec, modality, one or more FASTQs, optional n_reads |
Classify the primer_id anchor for each read, then scan the full read for exact primer-sequence and reverse-complement hits. |
Primer classification, primer sequence/length, start-hit and internal-hit counts/fractions for forward and reverse-complement matches, absent count/fraction, top hit positions |
random |
spec, modality, one or more FASTQs, optional n_reads |
Extract each sequence_type=random slice, group exact observed sequences, and compute Shannon entropy over that sequence distribution. |
Sampled count, covered count/fraction, short-read count, unique-sequence count, entropy in bits, maximum possible entropy, entropy fraction, top observed sequences |
cut |
spec, modality, one or more FASTQs, required region_id, optional n_reads |
Extract the exact sequence slice for one named region from each covered read. | Sampled count, covered count, short-read count, extracted sequence per covered read |
hist |
spec, modality, one or more FASTQs, required region_id, optional n_reads |
Extract the exact sequence slice for one named region and count exact sequence occurrences across reads. | Sampled count, covered count, short-read count, exact histogram of observed region sequences |
version is a utility command, not a read check. It reports the seqcheck version, the seqspec file version, and the assay id.
Every read-check command emits the same top-level JSON object:
report_schema_versionmetaresults
meta contains:
commandspecmodalityrequested_reads
results is a flat list of atomic result objects. A single command can emit many results. Each result has:
checkfilesreadsregionsexpectedobservedassessment
files, reads, and regions are seqspec ids only. They point back to entities already defined in the spec.
expected and observed use the same metric-item shape:
idnamedescriptiondata
data is shallow and typed:
kind:scalar,series, orrecordsvalueunitwhen it matters
assessment carries the interpretation:
type:pass,warning,error, orinterpretationcodedescriptionexpected_idsobserved_ids
This makes the JSON the source of truth. Stdout is rendered from the same result objects, so the human-readable report and the machine-readable report describe the same checks.
Each successful report emits an input_check result first. It records:
- the expected FASTQ inventory from the seqspec for that modality
- the supplied inputs
- the resolved matches
- any expected files that were not supplied
Missing expected files are warnings, not hard failures, so subset runs remain valid.
coverage emits three kinds of results:
- one
input_checkresult per invocation - one file-level
coverageresult per matched file/read - one region-level
coverageresult per matched file/read/region - cross-file
coverageresults when the same logical region or curated biological region type appears in multiple reads
Abbreviated JSON example:
{
"report_schema_version": "0.1.0",
"meta": {
"command": "coverage",
"spec": "../seqspec/tests/fixtures/spec.yaml",
"modality": "rna",
"requested_reads": 100
},
"results": [
{
"check": "input_check",
"files": ["rna_R1_SRR18677638.fastq.gz", "rna_R2_SRR18677638.fastq.gz"],
"reads": ["rna_R1", "rna_R2"],
"regions": [],
"expected": [...],
"observed": [...],
"assessment": [...]
},
{
"check": "coverage",
"files": ["rna_R1_SRR18677638.fastq.gz"],
"reads": ["rna_R1"],
"regions": ["rna_umi"],
"expected": [...],
"observed": [...],
"assessment": [...]
}
]
}Abbreviated stdout for the same schema:
seqcheck coverage
spec: ../seqspec/tests/fixtures/spec.yaml
modality: rna
requested_reads: 100
check: input_check
files: rna_R1_SRR18677638.fastq.gz, rna_R2_SRR18677638.fastq.gz
reads: rna_R1, rna_R2
regions: (none)
assessment:
- [pass] all_expected_files_matched: All expected modality files were supplied and matched uniquely.
check: coverage
files: rna_R1_SRR18677638.fastq.gz
reads: rna_R1
regions: rna_umi
assessment:
- [pass] region_coverage_complete: All sampled reads cover region 'rna_umi'.
expected:
- expected_region:
- name=UMI region_id=rna_umi region_type=umi sequence_type=random start=16 stop=28
observed:
- sampled_count: 100 count
- covered_count: 100 count
- covered_fraction: 1 fraction
Some seqspec files point at remote whitelist or reference resources that require HTTP auth. seqcheck resolves remote auth by host and applies it through one shared fetch path.
Auth profiles live in auth.toml and point to env vars, not raw secrets on the command line. seqcheck looks for the config in this order:
SEQCHECK_AUTH_CONFIG$XDG_CONFIG_HOME/seqcheck/auth.toml$HOME/.config/seqcheck/auth.toml
Example config:
[profiles.igvf]
hosts = ["api.data.igvf.org", "data.igvf.org"]
kind = "basic"
username_env = "IGVF_ACCESS_KEY_ID"
password_env = "IGVF_ACCESS_KEY_SECRET"You can also initialize the same profile directly:
seqcheck auth init \
--profile igvf \
--host api.data.igvf.org \
--host data.igvf.org \
--kind basic \
--username-env IGVF_ACCESS_KEY_ID \
--password-env IGVF_ACCESS_KEY_SECRETExample usage:
export IGVF_ACCESS_KEY_ID=...
export IGVF_ACCESS_KEY_SECRET=...
seqcheck onlist \
--auth-profile igvf \
-s spec.yaml \
-m crispr \
sample_R1.fastq.gzIf --auth-profile is omitted, seqcheck matches profiles by URL host. If a named profile is supplied, the host must still match that profile.
Hidden auth helpers for local debugging:
seqcheck auth path
seqcheck auth list
seqcheck auth resolve https://api.data.igvf.org/reference-files/...onlist reads remote whitelist files directly from their URL. It does not require a local staged copy. If a whitelist cannot be loaded, seqcheck onlist emits a structured error result instead of failing the whole report.
Check read lengths with the DOGMA fixture in the sibling seqspec repo:
seqcheck length \
--format json \
-s ../seqspec/tests/fixtures/spec.yaml \
-m rna \
-n 100 \
../seqspec/tests/fixtures/fastqs/rna_R1_SRR18677638.fastq.gz \
../seqspec/tests/fixtures/fastqs/rna_R2_SRR18677638.fastq.gzRun the core checks in one pass:
seqcheck check \
--format json \
-s tests/fixtures/synthetic/spec.yaml \
-m rna \
-n 0 \
tests/fixtures/synthetic/fastqs/synthetic_R1.fastqCheck exact onlist membership for the local index read in the upgraded 10xv3 example:
seqcheck onlist \
--format json \
-s examples/10xv3/spec.yaml \
-m rna \
-n 100 \
examples/10xv3/fastqs/I1.fastq.gzInspect a synthetic barcode slice and its histogram:
seqcheck cut \
-s tests/fixtures/synthetic/spec.yaml \
-m rna \
-r barcode \
-n 0 \
tests/fixtures/synthetic/fastqs/synthetic_R1.fastq
seqcheck hist \
-s tests/fixtures/synthetic/spec.yaml \
-m rna \
-r barcode \
-n 0 \
tests/fixtures/synthetic/fastqs/synthetic_R1.fastq
seqcheck random \
--format json \
-s tests/fixtures/synthetic/spec.yaml \
-m rna \
-n 0 \
tests/fixtures/synthetic/fastqs/synthetic_R1.fastqexamples/10xv3/spec.yamlis a current-format local example used to test index-read behaviortests/fixtures/synthetic/contains a small local assay used for golden CLI teststests/fixtures/primer_cases/covers fixed, ghost, and non-scannable primer definitionstests/fixtures/neg_fixed/covers reverse-complement fixed matching on a negative-strand readexamples/dogmaseq-lll/spec.yamlis legacy prototype material and is not the source-of-truth fixture for the current Rust implementation
The presentation in docs/Example of seqspec Failing in the CRISPR-pipeline.pptx shows the motivating failure mode for this tool: a spec can be internally valid and still miss the effective read geometry that the data reveals.
The coverage command surfaces one common case. A logical region can appear in multiple reads. That can reflect intended paired-end overlap, or it can reflect a read-geometry mismatch when the observed reads extend farther than the spec implies.
The primer command surfaces another case. Sequencing usually starts at the primer, so the primer sequence itself should usually not appear inside the read. If primer hits are common, or if the primer definition is not a concrete fixed sequence, that is strong evidence that the seqspec is anchoring the read incorrectly.
Codex was used to help build out parts of the Rust seqspec library and the seqcheck CLI, including implementation, testing, and documentation.