cami is a command-line companion for working with CAMI taxonomic profiling tables. It helps you inspect samples, clean and reformat abundances, and prepare subsets for downstream analysis without leaving the terminal.
- Summarize CAMI files to see which samples, ranks, and taxa are present.
- Preview the top entries of each sample before loading the file into another tool.
- Filter taxa with expressive boolean predicates that reference rank (
r), sample (s), abundance (a), taxonomy (t/tax), and cumulative sums (c). - Fill in missing higher ranks by pulling lineage information from the NCBI taxdump and round abundances to five decimal places.
- Renormalize abundances so that every rank in every sample sums to 100.
- Reorder taxa within each rank, either by abundance (dropping zeroes) or by lineage, to make tables easier to scan.
- Benchmark predicted profiles against ground-truth tables with precision/recall, abundance error, correlation, diversity, UniFrac, and abundance-rank metrics (ARE and mARE).
The repository includes a small demo table at examples/test.cami that you can use with the examples below.
- Install Rust if it is not already available.
- Clone this repository and build the binary:
git clone https://github.com/dawnmy/cami.git cd cami cargo install --path .
- Run
cami --helpto confirm the command is available.
You can also invoke subcommands directly with cargo run -- <command> while developing.
| Command | Description |
|---|---|
cami list |
Summarize ranks, taxon counts, and total abundance per sample. |
cami preview |
Display the first rows of every sample to spot-check formatting. |
cami filter |
Apply boolean filters, fill missing ranks, and renormalize abundances. |
cami convert |
Turn TSV-style taxon abundances into a single-sample CAMI profile. |
cami fillup |
Populate missing higher ranks using the NCBI taxdump. |
cami renorm |
Rescale abundances so every rank sums to 100 per sample. |
cami sort |
Reorder taxa within ranks by abundance or lineage strings. |
cami benchmark |
Compare predicted profiles against ground truth with rich metrics. |
Print a per-sample summary that counts how many taxa and how much abundance is assigned to each declared rank.
$ cami list examples/test.cami
Sample: s1
Ranks: superkingdom, phylum, class, order, family, genus, species, strain
Total taxa: 18
superkingdom: taxa=1 total=100.000
phylum: taxa=2 total=100.000
class: taxa=3 total=100.000
order: taxa=3 total=100.000
family: taxa=3 total=100.000
genus: taxa=3 total=100.000
species: taxa=3 total=100.000
strain: taxa=0 total=0.000
...
Use this command to ensure the file has the ranks and coverage you expect before diving into more complex operations.
Show the first N entries per sample (default 5). This is handy for spot-checking formatting and verifying that the taxonomy paths look correct.
$ cami preview -n 2 examples/test.cami
@SampleID:s1
@Version:0.10.0
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
2 superkingdom 2 Bacteria 100
201174 phylum 2|201174 Bacteria|Actinobacteria 65.67585
...
Filter taxa with boolean expressions while optionally filling missing ranks and renormalizing abundances. Results are emitted as a valid CAMI table, so you can chain additional commands or redirect to a file. It is recommended to use single quotation marks instead of double quotes. For sample ID matching, you can enclose the sample ID or pattern in double quotes within the single-quoted expression. If you use !c, you must use single quotes for the expression.
Common workflow:
cami filter --fill-up --renorm 's==s1 & r==species & a>=5' examples/test.cami > enriched.camiThis keeps species-level entries from sample s1 that are at least 5% abundant, fills in any missing higher ranks using the NCBI taxonomy, renormalizes each rank to 100%, and writes the output to enriched.cami.
Write expressions with & (and), | (or), and parentheses. Each atom targets one aspect of the data:
| Atom | Purpose | Operators | Notes |
|---|---|---|---|
r or rank |
Match entry ranks | ==, !=, <=, <, >=, > |
Uses the order declared by @Ranks. r<=class keeps class and more specific ranks, while r>class keeps more general ranks. Comma-separated lists are allowed with ==/!=. |
s or sample |
Select samples | ==, !=, ~ |
== accepts sample IDs, 1-based indices, comma-separated lists, and inclusive ranges (s==1:3). . or : match all samples. Use s~"regex" to match IDs with a regular expression. |
a or abundance |
Compare abundances | ==, !=, >=, >, <=, < |
Values are interpreted as percentages (0–100). |
t or tax |
Test lineage membership | ==, !=, <=, < |
Compares against TAXID values. With --fill-up or when taxonomy data is available, ancestors are resolved through the NCBI taxdump; otherwise the command inspects TAXPATH. Prefix with ! to negate the result. |
c or cumsum |
Filter by cumulative abundance | <= |
Keeps the least-abundant taxa within each rank whose cumulative sum is at most the threshold (again using percentage units). Prefix with ! to discard those instead. |
Examples:
r==species & a>=1keeps species entries that are at least 1% abundant.s==1,3-5 | s~"^gut"keeps explicit samples plus any whose IDs start withgut.t<=562keeps entries that fall under Escherichia coli (taxid 562) or match the taxid exactly.!c<=2removes the lowest-abundance taxa per rank whose cumulative total is at most 2%.
When --fill-up is supplied, the command downloads the NCBI taxdump (stored under ~/.cami) if necessary. Use --from <rank> to specify which rank to aggregate from when filling and --to <rank> to control how far up the lineage to build. Combine --renorm to ensure each rank sums to 100 after filtering and filling.
Turn a simple TSV of taxonomic abundances into a fully fledged CAMI profile. The command reads the taxid and abundance columns you specify, looks up missing lineage information from the local NCBI taxdump, and emits a single-sample CAMI table with rounded percentages and populated ranks. It is especially useful when migrating results from tools that output taxid\tababundance pairs.
If the first line of the TSV contains headers, the command automatically skips it as long as the taxid and abundance fields cannot be parsed as numbers. Abundance values are written to the CAMI file exactly as provided, so ensure your TSV reports percentages (multiply fractions by 100 before converting).
Key options:
-i, --taxid-column <INDEX>– 1-based column holding NCBI taxids. Defaults to1. Use this when the taxid column is not the first field (e.g.,-i 3to read from the third column).-a, --abundance-column <INDEX>– 1-based column holding abundances. Defaults to2. Adjust if abundances appear in another column.-s, --sample-id <ID>– Sample identifier written to the@SampleIDheader. Defaults tosample; supply a more descriptive label for clarity.-T, --taxonomy-tag <TAG>– Optional@TaxonomyIDvalue to describe the NCBI taxonomy snapshot used (for example,2025-06-19).--dmp-dir <DIR>– Directory containingnodes.dmpandnames.dmp. When omitted the command uses (and, if needed, downloads) the taxdump under~/.cami.-o, --output <FILE>– Write the CAMI profile to a file instead of stdout.
Examples:
# Convert a TSV whose taxids live in column 2 and abundances in column 5.
cami convert -i 2 -a 5 -s gut_sample results.tsv > gut_sample.cami
# Stream results from another program and tag the taxonomy snapshot.
other-profiler | cami convert --taxonomy-tag 2025-06-19 -s mock1 > mock1.cami
# Write the converted profile directly to a file and use a custom taxdump location.
cami convert --dmp-dir /data/taxdump -o sample.cami results.tsvThe generated CAMI table includes one sample populated with the modern or legacy CAMI rank set depending on the installed taxdump. Missing intermediate ranks are filled in automatically, and abundances are rounded to five decimal places to match the format expected by other cami subcommands.
Populate missing higher ranks for every sample using the NCBI taxdump. Abundances retain their full precision after the hierarchy is filled, and the command adapts to either the legacy or modern CAMI rank sets present in your taxonomy snapshot.
cami fillup --to family examples/test.cami > with_family.camiIf --to is omitted, the command fills to the highest rank declared in each sample. Use --from <rank> to choose the source rank used for aggregation (defaults to species when available). When an entry references a taxid that has been merged or deleted, cami fillup prints a warning to stderr but keeps processing the rest of the table.
Renormalize abundances so that the percentages at each rank sum to 100 for every sample. Entries with zero or negative abundances are ignored during scaling, and positive values keep their full double-precision values.
cami renorm examples/test.cami > renormalized.camiReorder taxa within each rank for every sample.
-a/--abundancesorts taxa by descending abundance and removes entries whose abundance is exactly zero.-t/--taxpath [taxpath|taxpathsn]sorts by the lineage strings so related taxa stay together. The default field istaxpathwhen-tis passed without a value.
cami sort --abundance examples/test.cami > sorted.cami
cami sort --taxpath examples/test.cami > lineage_sorted.camiEvaluate one or more predicted profiles against a ground-truth CAMI table. For every sample and rank the command computes detection metrics (TP/FP/FN, precision/purity, recall/completeness, F1, Jaccard), abundance distances (L1 error, Bray–Curtis), diversity summaries (Shannon index and equitability), Pearson/Spearman correlations, weighted/unweighted UniFrac differences, the Abundance Rank Error (ARE), and the mass-weighted Abundance Rank Error (mARE). Results are written to TSV files so you can load them into spreadsheets or plotting notebooks.
cami benchmark -g truth.cami predictions/profiler1.cami predictions/profiler2.cami \
-l "profiler1,profiler2" --af 'a>=0.01' -n --by-domain \
-o benchmark-results -r "phylum,class,order,family,genus,species"-g, --ground-truthselects the reference CAMI table.- Positional arguments list predicted profiles to score; provide as many as you like.
-l, --labels(optional) supplies comma-separated names used in the output. When omitted the command derives labels from the file names.--afapplies the same expression filter to both the ground truth and predicted profiles before scoring (e.g.,--af 'a>=0.01').--gffilters the ground-truth profile before scoring using the same expression language ascami filter.--pfapplies an expression filter to every predicted profile before metrics are computed.-n, --normalizerescales each sample/rank in every profile so positive abundances sum to 100 prior to computing metrics.--update-taxonomyresolves every taxid through the NCBI merged and deleted node tables so profiles recorded against different taxonomy snapshots still align before scoring.--by-domainproduces additional TSV files restricted to Bacteria, Archaea, Eukarya, and Viruses alongside the overall report.-o, --outputpoints to the directory where reports such asbenchmark.tsvandbenchmark_bacteria.tsvare written.-r, --ranksrestricts the evaluation to specific ranks; mix short forms (p,c,g) and full names (phylum,class,genus).
Each TSV contains one row per profile/sample/rank combination:
profile sample rank tp fp fn precision recall f1 jaccard l1_error bray_curtis shannon_pred shannon_truth evenness_pred evenness_truth pearson spearman weighted_unifrac unweighted_unifrac abundance_rank_error mass_weighted_abundance_rank_error
profiler1 s1 species 42 5 3 0.893617 0.933333 0.913043 0.777778 4.210000 0.021053 2.271111 2.318765 0.932842 0.950112 0.981000 0.975000 0.042000 0.018519 0.052632 0.041875
The cami benchmark command reports weighted and unweighted UniFrac scores that are always between 0 and 1. Internally the tool builds a taxonomic tree from the lineages present in the ground-truth and predicted profiles, normalizes the mass present at each lineage tip, and then computes branch-wise discrepancies between the two distributions. To avoid ambiguous superkingdom/domain assignments, the UniFrac implementation only considers the canonical ranks phylum, class, order, family, genus, species, and strain when building the comparison tree.
For a given evaluation rank the weighted variant sums the absolute differences in relative mass along every branch down to that rank and divides by the maximum possible distance (placing all mass on mismatching leaves whose lowest common ancestor is the root for the depth being evaluated). The unweighted variant measures how much branch length is unique to either profile by counting the number of phylum-to-rank edges that appear exclusively in the ground truth or the prediction and dividing by the maximum number of such edges given the observed support in each profile. Because every branch is treated as having length one, the reported values can be interpreted as the proportion of disagreement in the shared taxonomy. Missing intermediate ranks do not penalize a tool as long as both profiles share the same descendants—the implementation trims and right-aligns the lineages to a common depth before constructing the tree so that absent ancestors do not inflate the distance.
Expressions can be combined freely, allowing complex workflows:
- Focus on a cohort:
cami filter 's~"^trial_" & r<=genus' table.cami - Drop rare tails:
cami filter '!c<=2' table.cami - Isolate a lineage:
cami filter 't<=1224 & r>=phylum' table.cami - Chain post-processing:
cami filter --fill-up --renorm 'r==species & a>=1' table.cami | cami sort --abundance
Remember that each command reads from stdin when no input path is supplied and writes to stdout by default, making it easy to compose multiple steps.
Commands that require lineage information (filter --fill-up, fillup, convert, benchmark with taxonomy updates, etc.) automatically download the NCBI taxdump the first time you run cami and cache the extracted files under ~/.cami. Later runs reuse that directory as the default taxdump source. To refresh the taxonomy, download the latest taxdump from NCBI and replace the files in ~/.cami (or remove the directory to trigger a fresh download). When cached taxonomy data indicates that a taxid has been merged or deleted, the affected commands emit warnings on stderr but continue producing complete output.