This code has been made available for testing and reproducibility purposes in conjunction with the manuscript "Missing microbial eukaryotes and misleading meta-omic conclusions". In the future, tax-aliquots will be incorporated into the EUKulele taxonomic annotation tool (https://github.com/AlexanderLabWHOI/EUKulele).
-i: input file from DeepClust (https://github.com/bbuchfink/diamond)-c: FASTA file that contains all sequences present in theDeepClustfile, both reference and query--output_file: prefix for output file names--output_dir: directory in which to save output-t: taxonomy table, in the style ofEUKulele--prot_map: protein name mapping, in the style ofEUKulele--kmer_len: amino acid k-mer length to use-s: sample delimiter in sequence file--level_of_interest: taxonomic level of interest (e.g. phylum, family; should match EUKulele--list_of_interest: underscore-separated list of taxonomic labels to pull from the DeepClust file
tax-aliquots can be invoked using the process_clusters.py file in src/tax-aliquots-scripts. tax-aliquots uses faSomeRecords (https://github.com/santiagosnchez/faSomeRecords) to read in records from fasta files
python process_clusters.py \
-i deepclust_contigname.mad.50.out \
-c combined_seqs.fasta \
--output_file="test" \
--output_dir="test_dir" \
-t tax-table.txt \
--prot_map=prot-map.json \
--kmer_len=3 -s test --level_of_interest family \
--list_of_interest Hemiaulaceae_Rhizosoleniaceae_Thalassiosiraceae_Skeletonemataceae
- Buchfink, Benjamin, et al. "Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust." bioRxiv (2023): 2023-01.
- Sanchez-Ramirez, Santiago, et al. "faSomeRecords." https://github.com/santiagosnchez/faSomeRecords