Convert a fasta/fastq file into embeddings via Evo 2.
You can convert any fasta/fastq sequence into embeddings. The file can also be gzipped. Multiple files are not supported for now.
# With Nextflow
nextflow run artorias111/fasta2embeddings --sequence /path/to/sequence.fasta -c /custom/config
# run with aliasing on Illinois Campus Cluster
embeddings --sequence /path/to/sequence.fanextflow.config contains sensible defaults. You need to provide a custom config for your specific HPC/local computer. See taiga_evo7b.config for an example. Once you create a custom config file, it's augmented with the default nextflow.config, and you can pass the custom config with -c in your nextflow run command.
The pipeline expects the following in your environment (can be conda or venv):
- Evo 2 (Cuda 12.8) : https://github.com/ArcInstitute/evo2
- EasyEvo2 : https://github.com/ylab-hi/EasyEvo2
Embeddings are in the safetensors format (https://github.com/huggingface/safetensors).
A word of caution: The embeddings are the same length as the sequence (the exact dimensions can differ based on the Evo 2 model you're using) - see EasyEvo2's documentation. If your sequences are of different lengths, your embeddings will also reflect the same. That's not ideal for most downstream analyses without filtering. On the flipside, an ideal case to use without filtering would be to generate embeddings for k-mers.
There's two directories: work and *.safetensors. The *.safetensors directory contains a cleaned up collection of output files. work contains all the intermediate files and also a copy of the final output. You can safely remove this once you have all your safetensors.
See https://www.nextflow.io/docs/latest/workflow.html#outputs for more information.