Phage-Host Interaction Large Language Model Embedding Extraction

This repo contains code to flow genome sequences into genomic language models (gLMs) and extract their embeddings into numpy arrays. There is automatic handling to split inputted genome sequences into lengths compliant with context windows. This repo is a work in progress and a part of a larger project at the Arkin Lab, Lawrence Berkeley National Labratory on using deep learning and LLM models for predicting phage-host interactions.

Setup (Cloning this repo)

Run git clone --recurse-submodules https://github.com/J-Ngaiii/phllm.git
Then to instantiate the phllm local package move to the root of the repo and run pip install .

Setup (Enviornment - No Evo2)

Create conda environment in python 3.11 ()
Run pip install -r requirements.txt from the root of the repository
If error while installing requirements try
- installing core packages first: conda install numpy pandas scikit-learn pyarrow then running pip install -r requirements.txt
- installing pyarrow in parricular via conda might be helpful if you're running this on your local machine because Apple Silicon (M1/M2/M3 Macs) runs into issues trying to build pyarrow via pip
- ensure than numpy has a version older than 2.0 (ie numpy<2.0), this many conflict with spacy which uses thinc and blis, modules that require numpy >=2.0
- numba needs numpy<2.3
- megaDNA needs numpy>2.0

Setup (Enviornment - Yes Evo2)

Assuming you've already followed the instructions above and instantiated an environment that can run this repo without evo2 here's what to do next with that enviornment.

Evo2 Requirements

Prerequisites

Transformer Engine >= 2.0.0
Flash Attention for optimized attention operations (strongly recommended)

System requirements

[OS] Linux (official) or WSL2 (limited support)
[GPU] Requires Compute Capability 8.9+ (Ada/Hopper/Blackwell) due to FP8 being required
[Software]
- CUDA: 12.1+ (12.8+ for Blackwell) with compatible NVIDIA drivers
- cuDNN: 9.3+
- Compiler: GCC 9+ or Clang 10+ with C++17 support
- Python 3.12 required

Check respective githubs for more details about Transformer Engine and Flash Attention and how to install them. We recommend using conda to easily install Transformer Engine. Here is an example of how to install the prerequisites:

Step 1

pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu126

Step 2:

conda install -c nvidia cuda-nvcc cuda-cudart-dev cuda-nvrtc-dev
# pip install flash-attn==2.8.0.post2 --no-build-isolation (may not run on all superclusters like lawrencium)

While on a GPU node:

pip3 install --no-build-isolation transformer_engine[pytorch]
pip install evo2
pip install .

How test mode works

Constrains rt_dicts to only return 3 strains/phages
Constrains extract_embeddings to only look at the first 3 divisions for all strains/phages in a batch (which will just be 3 strains/phages if rt_dicts is put into test mode)

Build History

phllm-0.1.0: first working version that included initial workloop for flowing .fna files into ProkBERT and extracting the embeddings
phllm-0.1.2: version with working test mode for ProkBERT and initial architecture for Evo2
phllm-0.1.4: depricating to python 3.11.8 and megaDNA

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
.github/workflows		.github/workflows
data		data
dummy		dummy
experiments		experiments
logs/extract_logs		logs/extract_logs
phllm		phllm
slurm		slurm
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Phage-Host Interaction Large Language Model Embedding Extraction

Setup (Cloning this repo)

Setup (Enviornment - No Evo2)

Setup (Enviornment - Yes Evo2)

Evo2 Requirements

How test mode works

Build History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Phage-Host Interaction Large Language Model Embedding Extraction

Setup (Cloning this repo)

Setup (Enviornment - No Evo2)

Setup (Enviornment - Yes Evo2)

Evo2 Requirements

How test mode works

Build History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages