CULLALGO is a script designed to "cull" large datasets of .fasta protein sequences into a manageable amount with desired traits, outputting results in a .fasta file. All measurements leverage up-to-date literature.
The script uses the following parameters:
- Molecular Weight
- Average Surface Accessibility
- Isoelectric Point
- Cost
- Solubility
- Thermostability
- Alpha Helical Propensity
- Shannon's Entropy (Redundancy Measure)
- DNA Complexity
CULLALGO utilizes both NETSOLP and TemStaPro for determining solubility and thermostability thresholds, respectively. Both tools must be installed and configured to run the script.
- NETSOLP: GitHub - NetSolP
- TemStaPro: GitHub - TemStaPro
- Permissions: Running such a script may require administrative privileges, especially for parts that involve installing system-wide packages or modifying system paths.
- Compatibility: This script assumes a Unix-like operating system because of its dependency on bash commands. For Windows, adjustments might be necessary, particularly in how environments are activated and paths are handled.
python3 setup.pyconda create -n CULLALGO python=3.11conda activate CULLALGOgit clone https://github.com/gusgrazelis/CULLALGO.gitcd cullalgopip install -r cull_requirements.txtThis should be installed within the cullalgo directory
wget https://services.healthtech.dtu.dk/services/NetSolP-1.0/netsolp-1.0.ALL.tar.gztar -xzvf netsolp-1.0.ALL.tar.gzpip install -r requirements.txtpython3 CULLALGO1.1.py --config config.yamlThis should be its own directory and can be where you like.
git clone https://github.com/ievapudz/TemStaPro.gitcd TemStaProBefore starting up Anaconda or Miniconda should be installed in the system. Follow instructions given in Conda's documentation.
Setting up the environment can be done in one of the following ways.
In this repository two YML files can be found: one YML file
has the prerequisites for the environment that exploits only
CPU (environment_CPU.yml), another one to exploit both CPU
GPU (environment_GPU.yml).
This approach was tested with Conda 4.10.3 and 4.12.0 versions.
Run the following command to create the environment from a YML file:
conda env create -f environment_CPU.yml
Activate the environment:
conda activate temstapro_env_CPU
To set up the environment to exploit GPU for the program, run the following commands:
conda create -n temstapro_env python=3.7conda activate temstapro_envconda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidiaconda install -c conda-forge transformersconda install -c conda-forge sentencepiececonda install -c conda-forge matplotlibTo test if PyTorch package is installed to exploit CUDA,
call python3 command interpreter and run the
following lines:
python3import torch
torch.cuda.is_available()If the output is 'True', then the installing procedure was successful, otherwise try to set the path to the installed packages:
export PATH=/usr/local/cuda-11.7/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.7/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
If CUDA for PyTorch is still not available, check out the forum.
For the systems without GPU, run the following commands for CPU setup:
conda create -n temstapro_env python=3.7conda activate temstapro_envconda install -c conda-forge transformersconda install pytorch -c pytorchconda install -c conda-forge sentencepiececonda install -c conda-forge matplotlib./temstapro -f ./tests/data/long_sequence.fasta -d ./ProtTrans/ \
-e tests/outputs/ --mean-output ./long_sequence_predictions.tsv