β Authors: ROUAUD Lucas
π Formation: Master 2 Bio-informatics at UniveritΓ© de Paris
π This program were write during an internship in Institut de MinΓ©ralogie, de Physique et de Cosmochimie, Sorbonne UniversitΓ©, UMR7590, CNRS, MusΓ©um national dβHistoire naturelle; in the bioinformatique et biophysique team.
π This work was supported by the French National Research Agency (ANR-21-CE12-0021).
A script called INSTALL.sh is made to facilitate this script installation. To used it, do:
bash INSTALL.shAll used commands are described in the next parts (Cloning the repository; Install conda environment; Data decompression)! This script is available at the release page: https://github.com/FilouPlains/FIERLENIUZ/releases/tag/v1.2.3
conda activate fierleniuzTo clone the repository in your computer, use the next command:
git clone git@github.com:FilouPlains/FIERLENIUZ.git
cd FIERLENIUZ/This repository is using Python. To install packages, conda is used and you can refer to their website to install it: https://docs.conda.io/projects/conda/en/stable/user-guide/install/download.html
Once conda is installed (if it was not already the case), simply used those next commands to use the program (when you are in the root project directory π ./):
conda env create -n fierleniuz -f env/fierleniuz.yml
conda activate fierleniuzSome data were too heavy to be simply put like that into the repository. So they were compressed. So next commands have to be used (when you are in the root project directory π ./):
tar -xf data/peitsch2vec/default_domain.tar.gz -C data/peitsch2vec/
tar -xf data/peitsch2vec/redundancy/30_percent_redundancy.tar.gz -C data/peitsch2vec/redundancy/
tar -xf data/peitsch2vec/redundancy/70_percent_redundancy.tar.gz -C data/peitsch2vec/redundancy/
tar -xf data/peitsch2vec/redundancy/90_percent_redundancy.tar.gz -C data/peitsch2vec/redundancy/To have a description of the parameters and an example of command, use this next one:
python src/embeddings/peitsch2vec.py -hThis script is used in order to transform a corpus of hydrophobic clusters into vectors.
- A script used to transform a
.fastafile into a.outfile. To have a description of the parameters and an example of command, use this next one:
python3 src/hca_extraction/hca_extraction.py -h- A script used do computed the context diversity ad output a
.csvformat. To have a description of the parameters and an example of command, use this next one:
python3 src/scope_tree/context_extraction.py -h- A script used to computed a network linked to SCOPe with context diversity coloration and context diversity distribution through plotly. To have a description of the parameters and an example of command, use this next one:
python3 src/scope_tree/scope_tree.py -hcd-hit is a software used to treated sequences redundancy. It is available at this next webpage (GitHub) https://github.com/weizhongli/cdhit/releases. To use it, type the following command is used:
cd-hit -i {scope_database}.fa -o cd-hit_{i}.fasta -c 1With :
{scope_database}.fa: Use here the original SCOPe database, with the sequence at.faformat. You can download the dataset here: https://scop.berkeley.edu/astral/subsets/ver=2.08. InPercentage identity-filtered Astral SCOPe genetic domain sequence subsets, based on PDB SEQRES records, usesequencesandless than 30,less than 70andless than 90parameters.cd-hit_{i}.fasta: How to name the output file. For this respository, the output is namedcd-hit_30.fasta,cd-hit_70.fastaandcd-hit_90.fasta.
For some scripts, the cluster PCIA - Plateforme de calcul intensif du MusΓ©um national dβHistoire naturelle have been used. The next script were used to launch the job (in this next cluster path π STAGE_M2/):
sbatch LAUNCH_SCRIPT.sh- Output are done in next directory:
π /mnt/beegfs/abruley/CONTEXT/. - The script HAVE TO BE MANUALLY EDITED if you want to change the input database.
The used script is available at π src/cluster/launch_script_90.sh.
π src/cluster/launch_script_90.sh: Script used on the cluster to computed the context diversity.π src/embeddings/arg_parser.py: Parse given arguments for theπΎ The main script.π src/embeddings/context_analyzer.py: Compute ordered and unordered diversity contexts. There is also a function to extract and center words for a given window.π src/embeddings/domain.py: UNUSED, deprecated.π src/embeddings/genetic_deep_learning/correlation_matrix.py: Computed the correlation between two matrices.π src/embeddings/genetic_deep_learning/genetic_algorithm.py: Genetic algorithms to select the best Word2Vec model.π src/embeddings/genetic_deep_learning/hca_out_format_reader.py: Transform a whole.outinto a corpus usable by Word2Vec.π src/embeddings/genetic_deep_learning/running_model.py: Run a Word2Vec model.π src/embeddings/hca_reader.py: Parse a.outfile to extract information from it.π src/embeddings/hcdb_parser.py: Parse the hydrophobic cluster database.π src/embeddings/notebook/comparing_distribution.ipynb: Plot of the distribution of some characteristics using plotly.π src/embeddings/notebook/data_meaning.ipynb: Plot information like mostly to the norm using Plotly.π src/embeddings/notebook/matplotlib_for_report.ipynb: Used matplotlib to producedplot.pdfto use into the report.π src/embeddings/notebook/matrix.ipynb: Computed cosine similarities matrix.π src/embeddings/notebook/projection.ipynb: Test a lot of projection for the vectors, with a lot of descriptors.π src/embeddings/notebook/sammon.py: Computed a sammon map using this next GitHub repository: https://github.com/tompollard/sammon.π src/embeddings/peitsch2vec.py: The main program used to computed Word2Vec vectors and other characteristics.π src/embeddings/peitsch.py: Object to manipulate the hydrophobic clusters.π src/embeddings/write_csv.py: Write a.csvfile with some hydrophobic clusters characteristics.π src/hca_extraction/arg_parser.py: Parse given arguments for theπ src/hca_extraction/hca_extraction.pyscript.πsrc/hca_extraction/domain_comparison.py: Script used to compared multiple domain between them by computing the context diversity and output the best result through an user define threshold.π src/hca_extraction/hca_extraction.py: Go from a.fastafiles to a.outfile.π src/scope_tree/arg_parser.py: Parse given arguments for theπ src/scope_tree/context_extraction.pyandπ src/scope_tree/scope_score.pyscripts.π src/scope_tree/context_extraction.py: Extract the context informations, taking also in consideration the SCOPe levels, and output a.csvfile.π src/scope_tree/scope_score.py: Computed a score between two or multiple domains to see how far they are from each other in the SCOPe tree.π src/scope_tree/scope_tree.py: Computed a network of one given hydrophobic clusters. The network is linked to the SCOPe tree, with the indications of the context diversity on each nodes.
π data/HCDB_2018_summary_rss.csv: Hydrophobic clusters database with the summary of the regular secondary structures. Made in 2018.π pyHCA_SCOPe_30identity_globular.out: pyHCA output. It were applied on the SCOPe2.07database with a redundancy level of 30 %, download trough Astral. Original dataset available here: https://raw.githubusercontent.com/DarkVador-HCA/Order-Disorder-continuum/main/data/SCOPe/hca.out.π SCOPe_2.08_classification.txt: A file that permits to go from the domain ID to the SCOPe precise class (for instance, fromd1ux8a_toa.1.1.1). File is available here: https://scop.berkeley.edu/downloads/parse/dir.des.scope.2.08-stable.txt.π output_plot/: All plots produced by the notebooksrc/embeddings/notebook/matplotlib_for_report.ipynb, all in.pdfformat.π data/REDUNDANCY_DATASET/cd-hit_30.fasta;π data/REDUNDANCY_DATASET/cd-hit_70.fasta;π data/REDUNDANCY_DATASET/cd-hit_90.fasta: Amino acids sequences from SCOPe2.08with different redundancy levels (30 %, 70 %, 90 %). Redundancy were treated through Astral and cd-hit. Original dataset are available here: https://scop.berkeley.edu/astral/subsets/ver=2.08.π data/REDUNDANCY_DATASET/cd-hit_30.out;π data/REDUNDANCY_DATASET/cd-hit_70.out;π data/REDUNDANCY_DATASET/cd-hit_90.out: Hydrophobic clusters sequences from SCOPe2.08with different redundancy levels (30 %, 70 %, 90 %). Redundancy were treated through Astral and cd-hit. Not treated by pyHCA.π data/REDUNDANCY_DATASET/redundancy_30_context_conservation_2023-05-09_14-38-42.csv;π data/REDUNDANCY_DATASET/redundancy_70_context_conservation_2023-05-11_10-39-29.csv;π data/REDUNDANCY_DATASET/redundancy_90_context_conservation_2023-05-11_10-41-19.csv: All context diversity computed for different redundancy levels (30 %, 70 %, 90 %). Redundancy were treated through Astral and cd-hit. Little things to know:100.0 =context computed with a full diversity;100 =context could not be computed, so a full diversity have been attributed. This have been corrected in the program by puttingNAinstead of "int(100)".π data/peitsch2vec/default_domain/: Data for the dataset with a redundancy level of 30 %, treated by pyHCA, not treated by cd-hit.π data/peitsch2vec/redundancy/30_percent_redundancy/;π data/peitsch2vec/redundancy/70_percent_redundancy/;π data/peitsch2vec/redundancy/90_percent_redundancy/: Data for the dataset with a redundancy level of 30 %, 70 %, 90 %, not treated by pyHCA, treated by cd-hit.
For the path given in 8. and 9.:
π characteristics_{date}.npy/: Hydrophobic clusters characteristics for a given redundancy level, like the size or the regular secondary structure. The characteristics are listed here, in the same order as this file:- Peitch code.
- Hydrophobic cluster (binary code).
- Hydrophobic score.
- Cluster size.
- Regular secondary structure.
- Occurences.
- Number of cluster inside the domain, were the cluster[i] is found.
- Domain size, were the cluster[i] is found.
- Score HCA define by pyHCA, were the cluster[i] is found.
- P-value define by pyHCA, were the cluster[i] is found.
π corpus_{date}.npy/: Corpus given to Word2Vec, after applying the filters.π embedding_{date}.npy/: Vector embedding generated by Word2Vec.π matrix_cosine_{date}.npy/: Cosine similarities matrix generated from the vector embedding, generated by Word2Vec.π model_{date}.w2v/: The trained Word2Vec models.
$ tree -lF -h
[6.8G]
.
βββ [4.0K] "data/"
βΒ Β βββ [4.0K] "output_plot/"
βΒ Β βββ [4.0K] "peitsch2vec/"
βΒ Β βΒ Β βββ [4.0K] "default_domain/"
βΒ Β βΒ Β βββ [4.0K] "redundancy/"
βΒ Β βΒ Β βββ [4.0K] "30_percent_redundancy/"
βΒ Β βΒ Β βββ [4.0K] "70_percent_redundancy/"
βΒ Β βΒ Β βββ [4.0K] "90_percent_redundancy/"
βΒ Β βββ [4.0K] "REDUNDANCY_DATASET/"
βββ [4.0K] "env/"
βΒ Β βββ [ 905] "fierleniuz.yml"
βΒ Β βββ [ 885] "README.md"
βββ [ 895] "INSTALL.sh"
βββ [ 20K] "LICENSE"
βββ [ 13K] "README.md"
βββ [4.0K] "src/"
βββ [4.0K] "cluster/"
βββ [4.0K] "embeddings/"
βΒ Β βββ [4.0K] "genetic_deep_learning/"
βΒ Β βββ [4.0K] "notebook/"
βββ [4.0K] "hca_extraction/"
βββ [4.0K] "scope_tree/"
18 directories, 88 filesThis work is licensed under a Creative Commons Attribution 4.0 International License.