GReNIMJA: Gene Regulatory Network Inference by Mixing and Jointing features of Amino acid and nucleotide sequences
This is the code for Mixing features of transcription factors and genes enables accurate prediction of gene regulation relationships for unknown transcription factors. This project is carried out in Funahashi Lab. at Keio University.
A key point of this study is that GReNIMJA was designed, not to predict the specific TFs from genes, to predict whether the regulatory relationships exist or not from both of the amino acid sequences of TFs and nucleic acid sequences of genes.
Our model extracted features from both the amino acid sequences of TFs and the nucleotide sequences of target genes, mixed these features using a 2D LSTM architecture, and performed binary classification to predict the presence or absence of regulatory relationships.
The detailed information on this code is described in our paper published on Mixing features of transcription factors and genes enables accurate prediction of gene regulation relationships for unknown transcription factors.
We have confirmed that our code works correctly on Ubuntu 22.04.
- Python 3.8.5+
- Pytorch 2.4.1+
- NumPy 1.22.3+
- SciPy 1.8.0+
- Pandas 1.4.1+
- Gensim 4.1.2+
- Matplotlib 3.5.1+
- Seaborn 0.11.2+
- scikit-learn 1.0.2+
- Optuna 2.10.0+
- tqdm 4.64.0+
See requirements.txt for details.
% git clone git@github.com:funalab/GReNIMJA.git% cd GReNIMJA
% python -m venv venv
% source ./venv/bin/activate
% pip install --upgrade pip
% pip install -r requirements.txt[NOTE]
Before downloading the related files, please check the available storage space.
When you download and extract the tar.gz files,
you will need at least approximately 9.8 GB of storage space.
- On Linux:
% bash downloads/download_linux.sh- On macOS:
% bash downloads/download_mac.shAll datasets constructed in this study can be obtained by the above command.
If you want to know how the data was constructed, please see scripts/ for details.
These scripts are mainly written in Shell and R.
4. Run the model (Evaluation of model performance for each unknown TF in the prediction of regulatory relationships)
- On GPU (Specify GPU ID):
% python src/test.py --gpu_id 0- On CPU (Negative value of GPU ID indicates CPU)):
% python src/test.py --gpu_id -1The processing time of above example will be about 2 hours on GPU (NVIDIA A100 40GB PCIe).
You can also set the specific path.
To inference for known transcription factors,
% python src/test.py --gpu_id 0 --model_path ./models/known_best_model.pt --embeddings_path ./embeddings --dataset_path ./datasets/known_TFs/test_dataset.pickle --save_path ./results/known_testTo inference for unknown transcription factors,
% python src/test.py --gpu_id 0 --model_path ./models/unknown_best_model.pt --embeddings_path ./embeddings --dataset_path ./datasets/unknown_TFs/test_dataset.pickle --save_path ./results/unknown_testThe research was funded by JST CREST, Japan Grant Number JPMJCR2011 to Tetsuya J. Kobayashi and Akira Funahashi.