We trying to establish a novel cancer driver gene mining method based on heterogeneous network metapaths. First, we constructed a heterogeneous network using several types of multi-omics data that are biologically linked to genes. Subsequently, we form nine metapaths using genes as start and end nodes, and the representation vectors obtained by aggregating information within and across metapath nodes can be used as new gene features for subsequent classification and prediction tasks. In addition, we hope to improve the biological interpretability of the predictions by analysing the contribution of different metapathways.
All the nodes we use to build the network and the connections between them can be found in the './data/' folder:
'./data/biological_features.csv': The result we use EMOGI method for multi-omics feature extraction of genes.
In the initial feature preprocessing for gene nodes, we calculated the aberrant expression values for each gene across 16 cancer types in tumor tissue samples. The mutation rates were determined using the probability of single nucleotide variations (SNVs). The probabilities of gene methylation and gene expression products were represented by the logarithmic difference values between tumor samples and normal gene segments. All preprocessing procedures can be found in the './preprocess_data' folder.
'./data/ppi/': Protein-Protein interaction (PPI) data obtained from the CPDB database. https://toxnet.nlm.nih.gov/cpdb/
'./data/msigdb/': Various multi-omics data related to cancer from the MSigDB. https://www.gsea-msigdb.org/gsea/msigdb
The file './preprocess/gene_protein.ipynb' deals with the interrelationships between genes obtained through the PPi network. While the file './preprocess/Msig_preprocess.ipynb' deals with Correspondence Relationships between Multi omics Biological Nodes in MSIGDB Database.
We use './preprocess/generate_network.ipynb' to build available graphs in dgl input format. The graphs used for training and comparing performance are stored in './data/network/hetero/new_9nodes_graph.bin'
MCDHGN codes is baesd on Pytorch and Python and DGL library. So you will need the following packages to run.
- Python==3.9.16
- torch==1.12.0
- jupyter notebook==6.5.4
- ipykernel==6.19.2
- ipython==8.12.0
- dgl==1.1.1+cu102
- torch-geometric==2.3.1
- torchvision==0.13.1a0
First you can clone the repository or download source codes and data files.
git clone https://github.com/1160300611/MCDHGN.git
In order to improve running speed, we save the message flowing subgraph (mfgs) generated by random walks in heterogeneous graphs in './Intermediate/blocks/' .You can directly use these results to execute file './5fold_verification.ipynb' to view the results of five-fold cross validation, or choose to regenerate the messaging flowing subgraphs as guided by the code comments.
Use File './test_and_pridict.ipynb' to view the training and prediction results of the model on the entire MCDHGN label set.