Introduction to networks and
pathways
Introduction to multi-omics data integration and
               visualisation 2024
          Satyadhyan Chickerur, PhD
         KLE Technological University
          Ref: Marton Olbei & Dezso Modos & Tamas Korcsmaros
      A brief
      introduction
                        to networks
Figure: @randomgraphs
         Behind every complex system (comprised of many interconnected parts, with a complicated
         arrangement) lies a network:
               Networks encoding the interactions between genes, proteins, etc. the cellular network
                 The wiring diagram capturing the interactions between neurons, the neural network
                         The interactions of family, friends and colleagues, a social network
               Communication networks, describing how communication devices interact with other
                                Power grid, trade networks and many - many more.
Figure: @randomgraphs
       This is a node
In network biology nodes can
represent various components:
          Genes
         Proteins
          RNAs
and other biological entities
                        This is an edge
               node 1                     node 2
Edges represent the interactions occurring between our
nodes
They can be:         Directed or undirected
                    Weighted or unweighted
                    Stimulatory or inhibitory
      Networks are collections of nodes and edges
In network biology we use them to:
                 Visualize our biological entities
       See overrepresented functions and their interactions
          Make predictions about potential interactions
                        Characteristics of a
                             network
Figure: @randomgraphs
Degree
                                  a              b        c
The degree of a node
represents the number of
                                  1              2        1
links it has to other nodes,
e.g. number of citations on a
paper in a research graph,
how many friends you have.
                                Ka = 1      Kb = 2     Kc = 1
As networks can be
directed or undirected for        a              b        c
the former, in addition we
differentiate between in-
degree and out-degree             1              2        1
                                Kaout = 1   Kbout =    Kcout = 1
                                Kain = 0    0
                                            K in = 2   Kcin = 0
                                             b
Degree
The degree of a node
represents the number of
                                1           2                   1
links it has to other nodes,
e.g. number of citations on a
paper in a research graph,
how many friends you have.                      a
                                                    1
Hubs
Hubs are the largest
degree nodes in a network.
                                    d           e                   b
                                        1
They are often very                                         4           1
important biologically and
their deletion can be lethal.
Can anyone think of an                          c
example?                                                1
Paths
                                                d
A path is a sequence of
nodes in which each node      a             b   c
is adjacent to the next one
                                  a→b→c→d
                                  a→b→c→e
                                                e
Paths
                                                d
A path is a sequence of
nodes in which each node
is adjacent to the next one
                              a             b   c
The distance between two
nodes is defined as the
number of edges along the         a→b→c→d
shortest path connecting
them
                                  a→b→c→e
                                                e
 Paths
                                           a
  A path is a sequence of                                        c
  nodes in which each node
  is adjacent to the next one
  The distance between two
  nodes is defined as the                                b
  number of edges along the                                            d
                                Diameter:
  shortest path connecting      longest of all the
  them.                         calculated                   a to e
                                shortest paths in            3 steps
                                a network
   The shortest path is                                      4 steps
   the path with the
   shortest length              Characteristic
   (distance) between           path length:
                                average length of    f                 e
   two nodes.                   shortest paths in
                                a network
Figure: network science book
Local network structures
 Graphlet: A local unique (non-isomorphic) structure of a
 network
 Network motif: An overrepresented local structure of a
 network (a common graphlet)
Example network motifs
 Feed forward       Feedback loop                    Three chain
                a              a                             a
  b                 b
                                                             b
                c              c
                                                             c
                               Milo, et al (2002). Science
 Connectedness
  A graph is connected if
  any two nodes can be
  joined by a path.
  We refer to the largest
  connected component as
  the giant component and
  the rest as isolates.
  Bridges connect the
  various components - if
  you erase them the graph
  becomes disconnected
Figure: network science book
 Clustering
                                      i               i
  Clustering coefficient:
  what fraction of your
  neighbors are connected?
  Watts & Strogatz, Nature
  1998.
                               10 edges total   7 edges total
                                 6 between       3 between
                                neighbours       neighbours
                                    Ci = 1         Ci = 0.5
Figure: network science book
 Clustering                                    Calculating clustering
                                               coefficient
                                     i         ei = edges between the
                                               neighbours of node i
                                               ki = degree of node i
  Clustering coefficient:
  what fraction of your
  neighbors are connected
  (value between 0 and 1)?
                               7 edges total
  Watts & Strogatz, Nature      3 between             2*3
  1998.                         neighbours
                                  Ci = 0.5          4 * (4-1)
                                                       6
                                                                = 0.5
                                                       12
Figure: network science book
 Betweenness centrality
                      How much influence does
                      my node have over the
                      information flow in the
                      network?
                      Betweenness: fraction of
                      shortest paths which
                      goes through a given
                      node
Figure: network science book
Networks vs Pathways
Bipartite networks
    Bipartite networks are
    networks whose nodes can
    be divided into two distinct
    sets U and V in such a way
    that every link connects a
    node in U to one in V
https://www.nature.com/articles/srep00196
Multilayered networks
  Multilayered networks
  contain multiple types
  of edges, changing the
  context of the
  interaction.
 Representing networks
     One of the most commonly used
     representation due to its ease of
     use, and the ability to use it
     directly for measurements is the
     adjacency matrix.
     For an undirected network, if the
     position A12 equals to one, that
     means there exists an interaction
     between nodes 1 and 2.
     For directed networks the
     direction of the interaction
     carries information on its own: 1
     → 3 is not the same as 1 ← 3,
     and their adjacency matrices
     behave accordingly.
Figure: network science book
Representing networks
                                                       a
                                     network.csv
 Another very commonly used
 format is the edge list.
                                        a,e,
                                        b,e,
 The edge list is more human            c,e,       d   e   b
 readable, and is easy to write up      d,e
 on the fly.
 The example uses a .csv format,
 but different software of course
 can accept different delimiters.
                                                       c
Representing networks
   Standards are very important to
   every field, and network biology
   is no different.
   One attempt at this, to
   standardize the efforts of the
   molecular interaction field is the
   PSI-MI TAB format, which uses a
   controlled vocabulary to
   describe interactions.
https://academic.oup.com/bioinformatics/article/35/19/3779/5355053
                        Where do networks come
                                 from?
Figure: @randomgraphs
Network databases
Correlation/similarity networks:
You can generate from any omics
experiment (transcriptomics, proteomics
etc.) and build them through various
metrics e.g. correlation.
Pros:
 ● Condition specific network
Cons:
 ● Undirected
Tools: WGCNA
Physical interactions
  ● “Strict” definition: captures the physical
    interactions of proteins
  ● “Loose” definition: captures the functionally
    interacting proteins (e.g. includes members of a
    complex)
  ● Main experimental methods: Y2H, AP-MS,
    fragment complementation, BioID
Large scale protein interaction networks -
Y2H
                        Pros:
                         ● High throughput
                         ● Well established technique
                        Cons:
                         ● Cannot detect genes which not go
                           into cell nucleus
                         ● Bait-prey interactions have to work
                         ● Context dependent
                        Example databases: HuRI
Common methods for creating large scale
protein interaction networks - manual
curation                 Pros:
                                ● Usually more accurate
                Save to
               database
                                Cons:
                                ● Labour intensive
                                ● Publication/curator bias
                                Example datasets:
 From: https://phdcomics.com/
                                   SignaLink, Reactome,
                                WikiPathways
Common methods for creating large scale
protein interaction networks - text mining
                        Pros:
                                 Searches the whole
                        literature
                        Cons:
                                Potential bias from the
                        algorithm
                        Examples: STRING
Common methods for creating large scale
protein interaction networks - secondary
databases                Pros:
                       ● Database of databases
                       ● Many interactions
                       Cons:
                       ● Bias of all the various datasets
                       ● Need to know what is the aim
                          of your research
                       E.g. OmniPath, STRING ,
Türei et al 2021
                       ConsensusPathDB
                        Available tools
Figure: @randomgraphs
Tools
Easy to use, user friendly
Visualization, analysis and data sharing (NDEx)
Has its own ecosystem of apps, greatly extending its usability
https://apps.cytoscape.org/
Automation: RCy3, py4cytoscape
R / Python (and other languages) library for network
manipulation and visualization
Powerful tool, intermediate difficulty
Great introduction: https://kateto.net/network-
visualization
ggraph / tidygraph
Two R libraries made to help users manipulate
and visualize network data using the tidyverse
conventions (i.e. ggplot, dplyr)
Tidygraph provides manipulations and
analysis grammar for network data, while
ggraph offers visualization grammar for
network data
A bit easier to use than igraph, lacks some
functions, but has other important ones, e.g.
easy faceting based on attributes
               Network analysis and discovery
Figure: @randomgraphs
● Let’s say you constructed, queried, curated the network you are
  interested in
● Let’s say it is the differentially expressed proteins in a disease
  using proteomics and their interaction partners in the SignaLink 3
  network resource
● What now?
Network clustering
● One common way of adding
  value to your network is to
  cluster modularise partition
  them
● Module: A part of the network
  where the nodes are more
  connected with the nodes within
  the module then outside of the
  module
● There are many approaches to
  do so, including in Cytoscape
  (MCODE, clusterMaker)             https://mr.schochastics.net/
   How to measure modularity?
Module: A part of the network where the nodes are more connected with the nodes within
the module then outside of the module
                   Weighted degree of node i and j
                                                 Kronecker delta which is 1 if ci = cj and 0 otherwise
                                                        Community of node i and j
Sum of all edge weights   Edge weight between
                          nodes i and j
   -½ - not modular network
   0 - random network
   1 - completely modular network
   How to find communities, modules?
Girvan-Newman clustering (top-down):
       1. Calculate the betweenness centrality of each edge.
       2. Remove the highest betweenness centrality edge(s)
       3. Recalculate the betweenness.
       4. Repeat till no edges remain
Best partition where the modularity is the highest
Louvain clustering (bottom up):
      1.   Each nodes assigned to it’s own community
      2.   Adding a node to its neighbours community than calculate the modularity change for each node
      3.   Assign each node to that neighbour where the modularity change is the largest.
      4.   Create communities and do from step one again where the nodes are the communities
The modularity plateaus - multiple equally good
partition
Gene ontology enrichment
● H0: In our network module the genes are as common as in the
  background
● H1: The genes in the network module are more common than in
  the background
● What should be the background?
● How to know what is more common?
 What should be the background?
● Whole human genome?             Signalling Process, Intracellular
   ○ It is not in the network!    communication, kinase,
   ○ We do not measure it in      Enzyme
     proteomics
● The whole network?
   ○ Not all IDs are mapped to
● The intersection of the whole
   network and the omics data
    ○   We do not find anything
 How to know what is more common?
● Statistical test            Signalling Process, Intracellular
   ○ Fisher exact test        communication, kinase,
   ○ Hypergeometric test      Enzyme
   ○ Chi square test
● You do for each Gene
  Ontology function a
  statistical test! More GO
  function then gene!
● Use false discovery rate
  correction!
Network clustering and enrichment                Cell cycle
 ● As we’ve discussed
   before, closely connected
   nodes often work together
   in shared tasks
 ● To find out what these are
   in an unsupervised way we
   can run enrichment
   analyses on them (BinGO)
                                https://mr.schochastics.net/
   Visualising enrichment results
              Dotplot                                      Treemap plot
Top n enriched biological functions in   All significant biological process mapped by the GO
annotation database                      tree.
Network propagation
 ● Network propagation
   methods can highlight
   nodes of interest in
   addition to already known
   ones
 ● Example algorithms:
   PageRank, HotNet2
                               Hristov et al 2020
Footprint based methods
 ● Footprint methods can
   estimate activity of kinases
   or transcription factors
   using prior knowledge, eg.
   highlighting the most
   active TFs in a given RNA-
   Seq readout
 ● Examples: VIPER,
   decoupleR                      Vandereyken et al 2018
                        Dangers and pitfalls
Figure: @randomgraphs
Study bias
   •   Biological networks almost always have to deal with a huge
       study bias
   •   some proteins / interactions will be more studied, as they are
       more relevant for health for example
   •   as such they will show up more frequently in our network
       resources
Trivial predictions
    •   Trivial predictions can dominate your results, especially if
        you are familiar with the biological domain (this is also a
        result of study bias)
    •   As such, you will probably make a long list of predictions
        (that will be correct) that you have to go through to find novel
        information
Path fragility
    •   Biological networks can have simultaneously many false
        negative / false positive hits
    •   Depending on your experimental method, studied organism
        etc. the coverage of your interaction networks can vary quite
        a lot
    •   Therefore using path based metrics such as shortest paths
        or betweenness centrality can be misleading, as even the
        addition / loss of one edge can shift these results a lot
ID mapping in Cytoscape
    (and in general)
        Dezso Modos
              General Biological Genome Databases
Name of the    Website                         Based on what?       Species                Example IDs
database
Uniprot        www.uniprot.org/                Proteins You can     Multiple (from         P00533
                                               find the protein     bacteria till human)   P00533-1
                                               isoforms
Ensemble       https://www.ensembl.org/        Genome base:         Vertebrae              ENSG00000146648
                                               Genes/ transcript/                          ENSP00000275493
                                               proteins                                    ENST00000455089.5
NCBI Gene      https://www.ncbi.nlm.nih.gov/   Multiple genomes     Multiple               1956
               gene/
NCBI RefSeq    www.ncbi.nlm.nih.gov/refseq/    Reference            Multiple               NM_001346897.2
                                               sequences                                   NP_001333826.1
                         Interaction/pathway databases
Database    Website                         Content                          Species              Example IDs
name
OmniPath    omnipath.org                    A collection of multiple         Human and            P00533
                                            protein-protein interaction      mouse - mostly
                                            databases                        but continuously
                                                                             updated
STRING      https://string-db.org/          Integrated protein-protein       Many different       9606.ENSP00000275493
                                            interaction database             species mostly
                                                                             based on
                                                                             orthology
Reactome    https://reactome.org/           Reaction based pathway           Multiple species     R-HSA-179837
                                            database
IntAct      https://www.ebi.ac.uk/intact/   Curated protein-protein and      Multiple species     P00533
                                            other molecule interactions      including viruses
SignaLink   http://signalink.org/           Multilayer signalling database   H. sapiens, D.       P00533
                                            containing proteins, miRNAs,     rerio, C. elegans,
                                            TF-TG interactions               D. melanogaster
Many to many mapping - the biological rational
                                                             Interaction measurements
     Gene                Transcript 1         Protein 1           Yeast two hybrid
                         Transcript 2         Protein 2          Affinity purification
     Gene
                   Probe from a microarrays    Proteomic
                    RNA-seq read mapping      measurements
Genomic analysis
   (GWAS)
Many to many mapping - mapping table
    Gene/        Value   Gene/           Protein ID   Protein 1   Protein 2
    transcript           Transcript ID
    ID1          1       ID1             ID_A_1
                                                      ID_A_1      ID_A_2
    ID2          1.5     ID2             ID_A_2
                                                      ID_A_2      ID_A_3
    ID3          2.5     ID3             ID_A_3
                                                      ID_A_1      ID_A_4
    ID4          1       ID4             ID_A_4
                                                      ID_A_3      ID_A_4
                         ID4             ID_A_5       ID_A_3      ID_A_5
Source of mapping tables
Ensembl BioMart - note you can access it programmatically through R -
https://www.ensembl.org/biomart/martview/
https://bioconductor.org/packages/release/bioc/html/biomaRt.html
Uniprot - https://www.uniprot.org/uploadlists/
ID mapping: what should I do during my project?
Know how do you map, document what have you done (e.g. How many IDs were
mapped to how many)
Try to map as many ID as possible - check the unmapped IDs
Use only the “sure” IDs - Uniprot Swiss Prot
Do not use gene names - the septins and androgen receptors will haunt you
How can we show this
in cytoscape?
 Step 1 load in the network
Step 2 - the mapping table
Step 3 - the data
Real word example
You downloaded the interactions from Reactome through Cytoscape related to the
EGFR protein
You got from your bioinformatician the RNA-seq data using HGNC Gene names
Downloading data through PSICQUIC
   Oh wait… we lost around 10% of our IDs!
Why do the identifiers do not mapped?
Common reasons:
 ● Different version between the databases
 ● Pseudogene and it is not in the
   protein/transcript based database
 ● It is not in the database, typo in ID
 ● Wrong species the interaction is in rats
   meanwhile we use humans
Let’s see a few examples: Example 1 wrong ID in the Reactome
database
 One of the missing ones is KRAS - with a strange uniprot ID
 If we search for that id we got back the normal uniprot ID of KRAS: P01116
 In such case I suggest to make a separated column for our uniprot IDs maybe we will
 need them for further work and adding the HGNC column the KRAS manually
Example 2: Not surely expressed protein TREMBL ID
This gene do not have a name in
Reactome - uniprot ID in
Human Readable Label column
Uniprot says it is Unreviewed protein
and only exist at transcript level
We can exclude from our analysis
Example 3 - wrong uniprot ID
You need to check the genes “manually” Basically building up your own mapping table
But this one do not have any uniprot entry regarding this ID
Search to the gene name and use uniprot mapping to map to HGNC gene name
Hands on the computer
1. Download the interactome of colorectal cancer associated proteins
   (FAM123B ARID1A, ATM, CTNNB1, GNAS, PTEN, DNMT3A, NF1, MLL2,
   PIK3R1)
2. Download a mapping table
3. Map transcriptomic data of p53 mutant and wild type adenocarcinoma
4. See the difference