0% found this document useful (0 votes)
3 views81 pages

Introduction To Networks and Pathways

This document provides an introduction to networks and pathways in the context of multi-omics data integration and visualization. It covers the fundamental concepts of network biology, including nodes, edges, network characteristics, and various types of networks, as well as methods for constructing and analyzing these networks. Additionally, it discusses tools for visualization and analysis, the importance of standardization, and potential biases in biological networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views81 pages

Introduction To Networks and Pathways

This document provides an introduction to networks and pathways in the context of multi-omics data integration and visualization. It covers the fundamental concepts of network biology, including nodes, edges, network characteristics, and various types of networks, as well as methods for constructing and analyzing these networks. Additionally, it discusses tools for visualization and analysis, the importance of standardization, and potential biases in biological networks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Introduction to networks and

pathways

Introduction to multi-omics data integration and


visualisation 2024

Satyadhyan Chickerur, PhD

KLE Technological University

Ref: Marton Olbei & Dezso Modos & Tamas Korcsmaros


A brief
introduction

to networks
Figure: @randomgraphs
Behind every complex system (comprised of many interconnected parts, with a complicated
arrangement) lies a network:

Networks encoding the interactions between genes, proteins, etc. the cellular network

The wiring diagram capturing the interactions between neurons, the neural network

The interactions of family, friends and colleagues, a social network

Communication networks, describing how communication devices interact with other

Power grid, trade networks and many - many more.

Figure: @randomgraphs
This is a node

In network biology nodes can


represent various components:
Genes
Proteins
RNAs
and other biological entities
This is an edge

node 1 node 2

Edges represent the interactions occurring between our


nodes

They can be: Directed or undirected


Weighted or unweighted
Stimulatory or inhibitory
Networks are collections of nodes and edges

In network biology we use them to:

Visualize our biological entities


See overrepresented functions and their interactions
Make predictions about potential interactions
Characteristics of a
network

Figure: @randomgraphs
Degree
a b c
The degree of a node
represents the number of
1 2 1
links it has to other nodes,
e.g. number of citations on a
paper in a research graph,
how many friends you have.
Ka = 1 Kb = 2 Kc = 1

As networks can be
directed or undirected for a b c
the former, in addition we
differentiate between in-
degree and out-degree 1 2 1

Kaout = 1 Kbout = Kcout = 1


Kain = 0 0
K in = 2 Kcin = 0
b
Degree
The degree of a node
represents the number of
1 2 1
links it has to other nodes,
e.g. number of citations on a
paper in a research graph,
how many friends you have. a

1
Hubs
Hubs are the largest
degree nodes in a network.
d e b

1
They are often very 4 1
important biologically and
their deletion can be lethal.
Can anyone think of an c
example? 1
Paths
d

A path is a sequence of
nodes in which each node a b c
is adjacent to the next one

a→b→c→d

a→b→c→e
e
Paths
d
A path is a sequence of
nodes in which each node
is adjacent to the next one

a b c

The distance between two


nodes is defined as the
number of edges along the a→b→c→d
shortest path connecting
them
a→b→c→e
e
Paths
a
A path is a sequence of c
nodes in which each node
is adjacent to the next one
The distance between two
nodes is defined as the b
number of edges along the d
Diameter:
shortest path connecting longest of all the
them. calculated a to e
shortest paths in 3 steps
a network
The shortest path is 4 steps
the path with the
shortest length Characteristic
(distance) between path length:
average length of f e
two nodes. shortest paths in
a network
Figure: network science book
Local network structures

Graphlet: A local unique (non-isomorphic) structure of a


network

Network motif: An overrepresented local structure of a


network (a common graphlet)
Example network motifs
Feed forward Feedback loop Three chain

a a a

b b
b

c c
c

Milo, et al (2002). Science


Connectedness
A graph is connected if
any two nodes can be
joined by a path.
We refer to the largest
connected component as
the giant component and
the rest as isolates.

Bridges connect the


various components - if
you erase them the graph
becomes disconnected

Figure: network science book


Clustering

i i
Clustering coefficient:
what fraction of your
neighbors are connected?

Watts & Strogatz, Nature


1998.
10 edges total 7 edges total
6 between 3 between
neighbours neighbours
Ci = 1 Ci = 0.5

Figure: network science book


Clustering Calculating clustering
coefficient
i ei = edges between the
neighbours of node i
ki = degree of node i
Clustering coefficient:
what fraction of your
neighbors are connected
(value between 0 and 1)?
7 edges total
Watts & Strogatz, Nature 3 between 2*3
1998. neighbours
Ci = 0.5 4 * (4-1)
6
= 0.5
12
Figure: network science book
Betweenness centrality

How much influence does


my node have over the
information flow in the
network?

Betweenness: fraction of
shortest paths which
goes through a given
node

Figure: network science book


Networks vs Pathways
Bipartite networks

Bipartite networks are


networks whose nodes can
be divided into two distinct
sets U and V in such a way
that every link connects a
node in U to one in V

https://www.nature.com/articles/srep00196
Multilayered networks

Multilayered networks
contain multiple types
of edges, changing the
context of the
interaction.
Representing networks
One of the most commonly used
representation due to its ease of
use, and the ability to use it
directly for measurements is the
adjacency matrix.

For an undirected network, if the


position A12 equals to one, that
means there exists an interaction
between nodes 1 and 2.

For directed networks the


direction of the interaction
carries information on its own: 1
→ 3 is not the same as 1 ← 3,
and their adjacency matrices
behave accordingly.

Figure: network science book


Representing networks

a
network.csv
Another very commonly used
format is the edge list.
a,e,
b,e,
The edge list is more human c,e, d e b
readable, and is easy to write up d,e
on the fly.

The example uses a .csv format,


but different software of course
can accept different delimiters.
c
Representing networks

Standards are very important to


every field, and network biology
is no different.

One attempt at this, to


standardize the efforts of the
molecular interaction field is the
PSI-MI TAB format, which uses a
controlled vocabulary to
describe interactions.

https://academic.oup.com/bioinformatics/article/35/19/3779/5355053
Where do networks come
from?

Figure: @randomgraphs
Network databases
Correlation/similarity networks:

You can generate from any omics


experiment (transcriptomics, proteomics
etc.) and build them through various
metrics e.g. correlation.

Pros:
● Condition specific network
Cons:
● Undirected
Tools: WGCNA
Physical interactions

● “Strict” definition: captures the physical


interactions of proteins

● “Loose” definition: captures the functionally


interacting proteins (e.g. includes members of a
complex)

● Main experimental methods: Y2H, AP-MS,


fragment complementation, BioID
Large scale protein interaction networks -
Y2H
Pros:

● High throughput
● Well established technique

Cons:

● Cannot detect genes which not go


into cell nucleus
● Bait-prey interactions have to work
● Context dependent

Example databases: HuRI


Common methods for creating large scale
protein interaction networks - manual
curation Pros:

● Usually more accurate


Save to
database
Cons:

● Labour intensive
● Publication/curator bias

Example datasets:
From: https://phdcomics.com/
SignaLink, Reactome,
WikiPathways
Common methods for creating large scale
protein interaction networks - text mining
Pros:

Searches the whole


literature

Cons:

Potential bias from the


algorithm

Examples: STRING
Common methods for creating large scale
protein interaction networks - secondary
databases Pros:

● Database of databases
● Many interactions

Cons:

● Bias of all the various datasets


● Need to know what is the aim
of your research

E.g. OmniPath, STRING ,


Türei et al 2021
ConsensusPathDB
Available tools

Figure: @randomgraphs
Tools

Easy to use, user friendly

Visualization, analysis and data sharing (NDEx)

Has its own ecosystem of apps, greatly extending its usability


https://apps.cytoscape.org/

Automation: RCy3, py4cytoscape


R / Python (and other languages) library for network
manipulation and visualization

Powerful tool, intermediate difficulty

Great introduction: https://kateto.net/network-


visualization
ggraph / tidygraph
Two R libraries made to help users manipulate
and visualize network data using the tidyverse
conventions (i.e. ggplot, dplyr)

Tidygraph provides manipulations and


analysis grammar for network data, while
ggraph offers visualization grammar for
network data

A bit easier to use than igraph, lacks some


functions, but has other important ones, e.g.
easy faceting based on attributes
Network analysis and discovery

Figure: @randomgraphs
● Let’s say you constructed, queried, curated the network you are
interested in
● Let’s say it is the differentially expressed proteins in a disease
using proteomics and their interaction partners in the SignaLink 3
network resource
● What now?
Network clustering

● One common way of adding


value to your network is to
cluster modularise partition
them
● Module: A part of the network
where the nodes are more
connected with the nodes within
the module then outside of the
module
● There are many approaches to
do so, including in Cytoscape
(MCODE, clusterMaker) https://mr.schochastics.net/
How to measure modularity?
Module: A part of the network where the nodes are more connected with the nodes within
the module then outside of the module

Weighted degree of node i and j


Kronecker delta which is 1 if ci = cj and 0 otherwise

Community of node i and j

Sum of all edge weights Edge weight between


nodes i and j

-½ - not modular network


0 - random network
1 - completely modular network
How to find communities, modules?
Girvan-Newman clustering (top-down):

1. Calculate the betweenness centrality of each edge.


2. Remove the highest betweenness centrality edge(s)
3. Recalculate the betweenness.
4. Repeat till no edges remain
Best partition where the modularity is the highest
Louvain clustering (bottom up):
1. Each nodes assigned to it’s own community
2. Adding a node to its neighbours community than calculate the modularity change for each node
3. Assign each node to that neighbour where the modularity change is the largest.
4. Create communities and do from step one again where the nodes are the communities
The modularity plateaus - multiple equally good
partition
Gene ontology enrichment
● H0: In our network module the genes are as common as in the
background
● H1: The genes in the network module are more common than in
the background

● What should be the background?

● How to know what is more common?


What should be the background?
● Whole human genome? Signalling Process, Intracellular
○ It is not in the network! communication, kinase,
○ We do not measure it in Enzyme
proteomics
● The whole network?
○ Not all IDs are mapped to
● The intersection of the whole
network and the omics data
○ We do not find anything
How to know what is more common?
● Statistical test Signalling Process, Intracellular
○ Fisher exact test communication, kinase,
○ Hypergeometric test Enzyme
○ Chi square test
● You do for each Gene
Ontology function a
statistical test! More GO
function then gene!
● Use false discovery rate
correction!
Network clustering and enrichment Cell cycle

● As we’ve discussed
before, closely connected
nodes often work together
in shared tasks

● To find out what these are


in an unsupervised way we
can run enrichment
analyses on them (BinGO)
https://mr.schochastics.net/
Visualising enrichment results
Dotplot Treemap plot
Top n enriched biological functions in All significant biological process mapped by the GO
annotation database tree.
Network propagation

● Network propagation
methods can highlight
nodes of interest in
addition to already known
ones
● Example algorithms:
PageRank, HotNet2

Hristov et al 2020
Footprint based methods

● Footprint methods can


estimate activity of kinases
or transcription factors
using prior knowledge, eg.
highlighting the most
active TFs in a given RNA-
Seq readout

● Examples: VIPER,
decoupleR Vandereyken et al 2018
Dangers and pitfalls

Figure: @randomgraphs
Study bias

• Biological networks almost always have to deal with a huge


study bias

• some proteins / interactions will be more studied, as they are


more relevant for health for example

• as such they will show up more frequently in our network


resources
Trivial predictions

• Trivial predictions can dominate your results, especially if


you are familiar with the biological domain (this is also a
result of study bias)

• As such, you will probably make a long list of predictions


(that will be correct) that you have to go through to find novel
information
Path fragility

• Biological networks can have simultaneously many false


negative / false positive hits

• Depending on your experimental method, studied organism


etc. the coverage of your interaction networks can vary quite
a lot

• Therefore using path based metrics such as shortest paths


or betweenness centrality can be misleading, as even the
addition / loss of one edge can shift these results a lot
ID mapping in Cytoscape
(and in general)
Dezso Modos
General Biological Genome Databases
Name of the Website Based on what? Species Example IDs
database

Uniprot www.uniprot.org/ Proteins You can Multiple (from P00533


find the protein bacteria till human) P00533-1
isoforms

Ensemble https://www.ensembl.org/ Genome base: Vertebrae ENSG00000146648


Genes/ transcript/ ENSP00000275493
proteins ENST00000455089.5

NCBI Gene https://www.ncbi.nlm.nih.gov/ Multiple genomes Multiple 1956


gene/

NCBI RefSeq www.ncbi.nlm.nih.gov/refseq/ Reference Multiple NM_001346897.2


sequences NP_001333826.1
Interaction/pathway databases
Database Website Content Species Example IDs
name

OmniPath omnipath.org A collection of multiple Human and P00533


protein-protein interaction mouse - mostly
databases but continuously
updated

STRING https://string-db.org/ Integrated protein-protein Many different 9606.ENSP00000275493


interaction database species mostly
based on
orthology

Reactome https://reactome.org/ Reaction based pathway Multiple species R-HSA-179837


database

IntAct https://www.ebi.ac.uk/intact/ Curated protein-protein and Multiple species P00533


other molecule interactions including viruses

SignaLink http://signalink.org/ Multilayer signalling database H. sapiens, D. P00533


containing proteins, miRNAs, rerio, C. elegans,
TF-TG interactions D. melanogaster
Many to many mapping - the biological rational

Interaction measurements
Gene Transcript 1 Protein 1 Yeast two hybrid
Transcript 2 Protein 2 Affinity purification

Gene

Probe from a microarrays Proteomic


RNA-seq read mapping measurements

Genomic analysis
(GWAS)
Many to many mapping - mapping table

Gene/ Value Gene/ Protein ID Protein 1 Protein 2


transcript Transcript ID

ID1 1 ID1 ID_A_1


ID_A_1 ID_A_2
ID2 1.5 ID2 ID_A_2
ID_A_2 ID_A_3
ID3 2.5 ID3 ID_A_3
ID_A_1 ID_A_4
ID4 1 ID4 ID_A_4
ID_A_3 ID_A_4

ID4 ID_A_5 ID_A_3 ID_A_5


Source of mapping tables
Ensembl BioMart - note you can access it programmatically through R -
https://www.ensembl.org/biomart/martview/

https://bioconductor.org/packages/release/bioc/html/biomaRt.html

Uniprot - https://www.uniprot.org/uploadlists/
ID mapping: what should I do during my project?
Know how do you map, document what have you done (e.g. How many IDs were
mapped to how many)

Try to map as many ID as possible - check the unmapped IDs

Use only the “sure” IDs - Uniprot Swiss Prot

Do not use gene names - the septins and androgen receptors will haunt you
How can we show this
in cytoscape?
Step 1 load in the network
Step 2 - the mapping table
Step 3 - the data
Real word example
You downloaded the interactions from Reactome through Cytoscape related to the
EGFR protein

You got from your bioinformatician the RNA-seq data using HGNC Gene names
Downloading data through PSICQUIC
Oh wait… we lost around 10% of our IDs!
Why do the identifiers do not mapped?

Common reasons:

● Different version between the databases

● Pseudogene and it is not in the


protein/transcript based database

● It is not in the database, typo in ID

● Wrong species the interaction is in rats


meanwhile we use humans
Let’s see a few examples: Example 1 wrong ID in the Reactome
database

One of the missing ones is KRAS - with a strange uniprot ID

If we search for that id we got back the normal uniprot ID of KRAS: P01116

In such case I suggest to make a separated column for our uniprot IDs maybe we will
need them for further work and adding the HGNC column the KRAS manually
Example 2: Not surely expressed protein TREMBL ID

This gene do not have a name in


Reactome - uniprot ID in
Human Readable Label column

Uniprot says it is Unreviewed protein


and only exist at transcript level

We can exclude from our analysis


Example 3 - wrong uniprot ID

You need to check the genes “manually” Basically building up your own mapping table

But this one do not have any uniprot entry regarding this ID

Search to the gene name and use uniprot mapping to map to HGNC gene name
Hands on the computer
1. Download the interactome of colorectal cancer associated proteins
(FAM123B ARID1A, ATM, CTNNB1, GNAS, PTEN, DNMT3A, NF1, MLL2,
PIK3R1)

2. Download a mapping table

3. Map transcriptomic data of p53 mutant and wild type adenocarcinoma

4. See the difference

You might also like