AI in Drug Discovery & Rare Diseases
AI in Drug Discovery & Rare Diseases
INTRODUCTION
Drug discovery is the process through which potential new therapeutic entities are identified,
using a combination of computational, experimental, translational, and clinical models. Drug
design is the inventive process of finding new medications based on the knowledge of a
biological target.
AI has emerged as a powerful tool that harnesses anthropomorphic knowledge and provides
expedited solutions to complex challenges. Remarkable advancements in AI technology and
machine learning present a transformative opportunity in the drug discovery, formulation, and
testing of pharmaceutical dosage forms. By utilizing AI algorithms that analyse extensive
biological data, including genomics and proteomics, researchers can identify disease-
associated targets and predict their interactions with potential drug candidates. This enables a
more efficient and targeted approach to drug discovery, thereby increasing the likelihood of
successful drug approvals. The pharmaceutical industry is a critical field that plays a vital role
in saving lives. It operates based on continuous innovation and the adoption of new
technologies to address global healthcare challenges and respond to medical emergencies.
1
Table.1 List of AI tools used in drug discovery and development process
2
RARE DISEASES: A rare disease is a disease that affects a small percentage of the
population. The term orphan disease describes a rare disease whose rarity results in little or no
funding or research for treatments, without financial incentives from governments or other
agencies. Orphan drugs are medications targeting orphan diseases.
Gene network technique holds the promise of providing a conceptual framework for analysing
the profusion of biological data being generated on potential drug targets and providing
insights to understand the biological regulatory mechanisms in diseases, which has been
playing an increasingly important role in searching for novel drug targets. A gene
regulatory network (GRN) describes biological interactions among genes and provides a
systematic understanding of cellular signaling and regulatory processes. It depicts how a set of
genes interact with each other to form a functional module and how different gene modules are
related. A typical GRN approximates a scale-free network topology with a few highly
connected genes (i.e. hub genes) and many poorly connected nodes. These hub genes are master
regulators in a gene network, and usually play essential roles in a biological system.
Protein-protein interaction is the basis of drug target identification. Protein interaction maps
can reveal novel pathways and functional complexes, allowing ‘guilt by association’
annotation of uncharacterized proteins. Once the pathways are mapped, these need to be
analyzed and validated functionally in a biological model. It is possible that other proteins
operating in the same pathway as a known drug target could also represent appropriate
drug targets. Recent analyses of network properties of protein-protein interactions and of
metabolic maps have provided some insights into the structure of these networks. So
identifying protein-protein interactions can provide insights into the function of important
3
genes, elucidate relevant pathways, and facilitate the identification of potential drug
targets.
The networks of protein-protein interactions are significant tools for studying cellular
processes, mechanisms for diseases, and the creation of medications. Interpreting protein -
protein interaction map is challenging because of the network's complexity.
4
Monoclonal
antibodies
Beta blockers,
06. Timothy syndrome Heart, fingers & CACNA1C gene Insertion of
toes pacemaker
Oral corticosteroids,
07. Candle syndrome skin PSMB8 gene NSAIDs
In view of the importance of AI tools in the drug development process and significance of rare,
it was planned to identify genes and protein involved in two selected rare diseases (Cystic
fibrosis & Bolo disease) using freely available AI tools.
5
II. REVIEW OF LITERATURE
1.Target selection
2. Lead discovery
3. Structure optimization
4. In vitro studies
5. In vivo studies
Target selection in drug discovery is defined as the decision to focus on finding an agent with
a particular biological action that is anticipated to have therapeutic utility is influenced by a
complex balance of scientific, medical and strategic considerations.
6
• G-Protein coupled receptors -45%
• Enzymes -28%
• Hormones and factors -11%
• Ion channels -5%
• Nuclear receptors -2%
The molecular types of therapeutic targets include protein, nucleic acid, and other molecule.
1. Cellular and genetic target: Involves the identification of the function of a potential
therapeutic drug target and its role in the disease process. For small molecular drugs, this step
in the process involves identification of the target receptors or enzyme whereas for some
biologic approaches the focus is at the gene or transcription level. Drugs usually act on either
cellular or genetic chemicals in the body, known as targets, which are believed to be associated
with disease. Scientific use a variety of techniques to identify and isolate individual targets to
learn more about their functions and how they influence disease. Compounds are then
identified that have various interactions with the drug targets that might be helpful in treatment
of a specific disease.[16]
2. Genomics: The study of genes and their function. Genomics aims to understand the
structure of the genome, including the mapping genes and sequencing the DNA. Seeks to
exploit the findings from the sequencing of the human and other genomes to find new drug
targets. Human genome consists of a sequence of around 3 billion nucleotide (the A C G T
bases) which in turn probably encodes 35,000-50,000 genes. Drew’s estimates the number of
genes implicated in disease, both those due to defects in single genes and those arising from
combinations of genes, is about 1,000. Based on 5 or 10 linked proteins per gene, he proposes
that the number of potential drug targets may lie between 5,000 and 10,000.
Single nucleotide polymorphism (SNP) libraries: are used to compare the genomes from both
healthy and sick people and to identify where their genomes vary.
7
3. Proteomics: It is the study of the proteome, the complete set of proteins produced by a
species, using the technologies of largescale protein separation and identification. it is
becoming increasingly evident that the complexity if biological systems lies at the level of the
proteins, and that genomics alone will not suffice to understand these systems. it also at the
protein level that disease processes become manifest, and at which most (91%) drugs act.
Therefore, the analysis of proteins (including protein-protein, protein-nucleic acid, and protein
ligand interactions) will be utmost importance to target discovery. Proteomics is the systematic
high throughput separation and characterization of proteins within biological systems. Target
identification with proteomics is performed by comparing the protein expression levels in
normal and diseased tissues. 2D PAGE is used to separate the protein, which are subsequently
identified and fully characterized with LC-MS/MS. 4. Bioinformatics: Bioinformatics is a
branch of molecular biology that involves extensive analysis of biological data using
computers, for the purpose of enhancing biological research. It plays a key role in various
stages of the drug discovery process including
• Target identification
• Pharmacogenomics.
8
2.3 LEAD DISCOVERY
A. Lead Identification: Between 5 and 50000 compounds are examined in the laboratory, of
which only 100 to 200 are perfected in order to be tested on systems in vitro and in vivo. Once
the therapeutic target has been identified, scientists must then find one or more leads (e.g.,
chemical compounds or molecules) that interact with the therapeutic target so as to induce the
desired therapeutic effects, e.g., through antiviral or antibacterial activity. In order to discover
the compounds whose pharmacological properties are likely to have the required therapeutic
effects, researchers must test a large variety of them on one or more targets. First of all,
biologists ensure that the chosen compounds have the desired therapeutic or antiviral effects
on the target. Then, they test the compounds relative toxicity or in the case of vaccine, their
viral activity using in vitro cellular and/or tissue systems. Finally, they check their
bioavailability in vivo on animals. in vitro cellular and/or tissue systems. Finally, they check
their bioavailability in vivo on animals.[17,18]
B. Lead Optimization: Duration: From 4 to 6 months. The 100 to 200 chosen compounds are
examined in the laboratory in order to perfect their physiochemical properties, their
pharmacokinetic behaviour and the therapeutic effectiveness. Around twenty (20) will be
selected to be tested on. The purpose of this stage is to optimize the molecules or compounds
that demonstrate the potential to be transformed into drugs, retaining only a small number of
them for the next stages. To optimize these molecules, scientists use very advanced technique.
For example, using X-ray crystallography and in silico (computer) modelling, they study how
the selected molecules link themselves to the therapeutic target, for example, a protein or an
enzyme. These data allow the medical chemists to modify the structure of the selected
molecules or compounds, if necessary, by screening, thereby creating structural analogues.
Medicinal chemists prepare and/or select appropriate compounds for biological evaluation that,
if found to be active, could serve as lead compounds. They then evaluate the structure–activity
relationships (SARs) of analogous compounds with regard to their in vitro and in Svivo
efficacy and safety. Today, medicinal chemists who are engaged in drug discovery are part of
interdisciplinary teams, and must therefore understand not only the field of organic chemistry,
but also a range of other disciplines to anticipate problems and interpret developments to help
move the project forward. As highlighted in this article, the role of the medicinal chemist has
changed significantly in the past 25 years. In the early era (‘then’) of drug discovery (1950 to
9
about 1980), medicinal chemists relied primarily on data from in vivo testing. In the more
recent (‘now’) period (about 1980 to the present), the development of new technologies, such
as high-throughput in vitro screening, large compound libraries, combinatorial technology,
defined molecular targets and Structure based drug design, has changed that earlier and
relatively simple landscape. Although these new technologies present many opportunities to
the medicinal chemist, the multitude of new safety requirements that have arisen has also
brought unanticipated hurdles for the task of translating in vitro activity to in vivo activity.
Simultaneously, the knowledge base that supports drug research has expanded considerably,
increasing the challenge for chemists to understand their fields of expertise. The demonstration
of adequate clinical safety and efficacy in humans has also become more complex, and ever-
increasing amounts of data are now required by regulatory agencies. In fact, despite the use of
many new technologies, and the growing resources and funding for drug research, the number
of launches of new medicines in the form of new molecular entities (names) has been generally
decreasing for more than a decade. Clearly, the difficulty and complexity of drug research has
increased in the past two decades. It is our aim with this article to discuss how these changes
have influenced the role of medicinal chemists and to suggest ways to help them to contribute
more effectively to the drug discovery process.
In vitro studies are an essential part of research geared toward the discovery of drug candidates.
Driven by the need for predictive information, in vitro techniques have been developed to study
many aspects of drug disposition. These include: absorption, metabolic stability, elucidation of
elimination pathways, potential for inhibition of CYP enzymes, potential for induction of
CYP450 enzymes and metabolite profiling in various model species and humans. These studies
are not necessarily conducted following regulatory statute. On the other hand, many of these
in vitro experiments become essential support for submission of IND and post IND filing from
both the preclinical and clinical arena. “This part prescribes good laboratory practices for
conducting nonclinical laboratory studies that support or are intended to support applications
for research or marketing permits for products regulated by the Food and Drug administration.”
“Nonclinical Laboratory Study” means in vivo or in vitro experiments in which test articles are
studied prospectively in test systems under laboratory conditions to determine their safety…
The term does not include basic exploratory studies carried out to determine whether a test
article has any potential utility or to determine physical or chemical characteristics of a test
article.”
10
2.6 IN VIVO STUDIES
In vivo studies, in comparison to in vitro, take place within a living organism. In preclinical
trials, this happens within animal subjects. In clinical trials, in vivo studies can use either
humans or animals as subjects. In vivo studies are able to address the major limitation of in
vitro studies, they are able to demonstrate the impact of a pharmaceutical on the body as a
whole, rather than how it impacted isolated cells. This allows in vivo studies to better visualize
potential interactions, which can improve its predictions of safety, toxicity, and efficacy. This
helps scientists predict the impact of candidate drugs on human disease. However, while in
vivo studies address the drawback of in vitro studies, they have their own major setback. In
vivo studies face significant ethical concerns, particularly for preclinical studies where just
animal models are permitted. The debate over the ethics of animal testing has raged for decades.
Currently, the regulations and laws governing animal testing are tightening, and preclinical in
vivo studies scientists wishing to conduct preclinical studies with animals are required to
demonstrate that no other alternative methodology can be used to conduct the experiment. They
are also required to demonstrate a sense of balance, that is, the benefits of the study (gain in
knowledge) outweigh the drawbacks (suffering caused to the animals).Like in vitro studies, in
vivo studies are also undergoing a technological transformation. Emerging technologies such
as crisper will make complex animal models increasingly simple to conduct, cheaper, and
faster. While in vivo studies face significant ethical considerations, it is likely that they will
remain a fundamental part of preclinical studies. The future is predicted to bring significant
advances in preclinical technologies, both in vitro and in vivo, that should facilitate the
gathering of more accurate data, as well as faster and simpler methodologies. It is expected that
these advances will improve the quality of preclinical data, as well as reduce the reliance on
traditional animal models.[19]
11
or evaluate clinical laboratory tests (Eg. Imaging or molecular diagnostic tests) might be
considered to be a clinical trial if the test will be used for medical decision-making for the
subject or the test itself imposes more than minimal risk for subjects.
Phase I: clinical trials test a new biomedical intervention in a small group of people (e.g., 20-
80) for the first time to evaluate safety (e.g., to determine a safe dosage range, and to identify
side effects).
Phase II: clinical trials study the biomedical or behavioural intervention in a larger group of
people (several hundred) to determine efficacy and to further evaluate its safety.
Phase III: studies investigate the efficacy of the biomedical or behavioural intervention in
large groups of human subjects (from several hundred to several thousand) by comparing the
intervention to other standard or experimental interventions as well as to monitor adverse
effects, and to collect information that will allow the intervention to be used safely.
Phase IV: studies are conducted after the intervention has been marketed. These studies are
designed to monitor effectiveness of the approved intervention in the general population and
to collect information about any adverse effects associated with widespread use.[2]
➢ Disease modelling and target discovery are crucial initial steps in the drug discovery
process and significantly impact on the success of drug development.
➢ An increasing number of AI-identified targets are being validated through experiments
and several AI-derived drugs are entering clinical trials
➢ The drug discovery pipeline is widely recognized to be a time-consuming, expensive,
and risk laden process that typically requires around 10 years and $2 billion to bring a
novel drug to market
➢ Target identification, the process of identifying the right biological molecules or
cellular pathways that can be modulated by drugs to achieve therapeutic benefits, is
increasingly important in modern drug discovery.
➢ By 2022 fewer than 500 successful drug targets had been identified, representing a tiny
fraction of the estimated druggable targets in humans.
12
➢ Although numerous drug candidates undergo extensive optimization during preclinical
stages, the average failure rate in clinical trials from 2009 to 2018 reached 84.6%.
➢ The lack of clinical efficacy remains the key factor contributing to the failure of both
Phase 2 and 3 trials, leading to substantial financial losses and resource wastage.
➢ Identifying the right drug targets is crucial for increasing the likelihood of developing
clinically effective therapies.
➢ Recent advances in AI-driven biological analysis have identified novel targets and AI-
designed drugs are now entering clinical trials. (Lower panel) AI applications in the
early stages of drug discovery.
➢ Target identification can be classified into three distinct strategies – experimental,
multi omic, and computational approaches. Using these methods collaboratively can
generate novel therapeutic hypotheses in exploratory target identification, thus
significantly enhancing our understanding of complex diseases.
13
2.9 EXPERIMENTAL APPROACHES
Multiomic data provide researchers with interconnected molecular information from different
perspectives, including static genomic data and spatiotemporally dynamic expression and
metabolic profiles. As the first established and most mature omics discipline, genomics focuses
on genetic variants in the DNA sequence.
➢ Large-scale genome-wide association study (GWAS) analysis powered by next-
generation sequencing has yielded hundreds of thousands of associations between
genetic variants and complex diseases or traits.
➢ meta-analyses of published GWAS data have revealed novel genetic loci attributable
to different diseases, thus opening up drug repurposing opportunities.
➢ Transcriptomic and proteomic data can be used to identify causal genetic loci that
regulate gene and protein levels and facilitate the discovery of genes and pathways
underlying disease pathogenesis.[20] Likewise, epigenomic and metabolomic data can
also serve as functional evidence for GWAS-identified variants to support their disease
associations and clinical applications. As compared to single omic approaches,
integrated multiomic analysis can provide a more comprehensive view of disease
14
mechanisms and is therefore increasingly used to facilitate biomarker and therapeutic
target discoveries, treatment response, and patient prognosis predictions.
Depending on the availability of protein structure and the chemical structure of the compound
of interest, pharmacophore screening, reverse docking, and structure similarity assessment
have been used to predict novel biological targets for small molecules. On the other hand, AI
is a growing discipline in computational science for target discovery. Machine learning is an
indispensable component of AI that can be applied either with or without supervision.
Supervised learning utilizes labelled datasets to train models for data classification and reliable
outcome prediction. By contrast, unsupervised learning explores the hidden structure of
unlabelled data without human intervention. The application of machine learning is not limited
to predicting biological targets of the existing drugs or compounds, and can also identify novel
therapeutic targets for any disease of interest.
15
information about altered signaling pathways, molecular interactions, and protein–
protein interactions that can serve as additional inputs for target prioritization.
➢ large language models also aid therapeutic target discovery via rapid biomedical text
mining. Pretrained on a vast amount of text data extracted from millions of publications,
large language model-based Chat functionalities, such as Bio GPT from Microsoft and
Chat Panda GPT from In silico Medicine, can connect diseases, genes, and biological
processes to allow rapid identification of the biological mechanisms involved in disease
development and progression, as well as the identification of potential drug targets and
biomarkers.
➢ The ability of the large language models to understand natural language and interpret
complex scientific concepts could make them valuable tools in accelerating disease
hypothesis genera.[3]
'Synthetic data' refers to artificially generated data that mimic real-world patterns and
characteristics. By leveraging AI algorithms, synthetic data can be created to simulate various
biological scenarios, thus enabling researchers to explore and analyze a broader range of
possibilities
➢ This approach can be particularly valuable in therapeutic areas where experimental data
are scarce or difficult to obtain. For example, in rare diseases or conditions where
patient data are limited, AI can generate synthetic data based on existing knowledge
and patterns.[22]
➢ These synthetic data can then be used to train AI models and identify potential
therapeutic targets that may have been overlooked. Synthetic data can also be used to
validate predictions made by AI algorithms, thus providing an additional layer of
confidence in the target discovery process.
➢ comparative analyses can be performed to assess the similarity between the synthetic
data and real-world data to responsibly validate and control the quality of synthetic
omic data, several options can be considered. This can involve statistical measures,
such as comparing distributional characteristics, correlation patterns, or feature-level
comparisons.[4]
16
2.14 AI APPROACHES IN DRUG DISCOVERY
The cost of development and time consumption in developing novel therapeutic agents was
another setback in the drug design and development process. To minimize these challenges and
hurdles, researchers around the globe moved toward computational approaches such as virtual
screening (VS) and molecular docking, which are also known as traditional approaches.
However, these techniques also impose challenges such as inaccuracy and inefficiency. Thus,
there is a surge in the implementation of novel techniques, which are self-sufficient to eliminate
the challenges encountered in traditional computational approaches. Artificial intelligence
(AI), including deep learning (DL) and machine learning (ML) algorithms, has emerged as a
possible solution, which can overcome problems and hurdles in the drug design and discovery
process.
AI, which is also referred to as machine intelligence, means the ability of computer systems to
learn from input or past data. The term AI is commonly used when a machine mimics cognitive
behaviour associated with the human brain during learning and problem solving. Nowadays,
biological and chemical scientists extensively incorporate AI algorithms in drug designing and
discovery process. Computational modelling based on AI and ML principles provides a great
avenue for identification and validation of chemical compounds, target identification, peptide
synthesis, evaluation of drug toxicity and physiochemical properties, drug monitoring, drug
efficacy and effectiveness, and drug repositioning. AI models eliminate the toxicity problems,
which arise due to of target interactions.
17
➢ AI and machine learning technology play a crucial role in drug discovery and
development. In other words, artificial neural networks and deep learning algorithms
have modernized the area. Machine learning and deep learning algorithms have been
implemented in several drug discovery processes such as peptide synthesis, structure-
based virtual screening, ligand-based virtual screening, toxicity prediction, drug
monitoring and release, pharmacophore modelling, quantitative structure–activity
relationship, drug repositioning, poly pharmacology, and physiochemical activity.
➢ DL is a subset of ML, which itself is a subset of AI, and thus, the evolution goes like
AI>ML>DL. ML either uses supervised learning, where the model is trained to use
labelled data, which means that the input has been tagged with corresponding preferred
output labels or uses unsupervised learning, where the model is trained to use unlabelled
data but looks for recurring patterns from the input data.[23]
➢ The three main characteristic features of big data are volume, velocity, and variety,
where volume represents the huge amount and mass of data generated, velocity
represents the rate at which these data are being reproduced, and variety represents
heterogenicity present in the data sets. With the advent of microarray, RNA-seq, and
high-throughput sequencing (HTS) technologies, a plethora of biomedical data is being
engendered every day, due to which contemporary drug discovery has made a transition
into the big data era. For target identification, a feature like a gene expression is widely
used to understand disease mechanisms and find genes responsible for the disease.
➢ For example, using the ML approach and gene expression data, researchers found out
novel biomarkers and potential drug targets for rare soft tissue sarcoma. AI has emerged
as a possible solution to the problems raised due to classical chemistry or chemical
space, which hampers drug discovery and development. With the advancements in
technologies and the development of high-performance computers, AI algorithms such
as ML to DL have been increased in computer-aided drug design (CADD).
➢ This will encourage chemists to identify the potential of AI techniques for answering
two crucial questions of medical chemistry, such as "what should be the next
compound?” and "what is the process of making a compound?”.
➢ Application of big data for drug designing and discovery: with the increase in
biological and chemical data from the literature, in vitro, in vivo, clinical studies,
genomics studies, proteomics studies, metabolomics studies, gene ontology studies, and
molecular pathway data, different data repositories have been developed. For instance,
18
ChemSpider, ChEMBL, ZINC, BindingDB, and PubChem are the essential databases
for compound synthesis and screening in the drug designing and discovery process.
➢ with the advent of ML-based tools, it has become relatively easier to determine the
three-dimensional structure of a target protein, which is a critical step in drug discovery,
as novel drugs are designed based on the threedimensional ligand biding environment
of a protein
➢ With advancements in automated drug discovery methods involving AI and ML, it is
relatively simple to distinguish between existing drugs and novel chemical structures.
➢ quantum mechanics is used to determine the properties of molecules at a subatomic
level, which is used to estimate protein–ligand interactions during drug development.
➢ With these data, we can determine the electronic properties of molecules, the
arrangement of chemical bonds around a molecule, and the location of reactive sites
➢ various text mining-based tools have also been developed, which can aid the process of
traditional drug discovery. Text mining uses methods like natural language processing
(NLP) to transform unstructured texts in various literature and databases into structured
data, which can be analysed appropriately to gain new insights. NLP is a branch of AI,
which allows computers to process and analyse human languages like speech and text
through AI-based algorithms.[5]
The primary drug screening includes the classification and sorting of cells by image analysis
through AI technology. Many ML models using different algorithms recognize images with
great accuracy but become incompetent when analysing big data. To classify the target cell,
firstly, the ML model needs to be trained so that it can identify the cell and its features, which
is basically done by contrasting the image of the targeted cells, which separates it from the
background.
The secondary drug screening includes analysing the physical properties, bioactivity, and
toxicity of the compound. Melting point and partition coefficient are some of the physical
properties that govern the compound’s bioavailability and are also essential to design new
compounds.
In drug designing and drug discovery, VS is one of the crucial methods of CADD. VS refers
to the identification of a small chemical compound that binds to a drug target. VS is an efficient
19
method to screen out the promising therapeutic compound from a pool of compounds [158].
Thus, it becomes an important tool in high-throughput screening, which incurred the problem
of high-cost and low-accuracy rate. In general, there are two important types of VS that are
structure-based VS (SBVS) and ligand-based VS (LBVS). The LBVS depends on the chemical
structure and empirical data of both active and inactive ligands, which uses the chemical and
physiochemical similarities of active ligands to predict the other active ligand from a pool of
compounds with high bioactivity. However, the LBVS does not depend on the 3-D structure
of the target protein, and thus, this method is implemented where target structure or information
is missing, and the obtained structural accuracy is low
➢ In comparison with LBVS, SBVS possesses high accuracy and precision. However, SBVS
is associated with the problem of an increasing number of disease-causing proteins and
their complicated conformations.[24]
➢ Hence, ML has the power to speed up VS, make it more robust, and can even reduce false
positives in VS. Docking is the main principle applied in SBVS, where several AI and ML-
based scoring algorithms have been developed such as NNScore, CScore, SVR-Score, and
ID-Score
20
2.15 AI TOOLS USED IN DRUG DISCOVERY AND DEVELOPMENT
PROCESS
By using AI algorithms to analyse data from large populations, they can be used to identify
trends and patterns that can help predict the effectiveness of potential drug candidates for
specific patient populations, which can help tailor treatments to the needs of individual patients.
AI-based techniques can assist in selecting potential patients for pre-clinical trials by
identifying relevant human-disease bio-markers and anticipating potential toxic or unnecessary
side effects and by filtering a high dimensional set of clinical variables to select a cohort of
patients. AI can also help in predicting the outcome of clinical trials well ahead of the actual
trial minimizing the chance of any harmful effect on patients. AI has the potential to
revolutionize the drug discovery process, offering improved efficiency and accuracy,
accelerated drug development, and the capacity for the development of more effective and
personalized treatments (Figure 1). However, the successful application of AI in drug discovery
is dependent on the availability of high-quality data, the addressing of ethical concerns, and the
recognition of the limitations of AI-based approaches.
❖ CHATPANDAGPT:
Chat Panda GPT is a state-of-the-art large language model to provide users with an interactive
and personalized way to aggregate information and answer questions related to molecular
biology, therapeutic target discovery, and pharmaceutical development. ChatPandaGPT
provides a comprehensive description of a gene, disease or gene-disease association from the
drug discovery angle, including information about major signalling pathways and biological
processes, associated compounds, potential toxic effects, clinical trials, etc. The user can
interact with ChatPandaGPT either by selecting one of the predefined questions like "What are
potential risks and benefits of targeting gene X" or formulate his/her own questions and receive
relevant answers.
Applications:
• Effortless retrieval of a clear and concise summary about the disease, gene or
gene-disease association from the drug discovery perspective
• Identification of the most important genes and signalling pathways associated
with a disease of interest
21
• Generating supporting textual description of the PandaOmics
knowledge graph
• Publication-ready interpretation of gene-disease
associations
❖ BIOGPT:
BioGpt is a language model developed by Open AI that uses biological data to improve its
performance. The BioGPT model is based on the same principles as the original GPT-3 model,
which uses deep learning algorithms to generate human-like text. BioGPT takes this a step further
by using real biological data, such as gene sequences, to train the model.[35] BioGPT works by
training a language model on a large dataset of biological data, such as DNA sequences or protein
structures. The model then uses this data to generate new sequences of text, which can be used for
a variety of purposes, such as drug discovery or genetic research.
One of the key benefits of BioGPT is that it can generate highly accurate and detailed descriptions
of biological processes and structures. This can be especially useful in the field of biotechnology,
where researchers need to understand complex biological processes in order to develop new
treatments or therapies.
Applications of BIOGPT:
1.Drug Discovery:
BioGPT can be used to generate highly accurate and detailed descriptions of biological processes,
which can be used to develop new drugs and treatments.
By analysing gene sequences and other biological data, BioGPT can help researchers identify new
drug targets and design more effective treatments.[6]
2. Genetic Research:
BioGPT can be used to analyse large datasets of genetic information, such as gene sequences and
protein structures. This can help researchers understand the underlying genetic mechanisms behind
diseases and genetic disorders, and develop new therapies to treat them.
22
3. Precision Medicine:
BioGPT can be used to generate personalized treatment plans based on an individual's genetic
makeup. By analysing gene sequences and other biological data, BioGPT can help doctors and
researchers identify the best treatment options for individual patients.
4. Bioengineering:
BioGPT can be used to design new biological systems and structures, such as synthetic proteins or
microbial communities. By generating highly accurate and detailed descriptions of biological
processes, BioGPTcan help researchers design more effective and efficient biological systems.
5. Environmental Monitoring:
BioGPT can be used to monitor and analyse environmental data, such as water quality and air
pollution. By analysing biological data from environmental samples, BioGPT can help researchers
identify the underlying causes of environmental problems and develop more effective solutions.
BioGPT has the potential to revolutionize the field of language technology, particularly in the field
of biotechnology. By using real biological data to train the model, BioGPT can generate highly
accurate and detailed descriptions of biological processes and structures.
❖ CHEMISTRY-42:
Chemistry42 is a software platform for de novo small molecule design and optimization that
integrates Artificial Intelligence (AI) techniques with computational and medicinal chemistry
methodologies. Chemistry42 efficiently generates novel molecular structures with optimized
properties validated in both in vitro and in vivo studies and is available through licensing or
collaboration. Chemistry42 is the core component of Insilico Medicine’s Pharma.ai drug
discovery suite. Pharma.ai also includes PandaOmics for target discovery and multiomics data
analysis, and inClinico—a data-driven multimodal forecast of a clinical trial’s probability of
success (PoS). In this paper, we demonstrate how the platform can be used to efficiently find
novel molecular structures against DDR1 and CDK20.
23
programs, and over 30 internal programs. The main objective of this platform is to accelerate
the design of novel molecules with user-defined properties.[7]
❖ DEEP CHEM
The DeepChem library is a Tensor flow wrapper that understands and streamlines the analysis
of chemical datasets. It has been used for algorithmic research into one-shot deep-learning
algorithms for drug discovery and application projects such as model‐ ing inhibitors for BACE-
1. DeepChem can be used to analyse protein structures, pre‐ dict the solubility of small
molecule drugs and their binding affinity to targets, and count the number of cells in a
microscopic image. Molecule Net, which contains the properties of 700,000 compounds has
been integrated into the DeepChem package.
❖ ChEMBL
❖ ALPHA FOLD
24
❖ ODDT
The Open Drug Discovery Toolkit is an open-source tool for computer aided drug discovery
(CADD). ODDT uses machine learning scoring functions (RF-Score and NNScore) to develop
CADD pipelines. It is provided as a Python library.
ODDT is built to support different formats by extending the use of Cinfony – a common API
that unites molecular toolkits, such as RDKit and OpenBabel, and makes interacting with them
more Python-like. All atom information collected from underlying toolkits are stored as Numpy
arrays, which provide both speed and flexibility.
❖ AMPL
AMPL extends the functionality of DeepChem and supports an array of machine learning and
molecular featurization tools. It is an end-to-end data-driven modeling pipeline to generate
machine learning models that can predict key safety and pharmacokinetic-relevant parameters.
AMPL is benchmarked on a huge pool of pharmaceutical datasets and against a wide range of
parameters.
❖ DeeperBind
DeeperBind is a long short-term recurrent convolutional network that predicts protein binding
specificity in relation to DNA probes, which can model the interaction between transcription
factors (TF) and their corresponding (DNA/RNA) binding sites. DeeperBind can effectively
predict the dynamics of probe sequences. It can also be trained and tested on datasets with
sequences of variable lengths.[8]
25
2.16 INTRODUCTION TO RARE DISEASES
Rare (often called “orphan”) diseases are conventionally defined as those affecting a very low
number of individuals, but which can be associated with inappropriate management, chronic
debilitation and adverse health outcome, up to death. This problematic definition is, quite un‐
derstandably, a major drawback that many scientists, clinicians and laboratory professionals
recognize while facing very uncommon pathologies. A disease is conventionally defined “rare”
when the number of affected subjects is <1:2000(i.e <0.05%) in the European Union and
<1:200000(i.e <0.0005%) in the US, thus making the list of these conditions quite large,
encompassing upto 8000 pathologies for some of which molecular or biochemical underlying
abnormalities have not been completely unraveled so far.[26]
The worldwide epidemiology and immigration are other important drawbacks, because a given
disease can be defined as certainly rare in one geographical area, whilst its prevalence may be
much higher in another, due to the impact of specific demographic, genetic and environmental
factors.
Example: Thalassemia are a paradigmatic example, since the prevalence of these haemoglobin
disorders is higher in some Mediterranean and Asian regions, whilst its burden remains still
limited in other worldwide areas. [9,10]
Some of the rare diseases includes: biliary tract cancers, cystic fibrosis, sclerosing
mesenteritis, insulin autoimmune syndrome, chronic-inflammatory demyelinating
polyneuropathies, neurodegenerative dementia, rare neonatal disorders, sleep apnea in
childhood, acute intestinal ischemia, cardiac arrhythmias in sarcoidosis, Kounis syn‐ drome,
along with rare thrombophilic conditions, computer-related thrombosis, rare bleeding
disorders, rare forms of von Willebrand disease (VWD) and hereditary spherocytosis.
➢ In order to address the unmet needs and create opportunities that benefit patients with rare
disease in India, a group of volunteers created a not-for-profit organization named
Organization for Rare Diseases India.
26
➢ The ORDI team members come from diverse backgrounds such as genetics, molecular
diagnostics, drug development, bioinformatics, communications, information technology,
patient advocacy and public service.
History of rare diseases in India: A directory of accredited genetic testing service centers in
India compiled in 2007 (Singh et al., 2010) showed that there were 47 such centers offering
genetic services, including cytogenetic (40 centers), biochemical (26 centers) and molecular
diagnosis (26 centers), along with genetic counselling.
➢ This is supported by the Indian Council of Medical Research (ICMR). It has listed 649
disorders, 66 genetic centers and 35 prenatal diagnostic centers.
➢ Most of the genetic centers offer targeted tests that involve screening for common
mutations, although in recent years, many centers provide sequencing of entire disease-
associated genes
➢ Some Indian national laboratories such as SRL Labs and Lal Path Labs also offer advanced
medical and genetic tests.
➢ Education and genetic counseling (GC) are critical necessities to help patients and
physicians deal with rare diseases.
➢ GC is needed at various levels such as prior to genetic testing, post-testing, prenatal
diagnosis and family planning particularly in consanguineous marriages. GC needs to be
made an integral part of all genetic testing centers in India[27]
27
Treatments for rare disease: The newborn screening program in the USA covers about
31 metabolic disorders, which, when detected in the neonatal period, can be treated to
prevent disability.
➢ A drug called ‘Kuvan’ was launched for the BH4 responsive version of PKU
(phenylketonuria)
➢ In India, enzyme therapies are provided either by the Pharma companies under their
charitable programs, or by employers in India who are committed to giving ‘free’ health
care to their employees and their dependents.
➢ The success stories in India is the case of providing Factor VIII to patients with
hemophilia A and chelating agents to patients with thalassemia major.[11]
28
Indian organizations devoted to rare diseases:
29
2.18 CYSTIC FIBROSIS
Cystic fibrosis (CF) is a genetic condition that affects a protein in the body. People who have
cystic fibrosis have a faulty protein that affects the body’s cells, its tissues, and the glands that
make mucus and sweat.
Normal mucus is slippery and protects the airways, digestive tract, and other organs and tissues.
Cystic fibrosis causes mucus to become thick and sticky. As mucus builds up, it can cause
blockages, damage, or infections in affected organs.
ETIOLOGY:
According to the Cystic Fibrosis Foundation, about 30,000 Americans have CF. The disease
occurs mostly in whites whose ancestors came from northern Europe, although it cuts across
all races and ethnic groups. About 3,500 babies are born with the disease each year in the
United States. Moreover, about one in every 30 Americans are unaffected carriers of an
abnormal CF gene.
30
There are over 2000 different mutations in the CFTR gene that can cause disease. These
mutations are divided into five classes:
3. Disordered regulation
PATHOPHYSIOLOGY:
Class 1 dysfunction is the result of nonsense, frameshift, or splice-site mutation, which leads
to premature termination of the mRNA sequence. This fails to translate the genetic information
into a protein product with a subsequent total absence of CFTR protein, and approximately 2%
to 5% of cystic fibrosis cases result.[28]
Class 2 dysfunction results in abnormal post-translational processing of the CFTR protein. This
step in protein processing is essential for the proper intracellular transit of the protein. As a
result, CFTR is unable to be moved to the correct cellular location.
Class 4 dysfunction is when the protein is produced and correctly localized to the cell surface.
However, the rate of chloride ion flow and the duration of channel activation after stimulation
is decreased from normal.
Class 5 dysfunction is the net decreased concentration of CFTR channels in the cellular
membrane as a result of rapid degradation by cellular processes. It includes mutations that alter
the stability of mRNA and others that alter the stability of the mature CFTR protein.
The result of all mutations is decreased secretion of chloride and consequently increased
resorption of sodium into the cellular space. The increased sodium reabsorption leads to
increased water resorption and manifests as thicker mucus secretions on epithelial linings and
more viscous secretions from exocrine tissues. Thickened mucus secretions in nearly every
organ system involved result in mucous plugging with obstruction pathologies. The most
31
commonly affected organs include the sinuses, lungs, pancreas, biliary and hepatic systems,
intestines, and sweat glands.[12]
TREATMENT:
CF patients suffer from frequent lung infections caused by obstructed breathing. So, the
mainstays of treatment are physical therapy, exercise, and medications for reducing the mucus
blocking the lung's airways.
DRUG THERAPIES:
Medications are often inhaled and include the following:
▪ Bronchodilators. Which widen the breathing tubes.
▪ Mucolytics. Which thin the mucus.
▪ Decongestants. Which reduce swelling of the membranes of the breathing tubes.
▪ An enzyme that thins the mucus. By digesting the cellular material trapped in it.
32
▪ Antibiotics. To fight lung infections.
▪ Nonsteroidal anti-inflammatory drugs. To improve weight gain and reduce
declines in lung function.
▪ Corticosteroids. To reduce inflammation in the airway
Herbs: Herbs are a way to strengthen and tone the body's systems. As with any therapy, you
should speak with your provider before starting any treatment. You may use herbs as dried
extracts (capsules, powders, or teas), glycerites (glycerine extracts), or tinctures (alcohol
extracts). Unless otherwise indicated, make teas with 1 tsp. (5 g) herb per cup of hot water.
Steep covered 5 to 10 minutes for leaf or flowers, and 10 to 20 minutes for roots. Drink 2 to 4
cups per day. You may use tinctures alone or in combination as noted.[29]
➢ Green tea (Camellia sinensis). Standardized extract, for antioxidant and immune
effects. You may also prepare teas from the leaf of this herb.
➢ Cat's claw (Uncaria tomentosa). Standardized extract, for inflammation, immune and
antibacterial or antifungal activity. Cat's claw may interact with certain medications,
including blood pressure medications. Cat's claw may worsen autoimmune disorders
and Leukemia.
➢ Milk thistle (Silybum marianum). Seed standardized extract, for detoxification
support. Milk thistle may have an estrogen-like effect, so people who have a history of
hormone-related cancers should use milk thistle with caution. Milk thistle is in the same
family as ragweed and may cause allergic reactions in people who are sensitive to
ragweed. Since milk thistle works on the liver, it may interact with medications. Speak
with your physician.
➢ Bromelain (Ananus comosus). Standardized extract, for pain and inflammation.
Bromelain may increase bleeding in sensitive individuals, such as those taking blood-
33
thinning medications, including aspirin. Bromelain may also impact how your body
metabolizes antibiotics. Your physician may already be prescribing something like
bromelain, so check with your provider before taking a supplement.
➢ Ground Ivy (Hedera helix). Standardized extract, to reduce mucous production and
to loosen phlegm. Ground ivy can be particularly toxic to the liver and kidneys. People
who have a history of seizure disorders should avoid ground ivy. You should only take
ground ivy under the supervision of a trained herbalist who is working with your
physician. [13,14]
Homeopathy
Although few studies have examined the effectiveness of specific homeopathic therapies,
professional homeopaths may consider the following treatments to alleviate respiratory
symptoms (such as those experienced from CF) based on their knowledge and experience.
Before prescribing a remedy, homeopaths take into account a person's constitutional type,
includes your physical, emotional, and psychological makeup. An experienced homeopath
assesses all of these factors when determining the most appropriate treatment for each
individual.
Use the following treatments under the guidance of a licensed, certified homeopath, in addition
to standard medical care provided by a medical doctor:
▪ Antimonium tartaricum - For wet, rattling cough (although the cough is usually too
weak to bring up mucus material from the lungs) that is accompanied by extreme
fatigue and breathing problems. Symptoms usually worsen when the person is lying
down.[30]
▪ Carbo vegetabilis - For shortness of breath with anxiety, chills, and bluish skin
discoloration.
Acupuncture
Acupuncture may alleviate symptoms of cystic fibrosis. Acupuncture may help enhance
immune function, normalize digestion, and strengthen respiratory function.
Massage
Therapeutic massage can help drain mucus from the lungs.
34
CHEMICAL CONSTITUENTS PRESENT IN THE HERBS
35
Fig-2.8 Chemical constituents present in Uncaria tomentosa
Balo's disease is an uncommon central nervous system disease causing demyelination and is
an alternative form of multiple sclerosis. Concentric sclerosis, leukoencephalitis periaxialis
concentrica, is also known as Balo's disease. Concentric sclerosis signalizes the bands of intact
myelin and alternating rings of myelin loss present in several parts of the brainstem and brain.
The concentric pattern is observed on magnetic resonance imaging (MRI). They are
distinguished by the gradual appearance of symptoms that are found in the most common types
of multiple sclerosis which include muscle spasms, headache, seizures, and paralysis. Further
neurological manifestations develop mainly in brain areas that may cause physiological
abnormalities.[15]
36
Fig -2.10 Balo disease representation
Balo's disease exists in three forms: acute and self-limiting, relapsing-remitting variant, and
rapidly progressive primary disease.
ETIOLOGY:
The etiology of multiple sclerosis (MS) is probably polygenic and multifactorial involving
genetic, exogenous and immunological factors. Evidence for an environmental influence
comes from epidemiological studies (Acheson, 1985). While the disease appears to have a
predilection for Caucasians, the frequency may vary by a factor of 10 according to their place
of residence. In addition, migrants from high to low-risk areas appear further to reduce their
chances of getting the disease if they move early in life, and an increased risk may be present
for those migrating early from low to high-risk regions.
PATHOPHYSIOLOGY:
37
Fibrous gliosis develops in plaques that are disseminated throughout the central nervous
system (CNS), primarily in white matter, particularly in the lateral and posterior columns
(especially in the cervical regions), optic nerves, and periventricular areas. Tracts in the
midbrain, pons, and cerebellum are also affected. Gray matter in the cerebrum and spinal
cord can be affected but to a much lesser degree.
• Headaches
• Seizures
• Muscle pain and spasms
• Muscle weakness
• Paralysis over time
• Trouble speaking
• Trouble thinking or understanding others
• Changes in behaviour
TREATMENT:
Managing Balo disease usually involves corticosteroids to reduce inflammation and suppress
the immune response. Managing Balo disease usually involves corticosteroids to reduce
inflammation and suppress the immune response. [31,32] Other treatments may help some
individuals, although strong evidence of their benefits is lacking. They include Trusted Source:
• plasma exchange
• plasma exchange
• IV immunoglobulin
• cyclophosphamide
• cyclophosphamide
• azathioprine or mitoxantrone
38
Doctors can also prescribe treatments to relieve symptoms such as pain, weakness, or muscle
issues.
HOMEOPATHY
Homeopathic medicines have only a supportive role to play in multiple sclerosis. Homeopathic
medicines for multiple sclerosis basically help to provide symptomatic relief.
1. Physostigma & Gelsemium – For Multiple Sclerosis with Prominent Eye Symptoms
Physostigma and Gelsemium are the most reliable medicines for multiple sclerosis with eye
complaints. Both the medicines are known for their comprehensive approach in dealing with
most eye-related complaints in this disease all at once. The prominent symptoms that will have
the practitioner select Physostigma are a dim vision, blurred vision, partial loss of vision and
pain in the eyes. On the other hand, Gelsemium is the best Homeopathic medicine for multiple
sclerosis cases with optic neuritis with blurred/foggy vision, double vision, pain in the eyes and
varying degree of vision loss. These medicines help to treat these eye symptoms from multiple
sclerosis even though the results may vary from person to person.[33]
2. Oxalic Acid & Picric Acid – For Numbness, Tingling, Pin/Needle-like Sensations
Oxalic Acid is one of the most prescribed medicines for multiple sclerosis cases which show
symptoms of numbness and tingling in the limbs. Weakness and coldness in the lower limbs
may also be felt. Oxalic Acid is also a good choice of medicine for trembling hands and feet as
a result of multiple sclerosis. Picric Acid is another of the most effective Homeopathic
medicines for multiple sclerosis that has successfully treated several cases where patients feel
pin and needle pricks in their limbs. The limbs often feel tired and heavy in such cases. Marked
prostration, muscle weakness and spasms from multiple sclerosis can also be treated well with
Picric Acid. The weakness may worsen with exertion. Patients also complain of a burning
sensation along the spine.
3. Conium & Argentum Nitricum – For Weakness in Lower Limbs
Conium and Argentum Nitricum are the most useful medicines for weak lower limbs due to
multiple sclerosis. The guiding symptoms for use of Conium are weak and weary legs, sudden
loss of strength while walking, difficult gait and stiffness in legs. However, in case of weakness
in calf muscles, rigidity in calves, unsteady walk, a heaviness of limbs where the limbs feel as
39
if they were made of wood and trembling legs, Argentum Nitricum is the best Homeopathic
medicines for multiple sclerosis.
4. Gelsemium and Alumina – When there is Difficulty in Balance and Coordination (Ataxia)
The most effective medicines for multiple sclerosis with balance and coordination difficulties
are Gelsemium and Alumina. Gelsemium is prescribed to treat a lack of muscle coordination
in persons dealing with multiple sclerosis. The gait is slow and unsteady. Loss of balance while
walking is also marked. Intense dizziness may be experienced along with these symptoms. The
symptoms pointing towards prescription of Alumina as the most suitable medicines for
multiple sclerosis are a sluggish and staggering gait, tottering and falling on closing the eyes,
numbness and a bandaged feeling in the legs.[34]
CHEMICAL CONSTITUENTS PRESENT IN HERBS
40
III AIM AND OBJECTIVES
41
AIM:
AI is playing a crucial role in all the aspects of drug development process. Gene regulatory
networks (GRNs) provide regulatory familiarity between genes, thereby building on the
different biological functions of genes related to the gene functions. Given the importance of
the rare diseases, it was aimed to perform gene networking analysis, protein-protein
interactions for two selected rare diseases.
OBJECTIVES:
1. To review the literature regarding the role of artificial intelligence in drug discovery
and prevalence of rare diseases
2. To select two rare diseases for gene networking analysis
3. To perform gene networking using:
• STRING TOOL
• GENEPLEXUS
4. To select gene list of two rare diseases from DisGeNET
5. To analyze and interpret the results obtained from STRING and GENEPLEXUS
6. To perform molecular docking (1-click docking) using selected targets with few
chemical constituents useful in treating the diseases
7. To select structural hits based on the docking scores
42
IV MATERIALS AND METHODS
43
Methodology
01. STRING TOOL: The protein-protein interactions were investigated using STRING tool
(https://string-db.org/)
➢ For cystic fibrosis CFTR gene was submitted in the STRING database
➢ For Balo disease CLDN11 gene was submitted in the STRING database
02. DisGeNET: Gene-disease information for cystic fibrosis and Balo disease were obtained.
(https://www.disgenet.org/dbinfo)
➢ Gene disease association list for cystic fibrosis was submitted in GenePlexus to obtain
gene network graph.
➢ Gene disease association list Balo disease in GenePlexus to obtain gene network
graph.
Chemical structures of the compounds were submitted in the IUPAC format and PDB ID
2PZE was selected for the CFTR protein. Binding score and binding conformations were
obtained directly from the site of docking
44
V RESULTS AND DISCUSSION
45
5.1 PROTEIN-PROTEIN NETWORK ANALYSIS USING STRING TOOL
The network view summarizes the network of predicted associations for CFTR proteins. The
network nodes are proteins. The edges represent the predicted functional associations. The
edges are drawn according to the view settings. In evidence mode, an edge may be drawn with
up to 7 differently coloured lines - these lines represent the existence of the seven types of
evidence used in predicting the associations
46
Red line - indicates the presence of fusion evidence, Green line - neighbourhood evidence,
Blue line - cooccurrence evidence, Purple line - experimental evidence, Yellow line - text
mining evidence, Light blue line - database evidence, Black line - co expression evidence.
In confidence mode the thickness of the line indicates the degree of confidence prediction of
the interaction. Action mode show additional information about the prediction, such as,
binding, activation, etc.
CFTR (Cystic fibrosis transmembrane conductance regulator) - Epithelial ion channel that
plays an important role in the regulation of epithelial ion and water transport and fluid
homeostasis. Mediates the transport of chloride ions across the cell membrane. Channel activity
is coupled to ATP hydrolysis. The ion channel is also permeable to HCO (3-); selectivity
depends on the extracellular chloride concentration. Exerts its function also by modulating the
activity of other ion channels and transporters. Plays an important role in airway fluid
homeostasis, Contributes to the regulation of the PH. (1480 aa)
CXCL8 (Interleukin-8) - IL-8 is a chemotactic factor that attracts neutrophils, basophils, and
T-cells, but not monocytes. It is also involved in neutrophil activation. It is released from
several cell types in response to an inflammatory stimulus. IL-8(6-77) has a 5-10-fold higher
activity on neutrophil activation, IL-8(5-77) has increased activity on neutrophil activation and
IL-8(7-77) has a higher affinity to receptors CXCR1 and CXCR2 as compared to IL-8(1-77),
respectively. (99 aa)
47
SAA1 (Serum amyloid SAA1 Serum amyloid protein A) - Major acute phase protein; Belongs
to the SAA family. (122 aa) Serum amyloid A-2 protein; Major acute phase reactant.
Apolipoprotein of the HDL complex; Belongs to the SAA family (122 aa)
ELANE (Neutrophil elastase) - Modifies the functions of natural killer cells, monocytes and
granulocytes. Inhibits C5a-dependent neutrophil enzyme release and chemotaxis. Capable of
killing E. coli but not S.aureus in vitro; digests outer membrane protein A (ompA) in E.coli
and K.pneumoniae ; Belongs to the peptidase S1 family. Elastase subfamily. (267 aa)
Mucin-5AC - Gel-forming glycoprotein of gastric and respiratory tract epithelia that protects
the mucosa from infection and chemical damage by binding to inhaled microorganisms and
particles that are subsequently removed by the mucociliary system. Interacts with H.pylori in
the gastric epithelium, Barrett's esophagus as well as in gastric metaplasia of the duodenum
(GMD) (5654 aa)
➢ Cooccurrence evidence (blue line): the occurrence evidence shows the presence or
absence of linked proteins across species. (SAA1-CXCL8, SAA1-SAA2)
➢ Experimental evidence (purple line): this shows that the interaction between proteins is
experimentally proved. (SAA1-SAA2, CFTR-MUC5AC, MUC5AC-ELANE)
48
Significant role of SAA1, SAA2, CXCL8, CFTR in Cystic fibrosis
➢ SAA1 has been found to play an important role in lipid metabolism and contributes to
bacterial clearance, the regulation of inflammation and tumour pathogenesis.
➢ SAA2 displays antimicrobial activity against S. aureus and E. coli.
➢ CXCL8 is the most potent human neutrophil-attracting chemokine and plays crucial
roles in the response to infection and tissue injury.
➢ The cystic fibrosis transmembrane conductance regulator (CFTR) protein helps to
maintain the balance of salt and water on many surfaces in the body, such as the surface
of the lung.
➢ The ELANE gene provides instructions for making a protein called neutrophil elastase.
This protein is found in neutrophils, a type of white blood cell that plays a role in
inflammation and in fighting infection.
➢ MUC-5AC is a large gel-forming glycoprotein. In the respiratory tract it protects
against infection by binding to inhaled pathogens that are subsequently removed by
mucociliary clearance.
➢ Protein-protein interaction inhibitors for CFTR & MUC-5AC: HDL2, HDL3, IL-6,
CFTRinh-172
➢ Protein-protein interactions in cystic fibrosis involved networking with several proteins
involved in inflammation process.
49
5.1.2. BALO DISEASE
The network view summarizes the network of predicted associations for CLDN11 protein. The
network nodes are proteins. The edges represent the predicted functional associations. The
edges are drawn according to the view settings. In evidence mode, an edge may be drawn with
up to 7 differently coloured lines - these lines represent the existence of the seven types of
evidence used in predicting the associations.
Red line - indicates the presence of fusion evidence, Green line - neighbourhood evidence,
Blue line - cooccurrence evidence, Purple line - experimental evidence, Yellow line - text
mining evidence, Light blue line - database evidence, Black line - co expression evidence.
50
In confidence mode the thickness of the line indicates the degree of confidence prediction of
the interaction. Action mode show additional information about the prediction, such as,
binding, activation, etc.
CLDN11 (Claudin 11) - Plays a major role in tight junction-specific obliteration of the
intercellular space, through calcium-independent cell-adhesion activity (207 aa)
GJA1 (Gap junction alpha-1 protein) Gap junction protein that acts as a regulator of bladder
capacity. A gap junction consists of a cluster of closely packed pairs of transmembrane
channels, the connexons, through which materials of low MW diffuse from one cell to a
neighbouring cell. May play a critical role in the physiology of hearing by participating in the
recycling of potassium to the cochlear endolymph. Negative regulator of bladder functional
capacity: acts by enhancing intercellular electrical and chemical transmission, thus sensitizing
bladder muscles to cholinergic neural stimuli. (382 aa)
GJB1 (Gap junction beta-1 protein) - One gap junction consists of a cluster of closely packed
pairs of transmembrane channels, the connexons, through which materials of low MW diffuse
from one cell to a neighbouring cell. (283 aa)
51
GJC2 (Gap junction gamma-2 protein) - One gap junction consists of a cluster of closely
packed pairs of transmembrane channels, the connexons, through which materials of low MW
diffuse from one cell to a neighbouring cell. May play a role in myelination in central and
peripheral nervous systems. (439 aa)
GJB6 (Gap junction beta-6 protein) - One gap junction consists of a cluster of closely packed
pairs of transmembrane channels, the connexons, through which materials of low MW diffuse
from one cell to a neighbouring cell; Belongs to the connexin family. Beta-type (group I)
subfamily. (261 aa)
52
Significant role of CLDN11, MAG, MOG, CD68 in Balo disease
53
5.2 NETWORK ANALYSIS USING GENEPLEXUS
Gene network analysis explores the relationships between genes in biological systems. It helps
uncover how genes interact, regulate each other's expression, and influence biological
processes. Techniques like gene co-expression analysis, regulatory network inference, and
pathway analysis are used to understand gene interactions and their functional implications. It's
a powerful approach for studying complex biological systems and identifying key genes
involved in diseases or other biological phenomena.
GenePlexus is used to predict novel genes associated with CF based on a gene interaction
network, starting with a set of 100 known CF genes, obtained from the DisGeNet database.
54
Table-5.2.1 Gene details of Cystic Fibrosis
Symbol Name Probability Known/novel Training-label
INS Insulin 1.00 Novel U
BEST1 Bestrophin 1 1.00 Known P
IL1B Interleukin 1 beta 1.00 Known P
CFTR CF transmembrane 1.00 Known P
conductance regalator
IL6 Interleukin 6 1.00 Known P
TP53 Tumor protein p53 1.00 Novel U
CXCL8 C-X-C motif chemokine ligand 1.00 Known P
8
CRP C-reactive protein 1.00 Novel U
GCG glucagon 1.00 Known P
ACE Angiotensin 1converting 1.00 Novel U
enzyme
• Rank: Rank of the gene when sorted by prediction probability (up to several
decimals).
• Class-Labels: P – gene was considered in positive class during training, N – gene
was considered in negative class during training, and U – gene was not considered at
all during training.
• Known/Novel: Indicates whether the gene was part of the input gene list
(therefore Known) or not (therefore Novel).
➢ The top gene predictions are also visualized in the context of the original network that
was used to train the model.
➢ When a user provides a set of genes to GenePlexus, it trains a custom machine learning
(ML) model that captures the patterns of network connectivity of the user’s genes in
contrast to other genes in the network.
55
➢ Based on this ML model Geneplexus will Predict other genes in the network that are
similar to the input genes based on their network connectivity.
➢ Throughout this work, we demonstrate the utility and features of GenePlexus by
applying it to discover genes associated with cystic fibrosis (CF).
➢ GenePlexus predicts that CFTR and BEST1 is functionally similar to this input set and
is highly connected to known positive genes in the network.
➢ The connection between CFTR (Cystic Fibrosis Transmembrane Conductance
Regulator) and INS (Insulin) lies in their roles in different bodily systems. CFTR is
primarily associated with cystic fibrosis, a genetic disorder affecting the lungs and
digestive system, while insulin is crucial for regulating blood sugar levels. There isn't
a direct biological connection between CFTR and insulin, but both play important roles
in maintaining overall health.
➢ CFTR (Cystic Fibrosis Transmembrane Conductance Regulator) and Bestrophin-1 are
both membrane proteins involved in ion transport, but they serve different functions in
the body. CFTR is primarily associated with chloride ion transport, particularly in the
epithelial cells lining the respiratory, digestive, and reproductive systems. Mutations in
the CFTR gene can lead to cystic fibrosis. However, both are important in maintaining
cellular homeostasis and proper physiological function.
➢ CFTR (Cystic Fibrosis Transmembrane Conductance Regulator) and Interleukin 6 (IL-
6) are both implicated in inflammatory processes, particularly in the context of cystic
fibrosis (CF). In CF, mutations in the CFTR gene lead to defective chloride ion
transport and mucus buildup in various organs, particularly the lungs. So, while CFTR
and IL-6 have distinct molecular functions, they are interconnected in the context of
inflammation and the pathophysiology of cystic fibrosis.
56
5.2.2. BALO DISEASE
GenePlexus is used to predict novel genes associated with Balo disease based on a gene
interaction network, starting with a set of 100 known Balo genes, obtained from the DisGeNet
database.
57
• Probability: Indicates the genes network-based similarity to the input genes.
• Rank: Rank of the gene when sorted by prediction probability (up to several
decimals).
• Class-Labels: P – gene was considered in positive class during training, N – gene
was considered in negative class during training, and U – gene was not considered at
all during training.
• Known/Novel: Indicates whether the gene was part of the input gene list
(therefore Known) or not (therefore Novel).
➢ The top gene predictions are also visualized in the context of the original network that
was used to train the model.
➢ When a user provides a set of genes to GenePlexus, it trains a custom machine learning
(ML) model that captures the patterns of network connectivity of the user’s genes in
contrast to other genes in the network.
➢ Based on this ML model Geneplexus will Predict other genes in the network that are
similar to the input genes based on their network connectivity.
➢ Throughout this work, we demonstrate the utility and features of GenePlexus by
applying it to discover genes associated with Balo disease
➢ GenePlexus predicts that CLDN11 and MOG is functionally similar to this input set
and is highly connected to known positive genes in the network.
➢ CLDN11 (Claudin-11) and MOG (Myelin oligodendrocyte glycoprotein) are both
proteins found in the myelin sheath of neurons. While there isn't direct evidence linking
CLDN11 and MOG in Balo's disease specifically, both play crucial roles in myelin
structure and function.Dysfunction in either could potentially contribute to the
pathology of Balo's disease
➢ PRAMEF14 (preferentially expressed antigen in melanoma family member 14) is a
gene that has been implicated in some neurological disorders, including multiple
sclerosis, which shares some similarities withCLDN11 in Balo disease.
➢ Claudin-11 is a tight junction protein crucial for the integrity of the myelin sheath, while
PGBD1 is associated with transposable elements and is involved in genome
rearrangements.
58
5.3 Molecular docking
Molecular docking was performed using important chemical constituents present in the herbs
used to treat cystic fibrosis and the results are presented in table -5.3.1
Fig- 5.5 Binding pose of Epicatechin in the active site of CFTR protein (PDB ID: 2PZE)
Fig- 5.6 Binding pose of Epicatechin in the active site of CFTR protein (PDB ID: 2PZE)-
protein surface presentation
59
Fig-5.7 Binding pose of Catechin in the active site of CFTR protein (PDB ID: 2PZE)
Fig- 5.8 Binding pose of Catechin in the active site of CFTR protein (PDB ID: 2PZE)-
protein surface presentation
Fig-5.9 Binding pose of Gallic acid in the active site of CFTR protein (PDB ID: 2PZE)
60
Fig- 5.10 Binding pose of Gallic acid in the active site of CFTR protein (PDB ID: 2PZE)-
protein surface presentation
Fig-5.11 Binding pose of Silybin in the active site of CFTR protein (PDB ID: 2PZE)
Fig- 5.12 Binding pose of Silybin in the active site of CFTR protein (PDB ID: 2PZE)-
protein surface presentation
61
VI CONCLUSION
62
CONCLUSION
➢ The protein-protein interactions in rare disease complications were predicted for cystic
fibrosis and Balo disease. In case of cystic fibrosis ELANE, CXCL8, SAA1, SAA2,
MUC5AC are interacting with each other. The genes interacting in the cystic fibrosis
are INS, IL18, IL6 genes are interacting with each other
➢ In Balo disease protein-protein interactions involved CLDN11, MOG, MAG, GJA1,
GJB1, GJC2, CD68, RTN4, GJB6 while the crucial genes involved were CLDN11,
MOG, PRAMEEF14, PGPD.
➢ The findings of the present might be useful to design and develop novel therapeutic
agents to treat Cystic fibrosis and Balo disease
➢ Molecular docking results with CFTR protein highlighted silybin A as a promising
molecule that can be developed as a therapeutic agent to treat Cystic fibrosis
63
VII. REFERENCES
1. Pun, F. W., Ozerov, I. V., & Zhavoronkov, A. (2023). AI-powered therapeutic target
discovery. Trends in Pharmacological Sciences.
2. Chen, R., Liu, X., Jin, S., Lin, J., & Liu, J. (2018). Machine learning for drug-target
interaction prediction. Molecules, 23(9), 2208.
3. Ivanenkov, Y. A., Polykovskiy, D., Bezrukov, D., Zagribelnyy, B., Aladinskiy, V.,
Kamya, P., ... & Zhavoronkov, A. (2023). Chemistry42: an AI-driven platform for
molecular design and optimization. Journal of Chemical Information and
Modeling, 63(3), 695-701.
4. Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., & Liu, T. Y. (2022). BioGPT:
generative pre-trained transformer for biomedical text generation and mining. Briefings
in bioinformatics, 23(6), bbac409.
5. Varadi, M., Anyango, S., Deshpande, M., Nair, S., Natassia, C., Yordanova, G., ... &
Velankar, S. (2022). AlphaFold Protein Structure Database: massively expanding the
structural coverage of protein-sequence space with high-accuracy models. Nucleic
acids research, 50(D1), D439-D444.
6. Liu, Z., Roberts, R. A., Lal-Nag, M., Chen, X., Huang, R., & Tong, W. (2021). AI-
based language models powering drug discovery and development. Drug Discovery
Today, 26(11), 2593-2607.
7. Vamathevan, J., Clark, D., Czodrowski, P., Dunham, I., Ferran, E., Lee, G., ... & Zhao,
S. (2019). Applications of machine learning in drug discovery and development. Nature
reviews Drug discovery, 18(6), 463-477.
8. Schenone, M., Dančík, V., Wagner, B. K., & Clemons, P. A. (2013). Target
identification and mechanism of action in chemical biology and drug discovery. Nature
chemical biology, 9(4), 232-240.
9. Harrer, S., Shah, P., Antony, B., & Hu, J. (2019). Artificial intelligence for clinical trial
design. Trends in pharmacological sciences, 40(8), 577-591.
10. Delavan, B., Roberts, R., Huang, R., Bao, W., Tong, W., & Liu, Z. (2018).
Computational drug repositioning for rare diseases in the era of precision
medicine. Drug discovery today, 23(2), 382-394.
11. Drews, J. (2000). Drug discovery: a historical perspective. science, 287(5460), 1960-
1964.
64
12. Paul, D., Sanap, G., Shenoy, S., Kalyane, D., Kalia, K., & Tekade, R. K. (2021).
Artificial intelligence in drug discovery and development. Drug discovery today, 26(1),
80.
13. Keiser, M. J., Setola, V., Irwin, J. J., Laggner, C., Abbas, A. I., Hufeisen, S. J., ... &
Roth, B. L. (2009). Predicting new molecular targets for known
drugs. Nature, 462(7270), 175-181.
14. Buchan, N. S., Rajpal, D. K., Webster, Y., Alatorre, C., Gudivada, R. C., Zheng, C., ...
& Koehler, J. (2011). The role of translational bioinformatics in drug discovery. Drug
discovery today, 16(9-10), 426-434.
15. Barrangou, R., Birmingham, A., Wiemann, S., Beijersbergen, R. L., Hornung, V., &
Smith, A. V. B. (2015). Advances in CRISPR-Cas9 genome engineering: lessons
learned from RNA interference. Nucleic acids research, 43(7), 3407-3419.
16. You, Y., Lai, X., Pan, Y., Zheng, H., Vera, J., Liu, S., ... & Zhang, L. (2022). Artificial
intelligence in cancer target identification and drug discovery. Signal Transduction and
Targeted Therapy, 7(1), 156.
17. Paananen, J., & Fortino, V. (2020). An omics perspective on drug target discovery
platforms. Briefings in bioinformatics, 21(6), 1937-1953.
18. Zhao, J., Cao, Y., & Zhang, L. (2020). Exploring the computational methods for
protein-ligand binding site prediction. Computational and structural biotechnology
journal, 18, 417-426.
19. Hasin, Y., Seldin, M., & Lusis, A. (2017). Multi-omics approaches to disease. Genome
biology, 18, 1-15.
20. Rajasimha, H. K., Shirol, P. B., Ramamoorthy, P., Hegde, M., Barde, S., Chandru, V.,
... & Verma, I. C. (2014). Organization for rare diseases India (ORDI)–Addressing the
challenges and opportunities for the Indian rare diseases' community. Genetics
research, 96, e009.
21. Aronson, J. K. (2006). Rare diseases and orphan drugs. British journal of clinical
pharmacology, 61(3), 243.
22. Rosenbloom, B. E., & Weinreb, N. J. (2013). Gaucher disease: a comprehensive
review. Critical Reviews™ in Oncogenesis, 18(3).
23. Song, P., Gao, J., Inagaki, Y., Kokudo, N., & Tang, W. (2012). Rare diseases, orphan
drugs, and their regulation in Asia: Current status and future perspectives. Intractable
& rare diseases research, 1(1), 3-9.
65
24. Zhu, F., Shi, Z., Qin, C., Tao, L., Liu, X., Xu, F., ... & Chen, Y. (2012). Therapeutic
target database update 2012: a resource for facilitating target-oriented drug
discovery. Nucleic acids research, 40(D1), D1128-D1136.
25. Davis, P. B. (2006). Cystic fibrosis since 1938. American journal of respiratory and
critical care medicine, 173(5), 475-482.
26. Ashrafi, M. R., Tavasoli, A. R., Alizadeh, H., Noghabi, J. Z., & Parvaneh, N. (2017).
Tumefactive multiple sclerosis variants: report of two cases of Schilder and Balo
diseases. Iranian Journal of Child Neurology, 11(2), 69.
27. Torrelo, A., Patel, S., Colmenero, I., Gurbindo, D., Lendínez, F., Hernández, A., ... &
Paller, A. S. (2010). Chronic atypical neutrophilic dermatosis with lipodystrophy and
elevated temperature (CANDLE) syndrome. Journal of the American Academy of
Dermatology, 62(3), 489-495.
28. Hurst, J. A., & Baraitser, M. I. C. H. A. E. L. (1989). Johanson-Blizzard
syndrome. Journal of medical genetics, 26(1), 45.
29. Beutler, E. (1991). Gaucher's disease. New England Journal of Medicine, 325(19),
1354-1360.
30. McDermott, D. A., Fong, J. C., & Basson, C. T. (2019). Holt-Oram Syndrome.
31. Arkin, M. R., & Wells, J. A. (2004). Small-molecule inhibitors of protein–protein
interactions: progressing towards the dream. Nature reviews Drug discovery, 3(4), 301-
317.
32. Wójcik, P., & Berlicki, Ł. (2016). Peptide-based inhibitors of protein–protein
interactions. Bioorganic & medicinal chemistry letters, 26(3), 707-713.
33. Voter, A. F., & Keck, J. L. (2018). Development of protein–protein interaction
inhibitors for the treatment of infectious diseases. Advances in protein chemistry and
structural biology, 111, 197-222.
34. Arkin, M. R., & Wells, J. A. (2004). Small-molecule inhibitors of protein–protein
interactions: progressing towards the dream. Nature reviews Drug discovery, 3(4), 301-
317.
35.Dey, I., Shah, K., & Bradbury, N. A. (2016). Natural compounds as therapeutic agents
in the treatment cystic fibrosis. Journal of genetic syndromes & gene therapy, 7(1).
66