3.1.
Introduction to Machine
Learning Concepts
Learning Objectives
By the end of this lecture, candidates will be able to:
• Understand the basic concepts of machine learning, including features, classification,
regression, and model training.
• Recognize different ML models like KNN, Decision Trees, SVM, and Linear/Logistic
Regression, and their uses in drug discovery.
• Learn the working principle of ML models and their use in drug discovery and
development.
Why Machine Learning for Drug Discovery?
• Traditional drug discovery is slow and costly:
• Average timeline: 10-15 years
• Average cost: $2.6 billion per drug
• High failure rates — 90% of clinical trials fail
• ML Helps in:
• Hit Identification: Screening millions of compounds virtually (e.g., DeepDock,
VirtualFlow)
• Lead Optimization: Predicting molecule activity, ADMET (Absorption,
Distribution, Metabolism, Excretion, Toxicity) profiles
• Drug Repurposing: Finding new uses for existing drugs (e.g., Remdesivir for
COVID-19)
• Personalized Medicine: Predicting patient-specific drug responses
What is Machine Learning (ML)?
• Machine learning is a subset of artificial intelligence (AI) where computers learn from data
without being explicitly programmed.
• The model learns patterns from data to make predictions or decisions.
• Three Types of Machine Learning:
• Supervised Learning – Model learns from labeled data (e.g., predicting if a molecule is
active/inactive)
• Unsupervised Learning – Model finds hidden patterns in unlabeled data (e.g., clustering compounds
by chemical similarity)
• Reinforcement Learning – Model learns through trial and error (e.g., optimizing synthesis pathways)
Features
• A feature is an individual measurable property or characteristic of the data — like a descriptor for a
molecule.
• In drug discovery, features describe molecular properties. Examples:
• Molecular weight — How "heavy" the molecule is
• LogP (Partition coefficient) — Lipophilicity (fat vs. water solubility)
• Hydrogen bond donors/acceptors — Essential for binding interactions
• Topological polar surface area (TPSA) — Affects cell permeability
• Molecular fingerprints — Encoded bit strings representing chemical structure
• The better the features, the smarter the model.
• Predicting blood-brain barrier permeability — features like LogP,
TPSA, and rotatable bonds help determine if a compound crosses the BBB.
Discrete Features (Categorical)
• These are features that take on distinct, separate values — usually categories or
counts.
• Characteristics:
• Can’t be broken down into finer values
• Often encoded as integers (e.g., 0, 1, 2) or one-hot encoded for ML models
• Examples:
• Drug Class: Antibiotic (0), Antiviral (1), Anticancer (2)
• Chemical Substructures: Presence of benzene ring (Yes/No → 1/0)
• Toxicity Class: Non-toxic (0), Low toxicity (1), High toxicity (2)
• Amino Acid Type: Hydrophobic, Polar, Charged
Binary Features
• A feature that has only two possible values — typically 0/1, Yes/No, or True/False.
• Characteristics:
• Encodes presence/absence or positive/negative states
• Often represents qualitative data in a simplified form
• Helps models make quick binary decisions
• Examples:
• Lipinski’s Rule of 5 Compliance: (Yes = 1, No = 0)
• Toxicity Flag: (Toxic = 1, Non-toxic = 0)
• Hydrogen Bond Donor Presence: (Yes = 1, No = 0)
• Molecular Scaffold Presence: (Aromatic ring present = 1, Absent = 0)
• Activity Classification: (Active = 1, Inactive = 0)
Continuous Features (Numerical)
• These are features that can take any value within a range — they’re measured
on a continuous scale.
• Characteristics:
• Can be infinitely divided into smaller values (e.g., 5.3, 5.31, 5.314)
• Often require scaling/normalization (e.g., Min-Max, StandardScaler)
• Examples:
• Molecular Weight: e.g., 342.3 g/mol
• LogP (Lipophilicity): e.g., 2.5 (hydrophobicity measure)
• Topological Polar Surface Area (TPSA): e.g., 78.9 Ų (predicts membrane
permeability)
• IC₅₀ (Half Maximal Inhibitory Concentration): e.g., 12.7 nM (measure of drug
potency)
Derived Features
• Features that are calculated or engineered from existing data to capture
more insight.
• Characteristics:
• Can be continuous or categorical
• Helps enhance predictive power
• Involves domain knowledge for meaningful transformations
• Examples in Drug Discovery:
• Ligand Efficiency: (pIC₅₀ / Molecular Weight) → Measures binding efficiency
• LogD (Distribution Coefficient): Derived from LogP and pKa for drug
permeability prediction
• Hydrophobic Surface Area: Calculated from molecular structure
• Drug-likeness Score: Composite of multiple descriptors (MW, LogP, H-bond
donors, etc.)
• Polar to Nonpolar Ratio: Ratio of polar atoms to non-polar atoms — useful for
predicting solubility
Machine Learning
Input 1
Input 2 Types of Predictions
Input 3
Input 4 Output
Model (Prediction)
Input 5
Input 6
Input 7
Input 8
Supervised Learning
• Learn from labelled data to predict outcomes.
• Types:
• Classification: (e.g., Active vs. Inactive compound, soluble vs insoluble)
— SVM, Logistic Regression
• Regression: (e.g., Predict solubility, permeability or IC₅₀ values.) —
Linear Regression, Random Forest
Classification vs. Regression
Category Type Description Example Algorithms
Binary Two possible Active vs. Inactive Logistic Regression, SVM,
Classification
Classification outcomes compound Random Forest
Multiclass More than two Agonist vs. Antagonist Decision Trees, KNN, Neural
Classification classes vs. Neutral ligand Networks
Each sample can
Multilabel Antimicrobial + Anti- One-vs-Rest, Neural Networks,
belong to multiple
Classification inflammatory properties Adapted Random Forest
classes
Simple One feature → one Molecular weight
Regression Linear Regression, SVR
Regression output predicting IC₅₀
MW, logP, H-bond
Multiple Multiple features → Ridge Regression, Lasso
donors predicting
Regression one output Regression
bioavailability
Polynomial Captures curved Dose vs. Response
Polynomial Regression models
Regression relationships curve
Log/ Handles non-linear, Drug concentration
Exponential exponential, or decay in plasma over Nonlinear regression models
Regression decay relationships time
Machine Learning
Input 1
Input 2 Types of Predictions
Input 3
Input 4 Output
Model (Prediction)
Input 5
Input 6
Input 7
Input 8
Machine Learning
Input 1
Input 2 Types of Predictions
Input 3
Input 4 Output
Model (Prediction)
Input 5
Input 6
Input 7
Input 8
Supervised Learning Dataset
Data from J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005 by John S. Delaney
Supervised Learning Dataset
Each Row = Different Sample in the dataset
Data from J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005 by John S. Delaney
Supervised Learning Dataset
Different features
Data from J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005 by John S. Delaney
Supervised Learning Dataset
Label
Data from J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005 by John S. Delaney
Supervised Learning Dataset
Label Matrix (Y) Feature Matrix (X)
Data from J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005 by John S. Delaney
Machine Learning Actual value
Training Loss = Actual-Predicted
-1.8 mol/L
Model (Prediction)
Supervised Learning Dataset
Training Dataset
60:20:20 or 80:10:10
Validation
Dataset
Test
Dataset
Data from J. Chem. Inf. Comput. Sci. 2004, 44, 1000-1005 by John S. Delaney
Machine Learning Actual value
Training Loss = Actual-Predicted
(Make Adjustments)
Training Dataset Model Prediction
Machine Learning Actual value
Loss = Actual-Predicted
Validation Dataset Model Prediction
Validation set is used as a reality check during/after
training to ensure model can handle unseen data
Machine Learning
Validation Validation
Dataset Model 1 Loss = 1.3
Dataset Model 3 Loss = 0.5
Validation Validation
Dataset Model 1 Loss = 1
Dataset Model 4 Loss = 0.8
Machine Learning
Best Model
Validation Validation
Dataset Model 1 Loss = 1.3
Dataset Model 3 Loss = 0.5
Validation Validation
Dataset Model 1 Loss = 1
Dataset Model 4 Loss = 0.8
Validation set is used as a reality check during/after
training to ensure model can handle unseen data
Machine Learning Actual value
Reported
performance
Test Dataset Model 3 Prediction
Test set is used to check how generalizable
the final chosen model is
Loss Functions
• L1 Loss Function is used to minimize the error which is the sum of the all
the absolute differences between the true value and the predicted value.
• L2 Loss Function is used to minimize the error which is the sum of the all the
squared differences between the true value and the predicted value. (if close,
penalty is minimum and vice versa)
Supervised Learning Algorithms
• Classification: Predicts categories (e.g., "soluble" vs "insoluble").
• Regression: Predicts continuous values (e.g., logP, IC 50).
Algorithm Classification Regression
k-Nearest Neighbors (KNN) ✅ ✅
Support Vector Machine
✅ ✅
(SVM)
Decision Tree ✅ ✅
Random Forest ✅ ✅
Logistic Regression ✅ ❌
Linear Regression ❌ ✅
Naive Bayes ✅ ❌
Unsupervised Learning
• Clustering: Groups data (e.g., compound libraries clustering by chemical similarity).
• Dimensionality Reduction: Reduces features (e.g., PCA on molecular descriptors).
Dimensionality
Algorithm Clustering
Reduction
k-Means ✅ ❌
Hierarchical Clustering ✅ ❌
DBSCAN ✅ ❌
Principal Component
❌ ✅
Analysis (PCA)
t-SNE ❌ ✅
k-Nearest Neighbors (KNN)
• Works for Classification & Regression
• How it works:
• Finds the "k" closest data points to a new sample (based on
distance, like Euclidean distance).
• For classification, it picks the majority class among the
neighbors (e.g., most are "soluble" → predicts "soluble").
• For regression, it averages the values of the neighbors (e.g.,
averages logP values).
• Use case in drug discovery:
• Predicting whether a compound is an inhibitor (yes/no).
• Estimating a molecule’s binding affinity by averaging nearby
known compounds.
Support Vector Machine (SVM)
• Supports Classification (SVC) and Regression (SVR)
• How it works:
• Finds a hyperplane that best separates data into classes (for
classification).
• For regression, it tries to fit data within a margin of error
while keeping the model simple (less overfitting).
• Can handle non-linear data using kernels (e.g., RBF kernel).
• Use case in drug discovery:
• Classifying compounds as active/inactive based on molecular
fingerprints.
• Predicting biological properties like solubility or toxicity.
Decision Tree
• Works for Classification & Regression
• How it works:
• Splits data into "yes/no" decisions based on features (e.g., "Does
logP > 2?").
• Grows branches until data is pure (all samples in a leaf belong to
the same class or close in value).
• Prone to overfitting, but Random Forest (ensemble of trees) fixes
that.
• Use case in drug discovery:
• Classifying molecules based on structural alerts for toxicity.
• Predicting ADMET properties (Absorption, Distribution,
Metabolism, Excretion, Toxicity).
Random Forest (RF)
• Supports Classification & Regression
• How it works:
• Builds multiple decision trees on random
subsets of data.
• Takes a majority vote (classification) or averages
predictions (regression).
• Reduces overfitting compared to a single
decision tree.
• Use case in drug discovery:
• Predicting IC50 values of kinase inhibitors.
• Identifying active compounds from high-
throughput screening data.
Linear Regression
• Regression only
• How it works:
• Finds the best-fit line through data by minimizing the
difference between predicted and actual values.
• Assumes a linear relationship between features and
target (e.g., molecular weight vs solubility).
• Use case in drug discovery:
• Predicting LogP, LogD, or other physicochemical
properties.
• Modeling dose-response curves.
Logistic Regression
• Classification only
• How it works:
• Despite the name, it's for classification (binary/multiclass).
• Uses a sigmoid function to squash predictions between 0 and 1
(probabilities).
• Predicts the likelihood of a compound being "active/inactive"
based on molecular descriptors.
• Use case in drug discovery:
• Classifying compounds as hits/non-hits in virtual screening.
• Predicting whether a molecule crosses the blood-brain barrier
(yes/no).
Principal Component Analysis (PCA)
• Dimensionality Reduction (unsupervised)
• How it works:
• Reduces a high-dimensional dataset to fewer components while preserving most of the information.
• Helps visualize data and improves model performance by removing noise.
• Use case in drug discovery:
• Reducing thousands of molecular descriptors to 2D/3D for visualization.
• Preprocessing large compound datasets for faster training.
Summary
• Machine learning models learn from data to recognize patterns, make predictions, or classify
new information.
• Features and labels are essential parts of the dataset, where features describe data points and
labels define the outcome.
• Supervised learning focuses on labeled data (e.g., classification and regression), while
unsupervised learning finds hidden patterns in unlabeled data (e.g., clustering).
• Model training, validation, and testing ensure the model
generalizes well to new, unseen data.
Further Reading
• Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: a primer. Nat Methods 14, 1119–1120
(2017).
• https://medium.com/acing-ai/machine-learning-techniques-primer-60edd9d14863
• Badrulhisham F, Pogatzki-Zahn E, Segelcke D, Spisak T, Vollert J. Machine learning and artificial
intelligence in neuroscience: A primer for researchers. Brain Behav Immun. 2024 Jan;115:470-
479.
Think about it
Suppose you built a model that predicts a molecule as "active" with 95% accuracy- but in the lab,
most compounds still fail. Why might this happen?
Thank You!