0% found this document useful (0 votes)

52 views24 pages

Da MS

analytics for material science

Uploaded by

aswinganeshds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views24 pages

Da MS

analytics for material science

Uploaded by

aswinganeshds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Data Analytics for Materials Science

27-737

A.D. (Tony) Rollett, R.A. LeSar (Iowa State Univ.)

Dept. Materials Sci. Eng., Carnegie Mellon University

Random Forest

Lecture 6
Revised: 21st Apr., 2021

1 Do not re-distribute these slides without instructor permission

To date, we have discussed:
• linear algebra
• linear regression: prediction
• multiple linear regression: prediction

Recap 2
Useful sources of information (both in Canvas):
• The algorithm for random forests is presented on Page
588 of Hastie et al. Elements of Statistical Learning.
linear regression: prediction
• Another useful resource for learning about random
forests is: Leo Breiman, Random forests, Machine
learning, 45, 5–32 (2001).

Resources 3
A decision tree is a tool for making decisions that uses a tree-like model of decisions
and their possible consequences.

A formal decision tree consists of three types of nodes: [1]

• Decision nodes
• Chance nodes
• End nodes

Decision trees are all about information and how to use it in a structured way.

We mention them here because they are the building blocks of the random forest
model and useful in their own right.

Decision trees 4
To play tennis or not to play tennis? In a decision tree model,
“What feature will split
splits are chosen to
the observations in a
maximize information
way that the resulting
gain. For a regression
groups are as different
problem, the residual sum
from each other as
of squares (RSS) can be
possible (and the
used and for a
members of each
classification problem, the
resulting subgroup are
Gini index or entropy
as similar to each other
would apply. (See talk on
as possible)?”
https://www.slideshare.net/marinasantini1/lecture-4- slideshare.)
Splitting stops when the decision-trees-2-entropy-information-gain-gain-ratio-
55241087
data cannot be split
further. Pruning decision trees is discussed
at:
https://towardsdatascience.com/understanding-random-forest-58381e0602d2 https://en.wikipedia.org/wiki/Decision_tree_pruning

Decision trees 5
High entropy alloy dataset
(we have seen this in the
discussion of regular
expressions) with
composition including 24
elements in five phases.

Can we predict Vicker’s

hardness based on
composition and rule of
mixtures (ROM) density?

Decision trees in materials research 6

Greedy Approach is based on
the concept of Heuristic
Problem Solving by making an
optimal local choice at each
node. By making these local
optimal choices, we reach the
approximate optimal solution
globally.”
The algorithm can be
summarized as :
1. At each stage (node), pick
out the best feature as the
test condition.
2. Now split the node into the
possible outcomes (internal
nodes).
3. Repeat the above steps
until all the test conditions
have been exhausted into
leaf nodes.
see: https://www.slideshare.net/marinasantini1/lecture-4-decision-trees-2-
entropy-information-gain-gain-ratio-55241087 Courtesy of Tony Rollett.

Decision trees in materials research 7

“Random forests are bagged decision tree models that split on a subset of features
on each split.” https://towardsdatascience.com/why-random-forest-is-my-favorite-machine-learning-model-b97651fa3706

“Random forest, like its name implies,

consists of a large number of individual
decision trees that operate as an
ensemble. Each individual tree in the
random forest spits out a class prediction
and the class with the most votes
becomes our model’s prediction (see
figure).”

https://towardsdatascience.com/understanding-random-
forest-58381e0602d2

Random Forest model: basic idea 8

The basic concept behind random forest is
based on the wisdom of crowds.

Random forest takes a large number of

uncorrelated trees (models) that operate as a
committee, which will outperform any of the
individual models.

A key feature is that the models must have

low correlation between them.

The low correlation between trees protects

each of them from their individual errors.
https://towardsdatascience.com/understanding-
random-forest-58381e0602d2

Random Forest model: uncorrelated trees 9

Decision trees are very sensitive to the data they are trained on — small changes in a
training set can result in tree structures with large differences in structure.

Random forest allows individual trees to randomly sample the dataset with
replacement.

For example, suppose we have a training dataset with N=6 points: {1,2,3,4,5,6}.
Random sampling the data set with replacement might lead to something like
{1,2,2,5,5,6}, in which N=6.
Note that bagging can also be used by taking subsets of the data, as we see on the
next slide.

Random Forest model: bootstrap aggregating (bagging) 10

“Instead of building a single smoother from
the complete data set, 100 bootstrap samples
of the data were drawn. Each sample is
different from the original data set, yet
resembles it in distribution and variability. For
each bootstrap sample, a LOESS smoother
was fit. Predictions from these 100 smoothers
were then made across the range of the data.
The first 10 predicted smooth fits appear as
grey lines in the figure below. The lines are
clearly very wiggly and they overfit the data - By taking the average of the 100
a result of the bandwidth being too small.” smoothers, we arrive at one
bagged predictor (red line).
Clearly, the mean is more stable
https://en.wikipedia.org/wiki/Bootstrap_aggregating and there is less overfit.

Bootstrap aggregating (bagging) 11

Reducing variance
• a natural way to reduce the variance and hence increase the prediction accuracy
of a statistical learning method is to take many training sets from the population,
build a separate prediction model using each training set, and average the
resulting predictions

Best practice:
• each bagged tree makes use of around 2/3 of the observations
• remaining 1/3 of the observations are referred to as the out-of-bag (OOB)
observations
• each individual tree has high variance, but low bias, averaging these trees
reduces the variance
• reduces overfitting; reduce bias; break the bias-variance trade-off
• See later comments for use of OOB data for testing accuracy and feature
importance

Bagging: advantages 12
“Random forests are bagged decision tree
models that split on a subset of features
on each split.”

In addition to bagging, each tree in a

random forest bases its split on a random
subset of features.

In the example, while a decision tree would

include all 4 features, each tree in a random
forest would base their split on a subset of
features.

Random Forest model: basic idea 13

The basic concept behind random forest is
based on the wisdom of crowds.

Random forest takes a large number of

uncorrelated trees (models) that operate as a
committee, which will outperform any of the
individual models.

A key feature is that the models must have

low correlation between them.

The low correlation between trees protects

each of them from their individual errors.
https://towardsdatascience.com/understanding-
random-forest-58381e0602d2

Random Forest model: uncorrelated trees 14

“The random forest is a classification algorithm consisting of many decision trees.
It uses bagging and feature randomness when building each individual tree to try
to create an uncorrelated forest of trees whose prediction by committee is more
accurate than that of any individual tree.”

https://towardsdatascience.com/understanding-
random-forest-58381e0602d2

Random Forest model: summary 15

Decision trees Random Forest
• trees give insight into decision • “Black Box” — rather hard to gain
rules insight into the decision rules

• rather fast computationally • rather slow computationally

• prediction of trees tend to have • has smaller prediction variance

high variance and thus usually a better
performance

Decision trees versus Random Forest 16

• No statistical assumptions

• Works with any kind of data – continuous / categorical – intrinsically multiclass

• Can express any function – regression / classification

• Works well with small to medium data, unlike neural network which requires large
data

• Can handle thousands of input variables without variable selection

- provides feature importance

• It has an effective method for estimating missing data and maintains accuracy
when a large proportion of the data are missing

Random Forest: attributes 17

1. How much each feature decreases the variance in a tree
• For a forest, the variance decrease from each feature can be averaged and
the features are ranked according to this measure
• Biased towards preferring variables with more categories
(Bias in random forest variable importance measures: Illustrations, sources and a solution — on Canvas)
• When dataset has two (or more) correlated features, then one shows up high
while other as low (applies to other methods too)
-The effect of this phenomenon is somewhat reduced by random selection of
features at each node creation
2. Random shuffling of the variables
• permute the values of each feature and measure how much the permutation
decreases the accuracy of the model
• The OOB data is passed along each tree to determine the "test error" (since the
OOB were not used to train). See section 15.3.1 in Hastie et al.
• For each variable, the values are permuted in the OOB to evaluate the sensitivity
to that variable (from the increase in the test error).

Random Forest model: interpretation 18

R: randomForest package (available on CRAN)

Matlab: TreeBagger selects a random subset of predictors to use at each decision

split as in the random forest algorithm. (see documentation)

Mathematica: use Predict[] with Method-> “RandomForest”

There are also implementations in Python, …

Pick your favorite program and search for random forest in the documentation.

Random Forest model: availability 19

QUESTIONS?

20
“Despite the recent fast progress in materials informatics and data science, data-driven
molecular design of organic photovoltaic (OPV) materials remains challenging. We report
a screening of conjugated molecules for polymer−fullerene OPV applications by
supervised learning methods (artificial neural network (ANN) and random forest (RF)).

We report a screening of conjugated molecules for polymer−fullerene OPV applications

by supervised learning methods (artificial neural network (ANN) and random forest (RF)).
Approximately 1000 experimental parameters including power conversion efficiency
(PCE), molecular weight, and electronic properties are manually collected from the
literature and subjected to machine learning with digitized chemical structures. Contrary
to the low correlation coefficient in ANN, RF yields an acceptable accuracy, which is twice
that of random classification.”

Results based on 1200 points from 500 papers.

Computer-Aided Screening of Conjugated Polymers for
Organic Solar Cell: Classification by Random Forest, S.
Nagasawa et al, J. Phys. Chem Lett. 9, 2639 (2018)

Random Forest model: examples from materials 21

research
Artificial Neural Nets
(ANN) led to a relation
with r=0.37, which is
not acceptable.

They represented PCE

in 4 groups (e) and used
the RF in (d).

Based in part on the RF

results, they
demonstrated an
alternative approach to
the design of polymers
for OPVs.

Random Forest model: examples from materials 22

research
1. How much each feature decreases the variance in a tree
• For a forest, the variance decrease from each feature can be averaged and
the features are ranked according to this measure
• Biased towards preferring variables with more categories
(Bias in random forest variable importance measures: Illustrations, sources and a solution — on Canvas)
• When dataset has two (or more) correlated features, then one shows up high
while other as low (applies to other methods too)
-The effect of this phenomenon is somewhat reduced by random selection of
features at each node creation
2. Random shuffling of the variable
• permute the values of each feature and measure how much the permutation
decreases the accuracy of the model

Random Forest model: interpretation 23

Lecture 17: RF models part II

Random Forests 2
No ratings yet
Random Forests 2
43 pages
Random Forest, CNN and Different Algorithm
No ratings yet
Random Forest, CNN and Different Algorithm
14 pages
Random Forests
No ratings yet
Random Forests
43 pages
Random Forest
No ratings yet
Random Forest
29 pages
Decision Tree
No ratings yet
Decision Tree
7 pages
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
No ratings yet
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
12 pages
Random Forest
No ratings yet
Random Forest
14 pages
Deep Learning and Neural Networks
No ratings yet
Deep Learning and Neural Networks
21 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Randon Forest
No ratings yet
Randon Forest
34 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Random Forest for ML Enthusiasts
No ratings yet
Random Forest for ML Enthusiasts
4 pages
Random Forests
No ratings yet
Random Forests
35 pages
Random Forest Algorithm 1
100% (2)
Random Forest Algorithm 1
14 pages
Random Forest Algorithm in Machine Learning Random Forest Random Forests or Random Decision Trees Decision Trees
No ratings yet
Random Forest Algorithm in Machine Learning Random Forest Random Forests or Random Decision Trees Decision Trees
6 pages
Random Forest
No ratings yet
Random Forest
25 pages
Random Forest
No ratings yet
Random Forest
21 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
Random Forest (RF) : Decision Trees
No ratings yet
Random Forest (RF) : Decision Trees
3 pages
Random Forest Algorithm Updated
No ratings yet
Random Forest Algorithm Updated
11 pages
Lecture-12 Machine Learning With Python
No ratings yet
Lecture-12 Machine Learning With Python
18 pages
Random Forests For Beginners PDF
No ratings yet
Random Forests For Beginners PDF
71 pages
Random Forest
No ratings yet
Random Forest
25 pages
Aditri Chaudhuri - DM
No ratings yet
Aditri Chaudhuri - DM
10 pages
2023AIB1008 Lab08
No ratings yet
2023AIB1008 Lab08
8 pages
Random Forest
No ratings yet
Random Forest
6 pages
Random Forests Simplified
No ratings yet
Random Forests Simplified
39 pages
Random Forests: H S H H
No ratings yet
Random Forests: H S H H
2 pages
Random Forest
No ratings yet
Random Forest
2 pages
Random Forest
No ratings yet
Random Forest
8 pages
Random Forest Algorithms - Comprehensive Guide With Examples
No ratings yet
Random Forest Algorithms - Comprehensive Guide With Examples
13 pages
Present
No ratings yet
Present
20 pages
Trees and Random Forest
No ratings yet
Trees and Random Forest
34 pages
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
No ratings yet
Schonlau Zou 2020 The Random Forest Algorithm For Statistical Learning
27 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
2 pages
Random Forest Presentation
No ratings yet
Random Forest Presentation
37 pages
Lecture 19 Different Classification Models
No ratings yet
Lecture 19 Different Classification Models
22 pages
Notes On Random Forest
No ratings yet
Notes On Random Forest
2 pages
ML Lec6
No ratings yet
ML Lec6
4 pages
Random Forest - Basics
100% (1)
Random Forest - Basics
9 pages
Random Forests
No ratings yet
Random Forests
2 pages
Classification Algorithms
No ratings yet
Classification Algorithms
68 pages
Random Forest Summary - Rashmi
No ratings yet
Random Forest Summary - Rashmi
2 pages
03 - Random Forest
No ratings yet
03 - Random Forest
24 pages
Lecture2 Decision Tree and Random Forest
No ratings yet
Lecture2 Decision Tree and Random Forest
24 pages
Lecture #15: Regression Trees & Random Forests
No ratings yet
Lecture #15: Regression Trees & Random Forests
34 pages
Biau 2016
No ratings yet
Biau 2016
31 pages
Random Forest
100% (1)
Random Forest
83 pages
Association Rule Learning Explained
No ratings yet
Association Rule Learning Explained
35 pages
Random Forest
100% (1)
Random Forest
18 pages
Random Forest
No ratings yet
Random Forest
10 pages
Random Forest
No ratings yet
Random Forest
8 pages
Data Mining Notes
No ratings yet
Data Mining Notes
5 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
Arm Span As A Predictor of The Six-Minute Walk Test in Healthy Children
No ratings yet
Arm Span As A Predictor of The Six-Minute Walk Test in Healthy Children
5 pages
CH 3 Describing Relationship Review
No ratings yet
CH 3 Describing Relationship Review
8 pages
Machine Learning with Weka Guide
No ratings yet
Machine Learning with Weka Guide
15 pages
STAT5P87 - Course Outline
No ratings yet
STAT5P87 - Course Outline
2 pages
Spearman Rank Correlation Coefficient: Example
No ratings yet
Spearman Rank Correlation Coefficient: Example
2 pages
CE363 Chapter 4
No ratings yet
CE363 Chapter 4
37 pages
Unit 2-Part 3-Linear Regression
No ratings yet
Unit 2-Part 3-Linear Regression
38 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
31 pages
Statistical Analysis for Coaches
No ratings yet
Statistical Analysis for Coaches
6 pages
Regression Analysis: Statistics For Psychology
No ratings yet
Regression Analysis: Statistics For Psychology
40 pages
Analisis Kualitas Pelayanan Terhadap Kepuasan Pasien Berobat Di Puskesmas Pembantu Desa Pasir Utama
No ratings yet
Analisis Kualitas Pelayanan Terhadap Kepuasan Pasien Berobat Di Puskesmas Pembantu Desa Pasir Utama
11 pages
06 Nonlinear Regression Models
No ratings yet
06 Nonlinear Regression Models
57 pages
Correlation Coefficient Notes Practice
No ratings yet
Correlation Coefficient Notes Practice
5 pages
Assignment #1
No ratings yet
Assignment #1
3 pages
Sankhya Data Science Course
No ratings yet
Sankhya Data Science Course
22 pages
Spearman Rho
0% (1)
Spearman Rho
26 pages
Parametric Tests: Usage & Assumptions
No ratings yet
Parametric Tests: Usage & Assumptions
2 pages
FM2 Assignment1 - Ashok Leyland
No ratings yet
FM2 Assignment1 - Ashok Leyland
5 pages
Pengaruh Motivasi dan Disiplin Kerja
No ratings yet
Pengaruh Motivasi dan Disiplin Kerja
16 pages
2 Supervised Learning
No ratings yet
2 Supervised Learning
52 pages
CA 2 Cost Segregation
No ratings yet
CA 2 Cost Segregation
22 pages
Portfolio Optimization & Risk Analysis
No ratings yet
Portfolio Optimization & Risk Analysis
3 pages
Overview of Factor Analysis
No ratings yet
Overview of Factor Analysis
11 pages
Betareg PDF
No ratings yet
Betareg PDF
24 pages
Output
No ratings yet
Output
18 pages
Key Terms in Machine Learning
No ratings yet
Key Terms in Machine Learning
6 pages
Regression Webinar
No ratings yet
Regression Webinar
31 pages
PE Exit Prediction Model
No ratings yet
PE Exit Prediction Model
6 pages
Correlation and Causal Comparative Research
No ratings yet
Correlation and Causal Comparative Research
34 pages