Data Analytics for Materials Science
27-737
                 A.D. (Tony) Rollett, R.A. LeSar (Iowa State Univ.)
                  Dept. Materials Sci. Eng., Carnegie Mellon University
                                    Random Forest
                                       Lecture 6
Revised: 21st Apr., 2021
                                           1       Do not re-distribute these slides without instructor permission
To date, we have discussed:
 • linear algebra
 • linear regression: prediction
 • multiple linear regression: prediction
  Recap                                     2
Useful sources of information (both in Canvas):
 • The algorithm for random forests is presented on Page
   588 of Hastie et al. Elements of Statistical Learning.
   linear regression: prediction
 • Another useful resource for learning about random
   forests is: Leo Breiman, Random forests, Machine
   learning, 45, 5–32 (2001).
  Resources                                                 3
A decision tree is a tool for making decisions that uses a tree-like model of decisions
and their possible consequences.
A formal decision tree consists of three types of nodes:   [1]
 • Decision nodes
 • Chance nodes
 • End nodes
Decision trees are all about information and how to use it in a structured way.
We mention them here because they are the building blocks of the random forest
model and useful in their own right.
     Decision trees                                                                       4
                                      To play tennis or not to play tennis? In a decision tree model,
  “What feature will split
                                                                                           splits are chosen to
  the observations in a
                                                                                           maximize information
  way that the resulting
                                                                                           gain. For a regression
  groups are as different
                                                                                           problem, the residual sum
  from each other as
                                                                                           of squares (RSS) can be
  possible (and the
                                                                                           used and for a
  members of each
                                                                                           classification problem, the
  resulting subgroup are
                                                                                           Gini index or entropy
  as similar to each other
                                                                                           would apply. (See talk on
  as possible)?”
                                      https://www.slideshare.net/marinasantini1/lecture-4- slideshare.)
  Splitting stops when the            decision-trees-2-entropy-information-gain-gain-ratio-
                                                            55241087
  data cannot be split
  further.                                                                         Pruning decision trees is discussed
                                                                                   at:
https://towardsdatascience.com/understanding-random-forest-58381e0602d2            https://en.wikipedia.org/wiki/Decision_tree_pruning
           Decision trees                                                                                                        5
High entropy alloy dataset
(we have seen this in the
discussion of regular
expressions) with
composition including 24
elements in five phases.
Can we predict Vicker’s
hardness based on
composition and rule of
mixtures (ROM) density?
     Decision trees in materials research   6
Greedy Approach is based on
the concept of Heuristic
Problem Solving by making an
optimal local choice at each
node. By making these local
optimal choices, we reach the
approximate optimal solution
globally.”
The algorithm can be
summarized as :
1. At each stage (node), pick
   out the best feature as the
   test condition.
2. Now split the node into the
   possible outcomes (internal
   nodes).
3. Repeat the above steps
   until all the test conditions
   have been exhausted into
   leaf nodes.
                                   see: https://www.slideshare.net/marinasantini1/lecture-4-decision-trees-2-
                                                 entropy-information-gain-gain-ratio-55241087                Courtesy of Tony Rollett.
           Decision trees in materials research                                                                                      7
“Random forests are bagged decision tree models that split on a subset of features
on each split.” https://towardsdatascience.com/why-random-forest-is-my-favorite-machine-learning-model-b97651fa3706
     “Random forest, like its name implies,
     consists of a large number of individual
     decision trees that operate as an
     ensemble. Each individual tree in the
     random forest spits out a class prediction
     and the class with the most votes
     becomes our model’s prediction (see
     figure).”
      https://towardsdatascience.com/understanding-random-
                       forest-58381e0602d2
      Random Forest model: basic idea                                                                                 8
The basic concept behind random forest is
based on the wisdom of crowds.
Random forest takes a large number of
uncorrelated trees (models) that operate as a
committee, which will outperform any of the
individual models.
A key feature is that the models must have
low correlation between them.
The low correlation between trees protects
each of them from their individual errors.
                                                https://towardsdatascience.com/understanding-
                                                          random-forest-58381e0602d2
   Random Forest model: uncorrelated trees                                                      9
Decision trees are very sensitive to the data they are trained on — small changes in a
training set can result in tree structures with large differences in structure.
Random forest allows individual trees to randomly sample the dataset with
replacement.
For example, suppose we have a training dataset with N=6 points: {1,2,3,4,5,6}.
Random sampling the data set with replacement might lead to something like
{1,2,2,5,5,6}, in which N=6.
Note that bagging can also be used by taking subsets of the data, as we see on the
next slide.
     Random Forest model: bootstrap aggregating (bagging)                                10
“Instead of building a single smoother from
the complete data set, 100 bootstrap samples
of the data were drawn. Each sample is
different from the original data set, yet
resembles it in distribution and variability. For
each bootstrap sample, a LOESS smoother
was fit. Predictions from these 100 smoothers
were then made across the range of the data.
The first 10 predicted smooth fits appear as
grey lines in the figure below. The lines are
clearly very wiggly and they overfit the data -        By taking the average of the 100
a result of the bandwidth being too small.”            smoothers, we arrive at one
                                                       bagged predictor (red line).
                                                       Clearly, the mean is more stable
 https://en.wikipedia.org/wiki/Bootstrap_aggregating   and there is less overfit.
    Bootstrap aggregating (bagging)                                                       11
Reducing variance
 • a natural way to reduce the variance and hence increase the prediction accuracy
   of a statistical learning method is to take many training sets from the population,
   build a separate prediction model using each training set, and average the
   resulting predictions
Best practice:
  • each bagged tree makes use of around 2/3 of the observations
  • remaining 1/3 of the observations are referred to as the out-of-bag (OOB)
     observations
  • each individual tree has high variance, but low bias, averaging these trees
     reduces the variance
  • reduces overfitting; reduce bias; break the bias-variance trade-off
  • See later comments for use of OOB data for testing accuracy and feature
     importance
     Bagging: advantages                                                                 12
“Random forests are bagged decision tree
models that split on a subset of features
on each split.”
In addition to bagging, each tree in a
random forest bases its split on a random
subset of features.
In the example, while a decision tree would
include all 4 features, each tree in a random
forest would base their split on a subset of
features.
    Random Forest model: basic idea             13
The basic concept behind random forest is
based on the wisdom of crowds.
Random forest takes a large number of
uncorrelated trees (models) that operate as a
committee, which will outperform any of the
individual models.
A key feature is that the models must have
low correlation between them.
The low correlation between trees protects
each of them from their individual errors.
                                                https://towardsdatascience.com/understanding-
                                                          random-forest-58381e0602d2
   Random Forest model: uncorrelated trees                                                  14
“The random forest is a classification algorithm consisting of many decision trees.
It uses bagging and feature randomness when building each individual tree to try
to create an uncorrelated forest of trees whose prediction by committee is more
accurate than that of any individual tree.”
                                                  https://towardsdatascience.com/understanding-
                                                            random-forest-58381e0602d2
     Random Forest model: summary                                                                 15
         Decision trees                       Random Forest
•   trees give insight into decision   •   “Black Box” — rather hard to gain
    rules                                  insight into the decision rules
•   rather fast computationally        •   rather slow computationally
•   prediction of trees tend to have   •   has smaller prediction variance
    high variance                          and thus usually a better
                                           performance
     Decision trees versus Random Forest                                       16
•   No statistical assumptions
•   Works with any kind of data – continuous / categorical – intrinsically multiclass
•   Can express any function – regression / classification
•   Works well with small to medium data, unlike neural network which requires large
    data
•    Can handle thousands of input variables without variable selection
    - provides feature importance
•   It has an effective method for estimating missing data and maintains accuracy
    when a large proportion of the data are missing
      Random Forest: attributes                                                         17
1. How much each feature decreases the variance in a tree
  • For a forest, the variance decrease from each feature can be averaged and
    the features are ranked according to this measure
  • Biased towards preferring variables with more categories
     (Bias in random forest variable importance measures: Illustrations, sources and a solution — on Canvas)
 •   When dataset has two (or more) correlated features, then one shows up high
     while other as low (applies to other methods too)
     -The effect of this phenomenon is somewhat reduced by random selection of
      features at each node creation
2. Random shuffling of the variables
  • permute the values of each feature and measure how much the permutation
    decreases the accuracy of the model
  • The OOB data is passed along each tree to determine the "test error" (since the
    OOB were not used to train). See section 15.3.1 in Hastie et al.
  • For each variable, the values are permuted in the OOB to evaluate the sensitivity
    to that variable (from the increase in the test error).
         Random Forest model: interpretation                                                                   18
R: randomForest package (available on CRAN)
Matlab: TreeBagger selects a random subset of predictors to use at each decision
split as in the random forest algorithm. (see documentation)
Mathematica: use Predict[] with Method-> “RandomForest”
There are also implementations in Python, …
Pick your favorite program and search for random forest in the documentation.
     Random Forest model: availability                                             19
QUESTIONS?
             20
“Despite the recent fast progress in materials informatics and data science, data-driven
molecular design of organic photovoltaic (OPV) materials remains challenging. We report
a screening of conjugated molecules for polymer−fullerene OPV applications by
supervised learning methods (artificial neural network (ANN) and random forest (RF)).
We report a screening of conjugated molecules for polymer−fullerene OPV applications
by supervised learning methods (artificial neural network (ANN) and random forest (RF)).
Approximately 1000 experimental parameters including power conversion efficiency
(PCE), molecular weight, and electronic properties are manually collected from the
literature and subjected to machine learning with digitized chemical structures. Contrary
to the low correlation coefficient in ANN, RF yields an acceptable accuracy, which is twice
that of random classification.”
Results based on 1200 points from 500 papers.
                                                    Computer-Aided Screening of Conjugated Polymers for
                                                    Organic Solar Cell: Classification by Random Forest, S.
                                                    Nagasawa et al, J. Phys. Chem Lett. 9, 2639 (2018)
     Random Forest model: examples from materials                                                             21
     research
Artificial Neural Nets
(ANN) led to a relation
with r=0.37, which is
not acceptable.
They represented PCE
in 4 groups (e) and used
the RF in (d).
Based in part on the RF
results, they
demonstrated an
alternative approach to
the design of polymers
for OPVs.
      Random Forest model: examples from materials   22
      research
1. How much each feature decreases the variance in a tree
 •  For a forest, the variance decrease from each feature can be averaged and
    the features are ranked according to this measure
 •  Biased towards preferring variables with more categories
     (Bias in random forest variable importance measures: Illustrations, sources and a solution — on Canvas)
 •   When dataset has two (or more) correlated features, then one shows up high
     while other as low (applies to other methods too)
     -The effect of this phenomenon is somewhat reduced by random selection of
      features at each node creation
2. Random shuffling of the variable
 •  permute the values of each feature and measure how much the permutation
    decreases the accuracy of the model
         Random Forest model: interpretation                                                                   23
Lecture 17: RF models part II
                                24