Exclusive use Batch0402p
Subject CS2
Revision Notes
For the 2019 exams
Machine learning
Booklet 12
Covering
Chapter 21 Machine learning
The Actuarial Education Company
Exclusive use Batch0402p
Exclusive use Batch0402p
CONTENTS
Contents Page
Links to the Course Notes and Syllabus 2
Overview 3
Core Reading 5
Past Exam Questions 42
Solutions to Past Exam Questions 43
Factsheet 44
Copyright agreement
All of this material is copyright. The copyright belongs to Institute and
Faculty Education Ltd, a subsidiary of the Institute and Faculty of Actuaries.
The material is sold to you for your own exclusive use. You may not hire
out, lend, give, sell, transmit electronically, store electronically or photocopy
any part of it. You must take care of your material to ensure it is not used or
copied by anyone at any time.
Legal action will be taken if these terms are infringed. In addition, we may
seek to take disciplinary action through the profession or through your
employer.
These conditions remain in force after you have finished using the course.
© IFE: 2019 Examinations Page 1
Exclusive use Batch0402p
LINKS TO THE COURSE NOTES AND SYLLABUS
Material covered in this booklet
Chapter 21 Machine learning
These chapter numbers refer to the 2019 edition of the ActEd course notes.
Syllabus objectives covered in this booklet
The numbering of the syllabus items is the same as that used by the Institute
and Faculty of Actuaries.
5.1 Explain and apply elementary principles of machine learning.
5.1.1 Explain the main branches of machine learning and
describe examples of the types of problems typically
addressed by machine learning.
5.1.2 Explain and apply high-level concepts relevant to learning
from data.
5.1.3 Describe and give examples of key supervised and
unsupervised machine learning techniques, explaining the
difference between regression and classification and
between generative and discriminative models.
5.1.4 Explain in detail and use appropriate software to apply
machine learning techniques (eg penalised regression and
decision trees) to simple problems.
5.1.5 Demonstrate an understanding of the perspective of
statisticians, data scientists, and other quantitative
researchers from non-actuarial backgrounds.
Page 2 © IFE: 2019 Examinations
Exclusive use Batch0402p
OVERVIEW
This booklet covers Syllabus objective 5.1, which relates to machine
learning.
Machine learning involves using artificial intelligence techniques to analyse
data and make predictions. This is an important field of study at the moment
in many areas, including a number of actuarial applications. We are familiar
with software that automatically identifies spam emails and removes them
from the user’s inbox, and similar techniques can be used to identify
fraudulent insurance claims, for example.
There are two main types of machine learning algorithms that can be used.
With supervised learning we work with an existing dataset where we already
know the right answer, eg a set of insurance claims whose validity has
already been investigated. The machine learning algorithm is trained to give
the right answers for this dataset, after which it can be applied to new claims
to identify any suspicious ones.
With unsupervised learning we work with a dataset where we don’t know the
right answer. It is the job of the machine learning algorithm to look for
patterns in the data and to suggest a possible solution. For example, a
motor insurer with policies sold throughout the country may wish to divide
the country into 20 geographical areas (based on postcode) so that all the
policies sold in a particular area have similar claims experience. The
algorithm will suggest possible groupings of postcodes that might be suitable
to treat as homogeneous groups. The resulting 20 groups could then be
used for calculating premiums and for calculating the reserves the insurer
needs to hold.
Some machine learning algorithms are discussed elsewhere in the actuarial
courses (although they might not be described in this way), eg linear
regression models, where we try to find a line of best fit for predicting an
output value based on a set of input variables.
The main algorithms we will look at here are the naïve Bayes method,
decision trees and the k-means algorithm.
As the name suggests, the naïve Bayes method is based on Bayes’ theorem
from probability theory. It classifies items by comparing the likelihood of
each observation. This method requires very little data and can produce
surprisingly good results.
© IFE: 2019 Examinations Page 3
Exclusive use Batch0402p
Decision trees (which are also called CART methods – Classification and
Regression Trees) classify items by working sequentially through a series of
questions in the form of a flow chart. This leads to a final node in the tree,
which is the predicted classification for that item. We also look at a number
of measures, including the Gini index, which can be used to measure how
effective a particular decision tree is at making correct classifications.
The k-means algorithm is based on the idea that items that would be close
together when considered as points in multi-dimensional space can be
considered to be similar. The algorithm starts by allocating the data items
randomly to k groups (or clusters), then uses an iterative process to improve
the groupings until each group forms a cluster of items that are all close
together. New items can then easily be classified by finding which of the k
groups they fall within. We can vary the number of groups so that there are
enough to separate out the different types of item effectively but not so many
as to make the groupings artificial or unmanageable.
We also consider the steps involved in a machine learning exercise and
issues that we need to consider in relation to making the best use of the
available data.
Page 4 © IFE: 2019 Examinations
Exclusive use Batch0402p
CORE READING
All of the Core Reading for the topics covered in this booklet is contained in
this section.
We have inserted paragraph numbers in some places, such as 1, 2, 3 …, to
help break up the text. These numbers do not form part of the Core
Reading.
The text given in Arial Bold font is Core Reading.
The text given in Arial Bold Italic font is additional Core Reading that is not
directly related to the topic being discussed.
____________
Chapter 21 – Machine learning
Introduction
The aim of this chapter is to provide an insight into the topic of
machine learning. Machine learning is a vast topic and the Core
Reading in this booklet will only provide a high-level introduction.
Specifically, the Core Reading has the following aims:
To provide a high-level knowledge of the various branches of
machine learning and examples of their applications, both within
general industry and within the specific sectors that actuarial work
involves. The level of knowledge targeted is such as will allow you
to identify whether any branch of machine learning would be
useful in addressing any problem you face.
To provide you with sufficient background information that you
can participate in high-level conversations related to projects
involving machine learning analyses and their results.
To describe some of the most common machine learning
techniques.
To discuss the relationship between machine learning and other
branches of data science and statistical analysis, so that you are
able to communicate effectively with other quantitative
researchers, and to understand the similarities and differences
between machine learning and other approaches.
© IFE: 2019 Examinations Page 5
Exclusive use Batch0402p
There are many resources available to students to gain an insight into
the key elements of machine learning. One excellent resource is a
series of lectures given at Caltech by Yaser Abu-Mostafa which is
freely available online at https://work.caltech.edu/telecourse.html.
Another is A. Chalk and C. McMurtrie ‘A practical introduction to
Machine Learning concepts for actuaries’, Casualty Actuarial Society
E-forum, Spring 2016.
____________
What is machine learning?
1 Machine learning describes a set of methods by which computer
algorithms are developed and applied to data to generate information.
This information can consist simply of hidden patterns in the data, but
often the information is applied to solve a specific problem.
Examples of problems which are commonly solved in this way include:
targeting of advertising at consumers using web sites
location of stock within supermarkets to maximise turnover
forecasting of election results
prediction of which borrowers are most likely to default on a loan.
____________
Machine learning methods have become popular in recent years with
the advent of increasing quantities of data and the concomitant rapid
increase in computing power.
____________
2 In order for machine learning to be useful in tackling a problem we
need the following to apply:
A pattern should exist. If there is no pattern, there is no
information to be had, and machine learning will not help (indeed,
it might be counterproductive by ‘discovering’ patterns that do not
exist).
The pattern cannot be practically pinned down mathematically by
classical methods. If it could be pinned down, we could proceed to
describe it mathematically.
Page 6 © IFE: 2019 Examinations
Exclusive use Batch0402p
We have data relevant to the pattern.
____________
An overview of machine learning
3 The diagram below (due to Yaser Abu-Mostafa) provides an overview of
the machine learning process.
Target function
y = f ( x1, x 2 ,...)
↓
Data
( x11, x 21, , y 1)
( x12 , x 22 , , y 2 )
( x1N , x 2N , , y N )
↓
Hypotheses Learning Hypothesis
y = h1( x1, x 2 ,) algorithm y = g ( x1, x 2 ,)
y = h2 ( x1, x 2 ,) → →
y = hM ( x1, x 2 ,)
First, there is some target function f , which maps a set of variables, or
features, that we can measure, onto some output y . (What we term
‘variables’ or ‘covariates’ in statistical modelling, machine learning
terms ‘features’.)
Let the variables, or features, be x 1, x 2 , , x j , , x J . Then we have:
y f ( x1, x 2 , , x J )
© IFE: 2019 Examinations Page 7
Exclusive use Batch0402p
The target function is unknown and it is this which we are trying to
approximate. The target function might, for example, map life
insurance data such as smoking behaviour, lifestyle factors and
parental survival to life expectancy.
Second, we have data on y and x1, x 2 , , x J for a sample of N
individuals.
We use the data to develop a hypothesis which relates the data to the
output. Let the hypothesis be:
y = g ( x1, x 2 , , x J )
The idea is that g ( x1, x 2 ,) should be close to the unknown function
f ( x1, x 2 ,) .
The way the hypothesis y = g ( x1, x 2 ,) is chosen is by trying out a
large number, say M, of hypotheses y = h1( x1, x 2 ,) ,
y = h2 ( x1, x 2 ,) , ..., y = hM ( x1, x 2 ,) on the data and using a
learning algorithm to choose among them. The hypotheses are usually
drawn from a hypothesis set, which has a general form.
So, for example, in classical linear modelling the hypothesis set might
be the set of linear relationships:
y = w 10 + w 11x1 + w12 x 2 + ... + w 1 j x j + ... + w1J x J
y = w 20 + w 21x1 + w 22 x 2 + ... + w 2 j x j + ... + w 2J x J
y = w m 0 + w m1x1 + w m 2 x 2 + ... + w mj x j + ... + w mJ x J
y = w M 0 + w M1x1 + w M 2 x 2 + ... + w Mj x j + ... + w MJ x J
where the w mj are weights to be applied to the features. There are M
hypotheses, each with a different set of weights.
____________
Page 8 © IFE: 2019 Examinations
Exclusive use Batch0402p
Linear regression (which is covered in Subject CS1) can be viewed
within this framework. The weights are equivalent to regression
coefficients and the final hypothesis y = g ( x1, x 2 ,) is the set of
weights which ‘best fits’ the data according to some criterion, such as
minimising the squared distance between the values of y predicted by
the model and the values of y observed in reality. Of course, the
linear regression problem is typically solved in ‘one step’, whereas
many machine learning problems are solved iteratively, or in many
steps.
____________
Concepts in machine learning
4 An important difference between machine learning and many statistical
applications is that the goal of machine learning is to find an algorithm
that can predict the outcome y in previously unseen cases.
____________
In the example studied by Chalk and McMurtrie, the task was to predict
the cause codes of aviation accidents from the words in brief
narratives of the accidents.
The cause codes in their example were ‘aircraft’, ‘personnel issues’,
‘environmental issues’ and ‘organisational issues’. The idea of
classifying insurance claims in this way has wide actuarial applications
– for example, in the construction of different pricing models for
different types of claim. But if an insurer uses such cause codes, a
change in the IT system or in the staff that handle claims could result
in claims not being coded or being coded inaccurately.
____________
5 It would be useful to develop a way of using narrative descriptions of
claims to add cause codes to those for which codes are not available,
so that continuity of coding could be maintained. We might do this by
creating an algorithm which uses the claims narratives from data that
were cause-coded to work out the cause codes that were given, and
then apply this algorithm to claims that were not cause-coded.
© IFE: 2019 Examinations Page 9
Exclusive use Batch0402p
A key element of this scenario is that we are going to apply the results
of the exercise to data that were not used to develop the algorithm.
This means that we are interested in the performance of the algorithm
not just in the sample of N cases in our data, but ‘out of sample’.
This is not always the case in statistical modelling (where we are often
content with the model which ‘fits’ our data best).
____________
The loss function
6 One way to evaluate a hypothesis is to calculate the predictions it
makes and to penalise each incorrect prediction by some loss. For
example, if the prediction involves the classification of something into
categories, we could say that each incorrectly classified case incurs a
loss of one. We then choose the hypothesis y g ( x1, x 2 ,) by
minimising the loss function.
It can be shown that for some common algorithms (such as logistic
regression) maximising the likelihood is equivalent to minimising the
loss function.
____________
Model evaluation
7 When we fit statistical models to data, we have a range of criteria to
allow us to choose the ‘best’ model from among a set of models (as we
saw in Subject CS1).
But evaluating a predictive model involves more than this. Even the
‘best’ model may not be a very good predictive model. And even if it is
good, it might take a very long time to find the correct parameters, or it
might be very difficult to interpret (and explain to clients).
Model evaluation therefore involves more than just applying some
statistical criteria of ‘fit’.
____________
Page 10 © IFE: 2019 Examinations
Exclusive use Batch0402p
We illustrate some possible measures using a model designed for
classification. Consider a diagnostic test for a medical condition. The
patients who take the test either have the condition or they do not. The
test will classify (predict) patients as having the condition or not
according to whether the outcome of the test fulfils certain criteria.
____________
8 Accuracy. This is the proportion of predictions that the model gets
right. Usually we compare this proportion with the proportion
predicted by a naïve classifier (eg a classifier that puts every case into
the same category).
____________
9 The table below is known as a confusion matrix. There are four
possibilities:
Test result classified / predicts patient as
having condition
YES NO
Patient YES True positive (TP) False negative (FN)
actually has
condition NO False positive (FP) True negative (TN)
____________
10 Precision is the percentage of cases classified as positive that are, in
fact, positive. Using the abbreviations in the table this is:
TP
Precision =
TP + FP
____________
11 Recall is the percentage of positives that we managed to identify
correctly:
TP
Recall =
TP + FN
____________
© IFE: 2019 Examinations Page 11
Exclusive use Batch0402p
12 These can be combined in a single measure known as the F1 score:
2 ¥ Precision ¥ Recall
F1 score =
Precision + Recall
____________
13 The false positive rate is:
FP
False positive rate =
TN + FP
There is a trade-off between the recall (the true positive rate) and the
false positive rate (the percentage of cases which are not positives, but
which are classified as such).
____________
14 The trade-off between recall and the false positive rate can be
illustrated using a receiver operating characteristic (ROC) curve.
The area under the ROC provides another single-figure measure of the
efficacy of the model. The further away from the diagonal is the ROC,
the greater the area under the curve and the better the model is at
correctly classifying the cases.
____________
Page 12 © IFE: 2019 Examinations
Exclusive use Batch0402p
An example is shown below, taken from Alan Chalk and Conan
McMurtrie ‘A practical introduction to Machine Learning concepts for
actuaries’ Casualty Actuarial Society E-forum, Spring 2016.
This figure compares the ROC curves for a logistic regression model
fitted to the cause codes for aircraft accidents with a naïve model
based on random guesswork.
____________
The methods described above allow the assessment of model
performance on existing data.
But how can we assess the likely predictive performance of the model?
Can we be sure that we can use machine learning to test numerous
hypotheses and eventually pick one which will generalise acceptably to
new data?
The answer is that we can in theory (see Lectures 4-6 of Yaser
Abu-Mostafa’s course for a demonstration and proof of this).
____________
© IFE: 2019 Examinations Page 13
Exclusive use Batch0402p
Generalisation error and model validation
The Vapnik-Chervonenkis inequality
15 Specifically, we can show that if the in-sample error is E in (g ) and the
out-of-sample error is Eout (g ) , then:
1 2N
P ÈÎ Ein (g ) - Eout (g ) > e ˘˚ £ 4 [H (N )] e 8
- e
where:
N is the sample size
e is some specified tolerance
H (N ) is a polynomial in N which depends on the hypothesis set.
This equation, called the Vapnik-Chervonenkis inequality, shows that,
for large enough N , it will always be possible to use learning to
choose a hypothesis g which will make the tolerance as small as we
like.
____________
This may be true in theory, but how do we test the performance of our
model out-of-sample?
____________
Train-validation-test
16 The conventional approach in machine learning is to divide the data
into two. One part of the data (usually the majority) is used to train the
algorithm to choose the ‘best’ hypothesis from among the M
competing ones. The other is used to test the chosen hypothesis g
on data that the algorithm has not seen before.
In practice, the ‘training’ data is often split into a part used to estimate
the parameters of the model, and a part used to validate the model.
Page 14 © IFE: 2019 Examinations
Exclusive use Batch0402p
This approach is often called the train-validation-test approach. It
involves three data sets:
a training data set: the sample of data used to fit the model
a validation data set: the sample of data used to provide an
unbiased evaluation of model fit on the training dataset while
tuning model hyper-parameters (see below)
a test data set: the sample of data used to provide an unbiased
evaluation of the final model fit on the training data set.
____________
Parameters
17 In statistical analysis, we often fit models to data, for example
regression models such as:
y i = b 0 + b 1x1i + ... + b J x Ji + e i where e i Normal (0,s 2 )
Here the b ’s and s 2 are the parameters of the model. Most
supervised machine learning algorithms involve models with similar
parameters. The ‘best’ values for these parameters are estimated from
the data.
Parameters are required by the model when making predictions. They
define the skill of the model when applied to your problem and they are
estimated or learned from the data. They form an integral part of the
learned model.
____________
Hyper-parameters
18 Machine learning algorithms, both supervised and unsupervised,
however, also have higher-level attributes which must also be
estimated or (in some sense) optimised. These might include:
the number of covariates J to include in a regression model
the number of categories in a classification exercise
the rate at which the model should learn from the data.
© IFE: 2019 Examinations Page 15
Exclusive use Batch0402p
These attributes are caller hyper-parameters. They cannot be
estimated from the data – indeed they must often be defined before an
algorithm can be implemented. Hyper-parameters are external to the
model and their values cannot be estimated from the data. They are
typically specified by the practitioner and may be set using heuristic
guidelines. Nevertheless, they are critical to the predictive success of
a model.
____________
In a normal linear regression model, as we include more variables, the
proportion of the variance in the dependent variable that is explained
cannot decrease. A model with more variables will, in that sense, ‘fit’
the data better than one with fewer variables. The same is true with
machine learning models, but the number of parameters in machine
learning models can be very large.
____________
Over-fitting
19 There is a risk that, if the number of parameters / features is large, the
estimates of the parameters in the model g that is chosen will reflect
idiosyncratic characteristics of the specific data set we have used to
‘train’ the model, rather than the underlying relationships between the
output, y , and the features x1, x 2 , , x J . This is known as over-fitting
and is one of the biggest dangers faced by machine learning.
Over-fitting leads to the identification of patterns that are not really
there. More precisely, it leads to the identification of patterns that are
specific to the training data and do not generalise to other data sets.
____________
20 On the other hand, if the number of parameters / features is small, we
might miss important underlying relationships.
____________
21 So there is a trade-off here, between bias – the lack of fit of the model
to the training data – and variance – the tendency for the estimated
parameters to reflect the specific data we use for training the model.
____________
Page 16 © IFE: 2019 Examinations
Exclusive use Batch0402p
Validation
22 One way to assess how the predictive ability of the model changes as
the number of parameters / features increases is to withhold a portion
of the ‘training’ data and use it to validate models with different
numbers of parameters / features J . One approach is to divide the
training data into, say, s slices, and to ‘train’ the model s times, using
a different slice for validation each time. This is called s-fold
cross-validation.
____________
23 Typically, the error on the training data used to estimate the
parameters decreases as J increases. But the prediction error on the
validation data often decreases as J increases for small J , and
reaches a minimum before increasing again as J gets larger. This
suggests that models with a number of parameters / features close to
the minimum might be most suitable and perform best out-of-sample.
____________
How can we achieve a good balance between bias and variance? Put
another way, is there a method that can use all the features to choose
the final hypothesis g , but will prevent it becoming too complex so
that generalisation is poor? There is, and it is called regularisation or
penalisation.
____________
Regularisation
24 This approach exacts a penalty for having too many parameters.
Recall that finding the ‘best’ values of the parameters, or feature
weights, w j in a machine learning problem involves minimising a loss
function. Let the loss function be L * (w 1, w 2 , , w J ) . Then the
hypothesis g will be chosen to be the hypothesis with a set of weights
which minimises L * (w 1, w 2 , , w J ) .
____________
© IFE: 2019 Examinations Page 17
Exclusive use Batch0402p
25 The idea of regularisation, or penalisation, is to add to L * a cost for
J
model complexity. One possibility is to add a term equal to w 2j ,
j 1
so that we now minimise the expression:
J
L * (w 1, w 2 , , w J ) + l  w 2j
j =1
____________
26 As noted earlier, since minimising the loss function is, in some models,
equivalent to maximising the likelihood, minimising this expression is
equivalent to maximising a penalised likelihood.
____________
Branches of machine learning
27 Machine learning techniques can be divided into several branches,
which we can refer to as:
supervised learning
unsupervised learning
semi-supervised learning
reinforcement learning.
28 The difference between these lies not (as one might think) in the level
of involvement of the human researcher in the development of the
algorithm, or in the supervision of the machine. Instead, it lies in the
extent to which the machine is given an instruction as to the end-point
(or target) of the analysis.
____________
Paragraphs 29 to 35 below all refer to supervised learning.
____________
Page 18 © IFE: 2019 Examinations
Exclusive use Batch0402p
Supervised learning
29 Supervised learning is associated with predictive models in which the
output is specified. Here the machine is given a specific aim (eg to use
the variables in the data to develop a model to predict whether a
person will default on a loan), and the algorithm will try to converge on
the parameters of the model which provide the ‘best’ prediction.
____________
30 Examples relevant to the actuarial profession might be:
the prediction of future lifetime at age x , Tx , or survival
probabilities from age x , P (Tx > t )
the prediction of the risk of claims being made on certain classes
of insurance.
____________
Regression vs classification
31 A distinction can be made between supervised learning that involves
the prediction of a numerical value (such as future lifetime) and
prediction of which category a case falls into (will a person default on a
loan – yes or no?). For predicting numerical values, regression models
are the normal approach, whereas predicting which category a case
falls into is essentially a classification problem, and different
algorithms, such as decision trees, are used. However, this distinction
between regression and classification is somewhat fuzzy, as there are
regression models, such as logistic regression or probit models, where
the dependent variable is categorical.
____________
(These are examples of generalised linear models, which were covered
in Subject CS1.)
____________
Generating classifications
32 Within classification algorithms, a distinction can be made between
models that generate classifications and those that discriminate
between classes.
____________
© IFE: 2019 Examinations Page 19
Exclusive use Batch0402p
Consider the case where we have a categorical output value y and
data (covariates), x1, x 2 , . The aim is to predict into which category
of y case i will fall given the values of the covariates for case i ,
x1i , x 2i , .
____________
33 One approach is to model the joint probabilities P ( x1, x 2 , , y ) . This
generates a classification scheme. It is then possible to evaluate the
conditional probability of being in category y , given x1, x 2 , as:
P ( x1, x 2 ,..., y )
P ( y | x1, x 2 ,...) =
P ( x1, x 2 ,...)
One problem with this approach is that the number of separate
probabilities P ( x1, x 2 , , y ) to be computed increases exponentially
with the number of covariates x j .
This, however, can be overcome by assuming that, given the
classes y , the covariates x j ( j = 1, , J ) are independent.
With this assumption, we have:
J
P ( x1, x 2 ,..., y ) = P ( y ) ’ P ( x j | y )
j =1
This is called the naïve Bayes classifier.
____________
Discriminating between classes
34 An alternative method is to model the conditional probability
P ( y | x1, x 2 ,...) directly, and to find, say, a linear combination of the x k
that best discriminates between the categories of y . This is the aim of
a method known as discriminant analysis, which is effectively the same
as logistic regression.
____________
Page 20 © IFE: 2019 Examinations
Exclusive use Batch0402p
35 Other supervised learning techniques described in machine learning
textbooks include the perceptron, neural networks and support vector
machines.
____________
Unsupervised learning
36 Unsupervised machine learning techniques operate without a target for
the algorithm to aim at. We might, for example, set the machine the
task of identifying clusters within the data.
Given a set of covariates, the idea is that the machine should try to find
groups of cases which are similar to one another but different from
cases in other groups. In the language we used in the exposed to risk
chapter, we try to divide the data into homogeneous classes. However,
we may not tell the machine in advance what the characteristics of
each of these classes should be, or even how many such classes there
should be. We allow the machine to determine these given a set of
rules which form part of the algorithm. Machine learning where the
output is not specified in advance is called unsupervised learning.
____________
37 Examples of unsupervised learning techniques include cluster
analysis, and the use of association rules such as the apriori algorithm.
____________
38 Apart from their use to divide data into homogeneous classes,
unsupervised learning techniques are commonly used with very large
data sets. Examples would be market basket analysis, which uses data
generated from retail transactions to identify items which are
commonly purchased together, and text analysis.
____________
Semi-supervised learning
39 It is possible to perform machine learning analysis by using a mixture
of supervised and unsupervised learning. For example, cluster
analysis could be used to identify clusters. These clusters could then
be labelled using a variable y , and a supervised classification
algorithm such as naïve Bayes or logistic regression used to develop
predictions of the class into which each case would fall.
© IFE: 2019 Examinations Page 21
Exclusive use Batch0402p
This makes obvious sense if the clusters identified by the
unsupervised learner make substantive sense for the problem at hand.
But even if your clusters do not make sense to you (a human), you will
have constructed a machine called an autoencoder – which can
considerably speed up any future modelling analysis.
____________
Reinforcement learning
40 In reinforcement learning the learner is not given a target output in the
same way as with supervised learning. The learner uses the input data
to choose some output, and is then told how well it is doing, or how
close the chosen output is to the desired output. The learner can then
use this information as well as the input data to choose another
hypothesis.
____________
41 Example
Imagine a world that can be modelled as a finite-state discrete-time
stochastic process with state space S . An agent in this world who is
in state u at time t can take many possible actions, Al , and each of
these actions will result in a probability that the agent is in state v at
time t + 1 . We can define two functions:
the state transition function, P ( X t +1 = v | X t = u , Al ) , and
the observation, or output function P (Y | X t = u , Al ) .
Some values of Y are more desirable than others, and we want the
agent to take the actions which will lead to desirable outcomes of Y .
How do we achieve this? The agent does not know the future, and
cannot necessarily see how the actions taken at time t will enhance or
reduce the probabilities of Y .
Page 22 © IFE: 2019 Examinations
Exclusive use Batch0402p
One possibility is to define a reward function E (Rt | X t i , Al ) , in which
the reward, Rt depends on the probability that the action Al will lead
to desirable values of Y . The agent then tries to maximise its overall
rewards (discounted as appropriate). Clearly, if the agent had full
information about the model, we could treat this is a standard
maximisation problem. But the agent does not know this: all the agent
knows is the rewards it received for particular actions at specific time
points up to the present.
____________
42 Reinforcement learning is the process by which the agent updates the
probabilities of taking particular actions on the basis of past rewards
received.
____________
Machine learning tasks can be broken down into a series of steps.
These are discussed in Paragraphs 43 to 55 below.
____________
Collecting data
43 The data must be assembled in a form suitable for analysis using
computers. Several different tools are useful for achieving this: a
spreadsheet may be used, or a database such as Microsoft Access.
____________
44 Data may come from a variety of sources, including sample surveys,
population censuses, company administration systems, databases
constructed for specific purposes (such as the Human Mortality
Database, www.mortality.org).
____________
45 During the last 20-30 years the size of datasets available for analysis by
actuaries and other researchers has increased enormously. Datasets,
such as those on purchasing behaviour collected by supermarkets,
relate to millions of transactions.
____________
© IFE: 2019 Examinations Page 23
Exclusive use Batch0402p
Exploring and preparing the data
46 This stage can be divided into several elements:
The data need to be prepared in such a way that a computer is able
to access the information and apply a range of algorithms. If the
data are already in a spreadsheet, this may be a simple matter of
importing the data into whatever computer package is being used
to develop the algorithms. If the data are stored in complex file
formats, it will be useful to convert the data to rectangular format,
with one line per case and one column per variable. It is also
important here to recognise the nature of the variables being
analysed: are they nominal, ordinal or continuous?
Cleaning the data, replacing missing values, and checking the data
for obvious errors is an important stage of any analysis, including
machine learning.
Exploratory data analysis (EDA). In machine learning applications
it is probably not a good idea to do extensive EDA, as the outcome
might influence your choice of model and hypothesis set.
____________
Feature scaling
47 Some machine learning techniques will only work effectively if the
variables are of similar scale. We can see this by recalling that, in a
linear regression model (which we covered in Subject CS1) the
parameter, b j , associated with covariate x j , measures the impact on
y of a one-unit change in x j . If x j is measured in, say, metres, the
value of b j will be 100 times larger than it would be with the same data
if x j were measured in centimetres.
____________
Page 24 © IFE: 2019 Examinations
Exclusive use Batch0402p
48 In machine learning the weights w j play the role of the b ’s in the
linear regression model. Consider the expression in Paragraph 25:
J
L * (w 1, w 2 , , w J ) + l  w 2j
j =1
J
The penalty imposed for model complexity is l  w 2j which clearly
j =1
depends on the weights and hence on the scale at which the features
are measured.
____________
49 Descriptive statistics such as frequency distributions, measures of
central tendency and of dispersion might be useful here to establish an
appropriate scale for each feature, as will cross tabulations of nominal
or ordinal data, or correlation coefficients between continuous
variables. Pictorial representations, such as histograms and boxplots,
are invaluable.
____________
Splitting the data into the training, validation and test data sets
50 A typical split might be to use 60% of the data for training, 20% for
validation and 20% for testing. However it depends on the problem and
not on the data. A guide might be to select enough data for the
validation data set and the testing data set so that the validation and
testing processes can function, and to allocate the rest of the data to
the training data set. In practice, this often leads to around a 60% /
20% / 20% split.
____________
Training a model on the data
51 This involves choosing a suitable machine learning algorithm using a
subset of the data. The algorithm will typically represent the data as a
model and the model will have parameters which need to be estimated
from the data.
____________
© IFE: 2019 Examinations Page 25
Exclusive use Batch0402p
This stage is analogous to the process of fitting a model to data as
described in the chapters on regression and generalised linear models
in Subject CS1.
____________
Validation and testing
52 The model should then be validated using the 20% of the data set aside
for this purpose. This should indicate, for example, whether we are at
risk of over-fitting our data. The results of the validation exercise may
mean that further training is required.
____________
53 Once the model has been trained on a set of data, its performance
should be evaluated. How this is done may depend on the purpose of
the analysis. If the aim is prediction, then one obvious approach is to
test the model on a set of data different from the one used for
development. If the aim is to identify hidden patterns in the data, other
measures of performance may be needed.
____________
Improving model performance
54 We can measure the performance of the model by testing it on the 20%
of the data we have reserved for this purpose. The hope is that the
performance of the final hypothesis g on the ‘test’ data set is similar
to that achieved by the same hypothesis on the training data set. This
amounts to stating that the difference between the in-sample error and
the out-of-sample error Ein (g ) Eout (g ) will be generally small, or that:
P ÈÎ Ein (g ) - Eout (g ) > e ˘˚ £ Z
where Z is some threshold which may depend on the precise task to
hand (the greater the value at risk, the smaller Z ).
____________
Page 26 © IFE: 2019 Examinations
Exclusive use Batch0402p
55 If the performance of the model is not sufficient for the task at hand, it
may be possible to improve its performance. Sometimes the
combination of several different algorithms applied to the same data
set will produce a performance which is substantially better than any
individual model. In other cases, the use of more data might provide a
boost to performance. However, except when considering very simple
combinations of models, care should be taken not to overfit the
evaluation set.
____________
The reproducibility of research
56 It is important that data analysis be reproducible. This means that
someone else can take the same data, analyse it in the same way, and
obtain the same results. In order that an analysis be reproducible the
following criteria are necessary:
The data used should be fully described and available to other
researchers.
Any modifications to the data (eg recoding or transformation of
variables, or computation of new variables) should be clearly
described, ideally with the computer code used. In machine
learning this is often called ‘features engineering’, whereby
combinations of features are used to create something more
meaningful.
The selection of the algorithm and the development of the model
should be described, again with computer code being made
available. This should include the parameters of the model and
how and why they were chosen.
____________
There is an inherent problem with reproducing stochastic models
(which are studied in Subject CM1), in that those of necessity have a
random element. Of course, details of the random number generator
seeds chosen, and the precise command and package used to
generate any randomness, could be presented. However, since
stochastic models are typically run many times to produce a
distribution of results, it normally suffices that the distribution of the
results is reproducible.
© IFE: 2019 Examinations Page 27
Exclusive use Batch0402p
To ensure reproducibility in stochastic models in R, use the same
numerical seed in the function set.seed().
____________
In Paragraphs 57 to 77 we describe some commonly-used supervised
machine learning techniques. Some of these methods are direct
extensions of methods covered elsewhere in the syllabus, notably the
linear regression and generalised linear models in Subject CS1 and the
proportional hazards models in Booklet 4.
____________
Penalised generalised linear models
57 Suppose we have a set of data with J covariates ( x1, , x J ) and N
cases. To fit a generalised linear model, we normally specify the link
function and the parameters of the model, b 0 , , b J (where b 0 is the
intercept) and maximise the likelihood L( b 0 , , b J | x1, , x J ) .
However, this might not work well in certain situations.
For example, if there are many covariates, including all of them might
make the model unstable (the estimators of the parameters b 1, , b J
will have large variances) because of correlations among the x1, , x J .
If we wish to use the model for prediction on new data, this is very
undesirable. We want to be able to trust that the estimated values of
the parameters linking the covariates to the outcome variable are
solidly grounded and unlikely to shift greatly when the model is applied
to a new data set. Another way of saying this is that we only want to
include features which really do have a general effect on output.
____________
Page 28 © IFE: 2019 Examinations
Exclusive use Batch0402p
58 One way to solve this is to choose a subset of the J covariates to
include in the model. But how do we choose this? We could look at all
possible subsets of J and use criteria such as the Akaike Information
Criterion or the Bayesian Information Criterion.
These both exact a penalty for additional parameters. If the number of
parameters is J and the sample size in the data is N :
AIC = deviance + 2J
BIC = deviance + [loge (N )]J
However as J increases the number of possible subsets rises. In
many machine learning applications, J is large, and the number of
cases is also large, so that comparing all possible subsets is
computationally infeasible.
____________
59 Penalised regression involves exacting a penalty for unrealistic or
extreme values of the parameters, or just having too many parameters.
The penalty may be written l P ( b 1,..., b J ) , so that we maximise:
loge L ( b 0 , , b J | x1, , x J ) - l P ( b 1,..., b J )
Two common examples of penalties are:
J
ridge regression, where P ( b 1,..., b J ) = Â b 2j ,
j =1
the LASSO (Least Absolute Shrinkage and Selection Operator),
J
where P ( b 1,..., b J ) = Â bj .
j =1
____________
© IFE: 2019 Examinations Page 29
Exclusive use Batch0402p
60 The parameter is called the regularisation parameter, and its choice
is important. Too small a value of leads to over-fitting the data and
the problems associated with using just the likelihood. Too large a
value of means only gross effects of the covariates will be included
in the final model, and we may miss many important effects of the
covariates on the outcome.
____________
Recall from Subject CS1 that if B1, B2 , ..., BR constitute a partition of a
sample space S and P (Bi ) π 0 for i = 1,2, , R , then for any event A
in S such that P ( A) π 0 :
P ( A | Br )P (Br )
P (Br | A) = for r = 1,2, , R
P ( A)
where:
R
P ( A) = Â P ( A | Bi )P (Bi )
i =1
____________
Naïve Bayes classification
61 Naïve Bayes classification uses this formula to classify cases into
mutually exclusive categories on some outcome y , on the basis of a
set of covariates x1, , x J . The events A are equivalent to the
covariates taking some set of values, and the partition B1, B2 , ..., BR is
the set of values that the outcome can take.
Suppose the outcome is whether or not a person will survive for 10
years. Let y i = 1 denote the outcome that person i survives, and
y i = 0 denote the outcome that person i dies. Then, if we have J
covariates, we can write:
P ( x1i ,..., x Ji | y i = 1) P ( y i = 1)
P ( y i = 1| x1i ,..., x Ji ) =
P ( x1i ,..., x Ji )
Page 30 © IFE: 2019 Examinations
Exclusive use Batch0402p
This is difficult to estimate because all possible combinations of the
x1, , x J need to be estimated, and all combinations are unlikely to be
in your data set.
____________
62 The naïve Bayes algorithm assumes that the values of the x i are
independent, conditional on the value of y i .
This allows the formula to be re-written:
P ( y i = 1| x1i ,..., x Ji )
P ( x1i | y i = 1)P ( x 2i | y i = 1) P ( x Ji | y i = 1) P ( y i = 1)
=
P ( x1i ,..., x Ji )
so that:
J
P ( y i = 1| x1i ,..., x Ji ) μ P ( y i = 1) ’ P ( x ji | y i = 1)
j =1
____________
Decision trees are discussed in Paragraphs 63 to 77 below.
____________
Decision trees (classification and decision trees algorithm)
63 Classification and regression trees (CART) is a term introduced by Leo
Breiman to refer to decision tree algorithms that can be used for
classification or regression in predictive modelling problems.
Classically, this algorithm is referred to as ‘decision trees’ but on some
platforms like R they are referred to by the more modern term CART.
____________
© IFE: 2019 Examinations Page 31
Exclusive use Batch0402p
64 The CART algorithm provides a foundation for important algorithms
like:
bagged decision trees
random forest
boosted decision trees.
____________
65 In bagged decision trees, we create random sub-samples of our data
with replacement, train a CART model on each sample, and (given new
data) calculate the average prediction from each model.
The representation for the CART model is a binary tree.
Each root node on a tree represents a single input variable x and a
split point on that variable (assuming the variable is numeric).
The leaf nodes of the tree contain an output variable y which is used
to make a prediction.
____________
66 Example
Given a dataset with two inputs of height in centimetres and weight in
kilograms, the output of gender as male or female, below, is a crude
example of a binary decision tree.
Height > 180cm
YES NO
Weight
>80kg
Male
YES NO
Male Female
Page 32 © IFE: 2019 Examinations
Exclusive use Batch0402p
Given an input of [height = 60cm, weight = 65kg] the above tree would
be traversed as follows:
Node 1: Height > 180cm? No
Node 2: Weight > 80kg? No
Therefore, my result is: Female
With the binary tree representation of the CART model described
above, making predictions is relatively straightforward.
Given a new input, the tree is traversed by evaluating the specific input
at the root node of the tree.
____________
67 A learned binary tree is a partitioning of the input space. You can think
of each input variable as a dimension on a p -dimension space. The
decision tree splits this up into rectangles (when p 2 input variables)
or hyper-rectangles with more inputs.
New data is filtered through the tree and lands in one of the rectangles
and the output value for that rectangle is the prediction made by the
model. This gives an intuition for the type of decisions that a CART
model can make, eg boxy decision boundaries.
____________
Greedy splitting
68 Creating a binary decision tree is a process of dividing up the input
space. A ‘greedy’ approach is used to divide the space, called
recursive binary splitting.
This is a numerical procedure where all the values are lined up and
different split points are tried and tested using a cost function. The
split with the best cost (lowest cost because we minimize cost) is
selected.
All input variables and all possible split points are evaluated and
chosen in a greedy manner (ie the very best split point is chosen each
time).
____________
© IFE: 2019 Examinations Page 33
Exclusive use Batch0402p
69 For regression problems, the cost function that is minimized to choose
split points is the sum squared error across all training samples that
fall within the rectangle:
N
 ( y i - yˆ i )2
i =1
where y i is the output for the training sample and yˆ i is the predicted
output for the rectangle.
____________
The Gini index
70 For classification, the Gini index function is used, which provides an
indication of how ‘pure’ the leaf nodes are (ie how mixed the training
data assigned to each node is):
G = Â pk (1 - pk )
k
Here pk is the proportion of training instances with class k in the
rectangle of interest. A node that has all classes of the same type
(perfect class purity) will have G = 0 , whereas a node that has a 50-50
split of classes for a binary classification problem (worst purity) will
have G = 0.5 .
For a binary classification problem, this can be re-written as:
G = 2 p1 p2
or:
(
G = 1 - p12 + p22 )
Page 34 © IFE: 2019 Examinations
Exclusive use Batch0402p
The Gini index calculation for each node is weighted by the total
number of instances in the parent node. The Gini score for a chosen
split point in a binary classification problem is therefore calculated as
follows:
( ) ( )
ng ng
G = ÈÍ1 - g1,1
2 2 ˘
+ g1,2 ˙ ¥ 1 + ÈÍ1 - g2,1
2 2 ˘
+ g2,2 ˙ ¥ 2
Î ˚ n Î ˚ n
Here:
g1,1 is the proportion of instances in group 1 for class 1, g1,2 for
group 1 and class 2
g2,1 is the proportion of instances in group 2 for class 1, g2,2 for
group 2 and class 2
ng1 and ng2 are the total number of instances in groups 1 and 2
n is the total number of instances we are trying to group from the
parent node.
____________
Stopping criterion
71 The recursive binary splitting procedure described above needs to
know when to stop splitting as it works its way down the tree with the
training data.
____________
72 The stopping criterion is important as it strongly influences the
performance of the tree.
____________
© IFE: 2019 Examinations Page 35
Exclusive use Batch0402p
73 The most common stopping procedure is to use a minimum count on
the number of training instances assigned to each leaf node. If the
count is less than some minimum then the split is not accepted and the
node is taken as a final leaf node.
The count of training members is tuned to the dataset, eg 5 or 10. It
defines how specific to the training data the tree will be. Too specific
(eg a count of 1) and the tree will overfit the training data and likely
have poor performance on the test set.
____________
Pruning
74 Pruning may be used after learning to further enhance the tree’s
performance.
____________
75 The complexity of a decision tree is defined as the number of splits in
the tree. Simpler trees are preferred. They are easy to understand (you
can print them out and show them to subject matter experts), and they
are less likely to overfit your data.
____________
76 The fastest and simplest pruning method is to work through each leaf
node in the tree and evaluate the effect of removing it using a hold-out
test set. Leaf nodes are removed only if it results in a drop in the
overall cost function on the entire test set. You stop removing nodes
when no further improvements can be made.
____________
77 More sophisticated pruning methods can be used such as cost
complexity pruning (also called ‘weakest link pruning’) where a
learning parameter (alpha) is used to weigh whether nodes can be
removed based on the size of the sub-tree.
____________
In Paragraphs 78 to 81 we describe unsupervised learning techniques.
____________
Page 36 © IFE: 2019 Examinations
Exclusive use Batch0402p
Applications: unsupervised learning
K-means clustering
78 Suppose we have a set of data consisting of several variables (or
features) measured for a group of individuals. These might relate to
demographic characteristics, such as age, occupation, gender.
Alternatively, they might relate to life insurance policies for which we
have information such as sales channel, policy size, postcode, level of
underwriting, etc.
We might ask whether we can identify groups (clusters) of policies
which have similar characteristics. We may not know in advance what
these clusters are likely to be, or even how many there are in our data.
There are a range of clustering algorithms available, but many are
based on the K -means algorithm. This is an iterative algorithm which
starts with an initial division of the data into K clusters, and adjusts
that division in a series of steps designed to increase the homogeneity
within each cluster and to increase the heterogeneity between clusters.
____________
79 The K -means algorithm proceeds as follows. Let us suppose we have
data on J variables.
1. Choose a number of clusters, K , into which the data are to be
divided. This could be done on the basis of prior knowledge of the
problem. Alternatively, the algorithm could be run several times
with different numbers of clusters to see which produces the most
satisfactory and interpretable result. There are various measures
of within- and between-group heterogeneity, often based on within-
groups sums of squares. Comparing within-groups sums of
squares for different numbers of clusters might identify a value of
K beyond which no great increase in within-group homogeneity is
obtained.
2. Identify (perhaps arbitrarily) cluster centres in the J -dimensional
space occupied by the data. This initial location of the centres
could be done on the basis of prior knowledge of the problem to
hand, or by random assignment of cases.
© IFE: 2019 Examinations Page 37
Exclusive use Batch0402p
3. Assign cases to the cluster centre which is nearest to them, using
some measure of distance. One common measure is Euclidean
distance:
J
dist( x , k ) = Â ( x j - k j )2
j =1
Here x j is the standardised value of covariate j for case x , and
kj is the value of covariate j at the centre of cluster k
( k = 1, , K ). Note that it is often necessary to standardise the
data before calculating any distance measure, for example by
assuming a normal distribution using z-scores or by assuming a
uniform distribution on ( x a , x b ) , where x a and x b are the lowest
and highest observed values of covariate x .
4. Calculate the centroid of each cluster, using the mean values of
the data points assigned to that cluster. This centroid becomes
the new centre of each cluster.
5. Re-assign cases to the nearest cluster centre using the new cluster
centres.
Iterate steps 4 and 5 until no re-assignment of cases takes place.
____________
Page 38 © IFE: 2019 Examinations
Exclusive use Batch0402p
80 The table below shows the strengths and weaknesses of the K -means
algorithm.
Strengths Weaknesses
Uses simple principles for Less sophisticated than more
identifying clusters which can be recent clustering algorithms
explained in non-statistical terms
Highly flexible and can be adapted Not guaranteed to find the optimal
to address nearly all its set of clusters because it
shortcomings with simple incorporates a random element
adjustments
Fairly efficient and performs well Requires a reasonable guess as
to how many clusters naturally
exist in the data
Source: B. Lantz, Machine Learning with R (Birmingham, Packt
Publishing, 2013), p. 271
The interpretation and evaluation of the results of K -means clustering
can be somewhat subjective. If the K -means exercise has been
useful, the characteristics of the clusters will be interpretable within
the context of the problem being studied, and will either confirm that
the pre-existing opinion about the existence of homogeneous groups
has an evidential base in the data, or provide new insights into the
existence of groups that were not seen before. One objective criterion
that can be examined is the size of each of the clusters. If one cluster
contains the vast majority of the cases, or there are clusters with only a
few cases, this may indicate that meaningful groups do not exist.
____________
R has several machine learning packages that will achieve K -means
clustering. One simple one is kmeans.
____________
81 Another example of an unsupervised learning algorithm is principal
components analysis.
This is covered in Subject CS1.
____________
© IFE: 2019 Examinations Page 39
Exclusive use Batch0402p
Perspectives of statisticians, data scientists and other quantitative
researchers
82 Machine learning involves the application to data of a range of
methods aimed at using data to solve real-world problems. However,
many other quantitative researchers would claim to be doing the same
thing.
It is certainly true that practitioners from other backgrounds often do
work that overlaps with machine learning. Statisticians, for example,
do data mining, data reduction using principal components analysis,
and routinely estimate logistic regression models. There are
differences between the perspectives of many statisticians and that
normally adopted by the users of machine learning techniques.
____________
83 Some of the challenges of communicating with other quantitative
researchers are straightforward differences of terminology. In machine
learning we talk of ‘training’ a model, or ‘training’ hyper-parameters,
whereas statisticians might talk of ‘fitting’ a model or ‘choosing’
higher-level parameters. These are really different words being used
for the same activity.
____________
84 Some of the differences in perspectives of different groups of
researchers are related to the aims of their analyses. This results in
interest focusing on different aspects of the models.
____________
This may be illustrated using logistic regression, or discriminant
analysis. The logistic regression model may be written:
Ê P ( y i = 1) ˆ
log Á ˜ = b 0 + b 1x1i + ... + b J x Ji
Ë P ( y i = 0) ¯
where y is a binary variable dividing the data into two categories,
coded 1 and 0, x1i , , x Ji are the values of the J covariates for
case i , and the b ’s are parameters to be estimated from the data.
Page 40 © IFE: 2019 Examinations
Exclusive use Batch0402p
Statisticians will tend to be most interested in the values and
significance of the b ’s, that is in the effect of the covariates on the
probability that a case is in either group. They will often present these
in tables of odds ratios, showing the effect of a difference in the value
P ( y i = 1)
of a covariate on . Often the purpose of their analyses is to
P ( y i = 0)
test hypotheses about the effect of a covariate on the odds of y i
being 1.
85 For example, in a clinical trial, y i = 1 might denote recovery from an
illness and y i = 0 denote death, x1 might be a new drug treatment and
x 2 , , x J might be controls.
The statistician’s interest is mainly in the size and significance of the
parameter b 1 , and especially whether or not b 1 suggests that the new
treatment leads to an increase in the odds of recovery. How good the
model is at predicting who will recover and who will die is less of an
issue.
____________
86 In machine learning applications, however, the actual values of the
b ’s are less important than the success of the model in predicting
who will recover and who will die, or at discriminating between the two
groups (those who recover and those who die). A useful model will be
one that makes successful predictions of recovery / death when tested
on new data.
____________
87 Other criteria for assessing the usefulness of models are explicability,
and persuading regulators and other supervisory bodies that you have
not introduced a classification or discrimination which is perceived as
undesirable (for example one based on gender).
____________
In R, there is a wide range of packages that will perform machine
learning techniques. This range changes over time. See for example
https://www.r-bloggers.com/what-are-the-best-machine-learning-
packages-in-r/ for an overview.
____________
© IFE: 2019 Examinations Page 41
Exclusive use Batch0402p
PAST EXAM QUESTIONS
There are no past exam questions related to the topics covered in this
booklet.
Page 42 © IFE: 2019 Examinations
Exclusive use Batch0402p
SOLUTIONS TO PAST EXAM QUESTIONS
There are no past exam questions related to the topics covered in this
booklet.
© IFE: 2019 Examinations Page 43
Exclusive use Batch0402p
FACTSHEET
This factsheet summarises the main methods, formulae and information
required for tackling questions on the topics in this booklet.
Machine learning
Machine learning is a set of methods by which computer algorithms can be
used to generate information.
If a data set has a collection of features x1, x2,, xJ , each associated with a
corresponding output y , then the aim is to find a hypothesis
y g ( x1, x2,, xJ ) that provides a good approximation to the underlying
function y f ( x1, x2,, xJ ) and hence minimises the chosen loss function.
The goal is to find an algorithm that can predict the outcome y for
previously unseen cases.
Machine learning is useful if:
a pattern exists
the pattern cannot be expressed using traditional mathematics
relevant data are available.
Methods for evaluating a model
The model accuracy is the proportion of predictions the model gets right.
A confusion matrix for a two-state model looks like this:
Classification
Positive Negative
Positive True Positive (TP) False Negative (FN)
True state
Negative False Positive (FP) True Negative (TN)
Page 44 © IFE: 2019 Examinations
Exclusive use Batch0402p
Useful ratios include:
TP TP
Precision Recall (sensitivity)
TP FP TP FN
2 Precision Recall FP
F1 False positive rate
Precision Recall TN FP
A receiver operating characteristic (ROC) curve plots the recall (ie true
positive rate) against the false positive rate. The greater the area under the
graph, the better the model is at classifying data.
Assessing predictiveness
In-sample data refers to the data used to fit and validate the model.
Out-of-sample data refers to data that are not used to fit the model, ie data
that the algorithm has not seen before.
The Vapnik-Chervonenkis inequality states that, given a large enough
sample, it is always possible to use machine learning to choose a hypothesis
that will predict the outcomes for out-of-sample data to as high a degree of
accuracy as required.
The train-validation-test can be used to test the predictive accuracy of a
model on out-of-sample data.
The train validation test splits the data into:
a training data set – the in-sample data used to fit the model
a validation data set – the in-sample data used to fine-tune the
parameters and test how well the model fits the training data
a test data set – the out-of-sample data used to test whether the final
model is a good fit for data it has not seen before (ie it tests the
predictive accuracy of the model).
Alternatively, s-fold cross-validation divides the in-sample data into several
slices. Then we fit the model several times, using a different slice of data to
validate the model each time.
© IFE: 2019 Examinations Page 45
Exclusive use Batch0402p
Parameters
Parameters are estimated from the data and used to make predictions.
Hyper-parameters are higher level attributes of the model that can’t be
estimated from the data. They are typically specified by the practitioner. An
example of a hyper-parameter is the number of parameters J included in
the model.
Overfitting occurs when too many parameters are used in the model. It can
lead to the model identifying patterns that aren’t really there.
As the number of parameters J increases:
the model will be an increasingly good fit to the training data
the predictive accuracy of the model will:
– increase (for small J ) up to a maximum
– decrease as J gets larger.
Hence, a model should achieve a balance between bias (ie fit to the training
data) and variance (ie predictive accuracy).
Regularisation
Regularisation means that a penalty is incurred if a model has too many
parameters.
J
One possibility is to add a term l  w 2j to the loss function, so that we now
j =1
minimise:
J
L * (w1, w 2,, w J ) w 2j
j 1
In some models, this is equivalent to maximising the penalised likelihood.
Two examples of models that use a regularisation approach are penalised
regression and penalised GLMs (both discussed below).
Page 46 © IFE: 2019 Examinations
Exclusive use Batch0402p
Categories of machine learning
Naïve Bayes algorithm
The naïve Bayes algorithm classifies a set of outcomes y based on a set of
covariates x1,, xJ . It assumes that the values of the xi are independent,
conditional on the value of y i , so that:
J
P ( y i 1| x1i ,..., xJi ) P ( y i 1) P ( x ji | y i 1)
j 1
© IFE: 2019 Examinations Page 47
Exclusive use Batch0402p
Decision trees
Decision trees, also known as classification and regression techniques
(CART), ask a series of questions to classify each item. The simplest
method of construction is to use greedy splitting (which chooses the split
points that minimise the chosen cost function). Overfitting can be avoided
by applying a stopping criterion or by pruning the decision tree.
For regression problems, the split points are chosen so as to minimise the
squared error cost function:
N
( y i yˆ i )2
i 1
where y i is the output for the training sample and yˆi is the predicted output
for the rectangle.
For classification problems, the split points are chosen so as to minimise the
Gini index. The Gini index is a measure of ‘purity’.
In a binary classification problem, the Gini score for a leaf node (ie a final
node in the tree) is:
1 p12 p22
where pk is the proportion of sample items of class k present at that node.
The Gini index for a chosen split point (or for the whole tree) is the weighted
average of the Gini score for each node involved, weighted by the number of
items at each node. For a binary classification problem, this is:
G
nodes
nnode
n
1 p12 p22
In this case, G must take a value between 0, which means that all the items
at each node are of the same type, and 0.5, which means that the items at
each node are of mixed types.
The Gini index makes no judgement about whether the prediction is right or
wrong. It only provides a measure of how effective the tree is at sorting the
data into homogenous groups.
Page 48 © IFE: 2019 Examinations
Exclusive use Batch0402p
Discriminant analysis
Discriminant analysis is another name for logistic regression. It estimates
the probability of a case falling into a particular category by identifying which
covariates have the greatest effect on the outcome.
Penalised regression
Penalised regression can impose a penalty for unrealistic / extreme
parameter values (as well as simply too many parameters). The penalty is
often written as P ( 1,..., J ) and is called the regularisation parameter.
The regression then involves maximising the penalised log-likelihood
function:
loge L 0 ,, J | x1,, xJ P ( 1,..., J )
A small value of can lead to over-fitting. A high value of can lead to
important parameters being excluded from the model.
Common penalties are:
J
ridge regression, where P ( 1,..., J ) 2j
j 1
Least Absolute Shrinkage and Selection Operator (LASSO), where
J
P ( 1,..., J ) j .
j 1
Penalised generalised linear models
Penalties can be used to decide which parameters to include in a
generalised linear model.
The Akaike Information Criterion (AIC) and Bayesian Information Criterion
(BIC) both impose a penalty for additional parameters. If the sample size
is N and the number of parameters is J :
AIC deviance 2J
BIC deviance J ln N
where the deviance is a measure of the model’s goodness of fit. By
minimising the AIC or the BIC, we can achieve a trade-off between obtaining
a good fit to the data and minimising the number of parameters in the model.
© IFE: 2019 Examinations Page 49
Exclusive use Batch0402p
Unsupervised learning
Examples of unsupervised learning include
cluster analysis (such as K -means clustering)
the apriori algorithm (used for market basket analysis and text analysis)
principal components analysis.
K-means clustering
The K -means clustering algorithm involves modelling the data values as
points in space. Starting from an initial cluster allocation (usually random),
the method finds the centroid of the data points that have been allocated to
each cluster and then reallocates each point to the cluster whose centroid it
is nearest to. This process is repeated until no further changes can be
made.
Advantages and disadvantages of K-means clustering are:
+ it uses a simple principle that can easily be explained
+ it is highly flexible and can easily be adapted to address any
shortcomings
+ it is efficient and performs well
– it is less sophisticated than more recent clustering algorithms
– it is not guaranteed to find the optimal set of clusters (because of the
random element)
– it requires a reasonable guess as to how many clusters naturally exist in
the data
– results are sensitive to units of measurement used
– clusters may have no natural interpretation
– it can’t be used unless the data have a natural numerical order.
Semi-supervised learning
It is possible to perform machine learning analysis by using a mixture of
supervised and unsupervised learning. This type of approach is known as
semi-supervised learning.
An example of semi-supervised learning is to identify a set of clusters and
then predict which cluster each case will fall into.
Page 50 © IFE: 2019 Examinations
Exclusive use Batch0402p
Reinforcement learning
In reinforcement learning, the target output is not specified in advance.
Instead it uses a reward function, and the machine uses trial and error to find
the course of action that maximises the total reward.
Stages of machine learning
The steps required to carry out a machine learning exercise are:
collect data
explore and prepare the data
scale the data
split the data into training, validation and testing datasets – a 60% / 20%
/ 20% is common
train a model on the data
validate and test the model – retrain the model if required
evaluate performance by testing the model on on the test dataset
improve performance – a combination of algorithms may be used.
The model should also be reproducible, ie:
the data used should be fully available
modifications to the data (called features engineering) should be clearly
described
there should be a full description of the model, including the choice of
algorithm and key decisions made.
Perspectives of statisticians and other researchers
In machine learning, the term ‘train a model’ is used, whereas other fields
refer to ‘fitting a model’ or ‘choosing parameters’.
Machine learning is more concerned with how accurate the model is at
predicting outcomes for new data, whereas (arguably) other fields are more
concerned with the size of the fitted parameters, ie how much of an effect
the parameters have on the final outcome.
In general, models are more useful if they are easy to explain and are
acceptable to regulators.
© IFE: 2019 Examinations Page 51
Exclusive use Batch0402p
NOTES
Page 52 © IFE: 2019 Examinations
Exclusive use Batch0402p
NOTES
© IFE: 2019 Examinations Page 53
Exclusive use Batch0402p
NOTES
Page 54 © IFE: 2019 Examinations
Exclusive use Batch0402p
NOTES
© IFE: 2019 Examinations Page 55
Exclusive use Batch0402p
NOTES
Page 56 © IFE: 2019 Examinations