0 ratings0% found this document useful (0 votes) 159 views19 pagesData Analysis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
‘emt or: Data anaivsis, Page 1
Data analysis
‘The Actuarial Edueation Company (© 1FE:2019 ExaminationsPage? ‘cs-01: Data analysis.
‘This chapter provides an introduction to the underlying principles of data analysis, in particular
within an actuarial context.
Data analysis is the process by which data is gathered in its raw state and analysed or
processed into information which can be used for specific purposes. This chapter will
describe some of the different forms of data analysis, the steps involved in the process and
consider some of the practical problems encountered in data analytics.
Although this chapter looks at the general principles involved in data analysis, it does not deal
with the statistical techniques required to perform a data analysis. These are covered elsewhere,
in CS1 and C32.
(© IF: 2019 Examinations ‘The Actuarial Education Company‘emt.01: Data analysis, Page 3
11
1.2
Three keys forms of data analysis will be covered in this section:
* descriptive;
* inferential; and
* predicti
Descriptive analysis
Data presented in its raw state can be difficult to manage and draw meaningful conclusions
from, particularly where there is a large volume of data to work with. A descriptive analysis
‘solves this problem by presenting the data in a simpler format, more easily understood and
terproted by the user.
Simply put, this might involve summarising the data or presenting it in a format which
highlights any patterns or trends. A descriptive analysis is not intended to enable the user
to draw any specific conclusions. Rather, it describes the data actually presented.
For example, its likely to be easier to understand the trend and variation in the sterling/euro
exchange rate over the past year by looking at a graph of the daily exchange rate rather than a list
of values. The graph is likely to make the information easier to absorb.
‘Two key measures, or parameters, used in a descriptive analysis are the measure of central
tendency and the dispersion. The most common measurements of central tendency are the
mean, the median and the mode. Typical measurements of the dispersion are the standard
deviation and ranges such as the interquartile range. These measurements are described in
cst.
Measures of central tendency tell us about the ‘average’ value of a data set, whereas measures of
dispersion tell us about the ‘spread’ of the values.
It can also be important to describe other aspects of the shape of the (empirical) distribution
of the data, for example by calculating measures of skewness and kurtosis.
Empirical means ‘based on observation’. So an empirical distribution relates to the distribution of
the actual data points collected, rather than any assumed underlying theoretical distribution.
‘Skewness is a measure of how symmetrical a data set is, and kurtosis is a measure of how likely
extreme values are to appear (ie those in the tails of the distribution). Detailed knowledge of
these measures is not required for this course.
Inferential analysis
Often it is not feasible or practical to collect data in respect of the whole population,
particularly when that population is very large. For example, when conducting an opinion
poll in a large country, it may not be cost effective to survey every citizen. A practical
solution to this problem might be to gather data in respect of a sample, which is used to
represent the wider population. The analysis of the data from this sample is called
inferential analysis.
“Te Actuarial Education Company IE: 2019 Examinations1.3
Pages ‘cs-01: Data analysis.
‘The sample analysis involves estimating the parameters as described in Section 1.1 above
and testing hypotheses. It is generally accepted that if the sample is large and taken at
random (selected without prejudice), then it quite accurately represents the statistics of the
population, such as distribution, probability, mean, standard deviation, However, this is
also contingent upon the user making reasonably correct hypothesis about the populat
in order to perform the inferential analysis.
Care may need to be taken to ensure that the sample selected is likely to be representative of the
whole population. For example, an opinion poll on a national issue conducted in urban locations
‘on weekday afternoons between 2pm and 4pm may not accurately reflect the views of the whole
population. This is because those living in rural areas and those who regularly work during that
period are unlikely to have been surveyed, and these people might tend to have a different
viewpoint to those who have been surveyed.
Sampling, inferential analysis and parameter estimation are covered in more detail in CS1.
Predictive analysis
Predictive analysis extends the pri
analyse past data and make predictions about future events.
order for the user to
It achieves this by using an existing set of data with known attributes (also known as
features), known as the training set in order to discover potentially predictive relationships.
Those relationships are tested using a different set of data, known as the fest set, to assess
the strength of those relationships.
A typical example of a predictive analysis is regression analysis, which is covered in more
dotail in CS1 and CS2. The simplest form of this is linear regression where the relationship
between a scalar dependent variable and an explanatory or independent variable is
assumed to be linear and the training set is used to determine the slope and intercept of the
line. A practical example might be the relationship between a car's braking distance against
speed.
In this example, the car's speed is the explanatory (or independent) variable and the braking
distance is the dependent variable.
Question
Based on data gathered at a particular weather station on the monthly rainfall in mm (r) and the
average number of hours of sunshine per day (s), a researcher has determined the following,
explanatory relationship:
9-0.1r
Using this model:
{i) Estimate the average number of hours of sunshine per day, if the monthly rainfall is Somm.
(ii) State the impact on the average number of hours of sunshine per day of each extra
millimetre of rainfall in a month,‘emt or: Data anaivsis, Pages
Solution
(i) When r=50
je there are 4 hours of sunshine per day on average.
(ii) For each extra millimetre of rainfall in a month, the average number of hours of sunshine
per day falls by 0.1 hours, or 6 minutes.
“Te Actuarial Edueaton Company (© 1FE:2019 ExaminationsPage ‘cmt-or: Data analysis,
While the process to analyse data does not follow a set pattern of steps, it is helpful to
consider the key stages which might be used by actuaries when collecting and analysing
data.
can be described as follows:
The key steps in a data analysis proce
1. Develop awall-det
data analysis.
1ed set of objectives which need to be met by the results of the
‘The objective may be to summarise the claims from a sickness insurance product by age,
‘gender and cause of claim, or to predict the outcome of the next national parliamentary
election.
2. Identify the data items required for the analysis
3. Collection of the data from appropriate sources.
The relevant data may be available internally (eg from an insurance company’s
administration department) or may need to be gathered from external sources (eg from a
local council office or government statistical service.
4. Processing and formatting data for analysi
database or other model.
1g inputting into a spreadsh
5. Cleaning data, eg addressing unusual, missing or inconsistent values.
6. Exploratory data analysis, which may include:
(a) Descriptive analysis;
spread of the data.
roducing summary statistics on central tendency and
(b) Inferential analysis; estimating summary parameters of the wider population
of data, testing hypotheses.
(c) analysing data to make predictions about future events
or other data sets.
7. Modelling the data.
8 Communicating the results.
It will be important when communicating the results to make it clear what data was used,
what analyses were performed, what assumptions were made, the conclusion of the
analysis, and any limitations of the analysis,
9. Monitoring the process; updating the data and repeating the process if required.
‘Adata analysis is not necessarily just a one-off exercise. An insurance company analysing
the claims from its sickness policies may wish to do this every few years to allow for the
new data gathered and to look for trends. An opinion poll company attempting to predict
an election result is likely to repeat the poll a number of times in the weeks before the
election to monitor any changes in views during the campaign period.
1; 2019 Bxaminations “Te Actuarial Education Company(cMa.01: Data anaiysis. Page?
‘Throughout the process, the modelling team needs to ensure that any relevant professional
guidance has been complied with. For example, the Financial Reporting Council has issued
a Technical Actuarial Standard (TAS) on the principles for Technical Actuarial Work
(TAS100) which includes principles for the use of data in technical actuarial work.
Knowledge of the detail of this TAS is not required for CM1.
Further, the modelling team should also remain aware of any legal requirement to be
‘complied with. Such legal requirement may include aspects around consumer/customer
data protection and gender discrimination.
“The Actuarial Education Cornpany © IE: 2019 ExaminationsPage 8 ‘cmt-or: Data analysis,
Step 3 of the process described in Section 2 above refers to collection of the data needed to
meet the objectives of the analysis from appropriate sources. As consideration of Steps 3,
4, and 5 makes clear, getting data into a form ready for analysis is a process, not a single
event. Consequently, what is seen as the source of data can depend on your viewpoint.
‘Suppose you are conducting an analysis which involves collecting survey data from a
‘sample of people in the hope of drawing inferences about a wider population. If you are in
charge of the whole process, including collecting the primary data from your selected
‘sample, you would probably view the ‘source’ of the data as being the people in your
sample. Having collected, cleaned and possibly summarised the data you might make it
available to other investigators in JavaScript object notation (JSON) format via a web
Application programming interface (API). You will then have created a secondary ‘source’
for others to use.
In this section we discuss how the characteristics of the data are determined both by the
primary source and the steps carried out to prepare it for analysis - which may include the
‘steps on the journey from primary to secondary source. Details of particular data formats
{such as JSON), or of the mechanisms for getting data from an external source into a local
data structure suitable for analysis, are not covered in CM1.
Primary data can be gathered as the outcome of a designed experiment or from an
observational study (which could include a survey of responses to specific questions). In
all cases, knowledge of the details of the collection process is important for a complete
understanding of the data, including possible sources of bias or inaccuracy. Issues that the
analyst should be aware of include:
+ whether the process was manual or automated;
+ limitations on the precision of the data recorded;
+ whether there was any validation at source; and
+ if data wasn't collected automatically, how was it converted to an electronic form.
These factors can affect the accuracy and reliability of the data collected. For example:
. In a survey, an individual's salary may be specified as falling into given bands, eg £20,000 -
£29,999, £30,000 - £39,999 etc, rather than the precise value being recorded
. if responses were collected on handwritten forms, and then manually input into a
database, there is greater scope for errors to appear.
Where randomisation has been used to reduce the effect of bias or confounding var
is important to know the sampling scheme used:
. simple random sampling;
. stratified sampling; or
. another sampling method.
1; 2019 Bxaminations “Te Actuarial Education Companyoi
‘emt.01: Data analysis, Pages
Question
Aresearcher wishes to survey 10% of a company’s workforce.
Describe how the sample could be selected using:
(2) simple random sampling
(b) stratified sampling,
Solution
(2) Simple random sampling
Using simple random sampling, each employee would have an equal chance of being selected.
‘This could be achieved by taking a list of the employees, allocating each a number, and then
selecting 10% of the numbers at random (either manually, or using a computer-generated
process)
(b) Stratified sampling
Using stratified sampling, the workforce would first be split into groups (or strata) defined by
specific criteria, eg level of seniority. Then 10% of each group would be selected using simple
random sampling. In this way, the resulting sample would reflect the structure of the company by
seniority.
This aims to overcome one of the issues with simple random sampling, ie that the sample
obtained does not fully reflect the characteristics of the population. With a simple random
sample, it would be possible for all those selected to be at the same level of seniority, and so be
unrepresentative of the workforce as a whole.
Data may have undergone some form of pre-processing. A common example is grouping
(eg by geographical area or age band). In the past, this was often done to reduce the
amount of storage required and to make the number of calculations manageable. The scale
‘of computing power available now means that this is less often an issue, but data may still
be grouped: perhaps to anonymise it, or to remove the possibility of extracting sensitive (or
perhaps commercially sensitive) details.
Other aspects of the data which are determined by the collection process, and which affect
the way it is analysed include the following:
. Cross-sectional data involves recording values of the variables of interest for each
case in the sample at a single moment in time.
For example, recording the amount spent in a supermarket by each member of 2 loyalty
card scheme this week.
* Longitudinal data involves recording values at intervals over time.
For example, recording the amount spent in a supermarket by a particular member of a
loyalty card scheme each week for a year.
“Te Actuarial Education Company IE: 2019 Examinations3.1
age 10 ‘cs-01: Data analysis.
* Censored data occurs when the value of a variable is only partially known, for
example, if a subject in a survival study withdraws, or survives beyond the end of
the study: here a lower bound for the survival period is known but the exact value
isn't
Censoring is dealt with in detail in C52
. Truncated data occurs when measurements on some variables are not recorded so
are completely unknown.
For example, if we were collecting data on the periods of time for which a user's internet
connection was disrupted, but only recorded the duration of periods of disruption that
lasted 5 minutes or longer, we would have a truncated data set.
Big data
The term big data is not well defined but has come to be used to describe data with
characteristics that make it impossible to apply traditional methods of analysis (for
example, those which rely on a single, well-structured data set which can be manipulated
and analysed on a single computer). Typically, this means automatically collected data with
characteristics that have to be inferred from the data itself rather than known in advance
from the design of an experiment.
Given the descri
include:
1s that can lead data to be classified as ‘big’
ion above, the prope!
. size, not only does big data include a very large number of individual cases, but
each might include very many variables, a high proportion of which might have
empty (or null) values — leading to sparse data;
. speed, the data to be analysed might be arriving in real time at a very fast rate - for
example, from an array of sensors taking measurements thousands of time every
second;
. variety, big data is often composed of elements from many different sources which
could have very different structures — or is often largely unstructure:
‘* reliability, given the above three characteristics we can see that the reliability of
individual data elements might be difficult to ascertain and could vary over time (for
example, an internet connected sensor could go offline for a period).
Examples of ‘big data’ are:
© the information held by large online retailers on items viewed, purchased and
recommended by each of its customers
. measurements of atmospheric pressure from sensors monitored by 2 national
‘meteorological organisation
‘+ the data held by an insurance company received from the personal activity trackers (that
monitor daily exercise, food intake and sleep, for example) of its policyholders.3.2
(eMa.01: Data analysis, age a
Although the four points above (size, speed, variety, reliability) have been presented in the
context of big data, they are characteristics that should be considered for any data source.
For example, an actuary may need to decide if it is advisable to increase the volume of data
available for a given Investigation by combining an internal data set with data available
externally. In this case, the extra processing complexity required to handle a variety of
data, plus any issues of reliability of the external data, will need to be considered.
Data security, privacy and regulation
In the design of any investigation, consideration of issues related to data security, privacy
and complying with relevant regulations should be paramount. It is especially important to
be aware that combining different data from different ‘anonymised’ sources can mean that
individual cases become identifiable.
Another point to be aware of is that just because data has been made available on the
internet, doesn’t mean that that others are free to use it as they wish. This is a very
complex area and laws vary between jurisdictions.
‘The Actuarial Education Company IE: 2019 Examinations41
4.2
‘cmt-or: Data analysis,
The meaning of reproducible research
Reproducibility refers to the idea that when the results of a statistical analysis are reported,
sufficient information is provided so that an independent third party can repeat the analysis
and arrive at the same results,
In science, reproducibility is linked to the concept of replication which refers to som:
repeating an experiment and obtaining the same (or at least consi
can be hard, or expensive or impossible, for example
+ the study is big;
+ the study relies on data collected at great expense or over many years; or
+ the study is of a unique occurrence (the standards of healthcare in the aftermath of a
particular event).
Due to the possible difficulties of replication, reproducibility of the statistical analysis is,
often a reasonably alternative standard.
So, rather than the results of the analysis being validated by an independent third party
completely replicating the study from scratch (including gathering a new data set), the validation
is achieved by an independent third party reproducing the same results based on the same data
set.
Elements required for reproducibility
Typically, reproducibility requires the original data and the computer code to be made
available (or fully specified) so that other people can repeat the analysis and verify the
results. In all but the most trivial cases, it will be necessary to include full documentation
(eg description of each data variable, an audit trail describing the decisions made when
cleaning and processing the data, and full documented code). Documentation of models is
covered in Subject CP2.
Full documented code can be achieved through literate statistical programming (as defined
by Knuth, 1992) where the program includes an explanation of the program in plain
language, interspersed with code snippets. Within the R environment, a tool which allows,
this is R-markdown.
A detailed knowledge of the statistical package R is not required for CM1 — Ris covered in CS1 and
€S2. R-markdown enables documents to be produced that include the code used, an explanation
of that code, and, if desired, the output from that code.
‘Asa simpler example, it may be possible to document the work carried out in a spreadsheet by
adding comments or annotations to explain the operations performed in particular cells, rows or
columns.
1; 2019 Bxaminations “Te Actuarial Education Company‘cmt.01: Data analysis, age 13
Although not strictly required to meet the definition of reproducibility, a good version
control process can ensure evolving drafts of code, documentation and reports are kept in
alignment between the various stages of development and review, and changes are
reversible if necessary. There are many tools that are used for version control. A popular
tool used for version control is git.
A detailed knowledge of the version control tool ‘git’ is not required in CM1.
In addition to version control, documenting the software environment, the computing
architecture, the operating system, the software toolchain, external dependencies and
version numbers can all be important in ensuring reproducibility.
‘As an example, in the R programming language, the command:
sessioninfo()
provides information about the ope
Packages being used.
ing system, version of R and vel
Question
Give a reason why documenting the version number of the software used can be important for
reproducibility of a data analysis.
Solution
‘Some functions might be available in one version of a package that are not available in another
(older) version. This could prevent someone being able to reproduce the analysis.
Where there is randomness in the statistical or machine learning techniques being used (for
‘example random forests or neural networks) or where simulation is used, replicati
require the random seed to be set.
Machine learning is covered in Subject CS2.
‘Simulation will be dealt with in more detail in the next chapter. At this point, itis sufficient to
know that each simulation that is run will be based on a series of pseudo-random numbers. So,
for example, one simulation will be based on one particular series of pseudo-random numbers,
but unless explicitly coded otherwise, a different simulation will be based on a different series of
pseudo-random numbers. The second simulation will then produce different results, rather than
replicating the original results, which is the desired outcome here.
To ensure the two simulations give the same results, they would both need to be based on the
same series of pseudo-random numbers. This is known as ‘setting the random seed’
“Te Actuarial Edueation Company 1 IE: 2019 Examinations4.3
age 4 ‘cs-01: Data analysis.
Doing things ‘by hand’ is very likely to create problems in reproducing the work. Examples
of doing things by hand ar
+ manually editing spreadsheets (rather than reading the raw data into a programming
environment and making the changes there);
. editing tables and figures (rather than ensuring that the programming environment
creates them exactly as needed);
+ downloading data manually from a website (rather than doing it programmatically);
and
* pointing and clicking (unless the software used creates an audit trail of what has
been clicked).
‘Pointing and clicking’ relates to choosing a particular operation from an on-screen menu, for
example. This action would not ordinarily be recorded electronically.
‘The main thing to note here is that the more of the analysis that is performed in an automated
way, the easier it will be to reproduce by another individual. Manual interventions may be
forgotten altogether, and even if they are remembered, can be difficult to document clearly.
The value of reproducibility
Many actuarial analyses are undertaken for commer:
published, but reproducibility is still valuable:
|, not scientific, reasons and are not
. reproducibility is necessary for a complete technical work review (which in many
cases will be a professional requirement) to ensure the analysis has been correctly
carried out and the conclusions are justified by the data and analysis;
. reproducibility may be required by external regulators and auditors;
. reproducible research is more easily extended to investigate the effect of changes to
the analysis, or to incorporate new data;
. it is often desirable to compare the results of an investigation with a similar one
carried out in the past; if the earlier investigation was reported reproducibly an
analysis of the differences between the two can be carried out with confidence;
‘+ the discipline of reproducible research, with its emphasis on good documentation of
processes and data storage, can lead to fewer errors that need correcting in the
original work and, hence, greater efficiency.
‘There are some issues that reproducibility does not address:
. Reproducibility does not mean that the analysis is correct. For example, if an
incorrect distribution is assumed, the results may be wrong — even though they can
be reproduced by making the same incorrect assumption about the distribution.
However, by making clear how the results are achieved, it does allow transparency
so that incorrect analysis can be appropriately challenged.
+ If activities involved in reproducibility happen only at the end of an analysis, this
may be too late for resulting challenges to be dealt with. For example, resources
may have been moved on to other projects.44
(eMa.01: Data analysis,
References
Further information on the material in this section is given in the references:
* Knuth, Donald E. (1992). Literate Programming. California: Stanford University
Center for the Study of Language and Information. ISBN 978-0-937073-80-3.
* Peng, R. D., 2016, Report Writing for Data Science in R,
www.Leanpub.com/reportwriting
‘The Actuarial Education Company IE: 2019 ExaminationsPage 16 (emt or: Data analysis,
The chapter summary starts on the next page so that you can
keep all the chapter summaries together for revision purposes.
{© FE: 2019 Examinations “Te Actuarial Education Company(co-01: Data anaisis. age 17
Chapter 1 Summary
‘The three key forms of data analysis are:
© descriptive analysis: producing summary statistics (eg measures of central tendency
and dispersion) and presenting the data in a simpler format
. inferential analysis: using a data sample to estimate summary parameters for the
wider population from which the sample was taken, and testing hypotheses
. predictive analysis: extends the principles of inferential analysis to analyse past data
and make predictions about future events.
‘The key steps in the data analysis process are:
1. Develop a well-defined set of objectives which need to be met by the results of the
data analysis
2. Identify the data items required for the analysis.
3. Collection of the data from appropriate sources.
4, Processing and formatting data for analysis, eg inputting into a spreadsheet,
database or other model,
5. Cleaning data, eg addressing unusual, missing or inconsistent values.
6. Exploratory data analysis, which may include descriptive analysis, inferential analysis
or predictive analysis.
7. Modelling the data
8 Communicating the results.
9. Monitoring the process; updating the data and repeating the process if required.
In the data collection process, the primary source of the data is the population (or
population sample) from which the ‘raw/ data is obtained. If, once the information is
collected, cleaned and possibly summarised, tis made available for others to use via a web
interface, this is then a secondary source of data
Other aspects of the data determined by the collection process that may affect the analysis
are:
* Cross-sectional data involves recording values of the variables of interest for each
case in the sample at a single moment in time.
. Longitudinal data involves recording values at intervals over time,
Censored data occurs when the value of a variable is only partially known,
‘© Truncated data occurs when measurements on some variables are not recorded so
are completely unknown.
‘The Actuarial Edueation Company © IE: 2019 ExaminationsPage 18, ‘emt-o1: Data analss
‘The term ‘big data’ can be used to describe data with characteristics that make it impossible
to apply traditional methods of analysis. Typically, this means automatically collected data
with characteristics that have to be inferred from the data itself rather than known in
advance from the design of the experiment.
Properties that can lead to data being classified as ‘big’ include:
+ size of the data set
+ speed of arrival of the data
* variety of different sources from which the data is drawn
* reliability of the data elements might be difficult to ascertait
Replication refers to an independent third party repeating an experiment and obtaining the
same (or at least consistent) results. Replication of a data analysis can be difficult, expensive
or impossible, so reproducibility is often used as a reasonably alternative standard.
Reproducibility refers to reporting the results of a statistical analysis in sufficient detail that
an independent third party can repeat the analysis on the same data set and arrive at the
same results.
Elements required for reproducibility:
* the original data and fully documented computer code need to be made available
© good version control
* documentation of the software used, computing architecture, operating system,
external dependencies and version numbers
where randomness is involved in the process, replication will require the random
seed to be set
+ limiting the amount of work done ‘by hand’
{© FE: 2019 Examinations “The Actuarial Education Company‘cma.01: Data analysis, Page 19
1.1 The data analysis department of a mobile phone messaging app provider has gathered data on
the number of messages sent by each user of the app on each day over the past 5 years. The
‘geographical location of each user (by country) is also known.
(i) Describe each of the following terms as it relates to a data set, and give an example of
each as it relates to the app provider's data:
(a) cross-sectional
{b) longitudinal
(ii) Give an example of each of the following types of data analysis that could be carried out
using the app provider's data:
(a) descriptive
(b) inferential
() predictive.
1.2 Explain the regulatory and legal requirements that should be observed when conducting a data
analysis exercise.
1.3 Acar insurer wishes to investigate whether young drivers (aged 17-25) are more likely to have an
Ming *8ett 10 a ten year than older ders
Describe the steps that would be followed in the analysis of data for this investigation. 07)
14 ti) In the context of data analysis, define the terms ‘replication’ and ‘reproducibility’. [2]
|XRSHE (Give three reasons why replication of a data analysis can be difficult to achieve in practic.
8)
{Total 5}
‘The Actuarial Education Company (© IFE:2019 Examinations