0% found this document useful (0 votes)
20 views49 pages

Lecture 1

research lectures for data analysis

Uploaded by

maleekajain1399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views49 pages

Lecture 1

research lectures for data analysis

Uploaded by

maleekajain1399
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

BEAM078 Applied Empirical

Accounting and Finance

BEFM022 Quantitative
Research Methods
Module Leader: Dr Anthony Wood Email: a.p.wood@exeter.ac.uk
Workshop Tutor: Dr Wanling Rudkin Email: w.rudkin@exeter.ac.uk
Lecture 1
Introduction and Basic Statistical Concepts
Part 1
Introduction
Introduction

Objectives
• Introduction to various research methodologies and tools
• Statistical methods
• Databases
• STATA software
• Introduction to academic research
• Preparation for dissertation or equivalent
Introduction
Structure
• 15 credits
• 10 x 2-hour lectures
• 10 x Workshops or Help Hours
• Office hours: Fridays 10-12
• Assessment: 100% Assignment
• Re-assessment: 100% resubmission of Assignment (capped at 50%)
Introduction
Introduction

Assignment
• Topic: Bankruptcy Prediction
• Length: 4000 Words
• Type: Individual Assessment
• Data: Part Provided, Part Self-Collection
• Statistical Analysis: STATA
• Writing Style: Academic
• Referencing Style: APA
• The Assignment Brief can be found in full HERE
Introduction

Expectations
• Attendance
• Completion of tasks
• Ask questions
• Planning and preparation
• Completion of assignment by deadline
• Let me know if you are having difficulties (in-person/email/office hours)
Introduction
• Primary Resource – ELE

• ELE is your #1 destination for all aspects of this module


• Lecture Slides
• Workshop Questions
• Academic Papers
• Instructional Videos
• Data and Code
• Quizzes
• Assignment Details
Introduction
Recommended Text (not compulsory)
• Wooldridge – Introduction to Econometrics
• Gujarati – Basic Econometrics
• Baum– An Introduction to Modern Econometrics Using Stata
• Academic papers (introduced throughout the course)
Introduction

• STATA will be used extensively throughout this module.


• It is available for free via the software hub
• Your assignment will be conducted in STATA
• A quick start guide can be found here
• Serial numbers etc will be provided to you via the download.
Part 2
Basic Statistical Concepts
Basic Statistical Concepts
What is/are statistics?
• Statistics – science dealing with the collection, analysis, interpretation, and
presentation of numerical data.

• The practice or science of collecting and analysing numerical data in large


quantities, especially for the purpose of inferring proportions in a whole from
those in a representative sample.

• Statistics is not mathematics! it is a science!

Population
the whole: a collection of persons, objects, companies, or items under consideration in a
statistical study.

Sample
part of the population from which information is collected and analysed.
Basic Statistical Concepts
Usually in a statistics we do not know the true population parameters.

e.g. The ONS states the average man in England is 5ft 9in (175.3cm) tall and
weighs 13.16 stone (83.6kg)

• How do they know this?


• Did they measure every man in England?

By collecting and analysing real data samples we can make our “best guess” as to the true
“answer”.

What is an empirical/statistical study?


1) Using observation-based data to gain knowledge, prove concept, or answer research questions.
2) Capable of being verified or disproved by observation or experiment empirical laws.

This module concerns observation/sample-based data and how it can be used within an
empirical/statistical study.
Measures of Centre

1 𝑛
Arithmetic Mean = 𝑥ҧ = σ 𝑥 , where n is the size of the sample and 𝑥𝑖 is
𝑛 𝑖=1 𝑖
the value of an observation within the sample.

Commonly just termed the mean or average. The arithmetic mean uses the sum of all observations within the
sample. The further an observation is from the mean, the more unusual the observation is. These
unusual/extreme values heavily impact the mean value.

𝑛
Geometric Mean = ∏ = 𝑥1 𝑥2 𝑥3 … 𝑥𝑛 , where n is the size of the sample
The geometric mean is the 𝑛𝑡ℎ root of the product of all observations within the sample. It is frequently used to
average rates of change over time or to compute the growth rate of a variable. Typically negative observations
are removed from the calculation.
Measures of Centre

Sample Mode: Value which occurs most frequently within the sample.

e.g. given the data:

1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,5,6,6,6,6,6,6,7,7,7,8,8,8,9,9,9,9

The mode value would be 6

A dataset can have more than one mode (bimodal, trimodal etc), or it might not have any mode (all
observations are unique).
Measures of Centre

Sample Median: Numeric value separating the “higher” half of the data from the “lower”
half. i.e. value is midway in the distribution of values (in the middle)

𝑛+1 𝑡ℎ
When the sample size n is odd, the median is equal to the observation
2

5+1 𝑡ℎ
e.g. given the data 2,5,7,11,14, the median equals the = 3rd observation = 7
2

𝑛 𝑡ℎ 𝑛 𝑡ℎ
When the sample size n is even, the median is equal to the average of the and +1 observation
2 2

9+10
e.g. given the data 3,9,10,20, median is equal to the average of the 2nd and 3rd observations = 9.5
2

Note: Equal numbers of observation lie above and below the median. It is not affected by extreme values.
Measures of Location

2-Quantile (median): Divides the observed data into two halves and gives the “half-way
point”.

Quartiles: Diving the observed values into quarters, or 4 equal parts (Q1, Q2, Q3)

Quintiles: Diving the observed values into fifths, or 5 equal parts (QU1, QU2, QU3, QU4)

Deciles: Dividing the observed values into tenths, or 10 equal parts(D1, D2,…,D9)

Percentiles: Diving the observed values of the variable into hundredths, or 100 equal parts (P1, P2,…,P99).

Note that the median is also the 50th percentile, 2nd Quartile, and 5th Decile!!
Measures ofMeasures
Location (Quantiles)
of Location
General rule to find the value of the (interpolated) percentile

1. Locate the position of the Yth percentile within the ranked set of observations
2. Determine the value associated with that position
Location Location
𝑌 1 -18.11% 13 0.61%
Location of Yth percentile = 𝐿𝑌 = 𝑛 + 1
100 2 -17.13% 14 1.06%

25 3 -11.65% 15 1.77%
Given the ordered data to the left, the location of the 25th percentile = 24 + 1 = 6.25
100 4 -9.97% 16 2.82%

5 -8.95% 17 2.93%
In other words the interpolated percentile is a quarter of the way between L6 and L7
6 -8.61% 18 3.14%

Using linear interpolation we can determine its value. 7 -5.10% 19 6.72%

8 -4.75% 20 6.78%
Namely L6 + the fractional portion of the difference between L6 and L7 9 -2.09% 21 7.18%

10 -1.93% 22 7.87%
-8.61 + 0.25(-5.1 - -8.61) = -7.7325
11 -0.87% 23 9.84%

12 -0.64% 24 10.13%
Measures of Variation

Range: The sample range of a variable is the difference between its maximum
and minimum values in the data set:

Range = Max −Min.

The range cannot ever decrease, but can increase, when additional observations are included in the
data

The sample interquartile range of the variable, denoted IQR, is the difference between the first and
third quartiles of the variable, that is,

IQR = Q3 − Q1.

The IQR gives the range of the middle 50% of values.


Measures of Variation

Five Number Summary and Box Plot

Minimum, maximum and quartiles together provide information on centre and variation
of the variable in a nice compact way.

The five-number summary of the variable consists of minimum, maximum, and quartiles written in increasing
order: Min,Q1,Q2,Q3,Max.

A boxplot is based on the five-number summary and can be used to provide


a graphical display of the centre and variation of the observed values of variable in a data set.

Example: Price of Beef (£/100g), 54 observations.

0.11,0.17,0.11,0.15,0.10,0.11,0.21,0.20,0.14,0.14,0.23,0.25,0.07,0.09,0.10,0.10,0.19,0.11,0.19,0.17,0.12,0.12,0
.12,0.10,0.11,0.13,0.10,0.09,0.11,0.15,0.13,0.10,0.18,0.09,0.07,0.08,0.06,0.08,0.05,0.07,0.08,0.08,0.07,0.09,0.
06,0.07,0.08,0.07,0.07,0.07,0.08,0.06,0.07,0.06
Measures of Variation
Box Plot of Price of Beef (£/100g), 54 observations

DATA LINK
STATA code:

graph hbox Price_of_beef, nooutside


Measures of Variation
Box Plot Height and Weight of male and female students

DATA LINK
STATA code:
graph box height, nooutside over(gender)
graph box weight, nooutside over(gender)
Measures of Variation
Sample Variance

Variance is another measure of the spread of numbers within a data set.

To calculate the variance of the data we first square, then sum, the deviation from the mean, of each observation.

This called the sum of squared deviations which provides a measure of the total deviation from the mean for all
the observed values of the variable.

VARIANCE is the average sum of squared deviations. For a sample, the variance is calculated as follows:
Measures of Variation
Standard Deviation

The sample standard deviation is the most frequently used measure of variability
and is simply the square root of the variance.

For a variable x, the sample standard deviation, denoted by 𝑠𝑥 (or when no confusion arise, simply by s), is:

As a general rule of thumb, 95% of the data will lie between the mean ± 2 standard deviations, but more on this later.
Measures of Variation
Why divide by n-1 and not n?

If you ask this question to the internet or ChatGPT there is much discussion and debate.

The simplest answer however is thus:

Dividing by n-1 reduces bias in the standard deviation estimator. Why?

The sample standard deviation estimator (dividing by n-1) is an estimate of the true population standard deviation from
which the sample was drawn. Because the observed values, on average, fall closer to the sample mean than the population
mean, the sample standard deviation will most likely under-estimate the actual population standard deviation. Dividing by
n-1 attempts to correct this.

e.g. Heights.

I want to know the standard deviation of all heights in the UK (the population). If I go about the country recording peoples’
heights, It would be unlikely that I meet and record the very tall, or the very short. The sample would therefore have a
smaller standard deviation than the true population parameter if I simply divided by n. Dividing by n-1 corrects this.
Measures of Variation
Skewness is a measure of how symmetrical the data is about the mean.
If the dispersion of data around the mean is symmetrical, then
there is no skewness. This will only happen when the Mean,
Median and Mode are identical.

A positive skewness occurs when there are more extreme values


in the right-hand tail of the distribution of data.
Mean>Median>Mode

A negative skewness occurs when there are more extreme


values in the left-hand tail of the distribution of data.
Mode> Median> Mean
Measures of Variation
Skewness can be calculated using the following formulae:

𝑛 𝑥 −𝑥ҧ 3
MS Excel calculates skewness as = σ 𝑖
(𝑛−1)(𝑛−2) 𝑠

1 𝑛
σ
𝑛 𝑖=1
𝑥𝑖 −𝑥ҧ 3
STATA calculates skewness as = 1 𝑛 3/2
σ 𝑥𝑖 −𝑥ҧ 2
𝑛 𝑖=1

There are also alternate calculations!!

example spreadsheet HERE

See the excel example spreadsheet HERE which demonstrates these calculations.
Measures of Variation
Kurtosis

Kurtosis is an indicator of the size or spread of the data within the tails of its distribution.

If the data follows a normal distribution then the Kurtosis should be 3 (mesokurtic).

A positive/high kurtosis (>3) indicates that you have lots of data in the tails of your distribution (thinner/taller peak, heavier tails).

A negative/low kurtosis (<3)indicates that you have small amounts of data in the tails of your distribution (wider/lower peak, lighter/no tails).
Measures of Variation

Kurtosis can be calculated using the following formulae:

𝑛(𝑛+1) 𝑥𝑖 −𝑥ҧ 4 3(𝑛−1)2


MS Excel calculates skewness as = σ −
(𝑛−1)(𝑛−2)(𝑛−3) 𝑠 (𝑛−2)(𝑛−3)

1 𝑛
σ 𝑥 −𝑥ҧ 4
𝑛 𝑖=1 𝑖
STATA calculates skewness as = 1 𝑛 2
σ𝑖=1 𝑥𝑖 −𝑥ҧ 2
𝑛

example spreadsheet HERE

See the excel example spreadsheet HERE which demonstrates these calculations.
Summary Statistics
Summary Statistics

An important part of any empirical research


paper is providing key information
(summary statistics) for the data used within
the study.

This allows the reader to instantly visualise


the data and satisfy that there are no errors
or extreme values that may impact the
results of any analysis (more on this later).

Example taken from:

Horton, Tsipouridou, & Wood (2017). European


Market Reaction to Audit Reforms, European
Accounting Review 27(5).
Summary Statistics

Summary Statistics in STATA


DATA LINK

Simple statistics code:

summarize weight height


Summary Statistics

Summary Statistics in STATA


DATA LINK

More detailed statistics:

summarize weight, detail


Summary Statistics
Summary Statistics in STATA
DATA LINK

Statistics can also be generated via tabstat which gives you more control over
the output than the “summarize” command. For example:

tabstat height, by(gender) stats(n mean sd min p25 p50 p75 max)
Distribution
Distributions are a fundamental aspect of any statistical or
empirical analysis.

This part of the lecture will look at:

Discrete probability distributions


• The distribution of a discrete(categorical) variable, where variable x has a
countable number of possible values.

Lecture 2 will focus on:

Continuous probability distributions


• The distribution of a continuous variable x that has a set of possible values
which is infinite and uncountable.
Discrete Probability Distributions

Imagine a hypothetical experiment consisting of a very long sequence


of repeated observations on some random phenomenon.

e.g. flipping a coin or rolling a dice.

Each observation may or may not result in some particular outcome.

The probability of that outcome is defined to be the relative frequency


of its occurrence, in the long run.

The probability of a particular outcome is the proportion of times that


outcome would occur in a long run of repeated observations.
Discrete Probability Distributions
The probability distribution of a discrete random variable x assigns a
probability to each possible value of the variable.

Variable x has a countable number of possible values.


The probability distribution of x lists the values and their probabilities.

The probabilities 𝑃(𝑥𝑖 ) must satisfy two requirements, namely:

(probabilities are bounded between 0 and 1, and the sum of all probabilities is equal to 1)
Discrete Probability Distributions
Example: Dice rolling simulation.

Imagine that I have 2 dice, each dice has 6 sides numbered 1 through 6.
If I roll the dice, I will obtain an integer ranging between 2 and 12 (lets call this S).

We can calculate the probability of each possible outcome S as follows:

(6 ways, 6/36)

(5 ways, 5/36)

(4 ways, 4/36)

(3 ways, 3/36)

(2 ways, 2/36)
This is theoretical distribution –
(1 way, 1/36) what I would expect to see should
I roll the dice enough times.
Discrete Probability Distributions
Example: Dice rolling simulation.
Let’s see what happens when I roll the dice 10 times.

Relative Frequency

NOTE:

Actual frequency of
rolling a 7 = 3.

The relative frequency


is therefore 3/10 = 0.30

Because of the relatively small sample size (n=10), the relative frequencies obtained do not represent the
theoretic probability distribution seen on the previous slide…. so lets increase the sample size.
Discrete Probability Distributions
Example: Dice rolling simulation.
Relative Frequency

Relative Frequency
Relative Frequency As the sample size
increases, the closer we
Relative Frequency

get to the theoretical


discrete probability
distribution.
Histograms
Histograms plot the frequency or counts of discrete variables.
(as seen on the previous slide)

A histogram shows an approximate representation of the distribution of that data.

Let us plot a histogram in STATA using the ages of 102 people.

The data can be found here

and in raw format below

34,67,40,72,37,33,42,62,49,32,52,40,31,19,68,55,57,54,37,32,54,38,20,50,56,48,35,52,29,56,68,65,45,44,54,3
9,29,56,43,42,22,30,26,20,48,29,34,27,40,28,45,21,42,38,29,26,62,35,28,24,44,46,39,29,27,40,22,38,42,39,26
,48,39,25,34,56,31,60,32,24,51,69,28,27,38,56,36,25,46,50,36,58,39,57,55,42,49,38,49,36,48,44
Histograms
Firstly we can tabulate the data in order to view how many people are of a particular age.

Data Link
STATA Code:

tabulate age
Histograms
We can also easily plot the histogram using the histogram command

Data Link
STATA Code:

histogram age

Each bar represents a range (bin) of ages. Here STATA automatically divides the data into 10 bins.
Histograms
We can change the number of bins if we wish.

STATA Code: histogram age, bin(20) STATA Code: histogram age, discrete
(because age is a discrete variable, we can use the option “discrete” which
uses all possible integer values within the data range)
Histograms
We can also use histograms to visualise the distributions of continuous variables.

Continuous variables are generally unbounded (can take any value) and can be to any
number of decimal places.

i.e. the data does conform to any pre-determined probable value like a discrete variable
does.

In order to plot continuous data as a histogram, the data must be placed within a bin.

Similarly to discrete variables, you can tell STATA the number of bins which you require,
and the frequency counts are calculated for you.

Let us look at some share price return data which can be downloaded here

The data consists of 10,000 artificially generated stock returns.


Histograms

STATA Code: histogram stock_return, bin(10) STATA Code: histogram stock_return, bin(40)

STATA Code: histogram stock_return, bin(100) STATA Code: histogram stock_return, bin(1000) STATA Code: histogram stock_return, bin(100) kdensity
Continuous Probability Distributions

The Dice example demonstrated a discrete probability distribution with 11


possible outcomes (integers from 2-12).

We have also seen that a continuous variable can be divided into bins and
we can visualise it probability distribution at various bin sizes.

If the number of bins or class intervals increases, and given enough data, the
shape of the histogram will approach a smooth curve.

An infinite number of class intervals (bins) changes a discrete probability


distribution into a continuous probability distribution.

This will be the focus of Lecture 2


Tasks

• Watch my video on how to get started with STATA.

• Use the examples contained within these slides to familiarise yourself


with generating summary statistics and histograms.

• Complete the “Test Your Knowledge” Mini Quizzes

• Complete Workshop_01 Questions for next week.


Building Block

Assignment Structure

You might also like