FDS Unit 3
FDS Unit 3
UNIT-3
Statistics Essentials for Data Science
Syllabus:
Statistics Essentials for Data Science: Sample or Population Data? The
Fundamentals of Descriptive Statistics, Measures of Central Tendency,
Asymmetry, and Variability, Practical Example: Descriptive Statistics,
Distributions, Estimators and Estimates, Normal distributions – z scores – normal
curve problems
Populations and Samples
In statistics, a population refers to any complete collection of observations or
potential observations, whereas a sample refers to any smaller collection of
actual observations drawn from a population.
In everyday life, populations often are viewed as collections of real objects (e.g.,
people, whales, automobiles), whereas in statistics, populations may be viewed
more abstractly as collections of properties or measurements (e.g., the ethnic
backgrounds of people, life spans of whales, gas mileage of automobiles).
Depending on our perspective, a given set of observations can be either a
population or a sample
Ordinarily, populations are quite large and exist only as potential observations
(e.g., the potential scores of all U.S. college students on a test that measures
anxiety). On the other hand, samples are relatively small and exist as actual
observations (the actual scores of 100 college students on the test for anxiety).
When using a sample (100 actual scores) to generalize to a population (millions
of potential scores), it is important that the sample represent the population;
otherwise, any generalization might be erroneous.
Population Vs Sample:
Population Sample
Any complete collection of Any smaller collection of actual
observations or potential observations. observations from a population.
populations are quite large samples are relatively small
exist only as potential observations exist as actual observations
The population is complete set The sample is subset of population
Example: All the students in the class Example: the top 10 students in the
are population class are the sample.
The Fundamentals of Descriptive Statistics
Statistics exists because of the prevalence of variability in the real world.
It consists of two main subdivisions: descriptive statistics, which is concerned
with organizing and summarizing information for sets of actual observations,
and inferential statistics, which is concerned with generalizing beyond sets of
actual observations—that is, generalizing from a sample to a population.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Descriptive statistics will provide tools, such as tables, graphs, and averages
that help us describe and organize the inevitable variability among observations.
Examples are:
A tabular listing, ranked from most to least, of the total number of
romantic affairs during college reported anonymously by each member of
our stat class
A graph showing the annual change in global temperature during the last
30 years
A report that describes the average difference in grade point average
(GPA) between college students who regularly drink alcoholic beverages
and those who don’t.
Descriptive Statistics vs inferential Statistics
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Median
The median reflects the middle value when observations are ordered from least
to most.
The median splits a set of ordered observations into two equal parts, the upper
and lower halves.
In other words, the median has a percentile rank of 50, since observations with
equal or smaller values constitute 50 percent of the entire distribution.
To find the median, scores always must be ordered from least to most (or vice
versa). This task is straightforward with small sets of data but becomes
increasingly cumbersome with larger sets of data that must be ordered manually.
When the total number of scores is odd, there is a single middle-ranked score, and
the value of the median equals the value of this score. When the total number of
scores is even, the value of the median equals a value midway between the values
of the two middlemost scores.
In either case, the value of the median always reflects the value of middle-ranked
scores, not the position of these scores among the set of ordered scores
Example 1: Find the median for the following retirement ages: 60, 63, 45, 63,65,
70, 55, 63, 60, 65, 63.
Solution: Median = 63
Example2: Find the median for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
Solution: Median = 27.15 (halfway between 26.9 and 27.4)
Mean
The mean is the most common average.
The mean is found by adding all scores and then dividing by the number of
scores.
That is
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Even when large sets of unorganized data are involved, the calculation of the
mean is usually straightforward, particularly with the aid of a calculator or
computer.
The mean serves as the balance point for its frequency distribution.
Mean cannot be used with qualitative data.
Example 1: Find the mean for the following retirement ages: 60, 63, 45, 63, 65,
70, 55, 63, 60, 65, 63.
Solution:
Example 2: Find the mean for the following gas mileage tests: 26.3, 28.7, 27.4,
26.6, 27.4, 26.9.
Solution:
Which Average?
When a distribution of scores is not too skewed, the values of the mode, median,
and mean are similar, and any of them can be used to describe the central
tendency of the distribution.
When extreme scores cause a distribution to be skewed, the values of the three
averages can differ appreciably.
Unlike the mode and median, the mean is very sensitive to extreme scores, or
outliers.
In the long run, however, the mean is the single most preferred average for
quantitative data.
Ideally, when a distribution is skewed, report both the mean and the median.
Appreciable differences between the values of the mean and median signal the
presence of a skewed distribution.
If the mean exceeds the media, the underlying distribution is positively skewed
because of one or more scores with relatively large values.
On the other hand, if the median exceeds the mean, the underlying distribution
is negatively skewed because of one or more scores with relatively small values.
Following summarizes the relationship between the various averages and the
two types of skewed distributions (shown as smoothed curves).
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
If the left tail is larger than the right tail, the function is said to have negative
skewness. If the right tail is larger, it has a positive skew. If the two are equal, it
has zero skewness.
Asymmetrical distribution is a situation in which the values of variables occur
at irregular frequencies and the mean, median, and mode occur at different
points. An asymmetric distribution exhibits skewness.
Data in most real applications are not symmetric. They may instead be either
positively skewed, where the mode occurs at a value that is smaller than the
median or negatively skewed, where the mode occurs at a value greater than the
Median
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Types of Estimators
Estimators can be described in several ways:
Biased: a statistic that is either an overestimate or an underestimate.
Efficient: a statistic with small variances (the one with the smallest
possible variance is also called the “best”). Inefficient estimators can give
we good results as well, but they usually require much larger samples.
Invariant: statistics that are not easily changed by transformations, like
simple data shifts.
Shrinkage: a raw estimate that’s improved by combining it with other
information.
Sufficient: Sufficient statistics summarize all the available data about a
sample within a parameter.
Unbiased: an accurate statistic that neither underestimates nor
overestimates.
Distributions
Over many years, eminent statisticians noticed that data from samples and
populations often formed very similar patterns. For example, a lot of data were
grouped around the ‘middle’ values, with fewer observations at the outside
edges of the distribution (very high or very low values). These patterns are
known as ‘distributions’, because they describe how the data are ‘distributed’
across the range of possible values.
A statistical distribution, or probability distribution, describes how values are
distributed for a field. In other words, the statistical distribution shows which
values are common and uncommon.
There are many kinds of statistical distributions, including the bell-shaped
normal distribution. We use a statistical distribution to determine how likely a
particular value is.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Types of Distributions
Bernoulli Distribution
Uniform Distribution
Binomial Distribution
Poisson Distribution
Normal or Gaussian Distribution
Exponential Distribution
Bernoulli distribution
The Bernoulli distribution is one of the easiest distributions to understand.
It can be used as a starting point to derive more complex distributions.
Any event with a single trial and only two outcomes follows a Bernoulli
distribution.
Flipping a coin or choosing between True and False in a quiz are examples of a
Bernoulli distribution.
Uniform distribution
In statistics, uniform distribution refers to a statistical distribution in which all
outcomes are equally likely.
Consider rolling a six-sided die. We have an equal probability of obtaining all
six numbers on our next roll, i.e., obtaining precisely one of 1, 2, 3, 4, 5, or 6,
equalling a probability of 1/6, hence an example of a discrete uniform
distribution.
Binomial Distribution
The Binomial Distribution can be thought of as the sum of outcomes of an
event following a Bernoulli distribution. Therefore, Binomial Distribution is
used in binary outcome events, and the probability of success and failure is the
same in all successive trials.
An example of a binomial event would be flipping a coin multiple times to
count the number of heads and tails.
Poisson distribution
Poisson distribution deals with the frequency with which an event occurs
within a specific interval. Instead of the probability of an event, Poisson
distribution requires knowing how often it happens in a particular period or
distance.
The Poisson distribution describes the probability of k events happening in a
unit of time.
For example, a cricket chirps two times in 7 seconds on average. We can use the
Poisson distribution to determine the likelihood of it chirping five times in 15
seconds.
Normal distribution
Normal distribution is the most used distribution in data science. In a normal
distribution graph, data is symmetrically distributed with no skew. When
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
plotted, the data follows a bell shape, with most values clustering around a
central region and tapering off as they go further away from the center.
The normal distribution frequently appears in nature and life in various forms.
For example, the scores of a quiz follow a normal distribution. Many of the
students scored between 60 and 80 as illustrated in the graph below. Of course,
students with scores that fall outside this range are deviating from the center.
Exponential distribution
Exponential distribution is one of the widely used continuous distributions.
It is used to model the time taken between different events.
For example, in physics, it is often used to measure radioactive decay; in
engineering, to measure the time associated with receiving a defective part on
an assembly line; and in finance, to measure the likelihood of the next default
for a portfolio of financial assets.
Another common application of exponential distributions in survival analysis
(e.g., expected life of a device/machine).
Bernoulli Distribution: Single-trial with two possible outcomes
Uniform Distribution: All outcomes are equally likely
Binomial Distribution: A sequence of Bernoulli events
Poisson Distribution: an event occurs within a specific interval
Normal Distribution: Symmetric distribution of values around the mean
Exponential Distribution: Model elapsed time between two events
The Normal Distributions
A Normal distribution (or Gaussian distribution) is a continuous probability
distribution that is symmetrical on both sides of the mean, so that right side of
the center is mirror image of the left side.
Normal distribution is so important because it accurately describe the
distribution of values for many natural phenomena.
Many observed frequency distributions approximate the well-documented
normal curve, an important theoretical curve noted for its symmetrical bell-
shaped form.
Characteristics that are the sum of many independent processes frequently
follow normal distributions. For example, heights, blood pressure, measurement
error, and IQ scores follow the normal distribution.
The normal curve is defined in terms of standard deviation and mean.
The normal curve can be used to obtain answers to a wide variety of questions.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Z Scores
A unit-free, standardized score that indicates how many standard deviations a
score is above or below the mean of its distribution is called Z Score
To obtain a z score, express any original score, whether measured in inches,
milliseconds, dollars, IQ points, etc., as a deviation from its mean (by
subtracting its mean) and then split this deviation into standard deviation units
(by dividing by its standard deviation), that is,
where X is the original score and μ and σ are the mean and the standard
deviation respectively, for the normal distribution of the original scores
A z score consists of two parts:
a positive or negative sign indicating whether it’s above or below the
mean; and
a number indicating the size of its deviation from the mean in standard
deviation units.
Example:
A z score of 2.00 always signifies that the original score is exactly two
standard deviations above its mean.
Similarly, a z score of –1.27 signifies that the original score is exactly
1.27 standard deviations below its mean.
A z score of 0 signifies that the original score coincides with the mean.
Problem: Express each of the following scores as a z score:
(a) Margaret’s IQ of 135, given a mean of 100 and a standard deviation
of 15
(b) a score of 470 on the SAT math test, given a mean of 500 and a standard
deviation of 100
(c) a daily production of 2100 loaves of bread by a bakery, given a mean of
2180 and a standard deviation of 50
(d) Sam’s height of 69 inches, given a mean of 69 and a standard deviation of 3
(e) a thermometer-reading error of –3 degrees, given a mean of 0 degrees and a
standard deviation of 2 degrees
Answers:
(a) z = (135-100)/15= 2.33
(b) z = (470-500)/100= -0.30
(c) z = (2100-2180)/50= -1.60
(d) z = (69-69)/3= 0.00
(e) z = (-3-0)/2= -1.50
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Rough graphs of normal curves can be used an aid to visualizing the solution.
Only after thinking through to a solution, do any calculations and consult the
normal tables.
Finding Proportions
In these Normal curve problems, standard normal table (table A) must be
consulted to find the unknown proportion (of area) associated with some known
score or pair of known scores.
Finding Proportions for One Score
Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
X
3. Convert X to z using formula, z
4. Find the target area by consulting standard normal table
Example: to find the proportion of all applicants who are shorter than exactly
66 inches, given that the distribution of heights approximates a normal curve
with a mean of 69 inches and a standard deviation of 3 inches.
= (66-69)/3 = -3/3 = -1
Look up column A’ to 1.00 (representing a z score of –1.00), and note the
corresponding proportion of .1587 in column C’: This is the answer.
It can be concluded that only .1587 (or .16 or 16%) of all of the applicants will
be shorter than 66 inches.
Finding Proportions between Two Scores
Step-by-step procedure:
1. Sketch a normal curve and shade in the target area
2. Plan solution according to the normal table.
X
3. Convert X to z using formula, z
4. Find the target area.
Example: Assume that, when not interrupted artificially, the gestation periods
for human foetuses approximate a normal curve with a mean of 270 days (9
months) and a standard deviation of 15 days. What proportion of gestation
periods will be between 245 and 255 days?
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Finding Scores
In this type of normal curve problems standard normal table (table A) must be
consulted to find the unknown score or scores associated with some known
proportion.
Essentially, this type of problem requires that the use of table A by entering
proportions in columns B, C, B′, or C′ and finding z scores listed in columns A
or A′.
Finding One Score
Step-by-step procedure:
1. Sketch a normal curve and, on the correct side of the mean, draw a line
representing the target score
2. Plan solution according to the normal table.
3. Find z by consulting standard normal table
4. Convert z to the target score using formula, X= + (z) ( )
Problem: Exam scores for a large psychology class approximate a normal curve
with a mean of 230 and a standard deviation of 50. Furthermore, students are
graded “on a curve,” with only the upper 20 percent being awarded grades of A.
What is the lowest score on the exam that receives an A?
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
of 4 inches. What are the rainfalls for the more atypical years, defined as the
driest 2.5 percent of all years and the wettest 2.5 percent of all years?
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
For example, above table shows Sharon’s scores on college achievement tests in
three different subjects. The evaluation of her test performance is greatly
facilitated by converting her raw scores into the z scores listed in the final
column of above table. A glance at the z scores suggests that although she did
relatively well on the math test, her performance on the English test was only
slightly above average, as indicated by a z score of 0.50, and her performance
on the psychology test was slightly below average, as indicated by a z score of –
0.67.
Standard Score
Any unit-free scores expressed relative to a known mean and a known standard
deviation is called standard score.
Although z scores qualify as standard scores because they are unit-free and
expressed relative to a known mean of 0 and a known standard deviation of 1,
other scores also qualify as standard scores.
Transformed Standard Scores
z scores can be changed to transformed standard scores, other types of unit-
free standard scores that lack negative signs and decimal points.
These transformations change neither the shape of the original distribution nor
the relative standing of any test score within the distribution.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
For example, a test score located one standard deviation below the mean might
be reported not as a z score of –1.00 but as a T score of 40 in a distribution of T
scores with a mean of 50 and a standard deviation of 10.
Following figure shows the values of some of the more common types of
transformed standard scores relative to the various portions of the area under the
normal curve.
Answers:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Tutorial Questions:
Assignment Questions:
1. During their first swim through a water maze, 15 laboratory rats made the
following number of errors (blind alleyway entrances): 2, 17, 5, 3, 28, 7, 5,
8, 5, 6, 2, 12, 10, 4, 3.Find the mode, median, and mean for these data.
2. To the question “During your lifetime, how often have you changed your
permanent residence?” a group of 18 college students replied as follows:
1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 3. Find the mode, median,
and mean.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
4. Find the proportion of the total area identified with the following statements:
(a) above a z score of 1.80
(b) between the mean and a z score of –0.43
(c) below a z score of –3.00
(d) between the mean and a z score of 1.65
(e) between z scores of 0 and –1.96
5. Assume that GRE scores approximate a normal curve with a mean of 500 and
a standard deviation of 100. Find the proportions that correspond to the target
area described by each of the following statements:
(a) less than 400
(b) more than 650
(c) less than 700
6. Assume that SAT math scores approximate a normal curve with a mean of 500
and a standard deviation of 100.Find the target area(s) described by each of the
following statements:
(a) more than 570
(b) less than 515
(c) between 520 and 540
(d) between 470 and 520
(e) more than 50 points above the mean
(f) more than 100 points either above or below the mean
7. For the normal distribution of burning times of electric light bulbs, with a
mean equal to 1200 hours and a standard deviation equal to 120 hours, what
burning time is identified with the
(a) upper 50 percent?
(b) lower 75 percent?
(c) lower 1 percent?
(d) middle 90 percent?
8. Assume that each of the raw scores listed originates from a distribution with
the specified mean and standard deviation. After converting each raw score into
a z score, transform each z score into a series of new standard scores with
means and standard deviations of 50 and 10, 100 and 15, and 500 and 100,
respectively
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
4-1 B. Tech CIVIL Regulation: R20 FDS: UNIT-3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25