UNIT 1 SSMDA
Mean, median, mode, variance & standard deviation.
MEAN – average value of the data set
MEDIAN – “the middle number”, a number that splits data set in half.
How to find the median?
Step 1: Arrange data in increasing order
Step 2: Determine how many numbers are in the data set = n
Step 3: If n is odd: Median is the middle number
If n is even: Median is the average of two middle numbers.
Notes: Mean and median don’t have to be numbers from the data set!
Mean and median can only take one value each.
Mean is influenced by extreme values, while median is resistant.
MODE – The most frequent number in the data set.
Example 4: Find the mode for the data set: 19, 19, 34, 3, 10, 22, 10, 15, 25, 10, 6.
The number that occurs the most is number 10.
mode = 10.
Example 5: Find the mode for the data set: 19, 19, 34, 3, 10, 22, 10, 15, 25, 10, 6, 19.
Number 10 occurs 3 times, but also number 19 occurs 3 times, since there is no
number that occur 4
times both numbers 10 and 19 are mode, mode = {10, 19}.
Notes: Mode is always the number from the data set.
Mode can take zero, one, or more than one values. (There can be zero modes, one
mode, two modes, …)
The difference between value 𝑥 and population mean 𝜇 , 𝑥 – 𝜇 is deviation.
VARIANCE measures how far the values of the data set are from the mean, on
average. The average of the squared deviations is the population variance.
PROBABILITY DISTRIBUTION
Random Variable
A random variable is a variable that takes on numerical values as a result
of a random experiment or measurement; associates a numerical value
with each possible outcome.
The differences between variable and random variable are-
• Random variable always takes numerical values
• There is a probability associated with each possible values
Random variable is denoted by capital letters such as X, Y, Z etc.
And the possible outcomes are denoted by small letters such as x, y, z etc.
Example 1:
A coin is tossed. It has two possible outcomes- Head and Tail.
𝐻, 𝑖𝑓𝐻𝑒𝑎𝑑 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
Consider a variable, X= outcome of a coin toss= ቊ 𝑇,
𝑖𝑓 𝑇𝑎𝑖𝑙 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
Here, S= {H, T}.
But, these are not numerical values.
Consider a variable, X= Number of heads obtained in a trial
Then, 𝑋 = ቊ 1,
𝑖𝑓 𝐻𝑒𝑎𝑑 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
0,
𝑖𝑓 𝑇𝑎𝑖𝑙 𝑎𝑝𝑝𝑒𝑎𝑟𝑠
For a fair coin, we can write, So, X is a random variable.
P(X=1) = ½ and P(X=0) = ½
Types of random variable:
Discrete Random Variable
A random variable defined over a discrete sample space.
Continuous Random Variable
A random variable defined over a continuous sample space.
Examples:
Discrete Random Variable:
1. X= Number of correct answers in a 100-MCQ test= 0, 1, 2, …, 100
2. X= Number of cars passing a toll both in a day= 0, 1, 2, …, ∞
3. X= Number of balls required to take the first wicket = 1, 2, 3, …, ∞
Continuous Random Variable:
1. X= Weight of a person. 0<X<∞
2. X= Monthly Profit. -∞<X<∞
7Probability Distributions
Distribution of the probabilities among the different values of a random
variable.
Discrete probability distribution- probability distribution of a discrete
random variable.
Continuous probability distribution- probability distribution of a
continuous random variable.
Different types of probability distributions:
Discrete probability distribution-
1. Bernoulli Distribution
2. Binomial Distribution
3. Poisson Distribution etc.
Continuous probability distribution-
1. Uniform Distribution
2. Normal Distribution
3. Exponential Distribution
4. t-distribution etc.
PMF and PDF
Probability Mass Function (pmf)- the probability distribution function of
a discrete random variable X is called a pmf and is denoted by p(x).
Probability Density Function (pdf)- the probability distribution function
of a continuous random variable X is called a pdf and is denoted by f(x).
EXAMPLE:
A company estimates the net profit on a new product, it is launching, to be
Rs. 3 million during the first year, if it is ‘successful’, Rs. 1 million if it is
‘moderately successful’, and a loss of Rs. 1 million if it is ‘unsuccessful’.
The company assigns the following probabilities to first year prospects for
the product-
Successful: 0.25, Moderately successful: 0.40, and Unsuccessful: 0.35.
What are the expected value and standard deviation of the first year net
profit for the product? Also, find the expected value of net profit if there is
a fixed cost of Rs. 0.2 million, whatever the success status is.
Binomial Distribution
Bernoulli trial:
A trial that has only two possible outcomes (often called ‘Success’ and
‘Failure’)
Example:
There are 3 multiple choice questions in a MCQ test. Each MCQ consists
of four
possible choices and only one of them is correct. If an examinee answers
those
MCQ randomly (without knowing the correct answers)
a. What is the probability that exactly any two of the answers will be
correct?
b. What is the probability that at least two of the answers will be correct?
c. What is the probability that at most two of the answers will be correct?
d. What will be the average or expected number of correct answers?
e. Also, find the standard deviation of the number of correct answers.
Example :
The average number of errors on a page of a certain magazine is 0.2.
What is the probability that the next page (or a randomly selected page)
you read contains
i.0 (zero) error?
ii. 2 or more errors?
iii.What is the average error per page?
iv. Also, find standard deviation of the number of errors.
UNIFORM DISTRIBUTION
Example :
The waiting time (in minutes) for train is uniform (10, 50).
Find-
a. The probability that you have to wait at least 20 minutes.
b. Average waiting time.
c. Standard deviation of waiting time.
EXPONENTIAL DISTRIBUTION
Example 6:
Average time required to repair a machine is 0.5 hours. What is the
probability that the next repair will take more than 2 hours?
SOLUTION:
Characteristics of a Normal Distribution
Standard Normal Distribution Problem and Solution
Problem 1: For some computers, the time period between charges of the
battery is normally distributed with a mean of 50 hours and a standard
deviation of 15 hours. Rohan has one of these computers and needs to know
the probability that the time period will be between 50 and 70 hours.
Solution: Let x be the random variable that represents the time period.
Given Mean, μ= 50
and standard deviation, σ = 15
To find: P probability that x is between 50 and 70 or P( 50< x < 70)
By using the transformation equation, we know;
z = (X – μ) / σ
For x = 50 , z = (50 – 50) / 15 = 0
For x = 70 , z = (70 – 50) / 15 = 1.33
P( 50< x < 70) = P( 0< z < 1.33) = [area to the left of z = 1.33] – [area to the left of z
= 0]
From the table we get the value, such as;
P( 0< z < 1.33) = 0.9082 – 0.5 = 0.4082
The probability that Rohan’s computer has a time period between 50 and 70 hours is
equal to 0.4082.
Problem 2: The speeds of cars are measured using a radar unit, on a
motorway. The speeds are normally distributed with a mean of 90 km/hr and a
standard deviation of 10 km/hr. What is the probability that a car selected at
chance is moving at more than 100 km/hr?
Solution: Let the speed of cars is represented by a random variable ‘x’.
Now, given mean, μ = 90 and standard deviation, σ = 10.
To find: Probability that x is higher than 100 or P(x > 100)
By using the transformation equation, we know;
z = (X – μ) / σ
Hence,
For x = 100 , z = (100 – 90) / 10 = 1
P(x > 90) = P(z > 1) = [total area] – [area to the left of z = 1]
P(z > 1) = 1 – 0.8413 = 0.1587.
Probability Theory Example
We can study the concept of probability with the help of the example discussed
below,
Example: Let’s take two random dice and roll them randomly, now the probability of
getting a total of 10 is calculated.
Solution:
Total Possible events that can occur (sample space) {(1,1), (1,2),..., (1,6),..., (6,6)}.
The total spaces are 36.
Now the required events, {(4,6), (5,5), (6,4)} are all which adds up to 10.
So the probability of getting a total of 10 is = 3/36 = 1/12
Basics of Probability Theory
Various terms used in probability theory are discussed below,
Random Experiment
In probability theory, any event which can be repeated multiple times and its
outcome is not hampered by its repetition is called a Random Experiment. Tossing a
coin, rolling dice, etc. are random experiments.
Sample Space
The set of all possible outcomes for any random experiment is called sample space.
For example, throwing dice results in six outcomes, which are 1, 2, 3, 4, 5, and 6.
Thus, its sample space is (1, 2, 3, 4, 5, 6)
Event
The outcome of any experiment is called an event. Various types of events used in
probability theory are Independent Events: The events whose outcomes are not
affected by the outcomes of other future and/or past events are called independent
events. For example, the output of tossing a coin in repetition is not affected by its
previous outcome.
Dependent Events: The events whose outcomes are affected by the outcome of
other events are called dependent events. For example, picking oranges from a bag
that contains 100 oranges without replacement.
Mutually Exclusive Events: The events that can not occur simultaneously are
called mutually exclusive events. For example, obtaining a head or a tail in tossing a
coin, because both (head and tail) can not be obtained together.
Equally likely Events: The events that have an equal chance or probability of
happening are known as equally likely events. For example, observing any face in
rolling dice has an equal probability of 1/6.
Random Variable
A variable that can assume the value of all possible outcomes of an experiment is
called a random variable in Probability Theory. Random variables in probability
theory are of two types which are discussed below,
Discrete Random Variable: Variables that can take countable values such as 0, 1,
2,... are called discrete random variables.
Continuous Random Variable: Variables that can take an infinite number of values
in a given range are called continuous random variables.
Probability Theory Formulas
There are various formulas that are used in probability theory and some of them are
discussed below:
Theoretical Probability Formula:
(Number of Favourable Outcomes) / (Number of Total Outcomes)
Empirical Probability Formula:
(Number of times event A happened) / (Total number of trials)
Addition Rule of Probability: P(A ∪ B) = P(A) + P(B) – P(A∩B)
Complementary Rule of Probability: P(A’) = 1 – P(A)
Independent Events: P(A∩B) = P(A) ⋅ P(B)
Conditional Probability: P(A | B) = P(A∩B) / P(B)
Bayes’ Theorem: P(A | B) = P(B | A) ⋅ P(A) / P(B)
Applications of Probability Theory
Probability theory is widely used in our life, it is used to find answers to various types
of questions, such as will it rain tomorrow? What is the chance of landing on the
Moon? What is the chance of the evolution of humans? and others.
Some of the important uses of probability theory are,
Probability theory is used to predict the performance of stocks and bonds.
Probability theory is used in casinos and gambling.
Probability theory is used in weather forecasting.
Probability theory is used in Risk mitigation.
Probability theory is used in consumer industries to mitigate the risk of product
failure.
Solved Examples on Probability
Example 1: Consider a jar with 7 red marbles, 3 green marbles, and 4 blue marbles.
What is the probability of randomly selecting a non-blue marble from the jar?
Solution:
Given,
Number of Red Marbles = 7, Number of Green Marbles = 3, Number of Blue Marbles
=4
So, Total number of possible outcomes in this case: 7 + 3 + 4 = 14
Now, Number of non-blue marbles are: 7 + 3 = 10
According to the formula of theoretical Probability we can find, P(Non-Blue) = 10/14
= 5/7
Hence, the theoretical probability of selecting a non-blue marble is 5/7.
Example 2: Consider Two players, Naveena and Isha, playing a table tennis match.
The probability of Naveena winning the match is 0.76. What is the probability of Isha
winning the match?
Solution:
Let N and M represent the events that Naveena wins the match and Ashlesha wins
the match, respectively.
The probability of Naveena’s winning P(N) = 0.62 (given)
The probability of Isha’s winning P(I) = ?
Winning of the match is an mutually exclusive event, since only one of them can win
the match.
Therefore,
P(N) + P(I) =1
P(I) = 1 – P(N)
P(I) = 1 – 0.62 = 0.38
Thus, the Probability of Isha winning the match is 0.38.
Example 3: If someone takes out one card from a 52-card deck, what is the
probability of the card being a heart? What is the probability of obtaining a 7-number
card?
Solution:
Total number of cards in a deck = 52
Total Number of heart cards in a deck = 13
So, the probability of obtaining a heart,
P(heart) = 13/52 = 1/4
Total number of 7-number cards in a deck = 4
So, the probability of obtaining a 7-number card,
P(7-number) = 4/52 = 1/13
Example 4: Find the probability of rolling an even number when you roll a die
containing the numbers 1-6. Express the probability as a fraction, decimal, ratio, or
percent.
Solution:
Out of 1 to 6 numbers, even numbers are 2, 4, and 6.
So, Number of favorable outcomes = 3.
Total number of outcomes = 6.
Probability of obtaining an even number P(Even)= 1/2 = 0.5 = 1 : 2 = 50%
What is Hypothesis Testing?
Hypothesis is usually considered as the principal instrument in research. The main
goal in many research studies is to check whether the data collected support certain
statements or predictions.A statistical hypothesis is an assertion or conjecture
concerning one or more populations. Test of hypothesis is a process of testing the
significance regarding the parameters of the population on the basis of a sample
drawn from it. Thus, it is also termed as “Test of Significance’.
In short, hypothesis testing enables us to make probability statements about
population parameters. The hypothesis may not be proved absolutely, but in practice
it is accepted if it has withstood critical testing.
Points to be considered while formulating Hypothesis
Hypothesis should be clear and precise.
Hypothesis should be capable of being tested.
Hypothesis should state the relationship between variables.
Hypothesis should be limited in scope and must be specific.
Hypothesis should be stated as far as possible in most simple terms so that the
same is easily understandable by all concerned.
Hypothesis should be amenable to testing within a reasonable time.
Hypothesis must explain empirical reference.
Types of Hypothesis:
There are two types of hypothesis, i.e., Research Hypothesis and Statistical
Hypothesis
1. Research Hypothesis: A research hypothesis is a tentative solution for the
problem being investigated. It is the supposition that motivates the researcher to
accomplish a future course of action. In research, the researcher determines
whether or not their supposition can be supported through scientific investigation.
2. Statistical Hypothesis: Statistical hypothesis is a statement about the population
which we want to verify on the basis of a sample taken from the population.
Statistical hypothesis is stated in such a way that they may be evaluated by
appropriate statistical techniques.
Types of Statistical Hypotheses
There are two types of statistical hypotheses:
1. Null Hypothesis (H0) – A statistical hypothesis that states that there is no
difference between a parameter and a specific value, or that there is no difference
between two parameters.
2. Alternative Hypothesis (H1 or Ha) – A statistical hypothesis that states the
existence of a difference between a parameter and a specific value, or states that
there is a difference between two parameters. Alternative hypothesis is created in a
negative meaning of the null hypothesis.
Suppose we want to test the hypothesis that the population mean (μ) is equal to the
hypothesised mean (μH0) = 100. Then we would say that the null hypothesis is that
the population mean is equal to the hypothesised mean 100 and symbolically we can
express as: H0: μ = μ H0 = 100
If our sample results do not support this null hypothesis, we should conclude that
something else is true. What we conclude rejecting the null hypothesis is known as
alternative hypothesis. In other words, the set of alternatives to the null hypothesis is
referred to as the alternative hypothesis. If we accept H0, then we are rejecting H1
and if we reject H0, then we are accepting H1. For H0: μ = μHo =100, we may
consider three possible alternative hypotheses as follows:
The null hypothesis and the alternative hypothesis are chosen before the sample is
drawn (the researcher must avoid the error of deriving hypotheses from the data that
he/she collects and then testing the hypotheses from the same data). In the choice
of null hypothesis, the following considerations are usually kept in view:
1. Alternative hypothesis is usually the one which one wishes to prove and the null
hypothesis is the one which one wishes to disprove. Thus, a null hypothesis
represents the hypothesis we are trying to reject, and alternative hypothesis
represents all other possibilities.
2. Null hypotheses should always be specific hypothesis i.e., it should not state
about or approximately a certain value.
3. In testing hypothesis, there are two possible outcomes: Reject H0 and accept H1
because of sufficient evidence in the sample in favour of H1;
Do not reject H0 because of insufficient evidence to support H1.
BASIC CONCEPTS CONCERNING TESTING OF HYPOTHESES
1. The level of significance: This is a very important concept in the context of
hypothesis testing. It is always some percentage (usually 5%) which should be
chosen with great care, thought and reason. In case we take the significance level at
5 per cent, then this implies that H0 will be rejected when the sampling result (i.e.,
observed evidence) has a less than 0.05 probability of occurring if H0 is true. In other
words, the 5 percent level of significance means that a researcher is willing to take
as much as a 5 percent risk of rejecting the null hypothesis when it (H0) happens to
be true. Thus, the significance level is the maximum value of the probability of
rejecting H0 when it is true and is usually determined in advance before testing the
hypothesis.
2. Decision rule or Test of Hypothesis: A decision rule is a procedure that the
researcher uses to decide whether to accept or reject the null hypothesis. The
decision rule is a statement that tells under what circumstances to reject the null
hypothesis. The decision rule is based on specific values of the test statistic (e.g.,
reject H0 if Calculated value >table value at the same level of significance)
3. Types of Error: In the context of testing of hypotheses, there are basically two
types of errors we can make.
a. Type 1 error: To reject the null hypothesis when it is true is to make what is known
as a type I error. The level at which a result is declared significant is known as the
type I error rate, often denoted by α.
b. Type II error: If we do not reject the null hypothesis when in fact there is a
difference between the groups, we make what is known as a type II error. The type II
error rate is often denoted as β.
4. One- tailed and Two-tailed Tests: A test of statistical hypothesis, where the region
of rejection is on only one side of the sampling distribution, is called a one tailed test.
For example, suppose the null hypothesis states that the mean is less than or equal
to 10. The alternative hypothesis would be that the mean is greater than 10. The
region of rejection would consist of a range of numbers located on the right side of
sampling distribution i.e., a set of numbers greater than 10.
A test of statistical hypothesis, where the region of rejection is on both sides of the
sampling distribution, is called a two-tailed test. For example, suppose the null
hypothesis states that the mean is equal to 10. The alternative hypothesis would be
that the mean is less than 10 or greater than 10. The region of rejection would
consist of a range of numbers located on both sides of sampling distribution; i.e., the
region of rejection would consist partly of numbers that were less than 10 and partly
of numbers that were greater than 10.
Procedure of Hypothesis Testing
Procedure for hypothesis testing refers to all those steps that we undertake for
making a choice between the two actions i.e., rejection and acceptance of a null
hypothesis. The various steps involved in hypothesis testing are stated below:
1. Making a formal statement: The step consists in making a formal statement of the
null hypothesis (H0) and also of the alternative hypothesis (Ha or H1). This means
that hypotheses should be clearly stated, considering the nature of the research
problem.
2. Selecting a significance level: The hypotheses are tested on a pre-determined
level of significance and as such the same should be specified. Generally, in
practice, either 5% level or 1% level is adopted for the purpose.
3. Deciding the distribution to use: After deciding the level ofsignificance, the next
step in hypothesis testing is to determine the appropriate sampling distribution. The
choice generally remains between normal distribution and the t-distribution.
4. Selecting a random sample and computing an appropriate value: Another step is
to select a random sample(s) and compute an appropriate value from the sample
data concerning the test statistic utilizing the relevant distribution. In other words,
draw a sample to furnish empirical data.
5. Calculation of the probability: One has then to calculate the probability that the
sample result would diverge as widely as it has from expectations, if the null
hypothesis were in fact true.
6. Comparing the probability and Decision making: Yet another step consists in
comparing the probability thus calculated with the specified value for α, the
significance level. If the calculated probability is equal to or smaller than the α value
in case of one-tailed test (and α /2 in case of two-tailed test), then reject the null
hypothesis (i.e., accept the alternative hypothesis), but if the calculated probability is
greater, then accept the null hypothesis.
Tests of Hypotheses
Hypothesis testing determines the validity of the assumption (technically described
as null
hypothesis) with a view to choose between two conflicting hypotheses about the
value of a
population parameter. Hypothesis testing helps to decide on the basis of a sample
data, whether
a hypothesis about the population is likely to be true or false. Statisticians have
developed
several tests of hypotheses (also known as the tests of significance) for the purpose
of testing
of hypotheses which can be classified as:
a) Parametric tests or standard tests of hypotheses; and
b) Non-parametric tests or distribution-free test of hypotheses.
Parametric tests usually assume certain properties of the parent population from
which we draw samples. Assumptions like observations come from a normal
population, sample size is large, assumptions about the population parameters like
mean, variance, etc., must hold good before parametric tests can be used. But there
are situations when the researcher cannot or does not want to make such
assumptions. In such situations we use statistical methods for testing hypotheses
which are called non-parametric tests because such tests do not depend on any
assumption about the parameters of the parent population. Besides, most
non-parametric tests assume only nominal or ordinal data, whereas parametric tests
require measurement equivalent to at least an interval scale. As a result,
non-parametric tests need more observations than parametric tests to achieve the
same size of Type I and Type II errors.
IMPORTANT PARAMETRIC TESTS
The important parametric tests are: (1) z-test; (2) t-test; and (3) F-test. All these tests
are based on the assumption of normality i.e., the source of data is considered to be
normally distributed.
1. z- test: It is based on the normal probability distribution and is used for judging the
significance of several statistical measures, particularly the mean. This is a most
frequently used test in research studies. This test is used even when binomial
distribution or t-distribution is applicable on the presumption that such a distribution
tends to approximate normal distribution as ‘n’ becomes larger. z-test is generally
used for comparing the mean of a sample to some hypothesised mean for the
population in case of large sample, or when population variance is known. z-test is
also used for judging he significance of difference between means of two
independent samples in case of large samples, or when population variance is
known. z-test is also used for comparing the sample proportion to a theoretical value
of population proportion or for judging the difference in proportions of two
independent samples when n happens to be large. Besides, this test may be used
for judging the significance of median, mode, coefficient of correlation and several
other measures.
2. t- test: It is based on t-distribution and is considered an appropriate test for
judging the significance of a sample mean or for judging the significance of
difference between the means of two samples in case of small sample(s) when
population variance is not known (in which case we use variance of the sample as
an estimate of the population variance). In case two samples are related, we use
paired t-test (or what is known as difference test) for judging the significance of the
mean of difference between the two related samples. It can also be used for judging
the significance of the coefficients of simple and partial correlations.
3. F-test: It is based on F-distribution and is used to compare the variance of the two-
independent samples. This test is also used in the context of analysis of variance
(ANOVA) for judging the significance of more than two sample means at one and the
same time. It is also used for judging the significance of multiple correlation
coefficients.
Limitations of the Test of Hypotheses
Test do not explain the reasons as to why does the difference exist, say between the
means of the two samples. They simply indicate whether the difference is due to
fluctuations of sampling or because of other reasons but the tests do not tell us as to
which is/are the other reason(s) causing the difference.
Results of significance tests are based on probabilities and as such cannot be
expressed with full certainty.
Statistical inferences based on the significance tests cannot be said to be entirely
correct evidence concerning the truth of the hypotheses.
Linear Algebra
Linear algebra is a branch of mathematics that deals with vector spaces and linear
mappings between them. It provides a framework for representing and solving
systems of linear equations, as well as analyzing geometric transformations and
structures. Linear algebra has applications in various fields including engineering,
computer science, physics, economics, and data analysis.
1. Vectors and Scalars:
A vector is a quantity characterized by magnitude and direction, represented
geometrically as an arrow.
Scalars are quantities that only have magnitude, such as real numbers.
2. Vector Operations:
Addition: Two vectors can be added together by adding their corresponding
components.
Scalar Multiplication: A vector can be multiplied by a scalar (real number), resulting
in a vector with magnitudes scaled by that scalar.
Dot Product: Also known as the scalar product, it yields a scalar quantity by
multiplying corresponding components of two vectors and summing the results.
Cross Product: In three-dimensional space, it yields a vector perpendicular to the
plane containing the two input vectors.
3. Matrices and Matrix Operations:
A matrix is a rectangular array of numbers arranged in rows and columns.
Matrix Addition: Matrices of the same dimensions can be added by adding
corresponding elements.
Scalar Multiplication: A matrix can be multiplied by a scalar, resulting in each
element of the matrix being multiplied by that scalar.
Matrix Multiplication: The product of two matrices is calculated by taking the dot
product of rows and columns.
Transpose: The transpose of a matrix is obtained by swapping its rows and columns.
4. Systems of Linear Equations:
Linear equations are equations involving linear combinations of variables, where
each term is either a constant or a constant multiplied by a single variable.
A system of linear equations consists of multiple linear equations with the same
variables.
Solutions to a system of linear equations correspond to points of intersection of the
equations in space.
5. Eigenvalues and Eigenvectors:
Eigenvalues are scalar values that represent how a linear transformation scales a
corresponding eigenvector.
Eigenvectors are nonzero vectors that remain in the same direction after a linear
transformation.
Find all eigenvalues and corresponding eigenvectors for the matrix A if