Advanced Methods of Data
Analysis
Session 1 – 2
Program: PT MBA
Trim: V
Instructor: Dr. Abhinav Sharma
Two Different Approaches to Data Analysis
Statistical/data models – analysis Data mining/algorithmic models
where a specific model is proposed – models based on algorithms (e.g.,
(e.g., dependent and independent neural networks, decision trees,
variables to be analyzed by the support vector machine. Their
general linear model), the model is emphasis is on predictive accuracy
then estimated and a statistical rather than statistical inference and
inference is made as to its explanation
generalizability to the population
through statistical tests.
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 2
Two Different Approaches to Data Analysis
• No “best” approach, each has strengths and weaknesses.
• Analysts today must assess each research situation and identify the best modeling
approach for that specific situation (i.e., objective, data, etc.).
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 3
Is data analysis everything?
• American Express - Default Prediction: Predict if a customer will default in the
future!
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 4
What is Multivariate Analysis?
• All statistical techniques that simultaneously analyze multiple measurements
on individuals or objects under investigation. Thus, any simultaneous analysis of
more than two variables can be loosely considered multivariate analysis.
• Many multivariate techniques are extensions of univariate procedures
• ANOVA MANOVA
• Many other techniques are uniquely multivariate
• Factor analysis, cluster analysis, discriminant analysis
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 5
Types of Data and Measurement Scales
Data
Nonmetric
Metric or
or
Quantitative
Qualitative
Nominal Interval
Ordinal scale Ratio Scale
Scale Scale
NOTE: The level of measurement is critical in determining the appropriate
multivariate technique to use!
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 6
A multivariate data problem
1. How to distinguish between forged and a
genuine bill?
2. Attributes?
3. Compare the bills!
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 7
A multivariate data problem
x1: length of bill
x2: width of bill, measured on left
x3: width of bill, measured on right
x4: width of margin at the bottom
x5: width of margin at top
x6: length of image diagonal
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 8
Mahalanobis Distance
The Mahalanobis distance between centroid x and data point xi is given as:
MDi ( xi x ) S x1 ( xi x )T
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 9
Hypothesis Testing
People who did MBA during 2019-2021 (online mode) are most intelligent as
average grade was 3.76/4 with standard deviation of 0.04.
Really?
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 10
Hypothesis Testing
If you drink Horlicks, you can
grow taller, stronger and sharper (3
in 1).
Wearing deodorant makes you
attractive to the opposite gender
(known as Axe effect)
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 11
Hypothesis Testing
Women take more selfies
compared to men
Smokers are better salespeople.
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 12
Hypothesis Testing
Setting up hypothesis:
Step 1: Describe the hypothesis in words. A few examples of hypothesis are:
(a) The average time spent by women using social media is greater than that spent by
men.
(b) On average, women upload more photos on social media than men.
(c) Customers of mobile phone service providers with more than one mobile handset
are more likely to churn.
(d) The average mortality rate due to coronavirus is more for male compared to
female.
Step 2: Based on the claim made in Step 1, define the null and alternative
hypotheses. Initially, we believe that the null hypothesis is true .
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 13
Hypothesis Testing
Setting up hypothesis:
What will be the null and alternate hypothesis for claim ‘women use social media
more than men’
Null: There is no relationship between gender and average time spent on social
media
Alternate: There is a relationship between gender and average time spent on
social media
Remember: Null and alternative hypotheses are defined using a population
parameter.
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 14
Hypothesis Testing
Setting up hypothesis:
Step 3: Identify the test to be used.
Step 4: Determine the p-value, compare it with level of significance
Step 5: Take the decision to reject or retain the null hypothesis.
Remember:
1. p-value is low, null must go.
2. Alternate hypothesis is of our interest and we believe it to be true.
3. Null hypothesis is the claim that is assumed to be true initially. That is at the
beginning, we assume that null hypothesis is true and try it unless there is
strong evidence against null hypothesis.
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 15
1- Sample Z Test
When can be used?
1. When sample size is large
2. When population standard deviation is known.
A passport office claims that passport applications are processed within 30 days
of submitting the application form along with necessary documents. A sample of
processing time of 40 applications is taken which is provided in Excel file
‘Passport time.xlsx’. Population standard deviation of the processing time is
12.5. Verify the claim made by passport office.
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 16
1- Sample t-Test
When can be used?
• When population standard deviation is unknown
Aravind Productions (AP) is a newly formed movie production house based out
of Mumbai, India. AP was interested in understanding the production cost of
producing a Bollywood movie. The industry believes the production house will
require at least Rs.500 million (50 crore) on average. It is assumed that
Bollywood movie production costs follow a normal distribution. The production
costs of 40 Bollywood movies in millions of rupees is given in Excel file ‘Movie
budget.xlsx’. Conduct an appropriate hypothesis test at alpha = 0.05 to check
whether the belief about average production cost is correct.
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 17
2- Sample t-Test
When can be used?
• When comparing two samples
Millions of investors buy mutual funds, choosing from thousands of possibilities.
Some funds can be purchased directly from banks or other financial institutions
whereas others must be purchased through brokers, who charge a fee for this
service. This raises the question, Can investors do better by buying mutual funds
directly than by purchasing mutual funds through brokers? To help answer this
question, a group of researchers randomly sampled the annual returns from
mutual funds that can be acquired directly and mutual funds that are bought
through brokers and recorded the net annual returns, which are the returns on
investment after deducting all relevant fees. These are given in Excel file ‘Mutual
fund 2 sample t.xlsx’
Remember to test equality of variance
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 18
Paired t-Test
When can be used?
• In paired t-test, data related to a parameter is captured twice from same subject.
Data related to alcohol consumption before and after break-up is provided in
Excel file ‘Alcohol consumption.xlsx’. Conduct appropriate test to check whether
the alcohol consumption is more after break up.
Advanced Methods of Data Analysis, Dr. A.K. Sharma, NMIMS Mumbai 19