RMunit 3
RMunit 3
Research Methodology
(Shaheed Rajguru College of Applied Sciences for Women)
Syllabus (Unit-3): Data Collection and Analysis: Observation and collection of data-
Methods of data collection. Modeling: Mathematical models for research, Sampling
Method, Data processing, and analysis strategies. Data analysis with statistical packages-
Hypothesis testing, Sampling, Sampling Error, Statistical methods/Tools -Measure of
central tendency and Variation, Test of Hypothesis- z test, t-test, F test, ANOVA, Chi-
square, correlation, and regression analysis, Error estimation.
Chapter Topic Page number
1
5 Measure of Central Tendency and Variation 24
5.1 Measures of Central Tendency 24
5.1.1 Mean 24
5.1.2 Median 24
5.1.3 Mode 24
5.2 Why is Central Tendency important in research 25
5.3 Measures of variation 25
5.3.1 Range 25
5.3.2 Variance 25
5.3.3 Standard Deviation 25
5.4 Why is Variation important in research 26
6 Hypothesis 27
6.1 What is Hypothesis 27
6.2 Independent variable and Dependent variable 28
6.3 Characteristics of Good Hypothesis 28
6.4 Types of Hypothesis 28
6.4.1 Simple hypothesis 28
6.4.2 Complex hypothesis 28
6.4.3 Null hypothesis 28
6.4.4 Alternative hypothesis 29
6.4.5 Directional hypothesis 29
6.4.6 Non-directional hypothesis 29
6.5 More Facts about Null -Hypotheses and 30
Alternate -Hypothesis
7 Hypothesis Testing 31
7.1 T-Test 32
7.2 Z-Test 33
7.3 Comparison between T-Test and Z-Test 34
7.4 F-Test 35
7.5 ANOVA Test 36
7.6 Chi-square Test 37
7.7 When to use ANOVA 38
7.8 When to use Chi-square 38
8 Correlation and Regression 39
8.1 Correlation Analysis 39
8.2 Regression Analysis 39
8.3 Key Differences 40
9 Error Estimation 41
9.1 Types of Errors 41
9.2 Error Estimation Techniques 41
9.3 Importance of Error Estimation 41
Summary 42
2
Chapter 1
Data Collection and Analysis
Nominal Data:
Nominal data is used to categorize things into mutually exclusive groups without any rank or
order.
Example: marital status (single, divorced, married), gender (male, female), mode of
transportation (car, bus, plane, train).
3
Ordinal Data:
Ordinal data categorize and rank the data in a certain order
Example: Grades in the exam (A, B, C, D), Position in a competition (first, second, third).
Financial status(upper class, middle class, lower class). Level of education(University level,
College level, School level)
Discrete Data:
Data that can only take specific/distinct values. It is in the form of whole numbers or integers.
Example: Number of students in a college, Number of family members, Number of states.
Continuous Data:
Data that can be quantified, measured, or assigned a numerical value
Example: Income, Temperature data, Price of any product.
4
Figure 1.1 Methods of Primary Data Collection
5
1.1.3 Considerations (Special care) while collecting the Secondary Data for Research.
While collecting the secondary data the following points must be considered
General considerations:
(i) Relevance:
The researcher must ensure that the data is relevant and fit for the research purpose. The data
should address/answer the specific questions/problems and fulfill the objectives of the
research.
(ii) Adequacy:
Evaluate if the data is sufficient to answer the research question.
Ethical Considerations:
(viii) Obtain necessary permissions: If the data is not publicly available, ensure that you have
the necessary permissions to use it.
(ix) Protect data confidentiality: If the data contains sensitive information, ensure that it is
protected and handled ethically.
(x) Acknowledge sources: Properly cite/reference the source of the secondary data in your
research.
6
1.2 Methods and Tools of data collection
1.2.1. Primary Data Collection Methods
1.2.1.a Interview
The interview is the primary data collection method
Interview is the communication process between two persons.
One person will ask the question and the other will respond. One who asks the question
is called the interviewer. The one who responds to questions is called the interviewee.
The response of the interviewee is recorded by the interviewer (Researcher)
The interview can be classified as
Structured Interview
Unstructured Interview
Structured Interview:
Structured Interview has preplanned questions, pre-determined questions
The interviewer will ask only those number/ type of questions
Unstructured Interview:
7
The questionnaire can be classified as
Structured Questionnaire
Unstructured Questionnaire
Structured Questionnaire
The unstructured questionnaire does not have any specific structure (just has the
basic structure)
Qualitative data is collected through an unstructured questionnaire.
It uses open-ended questions, allowing respondents to provide detailed, free-form
answers (Descriptive type) without predefined choices.
This type of method provides a lot of flexibility and rich insights
For Example: What was your experience in a certain place/restaurant? (In description)
Tools for questioning data collection: Google Forms, SurveyMonkey, Type form, and Jot
form.
1.2.1.c Observation:
The researcher observes the researcher for a particular thing
Observation is a method of collecting primary data.
Observation is defined as the systematic, selective, and purposeful way of watching,
examining, or listening to what is happening in their natural setting and
documenting that.
In Observation, the researcher will not interfere in the event.
For example: A cricket team coach observes his/her team while they are performing.
8
it is also useful when full and appropriate/ accurate information is not obtained
through a questionnaire because either the respondent is unaware of the questions or
is not cooperative.
Types of observation:
Naturalistic observation:
As the name suggests in naturalistic observation researcher observes how the
participants respond to their environment in real life or a natural setting.
The researcher does not influence their behavior.
Example: Observing animals in national parks
Participant observation:
In participant observation, the researcher participates in the activities of the group being
observed as a member of the group with/without knowing that they are being observed.
Example:
Spending a few months in jail with prisoners to know their perception of the judicial system in
the country
Non-Participant Observation:
The researcher observes from a distance, without actively participating in the activity
Structured Observation:
Structured observation involves systematically recording observations of specific,
pre-defined behaviors or events in a setting, often using checklists or coding
systems.
Researchers using structured observation aim to gather quantitative data by
quantifying the frequency, duration, or intensity of specific behaviors.
Unstructured Observation:
Unstructured observation is a qualitative research method where researchers observe
behaviors or events without a predetermined checklist or structured recording
system.
It allows for a more flexible, open-ended approach to data collection
For example, An example of unstructured observation is a researcher spending time in a public
park, taking detailed notes on social interactions and behaviors without a pre-defined
checklist or categories.
9
Overt Observation:
The participants know they are being observed.
Covert Observation
The participants are unaware they are being observed
Tools for Observation data collection: Field notes, Audio and Video recordings, Rating
scales, Checklists, and Anecdotal Records.
10
1.3 Possible Questions
Explain the classification of data with suitable examples.
Q 1.1 Evaluate the different methods of Collecting Secondary data.
Q 1.2 Distinguish between Primary Data and Secondary Data
Q 1.3. Explain the meaning of primary data and secondary data. Discuss the special care
that is to be taken while collecting the secondary data for research.
Q 1.4. Discuss different methods of collecting data, its merits and demerits and brief on
the ethical issues in collecting data.
Q1.5 Explain the merits and demerits of the interview data collection method.
Answer:
Merits:
More in-depth information obtained
Personal Information
Greater Flexibility
Adaptation as per the respondent
Demerits:
Bias of Interviewer
Expensive/Time Consuming
Need expertise
Q1.6 What are the characteristics of a good questionnaire?
The following are the characteristics of a good questionnaire.
11
Chapter 2
Mathematical Modelling
Researchers start by clearly defining the problem or phenomenon they want to model.
They identify key variables, relationships, and assumptions, and then translate them into
mathematical equations or other models.
The model is then used to analyze the system's behavior, simulate different scenarios, and
make predictions.
The model's accuracy and usefulness are evaluated against real-world data (using test data),
and it may be refined based on the results.
12
2.3 Types of mathematical models:
These models assume that the system's behavior/system output /Process behavior is
completely determined by its initial conditions and input variables.
Differential equation models are often used to describe the dynamic behavior using differential
equations of systems over time.
Such as the transient behavior of second-order systems (Mechanical system-spring mass
damper system, Electrical system-RC circuits).
A statistical model is a mathematical model that uses statistical assumptions for the
generation of sample data.
13
2.4 Benefits of Using Mathematical Models:
They can be used to identify optimal strategies and make informed decisions.
They can be used to test the validity of different hypotheses and theories.
14
Chapter 3
Sampling Methods
Sampling means how we select samples carefully from the population for
research/study.
Practically it is not possible to study the entire population. To study the entire
population, we need more resources. It takes more time and cost and it is practically
impossible to handle, manage, and analyze large amounts of data.
To draw valid conclusions from research samples should be carefully selected which
represent the whole population.
The researcher collects samples from the population. Then he/she will analyze the data
using different statistical tools and then conclude the results. Then he/she generalized
the result to the entire population.
Probability-Sampling
Non- Probability-Sampling
15
3.3 Probability-Sampling
Probability sampling methods where all subjects in the target population have equal chances
to be selected in the sample.
3.3.1 Simple Random Sampling:
In this technique, every member of the sample is selected purely on a random basis with equal
chance.
For Example: Picking chits from the bowl, and a lottery system are the methods of random
sampling.
First, you define the population. Suppose the population size is 1000, Now you select the
sample size 100. We must select 100 respondents randomly
3.3.2 Stratified Sampling
In this, the population is divided into mutually exclusive groups and every member of the group
has an equal chance of being selected for research.
16
Grouping may be based on age, gender, employment, etc. From every group, the researcher
selects the respondent purely on a random basis.
3.3.3 Systematic Sampling
Systematic sampling is a probability sampling method where researchers select members
of the population at regular intervals.
For example, by selecting every 3rd person on a list of the population. If the population is in a
random order
Sampling interval= total population/sample size
Suppose the total population is p and the sample size is n
Sampling interval=p/n
For example,
A company wants to study the performance of the product in the country. The country
is divided into clusters (cities, towns, metropolitan, etc)
17
Figure 3.4 Cluster Sampling
The key difference between stratified and cluster sampling lies in their approach to
grouping and sampling.
In a subgroup in stratified sampling, samples are chosen randomly within distinct
categories/homogeneous ‘groups’.
Stratified sampling aims for homogeneous subgroups (strata), while cluster sampling
involves heterogeneous groups (clusters)
In contrast, cluster sampling involves randomly selecting clusters from the
population and then sampling all members within those chosen ‘clusters. This
method proves particularly efficient for populations spread across various geographical
locations.
Advantages of Stratified Sampling
Increases precision and reduces sampling error, especially when there's high
variation within subgroups.
Allows for separate analysis of subgroups. Not easy to implement
Disadvantages:
18
Advantages of Cluster Sampling
Disadvantages:
This may lead to less precision and higher sampling error compared to stratified
sampling, especially if clusters are not representative of the population.
Requires clear information about cluster boundaries.
Non-probability sampling is a method in which not all population members have an equal
chance of participating in the study, unlike probability sampling.
19
3.5.4 Snowball sampling
As the snowball moves further from top to bottom it gets bigger and bigger.
It is a sampling technique in which the researcher selects one or two respondents
first. These respondents refer to or identify other respondents.
The researcher continuously selects respondents based on referral until the required
sample size is achieved.
Snowball sampling is also called referral sampling, chain sampling, network
sampling, and friend-to-friend sampling.
Assume you are a market researcher of a company looking to introduce a new product to the
market. You must collect data from a sample of potential customers as part of your research to
determine their preferences and purchasing behavior. But how can you be sure that the
information you get from your sample is accurate for all the people who might buy your
product? The idea of sampling error comes into play here.
3.6.1 Definition:
It is the difference between what a sample has and what the entire population has. It can
significantly affect how accurate and reliable market research data is.
Or
A sampling error occurs when the sample used in the study does not represent the entire
population.
20
Increase sample size
A larger sample size is more accurate because the study gets closer to the actual
population size.
Test groups according to their size in the population instead of a random sample.
Study your population and understand its demographic mix (various characteristics of
a population, such as age, gender, ethnicity, income, education level, and other
socioeconomic factors).
21
Chapter 4
Data Processing and Data Analysis
Data Processing
Data processing means the steps we follow while performing data analysis. Data analysis is a
critical task and needs full attention. Following are the steps followed in the data analysis
process,
Data processing includes the following steps
4.1 Data Collection
4.2 Data Cleaning
4.3 Data Analysis
4.4 Data Interpretation
4.5 Data Visualisation
22
4.4 Data Interpretation:
After data analysis, the next step is data interpretation. Data interpretation is drawing
conclusions and inferences based on data analysis and generalizing research findings.
4.5 Data visualization:
Graphically represent the research findings using bar charts, graphs, line charts, plots, tables,
heat maps, etc. Visualization helps to understand key research findings and observe
relationships.
23
Chapter 5
Measure of Central Tendency and Variation
One of the most useful Statistics for researchers is the Center Point of the data.
Knowing the center point answers such questions as, “What is the average value?”
The central tendency in statistics describes the central or typical value of a dataset.
It's a measure used to Summarize a dataset with a single value that represents the
middle or center of the data distribution. The three main measures of central tendency
are the
Mean
Median
Mode
All three provide insights into “the center” of a distribution of data points. These measures
of central tendency are defined differently because they each describe the data differently
and will often reflect a different number. Each of these statistics can be a good measure
of central tendency in certain situations.
Important point:
The choice of central tendency measure (Mean, Median, or Mode) depends on the shape of
the data distribution.
The Mean is appropriate for Normally Distributed Data(Normal Distribution is a
statistical distribution where data points are evenly spread around the mean.
5.1.1 Mean: The mean is the average of all the values in a dataset, calculated by summing
the values and dividing by the total number of values.
5.1.2 Median: The median is the middle value in a sorted (in a certain order) dataset. In
other words, half of the values are above the median, and half are below.
5.1.3. Mode: The mode is the value that appears most frequently in a dataset.
24
5.2 Why is Central Tendency important in research?
Central tendency measures provide a concise way to describe the typical value of a dataset,
making it easier to interpret and communicate research findings.
Researchers can use these measures to compare the central tendencies of different groups in
their study, for example, comparing the average income of two different populations.
These measures are often used as inputs in more complex statistical analyses, such as t-tests
and ANOVA.
The choice of central tendency measure (mean, median, or mode) depends on the shape of the
data distribution. The mean is appropriate for normally distributed data, while the median is
better for skewed distributions, and the mode is useful for categorical data.
Variation describes the spread or dispersion of data points around that central value.
Range
Variance
Standard deviation.
5.3.1 Range:
The difference between the highest and lowest values in a dataset.
5.3.2 Variance:
A measure of how much data values deviate from the mean.
The square root of the variance also measures the spread of data points around the mean.
25
5.4 Why is Variation important in research?
Variance provides a numerical measure of how much the data points deviate from
the average.
A higher variance indicates greater variability, meaning the data is more dispersed,
while a lower variance suggests data points are clustered closer to the mean.
Variance is the foundation of many statistical tests, including the Analysis of Variance
(ANOVA).
26
Chapter 6
Hypothesis
6.2.2 The dependent variable: (the effect) is the outcome that is measured and expected to
be influenced by the independent variable.
In the above example: The dependent variable is Lung Cancer.
Generally Dependent variable is the Research problem under study. (In this case Lung
Cancer)
If our research supports or favours this hypothesis then we accept it otherwise reject it
(First example)
Second Example:
Regular exercise(independent variable) boosts our immunity and reduces the risk of heart-
related disease(dependent variable)
27
6.3 Characteristics of Good Hypothesis
It should be testable which can be verified with less difficulty
It should be logical (sensible, reasonable, and practical)
It should be specific and clear
It should be simple and understandable.
It should be economical.
It should be relevant (should be related to our Research problem)
A simple Hypothesis predicts a linear relationship between the single dependent variable
and the single independent variable.
For Example: If you do meditation (Cause), you feel happy (Effect)
28
6.4.4 Alternative hypothesis
Is just the opposite of the null hypothesis. It shows a significant relationship between
the two variables.
This hypothesis disapproves the null hypothesis.
For Example: When ask my friends and relatives(in testing the hypothesis). People told me
that their health is improving after taking green tea.
For Example, There is a positive relationship between advertisements and the sale of
products. This hypothesis Predicts the relationship in the positive direction.
29
6.5 More Facts about Null -Hypotheses and Alternate -Hypothesis
Null -Hypotheses
It states that something would happen that is happening/happened normally.
It is the case of equal to:
Two things are equal till we prove it.
What is Equal to a case
The mean of the two groups is equal.
The intelligence level of the two people is equal.
*Imp point: The researcher always (in most of the cases) test the Null hypothesis
Alternate -Hypothesis
Alternate is something that is not the null which is against the null.
For Example, A researcher wants to know whether the intelligence of two people is equal or
not.
Null says they are the same, Alternate says they are not the same.
Most of the researcher wants to disprove the null hypothesis or reject the null
hypothesis.
To check whether the outcome is significant and effective. The hypothesis you see
in research papers are alternate hypothesis. This hypothesis the Researcher actually
wants to claim
30
Chapter 7
Hypothesis Testing
31
7.1 T-Test: T -Test
A T-Test can be a
To compare the sample mean with that To compare the means of two different samples
of the population mean
𝑥̅ − 𝜇 𝑥̅ − 𝑥̅
𝑡= 𝑠 𝑡=
√𝑛 𝑠 𝑠
where +
𝑛 𝑛
𝑥̅ is the sample mean
where
𝜇 is the population mean
𝑥̅ is the sample mean of the first group
𝑠 is the sample standard deviation
𝑥̅ is the sample mean of the second group
𝑛 is the sample size
𝑠 is the sample/group 1 standard deviation
𝑠 is the sample/group 2 standard deviation
𝑛 is the sample size of group 1
𝑛 is the sample size of group 2
32
7.2 Z-Test: Z -Test
A Z-Test can be a
To compare the sample mean with that To compare the means of two different samples
of the population mean
𝑥̅ − 𝑥̅
𝑥̅ − 𝜇 𝑡=
𝑡=𝜎 𝜎 𝜎
+
√𝑛 𝑛 𝑛
where
𝑥̅ is the sample mean where
𝜇 is the population mean. 𝑥̅ is the sample mean of the first group.
𝜎 is the population standard 𝑥̅ is the sample mean of the second group.
deviation. 𝜎 is the population/group 1 standard deviation.
𝑛 is the sample size 𝜎 is the population/group 2 standard deviation.
𝑛 is the sample size of group 1.
𝑛 is the sample size of group 2.
33
7.3 Comparison between Z-test and T-test:
34
7.4 F-Test F -Test
It is calculated as:
𝒔𝟐𝟏
𝑭=
𝒔𝟐𝟐
Where
(𝒙 𝒙)𝟐
𝒔𝟐𝒊 = (𝒇𝒐𝒓 𝒊 = 𝟏 𝒂𝒏𝒅 𝟐)
𝒏 𝟏
It can be used:
Test the overall significance of the regression model
To compare the fits of different models
To test the equality of means
It assumes:
Population distribution is normal
Samples are drawn randomly and independently
35
7.5 ANOVA ANOVA
36
7.6 Chi-Square Test Chi-Square Test
Where
𝑶𝒊 is the observed frequency
𝑬𝒊 is the expected frequency
Conditions of Chi-square test
Observations are collected and recorded on a random basis.
All the items in the sample must be independent.
No group should contain very few items (i.e less than 10)
The overall number of items should be considerably large(at least 50). The
number of groups may be small.
37
7.7 When to use ANOVA?
You have a continuous dependent variable (e.g., height, weight, test scores) and one or more
categorical independent variables (e.g., treatment groups, age groups, education levels).
For example, you might use ANOVA to analyze the effects of different teaching methods
(independent variable) on student test scores (dependent variable) across multiple classes.
1. You have categorical variables (e.g., gender, occupation, preferences) and want to
test for associations or independence between them.
2. You want to determine if the observed frequencies of categories differ significantly
from the expected frequencies.
3. You are interested in analyzing the relationship between two or more categorical
variables.
For Example, You might use Chi-Square to examine whether there is an association between
smoking habits (variable 1: smoker, non-smoker) and the incidence of lung cancer (variable 2:
diagnosed with lung cancer, not diagnosed) in a population.
38
Chapter 8
Correlation and Regression
Correlation and regression analysis are statistical techniques used in research to examine the
relationships among variables. Correlation measures the strength and direction of a
relationship between the variables, while Regression analyzes how one variable changes
about another to make predictions. (regression is a kind of curve-fitting, modeling)
For Example: Measuring the correlation between hours studied and exam scores.
There is a correlation between height and weight.
Correlation Types:
8.1.1 Positive Correlation: When one variable increases, the other tends to increase as well,
or vice versa.
8.1.2 Negative Correlation: When one variable increases, the other tends to decrease, and vice
versa.
For Example: If a car decreases its speed, the time it takes to reach a destination increases.
For Example: The number of people with SAMSUNG mobile and global warming.
39
RegressionTypes:
8.2.1 Simple Linear Regression: Uses one independent variable to predict a dependent
variable.
8.2.2 Multiple Regression: Uses two or more independent variables to predict a dependent
variable.
For Example:
Predicting sales revenue based on advertising expenditure. (Simple Linear Regression)
Predicting Rain using humidity and temperature. (Multiple Regression)
In Summary:
40
Chapter 9
Error Estimation
Random Error: Occurs due to chance fluctuations and can be minimized by increasing
sample size and improving measurement techniques.
Systematic Error: Results from a consistent bias in the measurement process and can be
addressed by calibrating instruments and refining experimental techniques.
Measurement Error: Includes all errors associated with the process of obtaining data,
such as errors in instrument calibration, human oversight, and data recording.
Sampling Error: This occurs when a sample is used to estimate a population parameter, and
the sample may not perfectly represent the population.
Non-sampling Error: Includes errors that occur outside of the sampling process, such as
response bias, non-response bias, and data processing errors.
Confidence Intervals: A range of values within which the true population parameter is likely
to fall, based on a specific level of confidence.
Statistical Tests: Methods like t-tests and ANOVA can be used to assess the significance of
differences between groups or to test hypotheses.
Decision-Making: Helps researchers make informed decisions about the reliability and
validity of their findings, allowing for more confident interpretations.
Reproducibility: Allows other researchers to assess the reliability of the findings and replicate
the study with greater confidence.
41
In summary:
Error estimation is a crucial part of research methodology that helps researchers understand the
potential limitations of their findings and make more informed decisions about the
interpretation and validity of their results.
42