0% found this document useful (0 votes)

11 views40 pages

Unit 3 ML

This document covers essential concepts in statistics for machine learning, focusing on random variables, sampling methods, and the Central Limit Theorem. It explains discrete and continuous random variables, various sampling techniques (both probability and non-probability), and how to draw inferences about a population from sample data. Additionally, it discusses confidence intervals and hypothesis testing as tools for statistical analysis.

Uploaded by

B.ushasri Usha sri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views40 pages

Unit 3 ML

Uploaded by

B.ushasri Usha sri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

UNIT 3

Statistics Essentials in Machine Learning

Inferential Statistics: Learn Probability Distribution Functions,

Random Variable

Random variable is a fundamental concept in statistics that bridges the gap

between theoretical probability and real-world data. A Random variable in
statistics is a function that assigns a real value to an outcome in the sample space
of a random experiment. For example: if you roll a die, you can assign a number
to each possible outcome.
There are two basic types of random variables:
 Discrete Random Variables (which take on specific values).
 Continuous Random Variables (assume any value within a given range).

We define a random variable as a function that maps from the sample space of
an experiment to the real numbers. Mathematically, Random Variable is
expressed as,
X: S →R
where,
 X is Random Variable (It is usually denoted using capital letter)
 S is Sample Space
 R is Set of Real Numbers

Types of Random Variables

Random variables are of two types that are,
 Discrete Random Variable
 Continuous Random Variable

Discrete Random Variable

A Discrete Random Variable takes on a finite number of values. The probability

function associated with it is said to be PMF.
PMF(Probability Mass Function)
If X is a discrete random variable and the PMF of X is P(xi), then

 0 ≤ pi ≤ 1
 ∑pi = 1 where the sum is taken over all possible values of x
Example: Let S = {0, 1, 2}
xi 0 1 2

Pi(X = xi) P1 0.3 0.5

Find the value of P (X = 0)
Solution:
We know that the sum of all probabilities is equal to 1. And P (X = 0) be P1
P1 + 0.3 + 0.5 = 1
P1 = 0.2
Then, P (X = 0) is 0.2

Continuous Random Variable

Continuous Random Variable takes on an infinite number of values. The
probability function associated with it is said to be PDF (Probability Density
Function).

PDF (Probability Density Function)

Example

If X is a continuous random variable. P (x < X < x + dx) = f(x)dx then,

 0 ≤ f(x) ≤ 1; for all x
 ∫ f(x) dx = 1 over all values of x
Then P (X) is said to be a PDF of the distribution.

Find the value of P (1 < X < 2)

Such that,
 f(x) = kx3; 0 ≤ x ≤ 3 = 0
Otherwise f(x) is a density function

Sampling Methods

Probability Sampling Methods

1. Simple random sampling
In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of
obtaining a random sample is to give each individual in a population a number, and
then use a table of random numbers to decide which individuals to include. 1 For
example, if you have a sampling frame of 1000 individuals, labelled 0 to 999, use
groups of three digits from the random number table to pick your sample. So, if the
first three numbers from the random number table were 094, select the individual
labelled “94”, and so on.
As with all probability sampling methods, simple random sampling allows the
sampling error to be calculated and reduces selection bias. A specific advantage is
that it is the most straightforward method of probability sampling. A disadvantage
of simple random sampling is that you may not select enough individuals with your
characteristic of interest, especially if that characteristic is uncommon
2. Systematic sampling
Individuals are selected at regular intervals from the sampling frame. The intervals
are chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample. For
example, if you wanted a sample size of 100 from a population of 1000, select
every 1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it
is easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that
the sampling technique coincides with the periodicity of the underlying pattern. As
a hypothetical example, if a group of students were being sampled to gain their
opinions on college facilities, but the Student Record Department’s central list of
all students was arranged such that the sex of students alternated between male and
female, choosing an even interval (e.g. every 20 th student) would result in a sample
of all males or all females. Whilst in this example the bias is obvious and should be
easily corrected, this may not always be the case.

3. Stratified sampling
In this method, the population is first divided into subgroups (or strata) who all
share a similar characteristic. It is used when we might reasonably expect the
measurement of interest to vary between the different subgroups, and we want to
ensure representation from all the subgroups. For example, in a study of stroke
outcomes, we may stratify the population by sex, to ensure equal representation of
men and women. The study sample is then obtained by taking equal sample sizes
from each stratum. In stratified sampling, it may also be appropriate to choose non-
equal sample sizes from each stratum. For example, in a study of the health
outcomes of nursing staff in a county, if there are three hospitals each with
different numbers of nursing staff (hospital A has 500 nurses, hospital B has 1000
and hospital C has 2000), then it would be appropriate to choose the sample
numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from
hospital B and 40 from hospital C). This ensures a more realistic and accurate
estimation of the health outcomes of nurses across the county, whereas simple
random sampling would over-represent nurses from hospitals A and B. The fact
that the sample was stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always
available), and it can be difficult to decide which characteristic(s) to stratify by.

4. Clustered sampling
In a clustered sample, subgroups of the population are used as the sampling unit,
rather than individuals. The population is divided into subgroups, known as
clusters, which are randomly selected to be included in the study. Clusters are
usually already defined, for example individual GP practices or towns could be
identified as clusters. In single-stage cluster sampling, all members of the chosen
clusters are then included in the study. In two-stage cluster sampling, a selection of
individuals from each cluster is then randomly selected for inclusion. Clustering
should be taken into account in the analysis. The General Household survey, which
is undertaken annually in England, is a good example of a (one-stage) cluster
sample. All members of the selected households (clusters) are included in the
survey. 1
Cluster sampling can be more efficient that simple random sampling, especially
where a study takes place over a wide geographical region. For instance, it is easier
to contact lots of individuals in a few GP practices than a few individuals in many
different GP practices. Disadvantages include an increased risk of bias, if the
chosen clusters are not representative of the population, resulting in an increased
sampling error.

Non-Probability Sampling Methods

1. Convenience sampling
Convenience sampling is perhaps the easiest method of sampling, because
participants are selected based on availability and willingness to take part. Useful
results can be obtained, but the results are prone to significant bias, because those
who volunteer to take part may be different from those who choose not to
(volunteer bias), and the sample may not be representative of other characteristics,
such as age or sex. Note: volunteer bias is a risk of all non-probability sampling
methods.

2. Quota sampling
This method of sampling is often used by market researchers. Interviewers are
given a quota of subjects of a specified type to attempt to recruit. For example, an
interviewer might be told to go out and select 20 adult men, 20 adult women, 10
teenage girls and 10 teenage boys so that they could interview them about their
television viewing. Ideally the quotas chosen would proportionally represent the
characteristics of the underlying population.
Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2

3. Judgement (or Purposive) Sampling

Also known as selective, or subjective, sampling, this technique relies on the
judgement of the researcher when choosing who to ask to participate. Researchers
may implicitly thus choose a “representative” sample to suit their needs, or
specifically approach individuals with certain characteristics. This approach is
often used by the media when canvassing the public for opinions and in qualitative
research.
Judgement sampling has the advantage of being time-and cost-effective to perform
whilst resulting in a range of responses (particularly useful in qualitative research).
However, in addition to volunteer bias, it is also prone to errors of judgement by
the researcher and the findings, whilst being potentially broad, will not necessarily
be representative.

4. Snowball sampling
This method is commonly used in social sciences when investigating hard-to-reach
groups. Existing subjects are asked to nominate further subjects known to them, so
the sample increases in size like a rolling snowball. For example, when carrying
out a survey of risk behaviours amongst intravenous drug users, participants may
be asked to nominate other users to be interviewed.
Snowball sampling can be effective when a sampling frame is difficult to identify.

Central Limit Theorem and Draw Inferences

Central Limit Theorem

Central Limit Theorem is one of the important concepts in Inferential Statistics.

Inferential Statistics means drawing inferences about the population from the
sample.

When we draw a random sample from the population and calculate the mean of the
sample, it will likely differ from the population mean due to sampling fluctuation.
The variation between a sample statistic and population parameter is known
as sampling error.

Due to this sampling error, it may be difficult to draw inferences about population
parameter from sample statistics. Central Limit Theorem is one of the important
concepts in inferential statistics, which helps us to draw inferences about the
population parameter from sample statistic.

Let us learn about the central limit theorem in detail in this article.

Statistic → The values which represent the characteristics of the sample known
as Statistic.

Parameter → The values which represent the characteristics of the population

known as Parameter. (The values which we infer from statistic for the population)

Sampling distribution

Sampling → It means drawing representative samples from the population.

Sampling Distribution → A sampling distribution is the distribution of all
possible values of a sample statistic for a given sample drawn from a population.

Sampling distribution of mean is the distribution of sample means for a given size
sample selected from the population.

Steps in Sampling Distribution:

 We will draw random samples(s1,s2…sn) from the population.

 We will calculate the mean of the samples (ms1,ms2,ms2….msn).
 Then we will calculate the mean of the sampling means. (ms)

ms=(ms1+ms2+…msn)/n

n →sample size.

[Now we have calculated the mean of the sampling mean. Next, we have to
calculate the standard deviation of the sampling mean]

Standard Error

Variability of sample means in the sampling distribution is the Standard Error. The
standard deviation of the sampling distribution is known as the Standard Error of
the mean.

Standard Error of mean = Standard deviation of population/sqrt(n)

n- sample size

[Standard error decreases when sample size increases. So large samples help in
reducing standard error]

Sampling Distribution Properties.

1. The mean of the sampling mean is equal to the population mean.

[When we draw many random samples from the population, the variations will
cancel out. So, the mean of sampling mean equals to population mean]
2. Standard Deviation of Sampling Distribution is equal to the standard
deviation of population divided by the square root of the sample size.

Image by Author

Central Limit Theorem

Central Limit Theorem states that even if the population distribution is not normal,
the sampling distribution will be normally distributed if we take sufficiently large
samples from the population.[ For most distributions, n>30 will give a sampling
distribution which is nearly normal]

Sampling distribution properties also hold good for the central limit theorem.

Confidence Interval

We can say that the population mean will lie between a certain range by using a
confidence interval.

Confidence Interval is the range of values that the population parameter can take.

*Confidence Interval of Population Mean= Sample Mean + (confidence level

value ) Standard Error of the mean**

Z → Z scores associated with the confidence level.

Mostly used confidence level

99% Confidence Level → Z score = 2.58 95% Confidence Level → Z score

= 1.9690% Confidence Level → Z score =1.65

Sampling distribution using Python and Seaborn

Example:

1. Let’s say we have to calculate the mean of marks of all students in a school.

No of students = 1000.

population1=np.random.randint(0,100,1000)

2. Checking the Population distribution

sns.distplot(population1,hist=False)

Population Distribution

The population is not normally distributed.

3. We will draw random samples of size less than 30 from the population.

sample_means1=[]
for i in range(0,25):
sample=np.random.choice(population1,size=20)
sample_means1.append(np.mean(sample))
sample_m1=np.array(sample_means1)
4. Sampling distribution

sns.distplot(sample_means1,hist=False)
plt.title("Sampling distribution of sample mean")
plt.axvline(sample_m1.mean(),color='green',linestyle=' - ')
plt.xlabel("Sample Mean")

The sampling distribution is close to a normal distribution

5. Let’s check the sampling mean and standard error.

print ("Sampling mean: ",round(sample_m1.mean(),2))

print ("Standard Error: ",round(sample_m1.std(),2))
#Output:
Sampling mean: 47.96
Standard Error: 6.39

Standard Error = 6.39. Let’s increase the sample size and check whether the
standard error decreases.

6. Take sample size greater than 30 and calculate sampling mean

sample_means2=[]
for i in range(0,100):
sample=np.random.choice(population1,size=50)
sample_means2.append(np.mean(sample))
sample_m2=np.array(sample_means2)
7. Sampling distribution

sns.distplot(sample_means2,hist=False)
plt.title("Sampling distribution of sample mean")
plt.axvline(sample_m2.mean(),color='green',linestyle=' - ')
plt.xlabel("Sample Mean")

Sampling Distribution

The sampling distribution is normal now.

8. Calculate sampling mean and standard error

print ("Sampling mean: ",round(sample_m2.mean(),2))

print ("Standard Error: ",round(sample_m2.std(),2))
# Output:
Sampling mean: 48.17
Standard Error: 3.89

After increasing the sample size, the standard error decreases. Now the Standard
Error is 3.89.

9. Let’s verify our population mean

print ("Population Mean: ",round(population1.mean(),2))

#Output:
Population Mean: 48.03

We have calculated the sampling mean as 48.17 which is approximately equal to

the population mean 48.03
10. Calculating Confidence Interval at 99% confidence level.

Lower_limit=sample_m2.mean()- (2.58 * (sample_m2.std()))

print (round(Lower_limit,2))
#Output: 38.14
Upper_limit=sample_m2.mean()+ (2.58 * (sample_m2.std()))
print (round(Upper_limit),2)
#Output: 58.19

Confidence Interval = 38.14 – 58.19

Hypothesis Testing

Hypothesis testing compares two opposite ideas about a group of people or things
and uses data from a small part of that group (a sample) to decide which idea is
more likely true. We collect and study the sample data to check if the claim is
correct.
Hypothesis Testing
For example, if a company says its website gets 50 visitors each day on average,
we use hypothesis testing to look at past visitor data and see if this claim is true
or if the actual number is different.
Defining Hypotheses
 Null Hypothesis (H₀): The starting assumption. For example, "The average
visits are 50."
 Alternative Hypothesis (H₁): The opposite, saying there is a difference. For
example, "The average visits are not 50."
Key Terms of Hypothesis Testing
To understand the Hypothesis testing firstly we need to understand the key terms
which are given below:
 Significance Level (α): How sure we want to be before saying the claim is
false. Usually, we choose 0.05 (5%).
 p-value: The chance of seeing the data if the null hypothesis is true. If this is
less than α, we say the claim is probably false.
 Test Statistic: A number that helps us decide if the data supports or rejects the
claim.
 Critical Value: The cutoff point to compare with the test statistic.
 Degrees of freedom: A number that depends on the data size and helps find
the critical value.
Types of Hypothesis Testing
It involves basically two types of testing:
Types of Hypothesis testing
1. One-Tailed Test
Used when we expect a change in only one direction either up or down, but not
both. For example, if testing whether a new algorithm improves accuracy, we
only check if accuracy increases.
There are two types of one-tailed test:
 Left-Tailed (Left-Sided) Test: Checks if the value is less than expected.
Example: H0:μ≥50μ≥50 and H1: μ<50μ<50
 Right-Tailed (Right-Sided) Test: Checks if the value is greater than expected.
Example: H0 : μ≤50μ≤50 and H1:μ>50μ>50
2. Two-Tailed Test
Used when we want to see if there is a difference in either direction higher or
lower. For example, testing if a marketing strategy affects sales, whether it goes
up or down
Example: H0: μ=μ= 50 and H1: μ≠50μ =50
To go deeper into differences into both types of test: Refer to link
What are Type 1 and Type 2 errors in Hypothesis Testing?
In hypothesis testing Type I and Type II errors are two possible errors that can
happen when we are finding conclusions about a population based on a sample of
data. These errors are associated with the decisions we made regarding the null
hypothesis and the alternative hypothesis.
 Type I error: When we reject the null hypothesis although that hypothesis
was true. Type I error is denoted by alpha(αα).
 Type II errors: When we accept the null hypothesis but it is false. Type II
errors are denoted by beta(ββ).
Null Hypothesis is
True Null Hypothesis is False

Null Hypothesis is Type II Error (False

Correct Decision
True (Accept) Negative)

Alternative Hypothesis Type I Error (False

Correct Decision
is True (Reject) Positive)
How does Hypothesis Testing work?
Working of Hypothesis testing involves various steps:
Steps of Hypothesis Testing

Step 1: Define Hypotheses:

 Null hypothesis (H₀): Assumes no effect or difference.
 Alternative hypothesis (H₁): Assumes there is an effect or difference.
Example: Test if a new algorithm improves user engagement.
Note: In this we assume that our data is normally distributed.
Step 2: Choose significance level
We select a significance level (usually 0.05). This is the maximum chance we
accept of wrongly rejecting the null hypothesis (Type I error). It also sets the
confidence needed to accept results.
Step 3: Collect and Analyze data.
 Now we gather data this could come from user observations or an experiment.
Once collected we analyze the data using appropriate statistical methods to
calculate the test statistic.
 Example: We collect data on user engagement before and after implementing
the algorithm. We can also find the mean engagement scores for each group.
Step 4: Calculate Test Statistic
The test statistic measures how much the sample data deviates from what we did
expect if the null hypothesis were true. Different tests use different statistics:
 Z-test: Used when population variance is known and sample size is large.
 T-test: Used when sample size is small or population variance unknown.
 Chi-square test: Used for categorical data to compare observed vs. expected
counts.
Step 5: Make a Decision
We compare the test statistic to a critical value from a statistical table or use
the p-value:
 Using Critical Value:
o If test statistic > critical value → reject H0.
o If test statistic ≤ critical value → fail to reject H0.
 Using P-value:
o If p-value ≤ α → reject H0.
o If p-value > α → fail to reject H0.
Example: If p-value is 0.03 and α is 0.05, we reject the null hypothesis because
0.03 < 0.05.
Step 6: Interpret the Results
Based on the decision, we conclude whether there is enough evidence to support
the alternative hypothesis or if we should keep the null hypothesis.

Real life Examples of Hypothesis Testing

A pharmaceutical company tests a new drug to see if it lowers blood pressure in

patients.
Data:
 Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
 After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114
Step 1: Define the Hypothesis
 Null Hypothesis: (H0)The new drug has no effect on blood pressure.
 Alternate Hypothesis: (H1)The new drug has an effect on blood pressure.
Step 2: Define the Significance level
Usually 0.05, meaning less than 5% chance results are by random chance.
Step 3: Compute the test statistic
Using paired T-test analyze the data to obtain a test statistic and a p-value. The
test statistic is calculated based on the differences between blood pressure
measurements before and after treatment.
t = m/(s/√n)
Where:
 m = mean of the difference i.e X after, X before
 s = standard deviation of the difference (d) di=Xafter,i−Xbefore,idi=Xafter,i
−Xbefore,i
 n = sample size
then m= -3.9, s= 1.37 and n= 10. we calculate the T-statistic = -9 based on the
formula for paired t test
Step 4: Find the p-value
With degrees of freedom = 9, p-value ≈ 0.0000085 (very small).
Step 5: Result
Since the p-value (8.538051223166285e-06) is less than the significance level
(0.05) the researchers reject the null hypothesis. There is statistically significant
evidence that the average blood pressure before and after treatment with the new
drug is different.

The T-statistic of about -9 and a very small p-value provide strong evidence to
reject the null hypothesis at the 0.05 level. This means the new drug significantly
lowers blood pressure. The negative T-statistic shows the average blood pressure
after treatment is lower than before.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a important step in data science as it

visualizing data to understand its main features, find patterns and discover how
different parts of the data are connected. In this article, we will see more
about Exploratory Data Analysis (EDA).

Why Exploratory Data Analysis is Important?

Exploratory Data Analysis (EDA) is important for several reasons in the context
of data science and statistical modeling. Here are some of the key reasons:
1. It helps to understand the dataset by showing how many features it has, what
type of data each feature contains and how the data is distributed.
2. It helps to identify hidden patterns and relationships between different data
points which help us in and model building.
3. Allows to identify errors or unusual data points (outliers) that could affect our
results.
4. The insights gained from EDA help us to identify most important features for
building models and guide us on how to prepare them for better performance.
5. By understanding the data it helps us in choosing best modeling techniques and
adjusting them for better results.

Types of Exploratory Data Analysis

There are various types of EDA based on nature of records. Depending on the
number of columns we are analyzing we can divide EDA into three types:
1. Univariate Analysis
Univariate analysis focuses on studying one variable to understand its
characteristics. It helps to describe data and find patterns within a single feature.
Various common methods like histograms are used to show data distribution, box
plots to detect outliers and understand data spread and bar charts for categorical
data. Summary statistics like mean, median, mode, variance and standard
deviation helps in describing the central tendency and spread of the data
2. Bivariate Analysis
Bivariate Analysis focuses on identifying relationship between two variables to
find connections, correlations and dependencies. It helps to understand how two
variables interact with each other. Some key techniques include:
 Scatter plots which visualize the relationship between two continuous
variables.
 Correlation coefficient measures how strongly two variables are related which
commonly use Pearson's correlation for linear relationships.
 Cross-tabulation or contingency tables shows the frequency distribution of two
categorical variables and help to understand their relationship.
 Line graphs are useful for comparing two variables over time in time series
data to identify trends or patterns.
 Covariance measures how two variables change together but it is paired with
the correlation coefficient for a clearer and more standardized understanding of
the relationship.

3. Multivariate Analysis
Multivariate Analysis identify relationships between two or more variables in the
dataset and aims to understand how variables interact with one another which is
important for statistical modeling techniques. It include techniques like:
 Pair plots which shows the relationships between multiple variables at once
and helps in understanding how they interact.
 Another technique is Principal Component Analysis (PCA) which reduces
the complexity of large datasets by simplifying them while keeping the most
important information.
 Spatial Analysis is used for geographical data by using maps and spatial
plotting to understand the geographical distribution of variables.
 Time Series Analysis is used for datasets that involve time-based data and it
involves understanding and modeling patterns and trends over time. Common
techniques include line plots, autocorrelation analysis, moving averages
and ARIMA models.
Steps for Performing Exploratory Data Analysis
It involves a series of steps to help us understand the data, uncover patterns,
identify anomalies, test hypotheses and ensure the data is clean and ready for
further analysis. It can be done using different tools like:
 In Python, Pandas is used to clean, filter and manipulate data. Matplotlib helps
to create basic visualizations while Seaborn makes more attractive plots. For
interactive visualizations Plotly is a good choice.
 In R, ggplot2 is used for creating complex plots, dplyr helps with data
manipulation and tidyr makes sure our data is organized and easy to work
with.
Its step includes:

Step 1: Understanding the Problem and the Data

The first step in any data analysis project is to fully understand the problem we're
solving and the data we have. This includes asking key questions like:
1. What is the business goal or research question?
2. What are the variables in the data and what do they represent?
3. What types of data (numerical, categorical, text, etc.) do you have?
4. Are there any known data quality issues or limitations?
5. Are there any domain-specific concerns or restrictions?
By understanding the problem and the data, we can plan our analysis more
effectively, avoid incorrect assumptions and ensure accurate conclusions.

Step 2: Importing and Inspecting the Data

After understanding the problem and the data, next step is to import the data into
our analysis environment such as Python, R or a spreadsheet tool. It’s important
to find data to gain an basic understanding of its structure, variable types and any
potential issues. Here’s what we can do:
1. Load the data into our environment carefully to avoid errors or truncations.
2. Check the size of the data like number of rows and columns to understand its
complexity.
3. Check for missing values and see how they are distributed across variables
since missing data can impact the quality of your analysis.
4. Identify data types for each variable like numerical, categorical, etc which will
help in the next steps of data manipulation and analysis.
5. Look for errors or inconsistencies such as invalid values, mismatched units or
outliers which could show major issues with the data.
By completing these tasks we'll be prepared to clean and analyze the data more
effectively.

Step 3: Handling Missing Data

Missing data is common in many datasets and can affect the quality of our
analysis. During EDA it's important to identify and handle missing data properly
to avoid biased or misleading results. Here’s how to handle it:
1. Understand the patterns and possible causes of missing data. Is it missing
completely at random (MCAR), missing at random (MAR) or missing not at
random (MNAR). Identifying this helps us to find best way to handle the
missing data.
2. Decide whether to remove missing data or impute (fill in) the missing values.
Removing data can lead to biased outcomes if the missing data isn’t MCAR.
Filling values helps to preserve data but should be done carefully.
3. Use appropriate imputation methods like mean or median
imputation, regression imputation or machine learning techniques
like KNN or decision trees based on the data’s characteristics.
4. Consider the impact of missing data. Even after imputing, missing data can
cause uncertainty and bias so understands the result with caution.
Properly handling of missing data improves the accuracy of our analysis and
prevents misleading conclusions.

Step 4: Exploring Data Characteristics

After addressing missing data we find the characteristics of our data by checking
the distribution, central tendency and variability of our variables and identifying
outliers or anomalies. This helps in selecting appropriate analysis methods and
finding major data issues. We should calculate summary statistics like mean,
median, mode, standard deviation, skewness and kurtosis for numerical variables.
These provide an overview of the data’s distribution and helps us to identify any
irregular patterns or issues.

Step 5: Performing Data Transformation

Data transformation is an important step in EDA as it prepares our data for
accurate analysis and modeling. Depending on our data's characteristics and
analysis needs, we may need to transform it to ensure it's in the right format.
Common transformation techniques include:
1. Scaling or normalizing numerical variables like min-max
scaling or standardization.
2. Encoding categorical variables for machine learning like one-hot
encoding or label encoding.
3. Applying mathematical transformations like logarithmic square root to correct
skewness or non-linearity.
4. Creating new variables from existing ones like calculating ratios or combining
variables.
5. Aggregating or grouping data based on specific variables or conditions.

Step 6: Visualizing Relationship of Data

Visualization helps to find relationships between variables and identify patterns
or trends that may not be seen from summary statistics alone.
1. For categorical variables, create frequency tables, bar plots and pie charts to
understand the distribution of categories and identify imbalances or unusual
patterns.
2. For numerical variables generate histograms, box plots, violin plots and
density plots to visualize distribution, shape, spread and potential outliers.
3. To find relationships between variables use scatter plots, correlation matrices
or statistical tests like Pearson’s correlation coefficient or Spearman’s rank
correlation.

Step 7: Handling Outliers

Outliers are data points that differs from the rest of the data may caused by errors
in measurement or data entry. Detecting and handling outliers is important
because they can skew our analysis and affect model performance. We can
identify outliers using methods like interquartile range (IQR),
Z-scores or domain-specific rules. Once identified it can be removed or adjusted
depending on the context. Properly managing outliers shows our analysis is
accurate and reliable.
Step 8: Communicate Findings and Insights
The final step in EDA is to communicate our findings clearly. This involves
summarizing the analysis, pointing out key discoveries and presenting our results
in a clear way.

Data Analysis Using Python

Data Analysis is the technique of collecting, transforming and organizing data to

make future predictions and informed data-driven decisions. It also helps to find
possible solutions for a business problem. In this article, we will discuss how to
do data analysis with Python i.e. analyzing numerical data with NumPy, Tabular
data with Pandas, data visualization with Matplotlib.

Analyzing Numerical Data with NumPy

NumPy is an array processing package in Python and provides a high-
performance multidimensional array object and tools for working with these
arrays. It is the fundamental package for scientific computing with Python.
Arrays in NumPy
NumPy Array is a table of elements usually numbers, all of the same types,
indexed by a tuple of positive integers. In Numpy the number of dimensions of
the array is called the rank of the array. A tuple of integers giving the size of the
array along each dimension is known as the shape of the array.
Creating NumPy Array
NumPy arrays can be created in multiple ways with various ranks. It can also be
created with the use of different data types like lists, tuples, etc. The type of the
resultant array is deduced from the type of elements in the sequences. NumPy
offers several functions to create arrays with initial placeholder content. This
minimizes the necessity of growing arrays.
import numpy as np

a = np.empty([2, 2], dtype = int)

print("\nMatrix a : \n", a)

b = np.empty(2, dtype = int)

print("Matrix b : \n", b)

Output
Matrix a :
[[ 94655291709206 0]
[3543826506195694713 34181816989462323]]
Matrix b :
[-4611686018427387904 206158462975]
Arithmetic Operations
1. Addition:

import numpy as np

a = np.array([5, 72, 13, 100])

b = np.array([2, 5, 10, 30])

add_ans = a+b
print(add_ans)

add_ans = np.add(a, b)
print(add_ans)

c = np.array([1, 2, 3, 4])
add_ans = a+b+c
print(add_ans)

add_ans = np.add(a, b, c)
print(add_ans)

Output

[ 7 77 23 130]
[ 7 77 23 130]
[ 8 79 26 134]
[ 7 77 23 130]

NumPy Array Indexing

Indexing can be done in NumPy by using an array as an index. In the case of the
slice, a view or shallow copy of the array is returned but in the index array, a
copy of the original array is returned. Numpy arrays can be indexed with other
arrays or any other sequence with the exception of tuples. The last element is
indexed by -1 second last by -2 and so on.
import numpy as np

a = np.arange(10, 1, -2)
print("\n A sequential array with a negative step: \n",a)

newarr = a[np.array([3, 1, 2 ])]

print("\n Elements at these indices are:\n",newarr)

Output

A sequential array with a negative step:

[10 8 6 4 2]

Elements at these indices are:

[4 8 6]
NumPy Array Slicing
Consider the syntax x[obj] where x is the array and obj is the index. The slice
object is the index in the case of basic slicing. Basic slicing occurs when obj is:
 a slice object that is of the form start: stop: step
 an integer
 or a tuple of slice objects and integers
All arrays generated by basic slicing are always the view in the original array.

import numpy as np

a = np.arange(20)
print("\n Array is:\n ",a)

print("\n a[-8:17:1] = ",a[-8:17:1])

print("\n a[10:] = ",a[10:])

Output

Array is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]

a[-8:17:1] = [12 13 14 15 16]

a[10:] = [10 11 12 13 14 15 16 17 18 19]

import numpy as np

a = np.arange(20)
print("\n Array is:\n ",a)

print("\n a[-8:17:1] = ",a[-8:17:1])

print("\n a[10:] = ",a[10:])

Output

Array is:
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]

a[-8:17:1] = [12 13 14 15 16]

a[10:] = [10 11 12 13 14 15 16 17 18 19]

Ellipsis can also be used along with basic slicing. Ellipsis (…) is the number of :
objects needed to make a selection tuple of the same length as the dimensions of
the array.
import numpy as np

b = np.array([[[1, 2, 3],[4, 5, 6]],

[[7, 8, 9],[10, 11, 12]]])

print(b[...,1])

Output

[[ 2 5]
[ 8 11]]
NumPy Array Broadcasting
The term broadcasting refers to how numpy treats arrays with different
Dimensions during arithmetic operations which lead to certain constraints, the
smaller array is broadcast across the larger array so that they have compatible
shapes.

Let’s assume that we have a large data set, each datum is a list of parameters. In
Numpy we have a 2-D array where each row is a datum and the number of rows
is the size of the data set. Suppose we want to apply some sort of scaling to all
these data every parameter gets its own scaling factor or say Every parameter is
multiplied by some factor.

Just to have a clear understanding, let’s count calories in foods using a macro-
nutrient breakdown. Roughly put, the caloric parts of food are made of fats (9
calories per gram), protein (4 CPG) and carbs (4 CPG). So if we list some foods
(our data) and for each food list its macro-nutrient breakdown (parameters), we
can then multiply each nutrient by its caloric value (apply scaling) to compute the
caloric breakdown of every food item.
With this transformation, we can now compute all kinds of useful information.
For example what is the total number of calories present in some food or, given a
breakdown of my dinner know how many calories did I get from protein and so
on.
Let’s see a naive way of producing this computation with Numpy:

import numpy as np

macros = np.array([
[0.8, 2.9, 3.9],
[52.4, 23.6, 36.5],
[55.2, 31.7, 23.9],
[14.4, 11, 4.9]
])

cal_per_macro = np.array([3, 3, 8])

result = macros * cal_per_macro

print(result)

Output

[[ 2.4 8.7 31.2]

[157.2 70.8 292. ]
[165.6 95.1 191.2]
[ 43.2 33. 39.2]]

Analyzing Data Using Pandas

Python Pandas Is used for relational or labeled data and provides various data
structures for manipulating such data and time series. This library is built on top of
the NumPy library. This module is generally imported as:
import pandas as pd
Here, pd is referred to as Pandas. However, it is not necessary to import the library
using this, it just helps in writing less amount code every time a method or
property is called. Pandas generally provide two data structures for manipulating
data, They are:
 Series
 Data frame
Series
Pandas Series is a one-dimensional labeled array capable of holding data of any
type like integer, string, float, python objects, etc. The axis labels are collectively
called indexes. Pandas Series is nothing but a column in an excel sheet. Labels
need not be unique but must be a hashable type. The object supports both integer
and label-based indexing and provides a host of methods for performing operations
involving the index. Pandas Series
It can be created using the Series() function by loading the dataset from the
existing storage like SQL, Database, CSV Files, Excel Files, etc or from data
structures like lists, dictionaries, etc.
Python Pandas Creating Series

import pandas as pd
import numpy as np

ser = pd.Series(dtype="object")

print(ser)

data = np.array(['g', 'e', 'e', 'k', 's'])

ser = pd.Series(data)
print(ser)

Output

Series([], dtype: object)

0 g
1 e
2 e
3 k
4 s
dtype: object
Dataframe
Pandas DataFrame is a two-dimensional size mutable array with labeled axes (rows
and columns). In this data is aligned in a tabular format in rows and columns.
Pandas DataFrame consists of three principal components i.e data, rows and
columns. Pandas Dataframe
It can be created using the Dataframe() method and just like a series, it can also be
from different file types and data structures.
Python Pandas Creating Dataframe

import pandas as pd

df = pd.DataFrame()
print(df)

lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']

df = pd.DataFrame(lst, columns=['Words'])

print(df)

Output
Empty DataFrame
Columns: []
Index: []
Words
0 Geeks
1 For
2 Geeks
3 is
4 portal
5 for
6 Geeks
Creating Dataframe from CSV
We can create a dataframe from the CSV files using the read_csv() function.
import pandas as pd

df = pd.read_csv("Iris.csv")

df.head()
Output:

head of a
dataframe

Filtering DataFrame
Pandas dataframe.filter() function is used to Subset rows or columns of dataframe
according to labels in the specified index. Note that this routine does not filter a
dataframe on its contents. The filter is applied to the labels of the index.
Python Pandas Filter Dataframe
import pandas as pd

df = pd.read_csv("Iris.csv")

df.filter(["Species", "SepalLengthCm", "SepalLengthCm"]).head()

Output:

Applying filter on dataset

Sorting DataFrame
In order to sort the data frame in pandas, the function sort_values() is used. Pandas
sort_values() can sort the data frame in Ascending or Descending order. Python
Pandas Sorting Dataframe in Ascending Order

import pandas as pd

df = pd.read_csv("Iris.csv", header=None)
columns = ["Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm",
"PetalWidthCm", "Species"]
df.columns = columns

df_sorted = df.sort_values(by='SepalLengthCm', ascending=True)

print(df_sorted.head())
Sorted
dataset based on a column value

Pandas GroupBy
Groupby is a pretty simple concept. We can create a grouping of categories and
apply a function to the categories. In real data science projects, you’ll be dealing
with large amounts of data and trying things over and over, so for efficiency we
use the Groupby concept. Groupby mainly refers to a process involving one or
more of the following steps they are:
 Splitting: It is a process in which we split data into group by applying some
conditions on datasets.
 Applying: It is a process in which we apply a function to each group
independently.
 Combining: It is a process in which we combine different datasets after
applying groupby and results into a data structure.
 1. Group the unique values from the Team column

Pandas Aggregation
Aggregation is a process in which we compute a summary statistic about each
group. The aggregated function returns a single aggregated value for each group.
After splitting data into groups using groupby function, several aggregation
operations can be performed on the grouped data.
import pandas as pd

data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',

'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age': [27, 24, 22, 32,
33, 36, 27, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd',
'B.Tech', 'B.com', 'Msc', 'MA']}

df = pd.DataFrame(data1)

grp1 = df.groupby('Name')

result = grp1['Age'].aggregate('sum')
print(result)

Output:

Merging DataFrame
When we need to combine very large DataFrames, joins serve as a powerful way
to perform these operations swiftly. Joins can only be done on two DataFrames at
a time, denoted as left and right tables. The key is the common column that the
two DataFrames will be joined on. It’s a good practice to use keys that have
unique values throughout the column to avoid unintended duplication of row
values. Pandas provide a single function, merge(), as the entry point for all
standard database join operations between DataFrame objects.
There are four basic ways to handle the join (inner, left, right and outer),
depending on which rows must retain their data.

import pandas as pd

data1 = {'key': ['K0', 'K1', 'K2', 'K3'],

'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}

data2 = {'key': ['K0', 'K1', 'K2', 'K3'],

'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)

display(df,df1)

res = pd.merge(df, df1, on='key')

print(res)
Output:
Joining DataFrame
In order to join the dataframe, we use .join() function this function is used for
combining the columns of two potentially differently indexed DataFrames into a
single result DataFrame.
import pandas as pd

data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],

'Age':[27, 24, 22, 32]}

data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],

'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}

df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3'])

df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])

res = df.join(df1)
print(res)
Output:

Bar chart
A bar plot or bar chart is a graph that represents the category of data with
rectangular bars with lengths and heights that is proportional to the values which
they represent. The bar plots can be plotted horizontally or vertically. A bar chart
describes the comparisons between the discrete categories. It can be created using
the bar() method.
Here we will use the iris dataset only.
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("Iris.csv")

plt.bar(df['Species'], df['SepalLengthCm'])

plt.title("Iris Dataset")

plt.legend(["bar"])
plt.show()
Output:

Bar chart using matplotlib

library

Histograms
A histogram is basically used to represent data in the form of some groups. It is a
type of bar plot where the X-axis represents the bin ranges while the Y-axis gives
information about frequency. To create a histogram the first step is to create a bin
of the ranges, then distribute the whole range of the values into a series of
intervals and count the values which fall into each of the intervals. Bins are
clearly identified as consecutive, non-overlapping intervals of variables.
The hist() function is used to compute and create a histogram of x.
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("Iris.csv")

plt.hist(df["SepalLengthCm"])

plt.title("Histogram")

plt.legend(["SepalLengthCm"])
plt.show()
Output:

Histplot using matplotlib

library

Scatter Plot
Scatter plots are used to observe relationship between variables and uses dots to
represent the relationship between them. The scatter() method in the matplotlib
library is used to draw a scatter plot.t
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("Iris.csv")

plt.scatter(df["Species"], df["SepalLengthCm"])

plt.title("Scatter Plot")
plt.legend(["SepalLengthCm"])
plt.show()
Output:

Scatter plot using

matplotlib library

Box Plot
A boxplot is also known as a box and whisker plot. It is a very good visual
representation when it comes to measuring the data distribution. Clearly plots the
median values, outliers and the quartiles. Understanding data distribution is
another important factor which leads to better model building. If data has outliers,
box plot is a recommended way to identify them and take necessary actions. The
box and whiskers chart shows how data is spread out. Five pieces of information
are generally included in the chart
 The minimum is shown at the far left of the chart, at the end of the left
‘whisker’
 First quartile, Q1, is the far left of the box (left whisker)
 is shown as a line in the center of the box The median
 Third quartile, Q3, shown at the far right of the box (right whisker)
 The maximum is at the far right of the box
Representation of box plot

Inter quartile

range Illustrating box plot

Python Matplotlib Box Plot
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("Iris.csv")

plt.boxplot(df["SepalWidthCm"])

plt.title("Box Plot")

plt.legend(["SepalWidthCm"])
plt.show()
Output:
Boxplot using matplotlib
library

Correlation Heatmaps
A 2-D Heatmap is a data visualization tool that helps to represent the magnitude
of the phenomenon in form of colors. A correlation heatmap is a heatmap that
shows a 2D correlation matrix between two discrete dimensions, using colored
cells to represent data from usually a monochromatic scale. The values of the first
dimension appear as the rows of the table while the second dimension is a
column. The color of the cell is proportional to the number of measurements that
match the dimensional value. This makes correlation heatmaps ideal for data
analysis since it makes patterns easily readable and highlights the differences and
variation in the same data. A correlation heatmap, like a regular heatmap, is
assisted by a colorbar making data easily readable and comprehensible.
Note: The data here has to be passed with corr() method to generate a correlation
heatmap. Also, corr() itself eliminates columns that will be of no use while
generating a correlation heatmap and selects those which can be used.
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("Iris.csv")

plt.imshow(df.corr() , cmap = 'autumn' , interpolation = 'nearest' )

plt.title("Heat Map")
plt.show()
Output:

Methods of Sampling From A Population
No ratings yet
Methods of Sampling From A Population
4 pages
Methods of Sampling From A Population
No ratings yet
Methods of Sampling From A Population
3 pages
4683-Methods of Social Research-I (Assignment-II)
No ratings yet
4683-Methods of Social Research-I (Assignment-II)
22 pages
Sampling Techniques
No ratings yet
Sampling Techniques
4 pages
Data Collection UGC - NET Paper-1
No ratings yet
Data Collection UGC - NET Paper-1
17 pages
Populations and Samples. Eui 2023
No ratings yet
Populations and Samples. Eui 2023
19 pages
Sampling
No ratings yet
Sampling
22 pages
Strands." This Study Is in Partial Fulfilment of The Requirement in Practical Research 1
No ratings yet
Strands." This Study Is in Partial Fulfilment of The Requirement in Practical Research 1
4 pages
Unit One
No ratings yet
Unit One
17 pages
6 Sampling and Basic Descriptive Statistics
No ratings yet
6 Sampling and Basic Descriptive Statistics
38 pages
Sampling & Estimation for IT Students
No ratings yet
Sampling & Estimation for IT Students
50 pages
Research Techniques
No ratings yet
Research Techniques
9 pages
Sampling Design
No ratings yet
Sampling Design
33 pages
3 Sampling Probability Non Probability
No ratings yet
3 Sampling Probability Non Probability
20 pages
Sampling Randomization
No ratings yet
Sampling Randomization
23 pages
EM 104 Module
No ratings yet
EM 104 Module
12 pages
The Role of Probability - Boston
No ratings yet
The Role of Probability - Boston
39 pages
Probability Sampling
No ratings yet
Probability Sampling
3 pages
Scribbr logo-WPS Office
No ratings yet
Scribbr logo-WPS Office
12 pages
Statistics For Management II
No ratings yet
Statistics For Management II
99 pages
Sampling Basics BRM
No ratings yet
Sampling Basics BRM
24 pages
NSTA 51516 Slides
No ratings yet
NSTA 51516 Slides
97 pages
Sampling Theory and Techniques
No ratings yet
Sampling Theory and Techniques
8 pages
Sampling Techniques Explained
No ratings yet
Sampling Techniques Explained
13 pages
Sampling Techniques: of The Population Has A Chance of Being Included
100% (1)
Sampling Techniques: of The Population Has A Chance of Being Included
10 pages
Sampling Techniques Explained
100% (1)
Sampling Techniques Explained
12 pages
Statistics Reviewer Sampling Methods
No ratings yet
Statistics Reviewer Sampling Methods
7 pages
Sampling Techniques Overview
No ratings yet
Sampling Techniques Overview
35 pages
Presentation 1
No ratings yet
Presentation 1
37 pages
Sampling Design & Different Types of Sampling Techniques
No ratings yet
Sampling Design & Different Types of Sampling Techniques
41 pages
Sampling Methods Assignment
No ratings yet
Sampling Methods Assignment
8 pages
4 1 Sampling Techniques Ali 2021
No ratings yet
4 1 Sampling Techniques Ali 2021
43 pages
Intro to Sampling & Probability
No ratings yet
Intro to Sampling & Probability
54 pages
Statistics For Beginners 2024
No ratings yet
Statistics For Beginners 2024
37 pages
Probability Sampling - A Guideline For Quantitative Health Care Research
No ratings yet
Probability Sampling - A Guideline For Quantitative Health Care Research
5 pages
Population Sampling Techniqu
No ratings yet
Population Sampling Techniqu
2 pages
Research Sampling Methods
No ratings yet
Research Sampling Methods
4 pages
Donnie Marie Plaza - Sampling Techniques (March 06 2022)
No ratings yet
Donnie Marie Plaza - Sampling Techniques (March 06 2022)
34 pages
Sampling in Research
No ratings yet
Sampling in Research
14 pages
Sampling Assignment
No ratings yet
Sampling Assignment
4 pages
NJC Sampling Lecture Notes
No ratings yet
NJC Sampling Lecture Notes
24 pages
Chapter 6 Sampling
100% (1)
Chapter 6 Sampling
5 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
30 pages
Sample Definition: A Sample Is A Group of Units Selected From A Larger Group (The
No ratings yet
Sample Definition: A Sample Is A Group of Units Selected From A Larger Group (The
7 pages
Sampling and Sampling Techniques
No ratings yet
Sampling and Sampling Techniques
16 pages
8 Sampling and Sampling Distribution - EMBA
No ratings yet
8 Sampling and Sampling Distribution - EMBA
24 pages
Sampling and Sampling Methods
No ratings yet
Sampling and Sampling Methods
33 pages
Topic 4 Sampling Methods Types and Techniques
No ratings yet
Topic 4 Sampling Methods Types and Techniques
20 pages
Sampling and Sampling Techniques
No ratings yet
Sampling and Sampling Techniques
9 pages
Reggie Assignment
No ratings yet
Reggie Assignment
6 pages
Methods of Sampling
No ratings yet
Methods of Sampling
8 pages
Business Statistics Unit-Iv
No ratings yet
Business Statistics Unit-Iv
9 pages
F77SA1 Introduction To Statistical Science A Lecture Notes: Jennie Hansen George Streftaris
No ratings yet
F77SA1 Introduction To Statistical Science A Lecture Notes: Jennie Hansen George Streftaris
55 pages
Quantitative Sampling Overview
No ratings yet
Quantitative Sampling Overview
12 pages
Unit III - Probability Sampling Techniques
No ratings yet
Unit III - Probability Sampling Techniques
5 pages
5.2 Sampling Methods
No ratings yet
5.2 Sampling Methods
35 pages
Yessusaa
No ratings yet
Yessusaa
6 pages
ML Lesson Plan
No ratings yet
ML Lesson Plan
7 pages
Unit1 ML Introduction
No ratings yet
Unit1 ML Introduction
3 pages
24-25 Session Plan III I Ossa
No ratings yet
24-25 Session Plan III I Ossa
4 pages
Mincut
No ratings yet
Mincut
10 pages
Unit2 Decision Trees
No ratings yet
Unit2 Decision Trees
3 pages
UNIT 1 DBMS Part 1
No ratings yet
UNIT 1 DBMS Part 1
33 pages
23phd10042 Assignment 1
No ratings yet
23phd10042 Assignment 1
6 pages
Unit 2 ML
No ratings yet
Unit 2 ML
17 pages
Unit 4
No ratings yet
Unit 4
20 pages
Machine Learning
No ratings yet
Machine Learning
44 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
12 pages
Kruskal-Wallis Test: and It'S Implementation in R Programming
No ratings yet
Kruskal-Wallis Test: and It'S Implementation in R Programming
14 pages
G - Power Guide
No ratings yet
G - Power Guide
86 pages
Children'S Languages of Film Appreciation: Sweety Lakra & Dr. V. Sudhakar
No ratings yet
Children'S Languages of Film Appreciation: Sweety Lakra & Dr. V. Sudhakar
10 pages
Statistics Syllabus
No ratings yet
Statistics Syllabus
3 pages
Wilcoxon Rank-Sum Test Guide
No ratings yet
Wilcoxon Rank-Sum Test Guide
4 pages
Research Methodology
No ratings yet
Research Methodology
114 pages
Lean Six Sigma Template
100% (6)
Lean Six Sigma Template
908 pages
Khatun (2021) - Applications of Normality Test in Statistical Analysis
No ratings yet
Khatun (2021) - Applications of Normality Test in Statistical Analysis
10 pages
Testing Hypothesis
No ratings yet
Testing Hypothesis
11 pages
Stat - 4 One Sample Z and T Test
No ratings yet
Stat - 4 One Sample Z and T Test
9 pages
Quantitative Macroeconomics Ardl Model: 100403596@alumnos - Uc3m.es
No ratings yet
Quantitative Macroeconomics Ardl Model: 100403596@alumnos - Uc3m.es
10 pages
SPSS Chi-Square & ANOVA Guide
No ratings yet
SPSS Chi-Square & ANOVA Guide
20 pages
Hypothesis Testing 1 PDF
No ratings yet
Hypothesis Testing 1 PDF
15 pages
MPT-1st Year 19p
75% (4)
MPT-1st Year 19p
19 pages
Maxwell Et Al 2008 Sample Size Planning and Statistical Power and Accuracy in Parameter Estimation
No ratings yet
Maxwell Et Al 2008 Sample Size Planning and Statistical Power and Accuracy in Parameter Estimation
30 pages
The Relationship Between Ipa and Correct Pronunciation
No ratings yet
The Relationship Between Ipa and Correct Pronunciation
24 pages
Ch.8 Multiple Regression and Correlation
100% (2)
Ch.8 Multiple Regression and Correlation
44 pages
Audit Sampling Essentials
No ratings yet
Audit Sampling Essentials
7 pages
Chi Square Test
33% (3)
Chi Square Test
7 pages
Module 2
No ratings yet
Module 2
39 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
36 pages
Research Methodology
No ratings yet
Research Methodology
16 pages
APPENDIX E. Computation
No ratings yet
APPENDIX E. Computation
14 pages
Statistiscs Midterm
No ratings yet
Statistiscs Midterm
5 pages
Marketing Research Essentials: Seventh Edition
No ratings yet
Marketing Research Essentials: Seventh Edition
24 pages
Bonus Assignment
No ratings yet
Bonus Assignment
11 pages
Krebs Chapter 07 2013
No ratings yet
Krebs Chapter 07 2013
49 pages
Biostatistics - Course Syllabus Spring 2022-2023
No ratings yet
Biostatistics - Course Syllabus Spring 2022-2023
8 pages
Answers Key of Q4-Summative Test-1 in Statistics 2023-2024
100% (1)
Answers Key of Q4-Summative Test-1 in Statistics 2023-2024
3 pages
Statistical Analysis in Research Methods
No ratings yet
Statistical Analysis in Research Methods
1 page