UNIT 3
Statistics Essentials in Machine Learning
Inferential Statistics: Learn Probability Distribution Functions,
Random Variable
 Random variable is a fundamental concept in statistics that bridges the gap
 between theoretical probability and real-world data. A Random variable in
 statistics is a function that assigns a real value to an outcome in the sample space
 of a random experiment. For example: if you roll a die, you can assign a number
 to each possible outcome.
 There are two basic types of random variables:
 Discrete Random Variables (which take on specific values).
 Continuous Random Variables (assume any value within a given range).
 We define a random variable as a function that maps from the sample space of
 an experiment to the real numbers. Mathematically, Random Variable is
 expressed as,
                                    X: S →R
 where,
 X is Random Variable (It is usually denoted using capital letter)
 S is Sample Space
 R is Set of Real Numbers
 Types of Random Variables
 Random variables are of two types that are,
 Discrete Random Variable
 Continuous Random Variable
Discrete Random Variable
A Discrete Random Variable takes on a finite number of values. The probability
function associated with it is said to be PMF.
PMF(Probability Mass Function)
If X is a discrete random variable and the PMF of X is P(xi), then
   0 ≤ pi ≤ 1
   ∑pi = 1 where the sum is taken over all possible values of x
Example: Let S = {0, 1, 2}
     xi       0    1     2
 Pi(X = xi) P1 0.3 0.5
Find the value of P (X = 0)
Solution:
We know that the sum of all probabilities is equal to 1. And P (X = 0) be P1
P1 + 0.3 + 0.5 = 1
P1 = 0.2
Then, P (X = 0) is 0.2
Continuous Random Variable
Continuous Random Variable takes on an infinite number of values. The
probability function associated with it is said to be PDF (Probability Density
Function).
PDF (Probability Density Function)
Example
 If X is a continuous random variable. P (x < X < x + dx) = f(x)dx then,
 0 ≤ f(x) ≤ 1; for all x
 ∫ f(x) dx = 1 over all values of x
 Then P (X) is said to be a PDF of the distribution.
 Find the value of P (1 < X < 2)
 Such that,
 f(x) = kx3; 0 ≤ x ≤ 3 = 0
 Otherwise f(x) is a density function
Sampling Methods
Probability Sampling Methods
1. Simple random sampling
In this case each individual is chosen entirely by chance and each member of the
population has an equal chance, or probability, of being selected. One way of
obtaining a random sample is to give each individual in a population a number, and
then use a table of random numbers to decide which individuals to include. 1 For
example, if you have a sampling frame of 1000 individuals, labelled 0 to 999, use
groups of three digits from the random number table to pick your sample. So, if the
first three numbers from the random number table were 094, select the individual
labelled “94”, and so on.
As with all probability sampling methods, simple random sampling allows the
sampling error to be calculated and reduces selection bias. A specific advantage is
that it is the most straightforward method of probability sampling. A disadvantage
of simple random sampling is that you may not select enough individuals with your
characteristic of interest, especially if that characteristic is uncommon
2. Systematic sampling
Individuals are selected at regular intervals from the sampling frame. The intervals
are chosen to ensure an adequate sample size. If you need a sample size n from a
population of size x, you should select every x/nth individual for the sample. For
example, if you wanted a sample size of 100 from a population of 1000, select
every 1000/100 = 10th member of the sampling frame.
Systematic sampling is often more convenient than simple random sampling, and it
is easy to administer. However, it may also lead to bias, for example if there are
underlying patterns in the order of the individuals in the sampling frame, such that
the sampling technique coincides with the periodicity of the underlying pattern. As
a hypothetical example, if a group of students were being sampled to gain their
opinions on college facilities, but the Student Record Department’s central list of
all students was arranged such that the sex of students alternated between male and
female, choosing an even interval (e.g. every 20 th student) would result in a sample
of all males or all females. Whilst in this example the bias is obvious and should be
easily corrected, this may not always be the case.
3. Stratified sampling
In this method, the population is first divided into subgroups (or strata) who all
share a similar characteristic. It is used when we might reasonably expect the
measurement of interest to vary between the different subgroups, and we want to
ensure representation from all the subgroups. For example, in a study of stroke
outcomes, we may stratify the population by sex, to ensure equal representation of
men and women. The study sample is then obtained by taking equal sample sizes
from each stratum. In stratified sampling, it may also be appropriate to choose non-
equal sample sizes from each stratum. For example, in a study of the health
outcomes of nursing staff in a county, if there are three hospitals each with
different numbers of nursing staff (hospital A has 500 nurses, hospital B has 1000
and hospital C has 2000), then it would be appropriate to choose the sample
numbers from each hospital proportionally (e.g. 10 from hospital A, 20 from
hospital B and 40 from hospital C). This ensures a more realistic and accurate
estimation of the health outcomes of nurses across the county, whereas simple
random sampling would over-represent nurses from hospitals A and B. The fact
that the sample was stratified should be taken into account at the analysis stage.
Stratified sampling improves the accuracy and representativeness of the results by
reducing sampling bias. However, it requires knowledge of the appropriate
characteristics of the sampling frame (the details of which are not always
available), and it can be difficult to decide which characteristic(s) to stratify by.
4. Clustered sampling
In a clustered sample, subgroups of the population are used as the sampling unit,
rather than individuals. The population is divided into subgroups, known as
clusters, which are randomly selected to be included in the study. Clusters are
usually already defined, for example individual GP practices or towns could be
identified as clusters. In single-stage cluster sampling, all members of the chosen
clusters are then included in the study. In two-stage cluster sampling, a selection of
individuals from each cluster is then randomly selected for inclusion. Clustering
should be taken into account in the analysis. The General Household survey, which
is undertaken annually in England, is a good example of a (one-stage) cluster
sample. All members of the selected households (clusters) are included in the
survey. 1
Cluster sampling can be more efficient that simple random sampling, especially
where a study takes place over a wide geographical region. For instance, it is easier
to contact lots of individuals in a few GP practices than a few individuals in many
different GP practices. Disadvantages include an increased risk of bias, if the
chosen clusters are not representative of the population, resulting in an increased
sampling error.
Non-Probability Sampling Methods
1. Convenience sampling
Convenience sampling is perhaps the easiest method of sampling, because
participants are selected based on availability and willingness to take part. Useful
results can be obtained, but the results are prone to significant bias, because those
who volunteer to take part may be different from those who choose not to
(volunteer bias), and the sample may not be representative of other characteristics,
such as age or sex. Note: volunteer bias is a risk of all non-probability sampling
methods.
2. Quota sampling
This method of sampling is often used by market researchers. Interviewers are
given a quota of subjects of a specified type to attempt to recruit. For example, an
interviewer might be told to go out and select 20 adult men, 20 adult women, 10
teenage girls and 10 teenage boys so that they could interview them about their
television viewing. Ideally the quotas chosen would proportionally represent the
characteristics of the underlying population.
Whilst this has the advantage of being relatively straightforward and potentially
representative, the chosen sample may not be representative of other characteristics
that weren’t considered (a consequence of the non-random nature of sampling). 2
3. Judgement (or Purposive) Sampling
Also known as selective, or subjective, sampling, this technique relies on the
judgement of the researcher when choosing who to ask to participate. Researchers
may implicitly thus choose a “representative” sample to suit their needs, or
specifically approach individuals with certain characteristics. This approach is
often used by the media when canvassing the public for opinions and in qualitative
research.
Judgement sampling has the advantage of being time-and cost-effective to perform
whilst resulting in a range of responses (particularly useful in qualitative research).
However, in addition to volunteer bias, it is also prone to errors of judgement by
the researcher and the findings, whilst being potentially broad, will not necessarily
be representative.
4. Snowball sampling
This method is commonly used in social sciences when investigating hard-to-reach
groups. Existing subjects are asked to nominate further subjects known to them, so
the sample increases in size like a rolling snowball. For example, when carrying
out a survey of risk behaviours amongst intravenous drug users, participants may
be asked to nominate other users to be interviewed.
Snowball sampling can be effective when a sampling frame is difficult to identify.
Central Limit Theorem and Draw Inferences
Central Limit Theorem
Central Limit Theorem is one of the important concepts in Inferential Statistics.
Inferential Statistics means drawing inferences about the population from the
sample.
When we draw a random sample from the population and calculate the mean of the
sample, it will likely differ from the population mean due to sampling fluctuation.
The variation between a sample statistic and population parameter is known
as sampling error.
Due to this sampling error, it may be difficult to draw inferences about population
parameter from sample statistics. Central Limit Theorem is one of the important
concepts in inferential statistics, which helps us to draw inferences about the
population parameter from sample statistic.
Let us learn about the central limit theorem in detail in this article.
Statistic → The values which represent the characteristics of the sample known
as Statistic.
Parameter → The values which represent the characteristics of the population
known as Parameter. (The values which we infer from statistic for the population)
Sampling distribution
Sampling → It means drawing representative samples from the population.
Sampling Distribution → A sampling distribution is the distribution of all
possible values of a sample statistic for a given sample drawn from a population.
Sampling distribution of mean is the distribution of sample means for a given size
sample selected from the population.
Steps in Sampling Distribution:
      We will draw random samples(s1,s2…sn) from the population.
      We will calculate the mean of the samples (ms1,ms2,ms2….msn).
      Then we will calculate the mean of the sampling means. (ms)
ms=(ms1+ms2+…msn)/n
n →sample size.
[Now we have calculated the mean of the sampling mean. Next, we have to
calculate the standard deviation of the sampling mean]
Standard Error
Variability of sample means in the sampling distribution is the Standard Error. The
standard deviation of the sampling distribution is known as the Standard Error of
the mean.
Standard Error of mean = Standard deviation of population/sqrt(n)
n- sample size
[Standard error decreases when sample size increases. So large samples help in
reducing standard error]
Sampling Distribution Properties.
   1. The mean of the sampling mean is equal to the population mean.
[When we draw many random samples from the population, the variations will
cancel out. So, the mean of sampling mean equals to population mean]
   2. Standard Deviation of Sampling Distribution is equal to the standard
      deviation of population divided by the square root of the sample size.
                                                   Image by Author
Central Limit Theorem
Central Limit Theorem states that even if the population distribution is not normal,
the sampling distribution will be normally distributed if we take sufficiently large
samples from the population.[ For most distributions, n>30 will give a sampling
distribution which is nearly normal]
Sampling distribution properties also hold good for the central limit theorem.
Confidence Interval
We can say that the population mean will lie between a certain range by using a
confidence interval.
Confidence Interval is the range of values that the population parameter can take.
*Confidence Interval of Population Mean= Sample Mean + (confidence level
value ) Standard Error of the mean**
Z → Z scores associated with the confidence level.
Mostly used confidence level
99% Confidence Level → Z score = 2.58 95% Confidence Level → Z score
= 1.9690% Confidence Level → Z score =1.65
Sampling distribution using Python and Seaborn
Example:
   1. Let’s say we have to calculate the mean of marks of all students in a school.
No of students = 1000.
population1=np.random.randint(0,100,1000)
2. Checking the Population distribution
sns.distplot(population1,hist=False)
                                                            Population Distribution
The population is not normally distributed.
3. We will draw random samples of size less than 30 from the population.
sample_means1=[]
for i in range(0,25):
 sample=np.random.choice(population1,size=20)
 sample_means1.append(np.mean(sample))
sample_m1=np.array(sample_means1)
4. Sampling distribution
sns.distplot(sample_means1,hist=False)
plt.title("Sampling distribution of sample mean")
plt.axvline(sample_m1.mean(),color='green',linestyle=' - ')
plt.xlabel("Sample Mean")
The sampling distribution is close to a normal distribution
5. Let’s check the sampling mean and standard error.
print ("Sampling mean: ",round(sample_m1.mean(),2))
print ("Standard Error: ",round(sample_m1.std(),2))
#Output:
Sampling mean: 47.96
Standard Error: 6.39
Standard Error = 6.39. Let’s increase the sample size and check whether the
standard error decreases.
6. Take sample size greater than 30 and calculate sampling mean
sample_means2=[]
for i in range(0,100):
 sample=np.random.choice(population1,size=50)
 sample_means2.append(np.mean(sample))
sample_m2=np.array(sample_means2)
7. Sampling distribution
sns.distplot(sample_means2,hist=False)
plt.title("Sampling distribution of sample mean")
plt.axvline(sample_m2.mean(),color='green',linestyle=' - ')
plt.xlabel("Sample Mean")
                                                              Sampling Distribution
The sampling distribution is normal now.
8. Calculate sampling mean and standard error
print ("Sampling mean: ",round(sample_m2.mean(),2))
print ("Standard Error: ",round(sample_m2.std(),2))
# Output:
Sampling mean: 48.17
Standard Error: 3.89
After increasing the sample size, the standard error decreases. Now the Standard
Error is 3.89.
9. Let’s verify our population mean
print ("Population Mean: ",round(population1.mean(),2))
#Output:
Population Mean: 48.03
We have calculated the sampling mean as 48.17 which is approximately equal to
the population mean 48.03
10. Calculating Confidence Interval at 99% confidence level.
Lower_limit=sample_m2.mean()- (2.58 * (sample_m2.std()))
print (round(Lower_limit,2))
#Output: 38.14
Upper_limit=sample_m2.mean()+ (2.58 * (sample_m2.std()))
print (round(Upper_limit),2)
#Output: 58.19
Confidence Interval = 38.14 – 58.19
Hypothesis Testing
 Hypothesis testing compares two opposite ideas about a group of people or things
 and uses data from a small part of that group (a sample) to decide which idea is
 more likely true. We collect and study the sample data to check if the claim is
 correct.
  Hypothesis Testing
 For example, if a company says its website gets 50 visitors each day on average,
 we use hypothesis testing to look at past visitor data and see if this claim is true
 or if the actual number is different.
 Defining Hypotheses
 Null Hypothesis (H₀): The starting assumption. For example, "The average
     visits are 50."
 Alternative Hypothesis (H₁): The opposite, saying there is a difference. For
     example, "The average visits are not 50."
 Key Terms of Hypothesis Testing
 To understand the Hypothesis testing firstly we need to understand the key terms
 which are given below:
 Significance Level (α): How sure we want to be before saying the claim is
     false. Usually, we choose 0.05 (5%).
 p-value: The chance of seeing the data if the null hypothesis is true. If this is
     less than α, we say the claim is probably false.
 Test Statistic: A number that helps us decide if the data supports or rejects the
     claim.
 Critical Value: The cutoff point to compare with the test statistic.
 Degrees of freedom: A number that depends on the data size and helps find
     the critical value.
 Types of Hypothesis Testing
 It involves basically two types of testing:
  Types of Hypothesis testing
 1. One-Tailed Test
 Used when we expect a change in only one direction either up or down, but not
 both. For example, if testing whether a new algorithm improves accuracy, we
 only check if accuracy increases.
 There are two types of one-tailed test:
 Left-Tailed (Left-Sided) Test: Checks if the value is less than expected.
    Example: H0:μ≥50μ≥50 and H1: μ<50μ<50
 Right-Tailed (Right-Sided) Test: Checks if the value is greater than expected.
    Example: H0 : μ≤50μ≤50 and H1:μ>50μ>50
 2. Two-Tailed Test
 Used when we want to see if there is a difference in either direction higher or
 lower. For example, testing if a marketing strategy affects sales, whether it goes
 up or down
 Example: H0: μ=μ= 50 and H1: μ≠50μ =50
 To go deeper into differences into both types of test: Refer to link
 What are Type 1 and Type 2 errors in Hypothesis Testing?
 In hypothesis testing Type I and Type II errors are two possible errors that can
 happen when we are finding conclusions about a population based on a sample of
 data. These errors are associated with the decisions we made regarding the null
 hypothesis and the alternative hypothesis.
 Type I error: When we reject the null hypothesis although that hypothesis
    was true. Type I error is denoted by alpha(αα).
 Type II errors: When we accept the null hypothesis but it is false. Type II
    errors are denoted by beta(ββ).
                            Null Hypothesis is
                                  True             Null Hypothesis is False
   Null Hypothesis is                                 Type II Error (False
                              Correct Decision
    True (Accept)                                         Negative)
Alternative Hypothesis       Type I Error (False
                                                        Correct Decision
    is True (Reject)             Positive)
How does Hypothesis Testing work?
Working of Hypothesis testing involves various steps:
Steps of Hypothesis Testing
 Step 1: Define Hypotheses:
 Null hypothesis (H₀): Assumes no effect or difference.
 Alternative hypothesis (H₁): Assumes there is an effect or difference.
 Example: Test if a new algorithm improves user engagement.
 Note: In this we assume that our data is normally distributed.
 Step 2: Choose significance level
 We select a significance level (usually 0.05). This is the maximum chance we
 accept of wrongly rejecting the null hypothesis (Type I error). It also sets the
 confidence needed to accept results.
 Step 3: Collect and Analyze data.
 Now we gather data this could come from user observations or an experiment.
   Once collected we analyze the data using appropriate statistical methods to
   calculate the test statistic.
 Example: We collect data on user engagement before and after implementing
   the algorithm. We can also find the mean engagement scores for each group.
 Step 4: Calculate Test Statistic
 The test statistic measures how much the sample data deviates from what we did
 expect if the null hypothesis were true. Different tests use different statistics:
 Z-test: Used when population variance is known and sample size is large.
 T-test: Used when sample size is small or population variance unknown.
 Chi-square test: Used for categorical data to compare observed vs. expected
   counts.
 Step 5: Make a Decision
 We compare the test statistic to a critical value from a statistical table or use
 the p-value:
 Using Critical Value:
          o If test statistic > critical value → reject H0.
          o If test statistic ≤ critical value → fail to reject H0.
 Using P-value:
          o If p-value ≤ α → reject H0.
          o If p-value > α → fail to reject H0.
 Example: If p-value is 0.03 and α is 0.05, we reject the null hypothesis because
 0.03 < 0.05.
 Step 6: Interpret the Results
 Based on the decision, we conclude whether there is enough evidence to support
 the alternative hypothesis or if we should keep the null hypothesis.
Real life Examples of Hypothesis Testing
 A pharmaceutical company tests a new drug to see if it lowers blood pressure in
 patients.
 Data:
 Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
 After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114
 Step 1: Define the Hypothesis
 Null Hypothesis: (H0)The new drug has no effect on blood pressure.
 Alternate Hypothesis: (H1)The new drug has an effect on blood pressure.
 Step 2: Define the Significance level
 Usually 0.05, meaning less than 5% chance results are by random chance.
 Step 3: Compute the test statistic
 Using paired T-test analyze the data to obtain a test statistic and a p-value. The
 test statistic is calculated based on the differences between blood pressure
 measurements before and after treatment.
 t = m/(s/√n)
 Where:
 m = mean of the difference i.e X after, X before
 s = standard deviation of the difference (d) di=Xafter,i−Xbefore,idi=Xafter,i
    −Xbefore,i
 n = sample size
 then m= -3.9, s= 1.37 and n= 10. we calculate the T-statistic = -9 based on the
 formula for paired t test
Step 4: Find the p-value
With degrees of freedom = 9, p-value ≈ 0.0000085 (very small).
Step 5: Result
Since the p-value (8.538051223166285e-06) is less than the significance level
(0.05) the researchers reject the null hypothesis. There is statistically significant
evidence that the average blood pressure before and after treatment with the new
drug is different.
The T-statistic of about -9 and a very small p-value provide strong evidence to
reject the null hypothesis at the 0.05 level. This means the new drug significantly
lowers blood pressure. The negative T-statistic shows the average blood pressure
after treatment is lower than before.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a important step in data science as it
visualizing data to understand its main features, find patterns and discover how
different parts of the data are connected. In this article, we will see more
about Exploratory Data Analysis (EDA).
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons in the context
of data science and statistical modeling. Here are some of the key reasons:
1. It helps to understand the dataset by showing how many features it has, what
   type of data each feature contains and how the data is distributed.
2. It helps to identify hidden patterns and relationships between different data
   points which help us in and model building.
3. Allows to identify errors or unusual data points (outliers) that could affect our
   results.
4. The insights gained from EDA help us to identify most important features for
   building models and guide us on how to prepare them for better performance.
5. By understanding the data it helps us in choosing best modeling techniques and
   adjusting them for better results.
Types of Exploratory Data Analysis
There are various types of EDA based on nature of records. Depending on the
number of columns we are analyzing we can divide EDA into three types:
 1. Univariate Analysis
 Univariate analysis focuses on studying one variable to understand its
 characteristics. It helps to describe data and find patterns within a single feature.
 Various common methods like histograms are used to show data distribution, box
 plots to detect outliers and understand data spread and bar charts for categorical
 data. Summary statistics like mean, median, mode, variance and standard
 deviation helps in describing the central tendency and spread of the data
 2. Bivariate Analysis
 Bivariate Analysis focuses on identifying relationship between two variables to
 find connections, correlations and dependencies. It helps to understand how two
 variables interact with each other. Some key techniques include:
 Scatter plots which visualize the relationship between two continuous
    variables.
 Correlation coefficient measures how strongly two variables are related which
    commonly use Pearson's correlation for linear relationships.
 Cross-tabulation or contingency tables shows the frequency distribution of two
    categorical variables and help to understand their relationship.
 Line graphs are useful for comparing two variables over time in time series
    data to identify trends or patterns.
 Covariance measures how two variables change together but it is paired with
    the correlation coefficient for a clearer and more standardized understanding of
    the relationship.
 3. Multivariate Analysis
 Multivariate Analysis identify relationships between two or more variables in the
 dataset and aims to understand how variables interact with one another which is
 important for statistical modeling techniques. It include techniques like:
 Pair plots which shows the relationships between multiple variables at once
    and helps in understanding how they interact.
 Another technique is Principal Component Analysis (PCA) which reduces
    the complexity of large datasets by simplifying them while keeping the most
    important information.
 Spatial Analysis is used for geographical data by using maps and spatial
    plotting to understand the geographical distribution of variables.
 Time Series Analysis is used for datasets that involve time-based data and it
    involves understanding and modeling patterns and trends over time. Common
    techniques include line plots, autocorrelation analysis, moving averages
    and ARIMA models.
 Steps for Performing Exploratory Data Analysis
 It involves a series of steps to help us understand the data, uncover patterns,
 identify anomalies, test hypotheses and ensure the data is clean and ready for
 further analysis. It can be done using different tools like:
 In Python, Pandas is used to clean, filter and manipulate data. Matplotlib helps
     to create basic visualizations while Seaborn makes more attractive plots. For
     interactive visualizations Plotly is a good choice.
 In R, ggplot2 is used for creating complex plots, dplyr helps with data
     manipulation and tidyr makes sure our data is organized and easy to work
     with.
 Its step includes:
Step 1: Understanding the Problem and the Data
The first step in any data analysis project is to fully understand the problem we're
solving and the data we have. This includes asking key questions like:
1. What is the business goal or research question?
2. What are the variables in the data and what do they represent?
3. What types of data (numerical, categorical, text, etc.) do you have?
4. Are there any known data quality issues or limitations?
5. Are there any domain-specific concerns or restrictions?
By understanding the problem and the data, we can plan our analysis more
effectively, avoid incorrect assumptions and ensure accurate conclusions.
Step 2: Importing and Inspecting the Data
After understanding the problem and the data, next step is to import the data into
our analysis environment such as Python, R or a spreadsheet tool. It’s important
to find data to gain an basic understanding of its structure, variable types and any
potential issues. Here’s what we can do:
1. Load the data into our environment carefully to avoid errors or truncations.
2. Check the size of the data like number of rows and columns to understand its
   complexity.
3. Check for missing values and see how they are distributed across variables
   since missing data can impact the quality of your analysis.
4. Identify data types for each variable like numerical, categorical, etc which will
   help in the next steps of data manipulation and analysis.
5. Look for errors or inconsistencies such as invalid values, mismatched units or
   outliers which could show major issues with the data.
By completing these tasks we'll be prepared to clean and analyze the data more
effectively.
Step 3: Handling Missing Data
Missing data is common in many datasets and can affect the quality of our
analysis. During EDA it's important to identify and handle missing data properly
to avoid biased or misleading results. Here’s how to handle it:
1. Understand the patterns and possible causes of missing data. Is it missing
   completely at random (MCAR), missing at random (MAR) or missing not at
   random (MNAR). Identifying this helps us to find best way to handle the
   missing data.
2. Decide whether to remove missing data or impute (fill in) the missing values.
   Removing data can lead to biased outcomes if the missing data isn’t MCAR.
   Filling values helps to preserve data but should be done carefully.
3. Use appropriate imputation methods like mean or median
   imputation, regression imputation or machine learning techniques
   like KNN or decision trees based on the data’s characteristics.
4. Consider the impact of missing data. Even after imputing, missing data can
   cause uncertainty and bias so understands the result with caution.
Properly handling of missing data improves the accuracy of our analysis and
prevents misleading conclusions.
Step 4: Exploring Data Characteristics
After addressing missing data we find the characteristics of our data by checking
the distribution, central tendency and variability of our variables and identifying
outliers or anomalies. This helps in selecting appropriate analysis methods and
finding major data issues. We should calculate summary statistics like mean,
median, mode, standard deviation, skewness and kurtosis for numerical variables.
These provide an overview of the data’s distribution and helps us to identify any
irregular patterns or issues.
Step 5: Performing Data Transformation
Data transformation is an important step in EDA as it prepares our data for
accurate analysis and modeling. Depending on our data's characteristics and
analysis needs, we may need to transform it to ensure it's in the right format.
Common transformation techniques include:
1. Scaling or normalizing numerical variables like min-max
   scaling or standardization.
2. Encoding categorical variables for machine learning like one-hot
   encoding or label encoding.
3. Applying mathematical transformations like logarithmic square root to correct
   skewness or non-linearity.
4. Creating new variables from existing ones like calculating ratios or combining
   variables.
5. Aggregating or grouping data based on specific variables or conditions.
Step 6: Visualizing Relationship of Data
Visualization helps to find relationships between variables and identify patterns
or trends that may not be seen from summary statistics alone.
1. For categorical variables, create frequency tables, bar plots and pie charts to
   understand the distribution of categories and identify imbalances or unusual
   patterns.
2. For numerical variables generate histograms, box plots, violin plots and
   density plots to visualize distribution, shape, spread and potential outliers.
3. To find relationships between variables use scatter plots, correlation matrices
   or statistical tests like Pearson’s correlation coefficient or Spearman’s rank
   correlation.
Step 7: Handling Outliers
Outliers are data points that differs from the rest of the data may caused by errors
in measurement or data entry. Detecting and handling outliers is important
because they can skew our analysis and affect model performance. We can
identify outliers using methods like interquartile range (IQR),
Z-scores or domain-specific rules. Once identified it can be removed or adjusted
depending on the context. Properly managing outliers shows our analysis is
accurate and reliable.
Step 8: Communicate Findings and Insights
The final step in EDA is to communicate our findings clearly. This involves
summarizing the analysis, pointing out key discoveries and presenting our results
in a clear way.
Data Analysis Using Python
Data Analysis is the technique of collecting, transforming and organizing data to
make future predictions and informed data-driven decisions. It also helps to find
possible solutions for a business problem. In this article, we will discuss how to
do data analysis with Python i.e. analyzing numerical data with NumPy, Tabular
data with Pandas, data visualization with Matplotlib.
Analyzing Numerical Data with NumPy
NumPy is an array processing package in Python and provides a high-
performance multidimensional array object and tools for working with these
arrays. It is the fundamental package for scientific computing with Python.
Arrays in NumPy
NumPy Array is a table of elements usually numbers, all of the same types,
indexed by a tuple of positive integers. In Numpy the number of dimensions of
the array is called the rank of the array. A tuple of integers giving the size of the
array along each dimension is known as the shape of the array.
Creating NumPy Array
NumPy arrays can be created in multiple ways with various ranks. It can also be
created with the use of different data types like lists, tuples, etc. The type of the
resultant array is deduced from the type of elements in the sequences. NumPy
offers several functions to create arrays with initial placeholder content. This
minimizes the necessity of growing arrays.
import numpy as np
a = np.empty([2, 2], dtype = int)
print("\nMatrix a : \n", a)
b = np.empty(2, dtype = int)
print("Matrix b : \n", b)
Output
Matrix a :
[[   94655291709206                 0]
[3543826506195694713 34181816989462323]]
Matrix b :
[-4611686018427387904            206158462975]
Arithmetic Operations
1. Addition:
import numpy as np
a = np.array([5, 72, 13, 100])
b = np.array([2, 5, 10, 30])
add_ans = a+b
print(add_ans)
add_ans = np.add(a, b)
print(add_ans)
c = np.array([1, 2, 3, 4])
add_ans = a+b+c
print(add_ans)
add_ans = np.add(a, b, c)
print(add_ans)
Output
[ 7 77 23 130]
[ 7 77 23 130]
[ 8 79 26 134]
[ 7 77 23 130]
NumPy Array Indexing
Indexing can be done in NumPy by using an array as an index. In the case of the
slice, a view or shallow copy of the array is returned but in the index array, a
copy of the original array is returned. Numpy arrays can be indexed with other
arrays or any other sequence with the exception of tuples. The last element is
indexed by -1 second last by -2 and so on.
import numpy as np
a = np.arange(10, 1, -2)
print("\n A sequential array with a negative step: \n",a)
newarr = a[np.array([3, 1, 2 ])]
print("\n Elements at these indices are:\n",newarr)
Output
 A sequential array with a negative step:
 [10 8 6 4 2]
 Elements at these indices are:
 [4 8 6]
 NumPy Array Slicing
 Consider the syntax x[obj] where x is the array and obj is the index. The slice
 object is the index in the case of basic slicing. Basic slicing occurs when obj is:
 a slice object that is of the form start: stop: step
 an integer
 or a tuple of slice objects and integers
 All arrays generated by basic slicing are always the view in the original array.
import numpy as np
a = np.arange(20)
print("\n Array is:\n ",a)
print("\n a[-8:17:1] = ",a[-8:17:1])
print("\n a[10:] = ",a[10:])
Output
Array is:
 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
a[-8:17:1] = [12 13 14 15 16]
a[10:] = [10 11 12 13 14 15 16 17 18 19]
import numpy as np
a = np.arange(20)
print("\n Array is:\n ",a)
print("\n a[-8:17:1] = ",a[-8:17:1])
print("\n a[10:] = ",a[10:])
Output
Array is:
 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
a[-8:17:1] = [12 13 14 15 16]
a[10:] = [10 11 12 13 14 15 16 17 18 19]
Ellipsis can also be used along with basic slicing. Ellipsis (…) is the number of :
objects needed to make a selection tuple of the same length as the dimensions of
the array.
import numpy as np
b = np.array([[[1, 2, 3],[4, 5, 6]],
       [[7, 8, 9],[10, 11, 12]]])
print(b[...,1])
Output
[[ 2 5]
[ 8 11]]
NumPy Array Broadcasting
The term broadcasting refers to how numpy treats arrays with different
Dimensions during arithmetic operations which lead to certain constraints, the
smaller array is broadcast across the larger array so that they have compatible
shapes.
Let’s assume that we have a large data set, each datum is a list of parameters. In
Numpy we have a 2-D array where each row is a datum and the number of rows
is the size of the data set. Suppose we want to apply some sort of scaling to all
these data every parameter gets its own scaling factor or say Every parameter is
multiplied by some factor.
Just to have a clear understanding, let’s count calories in foods using a macro-
nutrient breakdown. Roughly put, the caloric parts of food are made of fats (9
calories per gram), protein (4 CPG) and carbs (4 CPG). So if we list some foods
(our data) and for each food list its macro-nutrient breakdown (parameters), we
can then multiply each nutrient by its caloric value (apply scaling) to compute the
caloric breakdown of every food item.
With this transformation, we can now compute all kinds of useful information.
For example what is the total number of calories present in some food or, given a
breakdown of my dinner know how many calories did I get from protein and so
on.
Let’s see a naive way of producing this computation with Numpy:
import numpy as np
macros = np.array([
   [0.8, 2.9, 3.9],
   [52.4, 23.6, 36.5],
   [55.2, 31.7, 23.9],
   [14.4, 11, 4.9]
])
cal_per_macro = np.array([3, 3, 8])
result = macros * cal_per_macro
print(result)
Output
[[ 2.4 8.7 31.2]
 [157.2 70.8 292. ]
 [165.6 95.1 191.2]
 [ 43.2 33. 39.2]]
 Analyzing Data Using Pandas
 Python Pandas Is used for relational or labeled data and provides various data
 structures for manipulating such data and time series. This library is built on top of
 the NumPy library. This module is generally imported as:
 import pandas as pd
 Here, pd is referred to as Pandas. However, it is not necessary to import the library
 using this, it just helps in writing less amount code every time a method or
 property is called. Pandas generally provide two data structures for manipulating
 data, They are:
 Series
 Data frame
 Series
 Pandas Series is a one-dimensional labeled array capable of holding data of any
 type like integer, string, float, python objects, etc. The axis labels are collectively
 called indexes. Pandas Series is nothing but a column in an excel sheet. Labels
 need not be unique but must be a hashable type. The object supports both integer
 and label-based indexing and provides a host of methods for performing operations
 involving the index. Pandas Series
 It can be created using the Series() function by loading the dataset from the
 existing storage like SQL, Database, CSV Files, Excel Files, etc or from data
 structures like lists, dictionaries, etc.
 Python Pandas Creating Series
import pandas as pd
import numpy as np
ser = pd.Series(dtype="object")
print(ser)
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)
Output
Series([], dtype: object)
0   g
1   e
2   e
3   k
4   s
dtype: object
Dataframe
Pandas DataFrame is a two-dimensional size mutable array with labeled axes (rows
and columns). In this data is aligned in a tabular format in rows and columns.
Pandas DataFrame consists of three principal components i.e data, rows and
columns. Pandas Dataframe
It can be created using the Dataframe() method and just like a series, it can also be
from different file types and data structures.
Python Pandas Creating Dataframe
import pandas as pd
df = pd.DataFrame()
print(df)
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
df = pd.DataFrame(lst, columns=['Words'])
print(df)
Output
Empty DataFrame
Columns: []
Index: []
    Words
0 Geeks
1    For
2 Geeks
3    is
4 portal
5    for
6 Geeks
Creating Dataframe from CSV
We can create a dataframe from the CSV files using the read_csv() function.
import pandas as pd
df = pd.read_csv("Iris.csv")
df.head()
Output:
                                                                         head of a
dataframe
Filtering DataFrame
Pandas dataframe.filter() function is used to Subset rows or columns of dataframe
according to labels in the specified index. Note that this routine does not filter a
dataframe on its contents. The filter is applied to the labels of the index.
Python Pandas Filter Dataframe
import pandas as pd
df = pd.read_csv("Iris.csv")
df.filter(["Species", "SepalLengthCm", "SepalLengthCm"]).head()
Output:
                                            Applying filter on dataset
Sorting DataFrame
In order to sort the data frame in pandas, the function sort_values() is used. Pandas
sort_values() can sort the data frame in Ascending or Descending order. Python
Pandas Sorting Dataframe in Ascending Order
import pandas as pd
df = pd.read_csv("Iris.csv", header=None)
columns = ["Id", "SepalLengthCm", "SepalWidthCm", "PetalLengthCm",
"PetalWidthCm", "Species"]
df.columns = columns
df_sorted = df.sort_values(by='SepalLengthCm', ascending=True)
print(df_sorted.head())
                                                                                 Sorted
dataset based on a column value
 Pandas GroupBy
 Groupby is a pretty simple concept. We can create a grouping of categories and
 apply a function to the categories. In real data science projects, you’ll be dealing
 with large amounts of data and trying things over and over, so for efficiency we
 use the Groupby concept. Groupby mainly refers to a process involving one or
 more of the following steps they are:
 Splitting: It is a process in which we split data into group by applying some
   conditions on datasets.
 Applying: It is a process in which we apply a function to each group
   independently.
 Combining: It is a process in which we combine different datasets after
   applying groupby and results into a data structure.
 1. Group the unique values from the Team column
    Pandas Aggregation
Aggregation is a process in which we compute a summary statistic about each
group. The aggregated function returns a single aggregated value for each group.
After splitting data into groups using groupby function, several aggregation
operations can be performed on the grouped data.
import pandas as pd
data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',
            'Gaurav', 'Anuj', 'Princi', 'Abhi'],
      'Age': [27, 24, 22, 32,
           33, 36, 27, 32],
      'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
              'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
      'Qualification': ['Msc', 'MA', 'MCA', 'Phd',
                  'B.Tech', 'B.com', 'Msc', 'MA']}
df = pd.DataFrame(data1)
grp1 = df.groupby('Name')
result = grp1['Age'].aggregate('sum')
print(result)
Output:
Merging DataFrame
When we need to combine very large DataFrames, joins serve as a powerful way
to perform these operations swiftly. Joins can only be done on two DataFrames at
a time, denoted as left and right tables. The key is the common column that the
two DataFrames will be joined on. It’s a good practice to use keys that have
unique values throughout the column to avoid unintended duplication of row
values. Pandas provide a single function, merge(), as the entry point for all
standard database join operations between DataFrame objects.
There are four basic ways to handle the join (inner, left, right and outer),
depending on which rows must retain their data.
import pandas as pd
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
      'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
     'Age':[27, 24, 22, 32],}
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
      'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
     'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1)
df1 = pd.DataFrame(data2)
display(df,df1)
res = pd.merge(df, df1, on='key')
print(res)
Output:
Joining DataFrame
In order to join the dataframe, we use .join() function this function is used for
combining the columns of two potentially differently indexed DataFrames into a
single result DataFrame.
import pandas as pd
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
     'Age':[27, 24, 22, 32]}
data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
     'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}
df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3'])
df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])
res = df.join(df1)
print(res)
Output:
Bar chart
A bar plot or bar chart is a graph that represents the category of data with
rectangular bars with lengths and heights that is proportional to the values which
they represent. The bar plots can be plotted horizontally or vertically. A bar chart
describes the comparisons between the discrete categories. It can be created using
the bar() method.
Here we will use the iris dataset only.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Iris.csv")
plt.bar(df['Species'], df['SepalLengthCm'])
plt.title("Iris Dataset")
plt.legend(["bar"])
plt.show()
Output:
                                                    Bar chart using matplotlib
library
Histograms
A histogram is basically used to represent data in the form of some groups. It is a
type of bar plot where the X-axis represents the bin ranges while the Y-axis gives
information about frequency. To create a histogram the first step is to create a bin
of the ranges, then distribute the whole range of the values into a series of
intervals and count the values which fall into each of the intervals. Bins are
clearly identified as consecutive, non-overlapping intervals of variables.
The hist() function is used to compute and create a histogram of x.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Iris.csv")
plt.hist(df["SepalLengthCm"])
plt.title("Histogram")
plt.legend(["SepalLengthCm"])
plt.show()
Output:
                                                      Histplot using matplotlib
library
Scatter Plot
Scatter plots are used to observe relationship between variables and uses dots to
represent the relationship between them. The scatter() method in the matplotlib
library is used to draw a scatter plot.t
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Iris.csv")
plt.scatter(df["Species"], df["SepalLengthCm"])
plt.title("Scatter Plot")
plt.legend(["SepalLengthCm"])
plt.show()
Output:
                                                         Scatter plot using
matplotlib library
 Box Plot
 A boxplot is also known as a box and whisker plot. It is a very good visual
 representation when it comes to measuring the data distribution. Clearly plots the
 median values, outliers and the quartiles. Understanding data distribution is
 another important factor which leads to better model building. If data has outliers,
 box plot is a recommended way to identify them and take necessary actions. The
 box and whiskers chart shows how data is spread out. Five pieces of information
 are generally included in the chart
 The minimum is shown at the far left of the chart, at the end of the left
    ‘whisker’
 First quartile, Q1, is the far left of the box (left whisker)
 is shown as a line in the center of the box The median
 Third quartile, Q3, shown at the far right of the box (right whisker)
 The maximum is at the far right of the box
Representation of box plot
                                         Inter quartile
range                             Illustrating box plot
Python Matplotlib Box Plot
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Iris.csv")
plt.boxplot(df["SepalWidthCm"])
plt.title("Box Plot")
plt.legend(["SepalWidthCm"])
plt.show()
Output:
                                                         Boxplot using matplotlib
library
Correlation Heatmaps
A 2-D Heatmap is a data visualization tool that helps to represent the magnitude
of the phenomenon in form of colors. A correlation heatmap is a heatmap that
shows a 2D correlation matrix between two discrete dimensions, using colored
cells to represent data from usually a monochromatic scale. The values of the first
dimension appear as the rows of the table while the second dimension is a
column. The color of the cell is proportional to the number of measurements that
match the dimensional value. This makes correlation heatmaps ideal for data
analysis since it makes patterns easily readable and highlights the differences and
variation in the same data. A correlation heatmap, like a regular heatmap, is
assisted by a colorbar making data easily readable and comprehensible.
Note: The data here has to be passed with corr() method to generate a correlation
heatmap. Also, corr() itself eliminates columns that will be of no use while
generating a correlation heatmap and selects those which can be used.
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("Iris.csv")
plt.imshow(df.corr() , cmap = 'autumn' , interpolation = 'nearest' )
plt.title("Heat Map")
plt.show()
Output: