Math, Probability, and Statistical
Modeling
Department of Computer Sci. & Eng.
Babaria Institute of Technology
Statisti
c
• A statistic is a result that’s derived from
performing a mathematical operation on
numerical data
• Two types
1. Descriptive
2. Inferential
Descripti
ve
• It describes the important characteristics/ properties
of the data using the measures the central tendency
like mean/ median/mode and the measures of
dispersion like range, standard deviation, variance
etc.
• Data can be summarized and represented in an
accurate way using charts, tables and graphs
Inferential Statistics
• It is about using data from sample and then
making inferences about the larger population
from which the sample is drawn.
• Ex. hypothesis tests, Analysis of variance etc.
Descriptive vs Inferential
Probability (likelihood)
• The probability of any single event never goes
below 0.0 or exceeds 1.0
• The probability of all events always sums to
exactly 1.0
• The set of possible events are mutually
exclusive — only one can occur at a time
Types of Random Variable
1. Discrete variable
• Only specific values are possible.
• Examples: # of defectives, # of votes, # of coronavirus
infected cases.
• Example of discrete probability distribution: Binomial
Distribution Continuous
2. Continuous variable
• Infinite number of values are possible.
• Examples: Infected rate, Height, Income.
• Example of continuous probability distribution:
Normal/Gaussian Distribution.
Probability distribution
Three Types
1. Normal distributions (numeric continuous)
2. Binomial distributions (numeric discrete)
3. Categorical distributions (non-numeric)
Conditional probability
“to predict the likelihood that an event will
occur, given evidence defined in your data
features”
• Naïve Bayes based on classification and
regression and used classify text data
• Example a model predicts whether an email is
spam — the event — based on features
gathered from its content — the evidence.
Naïve Bayes
Three Types
1. MultinomialNB
2. BernoulliNB
3. GaussianNB
Correlation
• A numerical measure of the strength of
relationship is Correlation: A measure of linear
relationship between two variables.
• Correlation r,
– which ranges between –1 and 1.
– The closer the r-value is to 1 or –1, the more
correlation there is between two variables.
– If two variables have an r-value that’s close
to 0, it could indicate that they’re independent
variables.
Pearson’s r
• Your data is normally distributed.
• You have continuous, numeric variables.
• Your variables are linearly related.
Spearman’s rank correlation
• Your variables are ordinal.
• Your variables are related non-linearly.
• Your data is non-normally distributed
Reducing Data Dimensionality
• singular value decomposition (SVD)
• Factor analysis
• Principal component analysis (PCA)
PCA
• (PCA) is a dimensionality reduction
technique that enables you to identify
correlations and patterns in a data set so that
it can be transformed into a data set of
significantly lower dimension without loss of
any important information.
PCA
• Step By Step Computation Of PCA
1. The below steps need to be followed to perform
dimensionality reduction using PCA:
2. Standardization of the data
3. Computing the covariance matrix
4. Calculating the eigenvectors and eigenvalues
5. Computing the Principal Components
6. Reducing the dimensions of the data set
Regression Methods
• Linear regression
• Logistic regression
• Ordinary least squares (OLS) regression
Outliers
“Outliers are data points with values that are
significantly different than the majority of data
points comprising a variable”
Outliers
Three categories
1. Point outliers
2. Contextual outliers
3. Collective outliers
Point outliers
• Point outliers are single data points that lay far
from the rest of the distribution.
Contextual outliers
• Contextual outliers can be noise in data, such
as punctuation symbols when realizing text
analysis or background noise signal when
doing speech recognition.
Collective outliers
• Collective outliers can be subsets of novelties
in data such as a signal that may indicate the
discovery of new phenomena (As in figure B).
Most common causes of outliers
• Data entry errors
• Measurement errors
• Experimental errors
• Intentional
• Data processing errors
• Sampling errors Natural
Detecting Outlier
1. Univariate
Univariate outliers can be found when looking at a
distribution of values in a single feature space.
2. Multivariate.
Multivariate outliers can be found in a n-dimensional
space (of n-features). Looking at distributions in n-
dimensional spaces can be very difficult for the human
brain, that is why we need to train a model to do it for us.
Detecting outliers with univariate
analysis
Univariate outlier detection is where you look at
features in your dataset, and inspect them
individually for anomalous values
1. Tukey outlier labeling
When you look at a variable, consider its spread, its Q1 / Q3
values, and its minimum and maximum values to decide
whether the variable is suspect for outliers
2. Tukey boxplot
Pretty easy way to spot outliers. Any values that lie beyond
these whiskers are outliers
Tukey
boxplot
Detecting outliers with multivariate
analysis
• A multivariate approach to outlier detection
involves considering two or more variables at
a time and inspecting them together for
outliers.
• Scatter-plot matrix
• Boxplot
• Density-based spatial clustering of applications with
noise (DBScan)
• Principal component analysis
Introducing Time Series Analysis
• A time series is just a collection of data on
attribute values over time. Time series
analysis is performed to predict future
instances of the measure based on the past
observational data.
Identifying patterns in time series
• Constant time series
• Trended time series
• Trended Seasonal time series
• Untrended Seasonal time series
Modeling univariate time series data
• Univariate analysis is the quantitative analysis
of only one variable at a time
• you are modeling time series changes that
represent changes in a single variable
over time.
• Autoregressive moving average (ARMA)
• Autoregression techniques
• Moving average techniques
Reference
s
1. https://towardsdatascience.com/statistics-
descriptive-and-inferential-63661eb13bb5
2. https://towardsdatascience.com/a-brief-
overview-of-outlier-detection-techniques
- 1e0b2c19e561
Thank You