EDA ( EXPLARATORY DATA ANALYSIS)
1- variable identification
2- univariate
3- bivariate-> correlation
4- missing value treatment->
Deletion
Mean/median/mode Imputation
Prediction model
KNN Imputation
5- outlier treatment
6- variable transformation
7- variable creation
RAW DATA & CLEAN DATA
- REGRESSION & CLASSION ( ML)
regression --> if dependent variable is continuous in natrure.
regression algorithm or regression model -->
1- simple linear regression
2- multiple linear regression
3- gradient descend || sgd || bgd
4- polynomial regression
5- support vector regression
6- decision tree regression
7- random forest regression ->regularization technique - L1 & L2 regression
8- time series
classification --> if dependent variable is binary in nature.
classification algorithm -
1- logistic regression
2- support vector machine
3- knn
4- naive bayes
5- decisstion tree
6- random forest
7- ada boost | catboost
8- xgboost
9- lgbm
1.MEASURE OF CENTRAL TENDENCY
Mean-> the sum of all values in the given data/population divided by a
It is preferred for numerical data
total number of values in the given data/population
Median-> the middle value after putting the observations in
ascending order
Mode-> the most commonly observed value in a set of data It is preferred for categorical data
2- MEASURE OF ASSYMETRY
Skewness-> measure of symmetry, or more precisely, the lack of
symmetry
every time we need to consider
only 0skewness
|| AKA -- (normal distribution ||
gaussian || 0 symmetrical)
+ve skewness --> (mean > median & mode) == data stays at left
and outlier is at right
0 skewness --> (mean = median = mode) == data stay at center
& no outlier
-ve skewness --> (mode > mean & Median) --> data stay at right
& outlier left)
Kurtois-> measure of whether the data are heavy-tailed or light-
tailed relative to a normal distribution every time we need to consider
leptokurtic == +ve kurtois == mean>median & Mode only mesokurtic
platykurtic == -ve kurtois == mode > mean & median
mesokurtic == normal distribution == mean = median = mode
3- MEASURE OF VARIABILITY-Regression Analysis
Sample equations are considered
everytime
3- MESUSURE OF RELATIONSHIP
Covariance
Correlation
from normal distribution to standard normal distribution
z-score (standard error) ==> converty mean to 0 & standard – 1
Type-I Error-> Reject a null true hypothesis
Type-II Error-> Accept a false null hypothesis
P-value=1-α
α- from standard distribution table with z-value and confidence
percentage, it can be determined Rule: you should reject null
hypothesis, if p-value<α
Z-Test is for population
Z-Test
Population variance is known
T-Test is for sample
t-Test
Population variance is unknown
Confidence Margin Error-> value obtained from
interval Z-Test/t-Test
Central Limit Theorem-> No matter what ever the distibution in entire dataset mean of sample you took
entire dataset it approximate to be normal distribution only
when sample size increase the
Standard Error-> standard deviation of the distribution formed standard error is decrerases &
by sample means bigger sample gives better
approximation
Strong positive correlation is recommended for consideration of relevant variables.
Sum of Squares Total (SST)
Sum of Squares Regression (SSE)
Sum of Squares of Error
SST=SSR+SSE
R Square-> 𝑹𝟐 = 𝑺𝑺𝑹/𝑺𝑺𝑻
range(0,1)
least squares method (ols)-> min SSE->lower error
better explanatory power so this method aims to find the line which minimise the sum of squared
error