0% found this document useful (0 votes)
21 views7 pages

Statistics

The document discusses topics related to exploratory data analysis including variable identification, univariate analysis, bivariate analysis, missing value treatment, outlier treatment, variable transformation, and variable creation. It also discusses regression, classification algorithms, measures of central tendency, measures of variability, measures of relationship, hypothesis testing, and regression analysis concepts.

Uploaded by

rkdakua2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

Statistics

The document discusses topics related to exploratory data analysis including variable identification, univariate analysis, bivariate analysis, missing value treatment, outlier treatment, variable transformation, and variable creation. It also discusses regression, classification algorithms, measures of central tendency, measures of variability, measures of relationship, hypothesis testing, and regression analysis concepts.

Uploaded by

rkdakua2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

EDA ( EXPLARATORY DATA ANALYSIS)

1- variable identification
2- univariate
3- bivariate-> correlation
4- missing value treatment->
 Deletion
 Mean/median/mode Imputation
 Prediction model
 KNN Imputation
5- outlier treatment
6- variable transformation
7- variable creation
RAW DATA & CLEAN DATA
- REGRESSION & CLASSION ( ML)
regression --> if dependent variable is continuous in natrure.
regression algorithm or regression model -->
1- simple linear regression
2- multiple linear regression
3- gradient descend || sgd || bgd
4- polynomial regression
5- support vector regression
6- decision tree regression
7- random forest regression ->regularization technique - L1 & L2 regression
8- time series
classification --> if dependent variable is binary in nature.
classification algorithm -
1- logistic regression
2- support vector machine
3- knn
4- naive bayes
5- decisstion tree
6- random forest
7- ada boost | catboost
8- xgboost
9- lgbm
1.MEASURE OF CENTRAL TENDENCY
Mean-> the sum of all values in the given data/population divided by a
It is preferred for numerical data
total number of values in the given data/population
Median-> the middle value after putting the observations in
ascending order
Mode-> the most commonly observed value in a set of data It is preferred for categorical data
2- MEASURE OF ASSYMETRY
Skewness-> measure of symmetry, or more precisely, the lack of
symmetry

every time we need to consider


only 0skewness
|| AKA -- (normal distribution ||
gaussian || 0 symmetrical)

+ve skewness --> (mean > median & mode) == data stays at left
and outlier is at right
0 skewness --> (mean = median = mode) == data stay at center
& no outlier
-ve skewness --> (mode > mean & Median) --> data stay at right
& outlier left)
Kurtois-> measure of whether the data are heavy-tailed or light-
tailed relative to a normal distribution every time we need to consider
leptokurtic == +ve kurtois == mean>median & Mode only mesokurtic
platykurtic == -ve kurtois == mode > mean & median
mesokurtic == normal distribution == mean = median = mode
3- MEASURE OF VARIABILITY-Regression Analysis

Sample equations are considered


everytime
3- MESUSURE OF RELATIONSHIP
Covariance

Correlation

from normal distribution to standard normal distribution

z-score (standard error) ==> converty mean to 0 & standard – 1

Type-I Error-> Reject a null true hypothesis


Type-II Error-> Accept a false null hypothesis
P-value=1-α
α- from standard distribution table with z-value and confidence
percentage, it can be determined Rule: you should reject null
hypothesis, if p-value<α

Z-Test is for population


Z-Test
Population variance is known

T-Test is for sample


t-Test
Population variance is unknown

Confidence Margin Error-> value obtained from


interval Z-Test/t-Test
Central Limit Theorem-> No matter what ever the distibution in entire dataset mean of sample you took
entire dataset it approximate to be normal distribution only
when sample size increase the
Standard Error-> standard deviation of the distribution formed standard error is decrerases &
by sample means bigger sample gives better
approximation
Strong positive correlation is recommended for consideration of relevant variables.
Sum of Squares Total (SST)

Sum of Squares Regression (SSE)

Sum of Squares of Error

SST=SSR+SSE
R Square-> 𝑹𝟐 = 𝑺𝑺𝑹/𝑺𝑺𝑻
range(0,1)
least squares method (ols)-> min SSE->lower error
 better explanatory power so this method aims to find the line which minimise the sum of squared
error

You might also like