0% found this document useful (0 votes)

114 views18 pages

Data Types & Distribution Basics

This document discusses different data types and measures of data distribution. It describes common data types like numerical, categorical, boolean, and ordinal variables. It also defines key distribution measures such as mean, median, mode, skewness, kurtosis, and compares common continuous and discrete distributions. Finally, it outlines statistical tests used to compare distributions, such as the Q-Q plot, Kolmogorov-Smirnov test, and explains when to use different statistical tests depending on the data and comparisons being made.

Uploaded by

ky453125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views18 pages

Data Types & Distribution Basics

Uploaded by

ky453125

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Data Types and

Measures of
Data Distribution

Prepared By:
Deeman Yousif Mahmood
PhD Student
Data Type
Data types
Type Example

I. Numerical (double) Income (e.g. 650.34)

II. Numerical (int) # of children (e.g. 4)

III. Boolean Gender (e.g. male)

IV. Categorical Colors (e.g. green)

V. Ordinal Satisfaction (e.g. pleased)

VI. Others Comments

Data types – Discrete and continuous
Type
I. Numerical (double) Continuous
II. Numerical (int)
III. Boolean
IV. Categorical
Discrete

V. Ordinal
VI. Others
Categorical vs Boolean
• Categorical is essentially several Booleans that are
grouped by some logic

• Example
– Feature (color): Green, Blue, Red
vs
– Feature (isGreen): Yes/No
– Feature (isBlue): Yes/No
– Feature (isRed): Yes/NO

Sometimes we convert categorical into Booleans

for machine learning
Why is knowledge of data type important?
• Model results are based on this input
– Distance measures

• Some models and techniques only use certain

data types

• Memory considerations
– Categorical vs Boolean (Male/Female or 0/1)
– Boolean can be sparse
Data
Distribution
Measures
Distribution measures 1: Mean, Median,
Mode
• Mode
 Good for nominal variables
 Quick and easy

• Median
 Robust central tendency statistics
• Less sensitive to outliers and extreme values
 Good for “bad” distributions

• Mean
 Most commonly used statistic for central tendency
• Generally preferred except for “bad” distribution
 Based on all data in the distribution
 Used for inference as well as description
• best estimator of the parameter
Distribution measures 1: Mean, Median, Mode
Distribution measures 2: Skewness & kurtosis
• Skewness (tails) • Kurtosis (shoulders, heavy tail)
• Skewness is a measure of the asymmetry of • Kurtosis is the degree of peakedness of a distribution
the probability distribution relative to a normal distribution
Excess
Kurtosis

• A normal distribution is a mesokurtic distribution

• Right skew -
• A pure leptokurtic distribution has a higher peak than
• Left skew - the normal distribution and has heavier tails.
• Symmetric - • A pure platykurtic distribution has a lower peak than a
normal distribution and lighter tails.
Common continuous distributions
Normal (Gaussian) Distribution Log-normal Distribution

 Z-score  Used to model a variable which is

a product of positive i.i.d vars,
• The distance of • A compound return from a
a value from the mean,
measured in standard sequence of many trades
deviations • Measures of size of living tissue

Student’s t-Distribution (Gosset 1908) The Distribution with k D.F

 Sampling distrib. (i.i.d measures) of

 Approaches the Gaussian

distrib. when  Heavily used in statistics
• or • Estimating variance
 Used for • Goodness-of-fit test
• Test the diff. between two sample means
• Inference when are unknown
Common discrete distributions
• Bernoulli Distribution • Binomial distribution
– Bernoulli trial – Number of success in n independent trials
• A trial with only two possible outcomes

– Bernoulli Distribution
• Represents success/failure (e.g. accuracy of
prediction)
If n is large, then:

–
is a good approximation
( ) for

• Multinomial Distribution • Poisson Distribution

– Categorical Distribution – Number of events occurring within a fixed
• A trial with k possible outcomes time interval (or space)
• , the shape param., indicates the average
where and
number of events in the given time interval
– Multinomial Distribution
• Number of occurrences of k categories in n
independent trials

– If is large, then is a good approximation

where for
Comparing distributions
Examples of commonly used distribution tests

• Q-Q plot:
– Compare distributions based on quantiles

• Kolmogorov–Smirnov (KS) test

– Compare distributions based on the cumulative density function

• Shapiro's test for normality

– Check if data is normally distributed

• Two derivatives of KS that also compare 2 distributions

– Cramér–von Mises criterion

– Anderson–Darling test
Q-Q plot
• A plot of the quantiles of the first data set against the quantiles of the second data
set
• Data sets sizes don’t have to be equal
• The greater the departure from the 45 deg. reference line, the greater the
evidence for the conclusion that the two data sets have come from populations
with different distributions
Kolmogorov–Smirnov test
• A non-parametric test for the equality of continuous, one-
dimensional probability distribution
• Can be applied to test a dataset distribution against a known distribution OR
against another dataset distribution
H0: The data follow a specified distribution
H1: The data do not follow the specified distribution

• The K-S statistics is defined as:

When to use which statistical test?
Using the correct statistical test, and correcting for multiple
hypotheses are recurrent issues in data science
Data comparisons you Data are normally Data are not normally- Data are Binomial
are making distributed distributed, or are ranks (Possess 2 possible
or scores values)
Compare one set of data to a One-sample t-test Wilcoxon test 2 test
hypothetical value

Compare two sets of Unpaired t-test Mann-Whitney test 2 test or Fisher test
independently-collected
(unpaired) data

Compare two sets of data from Paired t-test Wilcoxon test McNemar’s test
the same subjects under
different circumstances (paired)

Compare three or more sets of One-way ANOVA Kruskal-Wallis test 2 test

data

Look for a relationship between Pearson Correlation coefficient Spearman correlation Contingency Correlation
two variables coefficient coefficients

Look for a linear relationship Linear regression Nonparametric linear Simple logistic regression
between two variables regression

Look for a non-linear Non-linear regression Nonparametric non-linear

relationship between two regression
variables

Lesson1 - Data Definitions
No ratings yet
Lesson1 - Data Definitions
57 pages
Descriptive Data Analytics Guide
No ratings yet
Descriptive Data Analytics Guide
56 pages
Lesson 6 Data Life Cycle Part 2
No ratings yet
Lesson 6 Data Life Cycle Part 2
30 pages
Training in R For Data Statistics
No ratings yet
Training in R For Data Statistics
113 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Report Design & Data Monitor Using Businessobjects Dashboard Design
No ratings yet
Report Design & Data Monitor Using Businessobjects Dashboard Design
74 pages
Module No 5 Relational Database Design
No ratings yet
Module No 5 Relational Database Design
160 pages
Perl Tutorial
No ratings yet
Perl Tutorial
32 pages
SQL
No ratings yet
SQL
101 pages
DBMS Module 2
No ratings yet
DBMS Module 2
125 pages
SQL Basic
100% (1)
SQL Basic
53 pages
Classification and Prediction
No ratings yet
Classification and Prediction
143 pages
RDBMS
No ratings yet
RDBMS
155 pages
Introduction To R: Shanti.S.Chauhan, PH.D Business Studies Shuats
No ratings yet
Introduction To R: Shanti.S.Chauhan, PH.D Business Studies Shuats
53 pages
Subqueries
No ratings yet
Subqueries
32 pages
Unit-2 SQL Updated
No ratings yet
Unit-2 SQL Updated
102 pages
Egovernment Readiness
No ratings yet
Egovernment Readiness
105 pages
Examples On Triggers: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
No ratings yet
Examples On Triggers: Instructor: Mohamed Eltabakh Meltabakh@cs - Wpi.edu
15 pages
Blue Team Fundamentals
No ratings yet
Blue Team Fundamentals
11 pages
20461C 00
100% (1)
20461C 00
7 pages
4 Data Distribution 1
No ratings yet
4 Data Distribution 1
20 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
87 pages
Emerging Technology Chapter 1
No ratings yet
Emerging Technology Chapter 1
35 pages
DBMS Module1 Part1
No ratings yet
DBMS Module1 Part1
66 pages
DBMS - Module 3 Ppts - Jan28th (Autosaved)
100% (1)
DBMS - Module 3 Ppts - Jan28th (Autosaved)
104 pages
UNIT5 - PLSQL Introduction 1
No ratings yet
UNIT5 - PLSQL Introduction 1
110 pages
Statistical & Practical
No ratings yet
Statistical & Practical
5 pages
GIS Ch4
No ratings yet
GIS Ch4
37 pages
Structured Query Language (SQL)
No ratings yet
Structured Query Language (SQL)
145 pages
DWO - Innovation Through Collaboration
No ratings yet
DWO - Innovation Through Collaboration
10 pages
Mapping CGEIT and COBIT
No ratings yet
Mapping CGEIT and COBIT
47 pages
L9 SQL
No ratings yet
L9 SQL
128 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
CBAP CCBA Certified Business Analysis Study Guide Susan Weese PDF Download
No ratings yet
CBAP CCBA Certified Business Analysis Study Guide Susan Weese PDF Download
115 pages
Data Mining Techniques Unit-1
No ratings yet
Data Mining Techniques Unit-1
122 pages
Advanced SQL - LAB 1
No ratings yet
Advanced SQL - LAB 1
12 pages
Health Data Quality
100% (1)
Health Data Quality
20 pages
Basic SQL: IS 2511 - Fundamentals of Database Systems
No ratings yet
Basic SQL: IS 2511 - Fundamentals of Database Systems
53 pages
4-Stored Procedures
No ratings yet
4-Stored Procedures
22 pages
Data Science Essentials Guide
No ratings yet
Data Science Essentials Guide
1 page
Data Science: Stats & Regression
100% (1)
Data Science: Stats & Regression
21 pages
Unit 01
No ratings yet
Unit 01
32 pages
2 - Evolution of Data Analytics
No ratings yet
2 - Evolution of Data Analytics
56 pages
Advanced SQL - LAB 3
No ratings yet
Advanced SQL - LAB 3
21 pages
ch4 23 11 2023
100% (1)
ch4 23 11 2023
81 pages
Module 4 SQL
100% (1)
Module 4 SQL
151 pages
SQL Basics for Beginners
No ratings yet
SQL Basics for Beginners
147 pages
DBMS Module 1
No ratings yet
DBMS Module 1
56 pages
Ch01 Intro Stat&DataAnalysis
No ratings yet
Ch01 Intro Stat&DataAnalysis
106 pages
PSIT03 (Group 1) Overview-of-Risk-Management-Frameworks - NIST-RMF-ISO-31000-and-COBIT-4
No ratings yet
PSIT03 (Group 1) Overview-of-Risk-Management-Frameworks - NIST-RMF-ISO-31000-and-COBIT-4
23 pages
5 SQL
No ratings yet
5 SQL
71 pages
AP025131 BTEC ICT Systems and Principles U15
0% (1)
AP025131 BTEC ICT Systems and Principles U15
12 pages
Spatial Data Exploration
No ratings yet
Spatial Data Exploration
8 pages
CH 5
No ratings yet
CH 5
80 pages
Lecture Nr2 04 Descriptive Data Analysis Sy 2023
No ratings yet
Lecture Nr2 04 Descriptive Data Analysis Sy 2023
9 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
22 pages
Design of E-Government Security Governance
No ratings yet
Design of E-Government Security Governance
6 pages
Data Structures & Its Application
No ratings yet
Data Structures & Its Application
138 pages
The Data Analyst's Guide To Data Types, Distributions, and Statistical Tests
No ratings yet
The Data Analyst's Guide To Data Types, Distributions, and Statistical Tests
38 pages
A Handbook of Statistical Analyses Using R
No ratings yet
A Handbook of Statistical Analyses Using R
6 pages
Understanding Multicollinearity
No ratings yet
Understanding Multicollinearity
28 pages
Statistics For Business and Economics 8th Edition Newbold Solutions Manual Download
100% (11)
Statistics For Business and Economics 8th Edition Newbold Solutions Manual Download
45 pages
Stat Prob 11 q3 SLM Wk2
No ratings yet
Stat Prob 11 q3 SLM Wk2
10 pages
Unit 4 PTSP
No ratings yet
Unit 4 PTSP
40 pages
Check-In Activity 1 Problem Statement:: Fine Arts Arts and Sciences Engineerin G
No ratings yet
Check-In Activity 1 Problem Statement:: Fine Arts Arts and Sciences Engineerin G
2 pages
Powerpoint - Regression and Correlation Analysis
100% (1)
Powerpoint - Regression and Correlation Analysis
38 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
10 pages
Cohen 1992 A Power Primer PDF
No ratings yet
Cohen 1992 A Power Primer PDF
8 pages
Group Sequential and Confirmatory Adaptive Designs in Clinical Trials Complete Book Download
No ratings yet
Group Sequential and Confirmatory Adaptive Designs in Clinical Trials Complete Book Download
14 pages
Chapter 07 Solutions Manual
No ratings yet
Chapter 07 Solutions Manual
12 pages
Statistic in Education
No ratings yet
Statistic in Education
14 pages
Basic Principles of Experimental Designs
No ratings yet
Basic Principles of Experimental Designs
11 pages
Eshan Chattopadhyay
No ratings yet
Eshan Chattopadhyay
2 pages
Report
No ratings yet
Report
14 pages
(I2h Ad8 Mna) Homew
No ratings yet
(I2h Ad8 Mna) Homew
4 pages
EXCEL
No ratings yet
EXCEL
24 pages
Vũ Hoàng Thanh Trúc - 31221020941 - 33-2
No ratings yet
Vũ Hoàng Thanh Trúc - 31221020941 - 33-2
5 pages
Exercise 2 Spot Speed Study - Manual-218M
No ratings yet
Exercise 2 Spot Speed Study - Manual-218M
6 pages
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
No ratings yet
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
10 pages
P&S Assignement 2
No ratings yet
P&S Assignement 2
2 pages
Statistics in Research, Step
No ratings yet
Statistics in Research, Step
35 pages
Chem 26.1 ATQ 1
No ratings yet
Chem 26.1 ATQ 1
5 pages
Type I and Type II Errors in Statistics (With
No ratings yet
Type I and Type II Errors in Statistics (With
5 pages
SeriesTemp Cours4 Slides
No ratings yet
SeriesTemp Cours4 Slides
25 pages
R Formulas for Data Analysis
No ratings yet
R Formulas for Data Analysis
21 pages
Economatrics Postmte 1
No ratings yet
Economatrics Postmte 1
46 pages
Unit-3 Ai
No ratings yet
Unit-3 Ai
24 pages
Sas Tutorial Procunivariate
No ratings yet
Sas Tutorial Procunivariate
10 pages
Mat 152 - Sas#09
No ratings yet
Mat 152 - Sas#09
10 pages

Data Types & Distribution Basics

Uploaded by

Data Types & Distribution Basics

Uploaded by

Data Types and

I. Numerical (double) Income (e.g. 650.34)

II. Numerical (int) # of children (e.g. 4)

III. Boolean Gender (e.g. male)

IV. Categorical Colors (e.g. green)

V. Ordinal Satisfaction (e.g. pleased)

VI. Others Comments

Sometimes we convert categorical into Booleans

• Some models and techniques only use certain

• A normal distribution is a mesokurtic distribution

 Z-score  Used to model a variable which is

Student’s t-Distribution (Gosset 1908) The Distribution with k D.F

 Approaches the Gaussian

• Multinomial Distribution • Poisson Distribution

– If is large, then is a good approximation

• Kolmogorov–Smirnov (KS) test

• Shapiro's test for normality

• Two derivatives of KS that also compare 2 distributions

• The K-S statistics is defined as:

Compare three or more sets of One-way ANOVA Kruskal-Wallis test 2 test

Look for a non-linear Non-linear regression Nonparametric non-linear

You might also like