Data Types and
Measures of
Data Distribution
Prepared By:
Deeman Yousif Mahmood
PhD Student
Data Type
Data types
Type Example
I. Numerical (double) Income (e.g. 650.34)
II. Numerical (int) # of children (e.g. 4)
III. Boolean Gender (e.g. male)
IV. Categorical Colors (e.g. green)
V. Ordinal Satisfaction (e.g. pleased)
VI. Others Comments
Data types – Discrete and continuous
Type
I. Numerical (double) Continuous
II. Numerical (int)
III. Boolean
IV. Categorical
Discrete
V. Ordinal
VI. Others
Categorical vs Boolean
• Categorical is essentially several Booleans that are
grouped by some logic
• Example
– Feature (color): Green, Blue, Red
vs
– Feature (isGreen): Yes/No
– Feature (isBlue): Yes/No
– Feature (isRed): Yes/NO
Sometimes we convert categorical into Booleans
for machine learning
Why is knowledge of data type important?
• Model results are based on this input
– Distance measures
• Some models and techniques only use certain
data types
• Memory considerations
– Categorical vs Boolean (Male/Female or 0/1)
– Boolean can be sparse
Data
Distribution
Measures
Distribution measures 1: Mean, Median,
Mode
• Mode
Good for nominal variables
Quick and easy
• Median
Robust central tendency statistics
• Less sensitive to outliers and extreme values
Good for “bad” distributions
• Mean
Most commonly used statistic for central tendency
• Generally preferred except for “bad” distribution
Based on all data in the distribution
Used for inference as well as description
• best estimator of the parameter
Distribution measures 1: Mean, Median, Mode
Distribution measures 2: Skewness & kurtosis
• Skewness (tails) • Kurtosis (shoulders, heavy tail)
• Skewness is a measure of the asymmetry of • Kurtosis is the degree of peakedness of a distribution
the probability distribution relative to a normal distribution
Excess
Kurtosis
• A normal distribution is a mesokurtic distribution
• Right skew -
• A pure leptokurtic distribution has a higher peak than
• Left skew - the normal distribution and has heavier tails.
• Symmetric - • A pure platykurtic distribution has a lower peak than a
normal distribution and lighter tails.
Common continuous distributions
Normal (Gaussian) Distribution Log-normal Distribution
Z-score Used to model a variable which is
a product of positive i.i.d vars,
• The distance of • A compound return from a
a value from the mean,
measured in standard sequence of many trades
deviations • Measures of size of living tissue
Student’s t-Distribution (Gosset 1908) The Distribution with k D.F
Sampling distrib. (i.i.d measures) of
Approaches the Gaussian
distrib. when Heavily used in statistics
• or • Estimating variance
Used for • Goodness-of-fit test
• Test the diff. between two sample means
• Inference when are unknown
Common discrete distributions
• Bernoulli Distribution • Binomial distribution
– Bernoulli trial – Number of success in n independent trials
• A trial with only two possible outcomes
– Bernoulli Distribution
• Represents success/failure (e.g. accuracy of
prediction)
If n is large, then:
–
is a good approximation
( ) for
• Multinomial Distribution • Poisson Distribution
– Categorical Distribution – Number of events occurring within a fixed
• A trial with k possible outcomes time interval (or space)
• , the shape param., indicates the average
where and
number of events in the given time interval
– Multinomial Distribution
• Number of occurrences of k categories in n
independent trials
– If is large, then is a good approximation
where for
Comparing distributions
Examples of commonly used distribution tests
• Q-Q plot:
– Compare distributions based on quantiles
• Kolmogorov–Smirnov (KS) test
– Compare distributions based on the cumulative density function
• Shapiro's test for normality
– Check if data is normally distributed
• Two derivatives of KS that also compare 2 distributions
– Cramér–von Mises criterion
– Anderson–Darling test
Q-Q plot
• A plot of the quantiles of the first data set against the quantiles of the second data
set
• Data sets sizes don’t have to be equal
• The greater the departure from the 45 deg. reference line, the greater the
evidence for the conclusion that the two data sets have come from populations
with different distributions
Kolmogorov–Smirnov test
• A non-parametric test for the equality of continuous, one-
dimensional probability distribution
• Can be applied to test a dataset distribution against a known distribution OR
against another dataset distribution
H0: The data follow a specified distribution
H1: The data do not follow the specified distribution
• The K-S statistics is defined as:
When to use which statistical test?
Using the correct statistical test, and correcting for multiple
hypotheses are recurrent issues in data science
Data comparisons you Data are normally Data are not normally- Data are Binomial
are making distributed distributed, or are ranks (Possess 2 possible
or scores values)
Compare one set of data to a One-sample t-test Wilcoxon test 2 test
hypothetical value
Compare two sets of Unpaired t-test Mann-Whitney test 2 test or Fisher test
independently-collected
(unpaired) data
Compare two sets of data from Paired t-test Wilcoxon test McNemar’s test
the same subjects under
different circumstances (paired)
Compare three or more sets of One-way ANOVA Kruskal-Wallis test 2 test
data
Look for a relationship between Pearson Correlation coefficient Spearman correlation Contingency Correlation
two variables coefficient coefficients
Look for a linear relationship Linear regression Nonparametric linear Simple logistic regression
between two variables regression
Look for a non-linear Non-linear regression Nonparametric non-linear
relationship between two regression
variables