Data Mining:
Concepts and Techniques
— Chapter 2 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2013 Han, Kamber, and Pei. All rights reserved.
1
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images 1 Bread, Coke, Milk
Temporal data: time-series
2 Beer, Bread
Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
Genetic sequence data
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
3
Important Characteristics of Structured Data
Dimensionality
Curse of dimensionality
Sparsity
Only presence counts
Resolution
Patterns depend on the scale
Distribution
Centrality and dispersion
4
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data points, objects,
tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
5
Attributes
Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Ordinal
Numeric: quantitative
Interval-scaled
Ratio-scaled
6
Attribute Types
Nominal:
Nominal means “relating to names.”
The values of a nominal attribute are symbols or “names of things”.
Each value represents some kind of category, code, or state.
So nominal attributes are also referred to as categorical.
The values do not have any meaningful order.
Hair_color = { black, brown, grey, red, white}
Occupation = {teacher, dentist, programmer, farmer }
It is possible to represent the values of as symbols with numbers.
With hair color, we can assign a code of 0 for black, 1 for brown, and so on.
Another example is customor ID, with possible values that are all numeric.
In such cases, the numbers are not intended to be used quantitatively.
Mathematical operations on values of nominal attributes are not meaningful.
A nominal attribute may have integers as values, it is not considered as a
numeric attribute because the integers are not meant to be used
quantitatively.
7
Attribute Types
Binary
Nominal attribute with only 2 states (0 and 1)
Binary attributes are referred to as Boolean if the two states
correspond to true and false.
Symmetric binary:
its states are equally valuable and carry the same weight
There is no preference on which outcome should be coded as 0 or 1.
e.g., gender
Asymmetric binary:
The outcomes of the states are not equally important,
We code the most important outcome, which is usually the rarest one,
by 1 and the other by 0.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
8
Attribute Types
Ordinal
An attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is
not known.
Size = {small, medium, large}
Grade = (e.g., A+, A, A-, B+, and so on)
Ordinal attributes are useful for registering subjective assessments of
qualities.
Cannot be measured objectively.
Ordinal attributes are often used in surveys for ratings.
Nominal , binary, and ordinal attributes are qualitative.
They describe a feature of an object without giving an actual size or
quantity.
The values of such qualitative attributes are typically words representing
categories.
9
Numeric Attribute Types
A numeric attribute is quantitative.
It is a measurable quantity, represented in integer or real values.
Numeric attributes can be interval-scaled or ratio-scaled.
Interval-scaled
Measured on a scale of equal-sized units.
The values have order and can be positive, 0, or negative.
provides a ranking of values, Compare and quantify the
difference between values.
The outdoor temperature value for a number of different days.
By ordering the values, we obtain a ranking of the objects with
respect to temperature.
We can quantify the difference between values.
For example, a temperature of 20˚ C is five degrees higher than a
temperature of 15˚C.
10
Numeric Attribute Types
Calendar dates are another example. For instance, the
years 2002 and 2010 are eight years apart.
Temperatures in Celsius and Fahrenheit do not have a
true zero-point, that is, neither 0˚C nor 0˚ indicates “no
temperature.”
Ratio-scaled
Inherent zero-point
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
11
Discrete vs. Continuous Attributes
Classification algorithms developed often talk of attributes as
being either discrete or continuous.
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a
collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and represented
using a finite number of digits
Continuous attributes are typically represented as floating-
point variables 12
Chapter 2: Getting to Know Your Data
Data Objects and Attribute Types
Basic Statistical Descriptions of Data
Data Visualization
Measuring Data Similarity and Dissimilarity
Summary
13
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency, variation
and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
14
Measuring the Central Tendency
Various ways to measure the central tendency of data.
We have some attribute X, like salary, which has been
recorded for a set of objects.
Let x1,x2, : : : ,xN be the set of N observed values or
observations for X.
These values may also be referred to as the data set.
If we were to plot the observations for salary, where would
most of the values fall?
This gives us an idea of the central tendency of the data.
Measures of central tendency include the mean, median,
mode, and midrange.
15
MEAN
The most common and effective numeric measure of
the “center” of a set of data is the (arithmetic) mean.
Let x1,x2, : : : ,xN be a set of N values or observations,
such as for some numeric attribute X, like salary.
The mean of this set of values is
1 x1 x 2 ... xN
n
x xi
N i 1 N
16
MEAN
Sometimes, each value xi in a set may be associated with a
weight wi for i = 1, … ,N.
The weights reflect the significance, importance, or occurrence
frequency attached to their respective values.
In this case, we can compute
n
w x i i
w 1x1 w 2 x 2 ... w N x N
x i 1
n
w 1 w 2 ... w N
w
i 1
i
This is called the weighted arithmetic mean or the weighted
average.
17
MEAN
A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
Even a small number of extreme values can corrupt the mean.
For example, the mean salary at a company may be substantially
pushed up by that of a few highly paid managers.
Similarly, the mean score of a class in an exam could be pulled down
quite a bit by a few very low scores.
To offset the effect caused by a small number of extreme values, we can
instead use the trimmed mean.
which is the mean obtained after chopping off values at the high and
low extremes.
For example, we can sort the values observed for salary and remove
the top and bottom 2% before computing the mean.
We should avoid trimming too large a portion (such as 20%) at both
ends, as this can result in the loss of valuable information.
18
MEDIAN
The data are already sorted in increasing order.
If there is an even number of observations (i.e., 12); the median is not
unique.
Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
It can be any value within the two middlemost values of 52 and 56.
By convention, we assign the average of the two middlemost values as the
median; that is, (52+56) / 2 = 54.
The median is $54,000.
Suppose that we had only the first 11 values in the list. Given an odd
number of values, the median is the middlemost value. This is the sixth
value in this list, which has a value of $52,000.
The median is expensive to compute when we have a large number of
observations.
For numeric attributes, however, we can easily approximate the value.
19
MEDIAN
If that data are grouped in intervals according to their xi data values and that the
frequency of each interval is known.
For example, employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on.
Let the interval that contains the median frequency be the median interval.
We can approximate the median of the entire data set (e.g., the median salary)
by interpolation using the formula
n / 2 ( freq )l
median L1 ( ) width
freq m edian
where L1 is the lower boundary of the median interval.
N is the number of values in the entire data set.
(∑ freq )l is the sum of the frequencies of all of the intervals that are lower than
the median interval.
freqmedian is the frequency of the median interval.
width is the width of the median interval. 20
MODE
The mode is another measure of central tendency.
The mode for a set of data is the value that occurs most frequently in
the set.
Therefore, it can be determined for qualitative and quantitative
attributes.
It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
A data set with two or more modes is multimodal.
If each data value occurs only once, then there is no mode.
Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110.
The two modes are $52,000 and $70,000.
21
MIDRANGE
The midrange can also be used to assess the central tendency of a
numeric data set.
It is the average of the largest and smallest values in the set.
This measure is easy to compute using the SQL aggregate functions,
max() and min().
The midrange of the data of Example is ( 30,000 + 110,000 ) / 2 =
$70,000.
In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same center
value.
Data in most real applications are not symmetric.
They may instead be either positively skewed, where the mode
occurs at a value that is smaller than the median or negatively
skewed, where the mode occurs at a value greater than the median.
22
Symmetric vs. Skewed Data
Median, mean and mode of symmetric
symmetric, positively and negatively
skewed data
positively skewed negatively skewed
February 23, 2015 Data Mining: Concepts and Techniques 23