Data Mining:
Exploring Data
Created by :Tan, Steinbach, Kumar
Modified by : Thanrat Sintanakul
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 1
What is Data Exploration?
A preliminary exploration of the data to better
understand its characteristics.
• Key motivations of data exploration include
– Helping to select the right tool for preprocessing
or analysis
– Making use of humans’ abilities to recognize
patterns
• People can recognize patterns not
captured by data analysis tools
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 2
What is Data Exploration?
• 3 Major Topics of Exploring Data;
– Summary Statistics : X, S.D.
– Visualization Techniques : Histograms,
Scatter Plots
– On-Line Analytical Processing(OLAP) : a
set of techniques for exploring
multidimensional arrays of values
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 3
What is Data Exploration?
• OLAP-related analysis fn focus on various
ways to create summary data tables from a
multidimensional data array, by aggregating
data either across various dimensions or
across various attribute values
• For instance, considering sales info.
reported according to product, location, &
date
– OLAP techniques can be used to create a
summary that describes that sales activity at a
particular location by month & product category
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 4
Techniques Used In Data Exploration
• In EDA(Exploratory Data Analysis), as
originally defined by John Tukey
– The focus was on visualization
– Clustering and anomaly detection were
viewed as exploratory techniques
• In data mining, clustering and anomaly
detection are major areas of interest,
and not thought of as just
exploratory
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 5
Iris Sample Data Set
• Many of the exploratory data techniques
are illustrated with the Iris Plant data set.
– 5 Attributes;
• Three flower types (classes):
– Setosa
– Virginica
– Versicolour
• Four (non-class) attributes
– Sepal width and length
– Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 6
Summary Statistics
• Summary statistics are numbers that
summarize properties of the data
– Summarized properties include frequency,
location and spread
• Examples: location - mean
spread - standard deviation
– Most summary statistics can be calculated
in a single pass through the data
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 7
Summary Statistics
• For example;
– The average household income
– The fraction(percentage) of college
students who complete an undergraduate
degree in 4 years
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 8
Frequency and Mode
• The frequency of an attribute value is
the percentage of time the value
occurs in the data set
– For example, given the attribute ‘gender’
and a representative population of people,
the gender ‘female’ occurs about 50% of
the time.
• The mode of an attribute is the most
frequent attribute value
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 9
Frequency and Mode
No. Gender Department
1 M Electronics
2 F Information Technology
3 F Information Technology
4 F Business Computer
5 M Electronics
6 M Business Computer
7 F Business Computer
8 F Information Technology
9 F Business Computer
10
Data Mining F Thanrat Sintanakul Business Computer
Department of Computer Education KMUTNB 10
Frequency and Mode
• The notions of frequency and mode are
typically used with Categorical Data
• For Categorical Attributes
Frequency & Mode can be interesting &
useful
• For Continuous Data Mode is often
not useful because a single value may
not occur > 1 time
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 11
Percentiles
• For continuous data, the notion of a
percentile is more useful.
• Given an ordinal or continuous attribute
x and a number p between 0 and 100,
the pth percentile is a value xp of x such
xp
that p% of the observed values of x
are less than xp .
• For instance, the 50th percentile is
the value x50% such that 50% of all
values of x are less than x50% .
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 12
Percentiles
• See example in the sheet
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 13
Measures of Location: Mean and Median
• The mean is the most common measure of
the location of a set of points.
• However, the mean is very sensitive to
outliers.
• Thus, the median or a trimmed mean is also
commonly used.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 14
Measures of Location: Mean and Median
• Mean & Median are 2 of the most
widely used for Continuous Data
• Example: consider the height of 15
students, as follows;
• {130,121,123,124,123,127,131,125,
122,120,127,128,125,124,124}
• Mean = Σxi/x = 1874/15 = 124.93
• Median;
– 120,121,122,123,123,124,124,124.125,
125,127,127,128,130,131
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 15
Measures of Location: Mean and Median
• Mean can be interpreted as the middle
of a set of values, in case of;
– The values are distributed in a symmetric
manner (Normal Curve)
– No outliers in such data set
• In opposite, Median can be better
used, in case of;
– The distribution of values is skewed
– Data with outliers
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 16
Measures of Location: Mean and Median
• To overcome problems with the
traditional definition of a mean, the
notion of a Trimmed Mean is
sometimes used
• A percentage p between 0 & 100 is
specified, the top & bottom (p/2)% of
the data set is thrown out
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 17
Measures of Location: Mean and Median
• Example: consider the set of values
{1,2,3,4,5,90}
ไมแทนคากลาง
• Traditional Mean = 17.5 ที่เปนจริง
• Traditional Median = 3.5
• Trimmed Mean with p=40% is
– ตัดคานอยที่สุด 20% (=1.2 ค่ า 1 ค่ า) และคามาก
ที่สุด 20% (=1.2 ค่ า 1 ค่ า) ออก จะได {2,3,4,5}
– แลวคิดคา Mean ตามปกติ จะได = 3.5
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 18
Measures of Spread: Range and Variance
• Commonly used for Continuous Data
• Measure the dispersion or spread of a
set of values, by indicating if the
attribute values are widely spread out
or if they are relatively concentrated
around a single point, e.g. the Mean
• Range is the difference between the
Max. and Min. value;
Range(x) = Max(x) – Min(x) = xm – x1
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 19
Measures of Spread: Range and Variance
• Range can be misleading if most of
the values are concentrated in a
narrow band of values, but there are
also a relatively small no. of more
extreme values(outliers)
• The Variance or Standard Deviation is
the most common measure of the
spread of a set of points.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 20
Measures of Spread: Range and Variance
• The Standard Deviation(S.D.) is the
square root of the Variance (is written
as Sx)
• The Mean can be distorted by
Outliers, and since the Variance is
computed using the Mean, it’s also
sensitive to Outliers
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 21
Measures of Spread: Range and Variance
• So that other measures are often
used;
– AAD : Absolute Average Deviation
– MAD : Median Absolute Deviation
– IQR : Interquartile Range
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 22
Visualization
• Visualization is the conversion of data
into a visual or tabular format so that
the characteristics of the data and the
relationships among data items or
attributes can be analyzed or reported.
• Sometimes the use of visualization
techniques in data mining is referred to
as Visual Data Mining
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 23
Visualization
• Visualization of data is one of the
most powerful and appealing techniques
for data exploration.
– Humans have a well developed ability to
analyze large amounts of information that
is presented visually
– Can detect general patterns and trends
– Can detect outliers and unusual patterns
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 24
Example: Sea Surface Temperature
• The following shows the Sea Surface
Temperature (SST) for July 1982
– Tens of thousands of data points are
summarized in a single figure
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 25
Representation
• Is the mapping of information to a visual
format
• Data objects, their attributes, and the
relationships among data objects are
translated into graphical elements such as
points, lines, shapes, and colors.
• Example:
– Objects are often represented as points
– Their attribute values can be represented as the
position of the points or the characteristics of
the points, e.g., color, size, and shape
– If position is used, then the relationships of
points, i.e., whether they form groups or a point
is an outlier, is easily perceived.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 26
Arrangement
• Is the placement of visual elements
within a display
• Can make a large difference in how
easy it is to understand the data
• Example:
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 27
Arrangement
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 28
Selection
• Is the elimination or the de-emphasis
of certain objects and attributes
• Selection may involve the choosing a
subset of attributes
– Dimensionality reduction is often used to
reduce the number of dimensions to two or
three
– Alternatively, pairs of attributes can be
considered
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 29
Selection
• Selection may also involve choosing a
subset of objects
– A region of the screen can only show so
many points
– Can sample, but want to preserve points
in sparse areas
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 30
Visualization Techniques
• 3 categories of visualization
techniques;
– Visualization of data with small no. of
attributes
– Visualization of data with spatial and/or
temporal attributes
– Visualization of data with many
attributes
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 31
Visualization Techniques
• Visualization of data with small no.
of attributes;
• Stem & Leaf Plots
• Histograms
• Box Plots
• Pie Chart
• Scatter Plots
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 32
Visualization Techniques: Stem & Leaf Plots
• Can be used to provide insight into
the distribution of one-dimensional
integer or continuous data
• example: consider the set of integers
of the sepal length (from the Iris
Data Set) in cm.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 33
Visualization Techniques: Stem & Leaf Plots
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 34
Visualization Techniques: Histograms
• Histogram
– Usually shows the distribution of values of a single variable
– Divide the values into bins and show a bar plot of the
number of objects in each bin.
– The height of each bar indicates the number of objects
– Shape of histogram depends on the number of bins
• Example: Petal Width (10 and 20 bins, respectively)
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 35
Two-Dimensional Histograms
• Show the joint distribution of the
values of two attributes
• Each attribute is divided into intervals
and the 2 sets of intervals define
2-dimensional rectangles of values
• Can be used to discovered interesting
facts about how the values of 2
attributes co-occur
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 36
Two-Dimensional Histograms
• Example: petal width and petal length
– What does this tell us?
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 37
Two-Dimensional Histograms
• Example(cont.):
– Shows a 2-dimensional histogram of petal
length & petal width (from Iris Data Set)
– Because each attribute is split into 3 bins,
there are 9 rectangular 2-dimensional bins
– Height of each rectangular bar indicates
no. of objects(flower in this case) that fall
into each bin
– Most of the flowers fall into only 3 of the
bins – along the diagonal
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 38
Visualization Techniques: Box Plots
• Box Plots outlier
– Another way of 90th percentile
displaying the
distribution of the
values of a single 75th percentile
numerical attribute 50th percentile
25th percentile
– Following figure shows
the basic part of a
10th percentile
box plot
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 39
Example of Box Plots
• Box plots can be used to compare attributes
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 40
Example of Box Plots
• Box Plots can also be used to compare
how attributes vary between different
classes of objects
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 41
Visualization Techniques: Pie Chart
• Typically used with
Categorical
Attributes that have
a relatively small no.
of value
• Uses the relative
area of a circle to
indicate relative
freq.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 42
Visualization Techniques: Scatter Plots
• Scatter plots
– Attributes values determine the position
– Two-dimensional scatter plots are most
common, but can have three-dimensional
scatter plots
– Often, additional attributes can be displayed
by using the size, shape, and color of the
markers that represent the objects
– It is useful to have arrays of scatter plots,
because can compactly summarize the
relationships of several pairs of attributes
• See example on the next slide
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 43
Scatter Plot Array of Iris Attributes
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 44
3-Dimensional Scatter Plot
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 45
Visualization Techniques
• Visualization of Spatial – Temporal
Data;
• Contour Plots
• Surface Plots
• Vector Field Plots
• Lower-Dimensional Slices
• Animation
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 46
Visualization Techniques: Contour Plots
• Contour plots
– Useful when a continuous attribute is measured
on a spatial grid
– They partition the plane into regions of similar
values
– The contour lines that form the boundaries of
these regions connect points with equal values
– The most common example is contour maps of
elevation
– Can also display temperature, rainfall, air
pressure, etc.
• An example for Sea Surface Temperature (SST) is
provided on the next slide
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 47
Contour Plot Example: SST Dec, 1998
Celsius
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 48
Visualization Techniques: Surface Plots
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 49
Visualization Techniques: Vector Field Plots
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 50
Visualization Techniques: Lower-Dimensional Slices
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 51
Visualization Techniques
• Visualization of Higher Dimensional
Data;
• Matrix Plots
• Parallel Coordinates
• Star Coordinates & Chernoff Faces
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 52
Visualization Techniques: Matrix Plots
• Data Matrix :
– is a rectangular array of values
– Can be visualized as an image by associating
each entry of the data matrix with a pixel
in the image
– The brightness or color of the pixel is
determined by the value of the
corresponding entry of the matrix
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 53
Visualization Techniques: Matrix Plots
• Matrix plots
– Can plot the data matrix
– This can be useful when objects are
sorted according to class
– Typically, the attributes are normalized
to prevent one attribute from dominating
the plot
– Plots of similarity or distance matrices
can also be useful for visualizing the
relationships between objects
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 54
Visualization Techniques: Matrix Plots
• Example: consider the standardized
data matrix for the Iris Data Set
• 3 species;
– 1st 50 rows – Setosa
– 2nd 50 rows – Versicolour
– 3rd 50 rows – Virginica
• Petal width & length;
– Setosa < average
– Versicolour ≈ average
– Virginica > average
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 55
Visualization of the Iris Data Matrix
standard
deviation
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 56
Visualization Techniques: Matrix Plots
• Example: consider the correlation
matrix for the Iris Data Set
• The flowers in each group are most
similar to each other
• But Versicolour & Virginica are more
similar to one another than to Setosa
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 57
Visualization of the Iris Correlation Matrix
Correlation
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 58
Visualization Techniques: Parallel Coordinates
• Parallel Coordinates
– Used to plot the attribute values of high-
dimensional data
– Instead of using perpendicular axes, use a set of
parallel axes
– The attribute values of each object are plotted
as a point on each corresponding coordinate axis
and the points are connected by a line
– Thus, each object is represented as a line
– Often, the lines representing a distinct class of
objects group together, at least for some
attributes
– Ordering of attributes is important in seeing such
groupings
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 59
Parallel Coordinates Plots for Iris Data
• Example: consider a
parallel coordinates
plot of the 4
numerical attributes
of the Iris Data Set
• The plot shows that
the classes are
reasonably well
separated for the
pedal width & length,
but less well
separated for sepal
width & length
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 60
Parallel Coordinates Plots for Iris Data
• Another parallel
coordinates plot ,
with different
ordering of the
axes, of the 4
numerical
attributes of the
Iris Data Set
•
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 61
Visualization Techniques : Star Plot
• Star Plots
– Similar approach to parallel coordinates,
but axes radiate from a central point
– The line connecting the values of an
object is a polygon
– The size & shape of
the polygon gives a
visual description of
the attribute values
of the object
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 62
Star Plots for Iris Data
Setosa
Versicolour
Virginica
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 63
Visualization Techniques : Chernoff Faces
• Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute
with a characteristic of a face
– The values of each attribute determine
the appearance of the corresponding facial
characteristic
– Each object becomes a
separate face
– Relies on human’s ability to
distinguish faces
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 64
Visualization Techniques : Chernoff Faces
• Chernoff Faces(cont.)
Data Feature Facial Feature
Sepal Length Size of Face
Sepal Width Forehead/Jaw relative arc
length
Petal Length Shape of Forehead
Petal Width Shape of Jaw
• Other features of the face, e.g. width
between the eyes & length of the
mouth, are given default values
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 65
Chernoff Faces for Iris Data
Setosa
Versicolour
Virginica
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 66
On-Line Analytical Processing - OLAP
• On-Line Analytical Processing (OLAP)
was proposed by E. F. Codd, the
father of the relational database.
• Relational databases put data into
tables, while OLAP uses a
multidimensional array representation.
– Such representations of data previously
existed in statistics and other fields
• There are a number of data analysis
and data exploration operations that
are easier with such a data
representation.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 67
Creating a Multidimensional Array
• Two key steps in converting tabular
data into a multidimensional array.
– 1st, identify which attributes are to be
the dimensions and which attribute is to
be the target attribute whose values
appear as entries in the multidimensional
array.
• The attributes used as dimensions must have
discrete values
• The target value is typically a count or
continuous value, e.g., the cost of an item
• Can have no target variable at all except the
count of objects that have the same
set of attribute values
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 68
Creating a Multidimensional Array
• Two key steps in converting tabular
data into a multidimensional
array(cont.)
– 2nd, find the value of each entry in the
multidimensional array by summing the
values (of the target attribute) or count
of all objects that have the attribute
values corresponding to that entry.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 69
Example: Iris data
• Show how the attributes, petal length,
petal width, and species type can be
converted to a multidimensional array
– First, discretized the petal width and
length to have categorical values: low,
medium, and high
– Then get the following table - note the
count attribute
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 70
Example: Iris data
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 71
Example: Iris data
• Each unique tuple of petal width, petal
length, and species type identifies one
element of the array.
• This element is
assigned the
corresponding count
value.
• The figure illustrates
the result.
• All non-specified
tuples are 0.
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 72
Example: Iris data
• Slices of the multidimensional array
are shown by the following cross-
tabulations
Setosa Versicolour
Virginica
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 73
OLAP Operations
• 4 types of OLAP Operations;
– Aggregation
– Pivoting
– Slicing & Dicing
– Roll-Up & Drill-Down
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 74
OLAP Operations : Aggregation
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 75
OLAP Operations : Pivoting
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 76
OLAP Operations : Pivoting
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 77
OLAP Operations : Slicing
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 78
OLAP Operations : Slicing
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 79
OLAP Operations : Dicing
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 80
OLAP Operations : Dicing
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 81
OLAP Operations : Roll-Up & Drill-Down
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 82
OLAP Operations : Roll-Up & Drill-Down
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 83
OLAP Operations
Reference :
• Example of OLAP Operations -
pangsida.sakaeo.buu.ac.th/~52410017
/.../Ch06OLAP_Cubes.pptx
Data Mining Thanrat Sintanakul Department of Computer Education KMUTNB 84