Lecture 2
Data
Summary – last week
• Last week:
– Course Motivation
– Data Mining basics
• This week:
– Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 2
Agenda
– Attributes and Objects
– Types of Data
– Data Quality
– Similarity and Distance
– Data Preprocessing
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 3
What is Data?
Attributes
• Collection of data objects
and their attributes
• An attribute is a property or Tid Refund Marital Taxable
Status Income Cheat
characteristic of an object
• Examples: eye color of a 1 Yes Single 125K No
person, temperature, etc. 2 No Married 100K No
• Attribute is also known as 3 No Single 70K No
Objects
variable, field, characteristic,
4 Yes Married 120K No
dimension, or feature
5 No Divorced 95K Yes
• A collection of attributes
6 No Married 60K No
describe an object
• Object is also known as 7 Yes Divorced 220K No
record, point, case, sample, 8 No Single 85K Yes
entity, or instance 9 No Married 75K No
10 No Single 90K Yes
10
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 4
Attribute Values
• Attribute values are numbers or symbols assigned to
an attribute for a particular object
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
– But properties of attribute can be different than
the properties of the values used to represent the
attribute
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 5
Discrete and Continuous Attributes
• Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a
collection of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
• Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and
represented using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 8
Important Characteristics of Data
– Dimensionality (number of attributes)
• High dimensional data brings a number of
challenges
– Sparsity
• Only presence counts
– Resolution
• Patterns depend on the scale
– Size
• Type of analysis may depend on size of data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 10
Types of data sets
• Record
– Data Matrix
– Document Data
– Transaction Data
• Graph
– World Wide Web
– Molecular Structures
• Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 11
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 12
Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi-dimensional space, where each
dimension represents a distinct attribute
• Such a data set can be represented by an m by n
matrix, where there are m rows, one for each object,
and n columns, one for each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 13
Document Data
• Each document becomes a ‘term’ vector
– Each term is a component (attribute) of the vector
– The value of each component is the number of times the
corresponding term occurs in the document.
timeout
season
coach
game
score
play
team
win
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 14
Transaction Data
• A special type of data, where
– Each transaction involves a set of items.
– For example, consider a grocery store. The set of products
purchased by a customer during one shopping trip constitute a
transaction, while the individual products that were purchased
are the items.
– Can represent transaction data as record data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 15
Graph Data
• Examples: Generic graph, a molecule, and webpages
2
5 1
2
5
Benzene Molecule: C6H6
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 16
Ordered Data
• Sequences of transactions
Items/Events
An element of
the sequence
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 17
Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 18
Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?
• Examples of data quality problems:
– Noise and outliers
– Wrong data
– Fake data
– Missing values
– Duplicate data
Acknowledgment - Thanks to Tan, Steinbach, Karpatne, Kumar for the slides 20