0% found this document useful (0 votes)

9 views23 pages

Datalec 1

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses the fundamental aspects of data, including data objects, attribute types, and basic statistical descriptions. It covers various types of data sets, important characteristics of structured data, and methods for measuring central tendency such as mean, median, mode, and midrange. The chapter emphasizes understanding data through visualization and similarity measurements.

Uploaded by

agents0209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views23 pages

Datalec 1

Uploaded by

agents0209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign
Simon Fraser University
©2013 Han, Kamber, and Pei. All rights reserved.
1
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images 1 Bread, Coke, Milk
 Temporal data: time-series
2 Beer, Bread
 Sequential Data: transaction sequences
3 Beer, Coke, Diaper, Milk
 Genetic sequence data
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:
3
Important Characteristics of Structured Data

 Dimensionality
 Curse of dimensionality
 Sparsity
 Only presence counts
 Resolution
 Patterns depend on the scale
 Distribution
 Centrality and dispersion

4
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
5
Attributes
 Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Ordinal

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled
6
Attribute Types
 Nominal:
 Nominal means “relating to names.”
 The values of a nominal attribute are symbols or “names of things”.
 Each value represents some kind of category, code, or state.
 So nominal attributes are also referred to as categorical.
 The values do not have any meaningful order.
 Hair_color = { black, brown, grey, red, white}
 Occupation = {teacher, dentist, programmer, farmer }
 It is possible to represent the values of as symbols with numbers.
 With hair color, we can assign a code of 0 for black, 1 for brown, and so on.
 Another example is customor ID, with possible values that are all numeric.
 In such cases, the numbers are not intended to be used quantitatively.
 Mathematical operations on values of nominal attributes are not meaningful.
 A nominal attribute may have integers as values, it is not considered as a
numeric attribute because the integers are not meant to be used
quantitatively.
7
Attribute Types
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Binary attributes are referred to as Boolean if the two states
correspond to true and false.
 Symmetric binary:
 its states are equally valuable and carry the same weight
 There is no preference on which outcome should be coded as 0 or 1.
 e.g., gender
 Asymmetric binary:
 The outcomes of the states are not equally important,
 We code the most important outcome, which is usually the rarest one,
by 1 and the other by 0.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)

8
Attribute Types
 Ordinal
 An attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is
not known.
 Size = {small, medium, large}
 Grade = (e.g., A+, A, A-, B+, and so on)
 Ordinal attributes are useful for registering subjective assessments of

qualities.
 Cannot be measured objectively.

 Ordinal attributes are often used in surveys for ratings.

 Nominal , binary, and ordinal attributes are qualitative.

 They describe a feature of an object without giving an actual size or
quantity.
 The values of such qualitative attributes are typically words representing
categories.

9
Numeric Attribute Types
 A numeric attribute is quantitative.
 It is a measurable quantity, represented in integer or real values.
 Numeric attributes can be interval-scaled or ratio-scaled.
 Interval-scaled
 Measured on a scale of equal-sized units.

 The values have order and can be positive, 0, or negative.

 provides a ranking of values, Compare and quantify the

difference between values.
 The outdoor temperature value for a number of different days.
 By ordering the values, we obtain a ranking of the objects with
respect to temperature.
 We can quantify the difference between values.
 For example, a temperature of 20˚ C is five degrees higher than a
temperature of 15˚C.

10
Numeric Attribute Types
 Calendar dates are another example. For instance, the
years 2002 and 2010 are eight years apart.
 Temperatures in Celsius and Fahrenheit do not have a
true zero-point, that is, neither 0˚C nor 0˚ indicates “no
temperature.”
 Ratio-scaled
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

11
Discrete vs. Continuous Attributes
 Classification algorithms developed often talk of attributes as
being either discrete or continuous.
 Discrete Attribute
 Has only a finite or countably infinite set of values

 E.g., zip codes, profession, or the set of words in a

collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

using a finite number of digits
 Continuous attributes are typically represented as floating-
point variables 12
Chapter 2: Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

13
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency, variation
and spread
 Data dispersion characteristics
 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube
14
Measuring the Central Tendency
 Various ways to measure the central tendency of data.
 We have some attribute X, like salary, which has been
recorded for a set of objects.
 Let x1,x2, : : : ,xN be the set of N observed values or
observations for X.
 These values may also be referred to as the data set.
 If we were to plot the observations for salary, where would
most of the values fall?
 This gives us an idea of the central tendency of the data.
 Measures of central tendency include the mean, median,
mode, and midrange.

15
MEAN
 The most common and effective numeric measure of
the “center” of a set of data is the (arithmetic) mean.
 Let x1,x2, : : : ,xN be a set of N values or observations,
such as for some numeric attribute X, like salary.
 The mean of this set of values is

1 x1  x 2  ...  xN
n
x   xi 
N i 1 N

16
MEAN
 Sometimes, each value xi in a set may be associated with a
weight wi for i = 1, … ,N.
 The weights reflect the significance, importance, or occurrence
frequency attached to their respective values.
 In this case, we can compute
n

w x i i
w 1x1  w 2 x 2  ...  w N x N
x i 1

n
w 1  w 2  ...  w N
w
i 1
i

 This is called the weighted arithmetic mean or the weighted

average.
17
MEAN
 A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values.
 Even a small number of extreme values can corrupt the mean.
 For example, the mean salary at a company may be substantially

pushed up by that of a few highly paid managers.

 Similarly, the mean score of a class in an exam could be pulled down

quite a bit by a few very low scores.

 To offset the effect caused by a small number of extreme values, we can
instead use the trimmed mean.
 which is the mean obtained after chopping off values at the high and
low extremes.
 For example, we can sort the values observed for salary and remove

the top and bottom 2% before computing the mean.

 We should avoid trimming too large a portion (such as 20%) at both

ends, as this can result in the loss of valuable information.

18
MEDIAN
 The data are already sorted in increasing order.
 If there is an even number of observations (i.e., 12); the median is not
unique.
 Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
 It can be any value within the two middlemost values of 52 and 56.
 By convention, we assign the average of the two middlemost values as the
median; that is, (52+56) / 2 = 54.
 The median is $54,000.
 Suppose that we had only the first 11 values in the list. Given an odd
number of values, the median is the middlemost value. This is the sixth
value in this list, which has a value of $52,000.
 The median is expensive to compute when we have a large number of
observations.
 For numeric attributes, however, we can easily approximate the value.
19
MEDIAN
 If that data are grouped in intervals according to their xi data values and that the
frequency of each interval is known.
 For example, employees may be grouped according to their annual salary in
intervals such as $10–20,000, $20–30,000, and so on.
 Let the interval that contains the median frequency be the median interval.

 We can approximate the median of the entire data set (e.g., the median salary)
by interpolation using the formula

n / 2  ( freq )l
median  L1  ( ) width
freq m edian
 where L1 is the lower boundary of the median interval.
 N is the number of values in the entire data set.
 (∑ freq )l is the sum of the frequencies of all of the intervals that are lower than
the median interval.
 freqmedian is the frequency of the median interval.
 width is the width of the median interval. 20
MODE
 The mode is another measure of central tendency.
 The mode for a set of data is the value that occurs most frequently in
the set.
 Therefore, it can be determined for qualitative and quantitative
attributes.
 It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
 Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal.
 A data set with two or more modes is multimodal.
 If each data value occurs only once, then there is no mode.
 Suppose we have the following values for salary (in thousands of
dollars), shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63,
70, 70, 110.
 The two modes are $52,000 and $70,000.
21
MIDRANGE
 The midrange can also be used to assess the central tendency of a
numeric data set.
 It is the average of the largest and smallest values in the set.
 This measure is easy to compute using the SQL aggregate functions,
max() and min().
 The midrange of the data of Example is ( 30,000 + 110,000 ) / 2 =
$70,000.
 In a unimodal frequency curve with perfect symmetric data
distribution, the mean, median, and mode are all at the same center
value.
 Data in most real applications are not symmetric.
 They may instead be either positively skewed, where the mode
occurs at a value that is smaller than the median or negatively
skewed, where the mode occurs at a value greater than the median.
22
Symmetric vs. Skewed Data

 Median, mean and mode of symmetric

symmetric, positively and negatively
skewed data

positively skewed negatively skewed

February 23, 2015 Data Mining: Concepts and Techniques 23

Data Mining - Data Objects and Attributes
No ratings yet
Data Mining - Data Objects and Attributes
50 pages
Chapter 2 Data Mining
No ratings yet
Chapter 2 Data Mining
41 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
02data - 7 7 25
No ratings yet
02data - 7 7 25
63 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
02data Part1
No ratings yet
02data Part1
19 pages
CH 2
No ratings yet
CH 2
35 pages
Introduction To Data
No ratings yet
Introduction To Data
26 pages
Chpater 2 PDF
No ratings yet
Chpater 2 PDF
44 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
02 Data
No ratings yet
02 Data
47 pages
Topic3 Data Types
No ratings yet
Topic3 Data Types
124 pages
Data ch2
No ratings yet
Data ch2
16 pages
2nd Slides
No ratings yet
2nd Slides
54 pages
DMDW 2
No ratings yet
DMDW 2
68 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
Dmi Unit 2
No ratings yet
Dmi Unit 2
19 pages
Lect 3
No ratings yet
Lect 3
51 pages
02 Data
No ratings yet
02 Data
35 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
IDS Unit 2
No ratings yet
IDS Unit 2
49 pages
DS Handout 4
No ratings yet
DS Handout 4
4 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Fds Unit 2
No ratings yet
Fds Unit 2
21 pages
Lect 2
No ratings yet
Lect 2
77 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
DEP Unit 2
No ratings yet
DEP Unit 2
83 pages
Unit-2 Attributes
No ratings yet
Unit-2 Attributes
4 pages
Lect-2 Getting To Know Your Data-Part-I
No ratings yet
Lect-2 Getting To Know Your Data-Part-I
28 pages
CH 2
No ratings yet
CH 2
68 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
02 Data
No ratings yet
02 Data
24 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
31 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
DM Lec02
No ratings yet
DM Lec02
32 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Measures of Central Tendency Notes
No ratings yet
Measures of Central Tendency Notes
18 pages
Final Lesson Plan in Math 7 Mean Median and Mode
100% (7)
Final Lesson Plan in Math 7 Mean Median and Mode
8 pages
New Module Geography May 2021
No ratings yet
New Module Geography May 2021
38 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
4 pages
Understanding Measures of Variability
No ratings yet
Understanding Measures of Variability
3 pages
An Introduction To Statistics 1st Edition George Woodbury PDF Download
100% (10)
An Introduction To Statistics 1st Edition George Woodbury PDF Download
73 pages
Jee Mains + Boards Maths
100% (1)
Jee Mains + Boards Maths
89 pages
Base Camp Facility Layout: February 2001
No ratings yet
Base Camp Facility Layout: February 2001
8 pages
Central Tendency in Statistics
No ratings yet
Central Tendency in Statistics
20 pages
Aman Khedia CA FND MOCT & MOD Summary Notes
No ratings yet
Aman Khedia CA FND MOCT & MOD Summary Notes
9 pages
Objectives
No ratings yet
Objectives
10 pages
Allama Iqbal Open University Islamabad: Book Name (8614) Level: B.Ed
No ratings yet
Allama Iqbal Open University Islamabad: Book Name (8614) Level: B.Ed
7 pages
Modules Week 1 8 2nd Quarter
No ratings yet
Modules Week 1 8 2nd Quarter
11 pages
Chap3 Test
No ratings yet
Chap3 Test
4 pages
Inbound 7042853907796140836
No ratings yet
Inbound 7042853907796140836
108 pages
Week 6-8 Module
No ratings yet
Week 6-8 Module
46 pages
Course Code: Caec 3A Course Title: College: Authors: Title of The Learning Resource
No ratings yet
Course Code: Caec 3A Course Title: College: Authors: Title of The Learning Resource
30 pages
NCERT Class 12 Geography Data Processing
No ratings yet
NCERT Class 12 Geography Data Processing
19 pages
LESSON PLAN For Demo
No ratings yet
LESSON PLAN For Demo
3 pages
Measures of Central Tendency Guide
No ratings yet
Measures of Central Tendency Guide
58 pages
Abst I & II Commerce
No ratings yet
Abst I & II Commerce
9 pages
Worksheet
No ratings yet
Worksheet
6 pages
Biostatistics
100% (1)
Biostatistics
53 pages
Demo
No ratings yet
Demo
20 pages
1 Biostatistics Lecture Notes Part One
No ratings yet
1 Biostatistics Lecture Notes Part One
237 pages
Statistics and Probability MT - SY 2022-23
No ratings yet
Statistics and Probability MT - SY 2022-23
35 pages
Sports Testing & Measurement Guide
No ratings yet
Sports Testing & Measurement Guide
8 pages
CNUR 860 - FALL - 2020 - Stats - ONE - CONSTN
No ratings yet
CNUR 860 - FALL - 2020 - Stats - ONE - CONSTN
5 pages
04.measure of Disperson
No ratings yet
04.measure of Disperson
17 pages
Data Collection Methods Analysis
No ratings yet
Data Collection Methods Analysis
3 pages

Datalec 1

Uploaded by

Datalec 1

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

 Data sets are made up of data objects.

 Ordinal attributes are often used in surveys for ratings.

 Nominal , binary, and ordinal attributes are qualitative.

 The values have order and can be positive, 0, or negative.

 provides a ranking of values, Compare and quantify the

 E.g., zip codes, profession, or the set of words in a

 Note: Binary attributes are a special case of discrete attributes

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 This is called the weighted arithmetic mean or the weighted

pushed up by that of a few highly paid managers.

quite a bit by a few very low scores.

the top and bottom 2% before computing the mean.

ends, as this can result in the loss of valuable information.

 Median, mean and mode of symmetric

positively skewed negatively skewed

February 23, 2015 Data Mining: Concepts and Techniques 23

You might also like