0% found this document useful (0 votes)
17 views24 pages

R20-DMT Unit-I

A data warehouse is a structured collection of historical data optimized for querying and analysis, enabling consistent reporting across various business functions. It includes concepts like data marts, decision support systems, and various schemas such as star and snowflake for organizing data. Data mining is employed to extract valuable insights from large datasets, facilitating better decision-making in business through various methods and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

R20-DMT Unit-I

A data warehouse is a structured collection of historical data optimized for querying and analysis, enabling consistent reporting across various business functions. It includes concepts like data marts, decision support systems, and various schemas such as star and snowflake for organizing data. Data mining is employed to extract valuable insights from large datasets, facilitating better decision-making in business through various methods and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT I

Data Warehouse Introduction


A data warehouse is a collection of data marts representing historical data from different
operations in the company. This data is stored in a structure optimized for querying and
data analysis as a data warehouse. Table design, dimensions and organization should be
consistent throughout a data warehouse so that reports or queries across the data warehouse
are consistent. A data warehouse can also be viewed as a database forhistorical data from
different functions within a company.

The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in
the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-
volatile collection of data in support of management's decision making process". He
defined the terms in the sentence as follows:

Subject Oriented: Data that gives information about a particular subject instead of
about a company's ongoing operations.

Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.

Time-variant: All data in the data warehouse is identified with a particular time period.

Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.

This enables management to gain a consistent picture of the business. It is a single, complete
and consistent store of data obtained from a variety of different sources made available to
end users in what they can understand and use in a business context. It can be
Used for decision Support
Used to manage and control business
Used by managers and end-users to understand the business and make judgments
Data Warehousing is an architectural construct of information systems that provides users
with current and historical decision support information that is hard to access or present in
traditional operational data stores.
Other important terminology

Enterprise Data warehouse: It collects all information about subjects (customers,


products, sales, assets, personnel) that span the entire organization

Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment
of a data warehouse that can provide data for reporting and analysis on a section, unit,
department or operation in the company, e.g. sales, payroll, production. Data marts are
sometimes complete individual data warehouses which are usually smaller than the
corporate data warehouse.

Decision Support System (DSS):Information technology to help the knowledge


worker (executive, manager, and analyst) makes faster & better decisions

Drill-down: Traversing the summarization levels from highly summarized data to the
underlying current or old detail

Metadata: Data about data. Containing location and description of warehouse


system components: names, definition, structure…

Benefits of data warehousing

 Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
 The structure of data warehouses is easier for end users to navigate, understand
and query against unlike the relational databases primarily designed to handle
lots of transactions.
 Data warehouses enable queries that cut across different segments of a company's
operation. E.g. production data could be compared against inventorydata even if
they were originally stored in different databases with different structures.
 Queries that would be complex in very normalized databases could be easier to
build and maintain in data warehouses, decreasing the workload on transaction
systems.
 Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a company.
 Data warehousing is an efficient way to manage demand for lots of information
from lots of users.
 Data warehousing provides the capability to analyze large amounts of historical
Differences between Operational Database Systems and Data Warehouses

Features of OLTP and OLAP


The major distinguishing features between OLTP and OLAP are summarized as follows.

1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals. An
OLAP system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.

2. Data contents: An OLTP system manages current data that, typically, are too detailed to be
easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.

3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and
an application oriented database design. An OLAP system typically adopts either a star or
snowflake model and a subject-oriented database design.

4. View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP
system often spans multiple versions of a database schema. OLAP systems also deal with
information that originates from different organizations, integrating information from many data
stores. Because of their huge volume, OLAP data are stored on multiple storage media.

5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations although many could be complex
queries.
Multidimensional Data Model.

The most popular data model for data warehouses is a multidimensional model. This model
can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let's
have a look at each of these schema types.

rom Tables and Spreadsheets to Data Cubes

Stars,Snowflakes,and Fact Constellations:


Schemas for Multidimensional Databases

Star schema: The star schema is a modeling paradigm in which the data warehouse contains (1)

a large central table (fact table), and (2) a set of smaller attendant tables (dimension tables), one

for each dimension. The schema graph resembles a starburst, with the dimension tables displayed

in a radial pattern around the central fact table.

Figure Star schema of a data warehouse for sales.

Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension

tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph

forms a shape similar to a snowflake.

The major difference between the snowflake and star schema models is that thedimension tables of the

snowflake model may be kept in normalized form. Such a table is easy to maintain and also saves storage

space because a large dimension table can be extremely large when the dimensional structure is

included as columns.

Page 5
Figure Snowflake schema of a data warehouse for sales.

 Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called
a galaxy schema or a fact constellation.

Figure Fact constellation schema of a data warehouse for sales and shipping.

Example for Defining Star, Snowflake, and Fact Constellation Schemas


Page 6
OLAP operations on multidimensional data.

Roll-up: The roll-up operation performs aggregation on a data cube, either by climbing- up a concept

hierarchy for a dimension or by dimension reduction. Figure shows the result of a roll-up operation

performed on the central cube by climbing up the concept hierarchy for location. This hierarchy was

defined as the total order street < city < province or state<country.

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data tomore detailed

data. Drill-down can be realized by either stepping-down a concept hierarchyfor a dimension or introducing

additional dimensions. Figure shows the result of a drill- down operation performed on the central cube

by stepping down a concept hierarchy for time defined as day < month < quarter < year. Drill-down occurs

by descending the time hierarchy from the level of quarter to the more detailed level of month.

Slice and dice: The slice operation performs a selection on one dimension of the given cube, resulting in a
sub cube. Figure shows a slice operation where the sales data are selected from the central cube for the
dimension time using the criteria time=‖Q2". The dice operation defines asub cube by performing a
selection on two or more dimensions.

Page 7
Figure : Examples of typical OLAP operations on multidimensional data.

Pivot (rotate): Pivot is a visualization operation which rotates the data axes in view in order to provide an
alternative presentation of the data. Figure shows a pivot operation where the item and location axes in a 2-D
slice are rotated.

Page 8
concept hierarchy
A concept hierarchy represents a series of mappings from a set of low-level concepts to larger-level, more
general concepts. Concept hierarchy organizes information or concepts in a hierarchical structure or a specific
partial order, which are used for defining knowledge in brief, high-level methods, and creating possible mining
knowledge at several levels of abstraction.
A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define values of an attribute
known as concepts. A specific node, “ANY”, is constrained for the root of the tree. A number is created to the
level of each node in a conceptual hierarchy. The level of the root node is one. The level of a non-root node
is one more the level of its parent level number.
Because values are defined by nodes, the levels of nodes can also be used to describe the levels of values.
Concept hierarchy enables raw information to be managed at a higher and more generalized level of
abstraction. There are several types of concept hierarchies which are as follows −
Schema Hierarchy − Schema hierarchy represents the total or partial order between attributes in the database.
It can define existing semantic relationships between attributes. In a database, more than one schema hierarchy
can be generated by using multiple sequences and grouping of attributes.

Suppose a user selects a set of location-oriented attributes — street, country, province or state, and city
— from the AllElectronics database, but does not specify the hierarchical ordering among the attributes.

First, sort the attributes in ascending order based on the number of distinct values in each attribute. This results
in the following (where the number of distinct values per attribute is shown in parentheses):
country (15),
province or state (365),
city (3567),
and street (674,339).
Second, generate the hierarchy from the top down according to the sorted order, with the first attribute at the
Page 9
top level and the last attribute at the bottom level. Finally, the user can examine the generated hierarchy, and
when necessary, modify it to reflect desired semantic relationships among the attributes. In this example, it is
obvious that there is no need to modify the generated hierarchy.

Why we need Data Mining?


Volume of information is increasing everyday that we can handle from business transactions, scientific
data, sensor data, Pictures, videos, etc. So, we need a system that willbe capable of extracting essence of
information available and that can automatically generate report, views or summary of data for better decision-
making.

Why Data Mining is used in Business?


Data mining is used in business to make better managerial decisions by:
 Automatic summarization of data
 Extracting essence of information stored.

 Discovering patterns in raw data.

What is Data Mining?


Page 10
Data Mining is defined as extracting information from huge sets of data. In other words, we can say
that data mining is the procedure of mining knowledge from data. The information or knowledge
extracted so can be used for any of the following applications −
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

Knowledge discovery from Data (KDD) is essential for data mining. While others view data mining as
an essential step in the process of knowledge discovery. Here is the list of steps involved in the
knowledge discovery process −

 Data Cleaning − In this step, the noise and inconsistent data is removed.
 Data Integration − In this step, multiple data sources are combined.
 Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
 Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.

Data mining steps for knowledge discovery process


Page 11
What kinds of data can be mined?
1. Flat Files
2. Relational Databases
3. DataWarehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)

1. Flat Files
 Flat files are defined as data files in text form or binary form with a structure that can be easily
extracted by data mining algorithms.
 Data stored in flat files have no relationship or path among themselves, like if a relational
database is stored on flat file, and then there will be no relations between the tables.
 Flat files are represented by data dictionary. Eg: CSV file.
 Application: Used in Data Warehousing to store data, Used in carrying data toand from
server, etc.
2. Relational Databases
 A Relational database is defined as the collection of data organized in tables withrows and
columns.
 Physical schema in Relational databases is a schema which defines the structure oftables.
 Logical schema in Relational databases is a schema which defines the relationshipamong
tables.
 Standard API of relational database is SQL.
 Application: Data Mining, ROLAP model, etc.

3. DataWarehouse

A datawarehouse is defined as the collection of data integrated from multiple
sources that will query and decision making.
 There are three types of data warehouse: Enterprise datawarehouse,
DataMart and Virtual Warehouse.
 Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
 Application: Business decision making, Data mining, etc.
4. Transactional Databases
 Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
 This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
 Highly flexible system where users can modify information without changing any
sensitive information.
 Follows ACID property of DBMS.
Page 12
 Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
 Multimedia databases consists audio, video, images and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in pre-specified formats.
 Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database
 Store geographical information.
 Stores data in the form of coordinates, topology, lines, polygons, etc.
 Application: Maps, Global positioning, etc.
7. Time-series Databases
 Time series databases contain stock exchange data and user logged activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
 WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
 It is the most heterogeneous repository as it collects data from multiple resources.
 It is dynamic in nature as Volume of data is continuously increasing and changing.
 Application: Online shopping, Job search, Research, studying, etc.

What kinds of Patterns can be mined?


On the basis of the kind of data to be mined, there are two categories of functions involved
in Data Mining −
a) Descriptive
b) Classification and Prediction
a) DescriptiveFunction
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters

1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and concepts
of customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
 Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
 Data Discrimination − It refers to the mapping or classification of a class with some
Page 13
predefined group or class.

2. Mining of Frequent Patterns


Frequent patterns are those patterns that occur frequently in transactional data. Here is
the list of kind of frequent patterns −
 Frequent Item Set − It refers to a set of items that frequently appear together, for
example, milk and bread.
 Frequent Subsequence − A sequence of patterns that occur frequently such as
purchasing a camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such as
graphs, trees, or lattices, which may be combined with item-sets or subsequences.
3. Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk
is sold with bread and only 30% of times biscuits are sold with bread.

4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyzethat if they have
positive, negative or no effect on each other.

5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the objects
in other clusters.

Page 14
b) Classification andPrediction
Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whoseclass label
is unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −

1. Classification (IF-THEN) Rules


2. Prediction
3. Decision Trees
4. Mathematical Formulae
5. Neural Networks
6. Outlier Analysis
7. Evolution Analysis

The list of functions involved in these processes is as follows −


1. Classification − It predicts the class of objects whose class label is unknown. Its
objective is to find a derived model that describes and distinguishes data classes or
concepts. The Derived Model is based on the analysis set of training data i.e. the data
object whose class label is well known.

2. Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be
used for identification of distribution trends based on available data.

3. Decision Trees − A decision tree is a structure that includes a root node, branches, and
leaf nodes. Each internal node denotes a test on an attribute, each branch denotesthe
outcome of a test, and each leaf node holds a class label.

4. Mathematical Formulae – Data can be mined by using some mathematical formulas.

5. Neural Networks − Neural networks represent a brain metaphor for information


processing. These models are biologically inspired rather than an exact replica of how the
brain actually functions. Neural networks have been shown to be very promising systems
in many forecasting applications and business classification applications dueto their
ability to “learn” from the data.

6. Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.

7. Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.

Page 15
Data Mining Task Primitives
 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.

Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
 Set of task relevant data to be mined.
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.

Which Technologies are used in data mining?

1. Statistics:
 It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
 Statistical analysis involves the collection of methods, applicable to large amount of
data to conclude and report the trend. 
2. Machine learning
 Arthur Samuel defined machine learning as a field of study that gives computers the
ability to learn without being programmed. 
 When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning. 
 In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
 It is related to computational statistics.

Page 16
The four types of machine learning are:
a. Supervised learning
 It is based on the classification.
 It is also called as inductive learning. In this method, the desired outputs are included
in the training dataset.
b. Unsupervised learning
 Unsupervised learning is based on clustering. Clusters are formed on the basis of
similarity measures and desired outputs are not included in the training dataset.
c. Semi-supervised learning
 Semi-supervised learning includes some desired outputs to the training dataset to
generate the appropriate functions. This method generally avoids the large number of
labeled examples (i.e. desired outputs).
d. Active learning
 Active learning is a powerful approach in analyzing the data efficiently.
 The algorithm is designed in such a way that, the desired output should be decidedby
the algorithm itself (the user plays important role in this type).
3. Information retrieval
 Information deals with uncertain representations of the semantics of objects (text, images).
For example: Finding relevant information from a large document.

4. Database systems and data warehouse


 Databases are used for the purpose of recording the data as well as data warehousing.
 Online Transactional Processing (OLTP) uses databases for day to day transaction
purpose. 
 Data warehouses are used to store historical data which helps to take strategically decision
for business.
 It is used for online analytical processing (OALP), which helps to analyze the data.
5. PatternRecognition:
Pattern recognition is the automated recognition of patterns and regularities in data.
Pattern recognition is closely related to artificial intelligence and machine learning, together with
applications such as data mining and knowledge discovery in databases (KDD), and is often used
interchangeably with these terms.
6. Visualization:
It is the process of extracting and visualizing the data in a very clear and understandable
way without any form of reading or writing by displaying the results in theform of pie charts,
bar graphs, statistical representation and through graphical forms as well.
7. Algorithms:
To perform data mining techniques we have to design best algorithms.
8. High Performance Computing:
High Performance Computing most generally refers to the practice of aggregating
computing power in a way that delivers much higher performance than one could get out of a
typical desktop computer or workstation in order to solve large problems in science, engineering,
or business.

Page 17
Data Mining Applications
Here is the list of areas where data mining is widely used −
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection

Major Issues in data mining:


Data mining is a dynamic and fast-expanding field with great strengths. The major issues
can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society

Data Objects and Attribute Types:


Data Object:
An Object is real time entity.
Attribute:
It can be seen as a data field that represents characteristics or features of a data object. For
a customer object attributes can be customer Id, address etc. The attribute types can represented
as follows—
1. Nominal Attributes – related to names: The values of a Nominal attribute are name of
things, some kind of symbols. Values of Nominal attributes represent some category or
state and that’s why nominal attribute also referred as categorical attributes.
Example:
Attribute Values
Colors Black, Green, Brown, red

2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).

3. Ordinal Attributes: The Ordinal Attributes contains values that have a meaningful
sequence or ranking (order) between them, but the magnitude between values is not actually
known, the order of values that shows what is important but don’t indicate how important it
is.

Page 18
Attribute Values
Grade O, S, A, B, C, D, F

4. Numeric: A numeric attribute is quantitative because, it is a measurable quantity,


represented in integer or real values. Numerical attributes are of 2 types.
i. An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point or we can call zero point. Data
can be added and subtracted at interval scale but cannot be multiplied or divided. Consider
an example of temperature in degrees Centigrade. If a day’s temperature ofone day is
twice than the other day we cannot say that one day is twice as hot as another day.
i. A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurementis
ratio-scaled, we can say of a value as being a multiple (or ratio) of another value. The
values are ordered, and we can also compute the difference between values, andthe
mean, median, mode, Quantile-range and five number summaries can be given.

5. Discrete: Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example
Attribute Values
Teacher, Business man,
Profession
Peon
ZIP Code 521157, 521301

6. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example:
Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73, etc.,

Data Visualization:
Visualization is the use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data.
 Categorization of visualization methods:
a) Pixel-oriented visualization techniques
b) Geometric projection visualization techniques
c) Icon-based visualization techniques
d) Hierarchical visualization techniques
e) Visualizing complex data and relations
a) Pixel-oriented visualization techniques
 For a data set of m dimensions, create m windows on the screen, one for each
dimension
 The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows
 The colors of the pixels reflect the corresponding values

Page 19
 To save space and show the connections among multiple dimensions, space filling is
often done in a circle segment

Geometric projection visualization techniques


Visualization of geometric transformations and projections of the data
Methods
Direct visualization
Scatterplot and scatterplot matrices
Landscapes
Projection pursuit technique: Help users find meaningful projections of
multidimensional data
Prosection views
Hyperslice
Parallel coordinates

Page 20
c) Icon-based visualization techniques
Visualization of the data values as features of icons
Typical visualization methods
 Chernoff Faces
 Stick Figures
General techniques
 Shape coding: Use shape to represent certain information encoding
 Color icons: Use color icons to encode more information
 Tile bars: Use small icons to represent the relevant feature vectors in
document retrieval

d) Hierarchical visualization techniques


Visualization of the data using a hierarchical partitioning into subspaces
Methods
 Dimensional Stacking
 Worlds-within-Worlds
 Tree-Map
 Cone Trees
 InfoCube

Info Cube Worlds-within-worlds

Page 21
e) Visualizing complex data and relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
 The importance of tag is represented by font size/color
Besides text data, there are also methods to visualize relationships, such as
visualizing

Similarity and Dissimilarity


Distance or similarity measures are essential to solve many pattern recognition
problems such as classification and clustering. Various distance/similarity measures are

available in literature to compare two data distributions. As the names suggest, a similarity
measures how close two distributions are. For multivariate data complex summary methods are
developed to answer this question.
Similarity Measure
 Numerical measure of how alike two data objects are.
 Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
 Numerical measure of how different two data objects are.
 Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.
Similarity/Dissimilarity for Simple Attributes
Here, p and q are the attribute values for two data objects.

Page 22
Common Properties of Dissimilarity Measures
Distance, such as the Euclidean distance, is a dissimilarity measure and has some well known
properties:
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
A distance that satisfies these properties are called a metric. Following is a list of several common
distance measures to compare multivariate data. We will assume that the attributes are all
continuous.

a) Euclidean Distance
Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also
called attributes).
The Euclidean distance between the ith and jth objects is

for every pair (i, j) of observations.


The weighted Euclidean distance is

If scales of the attributes differ substantially, standardization is necessary.


b) Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement, xik , i = 1, … , N, k = 1, … , p, the Minkowski distance is

where λ ≥ 1. It is also called the Lλ metric.


λ = 1 : L1 metric, Manhattan or City-block distance.
λ = 2 : L2 metric, Euclidean distance.
λ → ∞ : L∞ metric, Supremum distance.

Note that λ and p are two different parameters. Dimension of the data matrix remains
finite.

Page 23
CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women

c) Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is

The Mahalanobis distance is

where ∑ is the p×p sample covariance matrix.

Common Properties of Similarity Measures


Similarities have some well known properties:

1. s(p, q) = 1 (or maximum similarity) only if p = q,


2. s(p, q) = s(q, p) for all p and q, where s(p, q) is the similarity between data
objects, p and q.
Similarity Between Two Binary Variables
The above similarity or distance measures are appropriate for continuous variables. However,
for binary variables a different approach is necessary.

Simple Matching and Jaccard Coefficients

Simple matching coefficient = (n1,1+ n0,0) / (n1,1 + n1,0 + n0,1 + n0,0).


Jaccard coefficient = n1,1 / (n1,1 + n1,0 + n0,1).

Page 24

You might also like