R20-DMT Unit-I
R20-DMT Unit-I
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in
the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-
volatile collection of data in support of management's decision making process". He
defined the terms in the sentence as follows:
Subject Oriented: Data that gives information about a particular subject instead of
about a company's ongoing operations.
Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed.
This enables management to gain a consistent picture of the business. It is a single, complete
and consistent store of data obtained from a variety of different sources made available to
end users in what they can understand and use in a business context. It can be
Used for decision Support
Used to manage and control business
Used by managers and end-users to understand the business and make judgments
Data Warehousing is an architectural construct of information systems that provides users
with current and historical decision support information that is hard to access or present in
traditional operational data stores.
Other important terminology
Data Mart: Departmental subsets that focus on selected subjects. A data mart is a segment
of a data warehouse that can provide data for reporting and analysis on a section, unit,
department or operation in the company, e.g. sales, payroll, production. Data marts are
sometimes complete individual data warehouses which are usually smaller than the
corporate data warehouse.
Drill-down: Traversing the summarization levels from highly summarized data to the
underlying current or old detail
Data warehouses are designed to perform well with aggregate queries running on
large amounts of data.
The structure of data warehouses is easier for end users to navigate, understand
and query against unlike the relational databases primarily designed to handle
lots of transactions.
Data warehouses enable queries that cut across different segments of a company's
operation. E.g. production data could be compared against inventorydata even if
they were originally stored in different databases with different structures.
Queries that would be complex in very normalized databases could be easier to
build and maintain in data warehouses, decreasing the workload on transaction
systems.
Data warehousing is an efficient way to manage and report on data that is
from a variety of sources, non uniform and scattered throughout a company.
Data warehousing is an efficient way to manage demand for lots of information
from lots of users.
Data warehousing provides the capability to analyze large amounts of historical
Differences between Operational Database Systems and Data Warehouses
1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals. An
OLAP system is market-oriented and is used for data analysis by knowledge workers, including
managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be
easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model and
an application oriented database design. An OLAP system typically adopts either a star or
snowflake model and a subject-oriented database design.
4. View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historical data or data in different organizations. In contrast, an OLAP
system often spans multiple versions of a database schema. OLAP systems also deal with
information that originates from different organizations, integrating information from many data
stores. Because of their huge volume, OLAP data are stored on multiple storage media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms. However,
accesses to OLAP systems are mostly read-only operations although many could be complex
queries.
Multidimensional Data Model.
The most popular data model for data warehouses is a multidimensional model. This model
can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let's
have a look at each of these schema types.
Star schema: The star schema is a modeling paradigm in which the data warehouse contains (1)
a large central table (fact table), and (2) a set of smaller attendant tables (dimension tables), one
for each dimension. The schema graph resembles a starburst, with the dimension tables displayed
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph
The major difference between the snowflake and star schema models is that thedimension tables of the
snowflake model may be kept in normalized form. Such a table is easy to maintain and also saves storage
space because a large dimension table can be extremely large when the dimensional structure is
included as columns.
Page 5
Figure Snowflake schema of a data warehouse for sales.
Fact constellation: Sophisticated applications may require multiple fact tables to share
dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called
a galaxy schema or a fact constellation.
Figure Fact constellation schema of a data warehouse for sales and shipping.
Roll-up: The roll-up operation performs aggregation on a data cube, either by climbing- up a concept
hierarchy for a dimension or by dimension reduction. Figure shows the result of a roll-up operation
performed on the central cube by climbing up the concept hierarchy for location. This hierarchy was
defined as the total order street < city < province or state<country.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data tomore detailed
data. Drill-down can be realized by either stepping-down a concept hierarchyfor a dimension or introducing
additional dimensions. Figure shows the result of a drill- down operation performed on the central cube
by stepping down a concept hierarchy for time defined as day < month < quarter < year. Drill-down occurs
by descending the time hierarchy from the level of quarter to the more detailed level of month.
Slice and dice: The slice operation performs a selection on one dimension of the given cube, resulting in a
sub cube. Figure shows a slice operation where the sales data are selected from the central cube for the
dimension time using the criteria time=‖Q2". The dice operation defines asub cube by performing a
selection on two or more dimensions.
Page 7
Figure : Examples of typical OLAP operations on multidimensional data.
Pivot (rotate): Pivot is a visualization operation which rotates the data axes in view in order to provide an
alternative presentation of the data. Figure shows a pivot operation where the item and location axes in a 2-D
slice are rotated.
Page 8
concept hierarchy
A concept hierarchy represents a series of mappings from a set of low-level concepts to larger-level, more
general concepts. Concept hierarchy organizes information or concepts in a hierarchical structure or a specific
partial order, which are used for defining knowledge in brief, high-level methods, and creating possible mining
knowledge at several levels of abstraction.
A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define values of an attribute
known as concepts. A specific node, “ANY”, is constrained for the root of the tree. A number is created to the
level of each node in a conceptual hierarchy. The level of the root node is one. The level of a non-root node
is one more the level of its parent level number.
Because values are defined by nodes, the levels of nodes can also be used to describe the levels of values.
Concept hierarchy enables raw information to be managed at a higher and more generalized level of
abstraction. There are several types of concept hierarchies which are as follows −
Schema Hierarchy − Schema hierarchy represents the total or partial order between attributes in the database.
It can define existing semantic relationships between attributes. In a database, more than one schema hierarchy
can be generated by using multiple sequences and grouping of attributes.
Suppose a user selects a set of location-oriented attributes — street, country, province or state, and city
— from the AllElectronics database, but does not specify the hierarchical ordering among the attributes.
First, sort the attributes in ascending order based on the number of distinct values in each attribute. This results
in the following (where the number of distinct values per attribute is shown in parentheses):
country (15),
province or state (365),
city (3567),
and street (674,339).
Second, generate the hierarchy from the top down according to the sorted order, with the first attribute at the
Page 9
top level and the last attribute at the bottom level. Finally, the user can examine the generated hierarchy, and
when necessary, modify it to reflect desired semantic relationships among the attributes. In this example, it is
obvious that there is no need to modify the generated hierarchy.
Knowledge discovery from Data (KDD) is essential for data mining. While others view data mining as
an essential step in the process of knowledge discovery. Here is the list of steps involved in the
knowledge discovery process −
Data Cleaning − In this step, the noise and inconsistent data is removed.
Data Integration − In this step, multiple data sources are combined.
Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
Pattern Evaluation − In this step, data patterns are evaluated.
Knowledge Presentation − In this step, knowledge is represented.
1. Flat Files
Flat files are defined as data files in text form or binary form with a structure that can be easily
extracted by data mining algorithms.
Data stored in flat files have no relationship or path among themselves, like if a relational
database is stored on flat file, and then there will be no relations between the tables.
Flat files are represented by data dictionary. Eg: CSV file.
Application: Used in Data Warehousing to store data, Used in carrying data toand from
server, etc.
2. Relational Databases
A Relational database is defined as the collection of data organized in tables withrows and
columns.
Physical schema in Relational databases is a schema which defines the structure oftables.
Logical schema in Relational databases is a schema which defines the relationshipamong
tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc.
3. DataWarehouse
A datawarehouse is defined as the collection of data integrated from multiple
sources that will query and decision making.
There are three types of data warehouse: Enterprise datawarehouse,
DataMart and Virtual Warehouse.
Two approaches can be used to update data in DataWarehouse: Query-
driven Approach and Update-driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
Transactional databases are a collection of data organized by time stamps, date,
etc to represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any
sensitive information.
Follows ACID property of DBMS.
Page 12
Application: Banking, Distributed systems, Object databases, etc.
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
6. Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases
Time series databases contain stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtremeDB, Graphite, InfluxDB, etc.
8. WWW
WWW refers to World wide web is a collection of documents and resources like
audio, video, text, etc which are identified by Uniform Resource Locators (URLs)
through web browsers, linked by HTML pages, and accessible via the Internet
network.
It is the most heterogeneous repository as it collects data from multiple resources.
It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc.
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For
example, in a company, the classes of items for sales include computer and printers, and concepts
of customers include big spenders and budget spenders. Such descriptions of a class or a concept
are called class/concept descriptions. These descriptions can be derived by the following two
ways −
Data Characterization − This refers to summarizing data of class under study. This
class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with some
Page 13
predefined group or class.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyzethat if they have
positive, negative or no effect on each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming
group of objects that are very similar to each other but are highly different from the objects
in other clusters.
Page 14
b) Classification andPrediction
Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whoseclass label
is unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
2. Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be
used for identification of distribution trends based on available data.
3. Decision Trees − A decision tree is a structure that includes a root node, branches, and
leaf nodes. Each internal node denotes a test on an attribute, each branch denotesthe
outcome of a test, and each leaf node holds a class label.
6. Outlier Analysis − Outliers may be defined as the data objects that do not comply with
the general behavior or model of the data available.
7. Evolution Analysis − Evolution analysis refers to the description and model regularities
or trends for objects whose behavior changes over time.
Page 15
Data Mining Task Primitives
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
1. Statistics:
It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large amount of
data to conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the
ability to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
It is related to computational statistics.
Page 16
The four types of machine learning are:
a. Supervised learning
It is based on the classification.
It is also called as inductive learning. In this method, the desired outputs are included
in the training dataset.
b. Unsupervised learning
Unsupervised learning is based on clustering. Clusters are formed on the basis of
similarity measures and desired outputs are not included in the training dataset.
c. Semi-supervised learning
Semi-supervised learning includes some desired outputs to the training dataset to
generate the appropriate functions. This method generally avoids the large number of
labeled examples (i.e. desired outputs).
d. Active learning
Active learning is a powerful approach in analyzing the data efficiently.
The algorithm is designed in such a way that, the desired output should be decidedby
the algorithm itself (the user plays important role in this type).
3. Information retrieval
Information deals with uncertain representations of the semantics of objects (text, images).
For example: Finding relevant information from a large document.
Page 17
Data Mining Applications
Here is the list of areas where data mining is widely used −
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no,
affected or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
3. Ordinal Attributes: The Ordinal Attributes contains values that have a meaningful
sequence or ranking (order) between them, but the magnitude between values is not actually
known, the order of values that shows what is important but don’t indicate how important it
is.
Page 18
Attribute Values
Grade O, S, A, B, C, D, F
5. Discrete: Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countably infinite set of values.
Example
Attribute Values
Teacher, Business man,
Profession
Peon
ZIP Code 521157, 521301
6. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example:
Attribute Values
Height 5.4, 5.7, 6.2, etc.,
Weight 50, 65, 70, 73, etc.,
Data Visualization:
Visualization is the use of computer graphics to create visual images which aid in the
understanding of complex, often massive representations of data.
Categorization of visualization methods:
a) Pixel-oriented visualization techniques
b) Geometric projection visualization techniques
c) Icon-based visualization techniques
d) Hierarchical visualization techniques
e) Visualizing complex data and relations
a) Pixel-oriented visualization techniques
For a data set of m dimensions, create m windows on the screen, one for each
dimension
The m dimension values of a record are mapped to m pixels at the corresponding
positions in the windows
The colors of the pixels reflect the corresponding values
Page 19
To save space and show the connections among multiple dimensions, space filling is
often done in a circle segment
Page 20
c) Icon-based visualization techniques
Visualization of the data values as features of icons
Typical visualization methods
Chernoff Faces
Stick Figures
General techniques
Shape coding: Use shape to represent certain information encoding
Color icons: Use color icons to encode more information
Tile bars: Use small icons to represent the relevant feature vectors in
document retrieval
Page 21
e) Visualizing complex data and relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
The importance of tag is represented by font size/color
Besides text data, there are also methods to visualize relationships, such as
visualizing
available in literature to compare two data distributions. As the names suggest, a similarity
measures how close two distributions are. For multivariate data complex summary methods are
developed to answer this question.
Similarity Measure
Numerical measure of how alike two data objects are.
Often falls between 0 (no similarity) and 1 (complete similarity).
Dissimilarity Measure
Numerical measure of how different two data objects are.
Range from 0 (objects are alike) to ∞ (objects are different).
Proximity refers to a similarity or dissimilarity.
Similarity/Dissimilarity for Simple Attributes
Here, p and q are the attribute values for two data objects.
Page 22
Common Properties of Dissimilarity Measures
Distance, such as the Euclidean distance, is a dissimilarity measure and has some well known
properties:
1. d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
2. d(p, q) = d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity)
between points (data objects), p and q.
A distance that satisfies these properties are called a metric. Following is a list of several common
distance measures to compare multivariate data. We will assume that the attributes are all
continuous.
a) Euclidean Distance
Assume that we have measurements xik, i = 1, … , N, on variables k = 1, … , p (also
called attributes).
The Euclidean distance between the ith and jth objects is
Note that λ and p are two different parameters. Dimension of the data matrix remains
finite.
Page 23
CH.YADAVENDRA KUMAR Asst.Prof Sri Mittapalli Institute of Technology for Women
c) Mahalanobis Distance
Let X be a N × p matrix. Then the ith row of X is
Page 24