0% found this document useful (0 votes)
63 views26 pages

Data Warehousing for Analysts

This document discusses the characteristics and components of data warehousing and online analytical processing (OLAP). It describes the four key characteristics of a data warehouse as subject-oriented, integrated, non-volatile, and time-variant. It also explains the differences between operational databases and data warehouses, outlines the tiers of a typical data warehouse architecture, and discusses the importance of online transaction processing (OLTP) systems for data warehousing. Finally, it covers the requirements and benefits of OLAP for flexible and effective decision making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views26 pages

Data Warehousing for Analysts

This document discusses the characteristics and components of data warehousing and online analytical processing (OLAP). It describes the four key characteristics of a data warehouse as subject-oriented, integrated, non-volatile, and time-variant. It also explains the differences between operational databases and data warehouses, outlines the tiers of a typical data warehouse architecture, and discusses the importance of online transaction processing (OLTP) systems for data warehousing. Finally, it covers the requirements and benefits of OLAP for flexible and effective decision making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

CSA4003 – Data Mining and Data Warehousing

1. Characteristics of Data Warehousing


The data warehouse originator W.H.Inmon described 4 characters.
1. A Subject-oriented
2. Integrated
3. Non Volatile
4. Time Variant collection of data

2. Differentiate Data Warehouse Database Server with Operational Database Server.


3. Demonstrate the functionalities of each tier of Data Warehouse Architecture.
4. Describe the importance and advantages of an OLTP system in Data warehousing.
5. Demonstrate the OLAP functional areas required to make a business data model as
flexible and an effective decision making.
Relational databases are used in the area of operations and control with transaction
processing.
It can answers
(1) The data warehouses are build using these relational databases and stores tactical
information to answer “Who”and “What”type of questions.
(2) OLAP uses Multi-dimensional views of aggregate data to provide access strategic
information. It enables users to gain insight to a wide variety of possible views of
information and transforms raw data to reflect the enterprise as understood by the
user.
Ex: Analysts, Managers, and Executives.
The OLAP can answer “What if”and “Why” type of questions. Thus it can enable
“Decision-Making”.
OLAP Applications and their Functions
a. Finance and Accounting
- Budgeting
- Activity based costing
- Financial Performance Analysis
- Financial Modelling
b. Sales and Marketing
- Sales Analysis and Forecasting
- Market Research Analysis
- Promotion Analysis
- Customer Analysis
- Market and Customer Segmentation
c. Production
- Production Planning
- Defect Analysis
Thus, OLAP must provide Managers with the Information for effective Decision-
Making.
OLAP applications KPI (Key Performance Indicator) provide JIT (Just-in-Time)
information for effective Decision-Making.
JIT information reflects complex data relationships and is calculated on the fly.
OLAP Application Requirements to achieve flexible and effective decision making
The OLAP application required the functional areas are:
1. Multi-Dimensional Views of Data
2. Complex calculation capabilities
3. Time Intelligence
Benefits of OLAP
 Increase the Productivity
 Self-sufficient
 Allows managers to model problems to increase flexibility.
 Providing Decision-Making with Accuracy, Consistent, and Timeliness
 Fast delivery of user applications and reduce the backlog.
 Control the integrity of the data.
 Reduce the query load and network traffic on OLTP and DW.
 Respond more quickly to market demands.

6. Demonstrate the functionality differences over OLTP and OLAP.

7. Data Warehouse Models


Introduction
Data Warehouse is the process of extracting and transforming operational data into
informational data and loading it into a central data store or warehouse. It is accessible
via desktop query and analysis tools by the decision makers.
Models
1. Enterprise Warehouse
2. Data Mart
3. Virtual Warehouse
Enterprise Warehouse
- Collects all the information about subject spanning the entire organization.
- Provides corporate-wide data integration from one or more operational systems or
external information providers, and is cross-functional.
- Contain detailed data as well as summarized data in size from few gigabytes to
100’s GB or TB or more.
- Implemented on traditional mainframes, super computers, or in parallel
architecture platforms.
- It requires extensive business modeling and may take years to design and build.

Data Mart
- Contains a subset of corporate-wide data for a specific group of users.
- Specific selected subject oriented.
For example: A marketing data mart may confine its subjects to “customer, item,
and sales”. It can be summarized.
- Implemented on low-cost department servers like Windows/Linux/Unix and
minimum periods.
- Design and planning are not enterprise-wide.
Type of Data Mart
1. Dependent: Data source are directly from enterprise data warehouse.
2. Independent: Data source are from one or more operational systems, or
external information providers, or locally generated for a particular
department, or geographic area.
Virtual Data Warehouse
- Set of views over operational databases.
- Provide some possible summary views for efficient query processing.
- Easy to build but requires excess capacity on operational database server.

8. Functions of Data Warehouse Back-end Tools and Utilities


Data Warehouse tools and utilities functions are:
Extraction, Transformation, and Loading (ETL)
Extraction
 Data Extraction – Gather the data from multiple, heterogeneous, and external sources.
 Data Cleaning - Detects errors in the data and rectifies them when possible.
Transformation
Converts the data from legacy or host format to Data warehouse format.
Loading
Sorting, summarizing, consolidating, compute the views, check the integrity, and build
indices and partitions.
Refresh
Update the data sources to warehouse.
Data cleaning and Transformation functions are used to improve the data quality.

9. Metadata Definition Tools


Metadata are data about data. It is used to define the warehouse objects, which is located
in bottom-tier of the data warehousing architecture.
Metadata are created for the data names and definitions of the given warehouse.
Additional metadata are created and captured for timestamping any extracted data, the
source of the extracted data, and missing fields that have been added by data cleaning or
integration processes.

Metadata Repository
The metadata repository contains
1. Description of warehouse structure
Warehouse Schema, View, Dimensions, Hierarchies, Data Definitions, Data Mart
Locations, and Contents.

2. Operational metadata
Data Lineage (History of Migrated Data and the Sequence of Transformations applied
to it). Currency of Data (Active, Archived or Purged).
Monitoring Information (Warehouse usage statistics, Error Reports, and Audit Trails).
3. Algorithms used for summarization
Measure and Dimension Definition Algorithms, Data on Granularity, Partitions,
Subject Areas, Aggregation, Summarization, and Predefined Queries and Reports.

4. Mapping from the operational environment to the data warehouse


Source Databases and their Contents, Gateway Descriptions, Data Partitions.
Data Extraction, Cleaning, Transformation Rules and Defaults, Data Refresh.
Purging Rules, and Security (Authorization and Access Control).

5. Data related to system performance


Indices and Profiles that improve Data Access and Retrieval Performance.
Rules for Timing and Scheduling of Refresh, Update, Replication Cycles.

6. Business metadata.
Business Terms and Definitions
Data Ownership Information
Charging Policies
Other Types include
Current Detailed Data –It reflects the most recent happenings, which are usually the
most interesting. Always stored on Disk Storage for fast access but expensive and
complex to manage.
Older Detailed Data - Infrequently accessed and stored at a level detail consistent with
current detailed data. Which are usually on Mass Storage.
Lightly Summarized Data – Distilled from Low level of detail found at the current
detailed level. Consider What unit of time is the summarization done, contents or what
attributes the summarized data will contain. Stored on Disk Storage.
Highly Summarized Data – Compact and easily accessible data and can even found
outside of the warehouse. Which may or may not be Physically housed.
Meta data play a very different role than other data warehouse data and are important for
many reasons.
For example:
Metadata used as a Directory
To help the Decision Support System Analyst – Locate the contents of the data
warehouse.
Metadata used as a Guide
To the Data Mapping when data are transformed from the Operational Environment to
the Data Warehouse Environment.
To the Algorithms used for Summarization between the Lightly and Highly summarized
data.
10. Data Modeling for Data Warehouses
Data Warehouses data are Multidimensional Models.
Data Cube: Inherent relationships in data to populate data in multidimensional matrices
called data cubes. A data cube allows data to be modeled and viewed in
multiple dimensions.
It has 2D, 3D, Hypercube types.
- Query performance in multidimensional matrices can be much better than
in the relational model.
Dimensions are the perspectives or entities with respect to which an
organization wants to keep records.
- For example: Tata may create a sales data warehouse in order to keep
records of the stores sales with respect to:
Tata_Sales(time, item, branch, location)

These dimensions allow the store to keep track of things like monthly
sales of items and the branches and locations at which the items were
sold.

Each dimension may have a table associated with it, called a dimension
table.

A dimension table for item may contain the attributes


Item(item_name, brand, type)

Dimension tables can be specified by users or experts, or automatically


generated and adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme
called Sales.
This theme is represented by a fact table. Facts are numeric measures.
Example: Facts of Sales includes (amount_sold, unit_sold, amount_budgeted)
Two Dimensional Model (2D)
- Table or Spreadsheet for sales data from Tata.
- Ex: Sales data for items sold per quarter in the city of Delhi.

2-D views of Sales data for Tata according to Time and Item
Location = “Delhi”
Item Type
Time Desktop Laptop Mobile CCTV
(Quarter) Phone
1 320 653 1276 2565
2 250 432 2501 1978
3 175 821 1854 2034
The values of item sold in Rupees Thousands

Three Dimensional Model (3D)


A 3-D data cube for the dimensions A, B, C and an aggregate measure is M.
The measures include commonly count(), sum(), min(), max(), total_sales().
Base Cuboid
A data cube is a lattice of cuboids; each cuboid represents a group-by.
A cell in the base cuboid is a base cell.
The dimensions ABC is the base cuboid. It containing all 3 dimensions.
The aggregate measure (M) is computed for each possible combination of the 3D.
The base cuboid is the least generalized of all the cuboids in the data cube.
Apex Cuboid
The common representation of all.
A cell from a non-base cuboid is an aggregate cell.
It contains one value by computing the A, B, C dimensions.
So, it has the aggregate measure (M) for all the tuples stored in the base cuboid.
Apex Cuboid < Base Cuboid
For example:
Suppose we have an “n-dimensional data cube. Let a = (a1, a2, . . . an, measures) be a cell
from one of the cuboids making up the data cube. Now we can say that “a” is an m-
dimensional cell.
If exactly m(m<=n) values among {a1, a2, . . . an) are not aggregated dimension.
If (m=n) then a is a base cell Else it is an aggregated cell (where m<n).
Location = “Delhi” Location = “Bombay”
Item Type Item Type
Time Desktop Laptop M.Phone CCTV Desktop Laptop M.Phone CCTV
1 320 653 1276 2565 653 320 2565 1276
2 250 432 2501 1978 432 250 1978 2501
3 175 821 1854 2034 821 175 2034 1854

320 653 1276 2565


Time

250 432 2501 1978

175 821 1854 2034

Item Type

11. Data Cube / OLAP Operations


Data cube operations are used to manipulate data to meet the needs of users. These operations
help to select particular data for the analysis purpose.
Data cube has 5 operations.
1. Roll-up: Perform the operation to aggregate certain similar data attributes having the same
dimension together.
For example: if the data cube displays the daily sales, then roll-up used to find the
monthly sales.
2. Drill-down: Reverse operation of the Roll-up operation. It allows to take particular
information and then subdivide it further for analysis.
For example: if India is an attribute of a country and we wish to see Cities, then Drill-
down operation splits India into States, Districts, Cities. Now the Cities information
displayed.
3. Slicing: It filters the unnecessary portions and displays particular attribute information.
For example: City=”Bombey”, displays its information only.
4. Dicing: It perform multidimensional cutting. It cut a certain range and create another
dimension.
For example: The user wants to see the Annual sales of a State.
5. Pivot: It transforms the data cube in terms of view and does not change the data present in the
data cube.
For example: If the user is comparing Year Vs Branch using Pivot operation, the user can
change the viewpoint and now compare Branches Vs Item Type.

Advantages of Data Cubes


Multidimensional Analysis
Data cubes enable multidimensional analysis of Business data, allowing users to view
data from different perspectives and levels of detail.
Interactivity
Data cubes provide interactive access to large amount of data, allowing users to easily
navigate and manipulate the data to support their analysis.
Speed and Efficiency
Data cubes are optimized for OLAP analysis, enabling fast and efficient querying and
aggregation of data.
Data Aggregation
Data cubes support complex calculations and data aggregation, enabling users to quickly
and easily summarize large amount of data.
Improved decision-making
Data cubes provide a clear and comprehensive view of business data, enabling improved
decision-making and business intelligence.
Accessibility
Data cubes can be accessed from a variety of devices and platforms, making it easy for
users to access and analyze business data from anywhere.
 Helps in giving a summarised view of data.
 Data cubes store large data in a simple way.
 Data cube operation provides quick and better analysis,
 Improve performance of data.

Disadvantages of data cube:

Complexity: OLAP systems can be complex to set up and maintain, requiring specialized
technical expertise.

Data size limitations: OLAP systems can struggle with very large data sets and may require
extensive data aggregation or summarization.

Performance issues: OLAP systems can be slow when dealing with large amounts of data,
especially when running complex queries or calculations.

Data integrity: Inconsistent data definitions and data quality issues can affect the accuracy of
OLAP analysis.

Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due to the
need for specialized hardware and software.

Inflexibility: OLAP systems may not easily accommodate changing business needs and may
require significant effort to modify or extend.

12. Elaborate Schemas for Multidimensional data models.


Data warehouse require a concise, subject-oriented schema that facilitates online data
analysis.
The multidimensional data model having the schemas are: Star, Snowflake or Fact
Constellation.
(a) Star Schema
Data warehouse star schema contains
1. A large central table called fact table, which containing the bulk of the data with no
redundancy.
2. A set of smaller attendant tables called dimension tables, one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.
For example: A sales contain a four dimensions like time, item, branch, location.
The schema contains a central fact table for sales that contains keys to each of the four
dimensions, along with two measures amount_sold and untis_sold.
To minimize the size of the fact table, dimension identifiers (time_key, item_key) are system-
generated identifiers.

Each dimension is represented by only one table and each table contains a set of attributes.
The attributes within a dimension table may form either total order (hierarchy) or partial order
(lattice).
Star schema is more popular and efficient.

(b) Snowflake Schema


It is a variant of the star schema model.
It is used to reduce redundancies.
Some dimension tables are normalized.
Further splitting the data into additional table to form a graph.
The tables are easy to maintain and saves storage space.
It reduces the system performance, so is not popular in data warehouse design.
For example: Location table in star schema consists of “state” that hold “city”. In a state there are
more than one city, therefore we need to reduce it as a table.
The definition of the dimension table is differentiating from star schema.
The single dimension table “item” in the star schema is normalized in the snowflake schema by
creating new “item” and “supplier” tables.
The single dimension table “location” is normalized by creating new “location” and “city”
tables.

(c) Fact Constellation


Data warehouse applications require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars called a galaxy scheme or a fact
constellation.

For example:
Consider two fact tables “sales”and “shipping”.
The sales table definition is identical to star schema.
If the shipping table has five dimensions or keys (time-key, item_key, shipper_key,
from_location, to_location), and two measures (dollars_cost, units_shipped).
A fact constellation schema allows dimension tables to be shared between fact tables.
The dimension tables for (time, item, location) are shared between the (sales, shipping) fact
tables.
A data warehouse collects information about subjects that span the entire organization.
The fact constellation schema can model multiple, interrelated subjects. So that it is commonly
used in data warehouses.

13. Data Cleaning techniques in data mining.


Data cleaning is a process to clean the dirty data. Data is mostly not clean. It means that most
data can be incorrect due to a large number of reasons like due to hardware error/failure,
network error or human error. So it is compulsory to clean the data before mining.
Benefits of Data Cleaning
 Data cleaning removes major errors.
 Data cleaning ensures happier customers, more sales, and more accurate decision.
 Data cleaning removes inconsistencies that are most likely occur when multiple sources
of data are store into one data set.
 Data cleaning make the data set more efficient, more reliable and more accurate.

Sources of missing values


User forgot to fill the data in a field.
It can be a programming error.
Data can be lost when we transferring the data manually from a legacy database.
For example:
Incomplete data: Salary=” ”
Inconsistent data: Age=”5 years”,
Noisy data: Salary=”-5000”, Name=”1234”
Internal Error: Gender=”male” In application default value is male.

How to handle the Incomplete / Missing Data?


Ignore the tuple
Fill the missing value manually
Fill the values automatically by
- Getting the attribute mean
- Getting the constant value if any constant value is there.
- Getting the most portable value by Bayesian formula or Decision tree.

How to handle Noisy Data?


The following methods are used to remove the Noisy data.
1. Binning
2. Regression
3. Clustering
4. Combined Computer and Human Inspection.

Data Cleaning Methods


There are 6 major steps for data cleaning
1. Monitoring the Errors: Monitor the source of errors and to monitor that which is the source
that is the reason for most of the errors.
2. Standardization of the mining processes: Standardize the point of the entry and check the
importance. When we standardize the data process, then it leads to a good point of entry. The
process of standardization reduces the risk of duplication.
3. Validation of data accuracy: Validate the accuracy of data, after clean the database. Real time
tools are available to clean the data.
4. Scrub for duplicate data: Identify the duplicates because it is very useful and it save our time
when perform data analysis.
5. Analyze: The data should be standardized, validated and scrubbed for the duplicates. Third-
party sources are available that can capture information directly from our databases. They help
us to clean and compile the data to ensure the completeness, accuracy, and reliability for
business decision making.
6. Communicate with the Team: To tell the new standardized cleaning.
Data Cleaning Tools
OpenRefine
Trifacta Wrangler
Drake
Data Ladder
Data Cleaner
Cloudingo
Reifier
IBM Infosphere Quality Stage
TIBCO Clarity
Winpure

14. Solve the following problem using Binning method for Data Smoothing.
Sales_Prizes: 8, 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
Data smoothing is a data pre-processing technique using a different kind of algorithm to
remove the noise from the data set. This allows important patterns to stand out.
Binning method used to smooth a sorted data value by consulting its “neighborhood”,
that is the values around it.
The sorted values are distributed into a number of “buckets” or “Bins”.
Binning method perform local smoothing because of it consult only its neighborhood of
values.

The Sales_Prizes are unsorted data. So we need to sort in using anyone sorting algorithm.
‘ Sorted Sales Prizes: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34.
(a) Smoothing the data by Equal Frequency Bins
Total number is: 12 Suppose if we use the depth 4, then there is 3 Bin will be used.
So, we used equal number (4) of data’s per Bin called equal frequency.
Bin-1: 8, 9, 15, 16
Bin-2: 21, 21, 24, 26
Bin-3: 27, 30, 30, 34
(b) Smoothing by Bin Means
Mean = Sum of number of data / Total number of data
For Bin-1:
(8 + 9 + 15 + 16 / 4 ) = 48 / 4 = 12
Therefore
Bin-1: 12, 12, 12, 12

For Bin-2: (21 + 21 + 24 + 26 / 4 ) = 23


Bin-2: 23, 23, 23, 23
For Bin-3: (27 + 30 + 30 + 34 / 4 ) = 30
Bin-3: 30, 30, 30, 30

(c) Smoothing by Bin Boundaries


8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34

Bin-1: 8, 9, 15, 16
The minimum value is 1, and the maximum value is 16.
First data is = 8
The second data 9 is near to 8, so 9 will be treated as 8.
The third data 15 is near to 16 and far away from 8. So, 15 will be treated as 16.
Therefore
After Boundary Bin-1: 8, 8, 16, 16
Like that
Before Boundary Bin-2: 21, 21, 24, 26
After boundary Bin-2: 21, 21, 26, 26
Before Boundary Bin-3: 27, 30, 30, 34
After Boundary Bin-3: 27, 27, 27, 34
Advantages of data smoothing
Data smoothing clears the understandability of different important hidden patterns in the data set.
Data smoothing can be used to help predict trends. Prediction is very helpful for getting the right
decisions at the right time.
Data smoothing helps in getting accurate results from the data.

Disadvantages of data smoothing


Data smoothing does not always provide a clear explanation of the patterns among the data.
It is possible that certain data points being ignored by focusing the other data points.

Exponential Smoothing
Exponential smoothing is a technique for smoothing the time series data. Exponential smoothing
can smooth the data using the exponential window function.
Advantages
Easy to learn and apply.
It gives more significance to recent observations.
It leads to accurate predictions.
Disadvantages
It leads to the predictions that lag behind the actual data trend.
It cannot handle the data trends very well.
Types of Exponential Smoothing
1. Double Exponential Smoothing
2. Triple Exponential Smoothing
3. Holt’s Linear Exponential Smoothing
Other techniques of Data Smoothing
Moving Medians
Moving Statistics
Linear Filter
Moving Averages
Weighted Moving Averages
How to find Outlier in the data set?
8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Outliers are extreme values that differ from most other data points in a data set. They can have a
big impact on our statistical analyses and skew the results of any hypothesis tests.

What are outliers?


Outliers are values at the extreme ends of a dataset.
Some outliers represent true values from natural variation in the population, other outliers may
result from incorrect data entry, equipment malfunctions, or other measurement errors.
An outlier is not always a form of dirty or incorrect data, so you have to be careful with them in
data cleansing.
True Outliers:
True Outliers should always be retained in your dataset because these just represent natural
variations in your sample.
There are 4 ways to identify the outliers
1. Sorting method
For example:
Data set: 180, 156, 9, 176, 163, 1827, 166, 171
Sorted Data set: 9, 156, 163, 166, 171, 176, 180, 1872
The lowest value is = 9
When we scan the values from lower to high level, we can find the extreme value is 1872.
Therefore, the outlier is 1872.

2. Data Visualization method


Use the Visualization software to visualize the data set with a box plot or a box-and-whisker
plot. It shows the data distribution at a glance. This type of chart highlights minimum and
maximum values (the range), the median, and the interquartile range for the data set.
Many computer programs highlight an outlier on a chart with an asterisk, and these will lie
outside the bounds of the graph.
3. Statistical test (Z scores)
Statistical outlier detection involves applying statistical tests or procedures to identify extreme
values.
We can convert the extreme data points into Z scores that tell us how many standard deviations
away they are from the mean.
If a value has a high enough or low enough Z score, it can be considered an outlier.
As a rule of thumb, values with a Z score greater than 3 or less than -3 are often determined to
be outliers.
4. Interquartile Range Method
The interquartile range (IQR) tells the range of the middle half of your dataset.
Use the IQR to create “fences” around our data and then define outlier as any values that fall
outside those fences.

Note: This method is helpful if we have a few values on the extreme ends of our dataset, but not
sure whether any of them might count as outlier.
Interquartile Range Procedure
1. Sort the data from Low to High
2. Identify the first quartile (Q1), the median, and the third quartile (Q3).
3. Calculate your IQR = Q3 – Q1
4. Calculate the Upper fence = Q3 + (1.5 * IQR)
5. Calculate the Lower fence = Q1 – (1.5 * IQR)
6. Use the fences to highlight any outliers, all values that fall outside the fences.

The outliers are any values greater than the Upper fence or Less than the Lower fence.
For example:
Data set: 25, 37, 24, 28, 35, 22, 31, 53, 41, 64, 29
Step-1: Sort the data lower to high
22, 24, 25, 28, 29, 31, 35, 37, 41, 53, 64
Step-2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of our dataset when all values are ordered
from low to high.
The median of the dataset is 31, because we have total number of values are 11, the
median is 6th value.
22, 24, 25, 28, 29, 31, 35, 37, 41, 53, 64
Now we use the exclusive method for identifying Q1 and Q3. It means we remove the median
from our calculations.
The Q1 is the value in the middle of the first half of the dataset, excluding the median. The first
quartile value is 25.
22, 24, 25, 28, 29 [ Q1 = 25 ]
The Q3 is the value in the middle of the second half of the dataset. Excluding the median. The
third quartile value is 41.
35, 37, 41, 53, 64 [ Q3 = 41 ]

Step-3: Calculate IQR


The IQR is the range of the middle half of your dataset. Therefore, Subtract Q1 from Q3.
IQR = Q3 – Q1 = 41 – 25 = 16
Step-4: Calculate Upper Fence
The upper fence is the boundary around the third quartile. It tells that any values
exceeding the upper fence are outlier.
Upper Fence = Q3 + (1.5*IQR)
= 41 + (1.5 * 16)
= 41 + 24
= 65
Step-5: Calculate Lower Fence
The Lower Fence is the boundary around the first quartile. Any values less than the lower
fence are outliers.
Lower Fence = Q1 – (1.5 * IQR)
= 25 – (1.5 * 16)
= 25 – 24
=1
Step-6: Using the Fences to highlight any Outliers
Go back to the dataset from Step-1 and highlight any values that are greater than the
upper fence or less than the lower fence. These are the outliers.
Lower Fence = 1 Upper Fence = 65
The dataset is: 22, 24, 25, 28, 29, 31, 35, 37, 41, 53, 64
No outliers.

15. Integration of Data Mining system with a Database or Data Warehouse.


There are 4 kind of integration schemes used in DB or DW systems.
1. No Coupling: A Data Mining system will not utilize any function of a DB or DW system. It
may fetch data from a particular source (such as file system), process data using some data mining
algorithms, and then store the mining results in another file.
2. Loose Coupling: A DM system will use some facilities of a DB or DW system, fetching data
from a data repository managed by these systems, performing data mining, and then storing the
mining results either in a file or in a designated place in a database or data warehouse. Loose
coupling is better than no coupling because it can fetch any portion of data stored in databases or
data warehouses by using query processing, indexing, and other system facilities.
Many loosely coupled mining systems are Main Memory based. Because mining does not explore
data structures and query optimization methods provided by DB or DW systems. It is difficult for
loose coupling to achieve high scalability and good performance with large data sets.
3. Semi-Tight Coupling: Linking a DM system to a DB/DW system. Efficient implementations of
a few essential data mining primitives can be provided in the DB/DW system. The primitives can
include Sorting, Indexing, Aggregation, Histogram Analysis, Multi-Way Join. Precomputation of
some essential statistical measures such as Sum, Count, Max, Min, Standard Deviation.
4. Tight Coupling: A DM system is smoothly integrated into the DB/DW system. The data mining
subsystem is treated as one functional component of information system. Data mining queries and
functions are optimized based on mining query analysis, data structures, indexing schemes, and
query processing methods of a DB/DW system.

DMQL:
The Data Mining Query Language is actually based on the Structured Query Language (SQL).
DMQL designed to support ad-hoc and interactive data mining.
It provides commands for specifying primitives.
It can work with databases and data warehouses.
DMQL can be used to define data mining tasks.

Syntax for Task-Relevant Data Specification


For Database: use database database_name
OR
For Data Warehouse: use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s) / cube(s) [where condition]
order by order_list
group by group_list

Syntax for Specifying the Kind of Knowledge


The syntax for Characterization, Discrimination, Association, Classification, and
Prediction.
Characterization: Syntax is
mine characteristics [as pattern_name]
analyze { measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or count%.
For example: Description describing customer purchasing habits.
mine characteristics as customerPurchasing
analyze count%

Discrimination: Syntax is
mine comparison [as {pattern_name} ]
for {target_class} where {target_condition} {versus { contrast_class_i}
where {contrast_condition_i}}
analyze {measure(s)}
For example:
A user may define big spenders as customers who purchase items that cost $100 or more than on
an average; and budget spenders as customers who purchase items at less than $100 on an
average.
The mining of discriminant descriptions for customers from each of these categories can be
specified in the DMQL as
mine comparison as purchaseGroups
for bigSpenders where avg(I.price) >= $100
versus budgetSpenders where avg(I.price) < $100
analyze count

Association: The Syntax is


mine associations [ as {pattern_name} ]
{matching {metapattern} }

For example:
mine associations as buyingHabits
matching P(X:customer, W) ^ Q(X, Y) > = buys(X, Y)

Where X is key of customer relation.


P and Q are predicate variables.
W, X, and Y are Object variables.
Classification: The Syntax is
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example:
To mine patterns, classifying customer credit rating where the classes are determined by the
attribute credit_rating, and mine classification is determined as classifyCustomerCreditRating.

analyze credit_rating

Prediction: The Syntax is


mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set { attribute_or_dimension_i = value_i } }

Syntax for Concept Hierarchy Specification


use hierarchy <hierarchy> for <attribute_or_dimension>

Different syntaxes for different types of hierarchies


Schema Hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]

Set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior

Operation derived hierarchies


define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
Rule based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all

if (price - cost)< $50


level_1: medium-profit_margin < level_0: all

if ((price - cost) > $50) and ((price - cost) ≤ $250))


level_1: high_profit_margin < level_0: all

Syntax for Interestingness Measures Specification


Interesting measures are thresholds can be specified by the user with the statement.
with <interest_measure_name> threshold = threshold_value

For example:
with support threshold = 0.05
with confidence threshold = 0.7

Syntax for Pattern Presentation and Visualization Specification


A syntax allows users to specify the display of discovered patterns in one or more
forms.
display as <result_form>

For example: display as table

Full Specification of DMQL


use database AllElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age, I. type, I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and B.address = "Canada" and I.price ≥ 100
with noise threshold = 5%
display as table

You might also like