0% found this document useful (0 votes)
4 views14 pages

DWDM

dont worry dm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

DWDM

dont worry dm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

21AD1901 – DATA WAREHOUSING AND DATA MINING (DWDM)

2 Marks Answers:

1. Characteristics of Data Warehousing with examples

• Subject-Oriented: Organized around subjects like sales, customers.


• Integrated: Data from multiple sources is unified.
• Time-Variant: Historical data is stored (e.g., sales trends across years).
• Non-Volatile: Data is read-only (not updated).

2. Purpose of ETL in Data Warehousing


ETL (Extract, Transform, Load) extracts data from sources, cleans & formats it, and loads it into
the warehouse.
Example: Extract sales from ERP, standardize currency, load into warehouse.

3. Data Warehouse difference from Database

• Database: Stores current transactional data, optimized for daily operations (OLTP).
• Data Warehouse: Stores historical, integrated data, optimized for analysis (OLAP).

4. Facts and dimensions in Data Warehouse

• Facts: Numerical measures (e.g., sales amount, profit).


• Dimensions: Descriptive context (e.g., time, product, region).

5. KDD steps / KDD Process steps involved


Steps:

1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Presentation

6. Primary goals of data preprocessing

• Improve data quality (remove noise, fill missing values).


• Ensure consistency & accuracy.
• Prepare data for mining (transformation, reduction).

7. Cosine Similarity of X & Y

X = (5,0,3,0,2,0,0,2,0,0)
Y = (3,0,2,0,1,1,0,1,0,1)

Formula:

Cosine Similarity = (X . Y) / ||X|| × ||Y||

Dot Product:

X . Y = (5 × 3) + (3 × 2) + (2 × 1) + (2 × 1) = 15 + 6 + 2 + 2 = 25

||X|| = √(5²+3²+2²+2²) = √42 ≈ 6.48


||Y|| = √(3²+2²+1²+1²+1²) = √16 = 4

So, cosine similarity = 25/25.92 = 0.92 (approx.)

8. Techniques and tools available in visualization

• Techniques: Pie charts, Bar charts, Histograms, Scatter plots, Heatmaps.


• Tools: Tableau, Power BI, Matplotlib, D3.js.
9. Frequent patterns in the content of data mining
Frequent patterns = recurring relationships.
Examples:

• Item sets: milk & bread often bought together.


• Sequences: buy PC → then buy printer.
• Substructures: repeating graph/tree patterns.

10. Supports and confidence used in frequent pattern analysis

• Support (s): Frequency of occurrence of an itemset in all transactions.


• Confidence (c): Probability that if X occurs, Y also occurs.
Example: buys(diaper) → buys(beer) with s=50%, c=60%.

13 Marks Answers:

11 a). Data Warehousing Multilayer Architecture

Introduction:
A data warehouse is not just a single database, but a complete system for storing historical data
and analyzing it for decision making. To make it efficient, it is built in multiple layers, each
performing a clear function. The standard model is the three-tier architecture.

Layers:

1. Bottom Tier – Data Sources and ETL


o Includes operational databases (like sales DB, HR DB, ERP) and external
data (market data, CSV files).
o ETL process extracts data, cleans it, transforms it, and loads it into a staging
area.
o Example: Sales data from an ERP and marketing data from social media
combined together.
2. Middle Tier – Data Warehouse Server (OLAP Engine)
o Contains the integrated database storing historical and current data.
o Supports schemas like star or snowflake.
o OLAP engine allows multidimensional queries (roll-up, drill-down).
3. Top Tier – Front-End Tools (User Layer)
o Business users interact using tools such as Power BI, Tableau, SQL-based
reports.
o Provides dashboards, charts, and decision support.

Diagram (exam-ready):

Data Sources (ERP, CRM, Files) → ETL → Data Warehouse (schemas, OLAP) → Users
(Reports, BI Tools)

Conclusion:
This architecture separates storage, processing, and presentation, making data warehouses
scalable, flexible, and reliable for business intelligence.

11 b). Data Warehouse Schemas for Decision Support

Introduction:
A schema is the logical design of how data is stored. In data warehousing, schema design helps
in decision support queries. The main schemas are Star, Snowflake, and Galaxy (Fact
Constellation).

Types:

1. Star Schema:
o Has one fact table (numeric measures) at the center and several dimension
tables around it.
o Example: Sales Fact Table → Time, Product, Customer, Region.
o Advantage: Simple, fast query performance, widely supported.
o Disadvantage: Data redundancy since dimensions are denormalized.
2. Snowflake Schema:
o A normalized version of star schema where dimensions are broken into sub-
dimensions.
o Example: Product → Category → Brand.
o Advantage: Saves storage, avoids redundancy.
o Disadvantage: Requires more joins, slower queries.
3. Galaxy / Fact Constellation Schema:
o Contains multiple fact tables sharing common dimensions.
o Example: Sales Fact + Shipping Fact sharing Time & Location dimensions.
o Advantage: Handles complex business models, multiple processes.
o Disadvantage: Complex design, hard to maintain.

Diagram:

• Star = simple star shape with one fact in center.


• Snowflake = fact with branched dimensions.
• Galaxy = two facts connected to shared dimensions.

Conclusion:

• Star → simple & fast.


• Snowflake → storage-efficient.
• Galaxy → supports complex businesses.

12 a) OLAP Operations

Introduction:
OLAP (Online Analytical Processing) supports fast analysis of multidimensional data in a data
warehouse. It allows users to view data from different angles.

Main Operations:

1. Roll-up: Aggregating data to a higher level.


o Example: Sales by month → quarter → year.
2. Drill-down: Moving to more detailed data.
o Example: Sales by year → quarter → month.
3. Slice: Fixing one dimension to see a single layer.
o Example: Sales in 2024 only.
4. Dice: Selecting multiple dimensions.
o Example: Sales in 2024, in Asia, for Mobiles.
5. Pivot (Rotate): Reorienting data view.
o Example: Swap rows and columns to view sales by product vs region.

Diagram:
Draw a cube: show slicing one layer, drilling down into detail.

Conclusion:
OLAP operations provide flexibility and allow managers to explore data interactively,
improving decisions.

12 b). OLAP Server Architecture & OLAP vs OLTP

OLAP Server Types:

1. MOLAP (Multidimensional OLAP): Uses pre-built cubes. Fast for queries, but large
storage.
2. ROLAP (Relational OLAP): Uses relational DB with star/snowflake schemas. Good
for large data, slower queries.
3. HOLAP (Hybrid OLAP): Combines both—summary stored in MOLAP, details in
ROLAP.

OLAP vs OLTP:

Feature OLAP (Analysis) OLTP (Transactions)

Purpose Analysis, decision support Routine transactions

Data Historical, summarized Current, operational

Queries Complex, read-heavy Simple, write-heavy

Example Sales trends report Banking transaction

Conclusion:
OLAP is for decision support; OLTP is for day-to-day operations. Both complement each
other.

13 a). Data Mining Steps & Architecture

Steps (KDD Process):

1. Data Cleaning – remove noise/missing values.


2. Data Integration – merge from multiple sources.
3. Data Selection – choose relevant attributes.
4. Data Transformation – convert formats.
5. Data Mining – apply algorithms (clustering, classification).
6. Pattern Evaluation – filter interesting patterns.
7. Knowledge Presentation – use visualization (rules, graphs).

Architecture Components:

• Database/Data Warehouse Server – provides input data.


• Knowledge Base – stores background knowledge.
• Data Mining Engine – core algorithms.
• Pattern Evaluation Module – checks interestingness.
• User Interface – communicates results to users.
Diagram:
Show flow: Database → Mining Engine → Evaluation → User Interface.

Conclusion:
The KDD process ensures only useful and valid knowledge is discovered.

13 b). Role of Data Cleaning & Data Integration

Data Cleaning:

• Fixes missing, noisy, inconsistent data.


• Methods: mean substitution, binning, clustering, regression.
• Example: Replacing missing age with average value.

Data Integration:

• Combines data from multiple sources.


• Removes redundancy & conflicts.
• Example: Same customer appearing in ERP & CRM merged.

Conclusion:
Both cleaning and integration are crucial preprocessing steps, improving quality before mining.

14 a). Data Transformation Methods

Definition: Changing data into suitable form for mining.

Methods:

1. Smoothing – remove noise (binning, regression).


2. Aggregation – summary data (daily → monthly sales).
3. Normalization – scale to range [0,1] (min-max, z-score).
4. Attribute Construction – create new attributes (BMI from weight & height).
5. Discretization – convert continuous → categorical (age → {child, adult}).

Conclusion:
Transformation ensures data is comparable, standardized, and ready for analysis.
14 b). Attribute Types & Choice of Mining Algorithms

• Nominal (Categories):
o Examples: eye color, gender.
o Algorithms: decision trees, clustering (k-modes).
• Ordinal (Ranked):
o Examples: grades (A, B, C), height {short, medium, tall}.
o Algorithms: regression trees, ranking models.
• Interval (Equal spacing, no true zero):
o Examples: temperature in °C, calendar dates.
o Algorithms: linear regression, correlation analysis.
• Ratio (True zero):
o Examples: weight, time, counts.
o Algorithms: statistical analysis, clustering, classification.

Conclusion:
Choosing correct algorithm based on attribute type ensures valid and accurate results.

15 a). Vertical Data Points – Apriori

Apriori Algorithm:

• Mines frequent itemsets using candidate generation + pruning.


• Rule: If a set is frequent, all its subsets must also be frequent.

Vertical Format (TID-lists):

• Instead of scanning full DB, store each item with list of transaction IDs.
• Example:
o A: {T1, T2, T5}
o B: {T2, T4}
o Intersection gives support for {A,B}.

Conclusion:
Using vertical data points speeds up Apriori by avoiding repeated scans.

15 b). Interesting Patterns & Pattern Evaluation

Interesting Patterns:
• Not all patterns are useful.
• A pattern is interesting if:
o Valid on test data.
o Useful and actionable.
o Novel or unexpected.
o Understandable to users.

Measures of Interestingness:

• Support: frequency.
• Confidence: reliability.
• Lift/Correlation: measures dependency.

Pattern Evaluation:

• Done after mining to filter unimportant patterns.


• Uses thresholds (min_support, min_confidence).

Conclusion:
Pattern evaluation ensures results are relevant and valuable for decision making.

15 Marks Answers:

16 a). Statistical Description of Data in Data Mining

Introduction:
In data mining, statistical description is the process of summarizing and presenting the main
features of a dataset using numerical and graphical measures. It helps to understand the
distribution, spread, and central tendency of data before applying mining techniques.

Types of Statistical Measures

1. Measures of Central Tendency


o Mean (Average):

Mean = Sumof(X) / N
• Median: Middle value when data is sorted. Robust to outliers.
• Mode: Most frequent value.

2. Measures of Dispersion (Spread)


o Range: Difference between max and min values.
o Variance & Standard Deviation: Shows spread around the mean.

Variance = Sumof (X – X’)^2 / N

3. Measures of Distribution Shape


o Skewness: Asymmetry in data distribution.
▪ Positive skew → long right tail (income data).
▪ Negative skew → long left tail.
o Kurtosis: Whether data has heavy tails or is flat.
4. Graphical Descriptions
o Histograms: Frequency distribution.
o Boxplots: Median, quartiles, outliers.
o Scatter plots: Relation between two variables.

Why is Statistical Description Important in Data Mining?

• Provides initial insight into the dataset.


• Detects outliers, skewness, missing values.
• Guides algorithm selection (e.g., skewed data may need normalization).
• Helps compare datasets (customer age vs income distribution).

Example:

Consider 5 exam scores: 50, 60, 70, 90, 100

• Mean = 74
• Median = 70
• Mode = None (no repetition)
• Range = 100 - 50 = 50
• Standard Deviation ≈ 20

This shows the class has wide performance spread.


Conclusion:
Statistical description is a fundamental preprocessing step in data mining. It provides a
compact summary, helps understand patterns, and ensures that later mining results are accurate
and meaningful.

16 b) FP-Growth Algorithm (with Given Dataset)

Introduction:
The FP-Growth algorithm is used for frequent pattern mining without generating candidate
sets like Apriori. It uses an FP-tree (Frequent Pattern Tree) and mines frequent itemsets
directly, making it faster for large datasets.

Dataset Given:
Transaction Items Bought

T100 M, O, N, K, E, Y

T200 D, O, N, K, E, Y

T300 M, A, K, E

T400 M, U, C, K, Y

T500 C, O, O, K, I, E

Minimum Support = 60% = 3 transactions


(3 out of 5 transactions must contain the itemset).

Step 1: Count Item Frequencies

• M→3
• O→3
• N→2
• K→5
• E→4
• Y→3
• D→1
• A→1
• U→1
• C→2
• I→1

Items with support ≥ 3 (Frequent):


{K:5, E:4, M:3, O:3, Y:3}

Discard {N, D, A, U, C, I}.

Step 2: Order Items by Frequency in Each Transaction

Reorder transactions by descending frequency order (K > E > M > O > Y):

• T100: K, E, M, O, Y
• T200: K, E, O, Y
• T300: K, E, M
• T400: K, M, Y
• T500: K, E, O

Step 3: Construct FP-Tree

1. Root → K
2. K branches:
o K→E→M→O→Y
o K→E→O→Y
o K→E→M
o K→M→Y
o K→E→O

Final FP-Tree Paths:

• K-E-M-O-Y
• K-E-O-Y
• K-E-M
• K-M-Y
• K-E-O
Step 4: Extract Frequent Patterns

Start from lowest frequency items upward:

• From Y: {K,Y}, {K,E,Y}, {K,M,Y}, {K,O,Y}


• From O: {K,O}, {K,E,O}, {K,O,Y}, {K,E,O,Y}
• From M: {K,M}, {K,E,M}, {K,M,Y}, {K,E,M,Y}
• From E: {K,E}, {K,E,M}, {K,E,O}, {K,E,Y}, {K,E,O,Y}, {K,E,M,Y}
• From K: {K}

Final Frequent Itemsets (Support ≥ 3):

• {K} → 5
• {E} → 4
• {M} → 3
• {O} → 3
• {Y} → 3
• {K,E} → 4
• {K,M} → 3
• {K,O} → 3
• {K,Y} → 3
• {K,E,O} → 3

Conclusion:
The FP-Growth algorithm efficiently finds frequent patterns without generating candidates.
Frequent itemsets include {K,E}, {K,M}, {K,O}, {K,Y}, {K,E,O}, which can later be used
for association rules like:
“If a customer buys K and E, they are likely to buy O.”

That’s your 15-mark set done in full detail.

Would you like me to now create a condensed “last-day revision notes” sheet for all 2, 13, and
15 mark questions combined, so you can revise quickly before exams?

You might also like