DWDM
DWDM
2 Marks Answers:
• Database: Stores current transactional data, optimized for daily operations (OLTP).
• Data Warehouse: Stores historical, integrated data, optimized for analysis (OLAP).
1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Presentation
X = (5,0,3,0,2,0,0,2,0,0)
Y = (3,0,2,0,1,1,0,1,0,1)
Formula:
Dot Product:
X . Y = (5 × 3) + (3 × 2) + (2 × 1) + (2 × 1) = 15 + 6 + 2 + 2 = 25
13 Marks Answers:
Introduction:
A data warehouse is not just a single database, but a complete system for storing historical data
and analyzing it for decision making. To make it efficient, it is built in multiple layers, each
performing a clear function. The standard model is the three-tier architecture.
Layers:
Diagram (exam-ready):
Data Sources (ERP, CRM, Files) → ETL → Data Warehouse (schemas, OLAP) → Users
(Reports, BI Tools)
Conclusion:
This architecture separates storage, processing, and presentation, making data warehouses
scalable, flexible, and reliable for business intelligence.
Introduction:
A schema is the logical design of how data is stored. In data warehousing, schema design helps
in decision support queries. The main schemas are Star, Snowflake, and Galaxy (Fact
Constellation).
Types:
1. Star Schema:
o Has one fact table (numeric measures) at the center and several dimension
tables around it.
o Example: Sales Fact Table → Time, Product, Customer, Region.
o Advantage: Simple, fast query performance, widely supported.
o Disadvantage: Data redundancy since dimensions are denormalized.
2. Snowflake Schema:
o A normalized version of star schema where dimensions are broken into sub-
dimensions.
o Example: Product → Category → Brand.
o Advantage: Saves storage, avoids redundancy.
o Disadvantage: Requires more joins, slower queries.
3. Galaxy / Fact Constellation Schema:
o Contains multiple fact tables sharing common dimensions.
o Example: Sales Fact + Shipping Fact sharing Time & Location dimensions.
o Advantage: Handles complex business models, multiple processes.
o Disadvantage: Complex design, hard to maintain.
Diagram:
Conclusion:
12 a) OLAP Operations
Introduction:
OLAP (Online Analytical Processing) supports fast analysis of multidimensional data in a data
warehouse. It allows users to view data from different angles.
Main Operations:
Diagram:
Draw a cube: show slicing one layer, drilling down into detail.
Conclusion:
OLAP operations provide flexibility and allow managers to explore data interactively,
improving decisions.
1. MOLAP (Multidimensional OLAP): Uses pre-built cubes. Fast for queries, but large
storage.
2. ROLAP (Relational OLAP): Uses relational DB with star/snowflake schemas. Good
for large data, slower queries.
3. HOLAP (Hybrid OLAP): Combines both—summary stored in MOLAP, details in
ROLAP.
OLAP vs OLTP:
Conclusion:
OLAP is for decision support; OLTP is for day-to-day operations. Both complement each
other.
Architecture Components:
Conclusion:
The KDD process ensures only useful and valid knowledge is discovered.
Data Cleaning:
Data Integration:
Conclusion:
Both cleaning and integration are crucial preprocessing steps, improving quality before mining.
Methods:
Conclusion:
Transformation ensures data is comparable, standardized, and ready for analysis.
14 b). Attribute Types & Choice of Mining Algorithms
• Nominal (Categories):
o Examples: eye color, gender.
o Algorithms: decision trees, clustering (k-modes).
• Ordinal (Ranked):
o Examples: grades (A, B, C), height {short, medium, tall}.
o Algorithms: regression trees, ranking models.
• Interval (Equal spacing, no true zero):
o Examples: temperature in °C, calendar dates.
o Algorithms: linear regression, correlation analysis.
• Ratio (True zero):
o Examples: weight, time, counts.
o Algorithms: statistical analysis, clustering, classification.
Conclusion:
Choosing correct algorithm based on attribute type ensures valid and accurate results.
Apriori Algorithm:
• Instead of scanning full DB, store each item with list of transaction IDs.
• Example:
o A: {T1, T2, T5}
o B: {T2, T4}
o Intersection gives support for {A,B}.
Conclusion:
Using vertical data points speeds up Apriori by avoiding repeated scans.
Interesting Patterns:
• Not all patterns are useful.
• A pattern is interesting if:
o Valid on test data.
o Useful and actionable.
o Novel or unexpected.
o Understandable to users.
Measures of Interestingness:
• Support: frequency.
• Confidence: reliability.
• Lift/Correlation: measures dependency.
Pattern Evaluation:
Conclusion:
Pattern evaluation ensures results are relevant and valuable for decision making.
15 Marks Answers:
Introduction:
In data mining, statistical description is the process of summarizing and presenting the main
features of a dataset using numerical and graphical measures. It helps to understand the
distribution, spread, and central tendency of data before applying mining techniques.
Mean = Sumof(X) / N
• Median: Middle value when data is sorted. Robust to outliers.
• Mode: Most frequent value.
Example:
• Mean = 74
• Median = 70
• Mode = None (no repetition)
• Range = 100 - 50 = 50
• Standard Deviation ≈ 20
Introduction:
The FP-Growth algorithm is used for frequent pattern mining without generating candidate
sets like Apriori. It uses an FP-tree (Frequent Pattern Tree) and mines frequent itemsets
directly, making it faster for large datasets.
Dataset Given:
Transaction Items Bought
T100 M, O, N, K, E, Y
T200 D, O, N, K, E, Y
T300 M, A, K, E
T400 M, U, C, K, Y
T500 C, O, O, K, I, E
• M→3
• O→3
• N→2
• K→5
• E→4
• Y→3
• D→1
• A→1
• U→1
• C→2
• I→1
Reorder transactions by descending frequency order (K > E > M > O > Y):
• T100: K, E, M, O, Y
• T200: K, E, O, Y
• T300: K, E, M
• T400: K, M, Y
• T500: K, E, O
1. Root → K
2. K branches:
o K→E→M→O→Y
o K→E→O→Y
o K→E→M
o K→M→Y
o K→E→O
• K-E-M-O-Y
• K-E-O-Y
• K-E-M
• K-M-Y
• K-E-O
Step 4: Extract Frequent Patterns
• {K} → 5
• {E} → 4
• {M} → 3
• {O} → 3
• {Y} → 3
• {K,E} → 4
• {K,M} → 3
• {K,O} → 3
• {K,Y} → 3
• {K,E,O} → 3
Conclusion:
The FP-Growth algorithm efficiently finds frequent patterns without generating candidates.
Frequent itemsets include {K,E}, {K,M}, {K,O}, {K,Y}, {K,E,O}, which can later be used
for association rules like:
“If a customer buys K and E, they are likely to buy O.”
Would you like me to now create a condensed “last-day revision notes” sheet for all 2, 13, and
15 mark questions combined, so you can revise quickly before exams?