0% found this document useful (0 votes)
10 views11 pages

Cs 614

Data warehousing involves collecting and managing large data sets for efficient analysis and decision-making. It highlights the importance of structured data, the relationship between data, information, knowledge, and intelligence, and discusses the evolution of data processing systems. The document also covers the types of data warehouses, normalization, denormalization, and the differences between OLTP and OLAP systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Cs 614

Data warehousing involves collecting and managing large data sets for efficient analysis and decision-making. It highlights the importance of structured data, the relationship between data, information, knowledge, and intelligence, and discusses the evolution of data processing systems. The document also covers the types of data warehouses, normalization, denormalization, and the differences between OLTP and OLAP systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Lecture No:1

Data warehousing:
Data Warehousing is the process of collecting, storing, and managing large amounts of data from different
sources in one place. This data is organized in a way that makes it easy to analyse and make better decisions.
Need for data warehousing:
A data warehouse is important because it helps businesses and organizations store and manage large
amounts of data efficiently. Here are some key reasons why we need it:
 Better decision making
 Fast access to data
 Improved data quality
 Historical data analysis
 Combing data from multiple resources
 Competitive Advantage
Relationship between Data, Information, Knowledge and Intelligence:
Data:
Data refers to raw facts and figures that are recorded or stored in a way that computers can process. It often
comes from transactions or operations, such as sales records or customer details, but by itself, data has no
meaning until processed.
Information:
Information is processed data that is structured and meaningful. Information helps answer basic questions
like "who," "what," "where," and "when."
Knowledge:
Knowledge is the application of information to understand patterns, make predictions, or solve problems. It
helps answer "how" questions by providing deeper insights based on experience and learning. For example,
understanding customer buying trends from sales data is knowledge.
Intelligence:
Intelligence is the ability to interpret knowledge and understand the reasons behind patterns and behaviors. It
answers the "why" questions and allows individuals or organizations to make strategic decisions.
Intelligence helps in problem-solving and innovation.
Note:
Data is raw facts, information is meaningful data, knowledge is applying information for insights, and
intelligence is understanding why things happen.
Historical overview:
Early days of data processing (1960s):
Computers used master files stored on magnetic tapes. Data access was slow and unreliable, requiring
sequential scanning. Redundant data became a major issue as master files grew.
Introduction of Direct Access Storage Device (DASD) & Databases (1970s):
DASD (Disk Storage) allowed faster, direct access to data, unlike magnetic tapes. DBMS (Database
Management Systems) were introduced, making data storage and retrieval more efficient. A database
became a "single source of truth", solving redundancy problems.
Rise of PCs & 4th Generation Languages (1980s):
PCs and 4GLs (4th Generation Programming Languages) gave users direct control over data. Management
Information Systems (MIS) were developed to help managers make decisions. This marked the early stages
of Decision Support Systems (DSS).
Data Ware House:
A complete repository of historical corporation data extracted from transaction systems that is available for
ad-hoc access by knowledge workers.
Types of data ware house:
Following are some industry-specific data warehouses:
 Financial - A financial data warehouse stores and organizes all the financial data of a business —
like income, expenses, budgets, and profits.
 Telecommunication - A telecom data warehouse stores call records, customer usage data, billing
info, and network performance details.
 Insurance - An insurance data warehouse keeps track of policyholder details, claim history,
premium payments, and risk analysis.
 Human resources - An HR data warehouse stores employee-related data like salaries, attendance,
performance, and hiring trends.
Ad-hoc:
It doesn’t have a certain repeatable pattern and it’s not in advance.
Ad-Hoc Query Example:
Task:
Your manager wants to quickly know:
“What were the total sales on April 1, 2025, broken down by store?”
This is a one-time, unplanned request — no scheduled report exists for it.
OLTP:
OLTP stands for online transection processing, which refers to the type of database system designed to
handle a large volume of short interaction in real time. These transactions often involve inserting, deleting,
updating etc.
DSS:
A Decision Support System (DSS) is a computer-based tool or system used to help business users make
informed decisions by analysing large volumes of data — often pulled from a data warehouse.

OLTP (online transaction A system used to handle day-to- Captures and processes real-time
processing) day operations like sales, data.
bookings, payments
DSS (decision support system) A system used to help users Uses historical data (often from a
analyse data and make decisions data warehouse) for analysis &
reporting.
Data warehouse A centralized repository of Supports reporting, analysis,
historical, cleaned, and structured DSS, and BI tools.
data
Typical Queries:
OLTP = Select tx_data, balance from tx_table where account_id=23785;
DSS =select balance, age, sal, gender from customer_table and tx_table where age between (30 and 40) and
eduction = ‘graduate ‘and CusID.customer_table and Customer_ID.tx_table;
Comparison of OLTP and DSS:
OLTP DSS
Primary key need Primary key not used
No concept of primary key index Primary index used
May use single table Uses multiple tables
Few rows returned Many rows returned
High selectivity of query Low selectivity of query
Indexing on primary key (unique) Indexing on primary index (non-unique)

Primary indexing:
Primary indexing is a way to quickly find data in a database table using the primary key.
Example:
A table of contents in a book, based on page numbers.
Or
An Index is like a starting point for reading data in a database.
Instead of the database scanning every row one by one (which is slow), the index tells it exactly where to
start — just like a bookmark or table of contents in a book.
OLAP:
OLAP (online analytical processing) is a tool that helps you analyse large amounts of data quickly so you
can make smart business decisions. In OLAP we are analysing data step-by -step.
Example:
You check total sales → then zoom into sales by region → then by product → then by month.
Anomalies:
Anomalies are problems or errors that happen when a database is not properly organized (not normalized).
Types of anomalies:
 Insert anomalies – you can’t add new data because other data is missing.
 Update anomalies – you have to update the same data in many places.
 Deletion anomalies – deleting one piece of data removes other important data.
Solution:
These anomalies can be avoided using normalization.
Normalization:
Normalization is the process of efficiently organizing the data in a database by decomposing (splitting) a
relational table into smaller table by projection.
Goals of normalization:
There are basically two goals for normalization:
 Eliminate redundant data (for example, storing the same data in more than one table).
 Ensuring data dependencies make sense (only storing related data in a table).
Need for Normalization:
 To remove duplicate data
 To save storage space
 To make data easier to update and manage
Example:
Imagine a student table like this:
StudentID Name Course1 Course2
1 Ali Math Science
2 JK Math English

Instead of storing courses like this, we can normalize into two separate tables:
Student Table:
StudentID Name
1 Ali
2 JK

Course Table:
StudentID Course
1 Math
1 Science
2 Math
2 English

Type of Normalization:
Ther are five main types of normalization:
 1NF (First Normal form) – Removes repeating groups or multiple values in a single column
 2NF (Second Normal form) – Removes partial dependencies (depends on part of the key)
 3NF (Third Normal form) – Removes transitive dependencies (depends on non-key column)
 BNNF (Boyce- Codd Normal form) – A stronger version of 3NF — handles some special cases
 4NF (Fourth Normal form) – Removes multi-valued dependencies (no column should repeat rows)
 5NF (Fifth Normal form) – Breaks down complex relationships into simpler ones
1NF (First Normal form):
A table is in 1NF if each column has only one value (no lists or multiple items in a cell).
Example (Before 1NF):
Student Courses
Raj Math, Science
Priya Math , English
After 1NF:
Student Courses
Raj Math
Raj Science
Priya Math
Priya English

2NF (Second Normal form):


A table is in 2NF if it's already in 1NF and all non-key columns depend fully on the primary key (no partial
dependency).
Example:
Before 2NF:
StudentID Courses StudentName
1 Math Raj
1 Science Raj
2 Math Priya
2 English Priya

Here, StudentName depends only on StudentID, not on both StudentID + Course.


After 2NF:
Student Table:
StudentID StudentName
1 Raj
1 Raj
2 Priya
2 Priya

Course Table:
StudentID Courses
1 Math
1 Science
2 Math
2 English

3NF (Third Normal Form):


A table is in 3NF if it's in 2NF and no column depends on another non-key column (no transitive
dependency).
Example:
Before 3NF
StudentID StudentName Department DeptLocation
1 Raj CS Block A
2 Piya IT Block B

DeptLocation depends on Department, not directly on StudentID.


After 3NF:
Student Table:
StudentID StudentName Department
1 Raj CS
2 Piya IT

Department Table:
Department DeptLocation
CS Block A
IT Block B

Denormalization:
Denormalization means combining tables back together to make data easier and faster to read, even if it
causes some repetition (redundancy). In denormalization, we join tables again to make data access faster.
Denormalization in DSS:
In DSS, users need fast reports and quick answers. If the data is too split (too many small tables), it will take
more time to join them when running a report. Denormalized tables allow users to get results faster because:
 Less joining of tables
 Faster query performance
 Easier to run big analytical queries
Denormalization technique:
 Collapsing
 Pre-joining tables
 Splitting tables
 Adding redundant columns (Reference data)
 Derived attributes
Collapsing:
Collapsing Tables is the process of merging two or more related tables into a single table. This is done when
there is a clear relationship between the tables, and it makes sense to combine them, typically for
performance optimization purposes, especially when those tables are frequently queried together.
Types of collapsing:
 Two entities with a one-to-one relationship
 Two entities with many-to-many relationship
Two entities with a one-to-one relationship:
In this type of relationship, each row in Table A is linked to exactly one row in Table B. When the
relationship is one-to-one, collapsing the tables simplifies access to data since there is no need to join them
repeatedly.
Example:
Customer table:
CustomerID Name Email
C001 Ali ali@gmail.com
C002 Sara sara@gmail.com

Customer Address table:


CustomerID Address
C001 123 main St
C002 456 main St

After collapsing (Combining the tables):


CustomerID Name Email Address
C001 Ali ali@gmail.com 123 main St
C002 Sara sara@gmail.com 456 main St

Two entities with many-to-many relationship


In a many-to-many relationship, a row in Table A can be related to many rows in Table B and vice versa.
Typically, in a relational database, a junction table is used to manage the many-to-many relationship.
However, if the data is queried frequently, it may be beneficial to pre-join and collapse the data into a single
table to improve query performance.
Example:
Student table:
StudentID Name
S001 Ali
S002 Sara

Course table:
CourseID CourseName
C001 Math
C002 History

Student course table:


StudentID CourseID
S001 C001
S001 C002
S002 C001

After collapsing (Combining the Tables into One Table)

StudentID Name CourseID CourseName

S001 Ali C001 Math

S001 Ali C002 History

S002 Sara C001 Math


Pre-joining tables:
Joining tables in advance (before queries) and saving the joined result as a new table.
Example:
Customer Table:
CutomerID Name
C001 Ali

Order Table:
OrderID CustomerID ProductName
O001 C001 Laptop

After pre joining:

OrderID CustomerID Name ProductName


O001 C001 Ali Laptop

Splitting Tables:
Splitting tables means breaking a large table into smaller tables to make data easier to manage, faster to
search, or more secure.
Types:
There are two main types:
 Horizontal Splitting
 Vertical Splitting
Horizontal Splitting:
We split a table into two or more tables where each new table has the same columns but contains different
rows based on some condition (like date, region, etc.).
Example:

OrderID CustomerName OrderDate Amount

1 Ali 01-01-2025 100

2 Sara 05-01-2025 200

3 Ahmed 10-02-2025 150

4 Hina 20-02-2025 250

After horizontal splitting:


Table 1: Order January
OrderID CustomerName OrderDate Amount

1 Ali 01-01-2025 100

2 Sara 05-01-2025 200

Table 2: Order _February

OrderID CustomerName OrderDate Amount

3 Ahmed 10-02-2025 150

4 Hina 20-02-2025 250

Vertical Splitting:
We split a table into two or more tables where each new table has fewer columns. Rows are connected by a
common key.
Example:

CustomerID Name Email CardNumber BillingAddress

C001 Ali ali@email.com 1234-5678 123 Main St

C002 Sara sara@email.com 9876-5432 456 Elm St

After vertical split:


Table 1: Customer Personal

CustomerID Name Email

C001 Ali ali@email.com

C002 Sara sara@email.com

Table 2: Customer Payment

CustomerID Name BillingAddress

C001 Ali 123 Main St

456 m
C002 Sara
St

Adding redundant columns (Reference data):


We add extra columns inside a table that repeat important information instead of joining with another table
every time. This makes the system faster because it avoids extra joins when you run queries. In other words,
moving all the column form the master table to the detail table, a sub set off column form the master table is
kept in the detail table.
OLAP:
OLAP is used for analysing large amounts of historical data to support decision-making. It helps businesses
answer complex questions like: What were the sales trends last year? or Which product performed best in
the last quarter? OLAP systems are designed for reading data, not writing or updating it frequently. They
support features like multi-dimensional analysis, charts, reports, and dashboards. These systems are slower
than OLTP for day-to-day operations but provide deeper insights.
Example: A company’s sales manager uses an OLAP system to compare sales across different regions and
time periods.
OLTP:
OLTP is used for managing daily transactions like order placement, inventory updates, and customer data
management. These systems are designed to handle many short, fast transactions such as insert, update, and
delete operations. OLTP systems must be fast, reliable, and always available since they support everyday
business functions.
Example: When a customer buys something online and the system processes their payment and updates the
inventory, that’s OLTP in action.
OLAP vs OLTP:

Feature OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)

Purpose Handles day-to-day transactions Performs analysis and reporting

Data Type Current, real-time transactional data Historical, aggregated data

Operations Insert, Update, delete (write-heavy) Read, Analyse, Drill-down (read-heavy)

Speed Very fast for small transactions Slower, optimized for complex queries

Users Front-line staff, clerks, customers Managers, analysts, decision-makers

Data Size Smaller data per transaction Large volumes of data for analysis

Examples Online banking, order processing Sales trends, market analysis

System Design Highly normalized database Denormalized or star/snowflake schema

Relationship between OLAP and DWH:


A Data Warehouse (DWH) is a central storage system that holds large volumes of historical data from
different sources (like sales, customer data, etc.).
OLAP (Online Analytical Processing) is a tool or technique used on top of a data warehouse to analyse this
stored data.
OLAP implementation:
MOLAP: OLAP implementation with a multi dimensional data structure.
ROLAP: OLAP implementation with a relation database.
HOLAP: OLAP implementation as a hybrid of MOLAP and ROLAP.
DOLAP: OLAP implementation for desktop decision support environment.
MOLAP stands for Multidimensional Online Analytical Processing. It is a type of OLAP (Online Analytical
Processing) that stores data in a multidimensional cube format, rather than in relational databases like in
ROLAP.

You might also like