Lecture No:1
Data warehousing:
Data Warehousing is the process of collecting, storing, and managing large amounts of data from different
sources in one place. This data is organized in a way that makes it easy to analyse and make better decisions.
Need for data warehousing:
A data warehouse is important because it helps businesses and organizations store and manage large
amounts of data efficiently. Here are some key reasons why we need it:
       Better decision making
       Fast access to data
       Improved data quality
       Historical data analysis
       Combing data from multiple resources
       Competitive Advantage
Relationship between Data, Information, Knowledge and Intelligence:
Data:
Data refers to raw facts and figures that are recorded or stored in a way that computers can process. It often
comes from transactions or operations, such as sales records or customer details, but by itself, data has no
meaning until processed.
Information:
Information is processed data that is structured and meaningful. Information helps answer basic questions
like "who," "what," "where," and "when."
Knowledge:
Knowledge is the application of information to understand patterns, make predictions, or solve problems. It
helps answer "how" questions by providing deeper insights based on experience and learning. For example,
understanding customer buying trends from sales data is knowledge.
Intelligence:
Intelligence is the ability to interpret knowledge and understand the reasons behind patterns and behaviors. It
answers the "why" questions and allows individuals or organizations to make strategic decisions.
Intelligence helps in problem-solving and innovation.
Note:
Data is raw facts, information is meaningful data, knowledge is applying information for insights, and
intelligence is understanding why things happen.
Historical overview:
Early days of data processing (1960s):
Computers used master files stored on magnetic tapes. Data access was slow and unreliable, requiring
sequential scanning. Redundant data became a major issue as master files grew.
Introduction of Direct Access Storage Device (DASD) & Databases (1970s):
DASD (Disk Storage) allowed faster, direct access to data, unlike magnetic tapes. DBMS (Database
Management Systems) were introduced, making data storage and retrieval more efficient. A database
became a "single source of truth", solving redundancy problems.
Rise of PCs & 4th Generation Languages (1980s):
PCs and 4GLs (4th Generation Programming Languages) gave users direct control over data. Management
Information Systems (MIS) were developed to help managers make decisions. This marked the early stages
of Decision Support Systems (DSS).
Data Ware House:
A complete repository of historical corporation data extracted from transaction systems that is available for
ad-hoc access by knowledge workers.
Types of data ware house:
Following are some industry-specific data warehouses:
       Financial - A financial data warehouse stores and organizes all the financial data of a business —
        like income, expenses, budgets, and profits.
       Telecommunication - A telecom data warehouse stores call records, customer usage data, billing
        info, and network performance details.
       Insurance - An insurance data warehouse keeps track of policyholder details, claim history,
        premium payments, and risk analysis.
       Human resources - An HR data warehouse stores employee-related data like salaries, attendance,
        performance, and hiring trends.
Ad-hoc:
It doesn’t have a certain repeatable pattern and it’s not in advance.
Ad-Hoc Query Example:
Task:
Your manager wants to quickly know:
“What were the total sales on April 1, 2025, broken down by store?”
This is a one-time, unplanned request — no scheduled report exists for it.
OLTP:
OLTP stands for online transection processing, which refers to the type of database system designed to
handle a large volume of short interaction in real time. These transactions often involve inserting, deleting,
updating etc.
DSS:
A Decision Support System (DSS) is a computer-based tool or system used to help business users make
informed decisions by analysing large volumes of data — often pulled from a data warehouse.
    OLTP (online transaction            A system used to handle day-to-      Captures and processes real-time
          processing)                      day operations like sales,                     data.
                                              bookings, payments
  DSS (decision support system)        A system used to help users           Uses historical data (often from a
                                       analyse data and make decisions       data warehouse) for analysis &
                                                                             reporting.
          Data warehouse               A centralized repository of           Supports reporting, analysis,
                                       historical, cleaned, and structured   DSS, and BI tools.
                                       data
Typical Queries:
OLTP = Select tx_data, balance from tx_table where account_id=23785;
DSS =select balance, age, sal, gender from customer_table and tx_table where age between (30 and 40) and
eduction = ‘graduate ‘and CusID.customer_table and Customer_ID.tx_table;
Comparison of OLTP and DSS:
                         OLTP                                                   DSS
 Primary key need                                         Primary key not used
 No concept of primary key index                          Primary index used
 May use single table                                     Uses multiple tables
 Few rows returned                                        Many rows returned
 High selectivity of query                                Low selectivity of query
 Indexing on primary key (unique)                         Indexing on primary index (non-unique)
Primary indexing:
Primary indexing is a way to quickly find data in a database table using the primary key.
Example:
A table of contents in a book, based on page numbers.
Or
An Index is like a starting point for reading data in a database.
Instead of the database scanning every row one by one (which is slow), the index tells it exactly where to
start — just like a bookmark or table of contents in a book.
OLAP:
OLAP (online analytical processing) is a tool that helps you analyse large amounts of data quickly so you
can make smart business decisions. In OLAP we are analysing data step-by -step.
Example:
You check total sales → then zoom into sales by region → then by product → then by month.
Anomalies:
Anomalies are problems or errors that happen when a database is not properly organized (not normalized).
Types of anomalies:
        Insert anomalies – you can’t add new data because other data is missing.
        Update anomalies – you have to update the same data in many places.
        Deletion anomalies – deleting one piece of data removes other important data.
Solution:
These anomalies can be avoided using normalization.
Normalization:
Normalization is the process of efficiently organizing the data in a database by decomposing (splitting) a
relational table into smaller table by projection.
Goals of normalization:
There are basically two goals for normalization:
      Eliminate redundant data (for example, storing the same data in more than one table).
      Ensuring data dependencies make sense (only storing related data in a table).
Need for Normalization:
      To remove duplicate data
      To save storage space
      To make data easier to update and manage
Example:
Imagine a student table like this:
        StudentID                      Name                       Course1                     Course2
            1                           Ali                        Math                       Science
            2                           JK                         Math                       English
Instead of storing courses like this, we can normalize into two separate tables:
Student Table:
       StudentID                     Name
           1                          Ali
           2                          JK
Course Table:
              StudentID                                 Course
                  1                                      Math
                  1                                     Science
                  2                                      Math
                  2                                     English
Type of Normalization:
Ther are five main types of normalization:
      1NF (First Normal form) – Removes repeating groups or multiple values in a single column
      2NF (Second Normal form) – Removes partial dependencies (depends on part of the key)
      3NF (Third Normal form) – Removes transitive dependencies (depends on non-key column)
      BNNF (Boyce- Codd Normal form) – A stronger version of 3NF — handles some special cases
      4NF (Fourth Normal form) – Removes multi-valued dependencies (no column should repeat rows)
      5NF (Fifth Normal form) – Breaks down complex relationships into simpler ones
1NF (First Normal form):
A table is in 1NF if each column has only one value (no lists or multiple items in a cell).
Example (Before 1NF):
                        Student                                               Courses
                          Raj                                               Math, Science
                         Priya                                              Math , English
After 1NF:
                         Student                                             Courses
                           Raj                                                Math
                           Raj                                               Science
                          Priya                                               Math
                          Priya                                              English
2NF (Second Normal form):
A table is in 2NF if it's already in 1NF and all non-key columns depend fully on the primary key (no partial
dependency).
Example:
Before 2NF:
             StudentID                         Courses                            StudentName
                 1                              Math                                   Raj
                 1                             Science                                 Raj
                 2                              Math                                  Priya
                 2                             English                                Priya
Here, StudentName depends only on StudentID, not on both StudentID + Course.
After 2NF:
Student Table:
             StudentID                       StudentName
                 1                                Raj
                 1                                Raj
                 2                               Priya
                 2                               Priya
Course Table:
             StudentID                         Courses
                 1                              Math
                 1                             Science
                 2                              Math
                 2                             English
3NF (Third Normal Form):
A table is in 3NF if it's in 2NF and no column depends on another non-key column (no transitive
dependency).
Example:
Before 3NF
        StudentID                  StudentName                Department                DeptLocation
            1                           Raj                      CS                       Block A
            2                          Piya                       IT                      Block B
DeptLocation depends on Department, not directly on StudentID.
After 3NF:
Student Table:
        StudentID                  StudentName                  Department
            1                           Raj                        CS
            2                          Piya                         IT
Department Table:
       Department                  DeptLocation
          CS                         Block A
           IT                        Block B
Denormalization:
Denormalization means combining tables back together to make data easier and faster to read, even if it
causes some repetition (redundancy). In denormalization, we join tables again to make data access faster.
Denormalization in DSS:
In DSS, users need fast reports and quick answers. If the data is too split (too many small tables), it will take
more time to join them when running a report. Denormalized tables allow users to get results faster because:
      Less joining of tables
      Faster query performance
      Easier to run big analytical queries
Denormalization technique:
      Collapsing
      Pre-joining tables
      Splitting tables
      Adding redundant columns (Reference data)
      Derived attributes
Collapsing:
Collapsing Tables is the process of merging two or more related tables into a single table. This is done when
there is a clear relationship between the tables, and it makes sense to combine them, typically for
performance optimization purposes, especially when those tables are frequently queried together.
Types of collapsing:
      Two entities with a one-to-one relationship
      Two entities with many-to-many relationship
Two entities with a one-to-one relationship:
In this type of relationship, each row in Table A is linked to exactly one row in Table B. When the
relationship is one-to-one, collapsing the tables simplifies access to data since there is no need to join them
repeatedly.
Example:
Customer table:
             CustomerID                              Name                                 Email
                C001                                  Ali                             ali@gmail.com
                  C002                                Sara                           sara@gmail.com
Customer Address table:
                     CustomerID                                                Address
                        C001                                                  123 main St
                        C002                                                  456 main St
After collapsing (Combining the tables):
       CustomerID                      Name                       Email                      Address
          C001                          Ali                   ali@gmail.com                 123 main St
          C002                          Sara                 sara@gmail.com                 456 main St
Two entities with many-to-many relationship
In a many-to-many relationship, a row in Table A can be related to many rows in Table B and vice versa.
Typically, in a relational database, a junction table is used to manage the many-to-many relationship.
However, if the data is queried frequently, it may be beneficial to pre-join and collapse the data into a single
table to improve query performance.
Example:
Student table:
                          StudentID                                              Name
                           S001                                                   Ali
                           S002                                                   Sara
Course table:
                         CourseID                                            CourseName
                           C001                                                 Math
                           C002                                                History
Student course table:
                         StudentID                                             CourseID
                            S001                                                 C001
                            S001                                                 C002
                            S002                                                 C001
After collapsing (Combining the Tables into One Table)
StudentID Name CourseID CourseName
S001        Ali     C001        Math
S001        Ali     C002        History
S002        Sara C001           Math
Pre-joining tables:
Joining tables in advance (before queries) and saving the joined result as a new table.
Example:
Customer Table:
                      CutomerID                                                 Name
                        C001                                                     Ali
Order Table:
               OrderID                            CustomerID                         ProductName
                O001                                 C001                               Laptop
After pre joining:
          OrderID                    CustomerID                   Name                    ProductName
           O001                         C001                       Ali                       Laptop
Splitting Tables:
Splitting tables means breaking a large table into smaller tables to make data easier to manage, faster to
search, or more secure.
Types:
There are two main types:
        Horizontal Splitting
        Vertical Splitting
Horizontal Splitting:
We split a table into two or more tables where each new table has the same columns but contains different
rows based on some condition (like date, region, etc.).
Example:
OrderID CustomerName OrderDate Amount
1           Ali                 01-01-2025 100
2           Sara                05-01-2025 200
3           Ahmed               10-02-2025 150
4           Hina                20-02-2025 250
After horizontal splitting:
Table 1: Order January
OrderID CustomerName OrderDate Amount
1          Ali               01-01-2025 100
2          Sara              05-01-2025 200
Table 2: Order _February
OrderID CustomerName OrderDate Amount
3          Ahmed             10-02-2025 150
4          Hina              20-02-2025 250
Vertical Splitting:
We split a table into two or more tables where each new table has fewer columns. Rows are connected by a
common key.
Example:
CustomerID Name Email                     CardNumber BillingAddress
C001             Ali    ali@email.com 1234-5678        123 Main St
C002             Sara   sara@email.com 9876-5432       456 Elm St
After vertical split:
Table 1: Customer Personal
CustomerID Name Email
C001             Ali    ali@email.com
C002             Sara   sara@email.com
Table 2: Customer Payment
CustomerID Name BillingAddress
C001             Ali    123 Main St
                           456        m
C002             Sara
                                 St
Adding redundant columns (Reference data):
We add extra columns inside a table that repeat important information instead of joining with another table
every time. This makes the system faster because it avoids extra joins when you run queries. In other words,
moving all the column form the master table to the detail table, a sub set off column form the master table is
kept in the detail table.
OLAP:
OLAP is used for analysing large amounts of historical data to support decision-making. It helps businesses
answer complex questions like: What were the sales trends last year? or Which product performed best in
the last quarter? OLAP systems are designed for reading data, not writing or updating it frequently. They
support features like multi-dimensional analysis, charts, reports, and dashboards. These systems are slower
than OLTP for day-to-day operations but provide deeper insights.
Example: A company’s sales manager uses an OLAP system to compare sales across different regions and
time periods.
OLTP:
OLTP is used for managing daily transactions like order placement, inventory updates, and customer data
management. These systems are designed to handle many short, fast transactions such as insert, update, and
delete operations. OLTP systems must be fast, reliable, and always available since they support everyday
business functions.
Example: When a customer buys something online and the system processes their payment and updates the
inventory, that’s OLTP in action.
OLAP vs OLTP:
Feature         OLTP (Online Transaction Processing) OLAP (Online Analytical Processing)
Purpose         Handles day-to-day transactions         Performs analysis and reporting
Data Type       Current, real-time transactional data   Historical, aggregated data
Operations      Insert, Update, delete (write-heavy)    Read, Analyse, Drill-down (read-heavy)
Speed           Very fast for small transactions        Slower, optimized for complex queries
Users           Front-line staff, clerks, customers     Managers, analysts, decision-makers
Data Size       Smaller data per transaction            Large volumes of data for analysis
Examples        Online banking, order processing        Sales trends, market analysis
System Design Highly normalized database                Denormalized or star/snowflake schema
Relationship between OLAP and DWH:
A Data Warehouse (DWH) is a central storage system that holds large volumes of historical data from
different sources (like sales, customer data, etc.).
OLAP (Online Analytical Processing) is a tool or technique used on top of a data warehouse to analyse this
stored data.
OLAP implementation:
MOLAP: OLAP implementation with a multi dimensional data structure.
ROLAP: OLAP implementation with a relation database.
HOLAP: OLAP implementation as a hybrid of MOLAP and ROLAP.
DOLAP: OLAP implementation for desktop decision support environment.
MOLAP stands for Multidimensional Online Analytical Processing. It is a type of OLAP (Online Analytical
Processing) that stores data in a multidimensional cube format, rather than in relational databases like in
ROLAP.