1)Explain features of Data Warehouse
1. Subject-Oriented
   Focus on Subjects: Data warehouses are organized around major subjects
    like sales, customers, or products. This helps in analyzing data based on
    specific themes rather than just transactions.
   Subject-Specific Data: Data is categorized and stored by subjects, making
    it easier to generate reports and conduct analysis.
2. Integrated
   Data Integration: Data from different sources is consolidated, cleaned, and
    stored in a consistent format in the data warehouse. This ensures that data
    from various operational systems is uniform and comparable.
   Consistency: Different data sources might use different conventions,
    formats, and units. The data warehouse standardizes these, ensuring that
    all data is consistent.
3. Time-Variant
   Historical Data: Data warehouses store historical data, which means they
    can store snapshots of data over time. This is crucial for trend analysis,
    forecasting, and reporting.
   Time Stamps: Data in a warehouse is associated with time periods,
    enabling the analysis of trends over various periods.
4. Non-Volatile
   Stable Data: Once data is entered into the data warehouse, it is not
    changed or deleted. This ensures data consistency over time.
   Read-Only Access: Data warehouses are primarily read-only, meaning data
    is rarely altered or deleted after it is loaded.
5. Data Granularity
   Detailed and Summary Data: A data warehouse contains data at various
    levels of granularity. It stores both highly detailed data for deep analysis
    and summarized data for quick insights.
   Flexible Queries: Users can query data at different levels of detail,
    depending on their needs.
6. Optimized for Query Performance
   Efficient Query Processing: Data warehouses are optimized to handle
    complex queries quickly, enabling fast retrieval of large datasets.
   Indexing and Partitioning: Data is often indexed and partitioned to improve
    query performance, allowing faster access to specific data segments.
7. Support for Business Intelligence Tools
   Integration with BI Tools: Data warehouses are designed to work
    seamlessly with business intelligence (BI) tools like reporting software,
    OLAP (Online Analytical Processing) tools, and data mining applications.
   Decision Support: They provide the data infrastructure required for
    decision-making processes in an organization.
8. Scalability
   Handling Large Volumes of Data: Data warehouses are scalable, meaning
    they can grow in storage capacity and processing power as the amount of
    data increases.
   Distributed Architecture: Many data warehouses use a distributed
    architecture, allowing them to handle large volumes of data efficiently.
9. Data Mart Integration
   Subsets of Data Warehouses: Data marts are smaller, more focused
    versions of data warehouses that cater to specific business lines or
    departments. They can be integrated into a data warehouse to provide
    specialized data access.
   Departmental Focus: Data marts allow departments to access relevant
    data quickly without querying the entire data warehouse.
10. Metadata Management
   Data about Data: Metadata in a data warehouse describes the data's
    source, transformation processes, structure, and usage. It helps users
    understand the data and how to use it effectively.
   User-Friendly Access: Metadata management tools in a data warehouse
    make it easier for users to find and understand the data they need.
2)Top down vs bottom up approach for building data
warehouse
Top-Down Approach
   Initiated by: Enterprise-wide data warehouse design.
   Development Sequence: First, create a centralized data warehouse, then
    develop data marts.
   Data Flow: Data is extracted from operational systems, transformed, and
    loaded into the centralized data warehouse, then segmented into data
    marts.
   Complexity: More complex and time-consuming due to the need for
    comprehensive design upfront.
   Advantages: Ensures consistency and uniformity across the organization;
    ideal for large enterprises.
   Disadvantages: High initial costs, longer implementation time.
Bottom-Up Approach
   Initiated by: Individual data marts development.
   Development Sequence: First, build data marts for specific departments,
    then integrate them into a centralized data warehouse.
   Data Flow: Data is extracted and loaded into individual data marts, which
    are then combined into a data warehouse.
   Complexity: Simpler and faster to implement, with a focus on specific
    business needs first.
   Advantages: Quick implementation, lower initial costs; allows for gradual
    expansion.
   Disadvantages: Risk of data inconsistency; potential challenges in
    integrating data marts later into a cohesive data warehouse.
 Top-Down Design Approach                Bottom-Up Design Approach
 Breaks the vast problem into            Solves the essential low-level
 smaller subproblems.                    problem and integrates them into a
                                         higher one.
 Inherently architected- not a union     Inherently incremental; can
 of several data marts.                  schedule essential data marts first.
 Single, central storage of              Departmental information stored.
 information about the content.
 Centralized rules and control.          Departmental rules and control.
 It includes redundant information.      Redundancy can be removed.
 It may see quick results if             Less risk of failure, favorable return
 implemented with repetitions.           on investment, and proof of
                                         techniques.
3)Define metadata and explain the types of metadata
Metadata is "data about data." It provides information that describes other
data, helping to understand, manage, and use the data effectively. Metadata
serves as a blueprint that helps in organizing, finding, and managing data
within a system.
Roles of Metadata
   Data Discovery and Search
    o   Helps users find relevant data quickly through descriptions, tags, and
        classifications.
   Data Understanding
    o   Provides context and meaning to data, helping users interpret and
        understand the information.
   Data Management
    o   Facilitates data organization, storage, and retrieval by describing data
        structures, relationships, and formats.
   Data Quality and Consistency
    o   Ensures accuracy by documenting data sources, transformations, and
        validation rules.
   Data Integration
    o   Enables the combination of data from different sources by providing a
        common reference and mapping of data fields.
   Data Governance
    o   Supports compliance and security by detailing access rights, usage
        restrictions, and data lineage.
   Data Analysis and Reporting
    o   Assists in generating reports and analyses by providing templates,
        business rules, and examples.
   Data Operations Monitoring
    o   Helps track and optimize data processing workflows through
        operational metadata on load times, errors, and performance.
   Data Preservation
    o   Aids in long-term data storage and retrieval by documenting version
        history, formats, and preservation strategies.
Types of Metadata
1. Operational Metadata
   Definition: Describes the processes and operations involved in managing
    data within a system.
   Details: Includes information on data load times, data source refresh rates,
    error logs, and processing times.
   Usage: Helps in monitoring and optimizing data processing workflows,
    ensuring smooth and efficient data operations.
2. Extractional and Transformational Metadata
   Definition: Describes the processes of extracting and transforming data
    from various sources before loading it into a data warehouse.
   Extractional Metadata:
    o   Details: Information about the sources of data, extraction methods,
        data formats, and timestamps of data extraction.
    o   Usage: Helps track where data comes from and how it's extracted,
        aiding in troubleshooting and ensuring data accuracy.
   Transformational Metadata:
    o   Details: Describes the rules and algorithms used to cleanse, format,
        and transform data into the desired structure.
    o   Usage: Critical for understanding how raw data is converted into the
        standardized format used in the data warehouse, ensuring data
        consistency and quality.
3. End-User Metadata
   Definition: Metadata that provides context and understanding for end
    users to effectively interact with and analyze data.
   Details: Includes data definitions, business rules, report templates, and
    query examples tailored for user interpretation.
   Usage: Facilitates easier access to data for non-technical users, helping
    them understand what the data represents and how to use it effectively in
    reports or analyses.
4) Difference between olap and oltp
 Criteria            OLAP                     OLTP
 Purpose             OLAP helps you           OLTP helps you
                     analyze large volumes    manage and process
                     of data to support       real-time transactions.
                     decision-making.
 Data source         OLAP uses historical     OLTP uses real-time
                     and aggregated data      and transactional data
                     from multiple sources.   from a single source.
 Data structure      OLAP uses                OLTP uses relational
                     multidimensional         databases.
                     (cubes) or relational
                     databases.
 Data model          OLAP uses star           OLTP uses normalized
                     schema, snowflake        or denormalized
                     schema, or other         models.
                     analytical models.
 Volume of data         OLAP has large               OLTP has
                        storage requirements.        comparatively smaller
                        Think terabytes (TB)         storage requirements.
                        and petabytes (PB).          Think gigabytes (GB).
 Response time          OLAP has longer              OLTP has shorter
                        response times,              response times,
                        typically in seconds or      typically in
                        minutes.                     milliseconds
 Example applications   OLAP is good for             OLTP is good for
                        analyzing trends,            processing payments,
                        predicting customer          customer data
                        behavior, and                management, and
                        identifying                  order processing.
                        profitability.
5) Data Warehouse vs Data Mart
 Data Warehouse                          Data Mart
 A Data Warehouse is a vast              A data mart is an only
 repository of information               subtype of a Data
 collected from various                  Warehouses. It is
 organizations or                        architecture to meet the
 departments within a                    requirement of a specific
corporation.                   user group.
It may hold multiple subject   It holds only one subject
areas.                         area. For example, Finance
                               or Sales.
It holds very detailed         It may hold more
information.                   summarized data.
Works to integrate all data    It concentrates on
sources                        integrating data from a given
                               subject area or set of source
                               systems.
In data warehousing, Fact      In Data Mart, Star Schema
constellation is used.         and Snowflake Schema are
                               used.
It is a Centralized System.    It is a Decentralized System.
Data Warehousing is the        Data Marts is a project-
data-oriented.                 oriented.
6) Operations and examples of olap cube
OLAP stands for Online Analytical Processing Server. It is a software
technology that allows users to analyze information from multiple
database systems at the same time. It is based on multidimensional
data model and allows the user to query on multi-dimensional data
(eg. Delhi -> 2018 -> Sales data). OLAP databases are divided into one
or more cubes and these cubes are known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on
an OLAP cube:
Drill down: In drill-down operation, the less detailed data is
converted into highly detailed data. It can be done by:
   o   Moving down in the concept hierarchy
   o   Adding a new dimension
In the cube given in overview section, the drill down operation is
performed by moving down in the concept hierarchy of Time
dimension (Quarter -> Month).
Roll up: It is just opposite of the drill-down operation. It performs
aggregation on the OLAP cube. It can be done by:
   o   Climbing up in the concept hierarchy
   o   Reducing the dimensions
In the cube given in the overview section, the roll-up operation is
performed by climbing up in the concept hierarchy of Location
dimension (City -> Country).
Dice: It selects a sub-cube from the OLAP cube by selecting two or
more dimensions. In the cube given in the overview section, a sub-
cube is selected by selecting following dimensions with criteria:
   o   Location = “Delhi” or “Kolkata”
   o   Time = “Q1” or “Q2”
   o   Item = “Car” or “Bus”
Slice: It selects a single dimension from the OLAP cube which results
in a new sub-cube creation. In the cube given in the overview section,
Slice is performed on the dimension Time = “Q1”.
Pivot: It is also known as rotation operation as it rotates the current
view to get a new view of the representation. In the sub-cube
obtained after the slice operation, performing pivot operation gives a
new view of it.
7) ETL process
The ETL (Extract, Transform, Load) process is a fundamental part of data
warehousing. It involves extracting data from various source systems,
transforming it into a format suitable for analysis, and loading it into a data
warehouse. Here’s a breakdown of each stage:
1. Extract
   Purpose: The extraction phase involves pulling data from various source
    systems, which can include databases, ERP systems, CRM systems, flat
    files, cloud services, or other data repositories.
   Process:
    o   Identify Data Sources: Determine the source systems from which data
        will be extracted.
    o   Data Collection: Use connectors or APIs to collect data from these
        sources. This might involve querying databases, extracting files, or
        streaming data from real-time sources.
    o   Data Handling: During extraction, it's essential to handle large volumes
        of data efficiently and manage different data formats.
   Example: Extracting customer transaction data from an e-commerce
    platform’s database and sales data from a CRM system.
2. Transform
   Purpose: The transformation phase cleanses, standardizes, and organizes
    the extracted data to make it suitable for analysis in the data warehouse.
   Process:
    o   Data Cleansing: Remove duplicates, handle missing data, correct
        errors, and filter out irrelevant information.
    o   Data Standardization: Convert data into a consistent format (e.g., date
        formats, units of measure) to ensure uniformity across the dataset.
    o   Data Integration: Combine data from multiple sources, resolve data
        conflicts, and consolidate related data into a single unified view.
    o   Data Enrichment: Add derived or calculated data, such as aggregating
        sales data to calculate monthly totals or adding geographic data based
        on customer addresses.
    o   Business Rules Application: Apply business logic to transform data
        into a form that aligns with the organization's requirements.
   Example: Converting all date fields into a standard format (e.g., YYYY-MM-
    DD), integrating customer data from different regions, and calculating
    total sales per region.
3. Load
   Purpose: The final stage involves loading the transformed data into the
    data warehouse, where it can be stored, indexed, and made available for
    querying and analysis.
   Process:
    o   Data Loading: Insert the transformed data into the data warehouse,
        which might involve bulk loading large datasets or incremental loading
        (e.g., updating only changed records).
     o   Data Indexing: Index the data for faster querying and retrieval.
     o   Data Validation: Perform checks to ensure that the data has been
         loaded correctly and that it meets the necessary quality standards.
     o   Data Partitioning: Optionally, partition the data in the warehouse to
         optimize performance.
   Example: Loading the cleansed and standardized customer transaction
    data into a sales fact table in the data warehouse, where it can be analyzed
    by business intelligence tools.
8)Central tendancy numerical
    (Find mean, median, mode, variance and midrange value)
9) Hierarchical clustering
10)K-means clustering algorithm + numerical