ASSIGNMENT-IST
OF
DATA MINING AND DATA
WAREHOUSING
SUBMITTED BY: RAKESH KUMAR
ROLL NO: 2100083
COURSE: B. TECH 7TH SEM/CSE
SUBMITTED TO: ER. HARKAMAL MAM
INDEX
SR. TITLE PAGE
NO. NO.
Q1. What is data ware house? Elaborate data ware house 3-5
using multi-tier architecture with an elaborative
diagram?
Q2. What are the various types of OLAP servers? Explain. 6-9
Q3. How much does a data warehouse cost? Write their 10-12
applications and uses?
Q1. What is data ware house? Elaborate data ware house using multi-tier
architecture with an elaborative diagram?
Ans. A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data
from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in
support of management's decisions."
Multi-tier Architecture
There are four layers in multi-tier architecture. These are : Data Source Layer, ETL Layer, Data
Storage Layer, and Data Access Layer. These are explained below in brief.
Data Source Layer:
It is the first layer of a multi-tier architecture. It includes all sources of data that need to be
integrated into the data warehouse. These sources can be databases, flat files or external
sources such as social media platforms. The data source layer is responsible for collecting,
validating and organizing the data before passing it on to the next layer.
ETL Layer
This is the second layer of the multi-tier architecture. It is responsible for extracting data from
data sources. It transforms it into a format suitable for a data warehouse. It also loads it into
the data storage layer. This layer ensures the quality and consistency of the data loaded into
the data warehouse.
Data Storage Layer
This is the third layer of the multi-tier architecture. It is responsible for storing the data that
has been transformed and loaded by the ETL Layer. This Layer can be divided into two sub-
layers: the staging area and the data warehouse. The staging area is used to store the data
temporarily before it is loaded into the data warehouse. The data warehouse is the final
destination for the data and is used for reporting and analysis.
Data Access Layer
It is the fourth layer of the multi-tier architecture. It is responsible for providing users with
accessibility to the data. This layer can be divided into two sub-layers −
• Presentation layer provides a user-friendly interface for users to access and
analyze data.
• Application layer is responsible for managing the business logic and ensuring the
security and integrity of the data.
•
Multi-Tier Data Warehouse Architecture Components:
Multi-Tier Data Warehouse Architecture has the following components: Data Sources, Data
Integration Layer, Staging Area, Data Warehouse Database, Data Mart, OLAP Cube, Front-End
Tools, Metadata Repository.
Multi-Tier Data Warehouse Architecture can be divided into three main parts. These are: Bottom,
Middle and Top tier. These are explained as follows below in brief.
Bottom Tier (Data Sources and Data Storage)
This layer consists of Data Sources and Data Storage. It is usually implemented using a warehouse
database server, such as RDBMS. Gateways, such as ODBC, OLE-DB, and JDBC, are used to extract
data from operational and external sources.
Middle Tier
This layer is an OLAP server. OLAP server can be implemented using either Relational OLAP
(ROLAP) model or Multidimensional OLAP (MOLAP) model. ROLAP is an extended relational
DBMS. That maps operations from standard data to standard data. While MOLAP is a special-
purpose server that directly implements multidimensional data and operations.
Top Tier
This layer is a front-end client layer. It has query and reporting tools, analysis tools, and data mining
tools, such as trend analysis and prediction.
Data Warehouse Models
There are three types of Data Warehouse Models from an architectural perspective −
Enterprise Warehouse
• It collects data from all parts of the organization and integrates it to provide a complete
picture of the organization.
• It can be very large and complex, and requires extensive modeling and design to
implement.
Data Mart
• It contains a subset of data from the Enterprise Warehouse that is specific to a
particular group of users.
• It is smaller and easier to implement than an Enterprise Warehouse, and is typically
used for department-level reporting and analysis.
Virtual Warehouse
• It is a set of views on top of an operational database, rather than a separate database.
• It is easy to create, but requires additional capacity on operational database servers to
support efficient querying.
Advantages of Multi-tier Architecture
These are main advantages of Multi-Tier Architecture of Data Warehouse −
Scalability
Components can be added, deleted or updated according to the data warehouse’s needs.
Better Performance
Several layers enable parallel and efficient processing for improved performance and reaction times.
Modularity
Modular design allows the creation, testing, and deployment of separate components.
Security
Applying security measures to various layers enhances the data warehouse’s overall security.
Improved Resource Management
Different tiers can be tuned to use proper hardware resources, reducing expenses and increasing
effectiveness.
Easier Maintenance
Individual components can be updated or maintained without affecting the entire data warehouse.
Q2: What are the various types of OLAP servers? Explain.
Ans. There are three main types of OLAP servers are as following:
ROLAP stands for Relational OLAP, an application based on relational DBMSs.
MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional
techniques.
Relational OLAP (ROLAP) Server
These are intermediate servers which stand in between a relational back-end server and user
frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse data, and
OLAP middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.
Relational OLAP Architecture
ROLAP Architecture includes the following components
o Database server.
o ROLAP server.
o Front-end tool.
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the
market. This method allows multiple multidimensional views of two-dimensional relational
tables to be created, avoiding structuring record around the desired view.
Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user
requests, being 'RDBMS' aware and also being capable of generating the SQL statements based
on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is
depends on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data
amount.
<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies, (works
on top of the RDBMS) can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the
relational database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements
to query the relational database, and SQL statements do not suit all needs.
Multidimensional OLAP (MOLAP) Server
A MOLAP system is based on a native logical model that directly supports multidimensional
data and operations. Data are stored physically into multidimensional arrays, and positional
techniques are used to access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and
are stored in an optimized format in a multidimensional cube, instead of in a relational
database. In MOLAP model, data are structured into proprietary formats by client's reporting
requirements with the calculations pre-generated on the cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
o Database server.
o MOLAP server.
o Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited
capabilities to dynamically create aggregations or to evaluate results which have not been pre-
calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well
suited for MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship
Server, Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject
areas in an RDBMS. Some vendors can solve these problems by continuing access from MOLAP
tools to detailed data in and RDBMS.
This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse
architecture that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product
and sales region) to be stored and maintained in a persistent structure. This structure would
be provided to reduce the application overhead of performing calculations and building
aggregation during initialization. These structures can be automatically refreshed at
predetermined intervals established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for
slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube is
created. Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed
when the cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not
already exist in the organization. Therefore, to adopt MOLAP technology, chances are other
investments in human and capital resources are needed.
Hybrid OLAP (HOLAP) Server
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP
systems save more substantial quantities of detailed data in the relational tables while the
aggregations are stored in the pre-calculated cubes. HOLAP also can drill through from the
cube down to the relational tables for delineated data. The Microsoft SQL Server
2000 provides a hybrid OLAP server.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information
on the OLAP server and the detail record remains in the relational database. So no
duplicate copy of the detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP
servers.
Q3: How much does a data warehouse cost? Write their applications and uses?
Ans. The cost of a data warehouse can vary widely based on factors such as the organization's
size, data volume, complexity of the architecture, technology choices, and deployment
options. There are several components that contribute to the overall cost of a data warehouse:
1. Hardware and Infrastructure: This includes the cost of servers, storage devices,
networking equipment, and other hardware needed to build and maintain the data
warehouse.
2. Software Licensing: Costs associated with purchasing or licensing the data warehouse
software, including the database management system (DBMS), ETL tools, data
integration software, reporting and analytics tools, and any other software required for
the data warehouse's operation.
3. Development and Implementation: Expenses related to designing, developing, and
implementing the data warehouse architecture, including data modeling, ETL processes,
data transformation, and integration.
4. Maintenance and Support: Ongoing costs for maintaining and supporting the data
warehouse, which can include software updates, performance monitoring,
troubleshooting, and addressing issues that arise.
5. Data Storage and Processing: The cost of storing and processing data within the data
warehouse. This can be influenced by the amount of data stored, the type of storage
used (e.g., traditional databases, cloud-based storage), and the frequency of data
processing.
6. Scalability: If the data warehouse needs to scale to accommodate growing data volumes
and user demands, the cost of scaling up or out (adding more resources or nodes) should
be considered.
7. Training and Expertise: Costs associated with training staff to work with the data
warehouse tools, technologies, and best practices, as well as potentially hiring or
outsourcing experts to manage and optimize the data warehouse.
8. Security and Compliance: Investing in security measures, encryption, access controls,
and compliance requirements can contribute to the overall cost.
Example 1: On-Premises Data Warehouse
Let's assume a medium-sized retail company is planning to build an on-premises data
warehouse to manage and analyze sales data, customer information, and inventory data.
Here's a breakdown of potential costs:
1. Hardware and Infrastructure:
• Server Hardware: $20,000 (including server hardware and storage devices)
• Networking Equipment: $5,000 (switches, routers, etc.)
2. Software Licensing:
• Database Management System (DBMS) License: $10,000 (for a commercial
DBMS like Oracle or SQL Server)
• ETL Tool License: $5,000 (for data extraction, transformation, and loading)
3. Development and Implementation:
• Data Modeling and Design: $8,000 (consulting fees for designing the data
model)
• ETL Development: $12,000 (developing ETL processes and data integration)
4. Maintenance and Support:
• Ongoing Maintenance: $6,000/year (software updates, troubleshooting)
• Support: $3,000/year (support contracts with vendors)
5. Data Storage and Processing:
• Storage Costs: $4,000/year (for storage devices and maintenance)
• Processing Costs: $3,000/year (for server resources)
6. Training and Expertise:
• Training Costs: $2,000 (training staff to use the DBMS, ETL tool, and data
modeling)
• Expertise: $15,000/year (hiring or outsourcing a database administrator)
7. Security and Compliance:
• Security Measures: $2,500 (encryption, access controls, security software)
• Total Initial Investment: $65,000
8. Total Annual Ongoing Costs: $30,500/year
Example 2: Cloud-Based Data Warehouse
Now, let's consider the same retail company opting for a cloud-based data warehouse solution
using a popular cloud provider like Amazon Web Services (AWS) or Microsoft Azure:
1. Cloud Services:
• Data Warehouse Service: $20,000/year (cost varies based on usage and data volume)
• ETL Service: $6,000/year (using cloud-based ETL services)
2. Data Storage and Processing:
• Storage Costs: $5,000/year (cloud storage costs)
• Processing Costs: $8,000/year (computational resources for querying and processing
data)
3. Support and Maintenance:
• Managed Services: $3,000/year (cloud provider manages updates and maintenance)
• Support: Included in the service cost
4. Training and Expertise:
• Training Costs: $1,000 (training staff to use cloud services)
• Expertise: $5,000/year (cloud expertise or consulting services)
5. Total Annual Costs: $48,000/year
Applications and Uses of Data Warehouses:
1. Data warehouses have a wide range of applications across various industries and
sectors. They provide a structured environment for storing, managing, and analyzing
data, enabling organizations to derive meaningful insights and make informed decisions.
Some of the common applications and uses of data warehouses include:
2. Business Intelligence (BI): Data warehouses serve as the foundation for BI initiatives.
Organizations use data warehouses to create dashboards, reports, and visualizations
that provide insights into key performance indicators (KPIs), sales trends, customer
behavior, and more.
3. Decision Making: Data warehouses enable data-driven decision-making by providing
historical and real-time data analysis. This helps executives and managers make
informed choices that drive business growth and optimization.
4. Market Analysis: Organizations can analyze market trends, customer preferences, and
competitive landscapes using data warehouses. This information helps them identify
opportunities and adapt their strategies.
5. Customer Analytics: Data warehouses allow businesses to segment their customer base,
understand customer behaviors, preferences, and purchasing patterns. This knowledge
aids in improving customer experiences and targeted marketing campaigns.
6. Supply Chain Management: Data warehouses help optimize supply chain processes by
tracking inventory levels, demand patterns, and supplier performance, leading to
improved efficiency and reduced costs.
7. Financial Analysis: Finance departments use data warehouses to consolidate financial
data from various sources, perform budgeting, forecasting, and variance analysis, and
ensure compliance with regulations.
8. Healthcare Analytics: Data warehouses are used to manage and analyze healthcare data,
including patient records, medical history, and treatment outcomes. This supports
clinical decision-making and research.
9. Retail Analytics: Retailers analyze data from point-of-sale systems, inventory
management, and customer interactions to enhance inventory planning, pricing
strategies, and customer experiences.
10.Human Resources Management: Data warehouses assist HR departments in managing
employee data, analyzing workforce trends, and making strategic decisions related to
recruitment, retention, and training.