UNIT-II
Data Warehouse Process and Technology
Contents:
Warehousing Strategy
Warehouse /management and Support Processes
Warehouse Planning and Implementation
Hardware and Operating Systems for Data Warehousing
Client/Server Computing Model & Data Warehousing.
Parallel Processors & Cluster Systems
Distributed DBMS implementations
Warehousing Software
Warehouse Schema Design
Data Extraction, Cleanup & Transformation Tools
Warehouse Metadata
warehouse Strategy involves following key
activities
1. Establish Sponsorship
2. Identify Enterprise Needs
3. Determine Measurement Cycles
4. Validate Measurement
5. Design Data ware house architecture
6. Apply Approximate Techniques
7. Implementing Data Warehouse
Establish Sponsorship
• First step is to establish sponsorship for data warehouse if it
is not already exists
• An important aspect to establish sponsorship is ensuring
everyone in enterprise to understand the purpose of data
warehouse.
• Sponsorship chain includes a data warehouse manager and
two other key individual:
– An executive sponsor with resource to invest in information
infrastructure and a project driver between executive sponsor and
warehouse manager to keeps the project moving and on schedule.
Identify Enterprise Needs
• Identifying enterprise needs is a major component in the
engineering life cycle for any information system, and it is
crucial when engineering a Data Warehouse.
• When developing operational systems, there is often one
single enterprise sponsor or one group of users with a clear
view of what they need, what the system should look like, and
how it should function. When developing a Data
Determine Measurement Cycles
• Completely defining an enterprise measure includes
describing the cycles or time periods used for the measure.
Are quarters, months, or hours appropriate for capturing
useful measurement data?
• How much historical data will be needed? These vary greatly
by enterprise.
• The United States Federal Reserve Bank views enterprise
measures in monthly, quarterly and annual increments and
uses years of historical data to determine trends in the
economy.
Validate Measures
• After identifying and defining enterprise needs, it is absolutely
necessary to communicate them throughout the enterprise.
• One of the best justifications for undertaking a Data
Warehouse project is the synergy achieved through the
process of defining and then communicating its critical
success factors and measures.
Design Data Warehouse Architecture
• Only after defining and thoroughly documenting enterprise
needs (measures and critical success factors), should an
organization begin actual Data Warehouse architecture
Apply the Correct Technology Solution
• Only after fully defining enterprise requirements and
designing the Data Warehouse architecture should an
enterprise begin to select the technology for the Data
Warehouse.
• Key technology issues, in addition to determining the
hardware/software platform for the Data Warehouse, include
developing programs for loading information into the Data
Warehouse, implementing access control (security)
mechanisms and selecting one or more user interface tool
sets.
Implement the Data Warehouse
• Implementation begins with integration and transformation
programs for ensuring data quality.
• Effective loading and maintaining of data in the data
warehouse and associated data marts is critical for decision
support activities and also providing information to users and
stakeholders.
Data Warehouse Benefits
More cost-effective decision making.
• A Data Warehouse allows reduction of staff and computer
resources required to support queries and reports against
operational and production databases.
• This typically offers significant savings. Having a Data
Warehouse also eliminates the resource drain on production
systems when executing long-running, complex queries and
reports.
Data Warehouse Benefits
Better enterprise intelligence
• Increased quality and flexibility of enterprise analysis arises
from the multi-tiered data structures of a Data Warehouse
that support data ranging from detailed transactional level to
high-level summary information.
• Guaranteed data accuracy and reliability result from ensuring
that a Data Warehouse contains only "trusted" data.
Data Warehouse Benefits
Enhanced customer service
• An enterprise can maintain better customer relationships by
correlating all customer data via a single Data Warehouse
architecture.
• Business reengineering. Allowing unlimited analysis of
enterprise information often provides insights into enterprise
processes that may yield breakthrough ideas for
reengineering those processes.
• Just defining the requirements for a Data Warehouse results
in better enterprise goals and measures. Knowing what
information is important to an enterprise will provide
direction and priority for reengineering efforts.
Data Warehouse Benefits
Information system reengineering
• A Data Warehouse that is based upon enterprise-wide data
requirements provides a cost-effective means of establishing
both data standardization and operational system
interoperability.
• Data Warehouse development can be an effective first step in
• reengineering the enterprise’s legacy systems.
Warehouse Management System
and
Support Process
Warehouse Management System(WMS)
• A data warehouse management system is software
application that support day to day operation in a warehouse.
• The management task ensuring its availability, effectiveness
backup of its components and security.
• A data warehouse management system typically has three
parts.
1. Load Management
2. Warehouse Management
3. Query management
Warehouse Management System(WMS)
Any Source Any Data Any Access
Operational DB Warehouse Manager Relational Tools
Multimedia Files OLAP Tools
Meta data Summary
Query Manager
Load Manager
data
Documents Files Raw data Pointer to
other source
Applications
External data
Relational Files
Load Management
• Load manager performs the operations required to extract and
load the data into the database.
• The size and complexity of a load manager varies between
specific solutions from one data warehouse to another.
• The load manager does performs the following functions −
• Extract data from the source system.
• Fast load the extracted data into temporary data store.
• Perform simple transformations into structure similar to
the one in the data warehouse.
Load Management
Extract Data from Source
• The data is extracted from the operational databases or the
external information providers.
• Gateways are the application programs that are used to extract
data.
• It is supported by underlying DBMS and allows the client
program to generate SQL to be executed at a server.
• Open Database Connection (ODBC) and Java Database
Connection (JDBC) are examples of gateway.
Load Management
Fast Load
• In order to minimize the total load window, the data needs to
be loaded into the warehouse in the fastest possible time.
• Transformations affect the speed of data processing.
• It is more effective to load the data into a relational database
prior to applying transformations and checks.
• Gateway technology is not suitable, since they are inefficient
when large data volumes are involved.
Load Management
Simple Transformations
• While loading, it may be required to perform simple
transformations.
• After completing simple transformations, we can do complex
checks.
• Suppose we are loading the EPOS sales transaction, we need to
perform the following checks −
• Strip out all the columns that are not required within the
warehouse.
• Convert all the values to required data types.
Warehouse Manager
• The warehouse manager is responsible for the warehouse
management process.
• It consists of a third-party system software, C programs, and
shell scripts.
• The size and complexity of a warehouse manager varies
between specific solutions.
• A warehouse manager includes the following −
• The controlling process
• Stored procedures or C with SQL
• Backup/Recovery tool
• SQL scripts
Functions of Warehouse Manager
A warehouse manager performs the following functions −
• Analyzes the data to perform consistency and referential
integrity checks.
• Creates indexes, business views, partition views against the base
data.
• Generates new aggregations and updates the existing
aggregations.
• Generates normalizations.
• Transforms and merges the source data of the temporary store
into the published data warehouse.
• Backs up the data in the data warehouse.
• Archives the data that has reached the end of its captured life.
Query Manager
• The query manager is responsible for directing the queries to
suitable tables.
• By directing the queries to appropriate tables, it speeds up the
query request and response process.
• In addition, the query manager is responsible for scheduling the
execution of the queries posted by the user.
• A query manager includes the following components −
• Query redirection via C tool or RDBMS
• Stored procedures
• Query management tool
• Query scheduling via C tool or RDBMS
• Query scheduling via third-party software
Functions of Query Manager
• It presents the data to the user in a form they understand.
• It schedules the execution of the queries posted by the end-user.
• It stores query profiles to allow the warehouse manager to
determine which indexes and aggregations are appropriate.
Warehouse Planning
and
Implementation
The key steps in developing a data warehouse
can be summarised as follows:
1. Project initiation
2. Requirements analysis
3. Design (architecture, databases and applications)
4. Construction(selecting and installing tools, developing
data feeds and building reports)
5. Deployment(release & training)
6. Maintenance
Project initiation
• No data warehousing project should commence without:
– a clear statement of business objectives and scope;
– a sound business case, including measurable benefits;
– an outline project plan, including estimated costs,
timescales and resource requirements;
– high level executive backing, including a commitment
to provide the necessary resources;
• A small team is usually set up to prepare and present a
suitable project initiation document.
• This is normally a joint effort between business and IT
managers.
Requirements analysis
Collecting requirements typically involves 4 principal activities:
1. Interviewing a number of potential users to find out what
they do, the information they need and how they analyse it
in order to make decisions.
2. Interviewing information systems specialists to find out what
data are available in potential source systems, and how they
are organised.
3. Analysing the requirements to establish those that are
feasible given available data.
4. Running facilitated workshops that bring representative users
and IT staff together to build consensus about what is
needed, what is feasible and where to start.
Design
The goal of the design process is to define the warehouse components
that will need to be built.
• Architecture design:
• how the components will work together;
• where they are located (geographically and on what platform);
• who uses them; who will build and maintain them.
• Data design:
• This step determines the structure of the primary data stores used
in the warehouse environment, based on the outcome of
the requirements analysis.
• Application design
• The application design describes the reports and analyses required
by a particular group of users, and usually specifies:
» a number of template report layouts;
» how and when these reports will be delivered to users;
» the functional requirements for the user interface.
Construction
• Warehouse components are usually developed iteratively and
in parallel. That said, the most efficient sequence to begin
construction is probably as follows:
1. Tool selection & installation:
• ETL tool
• Database(s) for the warehouse (usually relational) and
marts (often multi-dimensional)
• Reporting and analysis tools
Construction
2. Data staging system:
• Create target tables in the central warehouse database;
• Request initial and regular extracts from source systems;
• Write procedures to transform extract data ready for loading (optionally
creating interim tables in a data staging area);
• Write procedures to load initial data into the warehouse (using a bulk
loader);
• Create and populate any data marts;
• Write procedure to load regular updates into the warehouse;
• Develop special procedures for a once-off bulk load of historic data;
• Write validation/exception handling procedures;
• Write archiving & backup procedures;
• Create a provisional set of aggregates;
• Automate all regular procedures;
• Document the whole process.
Construction
3. Application development:
• This step can begin once a sample or initial extract has been
loaded.
• It is a good idea to involve users in the development of
reports and analytic applications, preferably through
prototyping, but at least by asking them to carry out
acceptance testing.
• Most modern business intelligence tools do not require
programming, so it is possible for non-IT staff to build some of
their own reports as well.
Deployment
• It is too often assumed that the first version of a data
warehouse can be rolled out in a matter of weeks, simply by
showing all the users how to use the new reporting tools.
– Installing and configuring desktop PCs - any hardware
upgrades or amendments to the ‘standard build’ need to
be organised well in advance;
– Implementing appropriate security measures - to control
access to applications and data;
– Setting up a support organisation to deal with questions
about the tools, the applications and the data.
– Providing more advanced tool training later, when users
are ready, and assisting potential power users to develop
their first few reports.
Maintenance
• The most important activities are:
– Monitoring the realisation of expected benefits;
– Providing ongoing support to users ;
– Training new staff;
– Assisting with the identification and cleansing of dirty data;
– Maintaining both feeds & meta-data as source systems
change over time;
– Tuning the warehouse for maximum performance (this
includes managing indexes and aggregates according to
actual usage);
– Purging dormant data(freeing space);
– Recording successes and using these to continuously
market the warehouse.
Hardware and Operating Systems
for Data Warehousing
HARDWARE AND OPERATING SYSTEMS
• Hardware and operating systems make up the computing
environment for data warehouse.
• All the data extraction, transformation, integration, and
staging jobs run on the selected hardware under the chosen
operating system.
• Here are some general guidelines for hardware selection:
– Scalability: When data warehouse grows in terms of the number of
users, the number of queries, and the complexity of the queries,
ensure that your selected hardware could be scaled up.
– Support: Vendor support is crucial for hardware maintenance. Make
sure that the support from the hardware vendor is at the highest
possible level.
– Vendor Reference: It is important to check vendor references with
other sites using hardware from this vendor.
– Vendor Stability: Check on the stability and staying power of the
vendor.
HARDWARE AND OPERATING SYSTEMS
• The hardware architecture of data warehouse defined within
technical blueprint stage of process.
• Hardware and software architecture is determine once a
broad technical architecture requirement achieved.
• There are two key points in hardware requirement:
1. Hardware architecture and
2. Hardware management
• Required Hardware are as given below:
1. Server Hardware
Symmetric Multiprocessing
Cluster Technology
Massive Parallel Processing
New and Emerging Technology
2. Network Hardware
3. Client Hardware
Server Hardware and Operating System
• Set of CPU Shared memory and Disks.
• Shared everything environment.
• Vender have facility to allow processes to have CPU affinity.
• Length of communication bus connecting to CPUs is in natural limit.
• Multiple CPUs allows operations to processed in parallel
Cluster Technology
...............
SMP NODE- Cluster 1 SMP NODE- Cluster2
High Speed Interconnection
DISK DISK DISK
A computer cluster is a set of loosely or tightly connected
SMP that work together so that, in many respects, they can
be viewed as a single system.
Each machine has its own CPUs and memory but they can
share disk.
Massive Parallel processing
• MPP (massively parallel processing) is the coordinated processing of a program
by multiple processor s that work on different parts of the program, with each
processor using its own operating system and memory .
• Typically, MPP processors communicate using messaging interface. In some
implementations, up to 200 or more processors can work on the same
application.
• An "interconnect" arrangement of data paths allows messages to be sent
between processors.
New and Emerging Technology
• New and Emerging Technology includes Non Uniform Memory
Architecture(NUMA) and High speed memory interconnect.
• It is basically tightly coupled cluster SMP nodes with high
speed interconnect.
• Interconnect enables OS to run across the whole machine as a
single distributed instance.
Network Hardware and Operating System
• The main aspects of data warehouse design that may be
affected network architecture as given below.
• User Access
• Source System Data Transfer
• Data Extraction
• We need to ensure that the network architecture and
bandwidth are capable of supporting data transfer and
extraction in acceptable time.
Client Hardware and Operating System
• Client management and client tools aspects need to be
consider during warehouse architecture design phase.
• At design stage it is necessary to consider what user side tools
will be used.
• If these tools have special requirements such as data being
summarized in certain ways.
• These requirement must be given to user side.
Client Server Computing Model
and Data Warehouse
Introduction
• Client server originally applied to a software architecture that
describe processing between applications and supporting
services.
• Client server computing model represent a specific instance of
distributed cooperative processing relationship between
clients and servers, relationship between hardware and
software components.
• Common Client server computing models are:
• Host Based Processing
• Master Slave Processing
• First Generation Client Server Processing
• Second Generation Client server Processing
Host Based Processing
HOST CPU
DATA
WAREHOUSE
APPLICATION
TERMINALS
• In this model host based application processing performed in
one computer system with attached unintelligent dumb
terminals.
• A single standalone PC with attached character based display
terminals is example of this computing models
Master Slave Processing
DATA WAREHOUSE
HOST CPU
SLAVE MASTER SLAVE
CPU CPU
APPLICATIONS
• In this model slave computers are attached to master computer and
perform functions only as directed by their master.
• Application processing is unidirectional distributed from master to
slave.
• Slave computer are capable of limited local application processing.
Fist Generation Client Server Processing
• This generation use two models:
• Shared Device LAN processing environment
• Client Server LAN processing environment
• In shared device LAN processing environment PCs are
attached to a system device that allow PCs to share a common
resource called server. For example Printer and File Server.
• Shared device receive request from PCs usually limited
services like file sharing and printing process.
• Client Server LAN processing environment is extension of
shared device processing environment .
• An advantage of this model can be illustrated by comparison
between file server and database servers.
Fist Generation Client Server Processing
DATA PRINTER DATA PRINTER
SERVER
SERVER APPLICATIONS
LAN LAN
Model 1 Model 2
Second Generation Client Server Processing
CLIENT APPLICATION SERVER DATA SERVER
WIN FILES
UNIX
NT MAINFRAME
NET DATA
SERVER
WARE
DATA
• This model used multiprocessor architecture.
• Model deals with server dedication towards applications,
transaction management and system arrangement.
• Data structure supported by this model range from relational
to multidimensional data model.
• Clients are now mobile and remote access implemented.
PARALLEL AND CLUSTER SYSTEM
IN
DATA WAREHOUSE
PARALLEL ARCHITECTURES FOR DATA
WAREHOUSING
• Due to their high-demand on storage and performance, large
DWs frequently reside within some sort of parallel system.
• There is a whole range of architectures for parallelization,
from shared-nothing to shared-disk and hybrid ones, as
current state-of-the-art servers come with multiple
processors.
• There are different nomenclatures for the basic models by
which a parallel system can be designed, and the details of
each model vary as well.
• Consider three basic elements in a parallel system:
1. The processing unit (PU).
2. The storage device (S).
3. The memory (M).
Shared Memory (SM)
• the shared memory or shared everything architecture,
illustrated in Figure , is a system where all existing processors
share a global memory address space as well as peripheral
devices.
• Only one DBMS is present, which can be executed in multiple
processes or threads, in order to utilize all processors;
Distributed DBMS
implementations
Distributed DBMS
• A distributed database management system
(DDBMS) is a centralized software system that
manages a distributed database in a manner
as if it were all stored in a single location.
Features
• It is used to create, retrieve, update and delete distributed
databases.
• It synchronizes the database periodically and provides access
mechanisms by the virtue of which the distribution becomes
transparent to the users.
• It ensures that the data modified at any site is universally
updated.
• It is used in application areas where large volumes of data are
processed and accessed by numerous users simultaneously.
• It is designed for heterogeneous database platforms.
• It maintains confidentiality and data integrity of the
databases.
Factors Encouraging DDBMS
• Distributed Nature of Organizational Units −
– Most organizations in the current times are subdivided into
multiple units that are physically distributed over the globe.
– Each unit requires its own set of local data.
– Thus, the overall database of the organization becomes
distributed.
• Need for Sharing of Data −
– The multiple organizational units often need to communicate
with each other and share their data and resources.
– This demands common databases or replicated databases that
should be used in a synchronized manner.
Factors Encouraging DDBMS
• Support for Both OLTP and OLAP −
– Online Transaction Processing (OLTP) and Online Analytical
Processing (OLAP) work upon diversified systems which
may have common data.
– Distributed database systems aid both these processing by
providing synchronized data.
• Support for Multiple Application Software −
– Most organizations use a variety of application software
each with its specific database support.
– DDBMS provides a uniform functionality for using the
same data among different platforms.
Factors Encouraging DDBMS
• Database Recovery −
– One of the common techniques used in DDBMS is
replication of data across different sites.
– Replication of data automatically helps in data recovery if
database in any site is damaged.
– Users can access data from other sites while the damaged
site is being reconstructed.
– Thus, database failure may become almost inconspicuous
to users.
Advantages of Distributed Databases
• Modular Development
• More Reliable
• Better Response
• Lower Communication Cost
Types of Distributed Databases
Homogeneous Distributed Databases
• In a homogeneous distributed database, all the sites use
identical DBMS and operating systems.
• Its properties are −
• The sites use very similar software.
• The sites use identical DBMS or DBMS from the same
vendor.
• Each site is aware of all other sites and cooperates with
other sites to process user requests.
• The database is accessed through a single interface as if
it is a single database.
Types of Homogeneous Distributed
Database
There are two types of homogeneous distributed database −
• Autonomous − Each database is independent that functions
on its own. They are integrated by a controlling application
and use message passing to share data updates.
• Non-autonomous − Data is distributed across the
homogeneous nodes and a central or master DBMS co-
ordinates data updates across the sites.
•.
Heterogeneous Distributed Databases
• In a heterogeneous distributed database, different sites have
different operating systems, DBMS products and data models.
• Its properties are −
• Different sites use dissimilar schemas and software.
• The system may be composed of a variety of DBMSs
like relational, network, hierarchical or object oriented.
• Query processing is complex due to dissimilar schemas.
• Transaction processing is complex due to dissimilar
software.
• A site may not be aware of other sites and so there is
limited co-operation in processing user requests.
Types of Heterogeneous Distributed
Databases
• Federated − The heterogeneous database systems are
independent in nature and integrated together so that they
function as a single database system.
• Un-federated − The database systems employ a central
coordinating module through which the databases are
accessed.
Distributed DBMS Architectures
• DDBMS architectures are generally developed depending on
three parameters: Distribution, Autonomy and Heterogeneity
• Some of the common architectural models are −
• Client - Server Architecture for DDBMS
• Peer - to - Peer Architecture for DDBMS
• Multi - DBMS Architecture
Client - Server Architecture for DDBMS
• This is a two-level architecture where the functionality is
divided into servers and clients.
• The server functions primarily encompass data management,
query processing, optimization and transaction management.
Client functions include mainly user interface.
• However, they have some functions like consistency checking
and transaction management.
• The two different client - server architecture are −
– Single Server Multiple Client
– Multiple Server Multiple Client (shown in the following diagram)
Client - Server Architecture for DDBMS
Peer- to-Peer Architecture for DDBMS
• This architecture generally has four levels of schemas −
• Global Conceptual Schema −
– Depicts the global logical view of data.
• Local Conceptual Schema −
– Depicts logical data organization at each site.
• Local Internal Schema −
– Depicts physical data organization at each site.
• External Schema −
– Depicts user view of data.
Peer- to-Peer Architecture for DDBMS
Multi - DBMS Architectures
• This is an integrated database system formed by a collection of
two or more autonomous database systems.
• Multi-DBMS can be expressed through six levels of schemas −
1. Multi-database View Level
2. Multi-database Conceptual Level
3. Multi-database Internal Level
4. Local database View Level
5. Local database Conceptual Level
6. Local database Internal Level
Multi - DBMS Architectures
Multi - DBMS Architectures
ETL Tools
Source :
http://www.dataintegration.info
ETL (Extract-Transform-Load)
• ETL comes from Data Warehousing and stands for
Extract-Transform-Load.
• ETL covers a process of how the data are loaded from
the source system to the data warehouse.
• Currently, the ETL encompasses a cleaning step as a
separate step.
• The sequence is then Extract-Clean-Transform-Load
Extract
• The Extract step covers the data extraction from the
source system and makes it accessible for further
processing.
• The main objective of the extract step is to retrieve
all the required data from the source system with as
little resources as possible.
• The extract step should be designed in a way that it
does not negatively affect the source system in terms
or performance, response time or any kind of
locking.
There are several ways to perform the extract:
• Update notification - if the source system is able to provide a
notification that a record has been changed and describe the
change, this is the easiest way to get the data.
• Incremental extract - some systems may not be able to
provide notification that an update has occurred, but they are
able to identify which records have been modified and
provide an extract of such records.
• Full extract - some systems are not able to identify which data
has been changed at all, so a full extract is the only way one
can get the data out of the system. The full extract requires
keeping a copy of the last extract in the same format in order
to be able to identify changes.
Clean
• The cleaning step is one of the most important as it ensures the
quality of the data in the data warehouse.
• Cleaning should perform basic data unification rules, such as:
– Making identifiers unique (sex categories
Male/Female/Unknown, M/F/null, Man/Woman/Not Available
are translated to standard Male/Female/Unknown)
– Convert null values into standardized Not Available/Not
Provided value
– Convert phone numbers, ZIP codes to a standardized form
– Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
– Validate address fields against each other (State/Country,
City/State, City/ZIP code, City/Street).
Transform
• The transform step applies a set of rules to transform the data
from the source to the target.
• This includes converting any measured data to the same
dimension using the same units so that they can later be
joined.
• The transformation step also requires:
• joining data from several sources,
• generating aggregates,
• generating command keys,
• sorting,
• deriving new calculated values,
• and applying advanced validation rules.
Load
• During the load step, it is necessary to ensure that the load is
performed correctly and with as little resources as possible.
• The target of the Load process is often a database.
• In order to make the load process efficient, it is helpful to
disable any constraints and indexes before the load and
enable them back only after the load completes.
• The referential integrity needs to be maintained by ETL tool to
ensure consistency.
Managing ETL Process
• The ETL process seems quite straight forward.
• As with every application, there is a possibility that the ETL
process fails.
– This can be caused by missing extracts from one of the
systems,
– missing values in one of the reference tables, or
– simply a connection or power outage.
• Therefore, it is necessary to design the ETL process keeping
fail-recovery in mind.
Staging
• Staging means that the data is simply dumped to the location
(called the Staging Area) so that it can then be read by the
next processing phase.
• The staging area is also used during ETL process to store
intermediate results of processing. This is ok for the ETL
process which uses for this purpose.
• However, The staging area should is be accessed by the load
ETL process only.
• It should never be available to anyone else; particularly not to
end users as it is not intended for data presentation to the
end-user.may contain incomplete or in-the-middle-of-the-
processing data.
ETL Tool Implementation
• When you are about to use an ETL tool, there is a
fundamental decision to be made:
• Will the company build its own data transformation tool or
will it use an existing tool?
• Building your own data transformation tool (usually a set of
shell scripts) is the preferred approach for a small number of
data sources which reside in storage of the same type.
• The main benefit of using off-the-shelf ETL tools is the fact
that they are optimized for the ETL process by providing
connectors to common data sources like databases, flat files,
mainframe systems, xml, etc.
List of the most popular ETL tools:
• Informatica - Power Center • Embarcadero Technologies -
• IBM - Websphere DataStage(Formerly DT/Studio
known as Ascential DataStage) • IKAN - ETL4ALL
• SAP - BusinessObjects Data Integrator • IBM - DB2 Warehouse Edition
• IBM - Cognos Data Manager (Formerly • Pervasive - Data Integrator
known as Cognos DecisionStream) • ETL Solutions Ltd. -
Transformation Manager
• Microsoft - SQL Server Integration Services
• Group 1 Software (Sagent) –
• Oracle - Data Integrator (Formerly known as DataFlow
Sunopsis Data Conductor)
• Sybase - Data Integrated Suite
• SAS - Data Integration Studio ETL
• Oracle - Warehouse Builder • Talend - Talend Open Studio
• AB Initio • Expressor Software - Expressor
• Information Builders - Data Migrator Semantic Data Integration
System
• Pentaho - Pentaho Data Integration
• Elixir - Elixir Repertoire
• OpenSys - CloverETL
Warehouse Metadata
What is Metadata?
• Metadata is simply defined as data about data.
• The data that is used to represent other data is known as
metadata.
• For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that
metadata is the summarized data that leads us to detailed
data.
• In terms of data warehouse, we can define metadata as
follows.
– Metadata is the road-map to a data warehouse.
– Metadata in a data warehouse defines the warehouse objects.
– Metadata acts as a directory.
– This directory helps the decision support system to locate the
contents of a data warehouse.
Categories of Metadata
• Business Metadata − It has the data ownership
information, business definition, and changing policies.
• Technical Metadata − It includes database system names,
table and column names and sizes, data types and
allowed values. Technical metadata also includes
structural information such as primary and foreign key
attributes and indices.
• Operational Metadata − It includes currency of data and
data lineage. Currency of data means whether the data is
active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on
it.
Role of Metadata
• Metadata acts as a directory.
• This directory helps the decision support system to locate the
contents of the data warehouse.
• Metadata helps in mapping of data when data is transformed
from operational environment to data warehouse
environment.
• Metadata helps in summarization between current detailed
data and highly summarized data.
• Metadata also helps in summarization between lightly
detailed data and highly summarized data.
• Metadata is used for query tools.
• Metadata is used in extraction and cleansing tools.
• Metadata is used in reporting tools.
• Metadata is used in transformation tools.
• Metadata plays an important role in loading functions.
Metadata Repository
• Metadata repository is an integral part of a data warehouse system.
It has the following metadata −
• Definition of data warehouse − It includes the description of
structure of data warehouse. The description is defined by schema,
view, hierarchies, derived data definitions, and data mart locations
and contents.
• Business metadata − It contains has the data ownership
information, business definition, and changing policies.
• Operational Metadata − It includes currency of data and data
lineage. Currency of data means whether the data is active,
archived, or purged. Lineage of data means the history of data
migrated and transformation applied on it.
• Data for mapping from operational environment to data
warehouse − It includes the source databases and their contents,
data extraction, data partition cleaning, transformation rules, data
refresh and purging rules.
• Algorithms for summarization − It includes dimension algorithms,
data on granularity, aggregation, summarizing, etc.
Challenges for Metadata Management
• The importance of metadata can not be overstated. Metadata
helps in driving the accuracy of reports, validates data
transformation, and ensures the accuracy of calculations.
Metadata also enforces the definition of business terms to
business end-users.
• Metadata in a big organization is scattered across the
organization. This metadata is spread in spreadsheets,
databases, and applications.
• Metadata could be present in text files or multimedia files. To
use this data for information management solutions, it has to
be correctly defined.
• There are no industry-wide accepted standards. Data
management solution vendors have narrow focus.
• There are no easy and accepted methods of passing
metadata.