International Journal of Computer Applications (0975 – 8887)
Volume 81 – No2, November 2013
From Data Warehouses to Streaming Warehouses: A
Survey on the Challenges for Real-Time Data
Warehousing and Available Solutions
Revathy S Saravana Balaji.B N.K.Karthikeyan, Ph.D
PG Scholar Assistant Professor Professor & Head - IT
Sri Ramakrishna Engg College Sri Ramakrishna Engg College Sri Krishna Coll of Engg & Tech
Coimbatore, Tamil Nadu, India Coimbatore, Tamil Nadu, India Coimbatore, Tamil Nadu, India
ABSTRACT queried by users to perform analysis on the data.
Data Warehouses usually work on history data. In most cases,
the Data Warehouse is loaded with data from operational or
transactional systems on a weekly or nightly basis. As today's
decisions in the business world are becoming real-time, it is
only natural that Data Warehouse, Business Intelligence,
Decision Support and OLAP systems must quickly begin
incorporating real-time data. When shifting from a traditional
offline and time-consuming data warehousing system to a
real-time system, two important considerations are speeding
up the ETL and the OLAP process. This survey looks into the
various challenges involved in building a real-time Data
Warehouse and some of the solutions available to overcome
them.
Keywords
Real-Time Data Warehousing, Real-Time ETL, Data Stream
Management Systems
1. INTRODUCTION Fig:1: A General Data Warehouse Architecture
The advent of the decision support systems came in the early
1970’s. A decision support system (DSS) is a computer The business markets around the globe are weighed down
based information system that supports organizational with rapid changes. To cope up with such a business market,
decision making activities. Decision-making has become one business level decision making must also be quick. Decision
of the criteria’s for a successful and competitive business in makers must adjust operational processes, corporate strategies
today’s world. Sound business decisions are based on data and business models at lightning speed and must be able to
that is analyzed according to pre-defined business criteria. leverage business intelligence instantly and take immediate
Such data, used for performing statistical and analytical action [3].
processing efficiently, resides within a Data Warehouse. The A data stream is a real-time, continuous and ordered sequence
Warehouse stores data for a Data-driven DSS. of items. The growing need for real-time data has caused a
Beginning in about 1990, Data Warehousing and on-line change in the data driven decision making process. Data
analytical processing (OLAP) began expanding the horizon of Driven DSS is shifting from Database Management Systems
DSS. According to Bill Inmon, who is often called the Father to Data Stream Management Systems (DSMSs). A Data
of Data Warehousing, "A Data Warehouse is a subject- Stream Management System is a program to manage a
oriented, integrated, time-variant, nonvolatile collection of continuous stream of data. It is similar to a Database
data". Another definition of a Data Warehouse is "A Data Management System, which is, however, designed for static
Warehouse is a copy of transaction data specifically structured data . Traditional Data Warehouses employ a store-and-then-
for query and analysis"[1]. query data processing model, where data is stored fully in the
database and results are provided to queries based on the data
A data warehouse is a database, most often a relational currently available within the database. In contrast, in
database. It is a central repository of data created by DSMSs, monitoring applications register Continuous Queries
integrating data from one or more different sources. It usually which continuously process unbounded data streams looking
contains historical data that is derived from transactional data. for data items that represent events of interest to the end-user
It enables an organization to separate the analytical workload [4].
from the transactional workload. Fig 1 gives the various
architectural components of a general Data Warehouse system
[1].The data available within the data warehouse can be
15
International Journal of Computer Applications (0975 – 8887)
Volume 81 – No2, November 2013
Table 1: Difference between a DBMS and a DSMS by the ETL process. Almost all the ETL tools and systems,
available in the market today, operate in batch mode. For
batch mode operation, the data should be available in an
Parameters Database Data stream extract file in a specific format. This data will be taken up by
management system management system the ETL process and transformed according to the specified
business rules and loaded into the Warehouse database. This
operation usually requires the Data Warehouse to be offline.
In a real time system, even a small amount of down time is
Nature of Data Non-Volatile Volatile data streams
not acceptable. The more the amount of incoming data, the
more will be the usage of the warehouse data.
Data Access Random Sequential 2.1 Enabling Real-Time ETL
Existing ETL systems can be modified to perform real-time or
near real-time warehouse loading. Some of the techniques
Query Nature One-time queries Continuous queries described in [5] are :
2.1.1 Near Real-time ETL
Storage unlimited secondary limited main memory The simplest way to solve the real-time ETL problem is to not
perform ETL. The system should be analyzed thoroughly to
Capacity storage check whether the cost of performing ETL for the system can
be justified. Some systems will only require that the data
available in the Warehouse be the most recent data. For such
Rate of Change Low update rate High / Continuous systems, the Warehouse refresh period can be changed from a
weekly basis to a daily or hourly basis. This will provide users
update rate. access to fresher data without having to bear the brunt of the
ETL or the reporting processes
Time Little or no time Real-time 2.1.2 Real Time Facts
The Data Warehouse maintains two types of tables, Facts and
Requirements requirements requirements Dimensions. In most of the cases, the actual querying happens
on the fact tables. The second approach is to create a separate
partition for the fact tables corresponding to real-time data.
The fact tables in the real-time partition can then be directly
A Stream Warehouse is a Data Stream Management System inserted or updated with data from the source system.
(DSMS) that stores a very long history, e.g. years or decades; Real-time data loading packages that are specifically designed
or equivalently a data warehouse that is continuously loaded for this approach are available. Packages are available from
[6]. It can also be called as a data warehouse which is updated DataMirror and MetaMatrix. Tibco provide solutions for real-
as and when data arrives. Table 1 gives the difference between time data transport. For systems based on the latest Java
a DBMS and a DSMS. Many systems generate data in technologies, Java Messaging Service (JMS) can be used to
streams. A few examples of such streaming systems are stock transmit each new data element from the source system to a
trading companies, network monitoring systems and traffic lightweight listener application that in turn inserts the new
monitoring systems. These systems are hugely dependent on data into the warehouse tables. For data that is received over
the results obtained from processing the data streams. The the Internet, the data can be transmitted in XML via HTTP
faster the result is obtained, the more efficient is the system using the SOAP standard.
performance.
2.1.3 Trickle & Flip
Data Stream Management Systems provide real-time analysis
In the Trickle and Flip approach staging tables that are an
by processing the events currently in-memory over a short
exact replica of the warehouse tables are created. Data is
time frame. Users would also require long-term analysis of
continuously loaded into these staging tables. On a periodic
data over large time frames. A Stream Warehouse bridges the
basis, the actual warehouse tables are swapped with these
short-term vs. long-term gap by loading data continuously in a
staging tables thus bringing the Data Warehouse instantly up-
streaming fashion and warehousing them over a long time
to-date
period [8].
2.1.4 External Real-time Data Cache
2. REAL – TIME DATA Another approach is to store the real time data outside of the
WAREHOUSING CHALLENGES Data Warehouse in an external real-time data cache (RTDC).
A typical data warehouse system consists of a staging area, This approach will avoid any performance problems to the
where the data from the different sources are brought together. existing warehouse. The RTDC can simply be another
The staging area is often contained in a different database, dedicated database server or a separate instance of a database
most commonly referred to as the staging database. The data system. Data Warehouse systems dealing with large volumes
from the staging area is integrated, cleansed and transformed. of real-time data or those that require extremely fast query
The transformed data is then moved to another database, often performance can benefit from using an in-memory database
called the warehouse database, where the data is modeled into (IMDB). Such IMDBs are provided by companies such as
tables called facts and dimensions. Angara, Cacheflow, Kx, TimesTen, and InfoCruiser.
The data from the staging area is processed using the
Extraction, Transformation and Load (ETL) process. The
most recent data is pushed into the data warehouse database
16
International Journal of Computer Applications (0975 – 8887)
Volume 81 – No2, November 2013
2.2 Enabling Real-Time Business
Intelligence
The analysis of the data that has been extracted, transformed
and loaded into the warehouse tables can be done by using
Data Mining or Online Analytical Processing (OLAP) tools.
OLAP can be defined as the general activity of querying and
presenting text and number data from data warehouses [2].
OLAP consists of a set of tools and techniques that allows
users to query databases and analyze the data. They aid in
preparing reports based on the data in the Data Warehouse.
These reports are used by managers and business analysts for
making business decisions.
Fig 3: Complex Event Processing Server Architecture
3. DATA STREAM MANAGEMENT
SYSTEMS
Customary Data Warehouses are refreshed with the latest
available data only on a periodic basis. The period of refresh
ranges from weekly to daily. Analysis happens on terabytes of
data; but the most recently arrived data will not be available
within the Warehouse. Data Stream Management Systems,
however, support analysis on the latest available data in real-
time. But the amount of data on which the analysis can be
done is very limited.
A Streaming Warehouse combines the historical analysis of
the Data Warehouse and the real-time analysis of the Data
Fig 2: Architecture for a Real Time Business Intelligence
Stream Management Systems. The goal of a streaming
System.
warehouse is to propagate new data across all the relevant
tables and views as quickly as possible [6]. If the most recent
data is available in the Warehouse immediately, then analysis
One of the methods proposed for real-time business can be done on the latest data and business decisions can be
intelligence to stream data in real-time from the source taken accordingly in real-time.
systems to the data warehouse is [9] where a component
referred to as the stream analysis engine. The goal of the 4. CONCLUSION
stream analysis process is to extract crucial information in As enterprises, such as E-commerce sites, are increasingly
real-time and then have it delivered to appropriate action facing the need for real-time Business Intelligence and
points which could be either tactical or strategic .Fig 2 gives Predictive Analytics, the need of the hour is to build ETL
an overview of the Stream Analysis Engine[7]. This stream tools, which will provide real-time data into Data
analysis engine performs in-depth analysis on the incoming Warehouses.
data to identify interesting patterns.
The trade-off between the cost of real-time Data Warehousing
2.2.1 Stream Analysis Engine and the actual requirement for such an analysis calls for
In this approach, query processing is done on continuously serious research and consideration. Otherwise the resulting
arriving data streams or event streams. The arrival of streams system may have prohibited costs associated with it [5]
triggers the query processing. Fig 3 gives an overview of a
complex event processing server architecture [9]. The The underlying technology components and custom solutions
drawback in this architecture is again the performance of the for real-time data warehousing are excessively expensive. The
Data Warehouse. Continuous running queries may reference importance, complexity and criticality of such an environment
data in the database and have an impact on the near real-time make real-time BI and DW a significant topic of research and
requirements practice. Therefore, these issues need to be addressed in the
future by both the industry and the academia [9].
17
International Journal of Computer Applications (0975 – 8887)
Volume 81 – No2, November 2013
5. REFERENCES [5] Langseth, J., "Real-Time Data Warehousing: Challenges
[1] Oracle Data Warehousing Guide – Oracle and Solutions", http://dssresources.com/papers/features/
Documentation,docs.oracle.com/cd/B28359_01/server.11 langseth/langseth02082004.html
1/b28313.pdf by P Lane. [6] Lucas Golab, Theodere Johnson , Vladislav Shkapenyuk
[2] The Data Warehouse Lifecycle Toolkit ,Ralph Kimball, Scalable, Scheduling of Updates in Streaming Data
Margy Ross ,Warren Thornthwaite ,Joy Mundy , Bob Warehouse, IEEE Transactions on knowledge and data
Becker , John Wiley & Sons; 2nd Edition. engineering ,Vol. 24, N0. 6, JUNE 2012.
[3] Dr. Kamal Kakish Dr.Theresa A.kraft , ETL Evolution [7] Agrawal , D., The Reality of Real-Time Business
for Real-Time Data Warehousing,2012, Proceedings of Intelligence, Proceedings of the 2nd International
the Conference on Information Systems Applied Workshop on Business Intelligence For the Real Time
Research ,New Orleans Louisiana, USA , ISSN:2167- Enterprise (BIRTE 2008), Springer , LNBIP 27 , 75-88.
1508 , v5 n2214. [8] Lukasz Golab and Theodore Johnson, Consistency in a
[4] Mohamed A. Sharaf, Alexandros Labrinidis, Panos K. Stream Warehouse, 5th Biennial Conference on
Chrysanthis , ETL Scheduling Continuous Queries in Innovative Data Systems Research (CIDR ‘11) January
Data Stream Management Systems, ACM 978-1-60558- 9-12, 2011, Asilomar, California, USA.
306-8/08/08. [9] Chaudhuri, S., Dayal, U., Narasayya, V., (2011) An
overview of Business Intelligence Technology,
Communications of the ACM, 54(8), 88-98.
IJCATM : www.ijcaonline.org 18