0% found this document useful (0 votes)
31 views10 pages

02 - Vines, Samoila

The paper provides an overview of the Data Vault methodology, emphasizing its advantages over traditional data modeling approaches like Kimball and Inmon. It highlights Data Vault's flexibility, scalability, and ability to integrate various data sources, making it suitable for big data environments. The study includes a comparative analysis of data warehousing methodologies and presents a use case demonstrating the application of Data Vault architecture.

Uploaded by

albumihai1186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views10 pages

02 - Vines, Samoila

The paper provides an overview of the Data Vault methodology, emphasizing its advantages over traditional data modeling approaches like Kimball and Inmon. It highlights Data Vault's flexibility, scalability, and ability to integrate various data sources, making it suitable for big data environments. The study includes a comparative analysis of data warehousing methodologies and presents a use case demonstrating the application of Data Vault architecture.

Uploaded by

albumihai1186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Informatica Economică vol. 27, no.

2/2023 15

An Overview of Data Vault Methodology and Its Benefits

Andreea VINEȘ, Radu-Eleonor SAMOILĂ


Bucharest University of Economic Studies
andreea.vines@csie.ase.ro, radusamoila2@gmail.com

Business Intelligence plays a vital role in helping organizations to extract meaningful insights
from their data and make informed decisions. However, choosing the right data modeling
approach can be challenging, as different methodologies have unique advantages and
limitations. The focus of the current paper is to provide an insight into the Data Vault
architecture, which is an emerging approach to data modeling, compared to the established
methods of Kimball and Inmon. The paper conducted a comparative analysis of these
methodologies, with a particular emphasis on the benefits of Data Vault. This study highlights
the advantages of the Data Vault model in managing data from multiple sources and its
compatibility with agile implementation practices. Overall, this paper sheds light on the
relevance of Data Vault in contemporary data management practices.
Keywords: Data Vault, Data Warehouse, Architecture, Data modelling, ETL
DOI: 10.24818/issn14531305/27.2.2023.02

1 Introduction
In the era of digitalization, most functions
of business and society are depending on the
structured and unstructured data. Thus,
enterprises will be capable of making better
decisions for their organizations. [4]
data that comes in many forms. To gain a large To address the challenges posed by big data,
competitive advantage and some insights new data warehousing techniques have
from the data, there is required to apply data emerged that are better suited to handling
analytics on top of it. [1] Hence that, the large and variated data sources. One such
volume, velocity and variety of data are technique is the Data Vault methodology,
increasing rapidly over the last decades. An which was developed by Dan Linstedt in the
advanced method to overcome all the early 2000s. Its main aim is to addresses all
shortcomings of traditional data warehouse the challenges of building and maintaining a
solutions that might appear became required. data warehouse in the context of big data,
For example, considering the volume that providing a flexible, scalable, and auditable
keeps increasing, traditional data modelling approach to data modelling and integration.
approaches can become difficult to maintain Data Vault is designed to accommodate a
to store all the information. Another example wide variety of data sources, including both
would be the data sources that are constantly structured and semi-structured data, making it
changing; hence the classical data warehouse well-suited to the challenges of big data. In
model is quite robust, and it can’t handle all this context, it is important to explore the
business requirements. Moreover, there are principles and practices of Data Vault, and to
multiple issues that affect the classical understand how it can be used to create
systems, such as missing source keys, data effective data warehousing solutions in the era
issues, data quality, performance gaps and of big data.
lack of database tuning or partitioning. [2] [3] The purpose of this paper is to conduct a
As traditional data warehouses were designed comprehensive analysis of the evolution of
to handle structured data, not semi-structured data warehouse models, with a specific focus
or unstructured data that comes from social on the Data Vault methodology proposed by
media, mobile devices or other sources, Dan Linstedt, and its relevance in the context
vendors started building a new generation data of Big Data. The paper includes an overview
warehouse with capabilities of performing of this area and also a use case that
analytics and forecasting, to support
Informatica Economică vol. 27, no. 2/2023 16

demonstrates how semi-structured data can be Bill Inmon and Ralph Kimball. The concept
integrated using Data Vault architecture. of Data Vault appeared in 1990 when Dan
To achive this objective, the current paper is Linstedt published some papers regarding this
structured as follows: The first section method, being able to propose a final version
provides an introduction of the paper. The of the architecture in 2020. In the following 10
second section describes the three available years, the author adjusted the method, and
methodologies of Data warehouses developed adapted it to the current necessities and design
by Kimball, Inmon and Linstedt. The third patterns, providing an adaptive method called
section specifically focuses on the Data Vault, Data Vault 2.0.
presenting its architecture and core Kimball’s architecture [5] that was proposed
components. The fourth section presents a use in 1996 consist in a two-layer model. Raw
case that illustrates how data can be modelled data is copied into a staging area and all the
using Data Vault 2.0. Finally, the last section transformation happen before it is ingested
of the paper presents the conclusions of the into the second layer, data warehouse. Using
study. this approach, end users will connect to the
same data warehouse model and all the
2 Data Warehouse models – from Kimball dimensions will be common. The main
to Linstedt advantage of it is that it can be easy to
Data Vault represents a conceptual and logical implement, but it can also be quite challenging
data model used to build data warehouses for if there is required to build a second model
enterprise-scale analytics. It comes as an using the same landing data, as everything
alternative of the architectures proposed by needs to be loaded again. [5]

Fig. 1. Kimball approach architecture


Source - [7]

Inmon’s architecture [6] has three layers to advantage of separating the data warehouse
overcome the challenges presented earlier, from the data mart is beneficial in the context
where the new layer consists of data marts of models with multiple use cases. It also
which are some mini-data warehouses from offers scalability in the context of integrating
where users can extract the information that is new requirements or defining new data marts.
required. This also requires more data [6]
processing to create all the data marts. The
Informatica Economică vol. 27, no. 2/2023 17

Fig. 2. Inmon approach architecture


Source - [7]

Data Vault 2.0 was designed in 2010 by Architecture). to be processed in parallel and
Linstedt as a necessity to come with a new removes all the dependencies on the data
proposed model scalable in the context of big processing side. The hash key is calculated
data, to be able to handle unstructured data based on the combination of the columns that
and to be integrated with more systems. One constitute the business key of the table, using
of the main differences is the change in using different methods such as SHA-1 (Secure
hash keys, which are primary keys instead of Hash Algorithm-1) or MD5 (Message-Digest
sequential numbers. This change allows all Algorithm). Several of the differences
the hubs and satellites (which will be between Data Vault 1.0 and Data Vault 2.0 are
presented in detail in Section 3 – Data Vault synthesized in the table 1. [7]

Table 1. Data Vault 1.0 & 2.0 Comparison


Data Vault 1.0 Data Vault 2.0
Sequence numbers for Surrogate keys. Use Hash keys for surrogate keys.

DV 1.0 also used MPP but only to a DV 2.0 takes advantage of MPP style platforms
certain extent. and is designed with MPP in mind.

Limited support to real time load. It’s real-time ready, cloud ready, NoSQL ready
and big data friendly.
Not very flexible to automation and DV 2.0 has a very strong focus on both
virtualization. automation and virtualization.
DV 1.0 had a major focus on modelling DV 2.0 is a complete system of Business
and many of the modelling concepts are Intelligence. It talks about everything, from
similar. concept to delivery.

The model proposed by Dan Linstedt • It enables new business capabilities such
represents a third approach of data modelling as data-driven decision-making, data
and it comes with a series of benefits, such as: science and it can be considered the key to
[8] new business models.
• It contributes to the organization ability It represents an agile, structured solution with
and improves the speed at which the flexibility for refactoring.
business can learn and exploit more According to the paper of Cernjeka et al. that
opportunities. developed a metamodel, Data Vault can be
• It delivers a modernized data service. used to integrate semi-structured data as
Informatica Economică vol. 27, no. 2/2023 18

well.[9] The authors translated some NoSQL to Kimball's approach where the
document stores that use JSON files in a data transformations to create the dimensions and
Vault model using the key-value concept of fact tables can be complex. Another aspect
the files. The following translation rules were worth mentioning is the historization, where
applied: all three approaches can handle data
• TR0: MongoDB collection is translated versioning. Regarding the lifecycle of the
into a set of hubs, links and satellites. model, it was already mentioned that Data
• TR1: each document id is translated into a Vault provides an easy to change model,
business key and a hash key is calculated where the existing tables are not affected by
and added to the hub. any changes. To conclude, for all modelling
• TR2: the non-id fields of the document are approaches presented, developers and data
translated as attributes of the satellite architects are required to understand the
entity where the value of the satellite requirements and choose the options that best
attribute is the value part of the document. fit.
• TR3: a document reference to another
document is translated into a Data Vault 3 Data Vault architecture
link entity connecting the current hub and The main objective of Data Vault is to ensure
a different hub (referenced document). quick adaption to changes in case a new
• TR4: an embedded document is translated business object is defined or some new
into a DV entity connected to the current sources are ingested. The architecture is
hub. In this case TR1-TR3 are applied defined on three layers: [9]
accordingly. • Staging area – which ensures ingesting
However, this metamodel doesn’t provide data from different sources and keep them
details regarding nested attributes as part of in their raw format; this layer doesn’t keep
JSON files, or how the JSON files with the history; hence it gets truncated every
different structure or attributes can be time a new batch is being processed. It
integrated in the Data vault model, which will allows only applying different hard
be the subject of future work. business rules which doesn’t impact the
On the other side, authors of paper [10] meaning of the data (for example,
provided a methodology for mapping a Data applying different data time
Vault schema to a standard XML schema, that transformations, define hash keys,
offers flexibility and also helps if the data defining Unicode format etc.)
from the source systems come in various • Enterprise data warehouse area – which
formats, and they all need to be stored in a represents the second layer of data which
centralized data warehouse. The authors also is modelled using the data vault 2.0
mentioned that using XML schema might architecture, also keeping the history of
have some limitations, such as identifying the the data. This component is designed
hub key or the data type variations. using hub, satellites, and link tables. This
As Data Vault is a quite new approach, most layer allows applying soft business rules,
of the studies found in the literature focus on calculating different KPIs. This area can
comparing the methodologies of Inmon and be also dividing into a Raw Vault where
Kimball. However, there were also some raw data is transposed into a data vault, or
papers that consider Data Vault in the directly into a business vault, where all
comparison [7][8]. The conclusion of these the business rules are applied, and data is
works is that all three methods have modelled using Data vault.
advantages and disadvantages, depending on • Information delivery area - which
the aim of the implementation. In the two allows end users to access the data using
mentioned researches, it was highlighted that data marts and use it to get different
Data Vault implementation requires simple insights. Data is aggregated and prepared
and standard ETL rules to load data, compared in order to be used for different use-cases,
Informatica Economică vol. 27, no. 2/2023 19

such as decision-making, prediction, artificial intelligence or machine learning.

Fig. 3. Data Vault architecture


Source - [7]

Comparing with the dimensional approach, can have multiple satellites with correspond to
where the main components of a data different data sources that are integrated into
warehouse are facts and tables, in Data Vault the system and they have the same business
there are defined three core components: hub, key. Another alternative is to define a new
link and satellite. [9] satellite if some new attributes are being
A hub represents the core business concept ingested in the system, instead of updating the
(customer, product, sales, vendor); it is current tables.
created as table that contains the business key The popularity of Data Vault was defined
and some metadata about when the key was based on the following characteristics [13]:
first loaded and the source of it. With the Data i. Flexibility which allows new sources to
Vault 2.0 model, the primary key of the hub be easily integrated in the correct model,
table was replaced with a hash key of the with no or only some small updates to
business column (in the Data Vault 1.0 the existing tables. Once a new entity is
version, the primary key was created using a being defined, there is required only to
sequence number). create a hub table and update one of the
A link represents relationships between link tables to connect with the new hub
multiple business objects. It contains a that was created.
primary key defined also as a hash key and ii. Loading efficiency is another
foreign keys for the hubs that are being characteristic, as all the entities can be
interconnected, like a bridge table. Links loaded in parallel without requiring a lot
make the model more flexible as for every of data dependencies. The only
new functionality added, it is just required to dependency required is that all the
create a new hub table and connect it to a link entities need to be processed in the
table. A satellite stores information about the following order: hubs, links and
hubs and the relationship between them, satellites, but all components can be
contains all attributes and information about loaded in parallel (for example, all hub
when the data was loaded and its source, keeps tables can be loaded in parallel,
also all the history to track the changes. The followed by linked tables and the
structure of it consists in a primary key created satellites).
using the parent hash key and the foreign key iii. Third characteristic is auditability that
of the load data and different other columns keeps track of all the changes made to a
for source, load date and attributes. Each hub source system as all the changes are
Informatica Economică vol. 27, no. 2/2023 20

tracked using the satellites, meaning The initial data model contains two objects –
that each change of the current data is Products that initially gets data from two
being stored as new records with the sources (S1 and S2) and another object for
timestamp that indicates its expiration Categories that only ingests data from one
date, similar with slowly changing source (S1). All data comes in a structured
dimensions type 2. format from the client’s systems, and it needs
to be ingested in the final data warehouse in
4 A simplified implementation of Data order to be used in the presentation layer,
Vault where users can consume the data through the
This section presents an alternative of reports and dashboards. The table structure of
implementing a Data Vault model and its the input data that has the column names,
benefits of integrating new sources to the description, and data types is presented in the
current model. Tables 2 & 3:
Table 2. Products table structure (S1 & S2 sources)
Column name Description Data type
id Unique identifier String
Name Name of the product String
Description Description of the product String
Price Current price Float
Category_id Category of the product String

Table 3. Categories table structure (S1 source)


Column name Description Data type
Category_id Unique identifier String
Name Name of the category String
Description Description of the category string

Considering the two data source presented The Product hub is connected with two
above, the proposed Data Vault model is satellites, representing one satellite for each
represented in the figure below. It contains source. The Categories hub has only one
two hubs (one for Categories and one for satellite with its attributes to display the
Products) and the two hubs are being category name and description.
connected using a link table. One of the main principles of the Data Vault
The primary key of the hub is created as a hash is that data won’t be deleted. [14] Considering
key of the business key of the Products source that, the information regarding when the data
(which is the id) and the record source. In this was ingested are kept in every source using
case, if the data that comes from different data ‘Load Date’ columns. As the satellites keep
sources will have the same id, they will be the history of the data, the column ‘Load End
distinguish in the hub as the record source is Date’ will be used to track when the record
difference and hence, that hash key generated become inactive – for example, a new record
will be different. for that key was added in the table (it was
The link table created will contain the hash modified in the source); hence that, the old
key of each hub as foreign key and a new hash record will be marked as inactive by
key (created as a combination of the two populated ‘Load End Date’ column.
mentioned) as a primary key.
Informatica Economică vol. 27, no. 2/2023 21

Fig. 3. Initial Data Vault model

If a new source of products is ingested, the the hash key of the hub is creating using the
approach is to create a new satellite for it. business key (which is the id) and the record
Considering that the data source will have a source. In this case, the hash key will be
new attribute, size, it won’t affect the other unique, and it won’t be affected by having the
sources that don’t have this attribute and the same id value in multiple sources. The new
current ETL process of ingesting data and data model is displayed below.
uploading it into the satellites won’t be
affected. To distinguish between data source,

Table 4. Products table structure (S3 source)


Column name Description Data type
id Unique identifier String
Name Name of the product String
Description Description of the product string
Price Current price Float
Size Current size of the product String
Category_id Category of the product String

From a data integration standpoint, when becomes apparent that Data Vault simplifies
comparing all three methodologies, it the process of adding new sources by keeping
Informatica Economică vol. 27, no. 2/2023 22

the satellites separate. In contrast, with the The new data model can be observed in the
traditional Inmon and Kimball approaches, if Fig. 5. below, where the only change that
a new data source is added to the system, the happen compared to the previous version, was
ETL process must be adjusted to incorporate to create a new satellite table
it. (Products_satellite_S3).

Fig. 5. Final Data Vault model

5 Conclusions solutions.
In this paper it was presented a comprehensive It should be also highlighted that all three
overview of the integration of Data Vault approaches have their respective advantages
architecture within data warehouse solutions. and disadvantages, depending on the defined
The study starts by comparing this approach purpose. The traditional approaches are an
with the methodologies introduced by excellent fit if the system is stable, and the
Kimball and Inman. The traditional requirements are well-known. In contrast,
approaches, Kimball and Inman, are prone to Data Vault architecture is a powerful
challenges generated by changes in the current approach if the data warehouse that needs to
data sources, and they require considerable be built has to be flexible and scalable due to
data engineering work to adjust the current its architecture that allows easy integration of
ETL flow. new sources.
Data Vault architecture, on the other hand, Therefore, the selection of an appropriate
provides flexibility and scalability to approach should be based on the specific
overcome these issues. Although this needs and objectives of the data warehouse
modelling architecture is relatively new, its solution that needs to be implemented. While
agile methodology is increasingly being used Kimball and Inmon are conventional
by many more companies to develop data methodologies that have been in use for a long
Informatica Economică vol. 27, no. 2/2023 23

time and have proved their efficiency in in-depth analysis on this area.
certain contexts, Data Vault provides a As a future work, the paper can be extended to
modern and flexible solution that is better include the implementation of semi-structured
suited for rapidly evolving environments. data, considering nested JSON or XML files
Moreover, Data Vault provides support for and how they can be modelled using Data
semi-structured data which seems to be more Vault. Another aspect that could be
and more used to develop modern data considered when developing solutions for
warehouse solutions. Data Vault architecture semi-structured data is how to handle cases
enables the construction of incremental where the attributes differ from one file to
models at a low cost, making also easy to another, and how they can be modelled
modify the business rules. One of its main effectively within the Data Vault architecture.
benefits is that it helps in the context of Big As semi-structured data becomes more
data, when it comes to data variety, being prevalent in the modern data landscape,
possible to integrate unstructured data or to exploring the integration of this data type
perform near-real-time data loading.[15] within the Data Vault architecture is being
Another advantage for Big data processing critical to ensure that data warehousing
refers to the volume of data that can be solutions remain flexible and scalable. By
ingested in the system, allowing petabyte- addressing these issues, the Data Vault
scale management.[9] architecture can continue to provide a
Some papers have explored the overview comprehensive and robust solution for data
development of metamodels for semi- warehousing in the future.
structured data, but there is a need for a more

References [6] W. H. Inmon and B. D. Eggleston,


[1] J. C. Margulies, "Data as competitive "Building the data warehouse," John
advantage," Winterberry Group, Oct. Wiley & Sons, 4th ed., Indianapolis, IN,
2015, Tech. Rep. 2015-02, pp. 1-28 USA, 2005
[2] D. Linstedt, K. Graziano, and H. [7] L. Yessad and A. Labiod, "Comparative
Hultgreen, "The business of data vault Study of Data Warehouses Modeling
modeling, 2nd edition," in Proceedings Approaches: Inmon, Kimball and Data
of the 2009 International Conference on Vault," in 2016 International
Information and Knowledge Conference on System Reliability and
Engineering (IKE), Las Vegas, NV, Science (ICSRS), Algiers, Algeria,
USA, July 2009, pp. 232-238. 2016, pp. 77-82
[3] D. Linstedt and K. Graziano, "Super [8] D. Schneider, A. Martino and M.
Charge your Data Warehouse," Eschermann, "Comparison of Data
Createspace Independent Pub, Scotts Modeling Methods for a Core Data
Valley, CA, USA, 1st ed., 2011 Warehouse," in Trivadis, pp. 20-21,
[4] Deloitte CIO Journal Editor. "The 2014
Future of Data Warehouses in the Age [9] D. Linstedt, "Building a Scalable Data
of Big Data" [Online]. Available: Warehouse with Data Vault 2.0,"
http://deloitte.wsj.com/cio/2013/07/17/t Morgan Kaufmann, 1st ed., Boston,
he-future-of-data-warehouses-in-the- MA, USA, 2016
age-of-big-data/. [Accessed: Apr. 14, [10] "What is Data Vault? - Data Vault,"
2023] Data Vault, 2019. [Online]. Available:
[5] R. Kimball and M. Ross, "The data https://www.data-vault.co.uk/what-is-
warehouse toolkit: The definitive guide data-vault/. [Accessed: Apr. 14, 2023]
to dimensional modeling," John Wiley [11] K. Cernjeka, D. Jaksic, and V.
& Sons, 3rd ed., Indianapolis, IN, USA, Jovanovic, "NoSQL Document Store
2013 Translation to Data Vault Based EDW,"
Informatica Economică vol. 27, no. 2/2023 24

in Proceedings of the 41st International Learned," in Proceedings of the 38th


Convention on Information and Conference on Conceptual Modeling
Communication Technology, (ER 2019), 2019, pp. 90-99
Electronics and Microelectronics [14] D. Linstedt, "Super Charge Your Data
(MIPRO), Opatija, Croatia, May 2018, Warehouse: Invaluable Data Modeling
pp. 1105-1110 Rules to Implement Your Data Vault,"
[12] C. Knowles and V. Jovanovic, 1st ed. CreateSpace Independent
"Extensible Markup Language (XML) Publishing Platform, 2011
Schemas for Data Vault Models," [15] I. Nogueira, M. Romdhane and J.
Journal of Computer Information Darmont, "Modeling Data Lake
Systems, vol. 53, no. 4, pp. 12-21, 2013 Metadata with a Data Vault" in
[13] C. Giebler, C. Gröger, E. Hoos, H. Proceedings of the 22nd International
Schwarz, and B. Mitschang, "Modeling Database Engineering & Applications
Data Lakes with Data Vault: Practical Symposium (IDEAS '18), New York,
Experiences, Assessment, and Lessons USA, 2017, pp. 253-261

Andreea VINEȘ has graduated the Faculty of Economic Statistics,


Cybernetics and Informatics in 2019. Following the bachelor’s degree, she also
pursued a Master program at the Bucharest University of Economic Studies in
Information Systems for the Management of Processes and Economic
Resources. She is currently working as a Data Engineer, and she comes with a
wealth of experience in designing, developing and maintain data solutions, her
expertise including building and managing data pipelines, data warehousing and ETL
processes, focusing more on the cloud technologies. Her main fields of interest are cloud
technologies, big data, data warehousing and BI.

Radu SAMOILA has graduated the Faculty of Accounting, Audit and


Information Technology in 2009. He also holds a master’s degree in
Economy, Audit, and Information Technology since 2011 at the Bucharest
University of Economy. He had different managerial roles for multinational
companies operating in the energy industry, business, and financial
consultancy or FMCG business sectors. His main roles were covering internal
and external audit, business advisor and corporate governance activities. He is a PhD Student
at Bucharest University of Economy, since 2019. Main fields of interests are business process
optimization and automation, as well as the continuous improvements concept.

You might also like