Distributed Database Management Systems and The Data Grid
Distributed Database Management Systems and The Data Grid
1
BaBar (an experiment at the Stanford University The paper is organised as follows. Section 2
where currently about 150 TB of data are available will give an introduction to related work in the
in Objectivity/DB [4]), as well as for the CERN database as well as the Grid community. The issues
experiments, CMS and Atlas. For the CERN raised there will be further analysed in later sections
experiments the final decision about the DBMS of the paper. Section 3 deals with replica catalogues
(object-oriented or relational, which vendor, etc.) and directory services that are used in Grid research
has not been made yet, but the current software and points out how these techniques can be mapped
development processes uses an object-oriented to the database approach. Section 4 discusses
approach and Objectivity for storing data Objectivity, its replication option and some general
persistently. Consequently, we base our work on ODBMS aspects. Since we assume that the usage of
object-oriented database management systems and Grid applications will not be fully transparent to the
in particular Objectivity/DB. end user, we dedicate Section 5 to possible
Recently, Grid research as well as distributed implications. Section 6 discusses data consistency
database research tackles the problem of data and replication methods known in database research
replication but from a different point of view. The and practise. Possible update synchronisation
aim of this paper is to outline their approaches and solutions are given in Section 7, which is followed
to find commonalities in order to have a most by concluding remarks.
efficient Data Grid that can manage Petabytes of
data stored in object-oriented databases. We will
provide a first basis for such an effort. Since Data 2 Related work on data replication
Grids are very new in the research community, we We first identify some selected related work in the
see a clear need for identifying the characteristics database community as well in the Grid
and requirements of Data Grids and how they can community. From this we derive commonalities
be met in a most efficient way. Special attention and discuss briefly what the contribution for an
will be given to data consistency and efficient Data Grid can be. The common aspects
communication issues. will be dealt with throughout this paper.
Optimising data replication and access to data
over the WAN is not addressed sufficiently in 2.1 Distributed DBMS research
database research. In a DBMS there is normally Distributed database research basically addresses
only one method of accessing data. For instance, a the following issues.
data server serves pages to a client. For the Data • Replica synchronisation is based on relatively
Grid, such a single access method may not be small transactions where a transaction consists
optimal. Using an ODMBS also has some of several read and/or write operations. In
restrictions, which are pointed out here and some contrast, in HEP a typical data production job
possible solutions are given. can write a whole file and thus a transaction
We elaborate on different data consistency should be relatively “large”.
models and how global transactions can contribute • Synchronous and asynchronous replication
to this. Global transactions are built on top of (more details can be found in Section 6): Most
transactions provided by a database management of the evaluation techniques are based on the
system at a local site. As opposed to database amount of communication messages needed.
research, a separation of data communication is • Cost functions for network or server loads are
required. In particular, global control messages are rarely integrated into the replica selection
exchanged by a message passing library whereas process of a DBMS.
the actual data files are transported via high-speed • Rather low amount of data as compared to the
file transfer protocols. A single communication Petabyte scale of HEP.
protocol is usually used in a commercial ODBMS Data access over the WAN and optimisation of
for exchanging small amounts of data within replicas is rarely addressed in database research. In
database transactions. This communication a DBMS there is normally only one internal method
mechanism is optimised for relatively small of accessing data. For instance, a data server serves
transactions but may not be optimised for pages to a client. In detail, a client sends a request
transferring large files over the WAN with high for data to the DBMS, and the DBMS (in particular
performance control information exchanged the data server) uses the implemented method for
between distributed sites. We will propose solutions serving the requested data. For read and write
for efficient, asynchronous replication and policies access the same method is used, independent of the
with different levels of data consistency in a Data amount of data accessed. For the Data Grid, a
Grid. single access method is not optimal depending on
2
the number of objects to be accessed from a file. between database and Grid research can be
For accessing large amounts of data of a remote identified.
file, the usage of a file transfer mechanism to One of the main questions to be answered is to
transfer the entire file and access the file locally can what level the replication middle-ware will be able
be more efficient in terms of response time than to replace a pure distributed DBMS approach where
remotely accessing (and thus streaming) many also inter-site replication issues are dealt with.
single objects of a file. More details will be given in Clearly, with a replication middle-ware there are
Section 5. Based on cost functions it can be even many more restrictions for update synchronisation
more efficient to send the application program to and transparent access to data. There will also be a
the remote data [5]. performance penalty for replica synchronisation.
However, an aim of the replication middle-ware is
2.2 Grid research to provide several relaxations of the concept of
The Globus project provides tools for Grid transparent data access and data consistency. For
computing like job scheduling, resource discovery, the remainder of the paper we assume to use such a
security, etc. Recently, Globus is also working on a replication middle-ware for a Data Grid.
Data Grid effort that enables fast and efficient file
transfer, a replica catalogue for managing files and
some more replica management functionality based 3 Replica catalogues and directory
on files. Replica update synchronisation is not services
addressed. In the Grid community there is a general Access to data is most important for data analysis in
tendency to deal with replication at the file level, the HEP as well as in any other scientific
i.e. a single file is the lowest granularity of community. Access to replicated data requires
replication. This has the advantage that the structure specific data and meta data structures and can be
of the file does not need to be known by the addressed in different ways. A DBMS has an
replication manager that is responsible for internal access method whereas in a Grid
replicating files from one site to another sites over environment these data structures are provided
the WAN. This is also a valid simplification for explicitly as a service to the end user. In this
most of the HEP replication use cases and, thus, is section, emphasis is put on how to combine
also the main focus of this paper. Related work in techniques from a Data Grid and a database
the HEP community can be found in the two Data environment.
Grid projects PPDG [6] and GriPhyN [7]. There is The management of replicas requires certain
still the possibility to deal with replication at the meta data structures that store information about the
object level which requires more sophisticated distribution of and access to data. Whereas an
techniques than file level replication. Object object location table [9] can be the data structure
handling is addressed in [8]. However, as soon as for managing objects, for files we can use a
data items in a replicated file are changed, these directory service holding a replica catalogue.
changes have to be propagated to all other replicas. There are many ways to deal with replica
This requires the knowledge about the data catalogues and many different protocols and
structure of the file. Possible solutions to this implementations are available. We want to limit
problem are given in Section 6. ourselves here to the approach used by Globus.
Since in the Globus Data Grid effort the replica
2.3 Commonalities – the usage of a replication catalogue is implemented by an LDAP directory
middle-ware service, we dedicate this section on analysing how a
Research results can be combined from both directory service can be used for file replication
communities by introducing a replication middle- where files are managed by a local DMBS.
ware layer that manages replication of files (and In principle, a DBMS provides, by definition,
possibly some objects) by taking into account that a catalogue of files that are available. As for
each site in a Data Grid will manage data locally Objectivity/DB this is an explicit file catalogue. An
with a database management system. Thus, the LDAP directory service basically does the same for
middle-ware is responsible for site to site (or inter- flat files which may or not have connections
site) replication and synchronisation whereas the (associations between objects in two or more files)
local DBMS takes care of transactions on local to each other. Now the questions arise what is the
data. use of LDAP when data are stored in a DBMS? In
In Grids there are many tools for monitoring order to answer this question, we identify the key
applications and network parameters. These can be problems that need to be addressed in a
used for filling the gap. Hence, a hybrid solution
3
heterogeneous Data Grid environment with many keeping replica catalogues up to date. Using the
sources of data: managing replicas of files and LDAP directory service also implies that each site,
dealing with heterogeneous data stores. that stores replica catalogue information, runs an
• Managing replicas: As for Objectivity, a file LDAP server locally. LDAP commands can be used
catalogue exists. The granularity for replication for synchronisation. For a homogeneous database
is now assumed to be on the file level. When approach, where only a single DBMS is used to
multiple copies of a file exist, they have to be store the entire data of the Data Grid, it may not be
introduced to a global replica catalogue which necessary to have an LDAP server for
is an extended version of the native file synchronisation - any communication mechanism
catalogue of the DBMS. Furthermore, a name for exchanging synchronisation information
space has to be provided that takes into account between sites in the Data Grid is sufficient.
multiple replicas. Both features are However, since each data site using a directory
enhancements to a DBMS that only considers service will have its own LDAP server, this server
single sources of information. Note that we can also be used as a synchronisation point for
assume here that the underlying DBMS does distributed sites. Whenever a file is introduced to a
not support file replication. Using the LDAP site, this information has to be made public within a
protocol and a database backend, the required global name space spanning all the other sites in the
replica catalogue information can be stored. Data Grid. The replica synchronisation protocol to
Another approach is to simply build the use depends on the data consistency requirements
catalogue and the access methods with the imposed by the application.
native schema of the DBMS and store the We conclude that the usage of LDAP in
replica information in a particular file of the combination with a DBMS seems to be useful in a
DBMS. heterogeneous environment and for synchronisation
• Heterogeneous data stores: The diversity of of sites holding replica catalogue information.
the HEP community requires an approach
which supports heterogeneous data stores since
different kinds of data may be stored in 4 Objectivity/DB
different formats and also in data stores of Since Objectivity is the main ODBMS system we
different vendors. Even the two different are referring to in this paper, we dedicate this
database paradigms, relational and object- section to explaining more details of this product
oriented, should be supported. In this case, a and problems that we were confronted with when
standard protocol for directory information like using Objectivity for wide area data replication.
LDAP seems to be a good choice since replica Some of the problems mentioned are specific to all
location information can be accessed in a object-oriented data stores while others are
uniform way without the knowledge of the Objectivity specific. However, the usage of an
underlying data store. ODBMS allows us some simplifications like a
Combining a DBMS with a directory service means global namespace and a global schema information.
to expose the replica catalogue to a larger user Objectivity/DB is a distributed, object-
community, where it is not necessary to have a oriented DBMS which has a Data Replication
database specific front end to access the directory Option called DRO. This option is optimised for
information. However, the LDAP protocol still synchronous replication over a local area network.
needs to have a database backend where the file In this section, we briefly describe DRO and its
information is stored and concurrency mechanisms drawbacks, and state complications that one comes
have to be established. The database backend can across when persistent objects have to be replicated.
either be a vendor specific and LDAP specialised
database, or any database of choice like Oracle or 4.1 Objectivity’s data replication option
Objectivity/DB. The usage of a directory service DRO provides a synchronous replication model on
allows an open environment and hence allows the the file level, i.e. entire Objectivity databases which
integration of data sources of different kinds. Note are mapped to physical files can be replicated. Note
that the problem of accessing heterogeneous data that in the following section we use the term replica
sets is not addressed here and may require to refer to a physical instance (a copy) of an
mediators for data integration. However, Objectivity file. There are basically two ways to use
introducing LDAP is at least a starting point for DRO: data can be written first, and then
extensibility. synchronised and replicated (populate - replicate).
Another point important point in replica Multiple replicas can be created and synchronised
management is synchronisation of single sites and and the data can be written into all the replicas at
4
the same time (replicate - populate). Once a via one association. When an Objectivity database
replica is synchronised with another one, all the (a file) is replicated, this association gets lost when
database transactions on a single replica are only one of the files is replicated. Note that there is
synchronised with other replicas. still the possibility of remotely accessing objects.
Objectivity/DRO performs a dynamic quorum Hence, the replication decision has to be carefully
calculation when an application accesses a replica. made and possible associations between objects
The quorum which is required to read or write have to be considered. This may result in
replicas can be changed, which provides some replicating a set of files in order to keep all the
flexibility concerning data consistency. The usage associations between files. Furthermore, this also
of a quorum method is well established in database imposes severe restrictions on partial replication
research and goes back to an early paper [10]. where only particular objects of a file are
There is one possibility to overcome the replicated. Again, when a certain set of objects is
immediate synchronisation. A replica can be set to selected which does not have associations to objects
be off-line. However, only if a quorum still exists, which are not in the set, the replication set is
data can be written to the off-line replica. There is association safe. This is a particular restriction of
no clean way in DRO to perform a real object-oriented data stores and does not only hold
asynchronous or batch replication where you for Objectivity.
specify a synchronisation point in time. An This is also an import difference and
asynchronous or batch replication method allows complication compared to a relational DBMS.
replicas to be out of sync for a certain amount of Whereas in an ODBMS an object can be stored in
time. Reconciliation of updates is only done at any file, in a relational database data items are
certain synchronisation points, e.g. every hour. The stored in tables that have well defined references to
lack of such an explicit asynchronous replication other tables.
method is one of the reasons why DRO is not
considered as a good option for WAN replication. 4.3 The Objectivity file catalogue
Like many commercial databases, Objectivity The integration of files into the native Objectivity
does not provide any optimisation for accessing file catalogue is done with a tool called ooattachdb
replicas. The replica catalogue only contains the which adds a logical file name of the physical
number of copies and the location of each single location into the catalogue. Furthermore, it
copy of a file. Once an object of a replicated file guarantees that links to existing Objectivity
has to be accessed, Objectivity tries to get the first databases are correct. The schema of the file is not
copy of the file which appears in the catalogue. It checked. This feature is used when a file created at
does not examine the bandwidth to the site nor one site has to be integrated into the file catalogue
takes into account any data server load. Some of another site.
operations such as database creation require all Since the native catalogue only has a one-to-
replicas to be available. When we talk about one mapping from one logical to one physical file,
Objectivity in any other section or subsection of replicas are not visible to the local site (not taking
this paper we mostly ignore the fact that DRO into account DRO). Furthermore, it is possible to
exists and thus see Objectivity only as a distributed have single links from one site to another one. For
DMBS with single copies of a file. However, these instance, site 1 has a file called X and this file shall
single copies of a file can exist at distributed sites in be shared (not replicated) between several sites.
the Data Grid and thus remote access to Objectivity The file name and the location can be integrated
data is possible. into the local file catalogue and a remote access to
the file can be established. Note that this requires an
4.2 Partial replication and associations Objectivity Advanced Multi-threaded Server
In this paper, as well as in many Grid (AMS) running or the usage of a shared file system
environments, replication is regarded to be done on like AFS that connects both sites. The AMS is
the file rather than on the object level and hence a responsible for transferring (streaming) objects
file is regarded to be the smallest granularity of from one machine to another one and thus
replication. In Objectivity, single objects are stored establishes a remote data access functionality. We
in containers. Several containers are stored together want to address the more general solution where no
in an Objectivity database which corresponds to a shared file system is available.
file. Each object can have associations (also called In the HEP community it is generally agreed
links or pointers) to other objects which may reside and proven that DRO is not optimised for the use in
in any database or container. Let us now assume a WAN. Hence, all our discussions here neglect the
that two objects in two different files are connected DRO of Objectivity/DB and we conclude that the
5
current implementation of DRO is not a feasible 10 kB. It is obvious that we want to transfer the
solution for the HEP community. requested objects rather than the whole file, and
having unnecessary information transferred over
the network. There is a clear need for a mapping
5 Implications for Grid applications instance that maps requested objects to files in
We claim that the usage of files that are replicated order to determine in which files the requested
in a Data Grid will have implications for Grid objects reside. Based on the type (class definition in
applications that are new to the HEP user the data definition language of the ODBMS) of the
community. We want to address explicitly some objects it can be roughly determined how much
possibilities of how a replicated file instance can be data needs to be transferred over the network
accessed and give some implications of read ahead (depending on how much dynamically allocated
assumptions. information is stored in the object). This problem
cannot be solved when only the “open” file operator
5.1 Accessing replicated files is used. Only at runtime the application can
A standard way to access a database in Objectivity determine which data are needed and a pre-fetching
is to issue an “open” command on the physical file. of requested objects can tackle this problem. We
This concept is unique for the object-oriented can summarise this as a query optimisation problem
database concept and very much resembles the where access to the data shall be optimised by
UNIX way of accessing files. However, if a single providing only the necessary objects that are
object is accessed, the file which it belongs to is required. Related work can be found [11,12].
transparent to the user. Once an object handle or Pre-fetching can also be addressed at the file
reference is available, the object ID to this object level. We assume that a user is aware that files are
contains information about the file to which it distributed to different sites in the Grid and the time
belongs. Since the native Objectivity “open” works to access data very much depends on the location of
on the Objectivity catalogue rather than on the the requested file. We further assume that an object
global replica catalogue, the user does not have in a file can only be accessed when the whole file is
access to the whole name space in the Data Grid. available locally at the client site. This is assumed
The “open” has to be adapted to do the lookup and for simplicity and we neglect the fact of accessing
the data access via the directory service. Plug-ins some parts of a file remotely. In the worst case, all
are required to transfer the requested data to the the requested files are available remotely and need
user application. This is still an open issue and to be transferred to the local site. If the size of a file
needs further research concerning: is large and also the amount of requested files is
• caching files locally high, the I/O part of an application takes a long
• adding the file to the local catalogue and time before the actual computation on the data can
creating a replica: this may only be useful be done. The application can give a hint to the
when data are accessed frequently system to tell it how long it would need to serve the
• transferring the whole file versus only sub required files and when the computation can be
sets of a file started. Thus, data can be pre-fetched before the
Objectivity provides functionality for transparently application is using the data. In the Grid
accessing data stored on tapes. This is done by environment, the necessary information for pre-
using an extension of the AMS backend that checks fetching files can be provided by diverse
if a file is locally stored on disk and fetches the file monitoring tools and information services.
from tape if it is not found on disk. This This is also a sociological aspect of Grid
functionality can be extended to go to a global applications in the sense that a user of a data-
replica catalogue and fetch files from remote sites. intensive application requests its data in advance.
This would provide a transparent way of accessing One can also reserve the network bandwidth and
replicas of files. We will further investigate this start the application at a certain point in the future.
option.
6
simple copy of an original but still has a logical Data in the DataGrid project will have different
connection to the original copy, we have to address types and not all of them require the same
the data consistency problem. The easiest way to consistency level. In this paper, we try to point out
tackle the consistency problem is in the field of briefly where different replication policies are
read-only data. Since no updates of any of the required. Furthermore, we claim that a replication
replicas are done, the data is always consistent. We system of a Data Grid should not only offer a single
can state that the consistency can reach its highest policy but several ones which satisfy the needs of
degree. Once updates are possible on replicas, the having different data types and degrees of
degree of data consistency normally has to be consistency.
decreased (provided we have read and write access For a middle-ware replication system it is
to data) in order to have a reasonable good response rather difficult to provide this high degree of
time for accessing data replicated over the WAN. consistency since global, synchronous transactions
Clearly, consistency also depends on the frequency are difficult to establish. Within a DBMS a global,
of updates and the amount of data items covered by synchronous transaction can be an extension of a
the update. Thus, we can state that the degree of conventional, local transaction, i.e. the DBMS
data consistency depends on the update frequency, specific locking mechanism is extended to the
amount of data items covered by the update and the replicas. Since a distributed DBMS like Objectivity
expected response time of the replicated system, i.e. has built-in global transactions, no additional
the Data Grid. communication mechanism like sockets or a
Let us now identify in more detail the message passing library is required. Hence, the
different consistency options which are possible performance for an integrated distributed DBMS is
and which are reasonable for HEP and within the superior to middle-ware replication systems that
DataGrid project. have to use external communication mechanisms.
In this section we also introduce the term Now the following question arises: why not
global transaction which has to be distinguished use a distributed DBMS like Objectivity to handle
from a local transaction. A local transaction is done replication? Several points are already covered in
by the DBMS at the local site whereas a global Section 4 but there is another major point.
transaction is an inter-site transaction which spans Objectivity does not provide flexible consistency
multiple sites and thus multiple database levels different kinds of data. Hence, we aim for a
management systems. Furthermore, local hybrid solution where a local site stores data in a
consistency is maintained by the DBMS whereas DBMS which also handles consistency locally by
the global consistency spans multiple sites in the managing all database transactions locally. A Grid
Data Grid. middle-ware is required to provide communication
and co-ordination between the local sites. The
6.1 Synchronous replication degree of independence of a single site needs to be
The highest degree of consistency can be flexibly managed. A form of global transaction
established by having fully synchronous replicas. system is necessary. Let us illustrate this by an
This means that each local database transaction example. Some data are allowed to be out of sync
needs to get acknowledgements from other replicas (low data consistency) whereas other types of data
(or at least a majority of replicas). In practice, as always need to be synchronised immediately (high
well as in the database research community, this is consistency). Thus, global transactions have to be
gained by global transactions which span multiple flexible and do not necessarily always have to
sites. Objectivity/DRO supports such a replication provide the highest degree of consistency.
model. Each time a single local write transaction Furthermore, a site may even want to be
takes place, the 2-phase-commit protocol and independent of others once data are available.
normally also the 2-phase-locking protocol are used
to guarantee serialisability and global consistency. 6.2 Asynchronous Replication
This comes at the cost of relatively worse Based on the relative slow performance of write
performance for global writes compared to local operations in a synchronously replicated
writes with no replication at all. Consequently, one environment, the database research community is
has to decide carefully if the application requires searching for efficient protocols for asynchronous
such a high degree of consistency. replication at the cost of lower consistency.
We derive from this statement as well as from Currently, there is no standard for replication
the current database literature that the type of available, but a few commonly agreed solutions:
replication protocol and hence the data consistency • Primary-copy approach: This is also known
model has to be well adapted to the application. as “master-slave” approach. The basic idea is
7
that for each data item or file a primary copy local import buffer and requests the files if
exists and all the other replicas are secondary necessary. This approach allows more
copies [13]. The updates can only be done by flexibility concerning data consistency and
the primary copy which is the owner of the file. independence for a local site. A site can decide
If a write request is sent to a secondary copy, itself which data to import and which
the request is passed on to the primary copy information to filter. Furthermore, a data
which does the updates and propagates the production site may not export all the locally
changes to all secondary copies. This policy is available information and can filter the export
implemented in the object data stores Versant buffer accordingly. A valid implementation of
[14] and ObjectStore [15]. Also Oracle this approach can be found in the Grid Data
provides such a feature. The primary-copy Management Pilot (GDMP) [17].
approach provides a high degree of data
consistency and has improved write 6.3 Communication and Transactions
performance features compared to synchronous As outlined above, there is a clear need for global
replication because the lock on a file has not to transactions. Such a transaction does not
be agreed among all replicas but only by the necessarily need to create locks at each site, but at
primary copy. least a notification system is required to automate
• Epidemic approach: User operations are and trigger the replication and data transfer process.
performed on any single replica and a separate In general, there is a clear separation between
activity compares version information (e.g. exchanging control messages which organise locks
time-stamps) of different replicas and and update notifications, and the actual data
propagates updates to older replicas in a lazy transfer. This is an important difference to current
manner (as opposed by an eager, synchronous database management systems. Replication
approach) [16]. Thus, update operations are protocols are often compared by the amount of
always executed locally first and then the sites messages sent in order to evaluate their
communicate to exchange up-to-date performance. The type of a message is dependent
information. The degree of consistency can be on the DBMS. A message of the same type is then
very low here and this solution does not sent to do the actual update using the same
exclude dirty reads or other possible database communication protocol. In Data Grids where most
anomalies. Such a system can only be applied of the data are read-only, we can divide the required
for non time-critical data since the propagation communication into the following two parts. This
of updates can also result in conflicts which concept is also realised in GDMP [17].
have to be solved manually. • control messages: These are all messages that
• Subscription and relatively independent are used to synchronise distributed sites. For
sites: Similar to the epidemic approach, instance, one site notifies another site that new
another policy is that a site only wants to have data are available, update information is
data and does not care about consistency at all. propagated, etc. To sum up, replication
When any other site does updates on a protocols are using these control messages.
particular file, only certain sites are notified of • data transfer: This includes the actual transfer
the updates. This follows a subscription model of data and meta data files from one site to
where a site subscribes explicitly to a data another one.
producing site. A site that has not subscribed is This separation is similar to the FTP protocol where
itself responsible to get the latest information we also have this clear separation between the two
from other sites. This allows a site to do local tasks [18]. The main point for this separation is to
changes without the agreement of other sites. use the most appropriate protocol for a specific
However, such a site has to be aware that the communication need. Simple message passing is
local data is not always up-to-date. A valid appropriate for exchanging control messages
solution to this problem is to provide an export whereas a fast file transfer protocol is required to
buffer, where the newest information of a site transfer large amounts of (large) files. In a Grid
is stored, and an import buffer, where a local environment both protocols are provided. To be
site stores all the information that needs to be more specific, in Globus the GlobusIO library can
imported from a remote site. For instance, a be used for control messages and Grid-FTP
local site has finished writing 10 different files implementation [19], which is based on the WU-
and puts the file names into the export buffer. FTP server and the NC-FTP client, serves as the
A remote site can than be notified and transfers data transport mechanism. A single communication
the information from the export buffer to its protocol used in an ODBMS like Objectivity may
8
not be optimal for transferring large files over the has be called before the actual file creation. The
wide area network. transaction checks in each local Objectivity file
catalogue if the file already exists. If not, the site
6.4 Append Transactions creates the database locally and initiates a database
Based on the HEP requirements, we see a need to creation at the remote sites as well. All remote
increase the traditional DBMS transaction system replicas of a specific file have to be locked since
by a new transaction, called append transaction. only one site can write into a file at a time. Once
In a conventional DBMS there exist only two the local site has completed the writing process, the
different kinds of transactions: read and write lock on the remote replicas is released. For such a
transactions. A write transaction itself can either system we require a replica catalogue at each site in
write new data (append) or update existing data. In addition to the local federation catalogue. In the
terms of data management, these two operations replica catalogue a flag is needed for initiating a
require different tasks. Whereas an update lock on a file. The actual file locking has to be
transaction has to prevent concurrent users from implemented with a call to the Objectivity database.
reading old data values, an append transaction only Furthermore, each transaction on a file in the
has to make sure that the data item to be added Objectivity file catalogue has to be done via the
satisfies a uniqueness condition. A uniqueness replication catalogue. Consequently, a database
condition is a condition which guarantees that a user is not allowed to use the conventional
data item appears only once and can be identified Objectivity “open” to create a new file but has to
uniquely. In HEP this condition can be satisfied by contact the replica catalogue in order to write a new
the following feature: sites often write data database file. All database transactions have to go
independently and do not change any previously through a high level API that always contacts the
written data items. Since different sites will write replica catalogue first.
different data by definition (thus we can derive an An easier approach, which is currently used in
inherent parallelism in the data production phase), the CMS experiment, is the allocation of
this uniqueness condition will hold. This is true for Objectivity database IDs to different remote sites.
HEP specific reconstruction software as well as This guarantees only a unique database-ID
simulated data created via Monte Carlo processes. allocation but not a unique name for a database.
Objectivity provides object IDs and unique Consequently, in order to have a fully automatic
database IDs for single files. Hence, it has to be system, a unique name space has to be provided.
guaranteed that newly created objects do not have This can only be guaranteed if on creation of a
the same OID (the same is true for database IDs). database file the name is communicated to other
Since an append transaction does not have to lock sites and then agreed on. However, another solution
the whole file but only the creation of new objects, is to provide a naming convention like adding the
it can allow multiple readers at the same time while host and domain name of each local site to the
an append transaction creates new objects. This database name. This also guarantees unique file
again allows for having different consistency and names. The populate-replicate approach also does
response time levels. not require the locking of remote replicas which
allows for a faster local write. This is a common
6.5 File replication with Objectivity using a approach for asynchronous replication.
replication middle-ware
Driven by real world replication requirements in
High Energy Physics, we want to give a possible 7 Possible Update Synchronisation for a
approach for replication of read-only Objectivity Data Grid
database files. In principle, replication of In the previous sections we have mainly addressed
Objectivity database files does not impose a big replication of read-only files. Although most of the
problem concerning data consistency. Each site has data in the High Energy Physics are ready-only,
to have its Objectivity federation that takes care of there will be some replicated data that will change
managing files and the Objectivity file catalogue. and thus need update synchronisation. In this
There are several ways to deal with a file section we provide some possible update
replication of read-only files. It has to be synchronisation strategies which can be
guaranteed that a unique Objectivity naming implemented using a replication middle-ware for
scheme is applied by all sites in the Data Grid. We the Data Grid.
outline two possible approaches. We have already described that it is rather
Each site can create a database file. In order to difficult for a middle-ware system to do replica
provide unique database names, a global transaction
9
update synchronisation on the object rather than on presented here has similar ideas. A local site may
the file level since a middle ware system cannot update immediately and can store the updates into a
access DBMS internals like pages or object tables. log file. Based on the consistency requirement of
A common solution in database research is to remote sites, the log information is sent to the
communicate only the changes of a file to remote remote sites which apply the same update function
sites. Since an Objectivity database file can only be as the original local site.
interpreted correctly by a native Objectivity The update synchronisation problem is then
process, the file itself appears like any binary file passed to a ReplicatorObject that is aware of
and a conventional process does not see any replicas and the replication policy. The
structure in the file. We call this the binary ReplicatorObject in turn can provide different
difference approach. As second possibility to consistency levels like updating remote sites
update data stored in an ODBMS, we use an object- immediately, each hour, day, etc. When large
oriented approach which requires knowledge about amounts of data are used, there may be most likely
the schema and data format of the files to be a scalability problem of managing all the logging
updated. Both approaches are outlined in this information. However, since each object is
section. identified by a single OID, only the parameters of
an update method together with the OID have to be
7.1 Binary Difference Approach stored.
There is also the possibility to use a tool called
XDelta [20], which produces the difference object.update_parameter_x (200);
between any two binary files. This difference can // OID = 38-23-222-442
than be sent to the remote site. This site can then
update the file, which is out of date, by merging the The log file stores the triple (x/38-23-222-442/200)
diff file with the original data file. XDelta is a where the first argument is the parameter of the
library interface and application program designed object, the second the OID and the third the new
to compute changes between files. These changes value of the parameter.
(deltas) are similar to the output of the "diff" The modus operandi for communicating the
program in that they may be used to store and changes is like the following. A local site gains an
transmit only the changes between files. However, exclusive, global lock on the file and updates the
unlike diff, the output of XDelta is not expressed in required objects. In parallel the log file is written.
a human-readable format - XDelta can also apply The file itself is transferred to remote sites with an
these deltas to a copy of the original file(s). efficient file transfer protocol whereas the remote
sites are notified via control messages.
7.2 Object-oriented approach Since such a replication policy is rather cost
Another approach is to create objects that are aware intensive in terms of exchanging communication
of replicas. In principle, an object can be created at messages sending data, it should only be applied to
any site and the creation method of this object has a relatively small amount of data. In the HEP
to take care of or delegate the distribution of this environment, there exist many meta data sources
object. The class definition has to be designed in a like replica catalogues, indices etc. which require a
way that there is some information on the amount high consistency of data. For such data this
and the site of replicas. For instance, an object approach is useful.
should be created at site 1 and replicated to the sites
X and Y. A typical creation method can look like
follows: 8 Conclusion
The data management efforts of the two research
object.create (site1, siteX, communities distributed databases and Grid deals
siteY); with the problem of data replication where the Grid
community specifically deals with large amounts of
The advantage of this approach is that all the data in wide area networks. In the HEP community,
necessary information is available to create the data are often stored in database management
object at any site. When an update on one of the systems, and it is appropriate to try to understand
objects is done, the update function has to be aware the research issues of both communities: distributed
of all the replicas and the sites that need to be databases and Grid; analyse differences and
updated. This can be compared to the stored commonalities, and combine common ideas to form
procedure approach which is known in the an efficient Data Grid. We have presented research
relational database world. In principle, the model issues and possible solutions. Thus, we provide a
10
first basis for the effort of combining both research
communities.
Acknowledgement
We want to thank colleagues from the following
groups for fruitful discussions: DataGrid work
package “Data Management” (including CERN,
Caltech, LBL and INFN), CMS Computing group
(CERN, Princeton and Caltech), Globus project in
Argonne and ISI, BaBar experiment at SLAC and
colleagues taking part in the Data Grid discussions
in the Grid Forum 5 in Boston.
11
Contact Sheet
Heinz Stockinger
CERN
CMS Experiment, Computing Group
Bat. 40-3B-15
CH-1211 Geneva 23
Switzerland
Heinz.Stockinger@cern.ch
tel +41 22 767 16 08
fax +41 22 767 89 40
12