Habilitation
Habilitation
Kumulative
H ABILITATIONSSCHRIFT
I NFORMATIONSSYSTEME
an der
Fakultät für Informatik
der
Technischen Universität Wien
vorgelegt von
Axel Polleres
XSPARQL: Traveling between the XML and RDF worlds – and avoiding the
XSLT pilgrimage 161
Waseem Akhtar and Jacek Kopecký and Thomas Krennwallner and Axel Polleres
Published in of the 5th European Semantic Web Conference (ESWC2008), pp.
432–447, Nov. 2007, Springer LNCS vol. 3803, ext. version published as tech.
report, cf. http://www.deri.ie/fileadmin/documents/TRs/DERI-TR-2007-12-14.pdf
and as W3C member submission, cf. http://www.w3.org/Submission/2009/01/
3
Dedicated to Inga & Aivi
4
Preface
“The truth is rarely pure and never simple.” (Oscar Wilde) . . . particularly on the Web.
The Semantic Web is about to grow up. Over the last few years technologies and standards to
build up the architecture of this next generation of the Web have matured and are being deployed
on large scale in many live Web sites. The underlying technology stack of the Semantic Web
consists of several standards endorsed by the World Wide Web consortium (W3C) that provide
the formal underpinings of a machine-readable “Web of Data” [94]:
• A Uniform Exchange Syntax: the eXtensible Markup Language (XML)
• A Uniform Data Exchange Format: the Resource Description Framework (RDF)
• Ontologies: RDF Schema and the Web Ontology Language (OWL)
• Rules: the Rule interchange format (RIF)
• Query and Transformation Languages: XQuery, SPARQL
5
such as RDFa [1], a format that allows to embed RDF within (X)HTML, or non-XML repre-
sentations such as the more readable Turtle [12] syntax; likewise RDF stores (e.g. YARS2 [54])
normally use their own, proprietary internal representations of triples, that do not relate to XML
at all.
6
which plays the same role for the Semantic Web as SQL does for relational data. SPARQL’s
syntax is roughly inspired by Turtle [12] and SQL [109], providing basic means to query RDF
such as unions of conjunctive queries, value filtering, optional query parts, as well as slicing and
sorting results. The recently re-chartered SPARQL1.1 W3C working group1 aims at extending
the original SPARQL language by commonly requested features such as aggregates, sub-queries,
negation, and path expressions.
The work in the respective standardisation groups is partially still ongoing or only finished
very recently. In parallel, there has been plenty of work in the scientific community to define the
formal underpinnings for these standards:
• The logical foundations and properties of RDF and RDF Schema have been investi-
gated in detail [83, 52, 89]. Correspondence of the formal semantics of RDF and RDF
Schema [55] with Datalog and First-order logic have been studied in the literature [21,
22, 66].
• The semantics of standard fragments of OWL have been defined in terms of expressive
Description Logics such as SHOIN (D) (OWL DL) [61] or SROIQ(D) (OWL2DL) [60],
and the research on OWL has significantly influenced the Description Logics community
over the past years: for example, in defining tractable fragments like the EL [8, 9] family
of Description Logics, or fragments that allow to reduce basic reasoning tasks to query
answering in SQL, such as the DL-Lite family of Description Logics [26]. Other frag-
ments of OWL and OWL2 have been defined in terms of Horn rules such as DLP [51],
OWL− [34], pD* [110], or Horn-SHIQ [72]. In fact, the new OWL2 specification defines
tractable fragments of OWL based on these results: namely, OWL2EL, OWL2QL, and
OWL2RL [79].
• The semantics of RIF builds on foundations such as Frame Logic [70] and Datalog. RIF
borrows, e.g., notions of Datalog safety from the scientific literature to define fragments
with finite minimal models despite the presence of built-ins: the strongly-safe fragment
of RIF Core [17, Section 6.2] is inspired by a similar safety condition defined by Eiter,
Schindlauer, et al. [39, 103]. In fact, the closely related area of decidable subsets of
Datalog and answer set programs with function symbols is a very active field of re-
search [10, 42, 25].
• The formal semantics of SPARQL is also very much inspired by academic results, such as
by the seminal papers of Pérez et al. [85, 86]. Their work further lead to refined results on
equivalences within SPARQL [104] and on the relation of SPARQL to Datalog [91, 90].
Angles and Gutierrez [7] later showed that SPARQL has exactly the expressive power of
non-recursive safe Datalog with negation.
Likewise, the scientific community has identified and addressed gaps between the Semantic
Web standards and the formal paradigms they are based on, which we want turn to next.
7
“in the wild”, i.e., to be applied on real Web data. Particularly, the following significant gaps
have been identified in various works over the past years by the author and other researchers:
Gap 1: XML vs. RDF The jump from XML, which is a mere syntax format, to RDF, which
is more declartive in nature, is not trivial, but needs to be addressed by appropriate – yet
missing – transformation languages for exchanging information between RDF-based and
XML-based applications.
Gap 2: RDF vs. OWL The clean conceptual model of Description Logics underlying the OWL
semantics is not necessarily applicable directly to all RDF data, particularly to messy,
potentially inconsistent data as found on the Web.
Gap 3: RDF/OWL vs. Rules/RIF There are several theoretical and practical concerns in com-
bining ontologies and rules, such as decidability issues or how to merge classical open
world reasoning with non-monotonic closed world inference. The current RIF specifica-
tion leaves many of these questions open, subject to ongoing research.
Gap 4: SPARQL vs. RDF Schema/RIF/OWL Query answering over ontologies and rules and
subtopics such as the semantics of SPARQL queries over RDF Schema and OWL ontolo-
gies, or querying over combinations of ontologies with RIF rulesets are still neglected by
the current standards.
In the following, we will discuss these gaps in more depth, point out how they have been ad-
dressed in scientific works so far, and particularly how the work of the author has contributed.
8
semantic specifications for OWL: OWL2’s RDF-based semantics [105], which directly builds
upon RDF’s model-theoretic semantics [55], and OWL2’s direct semantics [80], which builds
upon the Description Logics SROIQ but is not defined for all RDF graphs. Both of them ad-
dress different use cases; however, particular analyses on Web Data have shown [11, 58] that pure
OWL(2) in its Description Logics based semantics is not practically applicable: (i) in published
Web data we find a lot of non-DL ontologies [11], which only leave to apply the RDF-based se-
mantics; (ii) data and ontologies found on the Web spread across different sources contain a lot
of inconsistencies, which – in case one aims to still make sense out of this data – prohibits com-
plete reasoning using Description Logics [58]; (iii) finally, current DL reasoners cannot deal with
the amounts of instance data found on the Web, which is in the order of billions of statements.
The approach included in the selected papers for the present thesis, SAOR (Scalable Authorita-
tive OWL Reasoner) [59], aims at addressing these problems. SAOR provides incomplete, but
arguably meaningful inferences over huge data sets crawled from the Web, based on rule-based
OWL reasoning inspired by earlier approaches such as pD*[110], with further cautious modifica-
tions. Hogan and Decker [57] have later compared this approach to the new standard rule-based
OWL2RL [79] profile, coming to the conclusion that OWL2RL, as a maximal fragment of OWL2
that can be formalised purely with Horn rules, runs into similar problems as Description Logics
reasoning when taken as a basis for reasoning over Web data without the further modifications
proposed in SAOR. An orthogonal approach to reason with real Web data [36] – also proposed
by the author of this work together with Delbru, Tummarello and Decker – is likewise based on
pD*, but applies inference in a modular fashion per dataset rather than over entire Web crawls.
9
idea of such decidable combinations to rules with non-monotonic negation [98, 99, 101, 81, 77].
Another decidable approach was to define the semantic interplay between ontologies and rules
via a narrow, query-like interface within rule bodies [40]. Aside from considerations about
decidability, there have been several proposals for what would be the right logical framework
to embed combinations of classical logical theories (which DL ontologies fall into) and non-
monotonic rule languages. These include approaches based on MKNF [81], FO-AEL [32], or
Quantified Equilibrium Logics (QEL) [33], the latter of which is included in the collection of
papers selected for the present thesis. For an overview of issues concerned with combining
ontologies and rules, we also refer to surveys of existing approaches in [38, 37, 100], some of
which the author contributed to.
As a side note, it should be mentioned that rule-based/resolution-based reasoning has been
very successfully applied in implementing Description Logics or OWL reasoners in approaches
such as KAON2 [63] and DLog [76] which significantly outperform tableaux-based DL reason-
ers on certain problems (particularly instance reasoning).
10
tailment regime for SPARQL [49], which, although worth to be mentioned, will not necessarily
encompass full conjunctive queries with non-distinguished variables.
Selected Papers
The present habilitation thesis comprises a collection of articles reflecting the author’s contribu-
tion in addressing a number of relevant research problems to close the above mentioned gaps in
the Semantic Web architecture.
The first paper, “A Semantical Framework for Hybrid Knowledge Bases” [33], co-authored
with Jos de Bruijn, David Pearce and Agustı́n Valverde contributes to the fundamental discus-
sion of a logical framework for combining classical theories (such as DL ontologies) with logic
programs involving non-monotonic negation, in this case under the (open) answer set semantics.
Based on initial discussions among the author and David Pearce, the founder of Equilibrium
Logics (a non-classical logic which can be viewed as the base logic of answer set programming),
we came to the conclusion that Quantified Equilibrium Logics (QEL), a first-order variant of EL,
is a promising candidate for the unifying logical framework in quest. In the framework of QEL,
one can either enforce or relax the unique names assumption, or – by adding axiomatisations
that enforce the law of the excluded middle for certain predicates – make particular predicates
behave in a “classical” manner whereas others are treated non-monotonically in the spirit of
(open) answer set programming. We showed that the defined embedding of hybrid knowledge
bases in the logical framework of QEL encompasses previously defined operational semantics by
Rosati [98, 99, 101]. This correspondence naturally provides decidable fragments of QEL. At the
same time, concepts such as strong equivalence, which are well-investigated in the framework of
answer set programming, carry over to hybrid knowledge bases embedded in the framework of
QEL. This work particularly addresses theoretical aspects of Gap 3: RDF/OWL vs. Rules/RIF.
Another line of research addressing Gap 4: SPARQL vs. RDF Schema/RIF/OWL is presented
in the following three works.
The second paper, “From SPARQL to Rules (and back)” [91] clarifies the relationship of
SPARQL to Datalog. Besides providing a translation from SPARQL to non-recursive Datalog
with negation, several alternative join semantics for SPARQL are discussed, which – at the time
of publication of this paper and prior to SPARQL becoming a W3C recommendation – were
not yet entirely fixed. Additionally, the paper sketches several useful extensions of SPARQL by
adding rules or defining some extra operators such as MINUS.
The third paper, “SPARQL++ for Mapping between RDF Vocabularies” [96], co-authored
with Roman Schindlauer and François Scharffe continues this line of research. Based on the
idea of reductions to answer set programming as a superset of Datalog, we elaborate on several
SPARQL extensions such as aggregate functions, value construction, or what we call Extended
Graphs: i.e., RDF graphs that include implicit knowledge in the form of SPARQL queries which
are interpreted as “views”. We demonstrate that the proposed extensions can be used to model
Ontology mappings not expressible in OWL. It is worthwhile to mention that at least the first two
new features (aggregates and value construction) – which were not present in SPARQL’s original
specification but easy to add in our answer set programming based framework – are very likely
11
to be added in a similar form to SPARQL1.1.2
The fourth paper, “Dynamic Querying of Mass-Storage RDF Data with Rule-Based Entail-
ment Regimes”[64], co-authored with Giovambattista Ianni, Thomas Krennwallner, and Alessan-
dra Martello, expands on the results of the second paper in a different direction, towards provid-
ing an efficient implementation of the approach. In particular, we deploy a combination of the
DLV-DB system [111] and DLVHEX [39], and exploit magic sets optimisations inherent in DLV
to improve on the basic translation from [91] on RDF data stored in a persistent repository.
Moreover, the paper defines a generic extension of SPARQL by rule-based entailment regimes,
which the implemented system allows to load dynamically for each SPARQL query: the system
allows users to query data dynamically with different ontologies under different (rule-based) en-
tailment regimes. To the best of our knowledge, most existing RDF Stores only provide fixed
pre-materialised inference, whereas we could show in this paper that – by thorough design –
dynamic inferencing can still be relatively efficient.
The fifth paper, “Scalable Authoritative OWL Reasoning for the Web” [59] co-authored with
Aidan Hogan and Andreas Harth, addresses Gap 2: RDF vs. OWL by defining practically viable
OWL inference on Web data: similar to the previous paper, this work goes also in the direction
of implementing efficient inference support, this time though following a pre-materialisation
approach by forward-chaining. The reason for this approach is a different use case than before:
the goal here is to provide indexed pre-computed inferences on an extremely large dataset in the
order of billions of RDF triples to be used in search results for the Semantic Web Search Engine
(SWSE) project. The paper defines a special RDF rule set that is inspired by ter Horst’s pD*,
but tailored for reasoning per sorting and file scans, i.e., (i) by extracting the ontological part
(T-Box) of the dataset which is relatively small and can be kept in memory, and (ii) avoiding
expensive joins on the instance data (A-Box) where possible. We conclude that all of the RDFS
rules and most of the OWL rules fall under a category of rules that does not require A-Box joins.
As a second optimisation, the application of rules is triggered only if the rule is authoritatively
applicable, avoiding a phenomenon which we call “ontology hijacking”: i.e., uncontrolled re-
definition of ontologies by third parties that can lead to potentially harmful inferences. In order
to achieve this, we introduce the notion of a so-called T-Box split-rule – a rule which has a body
divided into an A-Box and T-Box part – along with an intuitive definition of authoritative rule
application for split rules. Similar approaches to do scalable reasoning on large sets of RDF data
have been independently presented since, demonstrating that our core approach can be naturally
applied in a distributed fashion [115, 113]. None of these other approaches go beyond RDFS
reasoning and neither apply the concept of authoritativeness, which proves very helpful on the
Web to filter out bogus inferences from noisy Web data; in fact, both approaches [115, 113] only
have been evaluated on synthetic data.
Finally, the sixth paper “XSPARQL: Traveling between the XML and RDF worlds – and
avoiding the XSLT pilgrimage”[3], co-authored by Waseem Akhtar, Jacek Kopecký and Thomas
Krennwallner, intends to close Gap 1: XML vs. RDF by defining a novel query language which is
the merge of XQuery and SPARQL. We demonstrate that XSPARQL provides concise and intu-
itive solutions for mapping between XML and RDF in either direction. The paper also describes
2 cf. http://www.w3.org/2009/05/sparql-phase-II-charter.html, here value con-
struction is subsumed under the term “project expressions”.
12
an initial implementation of an XSPARQL engine, available for user evaluation.3 This paper
had been among the nominees for the best paper award at the 5th European Semantic Web Con-
ference (ESWC2008). The approach has experienced considerable attention and been extended
later on to a W3C member submission under the direction of the thesis’ author [95, 71, 75, 84].
In the paper included here, we also include the formal semantics of XSPARQL, which was not
published in [3] originally, but as a part of the W3C member submission [71].
Additionally, the author has conducted work on foundations as well as practical applications
of Semantic Web technologies that is not contained in this collection, some of which is worth-
while to be mentioned in order to emphasise the breath of the author’s work in this challenging
application field for information systems.
In [93], different semantics for rules with “scoped negation” were proposed, based on re-
ductions to the stable model semantics and the well-founded semantics. The author conducted
the major part of this work, based on discussions with two students, Andreas Harth and Cristina
Feier.
In [87], we have presented a conceptual model and implementation for a simple workflow
language on top of SPARQL, along with a visual editor, which allows to compose reusable
data-mashups for RDF data published on the Web. This work was a collaboration with several
researchers at the National University of Ireland, Galway (Danh Le Phuoc, Manfred Hauswirth,
Giovanni Tummarello) where the author’s contribution was in defining the underlying conceptual
model and base operators of the presented workflow language.
In [30], we have developed extensions to the widely used open-source content management
system Drupal,4 in order to promote the uptake of Semantic Web technologies. This work has
won the best paper award in the in-use track of last year’s International Semantic Web Conferece
(ISWC2009). Furthermore, the effort has eventually lead to the integration of RDF technologies
into Drupal 7 Core, potentially affecting over 200,000 Web sites currently using Drupal 6. The
work was conducted in collaboration with Stéphane Corlosquet, a master student under the au-
thor’s supervision, and based on input from and discussions with Stefan Decker, Renaud Delbru
and Tim Clark, where the author provided the core mapping of Drupal’s content model to OWL,
co-developed a model to import external data from external SPARQL endpoints and provided
overall direction of the project.
In [53], the author together with colleagues from the National University of Ireland, Galway
(Jürgen Umbrich, Marcel Karnstedt), the University of Ilmenau (Kai-Uwe Sattler), University of
Karlsruhe (Andreas Harth), and the Max-Planck Institute in Saarbrücken (Katja Hose) have pre-
sented a novel approach to perform live queries on RDF Data published across multiple sources
across the Web. In order to define a reasonable middle-ground between storing crawled data
in a large centralised index – like most current search engines perform – or directly looking up
known sources on demand, we propose to use lightweight data summaries based on QTrees for
source selection, with promising initial results. The author’s contribution in this work was in
providing conceptual guidance in terms of the different characteristics of linked RDF data with
common database scenarios where QTrees have been applied earlier, whereafter the ideas of the
paper were developed jointly in a very interactive fashion among all contributors.
3 cf. http://xsparql.deri.org/
4 http://drupal.org/
13
Acknowledgements
I would like to express my deepest gratitude to all people who helped me get this far, beginning
with the co-supervisors of my diploma and doctoral theses, Nicola Leone, Wolfgang Faber, and
finally Thomas Eiter, who also encouraged and mentored me in the process of submitting the
present habilitation thesis.
I further have to thank Dieter Fensel for initially bringing me in touch with the exciting world
of the Semantic Web, and all my work colleagues over the past years – many of which became
close friends – from Vienna University of Technology, University of Innsbruck, and Universidad
Rey Juan Carlos, who made research over all these years so pleasant and interesting in hours
of discussions about research topics and beyond. Of course, I also want to especially thank my
present colleagues in the Digital Enterprise Research Institute (DERI) at the National University
of Ireland, Galway, foremost Stefan Decker and Manfred Hauswirth, whose enthusiasm and
vision were most inspiring over the last three years since working in DERI.
I want to thank all my co-authors, especially those involved in the articles which were se-
lected for the present thesis, namely, Jos, Andreas, Thomas, GB, Alessandra, Aidan, David,
Agustı́n, Roman, François, Waseem, and Jacek, as well as my students Nuno, Jürgen, Stéphane,
Lin, Philipp, and finally Antoine Zimmerman, who works with me as a postdoctoral researcher.
Last, but not least, I want to thank my parents Mechthild and Herbert and my sister Julia for
all their love and support over the years, and finally my wife Inga and my little daughter Aivi for
making every day worth it all.
The work presented in this thesis has been supported in parts by (i) Science Foundation
Ireland – under the Lı́on (SFI/02/CE1/I131) and Lı́on-2 (SFI/08/CE/I1380) projects, (ii) by the
Spanish Ministry of Education – under the projects projects TIC-2003-9001, URJC-CM-2006-
CET-0300, as well as a “Juan de la Cierva” postdoctoral fellowship, and (iii) by the EU under
the FP6 projects Knowledge Web (IST-2004-507482) and inContext (IST-034718).
References
[1] Ben Adida, Mark Birbeck, Shane McCarron, and Steven Pemberton. RDFa in XHTML:
Syntax and Processing. W3C recommendation, W3C, October 2008. Available at http:
//www.w3.org/TR/rdfa-syntax/.
[2] Waseem Akhtar, Jacek Kopecký, Thomas Krennwallner, and Axel Polleres. XSPARQL:
Traveling between the XML and RDF worlds – and avoiding the XSLT pilgrimage. Tech-
nical Report DERI-TR-2007-12-14, DERI Galway, 2007. Available at http://www.
deri.ie/fileadmin/documents/TRs/DERI-TR-2007-12-14.pdf.
[3] Waseem Akhtar, Jacek Kopecky, Thomas Krennwallner, and Axel Polleres. XSPARQL:
Traveling between the XML and RDF worlds – and avoiding the XSLT pilgrimage. In
Proceedings of the 5th European Semantic Web Conference (ESWC2008), pages 432–447,
Tenerife, Spain, June 2008. Springer.
[4] Anastasia Analyti, Grigoris Antoniou, and Carlos Viegas Damásio. A principled frame-
work for modular web rule bases and its semantics. In Proceedings of the 11th Interna-
tional Conference on Principles of Knowledge Representation and Reasoning (KR’08),
pages 390–400, 2008.
14
[5] Anastasia Analyti, Grigoris Antoniou, Carlos Viegas Damasio, and Gerd Wagner. Ex-
tended RDF as a semantic foundation of rule markup languages. Journal of Artificial
Intelligence Research, 32:37–94, 2008.
[6] Jürgen Angele, Harold Boley, Jos de Bruijn, Dieter Fensel, Pascal Hitzler, Michael Kifer,
Reto Krummenacher, Holger Lausen, Axel Polleres, and Rudi Studer. Web Rule Lan-
guage (WRL), September 2005. W3C member submission.
[7] Renzo Angles and Claudio Gutierrez. The expressive power of sparql. In International
Semantic Web Conference (ISWC 2008), volume 5318 of Lecture Notes in Computer Sci-
ence, pages 114–129, Karlsruhe, Germany, 2008. Springer.
[8] Franz Baader. Terminological cycles in a description logic with existential restrictions.
In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence
(IJCAI2003), pages 325–330, Acapulco, Mexico, August 2003.
[9] Franz Baader, Sebastian Brandt, and Carsten Lutz. Pushing the el envelope. In Pro-
ceedings of the Nineteenth International Joint Conference on Artificial Intelligence (IJ-
CAI2005), pages 364–369, Edinburgh, Scotland, UK, July 2005. Professional Book Cen-
ter.
[10] Sabrina Baselice, Piero A. Bonatti, and Giovanni Criscuolo. On finitely recursive pro-
grams. TPLP, 9(2):213–238, 2009.
[11] Sean Bechhofer and Raphael Volz. Patching syntax in OWL ontologies. In International
Semantic Web Conference (ISWC 2004), pages 668–682, Hiroshima, Japan, November
2004.
[12] Dave Beckett and Tim Berners-Lee. Turtle – Terse RDF Triple Language. W3C
team submission, W3C, January 2008. Available at http://www.w3.org/
TeamSubmission/turtle/.
[13] Dave Beckett and Brian McBride. RDF/XML Syntax Specification (Revised). W3C
recommendation, W3C, February 2004. Available at http://www.w3.org/TR/
REC-rdf-syntax/.
[14] Tim Berners-Lee and Dan Connolly. Notation3 (N3): A readable RDF syntax.
W3C team submission, W3C, January 2008. Available at http://www.w3.org/
TeamSubmission/n3/.
[15] Tim Berners-Lee, Dan Connolly, Lalana Kagal, Yosi Scharf, and Jim Hendler. N3logic: A
logical framework for the world wide web. Theory and Practice of Logic Programming,
8(3):249–269, 2008.
[16] Diego Berrueta, Jose E. Labra, and Ivan Herman. XSLT+SPARQL : Scripting the Se-
mantic Web with SPARQL embedded into XSLT stylesheets. In Chris Bizer, Sören Auer,
Gunnar Aastrand Grimmes, and Tom Heath, editors, 4th Workshop on Scripting for the
Semantic Web, Tenerife, June 2008.
[17] Harold Boley, Gary Hallmark, Michael Kifer, Adrian Paschke, Axel Polleres, and Dave
Reynolds. RIF Core Dialect. W3C proposed recommendation, W3C, May 2010. Avail-
able at http://www.w3.org/TR/2010/PR-rif-core-20100511/.
[18] Harold Boley and Michael Kifer. RIF Basic Logic Dialect. W3C proposed rec-
ommendation, W3C, May 2010. Available at http://www.w3.org/TR/2010/
PR-rif-bld-20100511/.
[19] Tim Bray, Jean Paoli, and C.M. Sperberg-McQueen. Extensible Markup Language
(XML) 1.0. W3C Recommendation, W3C, February 1998. Available at http://www.
w3.org/TR/1998/REC-xml-19980210.
15
[20] Dan Brickley, R. Guha, and Brian McBride (eds.). RDF Vocabulary Description Language
1.0: RDF Schema. Technical report, W3C, February 2004. W3C Recommendation.
[21] Jos de Bruijn, Enrico Franconi, and Sergio Tessaris. Logical reconstruction of normative
RDF. In OWL: Experiences and Directions Workshop (OWLED-2005), Galway, Ireland,
November 2005.
[22] Jos de Bruijn and Stijn Heymans. Logical foundations of (e)RDF(S): Complexity and rea-
soning. In Proceedings of the 6th International Semantic Web Conference and 2nd Asian
Semantic Web Conference (ISWC2007+ASWC2007), number 4825 in Lecture Notes in
Computer Science, pages 86–99, Busan, Korea, November 2007. Springer.
[23] François Bry, Tim Furche, Clemens Ley, Benedikt Linse, and Bruno Marnette. RDFLog:
It’s like datalog for RDF. In Proceedings of 22nd Workshop on (Constraint) Logic Pro-
gramming, Dresden (30th September–1st October 2008), 2008.
[24] Andrea Calı̀, Georg Gottlob, and Thomas Lukasiewicz. Tractable query answering over
ontologies with datalog+/− . In Proceedings of the 22nd International Workshop on De-
scription Logics (DL 2009), Oxford, UK, July 2009.
[25] Francesco Calimeri, Susanna Cozza, Giovambattista Ianni, and Nicola Leone. Magic sets
for the bottom-up evaluation of finitely recursive programs. In Esra Erdem, Fangzhen Lin,
and Torsten Schaub, editors, Logic Programming and Nonmonotonic Reasoning, 10th
International Conference (LPNMR 2009), volume 5753 of Lecture Notes in Computer
Science, pages 71–86, Potsdam, Germany, September 2009. Springer.
[26] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and
Riccardo Rosati. Tractable reasoning and efficient query answering in description logics:
The dl-lite family. Journal of Automated Reasoning, 39(3):385–429, 2007.
[27] Don Chamberlin, Jonathan Robie, Scott Boag, Mary F. Fernández, Jérôme Siméon, and
Daniela Florescu. XQuery 1.0: An XML Query Language. W3C recommendation,
W3C, January 2007. W3C Recommendation, available at http://www.w3.org/TR/
xquery/.
[28] Kendall Grant Clark, Lee Feigenbaum, and Elias Torres. SPARQL Protocol for RDF.
W3C recommendation, W3C, January 2008. Available at http://www.w3.org/TR/
rdf-sparql-protocol/.
[29] Dan Connolly. Gleaning Resource Descriptions from Dialects of Languages (GRDDL).
W3C Recommendation, W3C, September 2007. Available at http://www.w3.org/
TR/sawsdl/.
[30] Stéphane Corlosquet, Renaud Delbru, Tim Clark, Axel Polleres, and Stefan Decker. Pro-
duce and consume linked data with drupal! In Abraham Bernstein, David R. Karger, Tom
Heath, Lee Feigenbaum, Diana Maynard, Enrico Motta, and Krishnaprasad Thirunarayan,
editors, Proceedings of the 8th International Semantic Web Conference (ISWC 2009), vol-
ume 5823 of Lecture Notes in Computer Science, pages 763–778, Washington DC, USA,
October 2009. Springer.
[31] Jos de Bruijn. RIF RDF and OWL Compatibility. W3C proposed recom-
mendation, W3C, May 2010. Available at http://www.w3.org/TR/2010/
PR-rif-rdf-owl-20100511/.
[32] Jos de Bruijn, Thomas Eiter, Axel Polleres, and Hans Tompits. Embedding non-ground
logic programs into autoepistemic logic for knowledge-base combination. In Twentieth
International Joint Conference on Artificial Intelligence (IJCAI’07), pages 304–309, Hy-
derabad, India, January 2007. AAAI.
16
[33] Jos de Bruijn, David Pearce, Axel Polleres, and Agustı́n Valverde. A semantical frame-
work for hybrid knowledge bases. Knowledge and Information Systems, Special Issue:
RR 2007, 2010. Accepted for publication.
[34] Jos de Bruijn, Axel Polleres, Rubén Lara, and Dieter Fensel. OWL− . Final draft
d20.1v0.2, WSML, 2005.
[35] Christian de Sainte Marie, Gary Hallmark, and Adrian Paschke. RIF Production Rule
Dialect. W3C proposed recommendation, W3C, May 2010. Available at http://www.
w3.org/TR/2010/PR-rif-prd-20100511/.
[36] Renaud Delbru, Axel Polleres, Giovanni Tummarello, and Stefan Decker. Context depen-
dent reasoning for semantic documents in sindice. In Proceedings of the 4th International
Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2008), Karlsruhe,
Germany, October 2008.
[37] Thomas Eiter, Giovambattista Ianni, Thomas Krennwallner, and Axel Polleres. Rules and
ontologies for the semantic web. In Cristina Baroglio, Piero A. Bonatti, Jan Maluszynski,
Massimo Marchiori, Axel Polleres, and Sebastian Schaffert, editors, Reasoning Web 2008,
volume 5224 of Lecture Notes in Computer Science, pages 1–53. Springer, San Servolo
Island, Venice, Italy, September 2008.
[38] Thomas Eiter, Giovambattista Ianni, Axel Polleres, Roman Schindlauer, and Hans Tom-
pits. Reasoning with rules and ontologies. In P. Barahona et al., editor, Reasoning
Web 2006, volume 4126 of Lecture Notes in Computer Science, pages 93–127. Springer,
September 2006.
[39] Thomas Eiter, Giovambattista Ianni, Roman Schindlauer, and Hans Tompits. Effective
integration of declarative rules with external evaluations for semantic-web reasoning. In
Proceedings of the 3rd European Semantic Web Conference (ESWC2006), volume 4011
of LNCS, pages 273–287, Budva, Montenegro, June 2006. Springer.
[40] Thomas Eiter, Thomas Lukasiewicz, Roman Schindlauer, and Hans Tompits. Combining
answer set programming with description logics for the semantic web. In Proceedings
of the Ninth International Conference on Principles of Knowledge Representation and
Reasoning (KR’04), Whistler, Canada, 2004. AAAI Press.
[41] Thomas Eiter, Carsten Lutz, Magdalena Ortiz, and Mantas Simkus. Query answering
in description logics with transitive roles. In Proceedings of the 21st International Joint
Conference on Artificial Intelligence (IJCAI 2009), pages 759–764, Pasadena, California,
USA, July 2009.
[42] Thomas Eiter and Mantas Simkus. Fdnc: Decidable nonmonotonic disjunctive logic pro-
grams with function symbols. ACM Trans. Comput. Log., 11(2), 2010.
[43] Oren Etzioni, Keith Golden, and Daniel Weld. Tractable closed world reasoning with
updates. In KR’94: Principles of Knowledge Representation and Reasoning, pages 178–
189, San Francisco, California, 1994. Morgan Kaufmann.
[44] Joel Farrell and Holger Lausen. Semantic Annotations for WSDL and XML Schema.
W3C Recommendation, W3C, August 2007. Available at http://www.w3.org/TR/
sawsdl/.
[45] Dieter Fensel, Holger Lausen, Axel Polleres, Jos de Bruijn, Michael Stollberg, Dumitru
Roman, and John Domingue. Enabling Semantic Web Services : The Web Service Mod-
eling Ontology. Springer, 2006.
17
[46] Birte Glimm, Ian Horrocks, and Ulrike Sattler. Unions of conjunctive queries in SHOQ.
In Principles of Knowledge Representation and Reasoning: Proceedings of the Eleventh
International Conference, KR 2008, pages 252–262, Sydney, Australia, September 2008.
AAAI Press.
[47] Birte Glimm, Carsten Lutz, Ian Horrocks, and Ulrike Sattler. Conjunctive query answer-
ing for the description logic SHIQ. J. Artif. Intell. Res. (JAIR), 31:157–204, 2008.
[48] Birte Glimm and Sebastian Rudolph. Status QIO: Conjunctive Query Entailment is De-
cidable. In Principles of Knowledge Representation and Reasoning: Proceedings of the
Twelfth International Conference, KR 2010, pages 225–235, Toronto, Canada, May 2010.
AAAI Press.
[49] Birte Glimm, Chimezie Ogbuji, Sandro Hawke, Ivan Herman, Bijan Parsia, Axel Polleres,
and Andy Seaborne. SPARQL 1.1 Entailment Regimes. W3C working draft, W3C, May
2010. Available at http://www.w3.org/TR/sparql11-entailment/.
[50] Sven Groppe, Jinghua Groppe, Volker Linnemann, Dirk Kukulenz, Nils Hoeller, and
Christoph Reinke. Embedding SPARQL into XQuery/XSLT. In Proceedings of the
2008 ACM Symposium on Applied Computing (SAC), pages 2271–2278, Fortaleza, Ceara,
Brazil, March 2008. ACM.
[51] Benjamin N. Grosof, Ian Horrocks, Raphael Volz, and Stefan Decker. Description logic
programs: Combining logic programs with description logic. In 12th International Con-
ference on World Wide Web (WWW’03), pages 48–57, Budapest, Hungary, 2003. ACM.
[52] Claudio Gutiérrez, Carlos A. Hurtado, and Alberto O. Mendelzon. Foundations of Se-
mantic Web Databases. In Proceedings of the Twenty-third ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems (PODS 2004), pages 95–106,
Paris, France, 2004. ACM.
[53] Andreas Harth, Katja Hose, Marcel Karnstedt, Axel Polleres, Kai-Uwe Sattler, and Jürgen
Umbrich. Data summaries for on-demand queries over linked data. In Proceedings of the
19th World Wide Web Conference (WWW2010), Raleigh, NC, USA, April 2010. ACM
Press. Technical report version available at http://www.deri.ie/fileadmin/
documents/DERI-TR-2009-11-17.pdf.
[54] Andreas Harth, Jürgen Umbrich, Aidan Hogan, and Stefan Decker. YARS2: A federated
repository for querying graph structured data from the web. In 6th International Semantic
Web Conference, 2nd Asian Semantic Web Conference, pages 211–224, 2007.
[55] Patrick Hayes. RDF semantics. Technical report, W3C, February 2004. W3C Recom-
mendation.
[56] Pascal Hitzler, Markus Krötzsch, Bijan Parsia, Peter F. Patel-Schneider, and Sebastian
Rudolph. OWL 2 Web Ontology Language Primer. W3C recommendation, W3C, October
2009. Available at http://www.w3.org/TR/owl2-primer/.
[57] Aidan Hogan and Stefan Decker. On the ostensibly silent ’W’ in OWL 2 RL. In Web
Reasoning and Rule Systems – Third International Conference, RR 2009, pages 118–134,
2009.
[58] Aidan Hogan, Andreas Harth, Alexandre Passant, Stefan Decker, and Axel Polleres.
Weaving the pedantic web. In 3rd International Workshop on Linked Data on the Web
(LDOW2010) at WWW2010, Raleigh, USA, April 2010.
[59] Aidan Hogan, Andreas Harth, and Axel Polleres. Scalable authoritative OWL reasoning
for the Web. International Journal on Semantic Web and Information Systems, 5(2):49–
90, 2009.
18
[60] Ian Horrocks, Oliver Kutz, and Ulrike Sattler. The even more irresistible SROIQ. In
Proceedings of the Tenth International Conference on Principles of Knowledge Represen-
tation and Reasoning (KR’06), pages 57–67. AAAI Press, 2006.
[61] Ian Horrocks and Peter F. Patel-Schneider. Reducing OWL entailment to description logic
satisfiability. Journal of Web Semantics, 1(4):345–357, 2004.
[62] Ian Horrocks, Peter F. Patel-Schneider, Harold Boley, Said Tabet, Benjamin Grosof, and
Mike Dean. SWRL: A Semantic Web Rule Language Combining OWL and RuleML,
May 2004. W3C member submission.
[63] Ullrich Hustadt, Boris Motik, and Ulrike Sattler. Reducing shiq-description logic to dis-
junctive datalog programs. In Proceedings of the Ninth International Conference on Prin-
ciples of Knowledge Representation and Reasoning (KR’04), pages 152–162, Whistler,
Canada, 2004. AAAI Press.
[64] Giovambattista Ianni, Thomas Krennwallner, Alessandra Martello, and Axel Polleres.
Dynamic querying of mass-storage RDF data with rule-based entailment regimes. In
Abraham Bernstein, David R. Karger, Tom Heath, Lee Feigenbaum, Diana Maynard, En-
rico Motta, and Krishnaprasad Thirunarayan, editors, International Semantic Web Confer-
ence (ISWC 2009), volume 5823 of Lecture Notes in Computer Science, pages 310–327,
Washington DC, USA, October 2009. Springer.
[65] Giovambattista Ianni, Thomas Krennwallner, Alessandra Martello, and Axel Polleres. A
rule system for querying persistent RDFS data. In Proceedings of the 6th European Se-
mantic Web Conference (ESWC2009), Heraklion, Greece, May 2009. Springer. Demo
Paper.
[66] Giovambattista Ianni, Alessandra Martello, Claudio Panetta, and Giorgio Terracina. Ef-
ficiently querying RDF(S) ontologies with Answer Set Programming. Journal of Logic
and Computation (Special issue), 19(4):671–695, August 2009.
[67] Yixin Jing, Dongwon Jeong, and Doo-Kwon Baik. SPARQL graph pattern rewriting for
OWL-DL inference queries. Knowl. Inf. Syst., 20(2):243–262, 2009.
[68] Michael Kay. XSL Transformations (XSLT) Version 2.0 . W3C Recommendation, W3C,
January 2007. Available at http://www.w3.org/TR/xslt20.
[69] Michael Kifer. Nonmonotonic reasoning in FLORA-2. In 8th Int’l Conf. on Logic Pro-
gramming and Nonmonotonic Reasoning (LPNMR’05), Diamante, Italy, 2005. Invited
Paper.
[70] Michael Kifer, Georg Lausen, and James Wu. Logical foundations of object-oriented and
frame-based languages. Journal of the ACM, 42(4):741–843, 1995.
[71] Thomas Krennwallner, Nuno Lopes, and Axel Polleres. XSPARQL: Semantics, January
2009. W3C member submission.
[72] Markus Krötzsch, Sebastian Rudolph, and Pascal Hitzler. Complexity boundaries for horn
description logics. In Proceedings of the Twenty-Second AAAI Conference on Artificial
Intelligence (AAAI), pages 452–457, Vancouver, British Columbia, Canada, July 2007.
[73] Markus Krötzsch, Sebastian Rudolph, and Pascal Hitzler. Conjunctive queries for a
tractable fragment of OWL 1.1. In Proceedings of the 6th International Semantic Web
Conference and 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, pages
310–323, Busan, Korea, November 2007.
[74] Alon Y. Levy and Marie-Christine Rousset. Combining horn rules and description logics
in CARIN. Artificial Intelligence, 104:165–209, 1998.
19
[75] Nuno Lopes, Thomas Krennwallner, Axel Polleres, Waseem Akhtar, and Stéphane Cor-
losquet. XSPARQL: Implementation and Test-cases, January 2009. W3C member sub-
mission.
[76] Gergely Lukácsy and Péter Szeredi. Efficient description logic reasoning in Prolog: the
DLog system. Theory and Practice of Logic Programming, 9(3):343–414, 2009.
[77] Thomas Lukasiewicz. A novel combination of answer set programming with description
logics for the semantic web. IEEE Transactions on Knowledge and Data Engineering
(TKDE), 2010. In press.
[78] Michael Meier. Towards Rule-Based Minimization of RDF Graphs under Constraints. In
Proc. RR’08, volume 5341 of LNCS, pages 89–103. Springer, 2008.
[79] Boris Motik, Bernardo Cuenca Grau, Ian Horrocks, Zhe Wu, Achille Fokoue, Casrsten
Lutz, Diego Calvanese, Jeremy Carroll, Guiseppe De Giacomo, Jim Hendler, Ivan Her-
man, Bijan Parsia, Peter F. Patel-Schneider, Alan Ruttenberg, Uli Sattler, and Michael
Schneider. OWL 2 Web Ontology Language Profiles. W3C recommendation, W3C, Oc-
tober 2009. Available at http://www.w3.org/TR/owl2-profiles/.
[80] Boris Motik, Peter F. Patel-Schneider, Bernardo Cuenca Grau, Ian Horrocks, Bijan
Parsia, and Uli Sattler. OWL 2 Web Ontology Language Direct Semantics. W3C
recommendation, W3C, October 2009. Available at http://www.w3.org/TR/
owl2-direct-semantics/.
[81] Boris Motik and Riccardo Rosati. A faithful integration of description logics with logic
programming. In Proceedings of the Twentieth International Joint Conference on Ar-
tificial Intelligence (IJCAI-07), pages 477–482, Hyderabad, India, January 6–12 2007.
AAAI.
[82] Boris Motik, Ulrike Sattler, and Rudi Studer. Query answering for OWL-DL with rules.
Journal of Web Semantics, 3(1):41–60, 2005.
[83] Sergio Muñoz, Jorge Pérez, and Claudio Gutiérrez. Minimal deductive systems for RDF.
In Enrico Franconi, Michael Kifer, and Wolfgang May, editors, Proceedings of the 4th Eu-
ropean Semantic Web Conference (ESWC2007), volume 4519 of Lecture Notes in Com-
puter Science, pages 53–67, Innsbruck, Austria, June 2007. Springer.
[84] Alexandre Passant, Jacek Kopecký, Stéphane Corlosquet, Diego Berrueta, Davide
Palmisano, and Axel Polleres. XSPARQL: Use cases, January 2009. W3C member
submission.
[85] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. Semantics and complexity of sparql.
In International Semantic Web Conference (ISWC 2006), pages 30–43, 2006.
[86] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. Semantics and complexity of sparql.
ACM Transactions on Database Systems, 34(3):Article 16 (45 pages), 2009.
[87] Danh Le Phuoc, Axel Polleres, Giovanni Tummarello, Christian Morbidoni, and Man-
fred Hauswirth. Rapid semantic web mashup development through semantic web pipes.
In Proceedings of the 18th World Wide Web Conference (WWW2009), pages 581–590,
Madrid, Spain, April 2009. ACM Press.
[88] Reinhard Pichler, Axel Polleres, Sebastian Skritek, and Stefan Woltran. Minimis-
ing RDF graphs under rules and constraints revisited. In 4th Alberto Mendelzon
Workshop on Foundations of Data Management, May 2010. To appear, techni-
cal report version available at http://www.deri.ie/fileadmin/documents/
DERI-TR-2010-04-23.pdf.
20
[89] Reinhard Pichler, Axel Polleres, Fang Wei, and Stefan Woltran. Entailment for
domain-restricted RDF. In Proceedings of the 5th European Semantic Web Conference
(ESWC2008), pages 200–214, Tenerife, Spain, June 2008. Springer.
[90] Axel Polleres. SPARQL Rules! Technical Report GIA-TR-2006-11-28, Universidad Rey
Juan Carlos, Móstoles, Spain, 2006. Available at http://www.polleres.net/
TRs/GIA-TR-2006-11-28.pdf.
[91] Axel Polleres. From SPARQL to rules (and back). In Proceedings of the 16th World
Wide Web Conference (WWW2007), pages 787–796, Banff, Canada, May 2007. ACM
Press. Extended technical report version available at http://www.polleres.net/
TRs/GIA-TR-2006-11-28.pdf, slides available at http://www.polleres.
net/publications/poll-2007www-slides.pdf.
[92] Axel Polleres, Harold Boley, and Michael Kifer. RIF Datatypes and Built-Ins 1.0. W3C
proposed recommendation, W3C, May 2010. Available at http://www.w3.org/TR/
2010/PR-rif-dtb-20100511/.
[93] Axel Polleres, Cristina Feier, and Andreas Harth. Rules with contextually scoped nega-
tion. In Proceedings of the 3rd European Semantic Web Conference (ESWC2006), volume
4011 of Lecture Notes in Computer Science, Budva, Montenegro, June 2006. Springer.
[94] Axel Polleres and David Huynh, editors. Journal of Web Semantics, Special Issue: The
Web of Data, volume 7(3). Elsevier, 2009.
[95] Axel Polleres, Thomas Krennwallner, Nuno Lopes, Jacek Kopecký, and Stefan Decker.
XSPARQL Language Specification, January 2009. W3C member submission.
[96] Axel Polleres, François Scharffe, and Roman Schindlauer. SPARQL++ for mapping be-
tween RDF vocabularies. In OTM 2007, Part I : Proceedings of the 6th International
Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE 2007),
volume 4803 of Lecture Notes in Computer Science, pages 878–896, Vilamoura, Algarve,
Portugal, November 2007. Springer.
[97] Eric Prud0 hommeaux and Andy Seaborne. SPARQL Query Language for RDF.
W3C recommendation, W3C, January 2008. Available at http://www.w3.org/TR/
rdf-sparql-query/.
[98] Riccardo Rosati. On the decidability and complexity of integrating ontologies and rules.
Journal of Web Semantics, 3(1):61–73, 2005.
[99] Riccardo Rosati. Semantic and computational advantages of the safe integration of on-
tologies and rules. In Proceedings of the Third International Workshop on Principles and
Practice of Semantic Web Reasoning (PPSWR 2005), volume 3703 of Lecture Notes in
Computer Science, pages 50–64. Springer, 2005.
[100] Riccardo Rosati. Integrating Ontologies and Rules: Semantic and Computational Issues.
In Pedro Barahona, François Bry, Enrico Franconi, Ulrike Sattler, and Nicola Henze,
editors, Reasoning Web, Second International Summer School 2006, Lissabon, Portu-
gal, September 25-29, 2006, Tutorial Lectures, volume 4126 of LNCS, pages 128–151.
Springer, September 2006.
[101] Riccardo Rosati. DL + log: Tight integration of description logics and disjunctive dat-
alog. In Proceedings of the Tenth International Conference on Principles of Knowledge
Representation and Reasoning (KR’06), pages 68–78, 2006.
[102] Simon Schenk and Steffen Staab. Networked graphs: A declarative mechanism for
SPARQL rules, SPARQL views and RDF data integration on the web. In Proceedings
WWW-2008, pages 585–594, Beijing, China, 2008. ACM Press.
21
[103] Roman Schindlauer. Answer-Set Programming for the Semantic Web. PhD thesis, Vienna
University of Technology, December 2006.
[104] Michael Schmidt, Michael Meier, and Georg Lausen. Foundations of sparql query opti-
mization. In 13th International Conference on Database Theory (ICDT2010), Lausanne,
Switzerland, March 2010.
[105] Michael Schneider, Jeremy Carroll, Ivan Herman, and Peter F. Patel-Schneider.
W3C OWL 2 Web Ontology Language RDF-Based Semantics. W3C rec-
ommendation, W3C, October 2009. Available at http://www.w3.org/TR/
owl2-rdf-based-semantics/.
[106] Michael Sintek and Stefan Decker. TRIPLE - A Query, Inference, and Transformation
Language for the Semantic Web. In 1st International Semantic Web Conference, pages
364–378, 2002.
[107] Evren Sirin and Bijan Parsia. SPARQL-DL: SPARQL query for OWL-DL. In Proceedings
of the OWLED 2007 Workshop on OWL: Experiences and Directions, Innsbruck, Austria,
June 2007. CEUR-WS.org.
[108] Michael K. Smith, Chris Welty, and Deborah L. McGuinness. OWL Web Ontology
Language Guide. W3C recommendation, W3C, February 2004. Available at http:
//www.w3.org/TR/owl-guide/.
[109] SQL-99. Information Technology - Database Language SQL- Part 3: Call Level Interface
(SQL/CLI). Technical Report INCITS/ISO/IEC 9075-3, INCITS/ISO/IEC, October 1999.
Standard specification.
[110] Herman J. ter Horst. Completeness, decidability and complexity of entailment for RDF
Schema and a semantic extension involving the OWL vocabulary. Journal of Web Seman-
tics, 3:79–115, 2005.
[111] Giorgio Terracina, Nicola Leone, Vincenzino Lio, and Claudio Panetta. Experimenting
with recursive queries in database and logic programming systems. Theory and Practice
of Logic Programming, 8(2):129–165, March 2008.
[112] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML Schema Part 1:
Structures, 2nd Edition. W3C Recommendation, W3C, October 2004. Available at
http://www.w3.org/TR/xmlschema-1/.
[113] Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen. Scalable dis-
tributed reasoning using mapreduce. In International Semantic Web Conference, pages
634–649, 2009.
[114] Norman Walsh. RDF Twig: Accessing RDF Graphs in XSLT. Presented at Extreme
Markup Languages (XML) 2003, Montreal, Canada. Available at http://rdftwig.
sourceforge.net/.
[115] Jesse Weaver and James A. Hendler. Parallel materialization of the finite RDFS closure for
hundreds of millions of triples. In International Semantic Web Conference (ISWC2009),
pages 682–697, 2009.
22
Accepted for publication in Knowledge and Information Systems (KAIS), Springer, 2010,
ISSN: 0219-1377 (print version), 0219-3116 (electronic version); cf.
http://www.springerlink.com/content/y3q6657333137683/)
1 Introduction
In the current discussions on the Semantic Web architecture a recurring issue is how
to combine a first-order classical theory formalising an ontology with a (possibly non-
monotonic) rule base. Nonmonotonic rule languages have received considerable at-
tention and achieved maturity over the last few years especially due to the success of
Answer Set Programming (ASP), a nonmonotonic, purely declarative logic program-
ming and knowledge representation paradigm with many useful features such as aggre-
gates, weak constraints and priorities, supported by efficient implementations (for an
overview see [1]).
23
As a logical foundation for the answer set semantics and a tool for logical analysis
in ASP, the system of Equilibrium Logic was presented in [24] and further developed in
subsequent works (see [25] for an overview and references). The aim of this paper is to
show how Equilibrium Logic can be used as a logical foundation for the combination
of ASP and Ontologies.
In the quest to provide a formal underpinning for a nonmonotonic rules layer for
the Semantic Web which can coexist in a semantically well-defined manner with the
Ontology layer, various proposals for combining classical first-order logic with differ-
ent variants of ASP have been presented in the literature.1 We distinguish three kinds
of approaches: At one end of the spectrum there are approaches which provide an
entailment-based query interface to the Ontology in the bodies of ASP rules, result-
ing in a loose integration (e.g. [10, 9]). At the other end there are approaches which
use a unifying nonmonotonic formalism to embed both the Ontology and the rule base
(e.g. [4, 23]), resulting in a tight coupling. Hybrid approaches (e.g. [29, 30, 31, 16]) fall
between these extremes. Common to hybrid approaches is the definition of a modular
semantics based on classical first-order models, on the one hand, and stable models –
often, more generally, referred to as answer sets2 – on the other hand. Additionally, they
require several syntactical restrictions on the use of classical predicates within rules,
typically driven by considerations upon retaining decidability of reasoning tasks such
as knowledge base satisfiability and predicate subsumption. With further restrictions
of the classical part to decidable Description Logics (DLs), these semantics support
straightforward implementation using existing DL reasoners and ASP engines, in a
modular fashion. In this paper, we focus on such hybrid approaches, but from a more
general point of view.
Example 1 Consider a hybrid knowledge base consisting of a classical theory T :
which says that every P ERSON is an AGEN T and has some (unknown) mother,
and everyone who has a mother is an AN IM AL, and a nonmonotonic logic program
P:
P ERSON (x) ← AGEN T (x), ¬machine(x)
AGEN T (DaveBowman)
which says that AGEN T s are by default P ERSON s, unless known to be machines,
and DaveBowman is an AGEN T .
Using such a hybrid knowledge base consisting of T and P, we intuitively would
conclude that P ERSON (DaveBowman) holds since he is not known to be a machine,
and furthermore we would conclude that DaveBowman has some (unknown) mother,
and thus AN IM AL(DaveBowman). 3
We see two important shortcomings in current hybrid approaches:
1 Most of these approaches focus on the Description Logics fragments of first-order logic underlying the
24
(1) Current approaches to hybrid knowledge bases differ not only in terms of syntac-
tic restrictions, motivated by decidability considerations, but also in the way they deal
with more fundamental issues which arise when classical logic meets ASP, such as the
domain closure and unique names assumptions.3 In particular, current proposals im-
plicitly deal with these issues by either restricting the allowed models of the classical
theory, or by using variants of the traditional answer set semantics which cater for open
domains and non-unique names. So far, little effort has been spent in a comparing the
approaches from a more general perspective. In this paper we aim to provide a generic
semantic framework for hybrid knowledge bases that neither restricts models (e.g. to
unique names) nor imposes syntactical restrictions driven by decidability concerns. (2)
The semantics of current hybrid knowledge bases is defined in a modular fashion. This
has the important advantage that algorithms for reasoning with this combination can be
based on existing algorithms for DL and ASP satisfiability. A single underlying logic
for hybrid knowledge bases which, for example, allows to capture notions of equiva-
lence between combined knowledge bases in a standard way, is lacking though.
Our main contribution with this paper is twofold. First, we survey and compare
different (extensions of the) answer set semantics, as well as the existing approaches
to hybrid knowledge bases, all of which define nonmonotonic models in a modular
fashion. Second, we propose to use Quantified Equilibrium Logic (QEL) as a unified
logical foundation for hybrid knowledge bases: As it turns out, the equilibrium models
of the combined knowledge base coincide exactly with the modular nonmonotonic
models for all approaches we are aware of [29, 30, 31, 16].
The remainder of this paper is structured as follows: Section 2 recalls some basics
of classical first-order logic. Section 3 reformulates different variants of the answer
set semantics introduced in the literature using a common notation and points out cor-
respondences and discrepancies between these variants. Next, definitions of hybrid
knowledge bases from the literature are compared and generalised in Section 4. QEL
and its relation to the different variants of ASP are clarified in Section 5. Section 6
describes an embedding of hybrid knowledge bases into QEL and establishes the cor-
respondence between equilibrium models and nonmonotonic models of hybrid KBs.
We discuss some immediate implications of our results in Section 7. In Section 8 we
show how for finite knowledge bases an equivalent semantical characterisation can be
given via a second-order operator NM. This behaves analogously to the operator SM
used by Ferraris, Lee and Lifschitz [12] to define the stable models of a first-order
sentence, except that its minimisation condition applies only to the non-classical pred-
icates. In Section 9 we discuss an application of the previous results: we propose a
definition of strong equivalence for knowledge bases sharing a common structural lan-
guage and show how this notion can be captured by deduction in the (monotonic) logic
of here-and-there. These two Sections (9 and 8) particularly contain mostly new mate-
rial which has not yet been presented in the conference version [5] of this article. We
conclude with a discussion of further related approaches and an outlook to future work
in Section 10.
3 See [3] for a more in-depth discussion of these issues.
25
2 First-Order Logic (FOL)
A function-free first-order language L = hC, P i with equality consists of disjoint sets
of constant and predicate symbols C and P . Moreover, we assume a fixed countably
infinite set of variables, the symbols ‘→’, ‘∨’, ‘∧’, ‘¬’, ‘∃’, ‘∀’, and auxiliary paren-
theses ‘(’,‘)’. Each predicate symbol p ∈ P has an assigned arity ar(p). Atoms and
formulas are constructed as usual. Closed formulas, or sentences, are those where each
variable is bound by some quantifier. A theory T is a set of sentences. Variable-free
atoms, formulas, or theories are also called ground. If D is a non-empty set, we denote
by AtD (C, P ) the set of ground atoms constructible from L0 = hC ∪ D, P i.
Given a first-order language L, an L-structure consists of a pair I = hU, Ii, where
the universe U = (D, σ) (sometimes called pre-interpretation) consists of a non-empty
domain D and a function σ : C∪D → D which assigns a domain value to each constant
such that σ(d) = d for every d ∈ D. For tuples we write σ(~t) = (σ(d1 ), . . . , σ(dn )).
We call d ∈ D an unnamed individual if there is no c ∈ C such that σ(c) = d. The
function I assigns a relation pI ⊆ Dn to each n-ary predicate symbol p ∈ P and
is called the L-interpretation over D . The designated binary predicate symbol eq,
occasionally written ‘=’ in infix notation, is assumed to be associated with the fixed
interpretation function eq I = {(d, d) : d ∈ D}. If I is an L0 -structure we denote by
I|L the restriction of I to a sublanguage L ⊆ L0 .
An L-structure I = hU, Ii satisfies an atom p(d1 , . . . , dn ) of AtD (C, P ), written
I |= p(d1 , . . . , dn ), iff (σ(d1 ), . . . , σ(dn )) ∈ pI . This is extended as usual to sentences
and theories. I is a model of an atom (sentence, theory, respectively) ϕ, written I |= ϕ,
if it satisfies ϕ. A theory T entails a sentence ϕ, written T |= ϕ, if every model of T
is also a model of ϕ. A theory is consistent if it has a model.
In the context of logic programs, the following assumptions often play a role: We
say that the parameter names assumption (PNA) applies in case σ|C is surjective, i.e.,
there are no unnamed individuals in D; the unique names assumption (UNA) applies
in case σ|C is injective; in case both the PNA and UNA apply, the standard names
assumption (SNA) applies, i.e. σ|C is a bijection. In the following, we will speak about
PNA-, UNA-, or SNA-structures, (or PNA-, UNA-, or SNA-models, respectively), de-
pending on σ.
An L-interpretation I over D can be seen as a subset of AtD (C, P ). So, we can
define a subset relation for L-structures I1 = h(D, σ1 ), I1 i and I2 = h(D, σ2 ), I2 i over
the same domain by setting I1 ⊆ I2 if I1 ⊆ I2 .4 Whenever we speak about subset
minimality of models/structures in the following, we thus mean minimality among all
models/structures over the same domain.
4 Note that this is not the substructure or submodel relation in classical model theory, which holds between
26
3 Answer Set Semantics
In this paper we assume non-ground disjunctive logic programs with negation allowed
in rule heads and bodies, interpreted under the answer set semantics [21].5 A program
P consists of a set of rules of the form
where ai (i ∈ {1, . . . , l}) and bj (j ∈ {1, . . . , n}) are atoms, called head (body, respec-
tively) atoms of the rule, in a function-free first-order language L = hC, P i without
equality. By CP ⊆ C we denote the set of constants which appear in P. A rule with
k = l and m = n is called positive. Rules where each variable appears in b1 , . . . , bm
are called safe. A program is positive (safe) if all its rules are positive (safe).
For the purposes of this paper, we give a slightly generalised definition of the com-
mon notion of the grounding of a program: The grounding grU (P) of P wrt. a universe
U = (D, σ) denotes the set of all rules obtained as follows: For r ∈ P, replace (i) each
constant c appearing in r with σ(c) and (ii) each variable with some element in D.
Observe that thus grU (P) is a ground program over the atoms in AtD (C, P ).
For a ground program P and first-order structure I the reduct P I consists of rules
a1 ∨ a2 ∨ . . . ∨ ak ← b1 , . . . , bm
obtained from all rules of the form (1) in P for which it holds that I |= ai for all
k < i ≤ l and I 6|= bj for all m < j ≤ n.
Answer set semantics is usually defined in terms of Herbrand structures over L =
hC, P i. Herbrand structures have a fixed universe, the Herbrand universe H = (C, id),
where id is the identity function. For a Herbrand structure I = hH, Ii, I can be
viewed as a subset of the Herbrand base, B, which consists of the ground atoms of
L. Note that by definition of H, Herbrand structures are SNA-structures. A Herbrand
structure I is an answer set [21] of P if I is subset minimal among the structures
satisfying grH (P)I . Two variations of this semantics, the open [15] and generalised
open answer set [16] semantics, consider open domains, thereby relaxing the PNA. An
extended Herbrand structure is a first-order structure based on a universe U = (D, id),
where D ⊇ C.
Definition 1 A first-order L-structure I = hU, Ii is called a generalised open answer
set of P if I is subset minimal among the structures satisfying all rules in grU (P)I . If,
additionally, I is an extended Herbrand structure, then I is an open answer set of P.
In the open answer set semantics the UNA applies. We have the following correspon-
dence with the answer set semantics. First, as a straightforward consequence from the
definitions, we can observe:
Proposition 1 If M is an answer set of P then M is also an open answer set of P.
The converse does not hold in general:
5 By ¬ we mean negation as failure and not classical, or strong negation, which is also sometimes consid-
ered in ASP.
27
Example 2 Consider P = {p(a); ok ← ¬p(x); ← ¬ok} over L = h{a}, {p, ok}i.
We leave it as an exercise to the reader to show that P is inconsistent under the answer
set semantics, but M = h({a, c1 }, id), {p(a), ok}i is an open answer set of P. 3
Open answer set programs allow the use of the equality predicate ‘=’ in the body
of rules. However, since this definition of open answer sets adheres to the UNA, one
could argue that equality in open answer set programming is purely syntactical. Posi-
tive equality predicates in rule bodies can thus be eliminated by simple preprocessing,
applying unification. This is not the case for negative occurrences of equality, but, since
the interpretation of equality is fixed, these can be eliminated during grounding.
An alternative approach to relax the UNA has been presented by Rosati in [30]:
Instead of grounding with respect to U , programs are grounded with respect to the
Herbrand universe H = (C, id), and minimality of the models of grH (P)I wrt. U is
redefined: IH = {p(σ(c1 ), . . . , σ(cn )) : p(c1 , . . . , cn ) ∈ B, I |= p(c1 , . . . , cn )}, i.e.,
IH is the restriction of I to ground atoms of B. Given L-structures I1 = (U1 , I1 ) and
I2 = (U2 , I2 ),6 the relation I1 ⊆H I2 holds if I1 H ⊆ I2 H .
Definition 2 An L-structure I is called a generalised answer set of P if I is ⊆H -
minimal among the structures satisfying all rules in grH (P)I .
The following Lemma (implicit in [14]) establishes that, for safe programs, all atoms
of AtD (C, P ) satisfied in an open answer set of a safe program are ground atoms over
CP :
Lemma 2 Let P be a safe program over L = hC, P i with M = hU, Ii a (generalised)
open answer set over universe U = (D, σ). Then, for any atom from AtD (C, P ) such
that M |= p(d1 , . . . , dn ), there exist ci ∈ CP such that σ(ci ) = di for each 1 ≤ i ≤ n.
Proof: First, we observe that any atom M |= p(d1 , . . . , dn ) must be derivable from
a sequence of rules (r0 ; . . . ; rl ) in grU (P)M . We prove the lemma by induction over
the length l of this sequence. l = 0: Assume M |= p(d1 , . . . , dn ), then r0 must
be (by safety) a ground fact in P such that p(σ(c1 ), . . . , σ(cn )) = p(d1 , . . . , dn ) and
c1 , . . . , cn ∈ CP . As for the induction step, let p(d1 , . . . , dn ) be inferred by application
of rule rl ∈ grU (P)M . By safety, again each dj either stems from a constant cj ∈ CP
such that σ(cj ) = dj which appears in some true head atom of rl or dj also appears in
a positive body atom q(. . . , dj , . . .) of rl such that M |= q(. . . , dj , . . .), derivable by
(r0 ; . . . ; rl−1 ), which, by the induction hypothesis, proves the existence of a cj ∈ CP
with σ(cj ) = dj . 2
From this Lemma, the following correspondence follows directly. Note that the
answer sets and open answer sets of safe programs coincide as a direct consequence of
Lemma 2:
Proposition 3 M is an answer set of a safe program P if and only if M is an open
answer set of P.
Similarly, on unsafe programs, generalised answer sets and generalised open answer
sets do not necessarily coincide, as demonstrated by Example 2. However, the follow-
ing correspondence follows straightforwardly from Lemma 2:
6 Not necessarily over the same domain.
28
Proposition 4 Given a safe program P, M is a generalised open answer set of P if
and only if M is a generalised answer set of P.
Proof:
(⇒) Assume M is a generalised open answer set of P. By Lemma 2 we know that
rules in grU (P)M involving unnamed individuals do not contribute to answer
sets, since their body is always false. It follows that M = MH which in turn is
a ⊆H -minimal model of grH (P)M . This follows from the observation that each
rule in grH (P)M and its corresponding rules in grU (P)M are satisfied under
the same models.
(⇐) Analogously.
2
By similar arguments, generalised answer sets and generalised open answer sets
coincide in case the parameter name assumption applies:
Proposition 5 Let M be a PNA-structure. Then M is a generalised answer set of P
if and only if M is a generalised open answer of P.
If the SNA applies, consistency with respect to all semantics introduced so far boils
down to consistency under the original definition of answer sets:
Proposition 6 A program P has an answer set if and only if P has a generalised open
answer set under the SNA.
Answer sets under SNA may differ from the original answer sets since also non-
Herbrand structures are allowed. Further, we observe that there are programs which
have generalised (open) answer sets but do not have (open) answer sets, even for safe
programs, as shown by the following simple example:
Example 3 Consider P = {p(a); ← ¬p(b)} over L = h{a, b}, {p}i. P is ground,
thus obviously safe. However, although P has a generalised (open) answer set – the
reader may verify this by, for instance, considering the one-element universe U =
({d}, σ), where σ(a) = σ(b) = d – it is inconsistent under the open answer set seman-
tics, i.e. the program does not have any open (non-genrealised) answer set. 3
29
over the language LT = hC, PT i and a program P (also called rules part of K) over
the language L, where PT ∩ PP = ∅, i.e. T and P share a single set of constants, and
the predicate symbols allowed to be used in P are a superset of the predicate symbols
in LT . Intuitively, the predicates in LT are interpreted classically, whereas the predi-
cates in LP are interpreted nonmonotonically under the (generalised open) answer set
semantics. With LP = hC, PP i we denote the restricted language of P to only the
distinct predicates PP which are not supposed to occur in T .
We do not consider the alternative classical semantics defined in [29, 30, 31], as
these are straightforward.
We define the projection of a ground program P with respect to an L-structure
I = hU, Ii, denoted Π(P, I), as follows: for each rule r ∈ P, rΠ is defined as:
1. rΠ = ∅ if there is a literal over AtD (C, PT ) in the head of r of form p(~t) such
that p(σ(~t)) ∈ I or of form ¬p(~t) with p(σ(~t)) 6∈ I;
2. rΠ = ∅ if there is a literal over AtD (C, PT ) in the body of r of form p(~t) such
that p(σ(~t)) 6∈ I or of form ¬p(~t) such that p(σ(~t)) ∈ I;
Example 4 Consider the hybrid knowledge base K = (T , P), with T and P as in Ex-
ample 1, with the capitalised predicates being predicates in PT . Now consider the inter-
pretation I = hU, Ii (with U = (D, σ)) with D = {DaveBowman, k}, σ the identity
function, and I = {AGEN T (DaveBowman), HAS-M OT HER(DaveBowman, k),
AN IM AL(DaveBowman), machine(DaveBowman)}. Clearly, I|LT is a model
of T . The projection Π(grU (P), I) is
← ¬machine(DaveBowman),
which does not have a stable model, and thus I is not an NM-model of K. In fact,
the logic program P ensures that an interpretation cannot be an NM-model of K if
there is an AGEN T which is neither a P ERSON nor known (by conclusions from
P) to be a machine. It is easy to verify that, for any NM-model of K, the atoms
AGEN T (DaveBowman), P ERSON (DaveBowman), and AN IM AL(DaveBow-
man) must be true, and are thus entailed by K. The latter cannot be derived from
neither T nor P individually. 3
30
4.1 r-hybrid KBs
We now proceed to compare our definition of NM-models with the various definitions
in the literature. The first kind of hybrid knowledge base we consider was introduced
by Rosati in [29] (and extended in [31] under the name DL+log), and was labeled r-
hybrid knowledge base. Syntactically, r-hybrid KBs do not allow negated atoms in rule
heads, i.e. for rules of the form (1) l = k, and do not allow atoms from LT to occur
negatively in the rule body.7 Moreover, in [29], Rosati deploys a restriction which is
stronger than standard safety: each variable must appear in at least one positive body
atom with a predicate from LP . We call this condition LP -safe in the remainder.
In [31] this condition is relaxed to weak LP -safety: there is no special safety restriction
for variables which occur only in body atoms from PT .
Semantically, Rosati assumes (an infinite number of) standard names, i.e. C is
countably infinite, and normal answer sets, in his version of NM-models:
31
SNA variables disjunctive negated
rule heads LT atoms
r-hybrid yes LP -safe pos. only no
r+ -hybrid no LP -safe pos. only no
rw -hybrid yes weak LP -safe pos. only no
g-hybrid no guarded neg. allowed∗ yes
∗
g-hybrid allows negation in the head but at most one positive head atom
32
formulas, L-sentences and atomic L-sentences are defined in the usual way. Again,
we only work with sentences, and, as in Section 2, by an L-interpretation I over a
set D we mean a subset I of AtD (C, P ). A here-and-there L-structure with static
domains, or QHTs (L)-structure, is a tuple M = h(D, σ), Ih , It i where h(D, σ), Ih i
and h(D, σ), It i are L-structures such that Ih ⊆ It .
We can think of M as a structure similar to a first-order classical model, but having
two parts, or components, h and t that correspond to two different points or “worlds”,
‘here’ and ‘there’, in the sense of Kripke semantics for intuitionistic logic [32], where
the worlds are ordered by h ≤ t. At each world w ∈ {h, t} one verifies a set of atoms
Iw in the expanded language for the domain D. We call the model static, since, in
contrast to say intuitionistic logic, the same domain serves each of the worlds.8 Since
h ≤ t, whatever is verified at h remains true at t. The satisfaction relation for M is
defined so as to reflect the two different components, so we write M, w |= ϕ to denote
that ϕ is true in M with respect to the w component. Evidently we should require that
an atomic sentence is true at w just in case it belongs to the w-interpretation. Formally,
if p(t1 , . . . , tn ) ∈ AtD then
• M, t |= ϕ → ψ iff M, t 6|= ϕ or M, t |= ψ.
• M, h |= ϕ → ψ iff M, t |= ϕ → ψ and M, h 6|= ϕ or M, h |= ψ.
• M, w |= ¬ϕ iff M, t 6|= ϕ.
• M, t |= ∀xϕ(x) iff M, t |= ϕ(d) for all d ∈ D.
ambiguous since it might suggest that the domain is composed only of constants, which is not intended here.
9 The reader may easily check that the following correspond exactly to the usual Kripke semantics for
intuitionistic logic given our assumptions about the two worlds h and t and the single domain D, see e.g. [32]
33
the logic introduced before in [26]. By QHTs= we denote the version of QEL with
equality. The equality predicate in QHTs= is interpreted as the actual equality in both
worlds, ie M, w |= t1 = t2 iff σ(t1 ) = σ(t2 ).
The logic QHTs= can be axiomatised as follows. Let INT= denote first-order
intuitionistic logic [32] with the usual axioms for equality:
x = x,
x = y → (F (x) → F (y)),
for every formula F (x) such that y is substitutable for x in F (x). To this we add the
axiom of Hosoi
α ∨ (¬β ∨ (α → β)),
which determines 2-element here-and-there models in the propositional case, and the
axiom SQHT (static quantified here-and-there):
x = y ∨ x 6= y.
Analogous to the case of classical models we can define special kinds of QHTs
(resp. QHTs= ) models. Let M = h(D, σ), H, T i be an L-structure that is a model of
a universal theory T . Then, we call M a PNA-, UNA-, or SNA-model if the restriction
of σ to constants in C is surjective, injective or bijective, respectively.
34
1. M is said to be total if H = T .
2. M is said to be an equilibrium model of Γ (for short, we say: “M is in equilib-
rium”) if it is minimal under among models of Γ, and it is total.
Notice that a total QHTs= model of a theory Γ is equivalent to a classical first order
model of Γ.
Proposition 9 Let Γ be a theory in L and M an equilibrium model of Γ in QHTs= (L0 )
with L0 ⊃ L. Then M|L is an equilibrium model of Γ in QHTs= (L).
35
Lemma 12 Let M = hU, H, T i be a QHTs= -model of T ∪ st(T ). Then M |= P iff
M|LP |= Π(grU (P), M).
M |= r ⇔ M |= rΠ ⇔ M|LP |= rΠ (3)
by the semantics for QHTs= and Theorem 7. (ii) r has the form α → β ∨ ¬p(t), where
p(σ(t)) ∈ T ; so p(σ(t)) ∈ H and M |= p(t). Again it is easy to see that (3) holds.
Case (iii): r has the form α ∧ p(t) → β and p(σ(t)) ∈ H, T , so M |= p(t). Case (iv):
r has the form α ∧ ¬p(t) → β and M |= ¬p(t). Clearly for these two cases (3) holds
as well. It follows that if M |= P then M|LP |= Π(grU (P), M).
To check the converse condition we need now only examine the cases where rΠ =
∅. Suppose this arises because p(σ(t)) ∈ H, T , so M |= p(t). Now, if p(t) is in the
head of r, clearly M |= r. Similarly if ¬p(t) is in the body of r, by the semantics M |=
r. The cases where p(σ(t)) 6∈ T are analogous and left to the reader. Consequently if
M|LP |= Π(grU (P), M), then M |= P. 2
We now state the relation between equilibrium models and NM-models.
Proof: Assume the hypothesis and suppose that M is in equilibrium. Since T contains
only predicates from LT and M |= T ∪ st(T ), evidently
We claim (i) that M|LP is an equilibrium model of Π(grU (P), M). If not, there is
a model M0 = hH 0 , T 0 i with H 0 ⊂ T 0 = T |LP and M0 |= Π(grU (P), M). Lift
(U, M0 ) to a (first order) L-structure N by interpreting each p ∈ LT according to M.
So N |LT = M|LT and by(4) clearly N |= T ∪ st(T ). Moreover, by Lemma 12 N |=
P and by assumption N M, contradicting the assumption that M is an equilibrium
model of T ∪ st(T ) ∪ P. This establishes (i). Lastly, we note that since hT |LP , T |LP i
is an equilibrium model of Π(grU (P), M), M|LP is a generalised open answer set of
Π(grU (P), M) by Corollary 11, so that M = hU, T, T i is an NM-model of K.
For the converse direction, assume the hypothesis but suppose that M is not in
equilibrium. Then there is a model M0 = hU, H, T i of T ∪ st(T ) ∪ P, with H ⊂ T .
Since M0 |= P we can apply Lemma 12 to conclude that M0 |LP |= Π(grU (P), M0 ).
But clearly
Π(grU (P), M0 ) = Π(grU (P), M).
36
However, since evidently M0 |LT = M|LT , thus M0 |LP M|LP , so this shows that
M|LP is not an equilibrium model of Π(grU (P), M) and therefore T |LP is not an
answer set of Π(grU (P), M) and M is not an NM- model of K. 2
This establishes the main theorem relating to the various special types of hybrid
KBs discussed earlier.
Example 5 Consider again the hybrid knowledge base K = (T , P), with T and P as
in Example 1. The stable closure of K, st(K) = T ∪ st(T ) ∪ P is
7 Discussion
We have seen that quantified equilibrium logic captures three of the main approaches
to integrating classical, first-order or DL knowledge bases with nonmonotonic rules
under the answer set semantics, in a modular, hybrid approach. However, QEL has a
quite distinct flavor from those of r-hybrid, r+ -hybrid and g-hybrid KBs. Each of these
hybrid approaches has a semantics composed of two different components: a classical
model on the one hand and an answer set on the other. Integration is achieved by the
fact that the classical model serves as a pre-processing tool for the rule base. The style
37
of QEL is different. There is one semantics and one kind of model that covers both
types of knowledge. There is no need for any pre-processing of the rule base. In this
sense, the integration is more far-reaching. The only distinction we make is that for that
part of the knowledge base considered to be classical and monotonic we add a stability
condition to obtain the intended interpretation.
There are other features of the approach using QEL that are worth highlighting.
First, it is based on a simple minimal model semantics in a known non-classical logic,
actually a quantified version of Gödel’s 3-valued logic. No reducts are involved and,
consequently, the equilibrium construction applies directly to arbitrary first-order the-
ories. The rule part P of a knowledge base might therefore comprise, say, a nested
logic program, where the heads and bodies of rules may be arbitrary boolean formu-
las, or perhaps rules permitting nestings of the implication connective. While answer
sets have recently been defined for such general formulas, more work would be needed
to provide integration in a hybrid KB setting.11 Evidently QEL in the general case is
undecidable, so for extensions of the rule language syntax for practical applications
one may wish to study restrictions analogous to safety or guardedness. Second, the
logic QHTs= can be applied to characterise properties such as the strong equivalence
of programs and theories [20, 27]. While strong equivalence and related concepts have
been much studied recently in ASP, their characterisation in the case of hybrid KBs
remains uncharted territory. The fact that QEL provides a single semantics for hybrid
KBs means that a simple concept of strong equivalence is applicable to such KBs and
characterisable using the underlying logic, QHTs= . In Section 9 below we describe
how QHTs= can be applied in this context.
∀x(p(x) ↔ q(x)),
in Section 8.
38
where x is a tuple of distinct object variables. If p and q are tuples p1 , . . . , pn and
q1 , . . . , qn of predicate constants then p = q stands for the conjunction
p1 = q1 ∧ · · · ∧ pn = qn ,
and p ≤ q for
p1 ≤ q1 ∧ · · · ∧ pn ≤ qn .
Finally, p < q is an abbreviation for p ≤ q ∧ ¬(p = q). The operator NM|P
defines second-order formulas and the previous notation can be also applied to tuples
of predicate variables.
• pi (t1 , . . . , tm )∗ = ui (t1 , . . . , tm ) if pi 6∈ LT ;
• pi (t1 , . . . , tm )∗ = pi (t1 , . . . , tm ) if pi ∈ LT ;
• (t1 = t2 )∗ = (t1 = t2 );
• ⊥∗ = ⊥;
• (F G)∗ = F ∗ G∗ , where ∈ {∧, ∨};
• (F → G)∗ = (F ∗ → G∗ ) ∧ (F → G);
• (QxF )∗ = QxF ∗ , where Q ∈ {∀, ∃}.
(There is no clause for negation here, because ¬F is treated as shorthand for F → ⊥.)
Theorem 15 M = hU, T i is a NM-model of K = (T , P) if and only if it satisfies T
and NM|P [P].
We assume here that both T and P are finite, so that the operator NM|P is well-defined.
Proof:
(⇒) If hU, T i, U = (D, σ), is a NM-model of K = (T , P), then hU, T i |= T , and
hU, T i |= P, and hU, T, T i is an equilibrium model of T ∪ st(T ) ∪ P. So we
only need to prove that hU, T i |= ¬∃u((u < p)∧P ∗ (u)). For the contradiction,
let us assume that
hU, T i |= ∃u((u < p) ∧ P ∗ (u))
39
This means that:
Fact 1: For every pi 6∈ LT , there exists pi ⊂ Dn such that (u < p) ∧ P ∗ (u) is
valid in the structure hU, T i where ui is interpreted as pi .
If we consider the set
H = {pi (d1 , . . . , dk ) : (d1 , . . . , dk ) ∈ pi }∪
∪{pi (d1 , . . . , dk ) : pi ∈ LT , pi (d1 , . . . , dk ) ∈ T },
hU, H, T i, h |= ψ1 → ψ2 ⇔
⇔ hU, H, T i, t |= ψ1 → ψ2 and
either hU, H, T i, h 6|= ψ1 or hU, H, T i, h |= ψ2
⇔ hU, T i |= ψ1 → ψ2 and either hU, Hi 6|= ψ1∗ (p) or hU, Hi |= ψ2∗ (p)
⇔ hU, Hi |= (ψ1 → ψ2 )∗
(⇐) If hU, T i, U = (D, σ), satisfies T and NM|P [P], then trivially hU, T, T i is a
here-and-there model of the closure of K; we only need to prove that this model
40
is in equilibrium. By contradiction, let us assume that hU, H, T i is a here-and-
there model of the closure of K with H ⊂ T . For every pi 6∈ LT , we define
pi = {(di , . . . , dk ) : pi (di , . . . , dk ) ∈ H}
Fact 3: (u < p) ∧ P ∗ (u) is valid in the structure hU, T i if the variables ui are
interpreted as pi .
As a consequence of Fact 3, we have that ∃u((u < p) ∧ P ∗ (u)) is satisfied by
hU, T i which contradicts that NM|P [P] is satisfied by the structure.
As in the previous item, Fact 3 is equivalent to
41
Different proofs of Theorem 16 are given in [20] and [27]. For present purposes,
the proof contained in [27] is more useful. It shows that if theories are not strongly
equivalent, the set of formulas Σ such that Π1 ∪ Σ and Π2 ∪ Σ do not have the same
equilibrium models can be chosen to have the form of implications (A → B) where A
and B are atomic. So if we are interested in the case where Π1 and Π2 are sets of rules,
Σ can also be regarded as a set of rules. We shall make use of this property below.
In the case of hybrid knowledge bases K = (T , P), various kinds of equivalence
can be specified, according to whether one or other or both of the components T and
P are allowed to vary. The following form is rather general.
Definition 8 Let K1 = (T1 , P1 ) and K2 = (T2 , P2 ), be two hybrid KBs sharing the
same structural language, ie. LT 1 = LT 2 . K1 and K2 are said to be strongly equiv-
alent if for any theory T and set of rules P, (T1 ∪ T , P1 ∪ P) and (T2 ∪ T , P2 ∪ P)
have the same NM-models.
Until further notice, let us suppose that K1 = (T1 , P1 ) and K2 = (T2 , P2 ) are hybrid
KBs sharing a common structural language L.
Corollary 18 (a) K1 and K2 are strongly equivalent if T1 and T2 are classically equiv-
alent and P1 and P2 are equivalent in QHTs= .
(b) K1 and K2 are not strongly equivalent if T1 ∪ P1 and T2 ∪ P2 are not equivalent in
classical logic.
Proof:
(a) Assume the hypothesis. Since K1 = (T1 , P1 ) and K2 = (T2 , P2 ) share a common
structural language L, it follows that st(T1 ) = st(T2 ) = S, say. Since T1 and T2 are
classically equivalent, T1 ∪ S and T2 ∪ S have the same (total) QHTs= -models and
so for any T also T1 ∪ T ∪ S ∪ st(T ) and T2 ∪ T ∪ S ∪ st(T ) have the same (total)
42
QHTs= -models. Since P1 and P2 are equivalent in QHTs= it follows also that for any
P, (T1 ∪ T ∪ S ∪ st(T ) ∪ P1 ∪ P) and (T2 ∪ T ∪ S ∪ st(T ) ∪ P2 ∪ P) have the same
QHTs= -models and hence the same equilibrium models. The conclusion follows by
Theorem 13.
(b) Suppose that T1 ∪ P1 and T2 ∪ P2 are not equivalent in classical logic. Assume
again that st(T1 ) = st(T2 ) = S, say. Then clearly T1 ∪ S ∪ P1 and T2 ∪ S ∪ P2 are
not classically equivalent and hence they cannot be QHTs= -equivalent. Applying the
second part of the proof of Proposition 17 completes the argument. 2
Special cases of strong equivalence arise when hybrid KBs are based on the same
classical theory, say, or share the same rule base. That is, (T , P1 ) and (T , P2 ) are
strongly equivalent if P1 and P2 are QHTs= -equivalent.12 Analogously:
(T1 , P) and (T2 , P) are strongly equivalent if T1 and T2 are classically equivalent.
(8)
Let us briefly comment on a restriction that we imposed on strong equivalence,
namely that the KBs in question share a common structural language. Intuitively the
reason for this is that the structural language LT associated with a hybrid knowledge
base K = (T , P) is part of its identity or ‘meaning’. Precisely the predicates in LT are
the ones treated classically. In fact, another KB, K0 = (T 0 , P), where T 0 is completely
equivalent to T in classical logic, may have a different semantics if LT 0 is different
from LT . To see this, let us consider a simple example in propositional logic. Let
K1 = (T1 , P1 ) and K2 = (T2 , P2 ), be two hybrid KBs where P1 = P2 = {(p →
q)}, T1 = {(r ∧ (r ∨ p))}, T2 = {r}. Clearly, T1 and T2 are classically and even
QHTs= -equivalent. However, K1 and K2 are not even in a weak sense semantically
equivalent. st(T1 ) = {r ∨ ¬r; p ∨ ¬p}, while st(T2 ) = {r ∨ ¬r}. It is easy to check
that T1 ∪ st(T1 ) ∪ P1 and T2 ∪ st(T2 ) ∪ P2 have different QHTs= -models, different
equilibrium models and (hence) K1 and K2 have different NM-models. So we see
that without the assumption of a common structural language, the natural properties
expressed in Corollary 18 (a) and (8) would no longer hold.
It is interesting to note here that meaning-preserving relations among ontologies
have recently become a topic of interest in the description logic community where
logical concepts such as that of conservative extension are currently being studied and
applied [13]. A unified, logical approach to hybrid KBs such as that developed here
should lend itself well to the application of such concepts.
43
nonmonotonic models of K and the equilibrium models of what we call the stable
closure of K. This yields a way to capture in QEL the semantics of the g-hybrid KBs
of Heymans et al. [16] and the r-hybrid KBs of Rosati [30], where the latter is defined
without the UNA but for safe programs. Similarly, the version of QEL with UNA
captures the semantics of r-hybrid KBs as defined in [29, 31]. It is important to note
that the aim of this paper was not that of providing new kinds of safety conditions
or decidability results; these issues are ably dealt with in the literature reviewed here.
Rather our objective has been to show how classical and nonmonotonic theories might
be unified under a single semantical model. In part, as [16] show with their reduction of
DL knowledge bases to open answer set programs, this can also be achieved (at some
cost of translation) in other approaches. What distinguishes QEL is the fact that it is
based on a standard, nonclassical logic, QHTs= , which can therefore provide a unified
logical foundation for such extensions of (open) ASP. To illustrate the usefulness of our
framework we showed how the logic QHTs= also captures a natural concept of strong
equivalence between hybrid knowledge bases.
There are several other approaches to combining languages for Ontologies with
nonmonotonic rules which can be divided into two main streams [3]: approaches which
define integration of rules and ontologies (a) by entailment, ie. querying classical
knowledge bases through special predicates the rules body, and (b) on the basis of
single models, ie. defining a common notion of combined model.
The most prominent of the former kind of approaches are dl-programs [10] and
their generalization, HEX-programs [9]. Although these approaches both are based
on Answer Set programming like our approach, the orthogonal view of integration by
entailment can probably not be captured by a simple embedding in QEL. Another such
approach which allows querying classical KBs from a nonmonotonic rules language is
based on Defeasible Logic [33].
As for the second stream, variants of Autoepistemic Logic [4], and the logic of
minimal knowledge and negation as failure (MKNF) [23] have been recently proposed
in the literature. Similar to our approach, both these approaches embed a combined
knowledge base in a unifying logic. However, both purchase use modal logics fact
syntactically and semantically extend first-order logics. Thus, in these approaches,
embedding of the classical part of the theory is trivial, whereas the nonmonotonic rules
part needs to be rewritten in terms of modal formulas. Our approach is orthogonal,
as we use a non-classical logic where the nonmonotonic rules are trivially embedded,
but the stable closure guarantees classical behavior of certain predicates. In addition,
the fact that we include the stable closure ensures that the predicates from the classical
parts of the theory behave classically, also when used in rules with negation. In con-
trast, in both modal approaches occurrences of classical predicates are not interpreted
classically, as illustrated in the following example.
44
entailed from T ∪ τHP (P) under any stable expansion, and so LA(b) is false, and
thus r is necessarily true in every model. We thus have that r is a consequence of
T ∪ τHP (P).
Similar for the hybrid MKNF knowledge bases by [23].
Acknowledgements
Part of the results in this paper are contained, in preliminary form, in the proceedings
of the 1st International Conference on Web Reasoning and Rule Systems (RR2007),
and in the informal proceedings of the RuleML-06 Workshop on Ontology and Rule
Integration. The authors thank the anonymous reviewers of those preliminary versions
of the article for their helpful comments. This research has been partially supported
13 One of the authors of the present paper is in fact chairing the W3C SPARQL working group in which at
the time of writing of this paper this topic has been being discussed actively.
45
by the Spanish MEC (now MCI) under the projects TIC-2003-9001, TIN2006-15455-
CO3, and CSD2007-00022, also by the project URJC-CM-2006-CET-0300 and by
the European Commission under the projects Knowledge Web (IST-2004-507482) and
inContext (IST-034718), as well as by Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2).
References
[1] Baral, C. (2002), Knowledge Representation, Reasoning and Declarative Prob-
lem Solving, Cambridge University Press.
[2] Cabalar, P., Odintsov, S. P., Pearce, D., and Valverde, A. (2006), Analysing and
extending well-founded and partial stable semantics using partial equilibrium
logic, in ‘Proceedings of the 22nd International Conference on Logic Program-
ming (ICLP 2006)’, Vol. 4079 of Lecture Notes in Computer Science, Springer,
Seattle, WA, USA, pp. 346–360.
[3] de Bruijn, J., Eiter, T., Polleres, A., and Tompits, H. (2006), On representational
issues about combinations of classical theories with nonmonotonic rules, in ‘Pro-
ceedings of the First International Conference on Knowledge Science, Engineer-
ing and Management (KSEM’06)’, number 4092 in ‘Lecture Notes in Computer
Science’, Springer-Verlag, Guilin, China.
[4] de Bruijn, J., Eiter, T., Polleres, A., and Tompits, H. (2007), Embedding non-
ground logic programs into autoepistemic logic for knowledge-base combination,
in ‘Proceedings of the Twentieth International Joint Conference on Artificial In-
telligence (IJCAI-07)’, AAAI, Hyderabad, India, pp. 304–309.
[5] de Bruijn, J., Pearce, D., Polleres, A., and Valverde, A. (2007), Quantified equilib-
rium logic and hybrid rules, in ‘First International Conference on Web Reasoning
and Rule Systems (RR2007)’, Vol. 4524 of Lecture Notes in Computer Science,
Springer, Innsbruck, Austria, pp. 58–72.
[6] de Bruijn, J., Eiter, T., and Tompits, H. (2008), Embedding approaches to com-
bining rules and ontologies into autoepistemic logic, in ‘Proceedings of the 11th
International Conference on Principles of Knowledge Representation and Rea-
soning (KR2008)’, AAAI, Sydney, Australia, pp. 485–495.
[7] Drabent, W., Henriksson, J., and Maluszynski, J. (2007), Hybrid reasoning with
rules and constraints under well-founded semantics, in ‘First International Con-
ference on Web Reasoning and Rule Systems (RR2007)’, Vol. 4524 of Lecture
Notes in Computer Science, Springer, Innsbruck, Austria, pp. 348–357.
[8] Eiter, T., Fink, M., Tompits, H., and Woltran, S. (2005), Strong and uniform
equivalence in answer-set programming: Characterizations and complexity re-
sults for the non-ground case, in ‘Proceedings of the Twentieth National Con-
ference on Artificial Intelligence and the Seventeenth Innovative Applications of
Artificial Intelligence Conference’, pp. 695–700.
46
[9] Eiter, T., Ianni, G., Schindlauer, R., and Tompits, H. (2005), A uniform integra-
tion of higher-order reasoning and external evaluations in answer-set program-
ming, in ‘IJCAI 2005’, pp. 90–96.
[10] Eiter, T., Lukasiewicz, T., Schindlauer, R., and Tompits, H. (2004), Combining
answer set programming with description logics for the semantic Web, in ‘Pro-
ceedings of the Ninth International Conference on Principles of Knowledge Rep-
resentation and Reasoning (KR’04)’.
[11] Ensan, F. and Du, W. (2009), A knowledge encapsulation approach to ontology
modularization. Knowledge and Information Systems Online First, to appear.
[12] Ferraris, P., Lee, J., and Lifschitz, V. (2007), A new perspective on stable mod-
els, in ‘Proceedings of the Twentieth International Joint Conference on Artificial
Intelligence (IJCAI-07)’, AAAI, Hyderabad, India, pp. 372–379.
[13] Ghilardi, S., Lutz, C., and Wolter, F. (2006), Did I damage my ontology: A Case
for Conservative Extensions of Description Logics, in ‘Proceedings of the Tenth
International Conference on Principles of Knowledge Representation and Rea-
soning (KR’06)’, pp. 187–197.
[14] Heymans, S. (2006), Decidable Open Answer Set Programming, PhD thesis, The-
oretical Computer Science Lab (TINF), Department of Computer Science, Vrije
Universiteit Brussel, Brussels, Belgium.
[15] Heymans, S., Nieuwenborgh, D. V., and Vermeir, D. (2005), Guarded Open
Answer Set Programming, in ‘8th International Conference on Logic Program-
ming and Non Monotonic Reasoning (LPNMR 2005)’, volume 3662 in ‘LNAI’,
Springer, pp. 92–104.
[16] Heymans, S., Predoiu, L., Feier, C., de Bruijn, J., and van Nieuwenborgh, D.
(2006), G-hybrid knowledge bases, in ‘Workshop on Applications of Logic Pro-
gramming in the Semantic Web and Semantic Web Services (ALPSWS 2006)’.
[17] Jing, Y., Jeong, D., and Baik, D.-K. (2009), SPARQL graph pattern rewriting for
OWL-DL inference queries. Knowledge and Information Systems 20, pp. 243–
262.
[18] Knorr, M., Alferes, J., and Hitzler, P. (2008), A coherent well-founded model
for hybrid MKNF knowledge bases, in ‘18th European Conference on Artificial
Intelligence (ECAI2008)’, volume 178 in ‘Frontiers in Artificial Intelligence and
Applications’, IOS Press, pp. 99–103.
[19] Lifschitz, V., Pearce, D., and Valverde, A. (2001), ‘Strongly equivalent logic pro-
grams’, ACM Transactions on Computational Logic 2(4), 526–541.
[20] Lifschitz, V., Pearce, D., and Valverde, A. (2007), A characterization of strong
equivalence for logic programs with variables, in ‘9th International Conference
on Logic Programming and Nonmonotonic Reasoning (LPNMR))’, Vol. 4483 of
Lecture Notes in Computer Science, Springer, Tempe, AZ, USA, pp. 188–200.
47
[21] Lifschitz, V. and Woo, T. (1992), Answer sets in general nonmonotonic reasoning
(preliminary report), in B. Nebel, C. Rich and W. Swartout, eds, ‘KR’92. Prin-
ciples of Knowledge Representation and Reasoning: Proceedings of the Third
International Conference’, Morgan Kaufmann, San Mateo, California, pp. 603–
614.
[22] Lin, F. (2002), Reducing strong equivalence of logic programs to entailment in
classical propositional logic, in ‘Proceedings of the Eights International Con-
ference on Principles of Knowledge Representation and Reasoning (KR’02)’,
pp. 170–176.
[23] Motik, B. and Rosati, R. (2007), A faithful integration of description logics with
logic programming, in ‘Proceedings of the Twentieth International Joint Confer-
ence on Artificial Intelligence (IJCAI-07)’, AAAI, Hyderabad, India, pp. 477–
482.
[24] Pearce, D. (1997), A new logical characterization of stable models and answer
sets, in ‘Proceedings of NMELP 96’, Vol. 1216 of Lecture Notes in Computer
Science, Springer, pp. 57–70.
[25] Pearce, D. (2006), ‘Equilibrium logic’, Annals of Mathematics and Artificial In-
telligence 47 3–41.
[26] Pearce, D. and Valverde, A. (2005), ‘A first-order nonmonotonic extension of
constructive logic’, Studia Logica 80, 321–246.
[27] Pearce, D. and Valverde, A. (2006), Quantfied equilibrium logic, Technical report,
Universidad Rey Juan Carlos. in press.
[28] Polleres, A. (2007), From SPARQL to Rules (and back), in ‘WWW 2007’,
pp. 787–796.
[29] Rosati, R. (2005a), ‘On the decidability and complexity of integrating ontologies
and rules’, Journal of Web Semantics 3(1), 61–73.
[30] Rosati, R. (2005b), Semantic and computational advantages of the safe integra-
tion of ontologies and rules, in ‘Proceedings of the Third International Workshop
on Principles and Practice of Semantic Web Reasoning (PPSWR 2005)’, Vol.
3703 of Lecture Notes in Computer Science, Springer, pp. 50–64.
[31] Rosati, R. (2006), DL + log: Tight integration of description logics and disjunc-
tive datalog, in ‘Proceedings of the Tenth International Conference on Principles
of Knowledge Representation and Reasoning (KR’06)’, pp. 68–78.
[32] van Dalen, D. (1983), Logic and Structure, Springer.
[33] Wang, K., Billington, D., Blee, J., and Antoniou, G. (2004), Combining descrip-
tion logic and defeasible logic for the semantic Web, in ‘Proceedings of the Third
International Workshop Rules and Rule Markup Languages for the Semantic Web
(RuleML 2004)’, Vol. 3323 of Lecture Notes in Computer Science, Springer,
pp. 170–181.
48
Published in Proceedings of the 16th World Wide Web Conference (WWW2007), pp. 787–496,
May 2007, ACM Press
Axel Polleres
Digital Enterprise Research Institute, National University of Ireland, Galway
axel@polleres.net
Abstract
As the data and ontology layers of the Semantic Web stack have achieved a
certain level of maturity in standard recommendations such as RDF and OWL,
the current focus lies on two related aspects. On the one hand, the definition of
a suitable query language for RDF, SPARQL, is close to recommendation status
within the W3C. The establishment of the rules layer on top of the existing stack
on the other hand marks the next step to be taken, where languages with their roots
in Logic Programming and Deductive Databases are receiving considerable atten-
tion. The purpose of this paper is threefold. First, we discuss the formal semantics
of SPARQL extending recent results in several ways. Second, we provide transla-
tions from SPARQL to Datalog with negation as failure. Third, we propose some
useful and easy to implement extensions of SPARQL, based on this translation.
As it turns out, the combination serves for direct implementations of SPARQL on
top of existing rules engines as well as a basis for more general rules and query
languages on top of RDF.
1 Introduction
After the data and ontology layers of the Semantic Web stack have achieved a certain
level of maturity in standard recommendations such as RDF and OWL, the query and
the rules layers seem to be the next building-blocks to be finalized. For the first part,
SPARQL [18], W3C’s proposed query language, seems to be close to recommenda-
tion, though the Data Access working group is still struggling with defining aspects
such as a formal semantics or layering on top of OWL and RDFS. As for the second
∗ An extended technical report version of this article is available at
http://www.polleres.net/TRs/GIA-TR-2006-11-28.pdf. This work was
mainly conducted under a Spanish MEC grant at Universidad Rey Juan Carlos, Móstoles, Spain.
49
part, the RIF working group 1 , who is responsible for the rules layer, is just producing
first concrete results. Besides aspects like business rules exchange or reactive rules,
deductive rules languages on top of RDF and OWL are of special interest to the RIF
group. One such deductive rules language is Datalog, which has been successfully ap-
plied in areas such as deductive databases and thus might be viewed as a query language
itself. Let us briefly recap our starting points:
Datalog and SQL. Analogies between Datalog and relational query languages such as
SQL are well-known and -studied. Both formalisms cover UCQ (unions of conjunctive
queries), where Datalog adds recursion, particularly unrestricted recursion involving
nonmonotonic negation (aka unstratified negation as failure). Still, SQL is often viewed
to be more powerful in several respects. On the one hand, the lack of recursion has
been partly solved in the standard’s 1999 version [20]. On the other hand, aggregates or
external function calls are missing in pure Datalog. However, also developments on the
Datalog side are evolving and with recent extensions of Datalog towards Answer Set
Programming (ASP) – a logic programming paradigm extending and building on top of
Datalog – lots of these issues have been solved, for instance by defining a declarative
semantics for aggregates [9], external predicates [8].
The Semantic Web rules layer. Remarkably, logic programming dialects such as
Datalog with nonmonotonic negation which are covered by Answer Set Programming
are often viewed as a natural basis for the Semantic Web rules layer [7]. Current ASP
systems offer extensions for retrieving RDF data and querying OWL knowledge bases
from the Web [8]. Particular concerns in the Semantic Web community exist with
respect to adding rules including nonmonotonic negation [3] which involve a form of
closed world reasoning on top of RDF and OWL which both adopt an open world
assumption. Recent proposals for solving this issue suggest a “safe” use of negation as
failure over finite contexts only for the Web, also called scoped negation [17].
The Semantic Web query layer – SPARQL. Since we base our considerations in this
paper on the assumption that similar correspondences as between SQL and Datalog
can be established for SPARQL, we have to observe that SPARQL inherits a lot from
SQL, but there also remain substantial differences: On the one hand, SPARQL does
not deal with nested queries or recursion, a detail which is indeed surprising by the
fact that SPARQL is a graph query language on RDF where, typical recursive queries
such as transitive closure of a property might seem very useful. Likewise, aggregation
(such as count, average, etc.) of object values in RDF triples which might appear useful
have not yet been included in the current standard. On the other hand, subtleties like
blank nodes (aka bNodes), or optional graph patterns, which are similar but (as we
will see) different to outer joins in SQL or relational algebra, are not straightforwardly
translatable to Datalog.
The goal of this paper is to shed light on the actual relation between declarative
rules languages such as Datalog and SPARQL, and by this also provide valuable input
for the currently ongoing discussions on the Semantic Web rules layer, in particular its
integration with SPARQL, taking the likely direction into account that LP style rules
languages will play a significant role in this context.
1 http://www.w3.org/2005/rules/wg
50
Although the SPARQL specification does not seem 100% stable at the current
point, just having taken a step back from candidate recommendation to working draft,
we think that it is not too early for this exercise, as we will gain valuable insights
and positive side effects by our investigation. More precisely, the contributions of the
present work are:
• Based on the three semantic variants, we provide translations from a large frag-
ment of SPARQL queries to Datalog, which give rise to implementations of
SPARQL on top of existing engines.
• We provide some straightforward extensions of SPARQL such as a set differ-
ence operator MINUS, and nesting of ASK queries in FILTER expressions.
IRIs.
3 Following SPARQL, we are slightly more general than the original RDF specification in that we allow
51
# Graph: ex.org/bob # Graph: alice.org
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix bob: <ex.org/bob#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix alice: <alice.org#> .
<ex.org/bob> foaf:maker : a.
: a a foaf:Person ; foaf:name "Bob"; alice:me a foaf:Person ; foaf:name "Alice" ;
foaf:knows : b. foaf:knows : c.
Figure 1: Two RDF graphs in TURTLE notation and a simple SPARQL query.
paper, we will ignore solution modifiers mostly, thus we will usually write queries as
triples Q = (V, P, DS), and will use the syntax for graph patterns introduced below.
Result Forms Since we will, to a large extent, restrict ourselves to SELECT queries,
it is sufficient for our purposes to describe result forms by sets variables. Other result
forms will be discussed in Sec. 5. For instance, let Q = (V, P, DS) denote the query
from Fig. 1, then V = {?X, ?Y }. Query results in SPARQL are given by partial,
i.e. possibly incomplete, substitutions of variables in V by RDF terms. In traditional
relational query languages, such incompleteness is usually expressed using null values.
Using such null values we will write solutions as tuples where the order of columns is
determined by lexicographically ordering the variables in V . Given a set of variables
V , let V denote the tuple obtained from lexicographically ordering V .
The query from Fig. 1 with result form V = (?X, ?Y ) then has solution tuples
(”Bob”, : a), (”Alice”, alice.org#me), (”Bob”, : c). We write substitutions in
sqare brackets, so these tuples correspond to the substitutions [?X → ”Bob”, ?Y →
: a], [?X → ”Alice”, ?Y → alice.org#me], and [?X → ”Bob”, ?Y → : c],
respectively.
Graph Patterns We follow the recursive definition of graph patterns P from [16]:
of readability and in order to keep with the operator style definition of [16]. MINUS is syntactically not
present at all, but we will suggest a syntax extension for this particular keyword in Sec. 5.
52
• if P is a graph pattern and i ∈ I ∪ V ar, then (GRAPH i P ) is a graph pattern.
• if P is a graph pattern and R is a filter expression then (P FILTER R) is a graph
pattern.
For any pattern P , we denote by vars(P ) the set of all variables occurring in P . As
atomic filter expression, SPARQL allows the unary predicates BOUND, isBLANK,
isIRI, isLITERAL, binary equality predicates ’=’ for literals, and other features such
as comparison operators, data type conversion and string functions which we omit
here, see [18, Sec. 11.3] for details. Complex filter expressions can be built using the
connectives ’¬’,’∧’,’∨’.
Example 7 This issue becomes obvious in the following query with dataset DS =
({ex.org/bob}, ∅) which has an empty solution set.
53
SELECT ?N WHERE {?G foaf:maker ?M .
GRAPH ?G { ?X foaf:name ?N } }
We will sometimes find the following assumption convenient to avoid such arguably
unintuitive effects:
Definition 9 (Dataset closedness assumption) Given a dataset DS = (G, Gn ), Gn
implicitly contains (i) all graphs mentioned in G and (ii) all IRIs mentioned explicitly
in the graphs corresponding to G.
Under this assumption, the previous query has both (”Alice”) and (”Bob”) in its so-
lution set.
Some more remarks are in place concerning FILTER expressions. According to
the SPARQL specification “Graph pattern matching creates bindings of variables
[where] it is possible to further restrict solutions by constraining the allowable bind-
ings of variables to RDF Terms [with FILTER expressions].” However, it is not clearly
specified how to deal with filter constraints referring to variables which do not appear in
simple graph patterns. In this paper, for graph patterns of the form (P FILTER R) we
tacitly assume safe filter expressions, i.e. that all variables used in a filter expression R
also appear in the corresponding pattern P . This corresponds with the notion of safety
in Datalog (see Sec.3), where the built-in predicates (which obviously correspond to
filter predicates) do not suffice to safe unbound variables.
Moreover, the specification defines errors to avoid mistyped comparisons, or evalu-
ation of built-in functions over unbound values, i.e. “any potential solution that causes
an error condition in a constraint will not form part of the final results, but does not
cause the query to fail.” These errors propagate over the whole FILTER expression,
also over negation, as shown by the following example.
Example 8 Assuming the dataset does not contain triples for the foaf : dummy prop-
erty, the example query
SELECT ?X
WHERE { {?X a foaf:Person .
OPTIONAL { ?X foaf:dummy ?Y . } }
FILTER ( ¬(isLITERAL (?Y)) ) }
would discard any solution for ?X, since the unbound value for ?Y causes an error in
the isLITERAL expression and thus the whole FILTER expression returns an error.
We will take special care for these errors, when defining the semantics of FILTER
expressions later on.
54
from [16] in the way we define joining unbound variables. Moreover, we will refine
their notion of FILTER satisfaction in order to deal with error propagation properly.
We denote by Tnull the union I ∪ B ∪ L ∪ {null}, where null is a dedicated constant
denoting the unknown value not appearing in any of I, B, or L, how it is commonly
introduced when defining outer joins in relational algebra.
A substitution θ from V ar to Tnull is a partial function θ : V ar → Tnull . We write
substitutions in postfix notation: For a triple pattern t = (s, p, o) we denote by tθ
the triple (sθ, pθ, oθ) obtained by applying the substitution to all variables in t. The
domain of θ, dom(θ), is the subset of V ar where θ is defined. For a substitution θ and
a set of variables D ⊆ V ar we define the substitution θD with domain D as follows:
D xθ if x ∈ dom(θ) ∩ D
xθ =
null if x ∈ D \ dom(θ)
Analogously to [16] we define join, union, difference, and outer join between two
sets of substitutions Ω1 and Ω2 over domains D1 and D2 , respectively, all except union
parameterized by x ∈ {b,c,s}:
Ω1 ./x Ω2 = {θ1 ∪ θ2 | θ1 ∈ Ω1 , θ2 ∈ Ω2 , are x-compatible}
Ω1 ∪ Ω2 = {θ | ∃θ1 ∈ Ω1 with θ = θ1D1 ∪D2 or
∃θ2 ∈ Ω2 with θ = θ2D1 ∪D2 }
Ω1 −x Ω2 = {θ ∈ Ω1 | ∀θ2 ∈ Ω2 , θ and θ2 not x-compatible}
Ω1 A./x Ω2 = (Ω1 ./x Ω2 ) ∪ (Ω1 −x Ω2 )
55
Definition 10 (Evaluation, extends [16, Def. 2]) Let t = (s, p, o) be a triple pattern,
P, P1 , P2 graph patterns, DS = (G, Gn ) a dataset, and i ∈ Gn , and v ∈ V ar, then
the x-joining evaluation [[·]]xDS is defined as follows:
[[t]]xDS = {θ | dom(θ) = vars(P ) and tθ ∈ G}
[[P1 AND P2 ]]xDS = [[P1 ]]xDS ./x [[P2 ]]xDS
[[P1 UNION P2 ]]DS = [[P1 ]]xDS ∪ [[P2 ]]xDS
x
Rθ = ⊥ otherwise.
We will now exemplify the three different semantics defined above, namely bravely
joining (b-joining), cautiously joining (c-joining), and strictly-joining (s-joining) se-
mantics. When taking a closer look to the AND and MINUS operators, one will realize
that all three semantics take a slightly differing view only when joining null. Indeed,
6 > stands for “true”, ⊥ stands for “false” and ε stands for errors, see [18, Sec. 11.3] and Example 8 for
details.
56
the AND operator behaves as the traditional natural join operator ./ in relational alge-
bra, when no null values are involved.
Take for instance, DS = ({ex.org/bob, alice.org}, ∅) and P = ((?X, name, ?N ame)
AND (?X, knows, ?F riend)). When viewing each solution set as a relational table
with variables denoting attribute names, we can write:
?X ?Name
?X ?Friend
:a ”Bob”
./ :a :b
alice.org#me ”Alice”
alice.org#me :c
:c ”Bob”
?X ?Name ?Friend
= :a ”Bob” :b
alice.org#me ”Alice” :c
Differences between the three semantics appear when joining over null-bound vari-
ables, as shown in the next example.
Example 9 Let DS be as before and assume the following query which might be con-
sidered a naive attempt to ask for pairs of persons ?X1, ?X2 who share the same name
and nickname where both, name and nickname are optional:
P = ( ((?X1, a, Person) OPT (?X1, name, ?N )) AND
((?X2, a, Person) OPT (?X2, nick, ?N )) )
Again, we consider the tabular view of the resulting join:
?X1 ?N ?X2 ?N
:a ”Bob” :a null
:b null ./x :b ”Alice”
:c ”Bob” :c ”Bobby”
alice.org#me ”Alice” alice.org#me null
Now, let us see what happens when we evaluate the join ./x with respect to the
different semantics. The following result table lists in the last column which tuples
belong to the result of b-, c- and s-join, respectively.
?X1 ?N X2
:a ”Bob” :a b
:a ”Bob” alice.org#me b
:b null :a b,c
:b ”Alice” :b b
:b ”Bobby” :c b
=
:b null alice.org#me b,c
:c ”Bob” :a b
:c ”Bob” alice.org#me b
alice.org#me ”Alice” :a b
alice.org#me ”Alice” :b b,c,s
alice.org#me ”Alice” alice.org#me b
Leaving aside the question whether the query formulation was intuitively broken, we
remark that only the s-join would have the expected result. At the very least we might
argue, that the liberal behavior of b-joins might be considered surprising in some cases.
The c-joining semantics acts a bit more cautious in between the two, treating null
values as normal values, only unifiable with other null values.
Compared to how joins over incomplete relations are treated in common relational
database systems, the s-joining semantics might be considered the intuitive behavior.
Another interesting divergence (which would rather suggest to adopt the c-joining se-
mantics) shows up when we consider a simple idempotent join.
57
Example 10 Let us consider the following single triple dataset
DS = ({(alice.org#me, a, Person)}, ∅) and the following simple query pattern:
P = ((?X, a, Person) UNION (?Y, a, Person))
Clearly, this pattern, has the solution set
[[P ]]xDS = {(alice.org#me, null), (null, alice.org#me)}
under all three semantics. Surprisingly, P 0 = (P AND P ) has different solution sets
for the different semantics. First, [[P 0 ]]cDS = [[P ]]xDS , but [[P 0 ]]sDS = ∅, since null
values are not compatible under the s-joining semantics. Finally,
[[P 0 ]]bDS = {(alice.org#me, null), (null, alice.org#me),
(alice.org#me, alice.org#me)}
As shown by this example, under the reasonable assumption, that the join operator
is idempotent, i.e., (P ./ P ) ≡ P , only the c-joining semantics behaves correctly.
However, the brave b-joining behavior is advocated by the current SPARQL docu-
ment, and we might also think of examples where this obviously makes a lot of sense.
Especially, when considering no explicit joins, but the implicit joins within the OPT
operator:
Example 11 Let DS = ({ex.org/bob, alice.org}, ∅) and assume a slight variant
of a query from [5] which asks for persons and some names for these persons, where
preferably the foaf : name is taken, and, if not specified, foaf : nick.
P = ((((?X, a, Person) OPT (?X, name, ?XN AM E))
OPT (?X, nick, ?XN AM E))
Only [[P ]]bDS contains the expected solution ( : b, ”Alice”) for the bNode : b.
All three semantics may be considered as variations of the original definitions
in [16], for which the authors proved complexity results and various desirable features,
such as semantics-preserving normal form transformations and compositionality. The
following proposition shows that all these results carry over to the normative b-joining
semantics:
Proposition 19 Given a dataset DS and a pattern P which does not contain GRAPH
patterns, the solutions of [[P ]]DS as in [16] and [[P ]]bDS are in 1-to-1 correspondence.
Proof. Given DS and P each substitution θ obtained by evaluation [[P ]]bDS can be
reduced to a substitution θ0 obtained from the evaluation [[P ]]DS in [16] by dropping
all mappings of the form v → null from θ. Likewise, each substitution θ0 obtained
from [[P ]]DS can be extended to a substitution θ = θ0vars(P ) for [[P ]]bDS . 2
Following the definitions from the SPARQL specification and [16], the b-joining
semantics is the only admissible definition. There are still advantages for gradually
defining alternatives towards traditional treatment of joins involving nulls. On the one
hand, as we have seen in the examples above, the brave view on joining unbound
variables might have partly surprising results, on the other hand, as we will see, the c-
and s-joining semantics allow for a more efficient implementation in terms of Datalog
rules.
Let us now take a closer look on some properties of the three defined semantics.
58
Compositionality and Equivalences As shown in [16], some implementations have
a non-compositional semantics, leading to undesired effects such as non-commutativity
of the join operator, etc. A semantics is called compositional if for each P 0 sub-pattern
of P the result of evaluating P 0 can be used to evaluate P . Obviously, all three the c-,
s- and b-joining semantics defined here retain this property, since all three semantics
are defined recursively, and independent of the evaluation order of the sub-patterns.
The following proposition summarizes equivalences which hold for all three se-
mantics, showing some interesting additions to the results of Pérez et al.
Proposition 20 (extends [16, Prop. 1]) The following equivalences hold or do not hold
in the different semantics as indicated after each law:
(1) AND, UNION are associative and commutative. (b,c,s)
(2) (P1 AND (P2 UNION P3 ))
≡ ((P1 AND P2 ) UNION (P1 AND P3 )). (b)
(3) (P1 OPT (P2 UNION P3 ))
≡ ((P1 OPT P2 ) UNION (P1 OPT P3 )). (b)
(4) ((P1 UNION P2 ) OPT P3 )
≡ ((P1 OPT P3 ) UNION (P2 OPT P3 )). (b)
(5) ((P1 UNION P2 ) FILTER R)
≡ ((P1 FILTER R) UNION (P2 FILTER R)). (b,c,s)
(6) AND is idempotent, i.e. (P AND P ) ≡ P . (c)
Proof.[Sketch.] (1-5) for the b-joining semantics are proven in [16], (1): for
c-joining and s-joining follows straight from the definitions. (2)-(4): the substitu-
tion sets [[P1 ]]c,s = {[?X → a, ?Y → b]}, [[P2 ]]c,s = {[?X → a, ?Z → c]},
[[P3 ]]c,s = {[?Y → b, ?Z → c]} provide counterexamples for c-joining and s-joining
semantics for all three equivalences (2)-(4). (5): The semantics of FILTER expressions
and UNION is exactly the same for all three semantics, thus, the result for the b-joining
semantics carries over to all three semantics. (6): follows from the observations in Ex-
ample 10. 2
Ideally, we would like to identify a subclass of programs, where the three seman-
tics coincide. Obviously, this is the case for any query involving neither UNION nor
OPT operators. Pérez et al. [16] define a bigger class of programs, including “well-
behaving” optional patterns:
Definition 11 ([16, Def. 4]) A UNION-free graph pattern P is well-designed if for
every occurrence of a sub-pattern P 0 = (P1 OPT P2 ) of P and for every variable v
occurring in P , the following condition holds: if v occurs both in P2 and outside P 0
then it also occurs in P1 .
As may be easily verified by the reader, neither Example 9 nor Example 11, which are
both UNION-free, satisfy the well-designedness condition. Since in the general case
the equivalences for Prop. 20 do not hold, we also need to consider nested UNION pat-
terns as a potential source for null bindings which might affect join results. We extend
the notion of well-designedness, which direclty leads us to another correspondence in
the subsequent proposition.
59
Definition 12 A graph pattern P is well-designed if the condition from Def. 11 holds
and for every occurrence of a sub-pattern P 0 = (P1 UNION P2 ) of P and for every
variable v occurring in P 0 , the following condition holds: if v occurs outside P 0 then
it occurs in both P1 and P2 .
Proposition 21 On well-designed graph patterns the c-, s-, and b-joining semantics
coincide.
Proof.[Sketch.] Follows directly from the observation that all variables which are re-
used outside P 0 must be bound to a value unequal to null in P 0 due to well-designedness,
and thus cannot generate null bindings which might carry over to joins. 2
Likewise, we can identify “dangerous” variables in graph patterns, which might
cause semantic differences:
Definition 13 Let P 0 a sub-pattern of P of either the form P 0 = (P1 OPT P2 ) or
P 0 = (P1 UNION P2 ). Any variable v in P 0 which violates the well-designedness-
condition is called possibly-null-binding in P .
Note that, so far we have only defined the semantics in terms of a pattern P and
dataset DS, but not yet taken the result form V of query Q = (V, P, DS) into account.
We now define solution tuples that were informally introduced in Sec. 2. Recall that
by V we denote the tuple obtained from lexicographically ordering a set of variables in
V . The notion V [V 0 → null] means that, after ordering V all variables from a subset
V 0 ⊆ V are replaced by null.
Definition 14 (Solution Tuples) Let Q = (V, P, DS) be a SPARQL query, and θ a
substitution in [[P ]]xDS , then we call the tuple V [(V \ vars(P )) → null]θ a solution
tuple of Q with respect to the x-joining semantics.
Let us remark at this point, that as for the discussion of intuitivity of the different
join semantics discussed in Examples 9–11, we did not yet consider combinations of
different join semantics, e.g. using b-joins for OPT and c-joins for AND patterns. We
leave this for further work.
60
and Const (which may overlap), to be disjoint. In accordance with common nota-
tion in LP and the notation for external predicates from [7] we will in the following
assume that Const and P red comprise sets of numeric constants, string constants be-
ginning with a lower case letter, or ’"’ quoted strings, and strings of the form hquoted-
stringiˆˆhIRIi, hquoted-stringi@hvalid-lang-tagi, V ar is the set of string constants be-
ginning with an upper case letter. Given p ∈ P red an atom is defined as p(t1 , . . . , tn ),
where n is called the arity of p and t1 , . . . , tn ∈ Const ∪ V ar.
Moreover, we define a fixed set of external predicates exP r = {rdf , isBLAN K,
isIRI, isLIT ERAL, =, != } All external predicates have a fixed semantics and
fixed arities, distinguishing input and output terms. The atoms isBLAN K[c](val),
isIRI[c](val), isLIT ERAL[c](val) test the input term c ∈ Const ∪ V ar (in square
brackets) for being valid string representations of Blank nodes, IRI References or RDF
literals, returning an output value val ∈ {t, f, e}, representing truth, falsity or an er-
ror, following the semantics defined in [18, Sec. 11.3]. For the rdf predicate we write
atoms as rdf [i](s, p, o) to denote that i ∈ Const ∪ V ar is an input term, whereas
s, p, o ∈ Const ∪ V ar are output terms which may be bound by the external predi-
cate. The external atom rdf [i](s, p, o) is true if (s, p, o) is an RDF triple entailed by
the RDF graph which is accessibly at IRI i. For the moment, we consider simple RDF
entailment [13] only. Finally, we write comparison atoms ’t1 = t2 ’ and ’t1 != t2 ’ in
infix notation with t1 , t2 ∈ Const ∪ V ar and the obvious semantics of (lexicographic
or numeric) (in)equality. Here, for = either t1 or t2 is an output term, but at least one
is an input term, and for != both t1 and t2 are input terms.
We use H(r) to denote the head atom h and B(r) to denote the set of all body literals
B + (r) ∪ B − (r) of r, where B + (r) = {b1 , . . . , bm } and B − (r) = {bm+1 , . . . , bn }.
The notion of input and output terms in external atoms described above denotes the
binding pattern. More precisely, we assume the following condition which extends the
standard notion of safety (cf. [21]) in Datalog with negation: Each variable appearing
in a rule must appear in B + (r) in an atom or as an output term of an external atom.
Definition 16 A (logic) program Π is defined as a set of safe rules r of the form (1).
The Herbrand base of a program Π, denoted HBΠ , is the set of all possible ground
versions of atoms and external atoms occurring in Π obtained by replacing variables
with constants from Const, where we define for our purposes by Const the union
of the set of all constants appearing in Π as well as the literals, IRIs, and distinct
constants for each blank node occurring in each RDF graph identified8 by one of the
IRIs in the (recursively defined) set I, where I is defined by the recursive closure of all
8 By “identified” we mean here that IRIs denote network accessible resources which correspond to RDF
graphs.
61
IRIs appearing in Π and all RDF graphs identified by IRIs in I.9 As long as we assume
that the Web is finite the grounding of a rule r, ground (r), is defined by replacing
S the possible elements of HBΠ , and the grounding of program Π is
each variable with
ground (Π) = r∈Π ground (r).
An interpretation relative to Π is any subset I ⊆ HBΠ containing only atoms. We
say that I is a model of atom a ∈ HBΠ , denoted I |= a, if a ∈ I. With every external
predicate name lg ∈ exP r with arity n we associate an (n + 1)-ary Boolean function
flg (called oracle function) assigning each tuple (I, t1 . . . , tn ) either 0 or 1. 10 We say
that I ⊆ HBΠ is a model of a ground external atom a = g[t1 , . . . , tm ](tm+1 , . . . , tn ),
denoted I |= a, iff flg (I, t1 , . . . , tn ) = 1.
The semantics we use here generalizes the answer-set semantics [11]11 , and is de-
fined using the FLP-reduct [9], which is more elegant than the traditional GL-reduct [11]
of stable model semantics and ensures minimality of answer sets also in presence of
external atoms.
Let r be a ground rule. We define (i) I|=B(r) iff I |= a for all a ∈ B + (r) and
I 6|= a for all a ∈ B − (r), and (ii) I |= r iff I |= H(r) whenever I |= B(r). We say
that I is a model of a program Π, denoted I |= Π, iff I |= r for all r ∈ ground (Π).
The FLP-reduct [9] of Π with respect to I ⊆ HBΠ , denoted ΠI , is the set of all
r ∈ ground (Π) such that I |= B(r). I is an answer set of Π iff I is a minimal model
of ΠI .
We did not consider further extensions common to many ASP dialects here, namely
disjunctive rule heads, strong negation [11]. We note that for non-recursive programs,
i.e. where the predicate dependency graph is acyclic, the answer set is unique. For the
pure translation which we will give in Sec. 4 where we will produce such non-recursive
programs from SPARQL queries, we could equally take other semantics such as the
well-founded [10] semantics into account, which coincides with ASP on non-recursive
programs.
62
τ (V, (s, p, o), D, i) = answeri (V , D) :- triple(s, p, o, D). (1)
τ (V, (P 0 AND P 00 ), D, i) = τ (vars(P 0 ), P 0 , D, 2 ∗ i) ∪ τ (vars(P 00 ), P 00 , D, 2 ∗ i + 1) ∪
answeri (V , D) :- answer2∗i (vars(P 0 ), D), answer2∗i+1 ((vars(P 00 ), D). (2)
τ (V, (P 0 UNION P 00 ), D, i) = τ (vars(P 0 ), P 0 , D, 2
∗ i) ∪ τ (vars(P 00 ), P 00 , D, 2
∗ i + 1) ∪
answeri (V [(V \ vars(P 0 )) → null], D) :- answer2∗i (vars(P 0 ), D). (3)
answeri (V [(V \ vars(P 00 )) → null], D) :- answer2∗i+1 (vars(P 00 ), D). (4)
τ (V, (P 0 MINUS P 00 ), D, i) = τ (vars(P 0 ), P 0 , D, 2 ∗ i) ∪ τ (vars(P 00 ), P 00 , D, 2 ∗ i + 1) ∪
answeri (V [(V \ vars(P 0 )) → null], D) :- answer2∗i (vars(P 0 ), D), (5)
not answer2∗i 0 (vars(P 0 ) ∩ vars(P 00 ), D).
answer2∗i 0 (vars(P 0 ) ∩ vars(P 00 ), D) :- answer2∗i+1 (vars(P 00 ), D). } (6)
τ (V, (P 0 OPT P 00 ), D, i) = τ (V, (P 0 AND P 00 ), D, i) ∪ τ (V, (P 0 MINUS P 00 ), D, i)
τ (V, (P FILTER R), D, i) = τ (vars(P ), P, D, 2 ∗ i) ∪
LT (answeri (V , D) :- answer2∗i (vars(P ), D), R.) (7)
τ (V, (GRAPH g P ), D, i) = τ (V, P, g, i) for g ∈ V ∪ I
answeri (V , D) :- answeri (V , g), isIRI(g), not g = default. (8)
Here, the predicate HU stands for “Herbrand universe”, where we use this name a bit
sloppily, with the intention to cover all the relevant part of C, recursively importing
all possible IRIs in order to emulate the dataset closedness assumption. HU , can be
computed recursively over the input triples, i.e.
HU (X) :- triple(X, P, O, D). HU (X) :- triple(S, X, O, D).
HU (X) :- triple(S, P, X, D). HU (X) :- triple(S, P, O, X).
The remaining program τ (V, P, default, 1) represents the actual query transla-
tion, where τ is defined recursively as shown in Fig. 2.
By LT (·) we mean the set of rules resulting from disassembling complex FILTER
expressions (involving ’¬’,’∧’,’∨’) according to the rewriting defined by Lloyd and
Topor [15] where we have to obey the semantics for errors, following Definition 10. In
63
a nutshell, the rewriting LT − rewrite(·) proceeds as follows: Complex filters involv-
ing ¬ are transformed into negation normal form. Conjunctions of filter expressions
are simply disassembled to conjunctions of body literals, disjunctions are handled by
splitting the respective rule for both alternatives in the standard way. The resulting rules
involve possibly negated atomic filter expressions in the bodies. Here, BOU N D(v)
is translated to v = null, ¬BOU N D(v) to v! = null. isBLAN K(v), isIRI(v),
isLIT ERAL(v) and their negated forms are replaced by their corresponding external
atoms (see Sec. 3) isBLANK[v](t) or isBLANK[v](f), etc., respectively.
The resulting program ΠcQ implements the c-joining semantics in the following
sense:
Proposition 22 (Soundness and completeness of ΠcQ ) For each atom of the form
in the unique answer set M of ΠcQ , ~s is a solution tuple of Q with respect to the c-
joining semantics, and all solution tuples of Q are represented by the extension of
predicate answer1 in M .
Without giving a proof, we remark that the result follows if we convince ourselves that
τ (V, P, D, i) emulates exactly the recursive definition of [[P ]]xDS . Moreover, together
with Proposition 21, we obtain soundness and completeness of ΠQ for b-joining and
s-joining semantics as well for well-designed query patterns.
Translation ΠsQ The s-joining behavior can be achieved by adding FILTER expres-
sions ^
Rs = ( BOU N D(v) )
v∈VN
to the rule bodies of (2) and (6’). The resulting rules are again subject to the LT -
rewriting as discussed above for the rules of the form (7). This is sufficient to filter out
any joins involving null values, thus achieving s-joining semantics, and we denote the
program rewritten that way as ΠsQ .
64
Translation ΠbQ Obviously, b-joining semantics is more tricky to achieve, since we
now have to relax the allowed joins in order to allow null bindings to join with any
other value. We will again achieve this result by modifying rules (2) and (6’) where we
first do some variable renaming and then add respective FILTER expressions to these
rules.
Step 1. We rename each variable v ∈ VN in the respective rule bodies to v 0 or v 00 ,
respectively, in order to disambiguate the occurrences originally from sub-pattern P 0
or P 00 , respectively. That is, for each rule (2) or (6’), we rewrite the body to:
answer2∗i (vars(P 0 )[VN → VN0 ], D),
answer2∗i+1 (vars(P 00 )[VN → VN00 ], D).
b b
Step 2. We now add the following FILTER expressions R(2) and R(6 0 ) , respec-
tively, to the resulting rule bodies which “emulate” the relaxed b-compatibility:
b
= v∈V N ( ((v = v 0 ) ∧ (v 0 = v 00 )) ∨
V
R(2)
((v = v 0 ) ∧ ¬BOU N D(v 00 )) ∨
((v = v 00 ) ∧ ¬BOU N D(v 0 )) )
R(60 ) = v∈V N ( ((v = v 0 ) ∧ (v 0 = v 00 )) ∨
b
V
The rewritten rules are again subject to the LT rewriting. Note that, strictly speaking
the filter expression introduced here does not fulfill the assumption of safe filter ex-
pressions, since it creates new bindings for the variable v. However, these can safely
be allowed here, since the translation only creates valid input/output term bindings for
b b
the external Datalog predicate ’=’. The subtle difference between R(2) and R(6 0 ) lies
b 0 00
in the fact that R(2) preferably “carries over” bound values from v or v to v whereas
b 0
R(6 0 ) always takes the value of v . The effect of this becomes obvious in the translation
of Example 11 which we leave as an exercise to the reader. We note that the potential
exponential (with respect to |VN |) blowup of the program size by unfolding the fil-
ter expressions into negation normal form during the LT rewriting12 is not surprising,
given the negative complexity results in [16].
In total, we obtain a program which ΠbQ which reflects the normative b-joining
semantics. Consequently, we get sound and complete query translations for all three
semantics:
Corollary 24 (Soundness and completeness of ΠxQ ) Given an arbitrary graph pat-
tern P , the extension of predicate answer1 in the unique answer set M of ΠxQ repre-
sents all and only the solution tuples for Q = (V, P, DS) with respect to the x-joining
semantics, for x ∈ {b, c, s}.
In the following, we will drop the superscript x in ΠQ implicitly refer to the normative
b-joining translation/semantics.
12 Lloyd and Topor can avoid this potential exponential blowup by introducing new auxiliary predicates.
However, we cannot do the same trick, mainly for reasons of preserving safety of external predicates as
defined in Sec. 3.
65
5 Possible Extensions
As it turns out, the embedding of SPARQL in the rules world opens a wide range of
possibilities for combinations. In this section, we will first discuss some straightfor-
ward extensions of SPARQL which come practically for free with the translation to
Datalog provided before. We will then discuss the use of SPARQL itself as a simple
RDF rules language13 which allows to combine RDF fact bases with implicitly spec-
ified further facts and discuss the semantics thereof briefly. We conclude this section
with revisiting the open issue of entailment regimes covering RDFS or OWL semantics
in SPARQL.
Nested queries Nested queries are a distinct feature of SQL not present in SPARQL.
We suggest a simple, but useful form of nested queries to be added: Boolean queries
QASK = (∅, PASK , DSASK )) with an empty result form (denoted by the keyword ASK)
can be safely allowed within FILTER expressions as an easy extension fully compatible
with our translation. Given query Q = (V, P, DS), with sub-pattern (P1 FILTER (ASK
QASK )) we can modularly translate such subqueries by extending ΠQ with ΠQ0 where
Q0 = (vars(P1 ) ∩ vars(PASK ), PASK , DSASK )). Moreover, we have to rename pred-
0
icate names answeri to answerQ i in ΠQ0 . Some additional considerations are nec-
essary in order to combine this within arbitrary complex filter expressions, and we
probably need to impose well-designedness for variables shared between P and PASK
similar to Def. 12. We leave more details as future work.
QC = (CONSTRUCTPC , P, DS)
66
to ΠQ for each triple (s, p, o) in PC . The result graph is then naturally represented
in the answer set of the program extended that way in the extension of the predicate
triple.
for each g ∈ G, in order not to omit any of the implicit triples defined by such
“CONSTRUCT rules”. Analogously to the considerations for nested ASK queries,
we need to rename the answeri predicates and def ault constants in every subprogram
ΠQC defined this way.
Naturally, the resulting programs possibly involve recursion, and, even worse, re-
cursion over negation as failure. Fortunately, the general answer set semantics, which
we use, can cope with this. For some important aspects on the semantics of such dis-
tributed rules and facts bases, we refer to [17], where we also outline an alternative
semantics based on the well-founded semantics. A more in-depth investigation of the
complexity and other semantic features of such a combination is on our agenda.
67
with our translations, implementing what one might call RDFS− or OWL− entailment
at least. It remains to be seen whether the SPARQL working group will define such
reduced entailment regimes.
More complex issues arise when combining a nonmonotonic query language like
SPARQL with ontologies in OWL. An embedding of SPARQL into a nonmonotonic
rules language might provide valuable insights here, since it opens up a whole body of
work done on combinations of such languages with ontologies [7, 19].
7 Acknowledgments
Special thanks go to Jos de Bruijn and Reto Krummenacher for discussions on earlier
versions of this document, to Bijan Parsia, Jorge Pérez, and Andy Seaborne for valuable
email-discussions, to Roman Schindlauer for his help on prototype implementation on
top of dlvhex, and to the anonymous reviewers for various useful comments. This work
is partially supported by the Spanish MEC under the project TIC-2003-9001 and by the
EC funded projects TripCom (FP6-027324) and KnowledgeWeb (IST 507482).
16 http://www.dlvsystem.com/
68
References
[1] C. Baral. Knowledge Representation, Reasoning and Declarative Problem Solv-
ing. Cambr.Univ. Press, 2003.
[2] D. Beckett. Turtle - Terse RDF Triple Language. Tech. Report, 4 Apr. 2006.
[3] J. de Bruijn, A. Polleres, R. Lara, D. Fensel. OWL DL vs. OWL Flight: Concep-
tual modeling and reasoning for the semantic web. In Proc. WWW-2005, 2005.
[4] J. Carroll, C. Bizer, P. Hayes, P. Stickler. Named graphs. Journal of Web Seman-
tics, 3(4), 2005.
[5] R. Cyganiak. A relational algebra for sparql. Tech. Report HPL-2005-170, HP
Labs, Sept. 2005.
[6] J. de Bruijn, E. Franconi, S. Tessaris. Logical reconstruction of normative RDF.
OWL: Experiences and Directions Workshop (OWLED-2005), 2005.
[7] T. Eiter, G. Ianni, A. Polleres, R. Schindlauer, H. Tompits. Reasoning with rules
and ontologies. Reasoning Web 2006, 2006. Springer
[8] T. Eiter, G. Ianni, R. Schindlauer, H. Tompits. A Uniform Integration of Higher-
Order Reasoning and External Evaluations in Answer Set Programming. Int.l
Joint Conf. on Art. Intelligence (IJCAI), 2005.
[9] W. Faber, N. Leone, G. Pfeifer. Recursive aggregates in disjunctive logic pro-
grams: Semantics and complexity. Proc. of the 9th European Conf. on Art. Intel-
ligence (JELIA 2004), 2004. Springer.
[10] A. V. Gelder, K. Ross, J. Schlipf. Unfounded sets and well-founded semantics
for general logic programs. 7th ACM Symp. on Principles of Database Systems,
1988.
[11] M. Gelfond, V. Lifschitz. Classical Negation in Logic Programs and Disjunctive
Databases. New Generation Computing, 9:365–385, 1991.
[12] B. N. Grosof, I. Horrocks, R. Volz, S. Decker. Description logic programs: Com-
bining logic programs with description logics. Proc. WWW-2003, 2003.
[13] P. Hayes. RDF semantics. W3C Recommendation, 10 Feb. 2004. http://www.
w3.org/TR/rdf-mt/
[14] H. J. ter Horst. Completeness, decidability and complexity of entailment for RDF
Schema and a semantic extension involving the OWL vocabulary. Journal of Web
Semantics, 3(2), July 2005.
[15] J. W. Lloyd, R. W. Topor. Making prolog more expressive. Journal of Logic
Programming, 1(3):225–240, 1984.
[16] J. Pérez, M. Arenas, C. Gutierrez. Semantics and complexity of SPARQL. The
Semantic Web – ISWC 2006, 2006. Springer.
69
[17] A. Polleres, C. Feier, A. Harth. Rules with contextually scoped negation. Proc.
3rd European Semantic Web Conf. (ESWC2006), 2006. Springer.
[18] E. Prud’hommeaux, A. S. (ed.). SPARQL Query Language for RDF, W3C Work-
ing Draft, 4 Oct. 2006. http://www.w3.org/TR/rdf-sparql-query/
[19] R. Rosati. Reasoning with Rules and Ontologies. Reasoning Web 2006, 2006.
Springer.
[20] SQL-99. Information Technology - Database Language SQL- Part 3: Call
Level Interface (SQL/CLI). Technical Report INCITS/ISO/IEC 9075-3, IN-
CITS/ISO/IEC, Oct. 1999. Standard specification.
70
Published in Proceedings of the 6th International Conference on Ontologies, DataBases, and
Applications of Semantics (ODBASE 2007), pp. 878–896, Nov. 2007, Springer LNCS vol.
3803
Abstract
Lightweight ontologies in the form of RDF vocabularies such as SIOC, FOAF,
vCard, etc. are increasingly being used and exported by “serious” applications re-
cently. Such vocabularies, together with query languages like SPARQL also allow
to syndicate resulting RDF data from arbitrary Web sources and open the path to
finally bringing the Semantic Web to operation mode. Considering, however, that
many of the promoted lightweight ontologies overlap, the lack of suitable stan-
dards to describe these overlaps in a declarative fashion becomes evident. In this
paper we argue that one does not necessarily need to delve into the huge body of
research on ontology mapping for a solution, but SPARQL itself might — with
extensions such as external functions and aggregates — serve as a basis for declar-
atively describing ontology mappings. We provide the semantic foundations and a
path towards implementation for such a mapping language by means of a transla-
tion to Datalog with external predicates.
1 Introduction
As RDF vocabularies like SIOC,1 FOAF,2 vCard,3 etc. are increasingly being used
and exported by “serious” applications we are getting closer to bringing the Semantic
∗ This research has been partially supported by the European Commission under the FP6 projects inCon-
text (IST-034718), REWERSE (IST 506779), and Knowledge Web (FP6-507482), by the Austrian Science
Fund (FWF) under project P17212-N04, as well as by Science Foundation Ireland under the Lion project
(SFI/02/CE1/I131).
1 http://sioc-project.org/
2 http://xmlns.com/foaf/0.1/
3 http://www.w3.org/TR/vcard-rdf
71
Web to operation mode. The standardization of languages like RDF, RDF Schema
and OWL has set the path for such vocabularies to emerge, and the recent advent of
an operable query language, SPARQL, gave a final kick for wider adoption. These
ingredients allow not only to publish, but also to syndicate and reuse metadata from
arbitrary distibuted Web resources in flexible, novel ways.
When we take a closer look at emerging vocabularies we realize that many of them
overlap, but despite the long record of research on ontology mapping and alignment, a
standard language for defining mapping rules between RDF vocabularies is still miss-
ing. As it turns out, the RDF query language SPARQL [27] itself is a promising
candidate for filling this gap: Its CONSTRUCT queries may themselves be viewed
as rules over RDF. The use of SPARQL as a rules language has several advantages:
(i) the community is already familiar with SPARQL’s syntax as a query language,
(ii) SPARQL supports already a basic set of built-in predicates to filter results and
(iii) SPARQL gives a very powerful tool, including even non-monotonic constructs
such as OPTIONAL queries.
When proposing the use of SPARQL’s CONSTRUCT statement as a rules language
to define mappings, we should first have a look on existing proposals for syntaxes
for rules languages on top of RDF(S) and OWL. For instance, we can observe that
SPARQL may be viewed as syntactic extension of SWRL [19]: A SWRL rule is of the
form ant ⇒ cons, where both antecedent and consequent are conjunctions of atoms
a1 ∧ . . . ∧ an . When reading these conjunctions as basic graph patterns in SPARQL
we might thus equally express such a rule by a CONSTRUCT statement:
CONSTRUCT { cons } WHERE { ant }
In a sense, such SPARQL “rules” are more general than SWRL, since they may be
evaluated on top of arbitrary RDF data and — unlike SRWL — not only on top of valid
OWL DL. Other rules language proposals, like WRL [8] or TRIPLE [9] which are
based on F-Logic [22] Programming may likewise be viewed to be layerable on top of
RDF, by applying recent results of De Bruijn et al. [6, 7]. By the fact that (i) expressive
features such as negation as failure which are present in some of these languages are
also available in SPARQL4 and (ii) F-Logic molecules in rule heads may be serialized
in RDF again, we conjecture that rules in these languages can similarly be expressed
as syntactic variants of SPARQL CONSTRUCT statements.5
On the downside, it is well-known that even a simple rules language such as SWRL
already lead to termination/undecidability problems when mixed with ontology vocab-
ulary in OWL without care. Moreover, it is not possible to express even very sim-
ple mappings between common vocabularies such as FOAF [5] and VCard [20] in
SPARQL only. In order to remedy this situation, we propose the following approach
to enable complex mappings over ontologies: First, we keep the expressivity of the
underlying ontology language low, restricting ourselves to RDFS, or, more strictly
speaking to, ρdf− [24] ontologies; second, we extend SPARQL’s CONSTRUCT by
features which are almost essential to express various mappings, namely: a set of use-
ful built-in functions (such as string-concatenation and arithmetic functions on numeric
literal values) and aggregate functions (min, max, avg). Third, we show that evaluating
4 see [27, Section 11.4.1]
5 with the exception of predicates with arbitrary arities
72
SPARQL queries on top of ρdf− ontologies plus mapping rules is decidable by trans-
lating the problem to query answering over HEX-programs, i.e., logic programs with
external built-ins using the answer-set semantics, which gives rise to implementations
on top of existing rules engines such as dlvhex. A prototype of a SPARQL engine for
evaluating queries over combined datasets consisting of ρdf− and SPARQL mappings
has been implemented and is avaiblable for testing online.6
The remainder of this paper is structured as follows. We start with some moti-
vating examples of mappings which can and can’t be expressed with SPARQL CON-
STRUCT queries in Section 2 and suggest syntactic extensions of SPARQL, which
we call SPARQL++, in order to deal with the mappings that go beyond. In Section 3
we introduce HEX-programs, whereafter in Section 4 we show how SPARQL++ CON-
STRUCT queries can be translated to HEX-programs, and thereby bridge the gap to
implementations of SPARQL++. Next, we show how additional ontological inferences
by ρdf− ontologies can be itself viewed as a set of SPARQL++ CONSTRUCT “map-
pings” to HEX-programs and thus embedded in our overall framework, evaluating map-
pings and ontological inferences at the same level, while retaining decidability. After
a brief discussion of our current prototype and a discussion of related approaches, we
conclude in Section 6 with an outlook to future work.
The filter expression here reduces the mapping by a kind of additional “type check-
ing” where only those names are mapped which are not fully specified by a substruc-
ture, but merely given as a single literal.
Example 13 The situation quickly becomes more tricky for other terms, as for instance
mapping between VCard:n (name) and foaf:name, because VCard:n consists of
6 http://kr.tuwien.ac.at/research/dlvhex/
73
a substructure consisting of Family name, Given name, Other names, honorific Pre-
fixes, and honorific Suffixes. One possibility is to concatenate all these to constitute a
foaf:name of the respective person or entity:
CONSTRUCT { ?X foaf:name ?Name . }
WHERE { ?X VCard:N ?N .
OPTIONAL {?N VCard:Family ?Fam } OPTIONAL {?N VCard:Given ?Giv }
OPTIONAL {?N VCard:Other ?Oth } OPTIONAL {?N VCard:Prefix ?Prefix }
OPTIONAL {?N VCard:Suffix ?Suffix }
FILTER (?Name = fn:concat(?Prefix," ",?Giv, " ",?Fam," ",?Oth," ",?Suffix))
}
We observe the following problem here: First, we use filters for constructing a new
binding which is not covered by the current SPARQL specification, since filter ex-
pressions are not meant to create new bindings of variables (in this case the variable
?Name), but only filter existing bindings. Second, if we wanted to model the case where
e.g., several other names were provided, we would need built-in functions beyond what
the current SPARQL spec provides, in this case a string manipulation function such
as fn:concat. SPARQL provides a subset of the functions and operators defined by
XPath/XQuery, but these cover only boolean functions, like arithmetic comparison op-
erators or regular expression tests and basic arithmetic functions. String manipulation
routines are beyond the current spec. Even if we had the full range of XPath/XQuery
functions available, we would admittedly have to also slightly “extend” fn:concat
here, assuming that unbound variables are handled properly, being replaced by an
empty string in case one of the optional parts of the name structure is not defined.
Apart from built-in functions like string operations, aggregate functions such as
count, minimum, maximum or sum, are another helpful construct for many mappings
that is currently not available in SPARQL.
Finally, although we can query and create new RDF graphs by SPARQL CON-
STRUCT statements mapping one vocabulary to another, there is no well-defined way
to combine such mappings with arbitrary data, especially when we assume that (1) map-
pings are not restricted to be unidirectional from one vocabulary to another, but bidirec-
tional, and (2) additional ontological inferences such as subclass/subproperty relations
defined in the mutually mapped vocabularies should be taken into account when query-
ing over syndicated RDF data and mappings. We propose the following extensions of
SPARQL:
• We introduce an extensible set of useful built-in and aggregate functions.
• We permit function calls and aggregates in the CONSTRUCT clause,
74
CONSTRUCT {?X foaf:name fn:concat(?Prefix," ",?Giv," ",?Fam," ",?Oth," ",?Suffix).}
WHERE { ?X VCard:N ?N .
OPTIONAL {?N VCard:Family ?Fam } OPTIONAL {?N VCard:Given ?Giv }
OPTIONAL {?N VCard:Other ?Oth } OPTIONAL {?N VCard:Prefix ?Prefix }
OPTIONAL {?N VCard:Suffix ?Suffix } }
Another example for a non-trivial mapping is the different treatment of telephone num-
bers in FOAF and VCard.
Here, the WHERE clause singles out all projects, while the aggregate selects the
highest (i.e., latest) revision date of any available version for that project.
Example 16 Imagine you want to map/infer from an ontology having co-author re-
lationships declared using dc:creator properties from the Dublin Core metadata
vocabulary to foaf:knows, i.e., you want to specify
If ?a and ?b have co-authored the same paper, then ?a knows ?b.
The problem here is that a mapping using CONSTRUCT clauses needs to introduce new
blank nodes for both ?a and ?b (since dc:creator is a datatype property usually just
giving the name string of the author) and then need to infer the knows relation, so what
we really want to express is a mapping
75
If ?a and ?b are dc:creators of the same paper, then someone named
with foaf:name ?a foaf:knows someone with foaf:name ?b.
A first-shot solution could be:
CONSTRUCT { _:a foaf:knows _:b . _:a foaf:name ?n1 . _:b foaf:name ?n2 . }
FROM <g> WHERE { ?p dc:creator ?n1 . ?p dc:creator ?n2 . FILTER ( ?n1 != ?n2 ) }
Obviously, we lost some information in this mapping, namely the corellations that
the “Axel” knowing “Francois” is the same “Axel” that knows “Roman”, etc. We
could remedy this situation by allowing to nest CONSTRUCT queries in the FROM
clause of SPARQL queries as follows:
CONSTRUCT { ?a knows ?b . ?a foaf:name ?aname . ?b foaf:name ?bname . }
FROM { CONSTRUCT { _:auth foaf:name ?n . ?p aux:hasAuthor _:auth . }
FROM <g> WHERE { ?p dc:creator ?n . } }
WHERE { ?p aux:hasAuthor ?a . ?a foaf:name ?aname .
?p aux:hasAuthor ?b . ?b foaf:name ?bname . FILTER ( ?a != ?b ) }
Here, the “inner” CONSTRUCT creates a graph with unique blank nodes for each
author per paper, whereas the outer CONSTRUCT then aggregates a more appropriate
answer graph, say:
_:auth1 foaf:name "Axel". _:auth2 foaf:name "Roman". _:auth3 foaf:name "Francois".
_:auth1 foaf:knows _:auth2. _:auth1 foaf:knows _:auth3.
_:auth2 foaf:knows _:auth1. _:auth2 foaf:knows _:auth3.
_:auth3 foaf:knows _:auth1. _:auth3 foaf:knows _:auth2.
3 Preliminaries – HEX-Programs
To evaluate SPARQL++ queries, we will translate them to so-called HEX-programs [12],
an extension of logic programs under the answer-set semantics.
Let Pred , Const, Var , exPr be mutually disjoint sets of predicate, constant, vari-
able symbols, and external predicate names, respectively. In accordance with common
76
notation in LP and the notation for external predicates from [11] we will in the follow-
ing assume that Const comprises the set of numeric constants, string constants begin-
ning with a lower case letter, or double-quoted string literals, and IRIs.7 Var is the set
of string constants beginning with an uppercase letter. Elements from Const ∪ Var are
called terms. Given p ∈ Pred an atom is defined as p(t1 , . . . , tn ), where n is called the
arity of p and t1 , . . . , tn are terms. An external atom is of the form
g[Y1 , . . . , Yn ](X1 , . . . , Xm ),
We use H(r) to denote the head atom h and B(r) to denote the set of all body literals
B + (r) ∪ B − (r) of r, where B + (r) = {b1 , . . . , bm } and B − (r) = {bm+1 , . . . , bn }.
The notion of input and output terms in external atoms described above denotes the
binding pattern. More precisely, we assume the following condition which extends the
standard notion of safety (cf. [31]) in Datalog with negation.
Definition 18 (Safety) Each variable appearing in a rule must appear in a non-negated
body atom or as an output term of an external atom.
Finally, we define HEX-programs.
Definition 19 A HEX-program P is defined as a set of safe rules r of the form (1).
The Herbrand base of a HEX-program P , denoted HBP , is the set of all possible ground
versions of atoms and external atoms occurring in P obtained by replacing variables
with constants from Const. The grounding of a rule r, ground S (r), is defined accord-
ingly, and the grounding of program P is ground (P ) = r∈P ground (r).
An interpretation relative to P is any subset I ⊆ HBP containing only atoms. We
say that I is a model of atom a ∈ HBP , denoted I |= a, if a ∈ I. With every external
predicate name e ∈ exPr we associate an (n+m+1)-ary Boolean function fe (called
oracle function) assigning each tuple (I, y1 . . . , yn , x1 , . . . , xm ) either 0 or 1, where
n/m are the input/output arities of e, I ⊆ HBP , xi ∈ Const, and yj ∈ Pred ∪ Const.
7 For the purpose of this paper, we will disregard language-tagged and datatyped literals in the translation
to HEX-programs.
77
We say that I ⊆ HBP is a model of a ground external atom a = e[y1 , . . . , yn ](x1 , . . . ,
xm ), denoted I |= a, iff fe (I, y1 . . ., yn , x1 , . . . , xm ) = 1.
Let r be a ground rule. We define (i) I |= H(r) iff there is some a ∈ H(r) such
that I |= a, (ii) I |= B(r) iff I |= a for all a ∈ B + (r) and I 6|= a for all a ∈ B − (r),
and (iii) I |= r iff I |= H(r) whenever I |= B(r). We say that I is a model of a
HEX -program P , denoted I |= P , iff I |= r for all r ∈ ground (P ).
The semantics we use here generalizes the answer-set semantics [16] and is defined
using the FLP-reduct [15], which is more elegant than the traditional Gelfond-Lifschitz
reduct of stable model semantics and ensures minimality of answer sets also in presence
of external atoms: The FLP-reduct of P with respect to I ⊆ HBP , denoted P I , is the
set of all r ∈ ground (P ) such that I |= B(r). I ⊆ HBP is an answer set of P iff I is
a minimal model of P I .
By the cautious extension of a predicate p we denote the set of atoms with predicate
symbol p in the intersection of all answer sets of P .
For our purposes, we define a fixed set of external predicates exPr = {rdf , isBLANK ,
isIRI , isLITERAL, =, != , REGEX , CONCAT , COUNT , MAX , MIN , SK } with
a fixed semantics as follows. We take these external predicates as examples, which
demonstrate the HEX-programs are expressive enough to model all the necessary ingre-
dients for evaluating SPARQL filters (isBLANK , isIRI , isLITERAL, =, != , REGEX )
and also for more expressive built-in functions and aggregates (CONCAT , SK , COUNT ,
MAX , MIN ). Here, we take CONCAT just as an example built-in, assuming that
more XPath/XQuery functions could similarly be added.
For the rdf predicate we write atoms as rdf [i](s, p, o) to denote that i ∈ Const ∪
Var is an input term, whereas s, p, o ∈ Const are output terms which may be bound
by the external predicate. The external atom rdf [i](s, p, o) is true if (s, p, o) is an RDF
triple entailed by the RDF graph which is accessibly at IRI i. For the moment, here we
consider simple RDF entailment [18] only.
The atoms isBLANK [c](val ), isIRI [c](val ), isLITERAL[c](val ) test input term
c ∈ Const ∪ Var (in square brackets) for being a valid string representation of a
blank node, IRI reference or RDF literal. The atom REGEX [c1 , c2 ](val ) test whether
c1 matches the regular expression given in c2 . All these external predicates return
an output value val ∈ {t, f, e}, representing truth, falsity or an error, following the
semantics defined in [27, Sec. 11.3].
We write comparison atoms ‘t1 = t2 ’ and ‘t1 != t2 ’ in shorthand infix notation
with t1 , t2 ∈ Const ∪ Var and the obvious semantics of (lexicographic or numeric)
(in)equality. Here, for = either t1 or t2 is an output term, but at least one is an input
term, and for != both t1 and t2 are input terms.
Apart from these truth-valued external atoms we add other external predicates
which mimic built-in functions an aggregates. As an example predicate for a built-in,
we chose the predicate CONCAT [c1 , . . . , cn ](cn+1 ) with variable input arity which
concatenates string constants c1 , . . . , cn into cn+1 and thus implements the semantics
of fn:concat in XPath/XQuery [23].
Next, we define external predicates which mimic aggregate functions over a certain
predicate. Let p ∈ Pred with arity n, and x1 , . . . , xn ∈ Const ∪ {mask } where mask
is a special constant not allowed to appear anywhere except in input lists of aggregate
78
predicates.
Then COUNT [p, x1 , . . . , xn ](c) is true if c equals the number of distinct tuples
(t1 , . . . , tn ), such that I |= p(t1 , . . . , tn ) and for all xi different from the constant
mask it holds that ti = xi .
MAX [p, x1 , . . . , xn ](c) (and MIN [p, x1 , . . . , xn ](c), resp.) is true if among all
tuples (t1 , . . . , tn ), such that I |= p(t1 , . . . , tn ), c is the lexicographically greatest
(smallest, resp.) value among all the ti such that xi = mask .8
We will illustrate the use of these external predicates to express aggregations in
Section 4.4 below when discussing the actual translation from SPARQL++ to HEX-
programs.
Finally, the external predicate SK [id , v1 , . . . , vn ](sk n+1 ) computes a unique, new
“Skolem”-like term id (v1 , . . . , vn ) from its input parameters. We will use this built-in
function in our translation of SPARQL queries with blank nodes in the CONSTRUCT
part. Similar to the aggregate functions mentioned before, when using SK we will
need to take special care in our translation in order to retain strong safety.
As widely known, for programs without external predicates, safety guarantees that
the number of entailed ground atoms is finite. Though, by external atoms in rule bodies,
new, possibly infinitly many, ground atoms could be generated, even if all atoms them-
selves are safe. In order to avoid this, a stronger notion of safety for HEX-programs is
defined in [30]: Informally, this notion says that a HEX-program is strongly safe, if no
external predicate recursively depends on itself, thus defining a notion of stratification
over external predicates. Strong safety guarantees finiteness of models as well as finite
computability of external atoms.
79
• each element of Bc is a blank node identifier, i.e., Bc ⊆ B.
• for b ∈ B and t1 , . . . , tn in I ∪ B ∪ L, b(t1 , . . . , tn ) ∈ B.
Now, we extend the SPARQL syntax by allowing built-in functions and aggregates
in place of basic RDF terms in graph patterns (and thus also in CONSTRUCT clauses)
as well as in filter expressions. We define the set Blt of built-in terms as follows:
• All basic terms are built-in terms.
• If blt is a built-in predicate (e.g., fn:concat from above or another XPath/XQuery
functions), and c1 , . . . , cn are built-in terms then blt(c1 , . . . , cn ) is a built-in
term.
• a triple pattern (s, p, o) is a graph pattern where s, o ∈ Blt and p ∈ I ∪ Var .11
Triple patterns which only contain basic terms are called basic triple patterns
and value-generating triple patterns otherwise.
• if P, P1 and P2 are graph patterns, i ∈ I ∪ V ar, and R a filter expression then
(P1 AND P2 ), (P1 OPT P2 ), (P1 UNION P2 ), (GRAPH i P ), and (P FILTER R)
are graph patterns.12
For any pattern P , we denote by vars(P ) the set of all variables occurring in P and
by vars(P ) the tuple obtained by the lexicographic ordering of all variables in P .
As atomic filter expression, we allow here the unary predicates BOUND (possibly with
variables as arguments), isBLANK, isIRI, isLITERAL, and binary equality predicates ‘=’
with arbitrary safe built-in terms as arguments. Complex filter expressions can be built
using the connectives ‘¬’, ‘∧’, and ‘∨’.
Similar to aggregates in logic programming, we use a notion of safety. First, given
a query Q = (R, P, DS ) we allow only basic triple patterns in P , ie. we only allow
built-ins and aggregates only in FILTERs or in the result pattern R. Second, a built-in
term blt occurring in the result form or in P in a query Q = (R, P, DS ) is safe if all
variables recursively appearing in blt also appear in a basic triple pattern within P .
10 Thisaggregate syntax is adapted from the resp. definition for aggregates in LP from [15].
11 We do not consider blanks nodes here as these can be equivalently replaced by variables [6].
12 We use AND to keep with the operator style of [25] although it is not explicit in SPARQL.
80
4.2 Extended Datasets
In order to allow the definition of RDF data side-by-side with implicit data defined
by mappings of different vocabularies or, more general, views within RDF, we define
an extended RDF graph as a set of RDF triples I ∪ L ∪ B × I × I ∪ L ∪ B and
CONSTRUCT queries. An RDF graph (or dataset, resp.) without CONSTRUCT queries
is called a basic graph (or dataset, resp.).
The dataset DS = (G, {(g1 , G1 ), . . . (gk , Gk )}) of a SPARQL query is defined by
(i) a default graph G, i.e., the RDF merge [18, Section 0.3] of a set of extended RDF
graphs, plus (ii) a set of named graphs, i.e., pairs of IRIs and corresponding extended
graphs.
Without loss of generality (there are other ways to define the dataset such as in
a SPARQL protocol query), we assume DS defined by the IRIs given in a set of
FROM and FROM NAMED clauses. As an exception, we assume that any CONSTRUCT
query which is part of an extended graph G by default (i.e., in the absence of FROM
and FROM NAMED clauses) has the dataset DS = (G, ∅) For convenience, we allow
extended graphs consisting of a single CONSTRUCT statement to be written directly in
the FROM clause of a SPARQL++ query, like in Example 16.
We will now define syntactic restrictions on the CONSTRUCT queries allowed in
extended datasets, which retain finite termination on queries over such datasets. Let G
be an extended graph. First, for any CONSTRUCT query Q = (R, P, DS Q ) in G, DS Q
we allow only triple patterns tr = (s, p, o) in P or R where p ∈ I, i.e., neither blank
nodes nor variables are allowed in predicate positions in extended graphs, and, addi-
tionally, o ∈ I for all triples such that p = rdf:type. Second, we define a predicate-
class-dependency graph over an extended dataset DS = (G, {(g1 , G1 ), . . . (gk , Gk )})
as follows. The predicate-class-dependency graph for DS has an edge p → r with
p, r ∈ I for any CONSTRUCT query Q = (R, P, DS ) in G with r (or p, resp.) ei-
ther (i) a predicate different from rdf:type in a triple in R (or P , resp.), or (ii)
an object in an rdf:type triple in R (or P , resp.). All edges such that r occurs in
a value-generating triple are marked with ‘∗’. We now say that DS is strongly safe
if its predicate-class-dependency graph does not contain any cycles involving marked
edges. As it turns out, in our translation in Section 4.4 below, this condition is suffi-
cient (but not necessary) to guarantee that any query can be translated to a strongly safe
HEX -program.
Like in [29] we assume that blank node identifiers in each query Q = (R, P, DS )
have been standardized apart, i.e., that no blank nodes with the same identifiers appear
in a different scope. The scope of a blank node identifier is defined as the graph or
graph pattern it appears in, where each WHERE or CONSTRUCT clause open a “fresh”
scope . For instance, take the extended graph dataset in Fig. 1(a), its standardized apart
version is shown in Fig. 1(b). Obviously, extended datasets can always be standardized
apart in linear time in a preprocessing step.
4.3 Semantics
The semantics of SPARQL++ is based on the formal semantics for SPARQL queries
by Pérez et al. in [25] and its translation into HEX-programs in [26].
81
g1: :paper2 foaf:maker _:a. g1: :paper2 foaf:maker _:b1.
_:a foaf:name "Jean Deau". _:b1 foaf:name "Jean Deau".
g2: :paper1 dc:creator "John Doe". g2: :paper1 dc:creator "John Doe".
:paper1 dc:creator "Joan Dough". :paper1 dc:creator "Joan Dough".
CONSTRUCT {_:a foaf:knows _:b . CONSTRUCT {_:b2 foaf:knows _:b3 .
_:a foaf:name ?N1 . _:b2 foaf:name ?N1 .
_:b foaf:name ?N2 . } _:b3 foaf:name ?N2 . }
WHERE {?X dc:creator ?N1,?N2. WHERE {?X dc:creator ?N1,?N2.
FILTER( ?N1 != ?N2 ) } FILTER( ?N1 != ?N2 ) }
(a) (b)
Figure 1: Standardizing apart blank node identifiers in extended datasets.
Thus, in the union of two substitutions defined values in one take precedence over null
values the other substitution. Two substitutions θ1 and θ2 are compatible when for all
x ∈ dom(θ1 ) ∩ dom(θ2 ) either xθ1 = null or xθ2 = null or xθ1 = xθ2 holds, i.e.,
when θ1 ∪ θ2 is a substitution over dom(θ1 ) ∪ dom(θ2 ). Analogously to Pérez et al.
we define join, union, difference, and outer join between two sets of substitutions Ω1
and Ω2 over domains D1 and D2 , respectively:
Ω1 ./ Ω2 = {θ1 ∪ θ2 | θ1 ∈ Ω1 , θ2 ∈ Ω2 , θ1 and θ2 are compatible}
Ω1 ∪ Ω2 = {θ | ∃θ1 ∈ Ω1 with θ = θ1D1 ∪D2 or
∃θ2 ∈ Ω2 with θ = θ2D1 ∪D2 }
Ω1 − Ω2 = {θ ∈ Ω1 | for all θ2 ∈ Ω2 , θ and θ2 not compatible}
Ω1 A./ Ω2 = (Ω1 ./ Ω2 ) ∪ (Ω1 − Ω2 )
Next, we define the application of substitutions to built-in terms and triples: For a
built-in term t, by tθ we denote the value obtained by applying the substitution to all
variables in t. By evalθ (t) we denote the value obtained by (i) recursively evaluating
all built-in and aggregate functions, and (ii) replacing all bNode identifiers by complex
bNode identifiers according to θ, as follows:
82
evalθ (fn:concat(c1 , c2 , . . . , cn )) Returns the xs:string that is the concatenation of the values of c1 θ,. . . ,c1 θ
after conversion. If any of the arguments is the empty sequence or null, the argu-
ment is treated as the zero-length string.
eval θ (COUNT(V : P )) Returns the number of distinct13 answer substitutions for the query Q =
(V, P θ, DS ) where DS is the dataset of the encapsulating query.
eval θ (MAX(V : P )) Returns the maximum (numerically or lexicographically) of distinct answer sub-
stitutions for the query Q = (V, P θ, DS ).
eval θ (MIN(V : P )) Analogous to MAX, but returns the minimum.
eval θ (t) Returns tθ for all t ∈ I ∪ L ∪ Var , and t(dom(θ)θ) for t ∈ B.14
Finally, for a triple pattern tr = (s, p, o) we denote by tr θ the triple (sθ, pθ, oθ), and
by eval θ (tr ) the triple (eval θ (s), eval θ (p), eval θ (o)).
The evaluation of a graph pattern P over a basic dataset DS = (G, Gn ), can now be
defined recursively by sets of substitutions, extending the definitions in [25, 26].
Definition 20 Let tr = (s, p, o) be a basic triple pattern, P, P1 , P2 graph patterns,
and DS = (G, Gn ) a basic dataset, then the evaluation [[·]]DS is defined as follows:
[[tr ]]DS = {θ | dom(θ) = vars(P ) and tr θ ∈ G}
[[P1 AND P2 ]]DS = [[P1 ]]DS ./ [[P2 ]]DS
[[P1 UNION P2 ]]DS = [[P1 ]]DS ∪ [[P2 ]]DS
[[P1 OPT P2 ]]DS = [[P1 ]]DS A./ [[P2 ]]DS
[[GRAPH i P ]]DS = [[P ]](i,∅) , for i ∈ Gn
[[GRAPH v P ]]DS = {θ ∪ [v/g] | g ∈ Gn and θ ∈ [[P [v/g] ]](g,∅) }, for v ∈ Var
[[P FILTER R]]DS = {θ ∈ [[P ]]DS | Rθ = >}
Let R be a filter expression, u, v ∈ Blt. The valuation of R on a substitution θ, written Rθ
takes one of the three values {>, ⊥, ε}15 and is defined as follows.
= BOUND(v) with v ∈ dom(θ) ∧ eval θ (v) 6= null;
Rθ = >, if: (1) R
= isBLANK(v) with eval θ (v) ∈ B;
(2) R
= isIRI(v) with eval θ (v) ∈ I;
(3) R
= isLITERAL(v) with eval θ (v) ∈ L;
(4) R
= (u = v) with eval θ (u) = eval θ (v) ∧ eval θ (u) 6= null;
(5) R
= (¬R1 ) with R1 θ = ⊥;
(6) R
= (R1 ∨ R2) with R1 θ = > ∨ R2 θ = >;
(7) R
= (R1 ∧ R2) with R1 θ = > ∧ R2 θ = >.
(8) R
= isBLANK(v),R = isIRI(v),R = isLITERAL(v), or
Rθ = ε, if: (1) R
= (u = v) with (v ∈ Var ∧ v 6∈ dom(θ)) ∨ eval θ (v) = null ∨
R
(u ∈ Var ∧ u 6∈ dom(θ)) ∨ eval θ (u) = null;
(2) R = (¬R1 ) and R1 θ = ε;
(3) R = (R1 ∨ R2 ) and R1 θ 6= > ∧ R2 θ 6= > ∧ (R1 θ = ε ∨ R2 θ = ε);
(4) R = (R1 ∧ R2) and R1 θ = ε ∨ R2 θ = ε.
Rθ = ⊥ otherwise.
In [26] we have shown that the semantics defined this way corresponds with the
original semantics for SPARQL defined in [25] without complex built-in and aggregate
terms and on basic datasets.16
Note that, so far we have only defined the semantics in terms of a pattern P and ba-
sic dataset DS , but neither taken the result form R nor extended datasets into account.
13 Note that we give a set based semantics to the counting built-in, we do not take into account duplicate
solutions which can arise from the multi-set semantics in [27] when counting.
14 For blank nodes eval constructs a new blank node identifier, similar to Skolemization.
θ
16 Our definition here only differs in in the application of eval on built-in terms in filter expressions which
θ
does not make a difference if only basic terms appear in FILTERs.
83
As for the former, we proceed with formally define solutions for SELECT and CON-
STRUCT queries, respectively. The semantics of a SELECT query Q = (V, P, DS ) is
fully determined by its solution tuples [26].
Definition 21 Let Q = (R, P, DS ) be a SPARQL++ query, and θ a substitution in
[[P ]]DS , then we call the tuple vars(P )θ a solution tuple of Q.
As for a CONSTRUCT queries, we define the solution graphs as follows.
Definition 22 Let Q = (R, P, DS ) be a SPARQL CONSTRUCT query where blank
node identifiers in DS and R have been standardized apart and R = {t1 , . . . , tn } is
the result graph pattern. Further, for any θ ∈ [[P ]]DS , let θ0 = θvars(R)∪vars(P ) . The
solution graph for Q is then defined as the triples obtained from
[
{eval θ0 (t1 ), . . . , eval θ0 (tn )}
θin[[P ]]DS
84
Example 17 Let query q select all persons who do not know anybody called “John
Doe” from the extended dataset DS = (g1 ∪ g2, ∅), i.e., the merge of the graphs in
Fig. 1(b).
SELECT ?P FROM <g1> FROM <g2>
WHERE { ?P rdf:type foaf:Agent . FILTER ( !BOUND(?P1) )
OPTIONAL { P? foaf:knows ?P1 . ?P1 foaf:name "John Doe" . } }
More complex queries with nested patterns can be translated likewise by introduc-
ing more auxiliary predicates. The program part defining the tripleq predicate fixes
the triples of the dataset, by importing all explicit triples in the dataset as well as recur-
sively translating all CONSTRUCT clauses and subqueries in the extended dataset.
Example 18 The program to generate the dataset triples for the extended dataset
DS = (g1 ∪ g2, ∅) looks as follows:
tripleq (S,P,O,def) :- rdf["g1"](S,P,O).
tripleq (S,P,O,def) :- rdf["g2"](S,P,O).
tripleq (B2,foaf:knows,B3,def) :- SK[b2(X,N1,N2)](B2),SK[b3(X,N1,N2)](B3),
answerC1 ,g2 (X,N1,N2).
tripleq (B2,foaf:name,N1,def) :- SK[b2(X,N1,N2)](B2), answerC1 ,g2 (X,N1,N2).
tripleq (B3,foaf:knows,N2,def) :- SK[b3(X,N1,N2)](B3), answerC1 ,g2 (X,N1,N2).
answerC1 ,g2 (X,N1,N2) :- tripleq (X,dc:creator, N1,def),
tripleq(X,dc:creator,N2,def), N1 != N2.
The first two rules import all triples given explicitly in graphs g1, g2 by means of the
“standard” RDF import HEX predicate. The next three rules create the triples from the
CONSTRUCT in graph g2, where the query pattern is translated by an own subprogram
defining the predicate answerC1 ,g2 , which in this case only consists of a single rule.
The example shows the use of the external function SK to create blank node ids
for each solution tuple as mentioned before, which we need to emulate the semantics
of blank nodes in CONSTRUCT statements.
Next, we turn to the use of HEX aggregate predicates in order to translate aggregate
terms. Let Q = (R, P, DS ) and a = agg(V :Pa ) – here, V ⊆ vars(Pa ) is the tuple
of variables we want to aggregate over – be an aggregate term appearing either in R
or in a filter expression in P . Then, the idea is that a can be translated by an external
atom agg[aux , vars(Pa )0 [V /mask ] ](va ) where
(i) vars(Pa )0 is obtained from vars(Pa ) by removing all the variables which only appear in Pa
but not elsewhere in P ,
(ii) the variable va takes the place of a,
(iii) aux a is a new predicate defined by a rule: aux a (vars(Pa )0 ) ← answer a (vars(Pa )).
(iv) answer a is the predicate defining the solution set of the query Qa = (vars(Pa ), Pa , DS )
Example 19 The following rules mimic the CONSTRUCT query of Example 15:
85
triple(P,os:latestRelease,Va ) :- MAX[auxa ,P,mask](Va ),
triple(P,rdf:type,doap:Project,gr).
auxa (P,V) :- answera (P,R,V).
answera (P,R,V) :- triple(P,doap:release R,def), triple(R,doap:revision,V,def).
With the extensions the translation in [26] outlined here for extended datasets, ag-
gregate and built-in terms we can define the solution tuples of an SPARQL++ query
Q = (R, P, DS ) over an extended dataset now as precisely the set of tuples corre-
sponding to the cautious extension of the predicate answerq .
There is one more entailment rule for reflexive-relaxed ρdf concerning that blank node
renaming preserves ρdf entailment. However, it is neither straightforwardly possible,
nor desirable to encode this by CONSTRUCTs like the other rules. Blank node renam-
ing might have unintuitive effects on aggregations and in connection with OPTIONAL
86
queries. In fact, keeping blank node identifiers in recursive CONSTRUCTs after stan-
dardizing apart is what keeps our semantics finite, so we skip this rule, and call the
resulting ρdf fragment encoded by the above CONSTRUCTs ρdf− . Some care is in or-
der concerning strong safety of the resulting dataset when adding ρdf− . To still ensure
strong safety of the translation, we complete the predicate-class-dependency graph by
additional edges between all pairs of resources connected by subclassOf or subProper-
tyOf, domain, or range relations and checking the same safety condition as before on
the graph extended in this manner.
4.6 Implementation
We implemented a prototype of a SPARQL++ engine based on on the HEX-program
solver dlvhex.18 The prototype exploits the rewriting mechanism of the dlvhex frame-
work, taking care of the translation of a SPARQL++ query into the appropriate HEX-
program, as laid out in Section 4.4. The system implements external atoms used in the
translation, namely (i) the RDF atom for data import, (ii) the aggregate atoms, and (iii)
a string concatenation atom implementing both the CONCAT function and the SK
atom for bNode handling. The engine can directly be fed with a SPARQL++ query.
The default syntax of a dlvhex results corresponds to the usual answer format of logic
programming engines, i.e., sets of facts, from which we generate an XML representa-
tion, which can subsequently be transformed easily to a valid RDF syntax by an XSLT
to export solution graphs.
5 Related work
The idea of using SPARQL CONSTRUCT queries is in fact not new, even some im-
plemented systems such as TopBraid Composer already seem to offer this feature, 19
however without a defined and layered semantics, and lacking aggregates or built-ins,
thus insufficient to express mappings such as the ones studied in this article.
Our notion of extended graphs and datasets generalizes so-called networked graphs
defined by Schenk and Staab [29] who also use SPARQL CONSTRUCT statements as
rules with a slightly different motivation: dynamically generating views over graphs.
The authors only permit bNode- and built-in free CONSTRUCTs whereas we addi-
tionally allow bNodes, built-ins and aggregates, as long as strong safety holds which
only restricts recursion over value-generating triples. Another differenece is that their
semantics bases on the well-founded instead of the stable model semantics.
PSPARQL [1], a recent extension of SPARQL, allows to query RDF graphs using
regular path expressions over predicates. This extension is certainly useful to represent
mappings and queries over graphs. We conjecture that we can partly emulate such path
expressions by recursive CONSTRUCTs in extended datasets.
As an interesting orthogonal approach, we mention iSPARQL [21] which proposes
an alternative way to add external function calls to SPARQL by introducing so called
18 Available with dlvhex on http://www.kr.tuwien.ac.at/research/dlvhex/.
19 http://composing-the-semantic-web.blogspot.com/2006/09/
ontology-mapping-with-sparql-construct.html
87
virtual triple patterns which query a “virtual” dataset that could be an arbitrary service.
This approach does not need syntactic extensions of the language. However, an im-
plementation of this extension makes it necessary to know upfront which predicates
denote virtual triples. The authors use their framework to call a library of similarity
measure functions but do not focus on mappings or CONSTRUCT queries.
As already mentioned in the introduction, other approaches often allow only map-
pings at the level of the ontology level or deploy their own rules language such as
SWRL [19] or WRL [8]. A language more specific for ontology mapping is C-OWL [3],
which extends OWL with bridge rules to relate ontological entities. C-OWL is a for-
malism close to distributed description logics [2]. These approaches partially cover
aspects which we cannot handle, e.g., equating instances using owl:sameAs in SWRL
or relating ontologies based on a local model semantics [17] in C-OWL. None of these
approaches though offers aggregations which are often useful in practical applications
of RDF data syndication, the main application we target in the present work. The On-
tology Alignment Format [13] and the Ontology Mapping Language [28] are ongoing
efforts to express ontology mappings. In a recent work [14], these two languages were
merged and given a model-theoretic semantics which can be grounded to a particu-
lar logical formalism in order to be actually used to perform a mediation task. Our
approach combines rule and mapping specification languages using a more practical
approach than the above mentioned, exploiting standard languages, ρdf and SPARQL.
We keep the ontology language expressivity low on purpose in order to retain decid-
ability, thus providing an executable mapping specification format.
88
from our dlvhex as far as possible into more efficient SPARQL engines or possibly dis-
tributed SPARQL endpoints that cannot deal with extended datasets natively. Further,
we will investigate the feasibility of supporting larger fragments of RDFS and OWL.
Here, caution is in order as arbitrary combininations of OWL and SPARQL++ involve
the same problems as combining rules with ontologies (see[11]) in the general case.
We believe that the small fragment we started with is the right strategy in order to al-
low queries over networks of lightweight RDFS ontologies, connectable via expressive
mappings, which we will gradually extend.
References
[1] F. Alkhateeb, J.-F. Baget, J. Euzenat. Extending SPARQL with Regular Expres-
sion Patterns. Tech. Report 6191, Inst. National de Recherche en Informatique et
Automatique, May 2007.
[2] A. Borgida, L. Serafini. Distributed Description Logics: Assimilating Informa-
tion from Peer Sources. Journal of Data Semantics, 1:153–184, 2003.
[3] P. Bouquet, F. Giunchiglia, F. van Harmelen, L. Serafini, H. Stuckenschmidt. C-
OWL: Contextualizing Ontologies. In The Semantic Web - ISWC 2003, Florida,
USA, 2003.
[4] W. Chen, M. Kifer, D. Warren. HiLog: A Foundation for Higher-Order Logic
Programming. Journal of Logic Programming, 15(3):187–230, February 1993.
[5] FOAF Vocabulary Specification, July 2005. http://xmlns.com/foaf/0.
1/.
[6] J. de Bruijn, E. Franconi, S. Tessaris. Logical Reconstruction of Normative RDF.
In OWL: Experiences and Directions Workshop (OWLED-2005), Galway, Ireland,
2005.
[7] J. de Bruijn, S. Heymans. A Semantic Framework for Language Layering in
WSML. In First Int’l Conf. on Web Reasoning and Rule Systems (RR2007), Inns-
bruck, Austria, 2007.
[8] J. de Bruijn (ed.). Web Rule Language (WRL), 2005. W3C Member Submission.
[9] S. Decker et al. TRIPLE - an RDF Rule Language with Context and Use Cases. In
W3C Workshop on Rule Languages for Interoperability, Washington D.C., USA,
April 2005.
[10] E. Dumbill. DOAP: Description of a Project. http://usefulinc.com/
doap/.
[11] T. Eiter, G. Ianni, A. Polleres, R. Schindlauer, H. Tompits. Reasoning with Rules
and Ontologies. In Reasoning Web 2006, pp. 93–127. Springer, Sept. 2006.
89
[12] T. Eiter, G. Ianni, R. Schindlauer, H. Tompits. A Uniform Integration of Higher-
Order Reasoning and External Evaluations in Answer Set Programming. In In-
ternational Joint Conference on Artificial Intelligence (IJCAI) 2005, pp. 90–96,
Edinburgh, UK, Aug. 2005.
[13] J. Euzenat. An API for Ontology Alignment. In Proc. 3rd International Semantic
Web Conference, Hiroshima, Japan, pp. 698–712, 2004.
[14] J. Euzenat, F. Scharffe, A. Zimmerman. Expressive Alignment Language and
Implementation. Project Deliverable D2.2.10, Knowledge Web NoE (EU-IST-
2004-507482), 2007.
[23] A. Malhotra, J. Melton, N. W. (eds.). XQuery 1.0 and XPath 2.0 Functions and
Operators, Jan. 2007. W3C Recommendation.
[24] S. Muñoz, J. Pérez, C. Gutierrez. Minimal Deductive Systems for RDF. 4th
European Semantic Web Conference (ESWC’07), Innsbruck, Austria, 2007.
[25] J. Pérez, M. Arenas, C. Gutierrez. Semantics and Complexity of SPARQL. In
International Semantic Web Conference (ISWC 2006), pp. 30–43, 2006.
[26] A. Polleres. From SPARQL to Rules (and back). 16th World Wide Web Confer-
ence (WWW2007), Banff, Canada, May 2007.
90
[27] E. Prud’hommeaux, A. Seaborne (eds.). SPARQL Query Language for RDF,
June 2007. W3C Candidate Recommendation.
[28] F. Scharffe, J. de Bruijn. A Language to specify Mappings between Ontologies.
In First Int. Conf. on Signal-Image Technology and Internet-Based Systems (IEEE
SITIS’05), 2005.
[29] S. Schenk, S. Staab. Networked rdf graphs. Tech. Report, Univ. Koblenz,
2007. http://www.uni-koblenz.de/˜sschenk/publications/
2006/ngtr.pdf.
[30] R. Schindlauer. Answer-Set Programming for the Semantic Web. PhD thesis,
Vienna University of Technology, Dec. 2006.
[31] J. Ullman. Principles of Database & Knowledge Base Systems. Comp. Science
Press, 1989.
[32] M. Völkel. RDF (Open Source) Software Vocabulary. http://xam.de/ns/
os/.
91
Published in Proceedings of the 8th International Semantic Web Conference (ISWC 2009), pp.
310–327, Oct. 2007, Springer LNCS vol. 5823
Abstract
RDF Schema (RDFS) as a lightweight ontology language is gaining popularity
and, consequently, tools for scalable RDFS inference and querying are needed.
SPARQL has become recently a W3C standard for querying RDF data, but it
mostly provides means for querying simple RDF graphs only, whereas querying
with respect to RDFS or other entailment regimes is left outside the current speci-
fication. In this paper, we show that SPARQL faces certain unwanted ramifications
when querying ontologies in conjunction with RDF datasets that comprise multi-
ple named graphs, and we provide an extension for SPARQL that remedies these
effects. Moreover, since RDFS inference has a close relationship with logic rules,
we generalize our approach to select a custom ruleset for specifying inferences to
be taken into account in a SPARQL query. We show that our extensions are tech-
nically feasible by providing benchmark results for RDFS querying in our proto-
type system GiaBATA, which uses Datalog coupled with a persistent Relational
Database as a back-end for implementing SPARQL with dynamic rule-based in-
ference. By employing different optimization techniques like magic set rewriting
our system remains competitive with state-of-the-art RDFS querying systems.
∗ This work has been partially supported by the Italian Research Ministry (MIUR) project Interlink
II04CG8AGG, the Austrian Science Fund (FWF) project P20841, by Science Foundation Ireland under
Grant No. SFI/08/CE/I1380 (Lion-2).
93
1 Introduction
Thanks to initiatives such as DBPedia or the Linked Open Data project,1 a huge amount
of machine-readable RDF [1] data is available, accompanying pervasive ontologies
describing this data such as FOAF [2], SIOC [3], or YAGO [4].
A vast amount of Semantic Web data uses rather small and lightweight ontologies
that can be dealt with rule-based RDFS and OWL reasoning [5, 6, 7], in contrast to
the full power of expressive description logic reasoning. However, even if many prac-
tical use cases do not require complete reasoning on the terminological level provided
by DL-reasoners, the following tasks become of utter importance. First, a Semantic
Web system should be able to handle and evaluate (possibly complex) queries on large
amounts of RDF instance data. Second, it should be able to take into account implicit
knowledge found by ontological inferences as well as by additional custom rules in-
volving built-ins or even nonmonotonicity. The latter features are necessary, e.g., for
modeling complex mappings [8] between different RDF vocabularies. As a third point,
joining the first and the second task, if we want the Semantic Web to be a solution to –
as Ora Lassila formulated it – those problems and situations that we are yet to define,2
we need triple stores that allow dynamic querying of different data graphs, ontologies,
and (mapping) rules harvested from the Web. The notion of dynamic querying is in
opposition to static querying, meaning that the same dataset, depending on context,
reference ontology and entailment regime, might give different answers to the same
query. Indeed, there are many situations in which the dataset at hand and its supporting
class hierarchy cannot be assumed to be known upfront: think of distributed querying
of remotely exported RDF data.
Concerning the first point, traditional RDF processors like Jena (using the default
configuration) are designed for handling large RDF graphs in memory, thus reaching
their limits very early when dealing with large graphs retrieved from the Web. Cur-
rent RDF Stores, such as YARS [9], Sesame, Jena TDB, ThreeStore, AllegroGraph, or
OpenLink Virtuoso3 provide roughly the same functionality as traditional relational
database systems do for relational data. They offer query facilities and allow to im-
port large amounts of RDF data into their persistent storage, and typically support
SPARQL [10], the W3C standard RDF query language. SPARQL has the same expres-
sive power as non-recursive Datalog [11, 12] and includes a set of built-in predicates
in so called filter expressions.
However, as for the second and third point, current RDF stores only offer limited
support. OWL or RDF(S) inference, let alone custom rules, are typically fixed in com-
bination with SPARQL querying (cf. Section 2). Usually, dynamically assigning dif-
ferent ontologies or rulesets to data for querying is neither supported by the SPARQL
specification nor by existing systems. Use cases for such dynamic querying involve,
e.g., querying data with different versions of ontologies or queries over data expressed
in related ontologies adding custom mappings (using rules or “bridging” ontologies).
To this end, we propose an extension to SPARQL which caters for knowledge-
intensive applications on top of Semantic Web data, combining SPARQL querying with
1 http://dbpedia.org/ and http://linkeddata.org/
2 http://www.lassila.org/publications/2006/SCAI-2006-keynote.pdf
3 See http://openrdf.org/, http://jena.hpl.hp.com/wiki/TDB/, http:
//threestore.sf.net/, http://agraph.franz.com/allegrograph/, http:
//openlinksw.com/virtuoso/, respectively.
94
dynamic, rule-based inference. In this framework, we overcome some of the above
mentioned limitations of SPARQL and existing RDF stores. Moreover, our approach
is easily extensible by allowing features such as aggregates and arbitrary built-in predi-
cates to SPARQL (see [8, 14]) as well as the addition of custom inference and mapping
rules. The contributions of our paper are summarized as follows:
• We introduce two additional language constructs to the normative SPARQL lan-
guage. First, the directive using ontology for dynamically coupling a dataset with
an arbitrary RDFS ontology, and second extended dataset clauses, which allow to spec-
ify datasets with named graphs in a flexible way. The using ruleset directive can
be exploited for adding to the query at hand proper rulesets which might used for a va-
riety of applications such as encoding mappings between entities, or encoding custom
entailment rules, such as RDFS or different rule-based OWL fragments.
• We present the GiaBATA system [15], which demonstrates how the above exten-
sions can be implemented on a middle-ware layer translating SPARQL to Datalog and
SQL. Namely, the system is based on known translations of SPARQL to Datalog rules.
Arbitrary, possibly recursive rules can be added flexibly to model arbitrary ontological
inference regimes, vocabulary mappings, or alike. The resulting program is compiled
to SQL where possible, such that only the recursive parts are evaluated by a native Dat-
alog implementation. This hybrid approach allows to benefit from efficient algorithms
of deductive database systems for custom rule evaluation, and native features such as
query plan optimization techniques or rich built-in functions (which are for instance
needed to implement complex filter expressions in SPARQL) of common database
systems.
• We compare our GiaBATA prototype to well-known RDF(S) systems and provide
experimental results for the LUBM [16] benchmark. Our approach proves to be com-
petitive on both RDF and dynamic RDFS querying without the need to pre-materialize
inferences.
In the remainder of this paper we first introduce SPARQL along with RDF(S) and
partial OWL inference by means of some motivating example queries which existing
systems partially cannot deal in a reasonably manner in Section 2. Section 3 sketches
how the SPARQL language can be enhanced with custom ruleset specifications and ar-
bitrary graph merging specifications. We then briefly introduce our approach to trans-
late SPARQL rules to Datalog in Section 4, and how this is applied to a persistent
storage system. We evaluate our approach with respect to existing RDF stores in Sec-
tion 5, and then conclusions are drawn in Section 6.
95
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix rel: <http://purl.org/vocab/relationship/>.
...
rel:friendOf rdfs:subPropertyOf foaf:knows.
foaf:knows rdfs:domain foaf:Person.
foaf:knows rdfs:range foaf:Person.
foaf:homepage rdf:type owl:inverseFunctionalProperty.
...
(a) Graph GM (<http://example.org/myOnt.rdfs>), a combination of the FOAF&Relationship ontologies.
Both graphs refer to terms in a combined ontology defining the FOAF and Relation-
ship4 vocabularies, see Fig. 1(a) for an excerpt.
On this data the SPARQL query (1) intends to extract names of persons men-
tioned in those graphs that belong to friends of Bob. We assume that, by means of
rdfs:seeAlso statements, Bob provides links to the graphs associated to the persons
he is friend with.
select ?N from <http://example.org/myOnt.rdfs>
from <http://bob.org>
from named <http://alice.org> (1)
where { <http://bob.org#me> foaf:knows ?X . ?X rdfs:seeAlso ?G .
graph ?G { ?P rdf:type foaf:Person; foaf:name ?N } }
Here, the from and from named clauses specify an RDF dataset. In general, the
dataset DS = (G, N ) of a SPARQL query is defined by (i) a default graph G obtained
by the RDF merge [20] of all graphs mentioned in from clauses, and (ii) a set N =
{(u1 , G1 ), . . . , (uk , Gk )} of named graphs, where each pair (ui , Gi ) consists of an
IRI ui , given in a from named clause, paired with its corresponding graph Gi . For in-
stance, the dataset of query (1) would be DS 1 = ( GM ] GB , {(<http://alice.org>,
GA )}), where ] denotes merging of graphs according to the normative specifications.
Now, let us have a look at the answers to query (1). Answers to SPARQL select
queries are defined in terms of multisets of partial variable substitutions. In fact the
answer to query (1) is empty when – as typical for current SPARQL engines – only
simple RDF entailment is taken into account, and query answering then boils down to
simple graph matching. Since neither of the graphs in the default graph contain any
triple matching the pattern <http://bob.org#me> foaf:knows ?X in the where
clause, the result of (1) is empty. When taking subproperty inference by the statements
4 http://vocab.org/relationship/
96
of the ontology in GM into account, however, one would expect to obtain three sub-
stitutions for the variable ?N: {?N/"Alice", ?N/"Bob", ?N/"Charles"}. We will
explain in the following why this is not the case in standard SPARQL.
In order to obtain the expected answer, firstly SPARQL’s basic graph pattern match-
ing needs to be extended, see [10, Section 12.6]. In theory, this means that the graph
patterns in the where clause needs to be matched against an enlarged version of the
original graphs in the dataset (which we will call the deductive closure Cl(·)) of a given
entailment regime. Generic extensions for SPARQL to entailment regimes other than
simple RDF entailment are still an open research problem,5 due to various problems:
(i) for (non-simple) RDF entailment regimes, such as full RDFS entailment, Cl(G) is
infinite, and thus SPARQL queries over an empty graph G might already have infinite
answers, and (ii) it is not yet clear which should be the intuitive answers to queries over
inconsistent graphs, e.g. in OWL entailment, etc. In fact, SPARQL restricts extensions
of basic graph pattern matching to retain finite answers. Not surprisingly, many ex-
isting implementations implement finite approximations of higher entailment regimes
such as RDFS and OWL [6, 5, 21]. E.g., the RDF Semantics document [20] contains
an informative set of entailment rules, a subset of which (such as the one presented in
Section 3.2 below) is implemented by most available RDF stores. These rule-based
approximations, which we focus on in this paper, are typically expressible by means
of Datalog-style rules. These latter model how to infer a finite closure of a given RDF
graph that covers sound but not necessarily complete RDF(S) and OWL inferences. It
is worth noting that Rule-based entailment can be implemented in different ways: rules
could be either dynamically evaluated upon query time, or the closure wrt. ruleset R,
ClR (G), could be materialized when graph G is loaded into a store. Materialization of
inferred triples at loading time allows faster query responses, yet it has drawbacks: it is
time and space expensive and it has to be performed once and statically. In this setting,
it must be decided upfront
(a) which ontology should be taken into account for which data graph, and
(b) to which graph(s) the inferred triples “belong”, which particularly complicates the
querying of named graphs.
As for exemplifying (a), assume that a user agent wants to issue another query on
graph GB with only the FOAF ontology in mind, since she does not trust the Rela-
tionship ontology. In the realm of FOAF alone, rel:friendOf has nothing to deal
with foaf:knows. However, when materializing all inferences upon loading GM and
GB into the store, bob:me foaf:knows : a would be inferred from GM ] GB and
would contribute to such a different query. Current RDF stores prevent to dynamically
parameterize inference with an ontology of choice at query time, since indeed typically
all inferences are computed upon loading time once and for all.
As for (b), queries upon datasets including named graphs are even more problem-
atic. Query (1) uses GB in order to find the IRI identifiers for persons that Bob knows
by following rdfs:seeAlso links and looks for persons and their names in the named
RDF graphs found at these links. Even if rule-based inference was supported, the
answer to query (1) over dataset DS 1 is just {?N/"Alice"}, as “Alice” is the only
(explicitly) asserted foaf:Person in GA . Subproperty, domain and range inferences
over the GM ontology do not propagate to GA , since GM is normatively prescribed to
5 For details, cf. http://www.polleres.net/sparqltutorial/, Unit 5b.
97
be merged into the default graph, but not into the named graph. Thus, there is no way
to infer that "Bob" and "Charles" are actually names of foaf:Persons within the
named graph GA . Indeed, SPARQL does not allow to merge, on demand, graphs into
the named graphs, thus there is no way of combining GM with the named graph GA .
To remedy these deficiencies, we suggest an extension of the SPARQL syntax, in
order to allow the specification of datasets more flexibly: it is possible to group graphs
to be merged in parentheses in from and from named clauses. The modified query,
obtaining a dataset DS 2 = ( GM ] GB , {(http://alice.org, GM ] GA )}) looks as
follows:
select ?N
from (<http://example.org/myOnt.rdfs> <http://bob.org/>)
from named <http://alice.org>
(<http://example.org/myOnt.rdfs> <http://alice.org/>) (2)
where { bob:me foaf:knows ?X . ?X rdfs:seeAlso ?G .
graph ?G { ?X foaf:name ?N . ?X a foaf:Person . } }
For ontologies which should apply to the whole query, i.e., graphs to be merged into
the default graph as well as any specified named graph, we suggest a more convenient
shortcut notation by adding the keyword using ontology in the SPARQL syntax:
select ?N
using ontology <http://example.org/myOnt.rdfs>
from <http://bob.org/>
from named <http://alice.org/> (3)
where { bob:me foaf:knows ?X . ?X foaf:seeAlso ?G .
graph ?G { ?X foaf:name ?N . ?X a foaf:Person. } }
Hence, the using ontology construct allows for coupling the entire given dataset
with the terminological knowledge in the myOnt data schema. As our investigation of
currently available RDF stores (see Section 5) shows, none of these systems easily
allow to merge ontologies into named graphs or to dynamically specify the dataset of
choice.
In addition to parameterizing queries with ontologies in the dataset clauses, we
also allow to parameterize the ruleset which models the entailment regime at hand.
Per default, our framework supports a standard ruleset that “emulates” (a finite subset
of) the RDFS semantics. This standard ruleset is outlined in Section 3 below. Al-
ternatively, different rule-based entailment regimes, e.g., rulesets covering parts of
the OWL semantics á la ter Horst [5], de Bruijn [22, Section 9.3], OWL2 RL [17]
or other custom rulesets can be referenced with the using ruleset keyword. For
instance, the following query returns the solution {?X/<http://alice.org#me>,
?Y/<http://bob.org#me>}, by doing equality reasoning over inverse functional
properties such as foaf:homepage when the FOAF ontology is being considered:
select ?X ?Y
using ontology <http://example.org/myOnt.rdfs>
using ruleset rdfs
using ruleset <http://www.example.com/owl-horst> (4)
from <http://bob.org/>
from <http://alice.org/>
where { ?X foaf:knows ?Y }
Query (4) uses the built-in RDFS rules for the usual subproperty inference, plus
a ruleset implementing ter Horst’s inference rules, which might be available at URL
http://www.example.com/owl-horst. This ruleset contains the following addi-
98
tional rules, that will “equate” the blank node used in GA for “Bob” with <http://bob.org#me>:6
?P rdf:type owl:iFP . ?S1 ?P ?O . ?S2 ?P ?O . → ?S1 owl:sameAs ?S2.
?X owl:sameAs ?Y → ?Y owl:sameAs ?X.
?X ?P ?O . ?X owl:sameAs ?Y → ?Y ?P ?O. (5)
?S ?X ?O . ?X owl:sameAs ?Y → ?S ?Y ?O.
?S ?P ?X . ?X owl:sameAs ?Y → ?S ?P ?Y.
99
Solutions of BGP matching consist of multisets of bindings for the variables men-
tioned in the pattern to terms in the active graph. Partial solutions of each subpattern are
joined according to an algebra defining the optional, union and filter operators,
cf. [10, 18, 12]. For what we are concerned with here, the most interesting operator
though is the graph operator, since it changes the active graph. That is, the active
graph is the default graph G0 for any basic graph pattern not occurring within a graph
sub pattern. However, in a subpattern graph g P1 , the pattern P1 is matched against
the named graph identified by g, if g ∈ I, and against any named graph ui , if g ∈ Var ,
where the binding ui is returned for variable g. According to [12], for a RDF dataset D
and active graph G, we define [[P ]]DG as the multiset of tuples constituting the answer to
the graph pattern P . The solutions of a query Q = (V, D, P ) is the projection of [[P ]]DG
to the variables in V only.
We can now define the semantics of extended and ontological dataset clauses as
follows. Let F be a set of ordinary and extended dataset clauses, and O be a set of
ontological dataset clauses. Let graph(g) be the graph associated to the IRI g: the
extended RDF dataset obtained from F , denoted edataset(F ), is composed of:
100
(1) G0 = {graph(g) | “from g” ∈ F }. If there is no from clause, then G0 = ∅.
(2) A named graph collection hu, {graph(u)}i for each “from named u” in F .
(3) A named graph collection hi, {graph(i1 ), . . . , graph(im )}i for each “from named
i(i1 . . . im )” in F .
The graph collection obtained from O, denoted ocollection(O), is the set {graph(o) |
“using ontology o” ∈ O}. The ordinary dataset of O and F , denoted dataset(F, O),
is the set D(edataset(F ), ocollection(O)).
Let D and O be as above. The evaluation of a graph pattern P over D and O having
active graph collection G, denoted [[P ]]D,O
G , is the evaluation of P over D(D, O) having
U D,O D(D,O)
active graph G = g∈G g, that is, [[P ]]G = [[P ]]G .
Note that the semantics of extended datasets is defined in terms of ordinary RDF
datasets. This allows to define the semantics of SPARQL with extended and onto-
logical dataset clauses by means of the standard SPARQL semantics. Also note that
our extension is conservative, i.e., the semantics coincides with the standard SPARQL
semantics whenever no ontological clauses and extended dataset clauses are specified.
We call sets of inference rules RDF inference rulesets, or rulesets for short.
Rule Application and Closure. We define RDF rule application in terms of the
immediate consequences of a rule r or a ruleset R on a graph G. Given a BGP P ,
we denote as µ(P ) a pattern obtained by substituting variables in P with elements of
I ∪ B ∪ L. Let r be a rule of the form (6) and G be a set of RDF triples, then:
Tr (G) = {µ(Con) | ∃µ such that µ(Ante) ⊆ G}.
S
Accordingly, let TR (G) = r∈R Tr (G). Also, let G0 = G and Gi+1 = Gi ∪ TR (Gi )
for i ≥ 0. It can be easily shown that there exists the smallest n such that Gn+1 = Gn ;
we call then ClR (G) = Gn the closure of G with respect to ruleset R.
We can now further define R-entailment between two graphs G1 and G2 , written
G1 |=R G2 , as ClR (G1 ) |= G2 . Obviously for any finite graph G, ClR (G) is finite.
8 Unlike some other rule languages for RDF, the most prominent of which being CONSTRUCT statements
in SPARQL itself, we forbid blank nodes; i.e., existential variables in rule consequents which require the
“invention” of new blank nodes, typically causing termination issues.
101
In order to define the semantics of a SPARQL query wrt. R-entailment we now extend
graph pattern matching in [[P ]]D
G towards respecting R.
Analogously, one might use R-entailment as the basis for RDFS entailment as follows.
We consider here the ρDF fragment of RDFS entailment [6]. Let RRDFS denote the
ruleset corresponding to the minimal set of entailment rules (2)–(4) from [6]:
?P rdfs:subPropertyOf ?Q . ?Q rdfs:subPropertyOf ?R . → ?P rdfs:subPropertyOf ?R.
?P rdfs:subPropertyOf ?Q . ?S ?P ?O . → ?S ?Q ?O.
?C rdfs:subClassOf ?D . ?D rdfs:subClassOf ?E . → ?C rdfs:subClassOf ?E.
?C rdfs:subClassOf ?D . ?S rdf:type ?C . → ?S rdf:type ?D.
?P rdfs:domain ?C . ?S ?P ?O . → ?S rdf:type ?C.
?P rdfs:range ?C . ?S ?P ?O . → ?O rdf:type ?C.
Since obviously G |=RDFS ClRRDFS (G) and hence ClRRDFS (G) may be viewed
as a finite approximation of RDFS-entailment, we can obtain a reasonable definition
for defining a BGP matching extension for RDFS by simply defining [[P ]]D,RDFSG =
D,RRDFS
[[P ]]G . We allow the special ruleset clause using ruleset rdfs to conve-
niently refer to this particular ruleset. Other rulesets may be published under a Web
dereferenceable URI, e.g., using an appropriate RIF [23] syntax.
Note, eventually, that our rulesets consist of positive rules, and as such enjoy a
natural monotonicity property.
102
Proposition 26 For rulesets R and R0 , such that R ⊆ R0 , and graph G1 and G2 , if
G1 |=R G2 then G1 |=R0 G2 .
Entailment regimes modeled using rulesets can thus be enlarged without retracting
former inferences. This for instance would allow to introduce tighter RDFS-entailment
approximations by extending RRDFS with further axioms, yet preserving inferred
triples.
103
where the first rule (r1) computes the predicate "triple" taking values from the
built-in predicate &rdf. This latter is generally used to import RDF statements from
the specified URI. The following rules (r2) and (r3) compute the solutions for the
filtered basic graph patterns { ?X a foaf:Person. ?X foaf:name ?N. filter (?N
!= "Alice") } and { ?X foaf:mbox ?M }. In particular, note here that the evaluation
of filter expressions is “outsourced” to the built-in predicate &eval, which takes a
filter expression and an encoding of variable bindings as arguments, and returns the
evaluation value (true, false or error, following the SPARQL semantics). In or-
der to emulate SPARQL’s optional patterns a combination of join and set difference
operation is used, which is established by rules (r4)–(r6). Set difference is simu-
lated by using both null values and negation as failure. According to the semantics of
SPARQL, one has to particularly take care of variables which are joined and possibly
unbound (i.e., set to the null value) in the course of this translation for the general
case. Finally, the dedicated predicate answer in rule (r7) collects the answer substi-
tutions for Q. DQ might then be merged with additional rulesets whenever Q contains
using ruleset clauses.
From Datalog to SQL. For this step we rely on the system DLVDB [25] that imple-
ments Datalog under stable model semantics on top of a DBMS of choice. DLVDB is
able to translate Datalog programs in a corresponding SQL query plan to be issued to
the underlying DBMS. RDF Datasets are simply stored in a database D, but the native
dlvhex &rdf and &eval predicates in DQ cannot be processed by DLVDB directly over
D. So, DQ needs to be post-processed before it can be converted into suitable SQL
statements.
Rule (r1) corresponds to loading persistent data into D, instead of loading triples
via the &rdf built-in predicate. In practice, the predicate "triple" occurring in pro-
gram DA is directly associated to a database table TRIPLE in D. This operation is
done off-line by a loader module which populates the TRIPLE table accordingly, while
(r1) is removed from the program. The &eval predicate calls are recursively broken
down into WHERE conditions in SQL statements, as sketched below when we discuss
the implementation of filter statements.
0
After post-processing, we obtain a program DQ , which DLVDB allows to be exe-
0
cuted on a DBMS by translating it to corresponding SQL statements. DQ is coupled
with a mapping file which defines the correspondences between predicate names ap-
0
pearing in DQ and corresponding table and view names stored in the DBMS D.
For instance, the rule (r4) of DA , results in the following SQL statement issued
to the RDBMS by DLVDB :
INSERT INTO answer_b_join_1
SELECT DISTINCT answer2_p2.a1, answer1_p1.a1, answer1_p1.a2, ’default’
FROM answer1 answer1_p1, answer2 answer2_p2
WHERE (answer1_p1.a2=answer2_p2.a2)
AND (answer1_p1.a3=’default’)
AND (answer2_p2.a3=’default’)
EXCEPT (SELECT * FROM answer_b_join_1)
Whenever possible, the predicates for computing intermediate results such as answer1,
answer2, answer b join 1, . . . , are mapped to SQL views rather than materialized
tables, enabling dynamic evaluation of predicate contents on the DBMS side.9
9 For instance, recursive predicates require to be associated with permanent tables, while remaining pred-
icates are normally associated to views.
104
Schema rewriting. Our system allows for customizing schemes which triples are
stored in. It is known and debated [26] that in choosing the data scheme of D several
aspects have to be considered, which affect performance and scalability when handling
large-scale RDF data. A widely adopted solution is to exploit a single table storing
quadruples of form (s, p, o, c) where s, p, o and c are, respectively, the triple subject,
predicate, object and context the triple belongs to. This straightforward representation
is easily improved [27] by avoiding to store explicitly string values referring to URIs
and literals. Instead, such values are replaced with a corresponding hash value.
Other approaches suggest alternative data structures, e.g., property tables [27, 26].
These aim at denormalizing RDF graphs by storing them in a flattened representation,
trying to encode triples according to the hidden “schema” of RDF data. Similarly to
a traditional relational schema, in this approach D contains a table per each known
property name (and often also per class, splitting up the rdf:type table).
Our system gives sufficient flexibility in order to program different storage schemes:
while on higher levels of abstraction data are accessible via the 4-ary triple predicate,
0
a schema rewriter module is introduced in order to match DQ to the current database
0
scheme. This module currently adapts DQ by replacing constant IRIs and literals with
their corresponding hash value, and introducing further rules which translate answers,
converting hash values back to their original string representation.
0
Magic sets. Notably, DLVDB can post-process DQ using the magic sets technique,
an optimization method well-known in the database field [28]. The optimized program
0
mDQ tailors the data to be queried to an extent significantly smaller than the original
0
DQ . The application of magic sets allows, e.g., to apply entailment rules RRDFS only
on triples which might affect the answer to Q, preventing thus the full computation
and/or materialization of inferred data.
Implementation of filter statements. Evaluation of SPARQL filter statements
is pushed down to the underlying database D by translating filter expressions to appro-
priate SQL views. This allows to dynamically evaluate filter expressions on the DBMS
side. For instance, given a rule r ∈ DQ of the form
h(X,Y) :- b(X,Y), &eval[f_Y](bool).
where the &eval atom encodes the filter statement (f Y representing the filter ex-
pression), then r is translated to
h(X,Y) :- b’(X,Y).
where b’ is a fresh predicate associated via the mapping file to a database view. Such
a view defines the SQL code to be used for the computation of f Y , like
CREATE VIEW B’ AS ( SELECT X,Y FROM B WHERE F_Y )
105
5 Experiments
In order to illustrate that our approach is practically feasible, we present a quantitative
performance comparison between our prototype system, GiaBATA, which implements
the approach outlined before, and some state-of-the-art triple stores. The test were done
on an Intel P4 3GHz machine with 1.5GB RAM under Linux 2.6.24. Let us briefly
outline the main features and versions of the triple stores we used in our comparison.
AllegroGraph works as a database and application framework for building Seman-
tic Web applications. The system assures persistent storage and RDFS++ reasoning, a
semantic extension including the RDF and RDFS constructs and some OWL constructs
(owl:sameAs, owl:inverseOf, owl:TransitiveProperty, owl:hasValue). We
tested the free Java edition of AllegroGraph 3.2 with its native persistence mecha-
nism.11
ARQ is a query engine implementing SPARQL under the Jena framework.12 It can
be deployed on several persistent storage layers, like filesystem or RDBMS, and it
includes a rule-based inference engine. Being based on the Jena library, it provides
inferencing models and enables (incomplete) OWL reasoning. Also, the system comes
with support for custom rules. We used ARQ 2.6 with RDBMS backend connected to
PostgreSQL 8.3.
GiaBATA [15] is our prototype system implementing the SPARQL extensions de-
scribed above. GiaBATA is based on a combination of the DLVDB [25] and dlvhex [24]
systems, and caters for persistent storage of both data and ontology graphs. The former
system is a variant of DLV [13] with built-in database support. The latter is a solver
for HEX-programs [24], which features an extensible plugin system which we used
for developing a rewriter-plugin able to translate SPARQL queries to HEX-programs.
The tests were done using development versions of the above systems connected to
PostgreSQL 8.3.
Sesame is an open source RDF database with support for querying and reasoning.13 In
addition to its in-memory database engine it can be coupled with relational databases or
deployed on top of file systems. Sesame supports RDFS inference and other entailment
regimes such as OWL-Horst [5] by coupling with external reasoners. Sesame provides
an infrastructure for defining custom inference rules. Our tests have been done using
Sesame 2.3 with persistence support given by the native store.
First of all, it is worth noting that all systems allow persistent storage on RDBMS.
All systems, with the exception of ours, implement also direct filesystem storage. All
cover RDFS (actually, disregarding axiomatic triples) and partial or non-standard OWL
fragments. Although all the systems feature some form of persistence, both reasoning
and query evaluation are usually performed in main memory. All the systems, except
AllegroGraph and ours, adopt a persistent materialization approach for inferring data.
All systems – along with basic inference – support named graph querying, but, with
the exception of GiaBATA, combining both features results in incomplete behavior as
described in Section 2. Inference is properly handled as long as the query ranges over
the whole dataset, whereas it fails in case of querying explicit default or named graphs.
11 System available at http://agraph.franz.com/allegrograph/.
12 Distributedat https://jena.svn.sourceforge.net/svnroot/jena/ARQ/.
13 System available at http://www.openrdf.org/.
106
100 10000 100
Allegro 3.2 (native) Allegro 3.2 (native) Allegro 3.2 (native)
ARQ 2.6 ARQ 2.6 ARQ 2.6
GiaBATA (native)
Q1 GiaBATA (native) Q2 GiaBATA (native)
Q3
Sesame 2.3 Sesame 2.3 Sesame 2.3
1000
evaluation time / secs (logscale)
10 10
100
10
1 1
timeout
timeout
timeout
timeout
0.1 0.1 0.1
LUBM1 LUBM5 LUBM10 LUBM30 LUBM1 LUBM5 LUBM10 LUBM30 LUBM1 LUBM5 LUBM10 LUBM30
10000 10000
Allegro 3.2 (ordered) Allegro 3.2 (ordered)
Allegro 3.2 (native) Allegro 3.2 (native)
ARQ 2.6 Q4 ARQ 2.6 Q5
GiaBATA (ordered) GiaBATA (ordered)
GiaBATA (native) GiaBATA (native)
Sesame 2.3 Sesame 2.3
evaluation time / secs (logscale)
1000 1000
100 100
timeout
timeout
timeout
timeout
10 10
LUBM1 LUBM5 LUBM10 LUBM30 LUBM1 LUBM5 LUBM10 LUBM30
10000 10000
Allegro 3.2 (ordered) Allegro 3.2 (ordered)
Allegro 3.2 (native) Allegro 3.2 (native)
ARQ 2.6 Q6 ARQ 2.6 Q7
GiaBATA (ordered) GiaBATA (ordered)
GiaBATA (native) GiaBATA (native)
Sesame 2.3 Sesame 2.3
evaluation time / secs (logscale)
1000 1000
100 100
timeout
timeout
timeout
timeout
timeout
timeout
timeout
timeout
timeout
10 10
LUBM1 LUBM5 LUBM10 LUBM30 LUBM1 LUBM5 LUBM10 LUBM30
Figure 2: Evaluation
That makes querying of named graphs involving inference impossible with standard
systems.
For performance comparison we rely on the LUBM benchmark suite [16]. Our tests
involve the test datasets LUBMn for n ∈ {1, 5, 10, 30} with LUBM30 having roughly
four million triples (exact numbers are reported in [16]). In order to test the additional
performance cost of our extensions, we opted for showing how the performance fig-
ures change when queries which require RDFS entailment rules (LUBM Q4-Q7) are
considered, w.r.t. queries in which rules do not have an impact (LUBM Q1-Q3, see
Appendix of [16] for the SPARQL encodings of Q1–Q7). Experiments are enough
for comparing performance trends, so we didn’t consider at this stage larger instances
of LUBM. Note that evaluation times include the data loading times. Indeed, while
former performance benchmarks do not take this aspect in account, from the semantic
point of view, pre-materialization-at-loading computes inferences needed for complete
query answering under the entailment of choice. On further reflection, dynamic query-
107
ing of RDFS moves inference from this materialization to the query step, which would
result in an apparent advantage for systems that rely on pre-materialization for RDFS
data. Also, the setting of this paper assumes materialization cannot be performed una
tantum, since inferred information depends on the entailment regime of choice, and on
the dataset at hand, on a per query basis. We set a 120min query timeout limit to all
test runs.
Our test runs include the following system setup: (i) “Allegro (native)” and “Al-
legro (ordered)” (ii) “ARQ”; (iii) “GiaBATA (native)” and “GiaBATA (ordered)”; and
(iv) “Sesame”. For (i) and (iii), which apply dynamic inference mechanisms, we use
“(native)” and “(ordered)” to distinguish between executions of queries in LUBM’s
native ordering and in a optimized reordered version, respectively. The GiaBATA test
runs both use Magic Sets optimization. To appreciate the cost of RDFS reasoning for
queries Q4–Q7, the test runs for (i)–(iv) also include the loading time of the datasets,
i.e., the time needed in order to perform RDFS data materialization or to simply store
the raw RDF data.
The detailed outcome of the test results are summarized in Fig. 2. For the RDF test
queries Q1–Q3, GiaBATA is able to compete for Q1 and Q3. The systems ARQ and
Sesame turned out to be competitive for Q2 by having the best query response times,
while Allegro (native) scored worst. For queries involving inference (Q4–Q7) Alle-
gro shows better results. Interestingly, systems applying dynamic inference, namely
Allegro and GiaBATA, query pattern reordering plays a crucial role in preserving per-
formance and in assuring scalability; without reordering the queries simply timeout. In
particular, Allegro is well-suited for queries ranging over several properties of a sin-
gle class, whereas if the number of classes and properties increases (Q7), GiaBATA
exhibits better scalability. Finally, a further distinction between systems relying on
DBMS support and systems using native structures is disregarded, and since figures (in
logarithmic scale) depict overall loading and querying time, this penalizes in specific
cases those systems that use a DBMS.
108
spirit of [31], our framework allows for hypotheses (also called “premises”) on a per
query basis rather than a per atom basis.
References
[1] Klyne, G., Carroll, J.J. (eds.): Resource Description Framework (RDF): Concepts
and Abstract Syntax. W3C Rec. (February 2004)
[2] Brickley, D., Miller, L.: FOAF Vocabulary Specification 0.91 (2007) http:
//xmlns.com/foaf/spec/.
[3] Bojārs, U., Breslin, J.G., Berrueta, D., Brickley, D., Decker, S., Fernández, S.,
Görn, C., Harth, A., Heath, T., Idehen, K., Kjernsmo, K., Miles, A., Passant, A.,
Polleres, A., Polo, L., Sintek, M.: SIOC Core Ontology Specification (June 2007)
W3C member submission.
[4] Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A Core of Semantic Knowl-
edge. In: WWW2007, ACM (2007)
[5] ter Horst, H.J.: Completeness, decidability and complexity of entailment for RDF
Schema and a semantic extension involving the OWL vocabulary. J. Web Semant.
3(2–3) (2005) 79–115
[6] Muñoz, S., Pérez, J., Gutiérrez, C.: Minimal deductive systems for RDF. In:
ESWC’07. Springer (2007) 53–67
[7] Hogan, A., Harth, A., Polleres, A.: Scalable authoritative owl reasoning for the
web. Int. J. Semant. Web Inf. Syst. 5(2) (2009)
[8] Polleres, A., Scharffe, F., Schindlauer, R.: SPARQL++ for mapping between
RDF vocabularies. In: ODBASE’07. Springer (2007) 878–896
[9] Harth, A., Umbrich, J., Hogan, A., Decker, S.: YARS2: A Federated Repository
for Querying Graph Structured Data from the Web. In: ISWC’07. Springer (2007)
211–224
[10] Prud’hommeaux, E., Seaborne, A. (eds.): SPARQL Query Language for RDF.
W3C Rec. (January 2008)
[11] Polleres, A.: From SPARQL to rules (and back). In: WWW2007. ACM (2007)
787–796
[12] Angles, R., Gutierrez, C.: The expressive power of SPARQL. In: ISWC’08.
Springer (2008) 114–129
[13] Leone, N., Pfeifer, G., Faber, W., Eiter, T., Gottlob, G., Perri, S., Scarcello, F.:
The DLV system for knowledge representation and reasoning. ACM Trans. Com-
put. Log. 7(3) (2006) 499–562
[14] Euzenat, J., Polleres, A., Scharffe, F.: Processing ontology alignments with
SPARQL. In: OnAV’08 Workshop, CISIS’08, IEEE Computer Society (2008)
913–917
109
[15] Ianni, G., Krennwallner, T., Martello, A., Polleres, A.: A Rule System for Query-
ing Persistent RDFS Data. In : ESWC’09. Springer (2009) 857–862
[16] Guo, Y., Pan, Z., Heflin, J.: LUBM: A Benchmark for OWL Knowledge Base
Systems. J. Web Semant. 3(2–3) (2005) 158–182
[17] Motik, B., Grau, B.C., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C. (eds.): OWL 2
Web Ontology Language Profiles W3C Cand. Rec. (June 2009)
[18] Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. In:
ISWC’06. Springer (2006) 30–43
[19] Polleres, A., Schindlauer, R.: dlvhex-sparql: A SPARQL-compliant query engine
based on dlvhex. In: ALPSWS2007. CEUR-WS (2007) 3–12
[20] Hayes, P.: RDF semantics. W3C Rec. (February 2004).
[21] Ianni, G., Martello, A., Panetta, C., Terracina, G.: Efficiently querying RDF(S)
ontologies with answer set programming. J. Logic Comput. 19(4) (2009) 671–
695
[22] de Bruijn, J.: Semantic Web Language Layering with Ontologies, Rules, and
Meta-Modeling. PhD thesis, University of Innsbruck (2008)
[23] Boley, H., Kifer, M.: RIF Basic Logic Dialect. W3C Working Draft (July 2009)
[24] Eiter, T., Ianni, G., Schindlauer, R., Tompits, H.: Effective integration of declar-
ative rules with external evaluations for semantic web reasoning. In: ESWC’06.
Springer (2006) 273–287
[25] Terracina, G., Leone, N., Lio, V., Panetta, C.: Experimenting with recursive
queries in database and logic programming systems. Theory Pract. Log. Program.
8(2) (2008) 129–165
[26] Theoharis, Y., Christophides, V., Karvounarakis, G.: Benchmarking Database
Representations of RDF/S Stores. In: ISWC’05. Springer (2005) 685–701
[27] Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.J.: Scalable Semantic Web
Data Management Using Vertical Partitioning. In: VLDB. ACM (2007) 411–422
[28] Beeri, C., Ramakrishnan, R.: On the power of magic. J. Log. Program. 10(3-4)
(1991) 255–299
[29] Lu, J., Cao, F., Ma, L., Yu, Y., Pan, Y.: An Effective SPARQL Support over
Relational Databases. In: SWDB-ODBIS. (2007) 57–76
[30] Bonner, A.J.: Hypothetical datalog: complexity and expressibility. Theor. Comp.
Sci. 76(1) (1990) 3–51
[31] Gutiérrez, C., Hurtado, C.A., Mendelzon, A.O.: Foundations of semantic web
databases. In: PODS 2004, ACM (2004) 95–106
110
Scalable Authoritative OWL Reasoning for the Web
Aidan Hogan, Andreas Harth, and Axel Polleres
Published in International Journal on Semantic Web and Information Systems (IJSWIS),
Volume 5, Number 2, pp. 49–90, May 2009, IGI Global, ISSN 1552-6283
Abstract
In this paper we discuss the challenges of performing reasoning on large scale
RDF datasets from the Web. Using ter-Horst’s pD* fragment of OWL as a base, we
compose a rule-based framework for application to web data: we argue our deci-
sions using observations of undesirable examples taken directly from the Web. We
further temper our OWL fragment through consideration of “authoritative sources”
which counter-acts an observed behaviour which we term “ontology hijacking”:
new ontologies published on the Web re-defining the semantics of existing enti-
ties resident in other ontologies. We then present our system for performing rule-
based forward-chaining reasoning which we call SAOR: Scalable Authoritative
OWL Reasoner. Based upon observed characteristics of web data and reasoning
in general, we design our system to scale: our system is based upon a separation
of terminological data from assertional data and comprises of a lightweight in-
memory index, on-disk sorts and file-scans. We evaluate our methods on a dataset
in the order of a hundred million statements collected from real-world web sources
and present scale-up experiments on a dataset in the order of a billion statements
collected from the Web.
1 Introduction
Information attainable through the Web is unique in terms of scale and diversity. The
Semantic Web movement aims to bring order to this information by providing a stack
of technologies, the core of which is the Resource Description Framework (RDF) for
publishing data in a machine-readable format: there now exists millions of RDF data-
sources on the Web contributing billions of statements. The Semantic Web technol-
ogy stack includes means to supplement instance data being published in RDF with
∗ A preliminary version of this article has been accepted at ASWC 2008 [24]. Compared to that version,
we have added significant material. The added contributions in this version include (i) a better formalisation
of authoritative reasoning, (ii) improvements in the algorithms, and (iii) respectively updated experimental
results with additional metrics on a larger dataset. We thank the anonymous reviewers of this and related
papers for their valuable feedback. This work has been supported by Science Foundation Ireland project
Lion (SFI/02/CE1/I131), European FP6 project inContext (IST-034718), COST Action “Agreement Tech-
nologies” (IC0801) and an IRCSET Postgraduate Research Scholarship.
111
ontologies described in RDF Schema (RDFS) [4] and the Web Ontology Language
(OWL) [2, 41], allowing people to formally specify a domain of discourse, and provid-
ing machines a more sapient understanding of the data. In particular, the enhancement
of assertional data (i.e., instance data) with terminological data (i.e., structural data)
published in ontologies allows for deductive reasoning: i.e., inferring implicit knowl-
edge.
In particular, our work on reasoning is motivated by the requirements of the Seman-
tic Web Search Engine (SWSE) project: http://swse.deri.org/, within which
we strive to offer search, querying and browsing over data taken from the Semantic
Web. Reasoning over aggregated web data is useful, for example: to infer new asser-
tions using terminological knowledge from ontologies and therefore provide a more
complete dataset; to unite fractured knowledge (as is common on the Web in the ab-
sence of restrictive formal agreement on identifiers) about individuals collected from
disparate sources; and to execute mappings between domain descriptions and thereby
provide translations from one conceptual model to another. The ultimate goal here is
to provide a “global knowledge-base”, indexed by machines, providing querying over
both the explicit knowledge published on the Web and the implicit knowledge infer-
able by machine. However, as we will show, complete inferencing on the Web is an
infeasible goal, due firstly to the complexity of such a task and secondly to noisy web
data; we aim instead to strike a comprise between the above goals for reasoning and
what is indeed feasible for the Web.
Current systems have had limited success in exploiting ontology descriptions for
reasoning over RDF web data. While there exists a large body of work in the area of
reasoning algorithms and systems that work and scale well in confined environments,
the distributed and loosely coordinated creation of a world-wide knowledge-base cre-
ates new challenges for reasoning:
• the system has to perform on web scale, with implications on the completeness
of the reasoning procedure, algorithms and optimisations;
112
SAOR adopts a standard rule-based approach to reasoning whereby each rule con-
sists of (i) an ‘antecedent’: a clause which identifies a graph pattern that, when matched
by the data, allows for the rule to be executed and (ii) a ‘consequent’: the statement(s)
that can be inferred given data that match the antecedent. Within SAOR, we view rea-
soning as a once-off rule-processing task over a given set of statements. Since the rules
are all known a-priori, and all require simultaneous execution, we can design a task-
specific system that offers much greater optimisations over more general rule engines.
Firstly, we categorise the known rules according to the composition of their antecedents
(e.g., with respect to arity, proportion of terminological and assertional patterns, etc.)
and optimise each group according to the observed characteristics. Secondly, we do
not use an underlying database or native RDF store and opt for implementation using
fundamental data-structures and primitive operations; our system is built from scratch
specifically (and only) for the purpose of performing pre-runtime forward-chaining
reasoning which gives us greater freedom in implementing appropriate task-specific
optimisations.
This paper is an extended version of [24], in which we presented an initial modus-
operandi of SAOR; we provided some evaluation of a set of rules which exhibited
linear scale and concluded that using dynamic index structures, in SAOR, for more
complex rulesets, was not a viable solution for a large-scale reasoner. In this paper,
we provide extended discussion of our fragment of OWL reasoning and additional mo-
tivation for our deliberate incompleteness in terms of computational complexity and
impediments posed by web data considerations. We also describe an implementation
of SAOR which abandons dynamic index structures in favour of batch processing tech-
niques known to scale: namely sorts and file-scans. We present new evaluation of the
adapted system over a dataset of 147m triples collected from 665k web sources and
also provide scale-up evaluation of our most optimised ruleset on a dataset of 1.1b
statements collected from 6.5m web sources.
Specifically, we make the following contributions in this paper:
113
ments by applying our most optimised ruleset on a dataset of 1.1b statements
collected from 6.5m sources. We also reveal that the most computationally ef-
ficient segment of our reasoning is the most productive with regards inferred
output statements (Section 5).
We discuss related work in Section 6 and conclude with Section 7.
2 Preliminaries
Before we continue, we briefly introduce some concepts prevalent throughout the pa-
per. We use notation and nomenclature as is popular in the literature, particularly from
[22].
RDF Term Given a set of URI references U, a set of blank nodes B, and a set of
literals L, the set of RDF terms is denoted by RDFT erm = U ∪ B ∪ L. The set of
blank nodes B is a set of existensially quantified variables. The set of literals is given
as L = Lp ∪ Lt , where Lp is the set of plain literals and Lt is the set of typed literals.
A typed literal is the pair l = (s,t), where s is the lexical form of the literal and t∈ U is
a datatype URI. The sets U, B, Lp and Lt are pairwise disjoint.
RDF Triple in Context/RDF Quadruple A pair (t, c) with a triple t = (s, p, o) and c
∈ U is called a triple in context c [16, 20]. We may also refer to (s, p, o, c) as the RDF
quadruple or quad q with context c.
We use the term ‘RDF statement’ to refer generically to triple or quadruple where
differentiation is not pertinent.
RDF Graph/Web Graph An RDF graph G is a set of RDF triples; that is, a subset
of (U ∪ B) × U × (U ∪ B ∪ L).
We refer to a web graph W as a graph derived from a given web location (i.e., a
given document). We call the pair (W, c) a web graph W in context c, where c is the
web location from which W is retrieved. Informally, (W, c) is represented as the set
of quadruples (tw , c) for all tw ∈ W.
114
Merge The merge M(S) of a set of graphs S is the union of the set of all graphs G 0
for G ∈ S and G 0 derived from G such that G 0 contains a unique set of blank nodes for
S.
Web Knowledge-base Given a set SW of RDF web graphs, our view of a web
knowledge-base KB is taken as a set of pairs (W 0 , c) for each W ∈ SW , where W 0
contains a unique set of blank nodes for SW and c denotes the URL location of W.
Informally, KB is a set of quadruples retrieved from the Web wherein the set of
blank nodes are unique for a given document and triples are enhanced by means of
context which tracks the web location from which each triple is retrieved. We use the
abbreviated notation W ∈ KB or W 0 ∈ KB where we mean W ∈ SW for SW from
which KB is derived or (W 0 , c) ∈ KB for some c.
namespace with prefix “:”, i.e. we write e.g. just “:Class”, “:disjointWith”, etc. instead of us-
ing the commonly used owl: prefix. Other prefixes such as rdf:, rdfs:, foaf: are used as in other
common documents. Moreover, we often use the common abbreviation ‘a’ as a convenient shortcut for
rdf:type.
115
Terminological Triple We define a terminological triple as one of the following:
1. a membership assertion of a meta-class;
2. a membership assertion of a meta-property;
3. a triple in a non-branching, non-cyclic path tr0 , ..., trn where tr0 = (s0 , p0 , o0 ) for
p0 ∈ {:intersectionOf, :oneOf, :unionOf }; trk = (ok−1 , rdf:rest, ok )
for 1 ≤ k ≤ n, ok−1 ∈ B and on =rdf:nil; or a triple tfk = (ok , rdf:first,
ek ) with ok for 0 ≤ k < n as before.
We refer to triples tr1 , ..., trn and all triples tfk as terminological collection triples, whereby
RDF collections are used in a union, intersection or enumeration class description.
Instance A triple t = (s, p, o) (or, resp., a set of triples, i.e., a graph G) is an instance
of a triple pattern tp = (sv , pv , ov ) (or, resp., of a basic graph pattern GP) if there
exists a mapping µ : V ∪ RDF T erm → RDF T erm which maps every element
of RDFT erm to itself, such that t = µ(tp) = (µ(sv ), µ(pv ), µ(ov )) (or, resp., and
slightly simplifying notation, G = µ(GP)).
Inference Rule We define an inference rule r as the pair (Ante, Con), where the
antecedent Ante and the consequent Con are basic graph patterns such that V(Con)
and V(Ante) are non-empty, V(Con) ⊆ V(Ante) and Con does not contain blank
nodes3 . In this paper, we will typically write inference rules as:
in SPARQL, we forbid blank nodes; i.e., we forbid existential variables in rule consequents which would
require the “invention” of blank nodes.
116
Rule Application and Closure We define a rule application in terms of the imme-
diate consequences of a rule r or a set of rules R on a graph G (here slightly abusing
the notion of the immediate consequence operator in Logic Programming: cf. for ex-
ample [30]). That is, if r is a rule of the form (1), and G is a set of RDF triples, then:
117
from unwanted, independent third-party contributions. In particular, we adhere to the
following high-level restrictions:
118
OWL Lite). In [3], the authors identified and categorised OWL DL restrictions vio-
lated by a sample group of 201 OWL ontologies (all of which were found to be in
OWL Full); these include incorrect or missing typing of classes and properties, com-
plex object-properties (e.g., functional properties) declared to be transitive, inverse-
functional datatype properties, etc. In [46], a more extensive survey with nearly 1,300
ontologies was conducted: 924 were identified as being in OWL Full.
Taking into account that most web ontologies are in OWL Full, and also the unde-
cidability/computational infeasiblity of OWL Full, one could conclude that complete
reasoning on the Web is impractical. However, again for most web documents only cat-
egorisable as OWL Full, infringements are mainly syntactic and are rather innocuous
with no real effect on decidability ([46] showed that the majority of web documents
surveyed were in the base expressivity for Description Logics after patching infringe-
ments).
The main justification for the infeasibility of complete reasoning on the Web is
inconsistency.
Consistency cannot be expected on the Web; for instance, a past web crawl of ours
revealed the following:
w3:timbl a foaf:Person; foaf:homepage <http://w3.org/> .
w3:w3c a foaf:Organization; foaf:homepage <http://w3.org/>
.
foaf:homepage a :InverseFunctionalProperty .
foaf:Organization :disjointWith foaf:Person .
These triples together infer that Tim Berners-Lee is the same as the W3C and thus
cause an inconsistency.5 Aside from such examples which arise from misunderstand-
ing of the FOAF vocabulary, there might be cases where different parties deliberately
make contradictive statements; resolution of such contradictions could involve “choos-
ing sides”. In any case, the explosive nature of contradiction in classical logics suggests
that it is not desirable within our web reasoning scenario.
119
obvious advantages in our web reasoning scenario; thus SAOR’s approach to reasoning
is inspired by the pD* fragment to cover large parts of OWL by positive inference rules
which can be implemented in a forward-chaining engine.
Table 1 summarises the pD* ruleset. The rules are divided into D*-entailment rules
and P-entailment rules. D*-entailment is essentially RDFS entailment [22] combined
with some datatype reasoning. P-entailment is introduced in [43] as a set of rules which
applies to a property-related subset of OWL.
Given pD*, we make some amendments so as to align the ruleset with our require-
ments. Table 2 provides a full listing of our own modified ruleset, which we compare
against pD* in this section. Note that this table highlights characteristics of the rules
which we will discuss in Section 3.3 and Section 3.4; for the moment we point out that
rule0 is used to indicate an amendment to the respective pD* rule. Please also note that
we use the notation rulex* to refer to all rules with the prefix rulex.
pD* rule where
D*-entailment rules
lg ?x ?P ?l . ⇒ ?v ?P :bl . ?l∈ La
gl ?x ?P :bl . ⇒ ?x ?P ?l . ?l∈ L
rdf1 ?x ?P ?y . ⇒ ?P a rdf:Property .
rdf2-D ?x ?P ?l . ⇒ :bl ?type ?t . ?l= (s, t) ∈ Lt
rdfs1 ?x ?P ?l . ⇒ :bl a Literal . ?l∈ Lp
rdfs2 ?P rdfs:domain ?C . ?x ?P ?y . ⇒ ?x a ?C .
rdfs3 ?P rdfs:range ?C . ?x ?P ?y . ⇒ ?y a ?C . ?y ∈ U ∪ B
rdfs4a ?x ?P ?y . ⇒ ?x a rdfs:Resource .
rdfs4b ?x ?P ?y . ⇒ ?y a rdfs:Resource . ?y∈ U ∪ B
rdfs5 ?P rdfs:subProperty ?Q . ?Q rdfs:subProperty ?R . ⇒ ?P rdfs:subProperty ?R .
rdfs6 ?P a rdf:Property . ⇒ ?P rdfs:subProperty ?P .
rdfs7 ?P rdfs:subProperty ?Q . ?x ?P ?y . ⇒ ?x ?Q ?y . ?Q∈ U ∪ B
rdfs8 ?C a rdfs:Class . ⇒ ?C rdfs:subClassOf rdfs:Resource .
rdfs9 ?C rdfs:subClassOf ?D . ?x a ?C . ⇒ ?x a ?D .
rdfs10 ?C a rdfs:Class . ⇒ ?C rdfs:subClassOf ?C .
rdfs11 ?C rdfs:subClassOf ?D . ?D rdfs:subClassOf ?E . ⇒ ?C rdfs:subClassOf ?E .
rdfs12 ?P a rdfs:ContainerMembershipProperty . ⇒ ?P rdfs:subPropertyOf rdfs:member .
rdfs13 ?D a rdfs:Datatype . ⇒ ?D rdfs:subClassOf rdfs:Literal .
P-entailment rules
rdfp1 ?P a :FunctionalProperty . ?x ?P ?y , ?z . ⇒ ?y :sameAs ?z . ?y ∈ U ∪ B
rdfp2 ?P a :InverseFunctionalProperty . ?x ?P ?z . ?y ?P ?z . ⇒ ?x :sameAs ?y .
rdfp3 ?P a :SymmetricProperty . ?x ?P ?y . ⇒ ?y ?P ?x . ?y ∈ U ∪ B
rdfp4 ?P a :TransitiveProperty . ?x ?P ?y . ?y ?P ?z . ⇒ ?x ?P ?z .
rdfp5a ?x ?P ?y . ⇒ ?x :sameAs ?x .
rdfp5b ?x ?P ?y . ⇒ ?y :sameAs ?y . ?y∈ U ∪ B
rdfp6 ?x :sameAs ?y . ⇒ ?y :sameAs ?x . ?y∈ U ∪ B
rdfp7 ?x :sameAs ?y . ?y :sameAs ?z . ⇒ ?x :sameAs ?z .
rdfp8a ?P :inverseOf ?Q . ?x ?P ?y . ⇒ ?y ?Q ?x . ?y,?Q∈ U ∪ B
rdfp8b ?P :inverseOf ?Q . ?x ?Q ?y . ⇒ ?y ?P ?x . ?y ∈ U ∪ B
rdfp9 ?C a :Class ; :sameAs ?D . ⇒ ?C rdfs:subClassOf ?D .
rdfp10 ?P a :Property ; :sameAs ?Q . ⇒ ?P rdfs:subPropertyOf ?Q .
rdfp11 ?x :sameAs ? x . ?y :sameAs ? y . ?x ?P ?y .⇒ ? x ?P ? y . ? x∈U ∪B
rdfp12a ?C :equivalentClass ?D . ⇒ ?C rdfs:subClassOf ?D .
rdfp12b ?C :equivalentClass ?D . ⇒ ?D rdfs:subClassOf ?C . ?D∈ U ∪ B
rdfp12c ?C rdfs:subClassOf ?D . ?D rdfs:subClassOf ?C . ⇒ ?C :equivalentClass ?D .
rdfp13a ?P :equivalentProperty ?Q . ⇒ ?P rdfs:subPropertyOf ?Q .
rdfp13b ?P :equivalentProperty ?Q . ⇒ ?Q rdfs:subPropertyOf ?P . ?Q∈ U ∪ B
rdfp13c ?P rdfs:subPropertyOf ?Q . ?Q rdfs:subPropertyOf ?P . ⇒ ?P :equivalentProperty ?Q .
rdfp14a ?C :hasValue ?y ; :onProperty ?P . ?x ?P ?y . ⇒ ?x a ?C .
rdfp14b ?C :hasValue ?y ; :onProperty ?P . ?x a ?C . ⇒ ?x ?P ?y . ?P∈ U ∪ B
rdfp15 ?C :someValuesFrom ?D ; :onProperty ?P . ?x ?P ?y . ?y a ?D . ⇒ ?x a ?C .
rdfp16 ?C :allValuesFrom ?D ; :onProperty ?P . ?x a ?C; ?P ?y . ⇒ ?y a ?D . ?y∈ U ∪ B
120
SAOR rule where
R0 : only terminological patterns in antecedent
rdfc0 ?C :oneOf (?x1 ... ?xn ) . ⇒ ?x1 ... ?xn a ?C . ?C ∈ B
pD* Rules Directly Supported From the set of pD* rules, we directly support rules
rdfs2, rdfs9, rdfp2, rdfp4, rdfp7, and rdfp17.
121
listed in [43, Table 3] and [43, Table 6] for RDF(S) and OWL respectively; according
to pD*, these are inferred for the empty graph. Secondly, we do not materialise mem-
bership assertions for rdfs:Resource which would hold for every URI and blank
node in a graph. Thirdly, we do not materialise reflexive :sameAs membership as-
sertions, which again hold for every URI and blank node in a graph. We see such
statements as inflationary and orthogonal to our aim of reduced output.
pD* Omissions: Terminological Inferences From pD*, we also omit rules which
infer only terminological statements: namely rdf1, rdfs5, rdfs6, rdfs8, rdfs10, rdfs11,
rdfs12, rdfs13, rdfp9, rdfp10, rdfp12* and rdfp13*. As such, our use-case is query-
answering over assertional data; we therefore focus in this paper on materialising as-
sertional data.
We have already motivated omission of inference through :sameAs rules rdfp9
and rdfp10. Rules rdf1, rdfs8, rdfs12 and rdfs13 infer memberships of, or sub-
class/subproperty relations to, RDF(S) classes and properties; we are not interested
in these primarily syntactic statements which are not directly used in our inference
rules. Rules rdfs6 and rdfs10 infer reflexive memberships of rdfs:subPropertyOf
and rdfs:subClassOf meta-properties which are used in our inference rules; clearly
however, these reflexive statements will not lead to unique assertional inferences through
related rules rdfs70 or rdfs9 respectively. Rules rdfs5 and rdfs11 infer transitive mem-
berships again of rdfs:subPropertyOf and rdfs:subClassOf; again however,
exhaustive application of rules rdfs70 or rdfs9 respectively ensures that all possible
assertional inferences are materialised without the need for the transitive rules. Rules
rdfp12c and rdfp13c infer additional :equivalentClass/:equivalentProperty
statements from rdfs:subClassOf/rdfs:subPropertyOf statements where asser-
tional inferences can instead be conducted through two applications each of rules rdfs9
and rdfs70 respectively.
pD* Amendments: Direct Assertional Inferences The observant reader may have
noticed that we did not dismiss inferencing for rules rdfp12a,rdfp12b/rdfp13a,rdfp13b
which translate :equivalentClass/:equivalentProperty to rdfs:subClassOf/
rdfs:subPropertyOf. In pD*, these rules are required to support indirect asser-
122
tional inferences through rules rdfs9 and rdfs7 respectively; we instead support as-
sertional inferences directly from the :equivalentProperty/:equivalentClass
statements using symmetric rules rdfp12a0 ,rdfp12b0 /rdfp13a0 ,rdfp13b0 .
123
Additions to pD* In addition to pD*, we also include some “class based entailment”
from OWL, which we call C-entailment. We name such rules using the rdfc* stem,
following the convention from P-entailment. We provide limited support for enumer-
ated classes (rdfc0), union class descriptions (rdfc1), intersection class descriptions
(rdfc3*)6 , as well as limited cardinality constraints (rdfc2, rdfc4*).
Example 20 :
# FROM SOURCE <ex:>
ex:Person :onProperty ex:parent ; :someValuesFrom ex:Person .
# FROM SOURCE <ex2:>
ex:Person :allValuesFrom ex2:Human . 3
According to the abstract syntax mapping, neither of the restrictions should be iden-
tified by a URI (if blank nodes were used instead of ex:Person as mandated by the
abstract syntax, such a problem could not occur as each web-graph is given a unique set
of blank nodes). If we consider the RDF-merge of the two graphs, we will be unable to
distinguish which restriction the :onProperty value should be applied to. As above,
allowing URIs in these positions would enable “syntactic interference” between data
sources. Thus, in our ruleset, we always enforce blank-nodes as mandated by the OWL
abstract syntax; this specifically applies to pD* rules rdfp14*, rdfp15 and rdfp16 and
to all of our C-entailment rules rdfc*. We denote the restrictions in the where col-
umn of Table 2. Indeed, in our treatment of terminological collection statements, we
enforced blank nodes in the subject position of rdf:first/rdf:rest membership
assertions, as well as blank nodes in the object position of non-terminating rdf:rest
statements; these are analogously part of the OWL abstract syntax restrictions.
as they have variable antecedent-body length and, thus, can affect complexity considerations. It was in-
formally stated that :intersectionOf and :unionOf could be supported under pD* through re-
duction into subclass relations; however no rules were explicitly defined and our rule rdfc3b could not be
supported in this fashion. We support such rules here since we are not so concerned for the moment with
theoretical worst-case complexity, but are more concerned with the practicalities of web reasoning.
124
from assertional data according to their use of the RDF(S) and OWL vocabulary; these
are commonly known as the “T-Box” and “A-Box” respectively (loosely borrowing
Description Logics terminology). In particular, we require a separation of T-Box data
as part of a core optimisation of our approach; we wish to perform a once-off load of
T-Box data from our input knowledge-base into main memory.
Let PSAOR and CSAOR , resp., be the exact set of RDF(S)/OWL meta-properties
and -classes used in our inference rules; viz. PSAOR = { rdfs:domain, rdfs:ran-
ge, rdfs:subClassOf, rdfs:subPropertyOf, :allValuesFrom, :cardi-
nality, :equivalentClass, :equivalentProperty, :hasValue, :inter-
sectionOf, :inverseOf, :maxCardinality, :minCardinality, :oneOf,
:onProperty, :sameAs, :someValuesFrom, :unionOf } and, resp., CSAOR =
{ :FunctionalProperty, :InverseFunctionalProperty, :SymmetricPro-
perty, :TransitiveProperty }; our T-Box is a set of terminological triples re-
stricted to only include membership assertions for PSAOR and CSAOR and the set of
terminological collection statements. Table 2 identifies T-Box patterns by underlining.
Statements from the input knowledge-base that match these patterns are all of the T-
Box statements we consider in our reasoning process: inferred statements or statements
that do not match one of these patterns are not considered being part of the T-Box, but
are treated purely as assertional. We now define our T-Box:
Definition 24 (T-Box) Let TG be the union of all graph pattern instances from a graph
G for a terminological (underlined) graph pattern in Table 2; i.e., TG is itself a graph.
We call TG the T-Box of G.
domP
Also, let PSAOR = { rdfs:domain, rdfs:range, rdfs:subPropertyOf,
ranP
:equivalentProperty, :inverseOf } and PSAOR = { rdfs:subProperty-
Of, :equivalentProperty, :inverseOf, :onProperty }, We call φ a prop-
erty in T-Box T if there exists a triple t∈ T where
domP
• s = φ and p∈ PSAOR
ranP
• p∈ PSAOR and o = φ
• s = φ, p=rdf:type and o∈ CSAOR
domC
Similarly, let PSAOR = { rdfs:subClassOf, :allValuesFrom, :cardi-
nality, :equivalentClass, :hasValue, :intersectionOf, :maxCardi-
nality, :minCardinality, :oneOf, :onProperty, :someValuesFrom, :-
ranC
unionOf }, PSAOR = { rdf:first, rdfs:domain, rdfs:range, rdfs:sub-
ClassOf, :allValuesFrom, :equivalentClass, :someValuesFrom }. We
call χ a class in T-Box T if there exists a triple t∈ T where
domC
• p∈ PSAOR and s = χ
ranC
• p∈ PSAOR and o = χ
We define the signature of a T-Box T to be the set of all properties and classes in
T as above, which we denote by sig(T ).
For our knowledge-base KB, we define our T-Box T as the set of all pairs (TW 0 , c)
where (W 0 , c) ∈ KB and TW 0 6= ∅. Again, we may use the intuitive notation TW 0 ∈
T. We define our A-Box A as containing all of the statements in KB, including T
125
and the set of class and property membership assertions possibly using identifiers in
PSAOR ∪ CSAOR ; i.e., unlike description logics, our A is synonymous with our KB.
We use the term A-Box to distinguish data that are stored on-disk (which includes
T-Box data also stored in memory).
We now define our notion of a T -split inference rule, whereby part of the an-
tecedent is a basic graph pattern strictly instantiated by a static T-Box T .
We generally write (AnteT , AnteG , Con) as AnteT AnteG ⇒ Con where Table 2
follows this convention. We call AnteT the terminological or T-Box antecedent pattern
and AnteG the assertional or A-Box antecedent pattern.
We now define three disjoint sets of T -split rules which consist of only a T-Box
graph pattern, both a T-Box and A-Box graph pattern and only an A-Box graph pattern:
126
S S
that TR (T, KB) = TR ( TW 0 ∈T TW 0 , KB) = TW 0 ∈T TR (TW 0 , KB). In other words,
one web-graph cannot re-use structural statements in another web-graph to instantiate
a T-Box pattern in our rule; this has bearing on our notion of authoritative reasoning
which will be highlighted at the end of Section 3.4.
Further, a separate static T-Box within which inferences are not reflected has impli-
cations upon the completeness of reasoning w.r.t. the presented ruleset. Although, as
presented in Section 3.2, we do not infer terminological statements and thus can sup-
port most inferences directly from our static T-Box, SAOR still does not fully support
meta-modelling [33]: by separating the T-Box segment of the knowledge-base, we do
not support all possible entailments from the simultaneous description of both a class
(or property) and an indvidual. In other words, we do not fully support inferencing for
meta-classes or meta-properties defined outside of the RDF(S)/OWL specification.
However, we do provide limited reasoning support for meta-modelling in the spirit
of “punning” by conceptually separating the individual-, class- or property-meanings
of a resource (c.f. [14]). More precisely, during reasoning we not only store the T-Box
data in memory, but also store the data on-disk in the A-Box. Thus, we perform pun-
ning in one direction: viewing class and property descriptions which form our T-Box
also as individuals. Interestingly, although we do not support terminological reasoning
directly, we can through our limited punning perform reasoning for terminological data
based on the RDFS descriptions provided for the RDFS and OWL specifications. For
example, we would infer the following by storing the three input statements in both the
T-Box and the A-Box:
rdfs:subClassOf rdfs:domain rdfs:Class; rdfs:range rdfs:Class
.
ex:Class1 rdfs:subClassOf ex:Class2 . ⇒
ex:Class1 a rdfs:Class . ex:Class2 a rdfs:Class .
However, again our support for meta-modelling is limited; SAOR does not fully
support so-called “non-standard usage” of RDF(S) and OWL: the use of properties
and classes which make up the RDF(S) and OWL vocabularies in locations where
they have not been intended, cf. [6, 34]. We adapt and refine the definition of non-
standard vocabulary use for our purposes according to the parts of the RDF(S) and
OWL vocabularies relevant for our inference ruleset:
Continuing, we now introduce the following example wherein the first input state-
ment is a case of non-standard usage with rdfs:subClassOf∈ PSAOR in the object
position:7
ex:subClassOf rdfs:subPropertyOf rdfs:subClassOf .
ex:Class1 ex:subClassOf ex:Class2 . ⇒
ex:Class1 rdfs:subClassOf ex:Class2 .
7A similar example from the Web can be found at http://thesauri.cs.vu.nl/wordnet/
rdfs/wordnet2b.owl.
127
We can see that SAOR provides inference through rdfs:subPropertyOf as per
usual; however, the inferred triple will not be reflected in the T-Box, thus we are in-
complete and will not translate members of ex:Class1 into ex:Class2. As such,
non-standard usage may result in T-Box statements being produced which, according
to our limited form of punning, will not be reflected in the T-Box and will lead to
incomplete inference.
Indeed, there may be good reason for not fully supporting non-standard usage of
the ontology vocabulary: non-standard use could have unpredictable results even under
our simple rule-based entailment if we were to fully support meta-modelling. One
may consider a finite combination of only four non-standard triples that, upon naive
reasoning, would explode all web resources R by inferring |R|3 triples, namely:
rdfs:subClassOf rdfs:subPropertyOf rdfs:Resource.
rdfs:subClassOf rdfs:subPropertyOf rdfs:subPropertyOf.
rdf:type rdfs:subPropertyOf rdfs:subClassOf.
rdfs:subClassOf rdf:type :SymmetricProperty.
The exhaustive application of standard RDFS inference rules plus inference rules
for property symmetry together with the inference for class membership in rdfs:
Resource for all collected resources in typical rulesets such as pD* lead to inference
of any possible triple (r1 r2 r3 ) for arbitrary r1 , r2 , r3 ∈ R.
Thus, although by maintaining a separate static T-Box we are incomplete w.r.t non-
standard usage, we show that complete support of such usage of the RDFS/OWL vo-
cabularies is undesirable for the Web.8
1. n ∈ B; or
2. n ∈ U and c coincides with, or is redirected to by, the namespace9 of n.
8 In any case, as we will see in Section 3.4, our application of authoritative analysis would not allow such
128
Firstly, all graphs are authoritative for blank nodes defined in that graph (remember
that according to the definition of our knowledge-base, all blank nodes are unique to
a given graph). Secondly, we support namespace redirects so as to conform to best
practices as currently adopted by web ontology publishers.10
For example, as taken from the Web:
We consider the authority of sources speaking about classes and properties in our
T-Box to counter-act ontology hijacking; ontology hijacking is the assertion of a set of
non-authoritative T-Box statements such that could satisfy the T-Box antecedent pattern
of a rule in RT G (i.e., those rules with at least one terminological and at least one
assertional triple pattern in the antecedent). Such third-party sources can then cause
arbitrary inferences on membership assertions of classes or properties (contained in
the A-Box) for which they speak non-authoritatively. We can say that only rules in
RT G are relevant to ontology hijacking since: (i) inferencing on RG , which does not
contain any T-Box patterns, cannot be affected by non-authoritative T-Box statements;
and (ii) the RT ruleset does not contain any A-Box antecedent patterns and therefore,
cannot directly hijack assertional data (i.e., in our scenario, the :oneOf construct can
be viewed as directly asserting memberships, and is unable, according to our limited
support, to directly redefine sets of individuals). We now define ontology hijacking:
129
Ontology hijacking is problematic in that it vastly increases the amount of state-
ments that are materialised and can potentially harm inferencing on data contributed
by other parties. With respect to materialisation, the former issue becomes promi-
nent: members of classes/properties from popular/core ontologies get translated into a
plethora of conceptual models described in obscure ontologies; we quantify the prob-
lem in Section 5. However, taking precautions against harmful ontology hijacking is
growing more and more important as the Semantic Web features more and more atten-
tion; motivation for spamming and other malicious activity propagates amongst certain
parties with ontology hijacking being a prospective avenue. With this in mind, we as-
sign sole responsibility for classes and properties and reasoning upon their members to
those who maintain the authoritative specification.
Related to the idea of ontology hijacking is the idea of “non-conservative exten-
sion” described in the Description Logics literature: cf. [13, 31, 27]. However, the
notion of a “conservative extension” was defined with a slightly different objective in
mind: according to the notion of deductively conservative extensions, a graph Ga is
only considered malicious towards Gb if it causes additional inferences with respect to
the intersection of the signature of the original Gb with the newly inferred statements.
Returning to the former my:name example from above, defining a super-property of
foaf:name would still constitute a conservative extension: the closure without the
non-authoritative foaf:name rdfs:subPropertyOf my:name . statement is the
same as the closure with the statement after all of the my:name membership assertions
are removed. However, further stating that my:name a :InverseFunctionalPro-
perty. would not satisfy a model conservative extension since members of my:name
might then cause equalities in other remote ontologies as side-effects, independent from
the newly defined signature. Summarising, we can state that every non-conservative ex-
tension (with respect to our notion of deductive closure) constitutes a case of ontology
hijacking, but not vice versa; non-conservative extension can be considered “harmful”
hijacking whereas the remainder of ontology hijacking cases can be considered “infla-
tionary”.
To negate ontology hijacking, we only allow inferences through authoritative rule
applications, which we now define:
Definition 31 (Authoritative Rule Application) Again let sig(W)c be the set of classes
and properties for which W speaks authoritatively and let TW be the T-Box of W. We
define an authoritative rule application for a graph G w.r.t. the T-Box TW to be a
T -split rule application Tr (TW , G) where additionally, if both AnteT and AnteG are
non-empty (r ∈ RT G ), then for the mapping µ of Tr (TW , G) there must exist a variable
v ∈ (V(AnteT ) ∩ V(AnteG )) such that µ(v) ∈ sig(W).
c We denote an authoritative
rule application by Trb(TW , G).
In other words, an authoritative rule application will only occur if the rule consists
of only assertional patterns (RG ); or the rules consists of only terminological patterns
(RT ); or if in application of the rule, the terminological pattern instance is from a web-
graph authoritative for at least one class or property in the assertional pattern instance.
The TR b operator follows naturally as before for a set of authoritative rules R, as does
b
the notion of authoritative closure which we denote by ClR b (TW , G). We may also refer
to, e.g., TR
b (T, KB) and ClR b (T, KB) as before for a T -split rule application.
Table 2 identifies the authoritative restrictions we place on our rules wherein the
underlined T-Box pattern is matched by a set of triples from a web-graph W iff W
130
speaks authoritatively for at least one element matching a boldface variable in Table 2;
i.e., again, for each rule, at least one of the classes or properties matched by the A-Box
pattern of the antecedent must be authoritatively spoken for by an instance of the T-Box
pattern. These restrictions only apply to R1 and R2 (which are both a subset of RT G ).
Please note that, for example in rule rdfp14b0 where there are no boldface variables,
the variables enforced to be instantied by blank nodes will always be authoritatively
spoken for: a web-graph is always authoritative for its blank nodes.
We now make the following proposition relating to the prevention of ontology-
hijacking through authoritative rule application:
Proposition 27 Given a T-Box TW extracted from a web-graph W and any graph G
not mentioning any element of sig(W),
c then ClR
[ TG
(TW , G) = G.
Proof: Informally, our proposition is that the authoritative closure of a graph G w.r.t.
some T-Box TW will not contain any inferences which constitute ontology hijacking,
defined in terms of ruleset RT G . Firstly, from Definition 26, for each rule r ∈ RT G ,
AnteT 6= ∅ and AnteG 6= ∅. Therefore, from Definitions 27 & 31, for an authori-
tative rule application to occur for any such r, there must exist (i) a mapping µ such
that µ(AnteT , ) ⊆ TW and µ(AnteG ) ⊆ G; and (ii) a variable v ∈ (V(AnteT ) ∩
V(AnteG )) such that µ(v) ∈ sig(W).
c However, since G does not mention any element
of sig(W), then there is no such mapping µ where µ(v) ∈ sig(W)
c c for v ∈ V(AnteG ),
and µ(AnteG ) ⊆ G. Hence, for r ∈ RT G , no such application Trb(TW , G) will occur;
it then follows that TR
[ TG
(TW , G) = ∅ and ClR [TG
(TW , G) = G. 2
The above proposition and proof holds for a given web-graph W; however, given
a set of web-graphs where an instance of AnteT can consist of triples from more that
one graph, it is possible for ontology hijacking to occur whereby some triples in the
instance come from a non-authoritative graph and some from an authoritative graph.
To illustrate we refer to Example 20, wherein (and without enforcing abstract syntax
blank nodes) the second source could cause ontology hijacking by interfering with the
authoritative definition of the class restriction in the first source as follows:
Example 21 :
⇒ :Jill a ex2:Human . 3
Here, the above inference is authoritative according to our definition since the instance
of AnteT (specifically the first statement from source <ex:>) speaks authorita-
tively for a class/property in the assertional data; however, the statement from source
<ex2:> is causing inferences on assertional data not containing a class or property for
which source <ex2:> is authoritative .
131
As previously discussed, for our ruleset, we enforce the OWL abstract syntax and
thus we enforce that µ(AnteT ) ⊆ TW 0 where TW 0 ∈ T. However, where this condi-
tion does not hold (i.e., an instance of AnteT can comprise of data from more than one
graph), then an authoritative rule application should only occur if each web-graph con-
tributing to an instance of AnteT speaks authoritatively for at least one class/property
in the AnteG instance.
4 Reasoning Algorithm
In the following we first present observations on web data that influenced the design of
the SAOR algorithm, then give an overview of the algorithm, and next discuss details
of how we handle T-Box information, perform statement-wise reasoning, and deal with
equality for individuals.
1. Reasoning accesses a large slice of data in the index: we found that approxi-
mately 61% of statements in the 147m dataset and 90% in the 1.1b dataset pro-
duced inferred statements through authoritative reasoning.
2. Relative to assertional data, the volume of terminological data on the Web is
small: <0.9% of the statements in the 1.1b dataset and <1.7% of statements in
the 147m dataset were classifiable as SAOR T-Box statements11 .
3. The T-Box is the most frequently accessed segment of the knowledge-base for
reasoning: although relatively small, all but the rules in R3 require access to
T-Box information.
132
structures. In our previous work, a disk-based updateable random-access data structure
(a B+-Tree) proved to be the bottleneck for reasoning due to a high volume of inserts,
leading to frequent index reorganisations and hence inadequate performance. As a
result, our algorithms are now build upon two disk-based primitives known to scale:
file scanning and sorting.
Create
KB T-Box T
Section 4.3
Run R0
Run R1 Rules
Rules
Update R2/
R3 Indices
R3 R2
Index Index
Section 4.4
Run R2/R3
Rules Processing
step
Initial
Output Data
Section 4.5 structure
Section 4.6
• Execute rules with only a single A-Box triple pattern in the antecedent
(R1): join A-Box pattern with in-memory T-Box; recursively execute steps
over inferred statements; write inferred RDF statements to output file.
• Write on-disk files for computation of rules with multiple A-Box triple
patterns in the antecedent (R2); when a statement matches one of the A-
Box triple patterns for these rules and the necessary T-Box join exists, the
statement is written to the on-disk file for later rule computation.
133
• Write on-disk equality file for rules which involve equality reasoning (R3);
:sameAs statements found during the scan are written to an on-disk file
for later computation.
3. Execute ruleset R2 ∪ R3: on-disk files containing partial A-Box antecedent
matches for rules in R2 and R3 are sequentially analysed producing further
inferred statements. Newly inferred statements are again subject to step 2 above;
fresh statements can still be written to on-disk files and so the process is iterative
until no new statements are found (Section 4.5).
4. Finally, consolidate source data along with inferred statements according to :sameAs
computation (R3) and write to final output (Section 4.6).
In the following sections, we discuss the individual components and processes in
the architecture as highlighted, whereafter, in Section 4.7 we show how these elements
are combined to achieve closure.
134
• scan the T-Box data and store contexts of statements where the property ∈ {
:unionOf, :intersectionOf, :oneOf }.
• scan the T-Box data again and remove statements for which both hold:
– property ∈ { rdf:first, rdf:rest }
– the context does not appear in those stored from the previous scan.
These scans quickly remove irrelevant collection fragments where a :unionOf,
:intersectionOf, :oneOf statement does not appear in the same source as the frag-
ment (i.e., collections which cannot contribute to the T-Box pattern of one of our rules).
135
R0
rdfc0
rdfc0 ?C :oneOf (?x1 ... ?xn ) . (?x1 ... ?xn ) → ?C
R1
rdfs2
rdfs2 ?P rdfs:domain ?C . ?P → 0?C
rdfs3
rdfs30 ?P rdfs:range ?C . ?P → 0 ?C
rdfs7
rdfs70 ?P rdfs:subPropertyOf ?Q . ?P → ?Q
rdfs9
rdfs9 ?C rdfs:subClassOf ?D . ?C → 0?D
rdfp3
rdfp30 ?P a :SymmetricProperty . ?P → 0TRUE
rdfp8a
rdfp8a0 ?P :inverseOf ?Q . ?P → 0 ?Q
rdfp8b
rdfp8b0 ?P :inverseOf ?Q . ?Q → 0 ?P
rdfp12a
rdfp12a0 ?C :equivalentClass ?D . ?C → 0 ?D
rdfp12b
rdfp12b0 ?C :equivalentClass ?D . ?D → 0 ?C
rdfp13a
rdfp13a0 ?P :equivalentProperty ?Q . ?P → 0 ?Q
rdfs13b
rdfs13b0 ?P :equivalentProperty ?Q . ?Q → ?P
rdfp14a0
rdfp14a0 ?C :hasValue ?y ; :onProperty ?P . ?P → 0 { ?C, ?y }
rdfp14b
rdfp14b0 ?C :hasValue ?y ; :onProperty ?P . ?C → { ?P, ?y }
rdfc1
rdfc1 ?C :unionOf (?C1 ...?Ci ...?Cn ) . ?Ci → ?C
rdfc2
rdfc2 ?C :minCardinality 1 ; :onProperty ?P . ?P → ?C
rdfc3a
rdfc3a ?C :intersectionOf (?C1 ... ?Cn ) . ?C → { ?C1 , ..., ?Cn }
rdfc3b
rdfc3b ?C :intersectionOf (?C1 ) . ?C1 → ?C
R2
rdfp10
rdfp10 ?P a :FunctionalProperty . ?P → TRUE
rdfp2
rdfp2 ?P a :InverseFunctionalProperty . ?P → TRUE
rdfp4
rdfp4 ?P a :TransitiveProperty . ?P → TRUE
rdfp150 rdfp150
rdfp150 ?C :someValuesFrom ?D ; :onProperty ?P . ?P ↔ 0 ?D → 0 ?C
rdfp16 rdfp16
rdfp160 ?C :allValuesFrom ?D ; :onProperty ?P . ?P ↔ ?C → ?D
rdfc3c
rdfc3c ?C :intersectionOf (?C1 ... ?Cn ) . { ?C1 , ..., ?Cn } → ?C
rdfc4a
rdfc4a ?C :cardinality 1 ; :onProperty ?P . ?C ↔ ?P
rdfc4b
rdfc4b ?C :maxCardinality 1 ; :onProperty ?P . ?C ↔ ?P
Table 3: T-Box statements and how they are used to wire the concepts contained in the
in-memory T-Box.
class objects are designed to contain all of the information required for reasoning on a
membership assertion of that property or class: that is, classes/properties satisfying the
A-Box antecedent pattern of a rule are linked to the classes/properties appearing in the
consequent of that rule, with the link labelled according to that rule. During reasoning,
the class/property identifier used in the membership assertion is sent to the correspond-
ing hashtable and the returned internal object used for reasoning on that assertion. The
objects contain the following:
• Property objects contain the property URI and references to objects representing
domain classes (rdfs2), range classes (rdfs30 ), super properties (rdfs70 ), inverse
properties (rdfs8*) and equivalent properties (rdfp13*). References are kept to
restrictions where the property in question is the object of an :onProperty
statement (rdfp14a, rdfp160 , rdfc2, rdfc4*). Where applicable, if the prop-
erty is part of a some-values-from restriction, a pointer is kept to the some-
values-from class (rdfp150 ). Boolean values are stored to indicate whether the
property is functional (rdfp10 ), inverse-functional (rdfp2), symmetric (rdfp30 )
136
and/or transitive (rdfp4).
• Class objects contain the class URI and references to objects representing super
classes (rdfs9), equivalent classes (rdfp12*) and classes for which this class
is a component of a union (rdfc1) or intersection (rdfc3b/c). On top of these
core elements, different references are maintained for different types of class
description:
– intersection classes store references to their constituent class objects (rdfc3a)
– restriction classes store a reference to the property the restriction applies
to (rdfp14b0 , rdfp150 , rdfc2, rdfc4*) and also, if applicable to the type of
restriction:
∗ the values which the restriction property must have (rdfp14b0 )
∗ the class for which this class is a some-values-from restriction value
(rdfp150 )
rdfc3b Individual
0…* 0…* rdfs7',rdfp13*,
RDFTerm n
rdfs9,rdfp12*, rdfp8a',rdfp8b'
rdfc1,rdfc3a,rdfc3c rdfp14a'
rdfp14b'
0…*
0…* 0…* 0…* 0...1 1 0…1 0…1 0…* 0…* 1
0…* rdfc0
Class 0…*
rdfc4*,rdfp16'
0...1 Property
0…* 0…* RDFTerm n
rdfp15'
0…* 0…* bool isFunct (rdfp1')
RDFTerm n rdfs2,rdfs3'
bool isInvFunct (rdfp2)
0…* 0…1 bool isSym (rdfp3')
rdfp14a',rdfc2
bool isTrans (rdfp4)
0…* 0…1
rdfp14b'
The algorithm must also performs in-memory joining of collection segments ac-
cording to rdf:first and rdf:rest statements found during the scan for the pur-
poses of building union, intersection and enumeration class descriptions. Again, any
remaining collections not relevant to the T-Box segment of the knowledge-base (i.e.,
not terminological collection statements) are discarded at the end of loading the input
data; we also discard cyclic and branching lists as well as any lists not found to end
with the rdf:nil construct.
We have now loaded the final T-Box for reasoning into memory; this T-Box will
remain fixed throughout the whole reasoning process.
137
4.4 Initial Input Scan
Having loaded the terminological data, SAOR is now prepared for reasoning by statement-
wise scan of the assertional data.
Figure 3: ReasonStatement(s)
We provide the high-level flow for reasoning over an input statement s in Func-
tion ReasonStatement(s), cf. Figure 3. The reasoning scan process can be described
as recursive depth-first reasoning whereby each unique statement produced is again
input immediately for reasoning. Statements produced thus far for the original input
statement are kept in a set to provide uniqueness testing and avoid cycles; a uniquing
function is also maintained for a common subject group in the data, ensuring that state-
ments are only produced once for that statement group. Once all of the statements
produced by a rule have been themselves recursively analysed, the reasoner moves on
to analysing the proceeding rule and loops until no unique statements are inferred. The
reasoner then processes the next input statement.
There are three disjoint categories of statements which require different handling:
namely (i) rdf:type statements, (ii) :sameAs statements, (iii) all other statements.
We assume disjointness between the statement categories: we do not allow any ex-
ternal extension of the core rdf:type/:sameAs semantics (non-standard use / non-
authoritative extension). Further, the assertions about rdf:type in the RDFS specifi-
cation define the rdfs:domain and rdfs:range of rdf:type as being rdfs:Resource
and rdfs:Class; since we are not interested in inferring membership of such RDFS
classes we do not subject rdf:type statements to property-based entailments. The
only assertions about :sameAs from the OWL specification define domain and range
as :Thing which we ignore by the same justification.
The rdf:type statements are subject to class-based entailment reasoning and re-
quire joins with class descriptions in the T-Box. The :sameAs statements are handled
by ruleset R3, which we discuss in Section 4.6. All other statements are subject to
property-based entailments and thus requires joins with T-Box property descriptions.
Ruleset R2 ∪ R3 cannot be computed solely on a statement-wise basis. Instead,
for each rule, we assign an on-disk file (blocked and compressed to save disk space).
Each file contains statements which may contribute to satisfying the antecedent of its
138
pertinent rule. During the scan, if an A-Box statement satisfies the necessary T-Box
join for a rule, it is written to the index for that rule. For example, when the statement
ex:me foaf:isPrimaryTopicOf ex:myHomepage .
During the initial scan and inferencing, all files for ruleset R2 ∪ R3 are filled with
pertinent statements analogously to the example above. After the initial input state-
ments have been exhausted, these files are analysed to infer, for example, the :sameAs
statement above.
R2
rdfp10 ?x ?P ?y , ?z . SPOC
rdfp2 ?x ?P ?z . ?y ?P ?z . OPSC
rdfp4 ?x ?P ?y . ?y ?P ?z . SPOC & OPSC
rdfp150 ?x ?P ?y . ?y a ?D . SPOC / OPSC
rdfp160 ?x a ?C ; ?P ?y . SPOC
rdfc3c ?x a ?C1 , ..., ?Cn . SPOC
rdfc4a ?x a ?C ; ?P ?y , ?z . SPOC
rdfc4b ?x a ?C ; ?P ?y , ?z . SPOC
R3
rdfp7 ?x :sameAs ?y . ?y :sameas ?z . SPOC & OPSC
rdfp110 ?x :sameAs ? x ; ?P ?y . SPOC
rdfp1100 ?y :sameAs ? y . ?x ?P ?y . SPOC / OPSC
Table 4: Table enumerating the A-Box joins to be computed using the on-disk files with
key join position in boldface font and sorting order required for statements to compute
join.
Table 4 presents the joins to be executed via the on-disk files for each rule: the key
join variables, used for computing the join, are shown in boldface. In this table we
refer to SPOC and OPSC sorting order: these can be intuitively interpreted as quads
sorted according to subject, predicate, object, context (natural sorting order) and object,
predicate, subject, context (inverse sorting order) respectively. For the internal index
139
files, we use context to encode the sorting order of a statement and the iteration in
which it was added; only joins with at least one new statement from the last iteration
will infer novel output.
Again, an on-disk file is dedicated for each rule/join required. The joins to be
computed are a simple “star shaped” join pattern or “one-hop” join pattern (which we
reduce to a simple star shaped join computation by inverting one one or more patterns
to inverse order). The statements in each file are initially sorted according to the key
join variable. Thus, common bindings for the key join variable are grouped together
and joins can be executed by means of sequential scan for common key join variable
binding groups.
We now continue with a more detailed description of the process for each rule
beginning with the more straightforward rules.
140
?C position), we wish to infer that objects of the :onProperty value (as is given by
all the non-rdf:type statements according to the T-Box join with ?P – where ?P is
rdfp160
linked from ?C with → ) are of the all-values-from class. Therefore, for each re-
striction membership assertion, the objects of the corresponding :onProperty-value
membership-assertions are inferred to be members of the all-values-from object class
(?D).
141
ex:c ex:comesBefore ex:d :spoc1 .
ex:d ex:comesBefore ex:c :opsc1 .
The data, as above, can then be scanned and for each common join-binding/predicate
group (e.g., ex:b ex:comesBefore), the subjects of statements in inverse order (e.g.,
ex:a) can be linked to the object of naturally ordered statements (e.g., ex:c) by the
transitive property. However, such a scan will only compute a single one-hop join.
From above, we only produce:
# OUTPUT - ITERATION 1 / INPUT - ITERATION 2
ex:a ex:comesBefore ex:c .
ex:b ex:comesBefore ex:d .
We still not have not computed the valid statement ex:a ex:comesBefore ex:d
. which requires a two hop join. Thus we must iteratively feedback the results from
one scan as input for the next scan. The output from the first iteration, as above, is also
reordered and sorted as before and merge-sorted into the main SORTED FILE.
# SORTED FILE - ITERATION 2:
ex:a ex:comesBefore ex:b :spoc1 .
ex:a ex:comesBefore ex:c :spoc2 .
ex:b ex:comesBefore ex:a :opsc1 .
ex:b ex:comesBefore ex:c :spoc1 .
ex:b ex:comesBefore ex:d :spoc2 .
ex:c ex:comesBefore ex:a :opsc2 .
ex:c ex:comesBefore ex:b :opsc1 .
ex:c ex:comesBefore ex:d :spoc1 .
ex:d ex:comesBefore ex:b :opsc2 .
ex:d ex:comesBefore ex:c :opsc1 .
The observant reader may already have noticed from above that we also mark the
context with the iteration for which the statement was added. In every iteration, we only
compute inferences which involve the delta from the last iteration; thus the process is
comparable to semi-naı̈ve evaluation. Only joins containing at least one newly added
statement are used to infer new statements for output. Thus, from above, we avoid
repeat inferences from ITERATION 1 and instead infer:
# OUTPUT - ITERATION 2:
ex:a ex:comesBefore ex:d .
A fixpoint is reached when no new statements are inferred. Thus we would require
another iteration for the above example to ensure that no new statements are inferable.
The number of iterations required is in O(log n) according to the longest unclosed
transitive path in the input data. Since the algorithm requires scanning of not only the
delta, but also the entire data, performance using on-disk file scans alone would be
sub-optimal. For example, if one considers that most of the statements constitute paths
of, say ≤8 vertices, one path containing 128 vertices would require four more scans
after the bulk of the paths have been closed.
With this in mind, we accelerate transitive closure by means of an in-memory tran-
sitivity index. For each transitive property found, we store sets of linked lists which
represent the graph extracted for that property. From the example INPUT from above,
we would store.
ex:comesBefore -- ex:a -> ex:b -> ex:c -> ex:d
142
From this in-memory linked list, we would then collapse all paths of length ≥2 (all
paths of length 1 are input statements) and infer closure at once:
# OUTPUT - ITERATION 1 / INPUT - ITERATION 2
ex:a ex:comesBefore ex:c .
ex:a ex:comesBefore ex:d .
ex:b ex:comesBefore ex:d .
Obviously, for scalability requirements, we do not expect the entire transitive body
of statements to fit in-memory. Thus, before each iteration we calculate the in-memory
capacity and only store a pre-determined number of properties and vertices. Once the
in-memory transitive index is full, we infer the appropriate statements and continue
by file-scan. The in-memory index is only used to store the delta for a given iteration
(everything for the first iteration). Thus, we avoid excess iterations to compute closure
of a small percentage of statements which form a long chain and greatly accelerate the
fixpoint calculation.
143
Obviously, we must avoid the above scenarios, so we break from complete infer-
ence with respect to the rules in R3. Instead, for each set of equivalent individuals, we
chose a pivot identifier to use in rewriting the data. The pivot identifier is used to keep
a consistent identifier for the set of equivalent individuals: the alphabetically highest
pivot is chosen for convenience of computation. For alternative choices of pivot iden-
tifiers on web data see [23]. We use the pivot identifier to consolidate data by rewriting
all occurrences of equivalent identifiers to the pivot identifier (effectively merging the
equivalent set into one individual).
Thus, we do not derive the entire closure of :sameAs statements as indicated in
rules rdfp60 & rdfp7 but instead only derive an equivalence list which points from
equivalent identifiers to their pivots. As highlighted, use of a pivot identifier is nec-
essary to reduce the amount of output statements, effectively compressing equivalent
resource descriptions: we hint here that a fully expanded view of the descriptions could
instead be supported through backward-chaining over the semi-materialised data.
To achieve the pivot compressed inferences we use an on-disk file containing :sameAs
statements. Take for example the following statements:
# INPUT
ex:a :sameAs ex:b .
ex:b :sameAs ex:c .
ex:c :sameAs ex:d .
We only wish to infer the following output for the pivot identifier ex:a:
# OUTPUT PIVOT EQUIVALENCES
ex:b :sameAs ex:a .
ex:c :sameAs ex:a .
ex:d :sameAs ex:a .
The process is the same as that for symmetric transitive reasoning as described
before: however, we only close transitive paths to nodes with the highest alphabetical
order. So, for example, if we have already materialised a path from ex:d to ex:a we
ignore inferring a path from ex:d to ex:b as ex:b > ex:a.
To execute rules rdfp110 & rdfp1100 and perform “consolidation” (rewriting of
equivalent identifiers to their pivotal form), we perform a zig-zag join: we sequentially
scan the :sameAs inference output as above and an appropriately sorted file of data,
rewriting the latter data according to the :sameAs statements. For example, take the
following statements to be consolidated:
# UNCONSOLIDATED DATA
ex:a foaf:mbox <mail@example.org> .
...
ex:b foaf:mbox <mail@example.org> .
ex:b foaf:name "Joe Bloggs" .
...
ex:d :sameAs ex:b .
...
ex:e foaf:knows ex:d .
The above statements are scanned sequentially with the closed :sameAs pivot out-
put from above. For example, when the statement ex:b foaf:mbox <mailto:mail-
@example.org> . is first read from the unconsolidated data, the :sameAs index is
scanned until ex:b :sameAs ex:a . is found (if ex:b is not found in the:sameAs
144
file, the scan is paused when an element above the sorting order of ex:b is found).
Then, ex:b is rewritten to ex:a.
# PARTIALLY CONSOLIDATED DATA
ex:a foaf:mbox <mail@example.org> .
...
ex:a foaf:mbox <mail@example.org> .
ex:a foaf:name "Joe Bloggs" .
...
ex:a :sameAs ex:b .
...
ex:e foaf:knows ex:d .
We have now executed rule rdfp110 and have the data partially consolidated as
shown. However, the observant reader will notice that we have not consolidated the
object of the last two statements. We must sort the data again according to inverse
OPSC order and again sequentially scan both the partially consolidated data and the
:sameAs pivot equivalences, this time rewriting ex:b and ex:d in the object position
to ex:a and producing the final consolidated data. This equates to executing rule
rdfp1100 .
For the purposes of the on-disk files for computing rules requiring A-Box joins,
we must consolidate the key join variable bindings according to the :sameAs state-
ments found during reasoning. For example consider the following statements in the
functional reasoning file:
ex:a ex:mother ex:m1 .
ex:b ex:mother ex:m2 .
Evidently, rewriting the key join position according to our example pivot file will
lead to inference of:
ex:m1 :sameAs ex:m2 .
which we would otherwise miss. Thus, whenever the index of :sameAs statements
is changed, for the purposes of closure it is necessary to attempt to rewrite all join index
files according to the new :sameAs statements. Since we are, for the moment, only
concerned with consolidating on the join position we need only apply one consolidation
scan.
The final step in the SAOR reasoning process is to finalise consolidation of the
initial input data and the newly inferred output statements produced by all rules from
scanning and on-disk file analysis. Although we have provided exhaustive application
of all inferencing rules, and we have the complete set of :sameAs statements, elements
in the input and output files may not be in their equivalent pivotal form. Therefore, in
order to ensure proper consolidation of all of the data according to the final set of
:sameAs statements, we must firstly sort both input and inferred sets of data in SPOC
order, consolidate subjects according to the pivot file as above; sort according to OPSC
order and consolidate objects.
However, one may notice that :sameAs statements in the data become consoli-
dated into reflexive statements: i.e., from the above example ex:a :sameAs ex:a.
Thus, for the final output, we remove any :sameAs statements in the data and in-
stead merge the statements contained in our final pivot :sameAs equivalence index,
and their inverses, with the consolidated data. These statements retain the list of all
possible identifiers for a consolidated entity in the final output.
145
4.7 Achieving Closure
We conclude this section by summarising the approach, detailing the overall fixpoint
calculations (as such, putting the jigsaw together) and detailing how closure is achieved
using the individual components. Along these lines, in Figure 4 we provide a summary
of the algorithmic steps seen so far and, in particular, show the fixpoint calculations
involved for exhaustive application of ruleset R2 ∪ R3; we compute one main fixpoint
over all of the operations required, within which we also compute two local fixpoints.
Firstly, since all rules in R2 are dependant on :sameAs equality, we perform
:sameAs inferences first. Thus, we begin closure on R2 ∪ R3 with a local equality fix-
point which (i) executes all rules which produce :sameAs inferences (rdfp10 ,rdfp2,rdfc4*);
(ii) performs symmetric-transitive closure using pivots on all :sameAs inferences; (iii)
rewrites rdfp10 , rdfp2 and rdfc4* indexes according to :sameAs pivot equivalences
and (iv) repeats until no new :sameAs statements are produced.
Next, we have a local transitive fixpoint for recursively computing transitive prop-
erty reasoning: (i) the transitive index is rewritten according to the equivalences found
through the above local fixpoint; (ii) a transitive closure iteration is run, output in-
ferences are recursively fed back as input; (iii) ruleset R1 is also recursively applied
over output from previous step whereby the output from ruleset R1 may also write
new statements to any R2 index. The local fixpoint is reached when no new transitive
inferences are computed.
Finally, we conclude the main fixpoint by running the remaining rules: rdfp150 ,
rdfp160 and rdfc3c. For each rule, we rewrite the corresponding index according to
the equivalences found from the first local fixpoint, run the inferencing over the index
and send output for reasoning through ruleset R1. Statements inferred directly from the
rule index, or through subsequent application of ruleset R1, may write new statements
for R2 indexes. This concludes one iteration of the main fixpoint, which is run until
no new statements are inferred.
For each ruleset R0 − 3, we now justify our algorithm in terms of our definition of
closure with respect to our static T-Box. Firstly, closure is achieved immediately upon
ruleset R0, which requires only T-Box knowledge, from our static T-Box. Secondly,
with respect to the given T-Box, every input statement is subject to reasoning according
to ruleset R1, as is every statement inferred from ruleset R0, those recursively inferred
from ruleset R1 itself, and those recursively inferred from on-disk analysis for ruleset
R1 ∪ R2. Next, every input statement is subject to reasoning according to ruleset R2
with respect to our T-Box; these again include all inferences from R0, all statements
inferred through R1 alone, and all inferences from recursive application of ruleset
R1 ∪ R2.
Therefore, we can see that our algorithm applies exhaustive application of ruleset
R0 ∪ R1 ∪ R2 with respect to our T-Box, leaving only consideration of equality rea-
soning in ruleset R3. Indeed, our algorithm is not complete with respect to ruleset R3
since we choose pivot identifiers for representing equivalent individuals as justified in
Section 4.6. However, we still provide a form of “pivotal closure” whereby backward-
chaining support of rules rdfp110 and rdfp1100 over the output of our algorithm would
provide a view of closure as defined; i.e., our output contains all of the possible infer-
ences according to our notion of closure, but with equivalent individuals compressed
in pivotal form.
Firstly, for rules rdfp60 and rdfp7, all statements where p = :sameAs from the
146
Figure 4: SAOR reasoning algorithm
147
original input or as produced by R0 ∪ R1 ∪ R2 undergo on-disk symmetric-transitive
closure in pivotal form. Since both rules only produce more :sameAs statements, and
according to the standard usage restriction of our closure, they are not applicable to
reasoning under R0∪R1∪R2. Secondly, we loosely apply rules rdfp110 and rdfp1100
such as to provide closure with respect to joins in ruleset R2; i.e., all possible joins are
computed with respect to the given :sameAs statements. Equivalence is clearly not
important to R0 since we strictly do not allow :sameAs statements to affect our T-
Box; R1 inferences do not require joins and, although the statements produced will
not be in pivotal form, they will be output and rewritten later; inferences from R2 will
be produced as discussed, also possibly in non-pivotal form. In the final consolidation
step, we then rewrite all statements to their pivotal form and provide incoming and
outgoing :sameAs relations between pivot identifiers and their non-pivot equivalent
identifiers. This constitutes our output, which we call pivotal authoritative closure.
148
147m Dataset
C |ClR1 (T,
b {m(C)})| |ClR1 (T, {m(C)})| n b {m(C)})|
n|ClR1 (T, n|ClR1 (T, {m(C)})|
rss:item 0 356 3,558,055 0 1,266,667,580
foaf:Person 6 388 3,252,404 19,514,424 1,261,932,752
rdf:Seq 2 243 1,934,852 3,869,704 470,169,036
foaf:Document 1 354 1,750,365 1,750,365 619,629,210
wordnet:Person 0 236 1,475,378 0 348,189,208
TOTAL 9 1,577 11,971,054 25,134,493 3,966,587,786
P |ClR1 ({T,
b m(P )})| |ClR1 (T, {m(P )})| n n|ClR1 ({T,
b m(P )})| n|ClR1 (T, {m(P )})|
dc:title* 0 14 5,503,170 0 77,044,380
dc:date* 0 377 5,172,458 0 1,950,016,666
foaf:name* 3 418 4,631,614 13,894,842 1,936,014,652
foaf:nick* 0 390 4,416,760 0 1,722,536,400
rss:link* 1 377 4,073,739 4,073,739 1,535,799,603
TOTAL 4 1,576 23,797,741 17,968,581 7,221,411,701
1.1b Dataset
C |ClR1 (T,
b {m(C)})| |ClR1 (T, {m(C)})| n b {m(C)})|
n|ClR1 (T, n|ClR1 (T, {m(C)})|
foaf:Person 6 4,631 63,271,689 379,630,134 293,011,191,759
foaf:Document 1 4,523 6,092,322 6,092,322 27,555,572,406
rss:item 0 4,528 5,745,216 0 26,014,338,048
oboInOwl:DbXref 0 0 2,911,976 0 0
rdf:Seq 2 4,285 2,781,994 5,563,988 11,920,844,290
TOTAL 9 17,967 80,803,197 391,286,444 358,501,946,503
P |ClR1 (T,
b {m(P )})| |ClR1 (T, {m(P )})| n b {m(P )})|
n|ClR1 (T, n|ClR1 (T, {m(P )})|
rdfs:seeAlso 2 8,647 113,760,738 227,521,476 983,689,101,486
foaf:knows 14 9,269 77,335,237 1,082,693,318 716,820,311,753
dc:title* 0 4,621 71,321,437 0 329,576,360,377
foaf:nick* 0 4,635 65,855,264 0 305,239,148,640
foaf:weblog 7 9,286 55,079,875 385,559,125 511,471,719,250
TOTAL 23 36,458 383,352,551 1,695,773,919 2,846,796,641,506
Table 5: Comparison of authoritative and non-authoritative reasoning for the number of unique
inferred RDF statements produced (w.r.t. ruleset R1) over the five most frequently occurring
classes and properties in both input datasets. ‘*’ indicates a datatype property where the ob-
ject of m(P ) is a literal. The amount of statements produced for authoritative
˛ reasoning
˛ for
b {m(C)})˛˛ and
a single membership assertion of the class or property is denoted by ˛ClR1 (T,
˛
˛ ˛
b {m(P )})˛˛ respectively. Non-authoritative counts are given by |ClR1 (T, {m(C)})|
˛ClR1 (T,
˛
and |ClR1 (T, {m(P )})|. n is the number of membership assertions for the class C or property
P in the given dataset.
149
class descriptions with :Thing as a member;16 for the 1.1b dataset, many inferences
on the top level classes stem from, for example, the OWL W3C Test Repository17 . Of
course we do not see such documents as being malicious in any way, but clearly they
would cause inflationary inferences when naı̈vely considered as part of web knowledge-
base.
Next, we present some metrics regarding the first step of reasoning: the separation
and in-memory construction of the T-Box. For the 1.1b dataset, the initial scan of all
data found 9,683,009 T-Box statements (0.9%). Reducing the T-Box by removing col-
lection statements as described in Section 4.3.1 dropped a further 1,091,698 (11% of
total) collection statements leaving 733,734 such statements in the T-Box (67% collec-
tion statements dropped) and 8,591,311 (89%) total. Table 6 shows, for membership
assertions of each class and property in CSAOR and PSAOR , the result of applying
authoritative analysis. Of the 33,157 unique namespaces probed, 769 (2.3%) had a
redirect, 4068 (12.3%) connected but had no redirect and 28,320 (85.4%) did not con-
nect at all. In total, 14,227,116 authority checks were performed. Of these, 6,690,704
(47%) were negative and 7,536,412 (53%) were positive. Of the positive, 4,236,393
(56%) were blank-nodes, 2,327,945 (31%) were a direct match between namespace
and source and 972,074 (13%) had a redirect from the namespace to the source. In
total, 2,585,708 (30%) statements were dropped as they could not contribute to a valid
authoritative inference. The entire process of separating, analysing and loading the T-
Box into memory took 6.47 hours: the most costly operation here is the large amount
of HTTP lookups required for authoritative analysis, with many connections unsuc-
cessful after our five second timeout. The process required ∼3.5G of Java heap-space
and ∼10M of stack space.
For the 147m dataset, 2,649,532 (1.7%) T-Box statements were separated from the
data, which was reduced to 1,609,958 (61%) after reducing the amount of irrelevant
collection statements; a further 536,564 (33%) statements were dropped as they could
not contribute to a valid authoritative inference leaving 1,073,394 T-Box statements
(41% of original). Loading the T-Box into memory took approximately 1.04 hours.
We proceed by evaluating the application of reasoning over all rules on the 147m
dataset with respect to throughtput of statements written and read.
Figure 5 shows performance for reaching an overall fixpoint for application of all
rules. Clearly, the performance plateaus after 79 mins. At this point the input state-
ments have been exhausted, with rules in R0 and R1 having been applied to the input
data and statements written to the on-disk files for R2 and R3. SAOR now switches
over to calculating a fixpoint over the on-disk computed R2 and R3 rules, the results
of which become the new input for R1 and further recursive input to the R2 and R3
files.
Figure 6 shows performance specifically for achieving closure on the on-disk R2
and R3 rules. There are three pronounced steps in the output of statements. The
first one shown at (a) is due to inferencing of :sameAs statements from rule rdfp2
(:InverseFunctionalProperty - 2.1m inferences). Also part of the first step are
:sameAs inferences from rules rdfp10 (:FunctionalProperty - 31k inferences)
and rules rdfc4* (:cardinality/:maxCardinality - 449 inferences). For the first
16 Fifty-five such :unionOf class descriptions can be found in http://lsdis.cs.uga.edu/
150
1.6e+008
read
1.4e+008 written
1.2e+008
1e+008
statements
8e+007
6e+007
4e+007
2e+007
0
0 200 400 600 800 1000 1200 1400 1600 1800
minutes elapsed
Figure 5: Performance of applying entire ruleset on the 147m statements dataset (with-
out final consolidation step)
3.5e+006
written
(h)
(f) (g)
3e+006
(e)
2.5e+006 (d)
(c)
(b)
2e+006
statements
1.5e+006
(a)
1e+006
500000
0
0 200 400 600 800 1000 1200 1400 1600 1800
minutes elapsed
Figure 6: Performance of inferencing over R2 and R3 on-disk indexes for the 147m
statements dataset (without final consolidation)
151
Property AuthSub AuthObj AuthBoth AuthNone Total Drop
rdfs:subClassOf 25,076 583,399 1,595,850 1,762,414 3,966,739 2,345,813
:onProperty 1,041,873 - 97,921 - 1,139,843 -
:someValuesFrom 681,968 - 217,478 - 899,446 -
rdf:first 273,805 - 392,707 - 666,512 -
rdf:rest 249,541 - 416,946 - 666,487 -
:equivalentClass 574 189,912 162,886 3,198 356,570 3,198
:intersectionOf - - 216,035 - 216,035 -
rdfs:domain 5,693 7,788 66,338 79,748 159,567 87,536
rdfs:range 32,338 4,340 37,529 75,338 149,545 79,678
:hasValue 9,903 0 82,853 0 92,756 -
:allValuesFrom 51,988 - 22,145 - 74,133 -
rdfs:subPropertyOf 3,365 147 22,481 26,742 52,734 26,888
:maxCardinality 26,963 - - - 26,963 -
:inverseOf 75 52 6,397 18,363 24,887 18,363
:cardinality 20,006 - - - 20,006 -
:unionOf - - 21,671 - 21,671 -
:minCardinality 15,187 - - - 15,187 -
:oneOf - - 6,171 - 6,171 -
:equivalentProperty 105 24 187 696 1,012 696
Class
:FunctionalProperty 9,616 - - 18,111 27,727 18,111
:InverseFunctionalProperty 872 - - 3,080 3,952 3,080
:TransitiveProperty 807 - - 1,994 2,801 1,994
:SymmetricProperty 265 - - 351 616 351
OVERALL 2,450,020 785,661 3,365,595 1,990,035 8,591,311 2,585,708
Table 6: Authoritative analysis of T-Box statements in 1.1b dataset for each primitive
where dropped statements are highlighted in bold
152
plateau shown at (b), the :sameAs equality file is closed for the first time and a local
fixpoint is being calculated to derive the initial :sameAs statements for future rules;
also during the plateau at (b), the second iteration for the :sameAs fixpoint (which, for
the first time, consolidates the key join variables in files for rules rdfp2, rdfp10 , rdfc4a,
rdfc4b according to all :sameAs statements produced thus far) produces 1,018 new
such statements, with subsequent iterations producing 145, 2, and 0 new statements
respectively.
The second pronounced step at (c) is attributable to 265k transitive inferences, fol-
lowed by 1.7k symmetric-transitive inferences. The proceeding slope at (d) is caused
by inferences on rdfc3c (:intersectionOf - 265 inferences) and rdfp150 (:someVa-
luesFrom - 36k inferences), with rule rdfp160 (:allValuesFrom - 678k inferences)
producing the final significant step at (e). The first complete iteration of the overall
fixpoint calculation is now complete.
Since the first local :sameAs fixpoint, 22k mostly rdf:type statements have been
written back to the cardinality rule files, 4 statements to the :InverseFunctional-
Property file and 14 to the :FunctionalProperty file. Thus, the :sameAs fix-
point is re-executed at (f), with no new statements found. The final, minor, staggered
step at (g) occurs after the second :sameAs fixpoint when, most notably, rule rdfp4
(:TransitiveProperty) produces 24k inferences, rule rdfc3c (:intersectionOf)
produces 6.7k inferences, and rule rdfp160 (:allValuesFrom) produces 7.3k new
statements.
The final, extended plateau at (h) is caused by rules which produce/consume rdf:type
statements. In particular, the fixpoint encounters :allValuesFrom inferencing pro-
ducing a minor contribution of statements (≤ 2) which lead to an update and re-
execution of :allValuesFrom inferencing and :intersectionOf reasoning. In
particular, :allValuesFrom required 66 recursive iterations to reach a fixpoint. We
identified the problematic data as follows:
@prefix veml: <http://www.icsi.berkeley.edu/˜snarayan/VEML.owl#>
@prefix verl: <http://www.icsi.berkeley.edu/˜snarayan/VERL.owl#>
@prefix data: <http://www.icsi.berkeley.edu/˜snarayan/meeting01.owl#>
...
From the above data, each iteration of :allValuesFrom reasoning and subsequent
subclass reasoning produced:
# INPUT TO ALL-VALUES-FROM, ITERATION 0
# FROM INPUT
( :c1... :c65) rdf:first (data:1 ... data:65) .
# FROM RANGE
:c1 a veml:EventList .
153
2e+009
read
written
1.8e+009
1.6e+009
1.4e+009
1.2e+009
statements
1e+009
8e+008
6e+008
4e+008
2e+008
0
0 100 200 300 400 500 600
minutes elapsed
154
as described above. In particular, we refer to the linear trend present; upon inspection,
one can see that minor slow-down in the rate of statements read is attributable to an
increased throughput in terms of output statements (disk write operations).
Finally, Table 7 lists the number of times each rule was fired for reasoning on the
1.1b dataset, reasoning using only R0 ∪ R1 on the 147m dataset and also of applying
all rules to the 147m dataset. Again, from both Figure 5 and Table 7 we can deduce that
the bulk of current web reasoning is covered by those rules (R0 ∪ R1) which exhibit
linear scale.
Table 7: Count of number of statements inferred for applying the given ruleset on the
given dataset.
6 Related Work
OWL reasoning, specifically query answering over OWL Full, is not tackled by typical
DL Reasoners; such as FaCT++ [45], RACER [19] or Pellet [40]; which focus on
complex reasoning tasks such as subsumption checking and provable completeness of
155
reasoning. Likewise, KAON2 [32], which reports better results on query answering, is
limited to OWL-DL expressivity due to completeness requirements. Despite being able
to deal with complex ontologies in a complete manner, these systems are not tailored
for the particular challenges of processing large amounts of RDF data and particularly
large A-Boxes.
Systems such as TRIPLE [39], JESS18 , or Jena19 support rule representable RDFS
or OWL fragments as we do, but only work in-memory whereas our framework is
focused on conducting scalable reasoning using persistent storage.
The OWLIM [28] family of systems allows reasoning over a version of pD* using
the TRREE: Triple Reasoning and Rule Entailment Engine. Besides the in-memory
version SwiftOWLIM, which uses TRREE, there is also a version offering query-
processing over a persistent image of the repository, BigOWLIM, which comes closest
technically to our approach. In evaluation on 2 x Dual-Core 2GHz machines with
16GB of RAM, BigOWLIM is claimed to index over 1 bn triples from the LUBM
benchmark [17] in just under 70 hours [1]; however, this figure includes indexing of
the data for query-answering, and is not directly comparable with our results, and in
any case, our reasoning approach strictly focuses on sensible reasoning for web data.
Some existing systems already implement a separation of T-Box and A-Box for
scalable reasoning, where in particular, assertional data are stored in some RDBMS;
e.g. DLDB [35], Minerva [48] and OntoDB [25]. Similar to our approach of reasoning
over web data, [36] demonstrates reasoning over 166m triples using the DLDB system.
Also like us, (and as we had previously introduced in [23]) they internally choose pivot
identifiers to represent equivalent sets of individuals. However, they use the notion of
perspectives to support inferencing based on T-Box data; in their experiment they man-
ually selected nine T-Box perspectives, unlike our approach that deals with arbitrary
T-Box data from the Web. Their evaluation was performed on a workstation with dual
64-bit CPUs and 10GB main memory on which they loaded 760k documents / 166m
triples (14% larger than our 147m statement dataset) in about 350 hrs; however, unlike
our evaluation, the total time taken includes indexing for query-answering.
In a similar approach to our authoritative analysis, [8] introduced restrictions for
accepting sub-class and equivalent-class statements from third-party sources; they fol-
low similar arguments to that made in this paper. However, their notion of what we call
authoritativeness is based on hostnames and does not consider redirects; we argue that
in both cases, e.g., use of PURL services20 is not properly supported: (i) all documents
using the same service (and having the same namespace hostname) would be ‘author-
itative’ for each other, (ii) the document cannot be served directly by the namespace
location, but only through a redirect. Indeed, further work presented in [7] introduced
the notion of an authoritative description which is very similar to ours. In any case, we
provide much more extensive treatment of the issue, supporting a much more varied
range of RDF(S)/OWL constructs.
One promising alternative to authoritative reasoning for the Web is the notion of
“context-dependant” or “quarantined reasoning” introduced in [11], whereby inference
results are only considered valid within the given context of a document. As opposed to
our approach whereby we construct one authoritative model for all web data, their ap-
proach uses a unique model for each document, based on implicit and explicit imports
18 http://herzberg.ca.sandia.gov/
19 http://jena.sourceforge.net/
20 http://purl.org/
156
of the document; thus, they would infer statements within the local context which we
would consider to be non-authoritative. However, they would miss inferences which
can only be conducted by considering a merge of documents, such as transitive closure
or equality inferences based on inverse-functional properties over multiple documents.
Their evaluation was completed on three machines with quad-core 2.33GHz and 8GB
main memory; they claimed to be able to load, on average, 40 documents per second.
We show in our evaluation that naı̈ve inferencing over web data leads to an ex-
plosion of materialised statements and show how to prevent this explosion through
analysis of the authority of data sources. We also present metrics relating to the most
productive rules with regards inferencing on the Web.
Although SAOR is currently not optimised for reaching full closure, we show that
our system is suitable for optimised computation of the approximate closure of a web
knowledge-base w.r.t. the most commonly used RDF(S) and OWL constructs. In our
evaluation, we showed that the bulk of inferencing on web data can be completed with
two scans of an unsorted web-crawl.
Future work includes investigating possible distribution methods: indeed, by lim-
iting our tool-box to file scans and sorts, our system can be implemented on multiple
machines, as-is, according to known distribution methods for our foundational opera-
tions.
References
[1] Bigowlim: System doc., Oct. 2006. http://www.ontotext.com/owlim/big/
BigOWLIMSysDoc.pdf.
[2] S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-
Schneider, and L. A. Stein. OWL Web Ontology Language Reference. W3C Recommen-
dation, Feb. 2004. http://www.w3.org/TR/owl-ref/.
[3] S. Bechhofer and R. Volz. Patching syntax in owl ontologies. In International Semantic
Web Conference, volume 3298 of Lecture Notes in Computer Science, pages 668–682.
Springer, November 2004.
[4] D. Brickley and R. Guha. Rdf vocabulary description language 1.0: Rdf schema. W3C
Recommendation, Feb. 2004. http://www.w3.org/TR/rdf-schema/.
157
[5] D. Brickley and L. Miller. FOAF Vocabulary Specification 0.91, Nov. 2007. http:
//xmlns.com/foaf/spec/.
[6] J. d. Bruijn and S. Heymans. Logical foundations of (e)RDF(S): Complexity and reason-
ing. In 6th International Semantic Web Conference, number 4825 in LNCS, pages 86–99,
Busan, Korea, Nov 2007.
[7] G. Cheng, W. Ge, H. Wu, and Y. Qu. Searching semantic web objects based on class
hierarchies. In Proceedings of Linked Data on the Web Workshop, 2008.
[8] G. Cheng and Y. Qu. Term dependence on the semantic web. In International Semantic
Web Conference, pages 665–680, oct 2008.
[9] J. de Bruijn. Semantic Web Language Layering with Ontologies, Rules, and Meta-
Modeling. PhD thesis, University of Innsbruck, 2008.
[10] J. de Bruijn, A. Polleres, R. Lara, and D. Fensel. OWL− . Final draft d20.1v0.2, WSML,
2005.
[11] R. Delbru, A. Polleres, G. Tummarello, and S. Decker. Context dependent reasoning for
semantic documents in Sindice. In Proceedings of the 4th International Workshop on Scal-
able Semantic Web Knowledge Base Systems (SSWS 2008), October 2008.
[12] D. Fensel and F. van Harmelen. Unifying reasoning and search to web scale. IEEE Internet
Computing, 11(2):96, 94–95, 2007.
[13] S. Ghilardi, C. Lutz, and F. Wolter. Did i damage my ontology? a case for conservative
extensions in description logics. In Proceedings of the Tenth International Conference on
Principles of Knowledge Representation and Reasoning, pages 187–197, June 2006.
[14] B. C. Grau, I. Horrocks, B. Parsia, P. Patel-Schneider, and U. Sattler. Next steps for OWL.
In OWL: Experiences and Directions Workshop, Nov. 2006.
[15] B. Grosof, I. Horrocks, R. Volz, and S. Decker. Description logic programs: Combining
logic programs with description logic. In 13th International Conference on World Wide
Web, 2004.
[16] R. V. Guha, R. McCool, and R. Fikes. Contexts for the semantic web. In Third International
Semantic Web Conference, pages 32–46, November 2004.
[17] Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for owl knowledge base systems.
Journal of Web Semantics, 3(2-3):158–182, 2005.
[18] C. Gutiérrez, C. Hurtado, and A. O. Mendelzon. Foundations of Semantic Web Databases.
In 23rd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,
Paris, June 2004.
[19] V. Haarslev and R. Möller. Racer: A core inference engine for the semantic web. In
International Workshop on Evaluation of Ontology-based Tools, 2003.
[20] A. Harth and S. Decker. Optimized index structures for querying rdf from the web. In 3rd
Latin American Web Congress, pages 71–80. IEEE Press, 2005.
[21] A. Harth, J. Umbrich, and S. Decker. Multicrawler: A pipelined architecture for crawling
and indexing semantic web data. In 5th International Semantic Web Conference, pages
258–271, 2006.
[22] P. Hayes. RDF Semantics. W3C Recommendation, Feb. 2004. http://www.w3.org/
TR/rdf-mt/.
[23] A. Hogan, A. Harth, and S. Decker. Performing object consolidation on the semantic web
data graph. In 1st I3 Workshop: Identity, Identifiers, Identification Workshop, 2007.
[24] A. Hogan, A. Harth, and A. Polleres. SAOR: Authoritative Reasoning for the Web. In
Proceedings of the 3rd Asian Semantic Web Conference (ASWC 2008), Bankok, Thailand,
Dec. 2008.
158
[25] D. Hondjack, G. Pierra, and L. Bellatreche. Ontodb: An ontology-based database for data
intensive applications. In Proceedings of the 12th International Conference on Database
Systems for Advanced Applications, pages 497–508, April 2007.
[26] I. Horrocks and P. F. Patel-Schneider. Reducing owl entailment to description logic satisfi-
ability. Journal of Web Semamtics, 1(4):345–357, 2004.
[27] E. Jiménez-Ruiz, B. C. Grau, U. Sattler, T. Schneider, and R. B. Llavori. Safe and economic
re-use of ontologies: A logic-based methodology and tool support. In Proceedings of the
21st International Workshop on Description Logics (DL2008), May 2008.
[28] A. Kiryakov, D. Ognyanov, and D. Manov. Owlim - a pragmatic semantic repository for
owl. In Web Information Systems Engineering Workshops, LNCS, pages 182–192, New
York, USA, Nov 2005.
[29] D. Kunkle and G. Cooperman. Solving rubik’s cube: disk is the new ram. Communications
of the ACM, 51(4):31–33, 2008.
[30] J. W. Lloyd. Foundations of Logic Programming (2nd edition). Springer-Verlag, 1987.
[31] C. Lutz, D. Walther, and F. Wolter. Conservative extensions in expressive description
logics. In IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial
Intelligence, pages 453–458, January 2007.
[32] B. Motik. Reasoning in Description Logics using Resolution and Deductive Databases.
PhD thesis, Forschungszentrum Informatik, Karlsruhe, Germany, 2006.
[33] B. Motik. On the properties of metamodeling in owl. Journal of Logic and Computation,
17(4):617–637, 2007.
[34] S. Muñoz, J. Pérez, and C. Gutiérrez. Minimal deductive systems for RDF. In ESWC,
pages 53–67, 2007.
[35] Z. Pan and J. Heflin. Dldb: Extending relational databases to support semantic web queries.
In PSSS1 - Practical and Scalable Semantic Systems, Proceedings of the First International
Workshop on Practical and Scalable Semantic Systems, October 2003.
[36] Z. Pan, A. Qasem, S. Kanitkar, F. Prabhakar, and J. Heflin. Hawkeye: A practical large
scale demonstration of semantic web integration. In OTM Workshops (2), volume 4806 of
Lecture Notes in Computer Science, pages 1115–1124. Springer, November 2007.
[37] P. F. Patel-Schneider and I. Horrocks. Owl web ontology language semantics and abstract
syntax section 4. mapping to rdf graphs. W3C Recommendation, Feb. 2004. http:
//www.w3.org/TR/owl-semantics/mapping.html.
[38] E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF. W3C Recom-
mendation, Jan. 2008. http://www.w3.org/TR/rdf-sparql-query/.
[39] M. Sintek and S. Decker. Triple - a query, inference, and transformation language for the
semantic web. In 1st International Semantic Web Conference, pages 364–378, 2002.
[40] E. Sirin, B. Parsia, B. C. Grau, A. Kalyanpur, and Y. Katz. Pellet: A practical OWL-DL
reasoner. Journal of Web Semantics, 5(2):51–53, 2007.
[41] M. K. Smith, C. Welty, and D. L. McGuinness. OWL Web Ontology Language Guide.
W3C Recommendation, Feb. 2004. http://www.w3.org/TR/owl-guide/.
[42] H. J. ter Horst. Combining rdf and part of owl with rules: Semantics, decidability, com-
plexity. In 4th International Semantic Web Conference, pages 668–684, 2005.
[43] H. J. ter Horst. Completeness, decidability and complexity of entailment for rdf schema ans
a semantic extension involving the owl vocabulary. Journal of Web Semantics, 3:79–115,
2005.
159
[44] Y. Theoharis, V. Christophides, and G. Karvounarakis. Benchmarking database representa-
tions of rdf/s stores. In Proceedings of the Fourth International Semantic Web Conference,
pages 685–701, November 2005.
[45] D. Tsarkov and I. Horrocks. Fact++ description logic reasoner: System description. In
International Joint Conf. on Automated Reasoning, pages 292–297, 2006.
[46] T. D. Wang, B. Parsia, and J. A. Hendler. A survey of the web ontology landscape. In
Proceedings of the 5th International Semantic Web Conference (ISWC 2006), pages 682–
694, Athens, GA, USA, Nov. 2006.
[47] Z. Wu, G. Eadon, S. Das, E. I. Chong, V. Kolovski, M. Annamalai, and J. Srinivasan.
Implementing an Inference Engine for RDFS/OWL Constructs and User-Defined Rules in
Oracle. In 24th International Conference on Data Engineering. IEEE, 2008. To appear.
[48] J. Zhou, L. Ma, Q. Liu, L. Zhang, Y. Yu, and Y. Pan. Minerva: A scalable owl ontology
storage and inference system. In Proceedings of The First Asian Semantic Web Conference
(ASWC), pages 429–443, September 2006.
160
Published in of the 5th European Semantic Web Conference (ESWC2008), pp. 432–447, Nov.
2007, Springer LNCS vol. 3803, ext. version published as tech. report, cf.
http://www.deri.ie/fileadmin/documents/TRs/DERI-TR-2007-12-14.pdf and as W3C
member submission, cf. http://www.w3.org/Submission/2009/01/
Abstract
With currently available tools and languages, translating between an existing
XML format and RDF is a tedious and error-prone task. The importance of this
problem is acknowledged by the W3C GRDDL working group who faces the is-
sue of extracting RDF data out of existing HTML or XML files, as well as by
the Web service community around SAWSDL, who need to perform lowering and
lifting between RDF data from a semantic client and XML messages for a Web
service. However, at the moment, both these groups rely solely on XSLT trans-
formations between RDF/XML and the respective other XML format at hand. In
this paper, we propose a more natural approach for such transformations based on
merging XQuery and SPARQL into the novel language XSPARQL. We demon-
strate that XSPARQL provides concise and intuitive solutions for mapping be-
tween XML and RDF in either direction, addressing both the use cases of GRDDL
and SAWSDL. We also provide and describe an initial implementation of an XS-
PARQL engine, available for user evaluation.
1 Introduction
There is a gap within the Web of data: on one side, XML provides a popular format for
data exchange with a rapidly increasing amount of semi-structured data available. On
the other side, the Semantic Web builds on data represented in RDF, which is optimized
∗ This
material is based upon works supported by the European FP6 projects inContext (IST-034718) and
TripCom (IST-4-027324-STP), and by Science Foundation Ireland under Grant No. SFI/02/CE1/I131.
161
<rdf:RDF xmlns:foaf="...foaf/0.1/"
@prefix alice: <alice/> .
xmlns:rdf="...rdf-syntax-ns#">
@prefix foaf: <...foaf/0.1/> .
<foaf:Person rdf:about="alice/me">
<foaf:knows>
alice:me a foaf:Person.
<foaf:Person foaf:name="Charles"/>
alice:me foaf:knows :c.
</foaf:knows>
:c a foaf:Person.
</foaf:Person>
:c foaf:name "Charles".
</rdf:RDF>
(a) (b)
<rdf:RDF xmlns:foaf="...foaf/0.1/"
<rdf:RDF xmlns:foaf="...foaf/0.1/"
xmlns:rdf="...rdf-syntax-ns#">
xmlns:rdf="...rdf-syntax-ns#">
<rdf:Description rdf:about="alice/me">
<rdf:Description rdf:nodeID="x">
<foaf:knows rdf:nodeID="x"/>
<rdf:type
</rdf:Description>
rdf:resource=".../Person"/>
<rdf:Description rdf:about="alice/me">
<foaf:name>Charles</foaf:name>
<rdf:type rdf:resource=".../Person"/>
</rdf:Description>
</rdf:Description>
<rdf:Description
<rdf:Description rdf:nodeID="x">
rdf:about="alice/me">
<foaf:name>Charles</foaf:name>
<rdf:type
</rdf:Description>
rdf:resource=".../Person"/>
<rdf:Description rdf:nodeID="x">
<foaf:knows rdf:nodeID="x"/>
<rdf:type rdf:resource=".../Person"/>
</rdf:Description>
</rdf:Description>
</rdf:RDF>
</rdf:RDF>
(c) (d)
Figure 1: Different representations of the same RDF graph
for data interlinking and merging; the amount of RDF data published on the Web is also
increasing, but not yet at the same pace. It would clearly be useful to enable reuse of
XML data in the RDF world and vice versa. However, with currently available tools
and languages, translating between XML and RDF is not a simple task.
The importance of this issue is currently being acknowledged within the W3C in
several efforts. The Gleaning Resource Descriptions from Dialects of Languages [11]
(GRDDL) working group faces the issue of extracting RDF data out of existing (X)HTML
Web pages. In the Semantic Web Services community, RDF-based client software
needs to communicate with XML-based Web services, thus it needs to perform trans-
formations between its RDF data and the XML messages that are exchanged with
the Web services. The Semantic Annotations for WSDL (SAWSDL) working group
calls these transformations lifting and lowering (see [14, 16]). However, both these
groups propose solutions which rely solely on XSL transformations (XSLT) [12] be-
tween RDF/XML [2] and the respective other XML format at hand. Using XSLT for
handling RDF data is greatly complicated by the flexibility of the RDF/XML format.
XSLT (and XPath) were optimized to handle XML data with a simple and known hier-
archical structure, whereas RDF is conceptually different, abstracting away from fixed,
tree-like structures. In fact, RDF/XML provides a lot of flexibility in how RDF graphs
can be serialized. Thus, processors that handle RDF/XML as XML data (not as a set
of triples) need to take different possible representations into account when looking
for pieces of data. This is best illustrated by a concrete example: Fig. 1 shows four
versions of the same FOAF (cf. http://www.foaf-project.org) data.1 The first
version uses Turtle [3], a simple and readable textual format for RDF, inaccessible to
pure XML processing tools though; the other three versions are all RDF/XML, ranging
from concise (b) to verbose (d).
The three RDF/XML variants look very different to XML tools, yet exactly the
same to RDF tools. For any variant we could create simple XPath expressions that
1 In listings and figures we often abbreviate well-known namespace URIs (http://www.w3.org/1999/
162
extract for instance the names of the persons known to Alice, but a single expression
that would correctly work in all the possible variants would become more involved.
Here is a list of particular features of the RDF data model and RDF/XML syntax that
complicate XPath+XSLT processing:
• Elements denoting properties can directly contain value(s) as nested XML, or
reference other descriptions via the rdf:resource or rdf:nodeID attributes.
• References to resources can be relative or absolute URIs.
• Container membership may be expressed as rdf:li or rdf: 1, rdf: 2, etc.
• Statements about the same subject do not need to be grouped in a single element.
• String-valued property values such as foaf:name in our example (and also val-
ues of rdf:type) may be represented by XML element content or as attribute
values.
• The type of a resource can be represented directly as an XML element name,
with an explicit rdf:type XML element, or even with an rdf:type attribute.
This is not even a complete list of the issues that complicate the formulation of adequate
XPath expressions that cater for every possible alternative in how one and the same
RDF data might be structured in its concrete RDF/XML representation.
Apart from that, simple reasoning (e.g., RDFS materialization) improves data queries
when accessing RDF data. For instance, in FOAF, every Person (and Group and Orga-
nization etc.) is also an Agent, therefore we should be able to select all the instances of
foaf:Agent. If we wanted to write such a query in XPath+XSLT, we literally would
need to implement an RDFS inference engine within XSLT. Given the availability of
RDF tools and engines, this seems to be a dispensable exercise.
Recently, two new languages have entered the stage for processing XML and RDF
data: XQuery [7] is a W3C Recommendation since early last year and SPARQL [22]
has finally received W3C’s Recommendation stamp in January 2008. While both lan-
guages operate in their own worlds – SPARQL in the RDF- and XQuery in the XML-
world – we show in this paper that the merge of both in the novel language XSPARQL
has the potential to finally bring XML and RDF closer together. XSPARQL provides
concise and intuitive solutions for mapping between XML and RDF in either direction,
addressing both the use cases of GRDDL and SAWSDL. As a side effect, XSPARQL
may also be used for RDF to RDF transformations beyond the capabilities of “pure”
SPARQL. We also describe an implementation of XSPARQL, available for user evalu-
ation.
In the following, we elaborate a bit more in depth on the use cases of lifting and
lowering in the contexts of both GRDDL and SAWSDL in Section 2 and discuss how
they can be addressed by XSLT alone. Next, in Section 3 we describe the two starting
points for an improved lifting and lowering language – XQuery and SPARQL – before
we announce their happy marriage to XSPARQL in Section 4. Particularly, we extend
XQuery’s FLWOR expressions with a way of iterating over SPARQL results. We
define the semantics of XSPARQL based on the XQuery semantics in [9], and describe
a rewriting algorithm that translates XSPARQL to XQuery. By this we can show that
XSPARQL is a conservative extension of both XQuery and SPARQL. We wrap up
163
relations.rdf
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
Lowering _:b1 a foaf:Person;
relations.xml
foaf:name "Alice";
<relations>
foaf:knows _:b2;
<person name="Alice">
foaf:knows _:b3.
<knows>Bob</knows>
_:b2 a foaf:Person; foaf:name "Bob";
<knows>Charles</knows>
foaf:knows _:b3.
</person>
_:b3 a foaf:Person; foaf:name "Charles".
<person name="Bob">
<knows>Charles</knows>
</person>
<person name="Charles"/> Lifting
</relations>
the paper with an outlook to related and future works and conclusions to be drawn in
Section 5 and 6.
164
<rdf:RDF xmlns:rdf="...rdf-syntax-ns#"
<xsl:stylesheet
xmlns:foaf="...foaf/0.1/">
xmlns:xsl="...XSL/Transform"
<foaf:Person>
xmlns:foaf="...foaf/0.1/"
<foaf:name>Alice</foaf:name>
xmlns:rdf="...rdf-syntax-ns#"
<foaf:knows><foaf:Person>
version="2.0">
<foaf:name>Bob</foaf:name>
<xsl:template match="/relations"> </foaf:Person></foaf:knows>
<rdf:RDF> <foaf:knows><foaf:Person>
<xsl:apply-templates /> <foaf:name>Charles</foaf:name>
</rdf:RDF> </foaf:Person></foaf:knows>
</xsl:template> </foaf:Person>
<foaf:Person>
<xsl:template match="person"> <foaf:name>Bob</foaf:name>
<foaf:Person> <foaf:knows><foaf:Person>
<foaf:name> <foaf:name>Charles</foaf:name>
<xsl:value-of </foaf:Person></foaf:knows>
select="./@name"/> </foaf:Person>
</foaf:name> <foaf:Person>
<xsl:apply-templates/> <foaf:name>Charles</foaf:name>
</foaf:Person> </foaf:Person>
</xsl:template> </rdf:RDF>
<xsl:template match="knows"> @prefix foaf: <http://xmlns.com/foaf/0.1/>.
<foaf:knows><foaf:Person> :b1 a foaf:Person; foaf:name "Alice";
<foaf:name> foaf:knows :b2; foaf:knows :b3.
<xsl:apply-templates/> :b2 a foaf:Person; foaf:name "Bob".
</foaf:name> :b3 a foaf:Person; foaf:name "Charles".
</foaf:Person></foaf:knows> :b4 a foaf:Person; foaf:name "Bob";
</xsl:template> foaf:knows :b5 .
:b5 a foaf:Person; foaf:name "Charles" .
</xsl:stylesheet>
:b6 a foaf:Person; foaf:name "Charles".
(a) mygrddl.xsl (b) Result of the GRDDL transform
in RDF/XML (up) and Turtle (down)
Figure 3: Lifting attempt by XSLT
<relations xmlns:grddl="http://www.w3.org/2003/g/data-view#"
grddl:transformation="mygrddl.xsl"> ...
The RDF/XML result of the GRDDL transformation is shown in the upper part of
Fig. 3(b). However, if we take a look at the Turtle version of this result in the lower
part of Fig. 3(b) and compare it with the relations.rdf file in Fig. 2 we see that this
transformation creates too many blank nodes, since this simple XSLT does not merge
equal names into the same blank nodes.
XSLT is a Turing-complete language, and theoretically any conceivable transfor-
mation can be programmed in XSLT; so, we could come up with a more involved
stylesheet that creates unique blank node identifiers per name to solve the lifting task
as intended. However, instead of attempting to repair the stylesheet from Fig. 3(a)
let us rather ask ourselves whether XSLT is the right tool for such transformations.
The claim we make is that specially tailored languages for RDF-XML transformations
like XSPARQL which we present in this paper might be a more suitable alternative to
alleviate the drawbacks of XSLT for the task that GRDDL addresses.
Lifting/Lowering in SAWSDL.
While GRDDL is mainly concerned with lifting, in SAWSDL (Semantic Annotations
for WSDL and XML Schema) there is a strong need for translations in the other direc-
tion as well, i.e., from RDF to arbitrary XML.
SAWSDL is the first standardized specification for semantic description of Web
services. Semantic Web Services (SWS) research aims to automate tasks involved in
165
Client RDF data XML messages Web service
lowering
SOAP communication
lifting
the use of Web services, such as service discovery, composition and invocation. How-
ever, SAWSDL is only a first step, offering hooks for attaching semantics to WSDL
components such as operations, inputs and outputs, etc. Eventually, SWS shall enable
client software agents or services to automatically communicate with other services by
means of semantic mediation on the RDF level. The communication requires both low-
ering and lifting transformations, as illustrated in Fig. 4. Lowering is used to create the
request XML messages from the RDF data available to the client, and lifting extracts
RDF from the incoming response messages.
As opposed to GRDDL, which provides hooks to link XSLT transformations on
the level of whole XML or namespace documents, SAWSDL provides a more fine-
grained mechanism for “semantic adornments” of XML Schemas. In WSDL, schemata
are used to describe the input and output messages of Web service operations, and
SAWSDL can annotate messages or parts of them with pointers to relevant semantic
concepts plus links to lifting and lowering transformations. These links are created us-
ing the sawsdl:liftingSchemaMapping and sawsdl:loweringSchemaMapping
attributes which reference the transformations within XSL elements (xsl:element,
xsl:attribute, etc.) describing the respective message parts.
SAWSDL’s schema annotations for lifting and lowering are not only useful for
communication with web services from an RDF-aware client, but for service media-
tion in general. This means that the output of a service S1 uses a different message
format than service S2 expects as input, but it could still be used if services S1 and
S2 provide lifting and lowering schema mappings, respectively, which map from/to the
same ontology, or, respectively, ontologies that can be aligned via ontology mediation
techniques (see [13]).
Lifting is analogous to the GRDDL situation – the client or an intermediate media-
tion service receives XML and needs to extract RDF from it –, but let us focus on RDF
data lowering now. To stay within the boundaries of our running example, we assume
a social network site with a Web service for querying and updating the list of a user’s
friends. The service accepts an XML format à la relations.xml (Fig. 2) as the message
format for updating a user’s (client) list of friends.
Assuming the client stores his FOAF data (relations.rdf in Fig. 2) in RDF/XML in
the style of Fig. 1(b), the simple XSLT stylesheet mylowering.xsl in Fig. 5 would per-
form the lowering task. The service could advertise this transformation in its SAWSDL
by linking mylowering.xsl in the sawsdl:loweringSchemaMapping attribute of the
XML Schema definition of the relations element that conveys the message payload.
However, this XSLT will break if the input RDF is in any other variant shown in Fig. 1.
We could create a specific stylesheet for each of the presented variants, but creating
one that handles all the possible RDF/XML forms would be much more complicated.
In recognition of this problem, SAWSDL contains a non-normative example which
performs a lowering transformation as a sequence of a SPARQL query followed by
166
<xsl:stylesheet version="1.0" xmlns:rdf="...rdf-syntax-ns#"
xmlns:foaf="...foaf/0.1/" xmlns:xsl="...XSL/Transform">
<xsl:template match="/rdf:RDF">
<relations><xsl:apply-templates select=".//foaf:Person"/></relations>
</xsl:template>
<xsl:template match="foaf:Person"><person name="./@foaf:name">
<xsl:apply-templates select="./foaf:knows"/>
</person></xsl:template>
<xsl:template match="foaf:knows[@rdf:nodeID]"><knows>
<xsl:value-of select="//foaf:Person[@rdf:nodeID=./@rdf:nodeID]/@foaf:name"/>
</knows></xsl:template>
<xsl:template match="foaf:knows[foaf:Person]">
<knows><xsl:value-of select="./foaf:Person/@foaf:name"/></knows>
</xsl:template>
</xsl:stylesheet>
an XSLT transformation on SPARQL’s query results XML format [8]. Unlike XSLT
or XPath, SPARQL treats all the RDF input data from Fig. 1 as equal. This approach
makes a step in the right direction, combining SPARQL with XML technologies. The
detour through SPARQL’s XML query results format however seems to be an unnec-
essary burden. The XSPARQL language proposed in this paper solves this problem:
it uses SPARQL pattern matching for selecting data as necessary, while allowing the
construction of arbitrary XML (by using XQuery) for forming the resulting XML struc-
tures.
As more RDF data is becoming available on the Web which we want to integrate
with existing XML-aware applications, SAWSDL will obviously not remain the only
use case for lowering.
3.1 XQuery
As shown in Fig. 6(a) an XQuery starts with a (possibly empty) prolog (P) for names-
pace, library, function, and variable declarations, followed by so called FLWOR ex-
pressions, denoting body (FLWO) and head (R) of the query. We only show namespace
declarations in Fig. 6 for brevity.
As for the body, for clauses (F) can be used to declare variables looping over
the XML nodeset returned by an XPath expression. Alternatively, to bind the entire
167
Prolog: P declare namespace
prefix="namespace-URI" Prolog: P declare namespace prefix="namespace-URI"
Body: F for var in XPath-expression or prefix prefix: <namespace-URI>
L let var := XPath-expression
Body: F for var in XPath-expression
W where XPath-expression
L let var := XPath-expression
O order by XPath-expression
W where XPath-expression
Head: R return XML+ nested XQuery O order by expression or
(a) Schematic view on XQuery F’ for varlist
D from / from named <dataset-URI>
Prolog: P prefix prefix: <namespace-URI> W where { pattern }
Head: C construct { template } M order by expression
Body: D from / from named <dataset-URI> limit integer > 0 offset integer > 0
W where { pattern } Head: C construct
M order by expression { template (with nested XSPARQL) } or
limit integer > 0 R return XML+ nested XSPARQL
offset integer > 0
(b) Schematic view on SPARQL (c) Schematic view on XSPARQL
result of an XPath query to a variable, let assignments can be used. The where part
(W) defines an XPath condition over the current variable bindings. Processing order of
results of a for can be specified via a condition in the order by clause (O).
In the head (R) arbitrary well-formed XML is allowed following the return key-
word, where variables scoped in an enclosing for or let as well as nested XQuery
FLWOR expressions are allowed. Any XPath expression in FLWOR expressions
can again possibly involve variables defined in an enclosing for or let, or even
nested XQuery FLWOR expressions. Together with a large catalogue of built-in func-
tions [17], XQuery thus offers a flexible instrument for arbitrary transformations. For
more details, we refer the reader to [7, 9].
The lifting task of Fig. 2 can be solved with XQuery as shown in Fig. 7(a). The
resulting query is quite involved, but completely addresses the lifting task, including
unique blank node generation for each person: We first select all nodes containing per-
son names from the original file for which a blank node needs to be created in variable
$p (line 3). Looping over these nodes, we extract the actual names from either the
value of the name attribute or from the knows element in variable $n. Finally, we com-
pute the position in the original XML tree as blank node identifier in variable $id. The
where clause (lines 12–14) filters out only the last name for duplicate occurrences of
the same name. The nested for (lines 19–31) to create nested foaf:knows elements
again loops over persons, with the only differences that only those nodes are filtered
out (line 25), which are known by the person with the name from the outer for loop.
While this is a valid solution for lifting, we still observe the following drawbacks:
(1) We still have to build RDF/XML “manually” and cannot make use of the more
readable and concise Turtle syntax; and
(2) if we had to apply XQuery for the lowering task, we still would need to cater for
all kinds of different RDF/XML representations. As we will see, both these drawbacks
are alleviated by adding some SPARQL to XQuery.
3.2 SPARQL
Fig. 6(b) shows a schematic overview of the building blocks that SPARQL queries
consist of. Again, we do not go into details of SPARQL here (see [22, 19, 20] for
168
1 declare namespace foaf="...foaf/0.1/"; declare namespace foaf="...foaf/0.1/";
2 declare namespace rdf="...-syntax-ns#"; declare namespace rdf="...-syntax-ns#";
3 let $persons := //*[@name or ../knows] let $persons := //*[@name or ../knows]
4 return return
5 <rdf:RDF>
6 {
7 for $p in $persons for $p in $persons
8 let $n := if( $p[@name] ) let $n := if( $p[@name] )
9 then $p/@name else $p then $p/@name else $p
10 let $id :=count($p/preceding::*) let $id :=count($p/preceding::*)
11 +count($p/ancestor::*) +count($p/ancestor::*)
12 where where
13 not(exists($p/following::*[ not(exists($p/following::*[
14 @name=$n or data(.)=$n])) @name=$n or data(.)=$n]))
15 return construct {
16 <foaf:Person rdf:nodeId="b{$id}"> :b{$id} a foaf:Person;
17 <foaf:name>{data($n)}</foaf:name> foaf:name {data($n)}.
18 { {
19 for $k in $persons for $k in $persons
20 let $kn := if( $k[@name] ) let $kn := if( $k[@name] )
21 then $k/@name else $k then $k/@name else $k
22 let $kid :=count($k/preceding::*) let $kid :=count($k/preceding::*)
23 +count($k/ancestor::*) +count($k/ancestor::*)
24 where where
25 $kn = data(//*[@name=$n]/knows) and $kn = data(//*[@name=$n]/knows) and
26 not(exists($kn/../following::*[ not(exists($kn/../following::*[
27 @name=$kn or data(.)=$kn])) @name=$kn or data(.)=$kn]))
28 return construct {
29 <foaf:knows>
:b{$id} foaf:knows :b{$kid}.
30 <foaf:Person rdf:nodeID="b{$kid}"/>
:b{$kid} a foaf:Person.
31 </foaf:knows>
32 } }
33 </foaf:Person> }
34 } }
35 </rdf:RDF>
(a) XQuery (b) XSPARQL
formal details), since we do not aim at modifying the language, but concentrate on the
overall semantics of the parts we want to reuse. Like in XQuery, namespace prefixes
can be specified in the Prolog (P). In analogy to FLWOR in XQuery, let us define
so-called DWMC expressions for SPARQL.
The body (DWM) offers the following features. A dataset (D), i.e., the set of source
RDF graphs, is specified in from or from named clauses. The where part (W) –
unlike XQuery – allows to match parts of the dataset by specifying a graph pattern
possibly involving variables, which we denote vars(pattern). This pattern is given in
a Turtle-based syntax, in the simplest case by a set of triple patterns, i.e., triples with
variables. More involved patterns allow unions of graph patterns, optional matching of
parts of a graph, matching of named graphs, etc. Matching patterns on the conceptual
level of RDF graphs rather than on a concrete XML syntax alleviates the pain of having
to deal with different RDF/XML representations; SPARQL is agnostic to the actual
XML representation of the underlying source graphs. Also the RDF merge of several
source graphs specified in consecutive from clauses, which would involve renaming
of blank nodes at the pure XML level, comes for free in SPARQL. Finally, variable
bindings matching the where pattern in the source graphs can again be ordered, but
also other solution modifiers (M) such as limit and offset are allowed to restrict
the number of solutions considered in the result.
In the head, SPARQL’s construct clause (C) offers convenient and XML-inde-
pendent means to create an output RDF graph. Since we focus here on RDF construc-
169
prefix vc: <...vcard-rdf/3.0#>
prefix vc: <...vcard-rdf/3.0#>
prefix foaf: <...foaf/0.1/>
prefix foaf: <...foaf/0.1/>
construct { :b foaf:name
construct {$X foaf:name $FN.}
{fn:concat($N," ",$F)}.}
from <vc.rdf>
from <vc.rdf>
where { $X vc:FN $FN .}
where { $P vc:Given $N. $P vc:Family $F.}
(a) (b)
<relations>
{
for $Person $Name from <relations.rdf>
where { $Person foaf:name $Name }
order by $Name
return
<person name="{$Name}">
{
for $FName from <relations.rdf>
where { $Person foaf:knows $Friend.
$Person foaf:name $Name.
$Friend foaf:name $Fname }
return <knows>{$FName}</knows>
}
</person>
}
</relations>
tion, we omit the ask and select SPARQL query forms in Fig. 6(b) for brevity. A
construct template consists of a list of triple patterns in Turtle syntax possibly in-
volving variables, denoted by vars(template), that carry over bindings from the where
part. SPARQL can be used as transformation language between different RDF formats,
just like XSLT and XQuery for transforming between XML formats. A simple example
for mapping full names from vCard/RDF (http://www.w3.org/TR/vcard-rdf) to
foaf:name is given by the SPARQL query in Fig. 8(a).
Let us remark that SPARQL does not offer the generation of new values in the
head which on the contrary comes for free in XQuery by offering the full range of
XPath/XQuery built-in functions. For instance, the simple query in Fig. 8(b) which
attempts to merge family names and given names into a single foaf:name is be-
yond SPARQL’s capabilities. As we will see, XSPARQL will not only make reuse of
SPARQL for transformations from and to RDF, but also aims at enhancing SPARQL
itself for RDF-to-RDF transformations enabling queries like the one in Fig. 8(b).
170
4 XSPARQL
Conceptually, XSPARQL is a simple merge of SPARQL components into XQuery. In
order to benefit from the more intuitive facilities of SPARQL in terms of RDF graph
matching for retrieval of RDF data and the use of Turtle-like syntax for result con-
struction, we syntactically add these facilities to XQuery. Fig. 6(c) shows the result of
this “marriage.” First of all, every native XQuery query is also an XSPARQL query.
However we also allow the following modifications, extending XQuery’s FLWOR ex-
pressions to what we call (slightly abusing nomenclature) FLWOR’ expressions: (i) In
the body we allow SPARQL-style F’DWM blocks alternatively to XQuery’s FLWO
blocks. The new F’ for clause is very similar to XQuery’s native for clause, but
instead of assigning a single variable to the results of an XPath expression it allows the
assignment of a whitespace separated list of variables (varlist) to the bindings for these
variables obtained by evaluating the graph pattern of a SPARQL query of the form:
select varlist DWM. (ii) In the head we allow to create RDF/Turtle directly using
construct statements (C) alternatively to XQuery’s native return (R).
These modifications allows us to reformulate the lifting query of Fig. 7(a) into its
slightly more concise XSPARQL version of Fig. 7(b). The real power of XSPARQL in
our example becomes apparent on the lowering part, where all of the other languages
struggle. Fig. 9 shows the lowering query for our running example.
As a shortcut notation, we allow also to write “for *” in place of “for [list of
all variables appearing in the where clause]”; this is also the default value for the F’
clause whenever a SPARQL-style where clause is found and a for clause is miss-
ing. By this treatment, XSPARQL is also a syntactic superset of native SPARQL
construct queries, since we additionally allow the following: (1) XQuery and
SPARQL namespace declarations (P) may be used interchangeably; and
(2) SPARQL-style construct result forms (R) may appear before the retrieval
part; note that we allow this syntactic sugar only for queries consisting of a single
FLWOR’ expression, with a single construct appearing right after the query pro-
log, as otherwise, syntactic ambiguities may arise. This feature is mainly added in
order to encompass SPARQL style queries, but in principle, we expect the (R) part to
appear in the end of a FLWOR’ expression. This way, the queries of Fig. 8 are also
syntactically valid for XSPARQL.2
2 Inour implementation, we also allow select and ask queries, making SPARQL a real syntactic
subset of XSPARQL.
171
ConstructTemplate’ is defined in the same way as the production Construct-
Template in SPARQL [22], but we additionally allow XSPARQL nested FLWORExpr’
in subject, verb, and object place. These expressions need to evaluate to a valid RDF
term, i.e.:
To define this we use the SPARQL grammar rules as a starting point and replace
the following productions:
[42] VarOrTerm’ ::= Var’ | GraphTerm’ | literalConstruct
[43] VarOrIRIref’ ::= Var’ | IRIref | iriConstruct
[44] Var’ ::= VAR2
[45] GraphTerm’ ::= IRIref | RDFLiteral | NumericLiteral
| BooleanLiteral | BlankNode | NIL
| bnodeConstruct | iriConstruct
172
And likewise in the XQUERY grammar we do not allow underscores in the beginning
of NCNames (defined in [5]), i.e. we modify:
[6] NCNameStartChar’ ::= Letter
Here, [·]]SparqlQuery and [·]]SparqlResult are auxiliary mapping rules for expanding the
expressions:
[$VarName]]SparqlResult
==
let $ VarName Node := $ aux result/ sparql result:binding[@name="VarName"]
let $VarName := data($ VarName Node/*)
let $ VarName NodeType := name($ VarName Node/*)
let $ VarName RDFTerm := rdf term($ VarName Node)
173
and
»» ––
$VarName 1 · · · $VarName n DatasetClause
where GroupGraphPattern SolutionModifier SparqlQuery
==
»» –– !
fn:concat("SELECT $VarName 1 · · · $VarName n DataSetClause where { ",
fs:sparql
fn:concat(GroupGraphP attern), " } SolutionModifier ") Expr 0
The rdf term($ VarName Node) function is defined in the following way:
We now define the meaning of fs:sparql. It is, following the style of [9], an abstract
function which returns a SPARQL query result XML document [8] when applied a
proper SPARQL select query, i.e., fs:sparql conforms to the XML Schema defini-
tion http://www.w3.org/2007/SPARQL/result.xsd:3
fs:sparql($query as xs:string) as document-node(schema-element(_sparql_result:sparql))
Static typing rules applies here according to the rules given in the XQuery seman-
tics.
Since this function must be evaluated according to the SPARQL semantics, we need
to get the value of fs:sparql in the dynamic evaluation semantics of XSPARQL.
In case of error (for instance, the query string is not syntactically correct, or the
DatasetClause cannot be accessed), fs:sparql issues an error:
174
statEnv ` $ VarName RDFTerm bound
statEnv ` [$VarName]]VarSubst = [$ VarName RDFTerm]]Expr
Now we can apply our decoration of the core for-loops (without position vari-
ables) recursively:
ˆˆ ˜˜
for $VarName i OptTypeDeclaration i in Expr i ReturnClause Expr 0
==
ˆˆ ˜˜
for $VarName i OptTypeDeclaration i at $ VarName i Pos in [Expr i]Expr 0 [ReturnClause]]Expr 0
Expr
We do not specify where and order by clauses here, as they can be handled
similarly as above let and for expressions.
175
4.2.2 CONSTRUCT Expressions
We define now the semantics for the ReturnClause. Expressions of form return
Expr are evaluated as defined in the XQuery semantics. Stand-alone construct-
clauses are normalized as follows:
==
hh ii
return ( ConstructTemplate 0 SubjPredObjList )
ˆˆ ˜˜
Expr
The auxiliary mapping rule [·]]SubjPredObjlist rewrites variables and blank nodes
inside of ConstructTemplate’s using the normalization mapping rules [·]]Subject ,
[·]]PredObjList , and [·]]ObjList . They use the judgements expr valid subject, valid predi-
cate, and valid object, which holds if the expression expr is, according to the SPARQL
construct syntax, a valid subject, predicate, and object, resp: i.e., subjects must be
bound and not literals, predicates, must be bound, not literals and not blank nodes,
and objects must be bound. If, for any reason, one criterion fails, the triple contain-
ing the ill-formed expression will be removed from the output. Free variables in the
construct are unbound, hence triples containing such variables must be removed
too. The boundedness condition can be checked at runtime by wrapping each variable
and FLWOR’ into a fn:empty() assertion, which removes the corresponding triple from
the ConstructTemplate output.
Next we define the semantics of validSubject, validPredicate and validObject:
statEnv ` $ VarName Node bound
[validSubject($ VarName Node)]]Expr =
statEnv `
[ if (validBnode($ VarName Node) or validUri($ VarName Node)) then fn:true() else fn:false() ]Expr
We follow the URI definition of the N3 syntax (according to the regular expression
available at http://www.w3.org/2000/10/swap/grammar/n3-report.
html#explicituri) as opposed to the more extensive definition in [4].
176
statEnv ` $ VarName NodeType bound
if (fn:empty($ VarName NodeType = "literal"))
22 33
66 then fn:true() 77
statEnv ` [validLiteral($VarName)]]Expr = 66 if (fn:starts−with($VarName, ”””) and
66 77
77
44 fn:ends−with($VarName, "’’") 55
then fn:true() else fn:false() Expr
Finally, some of the normalization rules are presented; the missing rules should be
clear from the context:
statEnv ` validSubject(VarOrTerm)
[VarOrTerm PropertyListNotEmpty]]SubjPredObjList =
statEnv `
hh ii
fn:concat( [VarOrTerm]]Subject , [PropertyListNotEmpty]]PredObjlist )
Expr
[ [ PropertyListNotEmpty ] ]SubjPredObjList
==
hh ii
statEnv ` ( "[ ", [PropertyListNotEmpty]]PredObjectList , " ]" )
Expr
Otherwise, if one of the premises is not true, we suppress the generation of this
triple. One of the negated rules is the following:
The normalization for subjects, verbs, and objects according to [·]]Expr 0 is similar to
GroupGraphPattern: all variables in it will be replaced using [·]]VarSubst .
Blank nodes inside of construction templates must be treated carefully by adding
position variables from surrounding for expressions. To this end, we use [·]]BNodeSubst .
Since we normalize every for-loop by attaching position variables, we just need
to retrieve the available position variables from the static environment. We assume
a new static environment component statEnv.posVars which holds – similar to the
statEnv.varType component – all in-context positional variables in the given static en-
vironment, that is, the variables defined in the at clause of any enclosing for loop.
177
BOUND($A as xs:string) as xs:boolean
isIRI($A as xs:string) as xs:boolean
isBLANK($A as xs:string) as xs:boolean
isLITERAL($A as xs:string) as xs:boolean
LANG($A as xs:string) as xs:string
DATATYPE($A as xs:string) as xs:anyURI
4.3 Implementation
As we have seen above, XSPARQL syntactically subsumes both XQuery and SPARQL.
Concerning semantics, XSPARQL equally builds on top of its constituent languages.
We have extended the formal semantics of XQuery [9] by additional rules which reduce
each XSPARQL query to XQuery expressions; the resulting FLWORs operate on the
answers of SPARQL queries in the SPARQL XML result format [8]. Since we add
only new reduction rules for SPARQL-like heads and bodies, it is easy to see that each
native XQuery is treated in a semantically equivalent way in XSPARQL.
In order to convince the reader that the same holds for native SPARQL queries,
we will illustrate our reduction in the following. We restrict ourselves here to a more
abstract presentation of our rewriting algorithm, as we implemented it in a prototype.4
The main idea behind our implementation is translating XSPARQL queries to cor-
responding XQueries which possibly use interleaved calls to a SPARQL endpoint. The
architecture of our prototype shown in Fig. 10 consists of three main components:
(1) a query rewriter, which turns an XSPARQL query into an XQuery;
(2) a SPARQL endpoint, for querying RDF from within the rewritten XQuery; and
(3) an XQuery engine for computing the result document.
4 http://xsparql.deri.org/
178
XML SPARQL
or RDF Engine
The rewriter algorithm (Fig. 12) takes as input a full XSPARQL QueryBody [9]
q (i.e., a sequence of FLWOR’ expressions), a set of bound variables b and a set of
position variables p, which we explain below. For a FL (or F’, resp.) clause s, we
denote by vars(s) the list of all newly declared variables (or the varlist, resp.) of
s. For the sake of brevity, we only sketch the core rewriting function rewrite() here;
additional machinery handling the prolog including function, variable, module, and
namespace declarations is needed in the full implementation. The rewriting is initiated
by invoking rewrite(q, ∅, ∅) with empty bound and position variables, whose result is
an XQuery. Fig. 11 shows the output of our translation for the construct query
in Fig. 8(b) which illustrates both the lowering and lifting parts.5 Let us explain the
algorithm using this sample output.
After generating the prolog (lines 1–9 of the output), the rewriting of the Query-
Body is performed recursively following the syntax of XSPARQL. During the traversal
of the nested FLWOR’ expressions, SPARQL-like heads or bodies will be replaced by
XQuery expressions, which handle our two tasks. The lowering part is processed first:
Lowering The lowering part of XSPARQL, i.e., SPARQL-like F’DWM blocks, is “en-
5 Weprovide an online interface where other example queries can be found and tested along with a
downloadable version of our prototype at http://www.polleres.net/xsparql/.
179
coded” in XQuery with interleaved calls to an external SPARQL endpoint. To this end,
we translate F’DWM blocks into equivalent XQuery FLWO expressions which re-
trieve SPARQL result XML documents [1] from a SPARQL engine; i.e., we “push”
each F’DWM body to the SPARQL side, by translating it to a native select query
string. The auxiliary function sparql() in line 6 of our rewriter provides this func-
tionality, transforming the where {pattern} part of F’DWM clauses to XQuery ex-
pressions which have all bound variables in vars(pattern) replaced by the values of
the variables; “free” XSPARQL variables serve as binding variables for the SPARQL
query result. The outcome of the sparql() function is a list of expressions, which is
concatenated and URI-encoded using XQuery’s XPath functions, and wrapped into a
URI with http scheme pointing to the SPARQL query service (lines 10–12 of the out-
put), cf. [8]. Then we create a new XQuery for loop over variable $aux result to
iterate over the query answers extracted from the SPARQL XML result returned by
the SPARQL query processor (line 13). For each variable $xi ∈ vars(s) (i.e., in the
(F’) for clause of the original F’DWM body), new auxiliary variables are defined in
separate let-expressions extracting its node, content, type (i.e., literal, uri, or blank),
and RDF-Term ($xi Node, $xi , $xi NodeType, and $xi RDFTerm, resp.) by appro-
priate XPath expressions (lines 14–22 of Fig. 11); the auxvars() helper in line 6 of the
rewriter algorithm (Fig. 12) is responsible for this.
Lifting For the lifting part, i.e., SPARQL-like constructs in the R part, the trans-
formation process is straightforward. Before we rewrite the QueryBody q, we process
the prolog (P) of the XSPARQL query and output every namespace declaration as Tur-
tle string literals “@prefix ns: <URI>.” (line 10 of the output). Then, the rewriter
algorithm (Fig. 12) is called on q and recursively decorates every for $Var expression
by fresh position variables (line 13 of our example output); ultimately, construct
templates are rewritten to an assembled string of the pattern’s constituents, filling in
variable bindings and evaluated subexpressions (lines 23–24 of the output).
Blank nodes in constructs need special care, since, according to SPARQL’s se-
mantics, these must create new blank node identifiers for each solution binding. This
is solved by “adorning” each blank node identifier in the construct part with the
above-mentioned position variables from any enclosing for-loops, thus creating a
new, unique blank node identifier in each loop (line 23 in the output). The auxiliary
function rewrite-template() in line 8 of the algorithm provides this functionality by sim-
ply adding the list of all position variable p as expressions to each blank node string; if
there are nested expressions in the supplied construct {template}, it will return a
sequence of nested FLWORs with each having rewrite() applied on these expressions
with the in-scope bound and position variables.
Expressions involving constructs create Turtle output. Generating RDF/XML
output from this Turtle is optionally done as a simple postprocessing step supported by
using standard RDF processing tools.
180
Figure 12: Algorithm to Rewrite XSPARQL q to an XQuery
Since we add only new reduction rules for SPARQL-like heads and bodies, it is
easy to see that each native XQuery is treated in a semantically equivalent way in
XSPARQL.
Lemma 28 Let q be an XSPARQL query. Then, the result of applying q to the algo-
rithm in Fig. 12 is an XQuery, i.e., rewrite(q, ∅, ∅) is an XQuery.
Proof.[Sketch] From Lemma 28, we know that the output, given an XSPARQL query
falling in the XQuery fragment is again an XQuery. Note that, however, even this frag-
ment, our additional rewriting rules do change the original query in some cases. More
181
concretely, what happens is that by our “decoration” rule from p.175 each position-
variable free for loop (i.e., that does not have an at clause) is decorated with a new
position variable. As these new position variables begin with an underscore they cannot
occur in the original query, so this rewriting does not interfere with the semantics of
the original query. The only rewriting rules which use the newly created position vari-
ables are those for rewriting blanknodes in construct parts, i.e., the [·]]BNodeSubst
rule. However, this rule only applies to XSPARQL queries which fall outside the native
XQuery fragment, obviously. 2
In order to convince the reader that a similar correspondence holds for native
SPARQL queries, let us now sketch the proof showing the equivalence of XSPARQL’s
semantics and the evaluation of rewritten SPARQL queries into native XQuery. Intu-
itively, we “inherit” the SPARQL semantics from the fs:sparql “oracle.”
Let Ω denote a solution sequence of a an abstract SPARQL query q = (E, DS, R)
where E is a SPARQL algebra expression, DS is an RDF Dataset and R is a set of
variables called the query form (cf. [22]). Then, by SP ARQLResult(Ω) we denote
the SPARQL result XML format representation of Ω.
We are now ready to state some important properties about our transformations.
The following proposition states that any SPARQL select query can be equivalently
viewed as an XSPARQL F’DWMR query.
Proposition 30 Let q = (EWM , DS, $x1 , . . . , $xn ) be a SPARQL query of the form
select $x1 , . . . , $xn DWM, where we denote by DS the RDF dataset (cf. [22])
corresponding to the DataSetClause (D), by G the respective default graph of DS,
and by EWM the SPARQL algebra expression corresponding to WM and P be the
pattern defined in the where part (W). If eval(DS(G), q) = Ω1 , and
statEnv; dynEnv ` for $x1 · · · $xn from D(G) where P return ($x1 , . . . , $xn ) ⇒ Ω2 .
[for $x1 · · · $xn from D(G) where P return ($x1 , . . . , $xn )]]Expr 0
==
hh ii
let $ aux queryresult := [·]]SparqlQuery · · · for · · · [·]]SparqlResult · · · return ($x1 , . . . , $xn )
Expr 0
[·]]SparqlQuery builds q as string without replacing any variable, since all variables in
vars(P ) are free. Then, the resulting string is applied to fs:sparql , which – since q was
unchanged – by definition returns exactly SP ARQLResult(Ω1 ), and thus the return
part return ($x1 , . . . , $xn ) which extracts Ω2 is obviously just a representational
variant of Ω1 .
2
By similar arguments, we can see that SPARQL’s construct queries are treated
semantically equivalent in XSPARQL and in SPARQL. The idea here is that the rewrit-
ing rules constructs from Section 4.2.2 extract exactly the triples from the solution
sequence from the body defined as defined in the SPARQL semantics [22].
6 Here,
by equivalence (≡) modulo representation we mean that both Ω1 and Ω2 represent the same
sequences of (partial) variable bindings.
182
5 Related Works
Albeit both XML and RDF are nearly a decade old, there has been no serious effort
on developing a language for convenient transformations between the two data models.
There are, however, a number of apparently abandoned projects that aim at making it
easier to transform RDF data using XSLT. RDF Twig [23] suggests XSLT extension
functions that provide various useful views on the “sub-trees” of an RDF graph. The
main idea of RDF Twig is that while RDF/XML is hard to navigate using XPath, a
subtree of an RDF graph can be serialized into a more useful form of RDF/XML. Tree-
Hugger7 makes it possible to navigate the graph structure of RDF both in XSLT and
XQuery using XPath-like expressions, abstracting from the actual RDF/XML structure.
rdf2r3x8 uses an RDF processor and XSLT to transform RDF data into a predictable
form of RDF/XML also catering for RSS. Carroll and Stickler take the approach of
simplifying RDF/XML one step further, putting RDF into a simple TriX [6] format,
using XSLT as an extensibility mechanism to provide human-friendly macros for this
syntax.
These approaches rely on non-standard extensions or tools, providing implementa-
tions in some particular programming language, tied to specific versions of XPath or
XSLT processors. In contrast, RDFXSLT 9 provides an XSLT preprocessing stylesheet
and a set of helper functions, similar to RDF Twig and TreeHugger, yet implemented
in pure XSLT 2.0, readily available for many platforms.
All these proposals focus on XPath or XSLT, by adding RDF-friendly extensions,
or preprocessing the RDF data to ease the access with stock XPath expressions. It
seems that XQuery and SPARQL were disregarded previously because XQuery was
not standardized until 2007 and SPARQL – which we suggest to select relevant parts of
RDF data instead of XPath – has only very recently received W3C’s recommendation
stamp.
As for the use of SPARQL, Droop et al. [10] suggest, orthogonal to our approach,
to compile XPath queries into SPARQL. Similarly, encoding SPARQL completely into
XSLT or XQuery [15] seems to be an interesting idea that would enable to compile
down XSPARQL to pure XQuery without the use of a separate SPARQL engine. How-
ever, scalability results in [15] so far do not yet suggest that such an approach would
scale better than the interleaved approach we took in our current implementation.
Finally, related to our discussion in Section 2, the SPARQL Annotations for WSDL
(SPDL) project (http://www.w3.org/2005/11/SPDL/) suggests a direct integra-
tion of SPARQL queries into XML Schema, but is still work in progress. We expect
SPDL to be subsumed by SAWSDL, with XSPARQL as the language of choice for
lifting and lowering schema mappings.
183
and SPARQL, each in its own world, provide solutions for the problems we encoun-
tered, and we presented XSPARQL as a natural combination of the two as a proper
solution for the lifting and lowering tasks. Moreover, we have seen that XSPARQL
offers more than a handy tool for transformations between XML and RDF. Indeed, by
accessing the full library of XPath/XQuery functions, XSPARQL opens up extensions
such as value-generating built-ins or even aggregates in the construct part, which have
been pointed out missing in SPARQL earlier [21].
As we have seen, XSPARQL is a conservative extension of both of its constituent
languages, SPARQL and XQuery. The semantics of XSPARQL was defined as an
extension of XQuery’s formal semantics adding a few normalization mapping rules.
We provide an implementation of this transformation which is based on reducing XS-
PARQL queries to XQuery with interleaved calls to a SPARQL engine via the SPARQL
protocol. There are good reasons to abstract away from RDF/XML and rely on native
SPARQL engines in our implementation. Although one could try to compile SPARQL
entirely into an XQuery that caters for all different RDF/XML representations, that
would not solve the use which we expect most common in the nearer future: many
online RDF sources will most likely not be accessible as RDF/XML files, but rather
via RDF stores that provide a standard SPARQL interface.
Our resulting XSPARQL preprocessor can be used with any available XQuery and
SPARQL implementation, and is available for user evaluation along with all examples
of this paper at http://xsparql.deri.org/.
As mentioned briefly in the introduction, simple reasoning – which we have not yet
incorporated – would significantly improve queries involving RDF data. SPARQL en-
gines that provide (partial) RDFS support could immediately address this point and
be plugged into our implementation. But we plan to go a step further: integrat-
ing XSPARQL with Semantic Web Pipes [18] or other SPARQL extensions such as
SPARQL++ [21] shall allow more complex intermediate RDF processing than RDFS
materialization.
We also plan to apply our results for retrieving metadata from context-aware ser-
vices and for Semantic Web service communication, respectively, in the EU projects in-
Context (http://www.in-context.eu/) and TripCom (http://www.tripcom.
org/).
References
[1] Dave Beckett and Jeen Broekstra. SPARQL Query Results XML Format, Novem-
ber 2007. W3C Proposed Recommendation, available at http://www.w3.
org/TR/2007/PR-rdf-sparql-XMLres-20071112/.
[2] Dave Beckett and Brian McBride (eds.). RDF/XML Syntax Specification (Re-
vised). Technical report, W3C, February 2004. W3C Recommendation.
[3] David Beckett. Turtle - Terse RDF Triple Language, November 2007. Available
at http://www.dajobe.org/2004/01/turtle/.
[4] Tim Berners-Lee, R. Fielding, and L. Masinter. Uniform resource identifier
(URI): Generic syntax. Internet Engineering Task Force RFC 3986, Internet
Society (ISOC), January 2005. Published online in January 2005 at http:
//tools.ietf.org/html/rfc3986.
184
[5] Tim Bray, Dave Hollander, and Andrew Layman. Namespaces in XML. W3C
recommendation, World Wide Web Consortium, August 2006. Published online
in August 2006 at http://www.w3.org/TR/REC-xml-names.
[6] Jeremy Carroll and Patrick Stickler. TriX: RDF Triples in XML. Available
at http://www.hpl.hp.com/techreports/2004/HPL-2004-56.
html.
[7] Don Chamberlin, Jonathan Robie, Scott Boag, Mary F. Fernández, Jérôme
Siméon, and Daniela Florescu. XQuery 1.0: An XML Query Language. W3C
recommendation, W3C, January 2007. W3C Recommendation, available at
http://www.w3.org/TR/xquery/.
[8] Kendall Grant Clark, Lee Feigenbaum, and Elias Torres. SPARQL Protocol for
RDF, November 2007. W3C Proposed Recommendation, available at http:
//www.w3.org/TR/2007/PR-rdf-sparql-protocol-20071112/.
[9] Denise Draper, Peter Fankhauser, Mary Fernández, Ashok Malhotra, Kristof-
fer Rose, Michael Rys, Jérôme Siméon, and Philip Wadler. XQuery 1.0
and XPath 2.0 Formal Semantics. W3c recommendation, W3C, January
2007. W3C Recommendation, available at http://www.w3.org/TR/
xquery-semantics/.
[10] Matthias Droop, Markus Flarer, Jinghua Groppe, Sven Groppe, Volker Linne-
mann, Jakob Pinggera, Florian Santner, Michael Schier, Felix Schöpf, Hannes
Staffler, and Stefan Zugal. Translating xpath queries into sparql queries. In 6th
International Conference on Ontologies, DataBases, and Applications of Seman-
tics (ODBASE 2007), 2007.
[11] Dan Connolly (ed.). Gleaning Resource Descriptions from Dialects of Languages
(GRDDL). W3C recommendation, W3C, September 2007.
[12] Michael Kay (ed.). XSL Transformations (XSLT) Version 2.0 , January 2007.
W3C Recommendation, available at http://www.w3.org/TR/xslt20.
[13] Jérôme Euzenat and Pavel Shvaiko. Ontology matching. Springer, 2007.
[14] Joel Farrell and Holger Lausen. Semantic Annotations for WSDL and XML
Schema. W3C Recommendation, W3C, August 2007. Available at http://
www.w3.org/TR/sawsdl/.
[15] Sven Groppe, Jinghua Groppe, Volker Linneman, Dirk Kukulenz, Nils Hoeller,
and Christoph Reinke. Embedding SPARQL into XQuery/XSLT. In Proceedings
of the 23rd ACM Symposium on Applied Computing (SAC2008), March 2008. To
appear.
[16] Jacek Kopecký, Tomas Vitvar, Carine Bournez, and Joel Farrell. SAWSDL: Se-
mantic Annotations for WSDL and XML Schema. IEEE Internet Computing,
11(6):60–67, 2007.
[17] Ashok Malhotra, Jim Melton, and Norman Walsh (eds.). XQuery 1.0 and XPath
2.0 Functions and Operators , January 2007. W3C Recommendation, available at
http://www.w3.org/TR/xpath-functions/.
185
[18] Christian Morbidoni, Axel Polleres, Giovanni Tummarello, and Danh Le Phuoc.
Semantic web pipes. Technical Report DERI-TR-2007-11-07, DERI Galway, 11
2007.
[19] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. Semantics and complexity
of sparql. In International Semantic Web Conference (ISWC 2006), pages 30–43,
2006.
[20] Axel Polleres. From SPARQL to rules (and back). In Proceedings of the 16th
World Wide Web Conference (WWW2007), Banff, Canada, May 2007.
[21] Axel Polleres, François Scharffe, and Roman Schindlauer. SPARQL++ for map-
ping between RDF vocabularies. In 6th International Conference on Ontologies,
DataBases, and Applications of Semantics (ODBASE 2007), volume 4803 of Lec-
ture Notes in Computer Science, pages 878–896, Vilamoura, Algarve, Portugal,
November 2007. Springer.
[22] Eric Prud’hommeaux and Andy Seaborne (eds.). SPARQL Query Language for
RDF, January 2008. W3C Recommendation, available at http://www.w3.
org/TR/rdf-sparql-query/.
[23] Norman Walsh. RDF Twig: Accessing RDF Graphs in XSLT. Presented
at Extreme Markup Languages (XML) 2003, Montreal, Canada. Available at
http://rdftwig.sourceforge.net/.
186