Technology Advances in Large-Scale RDF Data Management
Abstract. One of the prime goals of the LOD2 project is improving the performance and scalability of RDF storage solutions so that the increasing amount of Linked Open Data (LOD) can be eﬃciently managed. Virtuoso has been chosen as the basic RDF store for the LOD2 project, and during the project it has been signiﬁcantly improved by incorporating advanced relational database techniques from MonetDB and Vectorwise, turning it into a compressed column store with vectored execution. This has reduced the performance gap (“RDF tax”) between Virtuoso's SQL and SPARQL query performance in a way that still respects the “schema-last” nature of RDF. However, by lacking schema information, RDF database systems such as Virtuoso still cannot use advanced relational storage optimizations such as table partitioning or clustered indexes and have to execute SPARQL queries with many selfjoins to a triple table, which leads to more join eﬀort than needed in SQL systems. In this chapter, we ﬁrst discuss the new column store techniques applied to Virtuoso, the enhancements in its cluster parallel version, and show its performance using the popular BSBM benchmark at the unsurpassed scale of 150 billion triples. We ﬁnally describe ongoing work in deriving an “emergent” relational schema from RDF data, which can help to close the performance gap between relational-based and RDF-based storage solutions.
One of the objectives of the LOD2 EU project is to boost the performance and the scalability of RDF storage solutions so that it can, eﬃciently manage huge datasets of Linked Open Data (LOD). However, it has been noted that given similar data management tasks, relational database technology signiﬁcantly outperformed RDF data stores. One controlled scenario in which the two technologies can be compared is the BSBM benchmark , which exists equivalent relational and RDF variants. As illustrated in Fig. 1, while the SQL systems can process by up to 40–175K QMpH, the Triple stores can only reach 1–10K QMpH, showing a factor of 15–40 of performances diﬀerence.
In the LOD2 project we investigated the causes of this large diﬀerence (the “RDF tax”, i.e. the performance cost of choosing RDF instead of relational database technology). Here we identify three causes:
Fig. 1. Triple stores vs. SQL: a heavy “RDF Tax” (2009)
• the particular case of BSBM is to the disadvantage of RDF as BSBM by its nature is very relational: its schema has just few classes, and all its properties occur exactly once for each subject, such that the data structure is very tabular. As such, the ease of use of SPARQL to formulate queries on irregularly structured data does not come into play, and the complications to which such irregular data leads in relational solutions (many more tables, and many more joins) are avoided.
• relational systems are quite mature in their implementations. For instance, the explore workload is an OLTP workload which relational systems target with key index structures and pre-compiled PL/SQL procedures. The BI workload of BSBM beneﬁts from analytical execution techniques like columnar storage and vectorized execution and hash joins with bloom ﬁlters (to name just a few). While relational products over the years have implemented many such optimizations, RDF stores typically have not. Another area where analytical relational database engines have made important progress is the use of cluster technology. Whereas in 1990s only the very high end of RDBMS solutions was cluster-enabled (i.e. Teradata), many other systems have been added such as Greenplum, Paraccel, Vertica, SAP HANA and SQLserver Parallel data Warehouse (which without exceptional also leverage columnar storage and vectorized or JIT-compiled execution).
• RDF stores do not require a schema but also do not exploit it, even though the structure of the data in fact is highly regular. This hurts in particular in the very common SPARQL star-patterns, which need to be executed using multiple self-joins, where relational systems do not need joins at all. The structure that is heavily present in RDF triples further leads to the co-occurrence of properties to be heavily (anti-) correlated. The complexity of query optimization is not only exponential with respect to the amount of joins (and SPARQL needs many more than SQL) but also relies on cost models, yet cost models typically become very unreliable in the face of correlations. Unreliable cost models lead to bad query plans and this very strongly aﬀects performance and scalability of RDF stores. In all, query optimization for SPARQL is both more costly and unreliable than for SQL.
Virtuoso6, a high-performance RDF Quad Store, was chosen as the main RDF store for LOD2 knowledge base at the start of the project. In order to reduce the “RDF tax”, we ﬁrst revised architectural ideas from the state-of-theart of relational database systems, particularly, advanced column stores such as MonetDB  and Vectorwise . Then, we brought some of the unique technologies and architectural principles from these column stores into Virtuoso7, making it work more eﬃciently on modern hardware. These techniques include tuning the access patterns of database queries to be CPU-cache conscious, and also making query processing amendable to deeply pipelined CPUs with SIMD instructions by introducing concepts like vector processing. We note that the insights gained in improving Virtuoso will also be useful for other RDF store providers to enhance their respective technologies as well.
Further, the cluster capabilities of Virtuoso were signiﬁcantly improved. Note that by lack of table structures, RDF systems must distribute data by the triple (not tuple), which leads to more network communication during query execution. Network communication cost tends to be the limiting factor for parallel database systems, hence Virtuoso7 Cluster Edition introduced an innovative control ﬂow framework that is able to hide network cost as much as possible behind CPU computation.
By the end of the LOD2 project, these improvements in Virtuoso7 on the BSBM BI workload strongly improved performance. A quantiﬁed comparison is hard as Virtuoso6 would not even complete the workload (“inﬁnitely better” would be an exaggeration). Still, when comparing the SQL with the SPARQL implementation of BSBM-BI on Virtuoso7, we still see an “RDF tax” of a factor 2.5. This performance diﬀerence comes from the schema-last approach of RDF model: SPARQL plans need more joins than SQL and often the query plan is not optimal. To address this issue, CWI performed research in the line of the second bullet point above: the goal would be to give the RDF store more insight in the actual structure of RDF, such that SPARQL query plans need less self-joins and query optimization becomes more reliable. This goal should be achieved without losing the schema-last feature of RDF: there should be no need for an explicit user-articulated schema.
The idea of recovering automatically an “emergent” schema of actual RDF data is that RDF data in practice is quite regular and structured. This was observed in the proposal to make SPARQL query optimization more reliable by recognizing “characteristics sets” . A characteristic set is a combination of properties that typically co-occur with the same subject. The work in  found that this number is limited to a few thousand on even the most complex LOD datasets (like DBpedia), and the CWI research on emergent schema detection that started in the LOD2 project  aims to further reduce the amount of characteristic sets to the point that characteristics sets become tables in a table of limited size (less than 100), i.e. further reducing the size. To that, the additional challenge of ﬁnding human-understandable labels (names) for tables, columns, and relationships was added. The goal of emergent schemata thus became twofold: (1) to inform SPARQL systems of the schema of the data such that they need less self-joins and query optimization becomes more reliable, (2) to oﬀer a fully relational view of an RDF dataset to end-users so that existing SQL applications can be leveraged on any RDF dataset. The latter goal could help to increase RDF adoption and further help make relational systems more semantic (because all tables, columns and relationships in an emergent schema are identiﬁed by URIs).
In all, this chapter shows tangible progress in reducing the “RDF tax” and a promising avenue to further reduce the performance gap between SPARQL and SQL systems and even some hope of making them converge.