A Customized Linked Data Stack for Statistics

The work on the LOD2 Statistical Workbench was motivated by the need to support the process of publishing statistical data in the RDF format using

Fig. 2. Distribution of the LOD2 stack components w.r.t. Linked Data Publishing cycle

common vocabularies such as the RDF Data Cube[1]. The aim here was to provide support for performing different operations such as

efficient transformation/conversion of traditional data stores (e.g. CSV, XML, relational databases) into linked, machine readable formats;

building and querying triple stores containing RDF Data Cubes;

validating RDF Data Cubes;

interlinking and adding meaning to data;

visualization and exploration of multi-dimensional RDF Data Cubes;

publishing statistical data using a LOD publication strategy and respective metadata about the RDF data cube within a selected portal (i.e. a CKAN instance).

The potential benefits of converting statistical data into Linked Data format were studied through several scenarios for the National Statistical Office use case (cf. Table 2) [1].

Application Architecture and Scenarios

The LOD2 Statistical Workbench[2] implements the Linked Data application architecture sketched in Sect. 2. The workbench introduces a number of new components such as the RDF Data Cube Validation tool, the RDF Data Cube Slicing tool and the RDF Data Cube Merging tool dedicated for the statistical

Table 2. Potential goals and benefits of converting statistical data into Linked Data.


Benefits/expected added value

Goal: Metadata management

Code lists creating and maintaining

Standardization on the metadata level:

(a) will allow harmonization of specific concepts and terminology,

(b) will improve interoperability and

(c) will support multilinguality in statistical information systems across Europe

Goal: Export

Export to different formats

Data exchange with other semantic tools, as well as other commonly used spreadsheet tool

e.g. Microsoft Excel

Goal: RDF Data Cube Extraction, Validation and Initial Exploration

Goal: RDF Data Cube Quality Assessment (validation and analysis of integrity constraints)

Goal: RDF Data Cube Transformation, Exploratory Analysis and Visualization

(a) Merging RDF Data Cubes

(b) Slicing RDF Data Cubes

(c) Visualization of RDF Data Cubes

Data fusion i.e. creation of a single dataset and different graphical charts that supports the exploratory analysis (e.g. indicator comparison)

Facilitate creation of intersections in multidimensional data

Efficient analysis and search for trends in statistical data

Goal: Interlinking

(a) Code lists Interlinking

(b) CSV Data Extraction and Reconciliation with DBpedia

Assigning meaning, improved interoperability of data with similar governmental agencies

Assigning meaning

Goal: Publishing

Publishing to CKAN

Increased transparency, improved accessibility of statistical data

domain. The workbench has also been augmented with extensions to explore other aspects: the LOD2 authentication component, the LOD2 provenance component and the CKAN Publisher.

In order to support the end-user, a new graphical user interface as been created wherein the LOD2 components are more intuitively organized for the statistical domain. There are grouped in the five topics: Manage Graph, Find more Data Online, Edit & Transform, Enrich Datacube, and Present & Publish.

Import features. The LOD2 Statistical Workbench is a framework for managing Linked Data stored in the RDF Data Cube format. Because statistical data is often provided in tabular format, it supports importing data from CSV. The CSV2RDF component allows the end users to transform tabular data from a CSV file into a multidimensional RDF Data Cube. Alternatively, LODRefine can be used. LODRefine is capable to import all kinds of structured formats including CSV, ODS and XSL(X) and transform them to RDF graphs based on arbitrary vocabularies.

Also the import from XML files is supported. The main international standard for exchanging statistical data is SDMX[3]. The users have the possibility to pass XML data as input to the XSLT processor and transform into RDF. The workbench provides ready to use XSLT scripts to deal with SDMX formatted data.

Additionally, using the Find more Data Online submenu, the user is able to find and import more data into the local RDF store using the respective tool of Statistical Workbench.

Semantic integration and storage. Linked Data applications are based on server platforms that enable RDF triple storage, semantic data integration and management, semantic interoperability based on W3C standards (XML, RDF, OWL, SOA, WSDL, etc). The Virtuoso Universal Server is used for this purpose in the LOD2 Statistical Workbench.

RDF Data Cube transformation features. Specialized components have been developed to support the most common operations for manipulating statistical data such as merging datasets, creating slices and data subsetting (Edit & Transform submenu). As each dataset defines components (e.g. dimensions used to describe the observations), the merging algorithm checks the adequacy of the input datasets for merging and compiles a new RDF Data Cube to be used for further exploration and analysis. Additionally, the slicing component can be used to group subsets of observations where one or more dimensions are fixed. This way, slices are given an identity (URI) so that they can be annotated or externally referenced, verbosity of the data set can be reduced because fixed dimensions need only be stated once, and consuming applications can be guided in how to present the data.

RDF Data Cube validation. The RDF Data Cube Validation tool [8] supports the identification of possibly not well-formed parts of an RDF Data Cube. The therein integrated analysis process consists mostly of integrity constraints rules represented as SPARQL queries as are defined in RDF Data Cube standard. The validation operation is applicable at several steps in the Linked Data publishing process e.g. on import/extraction/transformation from different sources or after fusion and creation of new RDF Data Cubes.

Authoring, querying and visualization. The OntoWiki authoring tool facilitates the authoring of rich semantic knowledge bases, by leveraging Semantic Wiki technology, the WYSIWYM paradigm (What You See Is What You Mean [3]) and distributed social, semantic collaboration and networking techniques. CubeViz, an extension of OntoWiki, is a facetted browser and visualization tool for statistical RDF data. It facilitates the discovery and exploration of RDF Data Cubes while hiding its complexity from users. In addition to using the browsing and authoring functionality of OntoWiki, advanced users are able to query the data directly (SPARQL) using one of the following offered SPARQL editors: OntoWiki query editor, Sindices SparQLed component and the OpenLink Virtuoso SPARQL editor.

Enrichment and interlinking. Linked Data publishing isn't just about putting data on the web, but also about creating links, so that a person or machine can explore the web of data. Therefore, the enrichment and interlinking features are very important as a pre-processing step in the integration and analysis of statistical data from multiple sources. LOD2 tools such as SILK and Limes facilitate mapping between knowledge bases, while LOD Open Refine can be used to enrich the data with descriptions from DBpedia or to reconcile with other information in the LOD cloud. PoolParty allows users to create their own high quality code lists and link the concepts therein to external sources as well. Once the code lists have been established, they can be reused as dimension values in Data Cubes or linked to Cubes that have been created separately.

Export and Linked Data publishing. The LOD2 Statistical Workbench export features are reachable via the Manage Graph and Present & Publish submenus. The Manage Graph option allows exporting of a graph with all its content in RDF/XML, RDF/JSON, Turtle, Notation 3. CubeViz supports subsetting of the data and extraction of a portion that is interesting for further analysis in CSV and RDF/XML format. The CKAN Publisher component aims at automating the upload and registration of new data with existing CKAN instances.

The use of the LOD2 Statistical Workbench for different data management operations is illustrated with online tutorials[4] for the scenarios summarized in Table 2.

Fig. 3. RDF Data Cube graphical representation

  • [1] w3.org/TR/vocab-data-cube
  • [2] demo.lod2.eu/lod2statworkbench
  • [3] sdmx.org
  • [4] wiki.lod2.eu/display/LOD2DOC/LOD2+Statistical+Workbench
< Prev   CONTENTS   Next >