Processing Data

The core challenge in this use case was to develop the (legal) data ecosystem by using the tools from the LOD2 Stack. Since the whole Semantic Web paradigm was new to WKD, we chose an iterative approach to learn and to optimize and smoothen the workflows and processes that come with it [4].

In order to focus on the highlights, we will not report on this iterative part here, but more on the results of every task. First, we built a knowledge framework based on the information we already stored in the XML documents. This led to an initial version of the knowledge graph describing our domain. We then executed LOD2 Stack tools [1] on this graph in order to enrich this information using data extraction technologies as well as executing data curation for cleansing; and linking tools for ingesting knowledge from external sources. Finally, we added a visualization layer (i) to support the editorial team in metadata management and (ii) to help our customers with visualizations supporting data analytics capabilities (see also [8]).

Transformation from XML to RDF

One major goal of the “Media and publishing” use case was to develop a stable transformation process for the WKG XML data. The development of the mapping schema from XML to RDF was based on the provided WKG DTD – so that the ontology was chosen to express the WKG data. The development of the schema for the transformation has been done in the following steps:

• Define vocabularies used for the WKD RDF schema (see Table 1)

• Define the URI pattern used for the WKD RDF schema

• Mapping definition

• Develop the XSLT style sheet based on the vocabularies and the URI patterns

In addition, a WKD schema description (schema.wolterskluwer.de) was developed, extending the used vocabularies to cover specific classes and properties. For the transformation of the WKD XML data to RDF various URI patterns had to be developed to cover the various types of data/information created:

• Resources (The transformed documents and document parts themselves)

e.g. labor protection law resource.wolterskluwer.de/legislation/bd_arbschg

• Vocabularies (used to harmonize parts of the used metadata e.g. taxonomies, authors, organizations, etc.) e.g. labor law thesaurus vocabulary.wolterskluwer.de/kwd/Arbeitsschutz

• WKD Schema Vocabulary (Specific properties defined for the mapping schema) e.g. keyword schema.wolterskluwer.de/Keyword

The mappings between the WKD DTD and the WKD schema were implemented as XSLT functions. The WKD XML data was then transformed into RDF triples by applying the functions related to the relevant XML elements. Note that the output of the transformation was using the RDF/XML serialization.

Table 1. Schemas that have been evaluated and are used in the WKD RDF schema (applied vocabularies)

Vocabulary

Prefix

Namespace

BIBO

bibo

purl.org/ontology/bibo/

Dublin core

dc

purl.org/dc/elements/1.1/

Dublin core terms

dcterms

purl.org/dc/terms/

FOAF

foaf

xmlns.com/foaf/0.1/

Metalex

metalex

metalex.eu/metalex/2008-05-02#

OWL

owl

w3.org/2002/07/owl#

RDF

rdf

w3.org/1999/02/22-rdf-syntax-ns#

RDF schema

rdfs

w3.org/2000/01/rdf-schema#

SKOS

skos

w3.org/2004/02/skos/core#

XHTML vocabulary

xhtml

w3.org/1999/xhtml/vocab#

Fig. 1. RDF graph for a document URI

The transformation resulted in a number of triples, stored in a named graph per document (see Fig. 1). In this way, a provenance relationship between the existence of the triple, the XSLT template and the original XML document was created. If either the XSLT template or the XML document was updated, then the set of triples to be updated was uniquely identified with the graph name.

Valiant[1], a command line processing tool written in JAVA supporting XSLT2.0, has been developed for the transformation process within the work package . As a first step, Virtuoso Sponger Cartridge was explored, as Virtuoso[2] was part of the LOD2 Stack, but this track was abandoned due to the lack of support for XSLT 2.0. For the management of the taxonomies and vocabularies PoolParty[3] was used. Additionally, Venrich was developed to support the batch process for the alignment of the document metadata and the vocabularies and taxonomies. All the data was stored in Virtuoso.

The initial transformation resulted in:

• 785,959 documents transformed to RDF graphs with a total of 46,651,884 triples

• several taxonomies and vocabularies that have been created based on the data

Additionally, two of the developed vocabularies have been released as linked open data under an open source license by WKG[4][5].

  • [1] https://github.com/bertvannuffelen/valiant, accessed June 10, 2014
  • [2] virtuoso.openlinksw.com/, accessed June 10, 2014
  • [3] poolparty.biz/, accessed June 10, 2014
  • [4] See vocabulary.wolterskluwer.de/, accessed June 10, 2014
  • [5] See further information about this in 3 Licensing Semantic Metadata and Deliverable
 
< Prev   CONTENTS   Next >