Pebbles, a Metadata Editor

The user is welcomed in Pebbles with a dashboard overview showing the most recent updated documents and the documents with the most outstanding issues.

Such dashboard view aids to focus on the most important items, but also it reflects the users' current state of work. After the user selects a document, Pebbles shows the document view on which the textual document content is shown together with its metadata. It is important that the metadata editor sees the document content in order to being able to validate the correctness of the associated metadata. Here the metadata can be updated, but also new metadata properties can be added according to the WKD schema. New properties can be added by hand, or result from the suggestions that are associated with the document (Fig. 3).

A suggestion is an association of some value with the document that has been added via an external process. This external process is controlled by the enrichment manager. The enrichment manager uses a linking environment (e.g. Silk) or an annotation engine (e.g. DBpedia Spotlight) to create these associations. At this point in time the enrichment manager has two options: either she directly adds the resulting associations to the metadata store, or the associations are reviewed through the quality assurance process. The quality assurance process is performed by the metadata editor by accepting/rejecting suggestions in the Pebbles environment. As the metadata editor that has the ownership of the documents metadata, she is the right person to make that decision. In case of the acceptance of a concept, the associated default metadata property can also be updated. This creates flexibility in the process: the enrichment manager can suggest new associations without deciding upfront the property which is handy in the case an annotation engine is being used. Such annotation engines often return related concepts belonging to a wide variety of domains (persons, countries, laws, …) It is however advised for the smoothness of the process to make the properties as concrete as possible.

The provided collection of documents by Wolters Kluwer forms an interconnected network. Journal articles refer to laws and court cases, and so on. In a document centric environment, these links are typically stored inside the document container. It is easy given a document to follow the outgoing references, whereas the reverse search (i.e. finding all documents that refer the current document) is much harder. Applying this

Fig. 3. Pebbles document metadata view

search on data in a RDF store simply requires inverting patterns in the SPARQL query. The non-directionality of the RDF graph model allows creating quickly any exploration path that is desired. Often exploration paths are quite generic: for instance to show the list of documents that belong to a particular document category is very similar to showing the list of documents for an author. By configuring the tree navigation widget with the values for a taxonomy, Pebbles offers a faceted approach to explore documents. The usability of the faceted browsing is determined by the quality of the taxonomies being used and the quality of the metadata that are tagging the documents.

Issues Identified in Existing Metadata, Based on Developed Vocabularies

During transformation of WKD data to RDF, several metadata properties were defined as having skos:Concepts as their range. The rationale behind that was that this data may be organized and managed in a next step in taxonomies or thesauri. In a second iteration after processing all the data, missing concepts have been detected and were added to the vocabularies.

During the review of the generated data, besides missing mappings to taxonomies the following issues in the existing metadata transformed to RDF were found:

• Transformation errors (e.g. concept generated with “” labels): To avoid this, the schema transformation has to be adapted to ignore empty metadata entries.

• Wrong Metadata (e.g. job titles or page numbers instead of organization name concerning the organizations taxonomy): This needs to be cleaned up manually. Rules can be provided to detect such kind of data during transformation; and the same rules could be applied to exclude this data from display in the metadata editor (Pebbles). Since this data can also be managed (changed/edited/deleted) in Pebbles, no additional efforts for a rule based cleaning have been made.

• Same concepts with different label: We decided that automatic mapping of metadata representing the same concepts (e.g. different spelling for persons, see Table 2 for different reasons) could not be done during schema transformation, because no quality assurance could be provided that way. So an interface for disambiguation of concepts based on label similarity was developed to provide a semi-automatic way of cleaning up those concepts.

Notification Service

We developed a scenario, where several vocabularies were developed and partly published as Linked Open Data (labor law thesaurus and courts thesaurus) with PoolParty. Furthermore, Pebbles was developed as an environment designed to manage RDF metadata for the WKD documents. To stay up-to-date with the latest changes in these datasets, the resource subscription and notification service (rsine[1], published under an open-source license at GitHub) was developed, allowing dataset curators to subscribe for specific changes that they are interested in and to get a notification as soon as such changes occur.

Table 2. Possible issues for different author names

Confusions

First version

Second version

Third version

Family name change after marriage

Gritt Diercks

Gritt Dierks-Oppler

Andrea Banse

Andrea Schnellbacher geb. Banse

Andrea Schnellbacher

Second forename

Bernd Schneider

Bernd Peter Schneider

Initials

Detlev Müllerhoff

D.Müllerhoff

Typos

Cornelius Prittwitz

Cornelins Prittwitz

Punctuation

Hans-Dieter Becker

Hans Dieter Becker

Different writings

Detlev Burhoff

Detlef Burhoff

Different characters

Østerborg

Österborg

Osterborg

Rsine is a service that tracks RDF triple changes in a triple store and creates a history of changes in a standardized format by using the change set ontology[2]. Users wanting to receive notifications can express the kind of changes they are interested in via SPARQL queries. These queries are sent to rsine, encapsulated in a subscription document that can also contain further information such as how the notification message should be formatted. Notifications were sent via mail.

The implemented scenarios focused on the following three main use cases:

• vocabulary management

• vocabulary quality

• metadata management

For all main use cases, several scenarios[3] have been implemented.

  • [1] https://github.com/rsine/rsine, accessed June 10, 2014
  • [2] vocab.org/changeset/schema.html, accessed June 10, 2014
  • [3] Scenarios are listed in Deliverable 5.3.2
 
< Prev   CONTENTS   Next >