The Silk Link Discovery Framework

A central problem of the Web of Linked Data as well as of data integration in general is to identify entities in different data sources that describe the same real-world object. While the amount of Linked Open Data has grown significantly over the last years, most data sources are still not sufficiently interlinked. Out of the over 31 billion RDF statements published as Linked Data less than 500 million represent RDF links between data sources; analysis confirms that the LOD cloud represents a weakly connected graph with most publishers only linking to one other data source [2].

This section presents the Silk Link Discovery Framework, which generates RDF links between data items based on user-provided or automatically learned linkage rules. Silk can be used by data providers to generate RDF links pointing at existing Web datasets, and then publish them together with the primary datasets. Furthermore, applications that consume Linked Data can use Silk as an identity resolution component to augment the data with additional RDF links that have not been discovered and/or published.

In Silk linkage rules are expressed using a declarative language, and define the conditions that data items must conform to in order to be interlinked. For instance, a linkage rule defines which properties should be compared (e.g. movieTitle in one dataset and label in another), which similarity measures should be used for comparison and how they are to be combined.

Writing good linkage rules by hand is a non-trivial problem, which requires considerable effort and expertise. To address this, Silk implements the ActiveGenLink algorithm which combines genetic programming and active learning techniques to generate high-quality expressive linkage rules interactively, minimizing the involvement of a human expert. In this section, we will briefly introduce the tool and the underlying algorithms. For further details readers are referred to [8, 9].

Silk: Functionality and Main Concepts

The Silk Link Discovery Framework can be downloaded from its official homepage[1], which is also the source for the documentation, examples and updates.

Fig. 2. Silk Workbench: linkage rule editor

It is an open source tool with the source code and the detailed developer documentation available online [2]. Silk can be used through the Silk Workbench graphical user interface or from the command line.

The Silk Workbench, developed in the course of the LOD2 project, aims at guiding the user through the process of interlinking different data sources. It is a web application offering the following functionality:

Possibility to manage different data sources and linking tasks (with RDF dump files as well as local and remote SPARQL endpoints as input).

Graphical editor to create and edit linkage rules (see Fig. 2).

Possibility to evaluate links generated by the current linkage rule.

User interface for learning linkage rules from existing reference links.

Active learning interface, which learns a linkage rule by interactively asking the user to confirm or decline a number of candidate links.

Possibility to create and edit a set of reference links used to evaluate the current link specification.

Additionally, Silk provides 3 command line applications: Silk Single Machine generates RDF links on a single machine, with input datasets either residing on the same machine or accessed via the SPARQL protocol; Silk MapReduce generate RDFs links between datasets using a cluster of multiple machines, is based on Hadoop and thus enables Silk to scale out to very big datasets. Finally, Silk Server [10] can be used as an identity resolution component within applications that consume Linked Data from the Web.

The basic concept in Silk is that of a linkage rule, which specifies the conditions under which two entities are considered to be referring to the same real-world entity. A linkage rule assigns a similarity value to a pair of entities. We represent a linkage rule as a tree built from 4 types of operators.

Property Operator: Retrieves all values of a specific property of each entity, such as of values of the label property.

Transformation Operator: Transforms the values according to a specific data transformation function, e.g. case normalization, tokenization, concatenation. Multiple transformation operators can be nested in order to apply a sequence of transformations.

Comparison Operator: Evaluates the similarity between the values of two input operators according to a specific distance measure, such as Levenshtein, Jaccard, or geographic distance. A user-specified threshold specifies the maximum distance, and is used to normalize the measure.

Aggregation Operator: As often the similarity of two entities cannot be determined by evaluating a single comparison, an aggregation operator combines the scores from multiple comparison or aggregation operators according to an aggregation function, e.g. weighted average or minimum.

The resulting linkage rule forms a tree where the terminal nodes are represented by property operators and the internal nodes are represented by transformation, comparison and aggregation operators, see Fig. 2 for an example.

  • [1]
  • [2]
< Prev   CONTENTS   Next >