The Silk Link Discovery Framework
A central problem of the Web of Linked Data as well as of data integration in general is to identify entities in diﬀerent data sources that describe the same real-world object. While the amount of Linked Open Data has grown signiﬁcantly over the last years, most data sources are still not suﬃciently interlinked. Out of the over 31 billion RDF statements published as Linked Data less than 500 million represent RDF links between data sources; analysis conﬁrms that the LOD cloud represents a weakly connected graph with most publishers only linking to one other data source .
This section presents the Silk Link Discovery Framework, which generates RDF links between data items based on user-provided or automatically learned linkage rules. Silk can be used by data providers to generate RDF links pointing at existing Web datasets, and then publish them together with the primary datasets. Furthermore, applications that consume Linked Data can use Silk as an identity resolution component to augment the data with additional RDF links that have not been discovered and/or published.
In Silk linkage rules are expressed using a declarative language, and deﬁne the conditions that data items must conform to in order to be interlinked. For instance, a linkage rule deﬁnes which properties should be compared (e.g. movieTitle in one dataset and label in another), which similarity measures should be used for comparison and how they are to be combined.
Writing good linkage rules by hand is a non-trivial problem, which requires considerable eﬀort and expertise. To address this, Silk implements the ActiveGenLink algorithm which combines genetic programming and active learning techniques to generate high-quality expressive linkage rules interactively, minimizing the involvement of a human expert. In this section, we will brieﬂy introduce the tool and the underlying algorithms. For further details readers are referred to [8, 9].
Silk: Functionality and Main Concepts
The Silk Link Discovery Framework can be downloaded from its oﬃcial homepage, which is also the source for the documentation, examples and updates.
Fig. 2. Silk Workbench: linkage rule editor
It is an open source tool with the source code and the detailed developer documentation available online . Silk can be used through the Silk Workbench graphical user interface or from the command line.
The Silk Workbench, developed in the course of the LOD2 project, aims at guiding the user through the process of interlinking diﬀerent data sources. It is a web application oﬀering the following functionality:
• Possibility to manage diﬀerent data sources and linking tasks (with RDF dump ﬁles as well as local and remote SPARQL endpoints as input).
• Graphical editor to create and edit linkage rules (see Fig. 2).
• Possibility to evaluate links generated by the current linkage rule.
• User interface for learning linkage rules from existing reference links.
• Active learning interface, which learns a linkage rule by interactively asking the user to conﬁrm or decline a number of candidate links.
• Possibility to create and edit a set of reference links used to evaluate the current link speciﬁcation.
Additionally, Silk provides 3 command line applications: Silk Single Machine generates RDF links on a single machine, with input datasets either residing on the same machine or accessed via the SPARQL protocol; Silk MapReduce generate RDFs links between datasets using a cluster of multiple machines, is based on Hadoop and thus enables Silk to scale out to very big datasets. Finally, Silk Server  can be used as an identity resolution component within applications that consume Linked Data from the Web.
The basic concept in Silk is that of a linkage rule, which speciﬁes the conditions under which two entities are considered to be referring to the same real-world entity. A linkage rule assigns a similarity value to a pair of entities. We represent a linkage rule as a tree built from 4 types of operators.
Property Operator: Retrieves all values of a speciﬁc property of each entity, such as of values of the label property.
Transformation Operator: Transforms the values according to a speciﬁc data transformation function, e.g. case normalization, tokenization, concatenation. Multiple transformation operators can be nested in order to apply a sequence of transformations.
Comparison Operator: Evaluates the similarity between the values of two input operators according to a speciﬁc distance measure, such as Levenshtein, Jaccard, or geographic distance. A user-speciﬁed threshold speciﬁes the maximum distance, and is used to normalize the measure.
Aggregation Operator: As often the similarity of two entities cannot be determined by evaluating a single comparison, an aggregation operator combines the scores from multiple comparison or aggregation operators according to an aggregation function, e.g. weighted average or minimum.
The resulting linkage rule forms a tree where the terminal nodes are represented by property operators and the internal nodes are represented by transformation, comparison and aggregation operators, see Fig. 2 for an example.