Interlinking and Knowledge Fusion
Abstract. The central assumption of Linked Data is that data providers ease the integration of Web data by setting RDF links between data sources. In addition to linking entities, Web data integration also requires the alignment of the diﬀerent vocabularies that are used to describe entities as well as the resolution of data conﬂicts between data sources. In this chapter, we present the methods and open source tools that have been developed in the LOD2 project for supporting data publishers to set RDF links between data sources. We also introduce the tools that have been developed for translating data between diﬀerent vocabularies, for assessing the quality of Web data as well as for resolving data conﬂicts by fusing data from multiple data sources.
The amount of Linked Open Data (LOD) already available on the Web of Data, or extracted using e.g. the methods presented in Chap. 3, is huge, as well as its potential for applications. However, the quality of the LOD sources varies greatly across domains and single datasets , making the eﬃcient use of data problematic. An important quality-related problem is the lack of data consistency : same real world entities are described in diﬀerent datasets using diﬀerent vocabularies and data formats, and the descriptions often contain conﬂicting values.
According to the architecture of a Linked Data application illustrated in Fig. 1, four steps are necessary before the input coming from the Web of Data can be consumed by an application: vocabulary mapping, identity resolution, data quality assessment and data fusion.
This chapter presents methods and open source tools developed within the LOD2 project, which cover the above four steps of the process of integrating and cleansing the Linked Data from the Web.
Vocabulary mapping, or schema alignment step is inevitable as diﬀerent LOD providers may use diﬀerent vocabularies to represent the same type of information. E.g. population property of a country or city can come under diﬀerent names
Fig. 1. Schematic architecture of a Linked Data application 
such as population, populationTotal, numberOfInhabitants, hasPopulation, etc. Therefore, tools that translate terms from diﬀerent vocabularies into a single target schema are needed. Section 2 presents the R2R Framework, which enables Linked Data applications to discover and apply vocabulary mappings to translate the Web data to the application's target vocabulary.
Identity resolution aims at interlinking URIs that are used by diﬀerent Linked Data sources to identify the same entity, for instance, a person or a place. Data sources may provide owl:sameAs links connecting data about the same realworld entity, but in many cases methods and tools for discovering these links are needed. In Sect. 3 we present the Silk Link Discovery Framework that supports identity resolution and data interlinking in the LOD context. Section 4 presents the LOD-enabled version of OpenReﬁne for data cleansing and reconciliation, which is also enhanced with crowdsourcing capabilities.
Data quality assessment and data fusion steps ensure the quality and consistency of data coming from the web. Depending on the application, diﬀerent data quality aspects may become relevant: trustworthiness, precision, recency, etc. Section 5 presents Sieve – Linked Data Quality Assessment and Fusion tool, which allows ﬁltering and then fusing the Web data according to user-deﬁned data quality assessment and conﬂict resolution policies. One of the crowdsourcing use cases in Sect. 4 is related to improving the data quality via data enrichment. In addition, Sect. 6 addresses the speciﬁc challenges of identity resolution and data fusion for some of the most wide-spread Asian languages: Korean, Chinese and Japanese.