Interlinking and Knowledge Fusion

Abstract. The central assumption of Linked Data is that data providers ease the integration of Web data by setting RDF links between data sources. In addition to linking entities, Web data integration also requires the alignment of the different vocabularies that are used to describe entities as well as the resolution of data conflicts between data sources. In this chapter, we present the methods and open source tools that have been developed in the LOD2 project for supporting data publishers to set RDF links between data sources. We also introduce the tools that have been developed for translating data between different vocabularies, for assessing the quality of Web data as well as for resolving data conflicts by fusing data from multiple data sources.


The amount of Linked Open Data (LOD) already available on the Web of Data, or extracted using e.g. the methods presented in Chap. 3, is huge, as well as its potential for applications. However, the quality of the LOD sources varies greatly across domains and single datasets [1], making the efficient use of data problematic. An important quality-related problem is the lack of data consistency : same real world entities are described in different datasets using different vocabularies and data formats, and the descriptions often contain conflicting values.

According to the architecture of a Linked Data application illustrated in Fig. 1, four steps are necessary before the input coming from the Web of Data can be consumed by an application: vocabulary mapping, identity resolution, data quality assessment and data fusion.

This chapter presents methods and open source tools developed within the LOD2 project, which cover the above four steps of the process of integrating and cleansing the Linked Data from the Web.

Vocabulary mapping, or schema alignment step is inevitable as different LOD providers may use different vocabularies to represent the same type of information. E.g. population property of a country or city can come under different names

Fig. 1. Schematic architecture of a Linked Data application [7]

such as population, populationTotal, numberOfInhabitants, hasPopulation, etc. Therefore, tools that translate terms from different vocabularies into a single target schema are needed. Section 2 presents the R2R Framework, which enables Linked Data applications to discover and apply vocabulary mappings to translate the Web data to the application's target vocabulary.

Identity resolution aims at interlinking URIs that are used by different Linked Data sources to identify the same entity, for instance, a person or a place. Data sources may provide owl:sameAs links connecting data about the same realworld entity, but in many cases methods and tools for discovering these links are needed. In Sect. 3 we present the Silk Link Discovery Framework that supports identity resolution and data interlinking in the LOD context. Section 4 presents the LOD-enabled version of OpenRefine for data cleansing and reconciliation, which is also enhanced with crowdsourcing capabilities.

Data quality assessment and data fusion steps ensure the quality and consistency of data coming from the web. Depending on the application, different data quality aspects may become relevant: trustworthiness, precision, recency, etc. Section 5 presents Sieve – Linked Data Quality Assessment and Fusion tool, which allows filtering and then fusing the Web data according to user-defined data quality assessment and conflict resolution policies. One of the crowdsourcing use cases in Sect. 4 is related to improving the data quality via data enrichment. In addition, Sect. 6 addresses the specific challenges of identity resolution and data fusion for some of the most wide-spread Asian languages: Korean, Chinese and Japanese.

< Prev   CONTENTS   Next >