Data Quality Assessment and Fusion
The vocabulary alignment and interlinking steps presented in Sects. 2 and 3, result in interlinked entity descriptions originating from a number of heterogeneous data sources. The quality of data from these sources is very diverse , as values may be out of date, incomplete or incorrect, either because of data extraction errors or due to the errors by human data editors. Situations in which conﬂicting values for a property of a real-world object are provided often occur. In order for Linked Data applications to eﬃciently consume data, the latter should be assessed and integrated based on their quality.
Quality is a subjective matter, often deﬁned as a “ﬁtness for use”, meaning that the interpretation of the quality of a data item depends on who will use it and for what task. Data quality has many dimensions such as accuracy, timeliness, completeness, relevancy, objectivity, believability, understandability, consistency, conciseness, availability, veriﬁability, etc. These dimensions are not independent of each other and typically only a subset of them is relevant in a speciﬁc situation. With the objective of supporting user applications in dealing with data quality and conﬂict resolution issues, we created Sieve – Linked Data Quality Assessment and Fusion framework , which we summarize in this section.
Sieve consists of two components: Data Quality Assessment and Data Fusion, and takes as input data to be fused and an XML ﬁle containing both quality assessment and data fusion conﬁgurations. The input XML-based speciﬁcation language allows a user to manually deﬁne conﬂict resolution strategies and quality assessment metrics to use for each data property.
Sieve takes as input two or more RDF data sources, along with the data provenance information. It is assumed that schema and object identiﬁers have been normalized, namely, if two descriptions refer to the same real-world object then they have the same identiﬁer (URI), and if two properties refer to the same real-world attribute then there should be two values for the same property URI for a given subject URI. Each property value in the input is expressed by a quad (subject,property,object,graph) where the graph is a named graph, which is used to attach provenance information to a fact or a set of facts. For an example see Listing 2, where the input data for the population of Amsterdam coming from three diﬀerent DBpedia editions is given, along with the last edit date information. Note that the 4th quad component, the provenance graph for lastedit property, is the same for all three triples and is omitted due to space reasons.
Listing 2. Data Fusion with Sieve: input data
-  sieve.wbsg.de