Data Quality Assessment and Fusion

The vocabulary alignment and interlinking steps presented in Sects. 2 and 3, result in interlinked entity descriptions originating from a number of heterogeneous data sources. The quality of data from these sources is very diverse [1], as values may be out of date, incomplete or incorrect, either because of data extraction errors or due to the errors by human data editors. Situations in which conflicting values for a property of a real-world object are provided often occur. In order for Linked Data applications to efficiently consume data, the latter should be assessed and integrated based on their quality.

Quality is a subjective matter, often defined as a “fitness for use”, meaning that the interpretation of the quality of a data item depends on who will use it and for what task. Data quality has many dimensions such as accuracy, timeliness, completeness, relevancy, objectivity, believability, understandability, consistency, conciseness, availability, verifiability, etc. These dimensions are not independent of each other and typically only a subset of them is relevant in a specific situation. With the objective of supporting user applications in dealing with data quality and conflict resolution issues, we created Sieve – Linked Data Quality Assessment and Fusion framework[1] [11], which we summarize in this section.

Sieve consists of two components: Data Quality Assessment and Data Fusion, and takes as input data to be fused and an XML file containing both quality assessment and data fusion configurations. The input XML-based specification language allows a user to manually define conflict resolution strategies and quality assessment metrics to use for each data property.

Sieve takes as input two or more RDF data sources, along with the data provenance information. It is assumed that schema and object identifiers have been normalized, namely, if two descriptions refer to the same real-world object then they have the same identifier (URI), and if two properties refer to the same real-world attribute then there should be two values for the same property URI for a given subject URI. Each property value in the input is expressed by a quad (subject,property,object,graph) where the graph is a named graph, which is used to attach provenance information to a fact or a set of facts. For an example see Listing 2, where the input data for the population of Amsterdam coming from three different DBpedia editions is given, along with the last edit date information. Note that the 4th quad component, the provenance graph for lastedit property, is the same for all three triples and is omitted due to space reasons.

Listing 2. Data Fusion with Sieve: input data

  • [1]
< Prev   CONTENTS   Next >