Data Cleansing and Reconciliation with LODRefine

Data cleansing and linking are very important processes in the life cycle of linked data, especially when creating new linked data. Nowadays data comes from different sources and it is published in many formats, e.g. XML, CSV, HTML, as dumps from relational databases or different services.

Unfortunately, cleansing and linking are rarely trivial, and can be very tedious, especially with the lack of good tools. A good cleansing tool should be able to assist users in detecting non-consistent data, removing duplicates, quickly performing transformations on a relatively large amount of data at once, and exporting cleansed data into different formats. It should be relatively simple to use and available on different operating systems. Fortunately, there is a open source (BSD licensed) solution available meeting all the above criteria. OpenRefine, previously Google Refine, was specifically created for dealing with messy data, is extendable, works on all three major operating systems and provides functionalities to reconcile data against Freebase. For needs of the LOD2 project we developed a LOD-

enabled version of OpenRefineLODRefine


Even with tools available, data cleansing and linking in the LOD cycle still require a lot of manual work, and the current algorithms and fully automatated tools cannot completely replace human intuition and knowledge: it can be too complicated or costly to encode all our knowledge and experience into rules and procedures computers can understand. Although we already have good cleansing and linking tools at our disposal requiring only minimal human intervention, with huge amounts of data even simple tasks add up and time resources become an issue.

However, such tasks are often simple and include evaluation of automatically obtained results, finding missing information or, in rare cases, demand special domain knowledge. Crowdsourcing seems to be a promising solution for such situations and offers hiring affordable working power for certain repetitive but relatively simple tasks, especially when automated processing does not give good enough results, e.g. when classifying spam, categorizing images, and disambiguating data. To bring crowdsourcing closer to the needs of the LOD2 project we added support for using CrowdFlower [2], a popular and versatile crowdsourcing platform, directly from the LODRefine environment. In the rest of this section we introduce LODRefine and shortly describe three use cases of using crowdsourcing for different tasks, namely, data cleansing and disambiguation of reconciliation results.

  • [1] Available for download at or as a source

    code at

  • [2]
< Prev   CONTENTS   Next >