Tabular Data in PublicData.eu

At the time of writing (May 2014) PublicData.eu comprised 20,396 datasets. Each dataset can comprise several data resources and there are overall 60,000+ data resources available at PublicData.eu. These include metadata such as categories, groups, license, geographical coverage and format. Comprehensive statistics gathered from the PublicData.eu are described in [3].

A large part of the datasets at PublicData.eu (approx. 37 %) are in tabular format, such as, for example, CSV, TSV, XLS, XLSX. These formats do not preserve much of the domain semantics and structure. Also, tabular data represented in the above mentioned formats can be syntactically quite heterogeneous and leaves many semantic ambiguities open, which make interpreting, integrating and visualizing the data difficult. In order to support the exploitation of tabular data, it is necessary to transform the data to standardized formats facilitating the semantic description, linking and integration, such as RDF.

Other formats represented on the PublicData.eu portal comprise: 42 of the datasets have no format specified, 15 % are human-readable representations (i.e. HTML, PDF, TXT, DOC), the other 6 % are geographical data, XML documents, archives as well as various proprietary formats. Thus for a large fraction (i.e. 42 %) of the datasets a manual annotation effort is required, and at the time of writing they can not be converted automatically due to the absence of the format descriptions. Discussion of the conversion of human-readable datasets (i.e. 15 %) to RDF is out of scope of this book. The known fact is that such conversion has been proven to be time-consuming and error-prone. The other 6 % of the datasets are tackled partially in other projects, for instance, GeoKnow project[1] is aimed at converting geographical data to RDF, whereas statistical data from XML documents are converted within Linked SDMX project[2].

User-Driven Conversion Framework

The completely automatic RDF transformation as well as the detection and correction of tabular data problems is not feasible. In [3] we devised an approach where the effort is shared between machines and human users. Our mapping authoring environment is based on the popular MediaWiki system. The resulting mapping wiki located at wiki.publicdata.eu operates together with PublicData.eu and helps users to map and convert tabular data to RDF in a meaningful way. To leverage the wisdom of the crowd, mappings are created automatically first and can then be revised by human users. Thus, users improve mappings by correcting errors of the automatic conversion and the cumbersome process of creating mappings from scratch can be avoided in most cases. An overview of the entire application is depicted in Fig. 1.

Fig. 1. Architecture of our CSV2RDF extension for PublicData.eu.

Our application continuously crawls CSV resources from PublicData.eu and validates them. Around 20 % of CSV resources are filtered out, mostly because of response timeouts, server errors or missing files. After the validation default mappings are created and resources are converted to RDF. In order to obtain an RDF graph from a table T we essentially use the table as class approach [1], which generates triples as follows: subjects are generated by prefixing each row's ID (in the case of CSV files this by default is the line number) with the corresponding CSV resource URL. The headings become properties in the ontology namespace. The cell values then become the objects. Note that we avoid inferring classes from the CSV file names, as the file names often turned out to be simply labels rather than meaningful type names.

Conversion to RDF is performed by the Sparqlify-CSV. Although the SparqlifyML syntax should not pose any problems to users familiar with SPARQL, it is too complicated for novice users and therefore less suitable for crowd-sourcing. To lower the barrier, we define a simplified mapping format, which releases users from dealing with the Sparqlify-ML syntax. Our format is based on MediaWiki templates and thus seamlessly integrates with MediaWiki. To define mappings we created a template called RelCSV2RDF. The complete description for the template is available on the mapping wiki.

At the end of the transformation a page is created for each resource on the mappings wiki at wiki.publicdata.eu. The resource page comprises links to the corresponding resource and dataset on PublicData.eu as well as one or several mappings and visualization links. Each mapping is rendered using the RelCSV2RDF template into a human-readable description of the parameters including links for transformation rerun and RDF download.

Sharing the effort between the human users and machines is never a simple task. The trade-off between human involvement and machine automatic processing should be balanced in a way, that the most precision is achieved with the least time expense from the user side. After automatic mapping generation and resource conversion, user is supposed to find the relevant RDF schema for the given CSV table with third-party tools such as LOV search engine. This task required the background knowledge in the field of Semantic Web, that is the knowledge about existence of specific RDF processing tools. To eliminate this requirement we developed special interface for the finding relevant properties for linking table schema to existing RDF terms.

Additionally, the mapping wiki uses the Semantic MediaWiki [4] (SMW) extension, which enables semantic annotations and embedding of search queries over these annotations within wiki pages. The RelCSV2RDF template utilizes SMW and automatically attaches semantic links (using has property) from mappings to respective property pages. This allows users to navigate between dataset resources which use the same properties, so that dataset resources are connected through the properties used in their mappings.

Conversion Results

We downloaded and cleaned 15,551 CSV files, that consume in total 62 GB of disk space. The vast majority (i.e. 85 %) of the published datasets have a size less than 100 kB. A small amount of the resources at PublicData.eu (i.e. 14.5 %) are between 100 kB and 50 MB. Only 44 resources (i.e. 0.5 %) are large and very large files above 50 MB, with the largest file comprising 3.3 GB. As a result, the largest 41 out of the 9,370 converted RDF resources account for 7.2 (i.e. 98.5 %) out of overall 7.3 billion triples.

The results of the transformation process are summarized in Table 1. Our efficient Sparqlify RDB2RDF transformation engine is capable to process CSV files and generate approx. 4.000 triples per second on a quad core 2.2 GHz machine. As a result, we can process CSV files up to a file size of 50 MB within a minute. This enables us to re-transform the vast majority of CSV files on demand, once a user revised a mapping. For files larger than 50 MB, the transformation is currently queued and processed in batch mode.

Table 1. Transformation results summary

CSV res. converted

9,370

Avg. no. properties per entity

47

CSV res. volume

33 GB

Generated default mappings

9,370

No. generated triples

7.3 billions

Overall properties

80,676

No. entity descriptions

154 millions

Distinct properties

13,490

  • [1] geoknow.eu
  • [2] csarven.ca/linked-sdmx-data
 
< Prev   CONTENTS   Next >