Publishing Pattern for Excel/CSV Data

When the original data reside in Excel or CSV format, describing them with rdf would be a first step of a publishing pattern while hosting and serving it on the Web follows. LODRefine is a stack component, well-suited to automating and easing the “RDFizing” procedure. Usage brings direct added business value:

powerful cleaning capabilities on the original business data.

reconciliation capabilities, in case it is needed, to find similar data in the LOD cloud and make the original business data compatible with well-known Linked Data sources.

augmenting capabilities, where columns can be added from DBpedia or other sources to the original data set based on the previous mentioned reconciliation services.

extraction facilities when entities reside inside the text of the cells.

Publishing Pattern for XML Data

When the original data is in xml format an xslt transformation to transform the xml document into a set of rdf triples is the appropriate solution. The original files will not change; rather a new document is created based on the content of the existing one. The basic idea is that specific structures are recognized and they are transformed into triples with a certain resource, predicate and value. The LOD2 stack supports xml to rdf/xml xslt transformations. The resulting triples are saved as an rdf/xml graph/file that can follow the same hosting and serving procedures explained in the previous section.

Publishing Pattern for Unstructured Data

Despite the evolution of complex storage facilities, the enterprise environment is still a major repository paradigm for unstructured and semi-structured content. Basic corporate information and knowledge is stored in a variety of formats such as pdf, text files, e-mails, classic or semantic annotated websites, may come from Web 2.0 applications like social networks or may need to be acquired from specific web API's like Geonames[1], Freebase [2] etc. Linked Data extraction and instance data generation tools maps the extracted data to appropriate ontologies en route to produce rdf data and facilitate the consolidation of enterprise information. A prominent example of a tool from the LOD2 stack that facilitate the transformation of such types of data to rdf graphs is Virtuoso Sponger.

Virtuoso Sponger [3] is a Linked Data middleware that generates Linked Data from a big variety of non-structured formats. Its basic functionality is based on Cartridges, that each one provides data extraction from various data source and mapping capabilities to existing ontologies. The data sources can be in rdfa format[4], GRDDL[5], Microsoft Documents, and Microformats[6] or can be specific vendor data sources and others provided by API's. The Cartridges are highly customizable so to enable generation of structured Linked Data from virtually any resource type, rather than limiting users to resource types supported by the default Sponger Cartridge collection bundled as part of the Virtuoso Sponger.

The PoolParty Thesaurus Server[7] is used to create thesauri and other controlled vocabularies and offers the possibility to instantly publish them and display their concepts as html while additionally providing machine-readable rdf versions via content negotiation. This means that anyone using PoolParty can become a w3c standards compliant Linked Data publisher without having to know anything about Semantic Web technicalities. The design of all pages on the Linked Data front-end can be controlled by the developer who can use his own style sheets and create views on the data with velocity templates.

DBpedia Spotlight[8] is a tool for automatically annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia. DBpedia Spotlight recognizes that names of concepts or entities have been mentioned. Besides common entity classes such as people, locations and organisations, DBpedia Spotlight also spots concepts from any of the 320 classes in the DBpedia Ontology. The tool currently specializes in English language, the support for other languages is currently being tested, and it is provided as an open source web service.

Stanbol[9] is another tool for extracting information from CMS or other web application with the use of a Restful API and represents it as rdf. Both Dbpedia Spotlight and Stanbol support NIF implementation (NIF will soon become a w3c recommendation) to standardise the output rdf aiming on achieving interoperability between Natural Language Processing (NLP) tools, language resources and annotations.

Hosting and Serving

The publishing phase usually involves the following steps:

1. storing the data in a Triple Store,

2. make them available from a sparql endpoint,

3. make their uris dereferenceable so that people and machines can look them up though the Web, and

4. provide them as an rdf dump so that data can easily be re-used.

The first three steps can be fully addressed with a LOD2 stack component called Virtuoso, while uploading the rdf file to CKAN[10] would be the procedure to make the rdf public.

OpenLink Virtuoso Universal Server is a hybrid architecture that can run as storage for multiple data models, such as relational data, rdf, xml, and text documents. Virtuoso supports a repository management interface and faceted browsing of the data. It can run as a Web Document server, Linked Data server and Web Application server. The open source version of Virtuoso is included in the LOD2 stack and is widely used for uploading data in its Quad store, it

Fig. 9. Publishing pattern for registering data sets

offers a sparql endpoint and a mechanism called URL-Rewriter to make uris dereferenceable.

According to the fourth step, sharing the data in a well-known open datahub such as CKAN will facilitate their discovery from other businesses and data publishers. The functionality of CKAN is based on packages where data sets can be uploaded. CKAN enables also updates, keeps track of changes, versions and author information. It is advised as good practice to accompany the data sets with information files (e.g. VOID file) that contain relevant metadata (Figs. 9, 10).

  • [1] geonames.org/
  • [2] freebase.com/
  • [3] virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtSponger
  • [4] rdfa.info
  • [5] w3.org/TR/grddl/
  • [6] microformats.org/
  • [7] poolparty.biz
  • [8] https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
  • [9] stanbol.apache.org/
  • [10] ckan.org/
 
< Prev   CONTENTS   Next >