Lifting Open Data Portals to the Data Web
Abstract. Recently, a large number of open data repositories, catalogs and portals have been emerging in the scientific and government realms. In this chapter, we characterise this newly emerging class of information systems. We describe the key functionality of open data portals, present a conceptual model and showcase the pan-European data portal PublicData.eu as a prominent example. Using examples from Serbia and Poland, we present an approach for lifting the often semantically shallow datasets registered at such data portals to Linked Data in order to make data portals the backbone of a distributed global data warehouse for our information society on the Web.
Public Data and Data Portals
Although there are many diﬀerent sources of data, government data is particularly important because of its scale, breadth, and status as the canonical source of information on a wide range of subjects. Governments gather data in many areas: demographics, elections, government budgets and spending, various types of geospatial data, environmental data, transport and planning, etc. While the data is gathered to support the functions of government, it is increasingly recognised that by publishing government data under permissive open licences (with due precautions to avoid publishing sensitive or personal data), huge amounts of value can be unlocked.
The growth of open government data has been particularly striking in Europe. The EU recognised its advantages very early, and issued the PSI (Public Sector Information) Directive in December 2003. This encouraged governments to make their data available, without restrictions on its use. However, though forwardlooking at the time, the directive did allow for charging for use of data, provided that the charges did not exceed those calculated on a cost-recovery basis. It therefore did not require what would now be considered 'open' data. The Directive was revised in 2013, bringing more public bodies within scope and encouraging free or marginal-cost, rather than recovery-cost, pricing – reﬂecting what was by then already practice in many EU states. A study in 2011 for the EU estimated the economic value of releasing public sector data throughout the EU at between 30–140 billion EUR.
Except the economical importance, there are additional issues concerning public data that we brieﬂy characterise below: discoverability, harvesting, interoperability, and community engagement.
Discoverability. One of the ﬁrst problems to be solved when working with any data is where to ﬁnd it. In using data, one needs exactly the right dataset – with the right variables, for the right year, the right area, etc. – and web search engines, while excellent at ﬁnding documents relevant to a given term, do not have enough metadata to ﬁnd datasets like this, particularly since their main use case is for ﬁnding web pages rather than data. There is little point in publishing data if noone can ﬁnd it, so how are governments to make their data 'discoverable' ? One possibility would be to link it from an on-line tree-structured directory, but such structures are hard to design and maintain, diﬃcult to extend, and do not really solve the problem of making nodes ﬁndable when there is a very large number of them (governments typically have at least tens of thousands of datasets).
To solve this problem of discoverability, in the last few years, an increasing number of governments have set up data portals, specialised sites where a publishing interface allows datasets to be uploaded and equipped with high-quality metadata. Using this metadata, users can then quickly ﬁnd the data they need with searching and ﬁltering features. One good example is the European Open Data portal, which is developed by LOD2 partners, using LOD2 stack tools.
Numerous countries, including a good number of EU member states, have followed, along with some local (e.g. city) governments.
Harvesting. Many of these portals use CKAN, a free, open-source data portal platform developed and maintained by Open Knowledge. As a result they have a standard powerful API, which raises the possibility of combining their catalogues to create a single Europe-wide entry point for ﬁnding and using public data. This has been done as part of the LOD2 project: the result is PublicData.eu, a data portal also powered by CKAN which uses the platform's 'harvesting' mechanism to copy metadata records of many thousands of datasets from government data portals in over a dozen countries, with new ones added when they become available. Some other portals are also harvested (e.g. city level portals or community-run catalogues of available government data). Sites are regularly polled for changes, ensuring that the aggregate catalogue at PublicData.eu stays roughly in sync with the original catalogues. The PublicData.eu portal is described in more detail in Sect. 2.
Interoperability. Non-CKAN portals can also be harvested if they provide a suﬃciently powerful API, but for each diﬀerent platform, some custom code must be written to link the platform's API to CKAN's harvesting mechanism. A few such sites are included in those harvested by PublicData.eu, but rather than writing endless pieces of code for custom harvesting, eﬀort has instead been directed to working towards deﬁning a standard simple interface which diﬀerent data catalogues can use to expose their metadata for harvesting. This work in progress can be seen at spec.datacatalogs.org.
Community Engagement. If governments want to realise the potential beneﬁts of open data, it is not enough just to publish data and make it discoverable. Even the most discoverable data will not be actually discovered if no-one knows that it exists. It is therefore recognised that best practice in data publishing includes an element of 'community engagement': not just waiting for potential users to ﬁnd data, but identifying possible re-users, awareness raising, and encouraging re-use.