Procurement Data Extraction and Pre-processing

Although procurement data are published in a certain form in most countries in the world, we focused on three groups of sources:

1. The European TED (Tenders Electronic Daily) portal,[1] which contains data from a number of countries (thus allowing for cross-country comparisons, as shown in [9]), although only a subset of these (typically for contracts above a certain price level).

2. The Czech and Polish procurement data portals; the lead partners in the procurement linked data activity of the LOD2 project are based in these two countries, and therefore have both good contacts to the national publishing agencies, knowledge of the local regulations, and fluency in the languages in which the unstructured part of the data is written.

3. US and UK procurement data portals, as these are the countries where the open government publishing campaign started first and therefore even the procurement data sources are likely to be found sufficiently rich and well curated.

Regarding the source format of the data, the TED and Czech data were initially only available as HTML, and only at a later phase became also published in XML. In contrast, the Polish, US and UK data have been available in XML from the beginning. Data extraction (and RDFization) methods for both formats have therefore been investigated.

Data Extraction from HTML

TED and the Czech national portal ISZVUS (later renamed to Public Procurement Bulletin) had been the prime targets in the initial phase of the project. At that time, only the HTML pages were available for these resources. In Fig. 2, we can see two HTML fragments with information about one lot; we can demonstrate different flavors of HTML-based data extraction on them. Both contain red-labelled sections numbered 1 to 4 (the related properties are in Table 1).

The left side of Fig. 2 depicts a fragment of a TED HTML document. The data is stored in div elements combined with additional textual information. Section 1 of the document contains combined information about the lot ID and the lot name, so it is necessary to split these properties. Section 2 only contains

Table 1. PCO property mapping to HTML fragments

#

PCO property

#

PCO property

1

dc:title + adms:identifier

3

pc:supplier

2

pc:numberOfTenders

4

pc:offeredPrice

one property, with a textual label that has to be removed. In the Sects. 3 and 4 the fields are separated by br tags combined with additional labels.

In contrast, the data in ISVZUS is strictly structured using input elements with unique id attributes (see the right side of Fig. 2), which allows to access the data fields without any additional transformation.

Technologically, the extraction was based on CSS selectors, and (where the CSS selectors did not suffice) pseudo-selectors[2] allowing to search for elements containing a defined substring. In some cases the element content had to be modified or shortened, which led us to applying regular expressions.

The HTML-based ETL activities for both resources were later suspended when the future availability of full data in XML (rather than mere HTML) was announced. The processing was resumed in Spring 2014 based on XML dumps (and, for the Czech data, also an XML-based SOAP API), which are more reliable than data obtained via information extraction from semi-structured text embedded in HTML.

Fig. 2. TED fragment (left) and ISVZUS fragment (right)

  • [1] ted.europa.eu/TED/
  • [2] Provided by the JSoup library jsoup.org/
 
< Prev   CONTENTS   Next >