Polish Data

Public procurement data is published by The Public Procurement Office (Urzad Zamowien Publicznych[1]) in the Public Procurement Bulletin (Biuletyn Zamowien Publicznych – BZP[2]).

There are several means to access the data: browsing the BZP portal, subscription mechanism with some restricted number of criteria, and the download of XML files, which we employed in the RDFization. The structure of XML is basically flat: even though some attributes can be grouped that are put on the same level. This has implications for the parsing and conversion mechanisms. On the one hand, no subset of XML data can be selected for further processing. On the other hand, the extraction expressions as well as XML paths are shorter. Conversion of XML files containing notices about public contracts has been carried out by means of Tripliser.[3] The RDFization had to overcome some issues in the XML structure, such as the use of consecutive numbers for elements describing the individual suppliers (in Polish ''wykonawca') awarded the different lots of a contract: wykonawca 0, wykonawca 1, wykonawca 2 and so on. We also had to write our own extension functions for Tripliser allowing us to generate new identifiers for addresses, as data structures, from their parts: locality, postal code and street.

Automatic linking, using Silk[4] as one of the LOD2 stack[5] tools, was carried out for the problem of mapping the contact information of a given contracting authority or supplier to a classification of Polish territorial units called TERYT.[6]

U.S. Data

The dataset was created by combining data from two complementary sources: USASpending.gov[7] and Federal Business Opportunities (FBO).[8] USASpending.gov offers a database of government expenditures, including awarded public contracts, for which it records, e.g., the numbers of bidders. On the other hand, FBO publishes public notices for ongoing calls for tenders. USASpending.gov provides data downloads in several structured data formats. We used the CSV dumps, which we converted to RDF using SPARQL mapping[9] executed by tarql.[10] Data dump from FBO is available in XML as part of the Data.gov initiative.[11] To convert the data to RDF we created an XSLT stylesheet that outputs RDF/XML.[12] As an additional dataset used by both USASpending.gov and FBO, we converted the FAR Product and Service Codes[13] to RDF using LODRefine,[14] an extraction tool from the LOD2 Stack.

Data resulting from transformation to RDF was interlinked both internally and with external datasets. Internal linking was done in order to fuse equivalent instances of public contracts and business entities. Deduplication was performed using the data processing unit for UnifiedViews that wraps the Silk link discovery framework.[15] The output links were merged using the data fusion component of UnifiedViews.[16] Links to external resources were created either by using codebased URI templates in transformation to RDF or by instance matching based on converted data. The use of codes as strong identifiers enabled automatic generation of links to FAR codes and North American Industry Classification System 2012,[17] two controlled vocabularies used to express objects and kinds of public contracts. Instance matching was applied to discover links to DBpedia[18] and OpenCorporates.[19] Links to DBpedia were created for populated places referred to from postal addresses in the U.S. procurement dataset. Furthermore, OpenCorporates was used as target for linking the bidding companies. The task was carried out using the batch reconciliation API of OpenCorporates via interface in LODRefine.

  • [1] uzp.gov.pl
  • [2] uzp.gov.pl/BZP/
  • [3] A Java library and command-line tool for creating triple graphs from XML, https:// github.com/daverog/Tripliser
  • [4] See Chap. 1 of this book
  • [5] stack.linkeddata.org
  • [6] teryt.stat.gov.pl/
  • [7] usaspending.gov/
  • [8] https://fbo.gov/
  • [9] https://github.com/opendatacz/USASpending2RDF
  • [10] https://github.com/cygri/tarql
  • [11] ftp://ftp.fbo.gov/datagov/
  • [12] https://github.com/opendatacz/FBO2RDF
  • [13] acquisition.gov/
  • [14] code.zemanta.com/sparkica/
  • [15] wifo5-03.informatik.uni-mannheim.de/bizer/silk/
  • [16] Developed previously for ODCleanStore, the predecessor of UnifiedViews [6]
  • [17] census.gov/eos/www/naics/index.html
  • [18] dbpedia.org
  • [19] https://opencorporates.com/
< Prev   CONTENTS   Next >