Slices Generation

After conversion of source data to data cube we still have just detailed data. In order to show some useful statistics, and in particular to visualise in CubeViz, we need to provide slices and aggregations. Such artefacts need to be modelled and included along with RDF data.

In order to generate slices we provided a Python script for materialisation of datasets. It first queries for a list of dimensions enumerated in data structure definition (DSD). Then, all elements (members) of dimensions are retrieved. We assume that slice contains two dimensions (for use in two-dimensional charts), therefore pairwise combination of all dimensions is generated. Finally, respective links between observations, datasets and slices are generated.[1]

Listing 3 presents relevant parts of our generated data. In this example, a slicekey with fixed dimensions for indicator and unit is defined – slicekey-indicator-unit. Using this structure, several slices are generated, one for each combination of free variables. One of them is the slice with values fixed on 'export' (indicator ) and 'EUR' (unit ) – export-EUR. The last line by qb:observation contains ellipsis because there are in fact 1330 observations attached.

Listing 3. Data cube clices in INSIGOS data

Aggregations

Aggregations are commonly used in reporting using multidimensional data. The most prominent example is a data warehouse with OLAP cubes. Aggregations are selected and calculated in such a way that speeded up reporting is possible. By analogy, aggregations may be deemed also useful for cubes defined as linked data.

In our case aggregation is necessary for drill-down operations. For example, daily data can be aggregated on a monthly basis to better observe a phenomenon. Also, we can display data in yearly sums, and allow drill-down to be even more precise. SPARQL is capable of calculating sums on-the-fly, but it takes time and sometimes time-out is reached.[2] Materialisation is then necessary for quicker operations.

Our first idea was to prepare aggregations using Python script, similar to slicing. That would require too much querying and would be inefficient. In the end, we found a way to implement the method for aggregation as a set of SPARQL queries.

One of the issues was generation of URIs for new observations as aggregation is in fact a new observation – the same dimensions but values are on higher level, e.g. month year. For INSIGOS/POLGOS observations we have defined a pattern for identifiers. We used the capabilities of Virtuoso to generate identifiers directly in SPARQL.

Before aggregation is done, a correct hierarchy should be prepared. The prerequisite for the script is that dimensions are represented as SKOS concept scheme, and elements of dimension are organised in hierarchy with skos:narrower property.

  • [1] A mechanism should be introduced to allow querying for observations instead of explicit assignments. Currently, such assignments require materialisation of a big number of additional triples, which makes solution questionable in enterprise data warehouse settings when considering the volume of data
  • [2] For example, a query calculating the value of public procurement contracts by voivodeships takes 100 seconds, which is outside of acceptable response times
 
< Prev   CONTENTS   Next >