The ever-increasing amounts of geospatial data are problematic to many organizations; without the right tools, they are not able to get value out of these data. Recent advances in data capture and computation methods have transformed the way organizations handle and process data. The rate at which geospatial data is generated exceeds the ability to organize and analyze them to extract patterns critical for understanding the constantly changing world. For example, Google generates about 25 PB of data per day, with a significant portion of it being geospatial data. Although the computational and analytical methods are not moving as fast as the rate of increase in geospatial data, there has been a lot of progress in this area. To analyze these data efficiently, the management and retrieval processes must be organized and centralized into accessible storage. Recent innovations have led to an increase of new data management solutions, for example, Globus Online (GO), rsync algorithm, YouSendlt, Dropbox, BitTorrent, Content Distribution Networks, and PhEDEx data service (Allen et al. 2012). Figure 9.3 illustrates the elements of data management, from the first stage of combining data from multiple sources through its presentation. The centralization of data management and retrieval is referred to as data warehousing, whereas the actual analysis of the data is referred to as data mining. In this chapter, the details of these terms are discussed.
Kimball and Ross (2013) describe data warehouses as a complete ecosystem for extracting, cleaning, integrating, and delivering data to decision makers, and it therefore includes the extract-transform-load(ETL) and business intelligence (BI) or analysis functions.
Elements of data management workflow showing different platforms, software infrastructure, tools, and methods.
Data Sources, Processing Tools, and the Extract-Transform-Load Process
The first component is the extraction of data from each of the individual sources (these can include historical data in the form of flat files or operational databases) into a temporary staging area where data integration takes place. Data extraction methods can be divided into two categories:
- • Logical extraction: Could be a full extraction of the complete dataset from the source or an incremental extraction (change data capture) of the data changes in a specified time period
- • Physical extraction: Can be done online, directly from the source system, or offline from a system staged explicitly outside the original source system
Data transformations are usually the most complex and time-consuming part of the ETL process. They range from simple data conversions to extremely complex data scrubbing techniques. Data can be transformed in two ways:
- • Multistage transformation: Data are transformed and validated in multiple stages outside the database before being inserted into the warehouse tables.
- • Pipelined data transformation: The database capabilities are utilized, and data are transformed while being loaded into the database.
Using data quality tools, one can ensure that the correct data and format are loaded into the warehouse. This process can be done manually using code created by programmers or automated by the use of ETL tools available in the market. Some of the popular tools include Oracle Warehouse Builder, Data Integrator and Services by SAP, and IBM Information Server. The result of this process is metadata and standardized data, which are then loaded into a data warehouse. Metadata is "data about the data," which may include mapping rules, ETL rules, description of source data, and precalculated field rules. Some of the benefits of this ETL process include
- • One source of truth: All the data are stored in the same format, ensuring their consistency and accuracy.
- • Reduction of resources: The reduction of interface programs used to access the consolidated data results in a reduction of resources.
- • Improved planning and decision-making: Strategic planning and organization-wide decision-making are greatly improved.
- • More timely data: Having data in one location speeds up the access and processing time and reduces problems related to timing discrepancies.
Data integration is the process of combining data from multiple sources into one common representation, with the goal of providing the users with one version of truth. This is a very important process of data warehousing since the quality of the data fed into the system determines the accuracy and reliability of the resulting business decisions.
Data Integration and Storage
In a data warehouse, data are subject oriented, integrated, nonvolatile, time variant, and process oriented. Spatial data warehouses host data for analysis, separating them from transaction workload and thus enabling organizations to consolidate data from multiple sources. The primary purpose of a spatial data warehouse is to organize these data according to the organization's business model to support management decision-making. Many decisions consider a broader view of the business and require foresight beyond the details of day- to-day operations. Spatial data warehouses are built to view businesses over time and spot trends, which is why they require large amounts of data from multiple sources. The analysis capability of a data warehouse enables users to view data across multiple dimensions. The use of a single repository for an organization's data promotes interdepartmental coordination and greatly improves data quality. The spatial data warehouse may contain metadata, summary data, and raw data of a traditional transactional system. Summaries are very valuable because they precompute long operations in advance, which improve query performance. In cases where organizations need to separate their data by business function, data marts can be included for this purpose.
Spatial data warehouses read trillions of bytes of data and therefore require specialized databases that can support this processing. Most data warehouses are bimodal and have a batch of windows (usually in the evenings) when new data are loaded, indexed, and summarized. To accommodate these shifts in processing, the server must be able to support parallel, large-table, full-table scans for data aggregation and have on-demand central processing unit (CPU) and random-access memory (RAM) resources, and the database management system must be able to dynamically reconfigure its resources. Overall, data warehouses provide many advantages to the end user including, but not limited to, improved data access and analysis, increased data consistency, and reduction in costs for accessing historical data.