Architecture of Business Intelligence Systems
The functionality of BI systems is greatly influenced by their architecture. It usually consists of four main elements (Figure 2.3):
■ Data warehouse — a place of storing thematically aggregated and analyzed data.
■ Data extraction and transmission tools - mainly responsible for the transfer of data from various sources, from transactional systems and the Internet to the data warehouse in particular.
Figure 2.3 Architecture of Business Intelligence system.
■ Data analysis and exploration tools — enabling users to access data as well as their comprehensive analysis and mining.
■ Data visualization - tools and applications designed to provide information in a transparent and accessible form.
The problem of modern organizations is not lack of information but its excess and dispersion. To make effective decisions, various sources of information need to be reached and then integrated and comprehensively analyzed. Different data formats, lack of consistency and integration between data mean that a lot of information is lost or unused. A data warehouse is considered a technology that integrates data from various dispersed sources.
A data warehouse is a collection of thematically integrated, non-volatile and historical data to support organizational decisions (Inmon, 1996). It aids an enterprise in gaining a coherent, multidimensional view of the business (impossible to obtain by using transactional systems), which can be analyzed in various cross sections and perspectives, also in a historical context. In contrast to the classic database, a data warehouse provides the opportunity to track the most important indicators from the perspective of an organization, such as product profitability, credit risk, customer satisfaction level, customer turnover, and quality of the production process that affect strategic decision-making and ongoing tasks.
There are usually two approaches that dominate building a data warehouse, namely top-down and bottom-up. In the top-down approach, the construction of a central data warehouse system is assumed, on which thematic data warehouses are created. The bottom-up approach begins with the creation of thematic data warehouses, which only later are attempted to be integrated.
The data warehouse project should be created in such a way as to allow later creation of reporting mechanisms, queries, multidimensional analyses, and data mining in the easiest way. Experience shows that “star,” “snowflake,” or constellation models are usually aimed at. In the first case, the factual data are stored in a central table surrounded by reference tables, which contain data on individual dimensions needed to make decisions. In the snowflake pattern, each dimension can have several dimensions of its own. This means that reference tables are not denormalized. A constellation-type scheme is an intermediate solution. Some tables are denormalized and others are normalized.
Today, there are many different methodologies for building a data warehouse. These include, but are not limited to, the methodology of Oracle; Warehousing Institute; Rapid Warehouse Methodology, proposed by SAS Institute; and the methodology for implementing the SAFE/DW data warehouse by Sybase and its partners.
The usefulness of a data warehouse, to a large extent, depends on the quality of data collected. Data must be obtained and integrated in advance before being entered into a repository which is a data warehouse. This involves implementing efficient data extraction mechanisms, standardization of data (coding and naming), cleaning, aggregating, and transforming data into a consistent form. The Extraction- Transformation-Load (ETL) tools are responsible for the implementation of these activities. ETL operation usually amounts to three stages: extraction, transformation, and data loading (Meyer, 2001).
Data extraction involves accessing data from various sources (databases, text files, HTML documents, XML, e-mail). This usually leads to data being saved in a relational database, which significantly facilitates their further processing at the transformation stage. The most important problems related to the data extraction stage include the need to access many sources that are stored in various formats, the temporary unavailability of some data sources, or the fact that they are outdated.
Data transformation is considered to be the most complex stage of the ETL process. It involves transforming the data into a common format, calculating the necessary aggregates, identifying missing or repeating data. The transformation process is usually carried out using traditional programming languages, scripting languages or SQL (Structured Query Language). When designing transformation procedures, skillful selection of appropriate data processing techniques is very important, which can have a significant impact on the efficiency of the entire ETL process.
In turn, loading data involves supplying a data warehouse with integrated and “cleaned” data. Due to the fact that this process is often associated with switching the system to an off-line mode, it becomes important to minimize the time needed for data transfer at this point.
An analysis of the software market indicates that the ETL tools are functionally quite different. They can be divided into the following categories (Meyer, 2001):
■ EtL tools which focus on the processes of data extraction and loading
■ eTL or ET1 tools which prefer specific types of source data or target data
■ eTL tools that perform the data transformation process relatively well, but they are not able to support many typical data formats
■ ETL tools with an integrated environment equipped with many solutions supporting the user during the construction of an ETL system.
The most well-known ETL solutions include: SAS Q, DataStage EX, Oracle Warehouse Builder, DataMirror Transformation Server and Microsoft Data Transformation Services, ’[hey differ among themselves, inter alia, by data access, data processing algorithms, as well as data cleaning and searching components. The choice of a specific ETL tool is an individual matter, largely dependent on the needs of an organization, size, and amount of resources it uses.