Selection of Certain Columns To Be Loaded
It's necessary to choose the columns that should be loaded. Here are the conditions under which columns need to be loaded:
- ? Translating coded values. For example, the source system is storing "M" for man and "W" for woman, but the data warehouse wants to store the value 1 for man and 2 for woman.
- ? Mapping of values. For example, mapping of the values "Man," "M" and "Mr." into the new value 1.
- ? Calculating a new calculated value. For example, sales = number x unit price.
- ? Joining from different sources. For example, to look-up or merge.
- ? Summing up of several rows of data. For example, total sales for all regions.
- ? Generating a surrogate key. This is a unique value attributed to a row or an object in the database. The surrogate key is not in the source system;it is attributed by the ETL tool.
- ? Transposing. Changing multiple columns to multiple rows or vice versa.
In the load phase of the ETL process, data is entered in the data warehouse or moved from one area of the data warehouse to another. There is always a target table filled with the results of the transformation in the load procedure. Depending on the organization's requirements, this process can vary greatly. For example, in some data warehouses, old data is overwritten by new data. Systems of a certain complexity are able to create data history simply by making "notes" in the data warehouse if a change occurs in the source data (e.g., if a customer has moved to a new address).
Exhibit 5.2 shows a simple ETL job, where data is extracted from the source table (Staging). Then the selected fields are transferred to the temporary table (Temp), which, through the load object, is sent on
Exhibit 5.2 Example of a Simple ETL Job
to the table (Staging) in the staging area. The transformation of the job is simple, since it's simply a case of selecting a subset of the columns or fields of the source table. The load procedure of the ETL job may overwrite the old rows in the target table or insert new rows.
A more complex part of an ETL job is shown in Exhibit 5.3. Here data is extracted from three staging tables. Note that only selected columns and rows are extracted with a filter function; an example of this could be rows that are valid for only a certain period. These three temporary tables in the center of Exhibit 5.3 are joined using Structured Query Language (SQL). SQL is a programming language used when manipulating data in a database or a data warehouse. The SQL join may link information about position (unemployed, employee, self-employed, etc.) to information about property evaluations and lending information. There may also be conditions (business rules) that filter out all noncorporate customers. The procedure is a transformation and joining of data, which ends up in the temporary table (Temp Table 4). The table with the joined information about loan applicants (again, Temp Table 4) then flows on in the ETL job with further transformations based on business rules, until it is finally
Exhibit 5.3 Part of ETL Job with SQL Join loaded to a target table in the staging area, the actual data warehouse, or for reporting and analytics in a data mart.
When initiating ETL processes and choosing tools, there are certain things to bear in mind. ETL processes can be very complex, and significant operational problems may arise if the ETL tools are not in order. Further complexity may be a consequence of many source systems with many different updating cycles. Some are updated every minute, and others on a weekly basis. A good ETL tool must be able to withhold certain data until all sources are synchronized.
The degree of scalability in the performance of the ETL tool in its lifetime and use should also be taken into consideration in the analysis phase. This includes an understanding of the volume of data to be processed. The ETL tool may need to be scalable in order to process terabytes of data, if such data volumes are included.
Even though ETL processes can be performed in any programming language, it's fairly complicated to do so from scratch. To an increasing extent, organizations buy ETL tools to create ETL processes. A good tool must be able to communicate with many different relational databases and read the different file formats that are used in the organization. Many vendors' ETL tools also offer data profiling, data quality, and metadata handling (we'll describe these processes in the following section). That is, a broader spectrum than extracting, transforming, and loading data is now necessary in a good tool.
The scope of data values or the data quality in a data source may be reduced compared to the expectations held by designers when the transformation rules were specified. Data profiling of a source system is recommended to identify the usability of the transformations on all imaginable future data values.