Causes and Effects of Poor Data Quality
Data quality is a result of how complete the data is, whether there are duplicates, and the level of accuracy and consistency across the overall organization. Most data quality projects have been linked to individual BA or CRM projects. Organizations know that correct data (e.g., complete and accurate customer contact data for CRM) is essential to achieve a positive return on these investments. Therefore, they are beginning to understand the significant advantage that is associated with focusing on data quality at a strategic level.
Data quality is central in all data integration initiatives, too. Data from a data warehouse can't be used in an efficient way until it has been analyzed and cleansed. In terms of data warehouses, it's becoming more and more common to install an actual storage facility or a firewall, which ensures quality when data is loaded from the staging area to the actual data warehouse. To ensure that poor data quality from external sources does not destroy or reduce the quality of internal processes and applications, organizations should establish this data quality firewall in their data warehouse. Analogous to a network firewall, whose objective is to keep hackers, viruses, and other undesirables out of the organization's network, the data quality firewall must keep data of poor quality out of internal processes and applications. The firewall can analyze incoming data as well as cleanse data by means of known patterns of problems, so that data will be of a certain quality, before it arrives in the data warehouse. Poor data that cannot be cleansed will be rejected by the firewall. The proactive way to improve the data quality is to subsequently identify poor data and add new patterns in the cleansing procedures of the firewall or track them back to the perpetrators and communicate the quality problems to the data source owners.
Poor data quality is very costly and can cause breakdowns in the organization's value chains (e.g., no items in stock) and lead to impaired decision-making at management and operational levels. Equally, it may lead to substandard customer service, which will cause dissatisfaction and cancellation of business. Lack of trust in reporting is another problem that will delay budgeting processes. In other words, poor data quality affects the organization's competiveness negatively.
The first step toward improved data quality in the data warehouse will typically be the deployment of tools for data profiling. By means of advanced software, basic statistical analyses are performed to search for frequencies and column widths on the data in the tables. Based on the statistics, we can see, for example, frequencies on nonexistent or missing postal codes as well as the number of rows without a customer name. Incorrect values of sales figures in transaction tables can be identified by means of analyses of the numeric widths of the columns. Algorithms searching for different ways of spelling the same content are carried out with the purpose of finding customers who appear under several names. For example, "Mr. Thomas D. Marchand" could be the same customer as "Thomas D. Marchand." Is it the same customer twice? Software packages can disclose whether data fits valid patterns and formats. Phone numbers, for instance, must have the format 311-555-1212 and not 3115551212 or 31 15 121 2. Data profiling can also identify superfluous data and whether business rules are observed (e.g., whether two fields contain the same data and whether sales and distributions are calculated correctly in the source system). Some programs offer functionality for calculating indicators or KPIs for data quality, which enable the business to follow the development in data quality over time.
Poor data quality may also be a result of the BA function introducing new requirements. If a source system is registering only the date of a business transaction (e.g., 12 April 2010), the BA initiative cannot analyze the sales distribution over the hours of the working day. That initiative will not be possible unless the source system is reprogrammed to register business transactions with a timestamp such as "12APR2010:12:40:31." Data will now show that the transaction took place 40 minutes and 31 seconds past 12, on 12 April 2010. The data quality is now secured, and the BA initiative can be carried out.
Data profiling is thus an analysis of the problems we are facing. In the next phase, the improvement of data quality, the process starts with the development of better data. In otherwords, this means correcting errors, securing accuracy, and validating and standardizing data with a view to increase their reliability. Based on data profiling, tools introduce intelligent algorithms to cleanse and improve data. Fuzzy merge technology is frequently used here. Using this technology means that duplicate rows can often be removed, so that customers appear only once in the system. Rows without customer names can be removed. Data with incorrect postal codes can be corrected, or removed. Phone numbers are adjusted to the desired format, such as XXX-XXX-XXXX.
Data cleansing is a process that identifies and corrects (or removes) ruined or incorrect rows in a table. After the cleansing, the data set will be consistent with other data sets elsewhere in the system. Ruined data can be a result of user entries or transmission errors. The actual data cleansing process may involve a comparison between entered values and a known list of possible values. The validation may be hard, so that all rows without valid postal codes are rejected or deleted, or it can be soft, which means that values are adjusted if they partly resemble the listed values. As mentioned previously, data quality tools are usually implemented when data is removed from the staging area to the data warehouse. Simply put, data moves through a kind of firewall of cleansing tools. Not all errors, however, can be corrected by the data quality tools. Entry error by users can be difficult to identify, and some of them will come through in the data profiling as very high or low values. Missing data caused by fields that have not been filled in should be corrected by means of validation procedures in the source system (for details, see Chapter 6). It should not be optional, for instance, whether the business user in sales selects one individual customer or not.