# SIX. Advanced Data Analytical Tests

WHILE CORRELATION, TREND ANALYSIS, and times series analyses are considered advanced statistical methods, they are easy to apply within - - IDEA. These tests are explained and demonstrated in IDEA. Tests to establish relationships between two fields have their procedures shown step by step in this chapter. These advanced data analytical tests are grouped together as they all establish some form of relationship testing.

## CORRELATION

A correlation is a relationship between two things or mathematical variables that tend to vary or move together. The data is represented by the letters x and y where x is the independent variable and y is the dependent variable. The independent variable x is usually described first. There is usually some logical connection between the two variables.

Many studies have been done on income levels of high school graduates, college, and university graduates and those with post-graduate degrees. The question in those studies is whether there is a connection or correlation between educational levels and income levels. Educational level would be x, the independent variable that would influence the income level variable of y, the dependent variable.

How much does the independent variable have influence over the dependent variable can be determined by calculating the correlation coefficient and is designated as r. A perfect linear correlation would have r equaling to 1 and a perfect negative correlation would have r equaling -1. The closer r is to 1 or -1, the stronger the correlation. Where there is no correlation, r would equal to 0.

There are three basic formulas for calculating r. The correlation coefficient can be calculated for the population, the sample, or the product moment.

The most common one is the product-moment correlation coefficient as shown.

E denotes the summation symbol, so the formula for the top part or numerator is to multiply the x variables by the y variables and then take the sum of those numbers. For the denominator, you square the x variables and take the sum of them. Do the same for the y variables and then multiple the resulting two sums. Apply the square root to the results and then divide the final results of the numerator by the final results of the denominator to obtain the correlation coefficient.

Since we already know how to calculate the Z-scores (or have IDEA calculate it for us), a simpler method of obtaining r would be to multiply the Z-scores of the x variables by the Z-scores of the y variables and then total up all the results. Take the total of the results and divide by the number of records less one.

The formula would be r = X(Zx - Zy)/(n - 1) where Zx is the Z-score for the x variable and Zy is the Z-score for the y variable. The letter n in the formula represents the number of records.

In general, the correlation coefficient, whether negative or positive, can be interpreted as:

.0 to .2 No correlation .2 to .4 Weak correlation .4 to .6 Moderate correlation .6 to .8 Strong correlation .8 to 1.0 Very strong correlation

The calculations of the correlation coefficients were shown for a better understanding of correlation. The auditor does not need to perform the calculations using the formulas, as IDEA has a built-in correlation feature that provides the correlation coefficient and you merely have to interpret the results.

Using a summarized monthly sales file from a POS system, we select Correlation from the Statistics area of IDEA. For demonstration purposes, the fields we correlate will be the HD_NETAMOUNT_SUM field, which is the sales amount before taxes, and the PAYAMOUNT_SUM_SUM_SUM field, which are amounts paid that include taxes. We will also include payments of cash only and payments of debit cards only for the correlation calculation as shown in Figure 6.1. In addition to the results being displayed, which can be exported to various file formats, you may optionally create an IDEA database of these results.

As expected, there is a perfect positive correlation of 1.000 between sales before taxes and payments. As sales go up or down, the sales tax moves accordingly, so the total payment by customers correlates to sales net of taxes.

There is a strong correlation between both the cash tender and the debit cards tender to the payment amounts of 0.710 and 0.792, respectively. There is no correlation between cash and debit cards payments as the correlation coefficient is 0.189. See Figure 6.2 .

FIGURE 6.1 Applying the Advanced Statistical Analysis Correlation Feature of IDEA

FIGURE 6.2 Correlation Coefficient Results

While IDEA does not calculate the coefficient of determination, it can be done simply by squaring the correlation coefficient that is already calculated for you. The coefficient of determination, or r2, tells us how much of the variation in one variable can be attributed to the variation of the other variable. This calculation when multiplied by 100 will be expressed in a percentage.

In our example of the correlation between cash tender and the payment amount of 0.710, 50.41 percent (0.710 x 0.710 x 100) of the variation can be attributed to the other variable.

One has to be mindful that even if the variables are calculated to have a strong correlation, there may not be a cause-and-effect relationship.

Are there direct cause-and-effect relations? That is, does x cause y or, in our previous example, does the level of education cause the income level? Maybe it is a reverse cause and effect where y causes x. That is, income levels determine your education level.

Possibly there is a third variable or a combination of several other variables that caused the relationship, such as networking relationships while in school that resulted in higher paying jobs. Maybe the whole relationship between the two variables was just a coincidence?