A Test on Correlation Based on Gini's Mean Difference

Wolfgang Schmid and Ivan Semeniuk


Gini's Mean Difference (GMD) and the Gini index are widely applied tools in statistics. The GMD is closely related to the Gini index which is obtained by dividing the GMD by a multiple of the sample mean.

The GMD is mostly used as an alternative measure for the process variability. While the variance is superior to the GMD for distributions that are nearly normal, the GMD can be more informative for distributions that depart from normality (cf. Yitzhaki, 2003). There are many further applications of the GMD. A correlation measure relying on the GMD was proposed in Schechtman and Yitzhaki (1999). It was reported to be superior to the existing Pearson's correlation and Spearman's correlation for some distributions. The Gini index can be used as a statistic in nonparametric tests, such as goodness-of-fit tests (Rao Jammalamadaka and Goria, 2004), a two-sample test (Niewiadomska-Bugaj, 2003), and a test for independence of two samples (Kowalczyk et al., 2006). Ouda (2006) considered a test on univariate symmetry based on the GMD.

The GMD has also been applied in survival analysis (Bonetti et al., 2009), while the Gini concentration index has been used to describe concentration in levels of mortality and length of life among different socioeconomic groups, and to evaluate inequality in health and life expectancy. The Gini concentration index is one of the most common statistical indices employed in the social sciences for measuring concentration in the distribution of a positive random variable. The Gini index is mainly used in economics as a measure of income or wealth inequality among individuals or households.

Recently, the GMD has been applied to monitoring the variance of an independent random sample (e.g., Riaz and Does, 2009; Ghute and Rajmanya, 2014; Zhang, 2014; Sindhumol et al. 2016, 2018). In Schoonhoven et al. (2011), the authors compare the performance of control charts for the standard deviation with estimated parameters. A control chart based on the GMD was among the considered charts. The authors concluded that in general none of the charts outperformed the others. In Mangold and Konopik (2017), the role of the GMD in statistical process control is discussed as well. The authors present a new class of Shewhart control charts - the general class of entropy- based control charts (the о-charts), where charts based on the GMD represent one of its subclasses.

There are also several more analytical papers discussing properties of the GMD. Mukhopadhyay and Chattopadhyay (2012) gave an asymptotic expansion of the percentiles for a sample mean standardized by the GMD in a normal case. Since the GMD is a special case of a U-statistic, statements about its asymptotic behavior can be found in, e.g., Dehling and Wendler (2010), Beutner and Zahle (2012), and Garg and Dewan (2015) for various types of processes.

In most of these papers, the underlying sample consists of independent random variables. In this paper, we want to describe how the GMD can be used for testing the data on positive correlation. To our knowledge, this has not been done yet. It is shown that the proposed test statistics based on the GMD are invariant with respect to the family of multivariate elliptically contoured distributions. Thus, the critical values of the test must be tabulated only once and can be used for a large family of distributions. Comparisons with the test of Box and Ljung show that the new approach seems to work quite well, especially for small sample sizes. Further, we describe how control charts based on the GMD can be constructed to monitor the correlation structure of a process. Two control charts based on exponentially weighted moving averages are introduced and compared with each other. The average run length and the conditional expected delay are used as performance measures.

Testing on Correlation

In this section, we want to analyze the behavior of the GMD for correlated random variables. We will show how the GMD can be used to test on positive correlation.

Given the random variables X,..., X„, the GMD is defined as

Analysis of the GMD for Correlated Variables

Using the Cauchy-Schwarz inequality, it is possible to obtain an upper bound for

Theorem 9.1. Suppose that X,..., X„ are random variables with existing second moments. Let p = E(X,), т2 = Var(Xj), i = 1,..., n and pjj = Corr(X„ Xj)for i,j = 1Then

The upper bound of Theorem 2.1 holds for all distributions with finite second moment. But how good is it at all? For example, let us consider the case of a multivariate normal distribution. If 7,у = Cou(X„Xy), then X, - X, ~

Щ0,2(72 - 7(/))/ E(|X, - X,|) = ^72-7(,, and

This shows how much information we may lose if the distribution of X, is unknown. Here, the upper bound is /7r/2 % 1.2533 times larger than the value for the normal case, i.e., it is around 25% larger.

In many applications a prior information on how random variables may be correlated with each other is available. For instance, air pressure is not negatively correlated with the wind speed, particulate matter is not positively correlated with wind speed and with humidity, etc. Frequently, the expected correlations can only be non-negative as in the example of air pressure and wind speed, and the negative values of the correlation appear in the concentration of particulate matter with wind speed and with humidity. In finance, the returns of a stock and a stock market index are in most cases not negatively correlated as well.

Assuming pij > 0, the largest value of the upper bound in Theorem 2.1 is obtained if the variables are uncorrelated, i.e., £(G„) < Vl'y. This is an interesting property which we will use in the following.

Next, a test on correlation is introduced. The main idea of the test is that the hypothesis of no correlation is rejected if G„/7 or G*/7 is sufficiently small. 7 denotes an estimator of 7, and G* a possible estimator of E(G„).

< Prev   CONTENTS   Source   Next >