# Sequential Confidence Set and Point Estimation of the Population Gini Index by Controlling Accuracies Relative to the Population Mean

## Introduction

Economic inequality exists in almost all aspects of life. It is very important to periodically measure the inequality for a region or a country of interest to gauge the effect of current economic policies in order to critically consider possible paths for future policies. Among many available measures of inequality, the Gini index or Gini coefficient is a leading measure for income

or wealth distribution in a region or country.

Suppose X is a nonnegative nondegenerate random variable having a distribution function (d.f.) F that represents income or wealth of a person or a household. If Xi and X2 are two i.i.d. copies of X, then a population Gini index, as given in Arnold (2005), is defined as:

A population Gini index (Gini 1914, 1921) for a country is typically calculated based on census data every ten years. However, for continuous monitoring of economic policies it is often crucial to estimate the Gini index by observing samples from a population during some intermediate time periods. Typically, to estimate the inequality index for larger regions, stratified sampling or some other appropriate complex sampling scheme is adopted.

For practical purposes, however, in a smaller region, simple random sampling may be used to produce nearly independent and identically distributed (i.i.d.) observations. For elaborate discussions on simple random sampling schemes in the context of household data, one may refer to miscellaneous sources including Beach and Davidson (1983), Chattopadhyay and De (2016), Davidson (2009), Davidson and Duclos (2000), De and Chattopadhyay (2017), Gastwirth (1972), and Xu (2007).

Let us consider n i.i.d. observations Xj,..., X„ from the common distribution function F with its support (0,00). An estimator of the Gini index G is given by:

where X„ is the sample mean and A„ is the sample Gini's mean difference (GMD) defined as follows:

Note that X„ and A„ are both U-statistics (Hoeffding, 1948; Lee, 1990) of degree 1 and 2 respectively. Exploiting various crucial properties of U-statistics, Chattopadhyay and De (2016) and De and Chattopadhyay (2017) recently developed purely sequential methodologies for estimating the Gini index.

With some prefixed half-width d > 0, Chattopadhyay and De (2016) constructed a fixed-width (= 2d) confidence interval of the form [Gn ± d] based on the finally accrued data {N, Xi,..., X,v} such that asymptotically we have:

for some pre-specified level a € (0,1) in the spirit of Chow and Robbins (1965). The associated stopping variable, that is, the final sample size of the sequential procedure N was shown to enjoy a number of desirable asymptotic properties including the first-order asymptotic efficiency property defined by Ghosh and Mukhopadhyay (1981), and the asymptotic consistency property (Chow and Robbins, 1965).

In a sequel, De and Chattopadhyay (2017) developed a purely sequential methodology for minimum risk point estimation of the Gini index G in the spirits of Robbins (1959) and Ghosh and Mukhopadhyay (1979). The loss function under that methodology was a combination of the squared error loss due to wrong estimation plus a linear cost, and it had the following form:

where A is a known positive constant representing the weight assigned by the researcher regarding the probable cost per unit squared error loss due to estimation and c is a known positive constant representing the cost for sampling one observation. De and Chattopadhyay (2017) constructed a minimum risk point estimator (MRPE) of the form G,v based on finally accrued data {N, Xi,... ,X,v}. The associated sequential risk became asymptotically close to the minimum risk. The stopping variable of the purely sequential procedure N enjoyed the first-order asymptotic efficiency and first-order asymptotic risk efficiency properties in the spirits of Ghosh and Mukhopadhyay (1981).

Both Chattopadhyay and De (2016) and De and Chattopadhyay (2017) developed ways to handle a number of crucial technical difficulties in assessing some of the desirable asymptotic optimality properties for their proposed sequential estimation methodologies, especially in the verification of the first-order asymptotic efficiency property. In these cited papers, the authors assumed that the support of F was (f, oo) with some t > 0 in order to keep the required moment conditions within reason and practicality. We attempt to overcome those difficulties in this article by proposing some competing notions for measuring estimation error.

Note that Chattopadhyay and De (2016) and De and Chattopadhyay (2017) focused on their loss due to estimation as a function of |Gn - G|. If F is assumed continuous, then G € (0,1) and G,v 6 (0,1) with probability (w.p.) 1. On the other hand, however, for any reasonable methodology, one should expect |Gn - G| to be rather small to begin with, especially since it is the absolute difference between two fractions. That is, |Gn - G| may not be a truly great quantification of the discrepancy between G.v and G. For the sake of argument, consider two artificial datasets A and B, both of size n = 5, showing weekly income per family from an extremely poor suburb (with distribution function F(x)) and not-so-poor suburb (with distribution function F(x/10)) of a city:

 Dataset A (d.f. F(x/10)): \$1 \$2 \$3 \$4 \$5 Dataset В (d.f. F(x =10)): \$10 \$20 \$30 \$40 \$50

In the case of either dataset, 6,1 = 3. Using the Gini index alone, one may be tempted to postulate that these two suburbs have the same level of income inequality based on observed data. However, such a conclusion may misrepresent the true situation on hand as explained next.

It is clear that the families from the extremely poor suburb have practically no weekly purchasing power to acquire anything of importance (e.g., a loaf of bread) because the sample mean from this dataset happens to be \$3 per family per week. However, the families from the not-so-poor suburb have more weekly purchasing power to acquire a loaf of bread because the sample mean from this dataset happens to be \$30 per family per week.

From this economic perspective, we may not feel very comfortable to declare quickly that these two suburbs have the same level of income inequality simply because they have equal estimated G from observed data. One may refer to Mukhopadhyay and Chattopadhyay (2018) which put forward significant economic persuasions in conjunction with alternatives to GMD or Gini's inequality index in order to address a number of crucial issues.

From the aforementioned discussions, we suggest that one may alternatively consider the loss due to estimation, namely |Gn - G|, but look at its magnitude relative to the magnitude of an appropriate power of To be specific, we should compare the values of

for the two suburbs under consideration.

The true Gini index from (8.1) in the two suburbs will both be obviously equal, say G, since one distribution is a scale transformation of the other. Hence, it may be reasonable to compare /<"G„ obtained from the two suburbs. In order to fix ideas, we may suppose that a = 2. In the case of dataset A, we have £2G„ = 32 x 3 = 3 whereas in the context of dataset B, we have /t2G„ = 302 x i = 300 which should order such weighted inequality measure associated with dataset В at a much higher level of income inequality than that from the dataset A. This prompts us to consider a revised loss function in the contexts of the fixed-width interval estimation problem as well as the minimum risk point estimation problem.

### Revised Loss Functions

In the context of confidence set estimation of G, with fixed n and predefined d > 0, we suppose that the 0-1 estimation loss is given by:

with the associated risk function PF(/ri|G„ - G| > d). That is, instead of (8.4), we should require:

based on the finally accrued data {N, Xx,... , X,v}. In (8.7), N would be an observable and appropriate stopping time to be defined shortly.

Analogously, with a fixed sample of size n, in the context of minimum risk point estimation of G we would work under the weighted squared error loss plus linear cost of sampling defined as follows:

2

with the associated risk function A//"£/--[G„ - G] + cn. That is, instead of minimizing the fixed-sample-size risk function associated with (8.5), we should minimize the fixed-sample-size risk function associated with (8.8), requiring that we handle the following expression based on finally accrued data {N,Xx.....XN}:

Here, N is an observable stopping time or the final sample size of a sequential procedure to be defined shortly. We assume in (8.8) and (8.9) that a > 0, A > 0, and c > 0 are all pre-specified.

### An Overview and the Layout of the Paper

The literature on sequential and multi-stage sampling methodologies in the construction of fixed-width confidence intervals and minimum risk point estimators of various parameters of interest are as broad ranging as one may imagine and span the field of both parametric and nonparametric inferences. Many relevant concepts, desirable properties, and techniques of proofs are far reaching. Initially, one may review from the original sources including Stein 1945,1949, Anscombe 1952,1953, Ray (1957), Robbins (1959), and Chow and Robbins (1965). In those methodologies, a population standard deviation was traditionally replaced by suitable consistent estimators in the form of a multiple of the sample standard deviation.

Mukhopadhyay and Chattopadhyay (2013a, b) and Mukhopadhyay and Hu (2017,2018) have proposed alternative sampling strategies by replacing a population standard deviation by consistent estimators in the form of a multiple of GMD as well as the mean absolute deviation (MAD) for normally distributed data with possible outliers. One may accomplish a broader review from the texts of Sen (1981), Mukhopadhyay and Solanky (1994), Ghosh et al. (1997), and Mukhopadhyay and de Silva (2009).

In Section 8.2, we formally state the relative-accuracy confidence set estimation problem based on loss function (8.6), develop a purely sequential procedure with some asymptotic optimality properties, and study its asymptotic first-order optimality properties (Theorems 8.1 and 8.2). In Section 8.3, we formally discuss the minimum relative risk point estimation problem based on loss function (8.8), propose a purely sequential procedure, and study its asymptotic optimality properties (Theorems 8.3 and 8.4). Section 8.4 presents summaries from simulation studies validating all theoretical results under the proposed methodologies. Section 8.5 provides a number of important subsidiary lemmas and shows brief outlines of proofs of Theorems 8.1-8.2 and Theorems 8.3-8.4.