
Income distribution is a major concern in economic theory. In regional economics, it is often of interest to compare income distributions in different regions. Traditional methods often compare the income inequality of different regions by assuming parametric forms of the income distributions, or using summary statistics like the Gini coefficient. In this paper, we propose a nonparametric procedure to test for heterogeneity in income distributions among different regions, and a K-means clustering procedure for clustering income distributions based on energy distance. In simulation studies, it is shown that the energy distance based method has competitive results with other common methods in hypothesis testing, and the energy distance based clustering method performs well in the clustering problem. The proposed approaches are applied in analyzing data from China Health and Nutrition Survey 2011. The results indicate that there are significant differences among income distributions of the 12 provinces in the dataset. After applying a 4-means clustering algorithm, we obtained the clustering results of the income distributions in the 12 provinces.
The income distribution in economic theory describes how a region’s total wealth is distributed amongst its population (Sullivan, 2003). Back to classical economists time, income distribution was a key concern of economic theory, as it plays an important role in measuring the health of a region’s economy. For modern economics, more attention is paid to inequalities in certain measurements of income distribution, an example of which is the Gini coefficient (Yitzhaki, 1979), now universally used by many international organizations such as the United Nations and the World Bank. The Gini coefficient, however, summarizes information about a distribution in one number, and loses some information such as the intrinsic structure of the distribution. Other measurements such as the Lorenz curve (Lorenz, 1905), also have such kind of disadvantage.
The probability density of the income distribution can often be estimated when the sample size is large enough. From the 1890s, many parametric distributions were introduced to model income distribution. In 1895, Pareto (1964) (originally published in 1895) first proposed the Pareto density function to model the income distribution. Gibrat (1931) suggested the usage of a two-parameter log-normal distribution. Salem and Mount (1974) proposed a two-parameter Gamma density to approximate the distribution of personal income. Bartels and Van Metelen (1975) suggested another two-parameter density, the Weibull distribution, to model personal income. McDonald (1984) considered two generalized Beta distributions as models for the distribution of income. Furthermore, McDonald and Xu (1995) introduced a five-parameter beta distribution which nests the generalized Beta and Gamma distributions to model income distribution. In summary, these methods used heavy tail distributions to model personal income. When testing the equality of income distributions, people often compared parameters in the parametric models. This method leans on the assumption that the parametric models sufficiently describe both income distributions. Still, some information loss would occur in such comparisons. In addition, it is often not reasonable to assume the same parametric model form for different regions, and such comparison would be impossible when the regional models are parameterized differently.
The energy distance (Székely and Rizzo, 2004) is a measure that is often used to test for equality of distributions. Rizzo and Székely (2010) provided a nonparametric version of the analysis of variance (ANOVA) called the distance components (DISCO), which partitions the total dispersion in the dataset into components that are analogies of ANOVA’s variance components. In addition, the energy distance can also be used in feature selection and generalizations of clustering algorithms (Rizzo and Székely, 2015). Li and Rizzo (2017) extended the usage of the energy distance and proposed k-groups, a generalization of the K-means clustering algorithm. K-groups aims to put similar samples in the same cluster so that the dispersion between the
In order to address the disadvantages of the Gini coefficient and parametric models, we use the energy distance between different regions to test the equality of income distributions, which is a parametric-free method. Based on the energy distance, we propose a clustering method which clusters not individual observations, but income distributions, in different regions. The main contribution of this work is that all proposed methods are nonparametric - we do not require any parametric assumptions for income distribution.
The remainder of this paper is organized as follows. In Section 2, we give a brief introduction to the energy distance, and propose our testing and clustering procedures. In Section 3, we present the performance of the proposed illustrate our proposed procedures using extensive simulation results. In Section 4, we apply our methods on China Health and Nutrition Survey data. We conclude this paper with a brief discussion in Section 5.
The energy distance is a statistical distance defined for two probability distributions. Following Székely and Rizzo (2004), we consider two distributions denoted by
where
Now consider two independent samples from
where || · || is the Euclidean distance. Based on (
For the two-sample problem, suppose we have two samples
where
Let
where
It is easy to extend the two-sample testing problem to k-sample problem. Suppose
and the alternative is
For
Similar to in the two-sample case, we can get the critical value from the random permutations from
|
|
According to the energy distance calculated by (
The purpose of the MDS is to provide a visual representation for the pattern of proximities among a set of objects, when only a table of the distances between them is given. When the objects are income distributions, the MDS is used to reduce the matrix of distance to two dimensions, which can be plotted on a two-dimensional coordinate system. Similar functions are placed near each other on the map, while very different functions are placed far from each other. See Algorithm 1 for a brief description of the MDS. Based on the MDS, following the steps of K-means clustering in Algorithm 2, we are able to cluster income distributions from different regions using their energy distance matrix.
In this section, we set up different scenarios to assess the accuracy of hypothesis testing and K-means clustering using the methods described in Section 2. For hypothesis testing, we test cases of two-sample and multi-sample under the null and alternative hypotheses respectively. Different sample sizes are considered to compare the error rates under different circumstances. For K-means clustering, we consider 2-means clustering procedure within different sample sizes.
We first consider the two-sample tests under two different circumstances. One case is that two groups of data
For the multi-sample scenario, the same simulation setting is used as the two-sample test in Section 3.1.1 except for the number of groups and the values of
In this section, we consider two-sample tests under the alternative hypothesis. We generated two groups of data,
Under the first case, the Energy test behaves the best compared to the other testing procedures since it has the greatest power with different values of
For multiple-sample testing, we generated five groups of data
In this case, we test the accuracy of K-means clustering using energy distance through simulation. We considered the case where in each replicate, 60% of the data are generated from Gamma(2, 0.2), and the other 40% are generated from another Gamma distribution with different rates as displayed in Table 5. For each sample size
We apply the proposed methods to the analysis of a China Health and Nutrition Survey (CHNS) dataset for year 2011. The dataset contains information of 4,346 households in 12 provinces. Based on our simulation results, this sample size is big enough to have powerful testing results and clustering results. A brief description of the dataset by province is given in Table 6.
From the Gini coefficient calculated in Table 6, Beijing has the lowest Gini coefficient, which means that Beijing has the most balanced income distribution. On the contrary, Henan has the highest Gini coefficient, which means the income distribution in Henan is the least balanced. Most provinces have similar Gini coefficients, and it is difficult to set a threshold to say whether there are statistically significant differences among these provinces.
We calculated the pairwise test statistic in (
Also, we do the multi-sample test for all the provinces using (
From the test results of previous analysis, we learned that the income distributions among 12 provinces are different, but smaller subsets of provinces have similar poverty pattern. Therefore, it is of interest to further cluster the 12 provinces. Some exploratory data analysis is performed before clustering. We used the hierarchical clustering methods (Bibby
From Figure 2, we can easily find that there are four major clusters based on the energy distance. Therefore, we chose
From Figure 3, it is found that Beijing and Shanghai are in cluster 1. Jiangsu province is the only one province in cluster 2. Hunan, Hubei, Chongqing, Guizhou, Heilongjiang, and Liaoning are in cluster 3, and Henan, Guangxi, and Shandong are in cluster 4. These clustering results are consistent with explanatory analysis based on the Gini coefficient calculation and testing results. Also, the clustering results reflects actual development of economy in these provinces. Beijing, Shanghai, and Jiangsu have relatively better economic development, while Henan, Guangxi, and Shandong have worse economic development.
We developed nonparametric test and clustering procedure for income distribution in different geographical regions. The test was inspired by the energy distance between distributions (Székely and Rizzo, 2004), and the MDS (Kruskal, 1964) for reducing the dimension of data based on a distance matrix so that an appropriate clustering algorithm can be implemented. In simulation studies, the nonparametric test has size close to the nominal 0.05 level. When the samples are generated from different distributions, or the same distribution with different parameters, the proposed test has strong power in detecting the violation of the null hypothesis, and the power is even stronger than other frequently used testing procedures. In the application to CHNS data, the proposed methods rejected the null hypothesis, and identified four clusters of income distribution, which is robust against choice of initial centroids and consistent with the benchmark hierarchical clustering results. In this work, there are no significant differences between the results of hierarchical clustering and K-means clustering because of the relative small number of provinces we analyzed. If we compare the county level data, however, it is very difficult for us to interpret results form the dendrograms. K-means clustering method is more robust than the HC.
In this work, we are only concerned with using household incomes to calculate the energy distance between different regions. There could be other information sources, such as the value of property, amount of mortgage, and consumptions, that can be used to provide more accurate description of the distribution of income, as the energy distribution provides good measurement of high-dimensional cumulative distribution functions. Geographical information, in addition, can be taken into account in comparing income distributions.
Dr. Hu’s research was supported by Dean’s office of College of Liberal Arts and Sciences in University of Connecticut.
Multidimensional scaling
Assign points to arbitrary coordinates in Compute the Euclidean distances among all pairs of points, to form the Compare the Adjust coordinates of each point in the direction that best minimize the stress Repeat step 2 through 4 until stress won’t get any lower |
K-means clustering
Set K-means { Each data point An alternative, equivalent representation of this assignment of points to cluster The model parameters, the means, are adjusted to mach the sample means of the data points that they are responsible for where Repeat the step 2 and step 3 until the assignments do not change |
Error rates under null hypothesis: two-sample (
Sample size |
20 | 40 | 60 | 80 | 100 | 150 | 200 | 250 | |
---|---|---|---|---|---|---|---|---|---|
Energy test | 0.0524 | 0.0532 | 0.0518 | 0.0516 | 0.0494 | 0.0490 | 0.0486 | 0.0466 | |
Wilcoxon rank sum test | 0.0528 | 0.0520 | 0.0518 | 0.0499 | 0.0516 | 0.0484 | 0.0502 | 0.0486 | |
Kolmogorov-Smirnov test | 0.0464 | 0.0371 | 0.0354 | 0.0330 | 0.0244 | 0.0436 | 0.0402 | 0.0394 | |
Anderson-Darling test | 0.0589 | 0.0531 | 0.0513 | 0.0502 | 0.0514 | 0.0494 | 0.0494 | 0.0462 | |
Jonckheere-Terpstra test | 0.0522 | 0.0504 | 0.0488 | 0.0499 | 0.0536 | 0.0542 | 0.0518 | 0.0484 | |
Rank Score test | 0.0560 | 0.0545 | 0.0494 | 0.0502 | 0.0482 | 0.0494 | 0.0470 | 0.0484 | |
Energy test | 0.0502 | 0.0488 | 0.0510 | 0.0506 | 0.0504 | 0.0492 | 0.0472 | 0.0474 | |
Wilcoxon rank sum test | 0.0554 | 0.0472 | 0.0508 | 0.0500 | 0.0520 | 0.0482 | 0.0502 | 0.0478 | |
Kolmogorov-Smirnov test | 0.0380 | 0.0334 | 0.0490 | 0.0330 | 0.0400 | 0.0422 | 0.0410 | 0.0386 | |
Anderson-Darling test | 0.0546 | 0.0498 | 0.0530 | 0.0494 | 0.0550 | 0.0506 | 0.0504 | 0.0458 | |
Jonckheere-Terpstra test | 0.0516 | 0.0430 | 0.0498 | 0.0454 | 0.0516 | 0.0514 | 0.0466 | 0.0452 | |
Rank Score test | 0.0532 | 0.0444 | 0.0490 | 0.0510 | 0.0526 | 0.0482 | 0.0482 | 0.0492 |
Error rates under null hypothesis: multi-sample
Sample size |
10 | 20 | 30 | 40 | 45 | 50 | 55 | |
---|---|---|---|---|---|---|---|---|
Energy test | 0.0502 | 0.0497 | 0.0510 | 0.0476 | 0.0478 | 0.0488 | 0.0484 | |
Wilcoxon rank sum test | 0.0466 | 0.0523 | 0.0492 | 0.0570 | 0.0468 | 0.0480 | 0.0464 | |
Anderson-Darling test | 0.0510 | 0.0486 | 0.0504 | 0.0460 | 0.0476 | 0.0440 | 0.0440 | |
Jonckheere-Terpstra test | 0.0492 | 0.0542 | 0.0502 | 0.0514 | 0.0466 | 0.0460 | 0.0476 | |
Rank Score test | 0.0498 | 0.0484 | 0.0514 | 0.0470 | 0.0464 | 0.0424 | 0.0448 | |
Energy test | 0.0488 | 0.0548 | 0.0462 | 0.0482 | 0.0492 | 0.0494 | 0.0484 | |
Wilcoxon rank sum test | 0.0468 | 0.0540 | 0.0436 | 0.0506 | 0.0500 | 0.0560 | 0.0430 | |
Anderson-Darling test | 0.0488 | 0.0566 | 0.0440 | 0.0524 | 0.0500 | 0.0536 | 0.0476 | |
Jonckheere-Terpstra test | 0.0538 | 0.0494 | 0.0462 | 0.0498 | 0.0494 | 0.0534 | 0.0452 | |
Rank Score test | 0.0508 | 0.0546 | 0.0436 | 0.0536 | 0.0514 | 0.0532 | 0.0504 |
Error rates under alternative hypothesis: 2 groups
Sample size |
20 | 40 | 60 | 80 | 100 | 150 | 200 | 250 | |
---|---|---|---|---|---|---|---|---|---|
Energy test | 0.6308 | 0.3428 | 0.1746 | 0.0798 | 0.0354 | 0.0054 | 0.0004 | 0.0000 | |
Wilcoxon rank sum test | 0.6460 | 0.3778 | 0.2038 | 0.0994 | 0.0474 | 0.0064 | 0.0010 | 0.0000 | |
Kolmogorov-Smirnov test | 0.7612 | 0.5598 | 0.2944 | 0.2060 | 0.1212 | 0.0202 | 0.0046 | 0.0004 | |
Anderson-Darling test | 0.6528 | 0.3786 | 0.1940 | 0.0936 | 0.0438 | 0.0070 | 0.0008 | 0.0001 | |
Jonckheere-Terpstra test | 0.9992 | 0.9942 | 0.9814 | 0.9781 | 0.9583 | 0.9016 | 0.8875 | 0.8666 | |
Rank Score test | 0.6360 | 0.3638 | 0.1860 | 0.0886 | 0.0384 | 0.0048 | 0.0006 | 0.0000 | |
Energy test | 0.7088 | 0.4920 | 0.3040 | 0.1870 | 0.1086 | 0.0240 | 0.0048 | 0.0010 | |
Wilcoxon rank sum test | 0.6800 | 0.4476 | 0.2552 | 0.1590 | 0.0802 | 0.0184 | 0.0038 | 0.0006 | |
Kolmogorov-Smirnov test | 0.7868 | 0.6398 | 0.3750 | 0.3004 | 0.1944 | 0.0482 | 0.0146 | 0.0036 | |
Anderson-Darling test | 0.6836 | 0.4622 | 0.2654 | 0.1700 | 0.0876 | 0.0174 | 0.0040 | 0.0010 | |
Jonckheere-Terpstra test | 0.5650 | 0.3214 | 0.1670 | 0.0906 | 0.0428 | 0.0060 | 0.0014 | 0.0000 | |
Rank Score test | 0.6744 | 0.4324 | 0.2374 | 0.1418 | 0.0700 | 0.0126 | 0.0028 | 0.0004 |
Error rates for the proposed nonparametric test under the alternative hypothesis: multi-sample
Sample size |
10 | 20 | 30 | 40 | 45 | 50 | 55 | |
---|---|---|---|---|---|---|---|---|
Energy test | 0.6144 | 0.2646 | 0.0946 | 0.0220 | 0.0154 | 0.0074 | 0.0026 | |
Wilcoxon rank sum test | 0.9968 | 0.9978 | 0.9992 | 0.9994 | 0.9998 | 0.9998 | 0.9998 | |
Anderson-Darling test | 0.6420 | 0.3054 | 0.1194 | 0.0306 | 0.0202 | 0.0116 | 0.0036 | |
Jonckheere-Terpstra test | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | |
Rank Score test | 0.6216 | 0.2896 | 0.1048 | 0.0282 | 0.0176 | 0.0086 | 0.0038 | |
Energy test | 0.8118 | 0.6004 | 0.3972 | 0.2692 | 0.1940 | 0.1534 | 0.1138 | |
Wilcoxon rank sum test | 0.6994 | 0.4516 | 0.2612 | 0.1564 | 0.1122 | 0.0794 | 0.0622 | |
Anderson-Darling test | 0.7888 | 0.5540 | 0.3430 | 0.2034 | 0.1548 | 0.1076 | 0.0790 | |
Jonckheere-Terpstra test | 0.4784 | 0.2128 | 0.0818 | 0.0332 | 0.0208 | 0.0118 | 0.0060 | |
Rank Score test | 0.7780 | 0.5254 | 0.3068 | 0.1734 | 0.1272 | 0.0826 | 0.0616 |
Error rate for the proposed nonparametric clustering procedure under the alternative hypothesis: multi-sample
Density 1 Density 2 |
Gamma(2, 0.2) Gamma(2, 0.25) |
Gamma(2, 0.2) Gamma(2, 0.3) |
Gamma(2, 0.2) Gamma(2, 0.35) |
Gamma(2, 0.2) Gamma(2, 0.4) |
Gamma(2, 0.2) Gamma(2, 0.45) |
---|---|---|---|---|---|
0.854 | 0.361 | 0.090 | 0.035 | 0.008 | |
0.823 | 0.303 | 0.073 | 0.012 | 0.001 | |
0.767 | 0.159 | 0.020 | 0.003 | 0.001 | |
0.721 | 0.133 | 0.015 | 0.001 | 0.001 | |
0.635 | 0.066 | 0.011 | 0.001 | 0.000 | |
0.624 | 0.064 | 0.006 | 0.001 | 0.000 | |
0.408 | 0.015 | 0.000 | 0.000 | 0.000 |
Description of China Health and Nutrition Survey data
Province | Sample size | Average household income in thousands (SD) | Gini coefficient |
---|---|---|---|
Beijing | 416 | 74.98 (49.9) | 0.3236 |
Liaoling | 395 | 49.43 (47.9) | 0.3953 |
Heilongjiang | 398 | 46.11 (44.4) | 0.4307 |
Shanghai | 424 | 87.46 (68.7) | 0.3700 |
Jiangsu | 414 | 60.93 (43.5) | 0.3712 |
Shandong | 399 | 41.00 (40.9) | 0.4031 |
Henan | 299 | 36.57 (42.7) | 0.5027 |
Hubei | 339 | 50.04 (57.3) | 0.4142 |
Hunan | 245 | 47.95 (43.5) | 0.3986 |
Guangxi | 362 | 36.81 (33.4) | 0.3800 |
Guizhou | 339 | 45.39 (52.7) | 0.4172 |
Chongqing | 329 | 40.98 (39.9) | 0.4361 |