Income distribution is a major concern in economic theory. In regional economics, it is often of interest to compare income distributions in different regions. Traditional methods often compare the income inequality of different regions by assuming parametric forms of the income distributions, or using summary statistics like the Gini coefficient. In this paper, we propose a nonparametric procedure to test for heterogeneity in income distributions among different regions, and a Kmeans clustering procedure for clustering income distributions based on energy distance. In simulation studies, it is shown that the energy distance based method has competitive results with other common methods in hypothesis testing, and the energy distance based clustering method performs well in the clustering problem. The proposed approaches are applied in analyzing data from China Health and Nutrition Survey 2011. The results indicate that there are significant differences among income distributions of the 12 provinces in the dataset. After applying a 4means clustering algorithm, we obtained the clustering results of the income distributions in the 12 provinces.
The income distribution in economic theory describes how a region’s total wealth is distributed amongst its population (Sullivan, 2003). Back to classical economists time, income distribution was a key concern of economic theory, as it plays an important role in measuring the health of a region’s economy. For modern economics, more attention is paid to inequalities in certain measurements of income distribution, an example of which is the Gini coefficient (Yitzhaki, 1979), now universally used by many international organizations such as the United Nations and the World Bank. The Gini coefficient, however, summarizes information about a distribution in one number, and loses some information such as the intrinsic structure of the distribution. Other measurements such as the Lorenz curve (Lorenz, 1905), also have such kind of disadvantage.
The probability density of the income distribution can often be estimated when the sample size is large enough. From the 1890s, many parametric distributions were introduced to model income distribution. In 1895, Pareto (1964) (originally published in 1895) first proposed the Pareto density function to model the income distribution. Gibrat (1931) suggested the usage of a twoparameter lognormal distribution. Salem and Mount (1974) proposed a twoparameter Gamma density to approximate the distribution of personal income. Bartels and Van Metelen (1975) suggested another twoparameter density, the Weibull distribution, to model personal income. McDonald (1984) considered two generalized Beta distributions as models for the distribution of income. Furthermore, McDonald and Xu (1995) introduced a fiveparameter beta distribution which nests the generalized Beta and Gamma distributions to model income distribution. In summary, these methods used heavy tail distributions to model personal income. When testing the equality of income distributions, people often compared parameters in the parametric models. This method leans on the assumption that the parametric models sufficiently describe both income distributions. Still, some information loss would occur in such comparisons. In addition, it is often not reasonable to assume the same parametric model form for different regions, and such comparison would be impossible when the regional models are parameterized differently.
The energy distance (Székely and Rizzo, 2004) is a measure that is often used to test for equality of distributions. Rizzo and Székely (2010) provided a nonparametric version of the analysis of variance (ANOVA) called the distance components (DISCO), which partitions the total dispersion in the dataset into components that are analogies of ANOVA’s variance components. In addition, the energy distance can also be used in feature selection and generalizations of clustering algorithms (Rizzo and Székely, 2015). Li and Rizzo (2017) extended the usage of the energy distance and proposed kgroups, a generalization of the Kmeans clustering algorithm. Kgroups aims to put similar samples in the same cluster so that the dispersion between the
In order to address the disadvantages of the Gini coefficient and parametric models, we use the energy distance between different regions to test the equality of income distributions, which is a parametricfree method. Based on the energy distance, we propose a clustering method which clusters not individual observations, but income distributions, in different regions. The main contribution of this work is that all proposed methods are nonparametric  we do not require any parametric assumptions for income distribution.
The remainder of this paper is organized as follows. In Section 2, we give a brief introduction to the energy distance, and propose our testing and clustering procedures. In Section 3, we present the performance of the proposed illustrate our proposed procedures using extensive simulation results. In Section 4, we apply our methods on China Health and Nutrition Survey data. We conclude this paper with a brief discussion in Section 5.
The energy distance is a statistical distance defined for two probability distributions. Following Székely and Rizzo (2004), we consider two distributions denoted by
where
Now consider two independent samples from
where  ·  is the Euclidean distance. Based on (
For the twosample problem, suppose we have two samples
where
Let
where
It is easy to extend the twosample testing problem to ksample problem. Suppose
and the alternative is
For
Similar to in the twosample case, we can get the critical value from the random permutations from
Multidimensional scaling

Kmeans clustering

According to the energy distance calculated by (
The purpose of the MDS is to provide a visual representation for the pattern of proximities among a set of objects, when only a table of the distances between them is given. When the objects are income distributions, the MDS is used to reduce the matrix of distance to two dimensions, which can be plotted on a twodimensional coordinate system. Similar functions are placed near each other on the map, while very different functions are placed far from each other. See Algorithm 1 for a brief description of the MDS. Based on the MDS, following the steps of Kmeans clustering in Algorithm 2, we are able to cluster income distributions from different regions using their energy distance matrix.
In this section, we set up different scenarios to assess the accuracy of hypothesis testing and Kmeans clustering using the methods described in Section 2. For hypothesis testing, we test cases of twosample and multisample under the null and alternative hypotheses respectively. Different sample sizes are considered to compare the error rates under different circumstances. For Kmeans clustering, we consider 2means clustering procedure within different sample sizes.
We first consider the twosample tests under two different circumstances. One case is that two groups of data
For the multisample scenario, the same simulation setting is used as the twosample test in Section 3.1.1 except for the number of groups and the values of
In this section, we consider twosample tests under the alternative hypothesis. We generated two groups of data,
Under the first case, the Energy test behaves the best compared to the other testing procedures since it has the greatest power with different values of
For multiplesample testing, we generated five groups of data
In this case, we test the accuracy of Kmeans clustering using energy distance through simulation. We considered the case where in each replicate, 60% of the data are generated from Gamma(2, 0.2), and the other 40% are generated from another Gamma distribution with different rates as displayed in Table 5. For each sample size
We apply the proposed methods to the analysis of a China Health and Nutrition Survey (CHNS) dataset for year 2011. The dataset contains information of 4,346 households in 12 provinces. Based on our simulation results, this sample size is big enough to have powerful testing results and clustering results. A brief description of the dataset by province is given in Table 6.
From the Gini coefficient calculated in Table 6, Beijing has the lowest Gini coefficient, which means that Beijing has the most balanced income distribution. On the contrary, Henan has the highest Gini coefficient, which means the income distribution in Henan is the least balanced. Most provinces have similar Gini coefficients, and it is difficult to set a threshold to say whether there are statistically significant differences among these provinces.
We calculated the pairwise test statistic in (
Also, we do the multisample test for all the provinces using (
From the test results of previous analysis, we learned that the income distributions among 12 provinces are different, but smaller subsets of provinces have similar poverty pattern. Therefore, it is of interest to further cluster the 12 provinces. Some exploratory data analysis is performed before clustering. We used the hierarchical clustering methods (Bibby
From Figure 2, we can easily find that there are four major clusters based on the energy distance. Therefore, we chose
From Figure 3, it is found that Beijing and Shanghai are in cluster 1. Jiangsu province is the only one province in cluster 2. Hunan, Hubei, Chongqing, Guizhou, Heilongjiang, and Liaoning are in cluster 3, and Henan, Guangxi, and Shandong are in cluster 4. These clustering results are consistent with explanatory analysis based on the Gini coefficient calculation and testing results. Also, the clustering results reflects actual development of economy in these provinces. Beijing, Shanghai, and Jiangsu have relatively better economic development, while Henan, Guangxi, and Shandong have worse economic development.
We developed nonparametric test and clustering procedure for income distribution in different geographical regions. The test was inspired by the energy distance between distributions (Székely and Rizzo, 2004), and the MDS (Kruskal, 1964) for reducing the dimension of data based on a distance matrix so that an appropriate clustering algorithm can be implemented. In simulation studies, the nonparametric test has size close to the nominal 0.05 level. When the samples are generated from different distributions, or the same distribution with different parameters, the proposed test has strong power in detecting the violation of the null hypothesis, and the power is even stronger than other frequently used testing procedures. In the application to CHNS data, the proposed methods rejected the null hypothesis, and identified four clusters of income distribution, which is robust against choice of initial centroids and consistent with the benchmark hierarchical clustering results. In this work, there are no significant differences between the results of hierarchical clustering and Kmeans clustering because of the relative small number of provinces we analyzed. If we compare the county level data, however, it is very difficult for us to interpret results form the dendrograms. Kmeans clustering method is more robust than the HC.
In this work, we are only concerned with using household incomes to calculate the energy distance between different regions. There could be other information sources, such as the value of property, amount of mortgage, and consumptions, that can be used to provide more accurate description of the distribution of income, as the energy distribution provides good measurement of highdimensional cumulative distribution functions. Geographical information, in addition, can be taken into account in comparing income distributions.
Dr. Hu’s research was supported by Dean’s office of College of Liberal Arts and Sciences in University of Connecticut.
Multidimensional scaling
Assign points to arbitrary coordinates in Compute the Euclidean distances among all pairs of points, to form the Compare the Adjust coordinates of each point in the direction that best minimize the stress Repeat step 2 through 4 until stress won’t get any lower 
Kmeans clustering
Set Kmeans { Each data point An alternative, equivalent representation of this assignment of points to cluster The model parameters, the means, are adjusted to mach the sample means of the data points that they are responsible for where Repeat the step 2 and step 3 until the assignments do not change 
Error rates under null hypothesis: twosample (
Sample size  20  40  60  80  100  150  200  250  

Energy test  0.0524  0.0532  0.0518  0.0516  0.0494  0.0490  0.0486  0.0466  
Wilcoxon rank sum test  0.0528  0.0520  0.0518  0.0499  0.0516  0.0484  0.0502  0.0486  
KolmogorovSmirnov test  0.0464  0.0371  0.0354  0.0330  0.0244  0.0436  0.0402  0.0394  
AndersonDarling test  0.0589  0.0531  0.0513  0.0502  0.0514  0.0494  0.0494  0.0462  
JonckheereTerpstra test  0.0522  0.0504  0.0488  0.0499  0.0536  0.0542  0.0518  0.0484  
Rank Score test  0.0560  0.0545  0.0494  0.0502  0.0482  0.0494  0.0470  0.0484  
Energy test  0.0502  0.0488  0.0510  0.0506  0.0504  0.0492  0.0472  0.0474  
Wilcoxon rank sum test  0.0554  0.0472  0.0508  0.0500  0.0520  0.0482  0.0502  0.0478  
KolmogorovSmirnov test  0.0380  0.0334  0.0490  0.0330  0.0400  0.0422  0.0410  0.0386  
AndersonDarling test  0.0546  0.0498  0.0530  0.0494  0.0550  0.0506  0.0504  0.0458  
JonckheereTerpstra test  0.0516  0.0430  0.0498  0.0454  0.0516  0.0514  0.0466  0.0452  
Rank Score test  0.0532  0.0444  0.0490  0.0510  0.0526  0.0482  0.0482  0.0492 
Error rates under null hypothesis: multisample
Sample size  10  20  30  40  45  50  55  

Energy test  0.0502  0.0497  0.0510  0.0476  0.0478  0.0488  0.0484  
Wilcoxon rank sum test  0.0466  0.0523  0.0492  0.0570  0.0468  0.0480  0.0464  
AndersonDarling test  0.0510  0.0486  0.0504  0.0460  0.0476  0.0440  0.0440  
JonckheereTerpstra test  0.0492  0.0542  0.0502  0.0514  0.0466  0.0460  0.0476  
Rank Score test  0.0498  0.0484  0.0514  0.0470  0.0464  0.0424  0.0448  
Energy test  0.0488  0.0548  0.0462  0.0482  0.0492  0.0494  0.0484  
Wilcoxon rank sum test  0.0468  0.0540  0.0436  0.0506  0.0500  0.0560  0.0430  
AndersonDarling test  0.0488  0.0566  0.0440  0.0524  0.0500  0.0536  0.0476  
JonckheereTerpstra test  0.0538  0.0494  0.0462  0.0498  0.0494  0.0534  0.0452  
Rank Score test  0.0508  0.0546  0.0436  0.0536  0.0514  0.0532  0.0504 
Error rates under alternative hypothesis: 2 groups
Sample size  20  40  60  80  100  150  200  250  

Energy test  0.6308  0.3428  0.1746  0.0798  0.0354  0.0054  0.0004  0.0000  
Wilcoxon rank sum test  0.6460  0.3778  0.2038  0.0994  0.0474  0.0064  0.0010  0.0000  
KolmogorovSmirnov test  0.7612  0.5598  0.2944  0.2060  0.1212  0.0202  0.0046  0.0004  
AndersonDarling test  0.6528  0.3786  0.1940  0.0936  0.0438  0.0070  0.0008  0.0001  
JonckheereTerpstra test  0.9992  0.9942  0.9814  0.9781  0.9583  0.9016  0.8875  0.8666  
Rank Score test  0.6360  0.3638  0.1860  0.0886  0.0384  0.0048  0.0006  0.0000  
Energy test  0.7088  0.4920  0.3040  0.1870  0.1086  0.0240  0.0048  0.0010  
Wilcoxon rank sum test  0.6800  0.4476  0.2552  0.1590  0.0802  0.0184  0.0038  0.0006  
KolmogorovSmirnov test  0.7868  0.6398  0.3750  0.3004  0.1944  0.0482  0.0146  0.0036  
AndersonDarling test  0.6836  0.4622  0.2654  0.1700  0.0876  0.0174  0.0040  0.0010  
JonckheereTerpstra test  0.5650  0.3214  0.1670  0.0906  0.0428  0.0060  0.0014  0.0000  
Rank Score test  0.6744  0.4324  0.2374  0.1418  0.0700  0.0126  0.0028  0.0004 
Error rates for the proposed nonparametric test under the alternative hypothesis: multisample
Sample size  10  20  30  40  45  50  55  

Energy test  0.6144  0.2646  0.0946  0.0220  0.0154  0.0074  0.0026  
Wilcoxon rank sum test  0.9968  0.9978  0.9992  0.9994  0.9998  0.9998  0.9998  
AndersonDarling test  0.6420  0.3054  0.1194  0.0306  0.0202  0.0116  0.0036  
JonckheereTerpstra test  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  1.0000  
Rank Score test  0.6216  0.2896  0.1048  0.0282  0.0176  0.0086  0.0038  
Energy test  0.8118  0.6004  0.3972  0.2692  0.1940  0.1534  0.1138  
Wilcoxon rank sum test  0.6994  0.4516  0.2612  0.1564  0.1122  0.0794  0.0622  
AndersonDarling test  0.7888  0.5540  0.3430  0.2034  0.1548  0.1076  0.0790  
JonckheereTerpstra test  0.4784  0.2128  0.0818  0.0332  0.0208  0.0118  0.0060  
Rank Score test  0.7780  0.5254  0.3068  0.1734  0.1272  0.0826  0.0616 
Error rate for the proposed nonparametric clustering procedure under the alternative hypothesis: multisample
Density 1 Density 2  Gamma(2, 0.2) Gamma(2, 0.25)  Gamma(2, 0.2) Gamma(2, 0.3)  Gamma(2, 0.2) Gamma(2, 0.35)  Gamma(2, 0.2) Gamma(2, 0.4)  Gamma(2, 0.2) Gamma(2, 0.45) 

0.854  0.361  0.090  0.035  0.008  
0.823  0.303  0.073  0.012  0.001  
0.767  0.159  0.020  0.003  0.001  
0.721  0.133  0.015  0.001  0.001  
0.635  0.066  0.011  0.001  0.000  
0.624  0.064  0.006  0.001  0.000  
0.408  0.015  0.000  0.000  0.000 
Description of China Health and Nutrition Survey data
Province  Sample size  Average household income in thousands (SD)  Gini coefficient 

Beijing  416  74.98 (49.9)  0.3236 
Liaoling  395  49.43 (47.9)  0.3953 
Heilongjiang  398  46.11 (44.4)  0.4307 
Shanghai  424  87.46 (68.7)  0.3700 
Jiangsu  414  60.93 (43.5)  0.3712 
Shandong  399  41.00 (40.9)  0.4031 
Henan  299  36.57 (42.7)  0.5027 
Hubei  339  50.04 (57.3)  0.4142 
Hunan  245  47.95 (43.5)  0.3986 
Guangxi  362  36.81 (33.4)  0.3800 
Guizhou  339  45.39 (52.7)  0.4172 
Chongqing  329  40.98 (39.9)  0.4361 