One of the common issues in large dataset analyses is to detect and construct homogeneous groups of objects in those datasets. This is typically done by some form of clustering technique. In this study, we present a divisive hierarchical clustering method for two monothetic characteristics of histogram data. Unlike classical data points, a histogram has internal variation of itself as well as location information. However, to find the optimal bipartition, existing divisive monothetic clustering methods for histogram data consider only location information as a monothetic characteristic and they cannot distinguish histograms with the same location but different internal variations. Thus, a divisive clustering method considering both location and internal variation of histograms is proposed in this study. The method has an advantage in interpreting clustering outcomes by providing binary questions for each split. The proposed clustering method is verified through a simulation study and applied to a large U.S. house property value dataset.
One of the common issues in large dataset analyses is to detect and construct homogeneous groups of objects in those datasets. This is typically done by some form of clustering technique. In this work, a monothetic divisive clustering procedure is introduced for histogramvalued (or, simply, histogram) datasets.
Histogramvalued data are a class of symbolic data (introduced by Diday, 1987). Suppose that we are interested in classes or groups, rather than individual observations. In that case, we usually aggregate observations by each class or group, and then summary statistics for each class such as mean and median, etc. are used to analyse the data. However, the use of summary statistics might lose some valuable information. For example, credit card companies have a massive credit card transaction data for customers and they are usually interested in customers rather than individual transactions. Thus, to analyse customers, they might aggregate transaction informations by each customer and use the average amount of usage for each customer. In this case, even if two customers have the same average amount, their consumption patterns might be different and such summary statistics cannot capture customer’s behaviors. To overcome this problem, if histograms of transaction amount for each customer are used for the analysis, instead of summary statistics, the loss of information can be reduced and consumption patterns of customers with the same average amount could be distinguished by histograms. In addition, if large classical datasets are transformed into histogram datasets, we can have more manageable size of data and computing can be faster by using histograms.
The histogramvalued data are also called multivalued quantitative data in some of the literature such as Irpino and Verde (2008). Unlike classical data which are single points in
There are several different divisive hierarchical clustering methods for classical data. Edwards and CavalliSforza (1965) introduced a divisive clustering method that finds the optimal bipartition among the 2
Chavent (2000) developed divisive hierarchical clustering methods for interval and ordinal categorical data using a hierarchy of a set of objects and a monothetic characteristic of each cluster of the hierarchy. Brito and Chavent (2012) and Kim (2009) applied the monothetic method to interval and histogram data and Kim and Billard (2012) introduced a monothetic clustering method for multimodal data. Those monothetic methods reduce the amount of calculation needed to detect an optimal bipartition, so that only
For histogram data, the existing monothetic methods depend only on location information for individual histograms to find the optimal bipartition. For example, the method in Brito and Chavent (2012) classifies histograms by whether cumulative relative frequencies at a certain subinterval are larger than 0.5, and Kim (2009) uses internal mean values of each histogram. However, since, unlike classical data, histograms have not only location but also variation information, clustering algorithms for histogram data should consider both the location and dispersion informations. For example, suppose that histograms have the same mean or median and different variances. In this case, the methods using only location information cannot distinguish them because they have the same location information. Thus, in this study, we propose a divisive monothetic clustering algorithm using both these characteristics of histograms.
Section 2 provides some background material needed for the clustering method. Thus, in Section 2.1, basic statistics are presented. Then, in Section 2.2, clustering criteria are defined. The algorithm is presented in Section 3 and it is verified through the well known Fisher (1936) iris dataset and a simulation study in Section 4. A large dataset is analysed in Section 5.
Let
where
For example, suppose that we have three observed histograms for a variable
Under the assumption that values on each subinterval are uniformly distributed, Billard and Diday (2003) defined the internal mean and standard deviation for a histogramvalued observation as follows:
The internal mean of (
Suppose that there are histogram observations
Now our interest is which cluster
where D^{2}(
Various distance/dissimilarity measures between two histograms have been proposed so far. Kim and Billard (2013) extended several dissimilarity/distance measures for interval data into histogram data. Also, Verde and Irpino (2008), Arroyo
From (
Thus, minimizing
Therefore, clustering algorithms should efficiently find the bipartition (
In this section, we describe a divisive clustering for histogram data with a double monothetic process at each stage. Clustering methods introduced by Chavent (2000) and Brito and Chavent (2012) use a monothetic searching algorithm to find the optimal bipartition at each stage, but they depend only on location information of histograms. However, a histogram has both location and dispersion information. That is, even when histograms have the same mean or median, they should be distinguishable if they have different dispersions. Since the clustering algorithms in Chavent (2000) and Brito and Chavent (2012) do not consider the dispersion information, they cannot provide desirable partitions for clusters with the same location but different dispersions. Thus, we propose a divisive hierarchical clustering algorithm using both location and dispersion of histograms.
In contrast to a polythetic algorithm which uses all
Suppose that there are histogram observations
Double monothetic algorithm

Algorithm 1 investigates two aspects for each observation to find the optimal bipartition. For each cluster, it firstly explores (
At every stage, the proposed method provides a binary question by either the internal mean or standard deviation as a cut point and it describes the clustering structure of histogram data. If the binary question comes from the bipartition (
In this section, the proposed clustering algorithm is verified through a well known dataset and a simulated dataset. To perform the the proposed clustering method, it firstly requires a distance measure for histogram data. In this study, we use the squared Euclidean distance used in Brito and Chavent (2012). The squared Euclidean distance between two histogram observations,
Consider the Iris dataset used in Fisher (1936). Originally, Fisher’s iris dataset has 150 classical observations for three species (setosa, versicolor, virginica) and four variables (
A dendrogram in Figure 1(a) was obtained by applying the proposed clustering algorithm to the Iris histogram dataset. The dendrogram shows that the proposed algorithm correctly classified observations into the three iris species. Also, from the binary questions in the dendrogram, the three clusters were distinguished by internal mean values of histograms. For example, the first bipartition (
In the Iris histogram dataset, we saw an example of clusters classified by locations of histograms. Now, we consider a case that clusters with different dispersions are overlapping. Therefore, consider the histogram observations generated from the three different bivariate normal distributions of (
Both the algorithm of Brito and Chavent (2012) and the proposed algorithm were applied to 1,000 sets of histogram data simulated from (
To demonstrate the divisive double monothetic clustering method for histogramvalued data proposed in this study, we analyse a large dataset for house property values for 51 states of U.S. obtained from the American Community Survey Public Use Microdata Sample (ACS PUMS) 2012 (available in the US Census Bureau website; http://www2.census.gov/acs2012_1yr/pums/csv_hus.zip). The original classical data (ACS PUMS) have various and many records for each sampled household. In this analysis, we are interested in classifying the 51 U.S. states by house property values. That is, the goal of this analysis is to investigate which states have the most similar distributions of house price for various house sizes. Note that only single family houses are considered in this analysis.
To analyse house property values for each state by house sizes, we consider house sizes as the number of bedrooms for each house and aggregate the data by each state and each house size. Since the number of houses with one bedroom for each state in the original dataset is not enough to be meaningful in the present context, one bedroom houses are merged with two bedroom houses. Similarly, since each sample size of houses with five bedrooms or more is not enough, houses with five bedrooms or more are pooled as one variable. Note that all missing values for the house property value are removed. Thus, we consider four variables;
To apply our method proposed in this study to this histogramvalued dataset, we first make the histogramvalued data have the same subintervals for each variable and calculate the squared Euclidean distance values using (
The vertical axis of Figure 2(a) is the total intracluster size of (
From Figure 2(a), it seems that there are two major clusters in the dataset. To investigate the optimal number of clusters, cluster validity indexes were computed. Cluster validity indexes evaluate clustering outcomes and provide the optimal number of clusters. In this analysis, we used the Dunn and DavisBouldin indexes developed in Kim and Billard (2011). Figure 2(b) shows both Dunn and DavisBouldin index values for the clustering result in Figure 2(a). As shown in Figure 2(b), the first bipartition (
This bipartition was detected by the binary question ‘Is
In this study, a divisive double monothetic clustering method for histogram data is introduced. Chavent (2000) proposed a divisive monothetic clustering method for interval data and ordered categorical data, and Brito and Chavent (2012) extended the method to histogram data. Both monothetic methods find the optimal bipartition by whether the cumulative probability is larger than 0.5. Kim (2009) also extended Chavent (2000) to histogram data by using the internal mean values to find the optimal bipartition. That is, they use only location information of histograms. However, a histogram has both location and internal variation. Thus, when two histograms have the same location but different dispersions, they should be distinguishable. To distinguish histograms in such situations, we propose a double monothetic clustering algorithm using both internal mean and standard deviation as location and dispersion information, respectively. Also, the proposed clustering algorithm has the same order of complexity,
However, since the proposed clustering algorithm depends on the mean and standard deviation of individual histograms, it might not distinguish histograms that have the same first two moments but different higher moments. For example, there might be right skewed and left skewed histograms with the same mean and standard deviation and the proposed algorithm cannot separate these histograms. In that case, a monothetic clustering algorithm exploring the third and higher moments of histograms can be considered.
Limam
Although divisive clustering is slower than agglomerative clustering, divisive clustering usually yields more accurate hierarchies because it starts with complete information for the global distribution. In contrast, since agglomerative algorithms merge clusters based on local information, it might not capture global structures. Also, agglomerative clustering do not provide variables related to clustering outcomes while the proposed algorithm gives binary questions with splitting variables and cut points for more interpretable results.
This paper was supported by Sungkyun Research Fund, Sungkyunkwan University, 2015 (No. S201512950000).
An example of histogram observations with the same subintervals
Observation  [0, 2)  [2, 4)  [4, 6)  [6, 8)  [8, 10)  [10, 12) 

0.0  0.2  0.25  0.25  0.3  0.0  
0.7  0.2  0.10  0.00  0.0  0.0  
0.0  0.0  0.00  0.20  0.2  0.6 
The simulation result for bivariate normal distributions
Brito and Chavent (2012) algorithm  The proposed algorithm  

# of sets that the algorithm correctly classifies all observation into 3 clusters  29  1,000 
Clusters in the U.S. house property value dataset
State  

AL, AR, AZ, FL, GA, IA, ID, IL, IN, KS, KY, LA, ME, MI, MN, MO, MS, NC, ND, NE, NM, NV, OH, OK, PA, SC, SD, TN, TX, UT, WI, WV  
AK, CA, CO, CT, DC, DE, HI, MA, MD, MT, NH, NJ, NY, OR, RI, VA, VT, WA, WY 