TEXT SIZE

search for



CrossRef (0)
Double monothetic clustering for histogram-valued data
Communications for Statistical Applications and Methods 2018;25:263-274
Published online May 31, 2018
© 2018 Korean Statistical Society.

Jaejik Kim1,a, and L. Billardb

aDepartment of Statistics, Sungkyunkwan University, Korea, bDepartment of Statistics, University of Georgia, USA
Correspondence to: Department of Statistics, Sungkyunkwan University, 25-2 Sungkyunkwan-ro, Jongno-Gu, Seoul 03063, Korea. E-mail: jaejik@skku.edu
Received January 5, 2018; Revised April 21, 2018; Accepted April 26, 2018.
 Abstract

One of the common issues in large dataset analyses is to detect and construct homogeneous groups of objects in those datasets. This is typically done by some form of clustering technique. In this study, we present a divisive hierarchical clustering method for two monothetic characteristics of histogram data. Unlike classical data points, a histogram has internal variation of itself as well as location information. However, to find the optimal bipartition, existing divisive monothetic clustering methods for histogram data consider only location information as a monothetic characteristic and they cannot distinguish histograms with the same location but different internal variations. Thus, a divisive clustering method considering both location and internal variation of histograms is proposed in this study. The method has an advantage in interpreting clustering outcomes by providing binary questions for each split. The proposed clustering method is verified through a simulation study and applied to a large U.S. house property value dataset.

Keywords : divisive clustering, histogram data, symbolic data
1. Introduction

One of the common issues in large dataset analyses is to detect and construct homogeneous groups of objects in those datasets. This is typically done by some form of clustering technique. In this work, a monothetic divisive clustering procedure is introduced for histogram-valued (or, simply, histogram) datasets.

Histogram-valued data are a class of symbolic data (introduced by Diday, 1987). Suppose that we are interested in classes or groups, rather than individual observations. In that case, we usually aggregate observations by each class or group, and then summary statistics for each class such as mean and median, etc. are used to analyse the data. However, the use of summary statistics might lose some valuable information. For example, credit card companies have a massive credit card transaction data for customers and they are usually interested in customers rather than individual transactions. Thus, to analyse customers, they might aggregate transaction informations by each customer and use the average amount of usage for each customer. In this case, even if two customers have the same average amount, their consumption patterns might be different and such summary statistics cannot capture customer’s behaviors. To overcome this problem, if histograms of transaction amount for each customer are used for the analysis, instead of summary statistics, the loss of information can be reduced and consumption patterns of customers with the same average amount could be distinguished by histograms. In addition, if large classical datasets are transformed into histogram datasets, we can have more manageable size of data and computing can be faster by using histograms.

The histogram-valued data are also called multivalued quantitative data in some of the literature such as Irpino and Verde (2008). Unlike classical data which are single points in p-dimensional space ℝp, symbolic data are hypercubes (broadly defined) and/or Cartesian products of distributions in ℝp (Billard, 2006). Detailed descriptions of such data can be found in, e.g., Bock and Diday (2000), Billard and Diday (2003, 2007), Billard (2011). In particular, a symbolic variable X whose observed value is a histogram takes the form ξ = {[ak, ak+1), πk; k = 1, …, v} (where [ak, ak+1) are the histogram based subintervals with support (e.g., relative frequency πk). It is assumed that values within a histogram subinterval [ak, ak+1) are uniformly spread across the subinterval. Interval-valued data are a special case where v = 1 and hence πk = 1.

There are several different divisive hierarchical clustering methods for classical data. Edwards and Cavalli-Sforza (1965) introduced a divisive clustering method that finds the optimal bipartition among the 2n−1 − 1 possible bipartitions for a cluster with n objects using the intra-cluster sum of squares as a clustering criterion. Since this method considers all possible bipartitions in a cluster, it always gives a global optimal bipartition but is not computationally efficient. MacNaughton-Smith et al. (1964) proposed a method that iteratively uses an average distance between an object and a group of objects. Also, Har-even and Brailovsky (1995) introduced a probabilistic validation approach for divisive clustering. Williams and Lambert (1959) and Lance and Williams (1968) proposed monothetic divisive clustering methods for classical binary data.

Chavent (2000) developed divisive hierarchical clustering methods for interval and ordinal categorical data using a hierarchy of a set of objects and a monothetic characteristic of each cluster of the hierarchy. Brito and Chavent (2012) and Kim (2009) applied the monothetic method to interval and histogram data and Kim and Billard (2012) introduced a monothetic clustering method for multimodal data. Those monothetic methods reduce the amount of calculation needed to detect an optimal bipartition, so that only p(n−1) bipartitions need to be examined to find the optimal bipartition rather than all the 2n−1 −1 possible bipartitions. When the number of objects is large, the number of possible bipartitions is even further reduced by the monothetic methods. Also, they provide binary questions which help to interpret clustering structures.

For histogram data, the existing monothetic methods depend only on location information for individual histograms to find the optimal bipartition. For example, the method in Brito and Chavent (2012) classifies histograms by whether cumulative relative frequencies at a certain subinterval are larger than 0.5, and Kim (2009) uses internal mean values of each histogram. However, since, unlike classical data, histograms have not only location but also variation information, clustering algorithms for histogram data should consider both the location and dispersion informations. For example, suppose that histograms have the same mean or median and different variances. In this case, the methods using only location information cannot distinguish them because they have the same location information. Thus, in this study, we propose a divisive monothetic clustering algorithm using both these characteristics of histograms.

Section 2 provides some background material needed for the clustering method. Thus, in Section 2.1, basic statistics are presented. Then, in Section 2.2, clustering criteria are defined. The algorithm is presented in Section 3 and it is verified through the well known Fisher (1936) iris dataset and a simulation study in Section 4. A large dataset is analysed in Section 5.

2. Background

2.1. Basic statistics

Let Y = (Y1, …, Yp) be a p-dimensional variable. Then, the variable Yj, j = 1, …, p, takes histogram-valued (i.e., histogram) realizations yi j, i = 1, …, n, for symbolic objects ωi ∈ Ω, where Ω = {ωi, i = 1, …, n} is a set of individual objects ωi, defined by

Yj(ωi)=yij={[ajk,aj,k+1),πijk;k=1,,vj},         i=1,,n,j=1,,p,

where ajkaj,k+1 ∈ ℝ and where 0 ≤ πi jk ≤ 1 and k=1vjπijk=1 (Diday, 1995; Bock and Diday, 2000; Billard and Diday, 2007). In general, it is assumed that values on each subinterval are uniformly distributed. Note that we assume that the kth subinterval of histograms for the variable Yj, [ajk, aj,k+1), depends only on the subscript j. This means that all histogram observations for the variable Yj have the same subintervals. In the symbolic data context, individual histograms for each variable can have different numbers and lengths of subintervals. While theoretically this is not a problem, computationally this can be quite burdensome. To overcome this difficulty, Kim and Billard (2013), De Carvalho and De Souza (2010) and Brito and Chavent (2012) introduced a method to make all histograms for each variable have the same subintervals. That method is based on the uniform assumption for each subinterval. However, since the shape of histograms depends on the length of subintervals, the length of subintervals, (aj,k+1ajk), for each variable should be small enough to avoid the change of histogram shapes.

For example, suppose that we have three observed histograms for a variable Y1 as follows: y1 = y11 = {[2, 4), 0.2; [4, 8), 0.5; [8, 10), 0.3}, y2 = y21 = {[0, 2), 0.7; [2, 5), 0.3}, y3 = y31 = {[6, 10), 0.4; [10, 12); 0.6}. From the given histograms, we see that each histogram has different numbers and lengths of subintervals. In order to make the three histograms have the same subintervals, the domain of Y1 is first obtained from observed histograms. Since the minimum and maximum values of all subintervals are 0 and 12 in this example, the interval [0, 12) is divided by the minimum length of all subintervals. Thus, [0, 12) is divided by the minimum length 2. Thus, the identical subintervals for all histograms for Y1 are {[0, 2), [2, 4), [4, 6), [6, 8), [8, 10), [10, 12)}. Under the uniform assumption, we can calculate supports for the identical subintervals using the overlapped proportion between the observed and identical subintervals. Also, zero is assigned to the identical subintervals that do not overlap with the observed subintervals. The histogram observations with the identical subintervals are shown in Table 1. Thus, henceforth, for computational efficiency, we assume that all histogram observations for each variable have been suitably transformed so as to have the same subintervals.

Under the assumption that values on each subinterval are uniformly distributed, Billard and Diday (2003) defined the internal mean and standard deviation for a histogram-valued observation as follows:

Definition 1

Let yi j be the ith histogram observation for a variable Yj. Then, the internal mean for yi j is defined by

Mij=k=1vj(ajk+aj,k+12)πijk,

and the internal standard deviation for yi j is given by

Sij=[k=1vj{(ajk-Mij)2+(ajk-Mij)(aj,k+1-Mij)+(aj,k+1-Mij)23}πijk]12.

The internal mean of (2.1) can be a representative value for location information for a histogram and the internal standard deviation of (2.2) can be considered as its dispersion information. In our proposed clustering algorithm, both internal mean and standard deviation values of histogram observations are used to find the optimal bipartition at each stage of clustering.

2.2. Clustering criteria

Suppose that there are histogram observations yi ∈ Λ = {y1, …, yn} of the p-dimensional variable {Yj, j = 1, …, p} where yi = {yi j, j = 1, …, p}, and suppose Pr is a partition of Λ at the rth stage of clustering. Then, Pr = {Cu, u = 1, …, r}, where Cu = {y1, …, ynu} is a cluster with nu observations, with u=1rnu=n. At the (r + 1)th stage, a single cluster Cu in Pr is split into Cu1 and Cu2. Thus, a new partition can be written as

Pr+1=(Pr{Cu1,Cu2})-{Cu}.

Now our interest is which cluster Cu is split and how to find the optimal Cu1 and Cu2 in Cu. If we have a partition Pr, there are u=1r(2nu-1-1)=z (say) possible bipartitions for Pr+1. Since the number of possible bipartitions exponentially increases as n increases, for a large n, it is computationally infeasible to explore all z bipartitions. In order to solve this problem, we need a clustering criterion and strategy which can find the optimal Cu1 and Cu2 without having to consider all z possibilities. The criterion we adopt is minimizing the intra-cluster size that evaluates the compactness of each cluster.

Definition 2

For a cluster Cu = {y1, …, ynu}, the intra-cluster size I(Cu) is, for u = 1, …, r,

I(Cu)=1n·nui1=2nui2=1,i2<i1D2(yi1,yi2),         i1,i2=1,,nu,

where D2(yi1, yi2) is a distance or dissimilarity measure between the observations yi1 and yi2 in Cu and n=u=1rnu.

Various distance/dissimilarity measures between two histograms have been proposed so far. Kim and Billard (2013) extended several dissimilarity/distance measures for interval data into histogram data. Also, Verde and Irpino (2008), Arroyo et al. (2011), and Dias and Brito (2011) developed distance measures and applied them to statistical models for histogram data such as regression and time-series. However, in this study, we consider the squared Euclidean distance (Brito and Chavent, 2012) or the Mallows distance (Irpino and Verde, 2006) as D2(yi1, yi2) in (2.4).

Definition 3

For a partition Pr = (C1, …,Cr) at the rth stage, the total intra-cluster size is given by, for r = 1, …, n − 1,

T(Pr)=u=1rI(Cu).

From (2.3) and (2.5), the total intra-cluster size for Pr+1 can be written as

T(Pr+1)=T(Pr)-{I(Cu)-I(Cu1)-I(Cu2)}.

Thus, minimizing T(Pr+1) is equivalent to maximizing T(Pr) − T(Pr+1) (note that T(Pr+1) ≤ T(Pr) for all r). Thus, let Δu = T(Pr) − T(Pr+1) when Cu is split into (Cu1,Cu2). Then,

Δu={I(Cu)-I(Cu1)-I(Cu2)}.

Therefore, clustering algorithms should efficiently find the bipartition (Cu1,Cu2) maximizing Δu.

3. Clustering algorithm

In this section, we describe a divisive clustering for histogram data with a double monothetic process at each stage. Clustering methods introduced by Chavent (2000) and Brito and Chavent (2012) use a monothetic searching algorithm to find the optimal bipartition at each stage, but they depend only on location information of histograms. However, a histogram has both location and dispersion information. That is, even when histograms have the same mean or median, they should be distinguishable if they have different dispersions. Since the clustering algorithms in Chavent (2000) and Brito and Chavent (2012) do not consider the dispersion information, they cannot provide desirable partitions for clusters with the same location but different dispersions. Thus, we propose a divisive hierarchical clustering algorithm using both location and dispersion of histograms.

In contrast to a polythetic algorithm which uses all p variables simultaneously, a monothetic algorithm basically uses one variable at a time to find the bipartition in an optimal criterion. It can also find a binary question for each variable which shows a monothetic characteristic at each stage. In general, the form of a binary question is ‘Is Yjcr?’, where cr is the cut point at the rth stage.

Suppose that there are histogram observations yi, i = 1, …, n, of the p-dimensional variable {Yj, j = 1, …, p}. From these histogram-valued observations, we can compute distances among all observations such as the squared Euclidean distance introduced in Brito and Chavent (2012) and the Mallows distance proposed in Irpino and Verde (2006). By using these distances, the proposed divisive hierarchical clustering method with a double monothetic process can be constructed. The double monothetic process consists of a process by the internal mean of histograms and a process by the internal standard deviation. The details of the proposed clustering algorithm are as follows:

Double monothetic algorithm

  • For each variable Yj, compute the internal mean Mi j and the internal standard deviation Si j for all histogram observations yi j, i = 1, …, n, j = 1, …, p, using (2.1) and (2.2), respectively.

  • Start with P1 ≡ Λ = {y1, …, yn}. Then, set r = 1.

  • At the rth stage, we have a partition Pr = {Cu, u = 1, …, r}, and each cluster contains histogram-valued observations (i.e., Cu = {yui, i = 1 …, nu}).

  • For each cluster Cu, u = 1, …, r, and each variable Yj, j = 1, …, p,

    • 4-1. Sort the observations in Cu in ascending order using the internal mean values of Mi j. Let {yu(i)Mj, i = 1, …,nu}be the observations in Cu sorted by the internal mean values for the variable Yj, and let Cu,lMj,1={yu(1)Mj,,yu(l)Mj}and Cu,lMj,2={yu(l+1)Mj,,yu(nu)Mj}, l = 1, #x02026;, nu − 1, i.e., the first l observations of the sorted observations go into Cu,lMj,1, and the remaining (nul) observations go into Cu,lMj,2.

    • 4-2. For l = 1, …, nu − 1,

      • 4-2-1. Compute Δu of (2.7) for the bipartition (Cu,lMj,1,Cu,lMj,2) as follows:Δu,lMj=I(Cu)-I(Cu,lMj,1)-I(Cu,lMj,2),

        where I(·) is the intra-cluster size of (2.4).

    • 4-3. Similarly to step 4–1, sort the observations in Cu in ascending order using the internal standard deviation values of Si j. Let {yu(i)Sj, i = 1, …,nu} be the observations in Cu sorted by the internal standard deviation values for the variable Yj, and let Cu,lSj,1={yu(1)Sj,,yu(l)Sj}and Cu,lSj,2={yu(1+1)Sj,,yu(nu)Sj}, l = 1, …, nu − 1.

    • 4-4. For l = 1, …, nu − 1,

      • 4-4-1. Compute Δu of (2.7) for the bipartition (Cu,lSj,1,Cu,lSj,2) as follows:Δu,lSj=I(Cu)-I(Cu,lSj,1)-I(Cu,lSj,2).

  • Find a bipartition corresponding to Δ, whereΔ=max {Δu,lMj,Δu,lSj;u=1,,r,j=1,,p,l=1,,nu-1},

    i.e., identify u, l, j, and (Mj or Sj) corresponding to Δ.

  • If (Cu,lMj,1,Cu,lMj,2) is identified in step 5, the cut point at the rth stage, cr, is (M(l) j + M(l+1) j)/2, where M(l) j and M(l+1) j are the internal means of the histograms for the variable Yj of yu(l)Mjand yu(l+1)Mj, respectively, and the binary question is ‘Is M(Yj) ≤ cr?’.

    If (Cu,lSj,1,Cu,lSj,2) is identified, the cut point cr = (S(l)j + S(l+1)j)/2, where S(l)j and S(l+1)j are the internal standard deviations of the histograms for the variable Yj of yu(l)Sjand yu(l+1)Sj, respectively, and the binary question is ‘Is S (Yj) ≤ cr?’.

  • Set r = r + 1 and repeat steps 3–6 until r = R or r = n, where R is a prespecified value and n is the total number of observations.

Algorithm 1 investigates two aspects for each observation to find the optimal bipartition. For each cluster, it firstly explores (nu−1) bipartitions constructed by the internal mean values of histogram observations for a variable Yj, and then it considers another (nu − 1) bipartitions obtained by the internal standard deviation values of histograms for a variable Yj. Thus, it includes a double monothetic process considering both location and dispersion informations of histograms and finally, at the rth stage, it finds a bipartitions giving the largest reduction of the total intra-cluster size among u=1r2p(nu-1) bipartitions. As mentioned in Section 1, there exist u=1r(2nu-1-1) possible bipartitions at the rth stage. In contrast, since Algorithm 1 investigates u=1r2p(nu-1) bipartitions, the order of complexity O(2n) is reduced into O(n) by the proposed algorithm. However, since the proposed algorithm does not explore all possible bipartitions at each stage, it does not guarantee the optimal clustering outcome as with other clustering algorithms.

At every stage, the proposed method provides a binary question by either the internal mean or standard deviation as a cut point and it describes the clustering structure of histogram data. If the binary question comes from the bipartition (Cu,lMj,1,Cu,lMj,2) constructed by internal mean values, it can be interpreted that two clusters of the bipartition have different locations for the variable Yj and observations in Cu,lMj,1 have internal mean values for Yj which are less than cr while internal mean values for Yj of observations in Cu,lMj,2 are larger than cr. If the binary question is obtained from the bipartition (Cu,lSj,1,Cu,lSj,2) by internal standard deviations, it means that two clusters have different degrees of dispersion for the variable Yj and Cu,lSj,2 is more dispersed than Cu,lSj,1. The cut point from Yj is either the average of internal mean or standard deviation values of the histograms for two observations around the cut point with respect to Yj.

4. Simulation study

In this section, the proposed clustering algorithm is verified through a well known dataset and a simulated dataset. To perform the the proposed clustering method, it firstly requires a distance measure for histogram data. In this study, we use the squared Euclidean distance used in Brito and Chavent (2012). The squared Euclidean distance between two histogram observations, yi1 and yi2, is defined by

D2(yi1,yi2)=j=1pD2(yi1j,yi2j)=j=1p[k=1vj(πi1jk-πi2jk)2].

Consider the Iris dataset used in Fisher (1936). Originally, Fisher’s iris dataset has 150 classical observations for three species (setosa, versicolor, virginica) and four variables (Y1 = ‘Sepal Length’, Y2 = ‘Sepal Width’, Y3 = ‘Petal Length’, Y4 = ‘Petal Width’), and each species has 50 classical observations. By aggregating successive ten classical observations in the iris dataset for each histogram object, a histogram-valued dataset with 15 observations can be generated. Histogram-valued observations y1, …, y5 belong to the species ‘setosa’, y6, …, y10 are included in the species ‘versicolor’, and y11, …, y15 are in the species ‘virginica’. Thus, each iris species is represented by five histogram-valued observations.

A dendrogram in Figure 1(a) was obtained by applying the proposed clustering algorithm to the Iris histogram dataset. The dendrogram shows that the proposed algorithm correctly classified observations into the three iris species. Also, from the binary questions in the dendrogram, the three clusters were distinguished by internal mean values of histograms. For example, the first bipartition (C1 = {y1, …, y5},C2 = {y6, …, y15}) has a binary question ‘Is M(Y4) ≤ 0.74?’. Here, C1 corresponds to the answer ‘Yes’ of the binary question and C2 is ‘No’. This means that all observations in C1 have internal mean values of ‘Petal Width’ less than 0.74. Conversely, internal mean values of observations in C2 are greater than 0.74 in Y4. Note that, the vertical axis in the dendrogram represents the total intra-cluster size values of (2.5) for each partition, and the answer ‘Yes’ for each binary question corresponds to the left hand side of a split and ‘No’ is the right hand side.

In the Iris histogram dataset, we saw an example of clusters classified by locations of histograms. Now, we consider a case that clusters with different dispersions are overlapping. Therefore, consider the histogram observations generated from the three different bivariate normal distributions of (4.2). Each histogram-valued observation yi is constructed from 100 classical observations generated from a bivariate normal distribution and each cluster has five histogram observations. Figure 1(b) illustrates such a dataset generated from (4.2). As shown in Figure 1(b), clusters C1 and C2 have the same internal mean but different variance values. In contrast, C1 and C3 have the same variance but different locations.

C1={y1,,y5}~BVN((55),(1001)),C2={y6,,y10}~BVN((55),(50.80.85)),C3={y11,,y15}~BVN((105),(1001)).

Both the algorithm of Brito and Chavent (2012) and the proposed algorithm were applied to 1,000 sets of histogram data simulated from (4.2), and we counted the number of sets that each algorithm correctly classifies all 15 observations into the three clusters. As shown in Table 2, the proposed algorithm perfectly classified for all 1,000 simulated datasets while Brito and Chavent (2012) algorithm did not work properly. Since Brito and Chavent (2012) algorithm depends only on location information of histograms to find the optimal bipartition, it cannot distinguish clusters C1 and C2 which have the same location but different dispersions. In contrast, the proposed algorithm considers both the internal mean and standard deviation of histograms as location and dispersion informations; therefore, it can distinguish those clusters.

5. Data analysis

To demonstrate the divisive double monothetic clustering method for histogram-valued data proposed in this study, we analyse a large dataset for house property values for 51 states of U.S. obtained from the American Community Survey Public Use Microdata Sample (ACS PUMS) 2012 (available in the US Census Bureau website; http://www2.census.gov/acs2012_1yr/pums/csv_hus.zip). The original classical data (ACS PUMS) have various and many records for each sampled household. In this analysis, we are interested in classifying the 51 U.S. states by house property values. That is, the goal of this analysis is to investigate which states have the most similar distributions of house price for various house sizes. Note that only single family houses are considered in this analysis.

To analyse house property values for each state by house sizes, we consider house sizes as the number of bedrooms for each house and aggregate the data by each state and each house size. Since the number of houses with one bedroom for each state in the original dataset is not enough to be meaningful in the present context, one bedroom houses are merged with two bedroom houses. Similarly, since each sample size of houses with five bedrooms or more is not enough, houses with five bedrooms or more are pooled as one variable. Note that all missing values for the house property value are removed. Thus, we consider four variables; Y1 is the house property value for a house with two bedrooms or less, Y2 is the value for a three bedroom house, Y3 is for a four bedroom house, and Y4 is the value for a house with five bedrooms or more. From 753,917 individual households, house property values for these 4 variables are aggregated by each state, and a histogram-valued dataset with 51 observations (states) and 4 variables (Y1, …, Y4) is generated from these aggregated data.

To apply our method proposed in this study to this histogram-valued dataset, we first make the histogram-valued data have the same subintervals for each variable and calculate the squared Euclidean distance values using (4.1). Then, the proposed clustering method classifies the 51 U.S. states, and the clustering result is shown in Figure 2(a).

The vertical axis of Figure 2(a) is the total intra-cluster size of (2.5) and the first five binary questions are provided in the dendrogram. The first three bipartitions (P2, P3, P4) were detected by the mean values of the house price with three bedrooms for U.S. states. The fourth bipartition had a cut point as the mean value of the house price with one or two bedrooms, and fifth bipartition was obtained by the standard deviation of four bedroom houses for U.S. states.

From Figure 2(a), it seems that there are two major clusters in the dataset. To investigate the optimal number of clusters, cluster validity indexes were computed. Cluster validity indexes evaluate clustering outcomes and provide the optimal number of clusters. In this analysis, we used the Dunn and Davis-Bouldin indexes developed in Kim and Billard (2011). Figure 2(b) shows both Dunn and Davis-Bouldin index values for the clustering result in Figure 2(a). As shown in Figure 2(b), the first bipartition (P2 = {C1, C2}) has the largest Dunn index and the smallest Davis-Bouldin index values. Thus, we can conclude that there are two major clusters in the dataset. Table 3 shows a list of U.S. states for the two clusters.

This bipartition was detected by the binary question ‘Is M(Y2) ≤ $205,595?’. That is, all U.S. states in C1 have internal mean values for the three bedroom house lower than $205,595.46, and internal mean values of states in C2 are higher than that price. Most states in C2 are mainly located in the north eastern region of U.S.

6. Conclusion and discussion

In this study, a divisive double monothetic clustering method for histogram data is introduced. Chavent (2000) proposed a divisive monothetic clustering method for interval data and ordered categorical data, and Brito and Chavent (2012) extended the method to histogram data. Both monothetic methods find the optimal bipartition by whether the cumulative probability is larger than 0.5. Kim (2009) also extended Chavent (2000) to histogram data by using the internal mean values to find the optimal bipartition. That is, they use only location information of histograms. However, a histogram has both location and internal variation. Thus, when two histograms have the same location but different dispersions, they should be distinguishable. To distinguish histograms in such situations, we propose a double monothetic clustering algorithm using both internal mean and standard deviation as location and dispersion information, respectively. Also, the proposed clustering algorithm has the same order of complexity, O(n), as the existing monothetic methods, and similarly to the existing methods, it provides binary questions to interpret clustering outcomes.

However, since the proposed clustering algorithm depends on the mean and standard deviation of individual histograms, it might not distinguish histograms that have the same first two moments but different higher moments. For example, there might be right skewed and left skewed histograms with the same mean and standard deviation and the proposed algorithm cannot separate these histograms. In that case, a monothetic clustering algorithm exploring the third and higher moments of histograms can be considered.

Limam et al. (2004) introduced a divisive monothetic clustering method for interval data using both intra-class homogeneity and inter-class discrimination criteria. Their method employs a stepwise strategy to find the best partition at each step and uses a probabilistic approach to assign objects to clusters. In contrast, the method proposed in this study uses a forward strategy and focuses on maximizing the homogeneity of clusters.

Although divisive clustering is slower than agglomerative clustering, divisive clustering usually yields more accurate hierarchies because it starts with complete information for the global distribution. In contrast, since agglomerative algorithms merge clusters based on local information, it might not capture global structures. Also, agglomerative clustering do not provide variables related to clustering outcomes while the proposed algorithm gives binary questions with splitting variables and cut points for more interpretable results.

Acknowledgement

This paper was supported by Sungkyun Research Fund, Sungkyunkwan University, 2015 (No. S-2015-1295-0000).

Figures
Fig. 1. (a) Dendrogram from Iris histogram data; (b) A simulated dataset from three bivariate normal distributions.
Fig. 2. (a) Clustering result for house property values of each U.S. state.; (b) Cluster validity indexes for the clustering result.
TABLES

Table 1

An example of histogram observations with the same subintervals

Observation[0, 2)[2, 4)[4, 6)[6, 8)[8, 10)[10, 12)
y10.00.20.250.250.30.0
y20.70.20.100.000.00.0
y30.00.00.000.200.20.6

Table 2

The simulation result for bivariate normal distributions

Brito and Chavent (2012) algorithmThe proposed algorithm
# of sets that the algorithm correctly classifies all observation into 3 clusters291,000

Table 3

Clusters in the U.S. house property value dataset

P2State
C1AL, AR, AZ, FL, GA, IA, ID, IL, IN, KS, KY, LA, ME, MI, MN, MO, MS, NC, ND, NE, NM, NV, OH, OK, PA, SC, SD, TN, TX, UT, WI, WV
C2AK, CA, CO, CT, DC, DE, HI, MA, MD, MT, NH, NJ, NY, OR, RI, VA, VT, WA, WY

References
  1. Arroyo, J, Gonz찼lez Rivera, G, Mat챕, C, and Mu챰oz San Roque, A (2011). Smoothing methods for histogram-valued time series: an application to value-at-risk. Statistical Analysis and Data Mining. 4, 216-228.
    CrossRef
  2. Billard, L (2006). Symbolic data analysis: what is it?. Proceedings in Computational Statistics 2006, Rizzi, A, and Vichi, M, ed. Rome, Italy, pp. 261-269
    CrossRef
  3. Billard, L (2011). Brief overview of symbolic data and analytic issues. Statistical Analysis and Data Mining: The ASA Data Science Journal. 4, 149-156.
    CrossRef
  4. Billard, L, and Diday, E (2003). From the statistics of data to the statistics of knowledge: Symbolic data analysis. Journal of the American Statistical Association. 98, 470-487.
    CrossRef
  5. Billard, L, and Diday, E (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. England: John Wiley & Sons
  6. Bock, HH, and Diday, E (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Berlin: Springer-Verlag
    CrossRef
  7. Brito, P, and Chavent, M 2012. Divisive monothetic clustering for interval and histogram-valued data., Proceedings ICPRAM 2012-1st International Conference on Pattern Recognition Applications and Methods, Vilamoura, Portugal.
  8. Chavent, M (2000). Criterion-based divisive clustering for symbolic data. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Bock, H-H, and Diday, E, ed. Berlin: Springer-Verlag, pp. 299-311
  9. De Carvalho, FAT, and De Souza, RMCR (2010). Unsupervised pattern recognition models for mixed feature-type symbolic data. Pattern Recognition Letters. 31, 430-443.
    CrossRef
  10. Dias, S, and Brito, P 2011. A new linear regression model for histogram-valued variables., Proceedings of the 58th ISI World Statistics Congress, Dublin, Ireland.
  11. Diday, E (1987). Introduction 횪 l셙pproche symbolique en analyse des donn챕es. Premi챔res Journ챕es Symbolique-Num챕rique. Universit챕 Paris IX: CEREMADE, pp. 21-56
  12. Diday, E (1995). Probabilist, possibilist and belief objects for knowledge analysis. Annals of Operations Research. 55, 225-276.
    CrossRef
  13. Edwards, AWF, and Cavalli-Sforza, EL (1965). A method for cluster analysis. Biometrics. 21, 362-375.
    Pubmed CrossRef
  14. Fisher, RA (1936). The use of multiple measurements in taxonomic problems. Annals of Human Genetics. 7, 179-188.
  15. Har-even, M, and Brailovsky, VL (1995). Probabilistic validation approach for clustering. Pattern Recognition Letters. 16, 1189-1196.
    CrossRef
  16. Irpino, A, and Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. Proceeding IFCS 2006, Batagelj, V, ed. Heidelberg: Springer, pp. 185-192
  17. Irpino, A, and Verde, R (2008). Dynamic clustering of interval data using a Wasserstein-based distance. Pattern Recognition Letters. 29, 1648-1658.
    CrossRef
  18. Kim, J 2009. . Dissimilarity Measures for Histogram-valued Data and Divisive Clustering of Symbolic Objects (PhD thesis). University of Georgia.
  19. Kim, J, and Billard, L (2011). A polythetic clustering process and cluster validity indexes for histogram-valued objects. Computational Statistics & Data Analysis. 55, 2250-2262.
    CrossRef
  20. Kim, J, and Billard, L (2012). Dissimilarity measures and divisive clustering for symbolic multimodal-valued data. Computational Statistics & Data Analysis. 56, 2795-2808.
    CrossRef
  21. Kim, J, and Billard, L (2013). Dissimilarity measures for histogram-valued observations. Communications in Statistics-Theory and Methods. 42, 283-303.
    CrossRef
  22. Lance, GN, and Williams, WT (1968). Note on a new information-statistic classificatory program. The Computer Journal. 11, 195.
    CrossRef
  23. Limam, MM, Diday, E, and Winsberg, S (2004). Probabilist allocation of aggregated statistiscal units in classification trees for symbolic class description. Classification, Clustering and Data Mining Applications, Banks, D, House, L, McMorris, FR, Arabie, P, and Gaul, W, ed. Heidelberg: Springer, pp. 371-379
    CrossRef
  24. MacNaughton-Smith, P, Williams, WT, Dale, MB, and Mockett, LG (1964). Dissimilarity analysis: a new technique of hierarchical sub-division. Nature. 202, 1034-1035.
    Pubmed CrossRef
  25. Verde, R, and Irpino, A (2008). Comparing histogram data using a Mahalanobis-Wasserstein distance. COMPSTAT 2008: Physica-Verlag HD, pp. 77-89
    CrossRef
  26. Williams, WT, and Lambert, JM (1959). Multivariate methods in plant ecology: I. Association-Analysis in Plant Communities. Journal of Ecology. 47, 83-101.
    CrossRef