TEXT SIZE

CrossRef (0)
Exploratory Methods for Joint Distribution Valued Data and Their Application

Kazuto Igarashi1,a, Hiroyuki Minamib, and Masahiro Mizutab

aGraduate School of Information Science and Technology, Hokkaido University, Japan, bInformation Initiative Center, Hokkaido University, Japan
Correspondence to: Kazuto Igarashi
Information Initiative Center, Hokkaido University, N11-W5, Kita-ku, Sapporo, Hokkaido, 060- 0811, Japan. E-mail: igarashi@iic.hokudai.ac.jp
Received January 23, 2015; Revised March 12, 2015; Accepted March 31, 2015.
Abstract

In this paper, we propose hierarchical cluster analysis and multidimensional scaling for joint distribution valued data. Information technology is increasing the necessity of statistical methods for large and complex data. Symbolic Data Analysis (SDA) is an attractive framework for the data. In SDA, target objects are typically represented by aggregated data. Most methods on SDA deal with objects represented as intervals and histograms. However, those methods cannot consider information among variables including correlation. In addition, objects represented as a joint distribution can contain information among variables. Therefore, we focus on methods for joint distribution valued data. We expanded the two well-known exploratory methods using the dissimilarities adopted Hall Type relative projection index among joint distribution valued data. We show a simulation study and an actual example of proposed methods.

Keywords : Symbolic Data Analysis (SDA), cluster analysis, multidimensional scaling, projection index, kernel density estimation
1. Introduction

These days, the development of information technology enables us to collect and store data easily and has subsequently resulted in large and complex data. A conventional analytic framework is not effective when analyzing these kinds of data.

Symbolic Data Analysis (SDA) is one of the attractive frameworks to analyze large and complex data. SDA was proposed by Diday in 1980’s. Many methods have been extended in the framework of SDA (Billard and Diday, 2006; Bock and Diday, 2000; Diday and Noirhomme-Fraiture, 2008). In SDA, target objects are typically represented by aggregated data. Those are called symbolic objects; consequently, how to represent them is important to keep information in the original data. Most methods on SDA analyze objects represented as intervals or histograms. However, objects represented as intervals do not contain information of distributions because they focus on only a pair of minimum and maximum values. In the case of objects represented as histograms, they can contain information of one dimensional distributions, but they lose information among variables including correlation. In addition, objects represented as a joint distribution can contain information among variables. When we analyze data in detail, methods for objects represented as a joint distribution are effective. However, there are few methods for them on SDA.

Therefore, we propose methods for objects represented as a joint distribution. Especially, we focus on two well-known statistical methods, hierarchical cluster analysis and multidimensional scaling. In these two methods, how to define dissimilarities is a common problem to grasp the characteristic structure of joint distributions. We adopt Hall Type relative projection index as dissimilarities among joint distributions to the problem. The index is used for an extension of projection pursuit which is one of the methods of dimension reduction.

This paper is organized as follows. We introduced the background and the outline of this study in this section. In section 2, we explain basic concept and the terms of SDA and introduce preceding studies. In section 3, we provide details of the proposed approaches. Section 4 is about simulation study. We show the effectiveness of the proposed methods by comparing them with existing methods. We also apply the proposed methods to telemonitoring data on Parkinson’s disease patients as an actual example in section 5. Section 6 provides a conclusion.

2. Symbolic Data Analysis

In this section, we explain the basic concept and terms of SDA, and preceding studies of cluster analysis and multidimensional scaling on SDA.

2.1. Target objects

In conventional analysis, the object is described with a single value or a vector. Therefore, when multiple observation values are provided to each variable for objects, we would lose most information because we aggregate them into a summarized value like an average. In SDA, objects on conventional analysis are called individuals, and we analyze what individuals are aggregated. They are called concepts. Concepts are typically described with intervals, histograms and distributions. We especially focus on multidimensional distributions i.e., joint distributions. We call them joint distribution valued data.

2.2. Preceding studies

A lot of methods in terms of cluster analysis on SDA were proposed such as Gowda and Diday (1991) and Chavent and Lechevallier (2002). For cluster analysis which represents objects as distributions, Katayama et al. (2009) proposed hierarchical cluster analysis using Symmetric Kullback-Leibler divergence as dissimilarities. In addition, Terada and Yadohisa (2010) proposed non-hierarchical cluster analysis using cumulative distribution function as dissimilarities. These methods realize cluster analysis in consideration of information among variables and distributions of the original data. However, the method by Katayama et al. (2009) has a limitation to the available case because it assumes that target distributions are normal distributions. The method by Terada and Yadohisa (2010) must keep all probabilities of intervals because it uses cumulative distribution functions. Therefore, there is a problem that computational complexity becomes enormous in higher dimensions.

Multidimensional scaling on SDA are proposed by Groenen et al. (2006). The dissimilarities of the method are represented as interval valued data. Mizuta and Minami (2012) proposed multidimensional scaling in which dissimilarities are distributions. However, there are few methods in terms of multidimensional scaling in which target objects are represented as multidimensional distributions.

3. Proposed Method

In this section, we give notations and explain how to calculate the dissimilarities among joint distributions. We also show methods for hierarchical cluster analysis and multidimensional scaling using dissimilarities.

3.1. Notations

We assume that there are m concepts and ith concept consists of ni × p matrix Xi. ni is the number of individuals included in ith concept. Each individual is described as p variables. We denote m concepts as X.

$Xi=(xi,1,1xi,1,2?xi,1,pxi,2,1xi,2,2?xi,2,p????xi,ni,1xi,ni,2?xi,ni,p),$$X=(X1X2?Xm).$

The ith symbolic object is built by aggregating ni individuals. We represent ith symbolic object ξi(z) as the density function on p variables. We approximate ξi(z) as a joint distribution in the form of a density function from Xi. Then, we estimate the probability density function by kernel density estimation.

3.2. Dissimilarities

It is important to define a dissimilarity among joint distributions. In the study, we adopt Hall Type relative projection index which is used for relative projection pursuit (Hiro et al., 2004). Projection pursuit is a method for dimension reduction to search for low dimensional space where we found an interesting structure. The original projection pursuit assumes that the normal distribution is the most uninteresting structure. If projected data have the structure which is most different from normal distribution, we regard it as the most interesting structure. Projection pursuit uses projection index to measure the distance between distributions. There are some indices including Hall index, area index and moment index. We focus on Hall index. Hall index defines dissimilarities with the difference of density functions. One dimensional Hall index is defined as

$J≡∫-∞∞{fα(u)-φ(u)}2?du.$

fα(u) is a probability density function of the samples projected in one dimensional space by projection vector α. φ(·) is the probability density function of the standard normal distribution.

Projection pursuit was extended to use the standard normal distribution as well as any distributions. The method is called relative projection pursuit. Hiro et al. (2004) proposed a Hall Type relative projection index that provides dissimilarities between a distribution of object’s samples and a distribution of referred samples;

$I(α)=∫-∞∞{fα(u)-gα(u)}2du=∫-∞∞fα(u)2du+∫-∞∞gα(u)2du-2∫-∞∞fα(u)gα(u)du,$

where fα(u) is a density function of object’s samples and gα(u) is that of referred samples. They are projected in one dimensional space by projection vector α.

We use Hall Type relative projection index to calculate dissimilarities among distribution valued data. fi(z) represents a density function of ith joint distribution valued data which have p variables, where z = (z1, z2, . . . , zp). In the same way, fj(z) represents a density function of jth joint distribution valued data. The dissimilarity si j is represented as

$sij=∫-∞∞?∫-∞∞{fi(z)-fj(z)}2dz=∫-∞∞?∫-∞∞fi(z)2dz+∫-∞∞?∫-∞∞fj(z)2dz-2∫-∞∞?∫-∞∞fi(z)fj(z)dz.$

We adopt kernel density estimation using normal distribution as kernel function. The distribution valued data is represented as

$ξi(z)?fi(z)=1(2π)p2nihi,1hi,2?hi,p∑k=1ni{∏r=1pexp?(-(zr-xi,k,r)22hi,r2)},$

where hi,1, hi,2, . . . , hi,p are optimal band widths of the density function fi(z) by Scott (1992) represented as

$hi,r=(4p+2)1p+4σrni-1p+4?????????(r=1,2,…,p),$

where σr is the standard deviation of the rth variable of whole object’s samples.

Using the estimated density function, we transform the items;

$∫-∞∞?∫-∞∞fi(z)2dz=12pπp2ni2hi,1hi,2?hi,p∑k=1ni∑l=1ni{∏r=1pexp(-(xi,k,r-xi,l,r)24hi,r2)},∫-∞∞?∫-∞∞fj(z)2dz=12pπp2nj2hj,1hj,2?hj,p∑k=1nj∑l=1nj{∏r=1pexp(-(xj,k,r-xj,l,r)24hj,r2)},∫-∞∞?∫-∞∞fi(z)fj(z)dz=12p2πp2ninj∏r=1p(hi,r2+hj,r2)12∑k=1ni∑l=1nj{∏r=1pexp(-(xi,k,r-xj,l,r)22(hi,r2+hj,r2))}.$

Then, the dissimilarity si j is represented as

$sij=12pπp2ni2hi,1hi,2?hi,p∑k=1ni∑l=1ni{∏r=1pexp?(-(xi,k,r-xi,l,r)24hi,r2)}+12pπp2nj2hj,1hj,2?hj,p∑k=1nj∑l=1nj{∏r=1pexp?(-(xj,k,r-xj,l,r)24hj,r2)}-12p2-1πp2ninj∏r=1p(hi,r2+hj,r2)12∑k=1ni∑l=1nj{∏r=1pexp?(-(xi,k,r-xj,l,r)22(hi,r2+hj,r2))}.$

3.3. Cluster analysis for joint distribution valued data

Cluster analysis classifies objects into some groups. The method merges similar object sequentially. The algorithm of the proposed hierarchical cluster analysis for joint distribution valued data is as follows.

• Step 1: As initial state, regard m symbolic objects as m clusters.

• Step 2: Calculate dissimilarities using the expression (3.8).

• Step 3: Merge the most similar two objects as one new cluster.

• Step 4: Calculate dissimilarities among the new cluster generated by Step 3 and the others.

• Step 5: Repeat Step 3 and Step 4 until the number of clusters is one.

Here, we explain the analysis procedure. First, we build symbolic objects. Next, we generate clusters using the above algorithm. Then, we decide how to generate new clusters such as single linkage method, complete linkage method and Ward method (Haltigan, 1975). Hereafter, we adopt the Ward method. Then, we visualize the result by a dendrogram. Finally, we interpret the result.

3.4. Multidimensional scaling for joint distribution valued data

Multidimensional scaling is to visualize relationships among objects by configurations in low dimensional space. It is based on dissimilarities. In the proposed method, we use dissimilarity matrix S = {si j} as input data of Torgerson’s method (Torgerson, 1958). The analysis procedure is as follows. First, we build symbolic objects and calculate dissimilarities matrix using the expression (3.8). Next, we apply Torgerson’s method to dissimilarities matrix S. Torgerson’s method is based on the theorem of Young-Householder. We transform the dissimilarities matrix for the non-negative matrix B = {bi j},

$bij=-12(sij2-si·2-s·j2+s··2),$

where

$si·2=1m∑j=1msij2,?????????s·j=1m∑i=1msij2,?????????s··2=1m2∑i=1m∑j=1msij2.$

We apply eigenvalue decomposition to B. After that, we construct configurations using relatively large eigenvalues and eigenvectors. Finally, we visualize the result by configurations in low dimensional space and interpret it.

4. Simulation Study

We adopt the proposed methods to artificial dataset generated by copulas. Copula is a function which indicates the relationship of marginal distribution functions. In addition, we compare the results to those by three existing approaches.

We use two types of copulas, Gumbel copula and Clayton copula (Nelsen, 1999), with various parameter values. When θ indicates a parameter, Gumbel copula is represented as

$C(μ,ν)=exp?(-[(-ln?μ)θ+(-ln?ν)θ]1θ)?????????(1≤θ).$

Clayton copula is represented as

$C(μ,ν)=[max?(μ-θ+ν-θ-1,0)]-1θ?????????(-1≤θ<0???or???0≤θ).$

Table 1 shows details of the copulas. There are four kinds of copula and each copula generates five concepts. Thus, there are 20 concepts in total. Each concept consists of 500 individuals. We adjust the parameters so that the values of their Kendall’s τ are same. Figure 1 shows the examples of the copulas in the simulation.

We introduce three existing methods. The 1st method is interval based approach (Chavent and Lechevallier, 2002). The 2nd method is histogram based approach (Diday and Noirhomme-Fraiture, 2008). The last method is distribution based approach (Katayama et al., 2010). The dissimilarities on these methods are important to compare with the proposed methods.

Chavent and Lechevallier (2002) uses Hausdorff distance as dissimilarities for interval valued data. When ith concept is represented as the interval [minik, maxik] (k = 1, . . . , p), Hausdorff distance between concept i and j is represented as

$sij=∑k=1pmax?[|minik-minjk|,|maxik-maxjk|].$

L2 distance is adopted as dissimilarities for histogram valued data (Diday and Noirhomme-Fraiture, 2008). When ith concept is represented as (qi,k,1, qi,k,2, . . . , qi,k,bk); k = 1, . . . , p where $∑l=1bkqi,k,l=1$, bk is the number of bins in the histogram for kth variable. L2 distance between concept i and j is represented as

$sij=∑k=1p∑l=1bk(qi,k,l-qj,k,l)2.$

Katayama et al. (2010) uses symmetric Kullback-Leibler divergence as dissimilarities for distribution valued data. When ith normal distribution is represented as N(μi,i), symmetric Kullback-Leibler divergence between concept i and j is represented as

$sij=tr(ΣiΣj-1)+tr(ΣjΣi-1)+tr((Σi-1+Σj-1)?(μi-μj)?(μi-μj)T)-2p.$

Figures 2 and 3 are the results of interval based approach, and 4 and 5 are a histogram based approach. The figures show that these existing approaches do not provide clusters based on kinds of copulas. Figures 6 and 7 are a distribution based approach. Objects are classified into two clusters based on correlation; however, this approach does not properly grasp the structure of copulas.

Figure 8 is the result of the proposed hierarchical cluster analysis. Symbolic objects are classified into four clusters by dividing at a height of 0.1. Figure 8 shows that the proposed method can grasp the structure of different correlation and copulas. Figure 9 is the result of multidimensional scaling. The vertical axis shows the differences of copulas. The horizontal axis shows those of correlation. We can also grasp the relationships among variables using the proposed multidimensional scaling.

5. Application

In this section, we show an actual example of the proposed methods. We introduce the dataset at first; subsequently, we explain an application of the proposed methods for the dataset and interpret the results.

5.1. Dataset

We use telemonitoring data on Parkinson’s disease patients. The dataset is open to the public in a Web site called UCI Machine Learning Repository. This is about voice measure in terms of the noise in patient’s phonation (Tsanas et al., 2010). It consists of 5,875 individuals and 20 variables, including patient’s ID, sex, age and voice measures. They also contain the two evaluations diagnosed by a doctor, called UPDRS. We focus on six voice measures (Table 5.1). In the dataset, there are 42 patients (male 28, female 14) in the initial stage of Parkinson’s disease. They have collected their own phonation by telemonitoring system called Intel AHTD for six months. Tsanas et al. (2010) tried to predict UPDRS values by regression methods using variables on voice measures.

5.2. How to build symbolic objects

We explain how to build symbolic objects to apply the proposed methods to the dataset. We regard each patient as a concept. Then, the individual is each measurement. We build joint distribution valued data by aggregated six voice measures. We excluded the 36th patient from the dataset because she had too large of value and was regarded as an outlier.

Jitter (y1) and Shimmer (y2) are parameters to evaluate periodic disorder of the vocal-fold vibration. Jitter quantifies the fluctuation of the periodicity in every basic period. Similarly, Shimmer quantifies a fluctuation of the amplitude. NHR (y3) and HNR (y4) measure the power ratio of harmonics wave ingredient and noise wave ingredient from the dividing sound wave. DFA (y5) measures the extent of turbulent noise in the speech signal, quantifying the stochastic self-similarity of the noise caused by turbulent airflow in the vocal tract (Little et al., 2007). Incomplete vocal-fold closure causes a rise of DFA. PPE (y6) measures the impaired control of stable pitch during sustained phonation (Little et al., 2009). PPE is robust to confounding factors, such as smooth vibrate, which is present in healthy voices as well as dysphonia voices. Thus, this measure contributes significant information separating healthy control and Parkinson’s disease patients (Tsanas et al., 2010).

5.3. Results

Figure 10 shows the result of the proposed hierarchical clustering method. Patients are classified into three clusters by dividing at height of 4.0e+07. We assume that cluster 1, cluster 2 and cluster 3 sequentially from the left.

Figure 11 shows the configuration of the result of the proposed multidimensional scaling. We consider that the configuration in the two dimensional space contain significant information because the cumulative contribution ratio of first and second eigenvalues is over 90%. Figure 11 also shows the structure of three clusters with Figure 10. In the figure, the patient number is plotted with the respective cluster’s color (cluster 1: red, cluster 2: blue, cluster 3: green).

We interpret the results. Figure 12 is a matrix of scatterplot of all individuals. DFA values of cluster 1 are high. This means that phonation of the patients in cluster 1 shows a self-similarity of the noise. However, DFA values and HNR values of cluster 2 are low. This means that phonation of the patients in cluster 2 does not show self-similarity of the noise. But, much noise is contained in their phonation. Most values of the variables in cluster 3 are relatively low except HNR.

Now, we summarize the interpretation in Table 5.2. The configuration has two directions which show voice features. One shows the degree of self-similarity of the noise in their phonation, the other shows the amount of the noise.

6. Concluding Remarks

In this paper, we proposed hierarchical cluster analysis and multidimensional scaling for joint distribution valued data in the framework of SDA. We can keep information among variables including correlation by representing objects as joint distribution valued data. In addition, we expanded existing methods using the dissimilarities adopted Hall Type relative projection index among joint distribution valued data. We investigated the effectiveness of the proposed methods by simulation study using copulas. We compared the proposed methods and existing approaches; in addition, we also applied the proposed methods to telemonitoring data on Parkinson’s disease patients. We confirmed that the clusters and the configuration grasped the voice characteristics of Parkinson’s disease patients.

Figures
Fig. 1. Examples of copulas.
Fig. 2. Result of interval based clustering.
Fig. 3. Result of interval based multidimensional scaling.
Fig. 4. Result of histogram based clustering.
Fig. 5. Result of histogram based multidimensional scaling.
Fig. 6. Result of distribution based clustering.
Fig. 7. Result of distribution based multidimensional scaling.
Fig. 8. Result of the proposed hierarchical cluster analysis
Fig. 9. Result of the proposed multidimensional scaling
Fig. 10. Result of hierarchical cluster analysis.
Fig. 11. Result of multidimensional scaling.
Fig. 12. Pairs plot of individuals.
TABLES

Table 1

Copulas in the simulation

ConceptCopulaParameter θKendall’s τ
1 ? 5Gumbel copula2.50.6
6 ? 10Gumbel copula5.00.8
11 ? 15Clayton copula3.00.6
16 ? 20Clayton copula8.00.8

Table 2

Details of variables

Variable
Description
y1MDVP: Jitter(abs)KP-MDVP absolute jitter in microseconds
y2MDVP: ShimmerKP-MDVP local shimmer
y3NHRNoise-to-Harmonics ratio
y4HNRHarmonics-to-Noise ratio
y6PPEPitch period entropy

Table 3

Interpretation of clusters

ClusterInterpretation
1Patients who have self-similarity of the noise in their phonation.
2Patients who have large amount of the noise in their phonation.
3Patients who have relatively few symptoms of Parkinson’s disease.

References
1. Billard, L, and Diday, E (2006). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Chichester: John Wiley & Sons.
2. Bock, HH, and Diday, E (2000). Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Berlin: Springer.
3. Chavent, M, and Lechevallier, Y (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on Hausdorff distance. Classification, Clustering and Data Analysis, Jajuga, K, Sokoowski, A, and Bock, HH, ed: Springer, pp. 53-59.
4. Diday, E, and Noirhomme-Fraiture, M (2008). Symbolic Data Analysis and the SODAS Software. Chichester: John Wiley & Sons.
5. Gowda, KC, and Diday, E (1991). Symbolic clustering using a new dissimilarity measure. Pattern Recognition. 24, 567-578.
6. Groenen, PJF, Winsberg, S, Rodriguez, O, and Diday, E (2006). I-Scal: Multidimensional scaling of interval dissimilarities. Computational Statistics and Data Analysis. 51, 360-378.
7. Hartigan, JA (1975). Clustering Algorithms. New York: John Wiley & Sons.
8. Hiro, S, Komiya, Y, Minami, H, and Mizuta, M (2004). Multidimensional relative projection pursuit. Japanese Journal of Applied Statistics. 33, 225-241.
9. Katayama, K, Minami, H, and Mizuta, M (2010). Hierarchical symbolic clustering for distribution valued data. Journal of the Japanese Society of Computational Statistics. 22, 83-89.
10. Little, M, McSharry, PE, Hunter, EJ, Spielman, J, and Ramig, LO (2009). Suitability of dysphonia measurements for telemonitoring of Parkinson’s disease. IEEE Transactions on BioMedical Engneering. 56, 1-19.
11. Little, M, McSharry, PE, Roberts, SJ, Castello, D, and Moroz, IM (2007). Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomedical Engineering OnLine. 6, 1015-1022.
12. Mizuta, M, and Minami, H (2012). Analysis of distribution valued dissimilarity data. Challenges at the Interface of Data Analysis, Computer Science, and Optimization, Gaul, WA, Geyer-Schulz, A, Schmidt-Thieme, L, and Kunze, J, ed. Berlin: Springer, pp. 23-28.
13. Nelsen, RB (1999). An Introduction to Copulas. New York: Springer.
14. Scott, DW (1992). Multivariate Density Estimation. New York: John Wiley & Sons.
15. Terada, Y, and Yadohisa, H (2010). Non-hierarchical clustering for distribution-valued data. Proceedings of COMPSTAT 2010. Berlin: Physical-Verlag, pp. 1653-1660.
16. Torgerson, WS (1958). Theory and Methods of Scaling. New York: Wile.
17. Tsanas, A, Little, MA, McSharry, PE, and Ramig, LO (2010). Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE Transactions on Biomedical Engineering. 57, 884-893.
18. UCI Machine Learning Repository (2015).