TEXT SIZE

• •   CrossRef (0) Binary classification on compositional data  Jae Yun Jooa, Seokho Lee1,a

aDepartment of Statistics, Hankuk University of Foreign Studies, Korea
Correspondence to: 1Department of Statistics, Hankuk University of Foreign Studies, 81 Oedae-ro, Yongin, Gyeonggido 17035, Korea.
E-mail: lees@hufs.ac.kr

This research was supported by Hankuk University of Foreign Studies Research Fund (of 2020).
Received October 12, 2020; Revised November 27, 2020; Accepted December 22, 2020.
Abstract
Due to boundedness and sum constraint, compositional data are often transformed by logratio transformation and their transformed data are put into traditional binary classification or discriminant analysis. However, it may be problematic to directly apply traditional multivariate approaches to the transformed data because class distributions are not Gaussian and Bayes decision boundary are not polynomial on the transformed space. In this study, we propose to use flexible classification approaches to transformed data for compositional data classification. Empirical studies using synthetic and real examples demonstrate that flexible approaches outperform traditional multivariate classification or discriminant analysis.
Keywords : Aitchison geometry, classification, compositional data, Gaussian mixture, isometric logratio transformation
1. Introduction

Compositional data is a special type of multivariate data with strictly positive real components under sum constraint. This kind of data often arises when we measure parts of a whole. A typical example of compositional data is a vector of proportions or percentages, where each component is positive and bounded, and the sum of components is 1 for proportions or 100% for percentages. Compositional data is commonly observed in various applications: ratio of components making up a rock, ratio of fine substances in atmosphere, to name a few (Pawlowsky-Glahn et al., 2015). The sample space of compositional data having D components is the simplex embedded in D dimensional space

$SD={x=(x1,x2,…,xD)T∈ℝD|∑j=1Dxj=κ, xj>0, j=1,2…,D},$

where κ is the sum constraint. Here we deal with compositional data of proportions, so that we set κ = 1. Since compositional vectors reside in , their geometry is different from the typical Euclidean geometry in real space. Addition (perturbation) and scalar multiplication (powering) operations, inner product, norm, and distance can be properly defined in the name of Aitchison geometry (Pawlowsky-Glahn and Egozcue, 2001).

Although compositional data can be described and understood under well-defined Aitchison geometry, statistical analysis related to compositional data is limited due to lack of statistical model on the simplex space. Dirichlet distribution is the only known statistical distribution that is well defined on simplex. Traditional multivariate data analysis techniques have been developed on Euclidean space, where Gaussian distribution is the norm, so that they cannot be directly applied or generalized to compositional data analysis. Since main difficulty on compositional data comes from the bounded elements (0 < xj < κ, j = 1, 2, . . . , D) and the sum constraint ($Σj=1Dxj=κ$), researchers often transform compositional vectors on into D − 1 dimensional unrestricted vectors. Logratio transformations are sensible choices for compositional data because relative information is preserved. Isometric logratio (ilr) transformation is frequently used in compositional data analysis. Since D − 1 dimensional vector z ∈ ℝD−1 from ilr transformation on x ∈ ℝD is free from constraint and boundedness, traditional multivariate analysis methods can be applied to z without any technical problem. These results and their interpretation seamlessly go back to the original compositional data because ilr transformation is an isometry (distance-preserving transformation) between and ℝD−1.

For binary classification on compositional data, a common practice is to apply traditional classification/discriminant analysis methods, such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression to ilr transformed data. LDA and QDA are based on the assumption that class distributions are multivariate Gaussian. LDA and QDA may be inappropriate since the distribution on ilr transformed data from compositional data with Dirichlet distribution is not Gaussian. Logistic regression assumes a linear or polynomial classifier for classification. When two classes come from separate Dirichlet distributions, Bayes decision boundary on ilr transformed space is not of polynomial. Therefore, we expect that a more flexible classification rule is desirable.

In this article, we study the usage of flexible binary classification methods on compositional data. Support vector machine (SVM) is a popular one that produces a flexible decision boundary and does not require any distributional assumption for classific Therefore, if SVM is applied to ilr transformed data and the resulting classifier is reversely transformed back to the original simplex space, then the transformed classification rule outperforms the existing linear or polynomial classification rules in compositional data classification. We also consider Gaussian mixture for class distribution on ilr transformed data, which enables us to obtain posterior class probabilities while SVM does not have any probabilistic interpretation. We demonstrate two types of flexible classification, SVM and Gaussian mixture, are desirable candidates for binary classification on compositional data.

The remaining is organized as follows. We briefly review Aitchison geometry and ilr transformation in Section 2 to help understand the behavior of compositional data on the simplex space. In Section 3, we present why Gaussian-based discriminant analysis (LDA, QDA) and polynomial logistic regression are not appropriate on ilr transformed space although they are frequently used in practice, and explain why more flexible classification methods are appropriate on ilr transformed space to improve compositional data classification. Its empirical evidence is provided in Section 4 under synthetic and real examples. Finally, some concluding remarks are given in Section 5.

2. Brief review on Aitchison geometry and ilr transformation

We consider compositional vectors on of (1.1) with κ = 1. The following definitions and properties on Aitchison geometry are well established and studied, for example, in Pawlowsky-Glahn et al. (2015).

### Definition 1

For and α ∈ ℝ,

• (Perturbation)

$x⊕y=C(x1y1,x2y2,…,xDyD)∈SD.$

• (Powering)

$α⊙x=C(x1α,x2α,…,xDα)∈SD$

with the closure operation$C(x)=(x1/Σi=1Dxi,x2/Σi=1Dxi,…,xD/Σi=1Dxi)T$.

The simplex with perturbation and powering, ( , ⊗, ⊙), is a vector space. Here, perturbation and powering are analogous to addition (or translation) and scalar multiplication, respectively, in real space. The following theorems show that perturbation and powering serve as basic operations required for a vector space structure of the simplex.

Theorem 1

( , ⊕) is a commutative group: for , it satisfies

• (commutative)xy = yx.

• (associative) (xy) ⊕ z = x ⊕ (yz).

• , which is the unique barycenter of the simplex.

• , leading toxx−1 = n.

Often, xy = xy−1 is defined and used for the perturbation difference.

Theorem 2

For and α, β ∈ ℝ, the followings hold.

• (associative) α ⊙ (βx) = (αβ) ⊙ x.

• (distributive 1) α ⊙ (xy) = (αx) ⊕ (αy).

• (distributive 2) (α + β) ⊙ x = (αx) ⊕ (βx).

• (identity) 1 ⊙ x = x.

The following definitions of inner product, norm, and distance provide that ( , ⊕, ⊙) is a finite dimensional Hilbert space.

### Definition 2

For ,

• (Aitchison inner product)

$〈x,y〉a=12D∑i=1D∑j=1Dlogxixjlogyiyj.$

• (Aitchison norm)

$‖x‖a=〈x,x〉a=12D∑i=1D∑j=1D(logxixj)2.$

• (Aitchison distance)

$da(x,y)=‖x⊖y‖a=12D∑i=1D∑j=1D(logxixj-logyiyj)2.$

As a Hilbert space, Cauchy-Schwartz inequality, Pythagoras theorem, and triangular inequality hold on ( , ⊕, ⊙) as well. This geometry is called Aitchison geometry.

While compositional vectors behave well on simplex space, their sum constraint and bounded domain cause difficulties when they are put into traditional multivariate data analysis methods that are established in real space. To circumvent those difficulties, researchers introduced logratio transformation from the simplex onto real space without any constraint so that existing multivariate data analysis can be done without any trouble. Aitchison (1986) originally proposed two types of logratio transformations: additive logratio (alr) and centered logratio (clr) transformations. Logratio transformations on compositional data preserve its relative information on the simplex. However, alr transformation fails to preserve distance. (alr transformation is an isomorphism, but not an isometry) Contrast to alr transformation, clr transformation is an isometry from to ℝD. However, clr transformation leads to degenerate distribution due to preserving dimensionality after transformation. Egozcue et al. (2003) propose a new logratio transformation, called isometric logratio (ilr) transformation, which is an isometry from to ℝD−1 associated with an orthogonal coordinate system in the simplex. Thus, traditional multivariate analysis applied to ilr transformed data can be seamlessly transmitted to the original simplex space through inverse ilr transformation. For , ilr transformed vector z = ilr(x) ∈ ℝD−1 is defined as:

$zj=1(D-j+1)(D-j)(logxixj+1+⋯+logxjxD), j=1,…,D-1.$

And, its inverse transformation becomes

$x1=exp (D-1Dz1),xj=exp (-∑k=1j-11(D-k+1)(D-k)zk+D-jD-j+1zj), j=2,…,D.$

Note that z j ∈ ℝ for j = 1, . . . , D − 1 and there is no constraint on zj. And we can easily see that 〈x1, x2a = 〈z1, z2〉, ||x||a = ||z||, and da(x1, x2) = d(z1, z2), where 〈 ·, ·〉, || · ||, and d(·,·) are inner product, norm, and distance defined on D − 1 dimensional real space.

3. Classification on ilr transformed data

In the previous section, we explain that any multivariate data analysis method can be applied to the ilr transformed data for compositional data analysis because ilr transformation provides unconstrained coordinate system and still preserves distance and intrinsic dimensionality as well. However, there is still a concern that the distributions on ilr transformed data may be different from typical distributions that are commonly assumed in multivariate data analysis on real space. For example, multivariate Gaussian is the most popular distribution in multivariate data analysis under empirical reasons and/or theoretical considerations. However, it is not justified well to assume Gaussianity on ilr transformed data. In binary classification for compositional data, two classes are assumed to be distributed under their own class-specific distributions on the simplex space . Consider $xic~iidf(x∣c) (i=1,…,nc)$ to represent the observations belonging to the class c (c = 0, 1) on . A common practice is to transform $xic$ into $zic$ by ilr transformation and, then, apply discriminant analysis or logistic regression on $zic$.

Discriminant analysis finds a decision rule that classifies x or its ilr-transformed z, into c = 1 if δ(z) > 1, and c = 0 otherwise. Here, the classifier δ(z) is defined as

$δ(z)=P(c=1∣z)P(c=0∣z)∝P(z∣c=1)π1f(z∣c=0)π0,$

where f (z|c) are class specific distributions on ilr-transformed space and πc are prior probabilities of class c. If f (z|c) is Gaussian, then the resulting decision boundary becomes linear (LDA) or quadratic (QDA) depending on covariance assumption. Therefore, discriminant analysis will perform reasonably well when Gaussianity is valid on ilr transformed space. Logistic regression assumes that log P(c = 1|z)/P(c = 0|z) is a polynomial of z. If Bayes decision boundary on ilr-transformed space is not of polynomial, then logistic regression does not guarantee its performance.

Suppose that two classes follow separate Dirichlet distributions, which are most popularly assumed for compositional vector. From Dirichlet density function

$f(x∣α)=Γ(α1+⋯+αD)Γ(α1)⋯Γ(αD)x1α1-1⋯xDαD-1, 00,$

ilr transformed vector z follows the distribution in the below theorem.

### Theorem 3

Ifx ~ Dirichlet(α) on with α = (α1, . . . , αD)T, then z = ilr(x) has the below density function onD−1.

$f(z∣α)=1DB(α)exp {∑j=1D-2(D-jD-j+1αj-Σk=j+1D-1αkD-j+1(D-j)) zj+αD-12zD-1}×[1-exp (D-1Dz1)-∑j=2D-1exp (D-j+1D-jzj-∑k=1j-11(D-k+1)(D-k)zk)]αD-1$

with B(α) = {Γ(α1) · · · Γ(αD)}/Γ(α1 + · · · + αD) is the beta function.

We omit the proof here because it can be easily obtained by applying the change-of-variable technique. Since f (z|α) provided in Theorem 3 is far from a Gaussian density, discriminant analysis on ilr-transformed data is not appropriate for compositional data from a two-class Dirichlet population. In addition, with the parameters α0 and α1 for two classes, log f (z|α1)/ f (z|α0) is not of polynomial in z, so that logistic regression is not a good candidate for compositional data classification. Instead of naively applying discriminant analysis or logistic regression, more flexible classification methods can be a better option for compositional data classification. Moreover, we have a limited knowledge on distributions established for the simplex. It may not be desirable to presume a specific one from the limited pool of distributions, such as Dirichlet, for class distribution.

Therefore, we expect that more flexible classification methods on the ilr-transformed data are desirable for compositional data classification. However, based on our limited knowledge, there are seldom literatures that consider flexible classification, such as SVM, for compositional data classification. In Section 4, we conduct comparison study for compositional data classification with several classification approaches. As existing methods, polynomial logistic regression and discriminant analysis under Gaussianity are considered as reference. Comparing to these frequently used methods, we consider SVM and Gaussian mixture discriminant analysis. SVM is a popular flexible binary classification that is free from distributional assumption. Discriminant analysis with Gaussian mixture as a class-specific distribution produces a non-polynomial decision boundary as well, whose functional form is determined flexibly according to the actual class distributions. Both methods aforementioned conduct classification procedure on ilr-transformed space after transformation. In addition, we consider discriminant analysis that is directly applied to the original simplex space under the assumption that class distributions are Dirichlet.

4. Numerical studies

In this section, we provide numerical comparison with polynomial logistic regression (Logistic), Gaussian-based discriminant analysis (LDA, QDA), SVM, Gaussian mixture discriminant analysis (GMDA) on the ilr space and Dirichlet discriminant analysis (DDA) on the simplex. For performance comparison, we use synthetic data generated from class distributions of Dirichlet and Dirichlet mixture. A real-world data example (Hydrochem data) is used for comparison as well, since the set of known distributions on simplex is too limited to represent various real-world compositional data.

### 4.1. Synthetic data examples

We consider two scenarios for two-class compositional data generation. n0 and n1 are class sizes in training data with n = n0 + n1 and n0 = n1 = n/2.

• (S1) Dirichlet class distribution

D-dimensional vectors, $xic (i=1,…nc;c=0,1)$, are independently generated from Dirichlet(αc). We set α0 = (0.4, 0.4, 0.4, 1, . . . , 1)T and α1 = (1.5, 2, 1.5, 1, . . . , 1)T.

• (S2) Dirichlet mixture class distribution

D-dimensional vectors, $xic (i=1,…nc;c=0,1)$, are independently generated from Dirichlet mixture distribution, $Σk=15πk$ Dirichlet(αc,k). We set α0,1 = (2, 1, 9, 1, . . . , 1)T, α0,2 = (1, 5, 10, 1, . . . , 1)T, α0,3 = (8, 10, 10, 1, . . . , 1)T, α0,4 = (1, 10, 5, 1, . . . , 1)T, α0,5 = (2, 9, 1, 1, . . . , 1)T for the class 0, and α1,1 = (6, 6, 18, 1, . . . , 1)T, α1,2 = (9, 2, 9, 1, . . . , 1)T, α1,3 = (9, 3, 3, 1, . . . , 1)T, α1,4 = (9, 9, 2, 1, . . . , 1)T, α1,5 = (6, 18, 6, 1, . . . , 1)T for the class 1. And we consider the equal mixing probabilities, i.e., πk = 1/5 for k = 1, . . . , 5.

The last D − 3 elements in αc and αc,k are all set to 1, implying that only first 3 variables separates two classes and remaining variables have no discriminative power. After $xic$ were generated, we transformed it into $zic=ilr(xic)$ with ilr transformation in (2.1). Logistic, LDA, QDA, SVM, and GMDA use the transformed data $zic$ and DDA uses the original compositional data $xic$ in classification procedure. In the scenario of (S1), DDA is expected to outperform other candidates because it correctly assumes data generating process. Since each class consists of 5 different Dirichlet distributions in (S2), DDA is not better anymore and flexible classification methods will show better performance. Figure 1 depicts data distributions of D = 3 on the simplex and the ilr space. For (S1), Bayes decision boundary becomes nearly circular shape in the ilr-transformed real space as in the upper right panel of Figure 1. However, it takes a quite irregular shape in (S2) so that LDA or QDA seem not relevant in this cases. To consider the shape of Bayes decision boundary, we fit logistic regression up to 5 degree polynomial models and choose the best among them for logistic regression. To quantify their performance, we additionally generated test data of size 5,000 from the same scenarios. Decision rules learned from the above methods are applied to test data and test error rates are computed. This procedure is repeated 100 times to reduce the effect from randomness in sampling, and their average and standard error are reported.

We summarize the averaged test error rates from the classifiers learned with training datasets of D = 3, n = 100, 250, 500 in Table 1 and n = 250, D = 3, 6, 12 in Table 2. Since each class of data is generated from a single Dirichlet distribution under the scenario (S1), DDA is expected to show the best performance among others and this expectation appears in Table 1. Most classification methods, except LDA, performs reasonably well. This is because Bayes decision boundary on the real space is fitted well under a quadratic (QDA) or polynomial (Logistic) shape. The situation become changed in the scenario (S2), where each class comes from Dirichlet mixture. Flexible classification rules from SVM and GMDA clearly outperform other frequently-used classification methods. This simulation result demonstrates that flexible approach is necessary when the ilr-transformed decision boundary is not guaranteed to be of low-degree polynomial.

### 4.2. Hydrochem data examples

Next, we apply and compare all classification methods to Hydrochem data (Otero et al., 2005). This data contains measurements of 14 components (H+, Na+, K+, Ca2+, Mg2+, Sr2+, Ba2+, $NH4+$, Cl, $HCO3-, NO3-, SO32-, PO43-$, TOC) in water samples from the Llobregat river and its two main tributaries, Anoia and Cardener, in northeastern Spain. Total 485 water samples were obtained from 4 distinct water bodies (Anoia, Cardener, lower and upper Llobregat River), whose geological background and human activities in the vicinity vary greatly. For application for binary classification, we select only 2 groups, Anoia (143 samples) and lower Llobregat (135 samples), which are the two largest groups in the data.

We randomly split the 2-class data into training (70%) and test (30%) datasets and, then, applied all classification methods to the training data for learning their classifiers. Test error rate is evaluated by applying the fitted classifiers to the test dataset. This procedure is heavily influenced by random train/test splitting. Therefore, this procedure is repeated 100 times to prevent a generalization from a particular accidental splitting. Figure 2 presents boxplots of 100 test error rates from classifiers that are fitted by classification methods we consider. LDA is the worst performer, which indicates that a linear classifier is not appropriate for this classification problem. SVM turns out the best and GMDA also performs reasonably well comparing to Logistic and QDA. This real example, where its actual data generating process on the simplex space is unknown, indicates that flexible approaches, like SVM or GMDA, are competitive over the commonly used practices, such as LDA, QDA, and logistic regression.

5. Conclusion and remarks

In this work, we provide empirical evidence that flexible classification approaches, SVM and Gaussian mixture discriminant analysis, are promising options for compositional data classification, rather than traditional multivariate approaches which are commonly and currently used in practice. We use isometric logratio transformation for this application because ilr transformation is an isometry between and ℝD−1. Instead of ilr transformation, one may use a different logratio transformation, for example clr transformation. While clr is also an isometry between and ℝD, a class distribution degenerates because the intrinsic dimensionality is D − 1. (for example, a sample covariance on ℝD becomes singular.) A discriminant analysis (GMDA) is, therefore, not available for clr transformation, but SVM is still applicable. Unlike ilr transformation, the clr transformed z j is obtained by logarithm of scaled xj so that variable importance in the statistical analysis is preserved before/after transformation. Thus, variable selection or variable importance evaluation in classification can be implemented and evaluated under clr transformation. We leave this research direction as a future work.

Figures Fig. 1. Two-class compositional datasets of D = 3 are generated from the scenario of S1 (upper) and S2 (lower). Left panels depict the original compositional data on the simplex space and, in the right panels their ilr transformed data are displayed on the 2-dimensional real space. Fig. 2. Hydrochem data: boxplots of 100 test error rates are presented.
TABLES

### Table 1

Average of 100 test error rates and its standard deviation (in parenthesis) are presented for the case of D = 3

ScenarionMethods

BayesLogisticLDAQDASVMGMDADDA
S11000.21300.23580.45710.22450.24130.23300.2184
(0.0064)(0.0347)(0.0335)(0.0086)(0.0201)(0.0196)(0.0085)

2500.21350.22240.46090.22150.22840.22450.2150
(0.0065)(0.0079)(0.0276)(0.0065)(0.0125)(0.0092)(0.0066)

5000.21280.22330.45780.22070.22240.21980.2136
(0.0064)(0.0289)(0.0181)(0.0069)(0.0085)(0.0073)(0.0065)

S21000.15770.24890.30150.26520.20540.20980.2535
(0.0049)(0.0165)(0.0057)(0.0151)(0.0234)(0.0222)(0.0126)

2500.15780.24480.30290.25710.17810.18130.2458
(0.0049)(0.0110)(0.0048)(0.0110)(0.0110)(0.0109)(0.0075)

5000.15800.24460.30210.25620.17080.17150.2449
(0.0045)(0.0118)(0.0053)(0.0084)(0.0082)(0.0066)(0.0057)

Best performer for each case is highlighted in bold face.

### Table 2

Average of 100 test error rates and its standard deviation (in parenthesis) are presented for the case of n = 250

ScenarioDMethods

BayesLogisticLDAQDASVMGMDADDA
S130.21350.22240.46090.22150.22840.22450.2150
(0.0065)(0.0079)(0.0276)(0.0065)(0.0125)(0.0092)(0.0066)

60.10700.12540.13680.13120.12470.12360.1087
(0.0039)(0.0091)(0.0073)(0.0062)(0.0086)(0.0082)(0.0043)

120.08420.10800.11780.11860.10300.11120.0876
(0.0039)(0.0110)(0.0090)(0.0072)(0.0088)(0.0084)(0.0043)

S230.15780.24480.30290.25710.17810.18130.2458
(0.0049)(0.0110)(0.0048)(0.0110)(0.0110)(0.0109)(0.0075)

60.13110.22830.24380.23220.22290.23310.2334
(0.0046)(0.0105)(0.0083)(0.0010)(0.0134)(0.0173)(0.0088)

120.10900.22370.21560.24080.20710.23290.2133
(0.0039)(0.0125)(0.0083)(0.0106)(0.0109)(0.0229)(0.0103)

Best performer for each case is highlighted in bold face.

References
1. Aitchison J (1986). The Statistical Analysis of Compositional Data, Monographs on Statistics and Applied Probability, London, Chapman & Hall.
2. Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, and Barcelû°-Vidal C (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35, 279-300.
3. Otero N, Tolosana-Delgado R, Soler A, Pawlowsky-Glahn V, and Canals A (2005). Relative vs. absolute statistical analysis of compositions: a comparative study of surface waters of a Mediterranean river. Water Research, 39, 1404-1414.
4. Pawlowsky-Glahn V and Egozcue JJ (2001). Geometric approach to statistical analysis on the simplex. Stochastic Environmental Research and Risk Assessment (SERRA), 15, 384-398.
5. Pawlowsky-Glahn V, Egozcue JJ, and Tolosana-Delgado R (2015). Modeling and Analysis of Compositional Data, Hoboken, John Wiley & Sons.