Compositional data is a special type of multivariate data with strictly positive real components under sum constraint. This kind of data often arises when we measure parts of a whole. A typical example of compositional data is a vector of proportions or percentages, where each component is positive and bounded, and the sum of components is 1 for proportions or 100% for percentages. Compositional data is commonly observed in various applications: ratio of components making up a rock, ratio of fine substances in atmosphere, to name a few (Pawlowsky-Glahn
where
Although compositional data can be described and understood under well-defined Aitchison geometry, statistical analysis related to compositional data is limited due to lack of statistical model on the simplex space. Dirichlet distribution is the only known statistical distribution that is well defined on simplex. Traditional multivariate data analysis techniques have been developed on Euclidean space, where Gaussian distribution is the norm, so that they cannot be directly applied or generalized to compositional data analysis. Since main difficulty on compositional data comes from the bounded elements (0
For binary classification on compositional data, a common practice is to apply traditional classification/discriminant analysis methods, such as linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression to ilr transformed data. LDA and QDA are based on the assumption that class distributions are multivariate Gaussian. LDA and QDA may be inappropriate since the distribution on ilr transformed data from compositional data with Dirichlet distribution is not Gaussian. Logistic regression assumes a linear or polynomial classifier for classification. When two classes come from separate Dirichlet distributions, Bayes decision boundary on ilr transformed space is not of polynomial. Therefore, we expect that a more flexible classification rule is desirable.
In this article, we study the usage of flexible binary classification methods on compositional data. Support vector machine (SVM) is a popular one that produces a flexible decision boundary and does not require any distributional assumption for classific Therefore, if SVM is applied to ilr transformed data and the resulting classifier is reversely transformed back to the original simplex space, then the transformed classification rule outperforms the existing linear or polynomial classification rules in compositional data classification. We also consider Gaussian mixture for class distribution on ilr transformed data, which enables us to obtain posterior class probabilities while SVM does not have any probabilistic interpretation. We demonstrate two types of flexible classification, SVM and Gaussian mixture, are desirable candidates for binary classification on compositional data.
The remaining is organized as follows. We briefly review Aitchison geometry and ilr transformation in Section 2 to help understand the behavior of compositional data on the simplex space. In Section 3, we present why Gaussian-based discriminant analysis (LDA, QDA) and polynomial logistic regression are not appropriate on ilr transformed space although they are frequently used in practice, and explain why more flexible classification methods are appropriate on ilr transformed space to improve compositional data classification. Its empirical evidence is provided in Section 4 under synthetic and real examples. Finally, some concluding remarks are given in Section 5.
We consider compositional vectors on of (
The simplex with perturbation and powering, ( , ⊗, ⊙), is a vector space. Here, perturbation and powering are analogous to addition (or translation) and scalar multiplication, respectively, in real space. The following theorems show that perturbation and powering serve as basic operations required for a vector space structure of the simplex.
( , ⊕)
,
,
Often,
The following definitions of inner product, norm, and distance provide that ( , ⊕, ⊙) is a finite dimensional Hilbert space.
As a Hilbert space, Cauchy-Schwartz inequality, Pythagoras theorem, and triangular inequality hold on ( , ⊕, ⊙) as well. This geometry is called Aitchison geometry.
While compositional vectors behave well on simplex space, their sum constraint and bounded domain cause difficulties when they are put into traditional multivariate data analysis methods that are established in real space. To circumvent those difficulties, researchers introduced logratio transformation from the simplex onto real space without any constraint so that existing multivariate data analysis can be done without any trouble. Aitchison (1986) originally proposed two types of logratio transformations: additive logratio (alr) and centered logratio (clr) transformations. Logratio transformations on compositional data preserve its relative information on the simplex. However, alr transformation fails to preserve distance. (alr transformation is an isomorphism, but not an isometry) Contrast to alr transformation, clr transformation is an isometry from to ℝ
And, its inverse transformation becomes
Note that
In the previous section, we explain that any multivariate data analysis method can be applied to the ilr transformed data for compositional data analysis because ilr transformation provides unconstrained coordinate system and still preserves distance and intrinsic dimensionality as well. However, there is still a concern that the distributions on ilr transformed data may be different from typical distributions that are commonly assumed in multivariate data analysis on real space. For example, multivariate Gaussian is the most popular distribution in multivariate data analysis under empirical reasons and/or theoretical considerations. However, it is not justified well to assume Gaussianity on ilr transformed data. In binary classification for compositional data, two classes are assumed to be distributed under their own class-specific distributions on the simplex space . Consider
Discriminant analysis finds a decision rule that classifies
where
Suppose that two classes follow separate Dirichlet distributions, which are most popularly assumed for compositional vector. From Dirichlet density function
ilr transformed vector
We omit the proof here because it can be easily obtained by applying the change-of-variable technique. Since
Therefore, we expect that more flexible classification methods on the ilr-transformed data are desirable for compositional data classification. However, based on our limited knowledge, there are seldom literatures that consider flexible classification, such as SVM, for compositional data classification. In Section 4, we conduct comparison study for compositional data classification with several classification approaches. As existing methods, polynomial logistic regression and discriminant analysis under Gaussianity are considered as reference. Comparing to these frequently used methods, we consider SVM and Gaussian mixture discriminant analysis. SVM is a popular flexible binary classification that is free from distributional assumption. Discriminant analysis with Gaussian mixture as a class-specific distribution produces a non-polynomial decision boundary as well, whose functional form is determined flexibly according to the actual class distributions. Both methods aforementioned conduct classification procedure on ilr-transformed space after transformation. In addition, we consider discriminant analysis that is directly applied to the original simplex space under the assumption that class distributions are Dirichlet.
In this section, we provide numerical comparison with polynomial logistic regression (Logistic), Gaussian-based discriminant analysis (LDA, QDA), SVM, Gaussian mixture discriminant analysis (GMDA) on the ilr space and Dirichlet discriminant analysis (DDA) on the simplex. For performance comparison, we use synthetic data generated from class distributions of Dirichlet and Dirichlet mixture. A real-world data example (Hydrochem data) is used for comparison as well, since the set of known distributions on simplex is too limited to represent various real-world compositional data.
We consider two scenarios for two-class compositional data generation.
The last
We summarize the averaged test error rates from the classifiers learned with training datasets of
Next, we apply and compare all classification methods to Hydrochem data (Otero
We randomly split the 2-class data into training (70%) and test (30%) datasets and, then, applied all classification methods to the training data for learning their classifiers. Test error rate is evaluated by applying the fitted classifiers to the test dataset. This procedure is heavily influenced by random train/test splitting. Therefore, this procedure is repeated 100 times to prevent a generalization from a particular accidental splitting. Figure 2 presents boxplots of 100 test error rates from classifiers that are fitted by classification methods we consider. LDA is the worst performer, which indicates that a linear classifier is not appropriate for this classification problem. SVM turns out the best and GMDA also performs reasonably well comparing to Logistic and QDA. This real example, where its actual data generating process on the simplex space is unknown, indicates that flexible approaches, like SVM or GMDA, are competitive over the commonly used practices, such as LDA, QDA, and logistic regression.
In this work, we provide empirical evidence that flexible classification approaches, SVM and Gaussian mixture discriminant analysis, are promising options for compositional data classification, rather than traditional multivariate approaches which are commonly and currently used in practice. We use isometric logratio transformation for this application because ilr transformation is an isometry between and ℝ
Average of 100 test error rates and its standard deviation (in parenthesis) are presented for the case of
Scenario | Methods | |||||||
---|---|---|---|---|---|---|---|---|
Bayes | Logistic | LDA | QDA | SVM | GMDA | DDA | ||
S1 | 100 | 0.2130 | 0.2358 | 0.4571 | 0.2245 | 0.2413 | 0.2330 | |
(0.0064) | (0.0347) | (0.0335) | (0.0086) | (0.0201) | (0.0196) | (0.0085) | ||
250 | 0.2135 | 0.2224 | 0.4609 | 0.2215 | 0.2284 | 0.2245 | ||
(0.0065) | (0.0079) | (0.0276) | (0.0065) | (0.0125) | (0.0092) | (0.0066) | ||
500 | 0.2128 | 0.2233 | 0.4578 | 0.2207 | 0.2224 | 0.2198 | ||
(0.0064) | (0.0289) | (0.0181) | (0.0069) | (0.0085) | (0.0073) | (0.0065) | ||
S2 | 100 | 0.1577 | 0.2489 | 0.3015 | 0.2652 | 0.2098 | 0.2535 | |
(0.0049) | (0.0165) | (0.0057) | (0.0151) | (0.0234) | (0.0222) | (0.0126) | ||
250 | 0.1578 | 0.2448 | 0.3029 | 0.2571 | 0.1813 | 0.2458 | ||
(0.0049) | (0.0110) | (0.0048) | (0.0110) | (0.0110) | (0.0109) | (0.0075) | ||
500 | 0.1580 | 0.2446 | 0.3021 | 0.2562 | 0.1715 | 0.2449 | ||
(0.0045) | (0.0118) | (0.0053) | (0.0084) | (0.0082) | (0.0066) | (0.0057) |
Best performer for each case is highlighted in bold face.
Average of 100 test error rates and its standard deviation (in parenthesis) are presented for the case of
Scenario | Methods | |||||||
---|---|---|---|---|---|---|---|---|
Bayes | Logistic | LDA | QDA | SVM | GMDA | DDA | ||
S1 | 3 | 0.2135 | 0.2224 | 0.4609 | 0.2215 | 0.2284 | 0.2245 | |
(0.0065) | (0.0079) | (0.0276) | (0.0065) | (0.0125) | (0.0092) | (0.0066) | ||
6 | 0.1070 | 0.1254 | 0.1368 | 0.1312 | 0.1247 | 0.1236 | ||
(0.0039) | (0.0091) | (0.0073) | (0.0062) | (0.0086) | (0.0082) | (0.0043) | ||
12 | 0.0842 | 0.1080 | 0.1178 | 0.1186 | 0.1030 | 0.1112 | ||
(0.0039) | (0.0110) | (0.0090) | (0.0072) | (0.0088) | (0.0084) | (0.0043) | ||
S2 | 3 | 0.1578 | 0.2448 | 0.3029 | 0.2571 | 0.1813 | 0.2458 | |
(0.0049) | (0.0110) | (0.0048) | (0.0110) | (0.0110) | (0.0109) | (0.0075) | ||
6 | 0.1311 | 0.2283 | 0.2438 | 0.2322 | 0.2331 | 0.2334 | ||
(0.0046) | (0.0105) | (0.0083) | (0.0010) | (0.0134) | (0.0173) | (0.0088) | ||
12 | 0.1090 | 0.2237 | 0.2156 | 0.2408 | 0.2329 | 0.2133 | ||
(0.0039) | (0.0125) | (0.0083) | (0.0106) | (0.0109) | (0.0229) | (0.0103) |
Best performer for each case is highlighted in bold face.