search for

CrossRef (0)
A guideline for the statistical analysis of compositional data in immunology
Communications for Statistical Applications and Methods 2022;29:453-469
Published online July 31, 2022
© 2022 Korean Statistical Society.

Jinkyung Yooa, Zequn Sunb, Michael Greenacrec, Qin Mad, Dongjun Chung1,d, Young Min Kim2,a

aDepartment of Statistics, Kyungpook National University, South Korea;
bDepartment of Preventive Medicine - Biostatistics, Northwestern University, USA;
cDepartment of Economics and Business, Universitat Pompeu Fabra, and Barcelona School of Management, Spain;
dDepartment of Biomedical Informatics, The Ohio State University, USA
Correspondence to: 1 Department of Biomedical Informatics, The Ohio State University, Columbus, OH, 43210, USA. E-mail: chung.911@osu.edu
2 Department of Statistics, Kyungpook National University, Daegu, 41566, South Korea. E-mail: kymmyself@knu.ac.kr

This work was supported by the National Institutes of Health (grant numbers R01-GM122078, R21-CA209848, U01- DA045300) awarded to Dongjun Chung,, and the Human Resources Program in Energy Technology of the Korea Institute of Energy Technology Evaluation and Planning(KETEP) granted financial resource from the Ministry of Trade, Industry & Energy, Republic of Korea (No. 20204010600060) awarded to Young Min Kim. The funders had no role in the study design, data collection, and analysis, decision to publish, or preparation of the manuscript.
Received December 28, 2021; Revised April 20, 2022; Accepted April 28, 2022.
The study of immune cellular composition has been of great scientific interest in immunology because of the generation of multiple large-scale data. From the statistical point of view, such immune cellular data should be treated as compositional. In compositional data, each element is positive, and all the elements sum to a constant, which can be set to one in general. Standard statistical methods are not directly applicable for the analysis of compositional data because they do not appropriately handle correlations between the compositional elements. In this paper, we review statistical methods for compositional data analysis and illustrate them in the context of immunology. Specifically, we focus on regression analyses using log-ratio transformations and the alternative approach using Dirichlet regression analysis, discuss their theoretical foundations, and illustrate their applications with immune cellular fraction data generated from colorectal cancer patients.
Keywords : compositional data, compositional regression, Dirichlet regression, immunology, immuno-oncology, log-ratio transformation
1. Introduction

The human immune system consists of various types of immune cells (e.g., T cells, B cells, natural killer (NK) cells, dendritic cells, among others). Upon viral infection, tissue transplantation, or disease occurrence, dynamic and extensive interaction among these immune cell types occurs in the human body. Hence, in the study of the human immune system, it is of great interest to understand composition, differentiation, and activities of various types of immune cells, and interactions among them. The composition of these immune cells is also associated with cancer progression, adverse events, and response to cancer immunotherapy, especially immune checkpoint blockades including Anti-PD1 and Anti-CTLA4.

Along with the interest in this association, there is a movement to gather immune cellular information and clinical information together. In the immunology field, multiple types of assays are used to interrogate such immune cellular composition, including flow cytometry and single cell RNA-seq. In addition, multiple computational algorithms have also been proposed to estimate immune cellular composition by deconvolving bulk gene expression data, where popular algorithms include CIBERSORT (Newman et al., 2019). Recognizing the importance of understanding the immune cellular composition, and the emergence of these computational algorithms and relevant assays, have led to the generation of multiple large-scale immune cellular composition datasets. For example, the Immune Landscape of Cancer generated a large-scale immuno-genomic dataset from more than 10,000 patients with 33 different cancer types based on the Cancer Genome Atlas (TCGA) data (Thorsson et al., 2018). Figure 1 illustrates the Immune Landscape of Cancer, which provides information about immune cell composition within the tumor microenvironment of each cancer patient, along with corresponding clinical information. This new type of data motivates the investigation of relevant statistical methods that can consider key characteristics of these datasets. Effective analysis of such datasets can potentially support development of diagnosis and treatment strategies for various diseases such as cancer and autoimmune diseases.

From the statistical point of view, such immune cellular data can be considered as compositional, since it carries relative information only in the form of proportions of a total amount, summing to 1 in each sample. John Aitchison was a pioneer in the statistical formulation of compositional data analysis, and developed the relevant geometry, metrics, and a guideline for the application of various statistical methods in this context. The constituent cellular fractions are defined on the Aitchison simplex, not in Euclidean space. The Aitchison simplex, S D, is a sample space for compositional data, and is defined as


where v1, . . ., vD are non-negative components of a D-part composition. The dimension of S D is D − 1 due to the constant sum constraint. Aitchison (1986) introduced statistical methods based on log-ratios, which are still most popularly used to analyze compositional data (in this case the compositional parts in (1.1) should be strictly positive). The method is invariant to scaling of compositions, called scale invariance, which gives coherent results regardless of multiplication of a row (composition) of the initial data by an arbitrary positive constant. The R environment includes several packages developed for the log-ratio approach, for example, compositions(van den Boogaart and Tolosana-Delgado, 2008), robCompositions(Templ et al., 2011), and easyCODA(Greenacre, 2018).

Maier (2014) asserted that it is often not straightforward to interpret the results from data analyses using log-ratio transformations and, in addition, these methods can often violate modeling assumptions such as homoscedasticity. As an alternative approach, Dirichlet regression was proposed, originally suggested as a null model for compositional data by Campbell and Mosimann (1987). Hijazi and Jernigan (2009) developed the maximum likelihood estimation methods for Dirichlet regression and also investigated the sampling distributions of the estimates. Camargo et al. (2012) introduced a new approach for estimating the Dirichlet model when each parameter has a linear structure on covariates and suggested a Bayesian model selection method.

These ongoing discussions are to determine optimal statistical strategies for compositional data analysis. Following the trend, in this paper, we aim to give a guideline for the statistical approaches of compositional data analysis in the context of immunology data: firstly, modeling using standard regression analysis with log-ratio transformations, and secondly, the approach using Dirichlet regression analysis.

This paper is structured as follows. In Section 2 we introduce the immune cellular fractions data for colorectal cancer, and the two compositional regression approaches, log-ratio regression and Dirichlet regression, which have been applied to this dataset. Section 3 gives the results of these alternative modeling approaches. Section 4 summarizes the key findings of this paper and comments on the similarities and differences of the two approaches.

2. Material and methods

2.1. Colorectal cancer data

In this paper, we focus on the analysis of the immune cellular fractions data of colorectal adenocarcinoma patients, generated from the Immune Landscape of Cancer project (Thorsson et al., 2018). Specifically, there are 254 colorectal adenocarcinoma patients, 58 (23%) of which are African Americans (AA) and 196 (77%) are European Americans (EA). These patients are almost equally divided between females and males, 126 and 128, respectively. Motivated by previous studies showing significant racial disparity in clinical outcomes (King Thomas et al., 2019; Curran et al., 2021), we focus here on investigating associations of immune cell compositions with race, while we also look at their possible relationships with age. In the Immune Landscape of Cancer, the immune cellular fractions were estimated by deconvolving the gene expression data of the TCGA PanCancer study using the CIBERSORT algorithm (Newman et al., 2015). Thorsson et al. (2018) provided three different aggregations of immune cell types, of which we use Aggregate 2. Specifically, it consists of nine immune cell types: CD8+ T cells (labeled as T.cells.CD8or T.CD8), CD4+ T cells ( T.cells.CD4or T.CD4), B cells ( B.cellsor B), NK cells ( NK.cellsor NK), macrophage ( Macrophageor Macro), dendritic cells ( Dendritic.cellsor Dendr), mast cells ( Mast.cellsor Mast), neutrophils ( Neutrophilsor Neutr), and eosinophils ( Eosinophilsor Eosin).

2.2. Log-ratio approaches

In this section, we describe how to apply the log-ratio regression model to the colorectal cancer data. Since this approach involves computing logarithms of ratios, zero values need to be replaced, which can be done in several ways (Lubbe et al., 2021). In this study we used the k-nearest neighbours (KNN) approach of Hron et al. (2010).

In the context of colorectal cancer data, we are mainly interested in racial differences in immune cellular compositions. To investigate this relationship using log-ratio regression models, the immune cellular compositions composed of the D = 9 immune cells are used as multivariate responses and race and/or age as explanatory variables.

The key step in the log-ratio approach is to choose a set of log-ratio transformations, which convert all the compositions on the Aitchison simplex to multivariate vectors on interval scales in a regular Euclidean space. We can then proceed to apply standard statistical analyses to the log-ratio transformed data, with some care taken in the way the results are interpreted. The widely used sets of log-ratio transformations are the additive log-ratios (ALRs) (Aitchison, 1982), the centered log-ratios (CLRs) (Aitchison, 1982), and isometric log-ratio (ILRs) (Egozcue et al., 2003). All of these transformations can be written as log(vQT, where log(v) is the row vector of compositional values transformed by the natural logarithm, and Q is a matrix with each row summing to zero — see, for example, Greenacre (2022).

The ALR transformations are the easiest to understand and interpret, since they can be simply calculated by taking D − 1 pairwise logratios with respect to a fixed denominator part, taken here as the last part:

ALR (v)=(log v1vD,,log vD-1vD)=log(v)·QALR,

where the (D−1)×D matrix QALR is the (D−1)×(D−1) identity matrix with an additional D-th column of all −1 values. Although technically the ALR transformations do not preserve exact isometricity, the denominator can be chosen to give a transformation close to being isometric (Greenacre et al., 2021). An isometric transformation is one that engenders the exact logratio geometry of all pairwise logratios (see, for example, Greenacre, 2019). The ALR transformation has a intuitive interpretation due to its simple definition, which is advantageous for practical applications.

The CLR transformations are defined as

CLR (v)=(log v1g(v),,log vDg(v))=log (v)·QCLR,

where g(v) is the geometric mean v1·v2vDD. The D × D matrix QCLR has all off-diagonal values equal to −1/D and diagonal values equal to 1 − 1/D. The D CLR transformations are symmetric with respect to the compositional components and they are isometric. The fact that the set of CLRs has a singular covariance matrix (Egozcue et al., 2003) is of no consequence to our present purpose of performing log-ratio regression.

Finally, the ILR transformations are the most complicated choice, both to define and interpret. It is also defined as a linear transformation

ILR (v)=log (v)·QILR,

where the (D − 1) × D matrix QILR has rows that contrast groups of compositional parts as ratios of their geometric means, along with an additional scalar multiplier. This set of transformations is used when the interpretability of the transformation is of less importance than its properties of having a nonsingular covariance matrix and being isometric. A special case of ILR transformations is the set of pivot log-ratios (Filzmoser et al., 2018), where single components are in the numerator of each ratio and the geometric means of the remaining components in the particular ordered sequence are in the denominator. The problem here would be to decide what that ordered sequence should be, since there are D! orderings. Again it should be noted that for the present purpose of compositional regression, the final results will be equivalent no matter which log-ratio transformation is chosen (see below, when the result is expressed as a log-contrast). Hence, if simplicity and interpretability are regarded as important, then the ALR transformations will be preferred.

When we use the ALR transformation, we need to decide which denominator part to choose. Since this choice will not affect our results, the fixed denominator can be chosen on substantive grounds to make the interpretation of the results more meaningful, or it can be chosen to optimize some favourable property. For example, Greenacre et al. (2021) showed how to choose the denominator that gives the set of transformations closest to being isometric. The Procrustes correlation (Gower and Dijksterhuis, 2004; Legendre and Legendre, 2012) can be used to measure the similarity of the geometry based on an ALR-transformation, with only D − 1 variables, to the geometry based on all D(D − 1)/2 pairwise log-ratios (Greenacre, 2019), sometimes called the Aitchison geometry. The same correlation can be used if an even smaller subset of pairwise log-ratios is selected and we want to measure how close to isometry they are, since it often happens that less log-ratios can be used to make the results more parsimonious (Greenacre, 2019; Graeve and Greenacre, 2020).

As said above, once the data have been log-ratio transformed, then standard statistical analysis can be used. Thus, multiple regression models on the log-ratio transformed data can be written as follows, for the j-th log-ratio LRj of a set of transformations for j = 1, 2, . . ., D − 1 or D, where LR can be ALR, ILR or CLR:


where αj, βj and γj are regression coefficients, x1, x2 are the two explanatory variables, race (dummy variable) and age in this case, and ej is a random error with mean 0 and a constant variance.

Once the set of log-ratio regressions (Equation (2.4)) is performed for each log-ratio j, then the coefficients βj and γj for the two predictors can be inversely transformed (Van den Boogaart and Tolosana-Delgado, 2013; Greenacre, 2022) to give a log-contrast on v of the form (for each predictor)

φ1log (v1)+φ2log (v2)++φDlog (vD),         where jφj=0.

This result shows how the compositional response, on a log-scale, is affected by a unit change in the predictor, as estimated by the set of log-ratio regressions. In the case of the dummy variable predictor for race, where EA = 0 and AA = 1, the coefficients show the “effect” of AA compare to EA. When Equation (2.5) is exponentiated, the φj’s are seen to be the estimated multiplicative effects, since the exponentiated log-contrast has this form: v1φ1v2φ2vDφD. The coefficients φj of the log-contrast will be the same no matter which log-ratio transformation, ALR, CLR or ILR, is used. This further justifies using the ALR transformation, which not only produces the correct log-contrast result but also has an easier interpretation of the responses in the original log-ratio regressions (Equation (2.4)). When log-ratios are used as predictors, there is a similar result—no matter which complete set of log-ratios is used, they all lead to the same log-contrast (Coenders and Pawlowsky-Glahn, 2020).

When it comes to visualizing compositional data by a dimension-reduction method such as principal component analysis (PCA), then CLR-transformed data are used, since the PCA of the CLRs has been shown to be equivalent to the PCA of all pairwise log-ratios (Aitchison and Greenacre, 2002). The PCA of CLR-transformed data has thus been called log-ratio analysis (LRA) (Greenacre, 2010). Here there is an issue whether to weight the parts or not, which is usually decided when comparing the variance contribution of the parts to the total variance, since it often occurs – as in this application – that rare parts contribute excessively to the total variance. With weights wd, d = 1, . . ., D, equal to the average compositional proportions, the total variance is then computed equivalently in two different ways as follows:


using either the variances σd2 of the CLRs or the variances σd,h2 of the pairwise logratios of the d-th and h-th parts (Greenacre, 2021).

2.3. Dirichlet approaches

The Dirichlet distribution

The Dirichlet distribution models the probability of a multinomial random variable, that is a random composition v = (v1, v2, . . ., vD) where vd ∈ [0, 1] and ∑d vd = 1. The Dirichlet distribution has shape parameters α = (α1, . . ., αD)T for the D respective components. The probability density function (pdf) is defined as

D(vα)=1B(α)d=1Dvdαd-1,         where         B(α)=Πd=1DΓ(αd)Γ(Σd=1Dαd),

where B(·) and Γ(·) are the Beta and Gamma functions, respectively. The marginal distributions of the Dirichlet distribution are all Beta distributions. For each component d, the mean and variance of the marginal Beta distribution are E[vd] = αd/α+ and Var[vd]=[αd(α+-αd)]+[α+2(α++1)], where α+ = ∑d αd. α+ is a measure of dispersion (or precision) of the distribution, with high values of α+ indicating higher density around the expected values.

Estimation of Dirichlet regression coefficients

The Dirichlet regression provides an alternative to log-ratio regression for the modeling of compositional data responses, and establishes the relationships between the parameters of the Dirichlet distribution and linear functions on the covariates. The regression model should be fitted for each αd. Suppose that there are independent random vector variables V1, . . ., Vn, where Vi = (Vi1, . . ., ViD) for i = 1, . . ., n satisfying Σd=1Dvid=1 for observed vid for Vid. In addition, we assume that given the covariate row vector Xi = (xi1, . . ., xip) corresponding to i-th observation, Vi|Xi follows the Dirichlet distribution with shape parameters (αi1, . . ., αiD), each of which satisfies for observed covariates xi,


where βd = {βdk}, k = 1, . . ., p, is a vector of regression coefficients. Here, each function hd: R → (0,∞) is an three times differentiable injective function. The functions hd work in the analogous way that the link function in generalized linear models does. We also assume that hd = h, especially designated by the log function (Gueorguieva et al., 2008; Melo et al., 2009; Maier, 2014). Therefore, the relation can be re-written for each element of αd = {αid} as

log (αid)=βdxi,         i=1,2,,n.

The unknown regression coefficients β̂dk are estimated by using maxim likelihood method, through the derivatives of the log-likelihood function given by,

L(vα˜)=i=1n{log Γ(d=1Dαid)-d=1Dlog Γ(αid)+d=1D(αid-1)log vid}.

There is no closed form solution, hence it must be calculated numerically using a nonlinear optimization procedure. The invariance property of the maximum likelihood estimator (MLE) leads to obtain the MLE α̂id of {αid} as

α^id=exp (β^dxi).

In this manuscript, we employed the Dirichlet regression for analysis of the colorectal cancer immunology data, where immune cellular composition is considered as conpositional outcomes.

Dirichlet regression diagnostics

For model diagnostics of the Dirichlet regression, two types of residuals can be considered, namely standardized residuals and composite residuals. For the d-th component and the i-th individual, standardized residuals rid and composite residuals Ci are defined as

rid=vid-E[Vidα^i·]Var (Vidα^i·),         andCi=drid2,

for d = 1, 2, . . ., D, and i = 1, 2, . . ., n. Here, Vid is a random variable for vid for d = 1, . . ., D and i = 1, . . ., n, and α̂i· is a D × 1 vector as (α̂i1, . . ., α̂iD) defined in Equation (2.7). The composite residuals Equation (2.9) are obtained using equal weights to all standardized residuals Equation (2.8).

For the Dirichlet regression model, Gueorguieva et al. (2008) investigated its diagnostic approach to identify influential observations utilizing score residuals, which can assess overall model fit through overdispersion. First of all, Cook (1986) suggested the local measures of influence to detect observations with high leverages. Specifically, the local measures are constructed by assigning minimal weight to an observation in the likelihood (Cook, 1986), whereas the original Cook’s distance, which is used for the global measures of influence, deletes an observation completely. The measure is defined as

ρid=sidG(α^id-G(Σiα^id),         d=1,2,,D,i=1,2,,n,

where α̂id is the maximum likelihood estimator of αid defined in Equation (2.7), G(x) = ∂ log Γ(x)/∂x is the digamma function, and a score residual sid = G(∑i α̂id)−G(α̂id)+log vid. Note that the denominator of ρid reflects the amount of observed information of vid that contributes to the estimation of the parameter αid (Gueorguieva et al., 2008).

Another important diagnostic tool for the Dirichlet regression is overdispersion, which evaluates the goodness-of-fit of the model. The overdispersion is defined as an increase of the variance of a response due to the lack of homogeneity in a parameter across observations (Zelterman and Chen, 1988). Zelterman and Chen (1988) derived overdispersion statistics when parameters are allowed to have small amount of random variability in the response across observations. The test statistics are to detect a lack of fit when the variability is larger than the expected variances. For a set of mutually independent response variables Vi having the pdf , the homogeneity in each parameter αid is satisfied when all αid’s are identical across observations. Gueorguieva et al. (2008) provided test statistics for overdispersion consisting of the shape parameters αid and coefficients βdk. For the Dirichlet regression model with the k-th covariates xk for the i-th observation, the overdispersion statistic for testing homogeneity of the parameter αid and the regression coefficient βdk is defined as

δidβdk=αid2xik2ηidαid,         d=1,2,,D,i=1,2,,n,



In general, if sample variances of certain observations are larger than what is expected, the estimates for δidβdk are also getting larger accordingly.

3. Results

3.1. Log-ratio regression

In an initial investigation of the variances of the log-ratios, it was found that the rarer components engendered high variances. Hence it was decided to use weighted analyses in the multivariate analyses where the transformed components of the cell types are combined, for example in the computation of total log-ratio variance as in Equation (2.6) or the computation of multivariate distances. Here the default weights were the average proportions of the nine immune cell types, so that rarer cell types have less weight than the more abundant ones. However, notice that the weighting does not affect the log-ratio regressions, where the transformed compositions act as response variables.

Choice of ALR transformation

In order to select the denominator for the ALR transformation, Table 1 presents the Procrustes correlations, in descending order, of the ALR-transformed data using each cell type in turn as denominator.

The weights in this table refer to the average proportions in the whole data set, which are used to compute total log-ratio variance in Equation (2.6) and the exact log-ratio geometry. In this case, the order of the correlations follows the weight of the denominator component. The higher the Procrustes correlation, the more accurately the ALR transformation reflects the exact logratio structure. Hence Table 1 shows that Macrophage is the reference part of choice, with a Procrustes correlation of 0.989.

Exploratory multivariate analyses of the immune cell compositions

Before doing the log-ratio regression, it is interesting to understand the multivariate structure of the data set. Figure 2 shows the weighted LRA on the left, that is the weighted PCA of the CLR-transformed data, where the two racial groups are coded into the label of the samples. This analysis, with 29.9 + 22.1 = 52.0% of the total log-ratio variance explained shows that four immune cell types dominate: Macrophage, T.cells.CD8, T.cells.CD4 and Mast.cells. On the right a discriminant version of the structure is shown, where the first (horizontal) axis is specifically constrained to coincide with the difference in the two group means of EA and AA. From this latter plot, it can be deduced that log-ratios such as B.cells/Macrophage and T.cells.CD4/T.cells.CD8 are good discriminators of the two groups, or even the amalgamation log-ratios of (B.cells+T.cells.CD4)/(Macrophage+T.cells.CD8), called SLRs (summated log-ratios, see Greenacre et al., 2020). This can be demonstrated by making 95% confidence plots of the group means (Greenacre, 2016) and performing a statistical test of the group differences, shown in Figure 3. If the confidence intervals do not overlap, this is a reliable indication that the group means are significantly different, but the statistical test should nevertheless be performed to confirm this result. The p-values are indeed both less than 0.05, with the summated log-ratio having the greater separation and the lower p-value.

Linear regressions using log-ratio transformed responses

Each ALR-transformed log-ratio is regressed in turn on race, a dummy variable coding AA, and a continuous variable for age. The estimated regression coefficients for these two variables are listed in Table 2, along with their 95% Bootstrapped confidence intervals (10000 bootstrap replicates).

The coefficients present how much the ALR response is influenced multiplicatively by the explanatory variable. For example, for AA the ratio B.cells/Macrophage is estimated as 66% higher than the same ratio for EA. The fact that the confidence interval does not include 1 means that this is a significant result, which has already been seen noted in Figure 3. Two other ALR ratios, NK.cells/Macrophage and Eosinophils/Macrophage, have significant coefficients, showing 39% and 49% increase in AA, respectively. The coefficients of age for all responses are all close to 1, and their confidence intervals all include 1, indicating that there is no significant effect of age on the responses.

The estimates for the ALR responses can be converted into log-contrast coefficients estimated for individual compositional components. Figure 4 shows these coefficients, also expressed as multiplicative effects, their bootstrap confidence intervals and p-values, for both race and age. Once again, for age, all the coefficients are close to 1 and none significantly different from 1, indicating no effect. For race, four components are significant, two in a positive direction (i.e. multiplicative effects greater than 1, T.cells.CD4 and B.cells) and two in a negative direction (T.cells.CD8 and Macrophage), coinciding exactly with the results on the right of Figures 2 and 3. These individual coefficients for race can be interpreted as follows: given any particular composition v of the nine cellular types for EA, the corresponding composition for AA is the set of estimated coefficients multiplied elementwise with v, hence pushing some values upwards, others downwards. Since race and age have been included in the regressions, the effects of age would also have to be applied if the hypothetical AA were different in age, but this will have only small non-significant effects. The beauty of the log-contrast coefficients is that exactly the same result would be obtained for any ALR transformation, any ILR transformation, or the CLR transformation. The only difference between the transformations is then the estimated coefficients in a table such as Table 2, where it is preferred to have simple and easily interpretable responses.

An alternative way to conclude that age is not a significant predictor of the compositional response is to conduct the MANOVA test for the models. Table 3 shows the results for ALR model with Pillai’s trace and approximated F-statistics with its degree of freedoms. Pillai’s trace value ranges from 0 to 1, which indicates that the explanatory variable has a significant effect on the multivariate response as being closer to 1 (Pillai, 1955). The result in Table 3 shows that race is significant on immune cells with Pillai’s trace of 0.0866 (p = 0.0043). On the other hand, age is not significant.

In regression analysis, to arrive at a final model, the nonsignificant predictor variable of age should be omitted. Furthermore, in this case where the composition is the multivariate response, the non-significant components of the response can also be eliminated, arriving at a parsimonious description of the relationship. The logratio regressions were thus repeated including only four immune cell types: T.cells.CD8, T.cells.CD4, B.cells and Macrophage, as a subcomposition. The results for the log-contrast coefficients are given in Figure 5, showing estimates, 95% confidence intervals and p-values. The results are more significant on each of the cell types, as might be expected in this reduced model. Furthermore, the similarity of the multiplicatively increasing coefficients for T.cells.CD4 and B.cells and those of the multiplicatively decreasing coefficients for T.cells.CD8 and Macrophage, gives further credence for using the simple ratio (T.cells.CD4+B.cells)/(T.cells.CD8+Macrophage), as already shown in Figure 3, deduced from the right hand biplot of Figure 2.

3.2. Dirichlet regression

Since the age covariate is insignificant, we only focus on a Dirichlet regression model with race.

Estimates of regression coefficients

Table 4 provides estimates of the β coefficients, standard errors (s.e.) and p-values for testing H0: β = 0 vs. H1: β ≠ 0. The coefficients for T.cells.CD4 and Macrophage are significant, which means that AA has exp(0.2127) = 1.2371 times or exp(−0.2290) = 0.7953 times more T.cells.CD4 or Macrophage than EA. The signs of the coefficients are generally agreeing with those in Figure 4, where positive and negative coefficients correspond to multiplicative effects above and below 1, respectively.

Figure 6 is a componentwise plot of the local influence measures against compositional values, which were generated based on the fitted Dirichlet models with race as an independent variable. We have already verified that race has no significant relevance with the immune cells in Table 4. Overall, the local influence measures tend to increase rapidly for values near zero and then increase more gradually as values increase. In spite of varying curvatures among cell types, which is smallest for Macrophage, it is common that as values are getting close to zero, the impact of individual observations on the estimation increases significantly. In Figure 6, we can also find two curves for T.cells.CD8, T.cells.CD4, B.cells, NK.cells and Macrophage corresponding to racial groups, which diverges. Specifically, AA has larger effects on estimates for Macrophage and T.cells.CD8 compared to EA, whereas opposite directionalities are observed for T.cells.CD4, B.cells and NK.cells. On the other hand, two curves are not visually separable for Dendritic.cells, Mast.cells, Neutrophils and Eosinophils.

Figure 7 illustrates the composite residual plots of the Dirichlet model, which shows that there are some observations with large composite residuals over 40 in both racial groups, but majority of the composite residuals are spread below 20.

Figure 8 illustrates the componentwise plots of the overdispersion statistics of individual observations against compositional values, based on the Dirichlet models fitted with race. In these plots, the red marked points indicate the observation with the largest overdispersion statistic value in each cell type. Nonetheless, no significant overdispersion issue is detected in general.

4. Discussion

With improved understanding of interaction between the immune system and various diseases such as cancer, the immunology field studying human immune system has gained significant attention. Investigation of immune cellular composition and its association with diseases constitutes the core of the immunologic studies. However, in spite of their importance, optimal statistical strategies for this type of data still remain to be studied. In this paper, we reviewed statistical methods for compositional data analysis and applied the methods to colorectal cancer immune cellular fractions data.

As illustrated throughout the manuscript, it is critical to consider unique aspects of compositional data to implement efficient data analysis of immune cellular composition data and guarantee meaningful scientific insight. Ignoring this can result in misleading conclusions based on inappropriately visualization and/or suboptimal selection of key variables ignoring inter-relationships among the elements in compositional data. As solutions for these issues, we especially investigated the log-ratio and Dirichlet regression models. Each approach has its own strengths. One of the key strengths of the log-ratio approaches is the fact that existing and established statistical methods can be employed. This allows utilization of a wide range of existing statistical models. The log-ratio approach involves choosing one of the available log-ratio transformations, which in the present application serve as multivariate responses in a regression model. Fortunately, for this purpose the final results in the form of log-contrast coefficients are invariant with respect to this choice, so we have used the simplest option, ALR. This choice has the favourable property that the individual regressions can be more easily interpreted. In contrast, the Dirichlet model handles compositional data more directly, without transformation and with a simpler interpretation, but the analysis is no longer subcompositionally coherent.

In the analysis of colorectal cancer immune cellular fractions data, we mainly focused on studying associations of immune cellular fractions with race, since age was found to not affect the responses significantly in both analyses. The log-ratio regression found that four cellular types were significantly associated with the two racial groups, whereas the Dirichlet regression found only two of those four types to be significant. From the log-ratio regression we can conclude that T.cells.CD4, B.cells, T.cells.CD8 and Macrophage can potentially be considered as key markers for racial difference, and that the ratio of the sum of first two versus the sum of the last two can be used as a single summary of the distinction between the two groups.

We hope that this paper provides a gentle but thorough guideline for the statistical analysis of compositional data, especially those generated in immunology.

Fig. 1. Immune Landscape of Cancer data and immune cell type composition within the tumor microenvironment of each cancer patient.
Fig. 2. (Left) Weighted log-ratio analysis of the immune cell compositional dataset, showing the contribution biplot. The African American samples are indicated by a cross, and the two ellipses are 95% confidence ellipses for the group means, with the African American group on the right. (Right) The discriminant version where the first dimension coincides with the group difference.
Fig. 3. 95% confidence plots of the log-ratio means of the pairwise log-ratio on the left and the summated log-ratio on the right. The dot indicates the mean, the box indicates the 50% confidence interval and the whiskers extend to the 95% confidence interval.
Fig. 4. Estimates of log-contrast coefficients (exponentiated) for each immune cell type, along with 95% bootstrap confidence intervals and -values. Race is coded as a dummy variable for African American.
Fig. 5. Estimates of log-contrast coefficients (exponentiated) for the four-part subcompositional repsonse modelled on the discrete predictor race, with a dummy variable for the category African American, along with 95% bootstrap confidence intervals and -values.
Fig. 6. Componentwise plots of the local influence measures against compositional values based on the Dirichlet model with the race variable.
Fig. 7. Composite residual plot for Dirichlet regression model with race.
Fig. 8. Componentwise plots of the overdispersion statistic against compositional values based on the Dirichlet model fitted with the race variable. The red marked points indicate the observation with the largest overdispersion statistic value in each cell type.

Table 1

Correlations from the Procrustes analysis between the geometries of the different ALR transformations and the exact geometry of all pairwise logratios, using each cell type in turn as denominator

Cell typeWeightProcrustes correlation

Weight is the average proportion of all log-ratios in the weighted analysis.

Table 2

Multiplicative coefficients for the the ALR transformed responses

T.CD8/Macro1.0234(0.7312 1.3902)1.0019(0.9927 1.0114)
T.CD4/Macro1.6476(1.2230 2.2024)1.0004(0.9889 1.0138)
B/Macro1.6617(1.1485 2.3378)0.9922(0.9811 1.0032)
NK/Macro1.3929(1.0139 1.8682)1.0032(0.9938 1.0121)
Dendr/Macro1.0676(0.6843 1.5620)0.9976(0.9872 1.0086)
Mast/Macro1.3601(0.9220 1.8937)0.9996(0.9899 1.0093)
Neutro/Macro1.1689(0.8237 1.6187)0.9944(0.9834 1.0055)
Eosin/Macro1.4902(1.0052 2.1052)0.9969(0.9874 1.0067)

Note that boot.CIrace and boot.CIage denote the 95% bootstrap confidence intervals for race and age.

Table 3

MANOVA table for model using ALR transformed immune cells as multivariate responses

Pillai(df1, df2)approx.Fp-value
Race(AA)0.0866(8, 244)2.89280.0043
Age0.021(8, 244)0.65280.7327

Pillai is Pillai’s trace, approx.F is the approximation of Pillai’s trace on a F-statistic and (df1, df2) are the degrees of freedom of the approximated F-statistics.

Table 4

Dirichlet regression outputs with race as an independent variable

Cell typeEstimateexp(estimate)s.e.p-value

  1. Aitchison J (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44, 139-160.
  2. Aitchison J (1986). Logratio analysis of composition. The Statistical Analysis of Compositional Data, (pp. 141-183), London, Champman & Hall.
  3. Aitchison J and Greenacre M (2002). Biplots of compositional data. Journal of the Royal Statistical Society: Series C (Applied Statistics), 51, 375-392.
  4. Camargo AP, Stern JM, and Lauretto MS (2012). Estimation and model selection in Dirichlet regression. AIP Conference Proceedings 31st, 1443, 206-213.
  5. Campbell G and Mosimann J (1987). Multivariate methods for proportional shape. ASA Proceedings of the Section on Statistical Graphics, 1, 10-17.
  6. Coenders G and Pawlowsky-Glahn RD (2020). On interpretations of tests and effect sizes in regression models with a compositional predictor. SORT Statistics and Operations Research Transactions, 44, 200-220.
  7. Cook RD (1986). Assessment of local influence. Journal of the Royal Statistical Society: Series B (Methodological), 48, 133-155.
  8. Curran T, Sun Z, and Gerry B, et al. (2021). Differential immune signatures in the tumor microenvironment are associated with colon cancer racial disparities. Cancer Medicine, 10, 1805-1814.
    KoreaMed CrossRef
  9. Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, and Barcelo-Vidal C (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35, 279-300.
  10. Filzmoser P, Hron K, and Templ M (2018). Applied Compositional Data Analysis, Cham, Springer.
  11. Gower JC and Dijksterhuis GB (2004). Procrustes problems, Oxford, New York, Oxford University Press.
  12. Graeve M and Greenacre M (2020). The selection and analysis of fatty acid ratios: a new approach for the univariate and multivariate analysis of fatty acid trophic markers in marine organisms. Limnology and Oceanography: Methods, 18, 196-210.
  13. Greenacre M (2010). Log-ratio analysis is a limiting case of correspondence analysis. Mathematical Geosciences, 42, 129-134.
  14. Greenacre M (2016). Data reporting and visualization in ecology. Polar Biology, 39, 2189-2205.
  15. Greenacre M (2018). Compositional Data Analysis in Practice, Chapman and Hall/CRC.
  16. Greenacre M (2019). Variable selection in compositional data analysis using pairwise logratios. Mathematical Geosciences, 51, 649-682.
  17. Greenacre M, Grunsky E, and Bacon-Shone J (2020). A comparison of isometric and amalgamation logratio balances in compositional data analysis. Computers & Geosciences, 148, 104621.
  18. Greenacre M, Mart챠nez-횁lvaro M, and Blasco A (2021). Compositional data analysis of microbiome and any-Omics datasets: A validation of the additive logratio transformation. Frontiers in Microbiology, 2625.
  19. Greenacre M (2021). Compositional data analysis. Annual Review of Statistics and its Application, 8, 271-299.
  20. Greenacre M (2022). Compositional data analysis linear algebra, visualization and interpretation, Bekker A (Ed). Innovations in Multivariate Statistical Modelling: Navigating Theoretical and Multidisciplinary Domains, Springer.
  21. Gueorguieva R, Rosenheck R, and Zelterman D (2008). Dirichlet component regression and its applications to psychiatric data. Computational Statistics & Data Analysis, 52, 5344-5355.
  22. Hijazi RH and Jernigan RW (2009). Modeling compositional data using Dirichlet regression models. Journal of Applied Probability & Statistics, 4, 77-91.
  23. Hron K, Templ M, and Filzmoser P (2010). Imputation of missing values for compositional data using classical and robust methods. Computational Statistics & Data Analysis, 54, 3095-3107.
  24. King Thomas J, Mir H, Kapur N, and Singh S (2019). Racial differences in immunological landscape modifiers contributing to disparity in prostate cancer. Cancers, 11, 1857.
  25. Legendre P and Legendre L (2012). Numerical Ecology, Elsevier Science.
  26. Lubbe S, Filzmoser P, and Templ M (2021). Comparison of zero replacement strategies for compositional data with large numbers of zeros. Chemometrics and Intelligent Laboratory Systems, 210, 104248.
  27. Maier MJ (2014). DirichletReg: Dirichlet regression for compositional data in R. Research Report Series, Vienna, Department of Statistics and Mathematics, 125, WU Vienna University of Economics and Business.
  28. Melo TFN, Vasconcellos KLP, and Lemonte AJ (2009). Some restriction tests in a new class of regression models for proportions. Computational Statistics & Data Analysis, 53, 3972-3979.
  29. Newman AM, Liu CL, and Green MR, et al. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature Methods, 12, 453-457.
    Pubmed KoreaMed CrossRef
  30. Newman AM, Steen CB, and Liu CL, et al. (2019). Determining cell type abundance and expression from bulk tissues with digital cytometry. Nature Biotechnology, 37, 773-782.
    Pubmed KoreaMed CrossRef
  31. Pillai KCS (1955). Some new test criteria in multivariate analysis. The Annals of Mathematical Statistics, 26, 117-121.
  32. Templ M, Hron K, and Filzmoser P (2011). robCompositions: An R-package for robust statistical analysis of compositional data, Ch. 25, Pawlowsky-Glahn V and Buccianti A (Eds). Compositional Data Analysis: Theory and Applications, (pp. 341-355), Chichester, UK, John Wiley & Sons, Ltd.
  33. Thorsson V, Gibbs DL, and Brown SD, et al. (2018). The immune landscape of cancer. Immunity, 48, 812-830.
    Pubmed KoreaMed CrossRef
  34. Van den Boogaart KG and Tolosana-Delgado R (2008). 쐁ompositions: a unified R package to analyze compositional data. Computers & Geosciences, 34, 320-338.
  35. Van den Boogaart KG and Tolosana-Delgado R (2013). Analyzing Compositional Data with R, Springer.
  36. Zelterman D and Chen CF (1988). Homogeneity tests against central-mixture alternatives. Journal of the American Statistical Association, 83, 179-182.