
The human immune system consists of various types of immune cells (e.g., T cells, B cells, natural killer (NK) cells, dendritic cells, among others). Upon viral infection, tissue transplantation, or disease occurrence, dynamic and extensive interaction among these immune cell types occurs in the human body. Hence, in the study of the human immune system, it is of great interest to understand composition, differentiation, and activities of various types of immune cells, and interactions among them. The composition of these immune cells is also associated with cancer progression, adverse events, and response to cancer immunotherapy, especially immune checkpoint blockades including Anti-PD1 and Anti-CTLA4.
Along with the interest in this association, there is a movement to gather immune cellular information and clinical information together. In the immunology field, multiple types of assays are used to interrogate such immune cellular composition, including flow cytometry and single cell RNA-seq. In addition, multiple computational algorithms have also been proposed to estimate immune cellular composition by deconvolving bulk gene expression data, where popular algorithms include CIBERSORT (Newman
From the statistical point of view, such immune cellular data can be considered as
where
Maier (2014) asserted that it is often not straightforward to interpret the results from data analyses using log-ratio transformations and, in addition, these methods can often violate modeling assumptions such as homoscedasticity. As an alternative approach, Dirichlet regression was proposed, originally suggested as a null model for compositional data by Campbell and Mosimann (1987). Hijazi and Jernigan (2009) developed the maximum likelihood estimation methods for Dirichlet regression and also investigated the sampling distributions of the estimates. Camargo
These ongoing discussions are to determine optimal statistical strategies for compositional data analysis. Following the trend, in this paper, we aim to give a guideline for the statistical approaches of compositional data analysis in the context of immunology data: firstly, modeling using standard regression analysis with log-ratio transformations, and secondly, the approach using Dirichlet regression analysis.
This paper is structured as follows. In Section 2 we introduce the immune cellular fractions data for colorectal cancer, and the two compositional regression approaches, log-ratio regression and Dirichlet regression, which have been applied to this dataset. Section 3 gives the results of these alternative modeling approaches. Section 4 summarizes the key findings of this paper and comments on the similarities and differences of the two approaches.
In this paper, we focus on the analysis of the immune cellular fractions data of colorectal adenocarcinoma patients, generated from the Immune Landscape of Cancer project (Thorsson
In this section, we describe how to apply the log-ratio regression model to the colorectal cancer data. Since this approach involves computing logarithms of ratios, zero values need to be replaced, which can be done in several ways (Lubbe
In the context of colorectal cancer data, we are mainly interested in racial differences in immune cellular compositions. To investigate this relationship using log-ratio regression models, the immune cellular compositions composed of the
The key step in the log-ratio approach is to choose a set of log-ratio transformations, which convert all the compositions on the Aitchison simplex to multivariate vectors on interval scales in a regular Euclidean space. We can then proceed to apply standard statistical analyses to the log-ratio transformed data, with some care taken in the way the results are interpreted. The widely used sets of log-ratio transformations are the additive log-ratios (ALRs) (Aitchison, 1982), the centered log-ratios (CLRs) (Aitchison, 1982), and isometric log-ratio (ILRs) (Egozcue
The ALR transformations are the easiest to understand and interpret, since they can be simply calculated by taking
where the (
The CLR transformations are defined as
where
Finally, the ILR transformations are the most complicated choice, both to define and interpret. It is also defined as a linear transformation
where the (
When we use the ALR transformation, we need to decide which denominator part to choose. Since this choice will not affect our results, the fixed denominator can be chosen on substantive grounds to make the interpretation of the results more meaningful, or it can be chosen to optimize some favourable property. For example, Greenacre
As said above, once the data have been log-ratio transformed, then standard statistical analysis can be used. Thus, multiple regression models on the log-ratio transformed data can be written as follows, for the
where
Once the set of log-ratio regressions (
This result shows how the compositional response, on a log-scale, is affected by a unit change in the predictor, as estimated by the set of log-ratio regressions. In the case of the dummy variable predictor for race, where EA = 0 and AA = 1, the coefficients show the “effect” of AA compare to EA. When
When it comes to visualizing compositional data by a dimension-reduction method such as principal component analysis (PCA), then CLR-transformed data are used, since the PCA of the CLRs has been shown to be equivalent to the PCA of all pairwise log-ratios (Aitchison and Greenacre, 2002). The PCA of CLR-transformed data has thus been called
using either the variances
The Dirichlet distribution models the probability of a multinomial random variable, that is a random composition
where
The Dirichlet regression provides an alternative to log-ratio regression for the modeling of compositional data responses, and establishes the relationships between the parameters of the Dirichlet distribution and linear functions on the covariates. The regression model should be fitted for each
where
The unknown regression coefficients
There is no closed form solution, hence it must be calculated numerically using a nonlinear optimization procedure. The invariance property of the maximum likelihood estimator (MLE) leads to obtain the MLE
In this manuscript, we employed the Dirichlet regression for analysis of the colorectal cancer immunology data, where immune cellular composition is considered as conpositional outcomes.
For model diagnostics of the Dirichlet regression, two types of residuals can be considered, namely standardized residuals and composite residuals. For the
for
For the Dirichlet regression model, Gueorguieva
where
Another important diagnostic tool for the Dirichlet regression is , the homogeneity in each parameter
with
In general, if sample variances of certain observations are larger than what is expected, the estimates for
In an initial investigation of the variances of the log-ratios, it was found that the rarer components engendered high variances. Hence it was decided to use weighted analyses in the multivariate analyses where the transformed components of the cell types are combined, for example in the computation of total log-ratio variance as in
In order to select the denominator for the ALR transformation, Table 1 presents the Procrustes correlations, in descending order, of the ALR-transformed data using each cell type in turn as denominator.
The weights in this table refer to the average proportions in the whole data set, which are used to compute total log-ratio variance in
Before doing the log-ratio regression, it is interesting to understand the multivariate structure of the data set. Figure 2 shows the weighted LRA on the left, that is the weighted PCA of the CLR-transformed data, where the two racial groups are coded into the label of the samples. This analysis, with 29.9 + 22.1 = 52.0% of the total log-ratio variance explained shows that four immune cell types dominate: Macrophage, T.cells.CD8, T.cells.CD4 and Mast.cells. On the right a discriminant version of the structure is shown, where the first (horizontal) axis is specifically constrained to coincide with the difference in the two group means of EA and AA. From this latter plot, it can be deduced that log-ratios such as B.cells/Macrophage and T.cells.CD4/T.cells.CD8 are good discriminators of the two groups, or even the amalgamation log-ratios of (B.cells+T.cells.CD4)/(Macrophage+T.cells.CD8), called SLRs (summated log-ratios, see Greenacre
Each ALR-transformed log-ratio is regressed in turn on race, a dummy variable coding AA, and a continuous variable for age. The estimated regression coefficients for these two variables are listed in Table 2, along with their 95% Bootstrapped confidence intervals (10000 bootstrap replicates).
The coefficients present how much the ALR response is influenced multiplicatively by the explanatory variable. For example, for AA the ratio B.cells/Macrophage is estimated as 66% higher than the same ratio for EA. The fact that the confidence interval does not include 1 means that this is a significant result, which has already been seen noted in Figure 3. Two other ALR ratios, NK.cells/Macrophage and Eosinophils/Macrophage, have significant coefficients, showing 39% and 49% increase in AA, respectively. The coefficients of age for all responses are all close to 1, and their confidence intervals all include 1, indicating that there is no significant effect of age on the responses.
The estimates for the ALR responses can be converted into log-contrast coefficients estimated for individual compositional components. Figure 4 shows these coefficients, also expressed as multiplicative effects, their bootstrap confidence intervals and
An alternative way to conclude that age is not a significant predictor of the compositional response is to conduct the MANOVA test for the models. Table 3 shows the results for ALR model with Pillai’s trace and approximated F-statistics with its degree of freedoms. Pillai’s trace value ranges from 0 to 1, which indicates that the explanatory variable has a significant effect on the multivariate response as being closer to 1 (Pillai, 1955). The result in Table 3 shows that race is significant on immune cells with Pillai’s trace of 0.0866 (
In regression analysis, to arrive at a final model, the nonsignificant predictor variable of age should be omitted. Furthermore, in this case where the composition is the multivariate response, the non-significant components of the response can also be eliminated, arriving at a parsimonious description of the relationship. The logratio regressions were thus repeated including only four immune cell types: T.cells.CD8, T.cells.CD4, B.cells and Macrophage, as a subcomposition. The results for the log-contrast coefficients are given in Figure 5, showing estimates, 95% confidence intervals and
Since the age covariate is insignificant, we only focus on a Dirichlet regression model with race.
Table 4 provides estimates of the
Figure 6 is a componentwise plot of the local influence measures against compositional values, which were generated based on the fitted Dirichlet models with race as an independent variable. We have already verified that race has no significant relevance with the immune cells in Table 4. Overall, the local influence measures tend to increase rapidly for values near zero and then increase more gradually as values increase. In spite of varying curvatures among cell types, which is smallest for Macrophage, it is common that as values are getting close to zero, the impact of individual observations on the estimation increases significantly. In Figure 6, we can also find two curves for T.cells.CD8, T.cells.CD4, B.cells, NK.cells and Macrophage corresponding to racial groups, which diverges. Specifically, AA has larger effects on estimates for Macrophage and T.cells.CD8 compared to EA, whereas opposite directionalities are observed for T.cells.CD4, B.cells and NK.cells. On the other hand, two curves are not visually separable for Dendritic.cells, Mast.cells, Neutrophils and Eosinophils.
Figure 7 illustrates the composite residual plots of the Dirichlet model, which shows that there are some observations with large composite residuals over 40 in both racial groups, but majority of the composite residuals are spread below 20.
Figure 8 illustrates the componentwise plots of the overdispersion statistics of individual observations against compositional values, based on the Dirichlet models fitted with race. In these plots, the red marked points indicate the observation with the largest overdispersion statistic value in each cell type. Nonetheless, no significant overdispersion issue is detected in general.
With improved understanding of interaction between the immune system and various diseases such as cancer, the immunology field studying human immune system has gained significant attention. Investigation of immune cellular composition and its association with diseases constitutes the core of the immunologic studies. However, in spite of their importance, optimal statistical strategies for this type of data still remain to be studied. In this paper, we reviewed statistical methods for compositional data analysis and applied the methods to colorectal cancer immune cellular fractions data.
As illustrated throughout the manuscript, it is critical to consider unique aspects of compositional data to implement efficient data analysis of immune cellular composition data and guarantee meaningful scientific insight. Ignoring this can result in misleading conclusions based on inappropriately visualization and/or suboptimal selection of key variables ignoring inter-relationships among the elements in compositional data. As solutions for these issues, we especially investigated the log-ratio and Dirichlet regression models. Each approach has its own strengths. One of the key strengths of the log-ratio approaches is the fact that existing and established statistical methods can be employed. This allows utilization of a wide range of existing statistical models. The log-ratio approach involves choosing one of the available log-ratio transformations, which in the present application serve as multivariate responses in a regression model. Fortunately, for this purpose the final results in the form of log-contrast coefficients are invariant with respect to this choice, so we have used the simplest option, ALR. This choice has the favourable property that the individual regressions can be more easily interpreted. In contrast, the Dirichlet model handles compositional data more directly, without transformation and with a simpler interpretation, but the analysis is no longer subcompositionally coherent.
In the analysis of colorectal cancer immune cellular fractions data, we mainly focused on studying associations of immune cellular fractions with race, since age was found to not affect the responses significantly in both analyses. The log-ratio regression found that four cellular types were significantly associated with the two racial groups, whereas the Dirichlet regression found only two of those four types to be significant. From the log-ratio regression we can conclude that T.cells.CD4, B.cells, T.cells.CD8 and Macrophage can potentially be considered as key markers for racial difference, and that the ratio of the sum of first two versus the sum of the last two can be used as a single summary of the distinction between the two groups.
We hope that this paper provides a gentle but thorough guideline for the statistical analysis of compositional data, especially those generated in immunology.
Correlations from the Procrustes analysis between the geometries of the different ALR transformations and the exact geometry of all pairwise logratios, using each cell type in turn as denominator
Cell type | Weight | Procrustes correlation |
---|---|---|
Macro | 0.453 | 0.989 |
T.CD8 | 0.142 | 0.920 |
T.CD4 | 0.165 | 0.909 |
Mast | 0.088 | 0.860 |
B | 0.072 | 0.851 |
NK | 0.051 | 0.829 |
Dendr | 0.017 | 0.689 |
Neutr | 0.008 | 0.628 |
Eosin | 0.003 | 0.569 |
Weight is the average proportion of all log-ratios in the weighted analysis.
Multiplicative coefficients for the the ALR transformed responses
race(AA) | boot.CIrace | age | boot.CIage | |
---|---|---|---|---|
T.CD8/Macro | 1.0234 | (0.7312 1.3902) | 1.0019 | (0.9927 1.0114) |
T.CD4/Macro | 1.6476 | (1.2230 2.2024) | 1.0004 | (0.9889 1.0138) |
B/Macro | 1.6617 | (1.1485 2.3378) | 0.9922 | (0.9811 1.0032) |
NK/Macro | 1.3929 | (1.0139 1.8682) | 1.0032 | (0.9938 1.0121) |
Dendr/Macro | 1.0676 | (0.6843 1.5620) | 0.9976 | (0.9872 1.0086) |
Mast/Macro | 1.3601 | (0.9220 1.8937) | 0.9996 | (0.9899 1.0093) |
Neutro/Macro | 1.1689 | (0.8237 1.6187) | 0.9944 | (0.9834 1.0055) |
Eosin/Macro | 1.4902 | (1.0052 2.1052) | 0.9969 | (0.9874 1.0067) |
Note that boot.CIrace and boot.CIage denote the 95% bootstrap confidence intervals for race and age.
MANOVA table for model using ALR transformed immune cells as multivariate responses
Pillai | (df1, df2) | approx.F | ||
---|---|---|---|---|
Race(AA) | 0.0866 | (8, 244) | 2.8928 | 0.0043 |
Age | 0.021 | (8, 244) | 0.6528 | 0.7327 |
Pillai is Pillai’s trace, approx.F is the approximation of Pillai’s trace on a F-statistic and (df1, df2) are the degrees of freedom of the approximated F-statistics.
Dirichlet regression outputs with race as an independent variable
Cell type | Estimate | exp(estimate) | s.e. | |
---|---|---|---|---|
T.CD8 | −0.1946 | 0.8232 | 0.1098 | 0.0762 |
T.CD4 | 0.2127 | 1.2371 | 0.1032 | 0.0392 |
B | 0.2051 | 1.2277 | 0.1175 | 0.0809 |
NK | 0.0440 | 1.0450 | 0.1217 | 0.7180 |
Macro | −0.2290 | 0.7953 | 0.0938 | 0.0147 |
Dendr | −0.0924 | 0.9117 | 0.1345 | 0.4920 |
Mast | 0.0371 | 1.0378 | 0.1158 | 0.7490 |
Neut | −0.0377 | 0.9630 | 0.1353 | 0.7810 |
Eosin | 0.0660 | 1.0682 | 0.1370 | 0.6300 |