A common interest in gene expression data analysis is to identify genes that present significant changes in expression levels among biological experimental conditions. In this paper, we develop a Bayesian approach to make a gene-by-gene comparison in the case with a control and more than one treatment experimental condition. The proposed approach is within a Bayesian framework with a Dirichlet process prior. The comparison procedure is based on a model selection procedure developed using the discreteness of the Dirichlet process and its representation via Polya urn scheme. The posterior probabilities for models considered are calculated using a Gibbs sampling algorithm. A numerical simulation study is conducted to understand and compare the performance of the proposed method in relation to usual methods based on analysis of variance (ANOVA) followed by a Tukey test. The comparison among methods is made in terms of a true positive rate and false discovery rate. We find that proposed method outperforms the other methods based on ANOVA followed by a Tukey test. We also apply the methodologies to a publicly available data set on Plasmodium falciparum protein.
DNA array technology is capable of providing simultaneous gene expression level measurements for thousands of genes under different biological experimental conditions. Once the expression levels have been obtained one of the objectives is to identify genes that present significant changes in the expression levels among experimental conditions. The identification of these genes is important because it may bring to reveal new biological discoveries such as which genes are involved in the origin and/or evolution of some genetic disease or which genes react to a drug stimulus. For further discussion and additional references on DNA array technology, see DeRisi
According to Baldi and Long (2001) the first level of gene expression data analysis is the identification of the genes with expression levels different in a treatment condition in relation to a control condition. For this case, the usual methods used to identify differentially expressed genes are based on
In this paper, we consider gene expression data analysis from experimental conditions with a control and more than one treatment. We model each one of the hypothesis of equality or inequality among experimental conditions by a model. In this way our interest is to search for a model which best fits the data and meets conditions of inequality among the experimental conditions. We use a hierarchical Bayesian approach in order to select one of the models considered. This approach uses the Dirichlet process prior and its representation via Polya urn scheme (Blackwell and MacQueen, 1973). The advantage of using the Dirichlet process prior is its discreteness that allows the parameters to be coincident with positive probability. To calculate the posterior probability for each model we implement a Gibbs sampling (GS) algorithm considering the Polya urn scheme written through latent variables.
We develop a simulation study to verify the performance of the proposed method and compare it with the usual method based on ANOVA followed by Tukey test. The simulation study was implemented considering the cases with a control and two- and three-treatment experimental conditions. The ANOVA is applied to identify genes which show a significant difference among experimental conditions. ANOVA does not identify which experimental conditions show the difference; therefore, we also apply the Tukey test as a post-hoc test to identify which experimental conditions show significant difference, see for example Pavlidis (2003), Parkitna
The simulation results show a better performance of the proposed method. We also apply both methods to a real data set downloaded from http://cybert.ics.uci.edu/anova that concerns a proteomics experiment (Baldi and Long, 2001).
The pioneering paper using the Dirichlet process prior for multiple-comparison was by Gopalan and Berry (1998). The paper considers a hierarchical Bayesian approach with the Dirichlet process prior and develop a GS algorithm to make a comparison among various hypotheses. The posterior probability for the hypotheses are estimated using the Rao-Blackwellization method proposed by Gelfand and Smith (1990). Guindani
The remainder of the paper is structured as follows. In Section 2, we describe the Bayesian model for gene expression data analysis and the Polya urn scheme using latent variables. In Section 3, we compare the performance of the proposed method with the usual approach. In Section 4, both methods are applied to a real dataset. Section 5 concludes the paper with final remarks.
Consider a DNA array experiment with
We omit the index
As is usual in gene expression data analysis, consider that the logarithm of the observed gene expression levels in control and treatment conditions are generated from normal distributions with mean
Denote parameters of the experimental condition
where
Our interest is to verify whether a gene
successively for all combinations of inequality 2 to 2, 3 to 3, …, until the last one model
In this way our interest is to search for a model which best fits the data and meets conditions defined by models
Thus, consider the following hierarchical Bayesian model with the Dirichlet process prior
for
In order to complete the specification of the model (
where
Using the result of Blackwell and MacQueen (1973), we have that integrating
where = 1 if
Note that, at each step of the sample procedure defined by (
Let
Assume that clusters are numbered consecutively as they arise; therefore, the sampling procedure defined by (
Set
for
for
(ii-a) Generate
(ii-b) If
(ii-c) If
Given
Note from this procedure, that
As a sample from a Dirichlet process is exchangeable (Antoniak, 1974; Jain and Neal, 2004; MacEachern, 2016), so from (
where
Using the Bayes rule, the conditional posterior probabilities for latent indicator variable are given by
where
in which,
Given a configuration
and
where
In order to estimate the posterior probabilities for models
For
Calculate
Generate
Conditional on
Let
We discard the first
for
In this section we verify performance of the proposed method called Polya urn within Gibbs sampling (PUGS) and compare it with a standard method based on ANOVA followed by Tukey test (ATUK) using simulated data sets and a real data set.
In order to establish the hyperparameters values for the PUGS we consider the following procedure. Let (
Consider a DNA experiment with a control and two experimental conditions. The five possible models written in terms of latent variables are:
In order to generate the data sets, we fix control parameters as
- (
- (
- (
- (
- (
for
The generation of a simulated data set is given by the following steps. For
If
If 0.70 <
If 0.80 <
If 0.90 <
If
Generate
Generated the data set, we apply PUGS and ATUK to identify the cases with a difference. We apply the PUGS fixing
To record the configuration obtained by the PUGS and the ATUK, we consider the index vector
where
where = 1 if case is generate according to configuration
Moreover, for each pair (
where TPR^{(}
Figure 1 show the
As in application presented in the next Section no case was identified under model
Now consider the proteomics data set mentioned in the introduction. This dataset was extracted from the website cybert.ics.uci.edu/anova/. The data set is composed by
For application of the PUGS we consider the same number of iteration, burn-in and hyperparameters values used in the simulation section. Table 1 shows the number of cases identified under each model by method. The last column of this table shows the number of cases in which model
ATUK identify 140 case with evidence for difference while PUGS identify 127. Out of case identified with difference, 92 were identified by both methods. No cases were identified under
Tables 2 and 3 show the ten most evident cases identified by PUGS and ATUK, respectively. These tables show the number of the protein in the dataset, the sample mean and the standard deviation (SD) for control and two treatments, the configuration identified by PUGS and ATUK, the posterior probability obtained by PUGS and the
The 10 most evident cases identified by ATUK were also identified by PUGS. Out of the 10 most evident cases identified with difference by PUGS, two were not identified by ATUK. These both cases are highlighted with * in Table 2. Note that these both cases have higher differences in control SD (highlighted in bold) in relation to treatment SDs. As in the simulation study, this result indicates that PUGS has ability to identify cases not identified by the usual method ATUK if the difference is in variances.
In this paper, we develop a gene-by-gene multiple comparison analysis using a semi-parametric Bayesian model with prior given by a Dirichlet process. The comparison among experimental conditions were made using the discreteness of the Dirichlet process within a model selection framework. The posterior probability for models were calculated through a GS algorithm that was implemented using the Polya urn scheme written in terms of latent variables to indicate equality or inequality among the experimental conditions.
The performance of the PUGS as well as its comparison with the ATUK was verified on artificial data sets and on a real dataset. Results show a better performance of PUGS for cases with differences in variances. Two examples of the better performance of the PUGS are cases 557 and 526 presented in Table 2. These two cases present a clear difference in standard deviation among control and treatments conditions, but are not identified as a case with evidence for difference among control and treatment conditions by ATUK. However, PUGS consider these both cases as being from a model that considers differences between the control and the two-treatment conditions.
From the biological point of view the results shows that PUGS may illustrate cases not identified when the usual method ATUK is considered. From the statistical point of view the proposed method may be viewed as an effective Bayesian alternative to solve problems of multiple comparison. PUGS can also be easily implemented in usual software such as the R software (The Comprehensive R Archive Network, http://cran.r-project.org). The code used for computing is in the R language and can be obtained by e-mail from the first author.
Here we provide some additional results from simulation study for PUGS and ATUK.
The author Erlandson F. Saraiva acknowledges the Brazilian institution CNPq.
Model configurations
Number of cases identified by method
Method | Model | Total of cases with difference | ||||
---|---|---|---|---|---|---|
PUGS | 961 | 21 | 89 | 17 | 0 | 127 |
ATUK | 948 | 46 | 27 | 67 | 0 | 140 |
PUGS = Polya urn within Gibbs sampling; ATUK = analysis of variance followed by Tukey test.
Ten most evident cases identified by PUGS, where
Number | Sample mean | Sample standard deviation | Configuration | Posterior probability | ||||||
---|---|---|---|---|---|---|---|---|---|---|
PUGS | ATUK | |||||||||
557^{*} | 9.0897 | 8.7857 | 0.8938 | 0.0897 | ||||||
690 | 8.0427 | 9.5514 | 9.3821 | 0.5140 | 0.3500 | 0.3832 | 0.8854 | 0.0002 | ||
526^{*} | 8.8327 | 8.1400 | 8.7649 | 0.8682 | 0.9076 | |||||
666 | 8.2380 | 9.7643 | 8.5284 | 0.4457 | 0.4276 | 0.4502 | 0.8331 | 0.0003 | ||
1069 | 7.3039 | 8.8703 | 7.3194 | 0.6188 | 1.0238 | 0.3377 | 0.8248 | 0.0065 | ||
1024 | 8.3471 | 9.6321 | 8.3603 | 0.4866 | 0.5582 | 0.4792 | 0.8105 | 0.0023 | ||
936 | 7.5842 | 8.8570 | 7.8247 | 0.3111 | 0.7440 | 0.3207 | 0.7926 | 0.0039 | ||
932 | 8.1317 | 9.7893 | 8.4973 | 0.5588 | 0.3725 | 0.5818 | 0.7925 | 0.0006 | ||
730 | 8.3837 | 9.6405 | 8.1156 | 0.5543 | 0.3864 | 0.6852 | 0.7637 | 0.0021 | ||
1012 | 8.0056 | 9.1393 | 8.0072 | 0.4247 | 0.4005 | 0.4868 | 0.7633 | 0.0019 |
PUGS = Polya urn within Gibbs sampling; ATUK = analysis of variance followed by Tukey test.
Ten most evident cases identified by ATUK, where
Number | Sample mean | Sample standard deviation | Configuration | Posterior probability | ||||||
---|---|---|---|---|---|---|---|---|---|---|
PUGS | ATUK | |||||||||
690 | 11.6032 | 13.7804 | 13.5362 | 0.7423 | 0.5050 | 0.5537 | 0.8854 | <0.001 | ||
666 | 11.8855 | 14.0871 | 12.3043 | 0.6432 | 0.6170 | 0.6496 | 0.8331 | <0.001 | ||
932 | 11.7326 | 14.1231 | 12.2591 | 0.8065 | 0.5371 | 0.8393 | 0.7925 | 0.001 | ||
60 | 12.6841 | 13.7752 | 11.6590 | 0.8529 | 0.6342 | 0.3794 | 0.3671 | 0.001 | ||
649 | 12.1523 | 12.9857 | 11.1680 | 0.6238 | 0.3427 | 0.7030 | 0.4103 | 0.001 | ||
1012 | 11.5504 | 13.1859 | 11.5521 | 0.6131 | 0.5783 | 0.7021 | 0.7633 | 0.002 | ||
730 | 12.0950 | 13.9087 | 11.7082 | 0.8001 | 0.5572 | 0.9892 | 0.7637 | 0.002 | ||
1024 | 12.0420 | 13.8963 | 12.0614 | 0.7023 | 0.8051 | 0.6912 | 0.8105 | 0.002 | ||
1020 | 11.7981 | 13.2402 | 10.8990 | 1.2130 | 0.4921 | 0.6438 | 0.5424 | 0.003 | ||
936 | 10.9425 | 12.7781 | 11.2891 | 0.4494 | 1.0730 | 0.4630 | 0.7926 | 0.004 |
PUGS = Polya urn within Gibbs sampling; ATUK = analysis of variance followed by Tukey test.