TEXT SIZE

CrossRef (0)
Identifying differentially expressed genes using the Polya urn scheme

Erlandson Ferreira Saraiva1,a, Adriano Kamimura Suzukib, and Luís Aparecido Milanc

aInstitute of Mathematics, Federal University of Mato Grosso do Sul, Brazil, bDepartment of Applied Mathematics and Statistics, University of São Paulo, Brazil, cDepartment of Statistics, Federal University of São Carlos, Brazil
Correspondence to: 1Corresponding author: Institute of Mathematics, Federal University of Mato Grosso do Sul, Av. Costa e Silva, Brazil. E-mail: erlandson.saraiva@ufms.br
Received June 30, 2017; Revised August 31, 2017; Accepted September 13, 2017.
Abstract

A common interest in gene expression data analysis is to identify genes that present significant changes in expression levels among biological experimental conditions. In this paper, we develop a Bayesian approach to make a gene-by-gene comparison in the case with a control and more than one treatment experimental condition. The proposed approach is within a Bayesian framework with a Dirichlet process prior. The comparison procedure is based on a model selection procedure developed using the discreteness of the Dirichlet process and its representation via Polya urn scheme. The posterior probabilities for models considered are calculated using a Gibbs sampling algorithm. A numerical simulation study is conducted to understand and compare the performance of the proposed method in relation to usual methods based on analysis of variance (ANOVA) followed by a Tukey test. The comparison among methods is made in terms of a true positive rate and false discovery rate. We find that proposed method outperforms the other methods based on ANOVA followed by a Tukey test. We also apply the methodologies to a publicly available data set on Plasmodium falciparum protein.

Keywords : gene expression, Bayesian approach, prior Dirichlet process, Polya urn scheme, Gibbs sampling
1. Introduction

DNA array technology is capable of providing simultaneous gene expression level measurements for thousands of genes under different biological experimental conditions. Once the expression levels have been obtained one of the objectives is to identify genes that present significant changes in the expression levels among experimental conditions. The identification of these genes is important because it may bring to reveal new biological discoveries such as which genes are involved in the origin and/or evolution of some genetic disease or which genes react to a drug stimulus. For further discussion and additional references on DNA array technology, see DeRisi et al. (1997), Arfin et al. (2000), Wu (2001), Hatfield et al. (2003), and their references.

According to Baldi and Long (2001) the first level of gene expression data analysis is the identification of the genes with expression levels different in a treatment condition in relation to a control condition. For this case, the usual methods used to identify differentially expressed genes are based on t-tests such as usual t-test for unequal variances, the Cyber-t (CT) proposed by Baldi and Long (2001) and the Bayesian t-test (BTT) proposed by Fox and Dimmic (2006). CT, and BTT modify the standard error estimate of the two-sample differences found in the denominator of the standard t-statistic. Under the Bayesian approach, Oh and Yang (2006) developed a two-sample comparison considering a multiple test scenery. The authors assume a conditionally prior distribution for each pair of comparing means. In order to estimate parameters of interest from posterior distribution the authors propose an importance sampling method. But, in situations with a control and more than one treatment, remains common to apply analysis of variance (ANOVA) followed by a Tukey test to identify which treatment caused the difference; see for example Pavlidis (2003), Parkitna et al. (2006), and Goeman and Bühlmann (2007).

In this paper, we consider gene expression data analysis from experimental conditions with a control and more than one treatment. We model each one of the hypothesis of equality or inequality among experimental conditions by a model. In this way our interest is to search for a model which best fits the data and meets conditions of inequality among the experimental conditions. We use a hierarchical Bayesian approach in order to select one of the models considered. This approach uses the Dirichlet process prior and its representation via Polya urn scheme (Blackwell and MacQueen, 1973). The advantage of using the Dirichlet process prior is its discreteness that allows the parameters to be coincident with positive probability. To calculate the posterior probability for each model we implement a Gibbs sampling (GS) algorithm considering the Polya urn scheme written through latent variables.

We develop a simulation study to verify the performance of the proposed method and compare it with the usual method based on ANOVA followed by Tukey test. The simulation study was implemented considering the cases with a control and two- and three-treatment experimental conditions. The ANOVA is applied to identify genes which show a significant difference among experimental conditions. ANOVA does not identify which experimental conditions show the difference; therefore, we also apply the Tukey test as a post-hoc test to identify which experimental conditions show significant difference, see for example Pavlidis (2003), Parkitna et al. (2006), Goeman and Bühlmann (2007), and Zollanvari et al. (2009). As comparison criterion between methods we consider the true positive rate (TPR) and false discovery rate (FDR).

The simulation results show a better performance of the proposed method. We also apply both methods to a real data set downloaded from http://cybert.ics.uci.edu/anova that concerns a proteomics experiment (Baldi and Long, 2001).

The pioneering paper using the Dirichlet process prior for multiple-comparison was by Gopalan and Berry (1998). The paper considers a hierarchical Bayesian approach with the Dirichlet process prior and develop a GS algorithm to make a comparison among various hypotheses. The posterior probability for the hypotheses are estimated using the Rao-Blackwellization method proposed by Gelfand and Smith (1990). Guindani et al. (2009) recently developed a semi-parametric Bayesian model with the Dirichlet process prior for multiple-comparison problems. Guindani et al. (2009) consider a loss function based on positive and false positive counts to proposes a decision rule that is based on a threshold of the posterior probability. Zou et al. (2010) consider a two-sample comparison and model the t-statistics using a hierarchical Bayesian approach with a Dirichlet process prior on the non-centrality parameter. Estimates for parameters of interest and false discovery rate are estimated via a GS algorithm. The distinction among the previous approaches and ours is that here the Dirichlet process is used jointly with a model selection procedure to identify the cases differentially expressed. The discreteness of the Dirichlet process is used to identify equality (or not) among parameters of the models considered. In this way, the procedure allow the source for a model that best fits the data and the identification of cases with difference in relation to mean e/or variance.

The remainder of the paper is structured as follows. In Section 2, we describe the Bayesian model for gene expression data analysis and the Polya urn scheme using latent variables. In Section 3, we compare the performance of the proposed method with the usual approach. In Section 4, both methods are applied to a real dataset. Section 5 concludes the paper with final remarks.

2. Hierarchical Bayesian model

Consider a DNA array experiment with N genes performed for experimental conditions E1, …, ET, where E1 represent the control experimental condition, E2 represent the first treatment experimental condition and successively until the last one treatment experimental condition ET. Suppose that each one of experimental conditions are replicated n times. Denote by yigt the ith observed expression level for gene g in experimental condition t ∈ {1, …, T}, for i = 1, …, n and g = 1, …, N.

We omit the index g in the next expressions in order to simplify the notation hereafter. Thus, let Y = {Y1, …, YT } be the set of all observed expression levels for gene g in T experimental conditions, where Yt = (y1t, y2t, …, ynt)′ is a n × 1 vector of conditionally independent observations for gene g on treatment t, for g = 1, …, N and t = 1, …, T.

As is usual in gene expression data analysis, consider that the logarithm of the observed gene expression levels in control and treatment conditions are generated from normal distributions with mean μt and variance $σt2,Yit~N(μt,σt2)$, for i = 1, …, n and t = 1, …, T. See for example, Baldi and Long (2001), Fox and Dimmic (2006), Kim et al. (2013), Saraiva and Milan (2012), Louzada et al. (2014), and Oh (2015), among others.

Denote parameters of the experimental condition t by $θt=(μt,σt2)$ and let Θ = {θ = (θ1, …, θT ); θt ∈ ℝ × ℝ+} be the parametric space, for t = 1, …, T. The likelihood function for θ given y is given by

$L(θ∣y)=∏t=1T∏i=1nf(yi∣θt)∝∏t=1T(σt2)-n2exp{-12σt2∑i=1n(yit-μt)2},$

where f (yi|θt) is the probability density function of the normal distribution with parameters $θt=(μt,σt2)$, for t = 1, …, T.

Our interest is to verify whether a gene g is differentially expressed among the different experimental conditions, i.e., if θt = θj or θtθj, for all t, j = 1, …, T and tj. This equality or inequality between θt and θj can be represented by the following models:

$M0:θ1=θ2=⋯=⋯=θT,M1:θ1≠θ2,θ1=θ3=⋯=θT,M2:θ1≠θ3,θ1=θ2=θ4=⋯=θT,$

successively for all combinations of inequality 2 to 2, 3 to 3, …, until the last one model

$MQ:θ1≠θ2≠⋯≠θT.$

In this way our interest is to search for a model which best fits the data and meets conditions defined by models Mq, q = 0, 1, …, Q. For each model Mq, the equality or inequality among θt’s determine a particularly partition on parameter space Θ, for t ∈ {1, …, T}. This then allow us to use a hierarchical Bayesian approach with the Dirichlet process prior on θ = (θ1, …, θT ) in order to make simultaneous comparisons among θt’s. This is possible due the discreteness of the Dirichlet process that allows parameters from distinct experimental conditions to be coincident with positive probability. For more details on discreteness of the Dirichlet process see Gopalan and Berry (1998), Neal (1998), and references.

Thus, consider the following hierarchical Bayesian model with the Dirichlet process prior

$Yt∣θt=(μt,σt2)~N(μt,σt2)θt∣G~GG∣α,G0~DP(αG0),$

for t = 1, …, T, where G is a unknown distribution function with a prior distribution given by a Dirichlet process with concentration parameter α and baseline distribution G0 (Antoniak, 1974; Ferguson, 1973). The model (2.1) is denominated in the literature by Dirichlet process mixture model (DPMM).

In order to complete the specification of the model (2.1) we fix α = 1, as done by Escobar and West (1995), Medvedovic and Sivaganesan (2002), and Jain and Neal (2004). Besides, in order to explore the complete conjugacy of the model we set up G0 as

$G0≡N(μ0,σt2λ)IG(τ2,β2),$

where μ0, λ, α, and β are known hyperparameters and IG(·) represents the inverse gamma distribution in a parametrization so that the expected value is β/(τ − 2). The choice of the hyperparameters values will generally depend upon the application at hand. At this moment, we leave them unspecified.

Using the result of Blackwell and MacQueen (1973), we have that integrating G over its prior distribution in model (2.1) θ follows the Polya urn scheme and can be written as

$θ1~G0$$θt∣θ1,…,θt-1~αα+t-1G0+1α+t-1∑j=1t-1Iθt(θj),$

where = 1 if θt = θj and = 0 otherwise, for t = 1, …, T.

Note that, at each step of the sample procedure defined by (2.3), θt may assume a new value generated from baseline distribution G0 with probability α/(α + t − 1) or may assume the value of one of previous θj’s with probability ${1/(α+t-1)}∑j=0t-1Iθt(θj)$. Thus, a sample from joint distribution of θ1, …, θT yields k groups (1 ≤ kT) of θt’s with distinct values φ1, …, φk generated from baseline distribution G0. Using this fact we proposed a MCMC procedure to estimate the posterior probabilities for models Mq through the Polya urn scheme, given by expressions (2.2) and (2.3), written in terms of latent indicator variables, for q = 1, …, Q.

2.1. Polya urn scheme via latent variables

Let c = (c1 …, cT ) be a vector of latent indicator variables, so that, ct = j if θt = φj, for t = 1, …, T and j = 1, …, k. Consider Dj = {yt; ct = j} be the cluster (set) of observations with identical configuration indicators ct, where D1, …, Dk are paired with φ1, …, φk, respectively, for t = 1, …, T.

Assume that clusters are numbered consecutively as they arise; therefore, the sampling procedure defined by (2.2) and (2.3) can be replicated by the following procedure:

• Set c1 = 1, D1 = {y1} and generate φ1 ~ G0;

• for t = 2, …, T calculate the probabilities $P(ct=j∣c1,…,ct-1)=njα+t-1 and P(ct=j*∣c1,…,ct-1)=αα+t-1,$

for j = 1, …, k(t), j* = k(t) +1, where k(t) is the number of different values in ct−1 = (c1, …, ct−1) and nj is the number of ct = j in ct−1, for t′ = 1, …, t − 1 and j = 1, …, k(t).

• (ii-a) Generate Zt ~ Multinom(1, Pt), where Pt = (P(ct = 1| · ), …, P(ct= k(t) |·), P(ct = j*|·)) and Multinom(1, Pt) is the multinomial distribution with k(t) + 1 modalities and a single observation;

• (ii-b) If Zt j = 1, for some j ∈ {1, …, k(t)}, then do ct = j, Dj = {Dj} ∪ {yt} and nj = nj + 1;

• (ii-c) If Zt j* = 1, then do ct = j*, Dj* = {yt}, nj* = 1 and generate φj* ~ G0.

• Given c, set θt = φj for all ct = j, t = 1, …, T and j = 1, …, k.

Note from this procedure, that D1 is the set composite by the Treatments conditions that do not have difference in relation to control experimental condition.

As a sample from a Dirichlet process is exchangeable (Antoniak, 1974; Jain and Neal, 2004; MacEachern, 2016), so from (2.4) we have conditional probabilities given by

$P(Ct=j∣c-t)=nj,-tα+T-1 and P(Ct=j*∣c-t)=αα+T-1,$

where ct = (c1, …, ct−1, ct+1, …, cT ) and nj,−t is the number of ct = j in ct, for j = 1, …, k(t). Here, k(t) is the number of different values in configuration ct. The φj’s remains drawn independently from baseline distribution G0.

Using the Bayes rule, the conditional posterior probabilities for latent indicator variable are given by

$P(Ct=j∣c-t,yt,φj)=btnj,-tα+T-1f(yt∣φj) and P(Ct=j*∣c-t,yt)=btαα+T-1q(yt),$

where bt is the appropriate normalizing constant for those probabilities sum to one, f (yt|φj) is the joint probability for yt given φj and

$q(yt)=∫f(yt∣φj*)πG0(φj*)dφj*=β*λ*Γ*[1+∑ym∈Djym2+λμ02β-(∑ym∈Djym+λμ0)2β(nj+λ)]-τ*,$

in which,

$β*=(1βπ)nj2, λ*=(λnj+λ)12, Γ*=Γ(τ+nj2)Γ(τ2) and τ*=(τ+nj2),$

Given a configuration c, the full conditionals for μj and $σj-2$ are, respectively,

$μj∣σj2,λ,y~N(njnj+λy¯j+λnj+λμ0,σj2nj+λ)$

and

$σj-2∣α,β,y~IG(α+nj+12,β+(nj-1)sj2+njλ(y¯j-μ0)2nj+λ),$

where yj and $sj2$ are the sample mean and variance, respectively, of the cluster Dj, for j = 1, …, k.

2.1.1. Gibbs sampling algorithm

In order to estimate the posterior probabilities for models Mq we use a GS algorithm by iterating between aligns (2.5)–(2.7).

• Gibbs sampling algorithm: Let the current state of the Markov chain consist of (c(l−1), φ(l−1)), where l is the lth iteration of the algorithm, for l = 1, …, L, where $c(0)=(c0(0),c1(0),…,cT(0))$ and $Ξ(0)=(φ1(0),…,φk(0))$ are the initial values. So, do the following:

• For t = 1, …, T:

• Calculate Pt = (P(ct = 1| · ), …, P(ct = k(t)| · ), P(ct = j*| · )) as in (2.5);

• Generate Zt ~ Multinom(1, Pt). If zt, j = 1, for some j ∈ {1, …, k(t)}, do $ct(l)=j$. Otherwise, if z j,k(t)+1 = 1, do $ct(l)=k(t)+1$;

• Conditional on c(l) update parameters φj from posterior distribution in (2.6) and (2.7);

• Let $Iq(l)$ be an indicator variable, so that, $Iq(l)=1$ if configuration c(l) defines model Mq; and otherwise, for q = 0, 1, …, Q.

We discard the first B values of the generated chains as a burn in. The posterior estimates for probabilities of models Mq are given by

$P˜q=1L-B∑l=B+1L-BIq(l),$

for q = 0, 1, …, Q. If Pq = max1≤q′≤Q Pq′, then Mq is the selected model, for q ∈ {1, …, Q}.

3. Data analysis

In this section we verify performance of the proposed method called Polya urn within Gibbs sampling (PUGS) and compare it with a standard method based on ANOVA followed by Tukey test (ATUK) using simulated data sets and a real data set.

In order to establish the hyperparameters values for the PUGS we consider the following procedure. Let (a, b) be the roughly interval which would include all observations produced by the experiment. Then, the hyperparameter μ0 was chosen to be the middle point of the interval μ0 = (a + b)/2 and we set up λ = 0.1. Besides, we choose τ and β in a way that $E[σt2]=β/(τ-2)=R$, where R is the range of the interval R = ba. Thus, we obtain β = (τ − 2)R and we set τ = 3. For ATUK we consider a significance level at 0.05.

3.1. Simulated data set

Consider a DNA experiment with a control and two experimental conditions. The five possible models written in terms of latent variables are:

$M1:c1=(c1=c2≠c3);M2:c2=(c1=c3≠c2);M3:c3=(c1≠c2=c3);M4:c4=(c1≠c2≠c3).$

In order to generate the data sets, we fix control parameters as μ0 = 8.44 and $σ02=0.67$. These values are the average of the observed expression levels in control experiment carried through with the proteomics data set (Baldi and Long, 2001). The parameters values for each configuration are given by

• - (μ3,σ3) = (μ2,σ2) = (μ1,σ1) for c0;

• - (μ2,σ2) = (μ1,σ1) and (μ3,σ3) = (μ1 + δσ1, γσ1) for c1;

• - (μ3,σ3) = (μ1,σ1) and (μ2,σ2) = (μ1 + δσ1, γσ1) for c2;

• - (μ2,σ2) = (μ1 + δσ1, γσ1) and (μ3,σ3) = (μ2,σ2) for c3;

• - (μ2,σ2) = (μ1 + δσ1, γσ1) and (μ3,σ3) = (μ2 + δσ2, γσ2) for c4,

for δ ∈ {0.50, 1.00, 1.50, 2.00, 2.50, 3.00, 3.50, 4.00} and γ ∈ {0.5, 1.0, 1.5, 2.0, 2.5, 3.0}. We consider the proportion of cases generated from each model are (0.70, 0.10, 0.10, 0.05, 0.05) for (M0, M1, M2, M3, M4), respectively. Besides, we set up n = 5 and N = 1, 000.

The generation of a simulated data set is given by the following steps. For g = 1, …, N, generate Ug from a uniform distribution on interval (0, 1), , and consider an indicator vector of dimension 1 × 3 that record from which model data are generated from:

• If ug ≤ 0.70, fix parameters values according to c0. Let the index vector = (1, 1, 1) to indicate that case g is generated from M0;

• If 0.70 < ug ≤ 0.80, fix parameters values according to c1 and set = (1, 1, 2);

• If 0.80 < ug ≤ 0.90, fix parameters values according to c2 and set = (1, 2, 1);

• If 0.90 < ug ≤ 0.95, fix parameters values according to c3 and set = (1, 2, 2);

• If ug > 0.95, fix parameters values according to c4 and set = (1, 2, 3);

• Generate $Yit~N(μt,σt2)$, for t = 1, 2, 3 and i = 1, …, n.

Generated the data set, we apply PUGS and ATUK to identify the cases with a difference. We apply the PUGS fixing L = 33,000 iterations and burn-in B = 3,000; in addition, Besides, out of 30,000 iterations, we consider jumps of size 10, i.e., only 1 draw from every 10 was kept, in order to construct a sample of size of 3,000 to make inferences.

To record the configuration obtained by the PUGS and the ATUK, we consider the index vector $ℤgmethod$, where $ℤgmethod$ assume one of the following configurations: (1, 1, 1), (1, 1, 2), (1, 2, 1), (1, 2, 2) or (1, 2, 3), for method = {PUGS, ATUK}. So, we compare performance of the methods by using the TPR (number of models correctly identified divided by N) and the FDR (number of models M0 incorrectly selected divided by the number of rejected models M0) given by

$TPR=∑g=1nIZgmethod(Gg)N,$

where $IZgmethod(Gg)=1$ if configuration identified by the method is equal to and $IZgmethod(Gg)=0$ otherwise, and

$FDR=∑g=1n(1-IZgmethod(Gg))·IGg(Z0)N-∑g=1nIZgmethod(Z0),$

where = 1 if case is generate according to configuration Z0 of the null model M0 and = 0 otherwise, for method = {PUGS, ATUK};

Moreover, for each pair (δ, γ) considered, we generate L = 100 different artificial data sets according to steps (i) to (vi) described above and present results using the mean of the TPR and FDR,

$TPR¯=1L∑l=1LTPR(l) and FDR¯=1L∑l=1LFDR(l),$

where TPR(l) and FDR(l) is the TPR and FDR calculated for lth generated data set, respectively.

Figure 1 show the $TPR¯$ and $FDR¯$ for both methods. Note that, PUGS present better performance than ATUK for all simulated cases, i.e., PUGS has higher $TPR¯$ and smaller $FDR¯$. These results mean that PUGS has a better performance in correctly classified the cases; and the errors are smaller than the ATUK method. Particularly, this better performance occurs for cases with differences in variances, γ = 2 and γ = 3, as can be viewed in Figures 1(b), (c), (e), (f). This fact is especially interesting from the biological point of view because PUGS may show cases identified when the usual method ATUK is considered.

As in application presented in the next Section no case was identified under model M4, so we present a simulation study in Appendix A similar to the presented case above, but considering only the configurations cj, j = 0, 1, 2, 3, from (3.1). In the Appendix B we present a simulation study conducted for a situation with a control and three treatment experimental conditions. Analogously to the results describe in this section, PUGS also present better performance than ATUK, specially for cases with difference in variances.

4. Application

Now consider the proteomics data set mentioned in the introduction. This dataset was extracted from the website cybert.ics.uci.edu/anova/. The data set is composed by N = 1,088 proteins from a control and two treatment conditions. The sample size from each experimental condition is n = 5.

For application of the PUGS we consider the same number of iteration, burn-in and hyperparameters values used in the simulation section. Table 1 shows the number of cases identified under each model by method. The last column of this table shows the number of cases in which model M0 was not selected, i.e., the number of cases with difference identified by each method.

ATUK identify 140 case with evidence for difference while PUGS identify 127. Out of case identified with difference, 92 were identified by both methods. No cases were identified under M4 in either method.

Tables 2 and 3 show the ten most evident cases identified by PUGS and ATUK, respectively. These tables show the number of the protein in the dataset, the sample mean and the standard deviation (SD) for control and two treatments, the configuration identified by PUGS and ATUK, the posterior probability obtained by PUGS and the p-value from ANOVA in ATUK.

The 10 most evident cases identified by ATUK were also identified by PUGS. Out of the 10 most evident cases identified with difference by PUGS, two were not identified by ATUK. These both cases are highlighted with * in Table 2. Note that these both cases have higher differences in control SD (highlighted in bold) in relation to treatment SDs. As in the simulation study, this result indicates that PUGS has ability to identify cases not identified by the usual method ATUK if the difference is in variances.

5. Final remarks

In this paper, we develop a gene-by-gene multiple comparison analysis using a semi-parametric Bayesian model with prior given by a Dirichlet process. The comparison among experimental conditions were made using the discreteness of the Dirichlet process within a model selection framework. The posterior probability for models were calculated through a GS algorithm that was implemented using the Polya urn scheme written in terms of latent variables to indicate equality or inequality among the experimental conditions.

The performance of the PUGS as well as its comparison with the ATUK was verified on artificial data sets and on a real dataset. Results show a better performance of PUGS for cases with differences in variances. Two examples of the better performance of the PUGS are cases 557 and 526 presented in Table 2. These two cases present a clear difference in standard deviation among control and treatments conditions, but are not identified as a case with evidence for difference among control and treatment conditions by ATUK. However, PUGS consider these both cases as being from a model that considers differences between the control and the two-treatment conditions.

From the biological point of view the results shows that PUGS may illustrate cases not identified when the usual method ATUK is considered. From the statistical point of view the proposed method may be viewed as an effective Bayesian alternative to solve problems of multiple comparison. PUGS can also be easily implemented in usual software such as the R software (The Comprehensive R Archive Network, http://cran.r-project.org). The code used for computing is in the R language and can be obtained by e-mail from the first author.

Here we provide some additional results from simulation study for PUGS and ATUK.

Acknowledgements

The author Erlandson F. Saraiva acknowledges the Brazilian institution CNPq.

Figures
Fig. 1. Average of TPR and FDR for PUGS and ATUK. TPR = number of models correctly identified divided by N; FDR = number of models M0 incorrectly selected divided by the number of rejected models M0; ATUK = analysis of variance followed by Tukey test; PUGS = Polya urn within Gibbs sampling.
TABLES

Table B.1

Model configurations

 M0 : c0 = (c1 = c2 = c3 = c3) M5 : c5 = (c1 = c2 ≠ c3 = c4) M10 : c10 = (c1 = c4 ≠ c2 ≠ c3) M1 : c1 = (c1 = c2 = c3 ≠ c4) M6 : c6 = (c1 = c3 ≠ c2 = c4) M11 : c11 = (c1 ≠ c2 = c3 ≠ c4) M2 : c2 = (c1 = c2 = c4 ≠ c3) M7 : c7 = (c1 = c4 ≠ c2 = c3) M12 : c12 = (c1 ≠ c2 = c4 ≠ c3) M3 : c3 = (c1 = c3 = c4 ≠ c2) M8 : c8 = (c1 = c2 ≠ c3 ≠ c4) M13 : c13 = (c1 ≠ c2 ≠ c3 = c4) M4 : c4 = (c1 ≠ c2 = c3 = c4) M9 : c9 = (c1 = c3 ≠ c2 ≠ c4) M14 : c14 = (c1 ≠ c2 ≠ c3 ≠ c4)

Table 1

Number of cases identified by method

MethodModelTotal of cases with difference

M0M1M2M3M4
PUGS9612189170127
ATUK9484627670140

PUGS = Polya urn within Gibbs sampling; ATUK = analysis of variance followed by Tukey test.

Table 2

Ten most evident cases identified by PUGS, where cj is given in (3.1), j = 0, …, 4

NumberSample meanSample standard deviationConfigurationPosterior probabilityp-value

123s1s2s3PUGSATUK
557*9.08978.78573.57390.95940.6304c3c00.89380.0897
6908.04279.55149.38210.51400.35000.3832c3c30.88540.0002
526*8.83278.14008.76490.93554.61010.5055c2c00.86820.9076
6668.23809.76438.52840.44570.42760.4502c2c20.83310.0003
10697.30398.87037.31940.61881.02380.3377c2c20.82480.0065
10248.34719.63218.36030.48660.55820.4792c2c20.81050.0023
9367.58428.85707.82470.31110.74400.3207c2c20.79260.0039
9328.13179.78938.49730.55880.37250.5818c2c20.79250.0006
7308.38379.64058.11560.55430.38640.6852c2c20.76370.0021
10128.00569.13938.00720.42470.40050.4868c2c20.76330.0019

PUGS = Polya urn within Gibbs sampling; ATUK = analysis of variance followed by Tukey test.

Table 3

Ten most evident cases identified by ATUK, where cj is given in (3.1), j = 0, …, 4.

NumberSample meanSample standard deviationConfigurationPosterior probabilityp-value

χ̄1χ̄2χ̄3s1s2s3PUGSATUK
69011.603213.780413.53620.74230.50500.5537c3c30.8854<0.001
66611.885514.087112.30430.64320.61700.6496c2c20.8331<0.001
93211.732614.123112.25910.80650.53710.8393c2c20.79250.001
6012.684113.775211.65900.85290.63420.3794c2c10.36710.001
64912.152312.985711.16800.62380.34270.7030c1c10.41030.001
101211.550413.185911.55210.61310.57830.7021c2c20.76330.002
73012.095013.908711.70820.80010.55720.9892c2c20.76370.002
102412.042013.896312.06140.70230.80510.6912c2c20.81050.002
102011.798113.240210.89901.21300.49210.6438c2c20.54240.003
93610.942512.778111.28910.44941.07300.4630c2c20.79260.004

PUGS = Polya urn within Gibbs sampling; ATUK = analysis of variance followed by Tukey test.

References
1. Antoniak, CE (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics. 2, 1152-1174.
2. Arfin, SM, Long, AD, Ito, ET, Tolleri, L, Riehle, MM, Paegle, ES, and Hatfield, GW (2000). Global gene expression profiling in Escherichia coli K12: the effects of integration host factor. Journal of Biological Chemistry. 275, 29672-29684.
3. Baldi, P, and Long, DA (2001). A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics. 17, 509-519.
4. Blackwell, D, and MacQueen, JB (1973). Ferguson distribution via Polya urn schemes. The Annals of Statistics. 1, 353-355.
5. DeRisi, JL, Iyer, VR, and Brown, PO (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science. 278, 680-686.
6. Escobar, MD, and West, M (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 90, 577-588.
7. Ferguson, TS (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 2, 209-230.
8. Fox, RJ, and Dimmic, MW (2006). A two-sample Bayesian t-test for microarray data. BMC Bioinformatics. 7, 126.
9. Gelfand, AE, and Smith, AFM (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association. 85, 398-409.
10. Goeman, JJ, and Bühlmann, P (2007). Analyzing gene expression data in terms of gene set: methodological issues. Bioinformatics. 23, 980-987.
11. Gopalan, R, and Berry, DA (1998). Bayesian multiple comparisons using Dirichlet process priors. Journal of the American Statistical Association. 93, 1130-1139.
12. Guindani, M, Müller, P, and Zhang, S (2009). A Bayesian discovery procedure. Journal of the Royal Statistical Society Series B (Statistical Methodology),. 71, 905-925.
13. Hatfield, GW, Hung, SP, and Baldi, P (2003). Differential analysis of DNA microarray gene expression data. Molecular Microbiology. 47, 871-877.
14. Jain, S, and Neal, RM (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. Journal of Computational and Graphical Statistics. 13, 158-182.
15. Kim, SG, Park, JS, and Lee, YS (2013). Identification of target clusters by using the restricted normal mixture model. Journal of Applied Statistics. 40, 941-960.
16. Louzada, F, Saraiva, EF, Milan, LA, and Cobre, J (2014). A predictive Bayes factor approach to identify genes differentially expressed: an application to Escherichia coli bacterium data. Brazilian Journal of Probability Statistics. 28, 167-189.
17. MacEachern, SN (2016). Nonparametric Bayesian methods: a gentle introduction and overview. Communications for Statistical Applications and Methods. 23, 445-466.
18. Medvedovic, M, and Sivaganesan, S (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics. 18, 1194-1206.
19. Neal, RM (1998). Markov chain sampling methods for Dirichlet process mixture models, Technical Report 4915.Retrieved September 1, 2017, from: http://cs.toronto.edu/redford/mixmc.abstract.html
20. Oh, HS, and Yang, WY (2006). A Bayesian multiple testing of detecting differentially expressed genes in two-sample comparison problem. Communications for Statistical Applications and Methods. 13, 39-47.
21. Oh, S (2015). How are Bayesian and non-parametric methods doing a great job in RNA-seq differential expression analysis?: a review. Communications for Statistical Applications and Methods. 22, 181-199.
22. Parkitna, JR, Korostynski, M, Kaminska-Chowaniec, D, Obara, I, Mika, J, Przewlocka, B, and Przewlocki, R (2006). Comparison of gene expression profiles in neuropathic and inflammatory pain. Journal of Physiology and Pharmacology. 57, 401-414.
23. Pavlidis, P (2003). Using ANOVA for gene selection from microarray studies of the nervous system. Methods. 31, 282-289.
24. Saraiva, EF, and Milan, LA (2012). Clustering gene expression data using a posterior split-merge-birth procedure. Scandinavian Journal of Statistics. 39, 399-415.
25. Wu, TD (2001). Analyzing gene expression data from DNA microarrays to identify candidate genes. Journal of Pathology. 195, 53-65.
26. Zollanvari, A, Cunningham, MJ, Braga-Neto, U, and Dougherty, ER (2009). Analysis and modeling of time-course gene-expression profiles from nanomaterial-exposed primary human epidermal keratinocytes. BMC Bioinformatics. 10, S10.
27. Zou, F, Huang, H, and Ibrahim, JG (2010). A semiparametric Bayesian approach for estimating the gene expression distribution. Journal of Biopharmaceutical Statistics. 20, 267-280.