We introduce a DNA microarray data that possesses gene expression levels in thousands of genes (Speed, 2003; Baldi and Hatfield, 2002). This study then simultaneously examines differentially expression genes among thousands of genes, which involves an appropriate simultaneous testing per each gene. The null hypothesis is to announce no association between gene expression levels and explanatory variables (Speed, 2003). For instance, microarray analysis could be conducted to examine differences in gene expression levels between cancer patients and healthy patient.
Applying the multiple testing framework to a microarray data, a true null hypothesis indicates no differentially expressed gene, whereas a non-true null hypothesis is a truly differentially expressed gene. Rejected hypothesis (gene) implies that this specific gene is declared as a differentially expressed gene (Table 1).
An older method in multiple testing framework is the family-wise error rate (FWER) defined as the probability of having any type I error among all hypotheses at assigned level
A dependent structure among the genes should be taken into account for the microarray data. Many researchers have developed various estimation procedures to assume a restricted dependent structure among genes and have estimated the proportion of true null hypotheses in a restricted or unrealistic manner. A dependent structure among genes in microarray studies is often unknown. The hidden Markov model (HMM) model exploits the local dependence structure and has been widely used in areas such as speech recognition, signal processing and DNA sequence analysis, see Rabiner (1989), Churchill (1992), Krogh
We consider the 9 most popular estimation procedures of the proportion of true null hypotheses as below. The least slope method (Benjamini and Hochberg, 2000), the smoother method described in Storey and Tibshirani (2003), the bootstrap method (Storey
Section 2 introduces different estimation procedures. In Section 3, simulation studies are conducted under independence and the HMM dependence structure by comparing the procedures. In Section 4, real data analysis is tested with the value of each estimation procedure. The summary of this paper is devoted to the last Section 5.
The hypotheses,
Calculate
Starting with
The smoother method (Storey and Tibshirani, 2003) estimates the proportion of true null hypotheses as
The bootstrap method (Storey
The rationale for this estimate is that
The Langaas method utilizes a convex decreasing density estimate for
We assume that
By transforming the density
As for the histogram method (Nettleton
The proposed method (Pounds and Cheng, 2006) does not depend on assumptions that the tests are two-sided or produce continuously distributed
For the average estimate method (Jiang and Doerge, 2008), we estimate the proportion of true null hypotheses
The value of
From the basic non-linear model of the
We calculate the slope of that fitting line as the estimated
Then the estimate
Jin and Cai (2007) develops an approach based on the empirical characteristic function and Fourier analysis. The estimators are shown to be uniformly consistent over a wide class of parameters. They extend their approach to dependent data structures. Please see Jin and Cai (2007) for more details.
In order to model the HMM, we need to utilize the notion of transition matrix with two states (0 and 1). Transition matrix is varied as:
The transition probability that a system goes a movement from state 0 (no differentially expressed gene) to state 1 (differentially expressed gene) or from state 1 to state 0 is constant over time. In the matrix, diagonal terms are generally not transitions of states such as from state 0 to state 0 or from state 1 to state 1.
The conditional probability of future states relies only upon the present state in the Markov chain. We could consider
The least slope method (ABH), the smoother method (Spline), the bootstrap method (Boot), the langaas method with a convex decreasing density estimate (Langaas), the histogram method (Histo), the average method (Jiang), the SLIM method (SLIM), the robust method (Pounds) and the Jin and Cai method (Cai) (Jin and Cai, 2007) are assessed under independent simulated data, the HMM model simulated data under various setups, and real microarray data. Estimates and standard errors are calculated for each estimation procedure.
We present 9 estimation procedures in an independent data with a different true proportion of true null
For the independence case, we simulate 1,000 independent normal random variables
In a dependent simulation case, we model the two hidden state HMM (0 and 1) with a varying transition probability matrix in the previous section. We utilize Welsch
Table 2 summarizes the result for independent
As transition matrix varies, we compute each
The microarray data in an HIV study (Van’t Wout
Table 4 describes different estimation procedures in real data analysis. Spline, Pounds, SLIM, Cai and Jiang procedures have relatively similar values of the proportion of true null hypotheses in the data.
We assess various estimation procedures with independent data and dependent data with the HMM model and conduct real data analysis. Simulation result indicate that Cai and SLIM procedures have relatively smaller biases and standard errors, being more appropriate for estimating the proportion of true null hypotheses. Spline, Pounds, SLIM, Cai and Jiang procedures have almost similar values of the proportion of true null hypotheses in real data analysis.
This work was supported by the Research Institute of Natural Science of Gangneung-Wonju National University.
Multiple hypothesis testing
Not rejected | Rejected | Total | |
---|---|---|---|
True null | |||
Non-true null | |||
Total |
Independent
Spline | Boot | Jiang | Histo | Langaas | Pounds | ABH | SLIM | Cai | ||
---|---|---|---|---|---|---|---|---|---|---|
1 | 0.25 | 0.5500 (0.3572) | 0.5430 (0.3461) | 0.4770 (0.2673) | 0.4900 (0.3152) | 0.5415 (0.3162) | 0.4877 (0.2861) | 0.6104 (0.4157) | 0.3215 (0.1963) | 0.2356 (0.1952) |
0.50 | 0.7299 (0.3532) | 0.6171 (0.2186) | 0.5335 (0.1565) | 0.5364 (0.1623) | 0.5951 (0.1603) | 0.5846 (0.1612) | 0.6104 (0.3642) | 0.4286 (0.1055) | 0.4985 (0.0864) | |
0.75 | 0.8324 (0.3805) | 0.8313 (0.1972) | 0.7846 (0.1329) | 0.8165 (0.1294) | 0.6951 (0.1603) | 0.7134 (0.1419) | 0.6104 (0.3914) | 0.7219 (0.1198) | 0.7386 (0.0785) | |
2 | 0.25 | 0.5610 (0.3467) | 0.5391 (0.3378) | 0.4691 (0.2631) | 0.4896 (0.3254) | 0.5512 (0.3098) | 0.4912 (0.2918) | 0.6084 (0.3981) | 0.3175 (0.1895) | 0.2413 (0.1823) |
0.50 | 0.7177 (0.3742) | 0.6319 (0.2218) | 0.5413 (0.1618) | 0.5429 (0.1719) | 0.6019 (0.1579) | 0.5912 (0.1701) | 0.6409 (0.3591) | 0.4809 (0.1193) | 0.4998 (0.0711) | |
0.75 | 0.8264 (0.3519) | 0.8516 (0.3609) | 0.7984 (0.1410) | 0.8093 (0.1310) | 0.6991 (0.1578) | 0.7231 (0.1092) | 0.6104 (0.4109) | 0.7410 (0.0909) | 0.7485 (0.0682) | |
3 | 0.25 | 0.4811 (0.3561) | 0.4491 (0.3451) | 0.4519 (0.2718) | 0.4789 (0.3348) | 0.5029 (0.2966) | 0.4892 (0.2798) | 0.6139 (0.4001) | 0.3091 (0.1886) | 0.2331 (0.1765) |
0.50 | 0.6156 (0.3697) | 0.6291 (0.2315) | 0.5542 (0.2234) | 0.5324 (0.1698) | 0.5967 (0.1498) | 0.6589 (0.1688) | 0.6340 (0.3610) | 0.4798 (0.1210) | 0.4999 (0.0691) | |
0.75 | 0.8109 (0.3509) | 0.8402 (0.3598) | 0.7975 (0.1409) | 0.7654 (0.1309) | 0.6580 (0.1569) | 0.7368 (0.1066) | 0.6291 (0.4091) | 0.7443 (0.0889) | 0.7495 (0.0661) |
Dependent
Transition matrix | Spline | Boot | Jiang | Histo | Langaas | Pounds | ABH | SLIM | Cai | |
---|---|---|---|---|---|---|---|---|---|---|
T1 | 0.7330 | 0.9990 (0.2608) | 0.9717 (0.2506) | 0.9290 (0.1698) | 0.9303 (0.1699) | 0.9477 (0.2018) | 0.9390 (0.1729) | 1.0000 (0.3608) | 0.9200 (0.1589) | 0.9042 (0.1546) |
T2 | 0.8456 | 0.9897 (0.2658) | 0.9799 (0.2109) | 0.8955 (0.1462) | 0.9380 (0.1589) | 0.9620 (0.1603) | 0.9400 (0.1609) | 1.0000 (0.3756) | 0.8654 (0.1357) | 0.8569 (0.1164) |
T3 | 0.8950 | 0.9871 (0.1998) | 0.9698 (0.1888) | 0.9665 (0.1112) | 0.9457 (0.1236) | 0.9620 (0.1603) | 0.9461 (0.1320) | 0.9963 (0.3756) | 0.8837 (0.1087) | 0.8900 (0.0975) |
T4 | 0.8826 | 0.9823 (0.2965) | 0.9717 (0.1865) | 0.9125 (0.1131) | 0.9148 (0.1248) | 0.9414 (0.1686) | 0.9308 (0.1319) | 0.9985 (0.2995) | 0.9044 (0.1067) | 0.8977 (0.0825) |
T5 | 0.8963 | 0.9716 (0.2876) | 0.9324 (0.1897) | 0.9133 (0.1234) | 0.9156 (0.1267) | 0.9414 (0.1698) | 0.9310 (0.1345) | 0.9966 (0.2898) | 0.9133 (0.0976) | 0.9126 (0.0784) |
T6 | 0.8136 | 0.9810 (0.2619) | 0.9708 (0.2245) | 0.8857 (0.1198) | 0.9282 (0.1478) | 0.9707 (0.2198) | 0.8927 (0.1456) | 0.9987 (0.3101) | 0.8823 (0.1089) | 0.8589 (0.0884) |
T7 | 0.8800 | 0.9822 (0.3109) | 0.9681 (0.2365) | 0.9278 (0.1287) | 0.9480 (0.1645) | 0.9502 (0.1780) | 0.9378 (0.1503) | 0.9999 (0.3589) | 0.9160 (0.1123) | 0.9155 (0.0967) |
T8 | 0.9160 | 0.9678 (0.3069) | 0.9598 (0.2438) | 0.9227 (0.1277) | 0.9333 (0.1310) | 0.9456 (0.2101) | 0.9394 (0.1659) | 0.9897 (0.3090) | 0.9136 (0.1209) | 0.9150 (0.1095) |
T9 | 0.8686 | 0.9780 (0.3109) | 0.9666 (0.2670) | 0.9016 (0.1310) | 0.8019 (0.1896) | 0.7889 (0.2546) | 0.8109 (0.1783) | 0.9888 (0.3245) | 0.8938 (0.1298) | 0.8922 (0.1156) |
Data analysis in Van’t Wout
Spline | Boot | Jiang | Histo | Langaas | Pounds | ABH | SLIM | Cai |
---|---|---|---|---|---|---|---|---|
0.8970344 | 0.9045681 | 0.8672398 | 0.9021678 | 0.9468924 | 0.8510235 | 0.9820108 | 0.8702348 | 0.8847652 |
Spline = smoother method; Boot = bootstrap method; Jiang = average method; Histo = histogram method; Langaas = Langaas method with a convex decreasing density estimate; Pounds = robust method; ABH = least slope method; SLIM = sliding linear model method; Cai = Jin and Cai method.