TEXT SIZE

CrossRef (0)
Robust inference with order constraint in microarray study

Joonsung Kang1,a

aDepartment of Information Statistics, Gangneung-Wonju national University, Korea
Correspondence to: 1Department of Information Statistics, Gangneung-Wonju National University, Jukheon-gil 7, Gangneung-si 25457, Republic of Korea. E-mail: mkang@gwnu.ac.kr
Received July 17, 2018; Revised August 29, 2018; Accepted August 30, 2018.
Abstract

Gene classification can involve complex order-restricted inference. Examining gene expression pattern across groups with order-restriction makes standard statistical inference ineffective and thus, requires different methods. For this problem, Roy’s union-intersection principle has some merit. The M-estimator adjusting for outlier arrays in a microarray study produces a robust test statistic with distribution-insensitive clustering of genes. The M-estimator in conjunction with a union-intersection principle provides a nonstandard robust procedure. By exact permutation distribution theory, a conditionally distribution-free test based on the proposed test statistic generates corresponding p-values in a small sample size setup. We apply a false discovery rate (FDR) as a multiple testing procedure to p-values in simulated data and real microarray data. FDR procedure for proposed test statistics controls the FDR at all levels of α and π0 (the proportion of true null); however, the FDR procedure for test statistics based upon normal theory (ANOVA) fails to control FDR.

Keywords : classification, distribution-free test, false discovery rate, M-estimator, union-intersection principle
1. Introduction

Classification in a genome-wide study separates subjects into similar groups so that subjects in the same group are more similar to each other subjects in different groups (Kim and Park, 2015; Choi et al., 2016). Gene expression data often has many outliers and is apt to be noisy. The small sample size in a microarray experiment makes the estimation of variance untrustworthy. A large number of genes and few number of arrays as well as a high signal-noise ratio in the microarray data make classical statistical approaches worthless. Robust statistical methods (Huber, 1981) offer a solution for the problem, especially when an underlying distribution is unknown. A robust statistical procedure should not be affected by departures from underlying assumptions caused by outliers (Jang et al., 2018). That is, it performs well under underlying assumptions whereas its performance deteriorates as the situation gets different from the assumptions (Son and Kim, 2017). Particularly, among robust estimators, M-estimator performs well even when we have a small sample size (Lim, 2018).

Order-restricted inference has been an issue that has been investigated for the last sixty years (Robertson et al., 1988). In a huge number of correlated heterogeneous genes with a small sample such as microarray data, order-restricted inference issues are often present in complicated ways. As for gene expression levels, the inference involves the mean expression over time or dose by inequalities of order-restriction (Peddada et al., 2003). For example, as for kth gene in the G groups, we can formulate hypotheses H0k vs H1k as below.

$H0k:μ1k=μ2k=⋯=μGk vs H1k:μ1k≤μ2k≤⋯≤μGk,$

where μjk means the average gene expression level in the kth gene in the jth group, where j = 1, …, G.

Now we define union-intersection principle (Roy, 1953) to involve order-restricted inference. We have a null hypothesis H0:θ ∈ Θ0 and rejection region C. A set of null hypotheses {Hl : θ ∈ Θl} and resulting set of tests with rejection regions {Cl}, where l belongs to Ł, follows the union-intersection principle (UIP) if Θ0 = ∩l∈Ł and C = ∪l∈ŁCl for each l ∈ Ł. Examining the gene expression pattern across different treatment groups requires different methods because the order limit makes standard statistical inference useless.

For this reason, nonstandard robust methods need to be addressed in order to classify genes taking order-restricted inference into account without assumptions. We propose an M-estimator based on a union-intersection principle for the distribution-insensitive classification of genes. An exact permutation enables a conditionally distribution-free test to compute p-values that are amenable in a small sample size, computationally tractable, and statistically robust.

Based on p-values, we need to test which genes has monotone increasing or decreasing pattern across the groups. Classical method in multiplicity controls the family-wise error rate (FWER), the probability of committing a type I error rate among all hypotheses at a preassigned level α. However, it is in general worthless to utilize it with a huge number of correlated gene due to losing truly differentially expressed genes or others. Benjamini and Hochberg (1995) proposed the false discovery rate (FDR) as an alternative of an the expected proportion of Type I error among the rejected hypotheses (genes). FDR procedures tend to produce less stringent Type I error compared to FWER controlling procedures (for example, the Bonferroni correction). Therefore, FDR procedures have greater power for the sake of increased numbers in Type I errors.

The paper is organized as follows. In Section 2, a general framework of robust inference is addressed. We discuss how to estimate Mn which is an M-estimator for a parameter of interest in each kth gene. In Section 3, a union-intersection principle is used to construct the test statistics based upon Mn in Section 2. Then, we use the test statistics to compute p-values in conditionally distribution-free tests with a small sample size setup. In Section 4, simulated data and real microarray data are assessed by applying a multiple testing procedure (FDR). Section 5 provides the concluding remarks.

2. Robust inference

### 2.1. Data structure

There are n subjects across G groups in the K genes (positions). Each subject has a gene expression level. We take the linear model Yk = Xβk + Ek in the kth gene into account where Yk = (Y1k, …, Ynk)t is the vector of gene expression levels collected from n individuals across G groups in the kth gene and At denotes the transpose of a matrix A. The design matrix of the n × G matrix is

$X=[10000⋯⋮⋮⋮⋮⋮⋮10000⋯11⋯00⋯⋮⋮⋮⋮⋮⋮11⋯00⋯⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮1000⋯1⋮⋮⋮⋮⋮⋮1000⋯1].$

The regression parameter of the G × 1 matrix is βk = (μ1k, δ2k, δ3k, …, δGk)t where μ1k denotes the average gene expression level in the kth gene in the first group and δjk denotes the difference between μ1k and the average expression level in the kth gene in the jth group, j = 1, 2, …, G. The vector of error of the n × 1 matrix denotes Ek = (ϵ1k, ϵ2k, …, ϵnk)t. We estimate a parameter of interest βk for each k = 1, …, K with the following M-estimator.

### 2.2. M-estimator

Let ρ:G × X → ℜ be a measurable function. We define an M-estimator Mn as a solution by minimizing with respect to t ∈ ℜG.

$∑i=1nρ (Yi-xitt),$

where Yi is the ith observation of Yk and xi(= (xi1, …, xiG)t), i = 1, …, n denotes the ith row of X. Mn is defined to be regression equivalent if Mn(Y + Xb) = Mn(Y) + b for b in ℜG. We define Mn to be scale equivalent if Mn(cY) = cMn(Y) for c > 0. Generally, the second condition is not satisfied. Fortunately, studentization makes Mn regression and scale equivalent. We consider a studentized M-estimator Mn of βk as a solution for minimization

$∑i=1nρ (Yi-xittSn)$

with respect to t (G × 1 matrix) and Sn = Sn(Y) is a scale statistic. The above linear model refers to the classical ANOVA model. But the distribution Y1k, …, Ynk may not be Gaussian. Researchers may be interested in G groups that may be stochastically ordered. For example, it is applicable to dose-response gene expression microarray data, introduced by Peddada et al. (2003) in cases when gene expression is stochastically increasing over time or dose. We define the null hypothesis H0k as the fact that the G groups in the kth gene are statistically homogeneous. The alternative hypothesis H1k is because G groups in the kth gene are ordered in an increasing level of dominance. H0k and H1k is constructed as below.

$H0k:δ2k=δ3k=⋯=δGk=0 vs H1k:0≤δ2k≤δ3k≤⋯≤δGk.$

These can be restated as the following hypotheses.

$H0k:θk=Aβk=0 H1k:θk=Aβk≥0,$

where the (G − 1) × G matrix

$A=(00000⋯01000⋯0-1100⋯00-110⋯⋮⋮⋮⋮⋮⋮000⋯-11).$

So as to test the null hypothesis, we tend to take alternatives that the vector θk belongs to the positive orthant space ℜ+(G−1) into account. As for the univariate case, we have an optimal UMP test. However, we do not have an optimal UMP test in a multivariate case. For instance, the Hotelling T2 creates a huge set of confidence intervals and involves loss of efficiency. It is therefore interesting to consider statistical inference under restricted setup. We can utilize UIP (Roy, 1953) for such a statistical inference under the positive orthant multivariate alternative hypothesis. We take the first derivative of ρ function as ψ. Mn should be a median-unbiased estimator of βk. Symmetry of F and skew symmetry of ψ, which is defined to be the symmetry of top left with bottom right and top right with bottom left, are necessary conditions for the median-unbiasedness of Mn. We select Huber loss function as a good candidate for ψ function. Therefore, minimization makes the estimator regression and scale equivalent. We define the Huber function as follows.

$ρ(t)={c∣t∣-12×c2,if ∣t∣ >c,12×t2,if ∣t∣ ≤c.$

The derivative of the Huber function ψ is defined as follows.

$ψ(t)={c×sign(t),if ∣t∣ >c,t,if ∣t∣ ≤c.$

ψ function is decomposed into the sum

$ψ=ψa+ψb+ψc.$

ψa is absolutely continuous function with absolutely continuous derivative. ψc should be a continuous and piecewise linear function. ψb is an increasing step function. We have ψa = ψb for the case of Huber loss function. This function satisfies the following conditions (Jurečková and Sen, 1996).

• M1: Sn should be both regression invariant and scale invariant, Sn > 0 a.s. and n1/2(SnS) = Op(1).

• M2: H(t) = ∫ ρ((zt)/S)dF(z) has the singular minimum at t = 0.

• M3: For some δ > 0 and η > 1,

$∫-∞∞[∣z∣sup∣u∣≤δ|ψa″(e-v(z+u)S)|]ηdF(z)<∞$

and

$∫-∞∞[∣z∣2sup∣u∣≤δ|ψa″(e-v(z+u)S)|]ηdF(z)<∞,$

where $ψa′(z)=(d/dz)ψa(z)$ and $ψa″(z)=(d2/dz2)ψa(z)$.

• M4: ψc(z) should be a continuous and piecewise linear function with knots at μ1, …, μr. Henceforth the derivative $ψc′(z)$ is a step function

$ψc′(z)=αν, μν

where α0, α1, …, αr ∈ ℜ1, α0 = αr = 0, and −∞ = μ0 < μ1 < ··· < μr < μr+1 < ∞. f(z) is assumed to be bounded in a neighborhood of Sμ1, …, Sμr.

• M5: ψb(z) = λν for qν < zqν+1, ν = 1, …, m where −∞ = q0 < q1 < ··· < qm+1 = ∞, −∞ < λ0 < λ1 < ··· < λm < ∞. f(z) and f′(z) are assumed to be bounded in Sq1, …, Sqm. We represent Mn asymptotically in the following functionals

$γ1=S-1∫-∞∞(ψa′(zS)+ψc′(zS))dF(z),γ2=S-1∫-∞∞z(ψa′(zS)+ψc′(zS))dF(z).$

Moreover, the following conditions are satisfied.

$X1. xi1=1, i=1,…,n,X2. n-1∑i=1n‖xi‖4=Op(1),X3. limn→∞ Qn=Q,$

where Qn = n−1XtX and Q is a positive definite p × p matrix.

Under these conditions, Mn is a solution of the equation

$∑i=1nxiψ (Yi-xittSn)=0.$

We calculate Sn follows in order to make Sn scale and regression invariant. Regression scores as defined below are used.

For α ∈ (0, 1), ân(α) = (ân1(α), …, ânn(α))t should be the unique solution to maximize

$∑i=1nYia^ni(α)$

with the constraint

$∑i=1nxija^ni(α)=(1-α)∑i=1nxij, j=1,…,G.$

Hajék (1965) proposed scores

$an*(Ri,α)={0,if Rin<α,Ri-nα,if Ri-1n<α

An increasing and square integrable function ϕ: (0, 1) → ℜ1, ϕ(α) = −ϕ(1 − α), 0 < α < 1 is chosen. For a number α0(0 < α0 < 12), ϕ is assumed to be standardized

$∫α01-α0ϕ2(α)dα=1.$

We define regression scores by ϕ as

$b^ni=-∫α01-α0ϕ(α)da^ni(α), i=1,…,n.$

That is,

$b^ni={n∫(Ri-1)nRinϕ(α)an*′(Ri,α)dα,if α≤(Ri-1)n≤1-α, Rin≤1-α,n∫(Ri-1)n1-αϕ(α)an*′(Ri,α)dα,if α≤(Ri-1)n≤1-α, Rin>1-α,n∫αRinϕ(α)an*′(Ri,α)dα,if α>(Ri-1)n, Rin≤1-α,0,if 1-α<(Ri-1)n,n∫α1-αϕ(α)an*′(Ri,α)dα,else.$

We define Sn as

$n-1∑i=1nYib^ni=n-1Xtb^n.$
3. Union-intersection principle and multiple testing

### 3.1. Construction of test statistics for each k gene

Suppose that γ1 is not equal to zero. For each k gene, the asymptotic distribution of Mn is as follows.

Theorem 1

The sequence

$n12γ^1 (Mn-βk)+γ^2 (SnS-1) e1$

follows the G-dimensional Gaussian distribution NG(0, σ2Q−1) asymptotically, where $σ2=∫-∞∞ψ2(z/S)dF(z)$(Jurečková and Sen, 1996).

The asymptotic variance of n1/2γ̂1(Mnβk) + γ̂2(Sn/S − 1)e1 is κk(= (γ̂1)−2σ̂2Q−1). The same factor (γ̂1)−2σ̂2 does not affect the parts of the following Zk and Vk. The last term γ̂2(Sn/S − 1)e1 is ignored while deriving a test statistic. Based upon the M-estimator Mn with theorem 1, we now tend to utilize the UIP in order to formulate a robust M-test as follows.

$H0k:θk=Aβk=0 H1k:θk=Aβk≥0,$

where the (G − 1) × G matrix

$A=(00000⋯01000⋯0-1100⋯00-110⋯⋮⋮⋮⋮⋮⋮000⋯-11).$

Zk = AMn and Vk = AQ−1At, where Mn is an estimator of βk. For ς = {1, …, G − 1} and every a : aς, we set a′ its complement and |a| its cardinality. For each a : aς, we partition Zk and Vk as follows.

$Zk=(ZkaZka′),Vk=(VkaaVkaa′Vka′aVka′a′).$

Write

$Zka:a′ =Zka-Zka′tVka′a′-1Zka′,Vkaa:a′ =Vkaa-Vkaa′Vka′a′-1Vka′a.$

By virtue of weak convergence of n1/2(Mnβk) to a G-variable Gaussian law, for n very large, we derive

$(nVk-1)12(Zk-θk)→DNG-1(0,I).$

Then, we construct the proposed test statistic

$Lk=∑a∈ςI(Zka:a′>0,Vka′a′-1Zka′≤0)(nZka:a′tVkaa:a′-1Zka:a′).$

### 3.2. p-value computation

We deal with high dimension low sample size data. n is small and we do not have asymptotic normality. The permutation distribution theory is still effective for such a setup. Under the null hypothesis of homogeneity, the joint distribution of n observations should be invariant under any permutation. We take all possible n!/(n1!n2! ··· nG!) as equally likely permutations. Henceforth, we construct conditionally distribution-free tests by utilizing the permutation. We calculate p-value for each k gene as below.

$Pk=Pr(Lk≥lk), k=1,…,K,$

where Lk is a test statistic and lk is an observed test statistic.

### 3.3. False discovery rate

We apply a proper multiple testing procedure (FDR) to K p-values to select which genes have monotone increasing or decreasing patterns.

FDR is defined as E(V/R|R > 0) · P(R > 0) in Table 1. V denotes the number of genes declared as differentially expressed among truly non-differentially expressed genes whereas S indicates the number of genes declared to be differentially expressed among truly differentially expressed genes. FDR has been in particular effective as the first alternative to the FWER to have broad acceptance in many scientific areas such as the life sciences, biology, chemistry, and medicine. For more details about the FDR, please see Dudoit et al. (2003), Storey (2002), and Sarkar (2006). For simulation study and real data analysis in next section, we select Benjamini and Hochberg’s FDR procedure (Benjamini and Hochberg, 1995) and Storey’s FDR procedure (Storey, 2002) among various FDR controlling procedures.

4. Numerical study

### 4.1. Simulation study

Simulation studies are conducted to show the performance of the proposed method and compare with the common procedure: ANOVA. We generate 100 random positions. Each position constitutes 5 groups with 30 observations. Each group contains 6 Gaussian variables with different means from others. Mean is assigned to each group in increasing order. The difference in the means between two adjunct groups increases by 1. We compute Mn based on those gene expression levels and corresponding Lk for each random position k = 1, …, 100 by using a union-intersection principle in Section 3 that also enable computing the proposed test statistics. We then calculate 100 p-values via the proposed method by permuting the distribution of test statistics with about 30!/(6!)5 iterations. For comparison, we also compute 100 p-values via test statistics using ANOVA. We now determine which positions have monotone increasing patterns across the groups.

FDR is a method to compute the rate of type I errors in multiple testing. FDR-controlling procedure is used to control the expected proportion of discoveries (rejected hypotheses) that are false. As a multiple testing procedure, we apply Storey’s FDR procedure and Benjamini and Hochberg’s FDR procedure to a set of p-values. Storey denotes Storey’s FDR procedure and BH denotes Benjamini and Hochberg’s FDR procedure. Table 2 shows that for each α (significance level), Storey (proposed) and BH (proposed) are less than α for all π0 (the proportion of true null hypotheses), which means they control the FDR at all levels of α and π0. Storey (normal); in addition, BH (normal) using the existing test statistics (ANOVA) fail to control the FDR since they are more than α for some α and π0 (Table 2). We infer that the proposed test statistics are optimistic for controlled FDR procedures and adequately reflect a monotone increasing or decreasing pattern of the response variable; however, existing test statistics fail to include a monotone increasing or decreasing trend of the response variable (failing to reflect genuine feature of the response variable).

### 4.2. Real data analysis

Application to the response of human fibroblasts to serum data set (Iyer et al., 1999) introduced the chronological program at 12 time points of gene expression levels throughout the physiological response of fibroblasts to serum using cDNA microarrays that included 8,613 genes over 24 hours. They had 517 genes whose expression levels differed concerning the stimulation of serum. They measured gene expression levels at 0, 0.25, 0.15, 1, 2, 4, 6, 8, 12, 16, 20, 24 hours after the stimulation of serum. Our data constitute 1000 genes measured at 6 time points 1, 2, 4, 6, and 8. One gene has 6 groups with 6 observations per group. We take log-transformation of gene expression levels. In order to compute the proposed test statistics in Section 3, we calculate Mn based on these gene expression levels and the resulting Lk for each gene k = 1, …, 1000 using a union-intersection principle in Section 3. We then compute 1000 p-values via the proposed method by permuting about 36!/(6!)6 iterations. We also calculate 1000 p-values using test statistics by ANOVA. We then apply FDR procedures to the p-values. Estimated $π^$0 = is 0.34 due to Storey’s FDR, which means that the proportion of no differentially expressed genes is 0.34 among all genes. Storey’s FDR procedure and Benjamini and Hochberg’s FDR procedure are then applied to them. Table 3 shows that for each α (significance level), values of Storey (proposed) and BH (proposed) are less than α, which means that they control FDR at all levels of α. Storey (normal) is more than α for some α (= 0.1, 0.01) (Table 3). BH (normal) is more than α for some α (= 0.05, 0.01) (Table 3). Storey (normal) and BH (normal) using the existing test statistics (ANOVA) do not control the FDR. To be more specific, we conclude that the proposed test statistics are well designed (optimal) to control FDR procedures and adequately follow a monotone increasing or decreasing pattern of the response variable. However, ANOVA test statistics do not account for the monotone increasing or decreasing trend of the response variable.

5. Concluding remarks

### 5. Concluding remarks

Classification of genes in a microarray data is often involved in and order-restricted constraint. Outlier arrays often make standard statistical inference useless. For this reason, we propose an M-estimator using a union-intersection principle to test which gene has a monotone increasing or decreasing pattern across the groups. FDR procedures based upon proposed test statistics control the FDR at all levels of α and π0, but FDR procedures for the test statistics from normal theory do not control the FDR at all levels of α and π0. Future research can be designed to develop test statistics based on a pure small sample theory. We extended the large sample theory results of Mn to small sample case. Microarray studies include many high dimension low sample size cases; therefore, we need to thoroughly explore test statistics with a small sample theory. In addition, we specifically designed a union-intersection principle for a monotone increasing or decreasing pattern of gene expression levels (the response variable) and an inequality-oriented inference. It is therefore worth developing union-intersection principle for a general case of constrained inference such as a shape constraint. In addition, more robust statistical methods can be developed for other measures such as L–statistics and a minimum distance method after examining the influence function for each measure.

TABLES

### Table 1

Multiple hypothesis testing

Not rejected Rejected Total
True null U V m0
Non-true null T S mm0

mR R m

### Table 2

False discovery rate (1)

α π0 Storey (proposed) BH (proposed) Storey (normal) BH (normal)
0.10 0.20 0.050 0.057 0.074 0.069
0.40 0.063 0.077 0.089 0.098
0.60 0.087 0.085 0.105 0.102

0.05 0.20 0.024 0.035 0.035 0.041
0.40 0.036 0.049 0.051 0.052
0.60 0.044 0.039 0.055 0.054

0.01 0.20 0.007 0.009 0.009 0.008
0.40 0.008 0.008 0.010 0.011
0.60 0.009 0.009 0.011 0.012

Storey = Storey’s FDR procedure; BH = Benjamini and Hochberg’s FDR procedure; FDR = false discovery rate.

### Table 3

False discovery rate (2)

α Storey (proposed) BH (proposed) Storey (normal) BH (normal)
0.10 0.089 0.077 0.110 0.099
0.05 0.039 0.047 0.049 0.053
0.01 0.009 0.007 0.013 0.015

Storey = Storey’s FDR procedure; BH = Benjamini and Hochberg’s FDR procedure; FDR = false discovery rate.

References
1. Benjamini, Y, and Hochberg, Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B. 57, 289-300.
2. Choi, D, Choi, H, and Park, C (2016). Classification of ratings in online reviews. Journal of the Korean Data and Information Science Society. 27, 845-854.
3. Dudoit, S, Shaffer, JP, and Boldrick, JC (2003). Multiple hypothesis testing in microarray experiments. Statistical Science. 18, 71-103.
4. Hajék, J (1965). Extension of the Kolmogorov-Smirnov test to regression alternatives. Proceedings of Bernoulli-Bayes-Laplace Seminar, LeCam, L, ed, pp. 45-60
5. Huber, PJ (1981). Robust Statistics. New York: John Wiley & Sons
6. Iyer, VR, Eisen, MB, and Ross, DT (1999). The transcriptional program in the response of human fibroblasts to serum. Sciences. 283, 83-87.
7. Jang, E, Choi, S, and Kim, D (2018). Robust Bayesian beta regression analysis. Journal of the Korean Data and Information Science Society. 29, 27-36.
8. Jurečková, J, and Sen, PK (1996). Robust Statistical Procedures, Asymptotics and Interrelations. New York: Wiley
9. Kim, G, and Park, C (2015). Analysis of English abstracts in journal of the Korean data and information science society using topic models and social network analysis. Journal of the Korean Data and Information Science Society. 26, 151-159.
10. Lim, Y (2018). M-estimation of the long-memory parameter by Laplace periodogram. Journal of the Korean Data and Information Science Society. 29, 523-532.
11. Peddada, SD, Lobenhofer, EK, Li, L, Afshari, CA, Weinberg, CR, and Umbach, DM (2003). Gene selection and clustering for time-course and dose-response microarray experiments using order-restricted inference. Bioinformatics. 19, 834-841.
12. Robertson, T, Wright, FT, and Dykstra, RL (1988). Order Restricted Statistical Inference. New York
13. Roy, SN (1953). On a heuristic method of test construction and its use in multivariate analysis. Annals of Mathematical Statistics. 24, 220-238.
14. Sarkar, SK (2006). False discovery and false nondiscovery rates in single-step multiple testing procedures. The Annals of Statistics. 34, 394-415.
15. Son, N, and Kim, M (2017). A study on robust regression estimators in heteroscedastic error models. Journal of the Korean Data and Information Science Society. 28, 1191-1204.
16. Storey, JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B. 64, 479-498.