aDepartment of Statistics, Duksung Women’s University, Korea
Correspondence to:1 Department of Statistics, Duksung Women’s University, 33 Samyang-ro 144-gil, Dobong-gu, Seoul 01369, Korea. E-mail: junsik@duksung.ac.kr
Received August 19, 2024; Revised September 23, 2024; Accepted October 18, 2024.
Abstract
Estimating the proportion of signals is a fundamental subject in a large scale inference for genome-wide association study (GWAS). In GWAS, hundreds of thousands of single-nucleotide polymorphisms (SNPs) are identified and often only p-values of each SNP are reported. The goal in GWAS is to test if any of these SNPs are associated with some disease or a phenotype of interest. When correlation structures of p-values are unknown, p-values under the null are no longer uniformly distributed and most existing methods become instable and inconsistent. To reduce these instability and inconsistency, we propose a new method that estimates the proportion of signals under arbitrary dependency structures of input data. We use a conditional distribution approach to account for the correlation information of observed p-values. We show the consistency of the proposed method under arbitrarily correlated p-values. We investigate the proposed method with extensive numerical studies and real data analysis to compare the performance of our method with existing methods under various dependency structures.
Keywords : proportion of null hypotheses, p-value, multiple testing, GWAS
1. Introduction
In the field of a large scale multiple testing procedures, estimating the proportion of signals, or equivalently true nulls is one of major research problems. However, it is difficult to employ a joint structure required for testing high dimensional problems. As a result, multiple test analysis is typically performed based on the p-values obtained from marginal tests. In this approach, estimation of the true signal proportion is often crucial. For example, in genome-wide association study (GWAS), a commonly used method is the set-based analysis which partitions the single-nucleotide polymorphisms (SNPs) into genes based on the biological information and then tests associations between the disease and each gene. Each gene consists of several SNPs, and the global testing of the SNPs of the gene can be used to determine whether each gene affects a certain disease. While most SNPs have no effect on diseases or phenotypes, only few genetic differences in SNPs are significant. One of the main interests in GWAS is determining how many SNPs are differently expressed among a large number of SNPs.
Another important reason for estimating the proportion of null hypotheses is that it is connected to a conservativeness of multiple testing procedures. In multiple testing procedures such as Benjamini and Hochberg (1995) and Storey (2002), controlling the false discovery rate (FDR) or the positive FDR (pFDR) depends on an upper bound of the proportion of null hypotheses which induces the conservativeness of the multiple testing procedures. By reducing the conservativeness of the FDR or pFDR with an accurate estimation of the proportion of null hypotheses, the average power of the testing procedures can be improved.
Methods of estimating the proportion of null hypotheses for independent p-values are well discussed. Suppose that independent p-values, p1, …, pd are given from a mixture density,
f(p)=π0f0(p)+(1-π0)f1(p),
where π0 is the proportion of null hypotheses, π0 = P(p is null), and f0, f1 are densities under the null and alternative, respectively. Under the mixture model, Langaas et al. (2005) proposed an estimator of π0 based on the ordered p-values. With an additional assumption of f1 being a decreasing convex function, they used an iterative method based on the steepest descent algorithm to estimate π0. Genovese and Wasserman (2004) discussed estimation of π0 based on a stochastic process. On the other hand, Schweder and Spjøtvoll (1982) proposed to use graphical methods to estimate π0.
However, p-values are usually correlated so that the assumption of independence is often unrealistic in practice. If p-values are arbitrarily dependent, observed p-values are not uniformly distributed but have increasing or decreasing patterns in practice. Even worse, on occasions, there are no visible patterns for p-values. The left and middle histograms in Figure 1 show the 500 null p-values, p1, …, p500 obtained by pi = 2(1 – Φ(|Xi|)) for i = 1, …, 500, where Φ is the cumulative distribution function of standard normal distribution and Xi is the ith element of X⃗ = (X1, …, X500)T ~ N500(0, ∑) with
Σ=(σij)i,j=1,…,500={1,i=j0.5,i≠j.
The right histogram in Figure 1 represents p-values of “NLGN1” gene in real dataset of Crohn’s disease used in Duerr et al. (2006).
The observed p-values in the left panel of Figure 1 indicate a decreasing pattern although the middle panel represents increasing pattern of p-values. For large p-values close to 1, the assumption of independence is no longer valid and most existing methods relying on the uniformity assumption can not handle these correlated cases. The right panel of Figure 1 also shows the deviation from the uniform pattern. Whereas in GWAS there are many factors that affect p-values of SNPs to deviate from the uniform distribution, it is hard to measure the effects of various factors accurately when there is no prior information. Therefore, it is often necessary for a robust statistical methodology that can take arbitrary correlation structures into account.
Recently, there are a number of studies considering multiple testing procedures for arbitrarily dependent p-values. Liu and Xie (2020) proposed a global testing procedure based on the Cauchy combination of dependent p-values. Kim and Park (2025) analyzed p-value combining methods according to heaviness of test statistics when observed p-values are arbitrarily dependent. Those studies deal with a case where only p-values are observed and test statistics or raw datasets are not provided.
In contrast, there are limited studies on estimating the proportion of true null hypotheses when arbitrarily dependent p-values are observed. Friguet and Causeur (2011) considered the empirical estimate of the probability distribution of p-values to estimate the proportion of null hypotheses based on Schweder and Spjøtvoll (1982). Fan and Han (2017) adjusted the dependence using the factor modeling of test statistics by estimating the principal factors. Methods of Friguet and Causeur (2011) and Fan and Han (2017) both are proposed under a factor analytic framework so that information of structures for raw test statistics and weak dependence covariance are necessary. However, when only p-values are observed, it is not possible to use their methods. Tong et al. (2013) used a linear fitting of marginal distribution of p-values under an assumption that the pattern of true null p-values is linear or unimodal to estimate the null distribution. On the other hand, the linear fitting too simplifies patterns of the observed p-values and the estimated proportion of null hypotheses often has large variations.
Sometimes it is unreasonable to consider the patterns in Figure 1 as a marginal probability of null p-values. Since p-values, p1, …, pd, are a d-dimensional vector of p-values, analyzing based on the marginal distribution fails to reflect patterns in the histogram when p-values are correlated. It is more natural to employ a conditional distribution. In this paper, we investigate the pattern in the histogram obtained when the values are observed as a conditional distribution and propose an estimator from this point of view, and show the consistency of the proposed estimator.
The rest of this paper is organized as follows. In Section 2, we propose a method estimating the proportion of null hypotheses under arbitrarily dependent p-values. We estimate the null distribution with a kernel distribution estimator based on the conditional probability of the null distribution. Choosing λ is also discussed. In Section 3, extensive numerical simulations show that the stability and accuracy of the proposed method under various dependency structures. In Section 4, we apply the proposed method to two case studies. Concluding remarks are presented in Section 5.
2. Estimating the proportion of null hypotheses under arbitrarily correlated p-values
In this section, we propose an estimator of the proportion of true null hypotheses when arbitrarily dependent p-values are given. In terms of the hypotheses testing procedure, one can approach the problem in the form of a mixture from the perspective of estimating the proportion of the non-null or alternative hypotheses. When p-values are obtained from the null and alternative distributions, the mixture distribution (1.1) of p-values can be considered as the marginal distribution of p-values.
Prior to propose our method, let us first briefly consider other existing methods for estimating the proportion of null hypotheses in that our proposed method extends these methods to accommodate the arbitrary dependence.
2.1. Methods of estimating the proportion of null hypotheses
If p-values are independent, the null density of p-values, f0 in (1.1), is the probability density function of the uniform distribution so that (1.1) can be expressed as follows.
f(p)=π0+(1-π0)f1(p).
From (2.1), estimating π0 is equivalent to estimating the mixing coefficient of the mixture model.
Storey (2002) assumed that the probability density function f1 of the alternative p-values is a decreasing function with f1(1) = 0. They further assumed the “zero-assumption” as in Efron (2010), that is, any p-value greater than some large constant λ ∈ (0, 1) is assumed to be the null p-value. In other words, if p-values are close to 1, then the value of f1 is close to 0, that is, f1(p) ≈ 0 for p ≈ 1. The formal definition of zero-assumption can be expressed as follow.
Definition 1. (Efron (2010))For a certain subsetof [0, 1],
f1(p)=0forp∈A0.
Integrating the both sides of (2.1) with respect to p from a given constant λ ∈ (0, 1) to one implies
∫λ1f(p)dp=π0(1-λ).
Using (2.2), the proportion of true null hypotheses in (λ, 1] can be expressed by
π0λ:=#{pj>λ}d(1-λ).
Storey (2002) proposed an estimator of π0 as follows:
π^0:=π0λ=#{pj>λ^}d(1-λ^),
where λ̂ can be chosen among λ ∈ (0, 1) to minimize the mean square error with bootstrap.
Under the “zero-assumption”, Schweder and Spjøtvoll (1982) proposed a graphical methods to use an approximation such that for t ∈ (0, 1),
1dE[∑i=1dI(pi>t)]≈π0(1-t).
By plotting the number of p-values greater than t against 1 – t, Schweder and Spjøtvoll (1982) estimated π0 based on p-values deviating from a straight line.
Tong et al. (2013) used a linear fitting in estimating the proportion of the true null p-values when true null distribution of p-values is linear or unimodal. For example, suppose that only p-values in the middle and right panels in Figure 1 are observed. When only p-values are given, one can not assume the independence between p-values or the specific type of dependency structures. For a given λ ∈ (0, 1), let W0(λ) be the total number of the true null p-values in (λ, 1] and
π0(λ,1]:=W0(λ)dπ0
be the proportion of the true null p-values in (λ, 1] over the range of [0, 1]. When π0(λ,1] is estimated, Tong et al. (2013) defined the following estimator,
π^0λ:=W(λ)d·π^0(λ,1],
where W(λ)=∑j=1dI(pj>λ) and π^0(λ,1] is an estimator of π0(λ,1] based on the linear fitting. It follows from the “zero-assumption” that W0(λ) ≈ W(λ) which implies that
where λ̂ is chosen based on bootstrap as Storey (2002).
However, the assumptions that p-values are independent or weakly dependent are often fail to estimate π0. Indeed, when p-values are independent, (2.3) is consistent under the independence assumption so that the estimator of Storey (2002) is also consistent. However if p-values are correlated, the following lemma shows that the consistency cannot be attained.
Lemma 1. If p-values, p1, …, pd are independent, (2.3) is a consistenct estimator under the “zero-assumption”. However, if p-values are dependent, the consistency of (2.3) can not hold.
Proof: Assume that p-values, p1, …, pd are independent. Under the “zero-assumption”, (2.3) is an unbiased estimator, trivially. Now, consider the variance of the estimator. For large λ ∈ (0, 1),
where I(·) is an indicator function. Under the independence of p-values, the numerator can be expressed by
Var(∑j=1dI(pj>λ))=∑i=1d{P(pi>λ)[1-P(pi>λ)]}.
Using the “zero-assumption” in Definition 1 with , we have
P(pi > λ) = P(pi > λ|pi is from null) · π0 + P(pi > λ|pi is from alternative) · (1 – π0) = (1 – λ)π0, since P(pi > λ|pi is from alternative) = 0 under the “zero-assumption”. Then,
Var(∑j=1dI(pj>λ))=d(1-λ)π0(1-π0(1-λ))
so that
Var(∑j=1dI(pj>λ))d2(1-λ)2=π0(1-π0(1-λ))d(1-λ)..
Therefore as d → ∞, the standard error goes to zero.
Now assume that p-values are correlated. Under the “zero-assumption”,
If pi and pk are not independent for i ≠ k = 1, …, d, P(pi > λ|pk > λ) – P(pi > λ) does not depend on dimension d and is nonzero. Hence it implies Var (π̂0(λ)) > 0 so that the variance cannot goes to zero as d → ∞. Hence the estimator (2.3) is not a consistent estimator of π0 if p-values are correlated.
In the next subsection, we propose a method to use a conditional probability which is robust to dependence structures p-values in estimating π0.
2.2. Estimating the proportion of null hypotheses based on conditional probability
Throughout this paper, we only use the “zero-assumption” but p-values are assumed to be arbitrarily correlated so that the covariance structure of p-values or input data can be arbitrary.
Let X→=(X1,…,Xd)T~fdX(X;μ,Σ) be test statistics of d-dimension, where fdX is any d-dimensional density function of X⃗, μ = (μ1, …, μd)T is a d-dimensional mean vector and ∑ is a covariance matrix of X⃗ ∈ ℝd. Sometimes it is of interest to test the following hypothesis problem,
H0:μ=0,H1:μ≠0,
where 0 is a d-dimensional vector whose elements are all 0. If H0 is rejected, then estimating how many signals in input data exist is an important subsequent subject. So, in this paper, we proposed to method that estimates the number of signals or the proportion on true null hypotheses.
Here, we assume that fdX(X):=fdX(X;μ,Σ) can be expressed as a mixture distribution, that is,
X→=(X1,…,Xd)T~fdX(X)=π0fd,0X(X)+(1-π0)fd,1X(X),
where fdX→,fd,0X→ and fd,1X→ are d-dimensional multivariate density functions, ℝd → ℝ, of marginal, null and alternative X⃗, respectively. Indeed, fd,0X→≡fdX(X;0,Σ) and fdX→≡fdX(X;μ,Σ) where at least one elements of μ are nonzero. In this case, p-values for testing (2.6) are obtained by transforming X⃗ based on marginal tests and defined by pi=1-Fd=1X(Xi) for i = 1, …, d if an one-sided test is considered and pi=2(1-Fd=1X(∣Xi∣)) for i = 1, …, d if a two-sided test is considered where Fd=1X is the marginal distribution function of fdX.
If X1, …, Xd are independent and identically distributed, their p-values, p1, …, pd can be considered as d independent random variables from the uniform distribution. Then
pi˜i.i.d.f(p)=π0·U(0,1)+(1-π0)f1(p),
where U(0, 1) denotes the uniform distribution and f1 is the alternative distribution.
On the other hand, if the input data or test statistics X1, …, Xd are correlated, that is, if ∑ is not a diagonal matrix, then each p-value is correlated through the covariance matrix, ∑. Therefore, the joint null distribution of p-values is no longer the uniform distribution. Figure 1 represents certain patterns of the null p-values although a marginal distribution of each p-values is still the uniform distribution.
For dependent input data or test statistics, d p-values are constructed from one d-dimensional observation X⃗ with a correlation structure, so it is more natural that the joint distribution of p⃗ is defined based on the following multivariate mixture model:
p→=(p1,…,pd)T~fd(p)=π0fd,0(p)+(1-π0)fd,1(p),
where fd, fd,0 and fd,1 are d-dimensional multivariate density functions, [0, 1]d → ℝ, of joint, null and alternative p-values, respectively. And π0 in (2.8) is the probability that the d-dimensional vector of p-values is true null.
In order to estimate π0 with correlation structures, we propose to use a conditional distribution of p-values. From the concept of conditional distribution, patterns in Figure 1 can be considered to be from the conditional distribution when p-values are observed. Hence, rather than using (2.8), we consider the following conditional mixture model given p-values, p⃗ = (p1, …, pd)T,
p∣p→~f(p∣p→):=π0(p→)f0(p∣p→)+(1-π0(p→))f1(p∣p→),
where p is a random variable following the conditional distribution of f (·|p⃗). In (2.9), f0(·|p⃗) and f1(·|p⃗) are conditional distributions given the observed p-values under null and alternative, respectively. Similarly, π0(p⃗) in (2.9) is defined by the conditional probability of true null when p-values are observed, that is, π0(p⃗) = P(p is null|p⃗) so that π0(p⃗) is a random variable whose randomness comes through p⃗.
The observed p-values construct a density function that includes the dependency structure and the pseudo random variable p that follows the density function is used to obtain the probability measure. A marginal distribution of the random variable p is equivalent to that of given p-values, that is, f (p) = f (pj), for all j = 1, …, d, and f (p|p⃗) is conditional density function of p given p⃗ representing a distribution of a p-value when d p-values are obtained. Increasing or decreasing patterns of frequencies of p-values in Figure 1 is from the conditional density function when p⃗ = (p1, …, pd)T is given.
f0(p|p⃗) is no longer the uniform distribution and f0 and f1 depend on p⃗. With the “zero-assumption” that the random variable p closes to 1 is obtained from the null distribution. In other words, for a large λ ∈ (0, 1), the conditional probability that a p-value from the alternative given other observed p-values is greater than λ is assumed to be zero. Specifically, we assume that for a given λ ∈ (0, 1),
∫λ1f1(p∣p→)dp=PH1(p>λ∣p→)=0,
where PH1 is the probability measure under H1. Under the “zero-assumption”, we have the following relation from the conditional mixture model (2.9). For a large λ ∈ (0, 1),
Note that both F̄(λ|p⃗) and F̄0(λ|p⃗) are random variables depending on p⃗ = (p1, …, pd)T. As a result, π0(λ, p⃗) in (2.11) is a random variable which implies that “estimating” π0(λ, p⃗) is not an appropriate notation, in a strict sense. However, when p-values, p1, …, pd are observed, “estimating” the proportion of null hypotheses makes sense. From Figure 1, histograms show different patterns depending on observed p-values. So estimating the proportion of true null hypotheses based on observed p-values is reasonable.
In the respect of the conditional distribution, F̄0(λ|p⃗) in (2.10) is an unbiased estimator of 1 – λ when p-values are independent, since
where Ep⃗ denotes an expectation with respect to p⃗, E·∣p→0 is a conditional expectation of a p-value given p⃗ under the null and E0 denotes a total expectation under the null. However, when dependent p-values are observed, the marginal null probability deviates from the conditional null probability resulting in inconsistent estimators if 1 – λ is used as a marginal probability that a null p-value is greater than λ. The mean squared error of F̄0(λ|p⃗) can be expressed as follows.
By definition, MSE in (2.12) is non-negative and as a deviation of 1–λ from the conditional probability increases, the mean squared error increases. The following example shows the deviation when input data or test statistics are collected from a multivariate normal distribution.
Example 1. Let input data X⃗ = (X1, …, Xd)T follow the d-dimensional normal distribution with mean zero and covariance matrix ∑, that is
X→~Nd(0,Σ).
Assume that diagonal elements of ∑ are all 1 and off-diagonal elements are ρ ∈ [0, 1) which is a compound symmetry structure. Consider an one-sided test that p-values are obtained by pi = 1 – Φ(Xi), where Φ is a distribution function of the standard normal distribution. Then by the conditional distribution of multivariate normal distribution implies that
Xi∣X→-i~N(Σi,-iΣ-i,-i-1X-i,1-Σi,-iΣ-i-1Σ-i,i),
where X⃗−i = (X1, …, Xi−1, Xi+1, …, Xd)T is a vector whose elements are equal to X⃗ without ith element, ∑i,−i = (ρ, …, ρ)T which is a 1 × (d – 1) dimensional vector, Σ-i,i=Σi,-iT and ∑−i,−i is a (d – 1) × (d – 1) compound symmetry covariance matrix with diagonal elements 1 and off-diagonal elements are ρ. Hence the conditional probability that pi is greater than λ given p⃗−i = (p1, …, pi−1, pi+1, …, pd)T can be expressed as follows.
where Z ~ N(0, 1). Therefore, when ρ ≠ 0, using 1 – λ as an estimator of F̄0 can not guarantee a consistency.
In Example 1, using 1 – λ is inconsistent when exchangeable p-values are given. As an equal correlation coefficient of exchangeable input data or p-values increases, the consistency of Storey’s estimator is not valid. Example 2 shows that if the correlation matrix of input data is a type of AR(1), using 1 – λ is also inconsistent.
Example 2. Following Example 1, let X⃗ = (X1, …, Xd)T ~ Nd(0, ∑). Here, we assume that the covariance matrix ∑ to be an exponentially decaying correlation matrix or the AR(1) correlation matrix such that for ρ ∈ (0, 1),
Σ=(σij)i,j=1,…,d,
where σi j = ρ|i−j| for i, j = 1, …, d. Let p1, …, pd be p-values corresponding X1, …, Xd and p be a p-value whose conditional distribution under the null is given as PH0 (·|p1, …, pd). In order to preserve the AR(1) correlation matrix in (2.13), denote the p-value p as pd+1 and Xd+1 is a corresponding input data. We estimate π0(λ) using a conditional probability of pd+1 when p1, …, pd are observed. Then Cov(Xd+1, Xi) = ρd+1−i for i = 1, …, d. Similar to Example 1,
Therefore, if elements of correlation matrix of input data are exponentially decaying with ρ ≠ 0, then using 1 – λ as an estimator of F̄0 can not guarantee consistency.
2.3. Proposed method
In this section, we propose an estimator of π0 defined in the previous section and show that the proposed estimator is consistent when d → ∞. Recall the definition of π0 in (2.11) with respect to the conditional probability such that
Then π0(λ, p⃗) denotes a ratio of the conditional probability that a p-value is greater than a given λ when p-values are observed to a conditional probability of a true null p-value being greater than a given λ.
As shown in the previous section, one can replace F̄0(λ|p⃗) with 1 – λ with the independence assumption as an unbiased estimator of Ep⃗F̄0(λ|p⃗). However, 1 – λ deviates from the F̄0(λ|p⃗) if p-values are correlated. If one can choose a more stable estimator of F̄0(λ|p⃗), the standard error of the estimator would decrease. To reduce the mean squared error in (2.12), one can choose an estimator of the null conditional probability, F̄0(λ|p⃗). Tong et al. (2013) pointed out that although the marginal distribution of the null p-values remains a standard uniform distribution regardless of the dependence of p-values, variations of the expected frequencies of the observed p-values increase as the correlation of p-values increases. They concluded that one can improve the estimation of the proportion of true nulls by knowing the pattern of the observed p-values. As a result, they proposed estimators using a linear fitting by assuming that the pattern of null p-values is linear. However, the assumption of linearity excessively simplifies patterns of p-values under arbitrarily dependent input data.
Revisiting the patterns of histograms of p-values in Figure 1, the histogram shows frequencies of d correlated p-values from one d dimensional observation. In this sense, one can explain the patterns of p-values based on the conditional distribution. In other words, on the histogram, the number of p-values greater than a large λ indicates F̄0(λ|p⃗), under the “zero-assumption”. Hence we propose to use a kernel distribution function to estimate F̄0(λ|p⃗). In Duong (2016), a cumulative distribution function, F(x), is estimated by the usual kernel estimator,
F^(x;H)=1d∑i=1dKH(x-Xi),
where KH(x-Xi)=∫-∞x-Xi∣H∣-1/2K(H1/2w)dw, K is a multivariate kernel function and H is the bandwidth matrix. To use the kernel estimation method for F̄0(λ|p⃗), we use the kcde function in the ks package of R. The kcde function covers 1 to 3 dimensional input data and in this paper, we use 1-dimensional method for d p-values. Denote F¯^0(λ∣p→) as the kernel distribution estimator of F̄0(λ|p⃗). For a given λ ∈ (0, 1), we define an estimator of π0(λ, p⃗) for observed p-values, p⃗, as follows.
π^0(λ,p→):=1d∑i=1dI(pi>λ)F¯^0(λ∣p→).
Note that correlations between p-values are implicit in the denominator of (2.15), F¯^0(λ∣p→), so that the proposed estimator can be applied under any dependency structure of input data. To make the definition in (2.15) valid, assume that there is at least one pi such that pi > λ for any λ ∈ (0, 1) and F¯^0(λ∣p→)>0 for any λ ∈ (0, 1).
The following theorem shows the consistency of π̂0(λ|p⃗) in (2.15). Prior to the proof of the theorem, we first consider the convergence of the conditional probability and a consequent assumption. As d increases, the conditional probability of a p-value given observed d values, p1, …, pd converges to a conditional probability given infinite sequence of p-values, that is,
P(p>λ∣p→)→P(p>λ∣p→∞),
where p→=(p1,…,pd)∈σ(∪i=1dPi) and p→∞=(p1,p2,⋯)∈σ(∪i=1∞Pi). Here, we denote that for i = 1, 2, · · ·, ℘i is a filtration or a σ-field containing p1, …, pi and σ(∪i=1dPi) and σ(∪i=1∞Pi) are σ-fields generated by ∪i=1dPi and ∪i=1∞Pi, respectively. Correlation structures between the p-value, p, and observed p-values are maintained even the number of observed p-values increases. In addition to the p-value, p, we assume that the conditional probability of one of the observed p-value is the same as the number of given p-values increases. Specifically, the following assumption represents that as the number of observed p-values increases, a conditional probability given infinite sequence of p-values equals for any i = 1, 2, . . ..
Assumption 1 corresponds to the equality of conditional distribution of p-values given infinitely many p-values and is mild in that in Example 1, the conditional distribution under the null converges to a non-degenerate distribution which is index-free. For an exchangeable case,
where μ is a signal under H1. So if p-values are exchangeable, Assumption 1 holds. Also, in Example 2, the conditional distribution converges to a distribution which is not related to index of p-values. So conditional distributions of any p-value under the null converge to the same distribution and Assumption 1 also holds in AR(1). The following theorem shows that the proposed estimator is consistent under Assumption 1.
Theorem 1. Under the “zero-assumption” and Assumption 1, the estimator π̂0(λ, p⃗) in (2.15) is a consistent estimator in L2sense.
Theorem 1 implies that when p-values are correlated, if the probability of a p-value greater than some constant λ under the null can be estimated correctly, π0 can be estimated with consistency. Information of correlations between p-values is projected to 1-dimensional space of conditional distribution of a generic p-value if p-values are observed. By estimating the conditional distribution under the null, one can summarize the information of correlations with a kernel distribution estimation F¯^0(λ). In Section 3, the proposed method estimates π0 correctly and stably even when the dimension is finite. It supports an efficiency of the proposed method compared to other methods that do not satisfy consistency. Before presenting extensive simulation studies, we discuss the choice of λ which is a key element of the “zero-assumption” in the next section.
2.4. Choosing λ
Consider the choice of a proper λ ∈ (0, 1) under the “zero-assumption”. If λ ≈ 1, there are few p-values in (λ, 1) so that estimating using the kernel method can result in an unreliable π̂0(λ). On the other hand, we can not ignore the alternative conditional probability, PH1 (p > λ|p⃗), if λ ≈ 0. Hence, an appropriate choice of λ is crucial in inference of π̂0(λ).
Efron (2004) used a prefixed interval based on the central proportion of the null distribution of input data which seems a strong condition. Methods using the p-value include an approach from the graphical form of the p-value (Schweder and Spjøtvoll, 1982) and from the break point on the plot (Turkheimer et al., 2001). By minimizing the mean squared error, Storey (2002) chose the optimal λ using a bootstrap method. Tong et al. (2013) extended the procedure of Storey (2002) with an adaptive bootstrap algorithm. They used a minimum value of a median observed p-value and 0.5 as a plug-in estimator of π0. However, since performances of Storey’s and Tong’s methods are quite similar, we use Storey’s bootstrap method, for the computational convenience. The bootstrap method chooses the optimal λ that minimizes the mean squared error of the estimator based on bootstrap samples. The following algorithm represents the bootstrap method.
Set ℛ = {0, 0.05, …, 0.95} to be the grid of λ.
For each λ ∈ ℛ, construct π̂0(λ, p⃗) using all p-values.
For each λ ∈ ℛ, generate B bootstrap copies of p-values then calculate π^0b based on each bootstrap copy for b = 1, …, B.
Estimate MSE(λ, p⃗) by
MSE^(λ,p→)=∑b=1B(π^0b(λ,p→)-π^0(p→))2B,
where π̂0(p⃗) = minλ∈ℛπ̂0(λ, p⃗).
Define λ^=arg minλ∈ℛMSE^(λ,p→) and obtain a plug-in estimator π^0λ^(p→):=π^0(λ^,p→).
Langaas et al. (2005) note that this bootstrap approach performs quite well in simulation studies. Indeed, in the next section, the bootstrap method shows rather stable results than other existing methods. Hence, in this paper, we use Storey’s approach to choose λ.
3. Numerical studies
In this section, we evaluate the numerical performance of the proposed method by comparing other competitive methods: Storey’s method (Storey, 2002), Langaas’ method (Langaas et al., 2005) and Tong’s method (Tong et al., 2013). To investigate the robustness for correlations between p-values in estimating the proportion of nulls, we generate p-values under various dependency scenarios including compound symmetry, polynomially decaying and exponentially decaying structures. Specifically, input data X⃗ are generated from multivariate normal distribution and multivariate t-distribution: X⃗ ~ Nd(X; μ, ∑) and X→~tdν(X;μ,Σ) where tdν denotes the d-dimensional t-distribution with degree of freedom ν, location vector and scale matrix μ and ∑, respectively. The following hypotheses are considered.
H0:μ=0vs.H1:μ≠0.
The dimension d has a different range from 50 to 3000. Let be a subset of {1, …, d} indicating indices of signals. Following Liu and Xie (2020), we assume all the signals for an alternative have the same strength μi:=μ0=3log(d)/s1/5 for , where s is the number of signals. The magnitude of signals is defined to account for the sparsity of signal. The proportion of signals is defined by π1 = s/d so that π0 = 1 – π1. We define p-values by pi = 2(1 – Φ(|Xi|)), i = 1, …, d where Xi is ith element of X⃗ for normal input data and pi = 2(1 – Tν(|Xi|)), i = 1, …, d where Tν is a distribution function of univariate t-distribution with degree of freedom ν. We use ν = 10 in this numerical study.
The dependency structures of the correlation matrix ∑ = (σi j)i, j=1,...,d are defined as follows.
Compound symmetry: For 0 < ρ < 1,
σij={ρi≠j1i=j
Polynomial decay: For ρ > 0,
σij={10.7+∣i-j∣ρi≠j1i=j
Exponential decay, AR(1): For 0 < ρ < 1,
σij=ρ∣i-j∣.
For each dependency structure, values of the correlation coefficient ρ decide the strength of the dependency. Different values of the correlation coefficient ρ are used to investigate effects of the dependencies. For the compound symmetry structure, ∑ = (σi j)i, j=1,...,d with σi j = ρ = 0, 0.1, …, 0.5 for all i ≠ j ∈ {1, …, d} and σi j = 1 for i = j. For the polynomially decaying structure, ρ = 0.1, 0.2, …, 2.5 and for the exponentially decaying structure, ρ = 0.1, …, 0.5 are used. The proportion of signals varies π1 = 0.05, 0.1, 0.15 to encompass both sparse and dense signal settings.
Figure 2 represents boxplots of estimates of the proportions for signals 0.05, 0.1 and 0.15 with the dimension 100 under compound symmetry correlation matrices with different correlation coefficients range from 0 to 0.5. The number of the bootstrap iterations to choose an optimal λ is B = 200 for each case. Figures 3 and 4 show boxplots of estimates of the proportions for signals under polynomially and exponentially decaying dependency structures, respectively. Through all boxplots, the proposed method has a much smaller standard error with considerably accurate estimates for all dependency and sparsity settings, while the method of Langaas et al. (2005) is accurate only for the independent case but standard errors increase as correlation coefficients increase, as mentioned in Langaas et al. (2005) and their Figure 5. Standard errors of Storey’s method and Tong’s method are much larger than the proposed and Langaas’ methods as the inconsistency of Storey’s and Tong’s methods are analyzed in the Section 2.
To evaluate effects of dimension, Figure 5, 6 and 7 present boxplots of estimates of the proportions for signals 0.1 with dimension 50, 500 and 3000 under compound symmetry, polynomially and exponentially decaying dependency structures. As the dimension becomes large, the proposed method outperforms other competitive methods and converges to the true values since the kernel estimation converges to true distribution function as d →∞. Figure 8 shows boxplots of estimates of the proportions for signals 0.1 with dimension500 under compound symmetry, polynomially and exponentially decaying dependency structures for multivariate t-distribution with degree of freedom ν = 1. The results are similar to cases of multivariate normal distribution, since the proposed method depends only on p-values not on the distribution of input data. Figure 9 shows the mean square errors (MSE) of estimates of each method under polynomially and exponentially decaying dependency structures, respectively. MSE of Storey’s method is too high in scale compared with other method. Figure 10 shows MSE of estimates of each method without Storey’s method. MSE of the proposed method are close to 0 for all correlation structures and correlation coefficients. On the other hand, MSE of Fan’s and Langaas’ methods increase as correlation coefficient increases.
4. Case study
We apply the proposed method to two genome-wide association studies (GWAS), schizophrenia (Lam et al., 2019) and Crohn’s disease (Duerr et al., 2006). Schizophrenia is a severe mental illness in which patients have an aberrant interpretation of reality. Symptoms of schizophrenia can include hallucinations, delusions, and profoundly abnormal thought and behavior, which can interfere with everyday functioning and be debilitating. For a large scale schizophrenia genetic study, Lam et al. (2019) identified 21 genome-wide significant associations in 19 genetic loci. The dataset contains 22,778 schizophrenia cases and 35,362 controls from 20 samples from East Asia. We conduct gene-based GWAS to discover significant genes that associate with schizophrenia and estimate the proportion of signals in significant genes, whereas the objective in Lam et al. (2019) is to identify individual SNPs that associate with schizophrenia. We group 10,694,925 SNPs to 7,093 genes using Genome Browser in a Box (GBiB) of University of California, Santa Cruz genome browser (UCSC). Genes consist of 1-4,467 SNPs and each SNP can be contained in multiple genes.
Crohn’s disease dataset consists of p-values of 242,535 SNPs. p-values are generated to test the case and control for the Jewish and non-Jewish cohorts by using the Cochran-Mantel Haenszel chisquare test computed marginally for each individual SNPs. 242,535 SNPs are grouped into 19,769 genes and each gene contains 1 to 676 SNPs using GBiB of UCSC. Figure 11 and 12 show histograms of p-values of SNPs in genes of Crohn’s disease and schizophrenia dataset. Histograms of p-values in each gene reflect that SNPs are correlated since patterns of p-values differ from the uniform distribution. In most of GWAS dataset, we have no additional information on joint distributions or dependency structures of p-values so that global testing procedures constructed under an independence assumption or based on estimating of dependency structure can not be applied.
To compare the proposed method with existing methods, we consider a two-stage method that uses the Cauchy combination method of Liu and Xie (2020) and bootstrap confidence interval in real data analysis. The Cauchy combination method is a testing procedure that is used to check if there are signals in each gene when p-values of each gene are arbitrarily dependent. Note that the Cauchy combination method is a global testing procedure that cannot identify how many signals in each gene. Hence, to find how many signals are in the identified genes, we apply our method other comparative methods to genes which are identified to have some signals. The process of the analysis is as follows:
For each gene in two diseases (Crohn’s disease and Schizophrenia), we apply the Cauchy combination method to check if there are some signals in each gene.
For genes identified that there are some signals, we apply our method and other comparative methods to find how many signals are in the significant genes.
We randomly resample p-values from the original p-values with replacement 100 times then apply our method and existing methods to those resampled bootstrap p-values. And then we construct bootstrap confidence intervals of those methods and compare lengths of the confidence intervals.
If the lengths of confidence interval are short, then the statistical power and stability of the method can be guaranteed. Hence, we present the confidence interval and its lengths in real data analysis as measures of the statistical power and stability of our method. If the lengths of confidence interval are short, then the power and stability of the method can be guaranteed.
First, we find significant genes using the Cauchy combination method. For schizophrenia dataset, we show randomly selected 5 genes among 155 genes which are identified to have signals and for Crohn’s disease, we show 5 genes among 12 identified genes in the following results. Table 1 presents estimates of proportion of signals for selected significant genes of schizophrenia and Crohn’s disease with their bootstrap 95% confidence intervals. Figures 13 shows boxplots of bootstrap estimates of each method. For all cases, from results in Table 1 and boxplots in Figures 13, the proposed method estimates the signals of proportion stably and the power of the method is guaranteed in that bootstrap confidence intervals are short compared to other methods whereas estimates of competitive methods are unstable since their bootstrap confidence interval is quite wide. For example, for “SLIT3” gene in Crohn’s disease in Table 1 and the third histogram in Figure 12, the proposed method reasonably estimates the number of signals, whereas other methods estimate unrealistically and unstably.
5. Concluding remarks
In this paper, we proposed a method to estimate the proportion of nulls or equivalently signals, when only p-values are given without information of input data or joint distribution so that correlation structures are unknown. When each p-values are independent, the null distribution of p-values follows an uniform distribution. However, in the case of non-independent, the robustness of the existing methods in estimating the proportion is weakened in that the values obtained from the null are not uniformly distributed, whereas the method proposed in this paper can be applied to any dependency structures in that correlation information is indirectly expressed in the form of conditional probability. In addition, the proposed method showed consistency, which is confirmed through numerical studies under various dependency settings. We also note that the proposed method requires no assumptions, for example, conditions of sparsity or dependency structures of p-values. We only use the “zero-assumption” for structures in p-values. Hence, the proposed method works even when there are no prior information of input data or test statistics.
However, there are few limitations of the proposed method. First, it requires the optimal value of the parameter λ and we obtained λ̂ numerically using bootstrap method. Hence to find λ̂, we searched many candidates points in [0, 1] so that it needs the computational efficiency. Second, since the goal of the proposed method is to provide a consistent estimator of the proportion of true null hypotheses, confidence bounds of the estimator are not provided. On the other hand, Meinshausen and Bühlmann (2005) proposed probabilistic lower bounds of the proportion of signals. Also, in simulation studies, the estimator of Langaas’ method tends to be larger than the target proportion of signals although confidence bounds of Langaas’ method is not presented explicitly in Langaas et al. (2005). However, although the proposed method outperforms in estimation, it does not provide confidence lower bounds of the proportion of signals. We leave those limitations as future works.
Appendix A
In this appendix, we prove the Theorem 1. Before presenting the proof, we first introduce a lemma which shows the expectation and variance of a ratio of two random variables.
Lemma 2. Let X1and X2be two random variables satisfying P(Xi > 0) = 1, for i = 1, 2. Then for μ1and μ2,
Proof of Theorem 1: Since F¯^0(λ) in (2.15) is a kernel distribution estimator of F̄0(λ) = PH0 (p > λ|p⃗), the consistency of kernel distribution function (Chacón and Rodríguez-Casal, 2010) implies that
For convenience, denote the random variable p as pd+1. Then as d → ∞, from Theorem 5.5.7 of Durrett (2010),
P(p>λ∣p→)=P(pd+1>λ∣p→)→P(p>λ∣p→∞-(d+1))a.s.,
and
PH0(p>λ∣p→)→PH0(p>λ∣p→∞-(d+1))a.s.,
since a σ-field ℘d which contains p1, …, pd is an increasing sequence with respect to d. It follows from the continuous mapping theorem with (A.2) and (A.3) as d →∞,
where Ep⃗∞ is the expectation corresponding probability measures defined on σ-fields generated by ∪i=1∞Pi and where ℘i is a σ-field containing p1, …, pi.
so that II → 0. Using similar steps in the convergence of bias, (A.4), it can be shown that VR → 0 as d →∞. Therefore, it can be concluded that as d →∞,
Varp→(1d∑i=1dI(pi>λ)-P(p>λ∣p→)PH0(p>λ∣p→))→0,
and from (A.1) that
∫[π^0(λ,p→)-π0(λ,p→)]2f(p→)d(p→)→0.
Figures
Fig. 1. Left and middle panels are histograms for p-values are obtained under a compound symmetric correlation structure whose off-diagonal elements are 0.5 and diagonal elements are 1 with dimension d = 500. Right panel is a histogram of “NLGN1” gene in real dataset.
Fig. 2. Simulation results of the case of compound symmetry dependence structure with different correlation coefficients for multivariate normal distribution. In each panels, 5 boxes indicate the proposed method(P), Storey(S), Langaas(L), Tong’s Linear method(T) and Fan (F) from the left. The proportion of signals is 0.05(top), 0.1 (middle) and 0.15 (bottom). Dimension is 100. Blue horizontal lines indicate 0.05, 0.1 and 0.15, respectively.
Fig. 3. Simulation results of the case of polynomially decaying dependence structure with different correlation coefficients for multivariate normal distribution. In each panels, 5 boxes indicate the proposed method(P), Storey(S), Langaas(L), Tong’s Linear method(T) and Fan (F) from the left. The proportion of signals is 0.05(top), 0.1 (middle) and 0.15 (bottom). Dimension is 100. Blue horizontal lines indicate 0.05, 0.1 and 0.15, respectively.
Fig. 4. Simulation results of the case of exponentially decaying dependence structure with different correlation coefficients for multivariate normal distribution. In each panels, 5 boxes indicate the proposed method(P), Storey(S), Langaas(L), Tong’s Linear method(T) and Fan (F) from the left. The proportion of signals is 0.05(top), 0.1 (middle) and 0.15 (bottom). Dimension is 100. Blue horizontal lines indicate 0.05, 0.1 and 0.15, respectively.
Fig. 5. Simulation results of the case of compound symmetry (top), polynomially (middle) and exponentially (bottom) decaying dependence structure with different correlation coefficients for multivariate normal distribution. In each panels, 5 boxes indicate the proposed method(P), Storey(S), Langaas(L), Tong’s Linear method(T) and Fan (F) from the left. The proportion of signals is 0.1. Dimension is 50. Blue horizontal lines indicate 0.1.
Fig. 6. Simulation results of the case of compound symmetry (top), polynomially (middle) and exponentially (bottom) decaying dependence structure with different correlation coefficients for multivariate normal distribution. The proportion of signals is 0.05. Dimension is 500. Blue horizontal lines indicate 0.05.
Fig. 7. Simulation results of the case of compound symmetry (top), polynomially (middle) and exponentially (bottom) decaying dependence structure with different correlation coefficients for multivariate normal distribution. The proportion of signals is 0.1. Dimension is 3000. Blue horizontal lines indicate 0.1.
Fig. 8. Simulation results of the case of compound symmetry (top), polynomially (middle) and exponentially (bottom) decaying dependence structure with different correlation coefficients for multivariate t-distribution with degree of freedom ν = 10. The proportion of signals is 0.1. Dimension is 500. Blue horizontal lines indicate 0.1.
Fig. 9. Mean squared errors for estimates of the proportion of signals under compound symmetry, polynomially and exponentially decaying dependence structure with different correlation coefficients for multivariate normal distribution. Columns from left to right correspond to the dimension, 50, 300, 500, 1000, 3000. The rows from top to bottom correspond to the signal proportion, 0.05, 0.1, 0.15.
Fig. 10. Mean squared errors for estimates of the proportion of signals under compound symmetry, polynomially and exponentially decaying dependence structure with different correlation coefficients for multivariate normal distribution without Storey’s method. Columns from left to right correspond to the dimension, 50, 300, 500, 1000, 3000. The rows from top to bottom correspond to the signal proportion, 0.05, 0.1, 0.15.
Fig. 11. Histogram of p-values of SNPs for genes of schizophrenia.
Fig. 12. Histogram of p-values of SNPs for genes of Crohn’s disease.
Fig. 13. Boxplots of bootstrap estimates of proportion of signals for selected genes of schizophrenia and Crohn’s disease.
TABLES
Table 1
Estimates of proportion of signals for selected genes of schizophrenia and Crohn’s disease with 95% bootstrap confidence intervals. d indicates the number of SNPs in each gene
Schizophrenia
GENE
d
Proposed
Storey
Langaas
Linear
uc010vip.3
160
0.3205 (0.2784, 0.3729)
0.5838 (0.5313, 0.7876)
0.5745 (0.4928, 0.7113)
0.4600 (0.4374, 0.4984)
uc288gjo.1
70
0.3912 (0.3261, 0.4337)
0.8052 (0.6844, 0.9405)
0.8704 (0.6646, 0.9593)
0.4553 (0.4546, 0.5720)
uc063ovd.1
207
0.0945 (0.0647, 0.1490)
0.3425 (0.2579, 0.5169)
0.1195 (0.0847, 0.2315)
0.0000 (0.0000, 0.0000)
uc287qtx.1
258
0.1391 (0.1268, 0.1719)
0.3992 (0.3271, 0.5034)
0.3679 (0.2868, 0.4635)
0.2524 (0.1863, 0.3729)
uc064pmp.1
87
0.3060 (0.2627, 0.3457)
0.5402 (0.3995, 0.6604)
0.4848 (0.3967, 0.7631)
0.4020 (0.0000, 0.5159)
Crohn’s disease
GENE
d
Proposed
Storey
Langaas
Linear
PRKG1
195
0.0391 (0.0337, 0.0686)
0.0962 (0.0357, 0.2053)
0.1006 (0.0314, 0.1914)
0.2262 (0.1045, 0.4698)
CASC15
163
0.0374 (0.0247, 0.0665)
0.3013 (0.1674, 0.4554)
0.2953 (0.1013, 0.5043)
0.0000 (0.0000, 0.1538)
SLIT3
125
0.0362 (0.0238, 0.0862)
0.3111 (0.1724, 0.5603)
0.2753 (0.0839, 0.4900)
0.0000 (0.0000, 0.1947)
PDE10A
115
0.0840 (0.0714, 0.1330)
0.2953 (0.1914, 0.6149)
0.2656 (0.1417, 0.4594)
0.0193 (0.0000, 0.3118)
CLSTN2
108
0.1105 (0.0868, 0.1609)
0.2394 (0.1469, 0.4599)
0.2518 (0.1379, 0.4280)
0.1504 (0.0000, 0.3807)
References
Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 57, 289-300.
Chacón JE and Rodríguez-Casal A (2010). A note on the universal consistency of the kernel distribution function estimator. Statistics & Probability Letters, 80, 1414-1419.
Duerr RH, Taylor KD, and Brant SR (2006). A genome-wide association study identifies il23r as an inflammatory bowel disease gene. Science, 314, 1461-1463.
Duong T (2016). Non-parametric smoothed estimation of multivariate cumulative distribution and survival functions, and receiver operating characteristic curves. Journal of the Korean Statistical Society, 45, 33-50.
Durrett R (2010). Probability; Theory and Examples (4th ed), Cambridge, Cambridge University Press.
Efron B (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association, 99, 96-104.
Efron B (2010). Large-scale Inference: Empirical Bayes Methods for Estimation, Testing and Prediction, Cambridge, Cambridge University Press.
Fan J and Han X (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society, Series B, 79, 1143-1164.
Fang Y, Chang C, Park Y, and Tseng GC (2023). Heavy-tailed distribution for combining dependent p-values with asymptotic robustness. Statistica Sinica, 33, 1115-1142.
Friguet G and Causeur D (2011). Estimation of the proportion of true null hypotheses in high-dimensional data under dependence. Computational Statistics and Data Analysis, 55, 2665- 2676.
Genovese C and Wasserman L (2004). A stochastic process approach to false discovery control. The Annals of Mathematical Statistics, 32, 1035-1061.
Kim J and Park J (2025). Combining P-values using heavy tailed distributions and their asymptotic results with applications to genomic data. Statistica Sinica, 35. Advance online publication.
Lam M, Chen C-Y, and Li Z, et al. (2019). Comparative genetic architectures of schizophrenia in east Asian and European populations. Nature Genetics, 51, 1670-1678.
Langaas M, Lindqvist BH, and Ferkingstad E (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 67, 555-572.
Liu Y and Xie J (2020). Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. Journal of the American Statistical Association, 115, 393-402.
Meinshausen N and Bühlmann P (2005). Lower bounds for the number of false null hypotheses for multiple testing of associations under general dependence structures. Biometrika, 92, 893-907.
Schweder T and Spjøtvoll E (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika, 69, 493-502.
Storey JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 64, 479-498.
Tong T, Feng Z, Hilton JS, and Zhao H (2013). Estimating the proportion of true null hypotheses using the pattern of observed p-values. Journal of Applied Statistics, 40, 1949-1964.
Turkheimer FE, Smith CB, and Schmidt K (2001). Estimation of the number of “true” null hypotheses in multivariate analysis of neuroimaging data. NeuroImage, 13, 920-930.