TEXT SIZE

search for



CrossRef (0)
Genetic association tests when a nuisance parameter is not identifiable under no association
Communications for Statistical Applications and Methods 2017;24:663-671
Published online November 30, 2017
© 2017 Korean Statistical Society.

Wonkuk Kim1,a, and Yeong-Hwa Kima

aDepartment of Applied Statistics, Chung-Ang University, Korea
Correspondence to: 1Corresponding author: Department of Applied Statistics, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Korea. E-mail: wkim@cau.ac.kr
Received August 22, 2017; Revised October 2, 2017; Accepted October 3, 2017.
 Abstract

Some genetic association tests include an unidentifiable nuisance parameter under the null hypothesis of no association. When the mode of inheritance (MOI) is not specified in a case-control design, the Cochran-Armitage (CA) trend test contains an unidentifiable nuisance parameter. The transmission disequilibrium test (TDT) in a family-based association study that includes the unaffected also contains an unidentifiable nuisance parameter. The hypothesis tests that include an unidentifiable nuisance parameter are typically performed by taking a supremum of the CA tests or TDT over reasonable values of the parameter. The p-values of the supremum test statistics cannot be obtained by a normal or chi-square distribution. A common method is to use a Davies’s upper bound of the p-value instead of an exact asymptotic p-value. In this paper, we provide a unified sine-cosine process expression of the CA trend test that does not specify the MOI and the TDT that includes the unaffected. We also present a closed form expression of the exact asymptotic formulas to calculate the p-values of the supremum tests when the score function can be written as a linear form in an unidentifiable parameter. We illustrate how to use the derived formulas using a pharmacogenetics case-control dataset and an attention deficit hyperactivity disorder family-based example.

Keywords : Cochran-Armitage trend test, Rice’s formula, sine-cosine process, transmission disequilibrium test, unidentifiable parameter
1. Introduction

A genome-wide association study is a powerful method to screen a high-dimensional genome data set and select candidate single nucleotide polymorphisms (SNPs) for genetic associations. Genetic association studies are commonly conducted by a case-control design or family-based design. Table 1 shows a summarization of a case-control data set at a single biallelic SNP. The Cochran-Armitage (CA) linear trend test (Armitage, 1955; Cochran, 1954) is known to be a powerful test in a genetic case-control association study, but it requires the specification of a mode of inheritance (MOI). The MOI can be specified by selecting a weight vector (0, θ, 1) where θ = 0 for recessive model, θ = 1/2 for additive model, and θ = 1 for dominant model. The CA linear trend test can be written as

ZCA(θ)=n[θ(rs1-sr1)+(rs2-sr2)]rs[θ2n1(n-n1)-2n1n2θ+n2(n-n2)].

The CA trend test statistic is a Rao’s score test statistic that can be derived from the logit model of

log(P(Y=1G)1-P(Y=1G))=α+β{θ·I(G=1)+I(G=2)},

where I(·) is an indicator function and G is the number of allele B in Table 1, that is, G = 0 for genotype AA, G = 1 for genotype AB and G = 2 for genotype BB. Under the null hypothesis of no association, that is, H0 : β = 0, the nuisance parameter θ is not estimable and hence it is not identifiable under the null hypothesis.

The family-based association study is known to be less powerful than the population-based association study, but the transmission disequilibrium test (TDT) (Spielman et al., 1993) may be a unique robust test that does not depend on the phenotype distribution, genotype frequencies, or mating distribution. The power of the TDT may increase by including the unaffected in the statistical analysis (Lange and Laird, 2002; Lunetta et al., 2000). Table 2 summarizes a single-locus data set in a family-based association study. The TDT including the unaffected is given by

ZTDT=(1-μ)(bA-cA)-μ(bU-cU)(1-μ)2(bA+cA)+μ2(bU+cU),

where the nuisance parameter μ is not identifiable under the null hypothesis of no association. One of ? common choices of the parameter μ is the prevalence of disease, that is not usually estimable in the case-control or family-based design.

Davies (1977) proposed the supremum test statistics when a nuisance parameter is present only under the alternative hypothesis. In addition, he presented a general formula for an upper bound of the p-value of the supremum test. This upper bound in an integral form is generally called the Davies’ upper bound.

Gaussian processes are widely used in many scientific fields (Choi and Lee, 2014; Lee and Park, 2017). A sine-cosine process is the simplest Gaussian wave process written as

U(t)=U1cos t+U2sin t

where U1 and U2 are independent standard normal random variables. Delmas (2003) derived the exact formula to calculate the upper tail probability of M(t) = sup0stU(s) by applying Rice’s formula (Rice, 1944, 1945).

In this work, we show that the supremum test of the CA linear trend test and the supremum test of the TDT can be written in terms of the supremum of a sine-cosine process, and we provide the exact asymptotic p-value calculation formulas for unified test statistics based on the supremum of the sine-cosine process.

2. Methods

Suppose that the probability function p(y, β, α, φ) does not depend on θ when the null hypothesis is true, that is, H0 : β = β0. In this work, we assume β and θ are scalars and α can be a parameter vector.

The score function is written as

Uβ,N(β0,α,φ)=1Ni=1Nβ|β=β0log p(yi,β,α,φ).

Let α̂0 be the maximum likelihood estimate of α under H0 : β = β0. Now suppose this score function can be written in form of

Uβ,N(β0,α^0,φ)=X1,N+w(φ)X2,N

where w is a continuous function in ℝ and let θ = w(φ). Suppose that the bivariate random vector XN = (X1,N, X2,N)′ converges in distribution to a mean zero bivariate normal random vector, that is, XN=(X1,N,X2,N)dX=(X1,X2)~N((0,0),Σ) where Σ=(σ12ρσ1σ2ρσ1σ2σ22) as N → ∞ under the null hypothesis. For simplicity, we use notation

Uβ,N(β0,α^0,θ)=X1,N+θX2,N

and we define the locally most powerful test given θ as

ZN(θ)=Uβ,N(β0,α^0,θ)σ12+2θρσ1σ2+θ2σ22=X1,N+θX2,Nσ12+2θρσ1σ2+θ2σ22.

Since ZN(θ)dZ(θ), the p-value of the test statistic can be obtained from the Gaussian process of

Z(θ)=[X1-σ1σ2ρX2]+(θ+σ1σ2ρ)X2σ12+2θρσ1σ2+θ2σ22.

We define two independent standard normal random variable U1 and U2 as

U1=X1-σ1σ2ρX2σ11-ρ2,         U2=X2σ2.

Using these standard random variables, we can write

Z(θ)=σ11-ρ2U1+(ρσ1+θσ2)U2(σ11-ρ2)2+(ρσ1+θσ2)2.

We define an angle t = t(θ) by

cos t=σ11-ρ2(σ11-ρ2)2+(ρσ1+θσ2)2,         sin t=ρσ1+θσ2(σ11-ρ2)2+(ρσ1+θσ2)2.

Here cos t is always nonnegative for all θ but sin t can be positive or negative depending upon θ. Hence we may assume that –π/2 ≤ tπ/2. In terms of the parameter t = t(θ),

Z(θ)=U1cos t(θ)+U2sin t(θ).

Let U(t) = U1 cos t +U2 sin t where t(θ)=arctan((ρσ1+θσ2)/(σ11-ρ2)). Let tL = minθLθθUt(θ) and tU = maxθLθθUt(θ). Here U(t) is the orthogonal projection of (U1, U2) onto the straight line joining the origin (0, 0) and (cos t, sin t). Without loss of generality, suppose tUtLπ since –π/2 ≤ tπ/2. First we can simply obtain the supremum of U(t) or |U(t)| by the following equations:

suptLttUU(t)={U12+U22,for tLarctan (U2U1)tU,U10,maxt{tL,tU}(U1cos t+U2sin t),otherwise,suptLttUU(t)={U12+U22,for tLarctan (U2U1)tU,maxt{tL,tU}U1cos t+U2sin t,otherwise,

For a given u > 0, we can calculate the right-tail probabilities of sup U(t) and sup |U(t)| by the following Theorem.

Theorem 1

Supposeπ/2 ≤ tLtUπ/2 and u > 0. Let U(t) be a sine-cosine process defined on [tL, tU]. Let Φ(u) be the distribution function of a standard normal random variate.

  • The one-sided p-value is the same as the Davies’ upper bound that is given byP(suptLttUU(t)u)=1-Φ(u)+tU-tL2πe-u22.

  • The two-sided p-value is smaller than the Davies’ upper bound and it can be calculated byP(suptLttUU(t)u)=2[1-Φ(u)]+tU-tLπe-u22-1π0tU-tLexp [-u21-cos s]ds=2P(suptLttUU(t)u)-1π0tU-tLexp [-u21-cos s]ds.

    If tUtL = π/2, thenP(suptLttUU(t)u)=2[1-Φ(u)]+12e-u22-2[1-Φ(u)]2.

The proof of Theorem 1 is given in Appendix. The following corollary is immediate from the contents in this section and Theorem 1.

Corollary 1. (Unified representation of ZTDT(μ) and ZCA(θ))

ZTDT(μ) and ZCA(θ) can be written as a sine-cosine process so that the supremum tests are

suptlttuU(t)         or         suptlttuU(t),

where U(t) = U1 cos t + U2 sin t is a sine-cosine process with independent standard normal random variates U1and U2. For the family-based association test,

cos t=(1-μ)bA+cA(1-μ)2(bA+cA)+μ2(bU+cU),         sin t=μbU+cU(1-μ)2(bA+cA)+μ2(bU+cU).

If 0 ≤ μLμμU ≤ 1, then

tL=arctan(μL1-μLbU+cUbA+cA)         and         tU=arctan(μU1-μUbU+cUbA+cA).

For the CA linear trend test,

cos t=1-ρ2σ1(1-ρ2)σ12+(σ1ρ+θσ2)2,         sin t=σ1ρ+θσ2(1-ρ2)σ12+(σ1ρ+θσ2)2,

whereσ1=p2(1-p2),σ2=p0(1-p0), andρ=-p1p2(1-p1)(1-p2),

tL=-arctan(p1p2p0),         tU=arctan(p0p1p2).

Here, p0 = P(AA), p1 = P(AB), and p2 = P(BB). The p-values of the supremum tests of the TDT including the unaffected and the CA trend test with unspecified MOI can be obtained from Theorem 1.

3. Examples

In this section, we illustrate two examples. One is a case-control type pharmacogenetics data set of anti-epileptic drug responses. The other example is the case-parent trio dataset of attention deficit hyperactivity disorder (ADHD) illustrated in Lunetta et al. (2000).

3.1. Case-control association study

Two hundred and eighty-eight patients of epilepsy were recruited from multiple epilepsy clinics in Korea and they were genotyped for whole-exomes by the next-generation sequencing experiments. All study participants were eligible if they had drug-resistant (case group) or drug-responsive (control group) epilepsy according to the following definitions and criteria explained in Kim et al. (2011). Drug resistance was defined as the occurrence of at least four unprovoked seizures in the course of the year before recruitment, with trials of two or more appropriate antiepileptic drugs (AEDs) at maximal tolerated doses. Patients who underwent surgical treatment for drug-resistant epilepsy were classified as having drug-resistant epilepsy, regardless of the surgical outcome. Patients who were frequently in poor compliance with AED therapy and those who had reported seizures with a questionable semiology were excluded from this study. We define drug responsiveness as complete freedom from seizures for at least one year up to the date of the last follow-up visit.

We performed the CA trend tests for recessive, additive, and dominant genetic models and the supremum test of the CA test for undetermined MOI but 0 ≤ θ ≤ 1. The p-values of the CA trend tests for the three specific genetic models are calculated by the upper tails of a chi-squared distribution with one degree of freedom while the p-values of the supremum tests are calculated by two ways. Table 3 shows the five smallest p-values of the supremum of CA trend tests for 0 ≤ θ ≤ 1 calculated by Equation (2.14) in Theorem 1 and the permutation p-values from 1 million and 10 millions permuted data. The p-values of the CA trend tests under recessive (θ = 0), additive (θ = 0.5), and dominant (θ = 1) MOI are included in the table. For 0 ≤ θ ≤ 1, the permutation p-value tends to be smaller than the exact asymptotic p-value from Equation (2.14). It appears that the permuted p-values based on 10 millions permuted data are closer to the exact asymptotic p-values derived in this paper than the permuted p-values obtained from one million permuted data.

3.2. Family-based association study

Lunetta et al. (2000) recruited 39 nuclear families in which at least one family member was believed to have ADHD and all individuals were assessed for the Diagnostic and Statistical Manual of Mental Disorders IV ADHD to determine their phenotypes. For DAT gene, 60 parent-child trios in 35 nuclear families, consisting of 33 affected and 27 unaffected children, were genotyped. A total of 15 unaffected children had exactly one heterozygous parent; 2 unaffected children had two heterozygous parents. A total of 15 affected children had one heterozygous parent; 6 affected children had two heterozygous parents. Twelve additional trios were genotyped for DRD4 gene, resulting in a total of 72 trios, with 42 affected and 30 unaffected children, in 39 nuclear families. For DRD4-7, 11 unaffected children had exactly one heterozygous parent; and 2 unaffected children had two heterozygous parents. A total of 13 affected children had exactly one heterozygous parent; 4 affected children had two heterozygous parents. The two genes were treated as biallelic; DRD4-7 against all other DRD4 alleles and DAT-480 against all other DAT alleles. For an illustration of the TDT example, we performed the TDT under no information of the prevalence of ADHD, that is, 0 ≤ μ ≤ 1. We also conducted the TDT under 0.05 ≤ μ ≤ 0.1 (Lunetta et al., 2000) and under 0.114 ≤ μ ≤ 0.161 (Faraone et al., 2003). Table 4 shows the p-values of the TDTs under two cases. The p-value under 0.05 ≤ μ ≤ 0.1 is greater than the p-value under 0 ≤ μ ≤ 1 for DAT-480 whereas the opposite holds for DRD4-7.

4. Discussion and conclusions

In this paper, we derived simple formulas to calculate the p-values of the supremum tests when a score function is linear in an unidentifiable nuisance parameter as in Equation (2.2). The derived formulas can be used to calculate the p-values of the CA trend test when the MOI is not specified as well as the TDT including the unaffected. In particular, the Davies’s upper bound is the same as the exact asymptotic p-value of the supremum test for a one-sided alternative hypothesis while the exact asymptotic two-tailed p-value is smaller than the corresponding Davies’s upper bound. As shown in Examples section, the approximate p-values of the supremum tests can be obtained by the Monte-Carlo permutation method. The exact asymptotic p-value tends to be greater than the permutation p-value in the case-control example, when the sample size is 288. To investigate the convergence of the exact asymptotic p-values, we conducted a small simulation study, in which we generated five fictitious SNPs data. For each SNP in Table 3, we used a dataset having three copies of genotypes while two copies of the phenotypes are used and a copy of phenotypes is randomized. Hence, the sample size of the simulation study is set as 864. By doing this, we could simulate the p-values around 10−6–10−8.

Table 5 shows the simulation results. We could not see any pattern that one method provides smaller p-values compared to the other method. It appears that the permutation p-value may be preferred when the sample size is not sufficiently large, whereas the exact asymptotic p-value may be preferred when the sample size is large enough. However, this permutation approach is computationally intensive and the Monte-Carlo permutation p-values depend on specified seeds and the number of resamples; 175 minutes on average were required to complete an empirical p-value calculation based on 10 million permuted data using a 3.5 GHz intel Xeon processor. Therefore, the permutation method for whole-exome or whole-genome data may not be feasible. Kim et al. (2012) and Kim (2015) proposed the linear trend tests and the TDT based on read counts for low-coverage next-generation sequences experiments in which genotypes are uncertain. Their extended tests require much computational resources due to estimating the parameters of mixture models. Our work in this paper can be applied to the extended linear trend tests and the TDT based on read counts while saving computational resources.

We also provided the unified sine-cosine process expression of the supremum tests for the CA linear trend without specifying the MOI and the TDT including the unaffected. This work focused on the case in which there is a one-dimensional unidentifiable nuisance parameter in a linear form. Davies (1987, 2002) considered a chi-square process and an F-process for more complex models and our work may be extended to cases in which two or more unidentifiable nuisance parameters exist.

Acknowledgements

This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute, funded by the Ministry of Health & Welfare, Republic of Korea (Grant No. HI15C1559).

TABLES

Table 1

Data structure in a case-control association study

AAABBBTotal
Controlr0r1r2r
Cases0s1s2s
Totaln0n1n2n

Table 2

Data structure in a family-based association study

TransmittedNot transmittedTotal
UnaffectedbUcUn0
AffectedbAcAn1

Table 3

The p-values of the top five SNPs selected by the CA trend tests for 0 ≤ θ ≤ 1 of the anti-epileptic drug data

SNP0 ≤ θ ≤ 1θ = 0θ = 0.5θ = 1

Exact1,000,00010,000,000
rs169643165.42 × 10−61.00 × 10−62.60 × 10−64.68 × 10−62.17 × 10−69.55 × 10−5
rs176713521.34 × 10−57.00 × 10−68.70 × 10−61.42 × 10−11.66 × 10−42.02 × 10−6
rs169096516.09 × 10−55.60 × 10−55.92 × 10−53.14 × 10−52.14 × 10−51.61 × 10−3
rs124172556.33 × 10−55.50 × 10−56.24 × 10−52.75 × 10−31.96 × 10−52.22 × 10−5
rs120414778.84 × 10−57.41 × 10−57.57 × 10−53.25 × 10−27.54 × 10−51.61 × 10−5

The numbers in the second column are obtained from Equation (2.14). The numbers in the third and fourth columns of 0 ≤ θ ≤ 1 are the p-values based on the Monte-Carlo permutation method from 1 million and 10 millions permuted data, respectively. SNP = single nucleotide polymorphism.


Table 4

Two-tailed p-values when the nuisance parameter μ is known to be in an interval, (0, 1), (0.05, 0.1), or (0.114, 0.161) for the attention deficit hyperactivity disorder data

AlleleNμAsymptotic p-valuePermutation p-value

bAcAbUcU
DAT-4801710613(0.000, 1.000)0.090990.083
(0.050, 0.100)0.141240.146
(0.114, 0.161)0.117620.146

DRD4-7156510(0.000, 1.000)0.050170.058
(0.050, 0.100)0.039700.040
(0.114, 0.161)0.033600.034

Table 5

The results of the simulation study

SNP used for simulated data0 ≤ θ ≤ 1

Exact asymptotic1,000,00010,000,000
rs169643164.39 × 10−80.001.00 × 10−7
rs176713523.12 × 10−62.00 × 10−62.50 × 10−6
rs169096515.05 × 10−70.004.00 × 10−7
rs124172559.66 × 10−71.00 × 10−69.00 × 10−7
rs120414772.23 × 10−62.00 × 10−62.70 × 10−6

Each dataset was generated by three copies of the genotypes of one of SNPs in Table 3. The phenotypes were generated by combining two copies of the phenotypes of the real data and a randomized copy of the phenotypes.

SNP = single nucleotide polymorphism.


References
  1. Armitage, P (1955). Tests for linear trends in proportions and frequencies. Biometrics. 11, 375-386.
    CrossRef
  2. Choi, ML, and Lee, J (2014). GLR charts for simultaneously monitoring a sustained shift and a linear drift in the process mean. Communications for Statistical Applications and Methods. 21, 69-80.
    CrossRef
  3. Cochran, WG (1954). Some methods for strengthening the common χ2 tests. Biometrics. 10, 417-451.
    CrossRef
  4. Davies, RB (1977). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 64, 247-254.
    CrossRef
  5. Davies, RB (1987). Hypothesis testing when a nuisance parameter is present only under the alternative. Biometrika. 74, 33-43.
  6. Davies, RB (2002). Hypothesis testing when a nuisance parameter is present only under the alternative: Linear model case. Biometrika. 89, 484-489.
    CrossRef
  7. Delmas, C (2003). Projections on spherical cones, maximum of Gaussian fields and Rice’s method. Statistics & Probability Letters. 64, 263-270.
    CrossRef
  8. Faraone, SV, Sergent, J, Gillberg, C, and Biederman, J (2003). The worldwide prevalence of ADHD: is it an American condition?. World Psychiatry. 2, 104-113.
  9. Kim, MK, Moore, JH, and Kim, JK (2011). Evidence for epistatic interactions in antiepileptic drug resistance. Journal of Human Genetics. 56, 71-76.
    CrossRef
  10. Kim, W (2015). Transmission disequilibrium tests based on read counts for low-coverage nextgeneration sequence data. Human Heredity. 80, 36-49.
    CrossRef
  11. Kim, W, Londono, D, and Zhou, L (2012). Single-variant and multi-variant trend tests for genetic association with next-generation sequencing that are robust to sequencing error. Human Heredity. 74, 172-183.
    Pubmed KoreaMed CrossRef
  12. Lange, C, and Laird, NM (2002). Power calculations for a general class of family-based association tests: dichotomous traits. The American Journal of Human Genetics. 71, 575-584.
    Pubmed KoreaMed CrossRef
  13. Lee, Y, and Park, JS (2017). Model selection algorithm in Gaussian process regression for computer experiments. Communications for Statistical Applications and Methods. 24, 383-396.
    CrossRef
  14. Lunetta, KL, Faraone, SV, Biederman, J, and Laird, NM (2000). Family-based tests of association and linkage that use unaffected sibs, covariates, and interactions. The American Journal of Human Genetics. 66, 605-614.
    Pubmed KoreaMed CrossRef
  15. Rice, SO (1944). Mathematical analysis of random noise. Bell Labs Technical Journal. 23, 282-332.
    CrossRef
  16. Rice, SO (1945). Mathematical analysis of random noise. The Bell System Technical Journal. 24, 46-156.
    CrossRef
  17. Spielman, RS, McGinnis, RE, and Ewens, WJ (1993). Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). The American Journal of Human Genetics. 52, 506-516.
    Pubmed KoreaMed