TEXT SIZE

search for



CrossRef (0)
Sample size calculations for clustered count data based on zero-inflated discrete Weibull regression models
Communications for Statistical Applications and Methods 2024;31:55-64
Published online January 31, 2024
© 2024 Korean Statistical Society.

Hanna Yoo1,a

aDepartment of Applied Statistics, Hanshin University, Korea
Correspondence to: 1 Department of Applied Statistics, Hanshin University, 137 Hanshindae-gil, Osan-si, Gyeonggi-do 18101, Korea. Email: yoohann@hs.ac.kr.
Received August 6, 2023; Revised October 18, 2023; Accepted October 18, 2023.
 Abstract
In this study, we consider the sample size determination problem for clustered count data with many zeros. In general, zero-inflated Poisson and binomial models are commonly used for zero-inflated data; however, in real data the assumptions that should be satisfied when using each model might be violated. We calculate the required sample size based on a discrete Weibull regression model that can handle both underdispersed and overdispersed data types. We use the Monte Carlo simulation to compute the required sample size. With our proposed method, a unified model with a low failure risk can be used to cope with the dispersed data type and handle data with many zeros, which appear in groups or clusters sharing a common variation source. A simulation study shows that our proposed method provides accurate results, revealing that the sample size is affected by the distribution skewness, covariance structure of covariates, and amount of zeros. We apply our method to the pancreas disorder length of the stay data collected from Western Australia.
Keywords : covariance structure, clustered count data, discrete Weibull regression, Monte Carlo simulations, sample size determination
1. Introduction

Clustered count data with an excess number of zeros are very common in the biomedical field, clinical studies, and health research. Zero-inflated models are a natural choice for this data type, with some studies utilizing these models. Hall (2000) first introduced random effects into a portion of the zero-inflated Poisson and binomial models. Tapak et al. (2019) extended the zero-inflated exponentiated-exponential geometric regression in the presence of a random effect. Furthermore, Choo-Wosoba et al. (2018) proposed a Bayesian approach by combining the Conway–Maxwell–Poisson distribution with a hurdle component.

Aside from proper data analysis, sample size determination before planning any statistical study is also a crucial part. Studies considering the sample size calculation for clustered count data with excessive zeros are scarce. Channouf et al. (2021) recently proposed sample size calculation methods for hierarchical Poisson and zero-inflated Poisson regression models by extending the work of Shieh (2001) and using the h-likelihood method with the Monte Carlo simulation to calculate the required sample size. In the present study, we calculate the required sample size based on a discrete Weibull model first introduced by Nakagawa and Osaki (1975). The discrete Weibull distribution can handle both overdispersion and underdispersion; thus, there is a low risk of failing to cope with the features of the dispersed data type. Yoo (2023) proposed a sample size calculation method for the clustered discrete Weibull regression model, which we expand herein to the zero-inflated data. To date, no method for the sample size calculation dealing with clustered count data with the zero-inflated discrete Weibull regression model has yet been reported. In this study, we modify the method of Channouf et al. (2021) to calculate the required sample size for the zero-inflated discrete Weibull regression model in the presence of clustered data. Perumean-Chaney et al. (2013) showed that ignoring the overdispersion within the zero-inflated data inflates the Type 1 error and results in poor estimates. Thus, considering both dispersion and zero inflation when calculating the sample size is meaningful.

The remainder of this paper is organized as follows: Section 2 presents the proposed sample size calculation method for clustered zero-inflated discrete Weibull (ZI-DW) model; Section 3 elaborates on the simulation study conducted to examine the method performance; Section 4 depicts an illustrative example; and Section 5 discusses the results and concludes this work.

2. Sample size for the clustered zero-inflated discrete Weibull regression model

For the clustered data with an excessive number of zeros, we expanded the ZI-DW model with a random effect. Let Yij denote the jth subject measured for cluster i, i = 1, . . . , n; j = 1, . . . , ni with the total number of subjects in the data, N=Σi=1nni. We assumed Yij to follow the DW distribution:

Yij~DW (q(Xij),β),

where Xij=(1,X1,ij,,Xp,ij) is a covariate vector. The probability mass function for the DW is

fDW(Yij;q(Xij),β)=q(Xij)Yijβ-q(Xij)(Yij+1)β,

for Yij = 0, 1, 2, 3, . . . , 0 < q(Xij) < 1, β > 0. Parameter q(Xij) is the probability of the outcome random variable Yij having a value greater than 0 and parameter β controlling the value range of Yij. The distribution became highly skewed as β converged to 0. The DW distribution approached the Bernoulli distribution as β converged to ∞. We incorporated a random intercept into the model to consider the subject correlation in the same cluster. The clustered DW regression model is presented as follows:

log(-log(q(Xij)))=θXij+U0,i=θ0+θ1X1,ij+θ2X2,ij++θpXp,ij+U0,i,

where θ = (θ0, θ1, θ2, . . . , θp) denotes the corresponding (p + 1) regression coefficients associated with Xij=(1,X1,ij,,Xp,ij), and U0,i is a random effect following an independently and identically distributed normal distribution with mean 0 and a common dispersion parameter σ02.

In the presence of clustered data with an excessive number of zeros, we extended the DW model with a zero-inflation parameter. The clustered ZI-DW had the following probability mass function:

fZI-DW(Yij,γ,q(Xij),β)={(π(Zij)+(1-π(Zij))fDW(Yij,q(Xij),β))for y=0,(1-π(Zij))fDW(Yij,q(Xij),β)for y=1,2,3,,

where 0 < π < 1 is the zero-inflation parameter, and fDW(Yij, q(Xij), β) is the probability mass function shown in Equation (2.2). The logit link for π is usually used:

logit(π(Zij))=γZij=γ0+γ1Z1,ij++γqZq,ij,

where Zij′ = (1, Z1,ij, Z2,ij, . . . , Zq,ij)′ is a vector with q covariates, and γ = (γ0, γ1, γ2, . . . , γq) ′ is the corresponding coefficient vector. A binary indicator variable δij, which is δij = 1 if Yij = 0 and δij = 0 if Yij > 0, is used for the likelihood of the ZI-DW regression model. We estimated the parameters by using the h-likelihood first introduced by Lee and Nelder (1996). It extended the penalized quasi-likelihood to estimate the parameters and was based on the joint distribution of the outcome variable and the random effect. Unlike the classical likelihood, the h-likelihood is constructed for both fixed parameters and unobserved frailties (Lee et al., 2017). The h-likelihood avoids such integration itself and provides the inference for unobservable random effects (Ha et al., 2001). A big merit of h-likelihood is that once the statistical model is specified, the required inference procedures can be made: See (Jin and Lee, 2020) for recent review of h-likelihood.

Given a random sample (Yij, Xij, Zij, δij), the log likelihood was obtained as follows:

l=i=1nj=1niδijlog[(e-zijγ+1)-1+[1-(e-zijγ+1)-1](1-e-exijθ++U0,i)]+i=1nj=1ni(1-δij)log[[1-(e-zijγ+1)-1]+(e-yijβeexijθ++U0,i-e-(yij+1)βeexijθ++U0,i)]+i=1nj=1nilogfX,Z(Xij,zij)+i=1n{-12log(2π)-12log(σ02)-12u0,i2}.

The score function is given as

Sn,k,l(θ,γ,β,u0)={l(y,x,θ,γ,β)θ0l(y,x,θ,γ,β)θkl(y,x,θ,γ,β)γ0l(y,x,θ,γ,β)γ1l(y,x,θ,γ,β)βl(y,x,θ,γ,β)u0,i         for k=1,,p;l=1,,q;i=1,,n,

where

l(y,x,θ,γ,β)θ0=i=1nj=1niδij[1-(e-zijγ+1)-1]wzlij(γ,θk)[exijθ+uo,i-exijθ+u0,i]+i=1nj=1ni(δij-1)[yijβe-yijβexijθ+u0,i-(yij+1)βe-(yij+1)βexijθ+uo,i]wzlij(γ,θk),l(y,x,θ,γ,β)θk=i=1nj=1niδij[1-(e-zijγ+1)-1]wzlij(γ,θk)[xikexijθ+u0,i-exijθ+u0,i]+i=1nj=1ni(δij-1)xik[yijβe-yijβexijθ+u0,i-(yij+1)βe-(yij+1)βexijθ+u0,i]wzlij(γ,θk)with wzlij(γ,θ)=(e-zijγ+1)-1+[1-(e-zijγ+1)-1](1-e-exijθ+u0,i)         andl(y,x,θ,γ,β)γ0=i=1nj=1niezijγl[(δij-1)eexijθ+u0,i(ezijγl+1)+1](ezijγl+1)(ezijγl+eexijθ+u0,i+eexijθ+u0,i-1),         l(y,x,θ,γ,β)γl=i=1nj=1niγlezijγl[(δij-1)eexijθ+u0,i(eezijγl+1)+1](ezijγl+1)(ezijγl+exijθ+u0,i+eexijθ+u0,i-1),   l(y,x,θ,γ,β)β=i=1nj=1ni(δij-1)exijθ[yijβlog(yij)e-(yij+1)βexijθ+u0,i-yijβlog(yij+1)e-(yij)βexijθ+u0,i]e-(yij+1)βexijθ+u0,i-e-yijβexijθ+u0,il(y,x,θ,γ,β)u0,i=i=1nj=1ni(δij-1)[yijβe-yijβexijθ+u0,i-(yij+1)βe-(yij+1)βexijθ+u0,i]e-yijβexijθ+u0,i-e-(yij+1)βexijθ+u0,i+i=1nj=1niδij[1-(e-zijγ+1)-1](e-exijθ+u0,i)wzlij(γ,θk)-u0,i.

The parameters θ, γ, β and u0 can be obtained by solving Sn,k,l(θ, γ, β, u0) = 0. Parameter σ02 was estimated through the second-stage procedure (Lee and Nelder, 1996). The variance–covariance matrix V(θ, γ, β, u0) of the maximum h-likelihood estimators was obtained as follows from the expected inverse of the Fisher information matrix:

V(θ,γ,β,u0)=E(-2l(θ,γ,β,uo)).

We restricted our scope to the case with the number of clusters fixed in advance and with an equal cluster size. We calculated the required sample size by modifying the methods of Channouf et al. (2014), which used a conditional expectation on the covariates using a Monte Carlo simulation for sample generation. The following hypothesis was tested:

H0:θ1=0         vs.         H1:θ10.

The sample size was calculated based on four steps. In the first step, the maximum likelihood estimates of parameters θ = (θ0, θ1, . . . , θp)′, γ = (γ0, γ1, . . . , γq)′, β, and u0 = (u0,1, . . . , u0,n)′ under the null hypothesis were obtained by solving the score function expectation as follows:

E[(θ0,0,,θp,γ,β,u0)X=x,Z=z]=0.

The parameters obtained from Equation (2.10) are denoted as θ*=(θ0*,0,θ2*,,θp*),γ=(γ0*,γ1*,,γq*), β*, and u0*=(u0,1*,,u0,n*). In the second step, the parameter variance was calculated under the null and alternative hypothesis. These variances, which were secondary diagonal elements, were obtained through the inverse of the Fisher information matrix and denoted as V(θ0*,0,θ2*,,θp*,γ0*,γ1*,,γq*,β*,u0,1*,,u0,n*)(2,2) and V(θ0, θ1, θ2, . . . , θp,, γ0, γ1, γ2, . . . , γq, β, u0,1, . . . , u0,n)(2,2).

In the third step, we calculated the sample size at each jth ( j = 1, . . . , B) Monte Carlo replication as follows:

Nj=[(V(θ0*,0,θ2*,,θp*,γ0*,γ1*,,γq*,β*,u0,1*,,u0,n*)(2,2)Zα/2+V(θ0,θ1,θ2,,θp,,γ0,γ1,γ2,,γq,β,u0,1,,u0,n)(2,2)Zrθ1)2],

where Zα/2 is the 100(1 − (α/2))th percentile of the standard normal distribution, and 1 − γ is power.

In the final step, the following sample size was used to test the hypothesis H0 : θ1 = 0 against the two-sided alternative hypothesis H1 : θ1 ≠ 0 with a specified significance level α and power 1 − γ:

NZI-DW=j=1BNjB.
3. Simulation study

We conducted a simulation study to assess the performance of our proposed formula for the sample size calculations. We calculated the required sample size for the 80% and 90% powers for each parameter setting with various cluster numbers. We then compared the true nominal power with the estimated power computed through a Monte Carlo simulation based on 1,000 independent datasets given the sample size. We considered the case of one and two covariates. In the one covariate case, we assumed a uniform distribution X ~ U(0, 1) and set the parameters to θ0 = −1, θ1 = −0.7 each to model the q(Xij) function. The random effect was assumed to follow a normal distribution with 0 mean and σ02,U0,i~N(0   σ02) variance. We observed the effect of the random-effect variance by considering two different values of σ0 = 0.3, 2. Moreover, we considered three different values for the zero-inflated parameter π = 10%, 30%, 50% for each representing the case of a rather small amount of zeros, a moderate amount of zeros, and a large amount of zeros in zero-state data. The Bernoulli distribution Z ~ B(0.5) was assumed for the corresponding covariate of the zero-inflated model. Parameters γ0 and γ1 satisfied each corresponding π value. For the dispersion parameter, we set two different values as β = 0.5, 2 to determine how the sample size is affected by the distribution skewness. Table 1 presents the required sample sizes for testing H0 : θ1 = 0 vs. H1 : θ1 = −0.7 with a significance level of α = 0.05 and powers of 1 − γ = 0.8 and 0.9. Along with Table 1, Tables 2 and 3 show the results for the 5, 10, and 20 cluster numbers, respectively. The estimated power calculated based on the sample size formula for each covariate X distribution was close to the true nominated power, showing that our proposed sample size methodology was very accurate. The sample generally required increases as the dispersion parameter β increased. The increased dispersion parameter decreased the conditional mean value of the response variable, thereby requiring a larger sample size. The bigger variance of the random effect demonstrated a sample size increase. In each table, the required sample size for the fixed β and σ0 increased by 30% when π = 30% and up to 100% when π = 50% compared to when π = 10% in average. The simulation study revealed that the required sample size was affected not only by the data skewness, but also by the zero-inflated parameter.

We held a simulation study with two covariates because the real data may contain more than one covariate. We assumed two covariates X1 and X2 following a bivariate normal distribution with mean μ′ = (0, 0) and a variance–covariance matrix as Σ=(1ρρ1) with ρ = 0.3 and 0.7, respectively. We set parameters θ0 = −1, θ1 = 0.2, and θ2 = 0.2. Tables 46 present the required sample sizes for testing H0 : θ1 = 0 vs. H1 : θ1 = 0.2 with a significance level of α = 0.05 and powers of 1 − γ = 0.8 and 0.9. Similar to the case with one covariate, the calculated power based on the calculated sample size was close to the true power. The effects of the parameters on the sample size were similar to those of one covariate. As regards the correlation effect between the covariates, a larger correlation ρ of the two covariates increased the required samples size. The sample size increased by 70% when ρ = 0.7 compared to when ρ = 0.3 in average.

4. Illustrative example

We applied our sample size calculation method to a real dataset. Yau et al. (2003) analyzed the pancreas disorder length of stay (LOS) data collected from 261 patients in Western Australia. The patients were collected from 36 different public hospitals. Yau et al. (2003) studied the factors associated with the patient LOS. The LOS distribution was highly skewed. The number of zeros was 17%. Among many risk factors affecting the LOS, the admission status (elective or emergency) was one of the factors associated with the LOS. We referred to their results to set the parameter values needed to calculate the required sample size. Furthermore, we considered a single covariate Z, X ~ B(0.5) assumed to be the admission status to model the ZI-DW regression model. The DW parameters were set as θ0 = −1.00, θ1 = −0.52, and β = 1.1. The zero-inflation part parameters were set as γ0 = −2.3, γ1 = −6.0. The standard deviation of the random intercept was set as 0.9. Based on our sample size calculation, four children were required in each cluster. As shown in Table 7, based on our proposed sample size method 144 patients were required under 80% power (i.e., 180 patients were required to acquire 90% power), which was only 55% of the data (261) that were actually analyzed.

5. Conclusion

In this study, we proposed a sample size calculation method for the clustered ZI-DW regression model. A random intercept was incorporated into a regression model. A Monte Carlo simulation was used to calculate the conditional expectation of the score function and the parameter variances under the null and alternative hypothesis. Our proposed sample size calculation method showed accurate results in the conducted simulation study. The sample size was affected by the distribution skewness, cluster number, and random-effect variance. In addition, the correlation between the variables and the amount of zeros in the data was found to affect the required sample size.

We believe that the proposed sample size method based on the clustered ZI-DW regression model is a suitable choice because it can handle both overdispersed and underdispersed data types at the presence of an excessive number of zeros, which may have been caused by a subpopulation with only zero counts. In this work, our method was restricted to the case with an equal cluster size. Thus, the future study should perform a sample size calculation that can handle an unequal cluster size and extend the model to a Bayesian approach, which offers more flexibility in model specification and uses prior information.

Funding

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2021R1F1A1053119).

TABLES

Table 1

Calculated sample size for testing H0 : θ1 = 0 vs. H1 : θ1 = −0.7 at various excess zero probabilities and parameter settings under nominated powers of 80 and 90% with five clusters

πβσ80%90%

NEst powerNEst power
10%0.50.3510806690.900
2580.809770.911

20.3590.805780.900
2630.795840.910

30%0.50.3700.795930.897
2780.7911040.900

20.3780.8061040.882
2850.7881140.900

50%0.50.3980.7911320.913
21080.8111440.910

20.31110.8141490.902
21130.7871580.887

Table 2

Calculated sample size for testing H0 : θ1 = 0 vs. H1 : θ1 = −0.7 at various excess zero probabilities and parameter settings under nominated powers of 80 and 90% with 10 clusters

πβσ80%90%

NEst powerNEst power
10%0.50.3260.787340.900
2280.800360.900

20.3290.793380.908
2300.806390.894

30%0.50.3350.800460.887
2370.805490.890

20.3390.814520.913
2410.789540.888

50%0.50.3460.805620.905
2510.808660.895

20.3540.800720.911
2560.797740.891

Table 3

Calculated sample size for testing H0 : θ1 = 0 vs. H1 : θ1 = −0.7 at various excess zero probabilities and parameter settings under nominated powers of 80 and 90% with 20 clusters

πβσ80%90%

NEst powerNEst power
10%0.50.3130.804170.905
2140.800180.895

20.3140.810190.900
2150.813200.913

30%0.50.3180.800230.906
2190.813240.900

20.3200.809260.900
2210.804270.910

50%0.50.3240.812320.916
2250.808330.900

20.3260.792350.910
2270.805360.902

Table 4

Calculated sample size for testing H0 : θ1 = 0 vs. H1 : θ1 = 0.2 at various correlations of the two covariates and parameter settings under the nominated powers of 80% and 90% with five clusters

ρπβσ80%90%

NEst powerNEst power
0.310%0.50.3580.793780.888
2660.808880.895

20.3680.812920.904
2720.797960.900

30%0.50.3800.7881070.890
2880.8051180.900

20.3940.8101250.900
2980.7941310.900

50%0.50.31110.7821480.893
21270.8001690.913

20.31360.8131830.909
21420.8161900.905

0.710%0.50.31010.8131390.913
21160.7961570.882

20.31150.8001600.887
21270.8101740.888

30%0.50.31400.8051900.911
21570.8112160.903

20.31610.7922250.913
21730.7902400.900

50%0.50.31990.7902630.912
22250.7882990.890

20.32230.8203130.900
22330.8003350.893

Table 5

Calculated sample size for testing H0 : θ1 = 0 vs. H1 : θ1 = 0.2 at various correlations of the two covariates and parameter settings under the nominated powers of 80 and 90% with 10 clusters

ρπβσ80%90%

NEst powerNEst power
0.310%0.50.3300.800400.917
2330.787430.897

20.3340.800460.913
2360.805480.886

30%0.50.3410.810530.912
2440.793580.900

20.3480.805620.906
2490.797640.897

50%0.50.3560.794720.905
2600.800790.912

20.3630.812850.906
2670.790880.905

0.710%0.50.3510.788680.907
2580.806770.915

20.3590.800790.908
2640.795840.894

30%0.50.3710.806920.912
2790.8001030.888

20.3830.8161080.884
2870.8001140.890

50%0.50.3970.8131240.915
21090.7961410.912

20.31090.8001460.895
21200.8051560.900

Table 6

Calculated sample size for testing H0 : θ1 = 0 vs. H1 : θ1 = 0.2 at various correlations of the two covariates and parameter settings under the nominated powers of 80 and 90% with 20 clusters

ρπβσ80%90%

NEst powerNEst power
0.310%0.50.3140.812190.900
2160.800210.895

20.3170.806230.913
2180.800230.895

30%0.50.3200.795270.900
2220.803290.903

20.3240.815310.893
2280.812320.906

50%0.50.3270.813360.908
2290.806390.890

20.3310.800420.904
2320.804430.900

0.710%0.50.3250.788330.913
2280.812380.905

20.3290.813380.908
2310.795410.890

30%0.50.3350.794460.912
2380.803510.890

20.3400.800530.895
2420.797560.903

50%0.50.3470.816620.910
2520.813690.906

20.3550.806720.910
2580.794760.900

Table 7

Sample size results of the pancreas disorder length of stay (LOS) data

Actual sample sizeRequired sample size
26180% power90% power

144180

References
  1. Channouf N, Fredette M, and MacGibbon B (2014). Power and sample size calculations for poisson and zero inflated poisson regression models. Computational Statistics & Data Analysis, 72, 241-251.
    CrossRef
  2. Channouf N, Fredette M, and MacGibbon B (2021). Sample size calculations for hierarchical Poisson and zero-inflated Poisson regression models. Communications in Statistics Simulation and Computation, 50, 937-956.
    CrossRef
  3. Choo-Wosoba H, Gaskins J, Levy S, and Datta S (2018). A Bayesian approach for analyzing zero-inflated clustered count data with dispersion. Statistics in Medicine, 37, 801-812.
    Pubmed KoreaMed CrossRef
  4. Ha ID, Lee Y, and Song JK (2001). Hierarchical likelihood approach for frailty models. Biometrika, 88, 233-243.
    CrossRef
  5. Hall DB (2000). Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics, 56, 1030-1039.
    Pubmed CrossRef
  6. Jin S and Lee Y (2020). A review of -likelihood and hierarchical generalized linear model. WIREs Computational Statistics, 13, e1527.
    CrossRef
  7. Lee Y and Nelder JA (1996). Hierarchical generalized linear models. Journal of the Royal Statistical Society: Series B (Methodological), 58, 619-678.
    CrossRef
  8. Lee M, Ha ID, and Lee Y (2017). Frailty modeling for clustered competing risks data with missing cause of failure. Statistical Methods in Medical Research, 26, 356-373.
    Pubmed CrossRef
  9. Nakagawa T and Osaki S (1975). The discreteWeibull distribution. IEEE Transactions on Reliability, 24, 300-301.
    CrossRef
  10. Perumean-Chaney SE, Morgan C, McDowall D, and Aban I (2013). Zero-inflated and overdispersed: What셲 one to do?. Journal of Statistical Computation and Simulation, 83, 1671-1683.
    CrossRef
  11. Shieh G (2001). Sample size calculations for logistic and Poisson regression models. Biometrika, 88, 1193-1199.
    CrossRef
  12. Tapak L, Hamidi O, Amini P, and Verbeke G (2019). Random effect exponentiated-exponential geometric model for clustered/longitudinal zero-inflated count data. Journal of Applied Statistics, 47, 2272-2288.
    Pubmed KoreaMed CrossRef
  13. Yau KWK, Wang K, and Lee AH (2003). Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biometrical Journal, 45, 437-452.
    CrossRef
  14. Yoo H (2023). Sample size for clustered count data based on discrete Weibull regression model. Communications in Statistics - Simulation and Computation, 52, 5850-5856.
    CrossRef