search for

CrossRef (0)
Weighted zero-inflated Poisson mixed model with an application to Medicaid utilization data
Communications for Statistical Applications and Methods 2018;25:173-184
Published online March 30, 2018
© 2018 Korean Statistical Society.

Sang Mee Lee1,a, Theodore Karrisona, Robert S. Noconb, and Elbert Huangb

aDepartment of Public Health Sciences, University of Chicago, USA, bDepartment of Medicine, University of Chicago, USA
Correspondence to: Department of Public Health Sciences, University of Chicago, 5841 S. Maryland Ave. MC 2000, IL 60637, USA. E-mail: slee@health.bsd.uchicago.edu
Received September 28, 2017; Revised December 23, 2017; Accepted December 24, 2017.

In medical or public health research, it is common to encounter clustered or longitudinal count data that exhibit excess zeros. For example, health care utilization data often have a multi-modal distribution with excess zeroes as well as a multilevel structure where patients are nested within physicians and hospitals. To analyze this type of data, zero-inflated count models with mixed effects have been developed where a count response variable is assumed to be distributed as a mixture of a Poisson or negative binomial and a distribution with a point mass of zeros that include random effects. However, no study has considered a situation where data are also censored due to the finite nature of the observation period or follow-up. In this paper, we present a weighted version of zero-inflated Poisson model with random effects accounting for variable individual follow-up times. We suggested two different types of weight function. The performance of the proposed model is evaluated and compared to a standard zero-inflated mixed model through simulation studies. This approach is then applied to Medicaid data analysis.

Keywords : emergency department, Health care utilization, weight function, zero-inflated model
1. Introduction

Health care utilization data are often analyzed to address critical questions about health care service and delivery, such as resource utilization and planning, allocation of services and the evaluation of patient outcomes. Such analysis is of increasing importance for policy makers and health care institutions to improve the quality of patient care. Measurement of utilization includes the frequency of visits to medical providers, visits to emergency department (ED), days spent in a hospital, and use of prescription medication. Such count data are frequently found overdispersed with heavy tails and are often multi-modal with excess zeroes. For example, few patients use a service multiple times, whereas the majority report no utilization in a specific time period. Traditional approaches for overdispersed count data with extra variation have been the use of zero-inflated models, that include zero-inflated Poisson (ZIP) (Lambert, 1992) and zero-inflated negative binomial (ZINB) (Ridout et al., 1998). These models were developed by assuming that the outcome variable contains a mixture of a point mass at zero and a count distribution. For a comprehensive review of zero-inflated models, see Ridout et al. (1998). However, a hurdle model (Mullahy, 1986) has been independently developed for count data with excessive zeros assuming all zeros are from one structural source rather than two sources (both structural and sampling zeros) as assumed in the ZIP models. Thus, positive count data (nonzero) are assumed as either truncated Poisson or truncated negative binomial distribution whereas logistic regression part predicts all zeros in a hurdle model. The model choice between zero-inflated and hurdle models of the distinction between structural and sampling zeros, may be subtle in practice (Monod, 2014). In our application, we develop an approach based on a case in which structural zeros cannot be distinguished from random zeros; therefore, we choose to focus on zero-inflated models.

Data on health care use often have a multilevel structure where patients are typically nested within physicians, hospitals, and geographic regions. Consequently, intra-cluster correlation and heterogeneity between clusters must be considered. In such cases, random effects are commonly incorporated into the zero-inflated model. Hall (2000) proposed ZIP with a random intercept to account for the within-subject dependence in the Poisson state. Yao and Lee (2001) introduced a pair of uncorrelated random effects to both zero and Poisson components. Min and Agresti (2005) suggested linking the two components by the joint distribution of the random effects. Lee et al. (2006) and Moghimbeigi et al. (2008) extended to a three-level hierarchical model using normally-distributed random effects.

In addition to a clustered structure, health care utilization data are often censored due to the finite nature of the observation period or follow-up. For instance, when utilization data are collected in long-term observational or randomized clinical trials, complete data are not available for some patients because they are not followed until the endpoint of interest or they enter in the middle of the study. Thus, patients with shorter follow-up should be treated differently from patients who have data for the entire study period because the incidence rate of utilization would otherwise be underestimated. Several authors have proposed modified zero-inflated models to handle variable follow-up (Emerson et al., 1993; Hsu, 2005, 2007). They adapted a weight function for the simple fact that the likelihood of observing a greater number of utilizations is higher for patients who are followed for longer periods of time. However, no study has considered this in conjunction with multi-level data. This provided the motivation to modify zero-inflated models with random effects by incorporating a variable follow-up period.

In this paper, we present a weighted ZIP model with random effects for clustered count data with excess zeros. The proposed model accounts for variable individual follow-up times using a weight function that can be estimated by several approaches. We focus on Poisson models, but the methods can easily be extended to other count distributions, such as the negative binomial.

The paper is organized as follows. In Section 2 we define the weighted ZIP mixed model. In Section 3, we describe the estimation procedures. In Section 4 simulation studies are conducted to show model performance. Section 5 applies the models to an analysis of Medicaid data. Discussion and future research are provided in Section 6.

2. Weighted zero-inflated Poisson mixed model

We consider incorporating the duration of follow-up in ZIP mixed model through a weight function. Let Yi j be the outcome and ti j be the follow-up time for the jth subject within the ith cluster (i = 1, 2, . . . ,m; j = 1, 2, . . . , ni). The probability of observing events is denoted as πi j = Pr(Yi j > 0) and the total number of individuals n=i=1mni. The weight function, w(ti j), where 0 < w(ti j) ≤ 1, is assumed to be an increasing function. w(ti j) = 1 indicates the data is completely observed over the study period whereas w(ti j) < 1 indicates the information is partially observed. This accounts for the increased probability of observing an event for individuals with longer follow-ups than those with shorter periods of time. We assume that the probability of observing events over time ti j, Pr(Yi j > 0; ti j), is equal to w(ti j) Pr(Yi j > 0).

The weighted ZIP mixed model is



logit(πij)=ξij=zijα+ui,         log(λij)=ηij=xijβ+vi,

where zi j and xi j are respectively vectors of covariates for the logistic and the Poisson components, and α and β are the corresponding vectors of regression coefficients. The ui and vi denote the random effects and are assumed to be independent and normally distributed with mean 0 and variance σu2 and σv2, respectively. In this paper, we consider two weights functions. One is a uniform weight function (denoted by W-ZIPu), i.e., w(t) = t/T where T is the complete observation time. The other is an exponential weight function (denoted by W-ZIPe), i.e., w(t) = (1 – eλt)/(1 – eλT ), where λ is a constant hazard and is estimated from the data using an exponential survival function. Their performance is explored in the simulation studies.

3. Model estimation

To ensure convergence in estimation of parameters and random effects the penalized log-likelihood is given by l = l1 + l2, where

l1={ij:yij=0}log [1+exp(ξij)-w(tij)exp(ξij)+w(tij)exp(ξij)exp(-exp(ηij))]-log [1+exp(ξij)]+{ij:yij>0}log(w(tij))+ξij-log [1+exp(ξij)]+yijηij-exp(ηij)-log(yij!),l2=-12(mlog (2πσu2)+σu-2uu+mlog (2πσv2)+σv-2vv)

with l1 being the log-likelihood when the random effects are conditionally fixed, and l2 being the penalty. The complete data log-likelihood lc is constructed as lc = lξ + lη with

lξ=i,jψijlog(1+exp(ξij)-w(tij)exp(ξij))+(1-ψij)ξij-log (1+exp(ξij))-12(mlog (2πσu2)+σu-2uu).lη=i,j(1-ψij)(yijηij-exp(ηij))-12(mlog (2πσv2)+σv-2vv),

where ψi j is a latent variable indicating whether yi j comes from zero (ψi j = 1) or non-zero(ψi j = 0) state. The complete log-likelihood lc can be easily maximized by maximizing the lξ and lη separately. With EM algorithm, ψi j is estimated by its conditional expectation ψij(k) under the current estimates α(k), β(k), u(k), and v(k). Here

ψij(k)={[1+w(tij)exp (-exp (xijβ(k)+vi(k)))1-w(tij)+exp (-tijα(k)-ui(k))]-1,yij=0,0,yij>0.

Details are given in the Appendix A.

4. Simulation

Simulation studies were conducted to evaluate the performance of the proposed weighted ZIP mixed models. Each of 500 data sets was generated from W-ZIP mixed model in (2.1). We consider one common covariate for both logistic and Poisson components zi j = xi j generated from Uniform (0, 1). The true values of the regression coefficients are assumed to be α′= (1, 1) and β′ = (1, 2). The variances of random effects for the logistic ui and Poisson part vi are set as σu2=0.1,σv2=0.2. To mimic the Medicaid data analysis, we simulated observation time periods ti j from an exponential distribution with a parameter 0.1 and set the maximum follow-up at 12 month. As a result, the mean observation follow-up time was 6.8. The response yi j is obtained from weighted ZIP mixed model with either uniform or exponential weight. Two sample sizes n = 250, 500 with m = 10, 25 clusters are considered in this paper.

Tables 1 and 2 presents the results of evaluating the proposed W-ZIP mixed models compared to ZIP mixed model. The bias, mean square error (MSE) and coverage probability are used to compare the performance of the models. The W-ZIP mixed models performs well in estimating all the regression coefficients whereas estimates of logistic part in ZIP mixed models were biased with larger MSE and smaller coverage rate compared with W-ZIP models. In particular, the estimated 95% confidence interval for α̂0 in ZIP model never contained the true coefficient. The coverage probability for σu2 and σv2 across all models was not close to the nominal confidence level but it increases as the cluster size increases from 10 to 25. As expected, the bias and MSE decrease with increasing sample size.

In addition, the sensitivity of the proposed model was evaluated given data with outliers. We used the same simulation settings described earlier and randomly selected 10% or 20% of outcomes. Then outliers were generated from weight ZIP model in (2.1) with ui ~ N(0, 1) and vi ~ N(0, 2) for outliers. Table 3 shows that all W-ZIP models have low bias and MSE for estimating β and the coverage probabilities for all estimates were close to the nominal 95%.

5. Application to Medicaid data

To demonstrate the approach with real study data, we applied it to a study of federally-funded community health centers (HCs) funded by the Bureau of Primary Health Care (BPHC). This study was designed to assess how the use of HCs relates to health care costs and utilization for vulnerable populations in the US. HCs provide comprehensive primary care and supportive services and are required to provide care for Medicaid enrollees. Recent expansions in the HC program have raised concerns about the financial sustainability of the program. Therefore, it is critical to understand if receipt of primary care in a HC has any association with health service utilization and spending for Medicaid enrollees. Of several sub-studies, we focused on a specific study using Medicaid claims data to compare Medicaid enrollees receiving primary care at HCs to non-HC users. We obtained claims data from the Medicaid Analytic eXtract (MAX) files, which is a dataset that contains individual-level protected health information. Data are not public, but available for use by researchers under a data use agreement and demonstration of adequate privacy and security protections. We defined a HC user as a patient who had more than half of their primary care visits in HCs. For this analysis, we used Illinois Medicaid claims data from 2009. We restricted the study population to adults aged 18 years and older. There were 103,379 Medicaid enrollees and roughly 19.4% of them (n = 20,061) obtained the majority of their primary care at HCs and the remaining 83,318 received care mostly elsewhere (e.g., physician office, hospital outpatients, and mixed use). The data set included not only individual-level determinants but also geographical information, Primary Care Service Area (PCSA) (Goodman et al., 2003) for each beneficiary, which approximates the local geographic market for primary care. A high degree of variation across PCSAs exists; however, great homogeneity is frequently observed within a PCSA. Thus a random effect in a model is considered accounting for the clustering effect of a total of 379 PCSAs. The size of PCSA ranged from 1 to 5,655.

Due to the observational nature of the study, there are undoubtedly underlying characteristics that made patients more likely to visit HC or non-HC initially. Thus, we employed propensity scores to balance on observable characteristics between HC and non-HC. We considered potential confounding factors including patient demographics (age, sex, race), insurance characteristics (Medicaid eligibility category, Temporary Aid For Needy Families (TANF) beneficiary indicator), and disease burden (childbirth, Chronic Illness and Disability Payment System for Medicaid (CDPS) risk score). Using 1 : 1 nearest matching, we obtained a subsample (n = 40,122) of the dataset in which patients who did (and did not) receive their primary care service at HC were comparable. Table 4 describes the dataset before and after the propensity score matching.

Among various utilization measures of health services, we focused on the number of ED visits, which is a highly skewed distribution. Over half of individuals (68.6%) had never visited the ED over the year whereas about 0.5% (n = 564) utilized the service more than 10 days. The maximum was 133 days. We observed only 57.9% of the Medicaid patients were enrolled for the entire year and the mean observation period was 9.3 months.

The results of fitting ZIP and W-ZIP mixed models with two different weight functions are given in Table 5. The results suggest that W-ZIPe model fits the data better than the other models in terms of smaller Akaike information criteria (AIC) and Bayesian information criteria (BIC). The estimates across all three models were very similar; HC covariate significantly affect ED visits in both logistic and Poisson parts of the models. Based on the Poisson part of the model, among patients who are potentially in need of emergency room care, HC users have less utilization than non-HC users. The health center setting may provide an efficient means of providing primary care for the Medicaid population. However, the intercept in zero-inflated part of ZIP model is less than W-ZIP models, reflecting the bias of ZIP toward underestimating utilization.

6. Discussion

In this paper, a weighted ZIP mixed model has been developed to analyze hierarchical count data with excess zeros and a variable observation time. We present two simple weight functions to handle the different individual follow-up times; however, our simulation studies show that the weighted models improve the estimate of zero-inflated part adequately. As a result, we believe our proposed model will be a useful tool to analyze properly zero-inflated count data with both random effects (clustering) and censoring in order to answer critical questions in biomedical, medical, and public health applications where such complicated data sometimes arise. To our knowledge, models that address all three issues (i.e., zero-inflation, clustering, and censoring) do not currently exist. In Medicaid health care utilization data analysis, weighted ZIP mixed model using an exponential function provides the best fit and the intercept estimate of zero-inflated part in weighted ZIP models much less bias than the unweighted ZIP model. Alternatively, different individual follow-up times can be handled by incorporating an off-set into the Poisson part (Lee et al. 2001). However, we argue that a subject who experiences no event over the entire study period should be treated differently and properly from one experiencing no event but followed only partially. Model (1), therefore, incorporates a weight both the logistic and Poisson part of the model. Finally, the weight can be easily extended to a function incorporating covariates when an assumption that individual follow-up time depends on a certain covariates is reasonable. For instance, the weight function can be estimated semiparametrically based on the Cox regression model or parametrically. This represents an area of possible future research.


This work was funded under a contract to the Health Resources and Services Administration (HRSA). Medicaid claims data used in the application section is not available under the contract.

Appendix A:

A.1. EM algorithm



Jα,u=[ZK](-2lξξξ)[ZK]+[000σu-2Im],Jβ,v=[XK](-2lηηη)[XK]+[000σv-2Im],lξα=Zlξξ,         lξu=Klξξ-uσu2,lηβ=Xlηη,         lξv=Klηη-vαv2



Then we have



-2lξξξ=Diag [ψexp(ξ)(w-1)(1+exp(ξ)-wexp(ξ))2+exp(ξ)(1+exp(ξ))2],-2lηηη=Diag [(1-ψ)exp(η)].

A.2. Variance component estimation

Suppose is partitioned conformally to :

αuas Jα,u-1=[A11A12A21A22].

Suppose is partitioned conformally to :

βvas Jβ,v-1=[B11B12B21B22].

Then estimators of variance components are given by:


In addition, the asymptotic variance matrix of the variance component estimators are obtained from the inverse of the residual maximum likelihood information matrix (McGilchrist and Yau, 1995) as follows:

var [σ^u2σ^v2]=2[σu-4tr (Im-σu-2A22)200σv-4tr (Im-σv-2B22)2]-1.
Appendix B:

R for fitting weighted ZIP models.

# m = # clusters, ni=# subjects within cluster i, N=sum of ni
# Y(Nx1) outcomes
# Z(pz x N) covariates for logit part
# X(px x N) covariates for Poisson part
# K=cluster indicator (mxN)
# w(Nx1) weight
# alpha, beta, sig2u, sig2v : initial values
estpm =function(Y,Z,X,K,w, alpha=NULL, beta=NULL, sig2u=NULL, sig2v=NULL){
N = dim(Y)[1]
pz = dim(Z)[1]
px = dim(X)[1]
m = dim(K)[1]
if (is.null(alpha)){alpha=as.matrix(c(1,1))}
if (is.null(beta)){beta=as.matrix(c(1,2))}
if (is.null(sig2u)){sig2u=0.5}
if (is.null(sig2v)){sig2v=0.5}
u=as.matrix(rnorm(m, sd=sqrt(sig2u)),m,1)
v=as.matrix(rnorm(m, sd=sqrt(sig2v)),m,1)
it = 0; dta = 1
itMax=1e4; eps=1e-5
while( (iteps) ){
#1) psi
#2) first derivatives
dalpha=Z%*%d1 #pzx1
dbeta=X%*%d2 #pxx1
du=K%*%d1-u/sig2u #mx1
#3) “-2nd derivatives”
#4) inversion matrix
tmp1=T1=matrix(0,pzm,pzm) ; tmp2=T2=matrix(0,pxm,pxm)
for (i in 1:pzm){for (j in 1:pzm){T1[i,j]=sum(ZK[i,]*ZK[j,]*dd1)} }
for (i in 1:pxm){for (j in 1:pxm){T2[i,j]=sum(XK[i,]*XK[j,]*dd2)} }
# 5) update
dta = mean(c(abs(Alpha-alpha),abs(U-u),abs(Beta-beta),abs(V-v)
alpha = Alpha
u = U
beta = Beta
v = V
sig2u = as.vector(Sig2u)
sig2v = as.vector(Sig2v)
it = it + 1
out=list(alpha=alpha, beta=beta, u=u,v=v, sig2u=sig2u, sig2v=sig2v, var1=AI1
, var2=AI2, loglik=loglik,var_sig2u=var_sig2u,var_sig2v=var_sig2v)

Table 1

Bias, MSE, and Cov for ZIP and W-ZIP mixed models (uniform weight function)

ZIP mixedW-ZIPu mixedW-ZIPe mixed

n = 250
m = 10

n = 250
m = 25

n = 500
m = 10

n = 500
m = 25

MSE = mean square error; Cov = coverage probability; ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP.

Coverage probability is the proportion of times the estimated 95% confidence interval contains the true value.

Table 2

Bias, MSE, and Cov for ZIP and W-ZIP mixed models (exponential weight function)

ZIP mixedW-ZIPu mixedW-ZIPe mixed

n = 250
m = 10

n = 250
m = 25

n = 500
m = 10

n = 500
m = 25

MSE = mean square error; Cov = coverage probability; ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP.

Coverage probability is the proportion of times the estimated 95% confidence interval contains the true value.

Table 3

Evaluation for W-ZIP mixed models given data with outliers

W-ZIPu mixedW-ZIPe mixed

n = 250
m = 10
10% outliersα00.0600.2660.950−0.0540.2370.952

20% outliersα00.0880.3360.950−0.0250.2910.948

MSE = mean square error; Cov = coverage probability; ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP.

Coverage probability is the proportion of times the estimated 95% confidence interval contains the true value.

Table 4

Means of patients’ characteristics before and after the propensity score matching

Unmatched (n = 103,379)Matched (n = 40,122)% Balance

n = 20,061n = 83,318n = 20,061n = 20,061Mean Diff.



Medicaid eligibility groupaCash assistance0.1660.1290.1660.16596.8
Medical need0.4090.4390.4090.38623.7

TANF eligibleb0.0570.0270.0570.05699.0

Restricted benefitsc0.1070.0400.1070.11192.9

Delivery during the year0.0790.0440.0790.07691.3

CDPS risk scored0.8051.0280.8050.81197.1

aEnrollees may be have more than one eligibility category assigned over the course of the year. Eligibility categories are grouped from original Medicaid Analytic eXtract data.

bEnrollee is eligible for Temporary Aid For Needy Families (TANF) program in any month during the data year.

cEnrollee is eligible under restricted benefits at any point during the data year.


Table 5

Parameter estimates and standard errors for ZIP and W-ZIP mixed models

VariableZIP mixedW-ZIPu mixedW-ZIPe mixed

EstimateStandard errorEstimateStandard errorEstimateStandard error
Logistic partInt.−0.446a0.0280.118a0.0220.077a0.022

Poisson partInt.0.466a0.0330.477a0.0310.476a0.031


ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP; AIC = Akaike information criteria; BIC = Bayesian information criteria.

aIndicates significance with α < 0.05.

  1. Emerson, SS, McGee, DL, Fennerty, B, Hixson, L, Garewal, H, and Alberts, D (1993). Design and analysis of studies to reduce the incidence of colon polyps. Statistics in Medicine. 12, 339-351.
    Pubmed CrossRef
  2. Goodman, DC, Mick, SS, and Bott, D (2003). Primary care service areas: a new tool for the evaluation of primary care services. Health Services Research. 38, 287-309.
    Pubmed KoreaMed CrossRef
  3. Hall, DB (2000). Zero-inflated and binomial regression with random effects: a case study. Biometrics. 56, 1030-1039.
    Pubmed CrossRef
  4. Hsu, CH (2005). Joint modelling of recurrence and progression of adenomas: a latent variable approach. Statistical Modelling. 5, 201-215.
  5. Hsu, CH (2007). A weighted zero-inflated Poisson model for estimation of recurrence of adenomas. Statistical Methods in Medical Research. 16, 155-166.
    Pubmed CrossRef
  6. Lambert, D (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics. 34, 1-14.
  7. Lee, AH, Wang, K, Scott, JA, Yau, KK, and McLachlan, GJ (2006). Multi-level zero-inflated Poisson regression modelling of correlated count data with excess zeros. Statistical Methods in Medical Research. 15, 47-61.
    Pubmed CrossRef
  8. Lee, AH, Wang, K, and Yau, KK (2001). Analysis of zero-inflated Poisson incorporating extent of exposure. Biometrical Journal. 43, 963-975.
  9. McGilchrist, C, and Yau, K (1995). The derivation of BLUP, ML, REML estimation methods for generalised linear mixed models. Communications in Statistics-Theory and Methods. 24, 2963-2980.
  10. Min, Y, and Agresti, A (2005). Random effect models for repeated measures of zero-inflated count data. Statistical Modelling. 5, 1-19.
  11. Moghimbeigi, A, Eshraghian, MR, Mohammad, K, and McArdle, B (2008). Multilevel zero-inflated negative binomial regression modeling for over-dispersed count data with extra zeros. Journal of Applied Statistics. 35, 1193-1202.
  12. Monod, A (2014). Random effects modeling and the zero-inflated Poisson distribution. Communications in Statistics. 43, 664-680.
  13. Mullahy, J (1986). Specification and testing of some modified count data models. Journal of Econometrics. 33, 341-365.
  14. Ridout, M, Dem챕trio, CG, and Hinde, J (1998). Models for count data with many zeros. Proceedings of the XIXth International Biometric Conference. 19, 179-192.
  15. Yau, KK, and Lee, AH (2001). Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Statistics in Medicine. 20, 2907-2920.
    Pubmed CrossRef