In medical or public health research, it is common to encounter clustered or longitudinal count data that exhibit excess zeros. For example, health care utilization data often have a multi-modal distribution with excess zeroes as well as a multilevel structure where patients are nested within physicians and hospitals. To analyze this type of data, zero-inflated count models with mixed effects have been developed where a count response variable is assumed to be distributed as a mixture of a Poisson or negative binomial and a distribution with a point mass of zeros that include random effects. However, no study has considered a situation where data are also censored due to the finite nature of the observation period or follow-up. In this paper, we present a weighted version of zero-inflated Poisson model with random effects accounting for variable individual follow-up times. We suggested two different types of weight function. The performance of the proposed model is evaluated and compared to a standard zero-inflated mixed model through simulation studies. This approach is then applied to Medicaid data analysis.
Health care utilization data are often analyzed to address critical questions about health care service and delivery, such as resource utilization and planning, allocation of services and the evaluation of patient outcomes. Such analysis is of increasing importance for policy makers and health care institutions to improve the quality of patient care. Measurement of utilization includes the frequency of visits to medical providers, visits to emergency department (ED), days spent in a hospital, and use of prescription medication. Such count data are frequently found overdispersed with heavy tails and are often multi-modal with excess zeroes. For example, few patients use a service multiple times, whereas the majority report no utilization in a specific time period. Traditional approaches for overdispersed count data with extra variation have been the use of zero-inflated models, that include zero-inflated Poisson (ZIP) (Lambert, 1992) and zero-inflated negative binomial (ZINB) (Ridout
Data on health care use often have a multilevel structure where patients are typically nested within physicians, hospitals, and geographic regions. Consequently, intra-cluster correlation and heterogeneity between clusters must be considered. In such cases, random effects are commonly incorporated into the zero-inflated model. Hall (2000) proposed ZIP with a random intercept to account for the within-subject dependence in the Poisson state. Yao and Lee (2001) introduced a pair of uncorrelated random effects to both zero and Poisson components. Min and Agresti (2005) suggested linking the two components by the joint distribution of the random effects. Lee
In addition to a clustered structure, health care utilization data are often censored due to the finite nature of the observation period or follow-up. For instance, when utilization data are collected in long-term observational or randomized clinical trials, complete data are not available for some patients because they are not followed until the endpoint of interest or they enter in the middle of the study. Thus, patients with shorter follow-up should be treated differently from patients who have data for the entire study period because the incidence rate of utilization would otherwise be underestimated. Several authors have proposed modified zero-inflated models to handle variable follow-up (Emerson
In this paper, we present a weighted ZIP model with random effects for clustered count data with excess zeros. The proposed model accounts for variable individual follow-up times using a weight function that can be estimated by several approaches. We focus on Poisson models, but the methods can easily be extended to other count distributions, such as the negative binomial.
The paper is organized as follows. In Section 2 we define the weighted ZIP mixed model. In Section 3, we describe the estimation procedures. In Section 4 simulation studies are conducted to show model performance. Section 5 applies the models to an analysis of Medicaid data. Discussion and future research are provided in Section 6.
We consider incorporating the duration of follow-up in ZIP mixed model through a weight function. Let
The weighted ZIP mixed model is
and
where
To ensure convergence in estimation of parameters and random effects the penalized log-likelihood is given by
with
where
Details are given in the
Simulation studies were conducted to evaluate the performance of the proposed weighted ZIP mixed models. Each of 500 data sets was generated from W-ZIP mixed model in (
Tables 1 and 2 presents the results of evaluating the proposed W-ZIP mixed models compared to ZIP mixed model. The bias, mean square error (MSE) and coverage probability are used to compare the performance of the models. The W-ZIP mixed models performs well in estimating all the regression coefficients whereas estimates of logistic part in ZIP mixed models were biased with larger MSE and smaller coverage rate compared with W-ZIP models. In particular, the estimated 95% confidence interval for
In addition, the sensitivity of the proposed model was evaluated given data with outliers. We used the same simulation settings described earlier and randomly selected 10% or 20% of outcomes. Then outliers were generated from weight ZIP model in (
To demonstrate the approach with real study data, we applied it to a study of federally-funded community health centers (HCs) funded by the Bureau of Primary Health Care (BPHC). This study was designed to assess how the use of HCs relates to health care costs and utilization for vulnerable populations in the US. HCs provide comprehensive primary care and supportive services and are required to provide care for Medicaid enrollees. Recent expansions in the HC program have raised concerns about the financial sustainability of the program. Therefore, it is critical to understand if receipt of primary care in a HC has any association with health service utilization and spending for Medicaid enrollees. Of several sub-studies, we focused on a specific study using Medicaid claims data to compare Medicaid enrollees receiving primary care at HCs to non-HC users. We obtained claims data from the Medicaid Analytic eXtract (MAX) files, which is a dataset that contains individual-level protected health information. Data are not public, but available for use by researchers under a data use agreement and demonstration of adequate privacy and security protections. We defined a HC user as a patient who had more than half of their primary care visits in HCs. For this analysis, we used Illinois Medicaid claims data from 2009. We restricted the study population to adults aged 18 years and older. There were 103,379 Medicaid enrollees and roughly 19.4% of them (
Due to the observational nature of the study, there are undoubtedly underlying characteristics that made patients more likely to visit HC or non-HC initially. Thus, we employed propensity scores to balance on observable characteristics between HC and non-HC. We considered potential confounding factors including patient demographics (age, sex, race), insurance characteristics (Medicaid eligibility category, Temporary Aid For Needy Families (TANF) beneficiary indicator), and disease burden (childbirth, Chronic Illness and Disability Payment System for Medicaid (CDPS) risk score). Using 1 : 1 nearest matching, we obtained a subsample (
Among various utilization measures of health services, we focused on the number of ED visits, which is a highly skewed distribution. Over half of individuals (68.6%) had never visited the ED over the year whereas about 0.5% (
The results of fitting ZIP and W-ZIP mixed models with two different weight functions are given in Table 5. The results suggest that W-ZIP
In this paper, a weighted ZIP mixed model has been developed to analyze hierarchical count data with excess zeros and a variable observation time. We present two simple weight functions to handle the different individual follow-up times; however, our simulation studies show that the weighted models improve the estimate of zero-inflated part adequately. As a result, we believe our proposed model will be a useful tool to analyze properly zero-inflated count data with both random effects (clustering) and censoring in order to answer critical questions in biomedical, medical, and public health applications where such complicated data sometimes arise. To our knowledge, models that address all three issues (i.e., zero-inflation, clustering, and censoring) do not currently exist. In Medicaid health care utilization data analysis, weighted ZIP mixed model using an exponential function provides the best fit and the intercept estimate of zero-inflated part in weighted ZIP models much less bias than the unweighted ZIP model. Alternatively, different individual follow-up times can be handled by incorporating an off-set into the Poisson part (Lee
This work was funded under a contract to the Health Resources and Services Administration (HRSA). Medicaid claims data used in the application section is not available under the contract.
where
and
Then we have
and
Suppose is partitioned conformally to :
Suppose is partitioned conformally to :
Then estimators of variance components are given by:
In addition, the asymptotic variance matrix of the variance component estimators are obtained from the inverse of the residual maximum likelihood information matrix (McGilchrist and Yau, 1995) as follows:
R for fitting weighted ZIP models.
Bias, MSE, and Cov for ZIP and W-ZIP mixed models (uniform weight function)
ZIP mixed | W-ZIP | W-ZIP | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Bias | MSE | Cov† | Bias | MSE | Cov | Bias | MSE | Cov | ||
−1.292 | 1.735 | 0.000 | −0.037 | 0.268 | 0.930 | −0.194 | 0.297 | 0.912 | ||
−0.673 | 0.627 | 0.662 | 0.047 | 0.966 | 0.940 | 0.143 | 0.867 | 0.956 | ||
−0.046 | 0.007 | 0.282 | 0.098 | 0.080 | 0.750 | 0.067 | 0.058 | 0.790 | ||
0.008 | 0.015 | 0.952 | 0.010 | 0.015 | 0.952 | 0.022 | 0.017 | 0.924 | ||
0.002 | 0.017 | 0.938 | 0.000 | 0.017 | 0.944 | −0.011 | 0.017 | 0.940 | ||
−0.010 | 0.005 | 0.292 | −0.010 | 0.005 | 0.300 | −0.008 | 0.004 | 0.282 | ||
−1.282 | 1.714 | 0.000 | −0.001 | 0.287 | 0.918 | −0.127 | 0.337 | 0.900 | ||
−0.654 | 0.615 | 0.670 | 0.093 | 0.982 | 0.948 | 0.019 | 1.057 | 0.926 | ||
−0.058 | 0.007 | 0.446 | 0.080 | 0.077 | 0.854 | 0.050 | 0.062 | 0.886 | ||
−0.002 | 0.026 | 0.942 | −0.001 | 0.026 | 0.940 | 0.004 | 0.029 | 0.930 | ||
0.009 | 0.015 | 0.944 | 0.007 | 0.014 | 0.948 | −0.001 | 0.017 | 0.948 | ||
−0.002 | 0.010 | 0.490 | −0.003 | 0.010 | 0.496 | −0.004 | 0.010 | 0.456 | ||
−1.289 | 1.694 | 0.000 | 0.000 | 0.138 | 0.928 | −0.145 | 0.149 | 0.892 | ||
−0.646 | 0.520 | 0.482 | −0.017 | 0.509 | 0.914 | −0.005 | 0.413 | 0.950 | ||
−0.066 | 0.006 | 0.154 | 0.030 | 0.023 | 0.730 | 0.016 | 0.021 | 0.692 | ||
0.005 | 0.010 | 0.964 | 0.006 | 0.010 | 0.962 | 0.005 | 0.013 | 0.942 | ||
0.001 | 0.007 | 0.956 | −0.000 | 0.007 | 0.956 | 0.003 | 0.008 | 0.950 | ||
−0.005 | 0.004 | 0.306 | 0.005 | −0.004 | 0.306 | −0.004 | 0.004 | 0.298 | ||
−1.299 | 1.728 | 0.000 | −0.026 | 0.145 | 0.924 | −0.155 | 0.139 | 0.900 | ||
−0.631 | 0.508 | 0.470 | 0.029 | 0.484 | 0.934 | 0.008 | 0.401 | 0.938 | ||
−0.078 | 0.007 | 0.124 | 0.009 | 0.022 | 0.874 | 0.000 | 0.016 | 0.846 | ||
0.008 | 0.024 | 0.926 | 0.009 | 0.024 | 0.928 | −0.010 | 0.025 | 0.914 | ||
−0.006 | 0.007 | 0.950 | −0.007 | 0.007 | 0.954 | −0.003 | 0.008 | 0.936 | ||
−0.004 | 0.008 | 0.504 | −0.004 | 0.008 | 0.502 | 0.009 | 0.011 | 0.474 |
MSE = mean square error; Cov = coverage probability; ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP.
Bias, MSE, and Cov for ZIP and W-ZIP mixed models (exponential weight function)
ZIP mixed | W-ZIP | W-ZIP | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Bias | MSE | Cov† | Bias | MSE | Cov | Bias | MSE | Cov | ||
−1.224 | 1.573 | 0.004 | 0.094 | 0.338 | 0.944 | −0.064 | 0.244 | 0.942 | ||
−0.620 | 0.590 | 0.710 | 0.034 | 1.058 | 0.956 | 0.078 | 0.911 | 0.958 | ||
−0.042 | 0.009 | 0.226 | 0.116 | 0.099 | 0.764 | 0.081 | 0.070 | 0.764 | ||
0.010 | 0.017 | 0.944 | 0.012 | 0.017 | 0.940 | 0.013 | 0.015 | 0.942 | ||
0.004 | 0.017 | 0.938 | 0.002 | 0.017 | 0.946 | −0.004 | 0.016 | 0.928 | ||
−0.005 | 0.004 | 0.348 | −0.005 | 0.004 | 0.336 | −0.010 | 0.004 | 0.284 | ||
−1.207 | 1.529 | 0.000 | 0.132 | 0.616 | 0.928 | 0.012 | 0.314 | 0.948 | ||
−0.621 | 0.602 | 0.686 | 0.081 | 1.628 | 0.942 | 0.025 | 0.905 | 0.954 | ||
−0.063 | 0.007 | 0.392 | 0.059 | 0.065 | 0.878 | 0.078 | 0.075 | 0.868 | ||
−0.014 | 0.027 | 0.954 | −0.013 | 0.027 | 0.954 | 0.002 | 0.031 | 0.910 | ||
0.008 | 0.015 | 0.942 | 0.007 | 0.015 | 0.944 | −0.005 | 0.014 | 0.946 | ||
−0.003 | 0.010 | 0.500 | −0.003 | 0.010 | 0.498 | −0.001 | 0.010 | 0.484 | ||
−1.197 | 1.466 | 0.000 | 0.075 | 0.155 | 0.938 | −0.018 | 0.126 | 0.946 | ||
−0.650 | 0.520 | 0.440 | 0.071 | 0.500 | 0.946 | 0.023 | 0.421 | 0.958 | ||
−0.065 | 0.006 | 0.144 | 0.040 | 0.034 | 0.732 | 0.027 | 0.026 | 0.718 | ||
0.003 | 0.013 | 0.930 | 0.005 | 0.013 | 0.930 | 0.010 | 0.012 | 0.944 | ||
0.003 | 0.007 | 0.944 | 0.001 | 0.007 | 0.944 | −0.006 | 0.007 | 0.954 | ||
−0.001 | 0.004 | 0.312 | −0.001 | 0.004 | 0.310 | 0.002 | 0.004 | 0.308 | ||
−1.197 | 1.463 | 0.000 | 0.087 | 0.123 | 0.962 | 0.016 | 0.126 | 0.952 | ||
−0.634 | 0.485 | 0.452 | 0.083 | 0.407 | 0.960 | 0.005 | 0.403 | 0.946 | ||
−0.075 | 0.007 | 0.156 | 0.016 | 0.028 | 0.850 | 0.030 | 0.029 | 0.800 | ||
−0.004 | 0.025 | 0.924 | −0.003 | 0.025 | 0.926 | 0.002 | 0.022 | 0.940 | ||
0.006 | 0.007 | 0.942 | 0.005 | 0.007 | 0.946 | −0.003 | 0.006 | 0.954 | ||
−0.006 | 0.009 | 0.486 | −0.006 | 0.009 | 0.482 | −0.004 | 0.009 | 0.488 |
MSE = mean square error; Cov = coverage probability; ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP.
Evaluation for W-ZIP mixed models given data with outliers
W-ZIP | W-ZIP | |||||||
---|---|---|---|---|---|---|---|---|
Bias | MSE | Cov† | Bias | MSE | Cov | |||
10% outliers | 0.060 | 0.266 | 0.950 | −0.054 | 0.237 | 0.952 | ||
0.088 | 1.058 | 0.956 | 0.059 | 0.941 | 0.958 | |||
0.014 | 0.047 | 0.912 | 0.013 | 0.047 | 0.912 | |||
−0.012 | 0.022 | 0.914 | −0.011 | 0.022 | 0.914 | |||
20% outliers | 0.088 | 0.336 | 0.950 | −0.025 | 0.291 | 0.948 | ||
0.045 | 0.896 | 0.962 | 0.018 | 0.798 | 0.964 | |||
0.000 | 0.070 | 0.904 | −0.001 | 0.070 | 0.904 | |||
−0.023 | 0.030 | 0.918 | −0.023 | 0.030 | 0.918 |
MSE = mean square error; Cov = coverage probability; ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP.
Means of patients’ characteristics before and after the propensity score matching
Unmatched ( | Matched ( | % Balance | ||||
---|---|---|---|---|---|---|
HC | Non-HC | HC | Non-HC | Improvement | ||
Mean Diff. | ||||||
Age | 31.4 | 33.0 | 31.4 | 31.5 | 94.9 | |
Female | 0.816 | 0.731 | 0.816 | 0.813 | 96.3 | |
Race | White | 0.186 | 0.509 | 0.186 | 0.189 | 99.1 |
Black | 0.539 | 0.283 | 0.539 | 0.538 | 99.6 | |
Hispanic | 0.134 | 0.073 | 0.134 | 0.135 | 99.2 | |
Others | 0.141 | 0.135 | 0.141 | 0.138 | 42.9 | |
Medicaid eligibility groupa | Cash assistance | 0.166 | 0.129 | 0.166 | 0.165 | 96.8 |
Blind/disabled | 0.206 | 0.274 | 0.206 | 0.208 | 96.9 | |
Medical need | 0.409 | 0.439 | 0.409 | 0.386 | 23.7 | |
Poverty | 0.279 | 0.221 | 0.279 | 0.291 | 79.1 | |
Aged | 0.006 | 0.011 | 0.006 | 0.006 | 94.4 | |
Others | 0.365 | 0.403 | 0.365 | 0.371 | 82.4 | |
TANF eligibleb | 0.057 | 0.027 | 0.057 | 0.056 | 99.0 | |
Restricted benefitsc | 0.107 | 0.040 | 0.107 | 0.111 | 92.9 | |
Delivery during the year | 0.079 | 0.044 | 0.079 | 0.076 | 91.3 | |
CDPS risk scored | 0.805 | 1.028 | 0.805 | 0.811 | 97.1 |
^{a}Enrollees may be have more than one eligibility category assigned over the course of the year. Eligibility categories are grouped from original Medicaid Analytic eXtract data.
^{b}Enrollee is eligible for Temporary Aid For Needy Families (TANF) program in any month during the data year.
^{c}Enrollee is eligible under restricted benefits at any point during the data year.
Parameter estimates and standard errors for ZIP and W-ZIP mixed models
Variable | ZIP mixed | W-ZIP | W-ZIP | ||||
---|---|---|---|---|---|---|---|
Estimate | Standard error | Estimate | Standard error | Estimate | Standard error | ||
Logistic part | Int. | −0.446a | 0.028 | 0.118a | 0.022 | 0.077a | 0.022 |
HC | 0.068a | 0.021 | 0.074a | 0.018 | 0.076a | 0.018 | |
0.068 | 0.001 | 0.033 | 0.001 | 0.032 | 0.001 | ||
Poisson part | Int. | 0.466a | 0.033 | 0.477a | 0.031 | 0.476a | 0.031 |
HC | −0.034a | 0.013 | −0.032a | 0.013 | −0.032a | 0.013 | |
0.171 | 0.001 | 0.152 | 0.001 | 0.154 | 0.001 | ||
−Log-likelihood | 47640.62 | 47719.06 | 47517.61 | ||||
AIC | 95293.24 | 95450.13 | 95047.23 | ||||
BIC | 95315.72 | 95472.60 | 95069.70 |
ZIP = zero-inflated Poisson; W-ZIP = weighted ZIP; AIC = Akaike information criteria; BIC = Bayesian information criteria.