search for

CrossRef (0)
A response probability estimation for non-ignorable non-response
Communications for Statistical Applications and Methods 2022;29:263-275
Published online March 31, 2022
© 2022 Korean Statistical Society.

Hee Young Chunga, Key-Il Shin1,a

aDepartment of Statistics, Hankuk University of Foreign Studies, Korea
Correspondence to: 1 Department of Statistics, Hankuk University of Foreign Studies, 81 Oedae-ro, Yongin, Gyeonggido 17035, Korea. E-mail: keyshin@hufs.ac.kr

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government
; Revised February 20, 2022; Accepted February 25, 2022.
Use of appropriate technique for non-response occurring in sample survey improves the accuracy of the estimation. Many studies have been conducted for handling non-ignorable non-response and commonly the response probability is estimated using the propensity score method. Recently, post-stratification method to obtain the response probability proposed by Chung and Shin (2017) reduces the e ect of bias and gives a good performance in terms of the MSE. In this study, we propose a new response probability estimation method by combining the propensity score adjustment method using the logistic regression model with post-stratification method used in Chung and Shin (2017). The superiority of the proposed method is confirmed through simulation.
Keywords : response probability model, bias estimation, sample distribution, population distribution, post-stratification
1. Introduction

Non-response in sample survey is a common source of non-sampling error that appears when part of the data to be collected is not observed. In the case of missing at random (MAR) in which non-response occurs randomly, many appropriate statistical methods have been developed. On the other hand, in the case of non-ignorable non-response, there are relatively few studies on this subject. Non-ignorable non-response is known to cause bias and so accurate bias estimation is the key to properly handle the non-response. The PSA estimator, propensity-score-adjusted estimator, defined by


is widely used to reduce non-response bias. Here Ri is 1 if response and Ri is 0 if non-response and s is an index set of sample. Also, wi is the sample weight and i is the estimated response probability. Kim and Riddles (2012) developed some asymptotic theories of the PSA estimator and suggested minimum variance of the estimator. Kim and Yu (2011) studied a mean estimation method for non-ignorable non-response case in which paper the missing value of the variable of interest is non-parametrically calculated using a kernel estimator. Riddles et al. (2016) proposed a propensity score adjustment method for non-ignorable non-response and showed that the proposed method is more efficient than the calibration-weighting method in Chang and Kott (2008) and Kott and Chang (2010).

In order to use the PSA estimator, it is necessary to estimate the response probability. Estimating response probabilities relies heavily on the use of model. Iannacchione et al. (1991) used a logistic regression model for the response probability estimation and this logistic regression model is widely used. Also Da Silva and Opsomer (2006) and Da Silva and Opsomer (2009) considered nonparametric methods to obtain the response probability. Bethlehem (2020) presented an approximate bias of the sample mean for non-ignorable non-response. Various response probability estimation methods have been addressed in that paper.

In general, the final sample weight considering non-response is used for estimation defined by


Therefore, the PSA estimator using w^iF=wi/p^i is one of the non-response adjusted weight estimators. Chung and Shin (2017) suggested a non-response adjusted weight estimation method using a post-stratification which is addressed in Bethlehem (2020). This method reduces the effect of bias and gives a good performance in terms of the MSE.

In this study, we propose a new response probability estimation method that combines the response probability estimation method using logistic regression model with the post-stratification method suggested by Chung and Shin (2017).

The composition of this paper is as follows. In Section 2, the existing methods and a proposed method for the response probability estimation are explained. Section 3 describes the bias corrected PSA estimators using the response probability estimates obtained in Section 2. Section 4 confirms the superiority of the proposed method through simulation studies. There is a conclusion in Section 5.

2. Response probability estimation

2.1. Modeling method

In order to properly handle non-ignorable non-response, the response probability of each data must be appropriately estimated. As mentioned in Bethlehem (2020), various methods can be used for the estimation. The propensity score using the logistic regression model is the most commonly used. Let Ri be the indicator variable representing the response and pi be the response probability. Also let the auxiliary variable xi be a p-dimensional vector and always observable. Usually for non-ignorable non-response, it is assumed that pi is a function of yi and xi. However, in this study, we assume that the response probability pi is a function of yi for simplicity.

Then, we can write pi = P(Ri = 1|yi, xi) = P(Ri = 1|yi) = g2(yi; φ). Practically, we do not have enough information to estimate pi and mostly the function is not known. To handle this situation, Kim and Yu (2011) and Kim and Riddles (2012) used some follow-up samples. However, this is not the usual case.

Mostly, the variable of interest is related to the auxiliary variables and therefore many studies consider a super-population model. Let the super-population model be yi = g1(xi; β) + εi and assume


where θ is a vector of some unknown parameters and ηi is an error related to εi. Since this study considers non-ignorable non-response, it may not be appropriate to use a function of the auxiliary variable xi. However, in reality, in order to estimate the response probability, we have no choice but to use the available auxiliary variables. Considering this, equation (2.1) may be useful. However, bias may occur in the result using the response probability obtained in equation (2.1).

The approximate estimate of pi can be obtained using the available auxiliary variable xi. The most common response probability estimation method is to use the parametric logistic regression model. Of course, nonparametric methods can be used as Da Silva and Opsomer (2006) and Da Silva and Opsomer (2009). Da Silva and Opsomer (2009) showed that local polynomial regression results in better practical and theoretical properties for the ignorable case. Recently Bethlehem (2020) studied the non-ignorable non-response and used the logistic regression modeling. Since the logistic regression model is commonly used and simple to implement, we use the logistic regression model in this paper. The logistic regression model is defined by

log (pi1-pi)=α0+α1x1i++αpxpi.

Using this model, we can obtain the response probability estimate p^iP. Now let w^iP=wi/p^iP. Then the final adjusted sample weight is obtained by following,


In Riddles et al. (2016), (2.2) is called naive estimator for non-ignorable non-response.

2.2. Post-stratification method

Several methods estimating response probability have been developed. One of them is the post-stratification method. As mentioned in Bethlehem (2020), post-stratification is a well-known and frequently used weighting technique. Usually categorical variables are used. Using these variables, population is divided into a number of non-overlapping subpopulations, called strata. Of course, continuous variables can be used and Bethlehem (2020) used the estimated response probability to construct subpopulations. Although (2.1) is an approximate result, naturally we can use the available auxiliary variable in order to construct strata. Chung and Shin (2017) and Min and Shin (2018) proposed a response probability estimation method using strata. In the proposed method by Min and Shin (2018), population is divided into L strata with boundaries obtained using percentiles of given auxiliary variable.

To construct the strata, we determine the total number of strata, L satisfying about r/L ≥ 10 where r is the total number of response samples. Then with L, we calculate the percentiles of the auxiliary variables and use the percentiles as boundaries. For instance, with L = 10, we use 10, 20, . . . , 90 percentiles for boundaries. For univariate case, the percentiles become directly boundaries. However, for multivariate case, we use the Cartesian product with the percentiles of each auxiliary variable to meet the total number of strata L.

Now, let Nh and rh be the number of population and the number of final response samples in the hth stratum, respectively. Then the sample weight of yi in the hth stratum using the post-stratification method is defined by


where sh is the index set of hth stratum sample. Since this method does not use any model to estimate the response probability, it is model free and naturally robust of model misspecification.

2.3. Proposed response probability estimation method

The widely used w^iF(P) is obtained using the individual data values. However, since it is based on a model, the results may vary depending on the model used. On the other hand, w^iF(D) has the advantage of being easy to calculate because it uses the number of samples in a given stratum rather than the individual data values. Again, the result may depend on the method of constructing the strata and the optimal boundaries of the strata are determined according to the given data.

A new method combining two methods explained in section 2.1 and 2.2 is proposed. First we calculate w^iF(P) explained in section 2.1. Then we make sum of w^jF(P) in the hth stratum be Nh. That is, the final sample weight is defined by


where Nh and sh are the same as those defined in (2.3).

3. Bias corrected propensity-score-adjusted (PSA) estimator

In the case of non-ignorable non-response, since the response probability is a function of the variable of interest, it is necessary to estimate the response probability using the variable of interest. However, as mentioned before it is practically hard to obtain the response probability using the variable of interest. In Section 2, the response probability is estimated using the available auxiliary variables and this usually produces bias in estimation. To obtain better estimation results, the bias should be estimated and corrected.

3.1. Bias estimation

For bias estimation, this section considers a super-population model. Let the inclusion probability, πi be a function of the variable of interest yi and fp(yi|xi) be a population distribution with a super-population model. Pfeffermann et al. (1998) showed fs(yi|θ*, xi) = f (yi|is, xi) = {Pr(is|yi, xi) fp(yi|θ, xi)}/{Pr(is|xi)} where θ* is a function of θ. Since Pr(is|yi, xi) = Ep(πi|yi, xi) and Pr(is|xi) = Ep(πi|xi), finally we have


where fs(yi|xi) is a sample distribution, Ep(πi|yi, xi) is the sample inclusion probability given xi, yi and Ep(πi|xi) is the sample inclusion probability given xi. Whenever Ep(πi|yi, xi) = Ep(πi|xi), then population distribution and sample distribution are the same. Therefore, with known inclusion probability and population distribution we can obtain the sample distribution and finally, we can calculate the bias. Bias estimation using various inclusion probabilities and population distributions have been obtained in previous studies.

In this study we consider a non-ignorable non-response with a non-informative sampling. Actually the final inclusion probability πi is obtained by combining the inclusion probability at the time of sampling design and the response probability. As a non-informative sampling design, we use simple random sampling and so wi in (1.1) is constant. Therefore πi used in (3.1) is defined by πi = pi/wi = pi/w where w = N/n. Since pi is a function of the variable of interest yi, we have fs(yi|xi) ≠ fp(yi|xi) and so we can apply the bias estimation method developed in the informative sampling design to non-ignorable non-response.

We consider a linear inclusion probability model whose bias is easily estimated theoretically. This is because, if the linear inclusion probability model is effective, it can be expected to be effective when using other inclusion probability models. The linear inclusion probability model considered is as follows,


Then simply we have


Let Es(yixi)=μi(s). Then plugging (3.4) into (3.1) we have,


where μi = Ep(yi|xi). Since Ep(yi2xi)=Varp(yixi)+μi2, we have μi(s)=μi+b1/b0+b1μi×Varp(yixi) and finally the bias is obtained as following,


3.2. Parameter estimation for inclusion probability model

Bias is estimated based on the super-population model and the inclusion probability model established.

In (3.2), Ep(πi|yi, xi) can be calculated using known parameters b0, b1 and given yi for the informative sampling. However, since we consider the non-ignorable non-response situation with non-informative sampling scheme, this situation usually can not happen in practice. The parameters b0, b1 should be estimated from the data.

In this study we use the linear probability model (3.2) defined by Ep(πi|yi, xi) = b0 + b1yi. Since Ep(πi|yi, xi) = 1/Es(wi|yi, xi) and Es(wi|yi, xi) ≈ wi from Pffeffermann and Sverchkov (2003) we have 1/wib0 + b1yi. Therefore by plugging the estimated final sample weight w^iF obtained in Section 2.1 into wi, we have the following model.


So using simple regression analysis we have the estimates of b0, b1.

3.3. Parameter estimation for super-population model

Three super-population models are considered in this study.

  • Normal distribution


    That is, yi=β0+β1Txi+ɛi where β1T is a p-dimensional vector of parameters, ɛiiidN(0,σ2) and Varp(yi|xi) = σ2.

  • Gamma distribution with log-linear model

    fp(yiαi,μi)yiα-1exp (-αyiμi),

    where μi = Ep(yi|xi) and ln(μi)=β0+β1Txi. That is, we use Γ(α, β*) with β* = μi/α and so we have Varp(yixi)=μi2/α.

  • Log-normal distribution with log-linear model


    where μi=Ep(yixi)=exp(β0+β1Txi+σ2/2) and Varp(yixi)=μi2(exp(σ2)-1).

As used in Riddles et al. (2016), using the obtained data (xi, yi) and the known super-population model, we can estimate μi and Varp(yi|xi).

3.4. Bias corrected propensity-score-adjusted (PSA) estimator

Bias corrected PSA estimator can be obtained based on the bias estimates and the PSA estimator defined in (1.2). That is, the bias corrected PSA estimator is defined by


In (3.11) we can use the final sample weights, w^iF(P),w^iF(D) and w^iF(C) defined in Section 2 and bias^i estimated according to the super-population model. Therefore, we have three bias corrected PSA estimators depending on the final sample weight,

4. Simulation studies

To investigate the finite sample properties of the proposed method, we perform simulation studies. In simulation, for x1iiidUnif(100,200) and x2iiidUnif(30,500) we generate a variable of interest yi according to the following error distribution of the super-population model,

  • Normal distribution: yi = β0 + β1x1i + β2x1i + εi, ɛiiidN(0,900).

    Here we use β0 = 10, β1 = 5 for univariate auxiliary variable case and β0 = 10, β1 = 5, β2 = 3 for bivariate auxiliary variable case.

  • Gamma distribution: yiiidΓ(10,μi/10).

    Here, we use μi = exp(0.01 + 0.03x1i) for univariate auxiliary variable case and μi = exp(0.01 + 0.03x1i + 0.002x2i) for bivariate auxiliary variable case.

  • Log-normal distribution: yiiidLN(μi,0.1) Here we use μi = 0.01 + 0.03x1i for univariate auxiliary variable case and μi = 0.01 + 0.03x1i + 0.002x2i for bivariate auxiliary variable case.

From N = 10, 000 population data, n = 500 samples are drawn using simple random sampling. In order to remove the influence of outliers, 10,100 data are generated and then 100 largest valued data are deleted in the samples of the gamma distribution and log-normal distribution.

We consider three response probability models,

  • exponential response: pi = exp (a0 + a1yi)

  • linear response: pi = b0 + b1yi

  • logistic response: ln (pi1-pi)=c0+c1yi

For the linear response probability model, pi = b0 + b1yi, pi ∈ [0, 1], let πymin and πymax denote the response probabilities corresponding to the minimum and maximum values of yi, respectively. Using given response probabilities, for instance (πymin,πymax)=(0.9,0.3), we calculate b0, b1 and then calculate pi. We consider four cases of response probabilities,

  • Case I (πymin,πymax)=(0.9,0.3),

  • Case II : (πymin,πymax)=(0.8,0.3),

  • Case III :(πymin,πymax)=(0.3,0.8),

  • Case IV :(πymin,πymax)=(0.3,0.9).

Response data are obtained according to the calculated response probability pi. Similarly, the exponential response probability model and the logistic response probability model are used to generate response data.

Three PSA estimators, Y¯^P,Y¯^D and Y¯^C are calculated according to the sample weight estimation method defined in (2.2), (2.3) and (2.4). In (2.3) and (2.4) for univariate case, the number of strata L = 20 is used. For bivariate case, we use L1 = 5 for x1i and L2 = 4 for x2i. Also three bias corrected PSA estimators, Y¯^PBC,Y¯^DBC and Y¯^CBC defined in section 3.4 are calculated. The performance of the estimators is compared using the following comparison statistics; bias, absolute relative bias (ARB), and root mean squared error (RMSE) defined by


Here, R = 1, 000 is used and the comparison statistics are calculated by generating a new population for each iteration. This is to reduce the influence of the specific population generated, so the true value of the rth repeat population mean is denoted as r.

Tables 1 to 3 contain the results of the PSA estimators. Table 1 shows the results when the super-population model is linear and the error distribution is normal with three response probability models. Even though there are some differences of response rates depending on the response probability cases, in Case I and IV, the response rate of about 53 – 65% is obtained, and in Case II and III, the response rate of about 50 – 56% is obtained. The bias of Y¯^P in Case I and II shows smaller than that in the other cases and shows the smallest in the exponential response model and the largest in the logistic response model. Also we have similar results of Y¯^P in terms of ARB and RMSE regardless of the response model.

Comparing Y¯^P,Y¯^D and Y¯^C in terms of bias, in some cases Y¯^P gives better results than the others, however Y¯^D and Y¯^C have good results in terms of ARB and RMSE with a very large difference. Also comparing Y¯^D and Y¯^C,Y¯^C is superior to Y¯^D, although it is not a big difference in the results of ARB and RMSE. In particular, Y¯^C is superior to Y¯^D in the bias results. Therefore, based on the results of Table 1, we can conclude that the proposed response probability estimation method gives the best results.

Table 2 shows the results when the super-population model is a log-linear model and the error distribution is a gamma distribution. Also three response probability models are considered. In Case I and II, the response rate of about 65 – 80% is obtained, and in Case III and IV, the response rate of about 40 – 45% is obtained. This result comes from the asymmetry of the gamma distribution. Investigating the results of Table 2, one can see that large biases occur in all estimators and the bias has a very large effect on the RMSE. As shown in Table 1, Y¯^C is the best through all comparison statistics results in Table 2.

Table 3 shows the results where the super-population model is a log-linear model and the error distribution is log-normal with three response probability models. The response rate is very similar to the result of gamma distribution. Also in Table 3, Y¯^C is the best through all comparison statistics results.

Table 4 shows the results of the bias corrected PSA estimator using the linear response probability model. To obtain the bias corrected PSA estimator, known response probability model and known super-population model are required. In this simulation, we use that the error distribution of the linear super-population model is normal and the response probability model is linear. Also we consider gamma and log-normal distributions as the error distribution of the log-linear super-population model.

Through the results, Y¯^PBC gives the worst results in terms of ARB and RMSE and even the results are worse than those of Y¯^C in the linear response probability model of Table 2. Comparing Y¯^DBC and Y¯^CBC, we can see that Y¯^DBC and Y¯^CBC have similar results in terms of ARB and RMSE. However, the bias of Y¯^CBC is smaller than that of Y¯^DBC. Also comparing Y¯^C and Y¯^CBC, we have that Y¯^CBC have better results with a very large difference. Therefore, it is very reasonable to use a bias corrected PSA estimator when the error distribution is known and the response probability model is linear. Of course, the bias corrected PSA estimator can be applied if bias is known.

Table 5 and Table 6 show the results of bivariate auxiliary variable case. The response rates are similar to the univariate variable cases. In Table 5, the result of the PSA estimator, it can be seen that the bias of Y¯^P is relatively small in the normal super-population result. However, the biases of Y¯^P in the gamma distribution and log-normal distribution are larger than those of other estimators. Also, in terms of ARB and RMSE, Y¯^P gives inferior results when compared to other estimators, so it is not good to use Y¯^P for a bivariate auxiliary variable case. Comparing Y¯^D and Y¯^C in the normal super-population result, it can be seen that the bias of Y¯^C is smaller. In addition, Y¯^C shows good results in terms of ARB and RMSE. This pattern of results also can be seen in the gamma distribution and log-normal distribution. Therefore, we conclude that Y¯^C gives the best result when using PSA estimator in the bivariate auxiliary variable case.

Table 6 shows the results of the bias corrected PSA estimator. As in the case of univariate auxiliary variable, Y¯^PBC gives the best result in terms of all comparison statistics in the case of bivariate auxiliary variable. Therefore, it is appropriate to use Y¯^DBC when using the bias corrected PSA estimator.

By comparing the results of Table 5 and Table 6, the effect of bias correction can be confirmed. The effect of bias correction of Y¯^PBC can be seen in the gamma distribution and the log-normal distribution except for the normal distribution. On the other hand, in the results of Y¯^DBC and Y¯^CBC, the bias correction effect can be seen in all distributions. Also the bias correction effect of Y¯^CBC is seen for all distributions, especially the gamma and log-normal distributions. Therefore, it can be confirmed through simulation studies that the bias corrected PSA estimator gives good results when a known super-population model is used in the linear response probability model.

5. Conclusion

Recently, a lot of non-ignorable non-response has occurred in sample survey and several studies have been conducted to properly deal with it. Non-ignorable non-response is not easy to deal with because it causes bias. In particular, in order to properly handle non-ignorable non-responses, accurate estimation of response probability plays an important role. However, since the response probability is a function of the variable of interest, it is not easy to accurately estimate the response probability in practice. Therefore, practically the method of estimating the response probability should use available auxiliary variables. In this study we propose a new response probability estimation method using available auxiliary variables. It is confirmed that the proposed method provides better result than the existing methods and the bias corrected PSA estimator produces significantly better results.

Here we note that the results in this study are obtained using a linear response probability model with known super population models. A model misspecification of the response probability model is a big concern and the linear model may be neither well motivated nor appropriate in practice. Also the super population model plays an important role and a model missspecification of the super population model is also concerned. Therefore, it is necessary to study a method that can be used for an arbitrary unknown response probability model and an arbitrary unknown super-population model.


Table 1

Results of PSA estimator with Normal distribution






Table 2

Results of PSA estimator with Gamma distribution






Table 3

Results of PSA estimator with Log-normal distribution






Table 4

Results of bias corrected PSA estimator using linear response probability model






Table 5

Results of PSA estimator using linear response probability model with bivariate auxiliary variable






Table 6

Results of bias corrected PSA estimator using linear response probability model with bivariate auxiliary variable






  1. Bethlehem J (2020). Working with response probabilities. Journal of Official Statistics, 36, 647-674.
  2. Chang T and Kott PS (2008). Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika, 95, 555-571.
  3. Chung HY and Shin KI (2017). Estimation using informative sampling technique when response probability follows exponential function of variable of interest. Korean Journal of Applied Statistics, 30, 993-1004.
  4. Da Silva DN and Opsomer JD (2006). A kernel smoothing method of adjusting for unit non-response in sample surveys. Canadian Journal of Statistics, 34, 563-579.
  5. Da Silva DN and Opsomer JD (2009). Nonparametric propensity weighting for survey nonresponse through local polynomial regression. Survey Methodology, 35, 165-176.
  6. Iannacchinoe VG, Milne JG, and Folsom RE (1991). Response probability weight adjustment using logistic regression. Proceedings of the Survey Research Methods Section. , American Statistical Association, 637-642.
  7. Kim JK and Riddles MK (2012). Some theory for propensity-score-adjustment estimators in survey sampling. Survey Methodology, 38, 157-165.
  8. Kim JK and Yu CL (2011). A semiparametric estimation of mean functionals with nonignorable missing data. Journal of the American Statistical Association, 106, 157-165.
  9. Kott PS and Chang T (2010). Using Calibration weighting to adjust for nonignorable unit nonresponse. American Statistical Association, 105, 1265-1275.
  10. Min JW and Shin KI (2018). A study on the determination of substrata using the information of exponential response rate by simulation studies. Korean Journal of Applied Statistics, 31, 621-636.
  11. Pfeffermann D, Krieger AM, and Rinott Y (1998). Parametric distributions of complex survey data under informative probability sampling. Statistica Sinica, 8, 1087-1114.
  12. Pfeffermann D and Sverchkov M (2003). Small area estimation under informative sampling. Proceedings of the Survey Research Method Section American Statistical Asscociation. , 3284-3295.
  13. Riddles MK, Kim JK, and Im J (2016). A propensity-score-adjustment method for nonignorable nonresponse. Journal of Survey Statistics and Methodology, 4, 215-245.