
Non-response in sample survey is a common source of non-sampling error that appears when part of the data to be collected is not observed. In the case of missing at random (MAR) in which non-response occurs randomly, many appropriate statistical methods have been developed. On the other hand, in the case of non-ignorable non-response, there are relatively few studies on this subject. Non-ignorable non-response is known to cause bias and so accurate bias estimation is the key to properly handle the non-response. The PSA estimator, propensity-score-adjusted estimator, defined by
is widely used to reduce non-response bias. Here
In order to use the PSA estimator, it is necessary to estimate the response probability. Estimating response probabilities relies heavily on the use of model. Iannacchione
In general, the final sample weight considering non-response is used for estimation defined by
Therefore, the PSA estimator using
In this study, we propose a new response probability estimation method that combines the response probability estimation method using logistic regression model with the post-stratification method suggested by Chung and Shin (2017).
The composition of this paper is as follows. In Section 2, the existing methods and a proposed method for the response probability estimation are explained. Section 3 describes the bias corrected PSA estimators using the response probability estimates obtained in Section 2. Section 4 confirms the superiority of the proposed method through simulation studies. There is a conclusion in Section 5.
In order to properly handle non-ignorable non-response, the response probability of each data must be appropriately estimated. As mentioned in Bethlehem (2020), various methods can be used for the estimation. The propensity score using the logistic regression model is the most commonly used. Let
Then, we can write
Mostly, the variable of interest is related to the auxiliary variables and therefore many studies consider a super-population model. Let the super-population model be
where
The approximate estimate of
Using this model, we can obtain the response probability estimate
In Riddles
Several methods estimating response probability have been developed. One of them is the post-stratification method. As mentioned in Bethlehem (2020), post-stratification is a well-known and frequently used weighting technique. Usually categorical variables are used. Using these variables, population is divided into a number of non-overlapping subpopulations, called strata. Of course, continuous variables can be used and Bethlehem (2020) used the estimated response probability to construct subpopulations. Although (
To construct the strata, we determine the total number of strata,
Now, let
where
The widely used
A new method combining two methods explained in section 2.1 and 2.2 is proposed. First we calculate
where
In the case of non-ignorable non-response, since the response probability is a function of the variable of interest, it is necessary to estimate the response probability using the variable of interest. However, as mentioned before it is practically hard to obtain the response probability using the variable of interest. In Section 2, the response probability is estimated using the available auxiliary variables and this usually produces bias in estimation. To obtain better estimation results, the bias should be estimated and corrected.
For bias estimation, this section considers a super-population model. Let the inclusion probability,
where
In this study we consider a non-ignorable non-response with a non-informative sampling. Actually the final inclusion probability
We consider a linear inclusion probability model whose bias is easily estimated theoretically. This is because, if the linear inclusion probability model is effective, it can be expected to be effective when using other inclusion probability models. The linear inclusion probability model considered is as follows,
Then simply we have
Let
where
Bias is estimated based on the super-population model and the inclusion probability model established.
In (
In this study we use the linear probability model (
So using simple regression analysis we have the estimates of
Three super-population models are considered in this study.
Normal distribution
That is,
Gamma distribution with log-linear model
where
Log-normal distribution with log-linear model
where
As used in Riddles
Bias corrected PSA estimator can be obtained based on the bias estimates and the PSA estimator defined in (
In (
To investigate the finite sample properties of the proposed method, we perform simulation studies. In simulation, for
Normal distribution:
Here we use
Gamma distribution:
Here, we use
Log-normal distribution:
From
We consider three response probability models,
exponential response:
linear response:
logistic response:
For the linear response probability model,
Case I
Case II :
Case III :
Case IV :
Response data are obtained according to the calculated response probability
Three PSA estimators,
Here,
Tables 1 to 3 contain the results of the PSA estimators. Table 1 shows the results when the super-population model is linear and the error distribution is normal with three response probability models. Even though there are some differences of response rates depending on the response probability cases, in Case I and IV, the response rate of about 53 – 65% is obtained, and in Case II and III, the response rate of about 50 – 56% is obtained. The bias of
Comparing
Table 2 shows the results when the super-population model is a log-linear model and the error distribution is a gamma distribution. Also three response probability models are considered. In Case I and II, the response rate of about 65 – 80% is obtained, and in Case III and IV, the response rate of about 40 – 45% is obtained. This result comes from the asymmetry of the gamma distribution. Investigating the results of Table 2, one can see that large biases occur in all estimators and the bias has a very large effect on the RMSE. As shown in Table 1,
Table 3 shows the results where the super-population model is a log-linear model and the error distribution is log-normal with three response probability models. The response rate is very similar to the result of gamma distribution. Also in Table 3,
Table 4 shows the results of the bias corrected PSA estimator using the linear response probability model. To obtain the bias corrected PSA estimator, known response probability model and known super-population model are required. In this simulation, we use that the error distribution of the linear super-population model is normal and the response probability model is linear. Also we consider gamma and log-normal distributions as the error distribution of the log-linear super-population model.
Through the results,
Table 5 and Table 6 show the results of bivariate auxiliary variable case. The response rates are similar to the univariate variable cases. In Table 5, the result of the PSA estimator, it can be seen that the bias of
Table 6 shows the results of the bias corrected PSA estimator. As in the case of univariate auxiliary variable,
By comparing the results of Table 5 and Table 6, the effect of bias correction can be confirmed. The effect of bias correction of
Recently, a lot of non-ignorable non-response has occurred in sample survey and several studies have been conducted to properly deal with it. Non-ignorable non-response is not easy to deal with because it causes bias. In particular, in order to properly handle non-ignorable non-responses, accurate estimation of response probability plays an important role. However, since the response probability is a function of the variable of interest, it is not easy to accurately estimate the response probability in practice. Therefore, practically the method of estimating the response probability should use available auxiliary variables. In this study we propose a new response probability estimation method using available auxiliary variables. It is confirmed that the proposed method provides better result than the existing methods and the bias corrected PSA estimator produces significantly better results.
Here we note that the results in this study are obtained using a linear response probability model with known super population models. A model misspecification of the response probability model is a big concern and the linear model may be neither well motivated nor appropriate in practice. Also the super population model plays an important role and a model missspecification of the super population model is also concerned. Therefore, it is necessary to study a method that can be used for an arbitrary unknown response probability model and an arbitrary unknown super-population model.
Results of PSA estimator with Normal distribution
Case | Estimator | Exponential | Linear | Logistic | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BIAS | ARB | RMSE | BIAS | ARB | RMSE | BIAS | ARB | RMSE | ||
I | −0.319 | 0.007 | 6.608 | −0.889 | 0.007 | 6.597 | −1.233 | 0.007 | 6.632 | |
−1.557 | 0.003 | 2.518 | −1.408 | 0.003 | 2.332 | −1.521 | 0.003 | 2.370 | ||
−1.476 | 0.003 | 2.468 | −1.360 | 0.002 | 2.304 | −1.443 | 0.003 | 2.320 | ||
II | −0.530 | 0.007 | 6.606 | −0.972 | 0.007 | 6.602 | −1.096 | 0.007 | 6.629 | |
−1.394 | 0.003 | 2.473 | −1.301 | 0.003 | 2.334 | −1.331 | 0.003 | 2.336 | ||
−1.323 | 0.003 | 2.433 | −1.233 | 0.002 | 2.297 | −1.261 | 0.002 | 2.297 | ||
III | 0.989 | 0.007 | 6.638 | 1.455 | 0.007 | 6.670 | 1.585 | 0.007 | 6.715 | |
1.436 | 0.003 | 2.446 | 1.427 | 0.003 | 2.374 | 1.468 | 0.003 | 2.389 | ||
1.365 | 0.003 | 2.407 | 1.360 | 0.002 | 2.336 | 1.398 | 0.003 | 2.348 | ||
IV | 0.772 | 0.007 | 6.626 | 1.321 | 0.007 | 6.687 | 1.655 | 0.007 | 6.694 | |
1.636 | 0.003 | 2.523 | 1.564 | 0.003 | 2.418 | 1.590 | 0.003 | 2.393 | ||
1.554 | 0.003 | 2.473 | 1.489 | 0.003 | 2.371 | 1.513 | 0.003 | 2.345 |
Results of PSA estimator with Gamma distribution
Case | Estimator | Exponential | Linear | Logistic | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BIAS | ARB | RMSE | BIAS | ARB | RMSE | BIAS | ARB | RMSE | ||
I | −5.459 | 0.050 | 7.434 | −5.288 | 0.048 | 7.161 | −5.483 | 0.049 | 7.208 | |
−4.626 | 0.038 | 5.398 | −4.262 | 0.035 | 4.995 | −4.219 | 0.035 | 4.896 | ||
−4.569 | 0.038 | 5.350 | −4.217 | 0.035 | 4.957 | −4.181 | 0.035 | 4.864 | ||
II | −5.394 | 0.050 | 7.483 | −5.207 | 0.048 | 7.201 | −5.415 | 0.049 | 7.334 | |
−4.166 | 0.035 | 5.081 | −3.927 | 0.033 | 4.781 | −3.993 | 0.034 | 4.825 | ||
−4.121 | 0.035 | 5.045 | −3.890 | 0.033 | 4.751 | −3.958 | 0.033 | 4.797 | ||
III | 5.717 | 0.053 | 8.004 | 5.262 | 0.050 | 7.579 | 5.363 | 0.050 | 7.638 | |
4.380 | 0.037 | 5.472 | 4.228 | 0.036 | 5.244 | 4.359 | 0.037 | 5.338 | ||
4.338 | 0.037 | 5.439 | 4.182 | 0.036 | 5.206 | 4.311 | 0.036 | 5.299 | ||
IV | 6.225 | 0.056 | 8.378 | 5.601 | 0.051 | 7.766 | 5.486 | 0.050 | 7.608 | |
4.885 | 0.041 | 5.852 | 4.676 | 0.039 | 5.543 | 4.794 | 0.039 | 5.568 | ||
4.839 | 0.040 | 5.813 | 4.625 | 0.038 | 5.499 | 4.738 | 0.039 | 5.519 |
Results of PSA estimator with Log-normal distribution
Case | Estimator | Exponential | Linear | Logistic | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BIAS | ARB | RMSE | BIAS | ARB | RMSE | BIAS | ARB | RMSE | ||
I | −5.720 | 0.050 | 7.771 | −5.474 | 0.048 | 7.445 | −5.646 | 0.048 | 7.472 | |
−4.646 | 0.037 | 5.527 | −4.257 | 0.034 | 5.071 | −4.199 | 0.033 | 4.947 | ||
−4.587 | 0.036 | 5.478 | −4.211 | 0.033 | 5.033 | −4.162 | 0.033 | 4.915 | ||
II | −5.618 | 0.050 | 7.751 | −5.380 | 0.048 | 7.459 | −5.611 | 0.049 | 7.574 | |
−4.154 | 0.034 | 5.198 | −3.893 | 0.032 | 4.882 | −3.976 | 0.032 | 4.907 | ||
−4.106 | 0.033 | 5.160 | −3.855 | 0.032 | 4.852 | −3.941 | 0.032 | 4.879 | ||
III | 5.767 | 0.051 | 8.147 | 5.298 | 0.048 | 7.712 | 5.428 | 0.048 | 7.775 | |
4.521 | 0.037 | 5.715 | 4.390 | 0.036 | 5.485 | 4.526 | 0.036 | 5.567 | ||
4.477 | 0.037 | 5.680 | 4.341 | 0.035 | 5.445 | 4.475 | 0.036 | 5.525 | ||
IV | 6.323 | 0.054 | 8.511 | 5.736 | 0.050 | 7.991 | 5.605 | 0.049 | 7.814 | |
5.067 | 0.040 | 6.099 | 4.928 | 0.039 | 5.885 | 5.052 | 0.039 | 5.912 | ||
5.018 | 0.040 | 6.057 | 4.873 | 0.039 | 5.838 | 4.992 | 0.039 | 5.860 |
Results of bias corrected PSA estimator using linear response probability model
Case | Estimator | Normal | Gamma | Log-normal | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BIAS | ARB | RMSE | BIAS | ARB | RMSE | BIAS | ARB | RMSE | ||
I | 0.459 | 0.007 | 6.569 | −2.442 | 0.037 | 5.614 | −1.577 | 0.035 | 5.601 | |
−0.150 | 0.002 | 1.936 | −1.320 | 0.021 | 3.216 | −0.168 | 0.021 | 3.386 | ||
−0.100 | 0.002 | 1.933 | −1.261 | 0.020 | 3.196 | −0.103 | 0.021 | 3.390 | ||
II | 0.251 | 0.007 | 6.546 | −2.691 | 0.038 | 5.836 | −1.926 | 0.037 | 5.804 | |
−0.162 | 0.002 | 2.007 | −1.270 | 0.021 | 3.282 | −0.179 | 0.022 | 3.587 | ||
−0.091 | 0.002 | 2.004 | −1.221 | 0.021 | 3.268 | −0.126 | 0.022 | 3.593 | ||
III | 0.233 | 0.007 | 6.539 | 2.893 | 0.039 | 6.139 | 1.416 | 0.035 | 5.683 | |
0.284 | 0.002 | 1.971 | 1.831 | 0.023 | 3.656 | 0.513 | 0.021 | 3.481 | ||
0.213 | 0.002 | 1.966 | 1.781 | 0.023 | 3.633 | 0.458 | 0.021 | 3.475 | ||
IV | −0.020 | 0.007 | 6.589 | 2.947 | 0.039 | 6.087 | 1.452 | 0.034 | 5.643 | |
0.293 | 0.002 | 1.919 | 1.954 | 0.023 | 3.620 | 0.577 | 0.021 | 3.394 | ||
0.214 | 0.002 | 1.913 | 1.899 | 0.023 | 3.592 | 0.516 | 0.021 | 3.385 |
Results of PSA estimator using linear response probability model with bivariate auxiliary variable
Case | Estimator | Normal | Gamma | Log-normal | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BIAS | ARB | RMSE | BIAS | ARB | RMSE | BIAS | ARB | RMSE | ||
I | 1.586 | 0.010 | 19.203 | −9.213 | 0.049 | 12.677 | −9.476 | 0.048 | 13.207 | |
−4.158 | 0.004 | 6.968 | −8.382 | 0.040 | 9.918 | −8.657 | 0.040 | 10.389 | ||
−0.528 | 0.003 | 5.591 | −7.198 | 0.035 | 8.967 | −7.417 | 0.035 | 9.415 | ||
II | 0.545 | 0.010 | 18.977 | −9.179 | 0.050 | 13.058 | −9.274 | 0.048 | 13.174 | |
−4.149 | 0.004 | 7.283 | −7.679 | 0.038 | 9.559 | −7.875 | 0.037 | 9.750 | ||
−0.940 | 0.003 | 6.005 | −6.734 | 0.034 | 8.836 | −6.898 | 0.033 | 8.997 | ||
III | 0.504 | 0.010 | 19.206 | 9.513 | 0.052 | 13.776 | 9.879 | 0.051 | 14.212 | |
3.866 | 0.004 | 7.225 | 9.326 | 0.045 | 11.43 | 9.674 | 0.044 | 11.864 | ||
0.723 | 0.003 | 6.123 | 8.061 | 0.040 | 10.383 | 8.343 | 0.039 | 10.775 | ||
IV | −0.444 | 0.010 | 19.037 | 10.069 | 0.053 | 14.055 | 10.624 | 0.054 | 14.866 | |
4.652 | 0.004 | 7.478 | 10.249 | 0.048 | 12.032 | 10.894 | 0.049 | 12.930 | ||
1.078 | 0.003 | 5.911 | 8.828 | 0.042 | 10.823 | 9.378 | 0.043 | 11.640 |
Results of bias corrected PSA estimator using linear response probability model with bivariate auxiliary variable
Case | Estimator | Normal | Gamma | Log-normal | ||||||
---|---|---|---|---|---|---|---|---|---|---|
BIAS | ARB | RMSE | BIAS | ARB | RMSE | BIAS | ARB | RMSE | ||
I | 2.064 | 0.010 | 19.253 | −3.554 | 0.037 | 9.936 | −3.053 | 0.036 | 10.351 | |
−3.728 | 0.004 | 6.728 | −2.883 | 0.026 | 6.851 | −2.369 | 0.026 | 7.366 | ||
−0.079 | 0.003 | 5.583 | −1.256 | 0.024 | 6.535 | −0.588 | 0.026 | 7.331 | ||
II | 0.980 | 0.010 | 19.001 | −4.258 | 0.040 | 10.670 | −3.670 | 0.037 | 10.613 | |
−3.769 | 0.004 | 7.073 | −2.698 | 0.026 | 6.975 | −2.125 | 0.026 | 7.414 | ||
−0.542 | 0.003 | 5.964 | −1.398 | 0.025 | 6.746 | −0.698 | 0.025 | 7.516 | ||
III | 0.076 | 0.010 | 19.200 | 5.367 | 0.042 | 11.326 | 3.483 | 0.037 | 10.535 | |
3.482 | 0.004 | 7.031 | 5.319 | 0.032 | 8.640 | 3.518 | 0.027 | 7.824 | ||
0.322 | 0.003 | 6.102 | 3.950 | 0.029 | 7.851 | 2.033 | 0.026 | 7.273 | ||
IV | −0.915 | 0.010 | 19.062 | 5.297 | 0.041 | 11.100 | 3.627 | 0.038 | 10.786 | |
4.228 | 0.004 | 7.228 | 5.566 | 0.032 | 8.628 | 4.063 | 0.029 | 8.152 | ||
0.635 | 0.003 | 5.862 | 4.035 | 0.029 | 7.735 | 2.387 | 0.026 | 7.429 |