^{a}Department of Mathematics, Shivaji College, University of Delhi, India
Correspondence to:^{1}Department of Mathematics, Shivaji College, University of Delhi, New Delhi 110 027, India. E-mail: priyanka.ism@gmail.com
Received October 4, 2018; Revised January 24, 2019; Accepted January 31, 2019.
Abstract
The problem of the estimation of quantitative sensitive variable using the item sum technique (IST) on successive occasions has been discussed. IST difference, IST regression, and IST general class of estimators have been proposed to estimate quantitative sensitive variable at the current occasion in two occasion successive sampling. The proposed new estimators have been elaborated under Trappmann et al. (Journal of Survey Statistics and Methodology, 2, 58–77, 2014) as well as Perri et al. (Biometrical Journal, 60, 155–173, 2018) allocation designs to allocate long list and short list samples of IST. The properties of all proposed estimators have been derived including optimum replacement policy. The proposed estimators have been mutually compared under the above mentioned allocation designs. The comparison has also been conducted with a direct method. Numerical applications through empirical as well as simplistic simulation has been used to show how the illustrated IST on successive occasions may venture in practical situations.
Keywords : sensitive variable, successive occasions, class of estimators, population mean, variance, bias, mean squared error, optimum matching fraction
1. Introduction
Data gathering on sensitive, incriminating, stigmatizing or too personal issues are an avowedly baffling task. The three possibilities: under reporting, over reporting and refusal to answer the sensitive questions are prevalent. Generally, socially undesirable characteristics such as drug addiction, tax evasion, plagiarism, criminal conviction, and unauthorized natural resource use are likely to be under reported whereas socially desirable characteristics such as energy conservation, reducing pollution, ecological and biological conservation, and participation in elections are likely to be over reported. However some may refuse to report because of social stigma or fear of privacy violations. All these phenomenon provoke non sampling errors that are difficult to deal with and can seriously damage the validity of the analysis. In order to overcome mis-reporting on sensitive issues, many data collection strategies have been developed to elicit a more honest response from respondents by increasing the obscurity of the survey process and ensuring privacy protection.
One important method of survey addressing sensitive issue is the indirect questioning method. The indirect questioning technique may be classified in three different categories: the randomized response technique (RRT), the item count technique (ICT), and the non RRT. The RRT was initiated by Warner (1965), the ICT was originally proposed by Miller (1984) for binary variables to estimate the prevalence of a stigmatizing behavior within the population. However the non RRT was initiated recently by Tian and Tang (2014).
In this paper we focus on the generalization of second technique the ICT. The ICT is used in surveys that require the study of qualitative variable. However, Chaudhuri and Christofides (2013) proposed a generalization to ICT for estimating a quantitative sensitive variable. Trappmann et al. (2014) named this technique as item sum technique (IST). Perri et al. (2018) discussed the optimal sample size allocation in the IST. Further enhancement in IST literature can be seen in Hussian et al. (2015), Rueda et al. (2017).
In many fields of applied research it has been observed that data needs to be gathered for a sensitive variables which can change over time. The statistical tool recommended for such a situation is successive sampling. Jessen (1942) initiated the idea of sampling same population over time with partial replacement of units called successive sampling. However, an analysis of a sensitive variable in successive sampling has been initiated by Arnab and Singh (2013). They applied RRT on successive occasions. Additional literature addressing sensitive issue over successive occasions can be seen in Yu et al. (2015), Naeem and Shabbir (2016), Singh et al. (2017), Priyanka et al. (2018), and Priyanka and Trisandhya (2018). These researchers focused on a scrambled response technique or RRT to handle sensitive issues on successive occasions. However, the IST is now emerging as an alternative technique to deal with sensitive issues. Hence, an attempt has been made in the present work to use IST in successive sampling to estimate a sensitive population mean. To the best of our knowledge this is an initial attempt and will contribute another useful method in successive sampling literature.
Therefore, the IST difference, the IST regression and also the IST general class of estimators have been proposed in the present work to estimate a sensitive population mean in two occasion successive sampling. All these proposed estimators have been discussed under Trappmann et al. (2014) allocation and Perri et al. (2018) optimal allocation designs. Detail properties including optimum replacement strategies are elaborated. The proposed methods have been compared mutually as well as with the corresponding direct method. Numerical applications through empirical simulation and simplistic simulation show how the illustrated IST on successive occasions may develop in practical situations. Finally some concluding remarks have been forwarded.
2. The item sum technique
The well-known technique in sensitive characteristics estimation is the ICT; however, the ICT is generally applicable for dichotomous (qualitative) variables only. Hence, the ICT was generalized by Chaudhuri and Christofides (2013) that can be used to estimate the quantitative sensitive variable. Later Trappmann et al. (2014) named this generalized version of ICT as IST and used it to estimate some quantitative sensitive variable. The algorithm for the IST is as follows.
From a random sample (s), two random sub-samples (s_{ll} and s_{sl}) are generated. The sub-sample s_{ll}, is confronted with a long list (LL) of items containing the sensitive question and a number of innocuous/non-sensitive questions. However the respondents in sub-sample s_{sl} have been given a short list (SL) of items containing only the innocuous questions present in the LL sample. Respondents in each sample are asked to report the total score of all items given to them, without disclosing the individual scores for the items. The mean difference of the answers between the s_{ll} and s_{sl} is used as an unbiased estimator of the population mean of the sensitive variable. All sensitive and innocuous variables should be quantitative in nature and measured on the same scale as that of the sensitive variable in the IST.
The decisive point in the IST is how to split the total sample in to the LL sample and SL sample. Trappmann et al. (2014) allocated the same number of units to each sample irrespective of the variation of items in the two lists. However, Perri et al. (2018) advocated the requirement of optimum allocation of LL and SL samples. The IST may be modified to deal with sensitive issues on successive occasions if the sensitive variable is also changing by time, which is often the scenario. For example if the sensitive variable is the amount spent per month on drugs such as cigarettes or pan masala by college students, then the non-sensitive variable may be taken as the total monthly pocket money received by them or the amount spent on purchasing books. Similarly, if the sensitive variable is the number of abortions, then the non-sensitive variable may be the number of children or total number of members in that family. The sensitive question together with non-sensitive questions will comprise of LL sample; however, only non-sensitive questions will comprise the SL sample. There may be any number of non-sensitive question with a sensitive question to be used for LL sample and the same non-sensitive questions to be used for the SL sample. But in this paper we considered one sensitive and one non-sensitive question case on successive occasions.
3. Proposed IST frame work in successive sampling design
Consider a finite population P consisting of N identifiable units for sampling over two successive occasions. Let x denote the quantitative sensitive variable at the first occasion which changes to y at second occasion. Similarly let t_{1} be the non sensitive/innocuous variable at the first occasion which changes to t_{2} at the second occasion. Assume that x_{i}, y_{i}, t_{1}_{i}, and t_{2}_{i} denotes the value of x, y, t_{1}, and t_{2} respectively on the unit i ∈ P. To estimate the population mean of quantitative sensitive variable Ȳ at current occasion using the IST, the sampling design is proposed as:
At first occasion a sample of size n is drawn using simple random sample without replacement (SR-SWOR) which has been split to s_{nll} and s_{nsl} samples called the LL-sample and SL-sample respectively. Now, at the second occasion considering the partial overlap case, two independent samples are considered, one is a matched sample of size m = nλ drawn as SRSWOR sub-sample from sample size n at first occasion and second is a fresh sample of size u = (n − m) = nμ, which is drawn afresh at current occasion. Further, the samples of sizes m and u are split into corresponding LL-sample and SL-samples as s_{mll}, s_{msl}, s_{ull}, and s_{usl} respectively. The response obtained from the respondents on two occasions and the corresponding IST estimate based on different samples are presented in Table 1.
4. IST successive difference estimator
In order to utilize information available from previous occasion an IST difference type estimator is considered based on sample of size m retained from previous occasion and the estimator based on sample of size u is the IST estimator ${\mathbb{T}}_{1u}={\widehat{\overline{y}}}_{u}$. Combining the two estimators as the convex linear combinations, the final estimator for sensitive population mean at current occasion is given by
where ${\mathbb{T}}_{u}={\widehat{\overline{y}}}_{u}$ and ${\mathbb{T}}_{1m}={\widehat{\overline{y}}}_{m}+k({\widehat{\overline{x}}}_{n}-{\widehat{\overline{x}}}_{m})$; ϕ_{1}ϵ [0, 1] and k is a scalar quantity to be chosen suitably.
5. IST successive regression estimator
The regression estimator is another estimator in survey sampling theory. Hence, the estimator for the matched portion of the sample have been chosen as a regression type estimator given by . The final IST successive regression estimator to estimate the sensitive population mean at current occasion is given as
where ${\mathbb{T}}_{2m}={\widehat{\overline{y}}}_{m}+\widehat{b}({m}_{ll})({\widehat{\overline{x}}}_{n}-{\widehat{\overline{x}}}_{m})$ with $\widehat{b}({m}_{ll})=\{{s}_{{z}_{1}{z}_{2}}({m}_{ll})\}/\{{s}_{{z}_{2}}^{2}({m}_{ll})\}$ and ϕ_{2} ∈ [0, 1] is a scalar quantity to be chosen suitably.
6. IST successive general class of estimator
Many estimators such as ratio, product, exponential ratio, may be thought on similar lines for proposing an estimator based on matched sample of size m. Therefore, in order to generalize the frame work, an IST general class of estimator has been proposed, so that the IST difference, IST regression and others may be viewed as members of the proposed class of estimator. Hence, the final estimator in this case is given as
where ${\mathbb{T}}_{3m}=g({\widehat{\overline{y}}}_{m},{\widehat{\overline{x}}}_{m},{\widehat{\overline{x}}}_{n})$ is a function of ${\widehat{\overline{y}}}_{m},\hspace{0.17em}{\widehat{\overline{x}}}_{m}$, and ${\widehat{\overline{x}}}_{n}$. Following Srivastava and Jhajj (1980), Tracy et al. (1996), and Priyanka and Trisandhya (2018), the function g is assumed such that it satisfies following conditions:
The point (${\widehat{\overline{y}}}_{m},{\widehat{\overline{x}}}_{m},{\widehat{\overline{x}}}_{n}$) assumes the value in a closed convex subset ℝ^{3} of three dimensional real space containing the point (Ȳ, X̄, X̄).
The function $g({\widehat{\overline{y}}}_{m},{\widehat{\overline{x}}}_{m},{\widehat{\overline{x}}}_{n})$ is continuous and bounded in ℝ^{3}.
g(Ȳ, X̄, X̄) = Ȳ and ${g}_{1}(\overline{Y},\overline{X},\overline{X})=\partial g({\widehat{\overline{y}}}_{m},{\widehat{\overline{x}}}_{m},{\widehat{\overline{x}}}_{n})/\partial {\widehat{\overline{y}}}_{m}=1$, i.e., first order partial derivative of g with respect to ${\widehat{\overline{y}}}_{m}$ at $g(\overline{Y},\overline{X},\overline{X})=\overline{Y}\Rightarrow {g}_{1}\hspace{0.17em}(K)=\partial g(\xb7)/\partial {{\widehat{\overline{y}}}_{m}\mid}_{K}=1$, where K = (Ȳ, X̄, X̄).
The first and second order partial derivatives of $g({\widehat{\overline{y}}}_{m},{\widehat{\overline{x}}}_{m},{\widehat{\overline{x}}}_{n})$ exist and are continuous and bounded in ℝ^{3}.
7. Analysis of IST estimators on successive occasions
To elucidate the performances of proposed IST estimators, the bias, variance/mean squared error of the proposed estimators has been calculated as
It can be seen that, in equation (7.3) is a function of ϕ_{i}. So, it has been optimized with respect to ϕ_{i} and optimum value of ϕ_{i} is obtained as:
Now, as the estimator and are biased for Ȳ, hence the expression for their bias and mean squared error have been computed under the following transformations:
Expanding $g({\widehat{\overline{y}}}_{m},{\widehat{\overline{x}}}_{m},{\widehat{\overline{x}}}_{n})$ about the point K = (Ȳ, X̄, X̄) using Taylor series expansion, retaining terms up to first order of approximations, we have
Clearly, we can see that equation (7.12) is a function of G_{2} and G_{3}. So, after minimizing equation 15 by partially differentiating with respect to G_{2} and G_{3} respectively and equating to zero we get the optimized value of G_{2} and G_{3} as
Since, in IST a sample is split in to LL sample and SL sample. Trappmann et al. (2014) considered equal number of units in both the samples irrespective of variability of the items in the two lists. Applying his approach on successive occasions we have the following allocations:
However, Perri et al. (2018) concluded that the estimates may be affected due to high variability of items in LL sample and SL sample. Hence, they proposed optimal sample size allocation to LL and SL samples by minimizing the variance of IST estimates under a budget constraints. Hence, modifying these ideas to work for allocating LL sample and SL sample on various samples at first and second occasion assuming same budget allocation for each LL and SL samples we have:
In order to compare various proposed IST estimators in successive sampling, the percent relative efficiencies have been computed for data considered in section 9 under Trappmann et al. (2014) as well as Perri et al. (2018) allocation designs as:
Population Source: [Free access to data by Statistical Abstracts of United States]
To evaluate the performance of proposed IST successive sampling estimators, numerical illustrations has been supplemented using natural population. The population consists of N = 51 states. Let the aim be to estimate rate of abortion in year 2004. Therefore, for IST successive sampling frame work
Clearly the rate of abortion is sensitive; however, rate of residents is non-sensitive. Hence, the data is suitable to be applied for IST frame work. Since same study variable “rate of abortion” has been observed for two different years 2000 and 2004, therefore, the considered data is suitable for IST successive sampling frame work. The numerical calculations have been performed on the data with results represented in Table 3.
The value of E_{i} (i = 1, 2) are observed to be more than 100, this indicates that optimum allocation design by Perri et al. (2018) is preferable over Trappmann et al. (2014) design. Therefore, the further numerical analysis has been done using Perri et al. (2018) allocation design.
10. Simulation study
An extensive simulation study has been done using Monte Carlo simulation for the data mentioned in Section 9. The 5,000 different Monte Carlo replications have been observed. The process is also repeated for a different combination of constants termed as sets. The variance/mean squared error of the proposed estimators , , and has been computed under Perri et al. (2018); in addition, the allocation design and are denoted by , , and respectively. The percent of relative efficiencies for the IST successive difference and regression estimators with respect to IST successive general class of estimator have been computed as:
Figure 1 and Figure 2 summarizes the outcomes of the simulation results.
11. Direct method
Estimators that use IST are less efficient than estimators obtained using direct questioning. Hence, in order to identify the amount of loss we compare the proposed class of estimator with respect to corresponding direct method. The estimator under direct questioning method is proposed as
Further the simulated ratio of the mean squared error of and have been computed by considering 5,000 different samples using Monte Carlo simulation study for different sets and results are presented in Figure 3 and Figure 4 respectively
Remark 1. The IST is used with an expectation to receive a true response as in IST directly; however, the response to sensitive questions are not being asked. Standard bias reduction methods like Jackknife, the Bootstrap and methods that use approximations of the bias function through asymptotic expansions of the bias that follow Kosmidis (2014) may be used if some false responses are still received that show bias in the estimator after explaining the IST efficiently to respondents.
12. Interpretations of results
The following interpretations can be drawn from empirical and simulation results:
It has been observed that IST is feasible in successive sampling to handle sensitive issues on successive occasions.
From Table 3, it is clear that E_{1} and E_{2} both are coming out to be greater than 100, this implies that optimum allocation design by Perri et al. (2018) is more efficient than allocation by Trappmann et al. (2014) design in two occasion successive sampling. It is to be noted that for the considered data the optimum fraction to be drawn afresh do not exist for IST successive general class of estimator, so corresponding efficiency cannot be computed. Hence, in order to check the validity of IST successive general class of estimator simulation has been carried out with several choices of parameters.
Simulation results in Figure 1 and Figure 2, justify that E_{s}_{1} and E_{s}_{2} are greater than 100 for all three considered sets. This indicates that IST successive general class of estimators is more efficient than IST successive regression and IST, a successive difference estimators. However, E_{s}_{1} > E_{s}_{2} indicate that the IST successive difference estimator is better than the IST successive regression estimator. Also, the simulated percent relative efficiency increases as φ increases and is in accordance with the theory of successive sampling.
From Figure 3 and Figure 4, it is observed that as φ increases the simulated ratio of the mean squared error of the direct method and IST successive general class of estimator increases. The values indicate a loss in precision of the IST general class of estimator over the direct method. The issues under consideration are sensitive; therefore, a direct method is unsuitable because the privacy of respondents need to be considered; therefore, despite of loss in precision IST need to be preferred over direct method for sensitive issues on successive occasions.
13. Conclusion
The IST on successive occasions enables an estimation of the population mean of stigmatized quantitative variable using innocuous information, that reduce the social desirability response bias and provides privacy. Out of the three proposed IST successive estimators, the IST successive general class of estimator under both Trappmann et al. (2014) allocation designs as well as Perri et al. (2018) allocation designs have been proven to be more efficient than the other two. The optimum allocation of LL and SL samples by Perri et al. (2018) have been found to be more fruitful in successive sampling than allocation due to Trappmann et al. (2014). While comparing with direct method, certain amount of loss in precision is observed but that is realistic as the survey issues are sensitive so there may be chances of complete refusal or partial refusal if we apply direct method. However, using IST on successive occasion atleast estimation of sensitive issues are possible. Therefore it can be concluded that the proposed IST successive estimators provide comfort and satisfaction to the respondents in terms of privacy protection as well as a methodological advancement in literature related to successive sampling dealing with sensitive issues. Hence, the IST successive class of estimators may be recommended for practical use by survey practitioners.
Acknowledgements
The authors are thankful to the reviewers and editors for their valuable suggestions that improved an earlier version of the paper. The authors are also thankful to SERB, New Delhi, India for providing the financial assistance to carry out the present work. The authors also sincerely acknowledge free access to data from the Statistical Abstracts of United States.
Figures
Fig. 1. Simulated percent relative efficiency of the proposed IST general class of estimator with respect to proposed IST difference estimator for three different sets.
Fig. 2. Simulated percent relative efficiency of the proposed IST general class of estimator with respect to proposed IST regression estimator for three different sets.
Fig. 3. Ratio of mean squared error of (under optimum allocation design) with respect to direct method under IST in two occasion successive sampling for Set-I.
Fig. 4. Ratio of mean squared error of (under optimum allocation design) with respect to direct method under IST in two occasion successive sampling for Set-II.
‘*’ = indicates that the optimum value of fraction of sample to be drawn afresh do not exist; ‘-’ = denote corresponding percent relative efficiency cannot be computed.
References
Arnab R and Singh S (2013). Estimation of mean of sensitive characteristics for successive sampling. Communications in Statistics - Theory and Methods, 42, 2499-2524.
Chaudhuri A and Christofides TC (2013). Indirect Questioning in Sample Surveys, Berlin, Heidelberg, De, Springer-Verlag.
Hussian Z, Shabbir N, and Shabbir J (2015). An alternative item sum technique for improved estimators of population mean in sensitive surveys. Hacettepe University Bulletin of Natural Sciences and Engineering Series B: Mathematics and Statistics, 46, 1-30.
Jessen RJ (1942). Statistical investigation of a sample survey for obtaining farm facts. Iowa Agriculture and Home Economics Experiment Station. Research Bulletin, 26, 1-104.
Kosmidis I (2014). Bias in parametric estimation: reduction and useful side-effects. WIREs Computational Statistics, 6, 185-196.
Miller JD (1984). A new survey technique for studying deviant behavior (PhD thesis) , The George Washington University, Washington DC.
Naeem N and Shabbir J (2016). Use of scrambled responses on two occasions successive sampling under non-response Hacettepe University Bulletin of Natural Sciences and Engineering Series B: Mathematics and Statistics 46.
Perri PF, Rueda Garcia M, and Cobo Rodriguez B (2018). Multiple sensitive estimation and optimal sample size allocation in the item sum technique. Biometrical Journal, 60, 155-173.
Priyanka K and Trisandhya P (2018). A composite class of estimators using scrambled response mechanism for sensitive population mean in successive sampling. Communications in Statistics - Theory and Methods.
Priyanka K, Trisandhya P, and Mittal R (2018). Dealing sensitive characters on successive occasions through a general class of estimators using scrambled response techniques. Metron, 76, 203-230.
Rueda Garcia M, Perri PF, and Cobo Rodriguez B (2017). Advances in estimation by the item sum technique using Auxiliary information in complex surveys. Advances in Statistical Analysis, 102, 455-478.
Singh GN, Suman S, Khetan M, and Paul C (2017). Some estimation procedures of sensitive character using scrambled response techniques in successive sampling. Communications in Statistics - Theory and Methods.
Srivastava SK and Jhajj HS (1980). A class of estimators using auxiliary information for estimating finite population variance. Sankhya, C42, 87-96.
Tracy DS, Singh HP, and Singh R (1996). An alternative to the ratio-cum-product estimator in sample surveys. Journal of Statistical Planning and Inference, 53, 375-397.
Trappmann M, Krumpal I, Kirchner A, and Jann B (2014). Item sum: a new technique for asking quantative sensitive questions. Journal of Survey Statistics and Methodology, 2, 58-77.
Tian GL and Tang ML (2014). Incomplete Categorical Data Design: Non-Randomized Response Techniques for Sensitive Questions in Surveys, FL, Chapman & Hall/CRC.
Warner SL (1965). Randomized response: a survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60, 63-69.
Yu B, Jin Z, Tian J, and Gao G (2015). Estimation of sensitive proportion by randomized response data in successive sampling. Computational and Mathematical Methods in Medicine, 2015, 1-6.