As a traditional data collection method to make an inference on the finite population, surveys are currently criticized for its cost-inefficiency caused by worsen survey environment such as increasing not-at-home households and nonresponse rate.
Statistical matching, that is less informative but more cost-efficient data collection or augmentation method, has been suggested as a possible solution to overcome the weakness of the conventional survey. Statistical matching is a method of combining multiple sources of data that has two versions, macro matching and micro matching. Macro matching is mainly used to estimate the population parameters that are not estimable using a single source of data.
In many applications, statistical matching implies micro matching that is similar to data linkage. For the explanation of the micro statistical matching, assume there are recipient file A and donor file B sampled from the same population. There are no overlapping units in the file of A and B, and the recipient file A contain variables (
Most of micro matching methods including Budd (1971) and Okner (1972) get their theoretical validity under the conditional independence assumption (CIA) that means unique variables
In this paper, at first, we suggest a new statistical matching method applicable to categorical variables under CIA. The proposed method is a mixture of parametric and nonparametric method which is robust to model misspecification. In addition, we also propose a statistical matching method for categorical variables using the auxiliary information when the CIA is not satisfied. The auxiliary information could be obtained from small surveys or outdated proxy data, denoted by file C, which should contains the distribution of (
The organization of paper is as follows. A brief review of statistical mathcing method using auxiliary information is given in Section 2. In Section 3, we propose new statistical matching methods without and with auxiliary information for categorical variables. In Section 4, we compare the several statistical matching methods including the suggested ones in Section 3 through a simulation. We make some concluding remarks in Section 5.
In this section, we briefly review the important previous statistical matching methods in which auxiliary information is considered, based on Singh
where
where
Renssen (1998) considered a statistical matching method in which calibration technique is applied for categorical variables in the finite population set-up. Renssen’s method is actually based on the regression method suggested by Rubin (1986). They assume that there are two registrations, file A and B, and an auxiliary information, file C, which is derived from these registrations. The problem is imputing a value of
where,
In this section, we propose statistical matching methods for categorical variables with or without auxiliary information by developing the methods of Singh
where
where
D’Orazio
Hotdeck matching, which is a most commonly used micro matching method, is similar to imputation in which both donor and recipient files are classified so that missing at random mechanism is satisfied, one observation in the same class is randomly chosen to impute the missing value in the recipient file. However, if the unit for the imputation does not exist in the same class of the donor file B, the random hotdeck does not work and both donor and recipient files are manually merged with the adjacent category, which could cause for violation of CIA.
To overcome the weakness of existing hotdeck matching methods, we propose a mixed method using multinomial logistic regression model, which can be used both for micro and macro approach, when recipient and donor files are composed of categorical variables.
The first proposed mixed method using multinomial logistic regression, which is denoted by mixed method using distance hotdeck under CIA (MHC), consists of following three steps.
From the donor file B, the estimated probability
For observation in the recipient file A, the predicted probability
where
Matching Step : a value of
The second proposed mixed method, which is donoted by mixed method using randomization mechanism under CIA (MRC) consists of three steps, as well.
Using the donor file B, fit a multinomial logistic regression model with a set of
Same as the Step (2) of the MHC method.
Predict category
Unlike the usual hotdeck method, the proposed method uses the set of predicted probability based on the multinomial logistic regression, no further process is necessary even when same classes in donor file are empty. The MRC method has the advantage of not only less burden of computation but also simplicity over the first method.
Both existing and proposed matching methods introduced in Section 3.1 are based on the CIA between
As mentioned in Section 2, Singh
We propose several statistical matching mixed methods for categorical variables using multinomial logistic regression model when auxiliary information is available. The first method, which is denoted by mixed method with auxiliary information 1 (MA1) consists of following three steps.
Using the file C, where
For recipient file A, the predicted probability
Predict category
The second method, which is denoted by MA2, also consists of three steps. MA2 is the same as MA1 except that independent variables in Step (1) are (
(1) Using the file C, fit a multinomial logistic regression model in which (
(2)–(3) Steps are the same as the (2)–(3) Steps of the MA1.
The third method, which is denoted by MA3, consists of following six steps.
Using the file C, fit a multinomial logistic regression model in which
For the donor file B, the predicted probability
Predict a category of
Using the donor file B, fit a multinomial logistic regression model in which (
For the recipient file A, the predicted probability
Predict a category of
The fourth method, which is denoted by MA4, consists of six steps, as well. MA4 is the same as MA3 except that independent variables in Step (1) are (
(1) Using the file C, fit a multinomial logistic regression model in which (
(2)–(6) Steps are the same as the (2)–(6) Steps of the MA3.
The proposed methods, which are mixture of parametric and nonparametric method, could be applied to match the files even when CIA does not hold.
We conduct a simulation study to compare the performance of several matching methods including the proposed ones, when all variables, in both files, consist of categorical variables. For the simulation study, we generate a population of size 100,000 with (
where
That is, we considered 3 categorical variables that have 4, 3, and 2 categories, respectively.
where
To evaluate the performance of the different matching method in various association strength of (
where different categories, (
In each replication, recipient sample A of size 1,000 and donor sample B of size 4,000 were selected using simple random sampling. As noted, we assumed only (
For MHC, we applied several distance measure such as Manhattan, Euclidean, Gower distance function, and a constrained statistical matching method using Manhattan distance function. For details about various distance measures, see D’Orazio
where
Cramer’s V appeared in the table, is a measure of association between two nominal variables
In the second simulation, we compared the performance of the statistical matching methods using the auxiliary information suggested in Section 3.2 to the conventional random hotdeck method. To obtain auxiliary information, file C of size 100 with (
Table 3 shows that in the case of
In the third simualtion, we compared the performance of matching method, RAND and MA3 with various sample size of file C. The performance of the MA3 method was compared to the random hotdeck, when the size of file C is 50, 100, 150, 200, 250, and 300. Table 4 shows that, the performance of MA3 in estimating the joint distribution of (
The main goal of a statistical micro matching is to generate synthetic data which has (
Supported by a Korea University Grant (K1910941).
1 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (0, 0, 0, 0) | |
2 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (0, 0, 0, 0) | |
1 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (0.3, 0, 0.3, 0) | |
2 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (0, 0.3, 0, 0.3) | |
1 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (0.7, 0, 0.7, 0) | |
2 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (0, 0.7, 0, 0.7) | |
1 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (1, 0, 1, 0) | |
2 | (0, 0, 0, 0, 0, 0, 0, 0, 0) | (0, 1, 0, 1) | |
1 | (0.2, 0, 0.2, 0, 0.2, 0, 0.2, 0, 0.2) | (0, 0, 0, 0) | |
2 | (0, −0.2, 0, −0.2, 0, −0.2, 0, −0.2, 0) | (0, 0, 0, 0) | |
1 | (0.2, 0, 0.2, 0, 0.2, 0, 0.2, 0, 0.2) | (0.3, 0, 0.3, 0) | |
2 | (0, −0.2, 0, −0.2, 0, −0.2, 0, −0.2, 0) | (0, 0.3, 0, 0.3) | |
1 | (0.2, 0, 0.2, 0, 0.2, 0, 0.2, 0, 0.2) | (0.7, 0, 0.7, 0) | |
2 | (0, −0.2, 0, −0.2, 0, −0.2, 0, −0.2, 0) | (0, 0.7, 0, 0.7) | |
1 | (0.2, 0, 0.2, 0, 0.2, 0, 0.2, 0, 0.2) | (1, 0, 1, 0) | |
2 | (0, −0.2, 0, −0.2, 0, −0.2, 0, −0.2, 0) | (0, 1, 0, 1) |
Comparison of TVD between random hotdeck and matching methods under CIA
Cramer’s V with |
RAND | MHC(MAN) | MHC(EUC) | MHC(GOW) | MHC(M.C) | MRC | |
---|---|---|---|---|---|---|---|
0.006 | 0.0415 | 0.0419 | 0.0417 | 0.0420 | 0.0406 | 0.0418 | |
0.089 | 0.0673 | 0.0677 | 0.0679 | 0.0671 | 0.0673 | 0.0672 | |
0.204 | 0.1343 | 0.1334 | 0.1336 | 0.1339 | 0.1333 | 0.1332 | |
0.285 | 0.1849 | 0.1856 | 0.1850 | 0.1853 | 0.1838 | 0.1848 | |
0.005 | 0.0412 | 0.0417 | 0.0413 | 0.0410 | 0.0406 | 0.0415 | |
0.083 | 0.0647 | 0.0647 | 0.0645 | 0.0645 | 0.0644 | 0.0646 | |
0.195 | 0.1290 | 0.1289 | 0.1288 | 0.1303 | 0.1287 | 0.1302 | |
0.276 | 0.1817 | 0.1826 | 0.1829 | 0.1810 | 0.1830 | 0.1826 |
TVD = Total Variation Distance; CIA = Conditional Independence Assumption; RAND = RANDom hotdeck; MHC = Mixed method using distance Hotdeck under CIA; MAN = MANhattan distance; EUC = EUClidean distance; GOW = GOWer distance; M.C = Manhattan distance with Constrained matching; MRC = Mixed method using Randomization mechanism under CIA.
Comparison of TVD between random hotdeck and methods using auxiliary information
Cramer’s V with |
RAND | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|---|
0.006 | 0.0415 | 0.1196 | 0.1238 | 0.1083 | 0.1189 | |
0.089 | 0.0673 | 0.1192 | 0.1229 | 0.1084 | 0.1195 | |
0.204 | 0.1343 | 0.1179 | 0.1204 | 0.1053 | 0.1128 | |
0.285 | 0.1849 | 0.1135 | 0.1146 | 0.1020 | 0.1110 | |
0.005 | 0.0412 | 0.1182 | 0.1215 | 0.1078 | 0.1159 | |
0.083 | 0.0647 | 0.1137 | 0.1180 | 0.1014 | 0.1133 | |
0.195 | 0.1290 | 0.1127 | 0.1158 | 0.1028 | 0.1087 | |
0.276 | 0.1817 | 0.1099 | 0.1126 | 0.1004 | 0.1077 |
TVD = Total Variation Distance; RAND = RANDom hotdeck; MA = Mixed method with Auxiliary information.
Comparison of TVD between RAND and MA3 with various sample size of file C
Cramer’s V with |
RAND | MA3 | ||||||
---|---|---|---|---|---|---|---|---|
50 | 100 | 150 | 200 | 250 | 300 | |||
0.006 | 0.0415 | 0.1476 | 0.1083 | 0.0903 | 0.0808 | 0.0750 | 0.0698 | |
0.089 | 0.0673 | 0.1518 | 0.1084 | 0.0885 | 0.0806 | 0.0735 | 0.0704 | |
0.204 | 0.1343 | 0.1432 | 0.1053 | 0.0879 | 0.0786 | 0.0723 | 0.0687 | |
0.285 | 0.1849 | 0.1404 | 0.1020 | 0.0850 | 0.0754 | 0.0715 | 0.0664 | |
0.005 | 0.0412 | 0.1495 | 0.1078 | 0.0884 | 0.0789 | 0.0724 | 0.0693 | |
0.083 | 0.0647 | 0.1478 | 0.1014 | 0.0892 | 0.0801 | 0.0724 | 0.0700 | |
0.195 | 0.1290 | 0.1405 | 0.1028 | 0.0862 | 0.0769 | 0.0704 | 0.0681 | |
0.276 | 0.1817 | 0.1405 | 0.1004 | 0.0840 | 0.0748 | 0.0689 | 0.0657 |
TVD = Total Variation Distance; RAND = RANDom hotdeck; MA = Mixed method with Auxiliary information.