TEXT SIZE

search for



CrossRef (0)
Non-identifiability and testability of missing mechanisms in incomplete two-way contingency tables
Communications for Statistical Applications and Methods 2021;28:307-314
Published online May 31, 2021
© 2021 Korean Statistical Society.

Yousung Parka, Seung Mo Oha, Tae Yeon Kwon1,b

aDepartment of Statistics, Korea University, Korea;
bDepartment of International Finance, Hankuk University of Foreign Studies, Korea
Correspondence to: 1Department of International Finance, Hankuk University of Foreign Studies, 81 Mohyeon-myeon, Oedae-ro Cheoin-gu, Yongin-si Gyeonggi-do, Republic of Korea. E-mail: tykwon@hufs.ac.kr
Received January 22, 2021; Revised March 2, 2021; Accepted March 26, 2021.
 Abstract
We showed that any missing mechanism is reproduced by EMAR or MNAR with equal fit for observed likelihood if there are non-negative solutions of maximum likelihood equations. This is a generalization of Molenberghs et al. (2008) and Jeon et al. (2019). Nonetheless, as MCAR becomes a nested model of MNAR, a natural question is whether or not MNAR and MCAR are testable by using the well-known three statistics, LR (Likelihood ratio), Wald, and Score test statistics. Through simulation studies, we compared these three statistics. We investigated to what extent the boundary solution affect tesing MCAR against MNAR, which is the only testable pair of missing mechanisms based on observed likelihood. We showed that all three statistics are useful as long as the boundary proximity is far from 1.
Keywords : MNAR, MCAR, identifiability, obseved likelihood, boundary proximity
1. Introduction

Traditionally, a missing mechanism may be classified as Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR) as defined by Little (1988); Little and Rubin (2019). This taxonomy of missing mechanisms can be applied to two-way contingency tables with item missings but not with any unit missing, under which Molenberghs et al. (2008); showed that any MNAR is uniquely reproduced by a MAR with equal fit in observed likelihood, implying that MAR is not testable against MNAR. Ibrahim et al. (2008) extended the missing mechanisms of MCAR, MAR, and MNAR to those of MCAR, Extended missing at random (EMAR), and MNAR so that such a taxonomy can be applied to two-way tables including not only item missing but unit missing. They also showed that EMAR is not testable against MNAR and proposed alternative criteria to determine whether or not the missing mechanism is EMAR.

Baker et al. (1992) reparameterized missing cell probabilties by which their missing mechanisms are specified in a loglinear model. Following this reparameterization, Jeon et al. (2019) specified the missing mechanisms of unit and item missing in the pattern mixture model of Little(1994), selection model of Little (2008), and loglinear model of Baker et al. (1992). In this paper, using the observed likelihood defined with such a reparameterization, we first show that based on the observed likelihood in an incomplete two-way, it is impossible to test H0: EMAR vs. Ha: MNAR or MCAR and H0: MNAR vs. Ha: EMAR or MCAR. Park et al. (2014) proposed boundary solution conditions at which some missing cell probabilities are estimated to be zeros in a MNAR missing mechanism. Jeon et al. (2019) showed that the missingness in a two-way table is a EMAR if the boundary solution conditions are met. However, there still remains a problem of identification between MNAR and MCAR. Different from other missing mechanisms under which a precise imputation procedures are required in advance, the missing can be ignored under MCAR. As MCAR is a nested model of MNAR in the reparameterization of Baker et al. (1992), MCAR is only testable against MNAR by using the well-known three statistics, LR, Wald, and score defined by observed likelihood. Although both MCAR and MNAR missing mechanisms are out of the boundary condition of Jeon et al. (2019), we are not sure that the three statistics work in tesing MCAR against MNAR even when the boundary condition is barely avoided. We conducted a simulation to examine the performance of LR, Wald and Score test statistics for testing H0: MCAR vs. Ha: MNAR, the only testable pair of missingness based on the obaserved likelihood. We found that they are useful as long as the boundary proximity is far from 1. The paper is composed of 4 sections. In Section 2, we show the reproducibility of observed likelihood by MNAR and/or EMAR no matter what missing mechanisms the observed likelihood is constructed from. In Section 3, after defining a boundary proximity from the boundary solution condition, we test MCAR against MNAR by using the LR, Wald, and Score tests. Simulation studies are carried out to see to what extent the boundary proximity affect the suitability and reliability of the three traditional tests. Finally, in Section 4, we close with conclusions and limitations of the study.

2. Reproducibility of observed likelihood

A two-way table with missing data for two categorical variables Y1 and Y2 can be summarized as below in Table 1.

The variable Y1 has I categories and the variable Y2 has J categories, and R1 and R2 indicate the missing state of Y1 and Y2, respectively. If Ri = 1, it means that Yi has been observed. If Ri = 2, Yi has not been observed for i = 1, 2. In this contingency table, we have completely observed cell counts denoted by zi j11 when R1 = R2 = 1, supplemental margins only on Y1 denoted by zi+12 when R1 = 1 and R2 = 2, supplemental margins only on Y2 denoted by z+j21 when R1 = 2 and R2 = 1, and the count of missing units denoted by z++22 when R1 = R2 = 2.

Under a multinomial distribution, the observed likelihood function of the two-way table shown in Table 1 which include both item and unit missings is expressed as given by

L=i=1Ij=1Jπij11zij11i=1Iπi+12zi+12j=1Jπ+j21z+j21π++22z++22,

where πi j11 = P(Y1 = i, Y2 = j, R1 = 1, R2 = 1), π+j21=i=1IP(Y1=i,Y2=j,R1=2,R2=1),π+j21=j=1JP(Y1=i,Y2=j,R1=1,R2=2), and π++22=i=1Ij=1JP(Y1=i,Y2=j,R1=2,R2=2) with fixed N = ∑i, j zi j11 + ∑i zi+12 + ∑j z+j12 + z++22.

We reparameterize the missing cell probabilities of πi j12 and πi j21 by

αij=πij21πij11 듼 듼 듼and 듼 듼 듼βij=πij12πij11,

so that we have πi j21 = αi jπi j11 and πi j12 = βi jπi j11. Note that αi j = α··, αi·, α·j for MCAR, MNAR, and EMAR Y1, respectively, and βi j = β··, βi·, β·j for MCAR, EMAR, and MNAR Y2, respectively as shown in Jeon et al. (2019). Then the observed log likelihood can be written as

log L=i=1Ij=1Jzij11log(πij11)+i=1Izi+12log (j=1Jβijπij11)+j=1Jz+j21log (i=1Iαijπij11)+z++22log (π++22).

The reparameterization of (2.1) requires the maximum likelihood estimates of πi j11, αi j, and βi j in (2.2) which satisfy

π+j21=i=1Iαijπij11 듼 듼 듼for j=1,,J,πi+12=j=1Jβijπij11 듼 듼 듼for i=1,,I.

We then have the following main result.

Theorem 1

The observed likelihood given in (2.2) under any missing mechanisms of Y1 and Y2 is reproduced by a combination of EMAR and MNAR with equal fit if there are unique non-negative solutions of αi j and βi j satisfying the system of equations of (2.3).

A combination of EMAR and MNAR in Theorem 1 includes a pair of only EMARs and of MNARs, implying that Theorem 1 is a generalization of Molenberghs et al. (2008) and Jeon et al. (2019)’s results as they allowed only a pair of EMARs for repoducing equal fit in observed likelihood. Theorem 1 also indicates that EMAR and MNAR missing mechanisms are not identiable and are not testable against the other missing mechanisms including MCAR.

Since Y1 and Y2 are MCAR when αi j = α·· and βi j = β··, respectively, the observed likelihood under MCAR is nested in those under MNAR and EMAR, implying that MCAR can be testable against MNAR or EMAR by using the likelihood ratio, Wald, and Score tests. However, testing MCAR against MNAR may be affected by the boundary solution problem as some αi· and β·j are forced to zero when MNAR suffers from a boundary solution. When a boundary solution occurs in MNAR, Rubin et al. (1995); Jeon et al. (2019)suggested that EMAR produced better estimates of missing cell than the true MNAR as EMAR has no boundary solution. Thus, our interest is on tesing MCAR against MNAR when no boundary solution occurs but close-to-boundary solution occurs.

3. Simulation studies

Jeon et al. (2019) proposed a new criterion to distinguish EMAR from the other missing mechanisms as follows. The missing mechanism of Y1 is EMAR if there is a pair of j and j′ satisfying C1, and that of Y2 is EMAR if there is a pair of i and i′ satisfying C2.

C1:ωjj+<ωjjmin 듼 듼 듼or 듼 듼 듼ωjj+>ωjjmax,C2:ωii+<ωiimin 듼 듼 듼or 듼 듼 듼ωii+>ωiimax,

where

ωii+=πi+12πi+12, 듼 듼 듼ωiimax=maxjπij11πij11, 듼 듼 듼and 듼 듼 듼ωiimin=minjπij11πij11 듼 듼 듼for ii;ωjj+=π+j21π+j21, 듼 듼 듼ωjjmax=maxiπij11πij11, 듼 듼 듼and 듼 듼 듼ωjjmin=miniπij11πij11 듼 듼 듼for jj.

Therefore, the missing mechanisms of Y1 and Y2 are MCAR or MNAR when the conditions C1 and C2 are violated. Note that these conditions are, unfortunately, the necessary conditions for MNAR not to fall on a boundary solution, implying that there exists a MNAR suffering from a boundary solution eventhough it violates the condition C1 and C2. Any test to differentiate MCAR from MNAR is meaningless in such cases because some of πi j21, πi j12, and πi j22 are, by force, made equal to zero solely due to mathematical restictions. In practice, EMAR is used when applying a MNAR to missing cells suffers from a boundary solution as discussed before.

One of our interests is if and how the distance of ωjj+ from the boundaries ωjjmin and ωjjmax in C1 and that of ωii+ from the corresponding boundaries of ωiimin and ωiimax in C2 affect the performance of LR, Wald, Score tests based on the observed likelihood. We call min(ωjjmax/ωjj+,ωjj+/ωjjmin) and min(ωiimax/ωii+,ωii+/ωiimin) the boundary proximity to the boundary. The closer the boundary proximity is to 1 the closer the solution for αi· and/or β·j under MNAR is to 0 (i.e., a boundary solution).

The simulations are carried out to compare the three statistics, LR, Wald, and Score, for testing MCAR against MNAR in terms of the significance level α = 0.05 and the power of each test statistic. We assume that Y1 is MCAR or MNAR and Y2 is EMAR and known so that we focus on only the missing mechanism of Y1 for simplicity of discussion. It is straightforward to show that when Y1 is MCAR and Y2 is EMAR, the maximum likelihood (ML) estimates maximizing the observed likelihood of (2.2) are given by

Nπ^ij11=zij11z+j+1z++11z+j11z+++1, 듼 듼 듼α^..=z++21z++11, 듼 듼 듼β^i·=zi+12Nπ^i+11.

When Y1 is MNAR and Y2 is EMAR, the ML estimates are

π^ij11=zij11N, 듼 듼 듼α^i· 듼satisfying izij11α^i·=z+j21, 듼 듼 듼β^i·=zi+12zzi+11.

Using theses ML estimates, we test H0: MCAR vs. Ha: MNAR to check type 1 error and to examine the power of LR, Wald, and Score test statistics under three scenarios of boundary proximities. Table 2 summarizes the three scenarios with different degrees of MNAR for 2×2×2×2 and 3×3×2×2 contingency tables. α = α and α = α = α are equivalent to MCAR in Table 2. The boundary proximities defined by min(wjjmax/wjj+,wjj+/wjjmin) are S 1 ~ S 3 for 2 × 2 × 2 × 2 and S 4 ~ S 6 for 3 × 3 × 2 × 2 contingency tables.

Scenario S 1(and S 4) is the furthest from the boundary solution, whereas S 3(and S 6) is the closest. The other simulation factors are sample sizes from 5,000 and 10,000 and missing rates (item missing rate) of Y1 from 5% to 15% with a fixed 5% of Y2 missing and 2% of unit missing. The reason for considering large-sized samples is to secure a sufficient number of missing values to identify the missing mechanism, which are between 250 and 1,500. Each simulation combination is repeated 10,000 times.

Table 3 shows type 1 errors that are probabilities rejecting H0 :Y1 is MCAR when MCAR is true with the nominal level α = 0.05. Except for S 6 with missing rate = 5% and sample size = 5,000, type 1 errors of all tests are well maintained near 5%. The type errors are close to 5% as sample size increases from 5,000 to 10,000.

Table 3S 4 and S 5 present powers of the three statistics for three degrees of boundary proximities in different sample sizes, missing rates and the strength of MNAR. As with the maintenance of the significance level, no noticeable difference in power was found in the three test statistics. The boundary proximity has a profound effect on the powers of test statistics. The powers are rapidly reduced as the boundary proximity approaches 1 for all the three statistics (i.e., as S 1 → S 2 → S 3 and S 4 → S 5 → S 6).

Table 3S 4 and S 5 also show that the power increases as sample size increases, missing percentage increases, or the degree of MNAR increases as desired. However, the powers decrease as the number of levels in 2-way tables, I, increases by comparing the powers of Table 4 with those of Table 5 as the larger the number of levels, the higher possibility the boundary proximity close to 1. Since the boundary proximity close to 1 implies that there are the levels of MNAR Y1 corresponding to αi close to zero and the levels of MNAR Y2 corresponding to β·j close to zero, Such levels in incompletely observed data should be rare events. As seen in Table 5, in particular, the very low powers of the three statistics for the boundary proximities close to 1 indicates that one must be careful to accept MCAR if there is no prior information that no level of missing data is rare event.

4. Conclusion

We showed that the LR, Wald, and Score test statistics based on the observed likelihood in an incomplete two-way table are not applicale to test H0: EMAR vs. Ha: the other missing mechanisms and H0: MNAR vs. Ha: the other missing mechanism except for H0: MCAR vs. Ha: MNAR as the observed likelihood constructed from any missing mechanism is copied by MNAR and EMAR with equal fit. Fortunately, since the boundary condition provided by Jeon et al. (2019) can be used to identify EMAR from MNAR or MCAR, MNAR is applied first when missing imputation. If a boundary solution problem occurs, we adopts EMAR as a missing mechanism, whereas we adopts MNAR or MCAR if not.

The LR, Wald, and Score test statistics for testing MCAR against MNAR, the only testable pair of missing mechanisms based on the observed likelihood, is useful as long as the boundary proximity is far from 1. When the boundary proximity is close to 1, however, one should take a careful caution to interprete the test result as the test powers of the three test statistics are very low in such a case. This requires a test method free from the boundary proximity to identify between MCAR and MNAR.

Acknowledgement
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2018R1C1B5043739). This work was supported by Hankuk University of Foreign Studies Research Fund.
TABLES

Table 1

I × J × 2 × 2 Table including item and unit missingness

R2 = 1R2 = 2

Y2 = 1Y2 = J
R1 = 1Y1 = 1z1111z1J11z1+12
Y1 = IzI111zIJ11zI+12

R1 = 2z+121z+J21z++22

Table 2

Boundary proximity for degrees of MNAR

I, JScenarioα1. : α2.=1:11:21:3
2S18.185.654.16
S22.331.781.55
S31.401.231.17

I, JScenarioα1. : α2. : α3.=1:1:11:1:21:1:31:2:3

3S48.004.893.693.85
S52.331.871.641.71
S61.341.351.351.45

Table 3

Type 1 Error

The probailities rejecting H0: Y1 is MCAR with nominal α = 5% when Y1 is simulated

ScenarioMissing rate (%)Sample size = 5000Sample size = 10000


LRWaldScoreLRWaldScore
S 155.215.385.365.175.275.27
105.235.285.284.604.644.64
154.904.964.965.065.075.07

S 254.965.115.095.435.495.49
105.545.605.594.914.934.93
155.165.205.195.025.035.03

S 354.124.214.205.025.085.08
105.245.285.285.125.155.15
154.964.974.975.165.205.20

S 454.945.305.295.025.135.13
104.824.954.945.405.525.52
155.195.305.255.005.005.00

S 554.895.155.124.774.884.88
105.455.725.714.985.035.03
155.095.185.174.914.934.93

S 653.313.393.374.144.194.18
104.084.164.144.885.055.05
154.514.544.525.215.275.27

Table 4

Power of Test

The probabilities rejecting MCAR against MNAR in 2 × 2 × 2 × 2 (%)

Sample sizeScenarioMissing rate (%)1:21:3


LRWALDSCORELRWALDSCORE
5000S1598.898.798.7100100100
10100100100100100100
15100100100100100100

S2553.754.154.187.387.587.5
1080.981.181.199.099.099.0
1591.891.991.999.999.999.9

S3511.211.511.418.819.419.4
1025.425.525.545.345.645.5
1535.135.235.260.760.860.8

10000S15100100100100100100
10100100100100100100
15100100100100100100

S2582.782.982.999.399.399.3
1098.298.298.2100100100
1599.899.899.8100100100

S3527.127.327.349.049.249.2
1047.447.547.576.976.976.9
1561.061.061.089.689.689.6

Table 5

Power of Test

The probabilities rejecting MCAR against MNAR in 3 × 3 × 2 × 2 (%)

Sample sizeScenarioMissing rate (%)1:1:21:1:31:2:3



LRWALDSCORELRWALDSCORELRWALDSCORE
5000S4592.191.591.510010010098.098.198.1
1099.799.799.7100100100100100100
15100100100100100100100100100

S5530.130.230.264.564.064.038.639.739.7
1051.250.850.790.690.490.466.467.067.0
1568.067.667.697.597.397.382.582.982.9

S656.97.06.914.214.814.75.96.06.0
1012.512.612.629.930.430.48.08.18.1
1517.517.517.540.340.940.89.49.49.4

10000S4599.899.899.8100100100100100100
10100100100100100100100100100
15100100100100100100100100100

S5554.754.254.291.591.191.168.669.269.2
1082.782.282.299.799.799.793.193.293.2
1594.394.294.210010010098.798.898.8

S6513.1813.213.231.932.332.38.38.38.3
1022.222.322.356.256.656.611.811.611.6
1531.531.631.671.972.372.315.014.914.9

References
  1. Baker SG, Rosenberger WF, and Dersimonian R (1992). Closed-form estimates for missing counts in two-way contingency tables. Statistics in Medicine, 11, 643-657.
    Pubmed CrossRef
  2. Ibrahim JG, Zhu H, and Tang N (2008). Model selection criteria for missing-data problems using the em algorithm. Journal of the American Statistical Association, 103, 1648-1658.
    CrossRef
  3. Jeon S, Kwon TY, and Park Y (2019). Variable-based missing mechanism for an incomplete contingency table with unit missingness. Statistics & Probability Letters, 146, 90-96.
    CrossRef
  4. Little RJ (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American statistical Association, 83, 1198-1202.
    CrossRef
  5. Little RJ (1994). A class of pattern-mixture models for normal incomplete data. Biometrika, 81, 471-483.
    CrossRef
  6. Little RJ (2008). Selection and pattern-mixture models. Longitudinal Data Analysis, (pp. 409-431), London, Chapman and Hall.
    CrossRef
  7. Little RJ and Rubin DB (2019). Statistical Analysis with Missing Data, England, Wiley Blackwell.
  8. Molenberghs G, Beunckens C, Sotto C, and Kenward MG (2008). Every missingness not at random model has a missingness at random counterpart with equal fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 371-388.
    CrossRef
  9. Park Y, Kim D, and Kim S (2014). Identification of the occurrence of boundary solutions in a contingency table with nonignorable nonresponse. Statistics & Probability Letters, 93, 34-40.
    CrossRef
  10. Rubin DB, Stern HS, and Vehovar V (1995). Handling 쐂on셳 know survey responses: the case of the slovenian plebiscite. Journal of the American Statistical Association, 90, 822-828.