In this paper, we propose a procedure to build a prediction interval of the sum of dependent binary random variables over a graph to account for the dependence among binary variables. Our main interest is to find a prediction interval of the weighted sum of dependent binary random variables indexed by a graph. This problem is motivated by the prediction problem of various elections including Korean National Assembly and US presidential election. Traditional and popular approaches to construct the prediction interval of the seats won by major parties are normal approximation by the CLT and Monte Carlo method by generating many independent Bernoulli random variables assuming that those binary random variables are independent and the success probabilities are known constants. However, in practice, the survey results (also the exit polls) on the election are random and hardly independent to each other. They are more often spatially correlated random variables. To take this into account, we suggest a spatial auto-regressive (AR) model for the surveyed success probabilities, and propose a residual based bootstrap procedure to construct the prediction interval of the sum of the binary outcomes. Finally, we apply the procedure to building the prediction intervals of the number of legislative seats won by each party from the exit poll data in the 19
In this paper, we are interested in the interval prediction of the sum of dependent binary random variables indexed by a graph. Suppose an undirected graph
where
The problem above often arises in prediction problem in various elections in many countries. One example is the United States Electoral College for the US presidential election, where
A traditional and popular method to construct the prediction interval of T is by assuming that
where the existing methods disregard the first term and approximate the second to ∑
This is one of many reasons for the failure of the exit poll for the KNA election. We remark that the exit poll for the KNA election starts in year 2004 (the 17
In this paper, we propose a new method to build a prediction interval of the sum statistic T. Our new way is a resampling based procedure. It assumes the spatial auto-regressive (AR) model for
The new method is applied to the exit polls of the 19
The remainder of the paper is organized as follows. In Section 2, we introduce the spatial AR model assumed for the observations {(
In this section, we introduce the spatial AR model and the estimation procedure of the model parameters. Recall that
where
When the observations {(
where A(
where
In this section, we introduce the new method to build the 100(1 −
We let
where
We propose a method to build the prediction interval of T using the empirical 100(
Our proposal can be understood as a procedure with a
where T is the number of seats we want to predict,
In this section, we apply the method proposed in Section 2 and 3 to building the prediction interval of the number of seats of each party from the exit poll data in two most recent Korean National Assembly elections, the 19
We start with a brief introduction of election exit polls that have been widely used in the U.S. since the 1970s. Their use has now expanded to other democracies (Mitosfky, 1991, 1995; Greiner and Quinn, 2010;Wang
The exit poll data we analyze in this section is obtained during the 19
The pollsters employ a two-stage cluster design (Mendenhall
In our exit poll example, the 246 election districts consist of the vertex set
be the set of all districts neighboring with
According to the above two types of neighboring system, we define the spatial model (
Thus, the model for spatial latent variables in (
Let
Two existing methods that are known to be used in practice are the normal approximation (NoA) and the Monte Carlo (MCind) approximation by Huh (2008) under the assumption of independence among the observed
It is followed by the classical central limit theorem that the sum of
where
On the other hand, MCind approximation is based on independent Bernoulli random samples {
In the 19
We apply the proposed spatial bootstrap method (SB) to build the prediction intervals as well as compare the results to the NoA and MCind. In the analysis below, three methods are applied to each political party, SNP and MTP, independently, to evaluate the expected number of seats and its prediction interval for each party. The size of the Monte Carlo samples for the MCind and SB is set at 10,000.
Table 2 reports the predictions on the number of seats by three methods. We find that the prediction interval by the SB method wider than the two existing methods. The SB accounts for both the spatial dependence and the effects of exogenous administrative districts which makes the interval wider than those based on the independent assumption without considering the administrative districts’ effect; consequently, the true numbers of seats fall within the intervals by the SB.
The covariate effects, the effects of administrative districts, are plotted in Figure 1, where the coefficient (effect) of administrative district “Seoul” is fixed as 0 for the comparison. The figure shows that the SNP has positive effect in the east and south east part of Korea, whereas the MTP does in the south west part of Korea.
The estimated spatial coefficients are reported in Table 3, where the standard errors are estimated from the bootstrap replications and the
In the election, it is often a particular interest to see a specific region with small number of electoral precincts such as the Nakdonggang River belt that refers to 8 districts around the Nakdonggang River in western Busan. The seat prediction in the small area
The primary goal of exit polls and legislative elections is to predict the number of seats won by major parties. However, they are not very accurate despite the large amount of financial resources dispatched to conducting exit polls. Furthermore, no formal procedures are suggested to build the prediction intervals of the number of seats won by each party. In this work, we recast the problem into a more general problem: the prediction of the sum of binary random variables on the graph when their success probabilities are observable. We consider the AR regression model to account for the effect of exogenous covariates on the graph and the spatial dependence over the graph. We propose a spatial bootstrap procedure to build the prediction interval of the sum along with the AR regression model. We apply our procedure to the exit poll data from the 19
Predicted numbers of seats by three major broadcasting companies in Korea in the 20
KBS | SBS | MBC | True | |
---|---|---|---|---|
SNP | (121, 143) | (123, 147) | (118, 136) | 122 |
TMP | (101, 123) | (97, 120) | (107, 128) | 123 |
GMP | (34, 41) | (31, 43) | (32, 42) | 38 |
SNP = Saenuri Party; TMP = Minju Party; GMP = Gukmin Party.
95% prediction interval of the number of seats
NoA | MCind | SB | True | |
---|---|---|---|---|
SNP | (117.2, 134.1) | (116.0, 135.0) | (110.0, 159.0) | 122 |
TMP | (110.5, 127.7) | (109.0, 129.0) | (95.0, 141.0) | 123 |
GMP | (33.8, 40.4) | (33.0, 41.0) | (29.0, 41.0) | 38 |
In the table “NoA” and “MCind” stand for the normal approximation and Monte Carlo approximation under independence assumption. “SB” stands for the spatial bootstrap procedure. SNP = Saenuri Party; TMP = Minju Party; GMP = Gukmin Party.
Summary statistics for the spatial coefficients
party | nhd-type | est | s.e. | |
---|---|---|---|---|
SNP | 0.065 | 0.053 | 0.217 | |
0.189 | 0.039 | <0.001 | ||
TMP | −0.013 | 0.054 | 0.803 | |
−0.198 | 0.045 | <0.001 | ||
GMP | 0.013 | 0.076 | 0.868 | |
0.145 | 0.081 | 0.075 |
“est” and “s.e.” are, respectively, an average and a standard deviation of
Prediction result from the exit poll data by three broadcasting systems in the 19
KBS | SBS | MBC | True | |
---|---|---|---|---|
SNP | (131, 147) | (126, 151) | (130, 153) | 152 |
MTP | (131, 147) | (128, 150) | (128, 153) | 127 |
KNA = Korean National Assembly; SNP = Saenuri Party; MTP = Minju Tonghap Party.
95% prediction interval of the number of seats
NoA | MCind | SB | True | |
---|---|---|---|---|
SNP | (137.6, 152.4) | (137.0, 153.0) | (132.0, 158.0) | 152 |
MTP | (128.8, 142.9) | (128.0, 144.0) | (116.0, 142.0) | 127 |
NoA = normal approximation; MCind = Monte Carlo approximation under independence assumption; SB = spatial bootstrap procedure; SNP = Saenuri Party; MTP = Minju Tonghap Party.
Summary statistics for the spatial coefficients
Party | nhd-type | est | s.e. | |
---|---|---|---|---|
SNP | −0.057 | 0.055 | 0.297 | |
0.092 | 0.092 | 0.321 | ||
MTP | −0.003 | 0.042 | 0.943 | |
−0.300 | 0.029 | <0.001 |
“est” and “s.e.” are, respectively, an average and a standard deviation of
95% prediction interval of the number of seats in small area
NoA | MCind | SB | True | |
---|---|---|---|---|
Belt-SNP | (1.4, 5.7) | (1.0, 6.0) | (4.0, 8.0) | 5 |
Belt-TMP | (2.3, 6.6) | (2.0, 7.0) | (0.0, 4.0) | 3 |
Seoul-SNP | (10.5, 18.4) | (10.0, 18.0) | (9.0, 24.0) | 16 |
Seoul-TMP | (27.4, 35.2) | (27.0, 35.0) | (22.0, 37.0) | 30 |
In the table “NoA” and “MCind” stand for the normal approximation and Monte Carlo approximation under independence assumption. “SB” stands for the spatial bootstrap procedure. “Belt-PARTY” refers to the outcome of PARTY in the Nakdonggang belt and “Seoul-PARTY” in Seoul city.
Monte Carlo method with bootstrap sampling
1: | Estimate |
2: | |
3: | |
4: | Get a bootstrap sample |
5: | Obtain |
6: | Run the Bernoulli trial |
7: | |
8: | Compute their sum |
9: | |