TEXT SIZE

CrossRef (0)
Optimizing the maximum reported cluster size for normal-based spatial scan statistics

Haerin Yooa, and Inkyung Jung1,a

aDivision of Biostatistics, Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Korea
Correspondence to: 1Division of Biostatistics, Department of Biomedical Systems Informatics, Yonsei University College of Medicine, 50-1 Yonsei-ro, Seodaemun-Gu, Seoul 03722, Korea. E-mail: ijung@yuhs.ac
Received February 28, 2018; Revised May 15, 2018; Accepted June 12, 2018.
Abstract

The spatial scan statistic is a widely used method to detect spatial clusters. The method imposes a large number of scanning windows with pre-defined shapes and varying sizes on the entire study region. The likelihood ratio test statistic comparing inside versus outside each window is then calculated and the window with the maximum value of test statistic becomes the most likely cluster. The results of cluster detection respond sensitively to the shape and the maximum size of scanning windows. The shape of scanning window has been extensively studied; however, there has been relatively little attention on the maximum scanning window size (MSWS) or maximum reported cluster size (MRCS). The Gini coefficient has recently been proposed by Han et al. (International Journal of Health Geographics, 15, 27, 2016) as a powerful tool to determine the optimal value of MRCS for the Poisson-based spatial scan statistic. In this paper, we apply the Gini coefficient to normal-based spatial scan statistics. Through a simulation study, we evaluate the performance of the proposed method. We illustrate the method using a real data example of female colorectal cancer incidence rates in South Korea for the year 2009.

Keywords : spatial cluster detection, Gini coefficient, maximum scanning window size, weighted normal model
1. Introduction

Spatial scan statistics have been widely used as a useful technique for cluster detection in different fields such as disease surveillance and spatial epidemiology. This method identifies statistically significant spatial clusters with higher or lower rates than other regions. It has been developed for various types of data such as Poisson (Kulldorff, 1997), ordinal (Jung et al., 2007), survival (Huang et al., 2007), normal (Kulldorff et al., 2009; Huang et al., 2009) and multinomial data (Jung et al., 2010). There is freely available software called SaTScanTM (Kulldorff and Information Management Services, 2018) for the method.

Spatial cluster detection using the spatial scan statistics is conducted based on the likelihood ratio test. In this process, a very large number of candidate areas (scanning windows) with a pre-defined shape and varying sizes are assumed in order to explore the whole study region. The likelihood ratio test statistic is calculated for each scanning window to compare inside versus outside the window; subsequently, the window which maximizes the test statistic is determined as the most likely cluster. As a result, the results of the cluster detection respond sensitively to the shape and the maximum size of scanning windows. Several previous studies focused on the shape of scanning windows such as circular (Kulldorff, 1997), elliptic (Kulldorff et al., 2006), or irregular shapes (Patil and Taillie, 2004; Duczmal and Assunção, 2004; Tango and Takahashi, 2005). However, there has not been significant interest in the maximum scanning window size (MSWS) or maximum reported cluster size (MRCS).

The MSWS is generally defined as less than 50% of total population at risk or the number of geographical study areas. The SaTScanTM software also uses 50% as the default setting. However, a higher value of MSWS can lead to the detection of spatial clusters larger than the true clusters, including less informative surrounding areas. On the contrary, larger clusters are not considered in the analysis process and cannot be found when a smaller value of MSWS is used. Ribeiro and Costa (2012) found that the performance of the spatial scan statistic is sensitive to different values of MSWS. Therefore, they suggested selecting the optimal value of the maximum cluster size rather than using the commonly used 50%. However, Han et al. (2016) emphasized we should never conduct spatial cluster detection analyses repeatedly using different values of MSWS and select the result of the lowest p-value because it can cause a multiple testing problem. Alternatively, we can rerun the analysis with a fixed larger value of MSWS (e.g., 50%) and with different MRCS (e.g., 5, 10, 15, 20, and 50%) to find an optimal cluster reporting size. Then, we can report clusters smaller than the MRCS, at the same time adjusting for the multiple testing problem using a fixed larger MSWS. Han et al. (2016) proposed an effective criterion, the Gini coefficient, to determine the optimal MRCS. The Gini coefficient represents the degree of heterogeneity of the disease clusters. The simulation study indicates the Gini coefficient can identify the best collection of non-overlapping clusters to report and those clusters tend to have a similar size with true clusters. This method was developed only for the Poisson model in the study by Han et al. (2016). Recently, Kim and Jung (2017) employed the Gini coefficient for the ordinal model to show that it can also be successfully used to optimize the MRCS in the spatial scan statistic for ordinal data.

This paper applies the Gini coefficient to normal-based spatial scan statistics and evaluate it through simulations. The normal-based spatial scan statistic can be used for continuous data at the individual level (Kulldorff et al., 2009) or at some aggregated level with a heterogeneous population (Huang et al., 2009). The idea of using the Gini coefficient may be similar to the work by Han et al. (2016); however, the method should be clearly defined for the specific model and should be fully evaluated for real use. In the Section 2, we define the criterion for the weighted normal spatial scan statistic to optimize the MRCS. The standard normal model is a special case of the weighted normal model; therefore, it is straightforward to reduce the method to the standard normal model. In Section 3, we evaluate the applicability of the Gini coefficient through simulations under various scenarios. We illustrate the application of the proposed method to a real data example in Section 4. We then discuss our results and the conclusion in Section 5.

2. Methods

### 2.1. Gini coefficient for Poisson-based spatial scan statistic

The Poisson-based spatial scan statistic compares cases against the underlying population at risk. The null hypothesis is written as H0 : p = q for all zZ and the alternative hypothesis is Ha : p > q for some z, where p and q are the intensities of the outcome variable inside and outside scanning window z, and Z denotes the collection of all scanning windows. For a given scanning window z, the likelihood ratio test statistic LR(z) is expressed as

$LR(z)=(cznz)cz (C-czN-nz)C-cz(CN)C$

if cz/nz > (Ccz)/(Nnz), and LR(z) = 1 otherwise. Here, cz and nz are the number of cases and population within z, C and N are the total number of cases and population in the whole study region, respectively. The area with the maximum value of LR(z) over zZ is the most likely cluster. Monte Carlo hypothesis testing (Dwass, 1957) is often used to obtain a p-value for the most likely cluster.

There can be significant secondary clusters with a high likelihood ratio test statistic values and we often report secondary clusters that do not overlap with a more likely cluster. However, this collection of clusters may not be the best one to report. There could be more meaningful (i.e., higher relative risk) but smaller clusters than the most likely cluster that are hidden within the most likely cluster or overlapped with. Han et al. (2016) proposed a criterion, the Gini coefficient, to find a suitable and informative collection of non-overlapping clusters to report.

The Gini coefficient was originally developed to represent the degree of heterogeneity of wealth distribution in economics (Gastwirth, 1972). It is a summary measure from a Lorenz curve (Lorenz, 1905) and a larger value indicates more heterogeneous distribution. Han et al.(2016) applied the concept of the Gini coefficient to evaluate the degree of the heterogeneity of the collection of clusters for count data. The x-axis of the Lorenz curve can be defined as the cumulative percentage of the number of disease cases (e.g., number of deaths or events of disease) and the y-axis is the cumulative percentage of the population. Figure 1 describes the corresponding Lorenz when there is only one significant cluster. The reference line OC indicates that the number of cases is proportional to the population for each region. The Gini coefficient is two times the area between the reference line and the Lorenz curve. As more cases are concentrated to the cluster, the point A becomes further away from the reference line and the value of the Gini coefficient increases. When there are I multiple clusters, the coordinates of each cluster (xi, yi) (i = 1, …, I) are defined using the cumulative cases and population by the order of relative risk of the cluster. More specifically, $xi=(1/C)∑k=1ick$ and $yi=(1/N)∑k=1ink$ where ck and nk are the number of cases and population in kth cluster, respectively. The Gini coefficient can be calculated as $G=∑i=1I+1(yixi-1-yi-1xi)$ where x0 = y0 = 0 and xI+1 = yI+1 = 1 (Han et al., 2016). The value of the Gini coefficient is between 0 and 1. The highest Gini coefficient value indicates the best collection of clusters to report among several competing collections of clusters. Refer to the study by Han et al. (2016) for detailed information.

### 2.2. Spatial scan statistics for continuous data

There are two types of spatial scan statistics based on the normal probability model. One is to detect spatial clusters of individuals or locations with high or low values of some continuous data attribute (Kulldorff et al., 2009) and the other is to detect clusters of geographic units with continuous regional measures (Huang et al., 2009). The former concerns with individual level data while the latter is used for continuous data at some aggregate level with heterogeneous population such as mortality rates, incidence rates and average survival at district level. The two models were proposed separately in two different articles, but here we briefly review the weighted normal model for aggregate level data because the standard normal model is a special case of the weighted normal model with homogeneous weight. The null and alternative hypotheses are written as H0 : μz = μzC = μG and Ha : μzμzC for some z, where μz. μzC, and μG are the means of regional summary measurements wj’s for inside and outside scanning window z, and the whole study area G. Maximizing the likelihood ratio test statistic given z is equivalent to maximizing

$-∑j∈Gδjwz2+(∑j∈zδjwj)2∑j∈zδj+(∑j∈zCδjwj)2∑j∈zCδj$

assuming $wj∣δj~N(μz,σG2/δj)$ when jz and $wj∣δj~N(μzC,σG2/δj)$ when jzC(= Gz), where δj is the weight, associated with wj and $σG2$ is the variance of wj’s in the whole study region G. $σG2/δj$ is the variance of wj after adjusting the local weight δj. The weight δj is assumed to be a known measure proportional to the inverse of the uncertainty in each j (jG). For example, in the case of mortality rates data at the district level, the inverse of the associated variances for each district can be δj or we may use population size at each district as a substitute for δj. The standard normal model can be described using δj = 1, jG. One can refer to the article by Huang et al. (2009) for the derivation of the test statistic (2.1) and more detailed information on the weighted normal model.

Inference procedure for the normal model is the same as the Poisson model. However, there is no criterion to select the optimal collection of clusters to report. In the next section, we propose a Gini coefficient specifically designed for the normal model and to evaluate the performance through a simulation study in the subsequent section.

### 2.3. Optimizing the maximum reported cluster size for normal-based spatial scan statistics

Assuming that there is only one significant cluster z* (Figure 1), we define the x- and y-coordinates of the point A for the cluster z* in Lorenz curve in the weighted normal model as

$x=∑j∈z*δjwj∑j∈Gδjwj=μ^z*∑j∈z*δj∑j∈Gδjwj$

and

$y=∑j∈z*δj∑j∈Gδj,$

where μ̂z* = ∑jz*δjwj/∑jz*δj. Here, the x-coordinate (2.2) represents the weighted sum inside the cluster to the total weighted sum of the regional measures and the y-coordinate (2.3) represents the proportion of weights (or population) inside the cluster to the whole study region. The Gini coefficient is the two times of the area between the reference line and the Lorenz curve. As the weighted sum of the regional measures increases, point A becomes further away from the reference line and the value of the Gini coefficient increases. When there are I multiple clusters, we define the x- and y-coordinates for each cluster (xi, yi) (i = i, …, I) using the cumulative sum in the numerators in (2.2) and (2.3) by the order of statistical significance of the clusters. That is, $xi=∑k=1i∑j∈z(k)δjwj/∑j∈Gδjwj$ and $yi=∑k=1i∑j∈z(k)δj/∑j∈Gδj$, where z(1), …, z(I) are the detected clusters ordered by their statistical significance. Then the Gini coefficient can be calculated in the same way as the Poisson model as $G=∑i=1I+1(yixi-1-yi-1xi)$, where x0 = y0 = 0 and xI+1 = yI+1 = 1. We can select the collection of the clusters with the highest value of the Gini coefficient as the optimal one to report among several competing cluster models. The Gini coefficient for the standard normal model can be defined using δj = 1 without any modification.

3. Simulation study

In order to evaluate the performance of the Gini coefficient in the weighted normal model, we conducted a simulation study under various scenarios. We used the area of Seoul and Gyeonggi province in South Korea as the whole study region consisting of 69 districts at the “Si-gun-gu” (district) level. We assumed 8 different true cluster models (Figure 2). Four models (A–D) are circular-shaped clusters and include 6, 13, 20, and 34 districts in the true clusters, respectively. The number of districts in the true clusters in A, B, C, and D accounts for 10%, 20%, 30%, and 50% of the whole study region, respectively. The other four models (E–H) are elliptic-shaped clusters, which also account for 10%, 20%, 30%, and 50% of the study region, respectively. For each true cluster model A–H, we assumed three different values for the mean inside the true cluster, μz = 1, 2, and 3. For the districts outside the true cluster, we assumed μzC = 0. For all settings, we used identical variance of the whole region $σG2=1$ and the true population in each district for the year 2010 for δj.

First, we generated 1,000 random data sets under each of 24 different settings with 8 different cluster models and 3 different mean values. Then, we analyzed each data set using the weighted normal spatial scan statistic in the SaTScan software searching for clusters with a high continuous value. We used both circular and elliptic scanning windows. We fixed the MSWS as 50% and used 15 different values (3–6, 8, 10, 12, 15, 20, 25, 30, 35, 40, 45, and 50%) for the MRCS. For the detection result from each value of MRCS, we calculated the Gini coefficient and reported the frequency of each MRCS chosen as the optimal maximum reporting size among 1,000. We also evaluated the accuracy of the detected clusters using sensitivity and positive predicted value (PPV) defined as the number of districts correctly detected among districts in the true cluster and as the number of districts correctly detected among districts in the detected cluster, respectively. Higher values of these two measures reflect a more accurate detection. A smaller value of sensitivity means that detected cluster missed more districts in the true cluster. A smaller value of PPV indicates that the detected cluster includes more districts other than the true cluster. The sensitivity and PPV were estimated as average values from the rejected data sets out of 1,000. We compared the accuracy of the detected clusters based on the Gini coefficient with that from the results using the default setting with 50% of MSWS and 50% of MRCS.

Tables 14 show the results for cluster models A–D. When the true cluster is circular-shaped, the Gini coefficient most often selected the best MRCS the same as the size of the true cluster when μz = 2 or 3 using either circular or elliptic windows. Compared with the default setting, both the sensitivity and PPV of the detected clusters at the most often picked optimal MRCS were higher. When μz = 1, the most often chosen optimal MRCS did not exactly agree with the true cluster size. Still, the PPV were slightly higher than the default setting, which suggests that the default setting tended to detect larger clusters than the true clusters. However, we do not consider the results meaningful because the statistical power in that case is very low, especially for cluster model A.

Tables 58 provide the results for the for elliptic-shaped cluster models E–H. As expected, the elliptic spatial scan statistic using the Gini coefficient most often chose the optimal MRCS similar to the true cluster size with a higher accuracy than the default setting. Even when the most often picked optimal MRCS did not exactly agree with the true cluster size, the accuracy of the detected clusters was still generally higher than the results from the default setting. The Gini coefficient with circular windows when the true clusters were elliptic-shaped seems not to work very well. The most frequently picked MRCS as best size was usually larger than the true cluster size, in which case the PPV was lower than the default setting despite the higher sensitivity. The frequency of each MRCS to be chosen as the optimal size was also distributed over all sizes considered, while it was usually concentrated concentrated around the size of the true cluster when using elliptic windows.

4. Application to real data

We applied the proposed method to a real data set of female colorectal cancer incidence rates in South Korea for the year 2009. The age-standardized incidence rates per 100,000 at the “Si-gun-gu” (district) level were obtained from the National Cancer Center. The incidence data at the individual level were not provided due to confidentiality concerns. The age-standardized incidence rates are continuous summary measures with varying regional uncertainty. A weighted normal spatial scan statistic should be used to search for clusters of districts with unusual high incidence rates. We used the 2010 population at each district as the weight of each region. We excluded incidence rates in five districts, including four districts of Jeju Island and one of Ulleung Island, among 251 districts because these islands are far from the mainland.

To select the best MRCS based on the proposed method, we used 17 different values (1–6, 8, 10, 12, 15, 20, 25, 30, 35, 40, 45, and 50%) of MRCS and chose the MRCS with the highest values as the optimal one. The clusters at the optimal MRCS were compared with the results of the default setting (50%). We used both the circular and elliptic windows.

Figure 3 shows the detected clusters at the optimal MRCS and at the default setting when using circular and elliptic windows. The Gini coefficient selected 50% as the optimal MRCS when using the elliptic window. The most likely cluster was large, covering from the northwest areas including Seoul to the southeast areas. However, 20% was selected as the best MRCS and two significant were found when using the circular windows, while only one cluster was detected at the default setting. Table 9 provides information on the detected clusters. The most likely cluster at the default setting included 105 districts, while the most likely and secondary clusters at the optimal MRCS based on the Gini coefficient included 27 and 46 districts, respectively. Similar to the simulation study, the default setting seemed to detect a larger cluster by absorbing neighboring areas with irrelevant risk. The weighted means of incidence rates for the clusters at the optimal MRCS were 29.89 and 28.63, respectively, which were higher than that for the most likely cluster at the default setting of 28.29. The weighted mean of incidence rates on the entire study region was 27.70. The clusters found at the optimal MRCS look more meaningful.

5. Discussion and conclusion

In this paper, we defined the Gini coefficient to evaluate the degree of heterogeneity of clusters for continuous data. Through the simulation study and the real data example, we showed that the proposed method can be useful to optimize the MRCS for normal-based spatial scan statistics. In the simulation study results, the Gini coefficient most often selected the optimal value of MRCS similar to the true cluster size. The accuracy of the detected clusters was also consistently higher at the most frequently chosen MRCS as best compared to the results from the default setting. Regarding the scanning window shape, elliptic windows seemed to work well regardless of the shape of true clusters. The real data example shows that it is possible to obtain a more meaningful and informative collection of clusters when using the Gini coefficient than when using the default setting.

Many users of SaTScanTM are often tempted to rerun the analyses using different MSWS when they do not like the results at the default setting. However, we emphasize that this approach should not be used because it causes a multiple testing problem. It is okay to filter clusters at different MRCS to choose the clusters to report. In the process, the Gini coefficient is a very useful criterion. The Gini coefficient has been implemented into SaTScanTM for the Poisson model only. We think that it will be valuable to add the same option to the other models of spatial scan statistics such as the normal model that can help researchers find a more refined collection of clusters to report.

Figures
Fig. 1. Illustration of Lorenz curve and Gini coefficient.
Fig. 2. True cluster models used in the simulation study.
Fig. 3. Detected clusters with high values of female colorectal cancer incidence rates at the optimal MRCS based on the Gini coefficient and at the default setting. MRCS = maximum reported cluster size.
TABLES

### Table 1

Simulation results for cluster model A (circular cluster, 10% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Circular window1Frequency219363228269111488543
Sensitivity0.3330.4560.5230.7400.8390.7690.8150.8180.8690.9380.9171.0001.0000.8890.733
PPV1.0000.9120.7850.8880.8390.6390.5140.4090.3460.2940.2440.2350.1970.1650.667

2Frequency10313913937812722322582722
Sensitivity0.3330.4780.6320.8210.9820.9510.9020.9320.9601.0001.0000.9521.0001.0000.903
PPV1.0000.9570.9490.9860.9820.7860.5570.4600.3850.3250.2610.2200.2070.1820.883

3Frequency2620126747716473-1--
Sensitivity0.3330.5000.6750.8330.9980.9860.8831.0001.0001.000-1.000--0.964
PPV1.0001.0000.9861.0000.9980.8230.5280.4830.4170.316-0.231--0.974

Elliptic window1Frequency121622131829211211910334
Sensitivity0.3190.3750.4850.6410.8800.8220.7780.8750.8940.8890.8330.8330.9171.0000.716
PPV0.9580.7500.7270.7690.8800.6640.5020.4370.3480.2830.2210.1920.1920.1820.603

2Frequency16295987248213594222112651
Sensitivity0.3230.4940.6530.8030.9680.9650.9800.9720.9620.9851.0001.0001.0001.0000.898
PPV0.9690.9890.9790.9630.9680.7830.6330.4970.3850.3130.2670.2310.2030.1760.833

3Frequency5433108610176351262----
Sensitivity0.3330.5000.6670.8330.9960.9930.9861.0001.0001.000----0.961
PPV1.0001.0001.0001.0000.9960.8270.6340.5200.3810.308----0.943

PPV = positive predictive value

### Table 2

Simulation results for cluster model B (circular cluster, 20% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Circular window1Frequency109916102648175893926152010
Sensitivity0.1540.2310.2910.3610.4080.5380.6460.8940.9180.8900.8930.9330.9500.9310.795
PPV1.0001.0000.9440.9380.8830.9270.8640.9250.8030.6180.5190.4630.4160.3670.809

2Frequency--1-11132756155189442
Sensitivity--0.308-0.4620.5800.7020.9850.9890.9530.9660.9620.9420.9230.969
PPV--1.000-1.0000.9760.9470.9970.8810.6640.5660.4860.4190.3530.961

3Frequency------2912824----
Sensitivity------0.6920.9940.9990.981----0.994
PPV------1.0001.0000.9090.654----0.991

Elliptic window1Frequency781310275357741213933191212
Sensitivity0.1430.2310.2780.3460.4270.5150.6570.8240.8880.9310.9250.9030.9360.9550.745
PPV0.9291.0000.9040.9000.9260.8950.8950.8860.7710.6410.5480.4530.4180.3760.779

2Frequency--14224635762603120432
Sensitivity--0.3080.3850.4620.5580.7160.9650.9720.9550.9810.9810.9231.0000.937
PPV--1.0001.0001.0000.9830.9770.9860.8700.6510.5860.5000.3960.3940.931

3Frequency-----11183215321---
Sensitivity-----0.6150.7480.9910.9891.0001.000---0.988
PPV-----1.0001.0000.9990.8900.6860.619---0.981

PPV = positive predictive value

### Table 3

Simulation results for cluster model C (circular cluster, 30% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Circular window1Frequency4612107212638162123123414430
Sensitivity0.1000.1500.1920.2400.3000.3710.4580.5620.7580.8910.9160.9400.9410.9550.777
PPV1.0001.0000.9580.9601.0000.9860.9630.9120.9540.9390.8240.7240.6360.5730.873

2Frequency-----14816359020117115
Sensitivity-----0.3500.4750.5940.8050.9780.9890.9650.9680.9700.946
PPV-----1.0000.9500.9890.9900.9970.9080.7560.6700.5920.968

3Frequency--------4286989---
Sensitivity--------0.8290.9941.000---0.988
PPV--------1.0001.0000.938---0.994

Elliptic window1Frequency4778627286312092112734626
Sensitivity0.0750.1500.2000.2380.2830.3630.4550.5790.7340.8370.9070.9530.9420.9670.762
PPV0.7501.0001.0000.9500.9440.9740.9520.9500.9300.8790.8060.7410.6430.5850.848

2Frequency---11222015052522760112
Sensitivity---0.2000.3000.3750.4500.6030.7890.9600.9790.9870.9821.0000.930
PPV---0.8001.0001.0001.0000.9880.9880.9850.8890.7720.6880.6060.948

3Frequency--------28871992--
Sensitivity--------0.8230.9900.9951.000--0.985
PPV--------1.0000.9980.9280.770--0.991

PPV = positive predictive value

### Table 4

Simulation results for cluster model D (circular cluster, 50% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Elliptic window1Frequency54644692128266593145317
Sensitivity0.0590.0810.1180.1470.1760.2010.2880.3560.4460.5280.6430.7400.8310.9370.777
PPV1.0000.9171.0001.0001.0000.8870.9890.9960.9670.9390.9600.9650.9500.9530.957

2Frequency--------131031109846
Sensitivity--------0.4410.5290.6710.7590.8790.9840.962
PPV--------1.0001.0000.9920.9960.9890.9930.992

3Frequency-----------224974
Sensitivity-----------0.7940.8970.9960.995
PPV-----------1.0000.9970.9990.999

Elliptic window1Frequency437575141831317489203224
Sensitivity0.0590.0780.1180.1410.1720.2060.2730.3420.4490.5350.6410.7280.8210.9130.744
PPV1.0000.8891.0000.9600.9760.9250.9690.9760.9710.9490.9660.9430.9380.9320.944

2Frequency--------111147208732
Sensitivity--------0.5000.5590.6740.7610.8640.9740.947
PPV--------1.0001.0000.9880.9850.9780.9860.985

3Frequency-----------145954
Sensitivity-----------0.7650.8930.9940.991
PPV-----------0.9630.9990.9980.998

PPV = positive predictive value

### Table 5

Simulation results for cluster model E (elliptic cluster, 10% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Circular window1Frequency61323211335181912137353
Sensitivity0.1670.4230.6090.6830.7180.8760.9170.8251.0000.8330.9291.0000.9670.8890.769
PPV0.5000.8460.9130.8190.7180.7240.6110.4090.3820.2660.2450.2280.2010.1620.616

2Frequency10411031314231885273989222
Sensitivity0.3000.5120.6550.8030.8100.9890.9730.9320.9871.0000.9631.0000.9171.0000.872
PPV0.9000.9760.9830.9630.8100.8330.6360.4780.3780.3120.2590.2310.1810.1790.809

3Frequency2109631471357448181233---
Sensitivity0.2500.9480.6670.8280.8211.0000.9900.9911.0001.0001.000---0.938
PPV0.7501.0001.0000.9930.8210.8470.6440.5050.3920.3330.281---0.859

Elliptic window1Frequency151214121732151714176449
Sensitivity0.2440.3190.4880.7360.7250.9270.9000.7940.8690.9410.9170.9170.9580.9070.753
PPV0.7330.6390.7320.8830.7250.7560.5810.4040.3440.3000.2470.2140.2040.1620.565

2Frequency102445113228258504320611142
Sensitivity0.3170.5000.6150.8040.9580.9810.9800.9610.9670.9440.9551.0001.0001.0000.907
PPV0.9501.0000.9220.9650.9580.8150.6370.4850.3820.3060.2620.2310.1980.1760.834

3Frequency-181282585253318221-1-
Sensitivity-0.4810.6670.8290.9960.9970.9841.0001.0001.0001.000-1.000-0.969
PPV-0.9631.0000.9950.9960.8310.6310.5130.3760.3250.286-0.214-0.934

PPV = positive predictive value.

### Table 6

Simulation results for cluster model F (elliptic cluster, 20% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Circular window1Frequency7811134181527221516141010
Sensitivity0.0880.1440.2170.3080.3270.4270.4670.5160.6750.7180.8370.9180.8310.8850.564
PPV0.5710.6250.7050.8000.7080.7280.6420.5690.5640.4860.4950.4620.3660.3450.575

2Frequency73044533010133604330541052114
Sensitivity0.1430.2210.2740.3600.4280.5060.5060.6170.7350.7920.8770.9850.9740.9340.634
PPV0.9290.9560.8920.9250.9280.7920.6940.6750.6270.5290.5250.5000.4280.3670.689

3Frequency3355885711444696431289227179
Sensitivity0.1540.2180.2920.3920.4890.5620.5470.6520.7620.7880.9050.9850.9950.9910.671
PPV1.0000.9430.9110.9750.9960.8140.7340.7110.6700.5310.5500.5050.4410.3930.707

Elliptic window1Frequency10473526214245312115238
Sensitivity0.1000.1540.1870.2050.2920.4640.5710.6720.7760.8640.9010.9280.9060.9420.671
PPV0.6500.6670.6070.5330.6330.8210.7770.7190.6490.5930.5410.4640.4070.3710.628

2Frequency1451173589346258855810101
Sensitivity0.1540.2310.2920.3710.5160.5270.7110.8800.9570.9610.9870.9920.9770.9230.873
PPV1.0001.0000.9500.9640.9880.9380.9620.9190.8090.6740.5910.5060.4430.3530.840

3Frequency-1121216465383143832---
Sensitivity-0.2310.6150.3850.8210.5820.7360.9110.9890.9961.000---0.926
PPV-1.0001.0001.0001.0000.9920.9930.9410.8390.6940.604---0.892

PPV = positive predictive value.

### Table 7

Simulation results for cluster model G (elliptic cluster, 30% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Circular window1Frequency11317618294164775259375233
Sensitivity0.1000.1310.1970.2420.2890.3600.4410.5410.6160.6770.7810.7920.8580.8790.615
PPV1.0000.8720.9850.9670.9630.9450.9340.9120.8040.7120.6800.6090.5800.5280.773

2Frequency-11-918532101671052156212523
Sensitivity-0.1500.200-0.3610.4360.5070.5780.6480.7680.8300.8370.8960.8980.720
PPV-1.0001.000-0.9810.9550.9790.9710.8440.7820.7440.6460.6140.5440.803

3Frequency--1-53318212158150264301236
Sensitivity--0.200-0.5100.6450.6470.5890.6690.7940.8460.8500.9000.9080.754
PPV--1.000-1.0000.9820.9690.9900.8860.7980.7600.6540.6180.5510.816

Elliptic window1Frequency4989152328591016869473532
Sensitivity0.1000.1500.1750.2390.2800.3650.4180.5660.6820.7830.8140.8710.8900.9340.678
PPV1.0001.0000.8750.9560.9330.9540.8770.9330.8830.8300.7290.6660.6030.5680.806

2Frequency-1--161362244372164943010
Sensitivity-0.150--0.3000.3750.4810.6020.7710.8830.9200.9540.9580.9550.845
PPV-1.000--1.0000.9790.9850.9890.9750.9510.8370.7370.6600.5750.903

3Frequency-----3-11146667120503-
Sensitivity-----0.633-0.5910.7920.8960.9640.9890.983-0.891
PPV-----1.000-1.0000.9860.9780.9040.7700.702-0.956

PPV = positive predictive value.

### Table 8

Simulation results for cluster model H (elliptic cluster, 50% of the whole study region)

μzMaximum reported cluster sizeDefault

3%5%6%8%10%12%15%20%25%30%35%40%45%50%
Circular window1Frequency217556122444436982155160
Sensitivity0.0590.0880.1180.1470.1590.2110.2700.3370.4400.5310.6440.7290.8220.8940.706
PPV1.0001.0001.0001.0000.9000.9550.9650.9750.700.9490.9580.9450.9430.9140.939

2Frequency------11152440254673
Sensitivity------0.2940.3530.4710.5760.6620.7640.8650.9460.912
PPV------1.0001.0001.0001.0000.9820.9890.9860.9700.972

3Frequency-----------296902
Sensitivity-----------0.7940.8840.9660.959
PPV-----------1.0000.9910.9860.986

Elliptic window1Frequency-310577101638337290138183
Sensitivity-0.0880.1090.1470.1600.1970.2740.3440.4330.5260.6280.7220.7870.8410.685
PPV-1.0000.9251.0000.9050.9080.9500.9550.9480.9500.9490.9410.9000.8570.909

2Frequency--------4943106186648
Sensitivity--------0.4340.5560.6500.7510.8050.8670.830
PPV--------0.9260.9940.9690.9770.9280.8750.901

3Frequency---------1428132835
Sensitivity---------0.5880.6470.7610.8110.8730.860
PPV---------1.0000.9680.9890.9410.8760.889

PPV = positive predictive value.

### Table 9

Detected clusters with high incidence rates of female colorectal cancer using the circular spatial scan statistic

ClusterNumber of districtsp-valueWeighted mean*
GiniMost likely270.01629.89
(20%)Secondary460.02028.63

Default settingMost likely1050.00628.29

*Weighted mean of incidence rates per 100,000.

References
1. Duczmal, L, and Assunção, R (2004). A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters. Computational Statistics & Data Analysis. 45, 269-286.
2. Dwass, M (1957). Modified randomization tests for nonparametric hypotheses. The Annals of Mathematical Statistics. 20, 181-187.
3. Gastwirth, JL (1972). The estimation of the Lorenz curve and Gini index. The Review of Economics and Statistics. 54, 306-316.
4. Han, J, Zhu, L, Kulldorff, M, Hostovich, S, Stinchcomb, DG, Tatalovich, Z, Lewis, DR, and Feuer, EJ (2016). Using Gini coefficient to determining optimal cluster reporting sizes for spatial scan statistics. International Journal of Health Geographics. 15, 27.
5. Huang, L, Kulldorff, M, and Gregorio, D (2007). A spatial scan statistic for survival data. Biometrics. 63, 109-118.
6. Huang, L, Tiwari, RC, Zou, Z, Kulldorff, M, and Feuer, EJ (2009). Weighted normal spatial scan statistic for heterogeneous population data. Journal of the American Statistical Association. 104, 886-898.
7. Jung, I, Kulldorff, M, and Klassen, AC (2007). A spatial scan statistic for ordinal data. Statistics in Medicine. 26, 1594-1607.
8. Jung, I, Kulldorff, M, and Richard, OJ (2010). A spatial scan statistic for multinomial data. Statistics in Medicine. 29, 1910-1918.
9. Kim, S, and Jung, I (2017). Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data. PLoS ONE. 12, e0182234.
10. Kulldorff, M (1997). A spatial scan statistic. Communications in Statistics - Theory and Methods. 26, 1481-1496.
11. Kulldorff, M, and Information Management Services Inc (2018). SaTScanTM v9.5: Software for the spatial and space-time spatial scan statistics.from: http://www.satscan.org/
12. Kulldorff, M, Huang, L, and Konty, K (2009). A scan statistic for continuous data based on the normal probability model. International Journal of Health Geographics. 8, 58.
13. Kulldorff, M, Huang, L, Pickle, L, and Duczmal, L (2006). An elliptic spatial scan statistic. Statistics in Medicine. 25, 3929-3943.
14. Lorenz, MO (1905). Methods of measuring the concentration of wealth. Publications of the American Statistical Association. 9, 209-219.
15. Ribeiro, SHR, and Costa, MA (2012). Optimal selection of the spatial scan parameters for cluster detection: a simulation study. Spatial and Spatio-Temporal Epidemiology. 3, 107-120.
16. Patil, GP, and Taillie, C (2004). Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics. 11, 183-197.
17. Tango, T, and Takahashi, K (2005). A flexibly shaped spatial scan statistic for detecting clusters. International Journal of Health Geographics. 4, 11.