
Particulate matter of less than 2.5
South Korea is one of the worst countries in terms of severe air pollution, and many studies have focused on the PM2.5 in South Korea. Choi
Various statistical models have been applied to the data to predict PM2.5 level. Ordieres
These models perform well in the relatively usual event, but they work poorly to predict extremely high levels of PM2.5. Since extreme events are rarely occurred, few data are used to learn the extreme patterns of the PM2.5 level. Therefore, conventional statistical models does not perform well in predicting extreme events, if there is no assumption for such events in the models.
Various models have been developed to predict extreme values to overcome this issue. D’Amico
In this paper, we adapt the three-stage model proposed by Wang and Li (2013) to predict extreme concentrations of PM2.5 in Seoul, South Korea. The three-stage model extends the model by Wang
In this paper, we modify the three-stage model with lasso regression (Tibshirani, 1996) to improve the prediction performance of the extreme concentration of PM2.5. Lasso regression performs variable selection, and provides a sparse solution. Therefore, even when numerous predictors are included in the initial model stage, penalized regression methods, such as lasso, provide a parsimonious linear model. There are various penalized quantile regression models (Wu and Liu, 2009; Alhamzawi
The rest of the paper is organized as follows. The data description and exploratory data analysis are provided in Section 2. In Section 3, we propose a binary classifier based on a three-stage model with lasso regression, and an algorithm for implementation is presented in Section 4. In Section 5, we apply the proposed method to the PM2.5 in Seoul, South Korea, and validate the results. Finally, the concluding remarks are presented in Section 6.
Table 1 lists the variables used in this study. The meteorological variables, hourly precipitation, temperature, wind speed, and humidity, were collected from the South Korea Meteorological Administration. Several studies indicate that the concentration level of PM2.5 is affected by the meteorological factors. For example, Zhang
Air pollutant data were obtained from AIR KOREA, affiliated with the Korea Environment Corporation. The measurements include the hourly PM2.5, PM10, SO2, NO2, CO, and O3 content. Stracquadanio
The variables in Table 1 are measured in 25 districts in Seoul, South Korea (Figure 1). Differences existed in places where the meteorological variables and air pollutant variables were measured, but the distance between the stations is relatively short.
All variables were obtained from January 1, 2015, to May 29, 2020, but in this study, we consider only the spring season (March, April, and May) every year, when the extremely high levels of PM2.5 are observed. In addition, for the fast computation, we used four-hour average values for all variables. Therefore, the number of time points is 3298 for each district. For example, the four hour average values of PM2.5 in the Gangnam district are presented in Figure 2.
Missing values occur in each measurements, thus, we imputed the missing points as follows. First, linear imputation was applied if length of sequential missing points was less or equal to four. Second, if the length of sequential missing points was greater than four and less than or equal to six, the sequence was imputed by the value of the nearest region within 5 km. Finally, the rows were omitted from the data if the previous steps did not impute the points.
To observe the variable correlations, we computed the correlation coefficients. Figure 3 displays the correlation in the Gangnam district, where PM10, SO2, and CO have a positive association of more than 0.5 with PM2.5, a relatively strong association. However, others have weak relationships with less than 0.5. We also observed relatively strong associations between O3, NO2, CO, and wind speed, implying multicollinearity between variables. For the other districts, we observed a similar variable correlation.
The Ministry of Environment in South Korea categorizes the concentrations of PM2.5 levels between 0 and 15
The proposed method is based on the three-stage model by Wang and Li (2013) with some modifications. We consider the response variable
The goal is to estimate the extremely high conditional quantiles for
where
Similar to Wang and Li (2013), we assume that
where
Although the three-stage model uses conventional quantile regression as a base model, we propose regularized quantile regression using the lasso penalty to overcome the multicollinearity problem and improve the prediction performance.
The entire procedure consists of four steps. First, the power transformation parameter
In quantile regression, conditional quantiles of
where
The power transformation parameter
where
where
In this step, we estimate the intermediate quantiles
The estimate of the intermediate quantile is
where
Next, we extrapolated the intermediate quantile estimates to the extreme tails. For
where
The extreme value index
The selection of
Then, we ensemble the extreme quantile results, , using a simple average,
where contains extreme quantile levels, and we set
= {0.950, 0.955, . . . , 0.990, 0.995, 0.999}. |
| denotes the number of elements in the set, which is equal to 11.
In the last step, the estimated extreme quantile,
We observed that {(
Note that, we have used predicted explanatory variables. Therefore, proposed model can be written as,
where
where
The number of time points in the data set is 3, 298. We have few ‘very bad’ cases in the overall data; thus, a small test set contains few ‘very bad’ cases. Therefore, to validate the performance in predicting extreme cases, we have considered splitting the data into training and test sets at a ratio of 1:2. In this study,
We constructed the proposed model using training data, and validated the model using test data. The training data are denoted as {(
For each district in Seoul, we ran the following algorithm.
The results were summarized using the sensitivity, specificity, positive predictive value, negative predictive value, and the F-score, which are statistical performance measures for a binary classification test. Based on the confusion matrix in Table 4, the following measures are defined,
Sensitivity measures the proportion of correctly identified positives (the proportion of ‘very bad’ time points that are correctly identified as ‘very bad’),
Specificity measures the proportion of correctly identified negatives (the proportion of ‘not very bad’ time points that are correctly identified as ‘not very bad’)
The positive predictive value (PPV) is calculated as follows
The negative predictive value (NPV) is calculated as follows
The F-score measures overall accuracy is calculated as follows,
The F-score is distributed from 0 to 1 and controlled by
Three conventional classification models are applied to evaluate the relative performance of the proposed model. As a representative non-parametric ensemble algorithm, the random forest (RF) is known to be robust to outliers and performs well in many classification problems. The MLP and LSTM models are artificial neural networks which are widely used in various applications. Significantly, variants of the LSTM have been used in predicting of PM2.5 because it is appropriate for time series data.
The details of each model are summurized in Table 5. The hyperparameters are selected by trial and error. For example, we considered 400, 600, and 800 number of trees, and choose the 800 as the optimal value.
The commonly used threshold value which is used to classify the observation as positive is 0.5. However, the value may be inappropriate, considering the extremeness of the ‘very bad’ events in the data (Zou
The results from three districts, Guro, Yangcheon, and Yeongdeungpo, are presented in Table 6. The proposed three-stage model with lasso ensemble (TSLE) method provides the highest PPVs and comparable sensitivity values, and works best according to the F-score. Although the sensitivity values for the MLP are higher than those for the other models on for Yangcheon and Yeongdeungpo, the MLP has the lowest PPVs, which indicates that a false alarm frequently occurs. This outcome implies that the proposed method has relatively balanced performance for predicting ‘very bad’ events. However, all models perform well in terms of the NPV and specificity, because of the extremely imbalanced data.
In Figure 4, we plot the F-scores from the three-stage model with lasso regression (TSL) before ensemble. The specific quantile result may provide the best result but there is inconsistent. For example, TSL prediction when
We obtained similar results for all 25 districts (not shown), and we plotted the average measures of all districts in Figure 5. Overall, the proposed TSLE method works best for all five measures.
As the lasso regression performs the variable selection, some variables are not selected in the regression model. Therefore, we present the proportion of the selection for each variable in Table 7. Temperature, humidity, and PM10 were selected in most models, and NO2, O3 and SO2 were rarely selected.
In this paper, we consider the prediction of extreme values of PM2.5 in Seoul, South Korea. Compared to the conventional mean-based models, the proposed method is based on the quantile regression with the extreme value theory. Therefore, the proposed model predicts the extremely high values of PM2.5 especially well. Moreover, we added the lasso penalty term to the quantile model, so it performs variable selections. Based on the statistical measures of the performance of a binary classification test, such as the sensitivity, PPV, and F-score, the proposed method works well for 25 districts in Seoul, South Korea.
We expected that the proposed model performance would improve by adding meteorological variables in China, which significantly affect the atmospheric conditions in South Korea. In addition, the data file and R code for implementation are provided at https://github.com/SaeSimcheon/Extreme-PM2.5-prediction.
Variables
Data | Variable | Interval | Source |
---|---|---|---|
Meteorological data | Precipitation Temperature Wind Speed Humidity |
Hourly | South Korea Meteorological Administration https://www.kma.go.kr/ |
Air pollution data | PM2.5 PM10 SO2 NO2 CO O3 |
Hourly | AIR KOREA https://www.airkorea.or.kr/ |
Category table for PM2.5 levels in South Korea
Daily PM2.5 level ( |
Category | |||
Good | Normal | Bad | Very bad | |
0~ 15 | 16 ~ 35 | 36 ~ 75 | 76 ~ |
Number of time points (%) in each categorized PM2.5 level in 25 districts in Seoul
District | PM2.5 level | District | PM2.5 level | ||
---|---|---|---|---|---|
PM2.5 |
76 ≤ PM2.5 | PM2.5 |
76 ≤ PM2.5 | ||
Gangnam | 3200 (97%) | 98 (3%) | Seodaemun | 3210(97%) | 88(3%) |
Gangdong | 3229 (98%) | 69 (2%) | Seocho | 3208(97%) | 90(3%) |
Gangbuk | 3236 (98%) | 62 (2%) | Seongdong | 3191(97%) | 107(3%) |
Gangseo | 3229 (98%) | 69 (2%) | Seongbuk | 3251(99%) | 47(1%) |
Gwanak | 3194 (97%) | 104 (3%) | Songpa | 3239(98%) | 59(2%) |
Gwangjin | 3203 (97%) | 95 (3%) | Yangcheon | 3217(98%) | 81(2%) |
Guro | 3207 (97%) | 91 (3%) | Yeongdeungpo | 3183(97%) | 115(3%) |
Geumcheon | 3225 (98%) | 73 (2%) | Yongsan | 3213(97%) | 85(3%) |
Nowon | 3224 (98%) | 74 (2%) | Eunpyeong | 3232(98%) | 66(2%) |
Dobong | 3244 (98%) | 54 (2%) | Jongno | 3215(97%) | 83(3%) |
Dongdaemun | 3221 (98%) | 77 (2%) | Jung | 3243(98%) | 55(2%) |
Dongjak | 3211 (97%) | 87 (3%) | Jungnang | 3211(97%) | 87(3%) |
Mapo | 3185 (97%) | 113 (3%) |
Confusion matrix
Prediction | |||
---|---|---|---|
Positive (‘Very bad’ ) | Negative (‘Not very bad’ ) | ||
Actual | Positive (‘Very bad’ ) | True positive (TP) | False negative (FN) |
Negative (‘Not very bad’ ) | False positive (FP) | True negative (TN) |
Hyper parameters and architectures of comparison methods
RF | ||||
n_estimators | 800 | |||
max features | 3 | |||
MLP | ||||
Optimizer | Adam | |||
Batch_size | 16 | |||
Loss function | binary cross entropy | |||
Learning rate | 0.001 | |||
Dense layer | 9 | 64 | ReLU | |
Dropout layer | 64 | 64 | · | |
Dense layer | 64 | 1 | Sigmoid | |
LSTM | ||||
Optimizer | Adam | |||
Batch_size | 16 | |||
Loss function | binary cross entropy | |||
Learning rate | 0.001 | |||
Input window | 6 | |||
LSTM layer | (6,9) | 64 | tanh | |
Dense layer | 64 | 1 | Sigmoid |
Performance table for Guro, Yangcheon, and Yeongdeungpo (Bold indicates best performance)
District | Method | F-score | PPV | NPV | Sensitivity | Specificity |
---|---|---|---|---|---|---|
Guro | TSLE | |||||
RF | 0.545 | 0.241 | 0.992 | 0.797 | 0.912 | |
MLP | 0.284 | 0.098 | 0.981 | 0.541 | 0.826 | |
LSTM | 0.331 | 0.137 | 0.981 | 0.514 | 0.887 | |
Yangcheon | TSLE | 0.562 | ||||
RF | 0.341 | 0.532 | 0.974 | 0.312 | 0.990 | |
MLP | 0.122 | 0.028 | 0.869 | 0.072 | ||
LSTM | 0.279 | 0.110 | 0.976 | 0.450 | 0.863 | |
Yeongdeungpo | TSLE | 0.990 | 0.781 | |||
RF | 0.561 | 0.356 | 0.984 | 0.656 | 0.946 | |
MLP | 0.464 | 0.161 | 0.791 | |||
LSTM | 0.498 | 0.253 | 0.983 | 0.656 | 0.911 |
Proportion of selection for each variable in the three-stage lasso models
Variable | Proportion of selection |
---|---|
Precipitation | 0.664 |
Temperature | 0.976 |
Humidity | 0.980 |
Wind Speed | 0.672 |
CO | 0.343 |
NO2 | 0.014 |
O3 | 0.031 |
PM10 | 1 |
SO2 | 0 |