search for

CrossRef (0)
Prediction of extreme PM concentrations via extreme quantile regression
Communications for Statistical Applications and Methods 2022;29:319-331
Published online May 31, 2022
© 2022 Korean Statistical Society.

SangHyuk Leea, Seoncheol Parkb, Yaeji Lim1,a

aDepartment of Statistics, Chung-Ang University, Korea
bDepartment of Information Statistics, Chungbuk National University, Korea
Correspondence to: 1 Department of Applied Statistics, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Korea. E-mail: yaeji.lim@gmail.com

This research was supported by the Chung-Ang University Graduate Research Scholarship in 2020 and the National Research Foundation of Korea (NRF) funded by the Korean government (NRF-2021R1A2B5B01001790, NRF-2021R1F1A1064096).
Received October 7, 2021; Revised November 15, 2021; Accepted November 22, 2021.
In this paper, we develop a new statistical model to forecast the PM2.5 level in Seoul, South Korea. The proposed model is based on the extreme quantile regression model with lasso penalty. Various meteorological variables and air pollution variables are considered as predictors in the regression model, and the lasso quantile regression performs variable selection and solves the multicollinearity problem. The final prediction model is obtained by combining various extreme lasso quantile regression estimators and we construct a binary classifier based on the model. Prediction performance is evaluated through the statistical measures of the performance of a binary classification test. We observe that the proposed method works better compared to the other classification methods, and predicts 쁵ery bad cases of the PM2.5 level well.
Keywords : PM2.5, prediction, classification, quantile regression
1. Introduction

Particulate matter of less than 2.5 μm(PM2.5) is a critical issues in modern society, and the World Health Organization WHO (2018) has estimated that around 7 million people die every year from exposure to fine particles in polluted air. Numerous studies have been conducted to show the associations between exposure to ambient PM2.5 and adverse health effects. Pui et al. (2014) reviewed various aspects of PM2.5, including its measurement, source apportionment, visibility, and health effects, and mitigation, and Burnett et al. (2014) developed a fine particulate mass-based relative risk function. In addition, Song et al. (2017) estimated the health burden attributable to PM2.5 based on three years of observed data.

South Korea is one of the worst countries in terms of severe air pollution, and many studies have focused on the PM2.5 in South Korea. Choi et al. (2012) examined the characteristics, sources, and distributions of PM2.5 and carbonaceous species in Incheon, South Korea, and Ryou et al. (2018) summarized the findings of PM source apportionment studies on South Korea. More recently, Bae et al. (2020) estimated long-term foreign contributions to the PM2.5 concentrations in South Korea with a set of air quality simulations.

Various statistical models have been applied to the data to predict PM2.5 level. Ordieres et al. (2005) compared three different topologies of neural networks for predicting average PM2.5 concentrations: multilayer perceptron (MLP), radial basis function (RBF) and square MLP. Dong et al. (2009) proposed a model based on hidden semi-Markov models for high PM2.5 concentration value prediction, and Qiao et al. (2019) developed a model based on wavelet transform–stacked autoencoder–long short term memory (LSTM).

These models perform well in the relatively usual event, but they work poorly to predict extremely high levels of PM2.5. Since extreme events are rarely occurred, few data are used to learn the extreme patterns of the PM2.5 level. Therefore, conventional statistical models does not perform well in predicting extreme events, if there is no assumption for such events in the models.

Various models have been developed to predict extreme values to overcome this issue. D’Amico et al. (2015) advanced the generalized Pareto distribution to model the probability distribution function’s tail to predict wind speed in Alaska. Quintela-del-R캇 and Francisco-Fernández (2011) proposed a nonparametric functional data analysis to estimate ozone data in the UK, and Schaumburg (2012) combined nonparametric quantile regression with extreme value theory. However, few studies have been conducted to predict extreme PM2.5 levels (Qin et al., 2015).

In this paper, we adapt the three-stage model proposed by Wang and Li (2013) to predict extreme concentrations of PM2.5 in Seoul, South Korea. The three-stage model extends the model by Wang et al. (2012a) that relaxed the assumptions of linear quantile functions of Y and tail equivalency across covariates, x. Wang and Li (2013) integrated quantile regression and extreme value theory by estimating intermediate conditional quantiles using quantile regression and extrapolating these estimates to tails based on the extreme value theory.

In this paper, we modify the three-stage model with lasso regression (Tibshirani, 1996) to improve the prediction performance of the extreme concentration of PM2.5. Lasso regression performs variable selection, and provides a sparse solution. Therefore, even when numerous predictors are included in the initial model stage, penalized regression methods, such as lasso, provide a parsimonious linear model. There are various penalized quantile regression models (Wu and Liu, 2009; Alhamzawi et al., 2012; Wang et al., 2012b), and we consider the lasso quantile regression in this study. We further combine the three-stage models with the lasso penalty obtained from the various extreme quantile values, and generate a single prediction value. Finally, a binary classifier based on the proposed model is constructed.

The rest of the paper is organized as follows. The data description and exploratory data analysis are provided in Section 2. In Section 3, we propose a binary classifier based on a three-stage model with lasso regression, and an algorithm for implementation is presented in Section 4. In Section 5, we apply the proposed method to the PM2.5 in Seoul, South Korea, and validate the results. Finally, the concluding remarks are presented in Section 6.

2. Data

Table 1 lists the variables used in this study. The meteorological variables, hourly precipitation, temperature, wind speed, and humidity, were collected from the South Korea Meteorological Administration. Several studies indicate that the concentration level of PM2.5 is affected by the meteorological factors. For example, Zhang et al. (2015) confirmed the critical role of meteorological parameters in air pollution formation, and Zhang et al. (2018) studied the influences of critical meteorological parameters, such as wind and precipitation, on PM concentrations. Therefore, we also included these variables in this study.

Air pollutant data were obtained from AIR KOREA, affiliated with the Korea Environment Corporation. The measurements include the hourly PM2.5, PM10, SO2, NO2, CO, and O3 content. Stracquadanio et al. (2007) studied the correlations between PM2.5 and gases (benzene, O3, SO2, NO2, and CO). Song et al. (2015) used the generalized additive model to determine the statistical relationships between PM2.5 concentrations and other pollutants, including SO2, NO2, CO, and O3. Based on these studies, we also consider these pollutant variables to be covariates.

The variables in Table 1 are measured in 25 districts in Seoul, South Korea (Figure 1). Differences existed in places where the meteorological variables and air pollutant variables were measured, but the distance between the stations is relatively short.

All variables were obtained from January 1, 2015, to May 29, 2020, but in this study, we consider only the spring season (March, April, and May) every year, when the extremely high levels of PM2.5 are observed. In addition, for the fast computation, we used four-hour average values for all variables. Therefore, the number of time points is 3298 for each district. For example, the four hour average values of PM2.5 in the Gangnam district are presented in Figure 2.

Missing values occur in each measurements, thus, we imputed the missing points as follows. First, linear imputation was applied if length of sequential missing points was less or equal to four. Second, if the length of sequential missing points was greater than four and less than or equal to six, the sequence was imputed by the value of the nearest region within 5 km. Finally, the rows were omitted from the data if the previous steps did not impute the points.

To observe the variable correlations, we computed the correlation coefficients. Figure 3 displays the correlation in the Gangnam district, where PM10, SO2, and CO have a positive association of more than 0.5 with PM2.5, a relatively strong association. However, others have weak relationships with less than 0.5. We also observed relatively strong associations between O3, NO2, CO, and wind speed, implying multicollinearity between variables. For the other districts, we observed a similar variable correlation.

The Ministry of Environment in South Korea categorizes the concentrations of PM2.5 levels between 0 and 15 μg/m3 as ‘good’, between 16 and 35 as ‘normal’, between 36 and 75 as ‘bad’, and more than 76 as ‘very bad’ (Table 2). In this study, we focus on the forecasting ‘very bad’ cases in Seoul that causes severe health effects. The number of time points that categorized based on the PM2.5 levels are presented in Table 3 for 25 districts. ‘Very bad’ cases are rarely observed for all districts, implying that conventional mean based statistical models may not work well in predicting extremely high PM2.5 levels.

3. Methodology

The proposed method is based on the three-stage model by Wang and Li (2013) with some modifications. We consider the response variable Y and covariate vector X = (X1, . . . , Xp)T with the X1 = 1, and assume that we observe a random sample {(yi, xi), i = 1, . . . , n} of the random vector (Y, X).

The goal is to estimate the extremely high conditional quantiles for τn → 1 as n→∞,


where FY (·|x) is the conditional cumulative distribution function of Y.

Similar to Wang and Li (2013), we assume that FY (·|x) is in the maximum domain of attraction of an extreme value distribution Gγ(x), where γ(x) > 0 is the extreme value index. For the random sample Z1, . . . , Zn from FY (·|x), constants an(x) > 0 and bn(x) ∈ 꽍 exist such that,

P(Z(n)-bn(x)an(x)y)Gγ(x)(y)=exp {-(1+γ(x)y)-1γ(x)}, 듼 듼 듼as n,

where Z(n) is the largest order statistic of the samples.

Although the three-stage model uses conventional quantile regression as a base model, we propose regularized quantile regression using the lasso penalty to overcome the multicollinearity problem and improve the prediction performance.

The entire procedure consists of four steps. First, the power transformation parameter λ of response variable Y is estimated. Secondly, the conditional intermediate quantiles are fitted to the transformed response variable. Then, the extreme quantile is estimated by extrapolating the intermediate quantile estimates and we ensemble the estimates using a simple average. Finally, the result is divided according to the threshold into two groups: ‘not very bad’ and ‘very bad’. The following section details the description of the procedure.

3.1. Power transformation

In quantile regression, conditional quantiles of Y are assumed to be linear in x at the tails. To relax this linearity assumption, Wang and Li (2013) considered the power transformation of Y. The power-transformed quantile regression model is defined as follows,

QΛλ(Y)(τxi)=xiTθ(τ), 듼 듼 듼for i=1,,n,

where τ ∈ [1 – , 1], where is a small positive constant, and θ(τ) is the τ-th quantile regression coefficient.

Λλ(y)={yλ-1λ,if λ0,log(y),if λ=0.

The power transformation parameter λ in (3.4) is estimated as follows,

λ^=arg minλi=1n{Rn(xi,λ;τ)2},

where Rn(t,λ;τ)=1/nΣi=1nI(xit)[τ-I{Λλ(yi)-xiTθ^LASSO(τ;λ)0}], and I is the indicator function. The estimated lasso coefficient, θLASSO(τ; λ), is computed as follows,

θ^LASSO(τ;λ)=arg minf=(b1,,bp)Ti=1nρτ(Λλ(yj)-xiTf)+νl=1p|bj|,

where ρτ(x) = x{τI(x < 0)} is the τth quantile loss function, and ν is a penalty parameter in lasso regression estimated at each λ through using the cross-validation method (Tibshirani, 1996; Wu and Liu, 2009). Wang and Li (2013) suggested using the upper quantile level τ = 1 – ε with small positive constant ε, and we set τ = 0.95.

3.2. Estimating intermediate quantiles

In this step, we estimate the intermediate quantiles QY (τj|x) for τj = j/(n + 1), j = 1, . . . ,m, with m = n – [nη]. Parameter η is set as 0.1, as suggested by Wang and Li (2013), and [x] denotes the integer part of x.

The estimate of the intermediate quantile is


where λ is a power transformation parameter estimated from (3.5), and

θ^LASSO(τj;λ^)=arg minf=(b1,,bp)Ti=1nρτj(Λλ^(yi)-xiTf)+νl=1pbl,

3.3. Extrapolating intermediate quantiles to tails

Next, we extrapolated the intermediate quantile estimates to the extreme tails. For τn → 1, we estimated QY (τn|x) as follows,


where k = kn →∞and k/n → 0.

The extreme value index γk(x) is estimated using the followings,

γk^(x)=1k-[nη]+1j=[nη]klog Q^Y(τn-jx)Q^Y(τn-kx).

The selection of k is a crucial part of the three-stage model, and we applied a selection procedure, a modified version of Section 3.3 of Wang and Li (2013), by adding a lasso penalty term.

Then, we ensemble the extreme quantile results, QY (τn|x), for , using a simple average,


where contains extreme quantile levels, and we set = {0.950, 0.955, . . . , 0.990, 0.995, 0.999}. | | denotes the number of elements in the set, which is equal to 11.

3.4. Binary classification procedure

In the last step, the estimated extreme quantile, Qensemble is classified into two classes based on the threshold value. This study focuses on the prediction of ‘very bad’ cases of PM2.5; thus, we set the threshold as 76μg/m3, which is defined in Table 2.

Y^={1,if Q^ensemble76,0,if Qensemble<76.
4. Algorithm

We observed that {(yi, xi), i = 1, . . . , n}, where yi is PM2.5, and xi = (x1,i, . . . , x9,i)T is a covariate vector on the ith time. We used nine variables as covariates: PM10, SO2, NO2, CO, O3, precipitation, temperature, wind speed, and humidity.

Note that, we have used predicted explanatory variables. Therefore, proposed model can be written as,

Yi+1=f(x^i+1i),where x^i+1i=g(x1,,xi),

where f (·) is proposed three-stage model and g(·) is the forecast model based on the exponential smoothing (ETS) algorithm (Hyndman et al., 2008) computed as follows,


where α is the smoothing parameter estimated by maximizing the likelihood.

The number of time points in the data set is 3, 298. We have few ‘very bad’ cases in the overall data; thus, a small test set contains few ‘very bad’ cases. Therefore, to validate the performance in predicting extreme cases, we have considered splitting the data into training and test sets at a ratio of 1:2. In this study, n1 = 1,099 observations from March 1, 2015, to March 31, 2017, were used for the training data, and the test data are n2 = 2,199 observations obtained from April 1, 2017, to May 29, 2020.

We constructed the proposed model using training data, and validated the model using test data. The training data are denoted as {(yitr,xitr),i = 1, . . . , n1}, and the test data are {(yite,xite),i = 1, . . . , n2}.

For each district in Seoul, we ran the following algorithm.

Algorithm 1: Three-stage model with Lasso ensemble
5. Application

5.1. Evaluation metrics

The results were summarized using the sensitivity, specificity, positive predictive value, negative predictive value, and the F-score, which are statistical performance measures for a binary classification test. Based on the confusion matrix in Table 4, the following measures are defined,

  • Sensitivity measures the proportion of correctly identified positives (the proportion of ‘very bad’ time points that are correctly identified as ‘very bad’),


  • Specificity measures the proportion of correctly identified negatives (the proportion of ‘not very bad’ time points that are correctly identified as ‘not very bad’)


  • The positive predictive value (PPV) is calculated as follows


  • The negative predictive value (NPV) is calculated as follows


  • The F-score measures overall accuracy is calculated as follows,


  • The F-score is distributed from 0 to 1 and controlled by β, and it is chosen such that sensitivity is considered β times as important as the PPV. Considering our purpose is to accurately predict a ‘very bad’ event and the rareness of the event, we set β = 2 (Sasaki, 2007).

5.2. Comparison methods

Three conventional classification models are applied to evaluate the relative performance of the proposed model. As a representative non-parametric ensemble algorithm, the random forest (RF) is known to be robust to outliers and performs well in many classification problems. The MLP and LSTM models are artificial neural networks which are widely used in various applications. Significantly, variants of the LSTM have been used in predicting of PM2.5 because it is appropriate for time series data.

The details of each model are summurized in Table 5. The hyperparameters are selected by trial and error. For example, we considered 400, 600, and 800 number of trees, and choose the 800 as the optimal value.

The commonly used threshold value which is used to classify the observation as positive is 0.5. However, the value may be inappropriate, considering the extremeness of the ‘very bad’ events in the data (Zou et al., 2016). Therefore, we considered threshold values from 0.01 to 1 in the test data and choose the result that provides the best F-score (Lakshmi and Prasad, 2014).

5.3. Results

The results from three districts, Guro, Yangcheon, and Yeongdeungpo, are presented in Table 6. The proposed three-stage model with lasso ensemble (TSLE) method provides the highest PPVs and comparable sensitivity values, and works best according to the F-score. Although the sensitivity values for the MLP are higher than those for the other models on for Yangcheon and Yeongdeungpo, the MLP has the lowest PPVs, which indicates that a false alarm frequently occurs. This outcome implies that the proposed method has relatively balanced performance for predicting ‘very bad’ events. However, all models perform well in terms of the NPV and specificity, because of the extremely imbalanced data.

In Figure 4, we plot the F-scores from the three-stage model with lasso regression (TSL) before ensemble. The specific quantile result may provide the best result but there is inconsistent. For example, TSL prediction when τn = 0.990 has the highest F-score in Guro, whereas the highest value occurs when τn = 0.999 in Yangcheon. Therefore, the ensemble technique solves this selection problem, and offers superior performance overall.

We obtained similar results for all 25 districts (not shown), and we plotted the average measures of all districts in Figure 5. Overall, the proposed TSLE method works best for all five measures.

As the lasso regression performs the variable selection, some variables are not selected in the regression model. Therefore, we present the proportion of the selection for each variable in Table 7. Temperature, humidity, and PM10 were selected in most models, and NO2, O3 and SO2 were rarely selected.

6. Conclusion

In this paper, we consider the prediction of extreme values of PM2.5 in Seoul, South Korea. Compared to the conventional mean-based models, the proposed method is based on the quantile regression with the extreme value theory. Therefore, the proposed model predicts the extremely high values of PM2.5 especially well. Moreover, we added the lasso penalty term to the quantile model, so it performs variable selections. Based on the statistical measures of the performance of a binary classification test, such as the sensitivity, PPV, and F-score, the proposed method works well for 25 districts in Seoul, South Korea.

We expected that the proposed model performance would improve by adding meteorological variables in China, which significantly affect the atmospheric conditions in South Korea. In addition, the data file and R code for implementation are provided at https://github.com/SaeSimcheon/Extreme-PM2.5-prediction.

Fig. 1. Variables in 25 districts in Seoul, South Korea, measured at the marked stations.
Fig. 2. Four-hour average values of PM from spring 2015 to spring 2020 in the Gangnam district. Dashed horizontal lines indicate 76 μg/m, which is a threshold value rated as ‘very bad’.
Fig. 3. Correlation matrix between the variables in the Gangnam district.
Fig. 4. F-score values for three districts.
Fig. 5. Average measures for 25 districts.

Table 1


Data Variable Interval Source
Meteorological data Precipitation
Wind Speed
Hourly South Korea Meteorological Administration
Air pollution data PM2.5

Table 2

Category table for PM2.5 levels in South Korea

Daily PM2.5 level (μg/m3) Category

Good Normal Bad Very bad
0~ 15 16 ~ 35 36 ~ 75 76 ~

Table 3

Number of time points (%) in each categorized PM2.5 level in 25 districts in Seoul

District PM2.5 level District PM2.5 level

PM2.5< 76 76 ≤ PM2.5 PM2.5< 76 76 ≤ PM2.5
Gangnam 3200 (97%) 98 (3%) Seodaemun 3210(97%) 88(3%)
Gangdong 3229 (98%) 69 (2%) Seocho 3208(97%) 90(3%)
Gangbuk 3236 (98%) 62 (2%) Seongdong 3191(97%) 107(3%)
Gangseo 3229 (98%) 69 (2%) Seongbuk 3251(99%) 47(1%)
Gwanak 3194 (97%) 104 (3%) Songpa 3239(98%) 59(2%)
Gwangjin 3203 (97%) 95 (3%) Yangcheon 3217(98%) 81(2%)
Guro 3207 (97%) 91 (3%) Yeongdeungpo 3183(97%) 115(3%)
Geumcheon 3225 (98%) 73 (2%) Yongsan 3213(97%) 85(3%)
Nowon 3224 (98%) 74 (2%) Eunpyeong 3232(98%) 66(2%)
Dobong 3244 (98%) 54 (2%) Jongno 3215(97%) 83(3%)
Dongdaemun 3221 (98%) 77 (2%) Jung 3243(98%) 55(2%)
Dongjak 3211 (97%) 87 (3%) Jungnang 3211(97%) 87(3%)
Mapo 3185 (97%) 113 (3%)

Table 4

Confusion matrix


Positive (‘Very bad’ ) Negative (‘Not very bad’ )
Actual Positive (‘Very bad’ ) True positive (TP) False negative (FN)
Negative (‘Not very bad’ ) False positive (FP) True negative (TN)

Table 5

Hyper parameters and architectures of comparison methods

RF Hyper parameter Value

n_estimators 800
max features 3

MLP Hyper parameter Value

Optimizer Adam
Batch_size 16
Loss function binary cross entropy
Learning rate 0.001

Layer type Input Output Activation

Dense layer 9 64 ReLU
Dropout layer 64 64 ·
Dense layer 64 1 Sigmoid

LSTM Hyper parameter Value

Optimizer Adam
Batch_size 16
Loss function binary cross entropy
Learning rate 0.001
Input window 6

Layer type Input Output Activation

LSTM layer (6,9) 64 tanh
Dense layer 64 1 Sigmoid

Table 6

Performance table for Guro, Yangcheon, and Yeongdeungpo (Bold indicates best performance)

District Method F-score PPV NPV Sensitivity Specificity
Guro TSLE 0.768 0.463 0.997 0.920 0.962
RF 0.545 0.241 0.992 0.797 0.912
MLP 0.284 0.098 0.981 0.541 0.826
LSTM 0.331 0.137 0.981 0.514 0.887

Yangcheon TSLE 0.584 0.692 0.984 0.562 0.991
RF 0.341 0.532 0.974 0.312 0.990
MLP 0.122 0.028 0.869 0.712 0.072
LSTM 0.279 0.110 0.976 0.450 0.863

Yeongdeungpo TSLE 0.727 0.568 0.990 0.781 0.973
RF 0.561 0.356 0.984 0.656 0.946
MLP 0.464 0.161 0.993 0.875 0.791
LSTM 0.498 0.253 0.983 0.656 0.911

Table 7

Proportion of selection for each variable in the three-stage lasso models

Variable Proportion of selection
Precipitation 0.664
Temperature 0.976
Humidity 0.980
Wind Speed 0.672
CO 0.343
NO2 0.014
O3 0.031
PM10 1     
SO2 0     

  1. Alhamzawi R, Yu K, and Benoit DF (2012). Bayesian adaptive Lasso quantile regression. Statistical Modelling, 12, 279-297.
  2. Bae MA, Kim BU, Kim HC, and Kim ST (2020). A multiscale tiered approach to quantify contributions: A case study of PM in South Korea during 20102017. Atmosphere, 11, 141.
  3. Burnett RT, Pope CA, and Ezzati M, et al. (2014). An integrated risk function for estimating the global burden of disease attributable to ambient fine particulate matter exposure. Environmental Health Perspectives, 122, 397-403.
    Pubmed KoreaMed CrossRef
  4. Choi JK, Heo JB, Ban SJ, Yi SM, and Zoh KD (2012). Chemical characteristics of PM aerosol in Incheon Korea. Atmospheric Environment, 60, 583-592.
  5. D섲mico G, Petroni F, and Prattico F (2015). Wind speed prediction for wind farm applications by extreme value theory and copulas. Journal of Wind Engineering and Industrial Aerodynamics, 145, 229-236.
  6. Dong M, Yang D, Kuang Y, He D, Erdal S, and Kenski D (2009). PM concentration prediction using hidden semi-Markov model-based times series data mining. Expert Systems with Applications, 36, 9046-9055.
  7. Hyndman R, Koehler AB, Ord JK, and Snyder RD (2008). Forecasting with Exponential Smoothing: The State Space Approach, Springer Science & Business Media.
  8. Lakshmi TJ and Prasad Ch (2014). A study on classifying imbalanced datasets. Proceedings of the 2014 First International Conference On Networks and Soft Computing (ICNSC2014). , 141-145.
  9. Ordieres JB, Vergara EP, Capuz RS, and Salazar RE (2005). Neural network prediction model for fine particulate matter PM on the US-Mexico border in El Paso (Texas) and Ciudad Ju찼rez (Chihuahua). Environmental Modelling and Software, 20, 547-559.
  10. Pui DYH, Chen S-C, and Zuo Z (2014). PM in China: Measurements, sources, visibility and health effects, and mitigation. Particuology, 13, 1-26.
  11. Qin S, Liu F, Wang C, Song Y, and Qu J (2015). Spatial-temporal analysis and projection of extreme particulate matter (PM and PM) levels using association rules: A case study of the Jing-Jin-Ji region, China. Atmospheric Environment, 120, 339-350.
  12. Qiao W, Tian W, Tian Y, Yang Q, Wang Y, and Zhang J (2019). The forecasting of PM using a hybrid model based on wavelet transform and an improved deep learning algorithm. IEEE Access, 7, 142814-142825.
  13. Quintela-del-R캇 A and Francisco-Fern찼ndez M (2011). Nonparametric functional data estimation applied to ozone data: Prediction and extreme value analysis. Chemosphere, 82, 800-808.
  14. Ryou HG, Heo JB, and Kim SY (2018). Source apportionment of PM and PM air pollution, and possible impacts of study characteristics in South Korea. Environmental Pollution, 240, 963-972.
  15. Song C, He J, and Wu L, et al. (2017). Health burden attributable to ambient PM in China. Environmental Pollution, 223, 575-586.
  16. Sasaki Y (2007) The truth of the F-measure . Retrieved May 26th, 2021 from https://www.cs.odu.edu/~mukka/cs795sum09dm/Lecturenotes/Day3/F-measure-YS-26Oct07.pdf
  17. Schaumburg J (2012). Predicting extreme value at risk: Nonparametric quantile regression with refinements from extreme value theory. Computational Statistics and Data Analysis, 56, 4081-4096.
  18. Song YZ, Yang HL, Peng JH, Song YR, Sun Q, and Li Y (2015). Estimating PM Concentrations in Xi셙n city using a generalized additive model with multi-source monitoring data. PLoS One, 10, e0142149.
  19. Stracquadanio M, Apollo G, and Trombini C (2007). A Study of PM and PM-Associated Poly-cyclic Aromatic Hydrocarbons at an Urban Site in the Po Valley (Bologna, Italy). Water, Air, And Soil Pollution, 179, 227-237.
  20. Sun Y, Wong AKC, and Kamel MS (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23, 687-719.
  21. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288.
  22. Weissman I (1978). Estimation of parameters and large quantiles based on the k largest observations. Journal of the American Statistical Association, 73, 812-815.
  23. Wu Y and Liu Y (2009). Variable selection in quantile regression. Statistica Sinica, 801-817.
  24. Wang HJ, Li D, and He X (2012). Estimation of high conditional quantiles for heavy-tailed distributions. Journal of the American Statistical Association, 107, 1453-1464.
  25. Wang L, Wu Y, and Li R (2012). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107, 214-222.
    Pubmed KoreaMed CrossRef
  26. Wang HJ and Li D (2013). Estimation of extreme conditional quantiles through power transformation. Journal of the American Statistical Association, 108, 1062-1074.
  27. WHO (2018) 9 out of 10 people worldwide breathe polluted air, but more countries are taking action . Retrieved November 4th, 2021, from https://www.who.int/news/item/02-05-2018-9-out-of-10-people-worldwide-breathe-polluted-air-but-more-countries-are-taking-action
  28. Zhang H, Wang Y, Hu J, Ying Q, and Hu X-M (2015). Relationships between meteorological parameters and criteria air pollutants in three megacities in China. Environmental Research, 140, 242-254.
    Pubmed CrossRef
  29. Zhang B, Jiao L, Xu G, Zhao S, Tang X, Zhou Y, and Gong C (2018). Influences of wind and precipitation on different-sized particulate matter concentrations (PM, PM, PM). Meteorology and Atmospheric Physics, 130, 383-392.
  30. Zou Q, Xie S, Lin Z, Wu M, and Ju Y (2016). Finding the best classification threshold in imbalanced classification. Big Data Research, 5, 2-8.