
Recently, air pollution has been a major environmental concern in Korea. Particularly, Seoul, the capital city of Korea with about 10 million residents, ranked 1217 out of more than 8,000 cities in the world in order of poor air quality, based on data collected during 2017–2022 by IQAir, the Swiss air quality technology company (www.iqair.com).
Fine dust is classified into primary fine dust and secondary fine dust according to the generation process. Primary fine dust is emitted through fuel combustion or vehicle exhaust, and secondary fine dust is generated through chemical reactions between already emitted substances (Jang, 2016). Studies have shown that fine dust causes cardiopulmonary diseases, lung cancer, respiratory diseases, and stroke (Shin, 2007; Han
To prevent the adverse effects of fine dust on the human body, it is important to predict the concentration of fine dust in advance in order to minimize exposure to fine dust and to identify factors that are closely related to fine dust in order to reduce concentration of fine dust (Lim, 2019; Cha and Kim, 2018; Shin
Various approaches to fine dust analysis have been proposed by many researchers. However, fine dust is affected by many factors such as weather conditions, population density, natural environment, industrial environment, and geographical conditions. Also, currently fine dust is measured every hour at observatories, producinpg a large amount of data. Therefore, there may be limitations in analyzing such a large amount of complex data using traditional statistical methods and fine dust analysis using machine learning techniques has recently attracted great attention of researchers (Cha and Kim, 2018; Cho
Bayesian network is a type of machine learning technique, which is a probabilistic graphical model that uses conditional independence/dependence to create a directed acyclic graph (DAG) that expresses the relationship between variables. Bayesian network is a model often used for large amount of complex data in medical, environmental, and transportation area. Recent studies on predicting diseases using Bayesian network include Oh
In this study, we constructed Bayesian network model for predicting concentration of fine dust in Seoul, Korea, identifying important factors that affect fine dust, and visualizing the relationship between variables. We used PM10 concentration as a measure of the concentration of fine dust. PM10 is dust with a diameter of 10 or less floating in the air, and it is a particle so thin and small that we cannot see (Korean Ministry of Environment, 2014). As predictors, we used variables related to air quality (Lim, 2019; Cho
This article consists of of 5 sections. Section 2 briefly describes the Bayesian network models used in this study, and Section 3 presents a description and preparation of the data. Section 4 describes construction of the Bayesian network prediction model and interpretation of the network, and Section 5 presents conclusions and comments.
A Bayesian network is a probabilistic graphical model based on a DAG. The nodes of DAG represent variables, and the connected arcs represent conditional dependencies between variables (Pearl, 2011). The joint probability distribution for the nodes of the Bayesian network can be expressed as
Bayesian networks make it easy to identify the conditional independence/dependence between variables by visualizing the relationship between variables, and have the advantage of being able to utilize expert knowledge when creating DAGs (Yang and Zhang, 2000). Especially, Bayesian networks can be used as a classifier to determine the probability distribution of a class node based on the conditional probability of other attributes. The classification is based on the highest posterior probability distribution. Bayesian networks are useful for classification because they allow for a fast and intuitive understanding of the relationships between nodes (Ang
Commonly used Bayesian network structures in classification are Naive Bayes (NB), Tree Augmented Naive Bayes (TAN), and General Bayesian Network (GBN) (Oh
NB is the simplest Bayesian network structure and is widely used in various fields such as medical and computer networks. NB assumes that all predictor variables affect the response variable independently, and treats only the response variable as the only parent node of each predictor variable. It is easy to interpret due to the simple structure. However, it rarely reflects the true relationship between variables in real data (Ang
TAN relaxes the structural restriction of NB by allowing the dependencies between predictor variables, hence the reliability of the structure and classification accuracy are usually higher than that of NB (Ang
GBN has no limit on the number of parent nodes in the DAG structures. In addition, it does not distinguish between the response and the predictor variable, and the response variable is considered as another predictor variable (Madden, 2009). While GBN has the advantage of allowing more flexibility, it frequently fails to provide an interpretive model in real data. The structure of GBN is illustrated in Figure 3. There are several ways to find a GBN structure suitable for a given data. In this article, the
Data on variables related to air quality - the concentrations of PM10 (
Data on meteorological variables - temperature (°C), wind speed (m/s), wind direction (16 directions), and humidity (%) - were obtained from the Automated Synoptic Observing System (ASOS) data of the Korea Meteorological Administration Meteorological Data Open Portal (https://data.kma.go.kr/). Specifically, we used hourly data collected from April 1, 2018 to March 31, 2021 at Seoul Observatory of Korea Meteorological Administration, 52 Songwol-gil, Jongno-gu, Seoul, which is very close to the station where the observations of the air quality variables were collected.
Considering that PM10, O3, NO2, CO, SO2, temperature, wind.speed, and humidity data were collected over time, missing values of these variables were imputed using linear interpolation (Roh, 2017), using na.interp function in R package forecast. Missing values of a categorical variable wind.direction were replaced with the most recent values.
Since weather forecasts usually predict the PM10 concentration of the next day, at each time point the concentration of PM10 after 24 hours was used as the target variable and it was named fPM10. As predictor variables, air quality variables (PM10, O3, NO2, CO, SO2) and meteorological variables (temperature, wind speed, wind direction, humidity) at each time point were used as predictors. The concentration of PM10 at each time point was named cPM10 to distinguish it from fPM10. The letters “c” and “f” in cPM10 and fPM10 stand for “current” and “future”, respectively. The names and descriptions of the variables are given in Table 1.
For categorical variables wind.direction, time, and month, we re-categorized them to reduce the number of categories. For wind.direction, the original 16 categories were merged into 8 categories - E (East), W (West), S (South), N (North), NE (Northeast), NW (Northwest), SE (Southeast), SW (Southwest). For time variable, its original 24 categories, from 0 to 23, were converted into 4 time zones - dawn (0–5), morning (6–11), afternoon (12–17), night (18–23), since the variation of fine dust concentration in each time zone is not large according to the Meteorological Administration of Korea.
To see the seasonal effect on PM10, the daily average PM10 concentration in 2019 was plotted (Figure 4). As shown in Figure 4, PM10 was highest from November to March, followed by April and May, and generally low from June to October. Therefore, we converted the variable month to season which has 3 categories; spring (Apr–May), summer-fall (Jun–Oct), winter (Nov–Mar).
After these pre-processing of data, the final data consisted of
Since Bayesian networks for mixed cases containing both continuous and discrete varaibles are very complex and it is difficult to find softwares that can be easily applied, we applied a discrete Bayesian network in this paper. To apply the discrete Bayesian network, the continuous variables fPM10, cPM10, O3, NO2, CO, SO2, temperature, wind.speed, and humidity were discretized as follows. For fPM10, cPM10, O3, and NO2, we categorised them as (good, normal, bad, very bad) following the standards given by Air Korea, and then merged ‘bad’ and ‘very bad’ into ‘bad’ since ‘very bad’ category had very small number of observations. CO, temperature, wind.speed, and humidity were discretized using quartiles Q1, Q2, Q3 as thresholds so that the probability of each category was similar. For SO2, since more than 25% of observed values of SO2 were 0.003 making Q1 = Q2 = 0.003, SO2 was descritized into three categories using Q1 and Q3 as thresholds. Details on discretization criteria for continuous variables are given in Table 1.
Among the final 26,244 observations, fPM10 was ‘good’ in 11,969 cases (45.69%), ‘normal’ in 12,644 cases (48.18%), and ‘bad’ in 1,631 cases (6.21%). Table 2 summarizes basic statistics of each predictor variable by fPM10 status. According to the status of fPM10, the mean ± standard deviation (SD) is given for each continuous variable, and the number of observations and the percentage (%) for each category is given for each discrete variable. The last column of Table 2 shows the
As the average concentrations of the air quality variables cPM10, O3, NO2, CO, and SO2 increased, fPM10 tended to increase. This implies that future PM10 concentration were positively correlated with current air quality variables. fPM10 concentrations were generally low in summer-fall and high in spring and winter, This phenominon was also reported in Oh and Park (2022) and it may be because of migrating cyclones and dry land surface in spring, and of increased fuel consumption in winter. In addition, the concentration of fPM10 was high when the wind direction was Northwest, Southwest, or West since transboundary air pollutants from China has significant effect on fine dust in Korea (Sohn and Kim, 2015; Kim, 2019). On the other hand, the effect of wind speed on fPM10 was relatively weak in terms of Cramér’s V (= 0.0413) which is a measure of association between two categorical variables.
Figure 5 summarizes the whole process of data collection, pre-processing, and discretization.
We considered the three Bayesian network structures introduced in Section 2 - NB, TAN, GBN - as candidate DAG structures in predicting fPM10. For each of the Bayesian network structure, we repeated 10-fold cross validation 10 times and calculated accuracy and AUC. Accuracy is the relative proportion of observations in the entire data whose predicted and actual values are the same, and AUC represents the area under the receiver operating characteristic (ROC) curve. Table 3 presents the mean and SD of accuracy and AUC for each Bayesian network structure. It can be seen that TAN had the best predictive performance in terms of accuracy and AUC. Thus, TAN was selected as the structure of the Bayesian network.
The final Bayesian network model was obtained from model averaging 100 TAN models from the 10 repetitions of 10-fold cross validation. Figure 6 shows the DAG structure of the final TAN prediction model.
We calculated the mutual information (MI) scores between the response variable fPM10 and the predictor variables. MI score is a quantitative measure of the degree of interaction between each node and its parent node in a network. In other words, MI(
Table 4 shows the MI scores between fPM10 and each predictor, computed using the mutinformation function of the R infotheo package.
The top three predictors with high MI scores are cPM10, season, CO. The variable cPM10 has the largest MI score, showing strong temporal effect on PM10 concentration. The variable season has the second largest MI score. This is in good agreement with the significant seasonal effect shown in Figure 4. Also, the percentages of the levels of ‘good’ or ‘normal’ of PM10 concentration in our data is 25 ~ 35% in spring and winter while it is about 68% in summer-fall. The variable CO has the third largest MI score. Previous studies supporting this result are as follows. According to Park
It should be noted however that MI score is a measure of a pairwise association between a predictor variable and the target variable, and it does not reflect the effect of the predictor variable when other variables are controlled for. When all the predictor variables are considered together, the order of strength of the effect of variables may differ from the order of MI scores.
Figure 7 visualizes a tree structure of the relationship between predictor variables in the final Bayesian network prediction model. From the network structure given in Figure 7, we may get useful information on causal relationships from the DAG structure. For instance, Figure 7 shows the well known causal effects of season on temperature, and of temperature on O3 (Yang, 2019; Korean Ministry of Environment, 2017). It also shows the effect of season on CO, and of CO on cPM10.
The two variables directly connected in Figure 7 have an interactive effect on the target variable fPM10, when other variables are adjusted for. For instance, cPM10 and CO have an interactive effect on fPM10. To see this more clearly, we plotted a bar plot of the predictive probability of fPM10 given cPM10 (Figure 8). As expected, as the concentration of cPM10 increases the probabilities of ‘normal’ and ‘bad’ for fPM10 increase. We also plotted a bar plot of the predictive probability of fPM10 given cPM10 and CO (Figure 9). It is clearly seen that, the effect of CO on fPM10 varies wildly depending on the level of cPM10 (shown at the top of the figure). This implies a significant interactive effect of cPM10 and CO on fPM10.
Fine dust is a very serious environmental problem in Seoul, Korea, a metropolitan city with a population of about 10million. Studies have revealed that fine dust is related to various diseases. Therefore, it is important to predict the concentration of fine dust to minimize exposure to fine dust, and to identify factors affecting fine dust to reduce fine dust. To this end, we developed a PM10 prediction model using a Bayesian network that was trained from air quality measurement data and meteorological data observed in Seoul between 2018 and 2021. A Bayesian network is a DAG-based probabilistic model, and has the advantage of good interpretability because it is possible to identify relationship between variables and discover causative factors using the generated DAG. In this study, we considered three commonly used Bayesian network structures - NB, TAN, GBN - and selected TAN as our prediction model structure based on prediction performance. The final Bayesian network prediction model was obtained from model averaging of TAN models from cross-validations.
The proposed Bayesian network prediction model visualized the relationship between variables in a simple and interpretable way. In terms of MI scores, current PM10 concentration, season, and CO level were highly effective variables in 24 hours ahead prediction of PM10 concetration. The model also showed that there were significant interactive effects among predictor variables in prediction of PM10 concentration.
There exist other studies on predicting PM10 concentration and identifying significant factors. However, many of the previous studies are based on linear or logistic regression models (see Park
In this study, we considered three structures of Bayesian network (NB, TAN, GBN) and applied a data-driven methodology to construct a Bayesian network model. Incorporating expert knowledge in selection of predictor variables and/or in construction of a DAG structure may improve the prediction performance and give more practically meaningful model.
We used air quality data observed at one location (Jongno-gu, Seoul), hence care needs to be taken to generalize the results of this studye to other regions of Seoul. Extending the method proposed in this study to handle data from multiple observatories and take into account of regional variables such as traffic volume, number of factories in the region is our future research interest.
Description and discretization of variables
Variable | Description | Category |
---|---|---|
fPM10 | PM10 concentration after 24 hours ( |
good(0–30); normal(31–80); bad(81-) |
cPM10 | PM10 concentration ( |
good(0–30); normal(31–80); bad(81-) |
O3 | O3 concentration (ppm) | good(0–0.03); normal(0.031–0.09); bad(0.091-) |
NO2 | NO2 concentration (ppm) | good(0–0.03); normal(0.031–0.06); bad(0.061-) |
CO | CO concentration (ppm) | < Q1, Q1-Q2, Q2-Q3, > Q3 |
SO2 | SO2 concentration (ppm) | < Q1, Q1-Q3, > Q3 |
Temperature | temperature (°C) | < Q1, Q1-Q2, Q2-Q3, > Q3 |
Wind.speed | wind speed (m/s) | < Q1, Q1-Q2, Q2-Q3, > Q3 |
Humidity | humidity (%) | < Q1, Q1-Q2, Q2-Q3, > Q3 |
Season | season | spring(Apr-May); summer-fall(Jun-Oct); winter(Nov-Mar) |
Time | time zone | dawn; morning; afternoon; night |
Wind.direction | wind direction | E; W; S; N; NE; NW; SE; SW |
Basic statistics of predictor variables according to fPM10
fPM10 | ||||||
---|---|---|---|---|---|---|
Total (N = 26244) | Good (N = 11969) | Normal (N = 12644) | Bad (N = 1631) | Assoc |
||
Variable | Mean ± SD or N (%) | |||||
cPM10 | 38.44 ± 26.62 | 44.69 ± 24.93 | 27.56 ± 18.66 | 69.78 ± 43.35 | <0.001 | |
O3 | 0.0223 ± 0.0174 | 0.0235 ± 0.0191 | 0.0215 ± 0.0153 | 0.0188 ± 0.0172 | <0.001 | |
NO2 | 0.0312 ± 0.0152 | 0.0327 ± 0.0153 | 0.028 ± 0.014 | 0.0438 ± 0.0148 | <0.001 | |
CO | 0.5163 ± 0.2119 | 0.5538 ± 0.2137 | 0.4452 ± 0.1652 | 0.7463 ± 0.2591 | <0.001 | |
SO2 | 0.0033 ± 0.0010 | 0.0035 ± 0.0011 | 0.003 ± 8e-04 | 0.0040 ± 0.0013 | <0.001 | |
Temperature | 13.47 ± 10.69 | 11.16 ± 10.60 | 16.59 ± 10.36 | 8.44 ± 6.59 | <0.001 | |
Wind speed | 2.0624 ± 1.1494 | 2.0979 ± 1.1501 | 2.0435 ± 1.1523 | 1.9256 ± 1.1097 | <0.001 | |
Humidity | 59.82 ± 20.33 | 56.00 ± 20.26 | 64.42 ± 19.39 | 55.67 ± 20.57 | <0.001 | |
Season | spring | 4386 (16.71 %) | 1512 (12.63 %) | 2631 (20.81 %) | 243 (14.9 %) | <0.001 |
summer-fall | 11001 (41.92 %) | 7533 (62.94 %) | 3355 (26.53 %) | 113 (6.93 %) | ||
winter | 10857 (41.37 %) | 2924 (24.43 %) | 6658 (52.66 %) | 1275 (78.17 %) | ||
Time | dawn | 6534 (24.9 %) | 3450 (28.82 %) | 2784 (22.02 %) | 300 (18.39 %) | <0.001 |
morning | 6570 (25.03 %) | 2991 (24.99 %) | 3175 (25.11 %) | 404 (24.77 %) | ||
afternoon | 6570 (25.03 %) | 2627 (21.95 %) | 3426 (27.1 %) | 517 (31.7 %) | ||
night | 6570 (25.03 %) | 2901 (24.24 %) | 3259 (25.78 %) | 410 (25.14 %) | ||
Wind direction | E | 1300 (4.95 %) | 864 (7.22 %) | 368 (2.91 %) | 68 (4.17 %) | <0.001 |
W | 3996 (15.23 %) | 1570 (13.12 %) | 2146 (16.97 %) | 280 (17.17 %) | ||
S | 549 (2.09 %) | 236 (1.97 %) | 255 (2.02 %) | 58 (3.56 %) | ||
N | 939 (3.58 %) | 574 (4.8 %) | 326 (2.58 %) | 39 (2.39 %) | ||
SE | 1454 (5.54 %) | 882 (7.37 %) | 509 (4.03 %) | 63 (3.86 %) | ||
SW | 4211 (16.05 %) | 1682 (14.05 %) | 2144 (16.96 %) | 385 (23.61 %) | ||
NE | 6156 (23.46 %) | 3222 (26.92 %) | 2554 (20.2 %) | 380 (23.3 %) | ||
NW | 7639 (29.11 %) | 2939 (24.56 %) | 4342 (34.34 %) | 358 (21.95 %) |
Assoc
Classification accuracy and AUC of Bayesian networks
Classifier | |||
---|---|---|---|
NB | TAN | GBN | |
Accuracy (%) | 63.59 ± 0.95 | 66.65 ± 0.86 | 65.92 ± 0.75 |
AUC | 0.7577 ± 0.056 | 0.7748 ± 0.0483 | 0.7567 ± 0.0518 |
Mutual information (MI) scores between PM10 and predictor variables
Variable | MI |
---|---|
cPM10 | 0.1013 |
Season | 0.0887 |
CO | 0.0615 |
Temperature | 0.0535 |
SO2 | 0.0370 |
NO2 | 0.0261 |
Humidity | 0.0187 |
Wind.direction | 0.0187 |
Time | 0.0047 |
O3 | 0.0041 |
Wind.speed | 0.0000 |