Interest in PM_{10} concentrations have increased greatly in Korea due to recent increases in air pollution levels. Therefore, we consider a forecasting model for next day PM_{10} concentration based on the principal elements of air pollution, weather information and Beijing PM_{2}
Recently, Korea scored 45.51 out of 100 in the “Environmental Performance Index 2016” in air quality, which was announced by joint researchers in Yale University and Columbia University, and ranked 173 out of 180 countries. People are more interested in PM_{10} levels now and there are many articles about fine particulate pollution (PM_{10}) in the media in Korea. PM_{10} is a fine particular with aerodynamic diameter of up to 10
The Korea Ministry of Environment provides the real-time PM_{10} concentration and next day forecast with four classes: “good” for 0 to 30, “normal” for 31 to 80, “bad” for 81 to 150 and “very bad” for more than 150. Recently, there are many days with PM_{10} classified as bad or very bad in Korea. Therefore, people want to know what causes the high PM_{10} concentration in Korea. Many factors can cause a high PM_{10} concentration; however, most people believe that the major reasons are severe air pollution from China, Korea’s thermal energy plants on the west coast, and using old diesel vehicles.
We examine which factors affect the PM_{10} concentration as well as build a forecast model for the next day mean and max of PM_{10} concentration. We believe that this analysis will be helpful to public and policy makers in Korea. There are several studies about PM_{10} forecasting. Sayegh
In this paper, we will use seven different models to forecast PM_{10} concentrations: two time series models (ARIMA, ARFIMA) and five regression models (linear regression, Randomforest, support vector machine (SVM), boosting, neural network). We will use several explanatory variables in our analysis. They are PM_{2}
The paper is and organized as follows. In Section 2, we explain the data and the variables used in the analysis. We explain the preparation of variables in detail because all data are time series data. In Section 3, we compare several PM_{10} forecasting models and find the best model for both regression and classification. Section 4 provides the concluding remarks.
In this section, we describe the dataset used in forecasting daily PM_{10} and preparation of variables.
We consider the various explanatory variables for forecasting daily PM_{10} concentration. In Table 1 provides the descriptions of and types of variables. Data were collected from 2011/08/01 to 2015/07/31.
PM_{10} level is recorded hourly; however, we will use a daily PM_{10} mean and max as the response variables in our analysis. However, these hourly PM_{10} values can be used as explanatory variables in the model. Air pollution data are obtained from Air Korea ( www.airkorea.or.kr/realSearch), which indicates real-time air pollution levels recorded by the Korea Environmental Management Corporation. Meteorological element information is obtained from weather information portal (data.kma.go.kr) and yellow dust information is obtained from the Korea Meteorological Administration. Beijing PM_{2}
Some variables contain missing values. For meteorological data, precipitation, maximum fresh snow cover, and duration of fog are missing values not offered by the Meteorological Administration, if there was no snow, rain and fog. These missing values are replaced by 0. Sea-level pressure and hours of daylight have one missing value and 1-hour solar radiation and solar radiation have 13 missing values. These missing values are imputed by K-nearest neighbor (KNN) method which replaces the missing value using only K-nearest observations.
Our goal of study is to forecast daily mean and max PM_{10} as well as class of PM_{10}, using variables that are expected to affect the concentration of PM_{10}. In this section, we describe how we build each explanatory variable in our model. Let
Figure 1 displays time series plot of daily mean and max PM_{10} concentration in Seoul from 2011/08/01 to 2015/07/31. Both of
The previous PM_{10} is expected to have influence on the present PM_{10}; therefore, we consider previous PM_{10} as explanatory variables. A period of past PM_{10} affecting current PM_{10} is determined by examining the ACF plot. Figure 2 is the ACF plots of original series and difference series for
Previous
where
PM_{10} levels from the previous day will also affect mostly next day PM_{10}. Therefore, we investigate more for hourly PM_{10} levels in the previous day. We believe including 24 hourly PM_{10} levels in the model is not appropriate. Hence, we try to find the best time intervals for the previous hourly PM_{10} in our model. We used tree models and various correlation plots. We found that these four intervals are easy to interpret and one of the best means to forecast next day PM_{10} levels. We denote that
We also consider that the day and month effect for PM_{10}. PM_{10} concentration is anticipated to be high in the spring due to yellow dust and on week days due to commuter transport in Seoul. In order to confirm that PM_{10} actually differs by month and day, we made box plots of
Consequently, the explanatory variables for ΔPM_{10,}
Air pollutants are commonly known as ozone (O_{3}), carbon monoxide (CO), sulfur dioxide (SO_{2}) and nitrogen dioxide (NO_{2}). Carbon monoxide is known to be generated by the incomplete combustion of carbon which is mainly emitted by transportation methods. Sulfur dioxide is released when fossil fuels, such as coal and petroleum their contains sulfur, are burned. It is mainly generated in power plants, heating equipment, and industrial processes. Nitrogen dioxide is generated by the oxidation of nitrogen monoxide which is made from vehicle exhaust and power plants. We consider air pollutants as explanatory variables since the increase in automobile emissions and coal consumption is one of the main causes of PM_{10}.
We can use daily mean or max or both as explanatory variables in our model. We examined the relation between these variables and
Air quality is influenced by wind, humidity, duration of fog, precipitation and yellow dust. Among the meteorological elements, yellow dust is a phenomenon characterized by finest dust that originates in China and Inter Mongolia, that blows in on the westerlies. A yellow dust watch is issued if the level is greater than 800 and expected to last more than 2 hours. It is likely that the yellow dust level affects PM_{10} level. Therefore we include this variable in our model: if the yellow dust watch is issued in the previous day, the variable is one; otherwise, the variable is zero. The other meteorological elements are also considered as explanatory variables and these time series variables are selected thorough the same method in Section 2.3.2. Table 2 presents the selected weather-related variables.
The concentration of PM_{10} in Korea is conjectured to be affected by the air quality in China due to its geographical proximity. In order to address the influence, we consider the PM_{2}
This section explains how to construct an optimal forecasting model for
We have to consider several factors in order to find the best model: the periods of time and window types in a training data. We can use either 1 or 2 or 3 years of data to fit a model in a training set since we have 3 years of data in a training set. We can also use either a growing window or moving window type. A more detailed explanation follows. Assume that we have a dataset up to time
We forecast daily mean PM_{10} in Section 3.1.1 and daily max PM_{10} in Section 3.1.2. We forecast ΔPM_{10} first and then convert ΔPM_{10} into PM_{10}. We explain how to find the optimal model in detail. Since we have four years of data (2011/08/01–2015/07/31). We use the first three years of data as a training set and the last one year of data as a test set. The best model is the model with a minimum RMSE in the test set. We need to find the optimal tuning parameter for each method using only training data. Therefore, we partition training data according to time periods. For example, if we use 1 year time period, we use 1 year of data (2011/08/01–2012/07/31) as a training set and the last 2 years of data (2012/08/01–2014/07/31) as a test set. If we use the 2 year time period, we first use 2 year of data as a training and the last 1 year of data as a test. We cannot partition the training data for the 3 year time period since we have only 3 years of data. Therefore, we use the optimal tuning parameter values from the 2 year time period model (Figure 6).
We forecast mean PM_{10} using the seven models with the explanatory variables in Table 2 and compare forecast performances for the seven models. For each model, we tried to find the optimal tuning parameters with the above procedure. For example, for the 2 year training set, we obtained the optimal tuning parameters as: for ARIMA, (
We cannot use all possible regression methods to find the optimal model since there are many of explanatory variables. We use a stepwise regression for variable selection; however, we found that including all available variables in the model does not necessarily provide the best result. Therefore, we tried three different sets of PM_{10} related variables described in Table 3. You can see that set 1 has all PM_{10} related variables. Set 2 does not have previous daily PM_{10} variables. Set 3 has only previous hourly PM_{10} variables. Consequently, Set 3 gives the best forecast performance and we report the result using only Set 3 for PM_{10} related variables found in Table 4.
RMSEs are not significantly different by time periods and window types. However, it shows large differences among forecast models. Especially, the regression models have a better forecast performance than time series models. This indicates that the forecast performance improves by including explanatory variables. Among the seven models, linear regression and SVM performs better than other models. The best model is the linear regression model with a 2 year time period and a moving window type.
Figure 7 displays time series plot of forecast from linear regression and SVM. Both models forecast PM_{10} adequately. The performance difference between linear regression and SVM is very small; however, we choose the linear regression model as the best model better interpretability for daily
Table 5 represents the sign of the selected variables from the final (best) linear regression model. The table shows that among the meteorological variables on the previous day, humidity, cloudiness, temperature, sea level pressure and hours of daylight influence daily mean PM_{10} on the present day. The daily mean PM_{10} increases as the humidity and hours of daylight increases; however, the increase in other meteorological factors leads to an decrease in daily mean PM_{10}. Among air pollutants, only sulfur dioxide affects mean PM_{10}. Sulfur dioxide is a by-product from power plants or heating equipment; therefore, we can see the relationship between
In the same way as Section 3.1.1, we forecast max PM_{10} with the proposed seven models. We tried to find the optimal tuning parameters for each model. For example, for 2 year training set, we found (
We can see that forecast performance does not depend on time periods or window type. However, the performance is greatly different by the models, similarly in
Figure 8 shows time series plot of forecast
Table 7 provides sign of the selected variables from the final linear regression model. Unlike the
Figure 9 shows the scatter plots of
The Korea Ministry of Environment classifies the PM_{10} concentrations by four classes: “good” (0–30), “normal” (31–80), “bad” (81–150) and “very bad” (more than 150). We consider classification models based on these four categories. We can use three different approaches for this classification. The first method (Method 1) is the forecast method using regression models. We forecast PM_{10} using regression models from Section 3.1.1 and then we classify PM_{10} forecasts into four categories: if the predicted value is between 0 to 30, it is classified as “good”; if it is between 31 to 80, it is “normal”; if it is between 81 to 150, it is “bad”; if it is more than 151, it is “very bad”. The second method (Method 2) uses numeric class labels as a response. We label the response as 1 if it is “good”, 2 if it is “normal”, 3 if it is “bad”, and 4 if it is “very bad”. Then we fit the data using the best regression model in Section 3.1.1. Finally, we classify PM_{10} forecasts according to predicted values: if the forecast is smaller than 1.5, it is “good”, if it is between 1.5 and 2.5, it is “normal”, if it is between 2.5 and 3.5, it is “bad” and if it is more than 3.5, it is “very bad”. The last method (Method 3) uses the classification algorithms. We label the response with four categories and apply several classification methods: logistic regression, linear discriminant analysis (LDA), Randomforest, and SVM. Sections 3.2.1–3.2.3 presents the forecast results for three methods. For the measure of forecast performance, we use a misclassification rate. We believe the model with the lowest misclassification rate in a test data (2014/08/01–2015/07/31) is the best model. Table 8 gives the frequencies of these four categories in the test data. Most of days in the test data are shown to be either good or normal.
In this section, we forecast the class of daily mean PM_{10} with the regression methods. Table 9 shows the misclassification rate of
Table 10 shows confusion matrices for test data (2014/08/01–2015/07/31) obtained from the best models for each method. Most of misclassifications happen between good and normal. It is because most of observations are in these two categories and are close. The risk of the misclassification is also different by class. It is more risky when bad and very bad are classified as good and normal than the opposite. The number of misclassification for this risky case is 11 for linear regression, 15 for ARFIMA, 12 for Randomforest, 12 for Boosting, 13 for Neural network and 11 for SVM. Again, regression models perform better than time series models.
We next forecast the class of daily mean PM_{10} with numeric labels using regression models. We select the explanatory variables again from the proposed method in Section 2 since we cannot use the differential
Table 11 shows the misclassification rates of
Lastly, we forecast the class of daily mean PM_{10} using classification methods. Table 13 presents the misclassification rate of
Table 14 is confusion matrices obtained from the best models for each method. The table shows a similar result to the previous section.
In terms of misclassification rate, the performance of Randomforest with Methods 2 and 3 are the same. However, the number of misclassifications for risky cases are different. Randomforest with Method 2 has 12 misclassification and Randomforest with Method 3 has 14 misclassifications for this risky case. We also consider the geometric mean (G-mean) proposed by Kubat
As the concentration of PM_{10} in Korea increases, people are paying more attention to PM_{10} forecasting. Accordingly, we proposed a forecast model for PM_{10} and examined some important features that affect PM_{10} concentration. Since PM_{10} is time series data, we consider explanatory variables that include the previous PM_{10} and other elements such as air pollutants, meteorological elements and China air quality. In order to determine the optimal lag for these variables, we used several plots including ACF and CCF plots. We consider the significant lags as explanatory variables from the plots of ACF of PM_{10} and plots of CCF of PM_{10} and explanatory variables.
Using these selected variables, we forecast mean and max of PM_{10} with the seven forecast models: two time series models ARIMA, ARFIMA and five regression models Linear regression, SVM, Boosting, Randomforest, Neural Network. We also consider the various training data: three periods of time (1, 2, 3 year) and two window types (growing, moving). We compare forecast performance for 42 combinations (seven model × six training data) in order to find the optimal forecast model. Among the models, linear regression shows the best forecast performance and makes it possible to interpret obvious relationships between explanatory variables and the response variable. We also found that regression models perform better than pure time series models. However, the forecast performance does not depend on the period and window types of training data.
We investigate the cause of PM_{10} from the selected variables in linear regression. We found that there are seasonal and daily effects on PM_{10} levels. We also found that PM_{2}
We next forecast the class of
In order to improve our model, we might consider other explanatory variables such as the daily generation of thermal power plants and daily traffic data. For the classification analysis, we may introduce asymmetric loss to find the best model which has a decision boundary to reflect that the loss for risky cases is larger than less risky cases.
This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2017R1D1A1B03036078).
Description of variables
Category | Variables | Type | |
---|---|---|---|
Air pollutant | fine particular (PM_{10}) | hourly | numeric |
ozone (O_{3}) | daily | numeric | |
carbon monoxide (CO) | daily | numeric | |
sulfur dioxide (SO_{2}) | daily | numeric | |
nitrogen dioxide (NO_{2}) | daily | numeric | |
Meteorological elements | temperature | daily | numeric |
wind speed | daily | numeric | |
wind direction | daily | category | |
relative humidity | daily | numeric | |
sea level pressure | daily | numeric | |
hours of daylight | daily | numeric | |
duration of fog | daily | numeric | |
precipitation | daily | numeric | |
duration of precipitation | daily | numeric | |
1 hour solar radiation | daily | numeric | |
solar radiation | daily | numeric | |
fresh snow cover | daily | numeric | |
amount of clouds | daily | numeric | |
yellow warning | daily | category | |
China air quality | beijing PM_{2.5} | daily | numeric |
Description of response and selected explanatory variables
Category | Variables | Variable explanation | Variables | Variable explanation |
---|---|---|---|---|
Response | difference daily mean PM_{10} (present - 1day ago) | difference daily max PM_{10} (present–1day ago) | ||
Previous PM_{10} | daily mean PM_{10} (1 day ago) | daily max PM_{10} (1 day ago) | ||
daily mean PM_{10} (2 day ago) | daily max PM_{10} (2 day ago) | |||
daily mean PM_{10} (3 day ago) | daily max PM_{10} (3 day ago) | |||
difference daily mean PM_{10} (1 day–2 day ago) | difference daily max PM_{10} (1 day–2 day ago) | |||
difference daily mean PM_{10} (2 day–3 day ago) | difference daily max PM_{10} (2 day–3 day ago) | |||
difference daily mean PM_{10} (3 day–4 day ago) | difference daily max PM_{10} (3 day–4 day ago) | |||
mean PM_{10} 0–6 hours | max PM_{10} 0–6 hours | |||
mean PM_{10} 6–12 hours | max PM_{10} 6–12 hours | |||
mean PM_{10} 12–18 hours | max PM_{10} 12–18 hours | |||
mean PM_{10} 18–24 hours | max PM_{10} 18–24 hours | |||
month.int | month (January–December) | month.int | month (January–December) | |
day.int | Mon, Tues, Wed, Thurs, Fri, Sat, Sun | day.int | Mon, Tues, Wed, Thurs, Fri, Sat, Sun | |
Air pollutant | mean SO_{2} (1 day ago) | mean SO_{2} (1 day ago) | ||
mean SO_{2} (2 day ago) | mean SO_{2} (2 day ago) | |||
mean CO (1 day ago) | mean CO (1 day ago) | |||
mean CO (2 day ago) | mean CO (2 day ago) | |||
mean NO_{2} (1 day ago) | mean NO_{2} (1 day ago) | |||
mean NO_{2} (2 day ago) | mean NO_{2} (2 day ago) | |||
mean O_{3} (1 day ago) | mean O_{3} (1 day ago) | |||
Meteorological elements | mean temperature (1 day ago) | mean temperature (1 day ago) | ||
mean wind speed (1 day ago) | mean wind speed (1 day ago) | |||
dir | wind direction (16 bearing) (1 day ago) | dir | wind direction (16 bearing) (1 day ago) | |
dir | wind direction (16 bearing) (2 day ago) | mean relative humidity (%) (1 day ago) | ||
mean relative humidity (%) (1 day ago) | mean relative humidity (%) (2 day ago) | |||
rain.hour | duration of precipitation (1 day ago) | rain.hour | duration of precipitation (1 day ago) | |
rain | precipitation (1 day ago) | rain | precipitation (1 day ago) | |
mean sea level pressure (1 day ago) | mean sea level pressure (1 day ago) | |||
sun.sum | solar radiation (1 day ago) | sun.sum | solar radiation (1 day ago) | |
sun.hour | hours of daylight (1 day ago) | sun.hour | hours of daylight (1 day ago) | |
cloud.hour | mean amount of clouds (1 day ago) | cloud.hour | mean amount of clouds (1 day ago) | |
fog.hour | duration of fog (1 day ago) | fog.hour | duration of fog (1 day ago) | |
sun.high | 1 hour solar radiation (1 day ago) | sun.high | 1 hour solar radiation (1 day ago) | |
dust | yellow dust warning | dust | yellow dust warning | |
China air quality | beijing PM_{2.5} (1 day ago) | beijing PM_{2.5} (1 day ago) | ||
beijing PM_{2.5} (2 day ago) | beijing PM_{2.5} (2 day ago) | |||
beijing PM_{2.5} (3 day ago) | beijing PM_{2.5} (3 day ago) |
Set of explanatory variables related to previous PM_{10}
Previous daily PM_{10} | Previous differential PM_{10} | Previous hourly PM_{10} | |
---|---|---|---|
Set 1 | PM_{10,} | ΔPM_{10,} | |
Set 2 | ΔPM_{10,} | ||
Set 3 |
Test error of the models for
1 year | 2 year | 3 year | ||||
---|---|---|---|---|---|---|
Growing | Moving | Growing | Moving | Growing | Moving | |
Linear regression | 19.81 | 18.92 | 19.23 | |||
ARIMA | 37.15 | 39.11 | 35.49 | 36.15 | 34.69 | 34.91 |
ARFIMA | 38.28 | 34.80 | 35.86 | 36.50 | 34.62 | 40.72 |
Randomforest | 36.14 | 36.59 | 35.88 | 36.20 | 36.19 | 36.14 |
Boosting | 33.94 | 34.51 | 33.78 | 33.47 | 33.93 | 34.05 |
Neural Network | 37.68 | 37.67 | 37.67 | 37.67 | 37.67 | 37.67 |
SVM | 19.20 | 19.68 | 19.44 | 19.26 | 19.30 | 19.51 |
Note: Bold type is the smallest test error for each period of the train data. SVM = support vector machine.
Important variables in forecasting
Category | Increase (+) | Decrease (−) |
---|---|---|
meteorological elements | humid | cloud.mean |
air pollutant | NO_{2,} | |
month | May | January–April, June–December |
day | weekdays | weekend |
hourly mean PM_{10} | 18–24 hour | 0–18 hour |
China air quality | beijing | beijing |
Test error of the models for
1 year | 2 year | 3 year | ||||
---|---|---|---|---|---|---|
Growing | Moving | Growing | Moving | Growing | Moving | |
Linear regression | 54.05 | 52.54 | 52.05 | |||
ARIMA | 63.45 | 65.16 | 62.39 | 62.49 | 61.75 | 62.20 |
ARFIMA | 64.49 | 62.32 | 62.11 | 63.02 | 61.96 | 67.26 |
Randomforest | 64.21 | 64.43 | 63.78 | 64.10 | 63.17 | 63.33 |
Boosting | 63.73 | 64.79 | 62.84 | 63.31 | 62.59 | 62.47 |
Neural Network | 69.06 | 69.06 | 69.08 | 69.06 | 69.05 | 69.06 |
SVM | 52.95 | 53.22 | 52.98 | 52.82 | 52.73 | 52.65 |
Note: Bold type is the smallest test error for each period of the train data. SVM = support vector machine.
Important variables in forecasting
Category | Increase (+) | Decrease (−) |
---|---|---|
meteorological elements | humid | |
air pollutant | SO_{2,} | NO_{2,} |
day | Monday, Wednesday–Thursday | Tuesday, Saturday–Sunday |
month | March–May | January–February, June–December |
hourly max of PM_{10} | 18–24 hour | 0–18 hour |
China air quality | beijing |
The frequencies of four categories of
Good | Normal | Bad | Very bad | |
---|---|---|---|---|
2014/08/01–2015/07/31 | 101 | 242 | 18 | 4 |
Misclassification rate of
1 year | 2 year | 3 year | ||||
---|---|---|---|---|---|---|
Growing | Moving | Growing | Moving | Growing | Moving | |
Linear regression | 0.23 | 0.24 | 0.22 | 0.23 | 0.22 | 0.22 |
ARIMA | 0.26 | 0.27 | 0.26 | 0.27 | 0.26 | 0.26 |
ARFIMA | 0.29 | 0.29 | 0.27 | 0.27 | 0.26 | 0.27 |
Randomforest | 0.21 | 0.21 | 0.21 | |||
Boosting | 0.22 | 0.22 | 0.21 | 0.21 | 0.21 | |
Neural Network | 0.27 | 0.27 | 0.27 | 0.27 | 0.27 | 0.27 |
SVM | 0.23 | 0.22 | 0.22 | 0.23 | 0.24 | 0.23 |
Note: bold type is the smallest test error for each period of the train data. SVM = support vector machine.
Confusion matrices of
Linear regression | ARFIMA | Randomforest | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | |
Good | 63 | 38 | 0 | 0 | 53 | 48 | 0 | 0 | 60 | 41 | 0 | 0 |
Normal | 25 | 213 | 4 | 0 | 29 | 202 | 10 | 1 | 15 | 224 | 3 | 0 |
Bad | 0 | 10 | 8 | 0 | 1 | 12 | 4 | 1 | 0 | 11 | 6 | 1 |
Very bad | 1 | 0 | 2 | 1 | 0 | 2 | 1 | 1 | 1 | 0 | 2 | 1 |
Boosting | Neural Network | SVM | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | |
Good | 63 | 38 | 0 | 0 | 62 | 37 | 2 | 0 | 68 | 33 | 0 | 0 |
Normal | 19 | 222 | 1 | 0 | 32 | 199 | 11 | 0 | 31 | 207 | 4 | 0 |
Bad | 0 | 10 | 7 | 1 | 0 | 11 | 4 | 3 | 0 | 10 | 7 | 1 |
Very bad | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 2 | 1 |
SVM = support vector machine.
Misclassification rate of
1 year | 2 year | 3 year | ||||
---|---|---|---|---|---|---|
Growing | Moving | Growing | Moving | Growing | Moving | |
Linear regression | 0.22 | 0.23 | 0.21 | 0.21 | 0.23 | 0.22 |
Randomforest | 0.22 | 0.21 | 0.20 | |||
SVM | 0.22 | 0.23 | 0.24 | 0.23 | 0.22 | 0.23 |
Note: bold type is the smallest test error for each period of the train data. SVM = support vector machine.
Confusion matrices of
Linear regression | Randomforest | SVM | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | |
Good | 69 | 32 | 0 | 0 | 64 | 37 | 0 | 0 | 60 | 41 | 0 | 0 |
Normal | 29 | 211 | 2 | 0 | 18 | 223 | 1 | 0 | 23 | 216 | 3 | 0 |
Bad | 1 | 10 | 6 | 1 | 0 | 10 | 8 | 0 | 0 | 10 | 7 | 1 |
Very bad | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 0 | 1 | 0 | 2 | 1 |
SVM = support vector machine.
Misclassification rate of
1 year | 2 year | 3 year | ||||
---|---|---|---|---|---|---|
Growing | Moving | Growing | Moving | Growing | Moving | |
Logistic regression | 0.31 | 0.25 | 0.24 | 0.23 | 0.25 | 0.24 |
LDA | 0.27 | 0.27 | 0.27 | 0.25 | 0.27 | 0.25 |
Randomforest | 0.20 | 0.20 | ||||
SVM | 0.23 | 0.24 | 0.23 | 0.22 | 0.22 | 0.23 |
Note: bold type is the smallest test error for each period of the train data.
LDA = linear discriminant analysis; SVM = support vector machine.
Confusion matrices of
Logistic regression | LDA | |||||||
---|---|---|---|---|---|---|---|---|
Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | |
Good | 65 | 35 | 0 | 1 | 61 | 40 | 0 | 0 |
Normal | 32 | 202 | 5 | 3 | 33 | 197 | 12 | 0 |
Bad | 0 | 10 | 6 | 2 | 0 | 10 | 6 | 2 |
Very bad | 1 | 1 | 0 | 2 | 1 | 1 | 1 | 1 |
Randomforest | SVM | |||||||
Good | Normal | Bad | Very bad | Good | Normal | Bad | Very bad | |
Good | 66 | 35 | 0 | 0 | 56 | 45 | 0 | 0 |
Normal | 22 | 218 | 2 | 0 | 21 | 220 | 1 | 0 |
Bad | 0 | 11 | 7 | 0 | 0 | 11 | 7 | 0 |
Very bad | 1 | 2 | 1 | 0 | 1 | 2 | 1 | 0 |
LDA = linear discriminant analysis; SVM = support vector machine.