The research related to air quality is one of the important atmospheric environmental issues these days. Among them, fine dust, which is particulate pollution, is being addressed as a very significant air pollutant, as it can cause health hazards such as respiratory diseases in humans. Recently, to overcome the limitations of surface in-situ observations in the fine dust monitoring network, satellite-based measurements of aerosol optical depth (AOD) are used to estimate PM2.5 concentrations across the world (Guo
Various statistical models for AOD data analysis have been developed. Lee
In line with this latest research trend, we apply three dimension reduction methods to analyze AOD over East Asia; NMF, EOF, and ICA. Furthermore, we propose a simple, but powerful prediction method based on these dimension reduction methods.
The remainder of the paper is organized as follows: In Section 2, we describe the AOD data over East Asia. Then, we briefly review the three dimension reduction methods, NMF, EOF, and ICA in Section 3, and also propose a forecasting method. In Section 4, AOD analyzing results including prediction performances are given. Lastly, concluding remarks are presented in Section 5.
In this study, we used the MYD04 L2 dataset of Aqua MODIS (Levy and Hsu, 2015), which has a spatial resolution of 10 km and a temporal resolution of 5 minutes. We focus on the region from 110°E to 145°E in longitude, and 30°N to 45°N in latitude which covers East Asia including Korea, Japan, and eastern China (Figure 1). The number of grid points in this area is 53001. We use monthly averaged data from 2018 to 2022, and the number of time points is 60.
The left panel of Figure 1 shows the average AOD from 2018 to 2022. The highest AOD values are found in the east coastal industrialized regions of China, the western coast of Korea, and the Yellow Sea. It is well known that the winds blowing in the southwest direction have the large impacts on South Korea’s ambient air pollution levels, which is consistent with the direction of emissions from Shanghai, China (Kim, 2019). The right four panels in Figure 1 shows the average AOD in each season. We observe that the AOD value is the highest in spring (March-April-May; MAM), followed by summer (July-June-August; JJA), winter (December-January-February; DJF), and autumn (September-October-November; SON). This is consistent with the results of a study on the characteristics of AOD distribution according to the season caused by the inflow of dust from China and Mongolia and the seasonal wind (Kim
NMF is a matrix decomposition technique that can only be applied to nonnegative data. It decompose a nonnegative feature matrix into a product of two nonnegative matrices, and this nonnegativity makes the resulting decomposed matrices easier to inspect. We often prefer NMF over other dimensionality reduction approaches because the nonnegativity of the approach naturally fits the characteristics of the domain problem and it has great advantage in terms of interpretation (Wang and Zhang, 2012).
For the nonnegative data matrix
where
To estimate the basis and coefficient matrices, Lee and Seung (1999) proposed basic NMF with associated algorithm, which has the following equation as an objective function;
Under the objective function, we can obtain
Set initial value
For
Repeat 2 until convergence.
There are several methods that modify the original NMF functional to enforce sparseness on the basis matrix and the coefficient matrix, and nonsmoothNMF (nsNMF) proposed by Pascual-Montano
where the smoothing matrix
Here,
EOF analysis (Hannachi
In EOF, we decompose observed data
where
To obtain coefficient and EOF matrices, we perform spectral decomposition to the covariance matrix of
where the diagonal matrix
ICA is a popular method in signal processing for separating a multivariate signal into additive sub-components, and it aims to find statistically independent linear combination of the variables. In ICA,
where
There are several algorithms for estimation of ICA, and we use the FastICA algorithm proposed by Hyvärinen and Oja (2000). For the estimation, FastICA algorithm uses non-Gaussianity, and negentropy is used to measure of non-Gaussianity (Hyvärinen and Oja, 2000). Negentropy, which is a information-theoretic quantity of differential entropy, is difficult to calculate, so Hyvärinen and Oja (2000) used simple approximations of negentropy
where
where 1 ≤
More detail description of FastICA can be found in Hyvärinen and Oja (2000).
In AOD data analysis, we can obtain components related to time and spatial pattern from each dimension reduction methods. For example, in the case of NMF,
Given
For each
Then, we can predict
where
In general,
Similar to NMF, we can make predictions for EOF and ICA as follows;
where
Figure 2 shows the flow chart of the prediction algorithms.
We apply NMF, EOF, and ICA to the AOD data, and extract features. Since there is no unified method to determine the number of components in NMF and ICA, we exam various dimension reduction level,
Figure 3 shows 6 factors of NMF and corresponding time series. We observe that the factor 2 represents the Pacific Ocean adjacent to East Asian countries and Eastern coast in China, and from
Figure 4 shows that 6 EOFs and corresponding PC time series, which have the largest explained variance. The 6 components explain 40.2% of the total variance. From the figure, we observe that the EOF 1 represents the East coastal industrialized regions of China and the West coastal regions of Korea, which is similar to factor 4 in NMF. From the PC 1 in Figure 4(b), the time series of factor 1 has semi-annual cycle with peaks occuring in both spring (MAM) and autumn (SON). EOF 2 represents area around Beijing and Bohai Sea, and the pattern has seasonality peak every winter (DJF).
Figure 5 shows that 6 sources and corresponding IC time series obtained from ICA. Source 1 represents the East coastal industrialized regions of China and the West coastal regions of Korea, which is similar to EOF 1 of EOF result, and factor 4 of NMF result. Source 3 represents area around Beijing and Bohai Sea, which is similar to EOF 2 in EOF result. And it has seasonality peak every winter (DJF), which is similar pattern with the PC 2 in EOF.
From the results, we observe that some components of three methods give similar spatial patterns of AOD. For example, factor 4 in NMF, EOF 1 in EOF, and source 1 all provide high AOD levels in East coastal industrialized regions of China and the West coastal regions of Korea. However, since NMF provides nonnegative coefficient matrix as shown in Figure 3(a), the results of NMF are easier to understand, as negative values hold less meaning in AOD data analysis. Also its seasonality is more clearly observed in
Here, we predict monthly AOD in 2022 based on NMF, EOF, and ICA as described in Figure 2. To make prediction at time point
The results are presented in Figure 6. We plot the average difference between the true AOD values and prediction values, ∑
For the numerical comparison, we compute root mean square error (RMSE) and mean absolute percentage error (MAPE) for each method as follows.
The results are summurized in Table 1. Here, we add simple ARIMA forecasting model for the comparison method. We apply ARIMA model for each grid, and the optimal parameters for the ARIMA model are chosen based on the Akaike information criterion (Hyndman and Khandakar, 2008). We also apply simple LSTM forecasting model as the comparison method, which is known to be good in particulate matter concentration prediction (Cho
From the results, EOF provides best performance for all seasons. ARIMA shows higher performance than NMF and ICA and lower performance than EOF for all seasons. By MAPE, EOF shows best performance in spring, summer, autumn and total period, and ICA shows best performance in winter. LSTM shows poor performance for all seasons, sine we do not have enough sample size for the learning model.
As mentioned earlier, dimension reduction level is chosen as
We also consider 2 and 3 month ahead predictions, and Figure 9 shows the RMSE values under various
In this study, we consider three dimension reduction methodologies to analyze AOD data, and compare the results in terms of feature extraction and prediction. Regional information that contributes to AOD are clearly reflected from all three methods. For example, the East coastal industrialized regions of China and the West coastal regions of Korea, which has significant contribution of AOD are well represented by all three methods. However, there is a factor in NMF that represents the Pacific Ocean adjacent to East Asian countries and Eastern coast in China, but other methods can not detect it.
Also, we predict monthly AOD in 2022 based on three methods, and compare their predictive performance. From the results, we observe that the EOF shows superior predictive performance compare to that of NMF, ICA, ARIMA and LSTM model. For various dimension reduction levels,
There are some limitations in this study. First, the dimension reduction level,
Accuracy evaluation of methods
Season | RMSE | MAPE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
NMF | EOF | ICA | ARIMA | LSTM | NMF | EOF | ICA | ARIMA | LSTM | |
Spring | 0.210 | 0.226 | 0.208 | 0.276 | 13.494 | 15.541 | 12.704 | 18.895 | ||
Summer | 0.177 | 0.21 | 0.171 | 0.238 | 11.253 | 11.988 | 10.846 | 14.125 | ||
Autumn | 0.128 | 0.143 | 0.128 | 0.170 | 7.544 | 9.206 | 7.929 | 11.592 | ||
Winter | 0.174 | 0.15 | 0.162 | 0.196 | 12.541 | 9.368 | 10.344 | 11.770 | ||
Total | 0.175 | 0.186 | 0.169 | 0.224 | 11.208 | 11.404 | 10.456 | 14.095 |