TEXT SIZE

search for



CrossRef (0)
Aerosol optical depth prediction based on dimension reduction methods
Communications for Statistical Applications and Methods 2024;31:521-533
Published online September 30, 2024
© 2024 Korean Statistical Society.

Jungkyun Leea, Yaeji Lim1,a

aDepartment of Statistics, Chung-Ang University, Korea
Correspondence to: 1 Department of Statistics, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Korea. E-mail: yaeji.lim@gmail.com
This research was supported by the Chung-Ang University Research Scholarship Grants in 2023.
Received January 24, 2024; Revised February 16, 2024; Accepted February 20, 2024.
 Abstract
As the concentration of fine dust has recently increased, numerous related studies are being conducted to address this issue. Aerosol optical depth (AOD) is a vital atmospheric parameter for measuring the optical properties of aerosols in the atmosphere, providing crucial information related to fine dust. In this paper, we apply three dimension reduction methods, nonnegative matrix factorization (NMF), empirical orthogonal functions (EOF) analysis and independent component analysis (ICA), to AOD data to analyze the patterns of fine dust in the East Asia region. Through a comparison of three dimension reduction methods, we observe that some patterns are observed in all three method, while some information are only extracted in a specific method. Additionally, we forecast AOD levels based on three methods, and compare the predictive performance of the three methodologies.
Keywords : nonnegative matrix factorization, empirical orthogonal function, independent component analysis, AOD prediction, dimension reduction
1. Introduction

The research related to air quality is one of the important atmospheric environmental issues these days. Among them, fine dust, which is particulate pollution, is being addressed as a very significant air pollutant, as it can cause health hazards such as respiratory diseases in humans. Recently, to overcome the limitations of surface in-situ observations in the fine dust monitoring network, satellite-based measurements of aerosol optical depth (AOD) are used to estimate PM2.5 concentrations across the world (Guo et al., 2009; Xin et al., 2014). AOD is a measure of the extinction of the solar beam by aerosols, which are tiny solid or liquid particles suspended in the air or as a gas, and measured by the Moderate Resolution Imaging Spectroradiometer (MODIS). AOD has its values typically range from 0 to 5, with higher values indicating a higher concentration of aerosol particles in the atmosphere or more effective scattering or absorption of radiation.

Various statistical models for AOD data analysis have been developed. Lee et al. (2012) developed spatial clustering models that combine AOD and ground monitoring data to predict PM2.5 concentrations, and Karimian et al. (2023) proposed wavelet based model to forecast PM2.5 concentrations using Himawari AOD. Various dimension reduction methods are also commonly used in analyzing AOD data. For example, Khoir et al. (2022) used rotated empirical orthogonal function (EOF) to analyze spatial and temporal AOD distribution, and Ma et al. (2021) compared EOF and nonnegative matrix factorization (NMF) in analyzing AOD over China.

In line with this latest research trend, we apply three dimension reduction methods to analyze AOD over East Asia; NMF, EOF, and ICA. Furthermore, we propose a simple, but powerful prediction method based on these dimension reduction methods.

The remainder of the paper is organized as follows: In Section 2, we describe the AOD data over East Asia. Then, we briefly review the three dimension reduction methods, NMF, EOF, and ICA in Section 3, and also propose a forecasting method. In Section 4, AOD analyzing results including prediction performances are given. Lastly, concluding remarks are presented in Section 5.

2. Data

In this study, we used the MYD04 L2 dataset of Aqua MODIS (Levy and Hsu, 2015), which has a spatial resolution of 10 km and a temporal resolution of 5 minutes. We focus on the region from 110°E to 145°E in longitude, and 30°N to 45°N in latitude which covers East Asia including Korea, Japan, and eastern China (Figure 1). The number of grid points in this area is 53001. We use monthly averaged data from 2018 to 2022, and the number of time points is 60.

The left panel of Figure 1 shows the average AOD from 2018 to 2022. The highest AOD values are found in the east coastal industrialized regions of China, the western coast of Korea, and the Yellow Sea. It is well known that the winds blowing in the southwest direction have the large impacts on South Korea’s ambient air pollution levels, which is consistent with the direction of emissions from Shanghai, China (Kim, 2019). The right four panels in Figure 1 shows the average AOD in each season. We observe that the AOD value is the highest in spring (March-April-May; MAM), followed by summer (July-June-August; JJA), winter (December-January-February; DJF), and autumn (September-October-November; SON). This is consistent with the results of a study on the characteristics of AOD distribution according to the season caused by the inflow of dust from China and Mongolia and the seasonal wind (Kim et al., 2010).

3. Methodology

3.1. Dimension reduction methods

3.1.1. NMF

NMF is a matrix decomposition technique that can only be applied to nonnegative data. It decompose a nonnegative feature matrix into a product of two nonnegative matrices, and this nonnegativity makes the resulting decomposed matrices easier to inspect. We often prefer NMF over other dimensionality reduction approaches because the nonnegativity of the approach naturally fits the characteristics of the domain problem and it has great advantage in terms of interpretation (Wang and Zhang, 2012).

For the nonnegative data matrix X = (x1, . . ., xn)’, where xi ∈ ℝm, NMF finds W and H approximate X with the property that matrices have all nonnegative elements;

XWH,

where W = (w1, . . ., wn)’ ∈ ℝn×r is a coefficient matrix, and each row wi = (wi1, . . ., wim) is the weight of the observation xi on the row of H. H = (h1, . . ., hr)’ ∈ ℝr×m is a basis matrix, and each row hk = (hk1, . . ., hkm) contains basis vectors (features). In other word, NMF decomposes each data into the linear combination of the basis vectors. Therefore, NMF is viewed as a dimension reduction method and also a feature extraction technique. Here, the rank r is the dimension reduction level, generally chosen so that (n + m)r < nm. There are some methods for determining rank r such as the Bayesian method (Hoffman et al., 2010) and hypothesis testing (Cai et al., 2023). But there is no unified criterion for selecting the rank r. Depending on the purpose of the study, measures such as residual some of squares (RSS) and cophenetic correlation, Euclidean distance also can be used to determine r. However, in most of the application studies using NMF, the rank r is predefined.

To estimate the basis and coefficient matrices, Lee and Seung (1999) proposed basic NMF with associated algorithm, which has the following equation as an objective function;

argminW,Hi=1nj=1m([X]ijln[X]ij[WH]ij-[X]ij+[WH]ij).

Under the objective function, we can obtain W and H by following algorithm.

  • Set initial value W and H with nonnegative random values.

  • For k = 1 . . . r, update each wik and hk j as

hkjhkji=1nwik[X]ij/[WH]iji=1nwik,         for         j=1m.wikwikj=1mhkj[X]ij/[WH]ijj=1mhkj,         for         i=1n.

Repeat 2 until convergence.

There are several methods that modify the original NMF functional to enforce sparseness on the basis matrix and the coefficient matrix, and nonsmoothNMF (nsNMF) proposed by Pascual-Montano et al. (2006) is one of the popular methods to find localized, part-based, representations of nonnegative data. By applying nsNMF to the AOD data, we can clearly verify AOD contribution of specific region, and obtain interpretable results. In nsNMF, we decompose X as

X=WSH,

where the smoothing matrix S ∈ ℝr×r is a positive symmetric matrix defined as

S=(1-θ)I+θr11T.

Here, I is the identity matrix, 1 is a vector of ones, and the parameter θ satisfies 0 ≤ θ ≤ 1. In practice, Pascual-Montano et al. (2006) proposed modified algorithm based on the basic NMF by replacing W with WS, and H with SH in (3.1). More detail description of nsNMF can be found in Pascual-Montano et al. (2006).

3.1.2. EOF

EOF analysis (Hannachi et al., 2007), also known as principal component analysis (PCA), has been widely used in climatology and meteorology (Shrestha and Barros, 2010). EOF analysis aims to find uncorrelated linear combination of the variables that explain maximum variance. In climate studies, EOF analysis is often used to study possible spatial patterns of variability and how they change with time, and it attempts to find a relatively small number of factors which convey as much of the original information as possible.

In EOF, we decompose observed data X ∈ ℝn×m into a set of orthogonal spatial components and its coefficient components;

X=VZ,

where V ∈ ℝn×r is coefficient matrix related to time and Z ∈ ℝr×m is EOFs related to location. Here, r is the number of EOFs, and usually determined by the percentage of explained variance.

To obtain coefficient and EOF matrices, we perform spectral decomposition to the covariance matrix of X, Σ;

Σ=UΛUT,         UUT=I,

where the diagonal matrix Λ has diagonal elements the ordered eigenvalues, λ1λ2 ··· ≥ λn ≥ 0, and the columns of U are the eigenvectors. Then each row of the the eigenvector U is EOFs, and the corresponding principal component (PC) time series also can be obtained. The proportion of total variation explained by the first r EOFs can be computed as

λ1++λrλ1++λn.
3.1.3. ICA

ICA is a popular method in signal processing for separating a multivariate signal into additive sub-components, and it aims to find statistically independent linear combination of the variables. In ICA, X ∈ ℝn×m is decomposed as

X=As,

where A ∈ ℝn×r is the mixing matrix and s ∈ ℝr×m is the source signals that can not be observed directly. In ICA, we assume that the source signals are independent of each other, and the values in each source signal are non-Gaussian distributed. In general, ICA cannot identify the actual number of source signals, nor a uniquely correct ordering of the source signals.

There are several algorithms for estimation of ICA, and we use the FastICA algorithm proposed by Hyvärinen and Oja (2000). For the estimation, FastICA algorithm uses non-Gaussianity, and negentropy is used to measure of non-Gaussianity (Hyvärinen and Oja, 2000). Negentropy, which is a information-theoretic quantity of differential entropy, is difficult to calculate, so Hyvärinen and Oja (2000) used simple approximations of negentropy J defined as;

J(y)[E{G(y)}-E{G(ν)}]2,

where ν is a Gaussian variable of zero mean and unit variance and G is non-quadratic function chosen by user according to the situation. Hyvärinen and Oja (2000) suggested several G;

G1(y)=1alog cosh(ay),         G2(y)=exp (-y22),

where 1 ≤ a ≤ 2 is some suitable constant. In this study, we use G1(y).

More detail description of FastICA can be found in Hyvärinen and Oja (2000).

3.2. Prediction based on dimension reduction methods

In AOD data analysis, we can obtain components related to time and spatial pattern from each dimension reduction methods. For example, in the case of NMF, W and H are time component and spatial component, respectively. That is, each row of spatial component is a latent variable made up of a linear combination for each grid of data, and each column of time component is the weight of latent variables over time. Therefore, if we can predict time component for the future time point, then AOD value for each grid point can be predicted by simply multiplying predicted time components and the corresponding spatial components. To take the case of NMF as an example, the procedure can be summurized by following algorithm.

  • Given T × m dimensional data X, perform nsNMF and obtain Ŵ and Ĥ.

  • For each kth column of Ŵ, ŵk = (ŵ1k, . . ., ŵTk)’, predict wT+1,k using ARIMA. Then, we have ŴT+1,*: = (ŵT+1,1, . . ., ŵT+1,r) ∈ ℝr.

  • Then, we can predict X at time T + 1 as

X^T+1,*NMF:=W^T+1,*H^,

where jth component of X^T+1,*NMFis computed as x^T+1,jNMF:k=1rw^T+1,kh^k,j for j = 1, . . ., J.

In general, h-ahead prediction also can be obtained as X^T+h,*NMF:=W^T+h,*H^, where ŴT+h,* is h-ahead predicted value using ARIMA model.

Similar to NMF, we can make predictions for EOF and ICA as follows;

X^T+1,*EOF=V^T+1,*Z^+1X¯X^T+1,*ICA=A^T+1,*s^+1 X¯,

where T+1, * and ÂT+1,* are predicted coefficient vector and mixing matrix at time T+1 using ARIMA, and and ŝ are estimated EOF matrix and source signal, respectively. Here, 1 ∈ ℝn is a vector of ones and = (χ̄1, . . ., χ̄m)’ ∈ ℝm is the average value at the each grid over the training time. In EOF and ICA, we subtract from the data as a pre-processing step to make the mean of the input data equal to zero. Therefore we should add the in the prediction formula in (3.2).

Figure 2 shows the flow chart of the prediction algorithms.

4. Results

4.1. Feature extraction

We apply NMF, EOF, and ICA to the AOD data, and extract features. Since there is no unified method to determine the number of components in NMF and ICA, we exam various dimension reduction level, r. As a result, r = 6 provides most interpretable results for all methods.

Figure 3 shows 6 factors of NMF and corresponding time series. We observe that the factor 2 represents the Pacific Ocean adjacent to East Asian countries and Eastern coast in China, and from ŵ2 in Figure 3(b), the corresponding time series has seasonality peak every January. We also observe that the factor 4 represents AOD distribution of Korea, Japan and Shandong Bandao of China, and corresponding time series has seasonality peak every spring (MAM).

Figure 4 shows that 6 EOFs and corresponding PC time series, which have the largest explained variance. The 6 components explain 40.2% of the total variance. From the figure, we observe that the EOF 1 represents the East coastal industrialized regions of China and the West coastal regions of Korea, which is similar to factor 4 in NMF. From the PC 1 in Figure 4(b), the time series of factor 1 has semi-annual cycle with peaks occuring in both spring (MAM) and autumn (SON). EOF 2 represents area around Beijing and Bohai Sea, and the pattern has seasonality peak every winter (DJF).

Figure 5 shows that 6 sources and corresponding IC time series obtained from ICA. Source 1 represents the East coastal industrialized regions of China and the West coastal regions of Korea, which is similar to EOF 1 of EOF result, and factor 4 of NMF result. Source 3 represents area around Beijing and Bohai Sea, which is similar to EOF 2 in EOF result. And it has seasonality peak every winter (DJF), which is similar pattern with the PC 2 in EOF.

From the results, we observe that some components of three methods give similar spatial patterns of AOD. For example, factor 4 in NMF, EOF 1 in EOF, and source 1 all provide high AOD levels in East coastal industrialized regions of China and the West coastal regions of Korea. However, since NMF provides nonnegative coefficient matrix as shown in Figure 3(a), the results of NMF are easier to understand, as negative values hold less meaning in AOD data analysis. Also its seasonality is more clearly observed in Ŵ1 (Figure 3(b)).

4.2. Prediction

Here, we predict monthly AOD in 2022 based on NMF, EOF, and ICA as described in Figure 2. To make prediction at time point T + 1, we use AOD data up to time point T. For example, to forecast AOD on January 2022, we use data from January 2018 to 2021 December. Then, we use data from January 2018 to 2022 January to forecast on February 2022, and so on.

The results are presented in Figure 6. We plot the average difference between the true AOD values and prediction values, ∑t([X]t j – []t j), for each grid j, where [X]t j and []t j are the true and predicted AOD values at time t on grid j, respectively. We can observe that the difference between the true values and the all predictions are close to zero, and all three methods tend to overestimate the same region, China’s coastline. We further plot the differences for each season; spring (MAM), summer (JJA), autumn (SON), and winter (DJF) (Figure 7). For all seasons, EOF based prediction works best, but NMF based prediction works similarly except winter. On the other hand, ICA based prediction works poorly compare other two methods except winter.

For the numerical comparison, we compute root mean square error (RMSE) and mean absolute percentage error (MAPE) for each method as follows.

RMSE=t=1Tj=1J([X]tj-[X^]tj)2TJ,MAPE=t=1Tj=1J|[X]tj-[X^]tj[X]tj|×100TJ.

The results are summurized in Table 1. Here, we add simple ARIMA forecasting model for the comparison method. We apply ARIMA model for each grid, and the optimal parameters for the ARIMA model are chosen based on the Akaike information criterion (Hyndman and Khandakar, 2008). We also apply simple LSTM forecasting model as the comparison method, which is known to be good in particulate matter concentration prediction (Cho et al., 2021). The model consists of two LSTM layers in addition to one dense layer with ReLu activation function. And the model uses stochastic gradient descent (SGD) optimizer algorithm and root mean square error (RMSE) loss function.

From the results, EOF provides best performance for all seasons. ARIMA shows higher performance than NMF and ICA and lower performance than EOF for all seasons. By MAPE, EOF shows best performance in spring, summer, autumn and total period, and ICA shows best performance in winter. LSTM shows poor performance for all seasons, sine we do not have enough sample size for the learning model.

As mentioned earlier, dimension reduction level is chosen as r = 6 for the interpretation. Therefore, we further examine the result under various dimension reduction levels, r = 4, 5, . . ., 20. From the Figure 8, we observe that EOF-based method works best for all r levels. Remarkable things are that large r does not guarantee high predictive performance. For NMF case, prediction performance even gets worse as r increased, and for the case of EOF, prediction performance does not change a lot depending on r. Main reason is that first 6 components explains most significant patterns of the AOD data, and if we consider large r value, meaningless patterns are included.

We also consider 2 and 3 month ahead predictions, and Figure 9 shows the RMSE values under various r. In both 2 and 3 month ahead predictions, proposed EOF forecasting shows best performances.

5. Concluding remarks

In this study, we consider three dimension reduction methodologies to analyze AOD data, and compare the results in terms of feature extraction and prediction. Regional information that contributes to AOD are clearly reflected from all three methods. For example, the East coastal industrialized regions of China and the West coastal regions of Korea, which has significant contribution of AOD are well represented by all three methods. However, there is a factor in NMF that represents the Pacific Ocean adjacent to East Asian countries and Eastern coast in China, but other methods can not detect it.

Also, we predict monthly AOD in 2022 based on three methods, and compare their predictive performance. From the results, we observe that the EOF shows superior predictive performance compare to that of NMF, ICA, ARIMA and LSTM model. For various dimension reduction levels, r, EOF based method still works better than other prediction methods.

There are some limitations in this study. First, the dimension reduction level, r, has different meaning in each method. In EOF, r means the number of EOFs that explains variance of the data most, whereas r means the target number of dimension reduction level in NMF. In ICA, it means the number of statistically independent components. Various methods have been developed in determining r in each method, and we expect that we can improve the results by finding optimal number r. Second, we use simple ARIMA and LSTM model in predicting time series components in each method, but more advanced time series forecasting methods can be used in the procedure. Although these limitations, we observe that all three dimension reduction methods provide meaningful patterns of the AOD data in East Asia region, and the methods provide promising prediction performances. From the prediction performance, we conclude that the EOF has superior predictive performance over other methods.

Figures
Fig. 1. Average of monthly AOD values in (a) 2018–2022, (b) spring, (c) summer, (d) autumn and (e) winter.
Fig. 2. Flow chart of the prediction algorithms for NMF, EOF, and ICA.
Fig. 3. NMF results from AOD data.
Fig. 4. EOF results from AOD data.
Fig. 5. ICA results from AOD data.
Fig. 6. Average difference between the prediction value and the true value for 2022.
Fig. 7. Average difference between the prediction value and the true value on each season in 2022.
Fig. 8. RMSE for different dimension reduction level, r.
Fig. 9. RMSE for (a) 2 month ahead prediction, and (b) 3 month ahead prediction.
TABLES

Table 1

Accuracy evaluation of methods

SeasonRMSEMAPE


NMFEOFICAARIMALSTMNMFEOFICAARIMALSTM
Spring0.2100.1890.2260.2080.27613.49411.71115.54112.70418.895
Summer0.1770.1560.210.1710.23811.2539.91411.98810.84614.125
Autumn0.1280.1120.1430.1280.1707.5447.2869.2067.92911.592
Winter0.1740.1480.150.1620.19612.5419.3688.87910.34411.770

Total0.1750.1540.1860.1690.22411.2089.5711.40410.45614.095

References
  1. Cai Y, Gu H, and Kenney T (2023). Rank selection for non-negative matrix factorization. Statistics in Medicine, 42, 5676-5693.
    Pubmed CrossRef
  2. Cho KW, Jung YJ, and Oh CH (2021). Comparative study of performance of deep learning algorithms in particulate matter concentration prediction. Journal of Advanced Navigation Technology, 25, 409-414.
  3. Guo JP, Zhang XY, and Che HZ, et al. (2009). Correlation between PM concentrations and aerosol optical depth in eastern China. Atmospheric Environment, 43, 5876-5886.
    CrossRef
  4. Hannachi A, Jolliffe IT, and Stephenson DB (2007). Empirical orthogonal functions and related techniques in atmospheric science: A review. International Journal of Climatology: A Journal of the Royal Meteorological Society, 27, 1119-1152.
    CrossRef
  5. Hoffman MD, Blei DM, and Cook PR (2010). Bayesian nonparametric matrix factorization for recorded music. ICML, 10, 439-446.
  6. Hyndman RJ and Khandakar Y (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27, 1-22.
    CrossRef
  7. Hyvärinen A and Oja E (2000). Independent component analysis: Algorithms and applications. Neural Networks, 13, 411-430.
    Pubmed CrossRef
  8. Karimian H, Li Y, Chen Y, and Wang Z (2023). Evaluation of different machine learning approaches and aerosol optical depth in PM2. 5 prediction. Environmental Research, 216, 114465.
    CrossRef
  9. Kim MJ (2019). The effects of transboundary air pollution from China on ambient air quality in South Korea. Heliyon, 5, e02953.
    KoreaMed CrossRef
  10. Kim SW, Choi IJ, and Yoon SC (2010). A multi-year analysis of clear-sky aerosol optical properties and direct radiative forcing at Gosan, Korea (2001–2008). Atmospheric Research, 95, 279-287.
    CrossRef
  11. Khoir ANU, Ooi MCG, Juneng L, Ramadhan MAI, Virgianto RH, and Tangang F (2022). Spatiotemporal analysis of aerosol optical depth using rotated empirical orthogonal function over the Maritime Continent from 2001 to 2020. Atmospheric Environment, 290, 119356.
    CrossRef
  12. Lee HJ, Coull BA, Bell ML, and Koutrakis P (2012). Use of satellite-based aerosol optical depth and spatial clustering to predict ambient PM2. 5 concentrations. Environmental Research, 118, 8-15.
    CrossRef
  13. Lee DD and Seung HS (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788-791.
    Pubmed CrossRef
  14. Levy R and Hsu C (2015). MODIS atmosphere L2 aerosol product NASA MODIS adaptive processing system, Goddard Space Flight Center, USA,
    CrossRef
  15. Ma Q, Zhang Q, Wang Q, Yuan X, Yuan R, and Luo C (2021). A comparative study of EOF and NMF analysis on downward trend of AOD over China from 2011 to 2019. Environmental Pollution, 288, 117713.
    Pubmed CrossRef
  16. Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, and Pascual-Marqui RD (2006). Nonsmooth nonnegative matrix factorization (nsNMF). IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 403-415.
    Pubmed CrossRef
  17. Shrestha P and Barros AP (2010). Joint spatial variability of aerosol, clouds and rainfall in the Himalayas from satellite data. Atmospheric Chemistry and Physics, 10, 8305-8317.
    CrossRef
  18. Xin J, Zhang Q, Wang L, Gong C, Wang Y, Liu Z, and Gao W (2014). The empirical relationship between the PM2. 5 concentration and aerosol optical depth over the background of North China from 2009 to 2011. Atmospheric Research, 138, 179-188.
    CrossRef
  19. Wang YX and Zhang YJ (2012). Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering, 25, 1336-1353.
    CrossRef