Functional data analysis continues to attract interest because advances in technology across many fields have increasingly permitted measurements to be made from continuous processes on a discretized scale. Particulate matter is among the most harmful air pollutants affecting public health and the environment, and levels of PM10 (particles less than 10 micrometers in diameter) for regions of California remain among the highest in the United States. The relatively high frequency of particulate matter sampling enables us to regard the data as functional data. In this work, we investigate the dominant modes of variation of PM10 using functional data analysis methodologies. Our analysis provides insight into the underlying data structure of PM10, and it captures the size and temporal variation of this underlying data structure. In addition, our study shows that certain aspects of size and temporal variation of the underlying PM10 structure are associated with changes in large-scale climate indices that quantify variations of sea surface temperature and atmospheric circulation patterns.
The effects of air pollution are diverse and numerous, such as increased mortality to increased sensitivity of an ecosystem at high concentrations. Effective air quality standards are designed to mitigate these effects on health and the environment. The pollutants that pose among the highest risk to California are ground level ozone and particulate matter (PM). California continues to mandate ambient air quality standards that tend to be more stringent than national standards. However, reducing air pollution concentrations to acceptable levels remains an on-going challenge in California where seven and eight of its cities rank in the top 10 of the highest levels of ozone and particulate matter pollution, respectively (American Lung Association, 2017).
Air quality is highly associated with weather conditions (Jhun
PM is one of the six common air pollutants identified by the U.S. Environmental Protection Agency (EPA, 2018), and it varies in size. Some recent results suggest that the Pacific Decadal Oscillation and South Oscillation index influence PM of a certain size (Singh and Palazoglu, 2012). Jerez
Many studies of PM10 utilize daily and monthly summary measures which are assumed to represent daily individual exposure to PM. In this article, we add to the literature by exploring the underlying data structure of the height or size variation and time variation of PM10 and its association with regimes of Pacific climate using functional data (FD) methodologies. FD methodologies allow us to explore the behavior of PM10 at any time
Monitoring stations provide hourly measurements of PM10 throughout the day. We processed hourly time series of PM10 concentrations (measured in micrograms per cubic meter) from 117 California monitoring stations among 35 counties for the years 1995–2017. Raw PM10 data is publicly available through the EPA Air Data website (https://www.epa.gov/outdoor-air-quality-data). The spatial locations of the stations are shown in the left panel of Figure 1. Of 117 monitoring stations, 37 and 80 were located in a rural and non-rural setting, respectively. A monitoring station in Riverside County, CA, at elevation -50m near the north shore of the Salton Sea in the eastern Coachella Valley, has the lowest elevation. A monitoring station located in Mono County, CA at elevation 2395m in the eastern slopes of California’s Sierra Nevada mountain range, has the highest elevation.
We propose to use the relatively high frequency of hourly PM10 sampling which enables us to analyze these hourly PM10 measurements using methods of functional data analysis (FDA). In this framework, we view hourly observations as a continuously evolving function over time at a monitoring station. Since this process is continuous, the hourly measurements are viewed as samples from that function plus measurement error. Such hourly measurements may be referred to as functional data. Key FDA references include Ramsay and Silverman, (2005) and Kokoszka and Reimherr (2017). Conceptually, PM10 is continuously defined. In practice, PM10 is observed on a discretized hourly scale. We represent the data as functional data using penalized regression B-splines with equidistant knots, where the number of knots are chosen based on the fixed selection method described in Montoya
Meteorological variables, such as relative humidity, temperature, wind speed, and wind direction, influence PM10 (Csavina
(https://www.esrl.noaa.gov/psd/data/correlation/pna.data). Negative and Positive values of these climate indices are associated with El Niño and La Niña episodes, respectively (Bridgman and Oliver, 2006; Yu
Functional data generally consists of two types of variations: height variation and lateral displacement or time variation. The lateral displacement in functional data arises due to randomness of the timing of the features in each curve in the sample, while the height variation represents randomness of the size of the values that these features in the sample assume. These variations are illustrated in Figure 3 using two simulated curves. The solid curves represent two functional observations. Individually, they show two notable features; a morning and afternoon rush hour peak. However, the timing of these features, as well as the size or height of these features, varies by curve. The functional average of these two curves do not represent the typical behavior of these curves. The two curves each have two peaks of varying size, yet the average of these functions would suggest that the general behavior of these two curves is two morning peaks followed by two afternoon peaks. Further, the size of the peaks in the mean curve are much lower than the peaks of the two curves. This simple example illustrates that the time variation may lead to a distorted summary of the data. Further, the time variation in the sample will also inflate the sample variance (Marron
To better explore the underlying structure of the data, a curve alignment procedure that separates the time and height variation of the data is applied. Curve alignment methods align a sample of curves to one common form. This alignment is carried out through the estimation of non-linear monotone functions for each curve in the sample that is applied to the time scale. These time scale transformations, often called warping functions, represent the time variation in the sample. The aligned sample represents the amplitude variation in the sample. Marron
To align the PM10 curves, we utilize the square-root slope function (SRSF) based curve alignment procedure proposed by Srivastava
To align multiple curves, the curves are aligned to the Karcher mean under the space of the SRSFs. Srivastava
We represent as functions the discretized estimates of the aligned PM10 curves using the B-spline basis expansion method (Ramsay and Silverman, 2005) with 23 basis functions. The discretized estimates of the warping functions are represented as functions in a similar manner but the B-spline basis coefficients are monotonically constrained to ensure that the functional representation of the warping functions are monotonically increasing. One may refer to Montoya and Meiring (2016) and others cited in their paper for details on monotonically constrained B-splines. The left, middle, and right panels of Figure 4 show the PM10 curves, the estimated warping functions (
To better summarize these data, Figure 5 displays the corresponding functional boxplots (Sun and Genton, 2011) of the panels shown in Figure 4. In the functional boxplots, the shaded region corresponds to a 50% central region akin to the box in a traditional boxplot with the functional median represented by a black line. The vertical lines extending from the shaded region to the light gray lines above and below the shaded region represent functional versions of the whiskers in traditional boxplots. The align PM10 curves show some notable characteristics regarding the height variation that are not visible in the unaligned PM10 curves. In particular, the larger features or episodes of PM10 are occurring during the morning and afternoon rush hours, as well as within a few hours of midnight. The timing of these features are quantified by the warping functions. For example, let
Functional Principal Component Analysis (FPCA) is a powerful exploratory method that extracts the dominant mode(s) of variability in a functional dataset. Such modes allow us to explore the major sources of variability in the data. We refer the reader to Shang (2014) for a review of different approaches to FPCA. Here, we use regularized FPCA (Silverman, 1996; Ramsay and Silverman, 2005) due its ability to control the smoothness of a functional principal component, and, as a consequence, may make the functional principal components less difficult to interpret.
We use regularized FPCA to explore whether the height and time variation of PM10 could be described by a few underlying components that primarily dominate its behavior. We briefly describe regularized FPCA. Assume the sample functional data are centered about its mean, and denote the centered sample as
subject to the constraints
Regarding the height variation of the PM10 curves, the first three FPCs shown in the left panel of Figure 6 account for 92.84% (62.67%, 24.40%, 5.77%) of the total variation in the
The right panel of Figure 6 displays the first three FPCs of the time variation of the PM10 curves, and they account for 91.56% (63.16%, 20.98%, 7.42%) of the total variation in the
The scores of a given FPC can be used to examine how climate and spatial location may influence the behavior of these components and hence, the behavior of PM10. For brevity, we only focus on the scores of the first FPC for the PM10 height and time variation. We use a generalized additive mixed model (Wood, 2017a) on the scores to examine these potential influences. As covariates, we consider the easting coordinate, northing coordinate, elevation, and location setting of a monitoring station. In addition, we take into account ONI and the PNA index. For each set of scores from the first FPC of the height and time variation of PM10 curves, we fit the following generalized additive mixed model (GAMM),
where
Regarding modeling results of the first FPC scores for the PM10 height variation, only the elevation of a monitoring station was not significant in model (
The estimated effect of ONI varies with PNA (top right panel of Figure 7), and it shows tha thet most cold (negative) ONI with cold PNA, and warm (positive) PNA with warm ONI, are associated with smaller scores. Warm ONI with cold PNA is associated with higher scores, and to a lesser extent during warm PNA with cold ONI. This suggests that during the coldest phases of ONI and PNA, as well as the warmest phases of ONI and PNA, features of PM10 tend to be smaller than average, whereas the opposite interactions of those phases are associated with larger than average features. While Feng
For the modeling of the first FPC scores for the PM10 time variation, all covariates considered in model (
The left and right panel of Figure 8 display the predicted random year effect for model (
The left and right panel of Figure 9 display predicted random monitoring station effects when averaged over the years for model (
This work proposes an exploratory data analysis of PM10 using California monitoring data by first performing curve alignment on PM10 to provide its height and time variation. Ignoring time variation can lead to distorted or biased estimates when modeling PM10 data, and thus, the time variability should be taken into account when using FD methodologies on PM10. The results show substantial height and time variation in the features of PM10, and, for brevity, we focused only on the dominating mode of variability for each process. The dominating mode of the underlying structure of the height and time variation processes of PM10 reflected an overall daily size effect primarily during the rush hours and a time shift effect throughout the day, respectively.
The scores of the first FPC allowed us to focus on these two aspects of the behavior of PM10. Using a GAMM, we investigated how both the first FPC scores of the PM10 height and time variation varied with large-scale climate indices and spatial location. The analysis showed that the size and timing of PM10 features varied throughout California. Furthermore, our analysis gives insight into the complicated relationship of the first FPC scores of the PM10 height and time variation with climate indices that measure variations of sea surface temperature and atmospheric circulation patterns. In particular, certain phase interactions between PNA and ONI are associated with above or below average levels of PM10, as well as its features occurring earlier or later than average, throughout the day. The random monitoring station and year effects in the models accounted for additional sources of unexplained variability. The predicted random monitoring station effects showed local spatial clustering of similar signed effects, which potentially may be partly driven by PM10 transport processes such as those discussed in Chow
This work presents an avenue for future research. Specifically, it motivates developing a functional data regression model that quantifies the association between the daily PM10 curves (functional response) with changes in several explanatory variables representing spatial location attributes (scalar covariates) and climate processes (functional covariates). This direction of research is significant to the atmospheric sciences, and would provide a new understanding and knowledge on the spatial and temporal behavior of daily PM10 at any time