Principal component analysis (PCA) is a useful tool for identifying the dominant directions of variation of multivariate data. Functional PCA (FPCA) is its generalization to time-varying data, called functional data or stochastic processes. Functional data may take values in Euclidean spaces, Hilbert spaces, or more generally in metric spaces. Most of the works on functional data have been focused on the Euclidean case (Ramsay and Silverman, 2005; Chiou
Recently, Lin and Yao (2019) proposed an approach to FPCA for Riemannian manifolds. Although they are mainly concerned with manifold-valued functional data, their approach is based on an eigenanalysis for Hilbertian functional data, so that one may use the procedure described there for Hilbertian stochastic processes. The eigenanalysis discussed in Lin and Yao (2019) gives eigenfunctions that live in the same space as the data objects. This means that, if one adopts their approach for Hilbertian functional data, then one would have eigenfunctions whose trajectories are time-varying Hilbertian values. In such cases one may face some difficulty in visualizing or interpreting the resulting eigenfunctions.
In this paper we take a different approach. Our approach gives real-valued eigenfunctions regardless of the underlying Hilbert space where the random functions take values. The associated functional principal component (FPC) scores lie in the Hilbert space. For example, consider the case where the Hilbert space under study is a simplex, . Let
Let
Let
We let
In the case where
Then, is nonnegative definite. This follows since
for all
where
Now, we derive a Karhunen-Loève expansion of
and the integral on the right hand side of (2.3) exists with probability one. Since
Also, since
Here, we have used (2.4) for the second equation. The integrals on the right hand sides of the third and fourth equations in (2.5) are Bochner integrals. Let
Apparently,
We are aware of two recent pioneering works on PCA for Riemannian functional data, which are Dai and Müller (2018) and Lin and Yao (2019). Both deal with functional data taking values in Riemannian manifolds. The basic idea of the first is to embed the manifold under consideration in an Euclidean
Let
Define
Then, it is also nonnegative definite since
for all
The
The empirical FPC scores (
Also, we have
In the above theorem,
Here, we consider two Hilbert spaces and give some more details for the implementation of the methodology presented in the previous section. One is the space of compositional vectors and the other is the space of density functions. The vector operations for these spaces are unconventional. We use isometric isomorphisms that map the respective Hilbert spaces to
Suppose that
With
As we have seen in the previous section, the FPCA with Hilbertian functional data boils down to the computation of the covariance kernel
The eigenfunctions
We discuss the computation of the Bochner integrals at (2.7) for
since ℒ is injective, where the last integral is in Lebesgue sense. Since ℒ and its inverse are continuous, we get from (3.3) and the convergence
where the second and third integrals are in Lebesgue sense. This means that we may evaluate
For simplicity we consider the space of density functions supported on [0, 1]. An extension to the case of a general support on
For this space, we define vector addition, scalar multiplication and inner product as follows.
With
We define the map
Then, ℒ is an isometry preserving the metrics on the two spaces:
Here we recall that
The covariance kernel
Here, it is worthwhile to note that
with probability one. For the kernel
Since ℒ as defined at (3.6) is an isometry, injective, continuous and has continuous inverse, we may prove (3.4) as well for the space of density functions. In practical implementation, we may discretize densities on a grid of [0, 1], which result in functional compositional vectors. Then, we are able to apply the procedure described in Section 3.1 to density functional data. Specifically, we may choose a grid {
as -valued random elements.
Here, we illustrate the methodology we presented in Section 2. We analyzed changes in population composition by age as time passes, which is an important element in demographic analysis and a key to understanding the social dynamic. The dataset we analyzed came from the World Population Prospects 2019 (WPP, https://population.un.org/wpp/) offered by the United Nations (UN), available at https://population.un.org/wpp/Download/Files/1 Indicators%20(Standard)/CSV FILES/WPP. The WPP2019 contains numerous data, among which we took those on population by location, age group and year. The original dataset consists of population estimates from the year 1950 to 2020, and projections from 2020 to 2100. Of these years we took the period 1950–2018. The original age groups were 0–4, 5–9, . . . , 95–99, 100+. We aggregated them into three age groups, 0–19 (before working age), 20–64 (work force), 65+ (retired). Locations are given in several types of categories such as country, region, subregion, etc. We chose ‘country’ as the data unit and the number of countries was 201. Thus, we obtained compositional vectors
To transform the dataset further into a set of smooth compositional vectors over time, we pre-smoothed
where
Our primary interest in this example was to see how well a few number of the FPC scores can explain the apparent change in population composition over time. For this we took dominant eigenfunctions based on the criterion called ‘fraction of variance explained (FVE)’. The FVE of the first
To illustrate how well the two leading FPC scores can reproduce the original data, we computed
Figure 2 compares the original
We performed a cluster analysis based on the two leading FPC scores. We formed three clusters based on the following distance metric between
We employed hierarchical clustering based on complete linkage, which takes
as the distance between two clusters and . Figure 3 depicts the average compositional vectors over time for each cluster. The figure demonstrates the central features of the clusters. The cluster consists of countries where population composition does not change much over time. The proportion of the retired ages in the cluster is almost constant over time. Those countries in the clusters and experienced some level of decrease in the proportion of the young ages since near 1970, but the proportion in increased until 1970 while in they stayed constant until then. Also, both and saw some level of increase in the proportion of the retired ages, but started from a relatively larger elderly proportion and underwent more rapid increase than .
We also compared our clustering result with the UN development groups obtained from the WPP2019. We took the three groups from the WPP2019: countries in more developed regions, least developed countries, and other less developed countries. More developed regions comprise Europe, Northern America, Australia/New Zealand and Japan. Less developed regions comprise all regions of Africa, Asia (except Japan), Latin America and the Caribbean plus Melanesia, Micronesia and Polynesia. Countries in less developed regions are divided further into two groups, one is the group of 47 least developed countries and the other consists of countries in less developed regions excluding the least developed ones. The criterion for least developed countries can be found at http://unohrlls.org/about-ldcs/criteria-for-ldcs/. Table 1 is a contingency table comparing the two clusterings. We find that the clusters based on the compositional FPC are very close to the UN development groups.
In this section, we discuss a multivariate extension of the methodology presented in Section 2. Suppose that we now have
To accommodate the dependency between
Define the integral operator
We endow
Thus, the integral operator is nonnegative definite. Also, its kernel
where
Let
Write
The corresponding FPC scores in the above eigenanalysis are
The FPC scores reside in
The following theorem is a multivariate version of Theorem 1. In the theorem,
We may follow the procedure we described in Section 2.2 to estimate the eigenfunctions and the FPC scores. Let
This would give the eigendecomposition
Then, as in the case of
For these estimators of eigenfunctions and FPC scores, we may also obtain an analogue of Theorem 2.
Research of Young Kyung Lee was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2018R1A2B6001068). Research of Dongwoo Kim and Byeong U. Park was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIP) (No. 2019R1A2C3007355).
Contingency table for the counts of countries
Least developed | Other less developed | More developed | Total | |
---|---|---|---|---|
Cluster 1 | 39 | 17 | 0 | 56 |
Cluster 2 | 7 | 80 | 2 | 89 |
Cluster 3 | 0 | 13 | 43 | 56 |
Total | 46 | 110 | 45 | 201 |