Functional clustering analysis is a crucial machine learning technique that assigns functional observations to several groups in functional data analysis (FDA) (Ramsay and Silverman, 2005). The literature has proposed many methods of functional clustering (Bouveyron and Jacques, 2011; Bouveyron
Therefore, many researchers are interested in combining registration and statistical methods to improve the performance of the statistical results. Some papers have been developing joint methods to address phase variation between curves while implementing a statistical method. For example, Ahn
To solve this problem, we propose a state-of-the-art method, a functional hierarchical agglomerative clustering method using shape distances (FHACS), which can address phase and amplitude variations for functional hierarchical clustering analysis. This approach is motivated by the phase-amplitude separation algorithm (Srivastava
The remainder of this paper is organized as follows. Section 2 describes the FHACS under the Fisher-Rao invariant metric and summarizes the concepts used for the clustering method. We demonstrate this method using several simulated datasets and compare its performance against three alternative methods: 1) the standard functional hierarchical clustering model, 2) the pre-aligned functional hierarchical clustering method, and 3) the elastic functional hierarchical clustering model that functions are aligned while computing the distance between two functions in Section 3. Then, we apply the proposed and three candidate methods to six real datasets to compare the effectiveness and clustering performance in Section 4. Last, Section 5 presents the concluding remarks and limitations of this study.
To introduce FHACS, we first summarized the standard functional hierarchical clustering model and some critical concepts for developing the functional clustering model.
The functional hierarchical clustering analysis is the most basic, simple clustering algorithm, which determines the distances between a pair of functional objectives and verifies which cluster should be combined or divided depending on the strategies. We use an agglomerative strategy for the bottom-up approach and a divisive strategy for the top-down approach. There are two critical points to consider when implementing this hierarchical clustering analysis: distance and linkage. Distance is nearly fixed for the FDA. We defined observations as functions in the space, where the inner product of two functions is defined as 〈
where
For the linkage method, we employed several selection methods for the hierarchical clustering analysis. There are many possible linkages, and some examples are listed below.
Single linkage: The distance between two clusters is the minimum distance between members of the two clusters.
Centroid linkage: The distance between two clusters is the distance between their centroids.
Complete linkage: The distance between two clusters is the maximum distance between members of the two clusters.
Average linkage: The distance between two clusters is the average of all distances between members of the two clusters.
Each linkage method has advantages and disadvantages; thus, no best universal linkage exists for all situations. For example, the single linkage method can manage nonelliptical shapes and is best for capturing clusters of assorted sizes, but it has a problem with chaining. The complete, centroid, and average linkage methods could group clusters when noise exists between clusters, but crowding, dendrogram inversion, and sensitivity to a monotone transformation can occur, respectively.
Srivastava
To implement this method, we defined functional observation
where
Two registration problems depend on the number of functional observations: The pairwise alignment problem for two observations and the groupwise or multiple alignment problem for multiple observations (Srivastava and Klassen, 2016). Functional hierarchical clustering analysis requires only the distance between two functions, so only pairwise alignment is necessary. Hence, we summarize the pairwise registration problem.
To align functions, the optimal time-warping function
Then, for any
where
Next, we defined the amplitude distance. For any two functions
In here, the distance
We let
where
Based on two distances, as discussed, we defined the shape distance
where
To assess the performance of the functional hierarchical clustering algorithm using the shape distance, we evaluated the model on simulated data constructed using the following:
where
Figure 1 presents the resulting simulated data using
To demonstrate the effectiveness and clustering performance of the model, we compared FHACS with three natural alternatives. These models are either commonly used in the literature or are modifications of current models addressing phase variability in functional data. These models are the standard functional hierarchical agglomerative clustering model (SFHAC), the pre-aligned functional hierarchical agglomerative clustering model (PAFHAC), and the elastic functional hierarchical agglomerative clustering model (EFHAC). We summarize these three clustering models.
SFHAC: This standard hierarchical agglomerative clustering (HAC) algorithm uses the distance instead of the Euclidean distance. The functions are defined in the space; thus, the metric is applied for the statistical clustering analysis. Hence,
PAFHAC: In this model, functional observations are pre-aligned (using a phase-amplitude separation algorithm) (for more details, refer to Srivastava and Klassen (2016)), and then applied the standard hierarchical clustering model. Phase variations are removed by aligning the functions concerning their Karcher mean; hence, this is a registration problem for the groupwise case.
EFHAC: In this model, functions are aligned pairwise while computing the distance between two functions. The concept of the model is similar to that of FHACS, but this model only focuses on the amplitude distance between two functions. The difference between PAFHAC and EFHAC is that PAFHAC computes the Karcher mean of the functions, registers all functions concerning the mean, and applies SFHAC. However, EFHAC computes the amplitude distance while removing the phase distance between two functions each time. Hence, the clustering results should be the same as when
We applied FHACS and the other three alternative clustering algorithms to compare the accuracy of clustering performance. These models can easily be implemented using the existing R packages from R software. We used the fdacluster (Stamm, 2024) R package for SFHAC and EFHAC. For PAFHAC, we aligned multiple functions based on the phase-amplitude separation algorithm using the fdasrvf R package (Tucker, 2024). We employed a dynamic programming algorithm to determine the optimal time-warping functions, {
To compare the clustering performance as a measurement, we calculated the Hubert–Arabie adjusted random index (ARI), which is the corrected-for-chance version of the rand index (Rand, 1971; Hubert and Arabie, 1985; Vinh
where the independent clusters have an expected index of zero and identical partitions have an ARI equal to 1.
For the simulation experiments in the clustering analysis, we directly specified the true number of clusters
Tables 1 and 2 list the clustering results for the simulation studies. For Table 1, the ARI is computed for each
Similarly, PAFHAC and EFHAC also performed well because these two models are clustering models that focus on the amplitude distance after removing phase variation. The groups from simulated Dataset 2 were generated based on two distinct phases, and the results reveal that parameter
Functional data have phase and amplitude variability in several crucial application areas, such as biology, human anatomy, biochemistry, images, and sensors. This section presents the results on multiple real datasets for clustering methods using six examples from the University of California at Riverside time-series classification archive (Chen
Figure 2 presents the functional data for each real dataset: Gun Point (GP; Figure 2(a)), Toe Segmentation (TS; Figure 2(b)), Face Four (FF; Figure 2(c)), Cylinder-Bell-Funnel (CBF; Figure 2(d)), Italy Power Demand (IPD; Figure 2(e)), and Plane (Figure 2(f)). Each figure is colored to distinguish the true number of clusters (2, 2, 4, 3, 2, and 7 for GP, TS, FF, CBF, IPD, and Plane, respectively). From the figures, the phase and amplitude variability values are naturally found in the data, as the observations are at different times for each measurement during collection.
Next, we applied FHACS and the other three alternative hierarchical clustering methods to these real datasets. Similar to the simulation study, for FHACS, we set
In Tables 4 and 5, FHACS provides better clustering performance than the other three competing models. The clustering results in tables imply that the phase distance is also critical to classify groups of functional data because most
The functional clustering algorithm was developed for the FDA to improve clustering performance. Many functional data have phase variability due to the lack of temporal synchronization across measurements. Therefore, awareness has been rising regarding phase variability removal to improve model clustering performance. However, removing or minimizing only the phase variation can be problematic because functional data can also involve phase characteristics. Hence, we proposed a state-of-the-art functional hierarchical clustering method that can measure both the phase and amplitude distances, defined as the shape distance. We applied this hierarchical clustering model to three simulated datasets to investigate clustering performance in different functional data situations, and we applied it to six real datasets. The proposed method outperforms other alternative three functional hierarchical clustering models. Moreover, this definition of the shape distance can be applied to other clustering or classification methods, such as
This study has some limitations. First, functional or multivariate standard hierarchical clustering suffers from choosing the optimal number of clusters,
FHACS results for each simulated dataset
0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Sim. Dataset 1 | 0.03 | 0.031 | 0.031 | ||||||||
Sim. Dataset 2 | 0.03 | 0.03 | 0.03 | 0.03 | 0.15 | 0.15 | 0.11 | 0.11 | 0.03 | ||
Sim. Dataset 3 | 0.24 | 0.13 | 0.15 | 0.35 | 0.26 | 0.26 | 0.30 | 0.29 | 0.26 | 0.27 |
ARI is computed for each
Three alternative clustering results for each simulated dataset
SFHAC | PAFHAC | EFHAC | |
---|---|---|---|
Sim. Dataset 1 | −0.01 | ||
Sim. Dataset 2 | 0.03 | 0.01 | 0.03 |
Sim. Dataset 3 | 0.23 | 0.11 | 0.27 |
Bold font marks the highest ARI.
Description of the real data
Data | # of obs., | Time points | Class | Donor |
---|---|---|---|---|
Gun Point (GP) | 50 | 150 | 2 | A. Ratanamahatana & E. Keogh |
Toe Segmentation (TS) | 40 | 270 | 2 | Tony Bagnall |
Face Four (FF) | 24 | 350 | 4 | A. Ratanamahatana & E. Keogh |
Cylinder-Bell-Funnel (CBF) | 30 | 128 | 3 | N. Saito |
Italy Power Demand (IPD) | 27 | 64 | 2 | J.J. van Wijk & E. Keogh & L. Wi |
Plane | 105 | 144 | 7 | J. Gao |
Clustering performance results for the real data
0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | 1 | |
---|---|---|---|---|---|---|---|---|---|---|---|
GP | 0.001 | 0.179 | −0.003 | 0.003 | 0.003 | 0.003 | −0.003 | −0.003 | −0.003 | −0.003 | |
TS | 0.100 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | 0.015 | |
FF | 0.514 | 0.514 | 0.514 | 0.514 | 0.444 | 0.179 | 0.179 | 0.213 | 0.240 | 0.240 | |
CBF | 0.028 | 0.059 | 0.020 | 0.059 | 0.091 | 0.059 | 0.110 | 0.110 | 0.145 | 0.145 | |
IPD | 0.110 | 0.110 | 0.110 | 0.110 | −0.008 | 0.110 | 0.004 | 0.004 | 0.004 | −0.008 | |
Plane | 0.806 | 0.806 | 0.806 |
Bold represents the highest ARI among the four clustering algorithms.
Results for real data using the three alternative clustering algorithms
SFHAC | PAFHAC | EFHAC | |
---|---|---|---|
GP | 0.011 | 0.029 | −0.003 |
TS | −0.018 | 0.015 | 0.015 |
FF | 0.369 | −0.021 | 0.240 |
CBF | 0.137 | 0.196 | 0.145 |
IPD | 0.031 | −0.008 | −0.008 |
Plane | 0.623 | 0.701 | 0.806 |