TEXT SIZE

search for



CrossRef (0)
Compositional data analysis by the square-root transformation: Application to NBA USG% data
Communications for Statistical Applications and Methods 2024;31:349-363
Published online May 31, 2024
© 2024 Korean Statistical Society.

Jeseok Leea, Byungwon Kim1,a

aDepartment of Statistics, Kyungpook National University, Korea
Correspondence to: 1 Department of Statistics, Kyungpook National University, 80 Daehak-ro, Buk-gu, Daegu 41566, Korea. E-mail: byungwonkim@knu.ac.kr
Received October 21, 2023; Revised January 31, 2024; Accepted February 5, 2024.
 Abstract
Compositional data refers to data where the sum of the values of the components is a constant, hence the sample space is defined as a simplex making it impossible to apply statistical methods developed in the usual Euclidean vector space. A natural approach to overcome this restriction is to consider an appropriate transformation which moves the sample space onto the Euclidean space, and log-ratio typed transformations, such as the additive log-ratio (ALR), the centered log-ratio (CLR) and the isometric log-ratio (ILR) transformations, have been mostly conducted. However, in scenarios with sparsity, where certain components take on exact zero values, these log-ratio type transformations may not be effective. In this work, we mainly suggest an alternative transformation, that is the square-root transformation which moves the original sample space onto the directional space. We compare the square-root transformation with the log-ratio typed transformation by the simulation study and the real data example. In the real data example, we applied both types of transformations to the USG% data obtained from NBA, and used a density based clustering method, DBSCAN (density-based spatial clustering of applications with noise), to show the result.
Keywords : compositional data analysis, log-ratio transformation, square-root transformation, sports data analysis, clustering
1. Introduction

A compositional data, a type of non-Euclidean spatial data, is found in various fields such as geochemistry (Buccianti and Pawlowsky-Glahn, 2005), sports (Ordóñez et al., 2016), and biochemistry (Li, 2015). A compositional random vector is a vector whose elements are all positive and their sum is fixed constant. In this context, the sum of the components is nothing but the size of a simplex, so we usually assume the sum to be 1, without loss of generality. This leads to the sample space being a simplex (Aitchison, 1982). When we consider a compositional random vector x whose dimension is d, we denote where is the unit simplex defined by

Sd-1={x=[x1,x2,,xd]d|xi>0,i=1,2,,d;i=1dxi=1}.

A non-Euclidean sample space has presented challenges when applying traditional statistical methods. To appropriately handle these challenges, research has been conducted on transformation methods mapping the sample space to a Euclidean space. Notably, there have been three representative transformations introduced which are additive log-ratio (ALR) transformation, centered log-ratio (CLR) transformation, and isometric log-ratio (ILR) transformation. Aitchison (1986), Kucera and Malmgren (1998), and Egozcue et al. (2003) provide detailed explanation about these transformations. Another transformation is the square-root transformation, which maps the sample space of a compositional data onto the surface of the unit hyper sphere (see for example Mardia and Jupp (2000), page 160). This transformation enables us to use us to use probability distributions defined on a hyper sphere such as the Kent distribution (Mardia and Jupp, 2000; Scealy and Welsh, 2011, 2014). In this paper, we also aim to show that the square-root transformation can contribute to the diversity in analyzing compositional data.

In addition to the challenge of the sample space being non-Euclidean, it is known that there exist more challenges in analyzing the compositional data. Among them, a prominent challenge what we mainly focus on in the research is one or more components taking exact values of zero. For example, in baseball, proportions of pitching arsenals, which means the kind of pitches such as fastball, curveball, or slider, etc., for each pitcher form a compositional data with zero values in some parts. In addition, there exist more examples in the field of sports, such as goal-scoring ratios or touchdown ratios of players in soccer or football, respectively, swing type frequency (proportion) in racket sports like tennis, table tennis and badminton. In such cases we can easily find that the components form a composition and some of those components can be exactly zero. In this context, there exists a limitation where the log-ratio typed transformations such as ALR, CLR and ILR cannot be applied directly, while the square-root transformation can be directly applied to any compositional data including exact zeros. Though a few studies proposed an imputation approach or a replacement approach that estimates zeros separately or replaces zeros with small positive values respectively (Hron et al., 2010), these suggestions are still restrictive to be broadly used because, as given in the examples in the field of sports above, sometimes those zero values have clear meanings. In the numerical study section of this paper, we set a few scenarios including one or two components take the values close to 0. In these scenarios, we explore how the distances between two observations change in the space where the CLR transformation and the square-root transformation are applied.

Sports data analysis originated from enhancing performance in baseball like Cook (1964). Then, this field has been gradually expanded to various sports, and recently has started to actively adopt modern statistical methods and computer science techniques as Rein and Memmert (2016) and Cust et al. (2019). Rein and Memmert (2016) propose a big data technology stack to overcome the obstacles related to technology access in sports analysis. Cust et al. (2019) describe the results of a study that used machine or deep learning methods to investigate the characteristics of athlete movement using IMU, a motion tracking sensor. Beyond studies focus on enhancing performance, sports data analysis has found its way into media usage. ESPN, for instance, operates dedicated statistical sections, showcasing the extension of this domain. In the context of sports data analysis, compositional data is commonly understood as ratios. In the realm of competitive sports, ratios have a strength in expressing relativity. This relativity encompasses the gap against opponents or the comparison with different metrics. Furthermore, it includes instances such as comparing performance levels in different leagues under the same metric. Notably, there are common statistics like win rates, as well as sport-specific metrics such as ball possession and expected goal (xG) in football (soccer), pitch selection and BB/K ratio in baseball, field goal percentage (FG%) and assist ratio (AST%) in basketball and so on.

In the field of sports data analysis, ratios are extensively used, but the statistical methods appropriately developed for compositional data still have not been widely applied. Especially in elite competitive sports such as soccer, baseball, and so on, observed compositional data has been used as simple observational data, ignoring the dependency. Ordóñez et al. (2016) introduced the use of alr transformation to analyse water polo offensive performance, considering the compositional data structure. This approach is uncommon in sports data analysis, and the authors also mention the lack of similar studies. Here arises our second goal of this study, that is, we want to show how compositional data analysis approaches bring diversity to sports data analysis.

This paper presents the results of a DBSCAN (density-based spatial clustering of applications with noise) cluster analysis using the advanced basketball statistic, USG% (usage percentage). The DBSCAN is an effective clustering technique for extensive datasets, particularly those with irregular or arbitrary cluster shapes. It is developed based on density-based principles, offering the advantage of flexibility in its application to various types of data, not confined to Euclidean space. USG% represents the proportion of a player’s involvement in offensive situations officially called Possesion. USG% is commonly used as an individual player performance metric. This study encounters the following issues when utilizing this measure as is (i.e., without any manipulation). First, while USG% within Possesion is in the form of compositional data, it loses its compositional characteristic when aggregated after the game. Second, a simple algebraic comparison of USG% fails to consider the player’s contribution in the team. Therefore, to address these issues, this study interprets USG% based on the official player positions (C - center, PF - power forward, SF - small forward, SG - shooting guard, PG - point guard) provided by basketball reference, taking into account both team and individual perspectives. We utilized data from the top-ranked teams in the Eastern and Western Conferences from the 1993–94 season to the 2021–22 season.

The remaining part of this paper is as follows: In Chapter 2, we present explanations of the log-ratio transformation and the square-root transformation, along with the distance measures defined on the transformed sample spaces. Chapter 3 presents an overview of the DBSCAN, which will be used for cluster analysis. In Chapter 4, we compare the two distance measures in a few data scenarios by simulation study. Chapter 5 presents the results of applying the DBSCAN algorithm based on the two distance measures to USG% data. Finally, we close the paper with a few discussion issues and directions of future work.

2. Transformation

In this section, we briefly review two different types of transformations. First, we present log-ratio transformations that map the sample space to Euclidean space. Then, we introduce the distance known as the Aitchison distance, which is in fact the Euclidean distance between two CLR transformed vectors. Next, we briefly review about the square-root transformation, which is introduced as an alternative of log-ratio type transformation that maps the sample space onto the surface of hyper sphere. Additionally, we describe the distance measure based on the square-root transformation known as the directional distance.

2.1. Log-ratio transformation

Compositional data is a type of non-Euclidean spatial data, requiring transformations to apply traditional statistical methods developed for the data in the Euclidean space. According to Galletti and Maratea (2016) all transformations need to be isomorphisms and have to guarantee scale-invariance and subcompositional coherence. Consider a random compositional vector xSd−1 where Sd−1 is the unit simplex defined by (1.1) and let xd denote the dth component of x, then it satisfies

xd=1-x1--xd-1.

Hence, the components of the data are not independent, and the maximum degree of freedom will be d − 1. For further usage, we denote the geometric mean of the components of x by g(x) defined by

g(x)=(j=1d-1xj)1d-1,

which is crucially used in the CLR and ILR transformations, while the ALR transformation simply used xd as below.

According to Aitchison’s early studies such as Aitchison (1982, 1986), there are two representative log-ratio typed transformations. The one is the additive log-ratio (ALR) transformation defined as

alr(d-1)×1(x)=[ln x1xd,ln x2xd,,ln xd-1xd]T,

which maps compositional data onto the d − 1 dimensional real space, . It is asymmetrical and does not preserve distance because an arbitrary component is chosen as the denominator. Changing the denominator component leads the transformed data to have different distributions. Thus, criticism exists regarding the validity of this transformation method. However, its strength lies in its simplicity (Aitchison, 2008), and as a result, it continues to be employed in non-mathematical studies like Greenacre et al. (2021).

The other one is the centered log-ratio (CLR) transformation defined as

clrd×1(x)=[ln x1g(x),ln x2g(x),,ln xdg(x)]T,

which moves a compositional vector onto , where is a hyperplane in ℝd. Figure 1(a), 1(b) show a toy example in and its CLR transformation mapping onto . In Figure 1(b), the hyperplane, the space of CLR transformed data is parallel to the original compositional hyperplane(Figure 1(a)) and contains the origin. It means the summation of transformed components in real space being zero, effectively projecting the data onto the hyperplane. Additionally, in the figure, we can observe symmetry, which implies that distances are preserved (Galletti and Maratea, 2016). Since distances are preserved and the maximum degree of freedom is also suitable, the CLR transformation is adopted for defining the Aitchison distance, provided in 2.1.

The last log-ratio typed transformation is the isometric log-ratio (ILR) transformation, which is defined as

ilr(d-1)×1(x)=ii+1[ln g(x)xi+1](i=1,2,,d-2).

This transformation is introduced by Egozcue et al. (2003) for addressing the limitations of the ALR and CLR transformations and further for providing a more effective explanation of the concept of orthogonality. Egozcue et al. (2003) investigated relationships among these three log-ratio-typed transformations, and Wang et al. (2020) compared ALR and ILR for soil particle size fractions data.

Aitchison distance : The Aitchison distance introduced by Aitchison et al. (2000) is the most widely distance measure for compositional data. Aitchison distance between two compositional vectors, x = (x1, …, xd)t and y = (y1, …, yd)t on the simplex Sd−1, is given by

dA(x,y)=1di=1d-1j=i+1d[ln xixj-ln yiyj]2.

This measure is equivalent to the Euclidean distance between two CLR transformed compositional vectors. The R package “robCompositions” developed by Hron et al. (2010) provides the function aDist, for the calculation. Aitchison distance is based on CLR transformation, so it also faces challenges such as the impossibility of calculation when exact zero values are included. Nevertheless, due to the simplicity of the Aitchison distance and research on handling zero values, it is used as a distance measure in most compositional data analyses. Palarea-Albaladejo et al. (2012) conducted clustering analysis on compositional data based on the Aitchison distance.

2.2. Square-root transformation

The square-root transformation is a type of power transformation that maps the compositional vector in onto the surface of the d−1 dimensional unit hyper sphere, denoted by ℂd−1, where it has been studied in the field of directional statistics, see for example in Mardia and Jupp (2000) page 168. The square-root transformation directly provides an alternative solution for analyzing the compositional data containing exact zero values, which replaces the existing approach to deal with zeros suggested by Aitchison (1986), pages 266–274. After being transformed to directional space, it is possible not only to apply the existing statistical methods developed on ℂd−1, but also to use the directional probability distributions such as von-Mises Fisher and Kent distributions to explain distributional characteristics. Scealy and Welsh (2011) propose a regression model suitable for typical compositional data sets containing exact zero values. This model utilizes the mean direction of the Kent distribution as a function of a vector of covariates. In the process, the square root transformation is employed to handle exact zero values within the model. Furthermore, by projecting the data onto the hyper sphere, we can calculate the distance between them, which is known as the directional distance.

Directional distance : In directional space, such as a surface of a circle for 2D, a sphere for 3D or the hyper sphere for higher dimension, the Euclidean distance is unable to represent the exact distance between two points. For example, Figure 2 is a 3-dimensional sphere where random directional vectors x and y lie on its surface. The Euclidean distance between these two points, denoted as δ1 in Figure 2 is the shortest path from one to the other, but the path does not follow the sample space. To construct statistical approaches appropriately defined in the sample space of directional vectors, we need a distance measure, exactly defined in the sample space. Hence, we define the distance measure, as

dD(x,y)=θ=arccos (xy),

which follows the greatest circle on the sphere as denoted by δ2 in Figure 2, and we call it directional distance. With this directional distance, it is ready to use almost statistical methods developed on the directional space for analyzing the square-root transformed compositional data.

This paper proposes to use the square-root transformation and its measure, directional distance, as an alternative to the log-ratio transformation and its measure, Aitchison distance. We provide a comparison of the two measures using simulation scenarios and clustering results from real data using the DBSCAN, which will be described in the next section.

3. Methodology

Clustering analysis for compositional data has been conducted in previous researches, such as Palarea-Albaladejo et al. (2012) and Godichon-Baggioni et al. (2019). For the purpose of comparing the Aitchison distance and the directional distance, we considered a distance based clustering, which can be sensitive to the choice of distance measure. We conducted our research using the DBSCAN, one of the most well known clustering algorithms. In this section, we briefly review the algorithm.

The DBSCAN (density-based spatial clustering of applications with noise) is a clustering method for spatial data that was proposed by Ester et al. (1996). The DBSCAN algorithm recognizes a cluster when a minimum number of points (minPts) exist within a certain radius(ε). Figure 3 shows an example of the DBSCAN with required parameters, minPts = 4 and ε = 4. Once the parameters are fixed, the algorithm follows Algorithm 1.

Based on this algorithm, the DBSCAN finds clusters effectively even the data is complex or high-dimensional data. Shen et al. (2016) applied this algorithm to real-time image segmentation, which has complex structure and Schubert et al. (2017) provided details on additional advantages of the DBSCAN.

4. Numerical study

Compositional data analysis generally involves transformation. Log-ratio transformations are commonly used because of their simplicity and familiarity. However, log-ratio transformations have some problems when compositional data contains exact zero values. Therefore, the square-root transformation is proposed as an alternative to the log-ratio transformation. In this section, we present a comparison of the two distance measures in a few challenging scenarios by Dirichlet distribution. Dirichlet distribution is defined over a natural number k and positive constants α1, …, αk. Given a vector of positive real numbers [x1, x2, …, xd] such that Σidxi=1, the value of the distribution is defined as

f(x1,,xk;α1,,αk)=1B(α)i=1kxiαi-1,

where:

B(α)=Πi=1kΓ(αi)Γ(Σi=1kαi).

In (4.1), we can control the variance and distribution shape by changing α. Using this, we simulated 6 scenarios whose data were generated from the two-dimensional compositional simplex . Our scenarios assumed the following cases:

  • Large variance case:

    • Dirichlet(30, 30, 30)

    • Dirichlet(30, 30, 2)

    • Dirichlet(30, 2, 2)

  • Small variance case:

    • Dirichlet(1000, 1000, 1000)

    • Dirichlet(1000, 1000, 50)

    • Dirichlet(1000, 50, 50)

We assume three cases for both large and small variance: a) all components are equal, b) one component is close to zero, and c) two components are close to zero. Figure 4 shows the distribution of the large variance case. Figures from small variance cases and their codes are provided in in our GitHub page (https://github.com/dlakakwns/Clustering-of-Compositional-data-using-DBSCAN).

To compare the two distance measures in the scenarios, we randomly generated 100 points from the Dirichlet distribution in each scenario. In the equal case, the two distance measures for Dirichlet(30,30,30) seem to be perfectly linear. In other two scenarios, when one or more components are close to 0, the value of the Aitchison distance is significantly increased while the directional distance remained in the same scale. As a result, the linearity shows in Figure 4(a) disappeared as given in Figure (4(b), 4(c)). The disappearance of linearity can be further confirmed by looking at the border lines. The bottom border line of the one component close to zero case is almost the same as the equal case, so there is some linearity. However, in the case where two components close to zero, the border line is significantly different, so the linearity of the two distance measures is almost completely lost. This is because the difference in increasing speed, as mentioned earlier, occurs more sharply.

5. Application to real data

In this section, we compare Aitchison distance and directional distance using USG%, a real-world example of compositional data. We provide a definition and explanation of the preprocessing process for USG%, as well as the results of the DBSCAN analysis using a radar plot.

5.1. Real data : USG%

Due to the limited 48 minutes of regular game time in the NBA, it is important to successfully execute offensive situations officially called Possession. Although only one player attempts to score at the end of possesion, all players on the court contribute to it. USG%(usage percentage) is a type of advanced statistic developed to consider comprehensive possesion participation. Different media outlets have slightly different definitions of advanced statistics. In this study, we used the definition of USG% from basketball reference, a US-based sports media outlet. The definition is as follows

USG%=100*(FGA+0.44*FTA+TOV)*(TmMP)5*MP*(TmFGA+0.44*TmFTA+TmTOV),

where the meaning of each statistic used in this quantity is described in our GitHub. USG% is always positive and sums to 1 (100%) across all possessions for a team, making it a compositional data. However, USG% loses its compositional nature when interpreted at the individual player level because players have different numbers of possessions and playing times. Figure 5 shows an example of this problem. It shows the match information about Dallas against Indiana on January 22, 2022. Luka Dončić is the one of superstar in the league and also most important player in Dallas. He recorded a 34.3% USG% in 31 minutes and 45 seconds. However, bench player Trey Burke recorded a 25.4% USG% in just 5 minutes and 7 seconds. This means that the USG% does not sufficiently reflect the importance of the player. This is a major weakness of USG%, as it does not accurately reflect the importance of individual players. In this study, we calculated USG% by summing match data for each position, as provided by Basketball Reference.

We used data from the top-ranked teams in the Eastern and Western Conferences from 1993–94 to 2021–22. Figure 6 and Table 1 are information about data. PG have the lowest variance and a high average value, so we can confirm that they take on a certain level of USG% in most teams SG, another guard position, have the highest average value, but also the highest variance, so it is possible to infer that some teams have recorded USG% at an outlier. C have the lowest average and a low variance, so we can confirm that they take on a low possession in most teams. SF and PF have similar average values, but the variance is significantly different. This suggests that there are more tactical differences between teams for PF than SF.

5.2. Application of DBSCAN

To obtain clusters by using DBSCAN algorithm, it is known that the algorithm requires two (tuning) parameters, the minimum number of points (minPts) in each neighborhood of data points and the size of neighborhood, i.e., the radius ε. When selecting these parameters in this study, we only chose the radius ε by a grid search algorithm while the minimum number of points is fixed at 3 to make the process of selecting the tuning parameters be simple. Since two distance measures return differently scaled values of pairwise distance, we selected different ε’s, which were 0.2542 for CLR transformed data and 0.0591 for square-root transformed data. Even we basically followed the direction from Ester et al. (1996) to determine the parameters, we also adjust the choice of parameters to make the number of returned clusters be not too large because we need to consider the interpretability as well.

Figure 7 shows the clustering results in a radar plot. The clustering results formed three types of clusters: A noise cluster, two weakly constrained clusters, and two strongly constrained clusters. The USG% used in this paper showed a similar pattern to the equal case in Section 4. Therefore, we hypothesized that the results of the DBSCAN algorithm using the two distance measures would be almost similar. The results showed differences in three teams in the weakly constrained cluster, but the other teams in the cluster were very similar. In general, noise clusters in the DBSCAN algorithm do not account for a large portion of the data. However, in this study, the DBSCAN algorithm was conducted under strong conditions, such as using the top team in each league, minPts, and a fixed number of clusters. This resulted in a higher proportion of noise clusters. As a result, the noise clusters also contain some interpretability. For example, the 12–13 Oklahoma City team, which won first place with an abnormally high proportion of two MVPs (Westbrook in PG and Durant in SF), and the 17–18 Houston team, which pursued a one-man offense led by Harden in SG (who won MVP that season), are both in the noise cluster.

In the case of weak rules, there are two clusters: ‘Cluster: SG’ and ‘Cluster: Pentagon’. ‘Cluster: SG’ is a cluster of teams that consider SG as a significant offensive option, and the level of SG in these teams is high, but not so high that it is considered an outlier. Teams include Kobe’s Lakers and Klay’s Golden State. The level of SG in the 99–00 Indiana team was ambiguous, and the team was clustered differently in the two distance measures. ‘Cluster: Pentagon’ is the most pentagonal, or evenly distributed, of the clusters. Teams in this cluster have a balanced distribution of the four positions, except for the center position. These teams include the Boston Celtics with their Big3 trio of Pierce, Garnett, and Allen, and the Cleveland Cavaliers with their Big3 trio of LeBron, Irving, and Love. In the case of Dallas in 02–03, the PF position is relatively high. In the case of Detroit in 06–07, the team shows a regular pentagon shape up to the C position. As a result, both teams were ambiguous to the rules and clustered differently in the two distance measures.

In the case of strong rules, there are two clusters: ‘Cluster: Jordan’ and ‘Cluster: C’. ‘Cluster: Jordan’ is the Chicago from the prime of Michael Jordan, the greatest player in NBA history, so the SG was classified as an outlier and clustered. ‘Cluster: C’ is also a cluster of outliers, with Detroit and Seattle with very low offensive participation rates for C. It is noteworthy that the two teams in ‘Cluster: C’ are from quite different eras. However, the radar plot shows that they are almost identical through cluster analysis.

Through cluster analysis of real data, we were able to confirm the characteristics of the top teams and the teams that showed similar characteristics in different eras. As confirmed in previous research, the two distance measures produce similar results for compositional data that is close to equal case.

6. Discussion

The purpose of this study is to explain the results of a comparative analysis of Aitchison distance and directional distance for the analysis of compositional data based on log-ratio transformation and square-root transformation. To verify this comparative study, we selected sports data, USG%, as real data and conducted the analysis. Sport is a field that includes a lot of data, and compositional data exists in various forms. However, compositional data is not a mainstream in sport data analysis, so there are difficulties in interpretation due to the lack of previous research using it. In addition, the real data example, USG%, is a statistical quantity with weaknesses such as not including the defensive part of the player and being affected by the playing time. Therefore, it is necessary to conduct follow-up research that considers other statistical quantities together in order to improve the completeness of the analysis.

In this study, we confirmed that the square-root transformation can be a viable alternative to the log-ratio typed transformation in the analysis of compositional data, especially, when there exist a few component being exactly zero. Hence we believe this study shows a valuable contribution to the field of compositional data analysis. However, further research is required for applying the directional methods to the real world compositional data. In particular, to be used for a microbiome data, the high-dimensionality and sparsity are needed to be considered simultaneously. Therefore, our immediate future work will be the adjustment for the development of directional statistical methods for microbiome data.

Acknowledgement

This research was supported by Kyungpook National University Research Fund, 2021.

Figures
Fig. 1. Toy examples.
Fig. 2. Description of directional distance. δ1 is the Euclidean distance between x and y, and δ2 is the directional distance.
Fig. 3. Description of DBSCAN algorithm.
Fig. 4. Examples of simulation data. The first row is the scatter plot for each case in the large variance scenario, the second row is the density plot for each case and the bottom row is the scatter plot of two distances (Aitchison and directional distances).
Fig. 5. Example of USG% in the NBA player’s statistics (https://www.basketball-reference.com/).
Fig. 6. The distribution of USG%.
Fig. 7. Radar plot for each cluster obtained by DBSCAN algorithm.
TABLES

Table 1

The mean and variance of the analyzed USG% data

StatisticCPFPGSFSG
Mean0.18843540.19214730.20710670.19481410.2174965
Var0.0018630140.0024643890.0014478660.0019864860.002847740

Algorithm 1.

DBSCAN algorithm

Require: Data, minPts, ε
1:C = current cluster
2:for each unclassified point p in Data do
3:ε – neighborhood(p) = surrounding points of p within a radius of ε
4:if Number of ε – neighborhood(p) < minPts then
5:  Mark p as Noise point
6:else
7:  C = next cluster
8:  Do expandCluster(p, ε – neighborhood(p), C, minPts, ε)
9:procedureexpandCluster(p, ε – neighborhood(p), C, minPts, ε)
10: add p to cluster C
11:for each unclassified point o in ε – neighborhood(p) do
12:  ε – neighborhood(o) = surrounding points of o within a radius of ε
13:  if Number of ε – neighborhood(o) ≥ minPts then
14:   ε – neighborhood(p)joined with ε – neighborhood(o)
15:  ifo still unclassified then
16:   add o to cluster C

References
  1. Aitchison J (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society: Series B (Methodological), 44, 139-160.
    CrossRef
  2. Aitchison J (1986). The Statistical Analysis of Compositional Data (Monographs on Statistics and Applied Probability), Chapman and Hall, London, New York.
  3. Aitchison J, Barceló-Vidal C, Martín-Fernández JA, and Pawlowsky-Glahn V (2000). Logratio analysis and compositional distance. Mathematical Geology, 32, 271-275.
    CrossRef
  4. Aitchison J (2008). The single principle of compositional data analysis, continuing fallacies, confusions and misunderstandings and some suggested remedies, Proceedings of CoDaWork’08, The 3rd Compositional Data AnalysisWorkshop, Girona, Spain.
  5. Buccianti A and Pawlowsky-Glahn V (2005). New perspectives on water chemistry and compositional data analysis. Mathematical Geology, 37, 703-727.
    CrossRef
  6. Cook (1964). Percentage Baseball, Waverly Press, Brooklyn, New York City.
  7. Cust EE, Sweeting AJ, Ball K, and Robertson S (2019). Machine and deep learning for sport-specific movement recognition: A systematic review of model development and performance. Journal of Sports Sciences, 37, 568-600.
    Pubmed CrossRef
  8. Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, and Barcelo-Vidal C (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35, 279-300.
    CrossRef
  9. Ester M, Kriegel H-P, Sander J, and Xu X (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, 96, 226-231.
  10. Galletti A and Maratea A (2016). Numerical stability analysis of the centered log-ratio transformation. Proceedings of 2016 12th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS) Napoli, 713-716, IEEE.
    CrossRef
  11. Godichon-Baggioni A, Maugis-Rabusseau C, and Rau A (2019). Clustering transformed compositional data using K-means, with applications in gene expression and bicycle sharing system data. Journal of Applied Statistics, 46, 47-65.
    CrossRef
  12. Greenacre M, Martínez-Á lvaro M, and Blasco A (2021). Compositional data analysis of microbiome and any-omics datasets: A validation of the additive logratio transformation. Frontiers in Microbiology, 12, 727398.
    Pubmed KoreaMed CrossRef
  13. Hron K, Templ M, and Filzmoser P (2010). Imputation of missing values for compositional data using classical and robust methods. Computational Statistics & Data Analysis, 54, 3095-3107.
    CrossRef
  14. Kucera M and Malmgren BA (1998). Logratio transformation of compositional data: A resolution of the constant sum constraint. Marine Micropaleontology, 34, 117-120.
    CrossRef
  15. Li H (2015). Microbiome, metagenomics, and high-dimensional compositional data analysis. Annual Review of Statistics and Its Application, 2, 73-94.
    CrossRef
  16. Mardia KV and Jupp PE (2000). Directional Statistics, Wiley Online Library.
    CrossRef
  17. Ordóñez EG, Pérez MdCI, and González CT (2016). Performance assessment in water polo using compositional data analysis. Journal of Human Kinetics, 54, 143-151.
    Pubmed KoreaMed CrossRef
  18. Palarea-Albaladejo J, Martín-Fernández JA, and Soto JA (2012). Dealing with distances and transformations for fuzzy C-means clustering of compositional data. Journal of Classification, 29, 144-169.
    CrossRef
  19. Rein R and Memmert D (2016). Big data and tactical analysis in elite soccer: Future challenges and opportunities for sports science. SpringerPlus, 5, 1-13.
    Pubmed KoreaMed CrossRef
  20. Scealy J and Welsh A (2011). Regression for compositional data by using distributions defined on the hypersphere. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73, 351-375.
    CrossRef
  21. Scealy J and Welsh AH (2014). Fitting Kent models to compositional data with small concentration. Statistics and Computing, 24, 165-179.
    CrossRef
  22. Schubert E, Sander J, Ester M, Kriegel HP, and Xu X (2017). Dbscan revisited, revisited: Why and how you should (still) use dbscan. ACM Transactions on Database Systems (TODS), 42, 1-21.
    CrossRef
  23. Shen J, Hao X, Liang Z, Liu Y, Wang W, and Shao L (2016). Real-time superpixel segmentation by dbscan clustering algorithm. IEEE Transactions on Image Processing, 25, 5933-5942.
    Pubmed CrossRef
  24. Wang Z, Shi W, Zhou W, Li X, and Yue T (2020). Comparison of additive and isometric log-ratio transformations combined with machine learning and regression kriging models for mapping soil particle size fractions. Geoderma, 365, 114214.
    CrossRef