A compositional data, a type of non-Euclidean spatial data, is found in various fields such as geochemistry (Buccianti and Pawlowsky-Glahn, 2005), sports (Ordóñez
A non-Euclidean sample space has presented challenges when applying traditional statistical methods. To appropriately handle these challenges, research has been conducted on transformation methods mapping the sample space to a Euclidean space. Notably, there have been three representative transformations introduced which are additive log-ratio (ALR) transformation, centered log-ratio (CLR) transformation, and isometric log-ratio (ILR) transformation. Aitchison (1986), Kucera and Malmgren (1998), and Egozcue
In addition to the challenge of the sample space being non-Euclidean, it is known that there exist more challenges in analyzing the compositional data. Among them, a prominent challenge what we mainly focus on in the research is one or more components taking exact values of zero. For example, in baseball, proportions of pitching arsenals, which means the kind of pitches such as fastball, curveball, or slider, etc., for each pitcher form a compositional data with zero values in some parts. In addition, there exist more examples in the field of sports, such as goal-scoring ratios or touchdown ratios of players in soccer or football, respectively, swing type frequency (proportion) in racket sports like tennis, table tennis and badminton. In such cases we can easily find that the components form a composition and some of those components can be exactly zero. In this context, there exists a limitation where the log-ratio typed transformations such as ALR, CLR and ILR cannot be applied directly, while the square-root transformation can be directly applied to any compositional data including exact zeros. Though a few studies proposed an imputation approach or a replacement approach that estimates zeros separately or replaces zeros with small positive values respectively (Hron
Sports data analysis originated from enhancing performance in baseball like Cook (1964). Then, this field has been gradually expanded to various sports, and recently has started to actively adopt modern statistical methods and computer science techniques as Rein and Memmert (2016) and Cust
In the field of sports data analysis, ratios are extensively used, but the statistical methods appropriately developed for compositional data still have not been widely applied. Especially in elite competitive sports such as soccer, baseball, and so on, observed compositional data has been used as simple observational data, ignoring the dependency. Ordóñez
This paper presents the results of a DBSCAN (density-based spatial clustering of applications with noise) cluster analysis using the advanced basketball statistic, USG% (usage percentage). The DBSCAN is an effective clustering technique for extensive datasets, particularly those with irregular or arbitrary cluster shapes. It is developed based on density-based principles, offering the advantage of flexibility in its application to various types of data, not confined to Euclidean space. USG% represents the proportion of a player’s involvement in offensive situations officially called Possesion. USG% is commonly used as an individual player performance metric. This study encounters the following issues when utilizing this measure as is (i.e., without any manipulation). First, while USG% within Possesion is in the form of compositional data, it loses its compositional characteristic when aggregated after the game. Second, a simple algebraic comparison of USG% fails to consider the player’s contribution in the team. Therefore, to address these issues, this study interprets USG% based on the official player positions (C - center, PF - power forward, SF - small forward, SG - shooting guard, PG - point guard) provided by basketball reference, taking into account both team and individual perspectives. We utilized data from the top-ranked teams in the Eastern and Western Conferences from the 1993–94 season to the 2021–22 season.
The remaining part of this paper is as follows: In Chapter 2, we present explanations of the log-ratio transformation and the square-root transformation, along with the distance measures defined on the transformed sample spaces. Chapter 3 presents an overview of the DBSCAN, which will be used for cluster analysis. In Chapter 4, we compare the two distance measures in a few data scenarios by simulation study. Chapter 5 presents the results of applying the DBSCAN algorithm based on the two distance measures to USG% data. Finally, we close the paper with a few discussion issues and directions of future work.
In this section, we briefly review two different types of transformations. First, we present log-ratio transformations that map the sample space to Euclidean space. Then, we introduce the distance known as the Aitchison distance, which is in fact the Euclidean distance between two CLR transformed vectors. Next, we briefly review about the square-root transformation, which is introduced as an alternative of log-ratio type transformation that maps the sample space onto the surface of hyper sphere. Additionally, we describe the distance measure based on the square-root transformation known as the directional distance.
Compositional data is a type of non-Euclidean spatial data, requiring transformations to apply traditional statistical methods developed for the data in the Euclidean space. According to Galletti and Maratea (2016) all transformations need to be isomorphisms and have to guarantee scale-invariance and subcompositional coherence. Consider a random compositional vector
Hence, the components of the data are not independent, and the maximum degree of freedom will be
which is crucially used in the CLR and ILR transformations, while the ALR transformation simply used
According to Aitchison’s early studies such as Aitchison (1982, 1986), there are two representative log-ratio typed transformations. The one is the additive log-ratio (ALR) transformation defined as
which maps compositional data onto the
The other one is the centered log-ratio (CLR) transformation defined as
which moves a compositional vector onto , where is a hyperplane in ℝ
The last log-ratio typed transformation is the isometric log-ratio (ILR) transformation, which is defined as
This transformation is introduced by Egozcue
This measure is equivalent to the Euclidean distance between two CLR transformed compositional vectors. The R package “robCompositions” developed by Hron
The square-root transformation is a type of power transformation that maps the compositional vector in onto the surface of the
which follows the greatest circle on the sphere as denoted by
This paper proposes to use the square-root transformation and its measure, directional distance, as an alternative to the log-ratio transformation and its measure, Aitchison distance. We provide a comparison of the two measures using simulation scenarios and clustering results from real data using the DBSCAN, which will be described in the next section.
Clustering analysis for compositional data has been conducted in previous researches, such as Palarea-Albaladejo
The DBSCAN (density-based spatial clustering of applications with noise) is a clustering method for spatial data that was proposed by Ester
Based on this algorithm, the DBSCAN finds clusters effectively even the data is complex or high-dimensional data. Shen
Compositional data analysis generally involves transformation. Log-ratio transformations are commonly used because of their simplicity and familiarity. However, log-ratio transformations have some problems when compositional data contains exact zero values. Therefore, the square-root transformation is proposed as an alternative to the log-ratio transformation. In this section, we present a comparison of the two distance measures in a few challenging scenarios by Dirichlet distribution. Dirichlet distribution is defined over a natural number
where:
In (
Dirichlet(30, 30, 30)
Dirichlet(30, 30, 2)
Dirichlet(30, 2, 2)
Dirichlet(1000, 1000, 1000)
Dirichlet(1000, 1000, 50)
Dirichlet(1000, 50, 50)
We assume three cases for both large and small variance: a) all components are equal, b) one component is close to zero, and c) two components are close to zero. Figure 4 shows the distribution of the large variance case. Figures from small variance cases and their codes are provided in in our GitHub page (https://github.com/dlakakwns/Clustering-of-Compositional-data-using-DBSCAN).
To compare the two distance measures in the scenarios, we randomly generated 100 points from the Dirichlet distribution in each scenario. In the equal case, the two distance measures for Dirichlet(30,30,30) seem to be perfectly linear. In other two scenarios, when one or more components are close to 0, the value of the Aitchison distance is significantly increased while the directional distance remained in the same scale. As a result, the linearity shows in Figure 4(a) disappeared as given in Figure (4(b), 4(c)). The disappearance of linearity can be further confirmed by looking at the border lines. The bottom border line of the one component close to zero case is almost the same as the equal case, so there is some linearity. However, in the case where two components close to zero, the border line is significantly different, so the linearity of the two distance measures is almost completely lost. This is because the difference in increasing speed, as mentioned earlier, occurs more sharply.
In this section, we compare Aitchison distance and directional distance using USG%, a real-world example of compositional data. We provide a definition and explanation of the preprocessing process for USG%, as well as the results of the DBSCAN analysis using a radar plot.
Due to the limited 48 minutes of regular game time in the NBA, it is important to successfully execute offensive situations officially called Possession. Although only one player attempts to score at the end of possesion, all players on the court contribute to it. USG%(usage percentage) is a type of advanced statistic developed to consider comprehensive possesion participation. Different media outlets have slightly different definitions of advanced statistics. In this study, we used the definition of USG% from basketball reference, a US-based sports media outlet. The definition is as follows
where the meaning of each statistic used in this quantity is described in our GitHub. USG% is always positive and sums to 1 (100%) across all possessions for a team, making it a compositional data. However, USG% loses its compositional nature when interpreted at the individual player level because players have different numbers of possessions and playing times. Figure 5 shows an example of this problem. It shows the match information about Dallas against Indiana on January 22, 2022. Luka Dončić is the one of superstar in the league and also most important player in Dallas. He recorded a 34.3% USG% in 31 minutes and 45 seconds. However, bench player Trey Burke recorded a 25.4% USG% in just 5 minutes and 7 seconds. This means that the USG% does not sufficiently reflect the importance of the player. This is a major weakness of USG%, as it does not accurately reflect the importance of individual players. In this study, we calculated USG% by summing match data for each position, as provided by Basketball Reference.
We used data from the top-ranked teams in the Eastern and Western Conferences from 1993–94 to 2021–22. Figure 6 and Table 1 are information about data. PG have the lowest variance and a high average value, so we can confirm that they take on a certain level of USG% in most teams SG, another guard position, have the highest average value, but also the highest variance, so it is possible to infer that some teams have recorded USG% at an outlier. C have the lowest average and a low variance, so we can confirm that they take on a low possession in most teams. SF and PF have similar average values, but the variance is significantly different. This suggests that there are more tactical differences between teams for PF than SF.
To obtain clusters by using DBSCAN algorithm, it is known that the algorithm requires two (tuning) parameters, the minimum number of points (minPts) in each neighborhood of data points and the size of neighborhood, i.e., the radius
Figure 7 shows the clustering results in a radar plot. The clustering results formed three types of clusters: A noise cluster, two weakly constrained clusters, and two strongly constrained clusters. The USG% used in this paper showed a similar pattern to the equal case in Section 4. Therefore, we hypothesized that the results of the DBSCAN algorithm using the two distance measures would be almost similar. The results showed differences in three teams in the weakly constrained cluster, but the other teams in the cluster were very similar. In general, noise clusters in the DBSCAN algorithm do not account for a large portion of the data. However, in this study, the DBSCAN algorithm was conducted under strong conditions, such as using the top team in each league, minPts, and a fixed number of clusters. This resulted in a higher proportion of noise clusters. As a result, the noise clusters also contain some interpretability. For example, the 12–13 Oklahoma City team, which won first place with an abnormally high proportion of two MVPs (Westbrook in PG and Durant in SF), and the 17–18 Houston team, which pursued a one-man offense led by Harden in SG (who won MVP that season), are both in the noise cluster.
In the case of weak rules, there are two clusters: ‘Cluster: SG’ and ‘Cluster: Pentagon’. ‘Cluster: SG’ is a cluster of teams that consider SG as a significant offensive option, and the level of SG in these teams is high, but not so high that it is considered an outlier. Teams include Kobe’s Lakers and Klay’s Golden State. The level of SG in the 99–00 Indiana team was ambiguous, and the team was clustered differently in the two distance measures. ‘Cluster: Pentagon’ is the most pentagonal, or evenly distributed, of the clusters. Teams in this cluster have a balanced distribution of the four positions, except for the center position. These teams include the Boston Celtics with their Big3 trio of Pierce, Garnett, and Allen, and the Cleveland Cavaliers with their Big3 trio of LeBron, Irving, and Love. In the case of Dallas in 02–03, the PF position is relatively high. In the case of Detroit in 06–07, the team shows a regular pentagon shape up to the C position. As a result, both teams were ambiguous to the rules and clustered differently in the two distance measures.
In the case of strong rules, there are two clusters: ‘Cluster: Jordan’ and ‘Cluster: C’. ‘Cluster: Jordan’ is the Chicago from the prime of Michael Jordan, the greatest player in NBA history, so the SG was classified as an outlier and clustered. ‘Cluster: C’ is also a cluster of outliers, with Detroit and Seattle with very low offensive participation rates for C. It is noteworthy that the two teams in ‘Cluster: C’ are from quite different eras. However, the radar plot shows that they are almost identical through cluster analysis.
Through cluster analysis of real data, we were able to confirm the characteristics of the top teams and the teams that showed similar characteristics in different eras. As confirmed in previous research, the two distance measures produce similar results for compositional data that is close to equal case.
The purpose of this study is to explain the results of a comparative analysis of Aitchison distance and directional distance for the analysis of compositional data based on log-ratio transformation and square-root transformation. To verify this comparative study, we selected sports data, USG%, as real data and conducted the analysis. Sport is a field that includes a lot of data, and compositional data exists in various forms. However, compositional data is not a mainstream in sport data analysis, so there are difficulties in interpretation due to the lack of previous research using it. In addition, the real data example, USG%, is a statistical quantity with weaknesses such as not including the defensive part of the player and being affected by the playing time. Therefore, it is necessary to conduct follow-up research that considers other statistical quantities together in order to improve the completeness of the analysis.
In this study, we confirmed that the square-root transformation can be a viable alternative to the log-ratio typed transformation in the analysis of compositional data, especially, when there exist a few component being exactly zero. Hence we believe this study shows a valuable contribution to the field of compositional data analysis. However, further research is required for applying the directional methods to the real world compositional data. In particular, to be used for a microbiome data, the high-dimensionality and sparsity are needed to be considered simultaneously. Therefore, our immediate future work will be the adjustment for the development of directional statistical methods for microbiome data.
This research was supported by Kyungpook National University Research Fund, 2021.
The mean and variance of the analyzed USG% data
Statistic | C | PF | PG | SF | SG |
---|---|---|---|---|---|
Mean | 0.1884354 | 0.1921473 | 0.2071067 | 0.1948141 | 0.2174965 |
Var | 0.001863014 | 0.002464389 | 0.001447866 | 0.001986486 | 0.002847740 |
DBSCAN algorithm
1: | C = current cluster |
2: | |
3: | |
4: | |
5: | Mark |
6: | |
7: | C = next cluster |
8: | Do expandCluster( |
9: | |
10: | add |
11: | |
12: | |
13: | |
14: | |
15: | |
16: | add |