Age-related hearing loss (presbyacusis) is a major public health concern due to a growing aging population, high prevalence among older adults, associated significant communication difficulties and health-related problems (Lin
The current study was motivated by ongoing genetic studies of a large number of older adults whose audiograms have been classified into one of four subtypes. In genetic studies, it is critical to have accurate phenotype definitions that differentiate subjects according to presumed underlying pathologies, in order to maximize statistical power, minimize biases, and demonstrate phenotype-genotype associations. In statistical genetics, extreme discordant phenotype (EDP) design has been used to address problems that arise when a wide range of classification accuracy within each phenotype is observed. In the EDP design, only subjects with purer phenotypes are used while subjects with ambiguous (impure) phenotypes are discarded. This approach has been reported to improve signals by “purifying” subjects (removing “noise” subjects) despite the loss of sample size (Barnett
Key variables of the presbyacusis dataset considered in this paper are the 8 variables indicating the probabilities that a subject’s right and left ears belong to one of the four presbyacusis subtypes (Older-Normal, Metabolic, Sensory, and Metabolic + Sensory), generated using the supervised machine learning approach (Dubno
In this paper, we investigate the problem of purifying subjects by formulating it as a supervised learning problem, integrating the transformation of paired compositional data with the penalized multinomial logistic regression approaches, and by ranking subjects using predictive probabilities indicating the phenotype purity. We further evaluated the ranking of presbyacusis subjects using the previous literature.
Subjects participants were from an ongoing longitudinal study of age-related hearing loss (Dubno
The paired compositional data with D components (subtypes) and n samples is expressed as a matrix of size
Figure 1 shows the overall workflow of the proposed supervised learning approach to rank subjects.
We generate three types of variables using paired compositional data in order to keep the rationale of ‘phenotype purity’ into account and facilitate interpretation. Here, the phenotype of a subject is defined “purer” if 1) the probability for one presbyacusis subtype is higher than those for other subtypes and 2) these probabilities are more comparable between two ears. First, we calculate the averaged probabilities for each component across two ears:
Second, we calculate the absolute difference in probabilities for each component between two ears:
Here,
Then, we calculate the absolute difference between the two largest values of the perturbation probabilities
For the modelling purpose, we also reformulate the synthetic class variable (
It is of interest to consider a multivariate approach instead of transforming
Instead of considering all the mismatches as one class (
While the data transformation procedures described in Sections 2.3 and 2.4 provide useful representations of the original paired compositional data and their class assignments, correlations among these transformed variables (
When the number of classes is larger than two ((
Based on this model, Friedman et al. modelled the penalized multinomial logistic regression using the regularized maximum multinomial likelihood (Friedman
To handle correlations among the transformed variables (
We use the 10-fold cross-validation to simultaneously select optimal values of the tuning parameters
After fitting the penalized multinomial logistic regression model that includes transformed paired compositional data and the modified synthetic class variable, we compute predicted probabilities for each subtype as Pr(
As described in Section 2.4, we defined the synthetic classes (
Figure 3 and Table A.2 in the Supplementary Materials show the averaged probabilities of each of four presbyacusis subtypes across two ears (
Figure 4 and Table A.4 in the Supplementary Materials shows absolute differences in probabilities between two ears for each of the four presbyacusis subtypes (
Figure 5 and Table A.6 in the Supplementary Materials show perturbation probabilities for each of the four presbyacusis subtypes (
In summary, data transformation results indicate that the three types of variables generated (
Based on the exploratory analysis results in Section 3.2, we transformed the original 8 variables of paired compositional data (probabilities for 4 presbyacusis subtypes in each ear;
We begin by exploring the correlation among the 9 predictor variables (
We further checked the variance inflation factors (VIFs) for the transformed variables to investigate the multicollinearity between them. VIF values were obtained as follows. Ave. Older-normal = 6,967,878; Ave.Metabolic = 33,530,720; Ave.Sensory = 111,326,100; Ave.Met+Sen = 76,892,700; Diff.Older-normal = 1.71; Diff.Metabolic = 8.20; Diff.Sensory = 5.07; Diff.Met+Sen = 3.33; Pert.diff = 3.58. Essentially, high VIF values were observed for average variables while low VIF values were observed for the absolute difference and the absolute perturbation difference variables. Overall, this is consistent with what we observed in Figure 6.
The final values of
Table 1 shows regression coefficient estimates for the five classes we considered (
To further explore sparseness and shrinkage of the proposed approach, we investigated two extreme cases of the Elastic Net penalty, which are the Lasso penalty (
We next evaluated patterns of predicted probability used to rank subjects, with respect to average, absolute difference and absolute perturbation difference variables (Figures A.4 in the Supplementary Materials). The results indicate that the predicted probabilities are positively associated with averages and absolute perturbation differences and negatively associated with absolute differences. Finally, we investigated our ranking results using the expected demographics based on previous literature (Dubno
Table 2 and Figures A.5–A.7 in the Supplementary Materials indicate that we observe the expected demographics for the top-ranked 50% of subjects (age, sex, and noise exposure history). First, Metabolic (mean = 75.69 years) and Metabolic+Sensory (mean = 74.98 years) subtypes are on average older than Older-Normal (mean = 67.12 years) and Sensory (mean = 70.57 years) subtypes. Second, the percentage of females is lower in Sensory (36.05%) and Metabolic+Sensory (51.52%) subtypes, compared to Older-Normal (86.36%) and Metabolic (68.75%) subtypes. Third, the percentage of subjects with a positive noise exposure history is higher in Sensory (60.47%) and Metabolic+Sensory (48.48%) subtypes, compared to Older-Normal (33.33%) and Metabolic (43.75%) subtypes. In contrast, differences among presbyacusis subtypes are more diluted for the bottom-ranked 50% of subjects as expected. For the bottom-ranked 50% subjects, standard deviations of means of age decrease from 4.01 to 3.89 compared to the top-ranked 50% subjects. Standard deviations of female percentages decrease from 21.72 to 16.85; in addition, standard deviations of positive noise exposure history percentages also decrease from 13.75 to 8.82. These results are consistent with what have been reported in the literature and further confirm the validity of our ranking approach.
In this paper, motivated by the problem to identify and rank subjects according to subtypes of age-related hearing loss, we proposed a new supervised learning approach to rank subjects using the penalized multinomial logistic regression model integrated with the transformation of compositional data. We formulated the problem as a supervised learning problem and utilized the penalized multinomial logistic regression model with the Elastic Net penalty, following transformation of the paired compositional data into three types of variables, including averaged probabilities of each component across two sides of the pair, absolute difference in component probabilities between two sides of the pair, and absolute difference between the two largest values in the perturbation probabilities. Its application to age-related hearing loss data indicates that the proposed supervised learning approach is effective for ranking subjects based on paired compositional data for age-related hearing loss subtyping. Furthermore, the results generated using our approach nicely coincide with biological knowledge and previous literature.
The proposed approach is promising for the subject ranking problem discussed in this manuscript; however, the literature for the analysis of paired compositional data is still limited to make more general recommendations at this point. There exists an important need to develop more statistical approaches to handle paired compositional data in various contexts. In general, there are multiple issues that need to be considered when analysis approaches are developed for the paired compositional data. First, if the original space is considered, it is important to take into account the ‘sum-to-one’ constraint carefully (such as visualizing data using ternary diagrams) that depicts observations on a simplex (Van den Boogaart and Tolosana-Delgado, 2013). Second, for the same reason, various statistical assumptions need to be carefully checked and addressed. While the popular ilr, clr, and alr transformations are known to help address these issues, assumption violations are still often observed such as the issue of heteroscedasticity (Maier, 2014). Finally, as we illustrated in our approach for the subject ranking problem, we need to consider additional and relevant issues, such as correspondence of each element between the pair and relationships between them, when we analyze paired compositional data. We plan to investigate paired compositional data for diverse problems and develop statistical approaches to handle them in a future study.
This work was supported by the National Institute on Deafness and Other Communication Disorders [grant number P50-DC000422]; the National Institute of General Medical Sciences [grant number R01-GM122078]; the National Cancer Institute [grant number R21-CA209848]; the National Institute on Drug Abuse [grant number U01-DA045300]; and the National Institute of Arthritis and Musculoskeletal and Skin Diseases [grant number P30-AR072582]; and the South Carolina Clinical and Translational Research Institute with an academic home at the Medical University of South Carolina [National Center for Advancing Translational Sciences grant number UL1-TR001450].
The authors declare no conflict of interest.
Summary of the proposed workflow for ranking subjects based on paired compositional data.
Boxplots of probabilities for the four presbyacusis subtypes (color) for each assigned class (column), separated by ear (panel).
Boxplots of probabilities averaged across two ears for all four presbyacusis subtypes (color), for four pure phenotype classes (cases in which the assigned classes were the same for the right and left ears, including Older-Normal, Metabolic, Sensory, and Metabolic+Sensory) and one impure phenotype class (cases in which assigned classes for the two ears were different) (panel).
Boxplots of absolute differences in presbyacusis subtype probabilities between two ears (color), for four pure phenotype classes (cases in which the assigned classes were the same for the right and left ears, including Older-Normal, Metabolic, Sensory, and Metabolic+Sensory) and one impure phenotype class (cases in which assigned classes for the two ears were different) (panel).
Boxplots of perturbation probabilities for the four presbyacusis subtypes (the first four columns) and the absolute differences between the two largest perturbation probabilities (“Pert. Diff”; the last column) for four pure phenotype classes (cases in which the assigned classes were the same for the right and left ears, including Older-Normal, Metabolic, Sensory, and Metabolic+Sensory) and one impure phenotype class (cases in which assigned classes for the two ears were different) (panel).
Heatmap of correlation coefficients between three types of generated variables: averages (ave.*), absolute differences (diff.*), and absolute perturbation difference (pert.diff).
Coefficient estimates of the multinomial logistic regression model with the Elastic Net penalty
Older-Normal | Metabolic | Sensory | Met+Sen | Impure | |
---|---|---|---|---|---|
Intercept | −0.589 | −1.768 | −0.420 | 0.769 | −2.009 |
ave.Older-Normal | 8.285 | 0.000 | −3.726 | −2.330 | −0.000 |
ave.Metabolic | −2.090 | 12.012 | −4.656 | −3.960 | −0.646 |
ave.Sensory | −1.027 | −0.913 | 9.456 | −4.284 | −0.000 |
ave.Meta+Sen | −1.429 | −1.210 | −3.931 | 9.010 | −0.000 |
diff.Older-Normal | −3.122 | 0.000 | −1.308 | −1.320 | −9.907 |
diff.Metabolic | 0.000 | −1.926 | −1.058 | −1.130 | −8.944 |
diff.Sensory | −2.931 | 1.417 | −9.506 | 0.000 | 11.388 |
diff.Met+Sen | −1.404 | −0.981 | 0.000 | −6.728 | 13.886 |
pert.diff | 1.344 | −0.179 | 4.608 | 2.244 | −9.342 |
Ave.*, diff.*, and pert.diff mean the average, absolute difference, and absolute perturbation difference variables, respectively.
Demographics (age, sex, and noise exposure history) of the top-ranked 50% and the bottom-ranked 50% subjects identified using the proposed statistical approach
Pure Phenotype Class | Age, Mean (SD) | Female, % | Positive Noise Exposure, % | ||||||
---|---|---|---|---|---|---|---|---|---|
All | Top 50% | Bottom 50% | All | Top 50% | Bottom 50% | All | Top 50% | Bottom 50% | |
Older-Normal ( |
66.60 (6.82) | 67.12 (6.97) | 66.05 (6.80) | 83.72 | 86.36 | 80.95 | 30.23 | 27.27 | 33.33 |
Metabolic ( |
74.54 (6.91) | 75.69 (7.42) | 73.39 (6.38) | 65.63 | 68.75 | 62.50 | 43.75 | 43.75 | 43.75 |
Sensory ( |
71.50 (6.98) | 70.57 (6.71) | 72.42 (7.16) | 43.93 | 36.05 | 51.72 | 53.49 | 60.47 | 46.51 |
Metabolic+Sensory ( |
74.94 (7.39) | 74.98 (7.48) | 74.89 (7.36) | 46.56 | 51.52 | 41.54 | 51.54 | 48.48 | 54.69 |
Overall ( |
72.39 (7.54) | 51.19 | 50.40 |