
Thanks to microarray technology, researchers are able to analyze global gene expression at molecular. The application of these technologies expands not only medicine but also basic biology (Tan
The most widely used analysis for these issues is correlation analysis. Correlation analysis is employed to ascertain the relationship between datasets, particularly when quantifying the association between two sets of data. Canonical correlation analysis (CCA) (Hotelling, 1992), the most widely used method for correlation analysis, seeks specific linear transformations that maximize the correlation between two multivariate datasets. However, CCA faces challenges when handling high-dimensional datasets due to the necessity of computing the inverse of covariance matrices. In contrast, co-inertia analysis (CIA) (Dolédec
Although the sparsity of loading vectors addresses the interpretation problem to some extent, there remain challenges in ensuring that non-zero elements truly represent significant features. This issue is closely associated with the problem of multiplicity. Traditionally, it has been addressed by controlling the family-wise error rate (FWER), which represents the proportion of false positive across all hypotheses used (Qian
Despite advances in managing multiple hypotheses, certain contexts, such as regression settings, pose additional challenges. In regression, variables with small
Building on this insight, we propose a novel Penalized CIA method that integrates SLOPE as a penalty term to control the false discovery rate (FDR) of loading vectors while maintaining their sparsity. Unlike traditional CIA and sparse CIA methods, our approach effectively manages FDR at a lower rate. This enhanced capability provides significant advantages by offering more reliable identification of meaningful features while controlling for false positives. This can be helpful particularly for practitioners and researchers in biomedical fields, because as the treatment based on molecular level is increasing, finding some genes which affect specific disease accurately becomes important.
In the rest of this paper, we first introduce the penalized co-inertia analysis (sCIA) method and SLOPE. Then we propose an extended CIA approach that employes SLOPE as a penalty function. In Section 3, we present results of an extensive simulation study to compare the performance of our proposed method with existing methodologies. Additionally, we apply our method to the NCI60 dataset to evaluate its performance on real-world data, comparing the results with those obtained using alternative techniques as assessed in the simulation study. Finally, we conclude this paper with a comprehensive discussion of our findings.
Co-inertia analysis (CIA) was initially developed in the field of ecology to explore the relationships between species and their environment. For instance, it can be used to analyze the relationship between the abundance of various species and the environmental variables of their habitats. Unlike CCA, which employs canonical correlation and requires the inversion of covariance matrices-posing challenges with high-dimensional datasets, CIA uses co-inertia as a measure of association (Dolédec
CIA considers two datasets;
Let
Considering the high-dimensional nature of our data, achieving sparsity in the loading vector is essential for interpretability. Sparse CIA (sCIA) (Min
To solve (
where
Due to the inherited nature of the
In the linear regression model,
To address FDR control in regression settings, Bogdan
where
Suggested approach to decide
Here,
We propose employing the SLOPE penalty instead of the LASSO penalty in the sCIA problem to ensure that the estimated loading vectors achieve both sparsity and the desired FDR. By imposing the penalty term makes the loading vectors obtain sparsity and SLOPE, the penalty mimicing BH procedure, will control FDR of the estimated loading vectors. The algorithm for the proposed model is as follows.
Lagrangian multipliers,
We follow the method of data generating presented in Min
The covariance matrix of
The datasets
We varied the number of nonzero elements in the true loading vectors and the signal intensity for performance comparison between CIA, sCIA, and SLOPE-CIA. The number of non-zero elements ranged from 5 to 80 to facilitate the comparison of variable selection performance. Additionally, signal intensity was varied to observe how the performance of the methods differed under different conditions. Specifically, scenarios 1 to 5 were characterized by higher signal intensity, whereas scenarios 6 to 10 had lower signal intensity. Within each scenario group, the number of non-zero elements increased as the scenario number increased.
SLOPE utilizes a single parameter,
Here,
where FP, TP, FN, and TN means false positives, true positives, false negatives, and true negatives, respectively. FDR denotes the proportion of erroneous discoveries, while sensitivity represents the proportion of true discoveries. Specificity describes the proportion of zero-element detections, while accuracy denotes the proportion of correctly detected variables. Additionally, we consider the angle to measure the similarity between a true loading vectors and an estimated loading vectors. Angle is calculated as ∠(
In this simulation study, we evaluate three methods: classical CIA, sCIA, and SLOPE-CIA. The simulation results are presented in Table 1 and Table 2. Also all of the results are visualized in Figure 1, where solid lines represent higher signal intensity cases and dashed lines represent lower signal intensity cases, for enhanced interpretability.
First, we observe that SLOPE-CIA exhibits comparable sensitivity to sCIA while achieving the highest accuracy across all scenarios. This indicates that both sCIA and SLOPE-CIA are effective in identifying nonzero elements, but SLOPE-CIA is more precise in accurately identifying non-zero elements.
In addition to accuracy and sensitivity, the FDR is another crucial measure to consider. Our objective was to maintain the FDR below 0.1, and we observe that SLOPE-CIA achieves FDR values close to this target with high accuracy. This suggests that variables selected by SLOPE-CIA have a highly likely to be truly significant. SLOPE-CIA shows a slight decline in performance as the number of nonzero elements increases, it still demonstrates lower FDR and higher accuracy compared to sCIA, indicating greater reliability than the previous methods. This phenomenon of FDR inflation with an increasing number of significant variables is also observed in Bogdan
Furthermore, the angles between true loading vectors and the estimated loading vectors are superior for SLOPE-CIA compared to other methods. The angle measurement not only assesses how well the methods detect non-zero elements but also considers the magnitude of selected elements. A higher angle value indicates greater precision in the estimated loading vectors, demonstrating that SLOPE-CIA is the best method for identifying more meaningful variables as significant elements.
Lastly, Table 2 presents results with the same settings with Table 1, but with lower signal intensity. By comparing the results from Table 1 and Table 2, we observe the differences caused by the intensity of the signals. In general, performances decreases with lower signal intensity across all methods and the number of non-zero elements, which is expected because higher intensity make it easier to detect signals.
To evaluate the performance of the proposed algorithm, we also conduct a real data analysis. Developed by the national cancer institute (NCI), the NCI60 dataset comprises 60 human cancer cell lines representing various cancer types, including leukemia, melanoma, non-small-cell lung carcinoma (NSCLC), central nervous system (CNS) cancers, ovarian cancer, breast cancer, colon cancer, renal cancer, and prostate cancer. Multiple datasets detailing the activities of DNA, RNA, and proteins are accessible for download via the CellMiner web platform (https://discover.nci.nih.gov/cellminer/).
Specifically, we applied our method to gene expression data generated by Staunton
To evaluate the performance of each method, we count the non-zero elements in the first two loading vectors and calculate the cumulative percentage of explained variability. A lower number of nonzero elements in the estimated loadings indicates greater sparsity, making the results easier to interpret. Despite this sparsity, the method still effectively identifies significant elements, allowing us to explain most of the variability contained in the data. This is evidenced by the cumulative percentage of explained variability, which represents the ratio of co-inertia, as calculated by the estimated loading vectors, to the total co-inertia of the datasets. This metric reflects the extent to which each method explains the co-variability between the two datasets. As shown in Table 3, our proposed method exhibits the highest number of nonzero elements, while its cumulative percentage of explained variance is comparable to that of CIA and sCIA. This outcome demonstrates that our method effectively identifies non-contributing elements as zero elements, thereby maintaining the variability representation of the two datasets.
Following the methodology of Dolédec
Figures in the second and third rows illustrate the projected gene and protein spaces, respectively. We labeled the top 30 genes that are most distant from the origin in red. It is notable that the genes selected by CIA, sCIA, and SLOPE-CIA exhibit similar locations. For instance, KRT19, recognized as a tumor cell marker (Saha
Further details on the roles of each gene can be elucidated through gene list functional enrichment analysis using ToppFun of ToppGene Suite (Chen
Lastly, the most important difference of SLOPE-CIA compared to other methods is to detect and control false discoveries effectively. Among the 1679 elements in the first loading vectors, 281 are considered false discoveries, as they are zero elements in SLOPE-CIA but are selected by sCIA. Enrichment analysis of these genes shows a high occurrence of diseases such as HIV, anemia, arteriosclerosis, congenital hypoplastic anemia, and hypoplasia of thumb, which are not of primary interest in the NCI60 cell line data. This indicates that our method effectively controls for false discoveries that do not contribute to explaining the correlation between the two datasets in real data conditions.
Among numerous correlation analysis methods for two multivariate datasets, CIA introduced a novel measurement called co-inertia to calculate the concordance between datasets. However, interpreting loading vectors becomes challenging in high-dimensional datasets. To address this, we proposed a method called SLOPE-CIA, which provides sparse loading vectors with false discovery rate (FDR) control. Our method not only improves the interpretability of loading vectors but also shows a lower number of erroneous discoveries compared to previous methods. Results from simulation studies validate this, and we demonstrate the effectiveness of our method in a real data setting by conducting an analysis using the NCI60 datasets.
Despite its advantages, SLOPE-CIA has some limitations. Firstly, cross-validation is used to select sparsity parameters, which can be computationally intensive. Secondly, the design of the slope tuning parameters requires various methods for optimization, which may add complexity to the implementation.
For future research, there are several directions to explore. One is to extend the method to analyze more than two datasets simultaneously. Another direction is to incorporate gene network information into the analysis, which could enhance the biological relevance and interpretability of the results.
Results for Simulation (
FDR | Sens | Spec | Acc | Angle | FDR | Sens | Spec | Acc | Angle | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Scenario 1 | CIA | - | - | - | - | 0.945 (0.000) | - | - | - | - | 0.932 (0.000) |
sCIA | 0.039 (0.078) | 1.000 (0.000) | 0.999 (0.001) | 0.999 (0.001) | 0.951 (0.001) | 0.766 (0.112) | 1.000 (0.000) | 0.958 (0.023) | 0.958 (0.023) | 0.932 (0.000) | |
SLOPE-CIA | 1.000 (0.000) | 1.000 (0.000) | |||||||||
Scenario 2 | CIA | - | - | - | - | 0.936 (0.000) | - | - | - | - | 0.944 (0.000) |
sCIA | 0.335 (0.138) | 1.000 (0.000) | 0.985 (0.008) | 0.986 (0.008) | 0.938 (0.000) | 0.290 (0.154) | 1.000 (0.000) | 0.990 (0.008) | 0.990 (0.001) | 0.947 (0.000) | |
SLOPE-CIA | 1.000 (0.000) | 1.000 (0.000) | |||||||||
Scenario 3 | CIA | - | - | - | - | 0.931 (0.000) | - | - | - | - | 0.891 (0.000) |
sCIA | 0.332 (0.086) | 0.959 (0.015) | 0.959 (0.014) | 0.935 (0.000) | 0.303 (0.082) | 0.972 (0.011) | 0.972 (0.011) | 0.901 (0.001) | |||
SLOPE-CIA | 0.933 (0.000) | 0.933 (0.000) | |||||||||
Scenario 4 | CIA | - | - | - | - | 0.943 (0.000) | - | - | - | - | 0.905 (0.000) |
sCIA | 0.311 (0.060) | 0.936 (0.018) | 0.940 (0.015) | 0.949 (0.000) | 0.279 (0.068) | 0.957 (0.015) | 0.958 (0.013) | 0.914 (0.001) | |||
SLOPE-CIA | 0.960 (0.003) | 0.940 (0.000) | |||||||||
Scenario 5 | CIA | - | - | - | - | 0.932 (0.000) | - | - | - | - | 0.911 (0.000) |
sCIA | 0.255 (0.046) | 0.918 (0.020) | 0.924 (0.016) | 0.941 (0.000) | 0.294 (0.049) | 0.923 (0.019) | 0.928 (0.015) | 0.919 (0.000) | |||
SLOPE-CIA | 0.932 (0.007) | 0.916 (0.006) |
Results for Simulation (
FDR | Sens | Spec | Acc | Angle | FDR | Sens | Spec | Acc | Angle | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Scenario 6 | CIA | - | - | - | - | 0.945 (0.000) | - | - | - | - | 0.932 (0.000) |
sCIA | 0.518 (0.198) | 1.000 (0.000) | 0.981 (0.014) | 0.982 (0.013) | 0.949 (0.001) | 0.949 (0.009) | 1.000 (0.000) | 0.808 (0.037) | 0.809 (0.036) | 0.932 (0.000) | |
SLOPE-CIA | 1.000 (0.000) | 1.000 (0.000) | |||||||||
Scenario 7 | CIA | - | - | - | - | 0.936 (0.000) | - | - | - | - | 0.944 (0.000) |
sCIA | 0.785 (0.058) | 1.000 (0.000) | 0.898 (0.032) | 0.901 (0.032) | 0.937 (0.000) | 0.800 0.060 |
1.000 (0.000) | 0.910 (0.030) | 0.912 (0.029) | 0.946 (0.000) | |
SLOPE-CIA | 1.000 (0.000) | 1.000 (0.000) | |||||||||
Scenario 8 | CIA | - | - | - | - | 0.931 (0.000) | - | - | - | - | 0.891 (0.000) |
sCIA | 0.636 (0.060) | 0.858 (0.347) | 0.866 (0.032) | 0.934 (0.000) | 0.640 (0.060) | 0.883 (0.030) | 0.980 (0.028) | 0.900 (0.001) | |||
SLOPE-CIA | 0.933 (0.000) | 0.933 (0.000) | |||||||||
Scenario 9 | CIA | - | - | - | - | 0.943 (0.000) | - | - | - | - | 0.905 (0.000) |
sCIA | 0.548 (0.050) | 0.828 (0.034) | 0.846 (0.030) | 0.948 (0.000) | 0.564 (0.052) | 0.857 (0.029) | 0.868 (0.026) | 0.913 (0.001) | |||
SLOPE-CIA | 0.961 (0.005) | 0.946 (0.009) | |||||||||
Scenario 10 | CIA | - | - | - | - | 0.932 (0.000) | - | - | - | - | 0.911 (0.000) |
sCIA | 0.441 (0.041) | 0.810 (0.031) | 0.838 (0.025) | 0.940 (0.000) | 0.512 (0.037) | 0.805 (0.030) | 0.830 (0.025 ) | 0.918 (0.001) | |||
SLOPE-CIA | 0.934 (0.008) | 0.918 (0.006) |
Results of real data analysis using NCI60 data
Number of nonzero elements | Cumulative percentage explained | |||||
---|---|---|---|---|---|---|
1st loading | 1st & 2nd loading | |||||
CIA | 1517 | 162 | 1517 | 162 | 0.379 | 0.594 |
sCIA | 1358 | 153 | 1382 | 157 | 0.378 | 0.593 |
SLOPE-CIA | 1096 | 139 | 1255 | 146 | 0.372 | 0.577 |
SLOPE-CIA