TEXT SIZE

search for



CrossRef (0)
Co-Inertia analysis for multi-omics data with FDR control via SLOPE
Communications for Statistical Applications and Methods 2025;32:91-106
Published online January 31, 2025
© 2025 Korean Statistical Society.

Soyeon Paenga, Eun Jeong Min1,ab

aDepartment of Medical Sciences, The Catholic University of Korea, Graduate School, Korea;
bDepartment of Medical Life Sciences, The Catholic University of Korea, Korea
Correspondence to: 1 Department of Medical Life Sciences, The Catholic University of Korea, Banpo-daero 222, Seocho-gu, Seoul 06591, Korea. E-mail: ej.min@catholic.ac.kr
Received September 8, 2024; Revised October 10, 2024; Accepted November 4, 2024.
 Abstract
Co-inertia analysis (CIA) is a multivariate analysis method that assesses relationships and trends in two sets of data. It has been e ectively employed in the integrative analysis of high-dimensional multi-omics datasets. Recently, penalized CIA methods have been introduced to enhance the interpretability by inducing sparsity in the loading vectors. However, challenges persist in ensuring that non-zero elements in the estimated vector genuinely represent significant features. To address these challenges, we propose a penalized CIA method that controls the false discovery rate (FDR) using sorted l-1 penalized estimation (SLOPE). This approach allows for simultaneous FDR control and sparsity induction in the estimated vectors. Extensive simulation studies demonstrate the performance compared to the existing CIA method. Additionally, we apply our methods to the integrative analysis of NCI60 data to show its e ectiveness in real-world scenarios.
Keywords : co-inertia analysis, sorted L-one penalized estimation, FDR, sparsity, omics data
1. Introduction

Thanks to microarray technology, researchers are able to analyze global gene expression at molecular. The application of these technologies expands not only medicine but also basic biology (Tan et al., 2008). Consequently, these technologies generate diverse high-throughput multi-omics datasets, which screen various biological activities at the molecular levels including DNA, RNA, proteins, and beyond. With the advent of omics datasets, numerous researchers have started to analyze these data to obtain a comprehensive understanding of the underlying biological system or to correlate the omics-based molecular measurements with a clinical outcome of interest (Forward, 2012). For instance, leveraging genomics datasets, researchers can identify biomarkers associated with specific diseases or conditions, thus facilitating advancements in precision medicine and disease diagnosis.

The most widely used analysis for these issues is correlation analysis. Correlation analysis is employed to ascertain the relationship between datasets, particularly when quantifying the association between two sets of data. Canonical correlation analysis (CCA) (Hotelling, 1992), the most widely used method for correlation analysis, seeks specific linear transformations that maximize the correlation between two multivariate datasets. However, CCA faces challenges when handling high-dimensional datasets due to the necessity of computing the inverse of covariance matrices. In contrast, co-inertia analysis (CIA) (Dolédec et al., 1994), another method to study the association between two datasets, identifies pairs of linear vectors that maximize a measure known as ‘co-inertia’, which can be considered as a generalized version of covariance between two datasets. Since CIA does not require the calculation of the inverse of covariance matrices, it does not encounter the same issues with high-dimensional data. Nevertheless, it still presents interpretational difficulties arising from high dimensionality. To overcome this problem, sparse co-inertia analysis (sCIA), which combines penalization with CIA, has emerged (Min et al., 2019). This method uses the l1-norm as a penalty term to obtain sparsity.

Although the sparsity of loading vectors addresses the interpretation problem to some extent, there remain challenges in ensuring that non-zero elements truly represent significant features. This issue is closely associated with the problem of multiplicity. Traditionally, it has been addressed by controlling the family-wise error rate (FWER), which represents the proportion of false positive across all hypotheses used (Qian et al., 2015). Beginning with Bonferroni’s correction, the most conservative approach, Holm proposed a step-down FWER-controlling procedure, and Hochberg utilized the Simes inequality to derive a step-up procedure (Hwang et al., 2010). However, FWER does not matter the loss incurred by erroneous rejections severely (Benjamini and Hochberg, 1995). In response, the false discovery rate (FDR) is proposed as an alternative measurement, representing the proportion of null hypotheses that are erroneously rejected (Benjamini and Hochberg, 1995). Various methods to control FDR, such as the step-up one-stage (BH95) and Storey’s q-value (Storey, 2002), have been developed in response to this challenge.

Despite advances in managing multiple hypotheses, certain contexts, such as regression settings, pose additional challenges. In regression, variables with small p-values (typically less than 0.05) are not guaranteed to be genuinely correlated with the outcome variable. To tackle this, methods like sorted L-one penalized estimation (SLOPE) have been developed (Bogdan et al., 2015). SLOPE incorporates data-dependent penalties into the regression model, which increase with the size of the coefficients, helping to control the FDR at a targeted level while addressing the limitations of simple p-value adjustments.

Building on this insight, we propose a novel Penalized CIA method that integrates SLOPE as a penalty term to control the false discovery rate (FDR) of loading vectors while maintaining their sparsity. Unlike traditional CIA and sparse CIA methods, our approach effectively manages FDR at a lower rate. This enhanced capability provides significant advantages by offering more reliable identification of meaningful features while controlling for false positives. This can be helpful particularly for practitioners and researchers in biomedical fields, because as the treatment based on molecular level is increasing, finding some genes which affect specific disease accurately becomes important.

In the rest of this paper, we first introduce the penalized co-inertia analysis (sCIA) method and SLOPE. Then we propose an extended CIA approach that employes SLOPE as a penalty function. In Section 3, we present results of an extensive simulation study to compare the performance of our proposed method with existing methodologies. Additionally, we apply our method to the NCI60 dataset to evaluate its performance on real-world data, comparing the results with those obtained using alternative techniques as assessed in the simulation study. Finally, we conclude this paper with a comprehensive discussion of our findings.

2. Methods

2.1. Co-inertia analysis and sparse co-inertia analysis

Co-inertia analysis (CIA) was initially developed in the field of ecology to explore the relationships between species and their environment. For instance, it can be used to analyze the relationship between the abundance of various species and the environmental variables of their habitats. Unlike CCA, which employs canonical correlation and requires the inversion of covariance matrices-posing challenges with high-dimensional datasets, CIA uses co-inertia as a measure of association (Dolédec et al., 1994). Use of co-inertia, which can be considerd as a generalized form of covariance, allows CIA to avoid matrix inversion in its estimation procedure, which is advantageous when dealing with high-dimensional data. This benefit has lead to the application of CIA in cross-platform comparison study for two microarray data (Culhane et al., 2003) and its widespread use in elucidating relationships between different omics datasets.

CIA considers two datasets; X ∈ ℝn×p contains observation of p variables from n samples, while Y ∈ ℝn×q contains observation of q variables from the same samples. D ∈ ℝn×n is a diagonal matrix with positive weights assigned to each samples. Additionally, Qx ∈ ℝp×p and Qy ∈ ℝq×q are diagonal matrices with positive weights assigned to the variables of X and Y, respectively. These matrices represent the networks among the samples and variables. If there are some prior information about relationships between samples or variables, researchers can construct the metrices based on that information. However, in case researchers do not have prior knowledge, Dray et al. (2003) described various types of matrices commonly used, depending on the study objectives. The co-inertia, which measures the concordance between the two datasets (Dolédec et al., 1994), is defined as Ic = trace(XQxXTDYQyYTD).

Let XQxu represent the projection of X onto the vector u normalized with Qx, where u is known as the inertia axis or inertia loading vector (Min et al., 2019). With Qx-normed vector u and Qy-normed vector v, the co-inertia between the two projections XQxu and YQyv is defined as Ic(u, v) = (uTQxXTDYQyv)2 (Min et al., 2019). The objective of CIA is to identify a principal loading vector that maximizes the projected variability (Dolédec et al., 1994). The objective function can be expressed as (2.1), and this can be reformulated as (2.2), where a=Qx1/2u,b=Qy1/2v,X˜=D1/2XQx1/2,Y˜=D1/2YQy1/2 (Min et al., 2019). Subsequently, (2.2) can be solved via singular value decomposition (SVD).

maximizeu,v         (uQxXDYQyv)2subject to         uQxu=vQyv=1.
maximizea,b         aX˜Y˜b         subject to         a2=1,b2=1.

Considering the high-dimensional nature of our data, achieving sparsity in the loading vector is essential for interpretability. Sparse CIA (sCIA) (Min et al., 2019) has been proposed to achieve this by imposing an l1-constraint on the optimization problem. This constraint, as shown in (2.3), involves the pre-defined constants c1 and c2.

maximizea,b         aX˜Y˜bsubject to         a2=1,b2=1,a1c1,b1c2.

To solve (2.3), Min et al. (2019) reformulated it using the Lagrangian formulation as follows, minimize

minimizea,b-aX˜Y˜b+12a22+λ1a1+12b22+λ2b1,

where λ1, λ2 are Lagrangian multipliers. It can be solved by iterative approach, fixing one loading vector and estimating the other loading vector until objective function converges. The iterative algorithm used to solve (2.4) is described below.

aargmina12X˜Y˜b-a22+λ1a1,bargminb12Y˜X˜a-b22+λ2b1.

Due to the inherited nature of the l1-penalty, the algorithm performs variable selection and shrinkage simultaneously (Kim, 2014). As sparse CIA uses the diagonal matrices as Qx and Qy for interpretability, the sparsity of a and b can be transferred to the sparsity of the loading vectors u and v, eventually (Min et al., 2019).

2.2. Sorted l-1 penalized estimation (SLOPE)

In the linear regression model, y = + ε, it is hard to interpret the estimated coefficient vector β̂ if data exhibit high-dimensionality. To overcome this issue, Tibshirani (1996) proposed a method that imposes a penalty term on the estimation of the coefficient vector, making the estimated coefficient vector β̂ sparse. Despite its advantages, LASSO does not control the false discovery rate (FDR).

To address FDR control in regression settings, Bogdan et al. (2015) proposed a novel penalty term that the penalties to each coefficient change based on the intensity of each coefficient and called it as Sorted l-1 penalized estimation (SLOPE). They imposed bigger penalties to coefficient that regarded as significant than it doesn’t to control FDR. The generalized SLOPE estimator is obtained as the solution to

minimizeβ   12y-Xβ2+αi=1pλi|β|(i)=minimizeβ   12y-Xβ2+α[λ1|β|(1)+λ2|β|(2)++λp|β|(p)],

where α is parameter determining intensity of the penalty term, and as mentioned above, λ1λ2 ≥ · · · ≥ λp ≥ 0, |β|(1) ≥ |β|(2) ≥ ·· · ≥ |β|(p) ≥ 0.

Suggested approach to decide λi’s is to utilize the Benjamini-Hochberg (BH) procedure. The BH procedure is a widely recognized method for controlling FDR in multiple testing problems by adjusting p-values based on their rank. Adapting this concept, the penalty intensity for each variable is varied according to the rank of each coefficient’s magnitude. The penalty term is calculated as follows:

λBH(i)=Φ-1(1-qi),         qi=i·q/2p.

Here, i denotes the index of the i-th largest coefficient, making λBH(1) the penalty term for the largest coefficient. Φ−1 represents the inverse cumulative distribution function of the normal distribution. Since Φ−1 is an increasing function, the top-ranked coefficient receives the largest penalty. Consequently, λBH(i) imposes a larger penalty on stronger coefficients. By incorporating the penalty term λBH(i) into the objective function, regression model can control FDR under certain value. FDR is controlled as FDR ≤ q(p0/p), where q is target FDR, p0 is the number of true null hypotheses, and p is the number of total hypotheses.

2.3. Penalized co-inertia analysis using SLOPE (SLOPE-CIA)

We propose employing the SLOPE penalty instead of the LASSO penalty in the sCIA problem to ensure that the estimated loading vectors achieve both sparsity and the desired FDR. By imposing the penalty term makes the loading vectors obtain sparsity and SLOPE, the penalty mimicing BH procedure, will control FDR of the estimated loading vectors. The algorithm for the proposed model is as follows.

aargmina12X˜Y˜b-a2+α1i=1pλi|a|(i)bargminb12Y˜X˜a-b2+α2i=1qλi|b|(i).

Lagrangian multipliers, λ1 and λ2 in (2.5), are denoted as α1 and α2 in equation (2.6). To estimate more than one pair of loading vectors, we can repeat the above iterative algorithm after deflation procedure. The deflation of the dataset is conducted by removing the effect of previously estimated loading vectors from the datasets. The pseudo-algorithm for the SLOPE-CIA is outlined in Algorithm 1.

3. Simulation study

3.1. Data generation

We follow the method of data generating presented in Min et al. (2019). They assume the existence of a latent variable to establish a dependency between two sets of random variables. The dependency between datasets is generated by the latent variable μ~N(0,σμ2). We generate two random variables, x = μu + ex and y = μv + ey, where u = [u1, . . . , up′, 0, . . . , 0]T ∈ ℝp, v = [v1, . . . , vq′, 0, . . . , 0]T ∈ ℝq, ex ~ N(0, x), and ey ~ N(0, y). Here, u and v denote the true sparse loading vectors of CIA such that uTQxu = vTQyv = 1. The sparsity of the true loading vectors is defined by the number of non-zero elements in each vector, denoted as p′ for u and q′ for v.

The covariance matrix of x and y is calculated as Σxy=σμ2uv. As E(x) = E(y) = 0, the final datasets X and Y are generated from a multivariate normal distribution with mean zero and covariance matrix xy. Because u and v are normalized by Qx and Qy, σμ determines the intensity of the correlation between the two datasets.

3.2. Simulation design

The datasets X ∈ ℝ200×400 and Y ∈ ℝ200×500 are generated from a multivariate normal distribution following the probability model introduced in Section 3.1. We generate 100 Monte Carlo (MC) datasets.

We varied the number of nonzero elements in the true loading vectors and the signal intensity for performance comparison between CIA, sCIA, and SLOPE-CIA. The number of non-zero elements ranged from 5 to 80 to facilitate the comparison of variable selection performance. Additionally, signal intensity was varied to observe how the performance of the methods differed under different conditions. Specifically, scenarios 1 to 5 were characterized by higher signal intensity, whereas scenarios 6 to 10 had lower signal intensity. Within each scenario group, the number of non-zero elements increased as the scenario number increased.

3.3. Parameter tuning and performance measures

SLOPE utilizes a single parameter, α, which governs the extent of the penalty term’s influence. To identify the optimal value of α, we perform a five-fold cross-validation (CV) procedure. The objective function for the cross-validation process is as follows:

CV(α)=1Kk=1K12[1400(XkYkb1-a^-k(α))2+1500(YkXka1-b^-k(α))2].

Here, Xk and Yk represent the kth subgroup of the dataset, âk and k are estimators of a and b obtained without using the kth subgroup. To assess the performance of feature selection, we calculate four metrics: false discovery rate (FDR), sensitivity, specificity, and accuracy defined as follows,

FDR=FPTP+FP,Sensitivity=TPTP+FN,Specificity=TNTN+FP,Accuracy=TP+TNTP+TN+FN+FP,

where FP, TP, FN, and TN means false positives, true positives, false negatives, and true negatives, respectively. FDR denotes the proportion of erroneous discoveries, while sensitivity represents the proportion of true discoveries. Specificity describes the proportion of zero-element detections, while accuracy denotes the proportion of correctly detected variables. Additionally, we consider the angle to measure the similarity between a true loading vectors and an estimated loading vectors. Angle is calculated as ∠(û) = (ûTu*)/||û||2 × ||u*||2, where u* is the true loading vector and û is the estimated loading vector.

3.4. Results

In this simulation study, we evaluate three methods: classical CIA, sCIA, and SLOPE-CIA. The simulation results are presented in Table 1 and Table 2. Also all of the results are visualized in Figure 1, where solid lines represent higher signal intensity cases and dashed lines represent lower signal intensity cases, for enhanced interpretability.

First, we observe that SLOPE-CIA exhibits comparable sensitivity to sCIA while achieving the highest accuracy across all scenarios. This indicates that both sCIA and SLOPE-CIA are effective in identifying nonzero elements, but SLOPE-CIA is more precise in accurately identifying non-zero elements.

In addition to accuracy and sensitivity, the FDR is another crucial measure to consider. Our objective was to maintain the FDR below 0.1, and we observe that SLOPE-CIA achieves FDR values close to this target with high accuracy. This suggests that variables selected by SLOPE-CIA have a highly likely to be truly significant. SLOPE-CIA shows a slight decline in performance as the number of nonzero elements increases, it still demonstrates lower FDR and higher accuracy compared to sCIA, indicating greater reliability than the previous methods. This phenomenon of FDR inflation with an increasing number of significant variables is also observed in Bogdan et al. (2015).

Furthermore, the angles between true loading vectors and the estimated loading vectors are superior for SLOPE-CIA compared to other methods. The angle measurement not only assesses how well the methods detect non-zero elements but also considers the magnitude of selected elements. A higher angle value indicates greater precision in the estimated loading vectors, demonstrating that SLOPE-CIA is the best method for identifying more meaningful variables as significant elements.

Lastly, Table 2 presents results with the same settings with Table 1, but with lower signal intensity. By comparing the results from Table 1 and Table 2, we observe the differences caused by the intensity of the signals. In general, performances decreases with lower signal intensity across all methods and the number of non-zero elements, which is expected because higher intensity make it easier to detect signals.

4. Real data analysis: NCI60

4.1. Data description

To evaluate the performance of the proposed algorithm, we also conduct a real data analysis. Developed by the national cancer institute (NCI), the NCI60 dataset comprises 60 human cancer cell lines representing various cancer types, including leukemia, melanoma, non-small-cell lung carcinoma (NSCLC), central nervous system (CNS) cancers, ovarian cancer, breast cancer, colon cancer, renal cancer, and prostate cancer. Multiple datasets detailing the activities of DNA, RNA, and proteins are accessible for download via the CellMiner web platform (https://discover.nci.nih.gov/cellminer/).

Specifically, we applied our method to gene expression data generated by Staunton et al. (2001) and protein abundance data generated by using reverse-phase lysate arrays (RPLA) (Nishizuka et al., 2003). Consistent with previous analyses (Min et al., 2019), we use 57 cell lines across the two datasets, among 60 cell lines in analysis after matching the labels of each dataset’s cell lines. Notably, each dataset contained 3144 probes and 162 proteins. But to reduce computation, we use 1517 probes in gene expression dataset, they are genes that difference of their expression between 60 cell lines is greater than 500 (Culhane et al., 2003). For the construction of the D matrix, we utilized an identity matrix. Additionally, the diagonal values for matrices Qx and Qy were determined as the proportion of column sums to the total sum.

4.2. Result

To evaluate the performance of each method, we count the non-zero elements in the first two loading vectors and calculate the cumulative percentage of explained variability. A lower number of nonzero elements in the estimated loadings indicates greater sparsity, making the results easier to interpret. Despite this sparsity, the method still effectively identifies significant elements, allowing us to explain most of the variability contained in the data. This is evidenced by the cumulative percentage of explained variability, which represents the ratio of co-inertia, as calculated by the estimated loading vectors, to the total co-inertia of the datasets. This metric reflects the extent to which each method explains the co-variability between the two datasets. As shown in Table 3, our proposed method exhibits the highest number of nonzero elements, while its cumulative percentage of explained variance is comparable to that of CIA and sCIA. This outcome demonstrates that our method effectively identifies non-contributing elements as zero elements, thereby maintaining the variability representation of the two datasets.

Following the methodology of Dolédec et al. (1994) and Culhane et al. (2003), we generated three figures, as depicted in Figure 2. The figures in the first row illustrate the projection space of genes at the base of the arrows and proteins at the tips. The length of the arrows represents the concordance between the two datasets, with shorter arrows indicating a stronger correlation. We observe that the arrows representing renal cancer cells are longer than those for other cancer types, suggesting a weaker relationship in the renal cell line compared to ovarian and leukemia cell lines. Also, we observe that leukemia and melanoma are separated from other cell lines along the first co-inertia axis, and the two cell lines are distinct from each other along the second co-inertia axis. This observation implies that these two cell lines significantly contribute to structuring the co-inertia axis. This finding is consistent with previous findings that leukemia and melanoma are emerged from hematopoietic cells and melanocytes respectively, while other cancer types are emerged from epithelial cells of each tissues (Marshall et al., 2017).

Figures in the second and third rows illustrate the projected gene and protein spaces, respectively. We labeled the top 30 genes that are most distant from the origin in red. It is notable that the genes selected by CIA, sCIA, and SLOPE-CIA exhibit similar locations. For instance, KRT19, recognized as a tumor cell marker (Saha et al., 2018), is positioned at the bottom left in both the gene and protein spaces. Similarly, TYR, which is implicated in the increased risk of skin cancer (Saran et al., 2004) including melanoma, is consistently found in the top right region of both spaces. Despite the difference in genes used to generate each dataset, we observe that genes situated in similar regions tend to have similar functions. For example, LGALS1 (li et al., 2023) at the top end of the second axis in the gene space and MLH1 (Hinrichsen et al., 2014) in a similar position in the protein space both play roles in cell adhesion within cancer cells. This indicates that SLOPE-CIA effectively selects genes that significantly contribute to cancer cell progression, aligning with other established methods, and groups genes with similar functional roles accurately.

Further details on the roles of each gene can be elucidated through gene list functional enrichment analysis using ToppFun of ToppGene Suite (Chen et al., 2009). Firstly, genes commonly selected in the first loading vectors of sCIA and SLOPE-CIA (1230 out of 1679) are enriched in mammary neoplasms, breast carcinoma, mammary carcinoma, and other cancers such as prostate cancer, renal carcinoma, non-small cell lung carcinoma, and ovarian carcinoma. These diseases are consistent with those included in the NCI60 cell line data, supporting the observation above that our method selects biologically meaningful genes. Additionally, we identified the molecular functions of the selected genes using gene ontology (GO) terms. The most highly enriched GO term is cell adhesion molecule binding (GO:0050839), followed by structural molecule activity (GO:0005198), structural constituent of ribosome (GO:0003735), and integrin binding (GO:0005178). The functions of genes such as KRT19, which is involved in the structural integrity of epithelial cells (Saha et al., 2018), and LGALS1, which is involved in cell adhesion (Li et al., 2023), align with these highly ranked GO terms, further supporting the reliability of our estimated loading vectors.

Lastly, the most important difference of SLOPE-CIA compared to other methods is to detect and control false discoveries effectively. Among the 1679 elements in the first loading vectors, 281 are considered false discoveries, as they are zero elements in SLOPE-CIA but are selected by sCIA. Enrichment analysis of these genes shows a high occurrence of diseases such as HIV, anemia, arteriosclerosis, congenital hypoplastic anemia, and hypoplasia of thumb, which are not of primary interest in the NCI60 cell line data. This indicates that our method effectively controls for false discoveries that do not contribute to explaining the correlation between the two datasets in real data conditions.

5. Discussion

Among numerous correlation analysis methods for two multivariate datasets, CIA introduced a novel measurement called co-inertia to calculate the concordance between datasets. However, interpreting loading vectors becomes challenging in high-dimensional datasets. To address this, we proposed a method called SLOPE-CIA, which provides sparse loading vectors with false discovery rate (FDR) control. Our method not only improves the interpretability of loading vectors but also shows a lower number of erroneous discoveries compared to previous methods. Results from simulation studies validate this, and we demonstrate the effectiveness of our method in a real data setting by conducting an analysis using the NCI60 datasets.

Despite its advantages, SLOPE-CIA has some limitations. Firstly, cross-validation is used to select sparsity parameters, which can be computationally intensive. Secondly, the design of the slope tuning parameters requires various methods for optimization, which may add complexity to the implementation.

For future research, there are several directions to explore. One is to extend the method to analyze more than two datasets simultaneously. Another direction is to incorporate gene network information into the analysis, which could enhance the biological relevance and interpretability of the results.

Figures
Fig. 1. The overall results of the simulation study. The graphs on the left display the results for the loading vectors, while the graphs on the right show the results for the loading vectors. The solid lines indicate the result from larger signal and the dashed lines indicate the result from smaller signal. (green: CIA, blue: sCIA, red: SLOPE-CIA)
Fig. 2. Results of real data analysis from CIA, sCIA, and SLOPE-CIA at each column, respectively. Figures in the first row represents sample space of analysis. Figures in the second and third row represents projected space of first two loading vectors of gene data and protein data, respectively.
TABLES

Table 1

Results for Simulation (σ = 7). Tables show the measurments of methods, False Discovery Rate (FDR), sensitivity (Sens), specificity (Spec), accuracy (Acc) and angle between two vectors. With values of measurements, standard errors of Monte Carlo are also included with parenthesis.

a in X b in Y


FDR Sens Spec Acc Angle FDR Sens Spec Acc Angle
Scenario 1 CIA - - - - 0.945 (0.000) - - - - 0.932 (0.000)
sCIA 0.039 (0.078) 1.000 (0.000) 0.999 (0.001) 0.999 (0.001) 0.951 (0.001) 0.766 (0.112) 1.000 (0.000) 0.958 (0.023) 0.958 (0.023) 0.932 (0.000)
SLOPE-CIA 0.002 (0.017) 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.962 (0.003) 0.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.935 (0.000)

Scenario 2 CIA - - - - 0.936 (0.000) - - - - 0.944 (0.000)
sCIA 0.335 (0.138) 1.000 (0.000) 0.985 (0.008) 0.986 (0.008) 0.938 (0.000) 0.290 (0.154) 1.000 (0.000) 0.990 (0.008) 0.990 (0.001) 0.947 (0.000)
SLOPE-CIA 0.002 (0.013) 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.940 (0.000) 0.016 (0.038) 1.000 (0.000) 1.000 (0.001) 1.000 (0.001) 0.952 (0.000)

Scenario 3 CIA - - - - 0.931 (0.000) - - - - 0.891 (0.000)
sCIA 0.332 (0.086) 0.959 (0.014) 0.959 (0.015) 0.959 (0.014) 0.935 (0.000) 0.303 (0.082) 0.974 (0.023) 0.972 (0.011) 0.972 (0.011) 0.901 (0.001)
SLOPE-CIA 0.073 (0.053) 0.933 (0.000) 0.994 (0.005) 0.990 (0.004) 0.939 (0.002) 0.025 (0.028) 0.933 (0.000) 0.998 (0.002) 0.992 (0.002) 0.917 (0.000)

Scenario 4 CIA - - - - 0.943 (0.000) - - - - 0.905 (0.000)
sCIA 0.311 (0.060) 0.961 (0.005) 0.936 (0.018) 0.940 (0.015) 0.949 (0.000) 0.279 (0.068) 0.962 (0.010) 0.957 (0.015) 0.958 (0.013) 0.914 (0.001)
SLOPE-CIA 0.042 (0.025) 0.960 (0.003) 0.994 (0.004) 0.990 (0.003) 0.957 (0.000) 0.028 (0.022) 0.940 (0.000) 0.997 (0.002) 0.991 (0.002) 0.930 (0.000)

Scenario 5 CIA - - - - 0.932 (0.000) - - - - 0.911 (0.000)
sCIA 0.255 (0.046) 0.949 (0.004) 0.918 (0.020) 0.924 (0.016) 0.941 (0.000) 0.294 (0.049) 0.956 (0.009) 0.923 (0.019) 0.928 (0.015) 0.919 (0.000)
SLOPE-CIA 0.185 (0.048) 0.932 (0.007) 0.946 (0.017) 0.943 (0.013) 0.946 (0.001) 0.083 (0.028) 0.916 (0.006) 0.984 (0.006) 0.973 (0.005) 0.927 (0.000)

Table 2

Results for Simulation (σ = 5). Tables show the measurments of methods, False Discovery Rate (FDR), sensitivity (Sens), specificity (Spec), accuracy (Acc) and angle between two vectors. With values of measurements, standard errors of Monte Carlo are also included with parenthesis.

a in X b in Y


FDR Sens Spec Acc Angle FDR Sens Spec Acc Angle
Scenario 6 CIA - - - - 0.945 (0.000) - - - - 0.932 (0.000)
sCIA 0.518 (0.198) 1.000 (0.000) 0.981 (0.014) 0.982 (0.013) 0.949 (0.001) 0.949 (0.009) 1.000 (0.000) 0.808 (0.037) 0.809 (0.036) 0.932 (0.000)
SLOPE-CIA 0.003 (0.024) 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.957 (0.000) 0.137 (0.084) 1.000 (0.000) 0.998 (0.002) 0.998 (0.002) 0.934 (0.000)

Scenario 7 CIA - - - - 0.936 (0.000) - - - - 0.944 (0.000)
sCIA 0.785 (0.058) 1.000 (0.000) 0.898 (0.032) 0.901 (0.032) 0.937 (0.000) 0.800
0.060
1.000 (0.000) 0.910 (0.030) 0.912 (0.029) 0.946 (0.000)
SLOPE-CIA 0.049 (0.101) 1.000 (0.000) 0.998 (0.004) 0.998 (0.004) 0.940 (0.000) 0.055 (0.063) 1.000 (0.000) 0.999 (0.002) 0.999 (0.002) 0.952 (0.000)

Scenario 8 CIA - - - - 0.931 (0.000) - - - - 0.891 (0.000)
sCIA 0.636 (0.060) 0.965 (0.008) 0.858 (0.347) 0.866 (0.032) 0.934 (0.000) 0.640 (0.060) 0.990 (0.016) 0.883 (0.030) 0.980 (0.028) 0.900 (0.001)
SLOPE-CIA 0.206 (0.056) 0.933 (0.000) 0.980 (0.007) 0.976 (0.006) 0.939 (0.000) 0.076 (0.050) 0.933 (0.000) 0.995 (0.004) 0.991 (0.003) 0.917 (0.001)

Scenario 9 CIA - - - - 0.943 (0.000) - - - - 0.905 (0.000)
sCIA 0.548 (0.050) 0.972 (0.041) 0.828 (0.034) 0.846 (0.030) 0.948 (0.000) 0.564 (0.052) 0.973 (0.010) 0.857 (0.029) 0.868 (0.026) 0.913 (0.001)
SLOPE-CIA 0.272 (0.114) 0.961 (0.005) 0.945 (0.945) 0.947 (0.023) 0.953 (0.002) 0.244 (0.086) 0.946 (0.009) 0.965 (0.041) 0.963 (0.013) 0.922 (0.003)

Scenario 10 CIA - - - - 0.932 (0.000) - - - - 0.911 (0.000)
sCIA 0.441 (0.041) 0.951 (0.008) 0.810 (0.031) 0.838 (0.025) 0.940 (0.000) 0.512 (0.037) 0.961 (0.011) 0.805 (0.030) 0.830 (0.025 ) 0.918 (0.001)
SLOPE-CIA 0.335 (0.040) 0.934 (0.008) 0.881 (0.021) 0.892 (0.017) 0.946 (0.000) 0.186 (0.038) 0.918 (0.006) 0.960 (0.010) 0.953 (0.009) 0.927 (0.000)

Table 3

Results of real data analysis using NCI60 data

Number of nonzero elements Cumulative percentage explained


a1 a2 b1 b2 1st loading 1st & 2nd loading
CIA 1517 162 1517 162 0.379 0.594
sCIA 1358 153 1382 157 0.378 0.593
SLOPE-CIA 1096 139 1255 146 0.372 0.577

Algorithm 1

SLOPE-CIA


References
  1. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological), 57, 289-300.
    CrossRef
  2. Małgorzata Bogdan, Ewout Van DB, Chiara S, Weijie S, and Candès JE (2015). SLOPE—adaptive variable selection via convex optimization. The Annals of Applied Statistics, 9, 1103.
  3. Chen J, Bardes EE, Aronow BJ, and Jegga AG (2009). ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Research, 37, W305-W311.
    Pubmed KoreaMed CrossRef
  4. Culhane AC, Perrière G, and Higgins DG (2003). Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics, 4, 1-15.
    CrossRef
  5. Dolédec S and Chessel D (1994). Co-inertia analysis: an alternative method for studying species–environment relationships. Freshwater Biology, 31, 277-294.
    CrossRef
  6. Dray S, Chessel D, and Thioulouse J (2003). Co-inertia analysis and the linking of ecological data tables. Ecology, 84, 3078-3089.
    CrossRef
  7. Forward P (2012). Evolution of Translational Omics, Washington (DC), National Academies Press (US).
  8. Hinrichsen I, Ernst BP, and Nuber F, et al. (2014). Reduced migration of MLH1 deficient colon cancer cells depends on SPTAN1. Molecular Cancer, 13, 1-12.
    CrossRef
  9. Hotelling H (1992). Relations between two sets of variates. Breakthroughs in Statistics: Methodology and Distribution, 162-190.
    CrossRef
  10. Hwang Y-T, Lai J-J, and Ou S-T (2010). Evaluations of FWER-controlling methods in multiple hypothesis testing. Journal of Applied Statistics, 37, 1681-1694.
    CrossRef
  11. Kim J (2014). Comparison of lasso type estimators for high-dimensional data. Communications for Statistical Applications and Methods, 21, 349-361.
    CrossRef
  12. Li X, Wang H, Jia A, Cao Y, Yang L, and Jia Z (2023). LGALS1 regulates cell adhesion to promote the progression of ovarian cancer. Oncology Letters, 26, 1-11.
    CrossRef
  13. Marshall EA, Sage AP, Ng KW, Martinez VD, Firmino NS, Bennewith KL, and Lam WL (2017). Small non-coding RNA transcriptome of the NCI-60 cell line panel. Scientific Data, 4, 1-8.
    CrossRef
  14. Min EJ, Safo SE, and Long Q (2019). Penalized co-inertia analysis with applications to-omics data. Bioinformatics, 35, 1018-1025.
    CrossRef
  15. Nishizuka S, Charboneau L, and Young L, et al. (2003). Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays. Proceedings of the National Academy of Sciences, 100, 14229-14234.
    CrossRef
  16. Qian H-R and Huang S (2005). Comparison of false discovery rate methods in identifying genes with differential expression. Genomics, 86, 495-503.
    Pubmed CrossRef
  17. Saha SK, Kim K, Yang G-M, Choi HY, and Cho S-G (2018). Cytokeratin 19 (KRT19) has a role in the reprogramming of cancer stem cell-like cells to less aggressive and more drug-sensitive cells. International Journal of Molecular Sciences, 19, 1423.
    Pubmed KoreaMed CrossRef
  18. Staunton JE, Slonim DK, and Coller HA, et al. (2001). Chemosensitivity prediction by transcriptional profiling. Proceedings of the National Academy of Sciences, 98, 10787-10792.
    CrossRef
  19. Storey JD (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society Series B: Statistical Methodology, 64, 479-498.
    CrossRef
  20. Tan Q, Zhao J, Li S, Christiansen L, Kruse TA, and Christensen K (2008). Differential and correlation analyses of microarray gene expression data in the CEPH Utah families. Genomics, 92, 94-100.
    Pubmed KoreaMed CrossRef
  21. Tibshirani R (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58, 267-288.
    CrossRef
  22. Saran A, Spinola M, and Pazzaglia S, et al. (2004). Loss of tyrosinase activity confers increased skin tumor susceptibility in mice. Oncogene, 23, 4130-4135.
    Pubmed CrossRef