TEXT SIZE

CrossRef (0)
Array

Seongho Kim1,a

aBiostatistics and Bioinformatics Core, Karmanos Cancer Institute, Wayne State University, USA
Correspondence to: 1 Biostatistics and Bioinformatics Core, Karmanos Cancer Institute, Department of Oncology, School of Medicine,Wayne State University, 87 E. Canfield St., Detroit, MI 48201, USA. E-mail: kimse@karmanos.org

This work is partially supported by NIH/NCI P30 CA022453 and NIH/NIGMS R21GM140352.
Received December 1, 2021; Revised January 12, 2022; Accepted January 13, 2022.
Abstract
The mathematical expression of the p-value calculation for the semi-partial correlation coefficient differs between Kim (2015) and Cohen et al. (2003). These two expressions were compared and the advantages of Kim (2015)’s approach over Cohen et al. (2003) were discussed.
Keywords : correlation, partial correlation, part correlation, ppcor, semi-partial correlation
1. Introduction

Suppose the random vector X = (x1, x2, …, xi, …, xn) where |X| = n. The variance of a random variable xi and the covariance between two random variables xi and xj are denoted as vi and ci j, respectively. The correlation between two random variables xi and xj is denoted by rij=cij/(vivj). Then the partial correlation of xi and xj given xk and the semi-partial correlation of xi with xj given xk are defined, respectively, by

rijk=rij-rikrjk1-rik21-rjk2,         ri(jk)=rij-rikrjk1-rjk2.

The partial and semi-partial correlations can be also defined, using multiple correlations, by

rijk=Ri.jk2-Ri.k21-Ri.k2,         ri(jk)=Ri.jk2-Ri.k2.

where Ri.k2 is the squared multiple correlation of xi on xk, which is equal to rik2, and Ri.jk2 is the squared multiple correlation of xi on xj and xk. Note that the superscripts k and c will be used to denote expressions of Kim (2015) and Cohen et al. (2003).

Kim (2015) used the statistics tijSk and ti(jS)k, respectively, in order to calculate the p-values of the partial and semi-partial correlations of xi and (with) xj given xS (= x(i, j)), which are

tijSk=rijSN-2-g1-rijS2,         ti(jS)k=ri(jS)N-2-g1-ri(jS)2,

where N is the sample size and g is the total number of given (or controlled) variables, and xS (= x(i, j)) is the random sub-vector of X after removing the random variables xi and xj and its size is |S | (= |X| − 2). The corresponding p-values are then calculated by

pijSk=2Φt(-|tijSk|,N-2-g),         pi(jS)k=2Φt(-|ti(jS)k|,N-2-g),

where Φt(·) is the cumulative density function of a Student’s t distribution with the degree of freedom N − 2 − g.

We can readily observe that the statistic ti(jS)k can equal zero only when the semi-partial correlation ri( j|S ) is zero in equation (1.3). This demonstrates that the statistic ti(jS)k in equation (1.3) is suitable for testing the deviation of the semi-partial correlation coefficient from zero. For the same reason, the statistic tijSk is also sufficient to test the null hypothesis H0 : ri j|S = 0.

On the other hand, Cohen et al. (2003) suggested the single statistic for calculating the p-values of both partial and semi-partial correlations of xi and (with) xj given xS (= x(i, j)), which is

tijSc=ti(jS)c=ri(jS)N-2-g1-Ri.jS2,

where Ri.jS2 is the squared multiple correlation of xi on xj and xS. The corresponding p-values are

pijSc=pi(jS)c=2Φt(-|ti(jS)c|,N-2-g).

In equation (1.1), we can notice that the partial and semi-partial correlations have the identical numerator and differ only by their denominators. Indeed, ri(jk)=rijk·1-rik2 and so ri( j|k)ri j|k because of 1-rik21. This relationship confirms that if one of the correlations is zero, then the other correlation is also zero. Besides, the statistic ti(jS)c (or tijSc) cannot equal zero unless the semi-partial correlation ri( j|S ) is zero in equation (1.5). Thus, equation (1.5) is appropriate for the hypothesis test to decide whether both partial and semi-partial correlation coefficients are significantly different from zero, which is the rationale for the use of the same statistic for both partial and semi-partial correlations in Cohen et al. (2003).

We can also see that, using equation (1.2),

ri(jS)N-2-g1-Ri.jS2=rijSN-2-g1-rijS2,

which implies that tijSk=tijSc=ti(jS)c and so pijSk=pijSc=pi(jS)c. That is, Cohen et al. (2003) uses Kim (2015)’s statistic tijSk to test whether both partial and semi-partial correlations are significantly different from zero.

However, the partial correlation is not a one-to-one correspondence to the semi-partial correlation. For example, when (ri j, rik, r jk) is either (0.816, 0.8, 0.8) or (0.892, 0.8, 0.8), the partial correlation ri j|k is equal to 0.70 but the semi-partial correlation ri( j|k) is 0.56 and 0.42, respectively. This means that there can be two or more corresponding semi-partial correlation coefficients for a partial correlation coefficient, resulting that two different semi-partial correlation coefficients can share the identical statistical significance in case of Cohen et al. (2003)’s approach. Moreover, the monotone ordering is not preserved between the partial and semi-partial correlation coefficients. For instance, when (ri j, rik, r jk) is (0.892, 0.8, 0.8) and (0.512, 0.80, 0.28), the partial correlation ri j|k is equal to 0.7 and 0.5, respectively, but the semi-partial correlation ri( j|k) is 0.42 and 0.48, respectively. In fact, these properties cause the relationship between pi(jS)c and ri( j|S ) to be non-monotonic, while pi(jS)k is monotonic with regard to ri( j|S ). This difference between pi(jS)c and pi(jS)k is evident in the following simulation study.

A simulation study was performed to compare the p-value calculation methods between Kim (2015) and Cohen et al. (2003). The expressions pi(jS)k and pijSk were used for the p-value calculations for the semi-partial correlation coefficients, which are corresponding to Kim (2015) and Cohen et al. (2003)’s approaches, respectively, because pi(jS)c=pijSk. The number of variables and samples was set to 30 and 100, respectively. For the partial correlation coefficient between the first and the second variables, 200 values were selected between 0.01 and 0.99 with the equal-width binning. For each of 200 partial correlation coefficients, a data set was simulated such that the calculated partial correlation coefficient from the simulated data set was similar to the selected coefficient, resulting in 200 simulated data sets. For each simulated data set, the partial and semi-partial correlation coefficients (i.e., r12|S and r1(2|S )) were calculated along with the corresponding p-values p12Sk and p1(2S)k, where S = {3, 4, 5, …, 30}. The simulated outcomes are depicted in Figure 1, and the R codes, which were used to perform the simulation study, are available in Appendix.

Figure 1(a) shows that the partial correlation coefficients are always greater than or equal to the semi-partial correlation coefficients (i.e., r12|Sr1(2|S )) as expected. Consequently, the corresponding p-values for the partial correlation coefficients are always less than or equal to those for the semi-partial correlation coefficients (i.e., p12Skp1(2S)k) as can be seen in Figure 1(b). In Figure 1(c), the p-values (the blue dotted line) for the semi-partial correlation coefficients by Cohen et al. (2003) are less than or equal to those (the red dashed line) by Kim (2015). This happens because r12|Sr1(2|S ) and Cohen et al. (2003) uses p12Sk for the corresponding p-values. Furthermore, Figure 1(d) confirms that the p-values (p12Sk, the blue dotted line) for the semi-partial correlation coefficients by Cohen et al. (2003) are not monotonic with regard to r1(2|S ). On the other hand, the semi-partial correlation coefficients (r1(2|S )) by Kim (2015) are monotonic to the p-values (p1(2S)k, the red dashed line).

In conclusion, both statistics ti(jS)k and ti(jS)c are appropriate for testing the null hypothesis H0 : ri( j|S ) = 0. However, as for comparisons between semi-partial correlations, Kim (2015)’s statistic ti(jS)k will be more suitable because it maintains the monotonicity with respect to ri( j|S ).

Figures
Fig. 1. Comparison between and when the number of variables is 30 (i.e., p = 30) and the number of samples is 100 (i.e., n = 100). The panels (a) and (b) are the scatter plots of the correlation coefficients and the significance levels (i.e., p-value), respectively, between the partial correlation (i.e., r ) and the semi-partial correlation (i.e., r), where S = {3, 4, 5, …, 30}. In (b), the p-values were calculated using and for the partial and semi-partial correlations, respectively. The panels (c) and (d) are the relationships between the correlation coefficients and the corresponding significance levels (i.e., p-value). In (d), the significance levels were log-transformed.
References
1. Cohen J, Cohen P, West SG, and Aiken LS (2003). Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd ed), Lawrence Erlbaum Associates Publishers.
2. Kim S (2015). ppcor: An R Package for a fast calculation to semi-partial correlation coefficients. Communications for Statistical Applications and Methods, 22, 665-674.