
Lack of a general matrix formula hampers implementation of the semi-partial correlation, also known as part correlation, to the higher-order coefficient. This is because the higher-order semi-partial correlation calculation using a recursive formula requires an enormous number of recursive calculations to obtain the correlation coefficients. To resolve this difficulty, we derive a general matrix formula of the semi-partial correlation for fast computation. The semi-partial correlations are then implemented on an R package
The partial and semi-partial (also known as part) correlations are used to express the specific portion of variance explained by eliminating the effect of other variables when assessing the correlation between two variables (James, 2002; Johnson and Wichern, 2002; Whittaker, 1990). The great number of studies have been published using either partial or semi-partial correlations in many areas including cognitive psychology (e.g., Baum and Rude, 2013), genomics (e.g., Fang
The partial correlation can be explained as the association between two random variables after eliminating the effect of all other random variables, while the semi-partial correlation eliminates the effect of a fraction of other random variables, for instance, removing the effect of all other random variables from just one of two interesting random variables. The rationale for the partial and semi-partial correlations is to estimate a direct relationship or association between two random variables. The brief explanation follows to describe the main difference among the correlation, the partial correlation and the semi-partial correlation. Suppose there are three random variables (or vectors),
Several R packages have been developed only for the partial correlation. The R package
On the other hand, there is no attempt to reduce the computational burden of the higher-order semi-partial correlation coefficients, while the higher-order partial correlation coefficients can be easily calculated using the inverse variance-covariance matrix. This means that a recursive formula (e.g., see
For these reasons, we derive a general matrix formula for the semi-partial correlation calculation (see
Consider the random vector
Whittaker (1990) defined the partial correlation using the correlation between two residuals. In fact, we can easily see that the definition in Whittaker (1990) is equivalent to the definition in
Note that the proof of Corollary 1 is omitted since it is straightforward. Corollary 1 can be further generalized to the case that there are two or more given variables. In other words, it can be extended to the higher-order partial and the higher-order semi-partial correlations. To do this, we need to consider the inverse variance-covariance matrix of
Most R packages for calculation of the partial correlation use the matrix-based calculation which is based on
R> -cov2cor(solve(cov(X)))
However, there is no matrix-based mathematical formula for the semi-partial correlation. Without a general matrix formula, users have to calculate the higher-order semi-partial correlation through a recursive formula in
By
Then, using
Using Theorem 1, we can readily calculate the semi-partial correlation using several lines of an R code. For example, the semi-partial correlation of
R> cx <- cov(X) R> dx <- solve(cx) R> pc <- -cov2cor(dx) R> diag(pc) <- 1 R> pc/sqrt(diag(cx))/sqrt(abs(diag(dx)-t(t(dxˆ2)/diag(dx))))
It first calculates the variance-covariance matrix of
While, to our knowledge, no R packages provide the level of statistical significance for partial correlation coefficient, the R package
The statistics
where
where Φ
In case of Kendall’s rank correlation, the statistics are computed by (Abdi, 2007)
Using
where Φ(·) is the cumulative density function of a standard normal distribution. The standard error is
The R package
R> library(ppcor) R> y.data <- data.frame( + hl = c(7,15,19,15,21,22,57,15,20,18), + disp = c(0,0.964,0,0,0.921,0,0,1.006,0,1.011), + deg = c(9,2,3,4,1,3,1,3,6,1), + BC = c(1.78e-02,1.05e-06,1.37e-05,7.18e-03,0,0,0,4.48e-03,2.10e-06,0) +)
This test data,
We can then calculate all pairwise partial correlations of each pair of two variables given other variables with
R> pcor(x=y.data,method="spearman")
Then we obtain the following output:
$estimate hl disp deg BC hl 1.0000000 −0.7647345 −0.1367596 −0.7860646 disp −0.7647345 1.0000000 −0.4845966 −0.4506273 deg −0.1367596 −0.4845966 1.0000000 0.4010940 BC −0.7860646 −0.4506273 0.4010940 1.0000000 $p.value hl disp deg BC hl 0.00000000 0.02708081 0.7467551 0.02071908 disp 0.02708081 0.00000000 0.2236095 0.26248897 deg 0.74675508 0.22360945 0.0000000 0.32471409 BC 0.02071908 0.26248897 0.3247141 0.00000000 $statistic hl disp deg BC hl 0.0000000 −2.907150 −0.3381686 −3.114899 disp −2.9071501 0.000000 −1.3569947 −1.236464 deg −0.3381686 −1.356995 0.0000000 1.072529 BC −3.1148991 −1.236464 1.0725286 0.000000 $n [1] 10 $gp [1] 2 $method [1] "spearman”
The output has six values,
R> pcor.test(x=y.data$hl,y=y.data$disp,z=y.data[,c("deg","BC")] +, method="spearman")
Then we obtain the following output:
estimate p.value statistic n gp Method 1 −0.7647345 0.02708081 −2.90715 10 2 spearman
Similarly, the semi-partial correlations can be calculated with
R> spcor(x=y.data,method="spearman")
Then we obtain the following output:
$estimate hl disp deg BC hl 1.00000000 −0.4254609 −0.04949092 −0.4558649 disp −0.59319449 1.0000000 −0.27689034 −0.2522965 deg −0.06380762 −0.2560457 1.00000000 0.2023709 BC −0.42262366 −0.1677612 0.14551866 1.0000000 $p.value hl disp deg BC hl 0.0000000 0.2933025 0.9073559 0.2562889 disp 0.1211334 0.0000000 0.5067562 0.5466351 deg 0.8806850 0.5404845 0.0000000 0.6307871 BC 0.2968811 0.6912998 0.7309799 0.0000000 $statistic hl disp deg BC hl 0.0000000 −1.1515898 −0.1213762 −1.2545787 disp −1.8048658 0.0000000 −0.7058372 −0.6386584 deg −0.1566153 −0.6488095 0.0000000 0.5061789 BC −1.1422336 −0.4168368 0.3602815 0.0000000 $n [1] 10 $gp [1] 2 $method [1] "spearman”
The semi-partial correlation of
R> spcor.test(x=y.data$hl,y=y.data$disp,z=y.data[,c("deg","BC")] +, method="spearman")
Then we obtain the following output:
estimate p.value statistic n gp Method 1 -0.4254609 0.2933025 -1.15159 10 2 spearman
It should be noted that, if a general matrix formula for the semi-partial correlation is not available, users have to calculate all pairs of each variable with the function spcor.test using two loops. To see how fast the general matrix formula can compute the semi-partial correlation, we compared the computational time by generating a data matrix with the size of 500 × 100 (i.e., the number of variables is 100 and the number of samples 500). When the function
A general matrix formula for the semi-partial correlation is derived. Lack of this general matrix formula has hampered implantation of the higher-order semi-partial correlation for high-dimensional ‘omics’ data analysis because it requires an enormous number of recursive calculations to obtain the correlation coefficient when using a recursive formula in
The results in this paper were obtained using R 3.2.2 with the package