
Logistic regression has been considered as the standard approach for the analysis of ordered categorical outcome data in application area such as biomedical and social science studies. For the ordinal outcomes, the most common regression models fall under the set of cumulative logit link models proposed by Walker and Duncan (1967) because of a direct result of its simplicity and ease of interpretation. Later it is called the proportional odds model by McCullagh (1980) since it assumes that the relationship between the cumulative log-odds and the explanatory variables does not depend on the response category. The cumulative logit model guarantees that for
Another commonly used form of modeling the cumulative logit model is based on an unobserved continuous latent variable. McCullagh and Nelder (1989) showed that the cumulative log-odds model could be derived from an underlying unobserved continuous random variable falling into an interval on the real line. A fully Bayesian method for modeling polychotomous ordinal categories based on the latent variable with probit link was developed in Albert and Chib (1993) by using the data augmentation approach. It might be easier to use a normal distribution with probit link instead of logistic distribution. For example, Yi
For the mean trend in regression, the linear model is commonly used in various field. However, with many explanatory variables, there might be chance of the inference problem of high-dimensional covariates. Principal component (PC) regression has been known as a tool for data analysis and dimension reduction in applications throughout science and engineering. The standard principal component regression can be represented using the reduced singular-value decomposition (SVD) of a matrix of explanatory variables. Then with orthogonal linear combination of predictors, PC, from SVD, the
The standard PC regression performs principal component analysis (PCA) on the centered explanatory variables matrix, and selects subset of the PC based on some appropriate criteria. Then, the observed vector of responses is regressed linearly on the selected PC as covariates, and the regression coefficients are estimated via ordinary least square method. The estimated PC regression coefficients are transformed back to the parameters of the actual predictors in the original model using the selected PCA loadings. In these steps, the principal components are obtained from the SVD of the explanatory variables matrix, thus the PC regression estimator obtained from using these principal components may produce unsatisfactory results for the prediction of the responses. Bair
In this paper, we propose a Bayesian approach for the cumulative log-odds PC regression model to deal with ordinal response based on an approximation of a
In the remainders of the paper, we present the details about the cumulative log-odds model with the PC based regression and subsequently describe about the Bayesian inference with conjugate priors. We then apply the proposed method to real data concerning sprout- and scab-damaged kernels of wheat. We finally discuss about conclusion.
For the Bayesian analysis of ordered categorical outcomes, we express the ordinal logistic regression model via underlying continuous variables. Then with PC regression, we extend to the cumulative log-odds PC regression model.
For the polychotomous ordinal outcomes of
The correspondence between the observed outcome and the latent variable is defined by
where
for
In (
As we discussed in Section 1, the cumulative logistic model is known as the proportional odds model (McCullagh, 1980). It is because the cumulative log-odds ratio for two sets of predictors,
Hence, the ratio of corresponding odds is independent of category
In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis. Massy (1965) introduced PCR to simplify the sample space of explanatory variables by a transformation to PC, for the cases such that the original independent variables are highly collinear with one another or there are a great number of potential explanatory variables. PCR regresses on a small number of regressors accounting for most of the variability of the explanatory variables. Massy (1965) also proposed a method choosing PCs which are most highly correlated with the response to be include in the regression and thus to explain the most variation therein.
Reiss and Ogden (2007) extended PCR to a functional version of PCR with splines, and introduced the reduced singular-value decomposition (SVD) of a matrix of explanatory variables to form a new design matrix of PCR. We consider the SVD of
Let
where
where
A useful feature of the logistic distribution is that the logistic density and the
where
For the
where Gamma(.|
This representation is considered for the model in (
In Bayesian framework, usually an unknown parameter is treated as a random variable, and uncertainty about the parameters is inferred by its probability distribution, the so-called posterior distribution. The posterior distribution is a combination of evidence collected from data that is formally expressed by the likelihood function and a prior belief that is specified with a prior distribution. To update a posterior distribution for the interest parameter, a Markov chain Monte Carlo (MCMC) algorithm is used, through which it enables to iteratively draw sample from even a complex distribution without analytically solving for the point estimate.
In our work, we consider shrinkage priors on the underlying factor regression parameters
independently, for some
where
With model in (
With given the number of PC,
where
There is an identifiability problem in intercept
As we discussed in Section 1, we choose the number of subset of PC,
For the simulation study and data analysis, to analyze the data in the lack of the information of prior, diffuse prior distributions were considered. For this, we set the hyperparameters of prior distribution of intercept, the mean and variance, of
For the comparison of the suggested model, we consider the EM based inference with linear mean trend using
We conduct a simulation study to evaluate how well the proposed method performs with considering a high-dimensional and highly correlated explanatory matrix. For a high-dimensional predictors, we set the number of observations as
To decide the number of components that are considered in the Bayesian cumulative logistic PC regression, we computed BIC for each number of components and the BIC plot up to 15 number of components is in Figure 1. In Figure 1, we observe that the minimum value of BIC is obtained when the number of principal components is 8,
Subset of estimated values and 95% confidence intervals of selected regression parameters in EM based proportional-odds logistic regression (polr) and in EM based Bayesian proportional-odds logistic regression (bayespolr), and estimated values (posterior median) and 95% credible intervals of regression parameters in Bayesian cumulative logistic PC regression (bayesPC(K=3)) for simulated data are in Table B.1 in
Bilder and Loughin (2015) presented parallel coordinate plot of the wheat data. It is in Figure B.2 in Appendix. Three different line types, (solid = “healthy”, long dash=“sprout”, dot dash=“scab” ), have been used to understand some of the features that differentiate the healthy, sprout, and scab conditions. Y-axis of plot is the observed value of each variables of kernels. Also, to compare explanatory variables,, in each type of wheat, mean and 95% confidence intervals are computed in Table B.3 in Appendix. As Bilder and Loughin (2015) discussed, based on parallel coordinate plot and basic statistics of each wheat type, we observe that (1) scab kernels generally have smaller density, size, and weight values, (2) healthy kernels may have higher densities, (3) there is much overlap for healthy and sprout kernels, and (4) the moisture content appears to be related to hard or soft red winter wheat class.
To explore the relationship between an explanatory variable and the response variable, we draw scatter plot of dependent variable vs. response variable (
To fit the Bayesian cumulative logistic PC regression on wheat kernel data, we need to decide the optimal number of components. What we had tried first is the principal components analysis on the explanatory variable matrix such that components with the decreasing order of variance are the corresponding eigenvector of the decreasing order of eigenvalues of the variance-covariance matrix of the centered and scaled explanatory variables. Secondly, as discussed in section 3.2, we computed BIC of the Bayesian cumulative logistic PC regression for the number of components from 2 to 6: 6 is the number of explanatory variables. Scree plot of PC analysis on the explanatory variable matrix and BIC plot of the Bayesian cumulative logistic PC regression are in Figure 5. We observe that only with explanatory matrix, 3 components (
Estimated values and 95% confidence intervals of regression parameters from the EM based proportional-odds logistic regression (polr) and from the EM based Bayesian proportional-odds logistic regression (bayespolr) are in Table 1. Also the posterior median and the 95% credible interval of the least-norm inverse estimates of regression parameters based on the Bayesian cumulative logistic PC regression with
To compare the model fit, we compute the mean squared error (MSE) and the mean absolute percentage error (MAPE) of EM based models and the proposed model after fitting the models. MSE is the average squared difference between the estimated
Considering prediction, we also compute the MSE and MAPE of predicted values such that for
A Bayesian approach to cumulative logistic regression model with orthogonal principal components based on singular value decomposition was proposed to consider the multicollinearity among predictors. The suggested method has advantage such as considering dimension reduction and parameter estimation simultaneously. However, if the aim of the analysis is the prediction of the response variable, the suggested model has limitation, because the suggested method uses components with largely valued variance and ignores components with small valued variance. Also, the considered prior is a generalized shrinkage prior and West (2003) discussed the standard principal component regression with shrinkage prior on the PC regression coefficients allowing for varying degrees of shrinkage in each of the orthogonal PC dimensions. Thus, the suggested model might not be good to be considered for the prediction. However, if the multicollinearity problem is serious among explanatory variables, there might be the inference problem of high-dimensional covariates.
We conducted a simulation study to show how well the proposed method performs with considering a high-dimensional and highly correlated predictors. Because the predictor matrix is not full rank, compared to EM based proportional-odds logistic regression (polr) and to EM based Bayesian proportional-odds logistic regression (bayespolr), the proposed Bayesian cumulative logistic PC regression works better for the highly correlated high-dimensional data with providing parameter estimates and good prediction. Also, we applied the proposed method to real data concerning sprout- and scab-damaged kernels of wheat. Because of the shrinkage priors on the components, the estimated regression parameters were small valued estimates. Compared to EM based proportional-odds logistic regression (polr) and to EM based Bayesian proportional-odds logistic regression (bayespolr), the proposed Bayesian cumulative logistic PC regression showed limitation on the prediction. Also, the estimated parameters of
Methodologically, the proposed method can be further extended to accommodate latent factor regression models for partitioning variation in the predictor variables into multiple component that reflect common patterns and separating these from noise variation. Although the suggested model considered the orthogonal components, there was a limitation on prediction. With latent factors, there might not be the orthogonal property, but we might be able to find out common structure in explanatory matrix isolating idiosyncratic variation for dimension reduction in predictor space.
The sampling scheme is as follows. With
update
with
update
where
update
where MVN
update
update
update
update
update
For the ease of notation, let the target distribution as
then for each iteration
(a) generate a random candidate
(b) calculate the acceptance probability
(c)
update
Estimated value and standard error (s.e.) of selected regression parameters in EM based proportional-odds logistic regression (polr) and in EM based Bayesian proportional-odds logistic regression (bayespolr), and in Bayesian cumulative logistic PC regression (bayesPC(K=3)) for simulated data
polr | bayespolr | bayesianPC | |||||
---|---|---|---|---|---|---|---|
true | estimate | s.e | estimate | standard error | s.e | s.e | |
X1 | 1.00 | −72.29 | 8055354.67 | −0.19 | 440.37 | 2.31 | 0.13 |
X2 | 1.00 | −989.25 | 23279508.27 | 17.21 | 509.22 | 1.64 | 0.09 |
X3 | 1.00 | 1160.6 | 3065818.64 | 3.83 | 749.09 | 2.21 | 0.11 |
X4 | 1.00 | 1125.13 | 21542791.4 | 4.23 | 578.85 | 4.18 | 0.19 |
X5 | 1.00 | −2067.27 | 10574041.95 | 14.75 | 672.63 | 4.36 | 0.18 |
X6 | 1.00 | 1136.96 | 11201451.06 | 5.86 | 531.71 | 3.80 | 0.18 |
X7 | 1.00 | −184.73 | 6851054.61 | 0.86 | 594.45 | 4.06 | 0.22 |
X8 | 1.00 | 413.57 | 14165806.42 | 8.13 | 524.29 | 5.27 | 0.30 |
X9 | 1.00 | −540.99 | 27327338.17 | 5.02 | 638.53 | 5.12 | 0.29 |
X10 | 1.00 | 1501.02 | 9970514.25 | 13.33 | 710.27 | 4.36 | 0.26 |
X21 | 1.00 | 172.77 | 9156189.68 | 22.05 | 599.75 | 4.44 | 0.17 |
X22 | 1.00 | 345.29 | 10475712.67 | 0.09 | 584.88 | 3.36 | 0.14 |
X23 | 1.00 | 1195.68 | 7540582.87 | 10.23 | 681.82 | 4.12 | 0.16 |
X24 | 1.00 | 230.59 | 12162209.05 | −7.42 | 633.15 | 4.55 | 0.18 |
X25 | 1.00 | 333.14 | 4259936.06 | 2.18 | 641.95 | 4.53 | 0.18 |
X26 | 1.00 | −174.64 | 11016684.56 | 3.49 | 568.32 | 5.71 | 0.25 |
X27 | 1.00 | −348.29 | 6888168.51 | 16.08 | 710.90 | 4.94 | 0.23 |
X28 | 1.00 | 823.1 | 6345472.98 | 0.70 | 477.04 | 4.38 | 0.23 |
X29 | 1.00 | −406.82 | 12339415.15 | 6.75 | 613.20 | 4.48 | 0.24 |
X30 | 1.00 | 864.38 | 5988978.42 | 14.71 | 580.43 | 2.88 | 0.19 |
X41 | 0.00 | −465.59 | 7455209.5 | 12.52 | 551.48 | −0.13 | 0.06 |
X42 | 0.00 | −140.18 | 9274035.47 | 3.39 | 669.30 | 1.60 | 0.09 |
X43 | 0.00 | 1201.24 | 10358985.13 | −16.64 | 666.33 | 0.96 | 0.07 |
X44 | 0.00 | 311.66 | 6866423.31 | −4.13 | 569.38 | −0.04 | 0.09 |
X45 | 0.00 | 174.68 | 10334135.45 | 15.01 | 661.61 | −0.17 | 0.07 |
X46 | 0.00 | −540.74 | 10310530.19 | 14.61 | 796.02 | −1.16 | 0.10 |
X47 | 0.00 | 455.75 | 3635108.05 | 3.49 | 586.07 | −0.01 | 0.06 |
X48 | 0.00 | 325.7 | 8039198.69 | −2.51 | 627.92 | 0.90 | 0.06 |
X49 | 0.00 | −64.21 | 26844484.18 | −10.95 | 706.34 | −0.29 | 0.07 |
X50 | 0.00 | NA | NA | −3.29 | 511.10 | −0.27 | 0.07 |
X51 | 0.00 | NA | NA | −0.18 | 479.21 | −0.14 | 0.07 |
X52 | 0.00 | NA | NA | −2.52 | 719.77 | −0.70 | 0.07 |
X53 | 0.00 | NA | NA | −5.73 | 755.87 | −0.35 | 0.06 |
X54 | 0.00 | NA | NA | 2.86 | 586.26 | 0.20 | 0.08 |
X55 | 0.00 | NA | NA | 3.63 | 638.93 | 0.72 | 0.08 |
X56 | 0.00 | NA | NA | −13.43 | 532.67 | 0.53 | 0.07 |
X57 | 0.00 | NA | NA | −2.71 | 587.05 | 1.71 | 0.11 |
X58 | 0.00 | NA | NA | −3.15 | 557.46 | 1.75 | 0.09 |
X59 | 0.00 | NA | NA | 2.07 | 551.61 | 0.67 | 0.05 |
X60 | 0.00 | NA | NA | −2.50 | 569.40 | −0.13 | 0.06 |
X1.2 | −1.90 | −164.42 | 24273592.35 | −7.97 | 3.46 | 0.00 | 0.00 |
X2.3 | 0.00 | −12.02 | 5147311.3 | 0.02 | 2.81 | 8.25 | 0.42 |
X3.4 | 1.80 | 93.3 | 35373841.15 | 8.41 | 3.43 | 14.29 | 0.59 |
Estimated values and 95% credible intervals of regression parameter of Bayesian cumulative logistic PC regression with 5 components (bayesPC(K=5)) and with 3 components (bayesPC(K=3)) of wheat data
bayesPC(K=5) | bayesPC(K=3) | |||||
---|---|---|---|---|---|---|
estimates | 2.5% | 97.5% | estimates | 2.5% | 97.5% | |
class | 0.08 | −0.31 | 0.48 | −0.00 | −0.01 | 0.01 |
density | −0.02 | −0.11 | 0.08 | 0.00 | −0.00 | 0.01 |
hardness | ||||||
size | −0.02 | −0.40 | 0.37 | |||
weight | ||||||
moisture | −0.04 | −0.16 | 0.09 | −0.02 | −0.12 | 0.09 |
Scab|Sprout | ||||||
Sprout|Healthy |
Mean and 95% confidence intervals of explanatory variables in wheat types)
Healty | Sprout | Scab | |||||||
---|---|---|---|---|---|---|---|---|---|
estimates | 2.5% | 97.5% | estimates | 2.5% | 97.5% | estimates | 2.5% | 97.5% | |
density | 1.28 | 1.26 | 1.29 | 1.19 | 1.18 | 1.21 | 1.08 | 1.04 | 1.13 |
hardness | 28.22 | 23.25 | 33.18 | 18.06 | 12.72 | 23.40 | 31.17 | 24.88 | 37.47 |
size | 2.33 | 2.26 | 2.41 | 2.37 | 2.28 | 2.46 | 1.86 | 1.76 | 1.97 |
weight | 30.42 | 29.16 | 31.69 | 30.65 | 29.36 | 31.94 | 20.48 | 19.07 | 21.90 |
moisture | 11.28 | 10.84 | 11.72 | 11.18 | 10.78 | 11.59 | 11.10 | 10.71 | 11.49 |
The convergence plot of the selected posterior parameters for the simulated data.
Parallel coordinate plot of the wheat data.
The convergence plot of the posterior parameters for the wheat data.
Bar plot of number of components K of Bayesian cumulative logistic PC regression from 100 repeats.
Estimated values and 95% confidence intervals of regression parameters in EM based proportional-odds logistic regression (polr) and in EM based Bayesian proportional-odds logistic regression (bayespolr), and estimated values (posterior median) and 95% credible intervals of regression parameters in Bayesian cumulative logistic PC regression (bayesPC(K=3))
polr | bayespolr | bayesPC(K=3) | |||||||
---|---|---|---|---|---|---|---|---|---|
estimates | lb* | ub* | estimates | lb | ub | estimates | 2.5% | 97.5% | |
class | 0.17 | −0.13 | 0.51 | 0.18 | −0.12 | 0.52 | −0.00 | −0.01 | 0.01 |
density | 0.00 | −0.00 | 0.01 | ||||||
hardness | |||||||||
size | − |
− |
− |
− |
− |
− |
|||
weight | |||||||||
moisture | − |
− |
− |
− |
− |
− |
−0.02 | −0.12 | 0.09 |
Scab|Sprout | |||||||||
Sprout|Healthy |
*lb* and ub* are lower bound and upper bound of 95% confidence intervals.
Predictive MSE and MAPE of EM based proportional-odds logistic regression (polr), EM based Bayesian proportional-odds logistic regression (bayespolr), and the Bayesian cumulative logistic PC regression (bayesPC(K=3)) from 100 repeats
polr | bayespolr | bayesPC(K=3) | ||||
---|---|---|---|---|---|---|
mean | standard deviation | mean | standard deviation | mean | standard deviation | |
MSE | 0.333 | 0.050 | 0.337 | 0.050 | 0.395 | 0.052 |
MAPE | 0.213 | 0.036 | 0.214 | 0.035 | 0.243 | 0.039 |