Load packages
library(dplyr)
library(rstanarm)
library(arm)
options(mc.cores = parallel::detectCores())
library(loo)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(GGally)
library(projpred)
This notebook demonstrates collinearity in multipredictor regression. Example of predicting the yields of mesquite bushes comes from Gelman and Hill (2007). The outcome variable is the total weight (in grams) of photosynthetic material as derived from actual harvesting of the bush. The predictor variables are:
dat <- read.table("mesquite.dat", header=TRUE) %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.factor, as.numeric)
summary(dat)
Obs Group Diam1 Diam2
Min. : 1.00 Min. :1.000 Min. :0.800 Min. :0.400
1st Qu.:12.25 1st Qu.:1.000 1st Qu.:1.400 1st Qu.:1.000
Median :23.50 Median :2.000 Median :1.950 Median :1.525
Mean :23.50 Mean :1.565 Mean :2.099 Mean :1.572
3rd Qu.:34.75 3rd Qu.:2.000 3rd Qu.:2.475 3rd Qu.:1.900
Max. :46.00 Max. :2.000 Max. :5.200 Max. :4.000
TotHt CanHt Dens LeafWt
Min. :0.650 Min. :0.5000 Min. :1.000 Min. : 60.2
1st Qu.:1.200 1st Qu.:0.8625 1st Qu.:1.000 1st Qu.: 219.6
Median :1.500 Median :1.1000 Median :1.000 Median : 361.9
Mean :1.482 Mean :1.1107 Mean :1.674 Mean : 559.7
3rd Qu.:1.700 3rd Qu.:1.3000 3rd Qu.:2.000 3rd Qu.: 688.7
Max. :3.000 Max. :2.5000 Max. :9.000 Max. :4052.0
Plot data
ggpairs(dat, diag=list(continuous="barDiag"))
Additional transformed variables
dat$CanVol <- dat$Diam1 * dat$Diam2 * dat$CanHt
dat$CanAre <- dat$Diam1 * dat$Diam2
dat$CanSha <- dat$Diam1 / dat$Diam2
It may be reasonable to fit on the logarithmic scale, so that effects are multiplicative rather than additive (we’ll return to checking this assumption in another notebook).
We first illustrate the problem with maxiumum likelihood estimate
lm1 <- lm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) + log(Dens) + Group, data = dat)
display(lm1)
lm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) +
log(TotHt) + log(Dens) + Group, data = dat)
coef.est coef.se
(Intercept) 4.18 0.23
log(CanVol) 0.37 0.28
log(CanAre) 0.40 0.29
log(CanSha) -0.38 0.23
log(TotHt) 0.39 0.31
log(Dens) 0.11 0.12
Group 0.58 0.13
---
n = 46, k = 7
residual sd = 0.33, R-Squared = 0.89
GroupMCD seems to be only variable which has coeffiecent far away from zero. Let’s try making a model with just the group variable.
lm2 <- lm(formula = log(LeafWt) ~ Group, data = dat)
display(lm2)
lm(formula = log(LeafWt) ~ Group, data = dat)
coef.est coef.se
(Intercept) 6.49 0.44
Group -0.36 0.27
---
n = 46, k = 2
residual sd = 0.91, R-Squared = 0.04
Hmmm…. R-squared dropped a lot, so it seems that other variables are useful even if estimated effects and their standard errors indicate that they are not relevant. There are approach for maximum likelihood estimated models to investigate this, but we’ll switch now to Bayesian inference using rstanarm
.
The corresponding rstanarm
model fit using stan_glm
fitg <- stan_glm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) + log(Dens) + Group, data = dat, refresh=0)
Print summary for some diagnostics.
summary(fitg)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) +
log(Dens) + Group
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 46
predictors: 7
Estimates:
mean sd 10% 50% 90%
(Intercept) 4.2 0.2 3.9 4.2 4.5
log(CanVol) 0.4 0.3 0.0 0.4 0.7
log(CanAre) 0.4 0.3 0.0 0.4 0.8
log(CanSha) -0.4 0.2 -0.7 -0.4 -0.1
log(TotHt) 0.4 0.3 0.0 0.4 0.8
log(Dens) 0.1 0.1 0.0 0.1 0.3
Group 0.6 0.1 0.4 0.6 0.7
sigma 0.3 0.0 0.3 0.3 0.4
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 5.9 0.1 5.8 5.9 6.0
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 4342
log(CanVol) 0.0 1.0 1339
log(CanAre) 0.0 1.0 1414
log(CanSha) 0.0 1.0 2818
log(TotHt) 0.0 1.0 1715
log(Dens) 0.0 1.0 2931
Group 0.0 1.0 2357
sigma 0.0 1.0 2525
mean_PPD 0.0 1.0 3477
log-posterior 0.1 1.0 1416
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Rhats and n_effs are good (see, e.g., RStan workflow), but QR transformation usually makes sampling work even better (see, The QR Decomposition For Regression Models)
Print summary for some diagnostics.
summary(fitg)
Model Info:
function: stan_glm
family: gaussian [identity]
formula: log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) +
log(Dens) + Group
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 46
predictors: 7
Estimates:
mean sd 10% 50% 90%
(Intercept) 4.2 0.2 3.9 4.2 4.5
log(CanVol) 0.4 0.3 0.0 0.4 0.7
log(CanAre) 0.4 0.3 0.0 0.4 0.8
log(CanSha) -0.4 0.2 -0.7 -0.4 -0.1
log(TotHt) 0.4 0.3 0.0 0.4 0.8
log(Dens) 0.1 0.1 0.0 0.1 0.3
Group 0.6 0.1 0.4 0.6 0.7
sigma 0.3 0.0 0.3 0.3 0.4
Fit Diagnostics:
mean sd 10% 50% 90%
mean_PPD 5.9 0.1 5.8 5.9 6.0
The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
MCMC diagnostics
mcse Rhat n_eff
(Intercept) 0.0 1.0 4342
log(CanVol) 0.0 1.0 1339
log(CanAre) 0.0 1.0 1414
log(CanSha) 0.0 1.0 2818
log(TotHt) 0.0 1.0 1715
log(Dens) 0.0 1.0 2931
Group 0.0 1.0 2357
sigma 0.0 1.0 2525
mean_PPD 0.0 1.0 3477
log-posterior 0.1 1.0 1416
For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Use of QR decomposition improved sampling efficiency (actually we get superefficient sampling, ie better than independent sampling) and we continue with this model.
Instead of looking at the tables, it’s easier to look at plots
mcmc_areas(as.matrix(fitg), prob = .5, prob_outer = .95)
All 95% posterior intervals except for GroupMCD are overlapping 0 and it seems we have serious collinearity problem.
Looking at the pairwise posteriors we can see high correlations especially between log(CanVol) and log(CanAre).
mcmc_pairs(as.matrix(fitg),pars = c("log(CanVol)","log(CanAre)","log(CanSha)","log(TotHt)","log(Dens)"))
If look more carefully on of the subplots, we see that although marginal posterior intervals overlap 0, some pairwise joint posteriors are not overlapping 0. Let’s look more carefully the joint posterior of log(CanVol) and log(CanAre).
mcmc_scatter(as.matrix(fitg), pars = c("log(CanVol)","log(CanAre)")) +
geom_vline(xintercept=0) +
geom_hline(yintercept=0)
From the joint posterior scatter plot, we can see that 0 is far away fron the typical set.
In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,
fitg0 <- update(fitg, formula = log(LeafWt) ~ 1, QR=FALSE)
We compute leave-one-out cross-validation elpd’s using PSIS-LOO (Vehtari, Gelman and Gabry, 2017)
(loog <- loo(fitg))
Computed from 4000 by 46 log-likelihood matrix
Estimate SE
elpd_loo -19.2 5.4
p_loo 7.5 1.5
looic 38.4 10.8
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 44 95.7% 934
(0.5, 0.7] (ok) 1 2.2% 390
(0.7, 1] (bad) 1 2.2% 267
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
(loog0 <- loo(fitg0))
Computed from 4000 by 46 log-likelihood matrix
Estimate SE
elpd_loo -62.6 5.0
p_loo 1.9 0.6
looic 125.3 10.0
------
Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
The model with variables has one bad Pareto \(k\) value (Vehtari, Gelman and Gabry, 2017). We can fix that by computing the corresponding leave-one-out-posterior exactly (Vehtari, Gelman and Gabry, 2017).
(loog <- loo(fitg, k_threshold=0.7))
Computed from 4000 by 46 log-likelihood matrix
Estimate SE
elpd_loo -19.3 5.4
p_loo 7.5 1.5
looic 38.5 10.8
------
Monte Carlo SE of elpd_loo is 0.1.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 44 97.8% 934
(0.5, 0.7] (ok) 1 2.2% 390
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
And then we can compare the models.
loo_compare(loog0, loog)
elpd_diff se_diff
fitg 0.0 0.0
fitg0 -43.4 7.0
Based on cross-validation covariates together contain significant information to improve predictions.
We might want to choose some variables 1) because we don’t want to observe all the variables in the future (e.g. due to the measurement cost), or 2) we want to most relevant variables which we define here as a minimal set of variables which can provide similar predictions to the full model.
LOO can be used for model selection, but we don’t recommend it for variable selection as discussed by Piironen and Vehtari (2017). The reason for not using LOO in variable selection is that the selection process uses the data twice, and in case of large number variable combinations the selection process overfits and can produce really bad models. Using the usual posterior inference given the selected variables ignores that the selected variables are conditonal on the selection process and simply setting some variables to 0 ignores the uncertainty related to their relevance.
Piironen and Vehtari (2017) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more by McLatchie et al. (2023), Pavone et al. (2022), and Piironen, Paasiniemi and Vehtari (2020).
We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman and Gabry, 2017) is used to choose the model size. As the number of observations is large compared to the number of covariates, we estimate the performance using LOO-CV only along the search path (validate_search=FALSE
), as we may assume that the overfitting in search is negligible (see more about this in McLatchie et al. (2023)).
fitg_cv <- cv_varsel(fitg, method='forward', cv_method='LOO', validate_search=FALSE)
We can now look at the estimated predictive performance of smaller models compared to the full model.
plot(fitg_cv, stats = c('elpd', 'rmse'))
As the estimated predictive performance is not going much above the reference model performance, we know that the use of option validate_search=FALSE
was safe (see more in McLatchie et al. (2023)).
And we get a loo-cv based recommendation for the model size to choose
(nsel <- suggest_size(fitg_cv, alpha=0.1))
[1] 1
(vsel <- solution_terms(fitg_cv)[1:nsel])
[1] "log(CanVol)"
We see that 1 variables is enough to get the same predictive accuracy as with all 4 variables.
Next we form the projected posterior for the chosen model.
projg <- project(fitg_cv, nv = nsel, ns = 4000)
projdraws <- as.matrix(projg)
round(colMeans(projdraws),1)
(Intercept) log(CanVol) Group sigma
4.2 0.8 0.5 0.4
round(posterior_interval(projdraws),1)
5% 95%
(Intercept) 3.9 4.5
log(CanVol) 0.8 0.9
Group 0.4 0.7
sigma 0.3 0.4
mcmc_areas(projdraws, pars=c("(Intercept)",vsel), prob_outer=0.99)
McLatchie, Y., Rögnvaldsson, S., Weber, F. and Vehtari, A. (2023) ‘Robust and efficient projection predictive inference’, arXiv preprint arXiv:2306.15581.
Pavone, F., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2022) ‘Using reference models in variable selection’, Computational Statistics, 38.
Piironen, J., Paasiniemi, M. and Vehtari, A. (2020) ‘Projective inference in high-dimensional problems: Prediction and feature selection’, Electronic Journal of Statistics, 14(1), pp. 2155–2197.
Piironen, J. and Vehtari, A. (2017) ‘Comparison of Bayesian predictive methods for model selection’, Statistics and Computing, 27(3), pp. 711–735. doi: 10.1007/s11222-016-9649-y.
Vehtari, A., Gelman, A. and Gabry, J. (2017) ‘Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC’, Statistics and Computing, 27(5), pp. 1413–1432. doi: 10.1007/s11222-016-9696-4.
sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] splines stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] bridgesampling_1.1-2 ggridges_0.5.4 arm_1.13-1
[4] lme4_1.1-34 Matrix_1.5-1 reliabilitydiag_0.2.1
[7] MASS_7.3-58.2 corrplot_0.92 caret_6.0-94
[10] lattice_0.20-45 GGally_2.1.2 fivethirtyeight_0.6.2
[13] projpred_2.7.0 bayesplot_1.10.0 lubridate_1.9.2
[16] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
[19] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[22] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
[25] loo_2.6.0 rstanarm_2.26.1 Rcpp_1.0.11
loaded via a namespace (and not attached):
[1] backports_1.4.1 plyr_1.8.8 igraph_1.5.1
[4] listenv_0.9.0 crosstalk_1.2.0 rstantools_2.3.1.1
[7] inline_0.3.19 digest_0.6.33 foreach_1.5.2
[10] htmltools_0.5.6 fansi_1.0.4 magrittr_2.0.3
[13] checkmate_2.2.0 tzdb_0.4.0 globals_0.16.2
[16] recipes_1.0.8 gower_1.0.1 RcppParallel_5.1.7
[19] matrixStats_1.0.0 xts_0.13.1 hardhat_1.3.0
[22] timechange_0.2.0 prettyunits_1.1.1 colorspace_2.1-0
[25] xfun_0.40 callr_3.7.3 crayon_1.5.2
[28] jsonlite_1.8.7 survival_3.4-0 zoo_1.8-12
[31] iterators_1.0.14 glue_1.6.2 gtable_0.3.4
[34] ipred_0.9-14 distributional_0.3.2 pkgbuild_1.4.2
[37] rstan_2.26.23 future.apply_1.11.0 abind_1.4-5
[40] scales_1.2.1 mvtnorm_1.2-3 miniUI_0.1.1.1
[43] xtable_1.8-4 progress_1.2.2 lava_1.7.2.1
[46] stats4_4.2.2 StanHeaders_2.26.28 prodlim_2023.08.28
[49] DT_0.29 htmlwidgets_1.6.2 threejs_0.3.3
[52] RColorBrewer_1.1-3 posterior_1.4.1 ellipsis_0.3.2
[55] pkgconfig_2.0.3 reshape_0.8.9 farver_2.1.1
[58] nnet_7.3-18 sass_0.4.7 utf8_1.2.3
[61] tidyselect_1.2.0 labeling_0.4.3 rlang_1.1.1
[64] reshape2_1.4.4 later_1.3.1 munsell_0.5.0
[67] tools_4.2.2 cachem_1.0.8 cli_3.6.1
[70] generics_0.1.3 evaluate_0.21 fastmap_1.1.1
[73] yaml_2.3.7 ModelMetrics_1.2.2.2 processx_3.8.2
[76] knitr_1.43 future_1.33.0 nlme_3.1-162
[79] mime_0.12 compiler_4.2.2 shinythemes_1.2.0
[82] gamm4_0.2-6 bslib_0.5.1 stringi_1.7.12
[85] highr_0.10 ps_1.7.5 Brobdingnag_1.2-9
[88] nloptr_2.0.3 markdown_1.8 shinyjs_2.1.0
[91] tensorA_0.36.2 vctrs_0.6.3 pillar_1.9.0
[94] lifecycle_1.0.3 jquerylib_0.1.4 data.table_1.14.8
[97] httpuv_1.6.11 QuickJSR_1.0.5 R6_2.5.1
[100] promises_1.2.1 gridExtra_2.3 parallelly_1.36.0
[103] codetools_0.2-19 boot_1.3-28 colourpicker_1.3.0
[106] gtools_3.9.4 withr_2.5.0 shinystan_2.6.0
[109] mgcv_1.9-0 parallel_4.2.2 hms_1.1.3
[112] grid_4.2.2 rpart_4.1.19 timeDate_4022.108
[115] coda_0.19-4 class_7.3-21 minqa_1.2.5
[118] rmarkdown_2.24 pROC_1.18.4 shiny_1.7.5
[121] base64enc_0.1-3 dygraphs_1.1.1.6