Collinear demo with mesquite bushes

Author

Aki Vehtari

Published

2018-01-16

Modified

2026-03-27

1 Introduction

This notebook demonstrates collinearity in multipredictor regression. Example of predicting the yields of mesquite bushes comes from Gelman and Hill (2007). The outcome variable is the total weight (in grams) of photosynthetic material as derived from actual harvesting of the bush. The predictor variables are:

diam1: diameter of the canopy (the leafy area of the bush) in meters, measured along the longer axis of the bush
diam2: canopy diameter measured along the shorter axis
canopy height: height of the canopy
total height: total height of the bush
density: plant unit density (# of primary stems per plant unit)
group: group of measurements (0 for the first group, 1 for the second group)

Load packages

library("rprojroot")
root <- has_file(".casestudies-root")$make_fix_file()
library(dplyr)
library(rstanarm)
library(arm)
options(mc.cores = 4)
library(loo)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans", base_size = 14))
library(ggdist)
library(GGally)
library(projpred)
library(posterior)

2 Data

Available in mesquite.dat

mesquite <- read.table(root("mesquite", "mesquite.dat"), header=TRUE) |>
  mutate_if(is.character, as.factor) |>
  mutate_if(is.factor, as.numeric) |>
  tibble() |>
  mutate(Group = as.factor(Group))  
summary(mesquite)

      Obs        Group      Diam1           Diam2           TotHt      
 Min.   : 1.00   1:20   Min.   :0.800   Min.   :0.400   Min.   :0.650  
 1st Qu.:12.25   2:26   1st Qu.:1.400   1st Qu.:1.000   1st Qu.:1.200  
 Median :23.50          Median :1.950   Median :1.525   Median :1.500  
 Mean   :23.50          Mean   :2.099   Mean   :1.572   Mean   :1.482  
 3rd Qu.:34.75          3rd Qu.:2.475   3rd Qu.:1.900   3rd Qu.:1.700  
 Max.   :46.00          Max.   :5.200   Max.   :4.000   Max.   :3.000  
     CanHt             Dens           LeafWt      
 Min.   :0.5000   Min.   :1.000   Min.   :  60.2  
 1st Qu.:0.8625   1st Qu.:1.000   1st Qu.: 219.6  
 Median :1.1000   Median :1.000   Median : 361.9  
 Mean   :1.1107   Mean   :1.674   Mean   : 559.7  
 3rd Qu.:1.3000   3rd Qu.:2.000   3rd Qu.: 688.7  
 Max.   :2.5000   Max.   :9.000   Max.   :4052.0

Plot data

ggpairs(mesquite,
        diag = list(continuous = "densityDiag", discrete = "barDiag"),
        lower = list(combo = "points"),
        upper = list(combo = "blank", continuous = "blank"),
        axisLabels = "none")

Additional transformed variables

mesquite <- mesquite |>
  mutate(CanVol = Diam1 * Diam2 * CanHt,
         CanAre = Diam1 * Diam2,
         CanSha = Diam1 / Diam2)

It may be reasonable to fit on the logarithmic scale, so that effects are multiplicative rather than additive (we’ll return to checking this assumption in another notebook).

3 Maximum likelihood estimate

We first illustrate the problem with maximum likelihood estimate

lm1 <- lm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) +
             log(CanSha) + log(TotHt) + log(Dens) + Group,
           data = mesquite)
display(lm1)

lm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + 
    log(TotHt) + log(Dens) + Group, data = mesquite)
            coef.est coef.se
(Intercept)  4.77     0.16  
log(CanVol)  0.37     0.28  
log(CanAre)  0.40     0.29  
log(CanSha) -0.38     0.23  
log(TotHt)   0.39     0.31  
log(Dens)    0.11     0.12  
Group2       0.58     0.13  
---
n = 46, k = 7
residual sd = 0.33, R-Squared = 0.89

GroupMCD seems to be only variable which has coefficient far away from zero. Let’s try making a model with just the group variable.

lm2 <- lm(formula = log(LeafWt) ~ Group, data = mesquite)
display(lm2)

lm(formula = log(LeafWt) ~ Group, data = mesquite)
            coef.est coef.se
(Intercept)  6.13     0.20  
Group2      -0.36     0.27  
---
n = 46, k = 2
residual sd = 0.91, R-Squared = 0.04

R^2 dropped a lot, so it seems that other variables are useful even if estimated effects and their standard errors indicate that they are not relevant. There are approaches for maximum likelihood estimated models to investigate this, but we’ll switch now to Bayesian inference using rstanarm.

4 Bayesian inference

The corresponding rstanarm model fit using stan_glm

fitg <- stan_glm(formula = log(LeafWt) ~ log(CanVol) + log(CanAre) +
                   log(CanSha) + log(TotHt) + log(Dens) + Group,
                 data = mesquite,
                 refresh = 0)

Print summary for some diagnostics.

summary(fitg)


Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) + 
       log(Dens) + Group
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 46
 predictors:   7

Estimates:
              mean   sd   10%   50%   90%
(Intercept)  4.8    0.2  4.6   4.8   5.0 
log(CanVol)  0.4    0.3  0.0   0.4   0.7 
log(CanAre)  0.4    0.3  0.0   0.4   0.8 
log(CanSha) -0.4    0.2 -0.7  -0.4  -0.1 
log(TotHt)   0.4    0.3  0.0   0.4   0.8 
log(Dens)    0.1    0.1 -0.1   0.1   0.3 
Group2       0.6    0.1  0.4   0.6   0.8 
sigma        0.3    0.0  0.3   0.3   0.4 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 5.9    0.1  5.8   5.9   6.0  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  3493 
log(CanVol)   0.0  1.0  1560 
log(CanAre)   0.0  1.0  1663 
log(CanSha)   0.0  1.0  2772 
log(TotHt)    0.0  1.0  1947 
log(Dens)     0.0  1.0  3006 
Group2        0.0  1.0  2577 
sigma         0.0  1.0  2624 
mean_PPD      0.0  1.0  3664 
log-posterior 0.1  1.0  1442 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Rhats and n_effs are good (see, e.g., RStan workflow), but QR transformation usually makes sampling work even better (see, The QR Decomposition For Regression Models)

Print summary for some diagnostics.

summary(fitg)


Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      log(LeafWt) ~ log(CanVol) + log(CanAre) + log(CanSha) + log(TotHt) + 
       log(Dens) + Group
 algorithm:    sampling
 sample:       4000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 46
 predictors:   7

Estimates:
              mean   sd   10%   50%   90%
(Intercept)  4.8    0.2  4.6   4.8   5.0 
log(CanVol)  0.4    0.3  0.0   0.4   0.7 
log(CanAre)  0.4    0.3  0.0   0.4   0.8 
log(CanSha) -0.4    0.2 -0.7  -0.4  -0.1 
log(TotHt)   0.4    0.3  0.0   0.4   0.8 
log(Dens)    0.1    0.1 -0.1   0.1   0.3 
Group2       0.6    0.1  0.4   0.6   0.8 
sigma        0.3    0.0  0.3   0.3   0.4 

Fit Diagnostics:
           mean   sd   10%   50%   90%
mean_PPD 5.9    0.1  5.8   5.9   6.0  

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.0  1.0  3493 
log(CanVol)   0.0  1.0  1560 
log(CanAre)   0.0  1.0  1663 
log(CanSha)   0.0  1.0  2772 
log(TotHt)    0.0  1.0  1947 
log(Dens)     0.0  1.0  3006 
Group2        0.0  1.0  2577 
sigma         0.0  1.0  2624 
mean_PPD      0.0  1.0  3664 
log-posterior 0.1  1.0  1442 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Use of QR decomposition improved sampling efficiency (actually we get superefficient sampling, ie better than independent sampling) and we continue with this model.

Instead of looking at the tables, it’s easier to look at plots

mcmc_areas(as.matrix(fitg), prob = .5, prob_outer = .95)

All 95% posterior intervals except for GroupMCD are overlapping 0 and it seems we have serious collinearity problem.

Looking at the pairwise posteriors we can see high correlations especially between log(CanVol) and log(CanAre).

mcmc_pairs(as.matrix(fitg),
           pars = c("log(CanVol)", "log(CanAre)", "log(CanSha)",
                    "log(TotHt)", "log(Dens)"))

If look more carefully on of the subplots, we see that although marginal posterior intervals overlap 0, some pairwise joint posteriors are not overlapping 0. Let’s look more carefully the joint posterior of log(CanVol) and log(CanAre).

mcmc_scatter(as.matrix(fitg), pars = c("log(CanVol)", "log(CanAre)")) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0)

From the joint posterior scatter plot, we can see that 0 is far away from the typical set.

In case of even more variables with some being relevant and some irrelevant, it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,

fitg0 <- update(fitg, formula = log(LeafWt) ~ 1, QR = FALSE)

We compute leave-one-out cross-validation elpd’s using PSIS-LOO (Vehtari, Gelman, and Gabry 2017)

(loog <- loo(fitg))


Computed from 4000 by 46 log-likelihood matrix.

         Estimate   SE
elpd_loo    -19.4  5.3
p_loo         7.6  1.5
looic        38.8 10.7
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume independent draws (r_eff=1).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

(loog0 <- loo(fitg0))


Computed from 4000 by 46 log-likelihood matrix.

         Estimate   SE
elpd_loo    -62.6  5.0
p_loo         1.9  0.5
looic       125.3 10.0
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume independent draws (r_eff=1).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

And then we can compare the models.

loo_compare(loog0, loog)

      elpd_diff se_diff
fitg    0.0       0.0  
fitg0 -43.2       7.0

Based on cross-validation covariates together contain significant information to improve predictions.

We might want to choose some variables 1) because we don’t want to observe all the variables in the future (e.g. due to the measurement cost), or 2) we want to most relevant variables which we define here as a minimal set of variables which can provide similar predictions to the full model.

LOO can be used for model selection, but we don’t recommend it for variable selection as discussed by Piironen and Vehtari (2017) and McLatchie et al. (2025). The reason for not using LOO in variable selection is that the selection process uses the data twice, and in case of large number variable combinations the selection process overfits and can produce really bad models. Using the usual posterior inference given the selected variables ignores that the selected variables are conditional on the selection process and simply setting some variables to 0 ignores the uncertainty related to their relevance.

Piironen and Vehtari (2017) and McLatchie et al. (2025) also show that a projection predictive approach can be used to make a model reduction, that is, choosing a smaller model with some coefficients set to 0. The projection predictive approach solves the problem how to do inference after the selection. The solution is to project the full model posterior to the restricted subspace. See more by McLatchie et al. (2025), Pavone et al. (2022), and Piironen, Paasiniemi, and Vehtari (2020).

We make the projective predictive variable selection using the previous full model. A fast leave-one-out cross-validation approach (Vehtari, Gelman, and Gabry 2017) is used to choose the model size.

fitg_cvvs <- cv_varsel(fitg,
                       method = "forward",
                       cv_method = "LOO",
                       validate_search = TRUE)

We can now look at the estimated predictive performance of smaller models compared to the full model.

plot(fitg_cvvs, stats = c("elpd", "rmse"))

We get a LOO-CV based recommendation for the model size to choose.

(nsel <- suggest_size(fitg_cvvs, alpha = 0.1))

[1] 3

We see that 3 variables is enough to get the same predictive accuracy as with all variables.

Next we form the projected posterior for the chosen model.

projg <- project(fitg_cvvs, nv = nsel, ns = 4000)
projdraws <- as_draws_df(projg)
summarise_draws(projdraws)

# A tibble: 5 × 10
  variable      mean median     sd    mad     q5     q95  rhat ess_bulk ess_tail
  <chr>        <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 (Intercept)  4.88   4.89  0.143  0.136   4.65   5.12   1.01      366.     403.
2 log(CanVol)  0.805  0.804 0.0544 0.0508  0.719  0.897  0.998     406.     333.
3 Group2       0.585  0.586 0.118  0.120   0.401  0.785  1.00      229.     365.
4 log(CanSha) -0.379 -0.385 0.219  0.225  -0.727 -0.0519 1.00      365.     306.
5 sigma        0.359  0.354 0.0416 0.0379  0.298  0.434  0.999     338.     369.

mcmc_areas(projdraws, pars = variables(projdraws)[(1:nsel)+1], prob_outer = 0.99)

We can also look at the predictive intervals of our submodel predictions

preds <- proj_predict(projg) |>
  as_draws_rvars(preds)
mesquite |>
  as_tibble() |>
  mutate(preds = preds) |>
  ggplot(aes(x = log(LeafWt), ydist = preds)) +
  stat_pointinterval(alpha = 0.3, color = "steelblue") +
  geom_abline() +
  labs(x = "log(leaf weight)", y = "Prediction")

References

McLatchie, Y., S. Rögnvaldsson, F. Weber, and A. Vehtari. 2025. “Advances in Projection Predictive Inference.” Statistical Science 40: 128–47.

Pavone, Federico, Juho Piironen, Paul-Christian Bürkner, and Aki Vehtari. 2022. “Using Reference Models in Variable Selection.” Computational Statistics 38.

Piironen, Juho, Markus Paasiniemi, and Aki Vehtari. 2020. “Projective Inference in High-Dimensional Problems: Prediction and Feature Selection.” Electronic Journal of Statistics 14 (1): 2155–97.

Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35. https://doi.org/10.1007/s11222-016-9649-y.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

1 Introduction

2 Data

3 Maximum likelihood estimate

4 Bayesian inference

References

Licenses