Bayesian variable selection for candy ranking data

Author

Aki Vehtari

Published

2018-02-27

Modified

2026-03-28

1 Introduction

This notebook was inspired by Joshua Loftus’ two blog posts Model selection bias invalidates significance tests and A conditional approach to inference after model selection.

In this notebook we illustrate Bayesian inference for model selection, including PSIS-LOO (Vehtari, Gelman, and Gabry 2017) and projection predictive approach (McLatchie et al. 2023; Piironen, Paasiniemi, and Vehtari 2020; Piironen and Vehtari 2017) which makes decision theoretically justified inference after model selection.

Setup

Load packages

library("rprojroot")
root <- has_file(".casestudies-root")$make_fix_file()
library(brms)
options(brms.backend = "cmdstanr")
options(mc.cores = 4)
library(loo)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans", base_size = 14))
library(ggdist)
library(posterior)
library(projpred)
library(fivethirtyeight)
library(dplyr)
SEED <- 150702646

2 Data

We use candy rankings data from fivethirtyeight package. Dataset was originally used in a fivethirtyeight story.

data(candy_rankings)
candy_rankings <- candy_rankings |>
  select(!competitorname) |>
  mutate_if(is.logical, as.numeric)
prednames <- candy_rankings |>
  select(!winpercent) |>
  colnames()
glimpse(candy_rankings)

Rows: 85
Columns: 12
$ chocolate        <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ fruity           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,…
$ caramel          <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ peanutyalmondy   <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ nougat           <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ crispedricewafer <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ hard             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,…
$ bar              <dbl> 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ pluribus         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1,…
$ sugarpercent     <dbl> 0.732, 0.604, 0.011, 0.011, 0.906, 0.465, 0.604, 0.31…
$ pricepercent     <dbl> 0.860, 0.511, 0.116, 0.511, 0.511, 0.767, 0.767, 0.51…
$ winpercent       <dbl> 66.97173, 67.60294, 32.26109, 46.11650, 52.34146, 50.…

3 Random data

We start first analysing a random “null” data set, where winpercent has been replaced with random draws from a normal distribution (with same mean and standar deviation) so that covariates do not have any predictive information.

set.seed(SEED)
N <- nrow(candy_rankings)
candy_random <- candy_rankings |>
  mutate(random = rnorm(N, mean(winpercent), sd(winpercent))) |>
  select(!winpercent)

Doing variable selection we are anyway assuming that some of the variables are not relevant, and thus it is sensible to use priors which assume some of the covariate effects are close to zero. We use R2D2 prior (Zhang et al. 2022) which provides adaptive shrinkage of regression coefficients.

fit_random <- brm(random ~ .,
                  data = candy_random,
                  prior = prior(R2D2(mean_R2 = 1/3, prec_R2 = 3)),
                  seed = SEED,
                  silent = 2,
                  refresh = 0)

Let’s look at the summary:

summary(fit_random)

 Family: gaussian 
  Links: mu = identity 
Formula: random ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + pricepercent 
   Data: candy_random (Number of observations: 85) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept           49.85      3.18    43.62    56.24 1.00     5125     3100
chocolate            0.58      2.30    -3.38     6.47 1.00     3825     3238
fruity               0.85      2.17    -2.90     6.51 1.00     4538     3429
caramel             -1.37      2.65    -8.31     2.57 1.00     4329     3062
peanutyalmondy       0.61      2.32    -3.52     6.52 1.00     5168     3545
nougat              -0.74      2.53    -7.29     3.50 1.00     4349     3500
crispedricewafer    -0.78      2.66    -7.47     3.82 1.00     4998     3638
hard                 1.22      2.52    -2.61     8.00 1.00     4270     3648
bar                 -2.20      3.16   -10.13     1.85 1.00     3432     3707
pluribus             0.37      1.89    -3.36     5.01 1.00     6209     3842
sugarpercent         0.37      2.35    -4.11     5.98 1.00     5513     3637
pricepercent        -1.75      3.42   -11.10     2.82 1.00     3743     3504

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    15.99      1.26    13.75    18.73 1.00     6850     2856

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

We didn’t get divergences, Rhat’s are less than 1.01 and bulk/tail ESS are useful.

draws_random <- as_draws_df(fit_random) |>
  subset_draws(variable = paste0("b_", prednames)) |>
  set_variables(variable = prednames)
mcmc_areas(draws_random, prob_outer = .95)

All 95% posterior intervals are overlapping 0, the R2D2 prior makes the posteriors concentrate near 0, but there is some uncertainty.

We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,

fit0 <- brm(random ~ 1,
            data = candy_random,
            seed = SEED,
            silent = 2,
            refresh = 0)

(loo_random <- loo(fit_random))
(loo0 <- loo(fit0))
loo_compare(loo0, loo_random)

Based on cross-validation covariates together do not contain any useful information, and there is no need to continue with variable selection. This step of checking whether the full model has any predictive power is often ignored especially when non-Bayesian methods are used. If loo (or AIC as Joshua Loftus demonstrated) would be used for stepwise variable selection it is possible that selection process over a large number of models overfits to the data.

To illustrate the robustness of projpred, we make the projective predictive variable selection using the previous model for “null” data. A fast leave-one-out cross-validation approach (Vehtari, Gelman, and Gabry 2017) is used to choose the model size. As the number of observations is large compared to the number of covariates, we estimate the performance using LOO-CV only along the search path (validate_search=FALSE), as we may assume that the overfitting in search is negligible (see more about this in McLatchie et al. (2023)).

fit_random_cv <- cv_varsel(fit_random,
                           method = "forward",
                           cv_method = "loo",
                           validate_search = FALSE)

We can now look at the estimated predictive performance of smaller models compared to the full model.

plot(fit_random_cv, stats = c("elpd"))

As the estimated predictive performance is not going much above the reference model performance, we know that the use of option validate_search=FALSE was safe (see more in McLatchie et al. (2023)).

And we get a LOO based recommendation for the model size to choose

(nv <- suggest_size(fit_random_cv, alpha = 0.1))

[1] 0

We see that projpred agrees that no variables have useful information.

Next we form the projected posterior for the chosen model.

proj_random <- project(fit_random_cv, nterms = nv, ns = 4000)
projdraws_random <- as_draws_df(proj_random)
round(colMeans(as.matrix(projdraws_random)), 1)

b_Intercept       sigma      .chain  .iteration       .draw 
       49.6        16.3         1.0       200.5       200.5

This looks good as the true values for “null” data are intercept=50.3167638, sigma=14.7143574.

4 Original data

Next we repeat the above analysis with original target variable winpercent.

fit_candy <- brm(winpercent ~ .,
                 data = candy_rankings,
                 prior = prior(R2D2(mean_R2 = 1/3, prec_R2 = 3)),
                 seed = SEED,
                 silent = 2,
                 refresh = 0)

Let’s look at the summary.

summary(fit_candy)

 Family: gaussian 
  Links: mu = identity 
Formula: winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + pricepercent 
   Data: candy_rankings (Number of observations: 85) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept           39.89      3.73    32.09    46.72 1.00     3342     3452
chocolate           15.34      3.76     8.05    22.95 1.00     3896     3776
fruity               2.52      3.23    -2.29    10.06 1.00     2562     3333
caramel              0.82      2.29    -3.30     6.18 1.00     4516     3559
peanutyalmondy       5.23      3.70    -0.73    12.83 1.00     2831     2420
nougat               0.38      2.72    -5.07     6.53 1.00     5627     3751
crispedricewafer     3.23      3.79    -2.17    12.02 1.00     3319     3569
hard                -2.29      2.83    -8.85     1.94 1.00     3249     3184
bar                  1.08      2.57    -3.64     7.08 1.00     4854     3313
pluribus            -0.43      1.87    -4.62     3.27 1.00     4804     3952
sugarpercent         3.77      3.77    -1.58    12.14 1.00     2819     3199
pricepercent        -0.27      2.91    -7.02     5.66 1.00     5036     4025

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    11.24      0.95     9.55    13.23 1.00     3559     3020

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

We didn’t get divergences, Rhat’s are less than 1.01 and bulk/tail ESS are useful.

Posterior predictive checking looks good enough

pp_check(fit_candy) +
  scale_x_continuous(breaks = seq(10, 90, by = 10)) +
  theme(axis.line.y = element_blank())

We can examine posterior marginals.

draws_candy <- as_draws_df(fit_candy) |>
  subset_draws(variable = paste0("b_", prednames)) |>
  set_variables(variable = prednames)
mcmc_areas(draws_candy, prob_outer = .95)

95% posterior interval for chocolateTRUE is not overlapping 0, so maybe there is something useful here.

In case of collinear variables it is possible that marginal posteriors overlap 0, but the covariates can still be useful for prediction. With many variables it will be difficult to analyse joint posterior to see which variables are jointly relevant. We can easily test whether any of the covariates are useful by using cross-validation to compare to a null model,

fit0 <- brm(winpercent ~ 1,
            data = candy_rankings,
            seed = SEED,
            silent = 2,
            refresh = 0)

(loo_candy <- loo(fit_candy))
(loo0 <- loo(fit0))
loo_compare(loo0, loo_candy)

Based on cross-validation covariates together do contain useful information. If we need just the predictions we can stop here, but if we want to learn more about the relevance of the covariates we can continue with variable selection.

We make the projective predictive variable selection using the previous model. A fast leave-one-out cross-validation approach is used to choose the model size.

fit_candy_cvv <- cv_varsel(fit_candy,
                          method = "forward",
                          cv_method = "loo",
                          validate_search = TRUE)

We can now look at the estimated predictive performance of smaller models compared to the full model.

plot(fit_candy_cvv, stats = c("elpd","R2"), deltas = "mixed")

Only one variable seems to be needed to get the same performance as the full model.

And we get a LOO based recommendation for the model size to choose

(nsel <- suggest_size(fit_candy_cvv, alpha = 0.1))

[1] 1

(vsel <- ranking(fit_candy_cvv, nterms_max = nsel)$fulldata)

[1] "chocolate"

projpred recommends to use just one variable.

Next we form the projected posterior for the chosen model.

proj_candy <- project(fit_candy_cvv, nterms = nsel)

We plot the marginals of projected posteriors

projdraws <- as_draws_df(proj_candy) |>
  subset_draws(variable = paste0("b_", vsel)) |>
  set_variables(variable = vsel)
mcmc_areas(projdraws,
           prob_outer = 0.99,
           area_method = "scaled height")

In our loo and projpred analysis, we find the chocolateTRUE to have predictive information. Other variables may have predictive power, too, but conditionally on chocolateTRUE other variables do not provide enough additional information to improve predictive performance.

preds <- proj_predict(proj_candy) |>
  as_draws_rvars(preds)
candy_rankings |>
  as_tibble() |>
  mutate(preds = preds) |>
  ggplot(aes(x = winpercent, ydist = preds)) +
  stat_pointinterval(alpha = 0.3, color = "steelblue") +
  geom_abline(linetype = "dashed", alpha = 0.5) +
  labs(x = "Win percent", y = "Prediction")

References

McLatchie, Yann, Sölvi Rögnvaldsson, Frank Weber, and Aki Vehtari. 2023. “Robust and Efficient Projection Predictive Inference.” arXiv Preprint arXiv:2306.15581.

Piironen, Juho, Markus Paasiniemi, and Aki Vehtari. 2020. “Projective Inference in High-Dimensional Problems: Prediction and Feature Selection.” Electronic Journal of Statistics 14 (1): 2155–97.

Piironen, Juho, and Aki Vehtari. 2017. “Comparison of Bayesian Predictive Methods for Model Selection.” Statistics and Computing 27 (3): 711–35. https://doi.org/10.1007/s11222-016-9649-y.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

Zhang, Yan Dora, Brian P. Naughton, Howard D. Bondell, and Brian J. Reich. 2022. “Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior.” Journal of the American Statistical Association 117: 862–74.