Student grades and variable selection with projpred

Author

Aki Vehtari

Published

2023-12-14

Modified

2025-08-15

1 Portugal students data

We work with an example of predicting mathematics and Portuguese exam grades for a sample of high school students in Portugal. The same data was used in Chapter 12 of Regression and Other Stories book to illustrate different models for regression coefficients .

We predict the students’ final-year median exam grade in mathematics (n=407) and Portuguese (n=657) given a large number of potentially relevant predictors: student’s school, student’s sex, student’s age, student’s home address type, family size, parents’ cohabitation status, mother’s education, father’s education, home-to-school travel time, weekly study time, number of past class failures, extra educational support, extra paid classes within the course subject, extra-curricular activities, whether the student attended nursery school, whether the student wants to take higher education, internet access at home, whether the student has a romantic relationship, quality of family relationships, free time after school, going out with friends, weekday alcohol consumption, weekend alcohol consumption, current health status, and number of school absences.

2 Variable selection

If we would care only about the predictive performance, we would not need to do variable selection, but we would use all the variables and a sensible joint prior. Here we are interested in finding the smallest set of variables that provide similar predictive performance as using all the variables (and sensible prior). This helps to improve explainability and to design further studies that could include also interventions. We are not considering causal structure, and the selected variables are unlikely to have direct causal effect, but the selected variables that have high predictive relevance are such that their role in causal graph should be eventually considered.

We first build models with all predictors, and then use projection predictive variable selection (Piironen, Paasiniemi, and Vehtari 2020; McLatchie et al. 2025) implemented in R package projpred. We also demonstrate use of subsampling LOO with difference estimator (Magnusson et al. 2020) to speed-up model size selection.

Load packages

Code
library(brms)
options(brms.backend="cmdstanr")
library(cmdstanr)
options(mc.cores = parallel::detectCores()-2)
library(posterior)
options(digits=2, posterior.digits=2,
        pillar.neg = FALSE, pillar.subtle=FALSE, pillar.sigfig=2)
library(loo)
library(projpred)
if (interactive()) {
  library(progressr)
  options(projpred.use_progressr = TRUE)
  handlers(global = TRUE)
}
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans", base_size=16))
set1 <- RColorBrewer::brewer.pal(7, "Set1")
library(khroma)
library(latex2exp)
library(tinytable)
options(tinytable_format_num_fmt = "significant_cell", tinytable_format_digits = 2, tinytable_tt_digits=2)
library(dplyr)
library(matrixStats)
library(patchwork)
library(ggdist)
library(doFuture)
library(tictoc)

Set random seed for reproducibility

SEED <- 2132

3 Data preparation and visualization

Get the data from Regression and Other Stories R package.

student <- read.csv(url('https://raw.githubusercontent.com/avehtari/ROS-Examples/master/Student/data/student-merged-all.csv'))

List the predictors to be used.

predictors <- c("school","sex","age","address","famsize","Pstatus","Medu","Fedu","traveltime","studytime","failures","schoolsup","famsup","paid","activities", "nursery", "higher", "internet", "romantic","famrel","freetime","goout","Dalc","Walc","health","absences")
p <- length(predictors)

The data includes 3 grades for both mathematics and Portuguese. To reduce the variability in the outcome we use median grades based on those three exams for each topic. We select only students with non-zero grades.

grades <- c("G1mat","G2mat","G3mat","G1por","G2por","G3por")
student <- student %>%
  mutate(across(matches("G[1-3]..."), ~na_if(.,0))) %>%
  mutate(Gmat = rowMedians(as.matrix(select(.,matches("G.mat"))), na.rm=TRUE),
         Gpor = rowMedians(as.matrix(select(.,matches("G.por"))), na.rm=TRUE))
student_Gmat <- subset(student, is.finite(Gmat), select=c("Gmat",predictors))
student_Gmat <- student_Gmat[is.finite(rowMeans(student_Gmat)),]
student_Gpor <- subset(student, is.finite(Gpor), select=c("Gpor",predictors))
(nmat <- nrow(student_Gmat))
[1] 407
head(student_Gmat) |> tt()
Gmat school sex age address famsize Pstatus Medu Fedu traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences
10 0 0 15 0 0 1 1 1 2 4 1 1 1 1 1 1 1 1 0 3 1 2 1 1 1 2
6 0 0 15 0 0 1 1 1 1 2 2 1 1 0 0 0 1 1 1 3 3 4 2 4 5 2
13 0 0 15 0 0 1 2 2 1 1 0 1 1 1 1 1 1 0 0 4 3 1 1 1 2 8
9 0 0 15 0 0 1 2 4 1 3 0 1 1 1 1 1 1 1 0 4 3 2 1 1 5 2
10 0 0 15 0 0 1 3 3 2 3 2 0 1 1 1 1 1 1 1 4 2 1 2 3 3 8
12 0 0 15 0 0 1 3 4 1 3 0 1 1 1 1 1 1 1 0 4 3 2 1 1 5 2
(npor <- nrow(student_Gpor))
[1] 657
head(student_Gpor) |> tt()
Gpor school sex age address famsize Pstatus Medu Fedu traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences
13 0 0 15 0 0 1 1 1 2 4 1 1 1 1 1 1 1 1 0 3 1 2 1 1 1 2
9 0 0 15 0 0 1 1 1 NA NA NA NA NA NA NA 1 NA 1 NA NA NA NA NA NA NA NA
11 0 0 15 0 0 1 1 1 1 2 2 1 1 0 0 0 1 1 1 3 3 4 2 4 5 2
13 0 0 15 0 0 1 2 2 1 1 0 1 1 1 1 1 1 0 0 4 3 1 1 1 2 8
10 0 0 15 0 0 1 2 4 1 3 0 1 1 1 1 1 1 1 0 4 3 2 1 1 5 2
13 0 0 15 0 0 1 3 3 2 3 2 0 1 1 1 1 1 1 1 4 2 1 2 3 3 8

The following plot shows the distributions of median math and Portuguese exam scores for each student.

p1 <- ggplot(student_Gmat, aes(x=Gmat)) + geom_dots() + labs(x='Median math exam score')
p2 <- ggplot(student_Gpor, aes(x=Gpor)) + geom_dots() + labs(x='Median Portuguese exam score')
(p1 + p2) * scale_x_continuous(lim=c(0,20)) *
  theme(axis.line.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())
Figure 1

We standardize all to have standard deviation 1, to make the comparison of relevances easier and also to make the definition of priors easier.

studentstd_Gmat <- student_Gmat
studentstd_Gmat[,predictors] <-scale(student_Gmat[,predictors])
studentstd_Gpor <- student_Gpor
studentstd_Gpor[,predictors] <-scale(student_Gpor[,predictors])

4 Default uniform prior on coefficients

Before variable selection, we want to build a good model with all covariates. We first illustrate that default priors may be bad when we have many predictors.

By default brms uses uniform (“flat”) priors on regression coefficients.

fitm_u <- brm(Gmat ~ ., data = studentstd_Gmat)

If we compare posterior-R^2 (bayes_R2()) and LOO-R^2 (loo_R2()) (Gelman et al. 2019), we see that the posterior-R^2 is much higher, which means that the posterior estimate for the residual variance is strongly underestimated and the model has overfitted the data.

We first re-define bayes_R2() to match the paper.

bayes_R2.brmsfit <- function(fit, summary=TRUE, probs = c(0.025, 0.975)) {
  mupred <- brms::posterior_epred(fit)
  var_mupred <- apply(mupred, 1, var)
  sigma2 <- as.matrix(fit, variable = c("sigma"))^2
  R2 <- var_mupred / (var_mupred + sigma2)
  colnames(R2) <- "R2"
  if (summary) {
    R2 <- posterior_summary(R2, probs = probs)
  }
  R2
}
bayes_R2(fitm_u) |> as.data.frame() |> tt()
Estimate Est.Error Q2.5 Q97.5
0.31 0.033 0.24 0.37
loo_R2(fitm_u) |> as.data.frame() |> tt()
Estimate Est.Error Q2.5 Q97.5
0.18 0.039 0.1 0.26

We plot the marginal posteriors for coefficients. For many coefficients the posterior is quite wide.

drawsmu <- as_draws_df(fitm_u, variable=paste0('b_',predictors)) |>
  set_variables(predictors)
p <- mcmc_areas(drawsmu,
                 prob_outer=0.98, area_method = "scaled height") +
  xlim(c(-1.5,1.5))
p <- p + scale_y_discrete(limits = rev(levels(p$data$parameter)))
p
Figure 2

5 Piranha theorem

The common proper prior choice for coefficients is independent wide normal prior. Piranha theorem states that it is not possible that all predictors would be independent and have large coefficients at the same time (Tosh et al. 2024). Thus when we have many predictors we should include prior information that not all coefficients can be large. One option is simply to divide the normal prior variance with the number of predictors which keeps the total variance constant (assuming the predictors are normalized). Other option is to use more elaborate joint priors on coefficients.

6 Implied priors on R^2 and R2D2 prior

In regression analysis cases we might assume that data is noisy and it is unlikely that we would get almost perfect fit. We can measure the proportion of variance explained by the model with R^2. To understand what is the implied prior on R^2 given different priors on coefficients, we can simply sample from the prior and compute Bayesian R^2 using the prior draws. The Bayesian-R^2 depends only on the model parameters, and thus can be used without computing residuals that depend on data.

If we have some prior information about R^2 we can use R2D2 prior (Zhang et al. 2022) to first define a prior directly on R^2, and then the prior is propagated to coefficients so that despite the number of predictors the prior on R^2 stays constant. As R^2 depends also on the residual scale, R2D2 prior is a joint prior on coefficients and residual sigma.

Although we can fit models with uniform prior on coefficients and get proper posterior, we can’t sample from improper unbounded uniform prior. Thus, in the following we consider only models with proper priors. For all the following models we use normal+(0, 3) prior (where normal+ indicates normal distribution constrained to be positive) for residual scale sigma, which is very weak prior as the standard deviation of the whole data is 3.3. We use four different priors for coefficients:

  1. Independent normal(0, 2.5) prior which is a proper prior used as default by rstanarm and considered to be weakly informative for a single coefficient.
  2. Independent scaled normal prior. If we assume that many predictors may each have small relevance, we can scale the independent priors so that the sum of the prior variance stays reasonable. In this case we have 26 predictors and could have a prior guess that proportion of the explained variance is near 0.3. Then a simple approach would be to assign independent priors to regression coefficients with mean 0 and standard deviation \sqrt{0.3/26}\operatorname{sd}(y).
  3. Regularized horseshoe prior (Piironen and Vehtari 2017) which is a joint prior for the coefficients depending also on the residual scale, and it can be used to define sparsity assuming prior with the expected number of relevant coefficients. Here we guess that maybe 6 coefficients are relevant and set the global scale according to that. The result is not sensitive to the exact value for the prior guess of the number of relevant coefficients as it states just the mean for the prior. Regularized horseshoe has good prior predictive behavior when more variables are added.
  4. R2D2 prior, which has the benefit that it first defines the prior directly on R^2, and then the prior is propagated to the coefficients. The R2D2 prior is predictively consistent, so that the prior on R^2 stays constant as the number of predictors increases. We assign the R2D2 prior with mean 1/3 and precision 3, which corresponds to the \Beta(1,2) distribution on R^2 implying that higher R^2 values are less likely. We set the concentration parameter to 1/2, which implies we assume some of the coefficients can be big and some small. R2D2 prior implementation in brms assumes the predictors have been standardized to have unit variance.
# we sample from both posterior and prior
# normal(0, 2.5)
fitm_n1 <- brm(Gmat ~ ., data = studentstd_Gmat,
               prior=c(prior(normal(0, 2.5), class = b)),
               warmup = 1000, iter = 5000,
               refresh = 0)
fitm_n1p <- update(fitm_n1, sample_prior = "only")
# we also sample from a truncated prior to get more draws in the
# interesting region of R^2<0.5 needed for zoomed plot
fitm_n1pt <- update(fitm_n1, sample_prior = "only",
                    prior=c(prior(normal(0, 2.5), class = b),
                            prior(student_t(3, 0, 3), lb=5, class = sigma)),
                    refresh = 0)

# normal(0, sqrt(0.3/26)*sd(y))
scale_b <- sqrt(0.3/26)*sd(studentstd_Gmat$Gmat)
fitm_n2 <- brm(Gmat ~ ., data = studentstd_Gmat,
               prior=c(prior(normal(0, scale_b), class = b)),
               warmup = 1000, iter = 5000,
               stanvars = stanvar(scale_b, name='scale_b'),
            refresh = 0)
fitm_n2p <- update(fitm_n2, sample_prior = "only")

# horseshoe
p <- length(predictors)
p0 <- 6
scale_slab <- sd(studentstd_Gmat$Gmat)/sqrt(p0)*sqrt(0.3)
scale_global <- p0 / (p - p0) / sqrt(nrow(studentstd_Gmat))
fitm_hs <- brm(Gmat ~ ., data = studentstd_Gmat,
               prior=c(prior(horseshoe(scale_global = scale_global,
                                       scale_slab = scale_slab),
                             class = b)),
               warmup = 1000, iter = 5000,
               refresh = 0)
fitm_hsp <- update(fitm_hs, sample_prior = "only")

# R2D2
fitm <- brm(Gmat ~ ., data = studentstd_Gmat,
            prior=c(prior(R2D2(mean_R2 = 1/3, prec_R2 = 3, cons_D2 = 1/2), class=b)),
            warmup = 1000, iter = 5000,
            refresh = 0)
fitmp <- update(fitm, sample_prior = "only")
# plot prior on R^2 in (0,1)
types=rep(factor(c("Prior","Posterior"),levels=c("Prior","Posterior")), times=4)
priornames=rep(factor(c("Wide normal","Scaled normal","RHS","R2D2"),
                      levels=c(c("Wide normal","Scaled normal","RHS","R2D2"))), each=2)
clr<-colour("bright",names=FALSE)(7)
fits=list(fitm_n1p, fitm_n1, fitm_n2p, fitm_n2, fitm_hsp, fitm_hs, fitmp, fitm)
pp1 <- lapply(1:8, \(i) data.frame(R2=as.numeric(bayes_R2(fits[[i]], summary=FALSE)), type=types[i], priorname=priornames[i])) |>
  bind_rows() |>
  filter(type=="Prior") |>
  ggplot(aes(x=R2, color=priorname)) +
  stat_slab(expand=TRUE, trim=FALSE, alpha=.6, fill=NA, adjust=2) +
  xlim(c(0,1)) +
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.line.y=element_blank(),
        axis.title.y=element_blank(),
        plot.title = element_text(size=16),
        legend.position="none") +
  scale_color_bright() +
  labs(x=TeX("$R^2$"), title=TeX("a) Implied prior on $R^2$")) +
  annotate(geom="text", x=0.94, y=0.33, label="Wide normal", hjust=1, color=clr[1], size=5) + 
  annotate(geom="text", x=0.85, y=0.12, label="Scaled normal", hjust=1, color=clr[2], size=5) +
  annotate(geom="text", x=0.06, y=0.33, label="Horseshoe", hjust=0, color=clr[3], size=5) +
  annotate(geom="text", x=0.15, y=0.12, label="R2D2", hjust=0, color=clr[4], size=5) 

# plot prior on R^2 in narrower range
fits=list(fitm_n1pt, fitm_n1, fitm_n2p, fitm_n2, fitm_hsp, fitm_hs, fitmp, fitm)
pp2 <- lapply(1:8, \(i) data.frame(R2=as.numeric(bayes_R2(fits[[i]], summary=FALSE)), type=types[i], priorname=priornames[i])) |>
  bind_rows() |>
  filter(type=="Prior") |>
  ggplot(aes(x=R2, color=priorname)) +
  stat_slab(expand=TRUE, trim=FALSE, alpha=0.5, fill=NA, adjust=2) +
  xlim(c(0.04,.419)) +
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.line.y=element_blank(),
        axis.title.y=element_blank(),
        plot.title = element_text(size=16),
        legend.position="none") +
  scale_color_bright() +
  labs(x=TeX("$R^2$"), title=TeX("b) Implied prior on $R^2$")) +
  annotate(geom="text", x=0.415, y=0.61, label="Wide normal", hjust=1, color=clr[1], size=5) + 
  annotate(geom="text", x=0.415, y=0.34, label="Scaled normal", hjust=1, color=clr[2], size=5) +
  annotate(geom="text", x=0.06, y=0.87, label="Horseshoe", hjust=0, color=clr[3], size=5) +
  annotate(geom="text", x=0.06, y=0.28, label="R2D2", hjust=0, color=clr[4], size=5) 

# plot R^2 posterior in narrower range
pp3 <- lapply(1:8, \(i) data.frame(R2=as.numeric(bayes_R2(fits[[i]], summary=FALSE)), type=types[i], priorname=priornames[i])) |>
  bind_rows() |>
  filter(type=="Posterior") |>
  ggplot(aes(x=R2, color=priorname)) +
  stat_slab(expand=TRUE, trim=FALSE, alpha=0.5, fill=NA, adjust=2) +
  xlim(c(0.04,.42)) +
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.line.y=element_blank(),
        axis.title.y=element_blank(),
        plot.title = element_text(size=16),
        legend.position="none") +
  scale_color_bright() +
  labs(x=TeX("$R^2$"), title=TeX("c) Posterior-$R^2$")) +
  annotate(geom="text", x=0.28, y=0.95, label="Wide normal", hjust=0, color=clr[1], size=5) + 
  annotate(geom="text", x=0.26, y=0.95, label="Scaled normal", hjust=1, color=clr[2], size=5) +
  annotate(geom="text", x=0.19, y=0.73, label="Horseshoe", hjust=1, color=clr[3], size=5) +
  annotate(geom="text", x=0.21, y=0.84, label="R2D2", hjust=1, color=clr[4], size=5) 

# plot LOO-R^2 in narrower range
looR2 <- lapply(1:8, \(i) data.frame(R2=as.numeric(loo_R2(fits[[i]], summary=FALSE)), type=types[i], priorname=priornames[i])) |> bind_rows()
pp4 <- looR2 |>
    filter(type=="Posterior") |>
  ggplot(aes(x=R2, color=priorname)) +
  stat_slab(expand=TRUE, trim=FALSE, alpha=0.5, fill=NA, adjust=2) +
  xlim(c(0.04,.42)) +
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.line.y=element_blank(),
        axis.title.y=element_blank(),
        plot.title = element_text(size=16),
        legend.position="none") +
  scale_color_bright() +
  labs(x=TeX("$R^2$"), title=TeX("d) LOO-$R^2$")) +
  annotate(geom="text", x=0.175, y=0.72, label="Wide normal", hjust=1, color=clr[1], size=5) + 
  annotate(geom="text", x=0.235, y=0.72, label="Scaled normal", hjust=0, color=clr[2], size=5) +
  annotate(geom="text", x=0.175, y=0.9, label="Horseshoe", hjust=1, color=clr[3], size=5) +
  annotate(geom="text", x=0.23, y=0.9, label="R2D2", hjust=0, color=clr[4], size=5) 
pp1
pp2
pp3
pp4
Figure 3
Figure 4
Figure 5
Figure 6

Subplot a) shows the implied priors on R^2. We see that independent wide normal prior on coefficients leads to a prior strongly favoring values near 1. Independent scaled normal and R2D2 priors imply relatively flat prior on R^2. Regularized horseshoe prior implies prior favoring values near 0. Looking at the implied priors on the whole range can be misleading, and what is more important is what is the behavior of the priors where the likelihood is not very close to 0. Subplot b) shows the implied priors on R^2 in a shorter range, and subplots c) and d) show posterior-R^2 and LOO-R^2, respectively in the same range. Wide normal prior favors larger R^2 values which then pushes the posterior towards higher R^2 values and makes the model ovefit which causes LOO-R^2 to be lower than with priors that do not favor higher R^2 values. Scaled normal, regularized horseshe and R2D2 priors all slightly favor smaller R^2 values, which is a sensible prior assumption, and the posterior have not been pushed towards higher R^2 values. LOO-R^2 results are quite similar with scaled normal, regularized horseshe and R2D2 priors, although with regularized horseshe and R2D2 priors there is slightly less uncertainty. Regularized horseshoe is favoring smaller R^2 values more than scaled normal and R2D2, but as the likelihood has thinner tail towards smaller R^2 values (is more informative in that direction) there is not much difference in the posteriors.

Although in this case scaled normal, regularized horseshoe, and R2D2 all produce quite similar R^2 results, in general we favor R2D2 prior as it is easiest to define the prior on R^2. Regularized horseshoe can be easier to define when our prior is about the sparsity in the coefficients. We now continue with the R2D2 prior.

In addition of LOO-R^2 shown above, we can compare LOO estimated expected log predictive scores. With R2D2 prior the predictive performance is better than with wide normal prior with probability 0.94. The difference to scaled normal and horseshoe prior is smaller.

loo_compare(list(`Wide normal`=loo(fitm_n1),
                 `Scaled normal`=loo(fitm_n2),
                 `Horseshoe`=loo(fitm_hs),
                 `R2D2`=loo(fitm)))
              elpd_diff se_diff
R2D2           0.0       0.0   
Scaled normal -0.4       1.6   
Horseshoe     -0.5       0.5   
Wide normal   -4.1       2.6   

7 Marginal posteriors of coefficients

We plot the marginal posteriors for coefficients. For many coefficients the posterior has been shrunk close to 0. Some marginal posteriors are wide.

# refit with fewer posterior draws
fitm <- update(fitm, iter=2000)
drawsm <- as_draws_df(fitm, variable=paste0('b_',predictors)) |>
  set_variables(predictors)
p <- mcmc_areas(drawsm,
                 prob_outer=0.98, area_method = "scaled height") +
  xlim(c(-1.5,1.5))
p <- p + scale_y_discrete(limits = rev(levels(p$data$parameter)))
p
Figure 7

We check the bivariate marginal for Fedu and Medu coefficients, and see that while the univariate marginals overlap with 0, jointly there is not much posterior mass near 0. This is due to Fedu and Medu being collinear. Collinearity of predictors, make it difficult to infer the predictor relevance from the marginal posteriors.

mcmc_scatter(drawsm, pars = c("Fedu","Medu"), size=1, alpha=0.1) +
  vline_0(linetype='dashed') +
  hline_0(linetype='dashed')
Figure 8

8 Model checking

We’re using a normal observation model, although we know that the exam scores are in a bounded range. The posterior predictive checking shows that we sometimes predict exam scores higher than 20, but the discrepancy is minor.

pp_check(fitm, type="hist", ndraws=5)
Figure 9

LOO-PIT-ECDF plots shows that otherwise the normal model is quite well calibrated.

pp_check(fitm, type="loo_pit_ecdf")
Figure 10

We could use truncated normal as more accurate model, but for example beta-binomial model cannot be used for median exam scores as some of the median scores are not integers. A fancier model could model the three exams hierarchically, but as the normal model is not that bad, we now continue with it.

9 Projection predictive variable selection

We use projection predictive variable selection implemented in projpred R package to find the minimal set of predictors that can provide similar predictive performance as all predictor jointly. By default projpred starts from the intercept only model and uses forward search to find in which order to add predictors to minimize the divergence from the full model predictive distribution.

9.1 Math exam scores

We start with doing fast PSIS-LOO-CV only for the full data search path.

vselm_fast <- cv_varsel(fitm, nterms_max = 27, validate_search = FALSE)

The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. As the search can overfit and we didn’t cross-validate the search, the performance estimates can go above the reference model performance. However, this plot helps as to see that 10 or fewer predictors would be sufficient.

plot(vselm_fast, stats=c("elpd", "R2"), deltas="mixed",
     text_angle = 45, alpha = 0.1, 
     size_position = "primary_x_top", show_cv_proportions=FALSE) +
  geom_vline(xintercept = seq(0, 25, by = 5), colour = "black", alpha = 0.1)
Figure 11

Next we repeat the search, but now cross-validate the search, too. We repeat the search with PSIS-LOO-CV criterion only for nloo=50 folds, and combine the result with the fast PSIS-LOO result using difference estimator (Magnusson et al. 2020). The use of sub-sampling LOO affects only the model size selection and given that is stable, the projection for the selected model is as good as with computationally more expensive search validation. Based on the previous quick result, we search only up to models of size 10. With my laptop and 8 parallel workers, this takes less than 5min.

registerDoFuture()
plan(multisession, workers = 8)
tic()
vselm <- cv_varsel(fitm, nterms_max = 10, validate_search = TRUE,
                   refit_prj = TRUE, nloo = 50,
                   parallel = TRUE, verbose = TRUE)
toc()
plan(sequential)

The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. The order is the same as in the previous plot, but now the predictive performance estimates are taking into account search and have smaller bias. It seems using just four predictors can provide the similar predictive performance as using all the predictors.

plot(vselm, stats=c("elpd","R2"), deltas='mixed',
     text_angle=45, alpha=0.1, 
     size_position = 'primary_x_top', show_cv_proportions=FALSE) +
  geom_vline(xintercept=seq(0,10,by=5), colour='black', alpha=0.1)
Figure 12

projpred can also provide suggestion for the sufficient model size.

(nselm <- suggest_size(vselm))
[1] 4

Form the projected posterior for the selected model.

rankm <- ranking(vselm, nterms=nselm)
projm <- project(vselm, nterms=nselm)
drawsm_proj <- as_draws_df(projm) |>
  subset_draws(variable=paste0('b_',rankm$fulldata[1:nselm])) |>
  set_variables(variable=rankm$fulldata[1:nselm])

The marginals of the projected posterior are all clearly away from 0.

mcmc_areas(drawsm_proj, prob_outer=0.98, area_method = "scaled height")
Figure 13

The following plot shows the stability of the search over the different LOO-CV folds. The numbers indicate the proportion of folds, the specific predictor was included at latest on the given model size.

plot(cv_proportions(rankm, cumulate=TRUE))
Figure 14

9.2 Portuguese exam scores

We repeat the same, but predicting grade for Portuguese instead of mathematics

Fit a model with R2D2 prior with mean 1/3 and precision 3.

fitp <- brm(Gpor ~ ., data = studentstd_Gpor,
            prior=c(prior(R2D2(mean_R2 = 1/3, prec_R2 = 3, cons_D2 = 1/2), class=b)),
            refresh = 0)

Compare posterior-R^2 and LOO-R^2. We see that Portuguese grade is easier to predict given the predictors (but there is still a lot of unexplained variance).

fitp <- add_criterion(fitp, criterion='loo')
bayes_R2(fitp) |> round(2)
   Estimate Est.Error Q2.5 Q97.5
R2     0.33      0.04 0.25   0.4
loo_R2(fitp) |> round(2)
   Estimate Est.Error Q2.5 Q97.5
R2     0.28      0.03 0.21  0.35

Plot marginal posteriors of coefficients

drawsp <- as_draws_df(fitp, variable=paste0('b_',predictors)) |>
  set_variables(predictors)
p <- mcmc_areas(drawsp, prob_outer=0.98, area_method = "scaled height") +
  xlim(c(-1.5,1.5))
p <- p + scale_y_discrete(limits = rev(levels(p$data$parameter)))
p
Figure 15

We use projection predictive variable selection with fast LOO-CV only for the full data search path.

vselp_fast <- cv_varsel(fitp, nterms_max=27, validate_search=FALSE)

The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. As there is some overfitting in the search and we didn’t cross-validate the search, the performance estimates scan go above the reference model performance. However, this plot helps as to see that 10 or fewer predictors would be sufficient.

plot(vselp_fast, stats = c("elpd","R2"), deltas="mixed", 
     text_angle=45, alpha=0.1, 
     size_position = 'primary_x_top', show_cv_proportions=FALSE) +
  geom_vline(xintercept=seq(0,25,by=5), colour='black', alpha=0.1)
Figure 16

Next we repeat the search, but now cross-validate the search, too. We repeat the search with PSIS-LOO-CV criterion only for nloo=50 folds, and combine the result with the fast PSIS-LOO result using difference estimator (Magnusson et al. 2020). The use of sub-sampling LOO affects only the model size selection and given that is stable, the projection for the selected model is as good as with computationally more expensive search validation. Based on the previous quick result, we search only up to models of size 10. With my laptop and 8 parallel workers, this takes less than 5min.

registerDoFuture()
plan(multisession, workers = 8)
tic()
vselp <- cv_varsel(fitp, nterms_max=10, validate_search=TRUE,
                   refit_prj=TRUE, nloo=50,
                   parallel=TRUE)
toc()
plan(sequential)

The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. The order is the same as in the previous plot, but now the predictive performance estimates are taking into account search and have smaller bias. It seems using just seven predictors can provide the similar predictive performance as using all the predictors.

plot(vselp, stats=c("elpd", "R2"), deltas="mixed",
     text_angle=45, alpha=0.1,
     size_position = 'primary_x_top', show_cv_proportions=FALSE) +
   geom_vline(xintercept=seq(0,10,by=5), colour='black', alpha=0.1)
Figure 17

projpred can also provide suggestion for the sufficient model size.

(nselp <- suggest_size(vselp))
[1] 8

Form the projected posterior for the selected model.

rankp <- ranking(vselp, nterms=nselp)
projp <- project(vselp, nterms=nselp)
drawsp_proj <- as_draws_df(projp) |>
  subset_draws(variable=paste0('b_',rankp$fulldata[1:nselp])) |>
  set_variables(variable=rankp$fulldata[1:nselp])

The marginals of the projected posterior are all clearly away from 0.

mcmc_areas(drawsp_proj, prob_outer=0.98, area_method = "scaled height")
Figure 18

The following plot shows the stability of the search over the different LOO-CV folds. The numbers indicate the proportion of folds, the specific predictor was included at latest on the given model size.

plot(cv_proportions(rankp, cumulate=TRUE))
Figure 19

9.3 Using the selected model

For further predictions we can use the projected draws. Due to how different packages work, sometimes it can be easier to rerun MCMC conditionally on the selected variables. This gives a slightly different result, but when the reference model has been good the difference tends to be small, and the main benefit form using projpred is still that the selection process itself has not caused overfitting and selection of spurious covariates.

References

Gelman, Andrew, Ben Goodrich, Jonah Gabry, and Aki Vehtari. 2019. “R-Squared for Bayesian Regression Models.” The American Statistician 73 (3): 307–9.
Magnusson, Måns, Michael Riis Andersen, Johan Jonasson, and Aki Vehtari. 2020. “Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data.” In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 108:341–51. PMLR.
McLatchie, Y., S. Rögnvaldsson, F. Weber, and A. Vehtari. 2025. “Advances in Projection Predictive Inference.” Statistical Science 40: 128–47.
Piironen, Juho, Markus Paasiniemi, and Aki Vehtari. 2020. “Projective Inference in High-Dimensional Problems: Prediction and Feature Selection.” Electronic Journal of Statistics 14 (1): 2155–97.
Piironen, Juho, and Aki Vehtari. 2017. “Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors.” Electronic Journal of Statistics 11: 5018–51.
Tosh, C., P. Greengard, B. Goodrich, A. Gelman, A. Vehtari, and D. Hsu. 2024. “The Piranha Problem: Large Effects Swimming in a Small Pond.” Notices of the American Mathematical Society 72: 15–25.
Zhang, Yan Dora, Brian P. Naughton, Howard D. Bondell, and Brian J. Reich. 2022. “Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior.” Journal of the American Statistical Association 117: 862–74.

Licenses

  • Code © 2023-2024, Aki Vehtari, licensed under BSD-3.
  • Text © 2023-2024, Aki Vehtari, licensed under CC-BY-NC 4.0.

Original Computing Environment

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Helsinki
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tictoc_1.2.1        doFuture_1.1.2      future_1.67.0      
 [4] foreach_1.5.2       ggdist_3.3.3        patchwork_1.3.1    
 [7] matrixStats_1.5.0   dplyr_1.1.4         tinytable_0.11.0   
[10] latex2exp_0.9.6     khroma_1.16.0       bayesplot_1.13.0   
[13] ggplot2_3.5.2       projpred_2.9.0      loo_2.8.0          
[16] posterior_1.6.1     cmdstanr_0.9.0.9000 brms_2.22.12       
[19] Rcpp_1.1.0         

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        farver_2.1.2            fastmap_1.2.0          
 [4] TH.data_1.1-3           tensorA_0.36.2.1        digest_0.6.37          
 [7] estimability_1.5.1      lifecycle_1.0.4         StanHeaders_2.36.0.9000
[10] survival_3.8-3          processx_3.8.6          magrittr_2.0.3         
[13] compiler_4.5.1          rlang_1.1.6             tools_4.5.1            
[16] yaml_2.3.10             knitr_1.50              labeling_0.4.3         
[19] bridgesampling_1.1-2    htmlwidgets_1.6.4       curl_6.4.0             
[22] pkgbuild_1.4.8          plyr_1.8.9              RColorBrewer_1.1-3     
[25] multcomp_1.4-28         abind_1.4-8             withr_3.0.2            
[28] grid_4.5.1              stats4_4.5.1            fansi_1.0.6            
[31] xtable_1.8-4            inline_0.3.21           ggridges_0.5.6         
[34] emmeans_1.11.2          globals_0.18.0          scales_1.4.0           
[37] iterators_1.0.14        MASS_7.3-65             cli_3.6.5              
[40] mvtnorm_1.3-3           rmarkdown_2.29          generics_0.1.4         
[43] RcppParallel_5.1.10     future.apply_1.20.0     reshape2_1.4.4         
[46] rstan_2.36.0.9000       stringr_1.5.1           splines_4.5.1          
[49] parallel_4.5.1          vctrs_0.6.5             V8_6.0.5               
[52] Matrix_1.7-3            sandwich_3.1-1          jsonlite_2.0.0         
[55] litedown_0.7            listenv_0.9.1           glue_1.8.0             
[58] parallelly_1.45.1       codetools_0.2-20        ps_1.9.1               
[61] distributional_0.5.0    stringi_1.8.7           gtable_0.3.6           
[64] QuickJSR_1.8.0          quadprog_1.5-8          tibble_3.3.0           
[67] pillar_1.11.0           htmltools_0.5.8.1       Brobdingnag_1.2-9      
[70] R6_2.6.1                evaluate_1.0.4          lattice_0.22-7         
[73] backports_1.5.0         rstantools_2.4.0        gridExtra_2.3          
[76] coda_0.19-4.1           nlme_3.1-168            checkmate_2.3.2        
[79] xfun_0.52               zoo_1.8-14              pkgconfig_2.0.3