Uncertainty in Bayesian LOO-CV Model Comparison

Author

Aki Vehtari

Published

2025-06-23

Modified

2025-06-23

1 Introduction

This case study provides the code for Section 5 in Uncertainty in Bayesian leave-one-out cross-validation based model comparison

We demonstrate the use of uncertainty quantification of the predictive performance difference with three real-data examples. We assume that the true data generating processes are more complex than the models used. We cover all three scenarios (very similar predictions, model misspecification, and small data) that can affect how well calibrated the normal approximation.

Inference was made using MCMC with 4 chains with 1000 warmup and 1000 sampling iterations. Convergence diagnostics (Vehtari et al., 2021), using the posterior package, (Bürkner et al., 2024) indicated reliable posterior inference. For LOO-CV we used the loo package (Vehtari et al., 2022), which uses fast PSIS-LOO (Vehtari, Gelman and Gabry, 2017) for computation.

Load packages

Code

library(tidyr)
library(dplyr)
library(tibble)
library(loo)
library(brms)
options(brms.backend = "cmdstanr", mc.cores = 1)
library(rstanarm)
library(posterior)
options(posterior.num_args=list(digits=2))
library(pillar)
options(pillar.negative = FALSE)
library(tinytable)
options(tinytable_format_num_fmt = "significant_cell", tinytable_format_digits = 2, tinytable_tt_digits=2)

2 Primate milk

McElreath (2020) describes the primate milk data: ``A popular hypothesis has it that primates with larger brains produce more energetic milk, so that brains can grow quickly… The question here is to what extent energy content of milk, measured here by kilocalories, is related to the percent of the brain mass that is neocortex… We’ll end up needing female body mass as well, to see the masking that hides the relationships among the variables.’’ The data include 17 different primate species. The target variable is the energy content of milk (kcal.per.g) and the covariates are the percent of the brain mass that is neocortex (neocortex) and the logarithm of female body mass (log(mass)). The covariates and target are centered and scaled to have unit variance.

data(milk)
milk <- milk |>
  drop_na() |>
  mutate(neocortex = neocortex.perc / 100)
head(milk)

             clade            species kcal.per.g perc.fat perc.protein
1    Strepsirrhine     Eulemur fulvus       0.49    16.60        15.42
2 New World Monkey Alouatta seniculus       0.47    21.22        23.58
3 New World Monkey         A palliata       0.56    29.66        23.46
4 New World Monkey       Cebus apella       0.89    53.41        15.80
5 New World Monkey         S sciureus       0.92    50.58        22.33
6 New World Monkey   Cebuella pygmaea       0.80    41.35        20.85
  perc.lactose mass neocortex.perc neocortex
1        67.98 1.95          55.16    0.5516
2        55.20 5.25          64.54    0.6454
3        46.88 5.37          64.54    0.6454
4        30.79 2.51          67.64    0.6764
5        27.09 0.68          68.85    0.6885
6        37.80 0.12          58.85    0.5885

We use the following four models with weakly informative normal(0, 1) priors for the coefficients and an exponential(1) prior for the residual scale: \begin{align*} \mathrm{M}_1: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha, \sigma) \\ \mathrm{M}_2: \quad& \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_1\times\mathrm{neocortex}, \sigma) \\ \mathrm{M}_3: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_2\times\mathrm{log(mass)}, \sigma) \\ \mathrm{M}_4: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_1\times\mathrm{neocortex} + \beta_2\times\mathrm{log(mass)}, \sigma). \end{align*}

We fit the models with the rstanarm package (Goodrich et al., 2024)

M_1 <- stan_glm(kcal.per.g ~ 1, data = milk, seed = 2030,
                 prior = normal(0, 1, autoscale = TRUE), refresh = 0)
M_2 <- update(M_1, formula = kcal.per.g ~ neocortex)
M_3 <- update(M_1, formula = kcal.per.g ~ log(mass))
M_4 <- update(M_1, formula = kcal.per.g ~ neocortex + log(mass))

We compute LOO-CV estimates using the fast PSIS-LOO method Vehtari et al. (2024)

loo1 <- loo(M_1)
loo2 <- loo(M_2)
loo3 <- loo(M_3)
loo4 <- loo(M_4)

We compare the models \mathrm{M}_1, \mathrm{M}_2,\mathrm{M}_3,\mathrm{M}_4.

The current version of loo_compare() shows elpd_diff and diff_se

(loo_comp <- loo_compare(loo1, loo2, loo3, loo4))

    elpd_diff se_diff
M_4  0.0       0.0   
M_3 -3.9       1.7   
M_1 -4.2       2.4   
M_2 -4.8       2.5

We add the probability that a model has worse predictive performance than the model with the best predictive performance using the normal approximation.

loo_comp |>
  as.data.frame() |>
  rownames_to_column("model") |>
  dplyr::select(model, elpd_diff, se_diff) |>
  mutate(p = ifelse(elpd_diff==0,NA,pnorm(0, elpd_diff, se_diff))) |>
  tt() |>
  format_tt(replace = "-")

model	elpd_diff	se_diff	p
M_4	0	0	-
M_3	-3.9	1.7	0.99
M_1	-4.2	2.4	0.96
M_2	-4.8	2.5	0.97

In the paper, we the comparison was reported using the M_1 as the baseline, and the probability that a model has better predictive performance than the baseline.

loo_comp2 <- sapply(list(loo1, loo2, loo3, loo4), \(x) pointwise(x, "elpd_loo")) 
colnames(loo_comp2) <- c("M_1","M_2","M_3","M_4")
loo_comp2 <- loo_comp2 |>
  as_tibble() |>
  mutate(across(M_1:M_4, ~ .x - M_1))
tibble(model=colnames(loo_comp2),
       elpd_diff = apply(loo_comp2, 2, sum),
       se_diff = apply(loo_comp2, 2, \(x) sd(x) * sqrt(length(x)))) |>
  mutate(p = ifelse(elpd_diff==0,NA,pnorm(0, -elpd_diff, se_diff))) |>
  tt() |>
  format_tt(replace = "-")

model	elpd_diff	se_diff	p
M_1	0	0	-
M_2	-0.57	0.58	0.16
M_3	0.29	1.2	0.6
M_4	4.2	2.4	0.96

Based on model checking and the distribution of pointwise {\widehat{\mathrm{elpd}}_{\mathrm{\scriptscriptstyle LOO},\, i}\bigr({{\mathrm{M}_k}} \mid {y}\bigl)}, the models seem to be reasonably specified and we are fine with respect to Scenario 2 (model misspecification) . Models \mathrm{M}_2 and \mathrm{M}_3 have very small {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} compared to model \mathrm{M}_1. The direct use of the normal approximation gives probabilities 0.16 and 0.6 that these models have better predictive performance than model \mathrm{M}_1. As {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} is small (Scenario 1) and the number of observations is small (Scenario 3), we may assume {\widehat{\mathrm{SE}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} to be underestimated and the error distribution to be more skewed than normal. However, since {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} is small, we can state that there is no practical or statistical difference in the predictive performance.

The direct use of {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({\mathrm{M}_4,\mathrm{M}_1} \mid {y}\bigl)} and {\widehat{\mathrm{SE}}_\mathrm{\scriptscriptstyle LOO}\bigr({\mathrm{M}_4,\mathrm{M}_1} \mid {y}\bigl)} would give probability 0.96 that model \mathrm{M}_4 has better predictive than model \mathrm{M}_1. This difference (4.2) is big enough that we are fine with respect to Scenario 1, but the number of observations is small (Scenario 3), and on expectation we may assume {\widehat{\mathrm{SE}}_\mathrm{\scriptscriptstyle LOO}\bigr({\mathrm{M}_4,\mathrm{M}_1} \mid {y}\bigl)} to be underestimated. If we multiply {\widehat{\mathrm{SE}}_\mathrm{\scriptscriptstyle LOO}\bigr({\mathrm{M}_4,\mathrm{M}_1} \mid {y}\bigl)} by 2 \citep[heuristic based on the limit of equations by][]{bengio_Grandvalet_2004] to make a more conservative estimate, the probability that model \mathrm{M}_4 has better predictive performance is bigger than 0.81. Considering we have only 17 observations, this is quite good. Collecting more data is, however, recommended.

As the predictive distribution includes the aleatoric uncertainty (modelled by the data model), there is often more uncertainty in the predictive performance model comparison than in the posterior distribution \citep[see, e.g., ][]{Wang-Gelman:2015]. In simple models, we can also look at the posterior for the quantities of interest. With model \mathrm{M}_4, 95\% central posterior intervals for \beta_1 and \beta_2 are (1.1,3.7) and (-0.12,-0.04) respectively, which indicates data have information about the parameters. The covariates neocortex and log(mass) are collinear, which causes correlation in the posterior of the coefficients, which could make the marginal posteriors overlap 0, even if the joint posterior does not, in which case, looking at the predictive performance is useful. In this case, although neocortex and log(mass) are collinear, they don’t have useful information alone, and the useful predictive information is along the second principal component of their joint distribution, which explains why the models with only one of the covariates are not better than the intercept-only model.

3 Sleep study

Belenky et al. (2003)] collected data on the effect of chronic sleep restriction. We use a subset of data in the R package lme4 (Bates et al., 2015). The data contains average reaction times (in milliseconds) for 18 subjects with sleep restricted to 3 hours per night for 7 consecutive nights (days 0 and 1 were adaptation and training and removed from this analysis).

data(sleepstudy, package="lme4")
sleepstudy2 <- sleepstudy |>
  filter(Days >= 2)

The compared models are a linear model, a linear model with varying intercept for each subject, and a linear model with varying intercept and slope for each subject. All models use a normal data model. The models were fitted using brms (Bürkner, 2017), and the default brms priors; prior for the coefficient for Days is uniform, the prior for the varying intercept is normal with unknown scale having a half-normal prior, and the prior for the varying intercept and slope is bivariate normal with unknown scales having half-normal priors and correlation having LKJ prior (Lewandowski, Kurowicka and Joe, 2009).

Using the brms formula notation, the compared models are \begin{align*} \mathrm{M}_1: \quad & \mathrm{Reaction} \sim \mathrm{Days} \\ \mathrm{M}_2: \quad & \mathrm{Reaction} \sim \mathrm{Days} + (1\,\mid\,\mathrm{Subject}) \\ \mathrm{M}_3: \quad & \mathrm{Reaction} \sim \mathrm{Days} + (\mathrm{Days}\,\mid\, \mathrm{Subject}). \end{align*}

Based on the study design, \mathrm{M}_3 is the appropriate model for the analysis, but comparing models is useful for assessing how much information the data has about the varying intercepts and slopes. For a few LOO-folds with high Pareto-\hat{k} diagnostic value (>0.7, Vehtari et al. (2024)) we re-ran MCMC (with reloo=TRUE in brms). We use add_criterion() to store the loo object inside the brmsfit objects.

M_1 <- brm(Reaction ~ Days,
           data = sleepstudy2,
           family = gaussian(),
           refresh=0) |>
  add_criterion(criterion="loo", save_psis=TRUE, reloo=TRUE)
M_2 <- brm(Reaction ~ Days + (1 | Subject),
           data = sleepstudy2,
           family = gaussian(),
           refresh=0) |>
  add_criterion(criterion="loo", save_psis=TRUE, reloo=TRUE)
M_3 <- brm(Reaction ~ Days + (Days | Subject),
           data = sleepstudy2,
           family = gaussian(),
           refresh=0) |>
  add_criterion(criterion="loo", save_psis=TRUE, reloo=TRUE)

We compare the models \mathrm{M}_1, \mathrm{M}_2,\mathrm{M}_3,\mathrm{M}_4

loo_compare(M_1, M_2, M_3) |>
  as.data.frame() |>
  rownames_to_column("model") |>
  dplyr::select(model, elpd_diff, se_diff) |>
  mutate(p = ifelse(elpd_diff==0,NA,pnorm(0, elpd_diff, se_diff))) |>
  tt() |>
  format_tt(j=4, replace = "-") |>
  format_tt(i=3, j=4, digits=5)

model	elpd_diff	se_diff	p
M_3	0	0	-
M_2	-15	9.2	0.95
M_1	-80	20	0.99996

Model \mathrm{M}_3 is estimated to have better predictive performance, but only with 0.9 probability of having better performance than model \mathrm{M}_2. Model-checking reveals that two observations are clear outliers with respect to these models, making the normal approximation likely to be poorly calibrated (Scenario 3).

We also fitted models using a Student’s t model to create models \mathrm{M}_{1t}, \mathrm{M}_{2t}, and \mathrm{M}_{3t}. Based on model checking, there is no obvious model misspecification.

M_1t <- brm(Reaction ~ Days,
            data = sleepstudy2,
            family = student(),
            refresh=0) |>
  add_criterion(criterion="loo", save_psis=TRUE, reloo=TRUE)
M_2t <- brm(Reaction ~ Days + (1 | Subject),
            data = sleepstudy2,
            family = student(),
            refresh=0) |>
  add_criterion(criterion="loo", save_psis=TRUE, reloo=TRUE)
M_3t <- brm(Reaction ~ Days + (Days | Subject),
            data = sleepstudy2,
            family = student(),
            refresh=0) |>
  add_criterion(criterion="loo", save_psis=TRUE, reloo=TRUE)

We first compare \mathrm{M}_{3} and \mathrm{M}_{3t} to see whether a Student’s t model is more appropriate.

loo_compare(M_3, M_3t) |>
  as.data.frame() |>
  rownames_to_column("model") |>
  dplyr::select(model, elpd_diff, se_diff) |>
  mutate(p = ifelse(elpd_diff==0,NA,pnorm(0, elpd_diff, se_diff))) |>
  tt() |>
  format_tt(j=4, replace = "-", digits=3)

model	elpd_diff	se_diff	p
M_3t	0	0	-
M_3	-39	13	0.999

Although in this comparison \mathrm{M}_3 is misspecified, the better specified model \mathrm{M}_{3t} shows much better predictive performance, and as we can expect \seHatPlain to be inflated, the actual probability that \mathrm{M}_{3t} is better than \mathrm{M}_{3} is likely to be bigger than 0.999. We then compare the three Student’s t models:

loo_compare(M_1t, M_2t, M_3t) |>
  as.data.frame() |>
  rownames_to_column("model") |>
  dplyr::select(model, elpd_diff, se_diff) |>
  mutate(p = ifelse(elpd_diff==0,NA,pnorm(0, elpd_diff, se_diff))) |>
  tt() |>
  format_tt(j=4, replace = "-")

model	elpd_diff	se_diff	p
M_3t	0	0	-
M_2t	-45	8.6	1
M_1t	-119	16	1

The probability that model \mathrm{M}_{3t} is better than models \mathrm{M}_{1t} and \mathrm{M}_{2t} is close to 1. The models appear sufficiently well specified, the number of observations is bigger than 100, and the differences are not small, so we can assume that the normal approximation is well calibrated.

In this case, the effect of days with sleep constrained to 3 hours is so big that the main conclusion stays the same with all the models. Still, for example, model \mathrm{M}_{3t} does indicate higher variation between subjects than model \mathrm{M}_{3}. As \mathrm{M}_{3t} passes the model checking and has higher predictive performance, we should continue looking at the posterior of model \mathrm{M}_{3t}.

4 Roaches

Gelman and Hill (2007) describe in Chapter 8.3 the roaches data as follows: ``the treatment and control were applied to 160 and 104 apartments, respectively, and the outcome measurement y_i in each apartment i was the number of roaches caught in a set of traps. Different apartments had traps for different numbers of days’’. The goal is to estimate the efficacy of a pest management system at reducing the number of roaches.

data(roaches)
# Roach1 is very skewed and we take a square root
roaches$sqrt_roach1 <- sqrt(roaches$roach1)

The target is the number of roaches (y), and the covariates include the square root of the pre-treatment number of roaches (sqrt_roach1), a treatment indicator variable (treatment), and a variable indicating whether the apartment is in a building restricted to elderly residents (senior). As the number of days for which the roach traps were used is not the same for all apartments, the offset argument includes the logarithm of the number of days the traps were used (log(exposure2)). The latent regression model presented with brms formula notation is:

\mathrm{y} \sim \mathrm{sqrt\_roach1} + \mathrm{treatment} + \mathrm{senior} + \mathrm{offset(log(exposure2))}.

We fit the following models using the brms package. \begin{align*} \mathrm{M}_1: \quad & \text{Poisson} \\ \mathrm{M}_2: \quad & \text{Negative.binomial} \mathrm{M}_3: \quad & \text{Zero-inflated negative-binomial} \\ \end{align*}

The zero-inflation is modelled using the same latent formula (with its own parameters). All coefficients have \operatorname{normal}(0,1) priors and the negative-binomial shape parameter has the brms default prior, which is inverse-gamma(.4, .3) (Vehtari, 2024).

For the Poisson model we re-ran MCMC for all LOO-folds with high Pareto-\hat{k} diagnostic value (>0.7) (with reloo=TRUE in brms), and for negative-binomial and zero-inflated negative-binomial we used moment matching [Paananen et al. (2021)} for a few LOO-folds with high Pareto-\hat{k} diagnostic value (>0.7) (with moment_match=TRUE in brms).

M_1 <- brm(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
           family=poisson(), data=roaches,
           prior=c(prior(normal(0,1), class='b')),
           seed=1704009, refresh=0) |>
  add_criterion(criterion='loo', save_psis=TRUE, reloo=TRUE)
M_2 <- brm(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
           family=negbinomial(), data=roaches,
           prior=c(prior(normal(0,1), class='b'),
                   prior(inv_gamma(0.4, 0.3), class='shape')),
           seed=1704009, refresh=0) |>
  add_criterion(criterion='loo', save_psis=TRUE, moment_match=TRUE)
M_3 <- brm(bf(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
              zi ~ sqrt_roach1 + treatment + senior + offset(log(exposure2))),
           family=zero_inflated_negbinomial(), data=roaches,
           prior=c(prior(normal(0,1), class='b'),
                   prior(normal(0,1), class='b', dpar='zi'),
                   prior(normal(0,1), class='Intercept', dpar='zi'),
                   prior(inv_gamma(0.4, 0.3), class='shape')),
           seed=1704009, refresh=0) |>
  add_criterion(criterion='loo', save_psis=TRUE, moment_match=TRUE)

loo_compare(M_1, M_2, M_3) |>
  as.data.frame() |>
  rownames_to_column("model") |>
  dplyr::select(model, elpd_diff, se_diff) |>
  mutate(p = ifelse(elpd_diff==0,NA,pnorm(0, elpd_diff, se_diff))) |>
  tt() |>
  format_tt(j=4, replace = "-", digits=4)

model	elpd_diff	se_diff	p
M_3	0	0	-
M_2	-23	6.9	0.9995
M_1	-4636	686	1

The zero-inflated negative-binomial model (\mathrm{M}_3) is clearly the best. Based on model checking, the Poisson model (\mathrm{M}_1) is underdispersed which indicates Scenario 2, but the difference is so big that we can be certain that the zero-inflated negative-binomial model is better. As the number of observations is larger than 100, and the difference to model \mathrm{M}_2 is not small, we may assume the normal approximation is well calibrated.

As we had used an ad-hoc square root transformation of pre-treatment number of roaches, we fitted a model \mathrm{M}_4 replacing the latent linear term for the square root of pre-treatment number of roaches with a spline.

M_4 <- brm(bf(y ~ s(sqrt_roach1) + treatment + senior + offset(log(exposure2)),
              zi ~ s(sqrt_roach1) + treatment + senior + offset(log(exposure2))),
           family=zero_inflated_negbinomial(), data=roaches,
           prior=c(prior(normal(0,1), class='b'),
                   prior(normal(0,1), class='b', dpar='zi'),
                   prior(normal(0,1), class='Intercept', dpar='zi'),
                   prior(inv_gamma(0.4, 0.3), class='shape')),
           save_pars = save_pars(all = TRUE),
           seed=1704009) |> 
  add_criterion(criterion='loo', save_psis=TRUE, moment_match=TRUE)

loo_compare(M_3, M_4) |>
  as.data.frame() |>
  rownames_to_column("model") |>
  dplyr::select(model, elpd_diff, se_diff) |>
  mutate(p = ifelse(elpd_diff==0,NA,pnorm(0, elpd_diff, se_diff))) |>
  tt() |>
  format_tt(j=4, replace = "-")

model	elpd_diff	se_diff	p
M_4	0	0	-
M_3	-2.2	3.2	0.76

Model \mathrm{M}_4 (with spline) seems to be slightly better, but now the difference is so small that the normal approximation is likely to be not perfectly calibrated. As the difference is small, we can proceed with either model.

References

Bates, D., Maechler, M., Bolker, B. and Walker, S. (2015) “Fitting linear mixed-effects models using lme4,” Journal of Statistical Software, 67, pp. 1–48.

Belenky, G. et al. (2003) “Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: A sleep dose-response study,” Journal of Sleep Research, 12, pp. 1–12.

Bürkner, P.-C. (2017) “Brms: An R package for Bayesian multilevel models using Stan,” Journal of Statistical Software, 80, pp. 1–28.

Bürkner, P.-C., Gabry, J., Kay, M. and Vehtari, A. (2024) “Posterior: Tools for working with posterior distributions.”

Gelman, A. and Hill, J. (2007) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.

Goodrich, B., Gabry, J., Ali, I. and Brillmean, S. (2024) “rstanarm: Bayesian applied regression modeling via Stan.”

Lewandowski, D., Kurowicka, D. and Joe, H. (2009) “Generating random correlation matrices based on vines and extended onion method,” Journal of Multivariate Analysis, 100, pp. 1989–2001.

McElreath, R. (2020) Statistical rethinking: A bayesian course with examples in r and stan. CRC Press.

Paananen, T., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2021) “Implicitly adaptive importance sampling.” Statistics and Computing, 31(16).

Vehtari, A. (2024) “Default prior for negative-binomial shape parameter.”

Vehtari, A., Gabry, J., Magnusson, M., Yao, Y. and Gelman, A. (2022) “Loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models.” Available at: https://mc-stan.org/loo.

Vehtari, A., Gelman, A. and Gabry, J. (2017) “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC,” Statistics and Computing, 27(5), pp. 1413–1432.

Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. and Bürkner, P.-C. (2021) “Rank-normalization, folding, and localization: An improved \widehat{R} for assessing convergence of MCMC (with discussion),” Bayesian Analysis, 16(2), pp. 667–718.

Vehtari, A., Simpson, D., Gelman, A., Yao, Y. and Gabry, J. (2024) “Pareto smoothed importance sampling,” Journal of Machine Learning Research, 25(72).