Uncertainty in Bayesian LOO-CV Model Comparison

Author

Aki Vehtari and Florence Bockting

Published

2025-06-23

Modified

2026-04-21

1 Introduction

This case study is a new version of the case study in Section 5 in paper Uncertainty in Bayesian leave-one-out cross-validation based model comparison (Sivula et al., 2025a). The original case study is available to reproduce the results in the paper. This version uses more recent loo package features and includes more model checking and more explanations.

1.1 Estimating and comparing predictive performance

When building Bayesian models for prediction, a key question is: which model predicts new data better? Since we cannot directly compute the predictive performance for unseen data, we estimate it using cross-validation. Here we focus specifically on leave-one-out cross-validation (LOO-CV), which fits the model to all observations except one, scores the held-out observation under the leave-one-out predictive distribution, and repeats this for every observation. We use the fast PSIS-LOO method (Vehtari, Gelman and Gabry, 2017) implemented in the loo R package (Vehtari et al., 2022). For

An estimated of the model’s predictive accuracy for out-of-sample data is the sum of pointwise log scores elpd_loo. For model comparison, the quantity of interest is elpd_diff, the difference in elpd_loo between two models. See CV-FAQ for explanations of different terms in Bayesian cross-validation.

The loo package provides loo_compare() which reports elpd_diff, with respect to the model with highest elpd_loo, and the associated standard error diff_se. Additionally, based on normal approximation of the uncertainty in the difference, the quantity p_worse provides the probability that a model has worse predictive performance than the model with best estimated predictive performance.

A useful way to read these three quantities together is:

  • elpd_diff: the magnitude of the estimated predictive performance difference,
  • se_diff: the uncertainty around that estimate,
  • p_worse: the implied probability that the model is truly worse than the current best, assuming the normal approximation is well calibrated.

1.2 Scenarios when the normal approximation may be miscalibrated

Sivula et al. (2025b) analyse the conditions under which the normal approximation underlying se_diff and p_worse holds, and identify three scenarios where it may be miscalibrated:

  • Scenario 1 – Similar predictions. When the models make similar predictions, there is not much difference in the predictive performance and we can use either model for prediction. In practice, se_diff tends to be underestimated and p_worse overestimated.
  • Scenario 2 – Model misspecification with outliers. Outlier observation in the data can lead to a biased estimator and thus poor calibration. Model misspecification should be detected and addressed through model checking before LOO-CV comparison is used for final model selection.
  • Scenario 3 – Small number of observations. LOO-CV can not reliably detect small differences in the predictive performance if the number of observations is small and se_diff should be treated as likely underestimated. A conservative heuristic is to double se_diff, which gives a more cautious estimate of the uncertainty.

1.3 Diagnostics in loo_compare()

To make these scenarios visible in practice, loo_compare() includes two diagnostic columns (since version v2.10.0):

  • diag_diff is a pairwise comparison diagnostic tied to Scenarios 1 and 3. It flags cases where se_diff and consequently p_worse are likely to be miscalibrated, currently using two criteria: |elpd_diff| < 4 for Scenario 1, and N < 100 for Scenario 3. These criteria follow a priority hierarchy: when both apply, only N < 100 is shown, as a small sample size already undermines the reliability of se_diff, which also renders the |elpd_diff| < 4 threshold less informative.
  • diag_elpd flags unreliable PSIS-LOO computation for an individual model. It is triggered when the Pareto-\hat{k} diagnostic identifies highly influential observations. This is diagnostic is qualitatively different from calibration issues: here the elpd_loo estimate itself may be biased.

1.4 Examples

Each real-data example is chosen to illustrate the three scenarios described above:

  • Scenario 1 similar predictions: primate milk
  • Scenario 2 model misspecification: roaches, sleepstudy
  • Scenario 3 small data: primate milk

In all examples, the true data-generating process is more complex than any of the models considered — the realistic applied setting. The goal is to show how diag_diff and diag_elpd appear in practice, and how they should shape interpretation of the loo_compare() output.

All models were fit using MCMC with 4 chains, 1000 warmup and 1000 sampling iterations. Convergence was assessed using the diagnostics of Vehtari et al. (2021) via the posterior package (Bürkner et al., 2024). LOO-CV was computed with the loo package (Vehtari et al., 2022) using fast PSIS-LOO (Vehtari, Gelman and Gabry, 2017).

Inference was made using MCMC with 4 chains with 1000 warmup and 1000 sampling iterations. MCMC convergence was fine as assessed by the diagnostics of Vehtari et al. (2021) via the posterior package (Bürkner et al., 2024). LOO-CV was computed with the loo package (Vehtari et al., 2022), which uses fast PSIS-LOO method (Vehtari, Gelman and Gabry, 2017).

Load packages

library(tidyr)
library(dplyr)
library(tibble)
#library(loo)
devtools::load_all("~/proj/loo_diag/")
library(brms)
options(brms.backend = "cmdstanr", mc.cores = 4)
library(posterior)
options(posterior.num_args=list(digits=2))
library(pillar)
options(pillar.negative = FALSE)
library(tinytable)
options(tinytable_format_num_fmt = "significant_cell",
        tinytable_format_digits = 2,
        tinytable_tt_digits=2)
library(ggplot2)
library(ggdist)
#library(bayesplot)
devtools::load_all("~/proj/bayesplot/")
theme_set(bayesplot::theme_default(base_family = "sans", base_size = 14))

2 Primate milk

McElreath (2020) describes the primate milk data: ``A popular hypothesis has it that primates with larger brains produce more energetic milk, so that brains can grow quickly… The question here is to what extent energy content of milk, measured here by kilocalories, is related to the percent of the brain mass that is neocortex… We’ll end up needing female body mass as well, to see the masking that hides the relationships among the variables.’’ The data include 17 different primate species. The target variable is the energy content of milk (kcal.per.g) and the covariates are the percent of the brain mass that is neocortex (neocortex) and the logarithm of female body mass (log(mass)). The covariates and target are centered and scaled to have unit variance.

data(milk)
milk <- milk |>
  drop_na() |>
  mutate(neocortex = neocortex.perc / 100)
head(milk)
             clade            species kcal.per.g perc.fat perc.protein
1    Strepsirrhine     Eulemur fulvus       0.49    16.60        15.42
2 New World Monkey Alouatta seniculus       0.47    21.22        23.58
3 New World Monkey         A palliata       0.56    29.66        23.46
4 New World Monkey       Cebus apella       0.89    53.41        15.80
5 New World Monkey         S sciureus       0.92    50.58        22.33
6 New World Monkey   Cebuella pygmaea       0.80    41.35        20.85
  perc.lactose mass neocortex.perc neocortex
1        67.98 1.95          55.16    0.5516
2        55.20 5.25          64.54    0.6454
3        46.88 5.37          64.54    0.6454
4        30.79 2.51          67.64    0.6764
5        27.09 0.68          68.85    0.6885
6        37.80 0.12          58.85    0.5885

We use the following four models with weakly informative normal(0, 1) priors for the coefficients and an exponential(1) prior for the residual scale: \begin{align*} \mathrm{M}_1: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha, \sigma) \\ \mathrm{M}_2: \quad& \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_1\times\mathrm{neocortex}, \sigma) \\ \mathrm{M}_3: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_2\times\mathrm{log(mass)}, \sigma) \\ \mathrm{M}_4: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_1\times\mathrm{neocortex} + \beta_2\times\mathrm{log(mass)}, \sigma). \end{align*}

We fit the models with the brms package (Bürkner, 2017),

M_1 <- brm(kcal.per.g ~ 1,
           data = milk,
           seed = 2030,
           refresh = 0)
M_2 <- brm(kcal.per.g ~ neocortex,
           prior = prior(normal(0, 1), class = "b"),
           data = milk,
           seed = 2030,
           refresh = 0)
M_3 <- brm(kcal.per.g ~ log(mass),
           prior = prior(normal(0, 1), class = "b"),
           data = milk,
           seed = 2030,
           refresh = 0)
M_4 <- brm(kcal.per.g ~ neocortex + log(mass),
           prior = prior(normal(0, 1), class = "b"),
           data = milk,
           seed = 2030,
           refresh = 0)

We compute LOO-CV estimates using the fast PSIS-LOO method Vehtari et al. (2024). We use add_criterion() to store the loo object inside the brmsfit objects.

M_1 <- add_criterion(M_1, criterion = "loo")
M_2 <- add_criterion(M_2, criterion = "loo")
M_3 <- add_criterion(M_3, criterion = "loo")
M_4 <- add_criterion(M_4, criterion = "loo")

There are no high Pareto-k warnings, which indicates there are no highly influential obervations that could indicate model misspecification.

We can assess how well the predictive distributions are calibrated with LOO-PIT predictive checking (Tesso and Vehtari, 2026).

pp_check(M_1, type = "loo_pit_ecdf", method = "correlated")

pp_check(M_2, type = "loo_pit_ecdf", method = "correlated")

pp_check(M_3, type = "loo_pit_ecdf", method = "correlated")

pp_check(M_4, type = "loo_pit_ecdf", method = "correlated")

All LOO-PITs are (close) to uniformly distributed, which indicates good calibration.

We compare the models \mathrm{M}_1, \mathrm{M}_2,\mathrm{M}_3,\mathrm{M}_4.

(loo_comp <- loo_compare(M_1, M_2, M_3, M_4))
 model elpd_diff se_diff p_worse diag_diff diag_elpd
   M_4       0.0     0.0      NA                    
   M_3      -3.1     1.1    1.00   N < 100          
   M_1      -3.3     2.0    0.95   N < 100          
   M_2      -3.7     2.1    0.96   N < 100          

As the data set contains only N=17 different primate species for all comparisons diag_diff shows N<100 and we get a message advising to read more in loo-glossary.

We can also use tinytable for prettier table formatting

loo_comp |>
  dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
  tt() |>
  format_tt(replace = "-")
model elpd_diff se_diff p_worse diag_diff diag_elpd
M_4 0 0 -
M_3 -3.1 1.1 1 N < 100
M_1 -3.3 2 0.95 N < 100
M_2 -3.7 2.1 0.96 N < 100

Based on the model checking above, the models seem to be reasonably specified and we are fine with respect to Scenario 2 (model misspecification). Models \mathrm{M}_1, \mathrm{M}_2 and \mathrm{M}_3 have small {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} compared to model \mathrm{M}_4. The direct use of the normal approximation gives probabilities 1, 0.95 and 0.96 that these models have worse predictive performance than model \mathrm{M}_4. As {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} is small (Scenario 1) and the number of observations is small (Scenario 3), we may assume {\widehat{\mathrm{SE}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} to be underestimated and the error distribution to be more skewed than normal. However, since {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} is small, we can state that there is no practical or statistical difference in the predictive performance.

We can get more cautious estimates of p_worse by multiplying se_diff by 2, and in this case get 0.92, 0.8, and 0.81. This doesn’t correct the skeweness, but provides some idea of the sensitivity of the estimates. Considering we have only 17 observations, it is natural that there is uncertainty in the comparisons. Collecting more data is recommended (if possible).

As the predictive distribution includes the aleatoric uncertainty (modelled by the data model), there is often more uncertainty in the predictive performance model comparison than in the posterior distribution (see, e.g., Wang and Gelman, 2015). In simple models, we can also look at the posterior for the quantities of interest. With model \mathrm{M}_4, 95\% central posterior intervals for \beta_1 and \beta_2 are (1.1,3.7) and (-0.12,-0.04) respectively, which indicates data have information about the parameters. The covariates neocortex and log(mass) are collinear, which causes correlation in the posterior of the coefficients, which could make the marginal posteriors overlap 0, even if the joint posterior does not, in which case, looking at the predictive performance is useful. In this case, although neocortex and log(mass) are collinear, they don’t have useful information alone, and the useful predictive information is along the second principal component of their joint distribution, which explains why the models with only one of the covariates are not better than the intercept-only model.

3 Sleep study

Belenky et al. (2003) collected data on the effect of chronic sleep restriction. We use a subset of data in the R package lme4 (Bates et al., 2015). The data contains average reaction times (in milliseconds) for 18 subjects with sleep restricted to 3 hours per night for 7 consecutive nights (days 0 and 1 were adaptation and training and removed from this analysis).

data(sleepstudy, package="lme4")
sleepstudy2 <- sleepstudy |>
  dplyr::filter(Days >= 2)

The compared models are a linear model, a linear model with varying intercept for each subject, and a linear model with varying intercept and slope for each subject. All models use a normal data model. The models were fitted using brms (Bürkner, 2017), and the default brms priors; prior for the coefficient for Days is uniform, the prior for the varying intercept is normal with unknown scale having a half-normal prior, and the prior for the varying intercept and slope is bivariate normal with unknown scales having half-normal priors and correlation having LKJ prior (Lewandowski, Kurowicka and Joe, 2009).

Using the brms formula notation, the compared models are \begin{align*} \mathrm{M}_1: \quad & \mathrm{Reaction} \sim \mathrm{Days} \\ \mathrm{M}_2: \quad & \mathrm{Reaction} \sim \mathrm{Days} + (1\,\mid\,\mathrm{Subject}) \\ \mathrm{M}_3: \quad & \mathrm{Reaction} \sim \mathrm{Days} + (\mathrm{Days}\,\mid\, \mathrm{Subject}). \end{align*}

Based on the study design, \mathrm{M}_3 is the appropriate model for the analysis, but comparing models is useful for assessing how much information the data has about the varying intercepts and slopes.

We fit the three models and use add_criterion() to store the loo object inside the brmsfit objects.

M_1 <- brm(Reaction ~ Days,
           data = sleepstudy2,
           family = gaussian(),
           refresh=0) |>
  add_criterion(criterion="loo", save_psis = TRUE)
M_2 <- brm(Reaction ~ Days + (1 | Subject),
           data = sleepstudy2,
           family = gaussian(),
           refresh=0) |>
  add_criterion(criterion="loo", save_psis = TRUE)
Warning: Found 1 observations with a pareto_k > 0.7 in model 'brm(Reaction ~
Days + (1 | Subject), data = sleepstudy2, family = gaussian(), refresh = 0)'.
We recommend to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
M_3 <- brm(Reaction ~ Days + (Days | Subject),
           data = sleepstudy2,
           family = gaussian(),
           refresh=0) |>
  add_criterion(criterion="loo", save_psis = TRUE)
Warning: Found 3 observations with a pareto_k > 0.7 in model 'brm(Reaction ~
Days + (Days | Subject), data = sleepstudy2, family = gaussian(), refresh =
0)'. We recommend to set 'moment_match = TRUE' in order to perform moment
matching for problematic observations.

We get some warnings about high Pareto-k values, but don’t fix these yet. If someone would miss high Pareto-k value diagnostic messages, loo_compare() reminds about them.

loo_compare(M_1, M_2, M_3)
 model elpd_diff se_diff p_worse diag_diff      diag_elpd
   M_3       0.0     0.0      NA           3 k_psis > 0.7
   M_2     -14.7     9.3    0.94           1 k_psis > 0.7
   M_1     -79.6    20.2    1.00                         

Alternatively using tinytable for prettier output and more precise control in the number of digits shown.

loo_compare(M_1, M_2, M_3) |>
  dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
  tt() |>
  format_tt(j=4, replace = "-") |>
  format_tt(i=3, j=4, digits=5)
model elpd_diff se_diff p_worse diag_diff diag_elpd
M_3 0 0 - 3 k_psis > 0.7
M_2 -15 9.3 0.94 1 k_psis > 0.7
M_1 -80 20 0.99996

High Pareto-k values maye be due to model misspecification or well-specified but flexible model, and can lead to optimistic elpd_loo estimates. Especially if the PSIS-LOO computation for the model with the best elpd_loo has many high Pareto-k values the p_worse values may also be biased. To improve accuracy of elpd computation we use argument reloo = TRUE.

M_2 <- M_2 |> add_criterion(criterion="loo", save_psis = TRUE,
                            reloo = TRUE, overwrite = TRUE)
M_3 <- M_3 |> add_criterion(criterion="loo", save_psis = TRUE,
                            reloo = TRUE, overwrite = TRUE)

If we now re-run loo_compare(), there are no more warnings about high Pareto-k values. The elpd_diff and se_diff values have changed a bit, but in this case p_worse are about the same.

loo_compare(M_1, M_2, M_3) |>
  dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
  tt() |>
  format_tt(j=4, replace = "-") |>
  format_tt(i=3, j=4, digits=5)
model elpd_diff se_diff p_worse diag_diff diag_elpd
M_3 0 0 -
M_2 -14 9.7 0.92
M_1 -78 21 0.99992

Although we can get more accurate elpd and elpd_diff estimates with more computation, the accuracy of the normal approximation for the uncertainty in the difference can still be compromised by the outliers. We check calibration of the predictive distributions with LOO-PIT-ECDF plots (Tesso and Vehtari, 2026).

pp_check(M_1, type = "loo_pit_ecdf", method = "correlated")

pp_check(M_2, type = "loo_pit_ecdf", method = "correlated")
Warning: Found 1 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.

pp_check(M_3, type = "loo_pit_ecdf", method = "correlated")
Warning: Found 3 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.

The pooled model \mathrm{M}_1 seems to be reasonably calibrated. The varying intercept model \mathrm{M}_2 has slightly S-shaped LOO-PIT-ECDF curve, but the pre-defined diagnostic threshold is not exceeded. The varying intercept and slope model \mathrm{M}_3 has more clearly S-shaped LOO-PIT-ECDF curve, and the diagnostic strongly indicate miscalibration. S-shaped LOO-PIT-ECDF curve indicates in general too wide predictive distribution and the red emphasis near 0 and 1 indicates outliers.

We can see the outliers better in the LOO-intervals plot. These outliers inflate the normal model residual variance, making the predictive distribution too wide for other observations.

pp_check(M_3, type = "loo_intervals")
Warning: Found 3 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.

If we examine the pointwise differences, we see that model 3 is almost always better, but outliers cause couple values with big magnitude, making it more likely that the normal approximation for quantifying uncertainty in elpd_diff is not accurate.

ggplot(data=NULL, aes(x=1:nrow(sleepstudy2),
           pointwise(M_3$criteria$loo, "elpd_loo")-pointwise(M_2$criteria$loo, "elpd_loo"))) +
  geom_point() +
  hline_0(alpha=0.5) +
  labs(x="Data index", y="pointwise elpd_diff")

We next fit models using a Student’s t model to create models \mathrm{M}_{1t}, \mathrm{M}_{2t}, and \mathrm{M}_{3t}. We now use reloo = TRUE right away.

M_1t <- brm(Reaction ~ Days,
            data = sleepstudy2,
            family = student(),
            refresh=0) |>
  add_criterion(criterion="loo", save_psis = TRUE, reloo = TRUE)
M_2t <- brm(Reaction ~ Days + (1 | Subject),
            data = sleepstudy2,
            family = student(),
            refresh=0) |>
  add_criterion(criterion="loo", save_psis = TRUE, reloo = TRUE)
M_3t <- brm(Reaction ~ Days + (Days | Subject),
            data = sleepstudy2,
            family = student(),
            refresh=0) |>
  add_criterion(criterion="loo", save_psis = TRUE, reloo = TRUE)

LOO-PIT-ECDF plots look now better.

pp_check(M_2t, type = "loo_pit_ecdf", method = "correlated")
Warning: Found 1 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.

pp_check(M_3t, type = "loo_pit_ecdf", method = "correlated")

We first compare \mathrm{M}_{3} and \mathrm{M}_{3t} to see whether a Student’s t model has better predictive performance.

loo_compare(loo(M_3), loo(M_3t)) |>
  dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
  tt() |>
  format_tt(j=4, replace = "-", digits=3)
model elpd_diff se_diff p_worse diag_diff diag_elpd
M_3t 0 0 -
M_3 -41 13 0.999

Although in this comparison \mathrm{M}_3 is misspecified, the better specified model \mathrm{M}_{3t} shows much better predictive performance, and as we can expect se_diff to be inflated, the actual probability that \mathrm{M}_{3t} is better than \mathrm{M}_{3} is likely to be bigger than 0.999.

We then compare the three Student’s t models:

loo_compare(loo(M_1t), loo(M_2t), loo(M_3t)) |>
  dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
  tt() |>
  format_tt(j=4, replace = "-")
model elpd_diff se_diff p_worse diag_diff diag_elpd
M_3t 0 0 -
M_2t -45 8.6 1
M_1t -119 16 1

The probability that model \mathrm{M}_{3t} is better than models \mathrm{M}_{1t} and \mathrm{M}_{2t} is close to 1. The models appear sufficiently well specified, the number of observations is bigger than 100, and the differences are not small, so we can assume that the normal approximation is well calibrated.

If we examine the pointwise differences, we see that model 3 is almost always better than model 2, and as there are no outliers, the distribution of pointwise elpd differences is such that the normal approximation for quantifying uncertainty in elpd_diff is likely to be accurate.

ggplot(data=NULL, aes(x=1:nrow(sleepstudy2),
           pointwise(M_3t$criteria$loo, "elpd_loo")-pointwise(M_2t$criteria$loo, "elpd_loo"))) +
  geom_point() +
  hline_0(alpha=0.5) +
  labs(x="Data index", y="pointwise elpd_diff")

In this case, the effect of days with sleep constrained to 3 hours is so big that the main conclusion stays the same with all the models. Still, for example, model \mathrm{M}_{3t} does indicate higher variation between subjects than model \mathrm{M}_{3}. As \mathrm{M}_{3t} passes the model checking and has higher predictive performance, we should continue looking at the posterior of model \mathrm{M}_{3t}.

4 Roaches

Gelman and Hill (2007) describe in Chapter 8.3 the roaches data as follows: ``the treatment and control were applied to 160 and 104 apartments, respectively, and the outcome measurement y_i in each apartment i was the number of roaches caught in a set of traps. Different apartments had traps for different numbers of days’’. The goal is to estimate the efficacy of a pest management system at reducing the number of roaches.

data(roaches)
# Roach1 is very skewed and we take a square root
roaches$sqrt_roach1 <- sqrt(roaches$roach1)

The target is the number of roaches (y), and the covariates include the square root of the pre-treatment number of roaches (sqrt_roach1), a treatment indicator variable (treatment), and a variable indicating whether the apartment is in a building restricted to elderly residents (senior). As the number of days for which the roach traps were used is not the same for all apartments, the offset argument includes the logarithm of the number of days the traps were used (log(exposure2)). The latent regression model presented with brms formula notation is:

\mathrm{y} \sim \mathrm{sqrt\_roach1} + \mathrm{treatment} + \mathrm{senior} + \mathrm{offset(log(exposure2))}.

We fit the following models using the brms package. \begin{align*} \mathrm{M}_1: \quad & \text{Poisson} \\ \mathrm{M}_2: \quad & \text{Negative.binomial} \mathrm{M}_3: \quad & \text{Zero-inflated negative-binomial} \\ \end{align*}

The zero-inflation is modelled using the same latent formula (with its own parameters). All coefficients have \operatorname{normal}(0,1) priors and the negative-binomial shape parameter has the brms default prior, which is inverse-gamma(.4, .3) (Vehtari, 2024).

For the Poisson model we re-ran MCMC for all LOO-folds with high Pareto-\hat{k} diagnostic value (>0.7) (with reloo = TRUE in brms), and for negative-binomial and zero-inflated negative-binomial we used moment matching (Paananen et al., 2021) for a few LOO-folds with high Pareto-\hat{k} diagnostic value (>0.7) (with moment_match = TRUE in brms).

M_1 <- brm(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
           family = poisson(),
           data = roaches,
           prior = c(prior(normal(0,1), class = "b")),
           seed = 1704009,
           refresh = 0) |>
  add_criterion(criterion = "loo", save_psis = TRUE, reloo = TRUE)
Warning: Ignoring relative efficiencies as some were NA. See argument 'r_eff'
in ?loo::loo for more details.
M_2 <- brm(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
           family = negbinomial(),
           data = roaches,
           prior = c(prior(normal(0,1), class = "b"),
                   prior(inv_gamma(0.4, 0.3), class = "shape")),
           seed = 1704009,
           refresh = 0) |>
  add_criterion(criterion = "loo", save_psis = TRUE, moment_match = TRUE)
M_3 <- brm(bf(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
              zi ~ sqrt_roach1 + treatment + senior + offset(log(exposure2))),
           family = zero_inflated_negbinomial(),
           data = roaches,
           prior = c(prior(normal(0,1), class = "b"),
                   prior(normal(0,1), class = "b", dpar = "zi"),
                   prior(normal(0,1), class = "Intercept", dpar  =  "zi"),
                   prior(inv_gamma(0.4, 0.3), class = "shape")),
           seed = 1704009,
           refresh = 0) |>
  add_criterion(criterion = "loo", save_psis = TRUE, moment_match = TRUE)

Compare with loo_compare()

loo_compare(loo(M_1), loo(M_2), loo(M_3)) |>
  dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
  tt() |>
  format_tt(j=4, replace = "-", digits=4)
model elpd_diff se_diff p_worse diag_diff diag_elpd
M_3 0 0 -
M_2 -23 6.9 0.9995
M_1 -4635 686 1

The zero-inflated negative-binomial model (\mathrm{M}_3) is clearly the best. Based on model checking (see Roaches case study), the Poisson model (\mathrm{M}_1) is underdispersed which indicates Scenario 2 (model misspecification), but the difference is so big that we can be certain that the zero-inflated negative-binomial model is better. As the number of observations is larger than 100, and the difference to model \mathrm{M}_2 is not small, we may assume the normal approximation is well calibrated.

As we had used an ad-hoc square root transformation of pre-treatment number of roaches, we fitted a model \mathrm{M}_4 replacing the latent linear term for the square root of pre-treatment number of roaches with a spline.

M_4 <- brm(bf(y ~ s(sqrt_roach1) + treatment + senior + offset(log(exposure2)),
              zi ~ s(sqrt_roach1) + treatment + senior + offset(log(exposure2))),
           family = zero_inflated_negbinomial(), data = roaches,
           prior = c(prior(normal(0,1), class = "b"),
                   prior(normal(0,1), class = "b", dpar = "zi"),
                   prior(normal(0,1), class = "Intercept", dpar = "zi"),
                   prior(inv_gamma(0.4, 0.3), class = "shape")),
           save_pars = save_pars(all = TRUE),
           seed = 1704009,
           refresh = 0) |> 
  add_criterion(criterion = "loo", save_psis = TRUE, moment_match = TRUE)

Compare with loo_compare()

loo_compare(loo(M_3), loo(M_4)) |>
  dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
  tt() |>
  format_tt(j=4, replace = "-")
model elpd_diff se_diff p_worse diag_diff diag_elpd
M_4 0 0 -
M_3 -2.4 3.2 0.78 |elpd_diff| < 4

Model \mathrm{M}_4 (with spline) seems to be slightly better, but now the difference is so small (Scenario 1) that the normal approximation is likely to be not perfectly calibrated. As the difference is small, we can proceed with either model.

In the milk example above, |elpd_diff| was also small, but the diag_diff showed only N < 100 as that was the dominating issue. In this case N > 100, and now |elpd_diff| < 4 diagnostic is shown.

References

Bates, D., Maechler, M., Bolker, B. and Walker, S. (2015) “Fitting linear mixed-effects models using lme4,” Journal of Statistical Software, 67, pp. 1–48.
Belenky, G. et al. (2003) “Patterns of performance degradation and restoration during sleep restriction and subsequent recovery: A sleep dose-response study,” Journal of Sleep Research, 12, pp. 1–12.
Bürkner, P.-C. (2017) “Brms: An R package for Bayesian multilevel models using Stan,” Journal of Statistical Software, 80, pp. 1–28.
Bürkner, P.-C., Gabry, J., Kay, M. and Vehtari, A. (2024) “Posterior: Tools for working with posterior distributions.”
Gelman, A. and Hill, J. (2007) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
Lewandowski, D., Kurowicka, D. and Joe, H. (2009) “Generating random correlation matrices based on vines and extended onion method,” Journal of Multivariate Analysis, 100, pp. 1989–2001.
McElreath, R. (2020) Statistical rethinking: A bayesian course with examples in r and stan. CRC Press.
Paananen, T., Piironen, J., Bürkner, P.-C. and Vehtari, A. (2021) “Implicitly adaptive importance sampling.” Statistics and Computing, 31(16).
Sivula, T., Magnusson, M., Matamoros, A. A. and Vehtari, A. (2025a) “Uncertainty in bayesian leave-one-out cross-validation based model comparison,” Bayesian Analysis. doi: 10.1214/25-BA1569.
Sivula, T., Magnusson, M., Matamoros, A. A. and Vehtari, A. (2025b) “Uncertainty in bayesian leave-one-out cross-validation based model comparison,” Bayesian Analysis. doi: 10.1214/25-BA1569.
Tesso, H. and Vehtari, A. (2026) LOO-PIT predictive model checking,” arXiv preprint arXiv:2603.02928.
Vehtari, A. (2024) “Default prior for negative-binomial shape parameter.”
Vehtari, A., Gabry, J., Magnusson, M., Yao, Y. and Gelman, A. (2022) “Loo: Efficient leave-one-out cross-validation and WAIC for Bayesian models.” Available at: https://mc-stan.org/loo.
Vehtari, A., Gelman, A. and Gabry, J. (2017) “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC,” Statistics and Computing, 27(5), pp. 1413–1432.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. and Bürkner, P.-C. (2021) “Rank-normalization, folding, and localization: An improved \widehat{R} for assessing convergence of MCMC (with discussion),” Bayesian Analysis, 16(2), pp. 667–718.
Vehtari, A., Simpson, D., Gelman, A., Yao, Y. and Gabry, J. (2024) “Pareto smoothed importance sampling,” Journal of Machine Learning Research, 25(72).
Wang, W. and Gelman, A. (2015) “Difficulty of selecting among multilevel models using predictive accuracy,” Statistics and Its Interface, 8(2), pp. 153–160.