library(tidyr)
library(dplyr)
library(tibble)
#library(loo)
devtools::load_all("~/proj/loo_diag/")
library(brms)
options(brms.backend = "cmdstanr", mc.cores = 4)
library(posterior)
options(posterior.num_args=list(digits=2))
library(pillar)
options(pillar.negative = FALSE)
library(tinytable)
options(tinytable_format_num_fmt = "significant_cell",
tinytable_format_digits = 2,
tinytable_tt_digits=2)
library(ggplot2)
library(ggdist)
#library(bayesplot)
devtools::load_all("~/proj/bayesplot/")
theme_set(bayesplot::theme_default(base_family = "sans", base_size = 14))Uncertainty in Bayesian LOO-CV Model Comparison
1 Introduction
This case study is a new version of the case study in Section 5 in paper Uncertainty in Bayesian leave-one-out cross-validation based model comparison (Sivula et al., 2025a). The original case study is available to reproduce the results in the paper. This version uses more recent loo package features and includes more model checking and more explanations.
1.1 Estimating and comparing predictive performance
When building Bayesian models for prediction, a key question is: which model predicts new data better? Since we cannot directly compute the predictive performance for unseen data, we estimate it using cross-validation. Here we focus specifically on leave-one-out cross-validation (LOO-CV), which fits the model to all observations except one, scores the held-out observation under the leave-one-out predictive distribution, and repeats this for every observation. We use the fast PSIS-LOO method (Vehtari, Gelman and Gabry, 2017) implemented in the loo R package (Vehtari et al., 2022). For
An estimated of the model’s predictive accuracy for out-of-sample data is the sum of pointwise log scores elpd_loo. For model comparison, the quantity of interest is elpd_diff, the difference in elpd_loo between two models. See CV-FAQ for explanations of different terms in Bayesian cross-validation.
The loo package provides loo_compare() which reports elpd_diff, with respect to the model with highest elpd_loo, and the associated standard error diff_se. Additionally, based on normal approximation of the uncertainty in the difference, the quantity p_worse provides the probability that a model has worse predictive performance than the model with best estimated predictive performance.
A useful way to read these three quantities together is:
elpd_diff: the magnitude of the estimated predictive performance difference,se_diff: the uncertainty around that estimate,p_worse: the implied probability that the model is truly worse than the current best, assuming the normal approximation is well calibrated.
1.2 Scenarios when the normal approximation may be miscalibrated
Sivula et al. (2025b) analyse the conditions under which the normal approximation underlying se_diff and p_worse holds, and identify three scenarios where it may be miscalibrated:
- Scenario 1 – Similar predictions. When the models make similar predictions, there is not much difference in the predictive performance and we can use either model for prediction. In practice,
se_difftends to be underestimated andp_worseoverestimated. - Scenario 2 – Model misspecification with outliers. Outlier observation in the data can lead to a biased estimator and thus poor calibration. Model misspecification should be detected and addressed through model checking before LOO-CV comparison is used for final model selection.
- Scenario 3 – Small number of observations. LOO-CV can not reliably detect small differences in the predictive performance if the number of observations is small and
se_diffshould be treated as likely underestimated. A conservative heuristic is to doublese_diff, which gives a more cautious estimate of the uncertainty.
1.3 Diagnostics in loo_compare()
To make these scenarios visible in practice, loo_compare() includes two diagnostic columns (since version v2.10.0):
diag_diffis a pairwise comparison diagnostic tied to Scenarios 1 and 3. It flags cases wherese_diffand consequentlyp_worseare likely to be miscalibrated, currently using two criteria:|elpd_diff| < 4for Scenario 1, andN < 100for Scenario 3. These criteria follow a priority hierarchy: when both apply, onlyN < 100is shown, as a small sample size already undermines the reliability ofse_diff, which also renders the|elpd_diff| < 4threshold less informative.diag_elpdflags unreliable PSIS-LOO computation for an individual model. It is triggered when the Pareto-\hat{k} diagnostic identifies highly influential observations. This is diagnostic is qualitatively different from calibration issues: here theelpd_looestimate itself may be biased.
1.4 Examples
Each real-data example is chosen to illustrate the three scenarios described above:
- Scenario 1 similar predictions: primate milk
- Scenario 2 model misspecification: roaches, sleepstudy
- Scenario 3 small data: primate milk
In all examples, the true data-generating process is more complex than any of the models considered — the realistic applied setting. The goal is to show how diag_diff and diag_elpd appear in practice, and how they should shape interpretation of the loo_compare() output.
All models were fit using MCMC with 4 chains, 1000 warmup and 1000 sampling iterations. Convergence was assessed using the diagnostics of Vehtari et al. (2021) via the posterior package (Bürkner et al., 2024). LOO-CV was computed with the loo package (Vehtari et al., 2022) using fast PSIS-LOO (Vehtari, Gelman and Gabry, 2017).
Inference was made using MCMC with 4 chains with 1000 warmup and 1000 sampling iterations. MCMC convergence was fine as assessed by the diagnostics of Vehtari et al. (2021) via the posterior package (Bürkner et al., 2024). LOO-CV was computed with the loo package (Vehtari et al., 2022), which uses fast PSIS-LOO method (Vehtari, Gelman and Gabry, 2017).
Load packages
2 Primate milk
McElreath (2020) describes the primate milk data: ``A popular hypothesis has it that primates with larger brains produce more energetic milk, so that brains can grow quickly… The question here is to what extent energy content of milk, measured here by kilocalories, is related to the percent of the brain mass that is neocortex… We’ll end up needing female body mass as well, to see the masking that hides the relationships among the variables.’’ The data include 17 different primate species. The target variable is the energy content of milk (kcal.per.g) and the covariates are the percent of the brain mass that is neocortex (neocortex) and the logarithm of female body mass (log(mass)). The covariates and target are centered and scaled to have unit variance.
data(milk)
milk <- milk |>
drop_na() |>
mutate(neocortex = neocortex.perc / 100)
head(milk) clade species kcal.per.g perc.fat perc.protein
1 Strepsirrhine Eulemur fulvus 0.49 16.60 15.42
2 New World Monkey Alouatta seniculus 0.47 21.22 23.58
3 New World Monkey A palliata 0.56 29.66 23.46
4 New World Monkey Cebus apella 0.89 53.41 15.80
5 New World Monkey S sciureus 0.92 50.58 22.33
6 New World Monkey Cebuella pygmaea 0.80 41.35 20.85
perc.lactose mass neocortex.perc neocortex
1 67.98 1.95 55.16 0.5516
2 55.20 5.25 64.54 0.6454
3 46.88 5.37 64.54 0.6454
4 30.79 2.51 67.64 0.6764
5 27.09 0.68 68.85 0.6885
6 37.80 0.12 58.85 0.5885
We use the following four models with weakly informative normal(0, 1) priors for the coefficients and an exponential(1) prior for the residual scale: \begin{align*} \mathrm{M}_1: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha, \sigma) \\ \mathrm{M}_2: \quad& \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_1\times\mathrm{neocortex}, \sigma) \\ \mathrm{M}_3: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_2\times\mathrm{log(mass)}, \sigma) \\ \mathrm{M}_4: \quad & \mathrm{kcal.per.g} \sim \operatorname{normal}(\alpha + \beta_1\times\mathrm{neocortex} + \beta_2\times\mathrm{log(mass)}, \sigma). \end{align*}
We fit the models with the brms package (Bürkner, 2017),
M_1 <- brm(kcal.per.g ~ 1,
data = milk,
seed = 2030,
refresh = 0)
M_2 <- brm(kcal.per.g ~ neocortex,
prior = prior(normal(0, 1), class = "b"),
data = milk,
seed = 2030,
refresh = 0)
M_3 <- brm(kcal.per.g ~ log(mass),
prior = prior(normal(0, 1), class = "b"),
data = milk,
seed = 2030,
refresh = 0)
M_4 <- brm(kcal.per.g ~ neocortex + log(mass),
prior = prior(normal(0, 1), class = "b"),
data = milk,
seed = 2030,
refresh = 0)We compute LOO-CV estimates using the fast PSIS-LOO method Vehtari et al. (2024). We use add_criterion() to store the loo object inside the brmsfit objects.
M_1 <- add_criterion(M_1, criterion = "loo")
M_2 <- add_criterion(M_2, criterion = "loo")
M_3 <- add_criterion(M_3, criterion = "loo")
M_4 <- add_criterion(M_4, criterion = "loo")There are no high Pareto-k warnings, which indicates there are no highly influential obervations that could indicate model misspecification.
We can assess how well the predictive distributions are calibrated with LOO-PIT predictive checking (Tesso and Vehtari, 2026).
pp_check(M_1, type = "loo_pit_ecdf", method = "correlated")pp_check(M_2, type = "loo_pit_ecdf", method = "correlated")pp_check(M_3, type = "loo_pit_ecdf", method = "correlated")pp_check(M_4, type = "loo_pit_ecdf", method = "correlated")All LOO-PITs are (close) to uniformly distributed, which indicates good calibration.
We compare the models \mathrm{M}_1, \mathrm{M}_2,\mathrm{M}_3,\mathrm{M}_4.
(loo_comp <- loo_compare(M_1, M_2, M_3, M_4)) model elpd_diff se_diff p_worse diag_diff diag_elpd
M_4 0.0 0.0 NA
M_3 -3.1 1.1 1.00 N < 100
M_1 -3.3 2.0 0.95 N < 100
M_2 -3.7 2.1 0.96 N < 100
As the data set contains only N=17 different primate species for all comparisons diag_diff shows N<100 and we get a message advising to read more in loo-glossary.
We can also use tinytable for prettier table formatting
loo_comp |>
dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
tt() |>
format_tt(replace = "-")| model | elpd_diff | se_diff | p_worse | diag_diff | diag_elpd |
|---|---|---|---|---|---|
| M_4 | 0 | 0 | - | ||
| M_3 | -3.1 | 1.1 | 1 | N < 100 | |
| M_1 | -3.3 | 2 | 0.95 | N < 100 | |
| M_2 | -3.7 | 2.1 | 0.96 | N < 100 |
Based on the model checking above, the models seem to be reasonably specified and we are fine with respect to Scenario 2 (model misspecification). Models \mathrm{M}_1, \mathrm{M}_2 and \mathrm{M}_3 have small {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} compared to model \mathrm{M}_4. The direct use of the normal approximation gives probabilities 1, 0.95 and 0.96 that these models have worse predictive performance than model \mathrm{M}_4. As {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} is small (Scenario 1) and the number of observations is small (Scenario 3), we may assume {\widehat{\mathrm{SE}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} to be underestimated and the error distribution to be more skewed than normal. However, since {\widehat{\mathrm{elpd}}_\mathrm{\scriptscriptstyle LOO}\bigr({{\mathrm{M}_a,\mathrm{M}_b}} \mid {y}\bigl)} is small, we can state that there is no practical or statistical difference in the predictive performance.
We can get more cautious estimates of p_worse by multiplying se_diff by 2, and in this case get 0.92, 0.8, and 0.81. This doesn’t correct the skeweness, but provides some idea of the sensitivity of the estimates. Considering we have only 17 observations, it is natural that there is uncertainty in the comparisons. Collecting more data is recommended (if possible).
As the predictive distribution includes the aleatoric uncertainty (modelled by the data model), there is often more uncertainty in the predictive performance model comparison than in the posterior distribution (see, e.g., Wang and Gelman, 2015). In simple models, we can also look at the posterior for the quantities of interest. With model \mathrm{M}_4, 95\% central posterior intervals for \beta_1 and \beta_2 are (1.1,3.7) and (-0.12,-0.04) respectively, which indicates data have information about the parameters. The covariates neocortex and log(mass) are collinear, which causes correlation in the posterior of the coefficients, which could make the marginal posteriors overlap 0, even if the joint posterior does not, in which case, looking at the predictive performance is useful. In this case, although neocortex and log(mass) are collinear, they don’t have useful information alone, and the useful predictive information is along the second principal component of their joint distribution, which explains why the models with only one of the covariates are not better than the intercept-only model.
3 Sleep study
Belenky et al. (2003) collected data on the effect of chronic sleep restriction. We use a subset of data in the R package lme4 (Bates et al., 2015). The data contains average reaction times (in milliseconds) for 18 subjects with sleep restricted to 3 hours per night for 7 consecutive nights (days 0 and 1 were adaptation and training and removed from this analysis).
data(sleepstudy, package="lme4")
sleepstudy2 <- sleepstudy |>
dplyr::filter(Days >= 2)The compared models are a linear model, a linear model with varying intercept for each subject, and a linear model with varying intercept and slope for each subject. All models use a normal data model. The models were fitted using brms (Bürkner, 2017), and the default brms priors; prior for the coefficient for Days is uniform, the prior for the varying intercept is normal with unknown scale having a half-normal prior, and the prior for the varying intercept and slope is bivariate normal with unknown scales having half-normal priors and correlation having LKJ prior (Lewandowski, Kurowicka and Joe, 2009).
Using the brms formula notation, the compared models are \begin{align*}
\mathrm{M}_1: \quad & \mathrm{Reaction} \sim \mathrm{Days} \\
\mathrm{M}_2: \quad & \mathrm{Reaction} \sim \mathrm{Days} + (1\,\mid\,\mathrm{Subject}) \\
\mathrm{M}_3: \quad & \mathrm{Reaction} \sim \mathrm{Days} + (\mathrm{Days}\,\mid\, \mathrm{Subject}).
\end{align*}
Based on the study design, \mathrm{M}_3 is the appropriate model for the analysis, but comparing models is useful for assessing how much information the data has about the varying intercepts and slopes.
We fit the three models and use add_criterion() to store the loo object inside the brmsfit objects.
M_1 <- brm(Reaction ~ Days,
data = sleepstudy2,
family = gaussian(),
refresh=0) |>
add_criterion(criterion="loo", save_psis = TRUE)
M_2 <- brm(Reaction ~ Days + (1 | Subject),
data = sleepstudy2,
family = gaussian(),
refresh=0) |>
add_criterion(criterion="loo", save_psis = TRUE)Warning: Found 1 observations with a pareto_k > 0.7 in model 'brm(Reaction ~
Days + (1 | Subject), data = sleepstudy2, family = gaussian(), refresh = 0)'.
We recommend to set 'moment_match = TRUE' in order to perform moment matching
for problematic observations.
M_3 <- brm(Reaction ~ Days + (Days | Subject),
data = sleepstudy2,
family = gaussian(),
refresh=0) |>
add_criterion(criterion="loo", save_psis = TRUE)Warning: Found 3 observations with a pareto_k > 0.7 in model 'brm(Reaction ~
Days + (Days | Subject), data = sleepstudy2, family = gaussian(), refresh =
0)'. We recommend to set 'moment_match = TRUE' in order to perform moment
matching for problematic observations.
We get some warnings about high Pareto-k values, but don’t fix these yet. If someone would miss high Pareto-k value diagnostic messages, loo_compare() reminds about them.
loo_compare(M_1, M_2, M_3) model elpd_diff se_diff p_worse diag_diff diag_elpd
M_3 0.0 0.0 NA 3 k_psis > 0.7
M_2 -14.7 9.3 0.94 1 k_psis > 0.7
M_1 -79.6 20.2 1.00
Alternatively using tinytable for prettier output and more precise control in the number of digits shown.
loo_compare(M_1, M_2, M_3) |>
dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
tt() |>
format_tt(j=4, replace = "-") |>
format_tt(i=3, j=4, digits=5)| model | elpd_diff | se_diff | p_worse | diag_diff | diag_elpd |
|---|---|---|---|---|---|
| M_3 | 0 | 0 | - | 3 k_psis > 0.7 | |
| M_2 | -15 | 9.3 | 0.94 | 1 k_psis > 0.7 | |
| M_1 | -80 | 20 | 0.99996 |
High Pareto-k values maye be due to model misspecification or well-specified but flexible model, and can lead to optimistic elpd_loo estimates. Especially if the PSIS-LOO computation for the model with the best elpd_loo has many high Pareto-k values the p_worse values may also be biased. To improve accuracy of elpd computation we use argument reloo = TRUE.
M_2 <- M_2 |> add_criterion(criterion="loo", save_psis = TRUE,
reloo = TRUE, overwrite = TRUE)
M_3 <- M_3 |> add_criterion(criterion="loo", save_psis = TRUE,
reloo = TRUE, overwrite = TRUE)If we now re-run loo_compare(), there are no more warnings about high Pareto-k values. The elpd_diff and se_diff values have changed a bit, but in this case p_worse are about the same.
loo_compare(M_1, M_2, M_3) |>
dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
tt() |>
format_tt(j=4, replace = "-") |>
format_tt(i=3, j=4, digits=5)| model | elpd_diff | se_diff | p_worse | diag_diff | diag_elpd |
|---|---|---|---|---|---|
| M_3 | 0 | 0 | - | ||
| M_2 | -14 | 9.7 | 0.92 | ||
| M_1 | -78 | 21 | 0.99992 |
Although we can get more accurate elpd and elpd_diff estimates with more computation, the accuracy of the normal approximation for the uncertainty in the difference can still be compromised by the outliers. We check calibration of the predictive distributions with LOO-PIT-ECDF plots (Tesso and Vehtari, 2026).
pp_check(M_1, type = "loo_pit_ecdf", method = "correlated")pp_check(M_2, type = "loo_pit_ecdf", method = "correlated")Warning: Found 1 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
pp_check(M_3, type = "loo_pit_ecdf", method = "correlated")Warning: Found 3 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
The pooled model \mathrm{M}_1 seems to be reasonably calibrated. The varying intercept model \mathrm{M}_2 has slightly S-shaped LOO-PIT-ECDF curve, but the pre-defined diagnostic threshold is not exceeded. The varying intercept and slope model \mathrm{M}_3 has more clearly S-shaped LOO-PIT-ECDF curve, and the diagnostic strongly indicate miscalibration. S-shaped LOO-PIT-ECDF curve indicates in general too wide predictive distribution and the red emphasis near 0 and 1 indicates outliers.
We can see the outliers better in the LOO-intervals plot. These outliers inflate the normal model residual variance, making the predictive distribution too wide for other observations.
pp_check(M_3, type = "loo_intervals")Warning: Found 3 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
If we examine the pointwise differences, we see that model 3 is almost always better, but outliers cause couple values with big magnitude, making it more likely that the normal approximation for quantifying uncertainty in elpd_diff is not accurate.
ggplot(data=NULL, aes(x=1:nrow(sleepstudy2),
pointwise(M_3$criteria$loo, "elpd_loo")-pointwise(M_2$criteria$loo, "elpd_loo"))) +
geom_point() +
hline_0(alpha=0.5) +
labs(x="Data index", y="pointwise elpd_diff")We next fit models using a Student’s t model to create models \mathrm{M}_{1t}, \mathrm{M}_{2t}, and \mathrm{M}_{3t}. We now use reloo = TRUE right away.
M_1t <- brm(Reaction ~ Days,
data = sleepstudy2,
family = student(),
refresh=0) |>
add_criterion(criterion="loo", save_psis = TRUE, reloo = TRUE)
M_2t <- brm(Reaction ~ Days + (1 | Subject),
data = sleepstudy2,
family = student(),
refresh=0) |>
add_criterion(criterion="loo", save_psis = TRUE, reloo = TRUE)
M_3t <- brm(Reaction ~ Days + (Days | Subject),
data = sleepstudy2,
family = student(),
refresh=0) |>
add_criterion(criterion="loo", save_psis = TRUE, reloo = TRUE)LOO-PIT-ECDF plots look now better.
pp_check(M_2t, type = "loo_pit_ecdf", method = "correlated")Warning: Found 1 observations with a pareto_k > 0.7 in model '.x1'. We
recommend to set 'moment_match = TRUE' in order to perform moment matching for
problematic observations.
pp_check(M_3t, type = "loo_pit_ecdf", method = "correlated")We first compare \mathrm{M}_{3} and \mathrm{M}_{3t} to see whether a Student’s t model has better predictive performance.
loo_compare(loo(M_3), loo(M_3t)) |>
dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
tt() |>
format_tt(j=4, replace = "-", digits=3)| model | elpd_diff | se_diff | p_worse | diag_diff | diag_elpd |
|---|---|---|---|---|---|
| M_3t | 0 | 0 | - | ||
| M_3 | -41 | 13 | 0.999 |
Although in this comparison \mathrm{M}_3 is misspecified, the better specified model \mathrm{M}_{3t} shows much better predictive performance, and as we can expect se_diff to be inflated, the actual probability that \mathrm{M}_{3t} is better than \mathrm{M}_{3} is likely to be bigger than 0.999.
We then compare the three Student’s t models:
loo_compare(loo(M_1t), loo(M_2t), loo(M_3t)) |>
dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
tt() |>
format_tt(j=4, replace = "-")| model | elpd_diff | se_diff | p_worse | diag_diff | diag_elpd |
|---|---|---|---|---|---|
| M_3t | 0 | 0 | - | ||
| M_2t | -45 | 8.6 | 1 | ||
| M_1t | -119 | 16 | 1 |
The probability that model \mathrm{M}_{3t} is better than models \mathrm{M}_{1t} and \mathrm{M}_{2t} is close to 1. The models appear sufficiently well specified, the number of observations is bigger than 100, and the differences are not small, so we can assume that the normal approximation is well calibrated.
If we examine the pointwise differences, we see that model 3 is almost always better than model 2, and as there are no outliers, the distribution of pointwise elpd differences is such that the normal approximation for quantifying uncertainty in elpd_diff is likely to be accurate.
ggplot(data=NULL, aes(x=1:nrow(sleepstudy2),
pointwise(M_3t$criteria$loo, "elpd_loo")-pointwise(M_2t$criteria$loo, "elpd_loo"))) +
geom_point() +
hline_0(alpha=0.5) +
labs(x="Data index", y="pointwise elpd_diff")In this case, the effect of days with sleep constrained to 3 hours is so big that the main conclusion stays the same with all the models. Still, for example, model \mathrm{M}_{3t} does indicate higher variation between subjects than model \mathrm{M}_{3}. As \mathrm{M}_{3t} passes the model checking and has higher predictive performance, we should continue looking at the posterior of model \mathrm{M}_{3t}.
4 Roaches
Gelman and Hill (2007) describe in Chapter 8.3 the roaches data as follows: ``the treatment and control were applied to 160 and 104 apartments, respectively, and the outcome measurement y_i in each apartment i was the number of roaches caught in a set of traps. Different apartments had traps for different numbers of days’’. The goal is to estimate the efficacy of a pest management system at reducing the number of roaches.
data(roaches)
# Roach1 is very skewed and we take a square root
roaches$sqrt_roach1 <- sqrt(roaches$roach1)The target is the number of roaches (y), and the covariates include the square root of the pre-treatment number of roaches (sqrt_roach1), a treatment indicator variable (treatment), and a variable indicating whether the apartment is in a building restricted to elderly residents (senior). As the number of days for which the roach traps were used is not the same for all apartments, the offset argument includes the logarithm of the number of days the traps were used (log(exposure2)). The latent regression model presented with brms formula notation is:
\mathrm{y} \sim \mathrm{sqrt\_roach1} + \mathrm{treatment} + \mathrm{senior} + \mathrm{offset(log(exposure2))}.
We fit the following models using the brms package. \begin{align*}
\mathrm{M}_1: \quad & \text{Poisson} \\
\mathrm{M}_2: \quad & \text{Negative.binomial}
\mathrm{M}_3: \quad & \text{Zero-inflated negative-binomial} \\
\end{align*}
The zero-inflation is modelled using the same latent formula (with its own parameters). All coefficients have \operatorname{normal}(0,1) priors and the negative-binomial shape parameter has the brms default prior, which is inverse-gamma(.4, .3) (Vehtari, 2024).
For the Poisson model we re-ran MCMC for all LOO-folds with high Pareto-\hat{k} diagnostic value (>0.7) (with reloo = TRUE in brms), and for negative-binomial and zero-inflated negative-binomial we used moment matching (Paananen et al., 2021) for a few LOO-folds with high Pareto-\hat{k} diagnostic value (>0.7) (with moment_match = TRUE in brms).
M_1 <- brm(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
family = poisson(),
data = roaches,
prior = c(prior(normal(0,1), class = "b")),
seed = 1704009,
refresh = 0) |>
add_criterion(criterion = "loo", save_psis = TRUE, reloo = TRUE)Warning: Ignoring relative efficiencies as some were NA. See argument 'r_eff'
in ?loo::loo for more details.
M_2 <- brm(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
family = negbinomial(),
data = roaches,
prior = c(prior(normal(0,1), class = "b"),
prior(inv_gamma(0.4, 0.3), class = "shape")),
seed = 1704009,
refresh = 0) |>
add_criterion(criterion = "loo", save_psis = TRUE, moment_match = TRUE)
M_3 <- brm(bf(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
zi ~ sqrt_roach1 + treatment + senior + offset(log(exposure2))),
family = zero_inflated_negbinomial(),
data = roaches,
prior = c(prior(normal(0,1), class = "b"),
prior(normal(0,1), class = "b", dpar = "zi"),
prior(normal(0,1), class = "Intercept", dpar = "zi"),
prior(inv_gamma(0.4, 0.3), class = "shape")),
seed = 1704009,
refresh = 0) |>
add_criterion(criterion = "loo", save_psis = TRUE, moment_match = TRUE)Compare with loo_compare()
loo_compare(loo(M_1), loo(M_2), loo(M_3)) |>
dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
tt() |>
format_tt(j=4, replace = "-", digits=4)| model | elpd_diff | se_diff | p_worse | diag_diff | diag_elpd |
|---|---|---|---|---|---|
| M_3 | 0 | 0 | - | ||
| M_2 | -23 | 6.9 | 0.9995 | ||
| M_1 | -4635 | 686 | 1 |
The zero-inflated negative-binomial model (\mathrm{M}_3) is clearly the best. Based on model checking (see Roaches case study), the Poisson model (\mathrm{M}_1) is underdispersed which indicates Scenario 2 (model misspecification), but the difference is so big that we can be certain that the zero-inflated negative-binomial model is better. As the number of observations is larger than 100, and the difference to model \mathrm{M}_2 is not small, we may assume the normal approximation is well calibrated.
As we had used an ad-hoc square root transformation of pre-treatment number of roaches, we fitted a model \mathrm{M}_4 replacing the latent linear term for the square root of pre-treatment number of roaches with a spline.
M_4 <- brm(bf(y ~ s(sqrt_roach1) + treatment + senior + offset(log(exposure2)),
zi ~ s(sqrt_roach1) + treatment + senior + offset(log(exposure2))),
family = zero_inflated_negbinomial(), data = roaches,
prior = c(prior(normal(0,1), class = "b"),
prior(normal(0,1), class = "b", dpar = "zi"),
prior(normal(0,1), class = "Intercept", dpar = "zi"),
prior(inv_gamma(0.4, 0.3), class = "shape")),
save_pars = save_pars(all = TRUE),
seed = 1704009,
refresh = 0) |>
add_criterion(criterion = "loo", save_psis = TRUE, moment_match = TRUE)Compare with loo_compare()
loo_compare(loo(M_3), loo(M_4)) |>
dplyr::select(model, elpd_diff, se_diff, p_worse, diag_diff, diag_elpd) |>
tt() |>
format_tt(j=4, replace = "-")| model | elpd_diff | se_diff | p_worse | diag_diff | diag_elpd |
|---|---|---|---|---|---|
| M_4 | 0 | 0 | - | ||
| M_3 | -2.4 | 3.2 | 0.78 | |elpd_diff| < 4 |
Model \mathrm{M}_4 (with spline) seems to be slightly better, but now the difference is so small (Scenario 1) that the normal approximation is likely to be not perfectly calibrated. As the difference is small, we can proceed with either model.
In the milk example above, |elpd_diff| was also small, but the diag_diff showed only N < 100 as that was the dominating issue. In this case N > 100, and now |elpd_diff| < 4 diagnostic is shown.