Roaches cross-validation case study

Author

Aki Vehtari

Published

2017-01-10

Modified

2025-09-03

Setup

Load packages

Code
Sys.setenv(PROCESSX_NOTIFY_OLD_SIGCHLD = 1) # avoid zombies
library(rstantools)
library(brms)
library(cmdstanr)
options(mc.cores = 8)
library(loo)
library(ggplot2)
library(ggdist)
library(bayesplot)
## theme_set(bayesplot::theme_default(base_family = "sans"))
theme_set(bayesplot::theme_default(base_family = "sans", base_size=16))
library(posterior)
options(posterior.num_args=list(digits=2))
library(priorsense)
library(dplyr)
library(tibble)
library(reliabilitydiag)

1 Introduction

This case study demonstrates cross-validation model comparison, and posterior and cross-validation predictive checking of models. Furthermore the notebook demonstrates how to use integrated PSIS-LOO with varying intercept (“random effect”) models.

The roaches data example comes from Chapter 8.3 of Gelman and Hill (2007).

the treatment and control were applied to 160 and 104 apartments, respectively, and the outcome measurement y_i in each apartment i was the number of roaches caught in a set of traps. Different apartments had traps for different numbers of days

In addition to an intercept, the regression predictors for the model are the pre-treatment number of roaches roach1, the treatment indicator treatment, and a variable indicating whether the apartment is in a building restricted to elderly residents senior. The distribution of roach1 is very skewed and we take a square root of it. Because the number of days for which the roach traps were used is not the same for all apartments in the sample, we include it as an exposure2 by adding \ln(u_i)) to the linear predictor \eta_i and it can be specified using the offset argument in brms.

Load data

data(roaches, package = "rstanarm")
roaches$sqrt_roach1 <- sqrt(roaches$roach1)
head(roaches) |> print(digits=2)
    y roach1 treatment senior exposure2 sqrt_roach1
1 153  308.0         1      0       0.8        17.5
2 127  331.2         1      0       0.6        18.2
3   7    1.7         1      0       1.0         1.3
4   7    3.0         1      0       1.0         1.7
5   0    2.0         1      0       1.1         1.4
6   0    0.0         1      0       1.0         0.0

2 Poisson model

Fit a Poisson regression model with brms

fit_p <- brm(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
                data = roaches,
                family = poisson,
                prior = prior(normal(0,1), class = b),
                refresh = 0)

Plot posterior

mcmc_areas(as.matrix(fit_p), prob_outer = .999,
           regex_pars = c("sqrt_roach1","treatment","senior")) +
  coord_cartesian(xlim=c(-0.65,0.25))
Figure 1

All marginal posteriors are clearly away from zero. We need to do some model checking before trusting these.

2.1 Posterior predictive checking

Posterior predictive checing can often detect problems and also provide more information about the reason. As the range of counts is large, we can use kernel density estimate plot (Säilynoja et al. 2025).

pp_check(fit_p, type = "dens_overlay", ndraws = 20) +
  scale_x_sqrt(breaks=c(0,1,3,10,30,100,300), lim=c(0,400))
Figure 2

We see that the marginal distribution of model replicated data is clearly different from the observed data which is more dispersed. Posterior predictive checking can sometimes have significant bias due to double use of the data, but in this case the discrepancy is so big that further checks are not needed.

Although in this case the model misspecification is obvious with kernel density plot, too, for count the often a better choice is a rootogram variant proposed by Säilynoja et al. (2025).

pp_check(fit_p, type = "rootogram", style = "discrete") +
  scale_x_sqrt() +
  theme(legend.position = "inside",
        legend.position.inside = c(0.8, 0.8))
Figure 3

2.2 Cross-validation model comparison

For demonstration, we show what would happen in cross validation based model comparison if we would ignore posterior predictive checking result.

We use Pareto-smoothed importance sampling leave-one-out (PSIS-LOO) cross-validation (Vehtari, Gelman, and Gabry 2017) as it is very fast to compute.

fit_p <- add_criterion(fit_p, criterion="loo")
# store the result for later use
loo_p1 <- loo(fit_p)

We get warning about high Pareto-\hat{k}’s in PSIS-LOO computation and recompute using moment matching (Paananen et al. 2021) which translates and scales the posterior draws to better match the moments of the leave-one-out posteriors (done only for the folds with high Pareto-\hat{k})

fit_p <- add_criterion(fit_p, criterion="loo", moment_match=TRUE, overwrite=TRUE)

Now all Pareto-\hat{k}’s are ok, and we examine LOO results.

loo(fit_p)

Computed from 4000 by 262 log-likelihood matrix.

         Estimate     SE
elpd_loo  -5478.2  699.9
p_loo       271.5   61.2
looic     10956.4 1399.8
------
MCSE of elpd_loo is 0.4.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.0]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

p_loo is about 278, which is much higher than the number of parameters (4), which indicates bad misspecification, which we did already see also with posterior predictive checking. Many high Pareto-()’s in PSIS-LOO without moment matching were likely also caused by model misspecification.

For demonstration, we show what would happen if we would try to use cross validation based model comparison to assess predictive relevance of the covariates. We later compare these to results when using better models.

We form 3 models by dropping each of the covariates out at a time.

fit_p_m1 <- update(fit_p, formula = y ~ treatment + senior) |>
  add_criterion(criterion="loo", moment_match=TRUE)
fit_p_m2 <- update(fit_p, formula = y ~ sqrt_roach1 + senior)  |>
  add_criterion(criterion="loo", moment_match=TRUE)
fit_p_m3 <- update(fit_p, formula = y ~ sqrt_roach1 + treatment) |>
  add_criterion(criterion="loo", moment_match=TRUE)

Moment matching is able to assist PSIS-LOO computation and all Pareto \hat{k} values are good.

loo_compare(fit_p,
            fit_p_m1,
            fit_p_m2,
            fit_p_m3,
            model_names=c("Poisson full model",
                          "Poisson without sqrt(roach1)",
                          "Poisson without treatment",
                          "Poisson without senior"))
                             elpd_diff se_diff
Poisson without senior           0.0       0.0
Poisson full model             -12.1     180.5
Poisson without treatment     -213.7     228.4
Poisson without sqrt(roach1) -2752.7     638.0

Based on this the roaches covariate would be relevant, but although dropping treatment or senior covariate will make a large change to elpd, the uncertainty is also very large and cross-validation indicates that these covariates are not necessarily relevant. The posterior marginals are conditional on the model, but cross-validation is more cautious by not using any model for the future data distribution.

3 Negative binomial model

We change the Poisson model to a more robust negative binomial model. Often it would be sensible to start with negative binomial model for counts and skip the Poisson model. The negative-binomial shape parameter has the brms default prior, which is inverse-gamma(.4, .3) (Vehtari 2024).

fit_nb <- update(fit_p, family = negbinomial)

Plot posterior

mcmc_areas(as.matrix(fit_nb), prob_outer = .999,
           regex_pars = c("sqrt_roach1", "treatment", "senior"))
Figure 4

Treatment effect is much closer to zero, and senior effect has lot of probability mass on both sides of 0. So it matters, which model we use, and we should trust posteriors only if the model passes predictive checking.

3.1 Posterior and LOO predictive checking

We use posterior predictive checking to compare marginal data and predictive distributions.

pp_check(fit_nb, type = "dens_overlay", ndraws=20) +
  scale_x_sqrt(breaks=c(0,1,3,10,30,100,300), lim=c(0,400))
Figure 5

We see that the negative-binomial model is much better although it seems that the model predictive distribution has more mass for small counts than the real data. This discrepancy can also be an artefact from the kernel density estimate, and it is better to examine discrete rootogram

pp_check(fit_nb, type = "rootogram", style = "discrete") +
  scale_x_sqrt() +
  coord_cartesian(xlim=c(0, 400)) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.8, 0.8))
Figure 6

We see that the negative-binomial model is quite good, and there is no clear discrepancy based on the predictive intervals and data.

Instead of looking at the marginal distribution, we can look at the pointwise predictive distribution and corresponding observations.

pp_check(fit_nb, "intervals") +
  scale_y_sqrt() +
  theme(legend.position = "inside",
        legend.position.inside = c(0.9, 0.9))
Figure 7

If the model predictive distributions are well calibrated, then the observations should look like randomly drawn from the predictive distributions. We can use a pit_ecdf to make probability integral transformation (PIT) plot to make the comparison of pointwise posterior predictive distributions and data. One PIT value is the cumulative density of one observation given the corresponding conditional predictive distribution. If the pointwise predictive distributions are well calibrated, PIT values are (almost) uniformly distributed. PIT-ECDF plot compares observed PIT values to uniform distribution.

pp_check(fit_nb, type = "pit_ecdf")
Figure 8

Now that posterior predictive check looks quite good, it is useful to be more careful and use LOO predictive checking, too. We first run PSIS-LOO computation to check it works.

fit_nb <- add_criterion(fit_nb, criterion="loo")
# store the result for later use
loo_nb1 <- loo(fit_nb)

We get warning about high Pareto-\hat{k} and to improve computation accuracy, we re-run with moment matching. We later need some Pareto smoothed importance sampling intermediate computation results and save them, too.

fit_nb <- add_criterion(fit_nb, criterion="loo",
                        moment_match = TRUE, save_psis=TRUE, overwrite = TRUE)

Let’s look at the LOO results.

(loo_nb <- loo(fit_nb))

Computed from 4000 by 262 log-likelihood matrix.

         Estimate   SE
elpd_loo   -881.9 38.3
p_loo         8.5  3.8
looic      1763.9 76.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.3]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

All Pareto-\hat{k} are good indicating PSIS-LOO computation with moment matching works well. p_loo is closer to the actual number of parameters, but still slightly larger than the total number of parameters 5 which is slightly suspicious. p_loo is small compared to the number of observations and we may expect pointwise LOO predictive intervals to be similar to pointwise posterior predictive intervals, and LOO-PIT-ECDF to look similar to posterior PIT-ECDF. Although we had used moment matching above, some intermediate results are not stored as they may take a lot of memory, and we use moment matching argument again when computing LOO intervals and LOO-PIT’s.

pp_check(fit_nb, "loo_intervals", moment_match = TRUE) +
  scale_y_sqrt() +
  theme(legend.position = "inside",
        legend.position.inside = c(0.9, 0.9))
Figure 9
pp_check(fit_nb, type = "loo_pit_ecdf", moment_match = TRUE)
Figure 10

As we guessed, there is not much difference between LOO and posterior inervals, and LOO-PIT-ECDF and PIT-ECDF.

In LOO model comparison the negative binomial model is much better than the Poisson model.

loo_compare(fit_p,
            fit_nb,
            model_names=c("Poisson full model",
                          "Negative binomial full model"))
                             elpd_diff se_diff
Negative binomial full model     0.0       0.0
Poisson full model           -4596.3     679.8

The next comparison shows the result without moment matching improved computation. The difference is now a bit smaller, but still the difference is huge. This illustrates that we don’t always need to improve PSIS-LOO computation to get rid of all high Pareto-\hat{k}’s, if the difference between the models is clearly bigger than possible bias.

loo_compare(list(`Poisson full model`=loo_p1,
                 `Negative binomial full model`= loo_nb1))
                             elpd_diff se_diff
Negative binomial full model     0.0       0.0
Poisson full model           -4578.2     674.6

As Poisson is a special case of negative-Binomial, instead of using cross validation model comparison, we could have also seen that Poisson is not a good model by looking at the posterior of the over-dispersion parameter (which gets very small values), and there would not have been need to fit Poisson model at all.

mcmc_areas(as.matrix(fit_nb), prob_outer = .999,
           pars = c("shape"))
Figure 11

Posterior predictive rootogram and LOO-PIT-ECDF looked good, but for discrete models where some discrete values have relatively high probability it may miss things and it is good to examine more carefully how well calibrated predictive probabilities are. In roaches, 64% of observations are 0, and it makes sense to look how well we predict zeros or non-zeros. To check calibration of binary target predictive probabilities, we use PAV-adjusted reliability diagram Säilynoja et al. (2025). Although in this case, the difference is small, for demonstration we use LOO predictive probabilities instead of posterior predictive probabilities for non-zeros. In case of good calibration, the calibration line would stay mostly inside of the envelope.

rd=reliabilitydiag(EMOS = E_loo((posterior_predict(fit_nb)>0)+0,
                                loo_nb$psis_object)$value,
                   y = as.numeric(roaches$y>0))
autoplot(rd)+
  labs(x="Predicted probability of non-zero",
       y="Conditional event probabilities")+
  bayesplot::theme_default(base_family = "sans", base_size=16)
Figure 12

There is a slight miscalibration indicated by red curve being quite much outside of the blue envelope. We later build a zero-inflated model to test whether that would improve calibration and predictive performance.

4 Poisson model with varying intercepts

Sometimes overdispersion is modelled by adding varying intercepts (“random effects”) for each individual. Negative-binomial model can be considered as mixture of Poissons with mixing distribution being a gamma distribution. It is common to use normal prior for varying intercepts in log intensity scale. This kind of varying intercept model may fit the data better than negative binomial model for two reasons. First, it is possible that the normal variation in log intensity scale is closer to actual variation. Second, explicitly presenting the latent values and using a prior makes the model slightly more flexible as the posterior of the varying intercepts can be different from the prior. However, in this case with just one observation per varying intercept, it is likely that the posterior is close to normal as likelihood contribution from one observation for each varying intercept is weak. The following example demonstrates computational challenges with varying intercept approach.

We add varying intercept for each observation with normal prior using the formula term (1 | id). We use run the sampling four times longer to get high enough effective sample sizes.

roaches$id <- 1:dim(roaches)[1]
fit_pvi <- brm(y ~ sqrt_roach1 + treatment + senior + (1 | id) + offset(log(exposure2)),
                  data = roaches, family = poisson, 
                  prior = prior(normal(0, 1), class = b),
                  warmup = 1000, iter = 5000, thin = 4,
                  refresh = 0)

4.1 Analyse posterior

Plot posterior

mcmc_areas(as.matrix(fit_pvi), prob_outer = .999,
    regex_pars = c("sqrt_roach1", "treatment", "senior"))
Figure 13

The marginals are similar as with negative-binomial model, but slightly closer to 0.

4.2 Cross-validation checking

Compute LOO using PSIS-LOO.

fit_pvi <- add_criterion(fit_pvi, criterion="loo")
loo(fit_pvi)

Computed from 4000 by 262 log-likelihood matrix.

         Estimate   SE
elpd_loo   -632.0 24.1
p_loo       167.8  4.6
looic      1264.1 48.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.7, 1.1]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)      53   20.2%   123     
   (0.7, 1]   (bad)      182   69.5%   <NA>    
   (1, Inf)   (very bad)  27   10.3%   <NA>    
See help('pareto-k-diagnostic') for details.

p_loo is about 164, which is less than the number of parameters 267, but it is relatively large compared to to the number of observations (p_loo >>N/5), which indicates very flexible model. In this case, this is due to having an intercept parameter for each observation. Removing one observation changes the posterior for that intercept so much that importance sampling fails (even with Pareto smoothing). Also the very large number of high k values is probaly due to having very flexible model. We can try to improve computation with moment matching. By default brm() does not store the varying coefficients and we need to re-run with save_pars = save_pars(all = TRUE) before using moment matching.

fit_pvi <- update(fit_pvi, save_pars = save_pars(all = TRUE))
fit_pvi <- add_criterion(fit_pvi, criterion="loo", moment_match = TRUE, overwrite = TRUE)
loo(fit_pvi)

Moment matching is able to reduce the number of high Pareto-\hat{k}’s from 204 to 46, which is still a lot. Varying coefficient models is the one special case where moment matching is likely to not be able to help enough as the posterior of the group specific parameter is changing too much and moment matching for high dimensional non-normal posterior is not able to help enough. brms supports running MCMC for the LOO folds with high Pareto-\hat{k}’s with reloo = TRUE, but that would in this case require 46 refits. We can use K-fold-CV instead to re-fit the model 10 times, each time leaving out 10% of the observations. This shows that cross-validation itself is not infeasible for varying parameter models.

(kcvpvi <- kfold(fit_pvi, K = 10))

loo package allows comparing PSIS-LOO and K-fold-CV results

loo_compare(list(`Negative binomial`=loo(fit_nb),
                 `Poisson varying intercepts`=kcvpvi))
                           elpd_diff se_diff
Poisson varying intercepts  0.0       0.0   
Negative binomial          -2.3       7.8   

There is not much difference, and the uncertainty is big. To verify this difference, we can run K-fold-CV also for negative-binomial model.

(kcvnb <- kfold(fit_nb, K=10))

Based on 10-fold cross-validation.

           Estimate   SE
elpd_kfold   -878.9 37.8
p_kfold         5.4  2.9
kfoldic      1757.8 75.7
loo_compare(list(`Negative binomial`=kcvnb,
                 `Poisson varying intercepts`=kcvpvi))
                           elpd_diff se_diff
Negative binomial           0.0       0.0   
Poisson varying intercepts -0.8       7.2   

The difference in predictive performance is very similar.

Now that we’ve seen that based on robust K-fold-CV there is not much difference between negative-binomial and varying-intercept-Poisson models, we can also check how bad the comparison would have been with PSIS-LOO with high Pareto-\hat{k} warnings.

loo_compare(fit_nb, fit_pvi)
        elpd_diff se_diff
fit_pvi    0.0       0.0 
fit_nb  -276.0      18.3 

If we would have ignored Pareto-k warnings, we would have mistakenly assumed that varying intercept model is much better. Note that WAIC is (as usual) even worse (see also Vehtari, Gelman, and Gabry (2017))

loo_compare(waic(fit_nb), waic(fit_pvi))
        elpd_diff se_diff
fit_pvi    0.0       0.0 
fit_nb  -309.5      18.2 

4.3 Posterior predictive checking

We do posterior predictive checking for varying intercept Poisson model.

pp_check(fit_pvi, type = "dens_overlay", ndraws = 20) +
  scale_x_sqrt(breaks=c(0,1,3,10,30,100,300), lim=c(0,400))
Figure 14

The match looks perfect, but that can be explained with having one parameter for each observation and kernel density estimate hiding something.

Looking at the discrete rootogram looks also fine.

pp_check(fit_pvi, type = "rootogram", style = "discrete") +
  scale_x_sqrt() +
  coord_cartesian(xlim=c(0, 400)) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.8, 0.8))
Figure 15

PIT-ECDF plot seems to indicate problems

pp_check(fit_pvi, type = "pit_ecdf")
Figure 16

There are too many PIT values near 0.5. If we look at the predictive intervals and observations, we see that many the observations are in the middle of the posterior predictive interval, which can be explained by having very flexible model with one parameter for each observation. Posterior predictive checking is likely to fail with flexible models (having big p_loo).

pp_check(fit_pvi, "intervals") +
  scale_y_sqrt() +
  theme(legend.position = "inside",
        legend.position.inside = c(0.9, 0.9))
Figure 17

LOO-PIT’s can take into account the model flexibility, if the computation works. In this case LOO-PIT’s look slightly better, but still showing problems, but this is because PSIS-LOO fails (as discussed above)

pp_check(fit_pvi, type = "loo_pit_ecdf")
Figure 18

5 Poisson model with varying intercept and integrated LOO

Removing one observation changes the posterior for that intercept so much that importance sampling fails (even with Pareto smoothing). We can get improved stability by integrating that intercept out with something more accurate than the importance sampling (Vehtari et al. 2016). If there is only one group or individual specific parameter then we can integrate that out easily with 1D adaptive quadrature. 2 parameters can be handled with nested 1D quadratures, but for more parameters nested quadrature is likely to be too slow and other integration methods are needed.

As we can easily integrate out only 1 (or 2) parameters (per group / individual), it’s not easy to make a generic approach for rstanarm and brms, and thus we illustrate the approach with direct Stan model code and cmdstanr.

The following Stan model uses individual specific intercept term z[i] (with common prior with scale sigmaz). In the generated quantities, the usual log_lik computation is replaced with integrated approach. We also generate y_loorep which is the LOO predictive distribution given other parameters than z. This is needed to get the correct LOO predictive distributions when combined with integrated PSIS-LOO.

poisson_vi_int <- "poisson_vi_integrate.stan"
writeLines(readLines(poisson_vi_int))
// Poisson regression with hierarchical intercept (varying effect)
functions {
  real integrand(real z, real notused, array[] real theta,
               array[] real X_i, array[] int y_i) {
    real sigmaz = theta[1];
    real mu_i = theta[2];
    real p = exp(normal_lpdf(z | 0, sigmaz) + poisson_log_lpmf(y_i | z + mu_i));
    return (is_inf(p) || is_nan(p)) ? 0 : p;
  }
}
data {
  int<lower=0> N;           // number of data points
  int<lower=0> P;           // number of covariates
  matrix[N,P] X;            // covariates
  array[N] int<lower=0> y;  // target
  vector[N] offsett;        // offset (offset variable name is reserved)
  real integrate_1d_reltol;
}
parameters {
  real alpha;               // intercept
  vector[P] beta;           // slope
  vector[N] z;              // individual intercept (varying effect)
  real<lower=0> sigmaz;     // prior scale for z
}
model {
  // priors
  alpha ~ normal(0, 1);
  beta ~ normal(0, 1);
  z ~ normal(0, sigmaz);
  sigmaz ~ normal(0, 1);
  // observation model
  y ~ poisson_log_glm(X, z+offsett+alpha, beta); 
}
generated quantities {
  // log_lik for PSIS-LOO
  vector[N] log_lik;
  vector[N] y_rep;
  vector[N] y_loorep;
  for (i in 1:N) {
    // z as posterior draws, this would be challenging for PSIS-LOO (and WAIC)
    // log_lik[i] = poisson_log_glm_lpmf({y[i]} | X[i,], z[i]+offsett[i]+alpha, beta);
    // posterior predictive replicates conditional on p(z[i] | y[i])
    // in theory these could be used with above (non-integrated) log_lik and PSIS-LOO
    // to get draws from LOO predictive distribution, but the distribution of the
    // ratios from above lok_lik is bad
    real mu_i = offsett[i] + alpha + X[i,]*beta;
    y_rep[i] = poisson_log_rng(z[i] + mu_i);
    // we can integrate each z[i] out with 1D adaptive quadrature to get more
    // stable log_lik and corresponding importance ratios
    log_lik[i] = log(integrate_1d(integrand,
                              negative_infinity(),
                              positive_infinity(),
                              append_array({sigmaz}, {mu_i}),
                  {0}, // not used, but an empty array not allowed
                  {y[i]},
                  integrate_1d_reltol));
    // conditional LOO predictive replicates conditional on p(z[i] | sigmaz, mu_i)
    // these combined with integrated log_lik and PSIS-LOO provide
    // more stable LOO predictive distributions
    y_loorep[i] = poisson_log_rng(normal_rng(0, sigmaz) + mu_i);
  }
}

We could also move the integrated likelihood to the model block and not use MCMC to sample the varying intercepts at all. This would make marginal posterior easier for MCMC, but using quadrature integration N times for each leapfrog step in Hamiltonian Monte Carlo sampling will increase the sampling time more than what would be the benefit from the simpler marginal posterior. When we do the integration in the generetad quantities, the quadrature is computed only for each saved iteration making the computation faster.

Next we compile the model, prepare the data, and sample. integrate_1d_reltol sets the relative tolerance for the adaptive 1D quadrature function integrate_1d (if with other model and data you see messages about error estimate of integral exceeding the given relative tolerance times norm of integral, you will get NaNs and need to increase the relative tolerance).

We increase the number of sampling iterations to improve effective sample size needed for better LOO-PIT plot.

mod_p_vi <- cmdstan_model(stan_file = poisson_vi_int)
datap <- list(N = dim(roaches)[1],
              P = 3,
              offsett = log(roaches$exposure2),
              X = roaches[,c('sqrt_roach1','treatment','senior')],
              y = roaches$y,
              integrate_1d_reltol = 1e-6)
fit_p_vi <- mod_p_vi$sample(data = datap, refresh = 0, chains = 4, parallel_chains = 4,
                            iter_sampling = 8000, thin = 8)

The posterior is similar as above for varying intercept Poisson model, as it should be, as the generated quantities is not affecting the posterior.

mcmc_areas(as_draws_matrix(fit_p_vi$draws(variables=c('beta','sigmaz'))),
           prob_outer = .999)
Figure 19

Now the PSIS-LOO doesn’t give warnings and the result is close to K-fold-CV.

(loo_p_vi <- fit_p_vi$loo(save_psis=TRUE))

Computed from 4000 by 262 log-likelihood matrix.

         Estimate   SE
elpd_loo   -878.3 38.2
p_loo         4.6  0.5
looic      1756.6 76.4
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.2, 0.8]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

Comparing to the negative binomial model gives only slight improvement in predictive performance (there may be yhash warning, which in this case can be ignored).

loo_compare(list(`Poisson varying intercepts w. int-LOO` = loo_p_vi,
            `Negative binomial` = loo_nb))
                                      elpd_diff se_diff
Poisson varying intercepts w. int-LOO  0.0       0.0   
Negative binomial                     -3.6       7.7   

LOO-PIT plot looks good, although a bit more variation, which is probably due to lower effective sample sizes due to more challenging posterior than for negative binomial model.

ppc_loo_pit_ecdf(y=roaches$y,
                 yrep=fit_p_vi$draws(variables="y_loorep", format = "matrix"),
                 psis_object = loo_p_vi$psis_object)
Figure 20

We check the calibration of predictive probabilities for zeros vs non-zeros. The calibration plot looks better than with negative binomial model. The predicted probabilities have wider range and the calibration curve stays better inside the envelope. Not shown here, but the varying intercepts are quite close to normally distributed (as we expected), and thus the difference to the negative binomial would be mostly the different distribution for the individual variation.

rd=reliabilitydiag(EMOS = E_loo((fit_p_vi$draws(variables="y_loorep", format = "matrix")>0)+0,
                                loo_p_vi$psis_object)$value,
                   y = as.numeric(roaches$y>0))
autoplot(rd)+
  labs(x="Predicted probability of non-zero",
       y="Conditional event probabilities")+
  bayesplot::theme_default(base_family = "sans", base_size=16)
Figure 21

6 Zero-inflated negative-binomial model

As the proportion of zeros is quite high in the data and the

5#' calibration plot for negative binomial model indicated slight
[1] 5

miscalibration in prediction of zeros and non-zeros, it is worthwhile to test also a zero-inflated negative-binomial model, which is a mixture of two models - logistic regression to model the proportion of extra zero counts - negative-binomial model

Fir zero-inflated negative-binomial model.

fit_zinb <-
  brm(bf(y ~ sqrt_roach1 + treatment + senior + offset(log(exposure2)),
         zi ~ sqrt_roach1 + treatment + senior + offset(log(exposure2))),
      family=zero_inflated_negbinomial(), data=roaches,
      prior=c(prior(normal(0,1), class='b'),
              prior(normal(0,1), class='b', dpar='zi'),
              prior(normal(0,1), class='Intercept', dpar='zi')),
      seed=1704009, refresh=1000)

Based on PSIS-LOO, zero-inflated negative-binomial is clearly better.

fit_zinb <- add_criterion(fit_zinb, criterion='loo', save_psis=TRUE)

We get warning about one high Pareto-\hat{k}’s in PSIS-LOO, which we can fix with moment matching.

fit_zinb <- add_criterion(fit_zinb, criterion='loo', save_psis=TRUE,
                          moment_match = TRUE, overwrite = TRUE)

p_loo is close to the total number of parameters in the model (9), which is a good sign.

(loozinb <- loo(fit_zinb))

Computed from 4000 by 262 log-likelihood matrix.

         Estimate   SE
elpd_loo   -859.1 37.8
p_loo        10.2  2.7
looic      1718.2 75.7
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.5]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

Zero-inflated negative binomial model has clearly better predictive performance than the negative binomial.

loo_compare(fit_nb, fit_zinb,
            model_names=c("Negative binomial",
                          "Zero-inflated negative binomial"))
                                elpd_diff se_diff
Zero-inflated negative binomial   0.0       0.0  
Negative binomial               -22.9       7.0  

Posterior predictive checking looks good, but there is no clear difference when looking at marginal predictive distributions or LOO-PIT.

pp_check(fit_zinb, type='dens_overlay', ndraws = 20) +
    scale_x_sqrt(breaks=c(0,1,3,10,30,100,300), lim=c(0,400))
Figure 22
pp_check(fit_zinb, type = "rootogram", style = "discrete") +
  scale_x_sqrt() +
  coord_cartesian(xlim=c(0, 400)) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.8, 0.8))
Figure 23
pp_check(fit_zinb, type='pit_ecdf')
Figure 24

LOO-PIT-ECDF

pp_check(fit_zinb, type='loo_pit_ecdf', moment_match = TRUE)
Figure 25

Reliability diagram assessing the calibration of predicted probabilities of zero vs non-zero looks clearly better than the one for negative binomial model.

rd=reliabilitydiag(EMOS = E_loo((posterior_predict(fit_zinb)>0)+0,
                                loozinb$psis_object)$value,
                   y = as.numeric(roaches$y>0))
autoplot(rd)+
  labs(x="Predicted probability of non-zero",
       y="Conditional event probabilities")+
  bayesplot::theme_default(base_family = "sans", base_size=16)
Figure 26

Although the models are different, with finite data and wide LOO predictive distributions, there is a limit in which differences can be see in LOO-PIT values. Both negative-binomial and zero-inflated negative binomial are close enough the LOO-PIT can’t see discrepancy from the data, but elpd_loo and calibration plot were able to to show that zero-inflation component improves the predictive accuracy and calibration.

6.1 Analyse posterior

Plot posterior

mcmc_areas(as.matrix(fit_zinb)[,3:8], prob_outer = .999)
Figure 27

The posterior marginals for negative-binomial part are similar to marginals in the plain negative-binomial model. The marginal effects for the logistic part have opposite sign as the logistic part is modelling the extra zeros.

The treatment effect is now divided between negative-binomial and logistic part. We can use the model to make predictions for the expected number of roaches given treatment and no-treatment to get one marginal posterior to examine.

Expectations of posterior predictive distributions given treatment=0 and treatment=1

pred_zinb <- posterior_epred(fit_zinb,
                           newdata=rbind(mutate(roaches, treatment=0),
                                         mutate(roaches, treatment=1)))

Ratio of expected number of roaches with vs without treatment

ratio_zinb <- array(rowMeans(pred_zinb[,263:524]/pred_zinb[,1:262]),
                    c(1000, 4, 1)) |>
  as_draws_df() |>
  set_variables(variables='ratio')
ratio_zinb |>
  ggplot(aes(x=ratio)) +
  stat_dots(quantiles=100) +
  stat_slab(fill=NA, color="gray") +
  labs(x='Ratio of roaches with vs without treatment', y=NULL) +
  scale_y_continuous(breaks=NULL) +
  theme(axis.line.y=element_blank(),
        strip.text.y=element_blank()) +
  xlim(c(0,1)) +
  geom_vline(xintercept=1, linetype='dotted')
Figure 28

The treatment clearly reduces the expected number of roaches.

Assuming our model is causally sensible, we can trust the posterior more if the model has well calibrated predictive distributions (passes posterior and LOO predictive, and calibration checking). For illustration purposes, we compare the posteriors using Poisson, negative binomial and zero-inflated negative binomial.

pred_p <- posterior_epred(fit_p,
                           newdata=rbind(mutate(roaches, treatment=0),
                                         mutate(roaches, treatment=1)))
ratio_p <- array(rowMeans(pred_p[,263:524]/pred_p[,1:262]),
                 c(1000, 4, 1)) |>
  as_draws_df() |>
  set_variables(variables='ratio')
pred_nb <- posterior_epred(fit_nb,
                           newdata=rbind(mutate(roaches, treatment=0),
                                         mutate(roaches, treatment=1)))
ratio_nb <- array(rowMeans(pred_nb[,263:524]/pred_nb[,1:262]),
                  c(1000, 4, 1)) |>
  as_draws_df() |>
  set_variables(variables='ratio')
clr <- khroma::colour("bright",names=FALSE)(7)
ratio_zinb |>
  ggplot(aes(x=ratio)) +
  stat_slab(data=ratio_p, fill=NA, color=clr[1], alpha=0.6) +
  stat_slab(data=ratio_nb, fill=NA, color=clr[2], alpha=0.6) +
  stat_slab(fill=NA, color=clr[3], alpha=0.6) +
  labs(x='Ratio of roaches with vs without treatment', y=NULL) +
  scale_y_continuous(breaks=NULL) +
  theme(axis.line.y=element_blank(),
        strip.text.y=element_blank()) +
  xlim(c(0,1)) +
  geom_vline(xintercept=1, linetype='dotted') +
  annotate(geom = "text", x=0.58, y=0.9, hjust=0, label="Poisson", color=clr[1], size=5) +
  annotate(geom = "text", x=0.33, y=0.9, hjust=1, label="NB", color=clr[2], size=5) +
  annotate(geom = "text", x=0.44, y=0.9, hjust=0, label="ZINB", color=clr[3], size=5)
Figure 29

All models show benefit of treatment, but Poisson model is overconfident with too narrow posterior. Posteriors using negative binomial (NB) and zero-inflated negative binomial (ZINB) are quite similar and in this case it would not probably matter for decision making even if the slightly worse negative binomial model would have been used.

6.2 Prior sensitivity analysis

Finally we make prior sensitivity analysis by power-scaling both prior and likelihood. As there is posterior dependency between the negative binomial and zero-inflated coefficients, it is not that useful to look at the prior sensitivity for the parameters. We focus on the actual quantity of interest, that is, the ratio of expected number of roaches with vs without treatment.

powerscale_sensitivity(fit_zinb, prediction = \(x, ...) ratio_zinb) |>
                         filter(variable=='ratio') |>
                         mutate(across(where(is.double),  ~num(.x, digits=2)))
Sensitivity based on cjs_dist
Prior selection: all priors
Likelihood selection: all data
# A tibble: 1 × 4
  variable  prior likelihood diagnosis
  <chr>     <num>      <num> <chr>    
1 ratio    0.0408      0.133 -        

There is no prior sensitivity and the likelihood is informative.

6.3 Predictive relevance of covariates

Let’s finally check cross-validation model comparison to see whether improved model has effect on the predictive performance comparison.

fit_zinb_m2 <-
  update(fit_zinb,
         formula=bf(y ~ sqrt_roach1 + senior + offset(log(exposure2)),
                    zi ~ sqrt_roach1 + senior + offset(log(exposure2)))) |>
  add_criterion(criterion="loo", moment_match=TRUE)
loo_compare(fit_zinb,
            fit_zinb_m2,
            model_names=c("Negative binomial full model",
                          "Negative binomial without treatment"))
                                    elpd_diff se_diff
Negative binomial full model         0.0       0.0   
Negative binomial without treatment -8.6       4.6   

Treatment effect improves the predictive performance with 96% probability (the normal approximation can be trusted as 1) the number of observations is larger than 100, 2) the elpd difference is bigger than 4, 3) there are no clear outliers, and 4) the distribution of the pointwise differences has finite variance as tested with Pareto-\hat{k} diagnostic (Sivula, Magnusson, and Vehtari 2020; Vehtari et al. 2024).


References

Dimitriadis, T., T. Gneiting, and A. I. Jordan. 2021. “Stable Reliability Diagrams for Probabilistic Classifiers.” Proceedings of the National Academy of Sciences 118 (e2016191118).
Paananen, Topi, Juho Piironen, Paul-Christian Bürkner, and Aki Vehtari. 2021. “Implicitly Adaptive Importance Sampling.” Statistics and Computing 31 (16).
Säilynoja, T., A. R. Johnson, O. A. Martin, and A. Vehtari. 2025. “Recommendations for Visual Predictive Checks in Bayesian Workflow.”
Sivula, Tuomas, Måns Magnusson, and Aki Vehtari. 2020. “Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison.” arXiv:2008.10296.
Vehtari, Aki. 2024. “Default Prior for Negative-Binomial Shape Parameter.”
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.
Vehtari, Aki, Tommi Mononen, Ville Tolvanen, Tuomas Sivula, and Ole Winther. 2016. “Bayesian Leave-One-Out Cross-Validation Approximations for Gaussian Latent Variable Models.” The Journal of Machine Learning Research 17 (1): 3581–3618.
Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2024. “Pareto Smoothed Importance Sampling.” Journal of Machine Learning Research 25 (72): 1–58.

Licenses

  • Code © 2025, Aki Vehtari, licensed under BSD-3.
  • Text © 2025, Aki Vehtari, licensed under CC-BY-NC 4.0.

Original Computing Environment

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Helsinki
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reliabilitydiag_0.2.1 tibble_3.3.0          dplyr_1.1.4          
 [4] priorsense_1.1.0      posterior_1.6.1       bayesplot_1.14.0     
 [7] ggdist_3.3.3          ggplot2_3.5.2         loo_2.8.0            
[10] cmdstanr_0.9.0.9000   brms_2.22.12          Rcpp_1.1.0           
[13] rstantools_2.4.0.9000

loaded via a namespace (and not attached):
  [1] Rdpack_2.6.4            gridExtra_2.3           inline_0.3.21          
  [4] sandwich_3.1-1          rlang_1.1.6             magrittr_2.0.3         
  [7] multcomp_1.4-28         matrixStats_1.5.0       ggridges_0.5.6         
 [10] compiler_4.5.1          vctrs_0.6.5             reshape2_1.4.4         
 [13] quadprog_1.5-8          stringr_1.5.1           pkgconfig_2.0.3        
 [16] fastmap_1.2.0           backports_1.5.0         labeling_0.4.3         
 [19] utf8_1.2.6              threejs_0.3.4           promises_1.3.3         
 [22] rmarkdown_2.29          markdown_2.0            ps_1.9.1               
 [25] nloptr_2.2.1            purrr_1.1.0             xfun_0.52              
 [28] jsonlite_2.0.0          later_1.4.2             parallel_4.5.1         
 [31] R6_2.6.1                dygraphs_1.1.1.6        stringi_1.8.7          
 [34] RColorBrewer_1.1-3      StanHeaders_2.36.0.9000 boot_1.3-31            
 [37] estimability_1.5.1      rstan_2.36.0.9000       knitr_1.50             
 [40] zoo_1.8-14              base64enc_0.1-3         httpuv_1.6.16          
 [43] Matrix_1.7-3            splines_4.5.1           igraph_2.1.4           
 [46] tidyselect_1.2.1        abind_1.4-8             yaml_2.3.10            
 [49] codetools_0.2-20        miniUI_0.1.2            curl_6.4.0             
 [52] processx_3.8.6          pkgbuild_1.4.8          lattice_0.22-7         
 [55] plyr_1.8.9              shiny_1.11.1            withr_3.0.2            
 [58] bridgesampling_1.1-2    coda_0.19-4.1           evaluate_1.0.4         
 [61] survival_3.8-3          RcppParallel_5.1.10     xts_0.14.1             
 [64] pillar_1.11.0           tensorA_0.36.2.1        checkmate_2.3.3        
 [67] DT_0.33                 stats4_4.5.1            reformulas_0.4.1       
 [70] shinyjs_2.1.0           distributional_0.5.0    generics_0.1.4         
 [73] scales_1.4.0            minqa_1.2.8             gtools_3.9.5           
 [76] xtable_1.8-4            glue_1.8.0              emmeans_1.11.2         
 [79] tools_4.5.1             shinystan_2.6.0         data.table_1.17.8      
 [82] lme4_1.1-37             colourpicker_1.3.0      mvtnorm_1.3-3          
 [85] grid_4.5.1              tidyr_1.3.1             rbibutils_2.3          
 [88] QuickJSR_1.8.0          crosstalk_1.2.1         nlme_3.1-168           
 [91] cli_3.6.5               khroma_1.16.0           Brobdingnag_1.2-9      
 [94] V8_6.0.5                gtable_0.3.6            digest_0.6.37          
 [97] TH.data_1.1-3           htmlwidgets_1.6.4       farver_2.1.2           
[100] htmltools_0.5.8.1       lifecycle_1.0.4         mime_0.13              
[103] rstanarm_2.32.1         shinythemes_1.2.0       MASS_7.3-65