This case study demonstrates using simple examples the most common failure modes in Markov chain Monte Carlo based Bayesian inference, how to recognize these using the diagnostics, and how to fix the problems.
library("rprojroot")
root<-has_file(".casestudies-root")$make_fix_file()
library(cmdstanr)
library(posterior)
options(pillar.neg = FALSE, pillar.subtle=FALSE, pillar.sigfig=2)
library(lemon)
library(tidyr)
library(dplyr)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans", base_size=16))
set1 <- RColorBrewer::brewer.pal(7, "Set1")
SEED <- 48927 # set random seed for reproducibility
An unbounded likelihood without a proper prior can lead to an improper posterior. We recommend to always use proper priors (integral over a proper distribution is finite) to guarantee proper posteriors.
A commonly used model that can have unbounded likelihood is logistic regression with complete separation in data.
Univariate continous predictor \(x\), binary target \(y\), and the two classes are completely separable, which leads to unbounded likelihood.
set.seed(SEED+4)
M=1;
N=10;
x=matrix(sort(rnorm(N)),ncol=M)
y=rep(c(0,1), each=N/2)
data_logit <-list(M = M, N = N, x = x, y = y)
data.frame(data_logit) |>
ggplot(aes(x, y)) +
geom_point(size = 3, shape=1, alpha=0.6) +
scale_y_continuous(breaks=c(0,1))
ggsave(file='separable_data.pdf', width=4, height=4)
We use the following Stan logistic regression model, where we have `forgot'' to include prior for the coefficient
beta`.
code_logit <- root("Problems", "logit_glm.stan")
writeLines(readLines(code_logit))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
array[N] int<lower=0,upper=1> y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
}
model {
y ~ bernoulli_logit_glm(x, alpha, beta);
}
Sample
mod_logit <- cmdstan_model(stan_file = code_logit)
fit_logit <- mod_logit$sample(data = data_logit, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.9 seconds.
Chain 2 finished in 1.0 seconds.
Chain 3 finished in 1.1 seconds.
Chain 4 finished in 1.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 1.0 seconds.
Total execution time: 4.3 seconds.
Warning: 1580 of 4000 (40.0%) transitions ended with a divergence.
See https://mc-stan.org/misc/warnings for details.
Warning: 2420 of 4000 (60.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.
When running Stan, we get warnings. We can also explicitly check the inference diagnostics:
fit_logit$diagnostic_summary()
$num_divergent
[1] 434 419 362 365
$num_max_treedepth
[1] 566 581 638 635
$ebfmi
[1] 1.957384 1.975748 2.023517 1.920562
We can also check \(\widehat{R}\) end effective sample size (ESS) diagnostics
draws <- as_draws_rvars(fit_logit$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | NA | NA | NA |
alpha | 1.892074e+34 | 3.417172e+29 | 3.847278e+34 | 5.066299e+29 | 1.206648e+24 | 1.052236e+35 | 3.44 | 4 | 11 |
beta | 4.517489e+34 | 8.496816e+29 | 9.178637e+34 | 1.259738e+30 | 2.718532e+24 | 2.707925e+35 | 3.45 | 4 | 12 |
We see that \(\widehat{R}\) for both and are about 3 and Bulk-ESS is about 4, which indicate that the chains are not mixing at all.
The above diagnostics refer to a documentation (https://mc-stan.org/misc/warnings) that mentions possibility to adjust the sampling algorithm options (e.g., increasing adapt_delta
and max_treedepth
), but it is better first to investigate the posterior.
The following Figure shows the posterior draws as marginal histograms and joint scatterplots. The range of the values is huge, which is typical for improper posterior, but the values of alpha
and beta
in any practical application are likely to have much smaller magnitude. In this case, increasing adapt_delta
and max_treedepth
would not have solved the problem, and would have just caused waste of modeler and computation time.
(p<-mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta")))
ggsave(p, file='separable_pairs.pdf', width=6, height=4)
The above diagnostics are applicable with any probabilistic programming framework. Stan compiler can also recognize some common problems. By default the pedantic mode is not enabled, but we can use option pedantic = TRUE
at compilation time, or after compilation with the check_syntax
method.
mod_logit$check_syntax(pedantic = TRUE)
Warning: The parameter beta has no priors. This means either no prior is
provided, or the prior(s) depend on data variables. In the later case,
this may be a false positive.
Warning: The parameter alpha has no priors. This means either no prior is
provided, or the prior(s) depend on data variables. In the later case,
this may be a false positive.
Stan program is syntactically correct
The pedantic check correctly warns that alpha
and beta
don’t have priors.
We add proper weak priors and rerun inference.
code_logit2 <- root("Problems", "logit_glm2.stan")
writeLines(readLines(code_logit2))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
array[N] int<lower=0,upper=1> y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
}
model {
alpha ~ normal(0,10);
beta ~ normal(0,10);
y ~ bernoulli_logit_glm(x, alpha, beta);
}
Sample
mod_logit2 <- cmdstan_model(stan_file = code_logit2)
fit_logit2 <- mod_logit2$sample(data = data_logit, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
There were no convergence warnings. We can also explicitly check the inference diagnostics:
fit_logit2$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 1.0938307 1.0000890 1.0620493 0.9204544
We check \(\widehat{R}\) end ESS values, which in this case all look good.
draws <- as_draws_rvars(fit_logit2$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -2.59 | -2.29 | 1.02 | 0.75 | -4.64 | -1.60 | 1 | 1315 | 1542 |
alpha | 6.16 | 5.87 | 2.80 | 2.69 | 2.06 | 11.21 | 1 | 793 | 1203 |
beta | 14.55 | 14.10 | 5.76 | 5.89 | 5.92 | 24.74 | 1 | 761 | 975 |
The following figure shows the more reasonable marginal histograms and joint scatterplots of the posterior sample.
(p=mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta")))
ggsave(p, file='separable_prior_pairs.pdf', width=6, height=4)
When writing and editing models, a common mistake is to declare a parameter, but not use it in the model. If the parameter is not used at all, it doesn’t have proper prior and the likelihood doesn’t provide information about that parameter, and thus the posterior along that parameter is improper. We use the previous logistic regression model with proper priors on alpha
and beta
, but include extra parameter declaration real gamma
.
code_logit3 <- root("Problems", "logit_glm3.stan")
writeLines(readLines(code_logit3))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
array[N] int<lower=0,upper=1> y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
real gamma;
}
model {
alpha ~ normal(0,1);
beta ~ normal(0,1);
y ~ bernoulli_logit_glm(x, alpha, beta);
}
Sample
mod_logit3 <- cmdstan_model(stan_file = code_logit3)
fit_logit3 <- mod_logit3$sample(data = data_logit, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 1.4 seconds.
Chain 2 finished in 1.4 seconds.
Chain 3 finished in 1.4 seconds.
Chain 4 finished in 1.3 seconds.
All 4 chains finished successfully.
Mean chain execution time: 1.4 seconds.
Total execution time: 5.7 seconds.
Warning: 1686 of 4000 (42.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.
There is sampler warning. We can also explicitly call inference diagnostics:
fit_logit3$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 425 414 442 405
$ebfmi
[1] 1.211723 1.207911 1.343447 1.375802
Instead of increasing max_treedepth
, we check the other convergence diagnostics.
draws <- as_draws_rvars(fit_logit3$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -6.610000e+00 | -6.31000e+00 | 1.010000e+00 | 7.400000e-01 | -8.640000e+00 | -5.64000e+00 | 1.00 | 1641 | 2467 |
alpha | 3.700000e-01 | 3.60000e-01 | 6.100000e-01 | 5.900000e-01 | -6.500000e-01 | 1.35000e+00 | 1.00 | 3382 | 2727 |
beta | 1.290000e+00 | 1.27000e+00 | 7.600000e-01 | 7.500000e-01 | 7.000000e-02 | 2.56000e+00 | 1.00 | 3297 | 2928 |
gamma | -7.271885e+18 | -3.75658e+18 | 5.077307e+19 | 4.041267e+19 | -9.555314e+19 | 8.54153e+19 | 3.17 | 5 | 12 |
\(\widehat{R}\), Bulk-ESS, and Tail-ESS look good for alpha
and beta, but really bad for
gamma, clearly pointing where to look for problems in the model code. The histogram of
gamma` posterior draws show huge magnitude of values (values larger than \(10^{20}\)) indicating improper posterior.
mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta","gamma"))
Non-mixing is well diagnosed by \(\widehat{R}\) and ESS, but the following Figure shows one of the rare cases where trace plots are useful to illustrate the type of non-mixing in case of improper uniform posterior for one the parameters.
mcmc_trace(as_draws_array(draws), pars=c("gamma"))
ggsave(file='unusedparam_trace.pdf', width=6, height=4)
Stan compiler pedantic check also recognizes that parameter gamma
was declared but was not used in the density calculation.
mod_logit3$check_syntax(pedantic = TRUE)
Warning: The parameter gamma was declared but was not used in the density
calculation.
Stan program is syntactically correct
Sometimes the models have two or more parameters that have similar or exactly the same role. We illustrate this by adding an extra column to the previous data matrix. Sometimes the data matrix is augmented with a column of 1’s to present the intercept effect. In this case that is redundant as our model has the explicit intercept term alpha
, and this redundancy will lead to problems.
M=2;
N=1000;
x=matrix(c(rep(1,N),sort(rnorm(N))),ncol=M)
y=((x[,1]+rnorm(N)/2)>0)+0
data_logit4 <-list(M = M, N = N, x = x, y = y)
We use the previous logistic regression model with proper priors (and no extra gamma
).
code_logit2 <- root("Problems", "logit_glm2.stan")
writeLines(readLines(code_logit2))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
array[N] int<lower=0,upper=1> y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
}
model {
alpha ~ normal(0,10);
beta ~ normal(0,10);
y ~ bernoulli_logit_glm(x, alpha, beta);
}
Sample
mod_logit4 <- cmdstan_model(stan_file = code_logit2)
fit_logit4 <- mod_logit4$sample(data = data_logit4, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 4.0 seconds.
Chain 2 finished in 4.0 seconds.
Chain 3 finished in 3.5 seconds.
Chain 4 finished in 3.7 seconds.
All 4 chains finished successfully.
Mean chain execution time: 3.8 seconds.
Total execution time: 15.5 seconds.
The Stan sampling time per chain with the original data matrix was less than 0.1s per chain. Now the Stan sampling time per chain is several seconds, which is suspicious. There are no automatic convergence diagnostic warnings and checking other diagnostics don’t show anything really bad.
fit_logit4$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 1.1216053 0.9732017 0.8804549 1.0738646
draws <- as_draws_rvars(fit_logit4$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -116.81 | -116.49 | 1.26 | 1.03 | -119.24 | -115.44 | 1 | 1135 | 1954 |
alpha | 1.86 | 1.89 | 7.16 | 6.98 | -10.33 | 13.59 | 1 | 1119 | 1238 |
beta[1] | 1.92 | 1.87 | 7.16 | 6.99 | -9.80 | 14.08 | 1 | 1118 | 1223 |
beta[2] | 0.37 | 0.37 | 0.21 | 0.21 | 0.04 | 0.71 | 1 | 1360 | 1530 |
ESS estimates are above the recommended diagnostic thresholds, but lower than what we would expect in general from Stan for such a lower dimensional problem.
The following figure shows marginal histograms and joint scatterplots, and we can see that alpha
and beta[1]
are highly correlated.
(p=mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta[1]","beta[2]")))
ggsave(p, file='competing_params_pairs.pdf', width=6, height=4)
We can compute the correlation.
cor(as_draws_matrix(draws)[,c("alpha","beta[1]")])[1,2]
[1] -0.9995226
The numerical value for the correlation is \(-0.999\). The correlation close to 1 can happen also from other reasons (see the next example), but one possibility is that parameters have similar role in the model. Here the reason is the constant column in \(x\), which we put there for the demonstration purposes. We may a have constant column, for example, if the predictor matrix is augmented with the intercept predictor, or if the observed data or subdata used in the specific analysis happens to have only one unique value.
Stan compiler pedantic check examining the code can’t recognize this issue, as the problem depends also on the data.
mod_logit4$check_syntax(pedantic = TRUE)
Stan program is syntactically correct
In the previous example the two parameters had the same role in the model, leading to high posterior correlation. High posterior correlations are common also in linear models when the predictor values are far from 0. We illustrate this with a linear regression model for the summer temperature in Kilpisjärvi, Finland, 1952–2013. We use the year as the covariate \(x\) without centering it.
The data are Kilpisjärvi summer month temperatures 1952-2013.
data_kilpis <- read.delim(root("Problems","kilpisjarvi-summer-temp.csv"), sep = ";")
data_lin <-list(M=1,
N = nrow(data_kilpis),
x = matrix(data_kilpis$year, ncol=1),
y = data_kilpis[,5])
data.frame(data_lin) |>
ggplot(aes(x, y)) +
geom_point(size = 1) +
labs(y = 'Summer temp. @Kilpisjärvi', x= "Year") +
guides(linetype = "none")
ggsave(file='Kilpisjarvi_data.pdf', width=6, height=4)
We use the following Stan linear regression model
code_lin <- root("Problems", "linear_glm_kilpis.stan")
writeLines(readLines(code_lin))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
vector[N] y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
real sigma;
}
model {
alpha ~ normal(0, 100);
beta ~ normal(0, 100);
sigma ~ normal(0, 1);
y ~ normal_id_glm(x, alpha, beta, sigma);
}
mod_lin <- cmdstan_model(stan_file = code_lin)
fit_lin <- mod_lin$sample(data = data_lin, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 Rejecting initial value:
Chain 1 Error evaluating the log probability at the initial value.
Chain 1 Exception: normal_id_glm_lpdf: Scale vector is -1.5974, but must be positive finite! (in '/tmp/RtmpoIWO9X/model-c49cb463abc8e.stan', line 17, column 2 to column 43)
Chain 1 Exception: normal_id_glm_lpdf: Scale vector is -1.5974, but must be positive finite! (in '/tmp/RtmpoIWO9X/model-c49cb463abc8e.stan', line 17, column 2 to column 43)
Chain 1 Rejecting initial value:
Chain 1 Error evaluating the log probability at the initial value.
Chain 1 Exception: normal_id_glm_lpdf: Scale vector is -0.929594, but must be positive finite! (in '/tmp/RtmpoIWO9X/model-c49cb463abc8e.stan', line 17, column 2 to column 43)
Chain 1 Exception: normal_id_glm_lpdf: Scale vector is -0.929594, but must be positive finite! (in '/tmp/RtmpoIWO9X/model-c49cb463abc8e.stan', line 17, column 2 to column 43)
Chain 1 finished in 1.1 seconds.
Chain 2 finished in 1.0 seconds.
Chain 3 Rejecting initial value:
Chain 3 Error evaluating the log probability at the initial value.
Chain 3 Exception: normal_id_glm_lpdf: Scale vector is -1.50837, but must be positive finite! (in '/tmp/RtmpoIWO9X/model-c49cb463abc8e.stan', line 17, column 2 to column 43)
Chain 3 Exception: normal_id_glm_lpdf: Scale vector is -1.50837, but must be positive finite! (in '/tmp/RtmpoIWO9X/model-c49cb463abc8e.stan', line 17, column 2 to column 43)
Chain 3 finished in 1.0 seconds.
Chain 4 finished in 1.1 seconds.
All 4 chains finished successfully.
Mean chain execution time: 1.0 seconds.
Total execution time: 4.4 seconds.
Warning: 16 of 4000 (0.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.
Stan gives a warning: There were X transitions after warmup that exceeded the maximum treedepth. As in the previous example, there are no other warnings.
fit_lin$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 10 4 0 2
$ebfmi
[1] 1.0136207 0.9657334 1.1039565 1.1209683
draws <- as_draws_rvars(fit_lin$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -38.50 | -38.18 | 1.25 | 1.02 | -40.89 | -37.15 | 1 | 1114 | 1649 |
alpha | -30.86 | -30.79 | 15.06 | 14.45 | -55.47 | -5.57 | 1 | 1185 | 1337 |
beta | 0.02 | 0.02 | 0.01 | 0.01 | 0.01 | 0.03 | 1 | 1184 | 1337 |
sigma | 1.12 | 1.11 | 0.10 | 0.10 | 0.97 | 1.30 | 1 | 1350 | 1287 |
ESS estimates are above the diagnostic threshold, but lower than we would expect for such a low dimensional model, unless there are strong posterior correlations. The following Figure shows the marginal histograms and joint scatterplot for alpha
and beta[1]
, which shows they are very highly correlated.
mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta"))
ggsave(p, file='correlating_params_pairs.pdf', width=6, height=4)
Here the reason is that the \(x\) values are in the range 1952–2013, and the intercept alpha
denotes the temperature at year 0, which is very far away from the range of observed \(x\). If the intercept alpha
changes, the slope beta
needs to change too. The high correlation makes the inference slower, and we can make it faster by centering \(x\). Here we simply subtract 1982.5 from the predictor year
, so that the mean of \(x\) is 0. We could also include the centering and back transformation to Stan code.
data_lin <-list(M=1,
N = nrow(data_kilpis),
x = matrix(data_kilpis$year-1982.5, ncol=1),
y = data_kilpis[,5])
fit_lin <- mod_lin$sample(data = data_lin, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
We check the diagnostics
fit_lin$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 1.0579740 0.9583711 1.0273150 1.0810975
draws <- as_draws_rvars(fit_lin$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -38.55 | -38.21 | 1.32 | 1.06 | -41.14 | -37.11 | 1 | 1915 | 2466 |
alpha | 9.31 | 9.31 | 0.15 | 0.14 | 9.07 | 9.56 | 1 | 3681 | 2701 |
beta | 0.02 | 0.02 | 0.01 | 0.01 | 0.01 | 0.03 | 1 | 3927 | 2766 |
sigma | 1.12 | 1.11 | 0.11 | 0.10 | 0.96 | 1.31 | 1 | 3168 | 2454 |
The following figure shows the scatter plot.
mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta"))
With this change, there is no posterior correlation, Bulk-ESS estimates are 3 times bigger, and the mean time per chain goes from 1.3s to less than 0.05s; that is, we get 2 orders of magnitude faster inference. In a bigger problems this could correspond to reduction of computation time from 24 hours to less than 20 minutes.
Bimodal distributions can arise from many reasons as in mixture models or models with non-log-concave likelihoods or priors (that is, with distributions with thick tails). We illustrate the diagnostics revealing the multimodal posterior. We use a simple toy problem with \(t\) model and data that is not from a \(t\) distribution, but from a mixture of two normal distributions
Bimodally distributed data
N=20
y=c(rnorm(N/2, mean=-5, sd=1),rnorm(N/2, mean=5, sd=1));
data_tt <-list(N = N, y = y)
Unimodal Student’s \(t\) model:
code_tt <- root("Problems", "student.stan")
writeLines(readLines(code_tt))
// student-student model
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
}
model {
mu ~ student_t(4, 0, 100);
y ~ student_t(4, mu, 1);
}
Sample
mod_tt <- cmdstan_model(stan_file = code_tt)
fit_tt <- mod_tt$sample(data = data_tt, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
We check the diagnostics
fit_tt$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 1.303367 1.037493 1.192409 1.165997
draws <- as_draws_rvars(fit_tt$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -84.57 | -84.34 | 0.74 | 0.46 | -86.03 | -83.90 | 1.13 | 21 | 120 |
mu | -0.08 | -0.35 | 4.62 | 6.81 | -5.11 | 4.99 | 1.73 | 6 | 153 |
The \(\widehat{R}\) values for mu
are large and ESS values for mu
are small indicating convergence problems. The following figure shows the histogram and trace plots of the posterior draws, clearly showing the bimodality and that chains are not mixing between the modes.
mcmc_hist(as_draws_array(draws), pars=c("mu"))
ggsave(file='bimodal1_hist.pdf', width=4, height=4)
In this toy example, with random initialization each chains has 50% probability of ending in either mode. We used Stan’s default of 4 chains, and when random initialization is used, there is 6% chance that when running Stan once, we would miss the multimodality. If the attraction areas within the random initialization range are not equal, the probability of missing one mode is even higher. There is a tradeoff between the default computation cost and cost of having higher probability of finding multiple modes. If there is a reason to suspect multimodality, it is useful to run more chains. Running more chains helps to diagnose the multimodality, but the probability of chains ending in different modes can be different from the relative probability mass of each mode, and running more chains doesn’t fix this. Other means are needed to improve mixing between the modes (e.g. Yao et al., 2020) or to approximately weight the chains (e.g. Yao et al., 2022).
If the modes in the bimodal distribution are not strongly separated, MCMC can jump from one mode to another and there are no convergence issues.
N=20
y=c(rnorm(N/2, mean=-3, sd=1),rnorm(N/2, mean=3, sd=1));
data_tt <-list(N = N, y = y)
fit_tt <- mod_tt$sample(data = data_tt, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
We check the diagnostics
fit_tt$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 0.8670786 0.8435780 0.7262904 0.7804591
draws <- as_draws_rvars(fit_tt$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -54.23 | -53.75 | 1.0 | 0.57 | -56.15 | -53.34 | 1.04 | 207 | 1181 |
mu | 0.84 | 1.34 | 1.3 | 0.62 | -2.17 | 2.06 | 1.06 | 107 | 357 |
Two modes are visible.
mcmc_hist(as_draws_array(draws), pars=c("mu"))
ggsave(file='bimodal2_hist.pdf', width=4, height=4)
Trace plot is not very useful. It shows the chains are jumping between modes, but it’s difficult to see whether the jumps happen often enough and chains are mixing well.
mcmc_trace(as_draws_array(draws), pars=c("mu"))
ggsave(file='bimodal2_trace.pdf', width=4, height=4)
Rank ECDF plot indicates good mixing as all chains have their lines inside the envelope (the envelope assumes no autocorrelation, which is the reason to thin the draws here)
draws |> thin_draws(ndraws(draws)/ess_basic(draws$mu)) |>
mcmc_rank_ecdf(pars=c("mu"), plot_diff=TRUE)
ggsave(file='bimodal2_rank_ecdf_diff.pdf', width=4, height=4)
MCMC requires some initial values. By default, Stan generates them randomly from [-2,2] in unconstrained space (constraints on parameters are achieved by transformations). Sometimes these initial values can be bad and cause numerical issues. Computers, (in general) use finite number of bits to present numbers and with very small or large numbers, there can be problems of presenting them or there can be significant loss of accuracy.
The data is generated from a Poisson regression model. The Poisson intensity parameter has to be positive and usually the latent linear predictor is exponentiated to be positive (the exponentiation can also be justified by multiplicative effects on Poisson intensity).
We intentionally generate the data so that there are initialization problems, but the same problem is common with real data when the scale of the predictors is large or small compared to the unit scale. The following figure shows the simulated data.
set.seed(SEED)
M=1;
N=20;
x=1e3*matrix(c(sort(rnorm(N))),ncol=M)
y=rpois(N,exp(1e-3*x[,1]))
data_pois <-list(M = M, N = N, x = x, y = y)
data.frame(data_pois) |>
ggplot(aes(x, y)) +
geom_point(size = 3)
ggsave(file='poisson_data.pdf', width=4, height=4)
We use a Poisson regression model with proper priors. The line poisson_log_glm(x, alpha, beta)
corresponds to a distribution in which the log intensity of the Poisson distribution is modeled with alpha + beta * x
but is implemented with better computational efficiency.
code_pois <- root("Problems", "pois_glm.stan")
writeLines(readLines(code_pois))
// Poisson regression
data {
int<lower=0> N;
int<lower=0> M;
array[N] int<lower=0> y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
}
model {
alpha ~ normal(0,10);
beta ~ normal(0,10);
y ~ poisson_log_glm(x, alpha, beta);
}
Sample
mod_pois <- cmdstan_model(stan_file = code_pois)
fit_pois <- mod_pois$sample(data = data_pois, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 Rejecting initial value:
Chain 1 Log probability evaluates to log(0), i.e. negative infinity.
Chain 1 Stan can't start sampling from this initial value.
Chain 1 Rejecting initial value:
Chain 1 Log probability evaluates to log(0), i.e. negative infinity.
Chain 1 Stan can't start sampling from this initial value.
Chain 1 Rejecting initial value:
Chain 1 Log probability evaluates to log(0), i.e. negative infinity.
Chain 1 Stan can't start sampling from this initial value.
Chain 1 finished in 1.0 seconds.
Chain 2 Rejecting initial value:
Chain 2 Log probability evaluates to log(0), i.e. negative infinity.
Chain 2 Stan can't start sampling from this initial value.
Chain 2 finished in 0.2 seconds.
Chain 3 Rejecting initial value:
Chain 3 Log probability evaluates to log(0), i.e. negative infinity.
Chain 3 Stan can't start sampling from this initial value.
Chain 3 Rejecting initial value:
Chain 3 Log probability evaluates to log(0), i.e. negative infinity.
Chain 3 Stan can't start sampling from this initial value.
Chain 3 Rejecting initial value:
Chain 3 Log probability evaluates to log(0), i.e. negative infinity.
Chain 3 Stan can't start sampling from this initial value.
Chain 3 Rejecting initial value:
Chain 3 Log probability evaluates to log(0), i.e. negative infinity.
Chain 3 Stan can't start sampling from this initial value.
Chain 3 Rejecting initial value:
Chain 3 Log probability evaluates to log(0), i.e. negative infinity.
Chain 3 Stan can't start sampling from this initial value.
Chain 3 Rejecting initial value:
Chain 3 Log probability evaluates to log(0), i.e. negative infinity.
Chain 3 Stan can't start sampling from this initial value.
Chain 3 finished in 0.0 seconds.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 1.5 seconds.
Warning: 584 of 4000 (15.0%) transitions hit the maximum treedepth limit of 10.
See https://mc-stan.org/misc/warnings for details.
Warning: 2 of 4 chains had an E-BFMI less than 0.2.
See https://mc-stan.org/misc/warnings for details.
We get a lot of warnings:
Chain 4 Rejecting initial value:
Chain 4 Log probability evaluates to log(0), i.e. negative infinity.
Chain 4 Stan can't start sampling from this initial value.
We check the diagnostics:
fit_pois$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 582 0 1 1
$ebfmi
[1] 1.395338019 1.022888262 0.002205932 0.001748789
draws <- as_draws_rvars(fit_pois$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -9.023224e+80 | -1.859e+35 | 1.986335e+81 | 2.756153e+35 | -5.320882e+81 | 9.98 | 5.47 | 5 | 11 |
alpha | -7.000000e-02 | 9.000e-02 | 6.600000e-01 | 9.100000e-01 | -1.080000e+00 | 0.70 | 3.86 | 4 | NA |
beta | -2.000000e-02 | 0.000e+00 | 5.000000e-02 | 3.000000e-02 | -1.100000e-01 | 0.04 | 3.20 | 4 | 11 |
\(\widehat{R}\) values are large and ESS values are small, indicating bad mixing. Marginal histograms and joint scatterplots of the posterior draws in the figure below clearly show that two chains have been stuck away from two others.
(p=mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta")))
ggsave(p, file='poisson_pairs.pdf', width=6, height=4)
The reason for the issue is that the initial values for beta
are sampled from \((-2, 2)\) and x
has some large values. If the initial value for beta
is higher than about 0.3 or lower than \(-0.4\), some of the values of exp(alpha + beta * x)
will overflow to floating point infinity (Inf
).
Sometimes an easy option is to change the initialization range. For example, in this the sampling succeeds if the initial values are drawn from the range \((-0.001, 0.001)\). Alternatively we can scale x
to have scale close to unit scale. After this scaling, the computation is fast and all convergence diagnostics look good.
data_pois <-list(M = M, N = N, x = x/1e3, y = y)
data.frame(data_pois) |>
ggplot(aes(x, y)) +
geom_point(size = 3)
mod_pois <- cmdstan_model(stan_file = code_pois)
fit_pois <- mod_pois$sample(data = data_pois, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
We check the diagnostics:
fit_pois$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 1.105842 1.094182 1.073488 1.112763
draws <- as_draws_rvars(fit_pois$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | 9.10 | 9.40 | 0.97 | 0.71 | 7.13 | 10.02 | 1.00 | 1329 | 1833 |
alpha | -0.05 | -0.04 | 0.26 | 0.26 | -0.48 | 0.37 | 1.00 | 1239 | 1364 |
beta | 1.01 | 1.01 | 0.18 | 0.18 | 0.72 | 1.31 | 1.01 | 1225 | 1473 |
If the initial value warning comes only once, it is possible that MCMC was able to escape the bad region and rest of the inference is ok.
We return to the logistic regression example with separable data. Now we use proper, but thick tailed Cauchy prior.
code_logit4 <- root("Problems", "logit_glm4.stan")
writeLines(readLines(code_logit4))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
array[N] int<lower=0,upper=1> y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
}
model {
alpha ~ cauchy(0, 10);
beta ~ cauchy(0, 10);
y ~ bernoulli_logit_glm(x, alpha, beta);
}
Sample
mod_logit4 <- cmdstan_model(stan_file = code_logit4)
fit_logit4 <- mod_logit4$sample(data = data_logit, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
We check diagnostics
fit_logit4$diagnostic_summary()
$num_divergent
[1] 0 0 0 0
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 0.4959455 0.2817680 0.4079288 0.7125728
draws <- as_draws_rvars(fit_logit4$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | -3.92 | -3.26 | 2.11 | 1.52 | -8.37 | -1.93 | 1.04 | 122 | 95 |
alpha | 16.14 | 10.09 | 20.49 | 7.19 | 3.19 | 48.79 | 1.05 | 100 | 95 |
beta | 38.68 | 24.10 | 48.50 | 16.89 | 8.09 | 119.59 | 1.04 | 100 | 89 |
The rounded \(\widehat{R}\) values look good, ESS values are low. Looking at the marginal histograms and joint scatterplots of the posterior draws in the following figure show a thick tail.
(p<-mcmc_pairs(as_draws_array(draws), pars=c("alpha","beta")))
ggsave(p, file='thick_tail_pairs.pdf', width=6, height=4)
The dynamic HMC algorithm used by Stan, along with many other MCMC methods, have problems with such thick tails and mixing is slow.
Rank ECDF plot indicates good mixing as all chains have their lines inside the envelope (the envelope assumes no autocorrelation, which is the reason to thin the draws here)
draws |> thin_draws(ndraws(draws)/ess_bulk(draws$alpha)) |>
mcmc_rank_ecdf(pars=c("alpha"), plot_diff=TRUE)
ggsave(p, file='thick_tail_rank_ecdf_diff.pdf', width=6, height=4)
More iterations confirm a reasonable mixing.
fit_logit4 <- mod_logit4$sample(data = data_logit, seed = SEED, refresh = 0, iter_sampling=4000)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.1 seconds.
Chain 2 finished in 0.1 seconds.
Chain 3 finished in 0.1 seconds.
Chain 4 finished in 0.1 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.1 seconds.
Total execution time: 0.8 seconds.
draws <- as_draws_rvars(fit_logit4$draws())
summarize_draws(draws)
# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -3.8 -3.2 2.0 1.4 -7.9 -1.9 1.0 712. 624.
2 alpha 15. 9.6 24. 6.7 2.7 45. 1.0 649. 577.
3 beta 36. 23. 54. 16. 7.2 104. 1.0 620. 620.
draws |> thin_draws(ndraws(draws)/ess_bulk(draws$alpha)) |>
mcmc_rank_ecdf(pars=c("alpha"), plot_diff=TRUE)
Demonstration what happens if we forget to constrain a parameter that has to be positive. In Stan the constraint can be added when declaring the parameter as real<lower=0> sigma;
We simulated x and y independently from independently from normal(0,1) and normal(0,0.1) respectively. As \(N=8\) is small, there will be a lot of uncertainty about the parameters including the scale sigma.
M=1;
N=8;
set.seed(SEED)
x=matrix(rnorm(N),ncol=M)
y=rnorm(N)/10
data_lin <-list(M = M, N = N, x = x, y = y)
We use linear regression model with proper priors.
code_lin <- root("Problems", "linear_glm.stan")
writeLines(readLines(code_lin))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
vector[N] y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
real sigma;
}
model {
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma ~ normal(0, 1);
y ~ normal_id_glm(x, alpha, beta, sigma);
}
Sample
mod_lin <- cmdstan_model(stan_file = code_lin)
fit_lin <- mod_lin$sample(data = data_lin, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
We get many times the following warnings
Chain 4 Informational Message: The current Metropolis proposal is about to be rejected because of the following issue:
Chain 4 Exception: normal_id_glm_lpdf: Scale vector is -0.747476, but must be positive finite! (in '/tmp/RtmprEP4gg/model-7caa12ce8e405.stan', line 16, column 2 to column 43)
Chain 4 If this warning occurs sporadically, such as for highly constrained variable types like covariance matrices, then the sampler is fine,
Chain 4 but if this warning occurs often then your model may be either severely ill-conditioned or misspecified.
Sometimes these warnings appear in early phase of the sampling, even if the model has been correctly defined. Now we have too many of them, which indicates the samples is trying to jump to infeasible values, which here means the negative scale parameter values. Many rejections may lead to biased estimates.
There are some divergences reported, which is also indication that there might be some problem (as divergence diagnostic has an ad hoc diagnostic threshold, there can also be false positive warnings). Other convergence diagnostics are good, but due to many rejection warnings, it is good to check the model code and numerical accuracy of the computations.
We check diagnostics
fit_lin$diagnostic_summary()
$num_divergent
[1] 0 0 0 3
$num_max_treedepth
[1] 0 0 0 0
$ebfmi
[1] 0.7768720 0.6476284 0.5988158 0.6908439
draws <- as_draws_rvars(fit_lin$draws())
summarize_draws(draws)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | 18.24 | 18.71 | 1.83 | 1.51 | 14.66 | 20.22 | 1.01 | 912 | 843 |
alpha | 0.05 | 0.05 | 0.03 | 0.02 | 0.01 | 0.10 | 1.00 | 1721 | 1776 |
beta | 0.04 | 0.04 | 0.02 | 0.02 | 0.00 | 0.07 | 1.00 | 1988 | 1389 |
sigma | 0.07 | 0.06 | 0.03 | 0.02 | 0.04 | 0.12 | 1.01 | 1398 | 1000 |
Stan compiler pedantic check can recognize that `A normal_id_glm distribution is given parameter sigma as a scale parameter (argument 4), but sigma was not constrained to be strictly positive. The pedantic check is also warning about the very wide priors.
mod_lin$check_syntax(pedantic = TRUE)
Warning in '/home/ave/proj/public_html/casestudies/Problems/linear_glm.stan', line 17, column 36: A
normal_id_glm distribution is given parameter sigma as a scale parameter
(argument 4), but sigma was not constrained to be strictly positive.
Stan program is syntactically correct
After fixing the model with proper parameter constraint, MCMC runs without warnings and the sampling efficiency is better. In this specific case, the bias is negligible when running MCMC with the model code without the constraint, but it is difficult to diagnose without running the fixed model.
Fixed model inlcudes <lower=0> constraint for sigma.
code_lin2 <- root("Problems", "linear_glm2.stan")
writeLines(readLines(code_lin2))
// logistic regression
data {
int<lower=0> N;
int<lower=0> M;
vector[N] y;
matrix[N,M] x;
}
parameters {
real alpha;
vector[M] beta;
real<lower=0> sigma;
}
model {
alpha ~ normal(0, 1);
beta ~ normal(0, 1);
sigma ~ normal(0, 1);
y ~ normal_id_glm(x, alpha, beta, sigma);
}
Sample
mod_lin2 <- cmdstan_model(stan_file = code_lin2)
fit_lin2 <- mod_lin2$sample(data = data_lin, seed = SEED, refresh = 0)
Running MCMC with 4 sequential chains...
Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.
We check diagnostics
draws2 <- as_draws_rvars(fit_lin2$draws())
summarize_draws(draws2)
variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|
lp__ | 15.52 | 15.95 | 1.61 | 1.27 | 12.25 | 17.21 | 1 | 1141 | 1322 |
alpha | 0.05 | 0.05 | 0.03 | 0.02 | 0.01 | 0.10 | 1 | 2024 | 1724 |
beta | 0.04 | 0.04 | 0.02 | 0.02 | 0.00 | 0.07 | 1 | 2217 | 2144 |
sigma | 0.07 | 0.06 | 0.03 | 0.02 | 0.04 | 0.13 | 1 | 1467 | 1539 |
In this specific case, the bias is negligible when running MCMC with the model code without the constraint, but it is difficult to diagnose without running the fixed model.