library(rprojroot)
root <- has_file(".Bayesian-Workflow-root")$make_fix_file()
library(tidyverse)
library(tictoc)
mytoc <- \() {
toc(func.toc = \(tic, toc, msg) {
sprintf("%s took %s sec", msg, as.character(signif(toc - tic, 2)))
})}
library(cmdstanr)
# CmdStanR output directory makes Quarto cache to work
dir.create(root("birthdays", "stan_output"), showWarnings = FALSE)
options(cmdstanr_output_dir = root("birthdays", "stan_output"))
library(posterior)
options(pillar.neg = FALSE,
pillar.subtle = FALSE,
pillar.sigfig = 2)
library(tinytable)
options(tinytable_format_num_fmt = "significant_cell",
tinytable_format_digits = 2,
tinytable_tt_digits = 2)
library(loo)
library(bayesplot)
theme_set(bayesplot::theme_default(base_family = "sans"))
library(ggdist)
library(patchwork)
library(ggrepel)
library(RColorBrewer)
set1 <- brewer.pal(7, "Set1")Birthdays and time-series decomposition
This notebook includes the code for Bayesian Workflow book Chapter 27 Model building: Time-series decomposition for birthdays.
1 Introduction
We analyse the relative number of births per day in USA 1969–1988 using Gaussian process time series model with several model components that can explain the long term, seasonal, weekly, day of year, and special floating day variation. We use Hilbert space Gaussian process approximation (Riutort-Mayol et al. 2023) to speed up the computation.
We also illustrate the use Pathfinder algorithm (Zhang et al. 2022) to quickly check that model code produces something reasonable and to initialize MCMC sampling.
Load packages
Use English for names of weekdays and months
Sys.setlocale("LC_TIME", "en_GB.utf8")[1] "en_GB.utf8"
Source common helpers for figures
source(root("birthdays", "plot_helpers.R"))2 Data
Load birthdays per day in USA 1969-1988:
birthdays <- read_csv(root("birthdays/data", "births_usa_1969.csv"))Add date type column for plotting
birthdays <- birthdays |>
mutate(date = as.Date("1968-12-31") + id,
births_relative100 = births / mean(births) * 100)2.1 Plot all births
We can see slow variation in trend, yearly pattern, and especially in the later years spread to lower and higher values.
birthdays |>
ggplot(aes(x = date, y = births)) +
geom_point(color = col_data) +
labs(x = "Date", y = "Relative number of births")2.2 Plot all births as relative to mean
To make the interpretation we switch to examine the relative change, with the mean level denoted with 100.
birthdays |>
ggplot(aes(x = date, y = births_relative100)) +
geom_point(color = col_data) +
layers_hline100 +
labs(x = "Date", y = "Relative births per day")2.3 Plot mean per day of year
We can see the generic pattern in yearly seasonal trend simply by averaging over each day of year (day_of_year has numbers from 1 to 366 every year with leap day being 60 and 1st March 61 also on non-leap-years).
birthdays |>
group_by(day_of_year2) |>
summarise(meanbirths = mean(births_relative100)) |>
ggplot(aes(x = as.Date("1986-12-31") + day_of_year2, y = meanbirths)) +
geom_point(color = col_data) +
layers_hline100 +
layers_scale_doy +
labs(x = "Day of year", y = "Relative births per day of year")2.4 Plot mean per day of week
We can see the generic pattern in weekly trend simply by averaging over each day of week.
birthdays |>
group_by(day_of_week) |>
summarise(meanbirths = mean(births_relative100)) |>
ggplot(aes(x = day_of_week, y = meanbirths)) +
geom_point(color = col_data, size = 4) +
layers_hline100 +
layers_scale_weekday +
labs(x = "Day of week", y = "Relative number of births of week")3 Previous analyses
We had analysed the same data before (see Gelman et al. (2013)) and thus we already had an idea of what kind of model to use. Previously we used GPstuff software which is Gaussian process specific software for Matlab and Octave. As Stan has aimed to be very generic it can be slower than specialized software for some specific models such as Gaussian processes, but Stan provides more flexibility in the model definition.
Riutort-Mayol et al. (2023) demonstrate Hilbert space approximate basis function approximation of Gaussian processes also for the same birthday data. In the experiments the inference was slower than expected raising suspicion of inefficient model code or bad posterior shape due to bad model specification.
4 Workflow for quick iterative model building
Even we have general idea for the model (slow trend, seasonal trend, weekday effect, etc), adding them all at once to the model makes the model complex and difficult to debug and solve the computational problems. It is thus natural to build the model gradually and check that each addition works before adding the next model component. During this iterative model building we want the inference to be fast, but it doesn’t need to be very accurate as long as qualitatively the new model is reasonable. For quick testing and iterative model building we can use optimization, Pathfinder (Zhang et al. 2022) and shorter MCMC chains that we would not recommend for the final inference. Furthermore, in this specific example, the new additions are qualitatively so clear improvements that there is no need for quantitative model comparison whether the additions are ``significant’’ (see also Navarro (2019)) and there is no danger of overfitting (see also McLatchie and Vehtari (2024)). Although there is one part of the model where the data is weakly informative and the prior choices seem to matter and we’ll get back to this and consequences later. Overall we build tens of different models, but illustrate here only the main line.
5 Models for relative number of birthdays
As the relative number of births is positive it’s natural to model the logarithm value. The generic form of the models is y \sim \mathrm{normal}(f(x), \sigma), where f is different and gradually more complex function conditional on x that includes running day number, day of year, day of week and eventually some special floating US bank holidays.
5.1 Model 1: Slow trend
The model 1 is just the slow trend over the years using Hilbert space basis function approximated Gaussian process \begin{aligned} f & = \mathrm{intercept} + f_1\\ \mathrm{intercept} & \sim \mathrm{normal}(0,1)\\ f_1 & \sim \mathrm{GP}(0,K_1) \end{aligned} where GP has exponentiated quadratic covariance function.
In this phase, the code by Riutort-Mayol et al. (2023) was cleaned and written to be more efficient, but only the one GP component was included to make the testing easier. Although the code was made more efficient, the aim wasn’t to make it the fastest possible, as the later model changes may have bigger effect on the performance (it’s good to avoid premature optimization). We also use quite small number of basis functions to make the code run faster, and only later examine more carefully whether the number of basis function is sufficient compared to the posterior of the length scale (see, Riutort-Mayol et al. (2023)).
Compile Stan model gpbf1.stan which includes gpbasisfun_functions.stan.
model1 <- cmdstan_model(stan_file = root("birthdays", "gpbf1.stan"),
include_paths = root("birthdays"))Data to be passed to Stan
standata1 <- list(x = birthdays$id,
y = log(birthdays$births_relative100),
N = length(birthdays$id),
c_f1 = 1.5,
M_f1 = 20)In this simplest model with just one GP, and as the basis function approximation and priors restrict the complexity of GP, we can safely use optimization to find maximum a posteriori (MAP) estimate get a very quick initial result to check that the model code is computing what we intended (e.g. no NaN, Infs, or nonsensical results). As there are only 24 parameters and 7305 observations it’s likely that the posterior in the unconstrained parameter space is close to normal. To obtain the correct mode in the unconstrained space, we need to call Stan optimizer with option jacobian = TRUE (see Laplace and Jacobian case study for illustration). Initialization at 0 in unconstrained space is good for most GP models. In this case the optimization takes less than one second while MCMC sampling with default options would have taken several minutes. Although this result can be useful in a quick workflow, the result should not be used as the final result.
tic('Finding MAP for model 1 with optimization')
opt1 <- model1$optimize(data = standata1, init = 0, algorithm = "bfgs",
jacobian = TRUE)mytoc()Finding MAP for model 1 with optimization took 0.22 sec
Check whether parameters have reasonable values
odraws1 <- opt1$draws()
subset(odraws1, variable = c("intercept", "sigma_f1", "lengthscale_f1", "sigma")) |>
as.data.frame() |>
tt()| intercept | sigma_f1 | lengthscale_f1 | sigma |
|---|---|---|---|
| -0.056 | 1.1 | 0.18 | 0.81 |
Compare the model to the data
oEf <- exp(as.numeric(subset(odraws1, variable = "f")))
make_pf(birthdays, oEf) + layers_hline100We can obtain a bit more information by making a normal approximation at the mode in the unconstrained parameter space. As Laplace was the first one to use the modal normal approximation, the method is commonly called Laplace method. Stan samples from the normal approximation in the unconstrained space and transforms the obtained draws to the constrained space. Stan’s Laplace method uses jacobian = TRUE by default. As we did already optimize, we can pass the optimization result to the Laplace method. With additional 2s we get 400 approximate draws.
tic('Sampling from Laplace approximation of model 1 posterior')
lap1 <- model1$laplace(data = standata1, mode = opt1, draws = 400,
refresh = 0)mytoc()Check whether parameters have reasonable values. With Laplace method, we get also some information about the uncertainty in the posterior.
ldraws1 <- lap1$draws()
summarise_draws(subset(ldraws1, variable = c("intercept", "sigma_f1", "lengthscale_f1", "sigma")),
default_summary_measures()) |>
tt()| variable | mean | median | sd | mad | q5 | q95 |
|---|---|---|---|---|---|---|
| intercept | -0.068 | -0.04 | 0.36 | 0.37 | -0.65 | 0.51 |
| sigma_f1 | 1.2 | 1.1 | 0.29 | 0.31 | 0.76 | 1.7 |
| lengthscale_f1 | 0.19 | 0.19 | 0.063 | 0.058 | 0.11 | 0.3 |
| sigma | 0.81 | 0.81 | 0.0064 | 0.007 | 0.8 | 0.82 |
At the moment, the Laplace method doesn’t automatically run diagnostic to assess the quality of the normal approximation, but we can do it manually by checking the Pareto-\hat{k} diagnostic for the importance sampling weights if the normal approximation would be used as a proposal distribution (Vehtari et al. 2024).
ldraws1 |>
mutate_variables(lw = lp__ - lp_approx__, w = exp(lw - max(lw))) |>
subset_draws(variable = "w") |>
summarise_draws(pareto_diags, .args = list(tail = "right")) |>
tt()| variable | khat | min_ss | khat_threshold | convergence_rate |
|---|---|---|---|---|
| w | 1.3 | Inf | 0.62 | 0 |
Here khat is larger than 0.7 indicating that importance sampling even with Pareto smoothing is not able to provide accurate adjustment. min_ss indicates how many draws would be needed to get an accurate importance weighting adjustment, and in this that number is impractically big. Even the Laplace approximation can be useful, this diagnostic shows that we would eventually want to run MCMC for more accurate inference.
After we get the model working using optimization we can compare the result to using short MCMC chains which will also provide us additional information on speed of different code implementations for the same model. We intentionally use just 1/10th length from the usual recommendation, as during the iterative model building a rough results are sufficient. When testing the code we initially used just one chain, but at this point running four chains with four core CPU doesn’t add much to the wall clock time, but gives more information of how easy it is sample from the posterior and can reveal if there are multiple modes. Although the result from short chains can be useful in a quick workflow, the result should not be used as the final result.
tic('MCMC sampling from model 1 posterior')
fit1 <- model1$sample(data = standata1, iter_warmup = 100, iter_sampling = 100,
chains = 4, parallel_chains = 4, seed = 3896)mytoc()Sampling from Pathfinder approximation of model 1 posterior took 2.9 sec
Depending on the random seed and luck, we sometimes observed that some of the chains got stuck in different modes. We could see this in high Rhat and low ESS diagnostic values. When updating this case study, we didn’t see multimodality with a few different seeds, but you can see such an example in Illustration of simple problematic posteriors case study
We can reduce the possibility of getting stuck in minor modes and improve the warmup by using Pathfinder algorithm. Pathfinder runs several optimizations, but chooses a normal approximation along the optimization path that minimizes ``exclusive’’-Kullback-Leibler distance from the approximation to the target posterior. Pathfinder is better than Laplace for highly skewed and funnel like posteriors which are typical for hierarchical model. We get 400 draws from 10 Pathfinder runs.
tic('Sampling from Pathfinder approximation of model 1 posterior')
pth1 <- model1$pathfinder(data = standata1, init = 0.1,
num_paths = 10, single_path_draws = 40, draws = 400,
history_size = 100, max_lbfgs_iters = 100,
refresh = 0)Pareto k value (1.4) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 2.1 seconds.
mytoc()Pathfinder provides automatically Pareto-\hat{k} diagnostic which is high, indicating the normal approximation is not good. When Pareto-\hat{k} is very high the Pareto smoothed importance sampling returns fewer distinct draws, and it is useful to check that, too.
pdraws1 <- pth1$draws()
summarise_draws(subset(pdraws1, variable = c("lp__")), n_distinct) |>
tt()| variable | n_distinct |
|---|---|
| lp__ | 50 |
Check whether parameters have reasonable values
summarise_draws(subset(pdraws1, variable = c("intercept", "sigma_f1", "lengthscale_f1", "sigma")),
default_summary_measures()) |>
tt()| variable | mean | median | sd | mad | q5 | q95 |
|---|---|---|---|---|---|---|
| intercept | -0.049 | 0.005 | 0.13 | 0.058 | -0.29 | 0.09 |
| sigma_f1 | 0.95 | 0.94 | 0.045 | 0.0062 | 0.87 | 1 |
| lengthscale_f1 | 0.2 | 0.22 | 0.033 | 0.03 | 0.16 | 0.24 |
| sigma | 0.82 | 0.82 | 0.0083 | 0.0064 | 0.8 | 0.83 |
In this case, we get more than one distinct draw, and the draws do have reasonable values and we get also some information about the uncertainty in the posterior (as with Laplace method). We use only default_summary_measures() as the MCMC diagnostics are not useful for Pathfinder draws.
The Pathfinder draws can be used to initialize MCMC as they are likely to be closer to where most of the posterior mass is than the default Stan initialization using uniform random draws from -2 to 2 (in unconstrained space).
tic('MCMC sampling from model 1 posterior with Pathfinder initialization')
fit1 <- model1$sample(data = standata1, iter_warmup = 100, iter_sampling = 100,
chains = 4, parallel_chains = 4,
init = pth1)mytoc()In many of the following short MCMC samplings we get some or many divergences and usually very large number of treedepth exceedences. Divergences indicate possible bias and should be eventually investigated carefully. Treedepth exceedences indicate strong posterior dependencies and slow mixing and sometimes the posterior can be much improved by changing the parameterization or priors, but as the treedepth exceedences don’t indicate bias there is no need for more careful analysis if the resulting ESS and MCSE values are good for the purpose in hand. We’ll come back later to more careful analysis of the final models.
draws1 <- fit1$draws()
summarise_draws(subset(draws1, variable = c("intercept", "sigma_f1", "lengthscale_f1", "sigma"))) |>
tt()| variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
|---|---|---|---|---|---|---|---|---|---|
| intercept | -0.013 | -0.017 | 0.24 | 0.21 | -0.4 | 0.4 | 1 | 174 | 182 |
| sigma_f1 | 0.58 | 0.57 | 0.12 | 0.12 | 0.42 | 0.81 | 1 | 194 | 290 |
| lengthscale_f1 | 0.23 | 0.23 | 0.04 | 0.035 | 0.16 | 0.29 | 1 | 286 | 248 |
| sigma | 0.81 | 0.81 | 0.0072 | 0.0068 | 0.8 | 0.83 | 1 | 398 | 298 |
Trace plot shows slow mixing but no multimodality.
mcmc_trace(draws1, regex_pars = c("intercept", "sigma_f1", "lengthscale_f1", "sigma"))The model result from short MCMC chains looks very similar to the optimization result.
draws1 <- as_draws_matrix(draws1)
Ef <- exp(apply(subset(draws1, variable = "f"), 2, median))
make_pf(birthdays, Ef) + layers_hline100If we compare the result from short sampling to optimizing, we don’t see practical difference in the predictions (although we see later more differences between optimization and MCMC).
birthdays |>
mutate(Ef = Ef,
oEf = oEf) |>
ggplot(aes(x = Ef, y = oEf)) +
geom_point(color = col_data) +
geom_abline() +
labs(x = "Ef from short Markov chain", y = "Ef from optimizing")After the first version of this notebook, Nikolas Siccha examined more carefully the posterior correlations and noticed strong correlation between intercept and the first basis function. Stan’s dynamic HMC is so efficient that the inference is successful anyway. Nikolas suggested removing the intercept term. The intercept term is not necessarily needed as the data has been centered. We test a model without the explicit intercept term.
Compile Stan model gpbf1b.stan
model1b <- cmdstan_model(stan_file = root("birthdays", "gpbf1b.stan"),
include_paths = root("birthdays"))First run Pathfinder
tic('Sampling from Pathfinder approximation of model 1b posterior')
pth1b <- model1b$pathfinder(data = standata1, init = 0.1,
num_paths = 10, single_path_draws = 40, draws = 400,
history_size = 50, max_lbfgs_iters = 100,
refresh = 0)Pareto k value (1.7) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 2.2 seconds.
mytoc()Sampling from Pathfinder approximation of model 1b posterior took 2.8 sec
Then sample using the Pathfinder initialization.
tic('MCMC sampling from model 1b posterior with Pathfinder initialization')
fit1b <- model1b$sample(data = standata1, iter_warmup = 100, iter_sampling = 100,
chains = 4, parallel_chains = 4,
init = pth1b)mytoc()MCMC sampling from model 1b posterior with Pathfinder initialization took 16 sec
The sampling is even faster, indicating that the strong posterior correlation in the first model was causing troubles for the adaptation in the short warmup.
draws1b <- fit1b$draws()
summarise_draws(subset(draws1b, variable = c("sigma_f1", "lengthscale_f1", "sigma"))) |>
tt()| variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
|---|---|---|---|---|---|---|---|---|---|
| sigma_f1 | 0.59 | 0.57 | 0.13 | 0.13 | 0.42 | 0.83 | 1 | 140 | 341 |
| lengthscale_f1 | 0.23 | 0.24 | 0.04 | 0.039 | 0.16 | 0.29 | 1 | 201 | 167 |
| sigma | 0.81 | 0.81 | 0.007 | 0.0072 | 0.8 | 0.82 | 1 | 454 | 339 |
Examining the trace plots don’t show multimodality
mcmc_trace(draws1b, regex_pars = c("sigma_f1", "lengthscale_f1", "sigma"))