Workflow case study for iterative building of a time series model.
We analyse the relative number of births per day in USA 1969-1988 using Gaussian process time series model with several model components that can explain the long term, seasonal, weekly, day of year, and special floating day variation. We use Hilbert space Gaussian process approximation (Riutort-Mayol et al. 2023) to speed up the computation.
We also illustrate the use Pathfinder algorithm (Zhang et al. 2022) to quickly check that model code produces something reasonable and to initialize MCMC sampling.
We can see the generic pattern in yearly seasonal trend simply by averaging over each day of year (day_of_year has numbers from 1 to 366 every year with leap day being 60 and 1st March 61 also on non-leap-years).
birthdays |>group_by(day_of_year2) |>summarise(meanbirths=mean(births_relative100)) |>ggplot(aes(x=as.Date("1986-12-31")+day_of_year2, y=meanbirths)) +geom_point(color=set1[2]) +geom_hline(yintercept=100, color='gray') +scale_x_date(date_breaks ="1 month", date_labels ="%b") +labs(x="Day of year", y="Relative births per day of year")
Figure 3
1.4 Plot mean per day of week
We can see the generic pattern in weekly trend simply by averaging over each day of week.
birthdays |>group_by(day_of_week) |>summarise(meanbirths=mean(births_relative100)) |>ggplot(aes(x=day_of_week, y=meanbirths)) +geom_point(color=set1[2], size=4) +geom_hline(yintercept=100, color='gray') +scale_x_continuous(breaks =1:7, labels=c('Mon','Tue','Wed','Thu','Fri','Sat','Sun')) +labs(x="Day of week", y="Relative number of births of week")
Figure 4
2 Previous analyses
We had analysed the same data before (see Gelman et al. (2013)) and thus we already had an idea of what kind of model to use. Previously we used GPstuff software which is Gaussian process specific software for Matlab and Octave. As Stan has aimed to be very generic it can be slower than specialized software for some specific models such as Gaussian processes, but Stan provides more flexibility in the model definition.
Riutort-Mayol et al. (2023) demonstrate Hilbert space approximate basis function approximation of Gaussian processes also for the same birthday data. In the experiments the inference was slower than expected raising suspicion of inefficient model code or bad posterior shape due to bad model specification.
3 Workflow for quick iterative model building
Even we have general idea for the model (slow trend, seasonal trend, weekday effect, etc), adding them all at once to the model makes the model complex and difficult to debug and solve the computational problems. It is thus natural to build the model gradually and check that each addition works before adding the next model component. During this iterative model building we want the inference to be fast, but it doesn’t need to be very accurate as long as qualitatively the new model is reasonable. For quick testing and iterative model building we can use optimization, Pathfinder (Zhang et al. 2022) and shorter MCMC chains that we would not recommend for the final inference. Furthermore, in this specific example, the new additions are qualitatively so clear improvements that there is no need for quantitative model comparison whether the additions are ``significant’’ (see also Navarro (2019)) and there is no danger of overfitting (see also (McLatchie+Vehtari:2024?)). Although there is one part of the model where the data is weakly informative and the prior choices seem to matter and we’ll get back to this and consequences later. Overall we build tens of different models, but illustrate here only the main line.
4 Models for relative number of birthdays
As the relative number of births is positive it’s natural to model the logarithm value. The generic form of the models is
y \sim \mathrm{normal}(f(x), \sigma),
where f is different and gradually more complex function conditional on x that includes running day number, day of year, day of week and eventually some special floating US bank holidays.
4.1 Model 1: Slow trend
The model 1 is just the slow trend over the years using Hilbert space basis function approximated Gaussian process
\begin{aligned}
f & = \mathrm{intercept} + f_1\\
\mathrm{intercept} & \sim \mathrm{normal}(0,1)\\
f_1 & \sim \mathrm{GP}(0,K_1)
\end{aligned}
where GP has exponentiated quadratic covariance function.
In this phase, the code by Riutort-Mayol et al. (2023) was cleaned and written to be more efficient, but only the one GP component was included to make the testing easier. Although the code was made more efficient, the aim wasn’t to make it the fastest possible, as the later model changes may have bigger effect on the performance (it’s good to avoid premature optimization). We also use quite small number of basis functions to make the code run faster, and only later examine more carefully whether the number of basis function is sufficient compared to the posterior of the length scale (see, Riutort-Mayol et al. (2023)).
standata1 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20) # number of basis functions for GP for f1
In this simplest model with just one GP, and as the basis function approximation and priors restrict the complexity of GP, we can safely use optimization to find maximum a posteriori (MAP) estimate get a very quick initial result to check that the model code is computing what we intended (e.g. no NaN, Infs, or non-sensical results). As there are only 14 parameters and 7305 observations it’s likely that the posterior in the unconstrained parameter space is close to normal. To obtain the correct mode in the unconstrained space, we need to call Stan optimizer with option jacobian=TRUE (see Laplace and Jacobian case study for illustration). Initialization at 0 in unconstrained space is good for most GP models. In this case the optimization takes less than one second while MCMC sampling with default options would have taken several minutes. Although this result can be useful in a quick workflow, the result should not be used as the final result.
tic('Finding MAP for model 1 with optimization')opt1 <- model1$optimize(data = standata1, init=0, algorithm='bfgs',jacobian=TRUE, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
Finding MAP for model 1 with optimization took 0.19 sec
oEf <-exp(as.numeric(subset(odraws1, variable='f')))birthdays |>mutate(oEf = oEf) |>ggplot(aes(x=date, y=births_relative100)) +geom_point(color=set1[2], alpha=0.2) +geom_line(aes(y=oEf), color=set1[1]) +geom_hline(yintercept=100, color='gray') +labs(x="Date", y="Relative number of births")
Figure 5
We can obtain a bit more information by making a normal approximation at the mode in the unconstrained parameter space. As Laplace was the first one to use the modal normal approximation, the method is commonly called Laplace method. Stan samples from the normal approximation in the unconstrained space and transforms the obtained draws to the constrained space. Stan’s Laplace method uses jacobian=TRUE by default. As we did already optimize, we can pass the optimization result to the Laplace method. With additional 2s we get 400 approximate draws.
tic('Sampling from Laplace approximation of model 1 posterior')lap1 <- model1$laplace(data = standata1, mode=opt1, draws=400,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
Sampling from Laplace approximation of model 1 posterior took 1.1 sec
Check whether parameters have reasonable values. With Laplace method, we get also some information about the uncertainty in the posterior.
At the moment, the Laplace method doesn’t automatically run diagnostic to assess the quality of the normal approximation, but we can do it manually by checking the Pareto-\hat{k} diagnostic for the importance sampling weights if the normal approximation would be used as a proposal distribution (Vehtari et al. 2024).
Here khat is larger than 0.7 indicating that importance sampling even with Pareto smoothing is not able to provide accurate adjustment. min_ss indicates how many draws would be needed to get an accurate importance weighting adjustment, and in this that number is impractically big. Even the Laplace approximation can be useful, this diagnostic shows that we would eventually want to run MCMC for more accurate inference.
After we get the model working using optimization we can compare the result to using short MCMC chains which will also provide us additional information on speed of different code implementations for the same model. We intentionally use just 1/10th length from the usual recommendation, as during the iterative model building a rough results are sufficient. When testing the code we initially used just one chain, but at this point running four chains with four core CPU doesn’t add much to the wall clock time, but gives more information of how easy it is sample from the posterior and can reveal if there are multiple modes. Although the result from short chains can be useful in a quick workflow, the result should not be used as the final result.
tic('MCMC sampling from model 1 posterior')fit1 <- model1$sample(data=standata1, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4, seed=3896, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 1 posterior took 6.5 sec
Depending on the random seed and luck, we sometimes observed that some of the chains got stuck in different modes. We could see this in high Rhat and low ESS diagnostic values. When updating this case study, we didn’t see multimodality with a few different seeds, but you can see such an example in Illustration of simple problematic posteriors case study
We can reduce the possibility of getting stuck in minor modes and improve the warmup by using Pathfinder algorithm. Pathfinder runs several optimizations, but chooses a normal approximation along the optimization path that minimizes ``exclusive’’-Kullback-Leibler distance from the approximation to the target posterior. Pathfinder is better than Laplace for highly skewed and funnel like posteriors which are typical for hierarchical model. We get 400 draws from 10 Pathfinder runs.
tic('Sampling from Pathfinder approximation of model 1 posterior')pth1 <- model1$pathfinder(data = standata1, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=100, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (1) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 1.7 seconds.
mytoc()
Sampling from Pathfinder approximation of model 1 posterior took 1.7 sec
Pathfinder provides automatically Pareto-\hat{k} diagnostic which is high, indicating the normal approximation is not good. When Pareto-\hat{k} is very high the Pareto smoothed importance sampling returns fewer distinct draws, and it is useful to check that, too.
In this case, we get more than one distinct draw, and the draws do have reasonable values and we get also some information about the uncertainty in the posterior (as with Laplace method). We use only default_summary_measures() as the MCMC diagnostics are not useful for Pathfinder draws.
The Pathfinder draws can be used to initialize MCMC as they are likely to be closer to where most of the posterior mass is than the default Stan initialization using uniform random draws from -2 to 2 (in unconstrained space).
tic('MCMC sampling from model 1 posterior with Pathfinder initialization')fit1 <- model1$sample(data=standata1, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth1, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 1 posterior with Pathfinder initialization took 7.4 sec
In many of the following short MCMC samplings we get some or many divergences and usually very large number of treedepth exceedences. Divergences indicate possible bias and should be eventually investigated carefully. Treedepth exceedences indicate strong posterior dependencies and slow mixing and sometimes the posterior can be much improved by changing the parameterization or priors, but as the treedepth exceedences don’t indicate bias there is no need for more careful analysis if the resulting ESS and MCSE values are good for the purpose in hand. We’ll come back later to more careful analysis of the final models.
If we compare the result from short sampling to optimizing, we don’t see practical difference in the predictions (although we see later more differences between optimization and MCMC).
Code
birthdays |>mutate(Ef = Ef,oEf = oEf) |>ggplot(aes(x=Ef, y=oEf)) +geom_point(color=set1[2]) +geom_abline() +labs(x="Ef from short Markov chain", y="Ef from optimizing")
Figure 7
After the first version of this notebook, Nikolas Siccha examined more carefully the posterior correlations and noticed strong correlation between intercept and the first basis function. Stan’s dynamic HMC is so efficient that the inference is successful anyway. Nikolas suggested removing the intercept term. The intercept term is not necessarily needed as the data has been centered. We test a model without the explicit intercept term.
tic('Sampling from Pathfinder approximation of model 1b posterior')pth1b <- model1b$pathfinder(data = standata1, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (1.4) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 1.7 seconds.
mytoc()
Sampling from Pathfinder approximation of model 1b posterior took 1.8 sec
Then sample using the Pathfinder initialization.
tic('MCMC sampling from model 1b posterior with Pathfinder initialization')fit1b <- model1b$sample(data=standata1, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth1b, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 1b posterior with Pathfinder initialization took 7.6 sec
The sampling is even faster, indicating that the strong posterior correlation in the first model was causing troubles for the adaptation in the short warmup.
We drop global intercept from the rest of the models, but continue using Pathfinder to initialize the sampling.
Compare the mean and sd of parameters from Pathfinder and MCMC. When the normal approximation is not good, Pathfinder tends to underestimate the posterior variance, which makes it less useful for setting the initial mass matrix. Thus, here we are using Pathfinder only to get initial values for MCMC.
Code
variables <-names(model1b$variables()$parameters)sp<-summarise_draws(subset(pth1b$draws(), variable=variables))sm<-summarise_draws(subset(draws1b, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 8
4.2 Model 2: Slow trend + yearly seasonal trend
The model 2 adds yearly seasonal trend using GP with periodic covariance function.
\begin{aligned}
f & = \mathrm{intercept} + f_1 + f_2 \\
\mathrm{intercept} & \sim \mathrm{normal}(0,1)\\
f_1 & \sim \mathrm{GP}(0,K_1)\\
f_2 & \sim \mathrm{GP}(0,K_2)
\end{aligned}
where the first GP uses the exponentiated quadratic covariance function, and the second one a periodic covariance function. Most years have 365 calendar days and every four years (during the data range) there are 366 days, and thus we simplify and use period of 365.25 for the periodic component,
The first version of model 2 with the added periodic component following Riutort-Mayol:2023:HSGP turned out be very slow. With the default MCMC options the inference would have taken hours, but with the short chains it was possible to infer that something has to be wrong. The model output was sensible, but diagnostics indicated very slow mixing. By more careful examination of the model it turned out that the periodic component was including another intercept term and with two intercept terms their sum was well informed by the data, but individually they were not well informed and thus the posteriors were wide, which lead to very slow mixing. This bad model is not shown here, but the optimization, short MCMC chains and sampling diagnostic tools were crucial for fast experimentation and solving the problem.
Compile Stan model 2 (the fixed version) gpbf2.stan
standata2 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20, # number of basis functions for GP for f1J_f2=20) # number of basis functions for periodic f2
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 2 posterior')pth2 <- model2$pathfinder(data = standata2, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (4.1) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 5.0 seconds.
mytoc()
Sampling from Pathfinder approximation of model 2 posterior took 5 sec
Pareto-\hat{k} is even higher, but the Pathfinder draws are likely to be useful for quick analysis and initialization of MCMC sampling.
Even Pareto-\hat{k} indicated that Pathfinder approximation was not good enough to be reliably adjusted using importance sampling, the draws produce sensible model predictions.
Sample short chains using the Pathfinder result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 2 posterior with Pathfinder initialization')fit2 <- model2$sample(data=standata2, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth2, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 2 posterior with Pathfinder initialization took 33 sec
While Pathfinder took about 12s, sampling with short chains is taking over 70s, and we see a clear benefit in being able to obtain approximate Pathfinder results faster.
Seasonal component has reasonable fit to the data.
Compare the mean and sd of parameters from Pathfinder and MCMC.
Code
variables <-names(model2$variables()$parameters)sp<-summarise_draws(subset(pdraws2, variable=variables))sm<-summarise_draws(subset(draws2, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 11
4.3 Model 3: Slow trend + yearly seasonal trend + day of week
Based on the quick plotting of the data above, day of week has a clear effect and there are less babies born on Saturday and Sunday. This can be taken into account with simple additive coefficients. We fix the effect of Monday to 0 and have additional coefficients for other weekdays.
\begin{aligned}
f & = \mathrm{intercept} + f_1 + f_2 + \beta_{\text{day of week}} \\
\mathrm{intercept} & \sim \mathrm{normal}(0,1)\\
f_1 & \sim \mathrm{GP}(0,K_1)\\
f_2 & \sim \mathrm{GP}(0,K_2)\\
\beta_{\text{day of week}} & = 0 \quad \text{if day of week is Monday}\\
\beta_{\text{day of week}} & \sim \mathrm{normal}(0,1) \quad \text{if day of week is not Monday}
\end{aligned}
standata3 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20, # number of basis functions for GP for f1J_f2=20, # number of basis functions for periodic f2day_of_week=birthdays$day_of_week)
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 3 posterior')pth3 <- model3$pathfinder(data = standata3, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (2.3) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 5.2 seconds.
mytoc()
Sampling from Pathfinder approximation of model 3 posterior took 5.3 sec
Pareto-\hat{k} is even higher, but the Pathfinder draws are likely to be useful for quick analysis and initialization of MCMC sampling.
Sample short chains using the Pathfinder result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 3 posterior with Pathfinder initialization')fit3 <- model3$sample(data=standata3, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth3, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 3 posterior with Pathfinder initialization took 71 sec
Weekday effects are easy to estimate as there are about thousand observations per weekday.
Compare the mean and sd of parameters from Pathfinder and MCMC.
Code
variables <-names(model3$variables()$parameters)sp<-summarise_draws(subset(pdraws3, variable=variables))sm<-summarise_draws(subset(draws3, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 14
4.4 Model 4: long term smooth + seasonal + weekday with increasing magnitude
Looking at the time series of whole data we see the dots representing the daily values forming three branches that are getting further away from each other. In previous analysis (Gelman et al. 2013) we also had a model component allowing gradually changing effect for day of week and did observe that the effect of Saturday and Sunday did get stronger in time. The next model includes time dependent magnitude component for the day of week effect.
\begin{aligned}
f & = \mathrm{intercept} + f_1 + f_2 + \exp(g_3)\beta_{\text{day of week}} \\
\mathrm{intercept} & \sim \mathrm{normal}(0,1)\\
f_1 & \sim \mathrm{GP}(0,K_1)\\
f_2 & \sim \mathrm{GP}(0,K_2)\\
g_3 & \sim \mathrm{GP}(0,K_3)\\
\beta_{\text{day of week}} = & 0 \quad \text{if day of week is Monday}\\
\beta_{\text{day of week}} & \sim \mathrm{normal}(0,1) \quad \text{if day of week is not Monday}
\end{aligned}
The magnitude of the weekday effect is modelled with \exp(g_3), where g_3 has GP prior with zero mean and exponentiated quadratic covariance function.
standata4 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20, # number of basis functions for GP for f1J_f2=20, # number of basis functions for periodic f2c_g3=1.5, # factor c of basis functions for GP for g3M_g3=5, # number of basis functions for GP for g3day_of_week=birthdays$day_of_week)
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 4 posterior')pth4 <- model4$pathfinder(data = standata4, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (2.2) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 7.3 seconds.
mytoc()
Sampling from Pathfinder approximation of model 4 posterior took 7.4 sec
Pareto-\hat{k} is even higher, but the Pathfinder draws are likely to be useful for quick analysis and initialization of MCMC sampling. Paret
Sample short chains using the Pathfinder result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 4 posterior with Pathfinder initialization')fit4 <- model4$sample(data=standata4, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth4, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 4 posterior with Pathfinder initialization took 120 sec
The model fits well the different branches visible in plotted daily relative number of births, that is, it is able to model the increasing weekend effect.
Compare the mean and sd of parameters from Pathfinder and MCMC.
Code
variables <-names(model4$variables()$parameters)sp<-summarise_draws(subset(pdraws4, variable=variables))sm<-summarise_draws(subset(draws4, variable=variables))ggplot(data =NULL, aes(x = sm$mean, xmin = sm$mean - sm$sd, xmax = sm$mean + sm$sd,y = sp$mean, ymin = sp$mean - sp$sd, ymax = sp$mean + sp$sd,label = sm$variable)) +geom_point(color =4) +geom_errorbar(width =0, color =4) +geom_errorbarh(height =0, color =4) +geom_text_repel() +geom_abline(linetype ="dotted") +labs(x ="MCMC mean and sd", y ="Pathfinder mean and sd")
Figure 17
4.5 Model 5: long term smooth + seasonal + weekday with time dependent magnitude + day of year RHS
The next component to add is day of year effect. Many bank holidays are every year on the same day of year and there might be also other special days that are favored or disfavored.
\begin{aligned}
f & = \mathrm{intercept} + f_1 + f_2 + \exp(g_3)\beta_{\text{day of week}} + \beta_{\text{day of year}}\\
\mathrm{intercept} & \sim \mathrm{normal}(0,1)\\
f_1 & \sim \mathrm{GP}(0,K_1)\\
f_2 & \sim \mathrm{GP}(0,K_2)\\
g_3 & \sim \mathrm{GP}(0,K_3)\\
\beta_{\text{day of week}} & = 0 \quad \text{if day of week is Monday}\\
\beta_{\text{day of week}} & \sim \mathrm{normal}(0,1) \quad \text{if day of week is not Monday}\\
\beta_{\text{day of year}} & \sim \mathrm{RHS}(0,0.1)
\end{aligned}
As we assume that only some days of year are special, we use regularized horseshoe (RHS) prior (Piironen and Vehtari 2017) for day of year effects.
At this point the optimization didn’t produce reasonable result as earlier and sampling turned out to be very slow. We assumed the optimization fails because there were so many more parameters with hierarchical prior. As even the short chain sampling would have taken more than hour, it would have been time consuming to further to test the model. As part of the quick iterative model building it was better to give up on this model for a moment. When revisiting this case study and adding Pathfinder approximation, it produced much better results and using it to initialize MCMC, the sampling took only 2.5 minutes.
standata5 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20, # number of basis functions for GP for f1J_f2=20, # number of basis functions for periodic f2c_g3=1.5, # factor c of basis functions for GP for g3M_g3=5, # number of basis functions for GP for g3scale_global=0.1, # global scale for RHS priorday_of_week=birthdays$day_of_week,day_of_year=birthdays$day_of_year2) # 1st March = 61 every year
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 5 posterior')pth5 <- model5$pathfinder(data = standata5, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (12) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 15.8 seconds.
mytoc()
Sampling from Pathfinder approximation of model 5 posterior took 16 sec
Pareto-\hat{k} is even higher, but the Pathfinder draws are likely to be useful for quick analysis and initialization of MCMC sampling.
We now get only one or couple distinct draws (depending on luck), so we don’t get much information about the posterior width, but the draws are still providing a useful approximation.
The quick model fit looks reasonable for a quick fit.
Stan Pathfinder uses Pareto smoothed importance sampling with replacement as default, but when Pareto-\hat{k} is very large that may return less distinct draws than we would like to use for initializing MCMC. In such case we can turn of the PSIS resampling in Stan, and do PSIS without replacement in R.
tic('Sampling from Pathfinder approximation of model 5 posterior')pth5 <- model5$pathfinder(data = standata5, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 16.9 seconds.
mytoc()
Sampling from Pathfinder approximation of model 5 posterior took 17 sec
We check the number of distinct draws, which is now higher (no resampling).
Do importance resampling without replacement. When Pareto-\hat{k} is not too big, for inference we prefer resampling with replacement, but for initializing MCMC we prefer distinct draws, and prefer resampling without replacement.
These steps to obtain 4 distinct draws were for illustration, and sampling initialization makes the same steps internally given a Pathfinder object.
Sample short chains using the Pathfinder result as initial values. (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 5 posterior with Pathfinder initialization')fit5 <- model5$sample(data=standata5, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth5, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 5 posterior with Pathfinder initialization took 270 sec
Before using Pathfinder to initialize sampling, the sampling took longer than my patience, and the sampler result was not included in the case study. With Pathfinder initialization, the sampler finished in 2.5 mins, but reported 100% of maximum treedepths which indicates very strong posterior dependencies.
The plot looks quite good. Compare the mean and sd of parameters from Pathfinder and MCMC. In this case, we are using the non-resampled Pathfinder draws (the resampled draws had only one distinct draw). Compare the mean and sd of parameters from Pathfinder and MCMC. We see that MCMC estimates of sd for some parameters is super high, indicating bad model. Instead of trying the get the computation work better, we drop this model at the moment.
Code
variables <-names(model5$variables()$parameters)sp<-summarise_draws(subset(pth5$draws(), variable=variables))sm<-summarise_draws(subset(draws5, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 22
4.6 Model 6: long term smooth + seasonal + weekday + day of year
To simplify the analysis of the day of year effect and make the inference during the exploration faster, we drop the time dependent day of week effect and RHS for a moment and use normal prior for the day of year effect.
\begin{aligned}
f & = \mathrm{intercept} + f_1 + f_2 + \beta_{\text{day of week}} + \beta_{\text{day of year}}\\
\mathrm{intercept} & \sim \mathrm{normal}(0,1)\\
f_1 & \sim \mathrm{GP}(0,K_1)\\
f_2 & \sim \mathrm{GP}(0,K_2)\\
\beta_{\text{day of week}} & = 0 \quad \text{if day of week is Monday}\\
\beta_{\text{day of week}} & \sim \mathrm{normal}(0,1) \quad \text{if day of week is not Monday}\\
\beta_{\text{day of year}} & \sim \mathrm{normal}(0,0.1)
\end{aligned}
standata6 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20, # number of basis functions for GP for f1J_f2=20, # number of basis functions for periodic f2day_of_week=birthdays$day_of_week,day_of_year=birthdays$day_of_year2) # 1st March = 61 every year
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 6 posterior')pth6 <- model6$pathfinder(data = standata6, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (13) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 8.5 seconds.
mytoc()
Sampling from Pathfinder approximation of model 6 posterior took 8.7 sec
Pathfinder provides automatically Pareto-\hat{k} diagnostic which is high, indicating the normal approximation is not good. When Pareto-\hat{k} is very high the Pareto smoothed importance sampling returns less distinct draws, and it is useful to check that, too.
Again we get only one or couple distinct draws (depending on luck), so we don’t get much information about the posterior width, but the draws are still providing a useful approximation.
We recognize some familiar structure in the day of year effect and proceed to sampling.
Stan Pathfinder uses Pareto smoothed importance sampling without replacement as default, but when Pareto-\hat{k} is very large that may return less distinct draws than we would like to use for initializing MCMC. In such case we can turn of the PSIS resampling in Stan.
tic('Sampling from Pathfinder approximation of model 6 posterior')pth6 <- model6$pathfinder(data = standata6, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 8.5 seconds.
mytoc()
Sampling from Pathfinder approximation of model 6 posterior took 8.6 sec
Sample short chains using the Pathfinder result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 6 posterior with Pathfinder initialization')fit6 <- model6$sample(data=standata6, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth6, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 6 posterior with Pathfinder initialization took 35 sec
The short sampling result looks reasonable and thus the problem is not in adding the day of year effect itself. In the bottom plot, the circles mark 13th day of each month. Results look similar to our previous analyses (Gelman et al. 2013), so it seems the day or year effect model component is working as it should, but there was some problem with our RHS implementation. As there is more variation in the day of year effects than we would hope, we did some additional experiments with different priors for the day of year effect (double exponential, Cauchy and Student’s t with unknown degrees of freedom as models 6b, 6c, 6d), but decided it’s better to add other components before investing that part more thoroughly.
Compare the mean and sd of parameters from Pathfinder and MCMC. In this case, we are using the non-resampled Pathfinder draws.
Code
variables <-names(model6$variables()$parameters)sp<-summarise_draws(subset(pth6$draws(), variable=variables))sm<-summarise_draws(subset(draws6, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 27
4.7 Model 7: long term smooth + seasonal + weekday + day of year normal + floating special days
We can see in the model 6 results that day of year effects have some dips in the relative number of births that are spread over a week. From previous analyse we know these correspond to holidays that are not on a specific day of year, but are for example on the last Monday of May. We call these floating special days and include Memorial day (last Monday of May), Labor day (first Monday of September, and we include also the following Tuesday), and Thanksgiving (fourth Thursday of November, and we include also the following Friday).
standata7 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20, # number of basis functions for GP for f1J_f2=20, # number of basis functions for periodic f2day_of_week=birthdays$day_of_week,day_of_year=birthdays$day_of_year2, # 1st March = 61 every yearmemorial_days=memorial_days,labor_days=labor_days,thanksgiving_days=thanksgiving_days)
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 7 posterior')pth7 <- model7$pathfinder(data = standata7, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (9.4) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 8.6 seconds.
mytoc()
Sampling from Pathfinder approximation of model 7 posterior took 8.8 sec
Turn of the PSIS resampling in Stan to get distinct draws for initalization.
tic('Sampling from Pathfinder approximation of model 6 posterior')pth7 <- model7$pathfinder(data = standata7, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 8.6 seconds.
mytoc()
Sampling from Pathfinder approximation of model 6 posterior took 8.6 sec
Sample short chains using the Pathfinder result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 7 posterior with Pathfinder initialization')fit7 <- model7$sample(data=standata7, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth7, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 7 posterior with Pathfinder initialization took 31 sec
standata8 <-list(x=birthdays$id,y=log(birthdays$births_relative100),N=length(birthdays$id),c_f1=1.5, # factor c of basis functions for GP for f1M_f1=20, # number of basis functions for GP for f1J_f2=20, # number of basis functions for periodic f2c_g3=1.5, # factor c of basis functions for GP for g3M_g3=5, # number of basis functions for GP for g3day_of_week=birthdays$day_of_week,day_of_year=birthdays$day_of_year2, # 1st March = 61 every yearmemorial_days=memorial_days,labor_days=labor_days,thanksgiving_days=thanksgiving_days)
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 8 posterior')pth8 <- model8$pathfinder(data = standata8, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR)
Pareto k value (6.9) is greater than 0.7. Importance resampling was not able to improve the approximation, which may indicate that the approximation itself is poor.
Finished in 10.6 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8 posterior took 11 sec
tic('Sampling from Pathfinder approximation of model 6 posterior')pth8 <- model8$pathfinder(data = standata8, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 11.8 seconds.
mytoc()
Sampling from Pathfinder approximation of model 6 posterior took 12 sec
Sample short chains using the Pathfinder result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 8 posterior with Pathfinder initialization')fit8 <- model8$sample(data=standata8, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth8, refresh=10, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 8 posterior with Pathfinder initialization took 49 sec
The inference for the model works fine, which hints that our RHS implementation for the model 5 had challenging posterior. Before testing RHS again, we’ll test with an easier to implement Student’s t prior whether long tailed prior for day of year effect is reasonable. These experiments help also to find out whether the day of year effect is sensitive to the prior choice.
Compare the mean and sd of parameters from Pathfinder and MCMC. In this case, we are using the non-resampled Pathfinder draws.
Code
variables <-names(model8$variables()$parameters)sp<-summarise_draws(subset(pth8$draws(), variable=variables))sm<-summarise_draws(subset(draws8, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 33
4.9 Model 8+t_nu: day of year effect with Student’s t prior
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result). We turn off the resampling in Stan to get distinct draws for initialization of MCMC.
tic('Sampling from Pathfinder approximation of model 8tnu posterior')pth8tnu <- model8tnu$pathfinder(data = standata8, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 11.2 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8tnu posterior took 11 sec
Sample short chains using the Pathfinder result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 8tnu posterior with Pathfinder initialization')fit8tnu <- model8tnu$sample(data=standata8, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth8tnu, refresh=10, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 8tnu posterior with Pathfinder initialization took 260 sec
We get a high number of divergences. The posterior clearly has a challenging shape. If we would like to continue using this model, eventually we should fix the computational issues, but before that we can examine the initial results in case we decide to not continue with this model after all.
Posterior of degrees of freedom nu_f4 is very close to 0.5, and thus the distribution has thicker tails than Cauchy. This is strong evidence that the distribution of day of year effects is far from normal. Compare the model to the data
The other effects seem to be quite similar as with the previous model, but the day of year effects are clearly different with most days having non-detectable effect. There are also effects that seemed to be quite clear in normal prior model such as 13th day of month effect, which is not visible anymore. As the posterior of degrees of freedom t_nu was concentrated close to 1, it’s likely that the normal prior for day of year effect can’t be the best. So far we hadn’t used model comparison such as leave-one-out cross-validation (LOO-CV, Vehtari, Gelman, and Gabry (2017)) as each added component had qualitatively big and reasonable effect. Now as day of year effect is sensitive to prior choice, but it’s not clear how much better t_\nu prior distribution is we use LOO-CV to compare the models.
As we could have expected based on the posterior of nu_f4 Student’s t prior on day of year effects is better. As low degrees of freedom indicate a thick tailed distribution for day of year effect is needed, we decided to test again RHS prior.
We did get warnings about high Pareto-\hat{k} values, but as the difference is big, and the normal model is nested in the Student’s t model with the posterior clearly indicating non-normality, we don’t need to fix the LOO computation to trust this specific comparison.
Compare the mean and sd of parameters from Pathfinder and MCMC. In this case, we are using the non-resampled Pathfinder draws.
Code
variables <-names(model8tnu$variables()$parameters)sp<-summarise_draws(subset(pth8tnu$draws(), variable=variables))sm<-summarise_draws(subset(draws8tnu, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 35
4.10 Model 8+RHS: day of year effect with RHS prior
Model 5 had RHS prior but the problem was that optimization result wasn’t even close to sensible and MCMC was very slow. Given the other models we now know that the problem is not in adding day of year effect or combining it with time dependent magnitude for the day of week effect. It was easier now to focus on figuring out the problem in RHS. Since RHS is presented as a scale mixture of normals involving hierarchical prior, it is common to use non-centered parameterization for RHS prior. Non-centered parameterization is useful when the information from the likelihood is weak and the prior dependency dominates in the posterior dependency. RHS is often used when there are less observations than unknowns. In this problem each unknown (one day of year effect) is informed by several observations from different years, and then it might be that the centered parameterization is better. And this turned out to be true and the inference for model 8 with centered parameterization RHS prior on day of year effect worked much better than for model 5. (In Stan it was easy to test switch from non-centered to centered parameterization by removing the multiplier from one of the parameter declarations).
standata8 <-c(standata8,scale_global=0.1) # global scale for RHS prior
Pathfinder is faster than sampling (although this result can be useful in a quick workflow, the result should not be used as the final result).
tic('Sampling from Pathfinder approximation of model 8rhs posterior')pth8rhs <- model8rhs$pathfinder(data = standata8, init=0.1,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 14.3 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8rhs posterior took 14 sec
Sample short chains using the optimization result as initial values (although the result from short chains can be useful in a quick workflow, the result should not be used as the final result).
tic('MCMC sampling from model 8rhs posterior with Pathfinder initialization')fit8rhs <- model8rhs$sample(data=standata8, iter_warmup=100, iter_sampling=100,chains=4, parallel_chains=4,init=pth8rhs, refresh=10, output_dir=CMDSTANR_OUTPUT_DIR)
mytoc()
MCMC sampling from model 8rhs posterior with Pathfinder initialization took 250 sec
We get a high number of divergences. The posterior clearly has a challenging shape. If we would like to continue using this model, eventually we should fix the computational issues, but before that we can examine the initial results in case we decide to not continue with this model after all.
Visually we get quite similar result as with t_\nu prior. When we compare the models with LOO-CV (Vehtari, Gelman, and Gabry 2017), RHS prior is better.
Compare the mean and sd of parameters from Pathfinder and MCMC. In this case, we are using the non-resampled Pathfinder draws.
Code
variables <-names(model8rhs$variables()$parameters)sp<-summarise_draws(subset(pth8rhs$draws(), variable=variables))sm<-summarise_draws(subset(draws8rhs, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel() +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 37
lambda_f4 is a vector parameter for scale mixture presentation of RHS prior, and it is a weakly identifiable which can explain bigger differences between Pathfinder and MCMC than for other parameters.
4.11 Further improvements for the day of year effect
It’s unlikely that day of year effect would be unstructured with some distribution like RHS, and thus instead of trying to find a prior distribution that would improve LOO-CV, it would make more sense to further add structural information. For example, it would be possible to add more known special days and take into account that a special day effect and weekend effect probably are not additive. Furthermore if there are less births during some day, the births need to happen some other day and it can be assumed that there would be corresponding excess of births before of after a bank holiday. This ringing around days with less births is not simple as it is also affected whether the previous and following days are weekend days. This all gets more complicated than we want to include in this case study, but the reader can see how the similar gradual model building could be made by adding additional components. Eventually it is likely that there starts to be worry of overfitting, but integration over the unknown alleviates that and looking at the predictive performance estimates such LOO-CV can help to decide when the additional model components don’t improve the predictive performance or can’t be well identified.
4.12 Quantitative predictive performance for the series of models
We didn’t use LOO-CV (Vehtari, Gelman, and Gabry 2017) until in the end, as the qualitative differences between models were very convincing. We can use LOO-CV to check how big the difference in the predictive performance are and if the differences are big, we know that model averaging that would take into account the uncertainty would give weights close to zero for all but the most elaborate models.
We can see some structure, specifically in years 1969–1978 the residual has negative peak in the middle of the year, while in years 1981–1988 the residual has positive peak in the middle of the year. This kind of pattern appears as we use the same seasonal effect for all years, but the magnitude of seasonal effect is changing. It would be possible to modify the model to include gradually changing seasonal effect, but leave it out from this case study.
The best model so far explains already 95% of the variance (LOO-R2).
As it seems we could still improve by adding more structure and time varying seasonal effect, it seems the variability in the number of births from day to day is quite well predictable. Of course big part of the variation is due to planned induced births and c-sections, and thus hospitals do already control the number of births per day and there is no really practical use for the result. However there are plenty of similar time series, for example, in consumer behavior that are affected by special days.
Above we used the same Stan model code for Pathfinder and MCMC which included generated quantities. If we want to use Pathfinder just for initialization, and trust it so that we don’t need to check the generated quantities we can drop that out (or call it separately).
tic('Sampling from Pathfinder approximation of model 8rhs posterior')pth8rhs <- model8rhs_nogq$pathfinder(data = standata8, init=0.1,num_threads=10,num_paths=10, single_path_draws=40, draws=400,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 3.1 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8rhs posterior took 3.3 sec
As the generated quantities are not computed and written to the csv, the time in my laptop drops to 3s.
Above we created the MCMC initial values using PSIS resampling without replacement. When the Pathfinder is used just for initializing MCMC, we could further reduce the time by getting less draws. If we are worried of getting stuck in minor modes, we could still run more paths than the required number of initial values and resample without replacement in hope of dropping out the draws from the minor modes. At the moment (2024-01-21) there is a bug so that minimum number of draws per path is 25.
tic('Sampling from Pathfinder approximation of model 8rhs posterior')pth8rhs <- model8rhs_nogq$pathfinder(data = standata8, init=0.1,num_threads=10,num_paths=10, single_path_draws=1, draws=10,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 3.0 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8rhs posterior took 3.1 sec
There is not much time saving and the number of paths run dominate.
If we are not worried about the minor modes we could run just 4 paths. At the moment (2024-01-21) there is a bug so that minimum number of draws per path is 25, but these draws the ones used to estimate ELBO, so there is no additional computational cost.
tic('Sampling from Pathfinder approximation of model 8rhs posterior')pth8rhs <- model8rhs_nogq$pathfinder(data = standata8, init=0.1,num_threads=4,num_paths=4, single_path_draws=1, draws=4,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 2.2 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8rhs posterior took 2.3 sec
With my laptop, it takes about 1.5s to run 4 paths to get 4 distinct draws to initialize MCMC.
Alternatively we could run just one path with 4 draw from the normal approximation. At the moment (2024-01-21) there is a bug so that minimum number of draws per path is 25.
tic('Sampling from Pathfinder approximation of model 8rhs posterior')pth8rhs <- model8rhs_nogq$pathfinder(data = standata8, init=0.1,num_threads=4,num_paths=1, single_path_draws=4, draws=4,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 1.2 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8rhs posterior took 1.3 sec
The time goes down to less than 1s, but these draws are likely to have less variation than draws from multi-Pathfinder.
Earlier we plotted mean and sd of several parameters based on MCMC and Pathfinder. Without generated quantities, we can easily increase the number of draws from the normal approximation to improve such mean and sd estimates, without increasing the computation time much.
tic('Sampling from Pathfinder approximation of model 8rhs posterior')pth8rhs <- model8rhs_nogq$pathfinder(data = standata8, init=0.1,num_threads=10,num_paths=40, single_path_draws=100, draws=4000,history_size=50, max_lbfgs_iters=100,refresh=0, output_dir=CMDSTANR_OUTPUT_DIR,psis_resample=FALSE)
Finished in 12.2 seconds.
mytoc()
Sampling from Pathfinder approximation of model 8rhs posterior took 12 sec
Compare the mean and sd of some parameters from Pathfinder and MCMC. In this case, we are using the non-resampled Pathfinder draws.
Code
variables <-names(model8rhs$variables()$parameters)sp<-summarise_draws(subset(pth8rhs$draws(), variable=variables))sm<-summarise_draws(subset(draws8rhs, variable=variables))ggplot(data=NULL, aes(x=sm$mean, xmin=sm$mean-sm$sd, xmax=sm$mean+sm$sd,y=sp$mean, ymin=sp$mean-sp$sd, ymax=sp$mean+sp$sd,label=sm$variable)) +geom_point(color=4) +geom_errorbar(width=0,color=4) +geom_errorbarh(height=0,color=4) +geom_text_repel(max.overlaps=20) +geom_abline(linetype='dotted') +labs(x='MCMC mean and sd', y='Pathfinder mean and sd')
Figure 39
Plot the sd from Pathfinder divided by sd from MCMC. We see that for some parameters the order of magnitude is fine, but for some sd is underestimated by 2-3 orders of magnitude, which would make these estimates bad for initializing the mass matrix.
During all the iterative model building we favored optimization and short MCMC chains. In the end we also run with higher adapt_delta to reduce the probability of divergences, higher maximum treedepth to ensure higher effective sample size per iteration (ESS per second doesn’t necessarily improve), and run much longer chains, but didn’t see practical differences in plots or LOO-CV values. As running these longer chains can take hours they are not run as part of this notebook. An example of how to reduce probability of divergences and increase maximum treedepth is shown below (there is rarely need to increase adapt_delta larger than 0.95 and if there are still divergences with adapt_delta equal to 0.99, the posterior has serious problems and it should be considered whether re-parameterization, better data or more informative priors could help).
Gelman, Andrew, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Third edition. CRC Press.
Navarro, Danielle J. 2019. “Between the Devil and the Deep Blue Sea: Tensions Between Scientific Judgement and Statistical Model Selection.”Computational Brain & Behavior 2 (1): 28–34.
Piironen, Juho, and Aki Vehtari. 2017. “Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors.”Electronic Journal of Statistics 11: 5018–51.
Riutort-Mayol, Gabriel, Paul-Christian Bürkner, Michael R Andersen, Arno Solin, and Aki Vehtari. 2023. “Practical Hilbert Space Approximate BayesianGaussian Processes for Probabilistic Programming.”Statistics and Computing 33 (1): 17.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.”Statistics and Computing 27: 1413–32.
Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2024. “Pareto Smoothed Importance Sampling.”Journal of Machine Learning Research 25 (72): 1–58.
Zhang, Lu, Bob Carpenter, Andrew Gelman, and Aki Vehtari. 2022. “Pathfinder: Parallel Quasi-Newton Variational Inference.”Journal of Machine Learning Research 23 (306).