We work with an example of predicting mathematics and Portuguese exam grades for a sample of high school students in Portugal. The same data was used in Chapter 12 of Regression and Other Stories book to illustrate different models for regression coefficients .
We predict the students’ final-year median exam grade in mathematics (n=382) and Portuguese (n=657) given a large number of potentially relevant predictors: student’s school, student’s sex, student’s age, student’s home address type, family size, parents’ cohabitation status, mother’s education, father’s education, home-to-school travel time, weekly study time, number of past class failures, extra educational support, extra paid classes within the course subject, extra-curricular activities, whether the student attended nursery school, whether the student wants to take higher education, internet access at home, whether the student has a romantic relationship, quality of family relationships, free time after school, going out with friends, weekday alcohol consumption, weekend alcohol consumption, current health status, and number of school absences.
2 Variable selection
If we would care only about the predictive performance, we would not need to do variable selection, but we would use all the variables and a sensible joint prior. Here we are interested in finding the smallest set of variables that provide similar predictive performance as using all the variables (and sensible prior). This helps to improve explainability and to design further studies that could include also interventions. We are not considering causal structure, and the selected variables are unlikely to have direct causal effect, but the selected variables that have high predictive relevance are such that their role in causal graph should be eventually considered.
The data includes 3 grades for both mathematics and Portuguese. To reduce the variability in the outcome we use median grades based on those three exams for each topic. We select only students with non-zero grades.
We standardize all non-binary predictors to have standard deviation 1, to make the comparison of relevances easier as discussed in Regression and Other Stories Section 12.1.
Before variable selection, we want to build a good model with all covariates. We first illustrate that default priors may be bad when we have many predictors.
By default brms uses uniform (“flat”) priors on regression coefficients. We use the option normalize=FALSE as we have already normalized the predictors.
fitmu <-brm(Gmat ~ ., data = studentstd_Gmat,normalize=FALSE)
If we compare posterior-R^2 (bayes_R2()) and LOO-R^2 (loo_R2()) (Gelman et al. 2019), we see that the posterior-R^2 is much higher, which means that the posterior estimate for the residual variance is strongly underestimated and the model has overfitted the data.
bayes_R2(fitmu) |>as.data.frame() |>tt()
tinytable_5ovvbjoj0r4joll8u5xf
Estimate
Est.Error
Q2.5
Q97.5
0.32
0.031
0.26
0.38
loo_R2(fitmu) |>as.data.frame() |>tt()
tinytable_p4qzjyr7r65j6oxut1c0
Estimate
Est.Error
Q2.5
Q97.5
0.2
0.04
0.11
0.27
We plot the marginal posteriors for coefficients. For many coefficients the posterior is quite wide.
The common proper prior choice for coefficients is independent wide normal prior. Piranha theorem states that it is not possible that all predictors would be independent and have large coefficients at the same time (Tosh et al. 2021). Thus when we have many predictors we should include prior information that not all coefficients can be large. One option is simply to divide the normal prior variance with the number of predictors which keeps the total variance constant (assuming the predictors are normalized). Other option is to use more elaborate so collaed sparsity priors that assume some of the coefficients are close to zero and some can be large.
6 Implied prior on R^2
Regression and Other Stories Chapter 12 example shows prior predictive simulations with 1) the usual independent wide prior normal prior, 2) scaled normal prior (the wide normal scale divided by the number of predictors), and 3) regularized horseshoe prior encoding sparsity assumption, illustrating the implied prior on R^2. For convenience, these figures are shown below, too. Flat priors are improper, and we can’t do prior predictive simulation with them, and there is no corresponding figure.
7 R2D2 prior
In this case study we use R2D2 prior (Zhang et al. 2022) which can be used to define prior directly on R^2, and then the prior is propagated to coefficients so that despite the number of predictors the prior on R^2 stays constant. We assign R2D2-prior with mean 1/3 and precision 3, which corresponds to Beta(1,2) distribution on R^2. The consentration parameter controls the prior assumption on sparsity or in other words how much variation is assumed to be between coefficients.
Posterior-R^2 (bayes_R2()) and LOO-R^2 (loo_R2()) are now more similar indicating that the prior is not pushing the posterior towards higher R^2 values.
bayes_R2(fitm) |>as.data.frame() |>tt()
tinytable_uwx3lnjpytk9pc6jpcy7
Estimate
Est.Error
Q2.5
Q97.5
0.24
0.036
0.17
0.31
loo_R2(fitm) |>as.data.frame() |>tt()
tinytable_g9anmboffqhw9s8kz7rx
Estimate
Est.Error
Q2.5
Q97.5
0.21
0.031
0.15
0.27
The following plot shows the Beta(1,2) prior on R^2 and the posterior of R^2.
We check the bivariate marginal for Fedu and Medu coefficients, and see that while the univariate marginals overlap with 0, jointly there is not much posterior mass near 0. This is due to Fedu and Medu being collinear. Collinearity of predictors, make it difficult to infer the predictor relevance from the marginal posteriors.
mcmc_scatter(drawsm, pars =c("Fedu","Medu")) +vline_0(linetype='dashed') +hline_0(linetype='dashed')
Figure 6
8 Model checking
We’re using a normal observation model, although we know that the exam scores are in a bounded range. The posterior predictive checking shows that we sometimes predict exam scores higher than 20, but the discrepancy is minor.
pp_check(fitm, type="hist", ndraws=5)
Figure 7
PIT-ECDF plots shows that otherwise the normal model is quite well calibrated.
pp_check(fitm, type="pit_ecdf")
Figure 8
We could use truncated normal as more accurate model, but for example beta-binomial model cannot be used for median exam scores as some of the median scores are not integers. A fancier model could model the three exams hierarchically, but as the normal model is not that bad, we now continue with it.
9 Projection predictive variable selection
We use projection predictive variable selection implemented in projpred R package to find the minimal set of predictors that can provide similar predictive performance as all predictor jointly. By default projpred starts from the intercept only model and uses forward search to find in which order to add predictors to minimize the divergence from the full model predictive distribution.
9.1 Math exam scores
We start with doing fast PSIS-LOO-CV only for the full data search path.
The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. As the search can overfit and we didn’t cross-validate the search, the performance estimates can go above the reference model performance. However, this plot helps as to see that 10 or fewer predictors would be sufficient.
Next we repeat the search, but now cross-validate the search, too. We repeat the search with PSIS-LOO-CV criterion only for nloo=50 folds, and combine the result with the fast PSIS-LOO result using difference estimator (Magnusson et al. 2020). The use of sub-sampling LOO affects only the model size selection and given that is stable, the projection for the selected model is as good as with computationally more expensive search validation. Based on the previous quick result, we search only up to models of size 10. With my laptop and 8 parallel workers, this takes less than 5min.
The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. The order is the same as in the previous plot, but now the predictive performance estimates are taking into account search and have smaller bias. It seems using just four predictors can provide the similar predictive performance as using all the predictors.
The following plot shows the stability of the search over the different LOO-CV folds. The numbers indicate the proportion of folds, the specific predictor was included at latest on the given model size.
plot(cv_proportions(rankm, cumulate=TRUE))
Figure 12
9.2 Portuguese exam scores
We repeat the same, but predicting grade for Portuguese instead of mathematics
Fit a model with R2D2 prior with mean 1/3 and precision 3.
Compare posterior-R^2 and LOO-R^2. We see that Portuguese grade is easier to predict given the predictors (but there is still a lot of unexplained variance).
The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. As there is some overfitting in the search and we didn’t cross-validate the search, the performance estimates scan go above the reference model performance. However, this plot helps as to see that 10 or fewer predictors would be sufficient.
Next we repeat the search, but now cross-validate the search, too. We repeat the search with PSIS-LOO-CV criterion only for nloo=50 folds, and combine the result with the fast PSIS-LOO result using difference estimator (Magnusson et al. 2020). The use of sub-sampling LOO affects only the model size selection and given that is stable, the projection for the selected model is as good as with computationally more expensive search validation. Based on the previous quick result, we search only up to models of size 10. With my laptop and 8 parallel workers, this takes less than 5min.
The following plot shows the relevance order of the predictors and estimated predictive performance given those variables. The order is the same as in the previous plot, but now the predictive performance estimates are taking into account search and have smaller bias. It seems using just seven predictors can provide the similar predictive performance as using all the predictors.
The following plot shows the stability of the search over the different LOO-CV folds. The numbers indicate the proportion of folds, the specific predictor was included at latest on the given model size.
plot(cv_proportions(rankp, cumulate=TRUE))
Figure 17
9.3 Using the selected model
For further predictions we can use the projected draws. Due to how different packages work, sometimes it can be easier to rerun MCMC conditionally on the selected variables. This gives a slightly different result, but when the reference model has been good the difference tends to be small, and the main benefit form using projpred is still that the selection process itself has not caused overfitting and selection of spurious covariates.
References
Gelman, Andrew, Ben Goodrich, Jonah Gabry, and Aki Vehtari. 2019. “R-Squared for Bayesian Regression Models.”The American Statistician 73 (3): 307–9.
Magnusson, Måns, Michael Riis Andersen, Johan Jonasson, and Aki Vehtari. 2020. “Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data.” In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 108:341–51. PMLR.
McLatchie, Yann, Sölvi Rögnvaldsson, Frank Weber, and Aki Vehtari. 2023. “Robust and Efficient Projection Predictive Inference.”arXiv Preprint arXiv:2306.15581.
Piironen, Juho, Markus Paasiniemi, and Aki Vehtari. 2020. “Projective Inference in High-Dimensional Problems: Prediction and Feature Selection.”Electronic Journal of Statistics 14 (1): 2155–97.
Tosh, Christopher, Philip Greengard, Ben Goodrich, Andrew Gelman, Aki Vehtari, and Daniel Hsu. 2021. “The Piranha Problem: Large Effects Swimming in a Small Pond.”arXiv:2105.13445.
Zhang, Yan Dora, Brian P. Naughton, Howard D. Bondell, and Brian J. Reich. 2022. “Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior.”Journal of the American Statistical Association 117: 862–74.