Bayesian workflow book - Worldcup continuous vs discrete

Author

Andrew Gelman and Aki Vehtari

Published

2021-01-12

Modified

2024-09-17

1 Worldcup 2014 team performance analysis

We use an item-response model such as described in the previous chapter to fit a model to estimate the abilities of the teams in the 2014 football World Cup. We use score differentials as data (ignoring the shoot-outs).

The forthcoming Bayesian workflow book has more information about the modeling task, and example of workflow of checking and fixing some models. At this time this case study focuses on illustrating the predictive performance comparison of continuous and discrete models, and the rest of the code will be included later.

We’ll start with continuous models and discretized continuous models to illustrate how those can be compared, and eventually build also models using bivariate Poisson and Poisson difference models (Karlis and Ntzoufras 2003), which are commonly used in analysis of sports data.

Load packages

Code

library(rprojroot)
root<-has_file(".Workflow-Examples-root")$make_fix_file()
library(cmdstanr)
library(posterior)
options(tibble.print_max=35,
        pillar.neg=FALSE,
        pillar.subtle=FALSE,
        pillar.sigfig=2)
library(arm)
library(loo)
library(dplyr)
library(readr)

2 Data

Data include 64 game results from Worldcup 2014 and Soccer Power Index that was available on the internet a month before the tournament (Silver, 2014). We took the rankings, with Brazil at the top (getting a score of 32) and Australia at the bottom (with a score of 1), and then for simplicity in interpretation of the parameters we rescaled these to have mean 0 and standard deviation 1/2, to get ``prior scores’’ that ranged from -0.83 to +0.83.

powerindex <- read_csv(root("Worldcup/data","soccerpowerindex.csv")) %>%
  mutate(prior_score = as.vector(scale(rev(index))/2))
teamnames <- powerindex$team
worldcup2014 <- read_csv(root("Worldcup/data","worldcup2014.csv")) %>%
    mutate(team_1 = match(team1, teamnames),
         team_2 = match(team2, teamnames))
N_games <- nrow(worldcup2014)
gamenames <- with(worldcup2014, rev(paste(teamnames[team_1], "vs.", teamnames[team_2])))

2.1 Data for Stan

standata <- with(
  worldcup2014,
  list(
    N_teams = nrow(powerindex),
    N_games = N_games,
    team_1 = team_1,
    score_1 = score1,
    team_2 = team_2,
    score_2 = score2,
    prior_score = powerindex$prior_score
  )
)

3 Normal continuous model

The first model uses normal distribution for score difference.

The model structure is as follows: if game i has teams j_1 and team j_2 playing, and they score s_1 and s_2 goals, respectively, then the data point for this game is y_i = s_1-s_2, and the data model is: y_i \sim \mathrm{normal}(a_{j_1[i]}-a_{j_2[i]}, \sigma_y), where a_{j_1} and a_{j_2} are the ability parameters (to use psychometrics jargon) for the two teams and \sigma_y is a scale parameter estimated from the data.

To include the prior information the model for the team abilities is a_j \sim \mathrm{normal}(\mu + \beta*\mathrm{prior\_score}_j, \sigma_a), which we write as a_j=\mu+b*\mathrm{prior\_score}_j+\sigma_a*\alpha_j, with \alpha_j\sim\mathrm{normal}(0,1) \text{ for } j=1,\dots,J=32. Actually, though, all we care about are the relative, not the absolute, team abilities, so we can just set \mu=0, so that the model is, a_j = b*\mathrm{prior\_score}_j+\sigma_a*\alpha_j, \text{ with } \alpha_j \sim \mathrm{normal}(0, 1), \text{ for } j=1,\dots,J=32. Following our general practice, we add weakly informative priors for b, \sigma_a, and \sigma_y. The data and model are roughly on unit scale, so \mathrm{normal}(0,1) priors should give some computational stability while letting the data be dominant in estimating the parameters.

The data model in Stan is written as

  dif ~ normal(a[team_1] - a[team_2], sigma_y);

We compute the log_lik for PSIS-LOO as most would do with continuous models, that is using the corresponding log density at the observed value.

  log_lik[n] = normal_lpdf(dif[n] | a[team_1[n]] - a[team_2[n]], sigma_y);

Continuous normal model

model_normal <- cmdstan_model(stan_file = root("Worldcup", "worldcup_normal.stan"))
fit_normal <- model_normal$sample(data = standata, refresh = 0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.2 seconds.
Chain 2 finished in 0.2 seconds.
Chain 3 finished in 0.2 seconds.
Chain 4 finished in 0.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 1.3 seconds.

fit_normal$summary(c("a[1]","a[32]","b","sigma_a","sigma_y"))

# A tibble: 5 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 a[1]      0.49   0.51  0.46  0.45 -0.29  1.2    1.0    3091.    3760.
2 a[32]    -1.0   -1.0   0.47  0.42 -1.8  -0.28   1.0    5580.    2335.
3 b         1.1    1.1   0.32  0.31  0.57  1.6    1.0    3945.    2875.
4 sigma_a   0.51   0.52  0.23  0.22  0.12  0.89   1.0     992.    1316.
5 sigma_y   1.5    1.5   0.17  0.16  1.3   1.8    1.0    2006.    2260.

loo_normal <- fit_normal$loo()

4 Normal model with rounding and explicit latent values

As the score differences are integers, a more appropriate model would be discrete. We can discretize the normal model by adding rounding model: The new discrete model is \begin{aligned} \alpha & \sim \mathrm{normal}(0, 1) \\ b & \sim \mathrm{normal}(0, 1) \\ \sigma_a & \sim \mathrm{normal}(0, 1) \\ \sigma_z & \sim \mathrm{normal}(0, 1) \\ a_j & = b * \mathrm{prior\_score} + \sigma_a * \alpha, \quad \text{for } j=1,\dots,32\\ z_i & \sim \mathrm{normal}(a_{j_1[i]}-a_{j_2[i]}, \sigma_z), \quad \text{for } i=1,\dots,64\\ y_i & = \mathrm{round}(z_i, 1) \end{aligned} where the observed score differential y_i is equal to the latent variable z_i rounded to the nearest integer. The likelihood p(y_i|z_i) is then a box function, \begin{aligned} p(y_i|z_i) = \begin{cases} \, 1 & \text{if } z_i\in(y_i-0.5, y_i+ 0.5) \\ \, 0 & \text{otherwise.} \end{cases} \end{aligned} Multiplying the other terms of the model with this likelihood sets the posterior density to zero when z_i is outside the interval (y_i-0.5, y_i+0.5). To handle this discontinuous posterior in Stan, we code the constraint as bounds for the latent continuous outcome in the parameters block:

  vector<lower=dif-0.5, upper=dif+0.5>[N_games] z;

The model block is then coded as

model {
  alpha ~ normal(0, 1);
  b ~ normal(0, 1);
  sigma_a ~ normal(0, 1);
  sigma_z ~ normal(0, 1);
  z ~ normal(a[team_1] - a[team_2], sigma_z);
}

The model block does not include the data model for y_i at all, but the effect of the likelihood p(y_i|z_i) is coded through the constraints on z_i. This is another example, where due to the properties of Stan, we don’t code the model completely as a generative model, and part of the data model comes in the constraint.

Discrete model with explicit latent z sampled

model_normal_rounded_z <- cmdstan_model(stan_file = root("Worldcup", "worldcup_normal_rounded_z.stan"))
fit_normal_rounded_z <- model_normal_rounded_z$sample(data = standata, refresh = 0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.4 seconds.
Chain 2 finished in 0.4 seconds.
Chain 3 finished in 0.6 seconds.
Chain 4 finished in 0.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.5 seconds.
Total execution time: 2.3 seconds.

fit_normal_rounded_z$summary(c("b","sigma_a","sigma_z"))

# A tibble: 3 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 b         1.1    1.1   0.31  0.30  0.56  1.6    1.0    3945.    2888.
2 sigma_a   0.52   0.52  0.23  0.22  0.12  0.88   1.0     873.     879.
3 sigma_z   1.5    1.5   0.17  0.17  1.2   1.8    1.0    1934.    2248.

The model summary is very similar to continuous model. We’ll soon see the explanation why the difference is small. At this point we skip the LOO comparison as PSIS-LOO computation does not work well with the explicit latent value model.

5 Normal model with rounding and integrated latent values

In our first discrete model we presented the latent value explicitly, but we can also integrate that out: \begin{aligned} p(\alpha,b,\sigma_a,\sigma_z|y) & \propto \int \!\!p(\alpha,b,\sigma_a,\sigma_z,z|y) dz \\ & = p(\alpha,b,\sigma_a,\sigma_z)\prod_{i=1}^{64} \int\!\! p(z_i|\alpha,b,\sigma_z)p(y_i|z_i) dz_i \\ & = p(\alpha,b,\sigma_a,\sigma_z)\prod_{i=1}^{64} \int_{y_i-1/2}^{y_i+1/2} p(z_i|\alpha,b,\sigma_z) dz_i. \end{aligned} The integrals \int_{y_i-1/2}^{y_i+1/2} p(z_i|\alpha,b,\sigma_z)dz_i are easy to compute as \int_{y_i-1/2}^{y_i+1/2} p(z_i|\alpha,b,\sigma_z)dz_i = \int_{-\infty}^{y_i+1/2} p(z_i|\alpha,b,\sigma_z)dz_i \,- \int_{-\infty}^{y_i-1/2} p(z_i|\alpha,b,\sigma_z)dz_i, and p(z_i|\alpha,b,\sigma_z) is normal distribution so that we can use a common function to compute the cumulative distribution function efficiently.

We can’t write this only with the distribution notation, and we increment the log density directly:

Here computes \log(\exp(a)-\exp(b)) while taking care of the floating point numerical accuracy, and computes the log cumulative distribution function of the normal distribution.

Now the corresponding log_lik is computed as in the model block, and this integrated log_lik works well with PSIS-LOO computation.

  log_lik[n] = log_diff_exp(normal_lcdf(dif[n]+0.5 | a[team_1[n]] - a[team_2[n]], sigma_z),
                            normal_lcdf(dif[n]-0.5 | a[team_1[n]] - a[team_2[n]], sigma_z));

model_normal_rounded <- cmdstan_model(stan_file = root("Worldcup", "worldcup_normal_rounded.stan"))
fit_normal_rounded <- model_normal_rounded$sample(data = standata, refresh = 0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 1.1 seconds.
Chain 2 finished in 0.8 seconds.
Chain 3 finished in 0.8 seconds.
Chain 4 finished in 0.8 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.9 seconds.
Total execution time: 3.8 seconds.

fit_normal_rounded$summary(c("b","sigma_a","sigma_z"))

# A tibble: 3 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 b         1.1    1.1   0.31  0.30  0.56  1.6    1.0    3774.    2763.
2 sigma_a   0.52   0.52  0.22  0.21  0.13  0.88   1.0    1166.    1347.
3 sigma_z   1.5    1.5   0.16  0.16  1.2   1.8    1.0    2152.    2937.

loo_normal_rounded <- fit_normal_rounded$loo()

The model summary is up to Monte Carlo variability the same as from the model with explicit latent variables.

6 Continuous model approximating a discrete model

We can approximate integrals using the midpoint rule, \int_{a}^{b}f(z)\,dz\approx (b-a)\,f((a+b)/2), which in this case is \int_{y_i-1/2}^{y_i+1/2}p(z_i|\alpha,b,\sigma_z) dz_i \approx (y_i+1/2-y_i-1/2)p(y_i|\alpha,b,\sigma_z) = p(y_i|\alpha,b,\sigma_z), exactly the term in the continuous model after replacing \sigma_z with \sigma_y! If \sigma_y is relatively big compared to the integration interval (which in this case is 1, as the score differential is represented as a latent continuous variable rounded to the nearest integer), then the midpoint rule shou#’ ld be fine. In this case the posterior mean of \sigma_y is about 1.5. For this problem it would be fine for most purposes to perform inferences treating the discrete score differential as a continuous outcome and then include the rounding just when generating predictions.

7 Comparing discrete and continuous models

When estimating predictive performance we use log predictive densities for continuous data models and log predictive probabilities for discrete data models. We can’t directly compare densities and probabilities, but the above discussion shows that for integer outcomes, we can use the midpoint rule to approximate the predictive probabilities for a discretized (rounded) continuous model.

If we compare LOO results, there is minimal difference due to the Monte Carlo variation.

loo_compare(list("Normal continuous model"=loo_normal,
                 "Normal rounded model"=loo_normal_rounded))

                        elpd_diff se_diff
Normal rounded model     0.0       0.0   
Normal continuous model -0.2       0.2

The computation gets more complicated if a continuous data model is used for discrete outcomes that are not integers, or more specifically if level of discretization or rounding is not the same for all observations. Previously we had used continuous data model for the square root of the score differentials. If the outcome were continuous, we could adjust the log predictive densities by including the Jacobian of the transformation. Chapter 12 of {} has an example of modeling the weight of leaves from mesquite bushes with normal model for the yield y directly and for the logarithm of the yield, \log(y). To make the comparison valid, for the model using \log(y) as data, we can compute p(y_i | \theta) = p(\log(y_i) | \theta)/y_i, where 1/y_i is the Jacobian of the transformation \log(y_i). When using a continuous data model for discrete outcome, simply adding the Jacobian term and using midpoint rule may work sometimes, but for example, if we use normal model for signed square root score difference, then this does not work, as the score difference y_i can be 0 and then the Jacobian 1/(2\sqrt{y_i}) becomes infinite. Even if the Jacobian were finite for all outcome values, we would need to take into account that the integration interval length is not 1, and the Jacobian can affect the curvature. Thus, we need to perform the actual integration, \int_{y_i-1/2}^{y_i+1/2} \frac{1}{2\sqrt{y_i}}\mathrm{normal}(y_i | a_{j_1[i]}-a_{j_2[i]}, \sigma_y) dy_i. This integral is well defined also for y_i=0, but we just need to compute the integral in parts avoiding the problematic singularity at 0. Stan provides the function using the double exponential quadrature algorithm, which we can use in generated quantities to compute the variable used in PSIS-LOO computation.

We build a normal for signed square root score difference. The transformed data block has

  for (i in 1:N_games) {
    sqrt_dif[i] = 2 * (step(dif[i]) - 0.5) * sqrt(abs(dif[i]));
  }

the model block has

 sqrt_dif ~ normal(a[team_1] - a[team_2], sigma_y);

and the generated quantities block has

    log_lik[n] = normal_lpdf(sqrt_dif[n] | a[team_1[n]] - a[team_2[n]], sigma_y);

We create a model using log densities for log_lik.

model_sqrt_normal <- cmdstan_model(stan_file = root("Worldcup", "worldcup_sqrt_normal.stan"))
fit_sqrt_normal <- model_sqrt_normal$sample(data = standata, refresh = 0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.2 seconds.
Chain 2 finished in 0.2 seconds.
Chain 3 finished in 0.2 seconds.
Chain 4 finished in 0.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 1.0 seconds.

fit_sqrt_normal$summary(c("b","sigma_a","sigma_y"))

# A tibble: 3 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 b         0.83   0.83  0.20  0.20 0.50   1.1    1.0    3520.    2842.
2 sigma_a   0.33   0.34  0.15  0.14 0.075  0.57   1.0     848.    1375.
3 sigma_y   0.93   0.92  0.11  0.11 0.77   1.1    1.0    1365.    2342.

loo_sqrt_normal <- fit_sqrt_normal$loo()

If we compare the sqrt-normal model without taking into account the Jacobian to the normal model, it would falsely look like sqrt-normal is much better.

loo_compare(list("Normal model"=loo_normal,
                 "Sqrt-normal model (no Jacobian)"=loo_sqrt_normal))

                                elpd_diff se_diff
Sqrt-normal model (no Jacobian)   0.0       0.0  
Normal model                    -32.5       5.2

We then create a sqrt-normal model with log_lik computation using Jacobian of sqrt-transformation and integration to discretize the sqrt-normal. The integration is made using 1D quadrature integration.

    log_lik[n] = (dif[n] == 0)
  ? log(integrate_1d(integrand,
                            -0.5,
                            0,
                            append_array({sigma_y}, {a[team_1[n]] - a[team_2[n]]}),
                  {0}, // not used, but an empty array not allowed
                  {0}, // not used, but an empty array not allowed
                  1e-6) +
        integrate_1d(integrand,
                            0,
                            0.5,
                            append_array({sigma_y}, {a[team_1[n]] - a[team_2[n]]}),
                  {0}, // not used, but an empty array not allowed
                  {0}, // not used, but an empty array not allowed
                  1e-6))
  : log(integrate_1d(integrand,
                            dif[n]-0.5,
                            dif[n]+0.5,
                            append_array({sigma_y}, {a[team_1[n]] - a[team_2[n]]}),
                  {0}, // not used, but an empty array not allowed
                  {0}, // not used, but an empty array not allowed
                  1e-6));
}

model_sqrt_normal_jacobian_discretized <- cmdstan_model(stan_file = root("Worldcup", "worldcup_sqrt_normal_jacobian_discretized.stan"))
fit_sqrt_normal_jacobian_discretized <- model_sqrt_normal_jacobian_discretized$sample(data = standata, refresh = 0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.6 seconds.
Chain 2 finished in 0.7 seconds.
Chain 3 finished in 0.5 seconds.
Chain 4 finished in 0.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.6 seconds.
Total execution time: 2.7 seconds.

fit_sqrt_normal_jacobian_discretized$summary(c("b","sigma_a","sigma_y"))

# A tibble: 3 × 10
  variable  mean median    sd   mad    q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 b         0.83   0.83  0.20  0.19 0.49   1.2    1.0    3824.    3137.
2 sigma_a   0.34   0.34  0.14  0.14 0.087  0.57   1.0     877.    1058.
3 sigma_y   0.92   0.92  0.10  0.10 0.76   1.1    1.0    1885.    2612.

loo_sqrt_normal_jacobian_discretized <- fit_sqrt_normal_jacobian_discretized$loo()

If we compare the sqrt-normal model to the normal model while taking into account the Jacobian and correct discretization, we see that the sqrt root model is worse.

loo_compare(list("Normal model"=loo_normal,
                 "Sqrt-normal model"=loo_sqrt_normal_jacobian_discretized))

                  elpd_diff se_diff
Normal model       0.0       0.0   
Sqrt-normal model -7.2       5.5

8 Bivariate Poisson and Poisson difference models

To complete the comparison of continuous and discrete models we build bivariate Poisson and Poisson difference models (Karlis and Ntzoufras 2003), which are naturally discrete and commonly used in sport statistics. For the bivariate Poisson we compute log_lik in generated quantities using the Poisson difference log probability to make it comparable with other models targeting the score difference.

model_bipois <- cmdstan_model(stan_file = root("Worldcup", "worldcup_bivariate_poisson.stan"))
fit_bipois <- model_bipois$sample(data = standata, refresh = 0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.7 seconds.
Chain 2 finished in 0.6 seconds.
Chain 3 finished in 0.9 seconds.
Chain 4 finished in 0.6 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.7 seconds.
Total execution time: 3.2 seconds.

loo_bipois <- fit_bipois$loo()

model_poisdif <- cmdstan_model(stan_file = root("Worldcup", "worldcup_poisson_difference.stan"))

fit_poisdif <- model_poisdif$sample(data = standata, refresh = 0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 2.7 seconds.
Chain 2 finished in 2.7 seconds.
Chain 3 finished in 2.8 seconds.
Chain 4 finished in 3.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 2.8 seconds.
Total execution time: 11.7 seconds.

fit_poisdif$summary(c("a[1]","a[32]","b","sigma_a"))

# A tibble: 4 × 10
  variable  mean median    sd   mad     q5   q95  rhat ess_bulk ess_tail
  <chr>    <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
1 a[1]      0.22   0.23 0.21  0.20  -0.14   0.54   1.0    3691.    3152.
2 a[32]    -0.47  -0.47 0.22  0.21  -0.82  -0.10   1.0    6836.    3074.
3 b         0.52   0.52 0.14  0.14   0.29   0.75   1.0    4643.    2928.
4 sigma_a   0.24   0.24 0.094 0.090  0.079  0.40   1.0    1150.    1054.

loo_poisdif <- fit_poisdif$loo()

When we compare different models, there is not much difference, but it would make sense to use bivariate Poisson or Poisson difference as they are naturally discrete and justfied for count data.

loo_compare(list("Normal (continuous)"=loo_normal,
                 "Normal rounded (discrete)"=loo_normal_rounded,
                 "Bivariate Poisson"=loo_bipois,
                 "Poisson difference"=loo_poisdif))

                          elpd_diff se_diff
Poisson difference         0.0       0.0   
Bivariate Poisson         -1.0       1.6   
Normal rounded (discrete) -1.7       1.6   
Normal (continuous)       -1.9       1.7

Although in this case the actual discrete count data models were better or as good as continuous, sometimes it can be useful to have more flexibility in the choice of distributions and some continuous distribution with flexible shape might work better than common count models. If some continuous model seems to better than usual count choices (and the log_lik computation has been made correctly), then it might be worthwhile to investigate if there is some less well known discrete distribution with the same flexibility or if using additional latent dispersion variable would help.

References

Karlis, Dimitris, and Ioannis Ntzoufras. 2003. “Analysis of Sports Data by Using Bivariate Poisson Models.” Journal of the Royal Statistical Society: Series D 52 (3): 381–93.