A big part of the purpose of `idealstan`

is to give people different options in fitting ideal point models to diverse data. Along with that, `idealstan`

makes use of Bayesian model evaluation a la loo and also can analyze the posterior predictive distribution using bayesplot. `loo`

is an approximation of the predictive error of a Bayesian model via leave-one-out cross-validation (LOO-CV). True LOO-CV on Bayesian models is computationally prohibitive because it involves estimating a new model for each data point. For IRT models incorporating thousands or even millions of observations, this is practically infeasible.

`bayesplot`

allows us to analyze the data we used to estimate the model compared to data produced by the model, or what is called the posterior predictive distribution. This is very useful as a general summary of model fit to see whether there are values of the outcome that we are over or under predicting.

`idealstan`

implements functions for each ideal point model that calculate the log-posterior probability of the data, which is the necessary input to use `loo`

’s model evaluation features. This vignette demonstrates the basic usage. I also discuss best practices for evaluating models.

We first begin by simulating data for a standard IRT 2-PL ideal point model but with strategically missing data:

`irt_2pl <- id_sim_gen(inflate=TRUE)`

`## Warning: package 'bindrcpp' was built under R version 3.4.4`

We can then fit two ideal point models to the same data, one that uses the correct binomial model for a 0/1 binary outcome, and the second that tries to fit a Poisson count model to this same binary data.

```
# Because of CRAN limitations, only using 2 cores & 2 chains
irt_2pl_correct <- id_estimate(idealdata=irt_2pl,
model_type=2,
restrict_ind_high = as.character(sort(irt_2pl@simul_data$true_person,
decreasing=TRUE,
index=TRUE)$ix[1]),
restrict_ind_low = as.character(sort(irt_2pl@simul_data$true_person,
decreasing=FALSE,
index=TRUE)$ix[1]),
fixtype='vb_partial',
ncores=2,
nchains=2,
niters = 500)
```

```
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
## Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
## Chain 1: Success! Found best value [eta = 1] earlier than expected.
## Chain 1: 100 -1057.420 1.000 1.000
## Chain 1: 200 -1045.939 0.505 1.000
## Chain 1: 300 -1044.276 0.338 0.011
## Chain 1: 400 -1041.217 0.254 0.011
## Chain 1: 500 -1042.919 0.203 0.003 MEDIAN ELBO CONVERGED
## Chain 1: Drawing a sample of size 1000 from the approximate posterior...
```

```
irt_2pl_incorrect <- id_estimate(idealdata=irt_2pl,
model_type=8,
restrict_ind_high = as.character(sort(irt_2pl@simul_data$true_person,
decreasing=TRUE,
index=TRUE)$ix[1]),
restrict_ind_low = as.character(sort(irt_2pl@simul_data$true_person,
decreasing=FALSE,
index=TRUE)$ix[1]),
fixtype='vb_partial',
ncores=2,
nchains=2,
niters=500)
```

```
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Iteration: 1 / 250 [ 0%] (Adaptation)
## Chain 1: Iteration: 50 / 250 [ 20%] (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%] (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%] (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%] (Adaptation)
## Chain 1: Success! Found best value [eta = 1] earlier than expected.
## Chain 1: 100 -4154172.311 1.000 1.000
## Chain 1: 200 -462000.315 4.496 7.992
## Chain 1: 300 -1605796.824 3.235 1.000
## Chain 1: 400 -9960842.290 2.636 1.000
## Chain 1: 500 -3660771.278 2.453 1.000
## Chain 1: 600 -208653028151.058 2.211 1.000
## Chain 1: 700 -871621239.323 35.950 1.000
## Chain 1: 800 -25862319.457 35.544 1.721
## Chain 1: 900 -47171459.169 31.645 1.000
## Chain 1: 1000 -13591951.336 28.727 1.721
## Chain 1: 1100 -1871355.503 29.254 2.471 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1200 -1104070282.330 28.554 1.721 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1300 -92502391119.340 28.582 1.721 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1400 -563994.875 16429.680 2.471 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1500 -49620069.571 16429.607 2.471 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1600 -8661.581 17002.282 6.263 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1700 -1810.391 16978.822 3.784 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1800 -2552.708 16975.581 2.471 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 1900 -1682.214 16975.588 2.471 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 2000 -1659.393 16975.342 0.998 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 2100 -1588.202 16974.720 0.989 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 2200 -1558.193 16974.622 0.988 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 2300 -1547.898 16974.524 0.517 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 2400 -1546.471 573.342 0.291 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 2500 -1547.006 573.243 0.045 MAY BE DIVERGING... INSPECT ELBO
## Chain 1: 2600 -1548.441 0.468 0.019
## Chain 1: 2700 -1547.860 0.090 0.014
## Chain 1: 2800 -1546.677 0.061 0.007 MEDIAN ELBO CONVERGED
## Chain 1: Drawing a sample of size 1000 from the approximate posterior...
```

The first thing we want to check with any MCMC model is convergence. An easy way to check is by looking at the Rhat distributions. If all these values are below 1.1, then we have good reason to believe that the model converged, and we can get these distributions with the `id_plot_rhats`

function:

`id_plot_rhats(irt_2pl_correct)`

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

`id_plot_rhats(irt_2pl_incorrect)`

`## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

We can only assume that the first model will converge. If the second model failed the Rhat test, at this point we should go back and see if the model is miss-specified or there is something wrong with the data. But for the sake of illustration, we will look at other diagnostics. We can also examine whether 1) the models are able to replicate the data they were fitted on accurately and 2) overall measures of model fit.

We can first look at how well the model reproduces the data, which is called the posterior predictive distribution. We can obtain these distributions using the `id_post_pred`

function:

`post_correct <- id_post_pred(irt_2pl_correct)`

`## [1] "Processing posterior replications for 1000 scores using 100 posterior samples out of a total of 500 samples."`

`post_incorrect <- id_post_pred(irt_2pl_incorrect)`

`## [1] "Processing posterior replications for 1000 scores using 100 posterior samples out of a total of 500 samples."`

What we can do is the use a wrapper around the `bayesplot`

package called `id_plot_ppc`

to see how well these models replicate their own data. These plots show the original data as blue bars and the posterior predictive estimate as an uncertainty interval. If the interval overlaps with the observed data, then the model was able to replicate the observed data, which is a basic and important validity test for the model, i.e., could the model have generated the data that was used to fit it?

`id_plot_ppc(irt_2pl_correct,ppc_pred=post_correct)`

`id_plot_ppc(irt_2pl_incorrect,ppc_pred=post_incorrect)`

It is stupidly obvious from these plots that one should not use a Poisson distribution for a Bernoulli (binary) outcome as it will predict values as high as 10 (although it still puts the majority of the density on 1 and 2). Only two observed values are showing on the Poisson predictive distribution because the missing data model must be shown separately if the outcome is unbounded. To view the missing data model (i.e., a binary response), simply change the `output`

option in `id_post_pred`

to `'missing'`

. We can also look at particular persons or items to see how well the models predict those persons or items. For example, let’s look at the first two persons in the simulated data for which we incorrectly used the Poisson model by passing their IDs as a character vector to the `group`

option:

`id_plot_ppc(irt_2pl_incorrect,ppc_pred=post_incorrect,group=c('1','2'))`

Finally, we can turn to summary measures of model fit that also allow us to compare models directly to each other (if they were fit on the same data). To do so, I first employ the `log_lik`

option of the `id_post_pred`

function to generate log-likelihood values for each of these models.

`log_lik_irt_2pl_correct <- id_post_pred(irt_2pl_correct,type='log_lik')`

`## [1] "Processing posterior replications for 1000 scores using 500 posterior samples out of a total of 500 samples."`

`log_lik_irt_2pl_incorrect <- id_post_pred(irt_2pl_incorrect,type='log_lik')`

`## [1] "Processing posterior replications for 1000 scores using 500 posterior samples out of a total of 500 samples."`

Note that we must also specify the `relative_eff`

function in order to calculate degrees of freedom. The first argument to `relative_eff`

is the exponentiated log-likelihood matrix we calculated previously. The second argument uses the function `derive_chain`

and also takes the log-likelihood matrix as its argument.

I put in an option for two cores, but a larger number could be used which would improve the speed of the calculations.

```
correct_loo <- loo(log_lik_irt_2pl_correct,
cores=2,
r_eff=relative_eff(exp(log_lik_irt_2pl_correct),
chain_id=derive_chain(log_lik_irt_2pl_correct)))
```

`## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.`

```
incorrect_loo <- loo(log_lik_irt_2pl_incorrect,
cores=2,
r_eff=relative_eff(exp(log_lik_irt_2pl_incorrect),
chain_id=derive_chain(log_lik_irt_2pl_incorrect)))
```

`## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.`

`print(correct_loo)`

```
##
## Computed from 500 by 1000 log-likelihood matrix
##
## Estimate SE
## elpd_loo -630.2 14.8
## p_loo 105.4 3.8
## looic 1260.4 29.6
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 936 93.6% 106
## (0.5, 0.7] (ok) 59 5.9% 73
## (0.7, 1] (bad) 5 0.5% 48
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
```

`print(incorrect_loo)`

```
##
## Computed from 500 by 1000 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1042.5 12.2
## p_loo 39.3 2.2
## looic 2084.9 24.4
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 992 99.2% 172
## (0.5, 0.7] (ok) 7 0.7% 227
## (0.7, 1] (bad) 1 0.1% 379
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
```

With this calculation we can examine the models’ `loo`

values, which shows the relative predictive performance of the model to the data. Overall, model performance seems quite good, as the Pareto k values show that there are only a few dozen observations in the dataset that aren’t well predicted. The LOO-IC, or the leave-one-out information criterion (think AIC or BIC), is much lower (i.e. better) for the correct model, as we would expect.

We can also compare the LOOIC of the two models explicitly using a second `loo`

function that will even give us a confidence interval around the difference. If the difference is negative, then the first (correct) model has higher predictive accuracy:

```
compare(correct_loo,
incorrect_loo)
```

```
## elpd_diff se
## -412.3 14.8
```

The first (correct) model is clearly preferred, as we would certainly hope in this extreme example. In less obvious cases, the `elpd`

values can help arbitrate between model choices when a theoretical reason for choosing a model does not exist.