The R package `SimCorMultRes`

is suitable for simulation of correlated binary responses (exactly two response categories) and of correlated nominal or ordinal multinomial responses (three or more response categories) conditional on a regression model specification for the marginal probabilities of the response categories. A more detailed description of `SimCorMultRes`

can be found in Touloumis (2016). This vignette briefly describes the simulation methods proposed by Touloumis (2016), introduces how to simulate ordinal responses under a marginal adjacent-category logit model and illustrates the use of the core functions of `SimCorMultRes`

.

This package was created to facilitate the task of carrying out simulation studies and evaluating the performance of statistical methods for estimating the regression parameters in a marginal model with clustered binary and multinomial responses. Examples of such statistical methods include maximum likelihood methods, copula approaches, quasi-least squares approaches, generalized quasi-likelihood methods and generalized estimating equations (GEE) approaches among others (see references in Touloumis 2016).

In addition, `SimCorMultRes`

can generate correlated binary and multinomial random variables conditional on a desired dependence structure and known marginal probabilities even if these are not determined by a regression model (see third example in Touloumis 2016) or to explore approximations of association measures for discrete variables that arise as realizations of an underlying continuum (see second example in Touloumis 2016).

Let \(Y_{it}\) be the binary or multinomial response for subject \(i\) (\(i=1,\ldots,N\)) at measurement occasion \(t\) (\(t=1,\ldots,T\)), and let \(\mathbf {x}_{it}\) be the associated covariates vector. We assume that \(Y_{it} \in \{0,1\}\) for binary responses and \(Y_{it} \in \{1,2,\ldots,J\geq 3\}\) for multinomial responses.

To achieve simulation of clustered binary, ordinal and nominal responses under no marginal model specification, perform the following intercepts:

Based on the marginal probabilities calculate the intercept of a marginal probit model for binary responses (see Example 3.7) or the category-specific intercepts of a cumulative probit model (see third example in Touloumis 2016) or of a baseline-category logit model for multinomial responses (see Example 3.8).

Create a pseudo-covariate say

`x`

of length equal to the number of cluster size (`clsize`

) times the desired number of clusters of simulated responses (say`R`

), that is`x = clsize * R`

. This step is required in order to identify the desired number of clustered responses.Set

`betas = 0`

in the core functions`rbin`

(see Example 3.7) or`rmult.clm`

, or set 0 all values of the`beta`

argument that correspond to the category-specific parameters in the core function`rmult.bcl`

(see Example 3.8).set

`xformula = ~ x`

.Run the core function to obtain realizations of the simulated clustered responses.

**Example 3.7 (Simulation of clustered binary responses without covariates) **Suppose the goal is to simulate \(5000\) clustered binary responses with \(\Pr(Y_{t}=1)=0.8\) for all \(t=1,\ldots,4\). For simplicity, assume that the clustered binary responses are independent.

```
set.seed(123)
# sample size
sample_size <- 5000
# cluster size
cluster_size <- 4
# intercept
beta_intercepts <- qnorm(0.8)
# pseudo-covariate
x <- rep(0, each = cluster_size * sample_size)
# regression parameter associated with the covariate
beta_coefficients <- 0
# correlation matrix for the NORTA method
latent_correlation_matrix <- diag(cluster_size)
# simulation of clustered binary responses
simulated_binary_dataset <- rbin(clsize = cluster_size, intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x, cor.matrix = latent_correlation_matrix,
link = "probit")
# simulated marginal probabilities
colMeans(simulated_binary_dataset$Ysim)
#> t=1 t=2 t=3 t=4
#> 0.8024 0.7972 0.7948 0.8088
```

**Example 3.8 (Simulation of clustered nominal responses without covariates) **Suppose the aim is to simulate \(N=5000\) clustered nominal responses with
\(\Pr(Y_{t}=1)=0.1\), \(\Pr(Y_{t}=2)=0.2\), \(\Pr(Y_{t}=3)=0.3\) and \(\Pr(Y_{t}=4)=0.4\), for all \(i\) and \(t=1,\ldots,3\). For the sake of simplicity, we assume that the clustered responses are independent.

```
# sample size
sample_size <- 5000
# cluster size
cluster_size <- 3
# pseudo-covariate
x <- rep(0, each = cluster_size * sample_size)
# parameter vector
betas <- c(log(0.1/0.4), 0, log(0.2/0.4), 0, log(0.3/0.4), 0, 0, 0)
# number of nominal response categories
categories_no <- 4
set.seed(1)
# correlation matrix for the NORTA method
latent_correlation_matrix <- kronecker(diag(cluster_size), diag(categories_no))
# simulation of clustered nominal responses
simulated_nominal_dataset <- rmult.bcl(clsize = cluster_size, ncategories = categories_no,
betas = betas, xformula = ~x, cor.matrix = latent_correlation_matrix)
# simulated marginal probabilities
apply(simulated_nominal_dataset$Ysim, 2, table)/sample_size
#> t=1 t=2 t=3
#> 1 0.1000 0.0996 0.1036
#> 2 0.2034 0.2000 0.2000
#> 3 0.2874 0.3130 0.2894
#> 4 0.4092 0.3874 0.4070
```

In `SimCorMultRes`

, the user provides the correlation matrix (denoted as \(\mathbf{R}\)) for the multivariate normal distribution used in the intermediate step of the NORTA method, rather than the correlation matrix of the latent responses used in the corresponding threshold approach. This choice is motivated by the observation that when all the marginal distributions of the correlated latent responses follow a logistic distribution, the correlation matrix \(\mathbf{R}\) and the correlation matrix of the latent responses are expected to be similar, as noted by Touloumis (2016). Therefore, in `SimCorMultRes`

, this approximation is employed irrespective of the marginal distribution of the latent `responses`

.

To evaluate the validity of this approximation for the threshold approaches employed in `SimCorMultRes`

, a simulation study was conducted. For a fixed sample size \(N\) and a correlation parameter \(\rho\), \(N\) independent bivariate random vectors \(\{\mathbf y_{i}: i = 1, \ldots, N \}\) from a bivariate normal distribution with mean vector the zero vector and covariance matrix the correlation matrix
\[
\mathbf R = \begin{bmatrix}
1 & \rho\\
\rho & 1
\end{bmatrix}
\]
were drawn. The sample correlation was used to estimate \(\rho\). Next, the NORTA method was applied to obtain bivariate random vectors \(\{\mathbf z_{i}: i = 1, \ldots, N \}\) so that their marginal distribution is the logistic distribution. Their correlation parameter, say \(\rho_{z}\), was estimated using the corresponding sample correlation. Then, the NORTA method was applied to obtain bivariate random vectors \(\{\mathbf w_{i}: i = 1, \ldots, N \}\) so that their marginal distribution is the Gumbel distribution. Their correlation parameter, say \(\rho_{w}\), was estimated using their sample correlation.

For a fixed value of \(\rho\), this procedure was replicated \(10,000,000\) times to reduce the simulation error and with \(N=10,000\) to reduce the sampling error. The three correlation parameters \(\rho\), \(\rho_z\) and \(\rho_w\) were estimated using their corresponding Monte Carlo counterparts, denoted by \(\widehat{\rho}\), \(\widehat{\rho}_z\) and \(\widehat{\rho}_w\), respectively. We let \(\rho \in \{ 0, 0.01,0.02,\ldots, 0.99\}\).

The dataframe `simulation`

contains the true correlation parameter \(\rho\) (`rho`

) and the Monte Carlo estimates \(\widehat{\rho}\) (`normal`

), \(\widehat{\rho}_z\) (`logistic`

) and \(\widehat{\rho}_w\) (`gumbel`

) from the simulation study described above. As expected, \(\widehat{\rho} \approx \rho\) regardless of the strength of the correlation parameter. For the case of logistic marginal distributions, the average difference between \(\rho\) and \(\widehat{\rho}_z\) is 0.002, taking the maximum value of 0.0031 at \(\rho = 0.58\). Therefore \(\rho\) appears to approximate \(\rho_{z}\) to 2 decimal points. For the case of Gumbel marginal distributions, the average difference between \(\rho\) and \(\widehat{\rho}_w\) is 0.0103, taking the maximum value of 0.0154 at \(\rho = 0.51\). Although \(\rho\) appears to approximate well \(\rho_{w}\), there is some accuracy loss compared to \(\rho_{z}\). The plot below shows the differences between the true correlation coefficient of the bivariate normal distribution and the simulated correlations for \(\rho\), \(\rho_z\) or \(\rho_w\).

Overall, there is little accuracy loss by specifying the correlation matrix of the multivariate normal distribution in the intermediate step of the NORTA distribution instead of the correlation matrix of correlated latent responses regardless of whether their marginal distributions are either logistic or Gumbel distributions. Users can treat the correlation matrix passed on to the core functions of `SimCorMultRes`

as the correlation matrix of the latent variables.

```
citation("SimCorMultRes")
To cite 'SimCorMultRes' in publications, please use:
Touloumis A (2016). "Simulating Correlated Binary and Multinomial
Responses under Marginal Model Specification: The SimCorMultRes
Package." _The R Journal_, *8*(2), 79-91. R package version 1.9.0,
<https://journal.r-project.org/archive/2016/RJ-2016-034/index.html>.
A BibTeX entry for LaTeX users is
@Article{,
title = {Simulating Correlated Binary and Multinomial Responses under
Marginal Model Specification: The SimCorMultRes Package},
author = {Anestis Touloumis},
year = {2016},
journal = {The R Journal},
volume = {8},
number = {2},
note = {R package version 1.9.0},
pages = {79-91},
url = {https://journal.r-project.org/archive/2016/RJ-2016-034/index.html},
}
```

Gumbel, E. J. 1958. *Statistics of Extremes*. Columbia University Press, New York.

Touloumis, A. 2016. “Simulating Correlated Binary and Multinomial Responses under Marginal Model Specification: The SimCorMultRes Package.” *The R Journal* 8 (2): 79–91. https://journal.r-project.org/archive/2016/RJ-2016-034/index.html.