Advanced functionality

Complex simulation levels

Often, simulation levels will be simple, such as a vector of sample sizes:

sim <- new_sim()
sim %<>% set_levels(n = c(200,400,800))

However, there are many instances in which more complex objects are needed. For these cases, instead of a vector of numbers or character strings, a named list of lists can be used. The toy example below illustrates this.

sim <- new_sim()
sim %<>% set_levels(
  n = c(10,100),
  distribution = list(
    "Beta 1" = list(type="Beta", params=c(0.3, 0.7)),
    "Beta 2" = list(type="Beta", params=c(1.5, 0.4)),
    "Normal" = list(type="Normal", params=c(3.0, 0.2))
  )
)
create_data <- function(n, type, params) {
  if (type=="Beta") {
    return(rbeta(n, shape1=params[1], shape2=params[2]))
  } else if (type=="Normal") {
    return(rnorm(n, mean=params[1], sd=params[2]))
  }
}
sim %<>% set_script(function() {
  x <- create_data(L$n, L$distribution$type, L$distribution$params)
  return(list("y"=mean(x)))
})
sim %<>% run()
#> Done. No errors or warnings detected.

Note that the list names ("Beta 1", "Beta 2", and "Normal") become the entries in the sim$results dataframe, as well as the dataframe returned by summarize().

sim %>% summarize(list(stat="mean", x="y"))
#>   level_id   n distribution n_reps    mean_y
#> 1        1  10       Beta 1      1 0.3293261
#> 2        2 100       Beta 1      1 0.2975617
#> 3        3  10       Beta 2      1 0.8442316
#> 4        4 100       Beta 2      1 0.7989588
#> 5        5  10       Normal      1 2.9720820
#> 6        6 100       Normal      1 2.9747477

Sometimes, it may be the case that you want to run simulations only for a subset of level combinations. In these cases, use the .keep option of set_levels() . First, set the levels normally. Second, view the sim$levels_grid dataframe to examine the level combinations and the associated level_id values. Third, call set_levels() again with the .keep option to specify which level combinations to keep. This is demonstrated below:

sim <- new_sim()
sim %<>% set_levels(alpha=c(2,3,4), beta=c(50,60))
print(sim$levels_grid)
#>   level_id alpha beta
#> 1        1     2   50
#> 2        2     3   50
#> 3        3     4   50
#> 4        4     2   60
#> 5        5     3   60
#> 6        6     4   60
sim %<>% set_levels(.keep=c(1,2,6))
print(sim$levels_grid)
#>   level_id alpha beta
#> 1        1     2   50
#> 2        2     3   50
#> 6        6     4   60

Complex return data

In most situations, the results of simulations are numeric. However, we may want to return more complex data, such as matrices, lists, or model objects. To do this, we add a key-value pair to the list returned by the simulation script with the special key ".complex" and a list (containing the complex data) as the value. This is illustrated in the toy example below. Here, the simulation script estimates the parameters of a linear regression and returns these as numeric, but also returns the estimated covariance matrix and the entire model object.

sim <- new_sim()
sim %<>% set_levels(n=c(10, 100, 1000))
create_data <- function(n) {
  x <- runif(n)
  y <- 3 + 2*x + rnorm(n)
  return(data.frame("x"=x, "y"=y))
}
sim %<>% set_config(num_sim=2)
sim %<>% set_script(function() {
  dat <- create_data(L$n)
  model <- lm(y~x, data=dat)
  return(list(
    "beta0_hat" = model$coefficients[[1]],
    "beta1_hat" = model$coefficients[[2]],
    ".complex" = list(
      "model" = model,
      "cov_mtx" = vcov(model)
    )
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

After running this simulation, the numeric results can be accessed directly via sim$results or using the summarize() function, as usual:

head(sim$results)
#>   sim_uid level_id rep_id    n      runtime beta0_hat  beta1_hat
#> 1       1        1      1   10 0.0023009777  3.313459 1.64224090
#> 2       4        1      2   10 0.0006840229  3.682214 0.04035482
#> 3       2        2      1  100 0.0026309490  3.083269 2.08769380
#> 4       5        2      2  100 0.0006608963  2.781492 2.11971793
#> 5       3        3      1 1000 0.0009598732  3.105255 1.86686623
#> 6       6        3      2 1000 0.0008368492  2.964213 2.09937006

To examine the complex return data, we can use the function get_complex(), as illustrated below:

c5 <- get_complex(sim, sim_uid=5)
print(summary(c5$model))
#> 
#> Call:
#> lm(formula = y ~ x, data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2.71501 -0.57515 -0.07813  0.74569  2.87825 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   2.7815     0.1926  14.442  < 2e-16 ***
#> x             2.1197     0.3234   6.554 2.63e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9914 on 98 degrees of freedom
#> Multiple R-squared:  0.3047, Adjusted R-squared:  0.2976 
#> F-statistic: 42.95 on 1 and 98 DF,  p-value: 2.628e-09
print(c5$cov_mtx)
#>             (Intercept)           x
#> (Intercept)  0.03709408 -0.05340809
#> x           -0.05340809  0.10461387

Using the batch function

The batch() function is useful for sharing data or objects between simulation replicates. Essentially, it allows simulation replicates to be divided into “batches”; all replicates in a given batch will then share a certain set of objects. A common use case for this is a simulation that involves using multiple methods to analyze a shared dataset, and repeating this process over a number of dataset replicates. This may be of interest if, for example, it is computationally expensive to generate a simulated dataset.

To illustrate the use of batch() using this example, we first consider the following simulation:

sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3)
sim %<>% set_script(function() {
  dat <- create_data(n=100)
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

From the "dat_1" column of the results object (equal to the first element of the dat vector created in the simulation script), we see that a unique dataset was created for each simulation replicate:

sim$results[order(sim$results$rep_id),c(1:7)!=5]
#>   sim_uid level_id rep_id        est mu_hat dat_1
#> 1       1        1      1   est_mean   3.17  3.75
#> 4       2        2      1 est_median   3.13  2.50
#> 2       3        1      2   est_mean   3.06  4.36
#> 5       5        2      2 est_median   2.88  2.31
#> 3       4        1      3   est_mean   3.00  2.27
#> 6       6        2      3 est_median   2.97  2.63

Suppose that instead, we wish to analyze each simulated dataset using multiple methods (in this case corresponding to "est_mean" and "est_median"), and repeat this procedure a total of three times. We can do this using the batch() function, as follows:

sim <- new_sim()
create_data <- function(n) { rnorm(n=n, mean=3) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=3, batch_levels=NULL)
sim %<>% set_script(function() {
  batch({
    dat <- create_data(n=100)
  })
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

In the code above, we changed two things. First, we added batch_levels=NULL to the set_config() call; this will be explained below. Second, we wrapped the code line dat <- create_data(n=100) inside the batch() function. Whatever code goes inside the batch() function will produce the same output for all simulations in a batch.

sim$results[order(sim$results$rep_id),c(1:7)!=5]
#>   sim_uid level_id rep_id        est mu_hat dat_1
#> 1       1        1      1   est_mean   3.05  3.99
#> 4       2        2      1 est_median   2.98  3.99
#> 2       3        1      2   est_mean   3.09  1.72
#> 5       5        2      2 est_median   3.14  1.72
#> 3       4        1      3   est_mean   2.98  2.66
#> 6       6        2      3 est_median   2.90  2.66

In this case, from the "dat_1" column of the results object, we see that one dataset was created and shared by the batch corresponding to sim_uids 1 and 2 (likewise for sim_uids {3,5} and {4,6}).

However, the situation is often more complicated. Suppose we have a simulation with multiple levels, some that correspond to creating data and some that correspond to analyzing the data. Here, the batch_levels argument of set_config() plays a role. Specifically, this argument should be a character vector equal to the names of the simulation levels that are referenced (via the special variable L) from within a batch() block. In the example below, the levels n and mu are used within the batch() call, while the level est is not.

sim <- new_sim()
create_data <- function(n, mu) { rnorm(n=n, mean=mu) }
est_mean <- function(dat, type) {
  if (type=="est_mean") { return(mean(dat)) }
  if (type=="est_median") { return(median(dat)) }
}
sim %<>% set_levels(n=c(10,100), mu=c(3,5), est=c("est_mean","est_median"))
sim %<>% set_config(num_sim=2, batch_levels=c("n", "mu"), return_batch_id=T)
sim %<>% set_script(function() {
  batch({
    dat <- create_data(n=L$n, mu=L$mu)
  })
  mu_hat <- est_mean(dat=dat, type=L$est)
  return(list(
    "mu_hat" = round(mu_hat,2),
    "dat_1" = round(dat[1],2)
  ))
})
sim %<>% run()
#> Done. No errors or warnings detected.

sim$results[order(sim$results$batch_id),c(1:10)!=8]
#>    sim_uid level_id rep_id batch_id   n mu        est mu_hat dat_1
#> 1        1        1      1        1  10  3   est_mean   3.11  4.88
#> 9        5        5      1        1  10  3 est_median   2.93  4.88
#> 2        9        1      2        2  10  3   est_mean   2.79  1.65
#> 10      13        5      2        2  10  3 est_median   2.35  1.65
#> 3        2        2      1        3 100  3   est_mean   3.02  3.83
#> 11       6        6      1        3 100  3 est_median   3.00  3.83
#> 4       10        2      2        4 100  3   est_mean   3.10  2.81
#> 12      14        6      2        4 100  3 est_median   3.06  2.81
#> 5        3        3      1        5  10  5   est_mean   4.98  5.83
#> 13       7        7      1        5  10  5 est_median   5.03  5.83
#> 6       11        3      2        6  10  5   est_mean   5.36  7.78
#> 14      15        7      2        6  10  5 est_median   5.41  7.78
#> 7        4        4      1        7 100  5   est_mean   5.08  4.56
#> 15       8        8      1        7 100  5 est_median   4.99  4.56
#> 8       12        4      2        8 100  5   est_mean   5.30  5.69
#> 16      16        8      2        8 100  5 est_median   5.26  5.69

The batches were created such that each batch contained two replicates, one for each level value of est. For expository purposes, we also specified the return_batch_id=T option in set_config() so that the results object would return the batch_id. This is not necessary in practice. The batch_id variable defines the batches; all simulations that share the same batch_id are in a single batch. The return_batch_id=T option can be useful to ensure correct usage of the batch() function.

We note the following about the batch() function:

Random number generator seeds

In statistical research, it is desirable to be able to reproduce the exact results of a simulation study. Since R code often involves stochastic (random) functions like rnorm() or sample() that return different values when called multiple times, reproducibility is not guaranteed. In a simple R script, calling the set.seed() function at the beginning of the script ensures that the stochastic functions that follow will produce the same results whenever the script is run. However, a more nuanced strategy is needed when running simulations. When running 100 replicates of the same simulation, we do not want each replicate to return identical results; rather, we would like for each replicate to be different from one another, but for the entire set of replicates to be the same when the entire simulation is run twice in a row. SimEngine manages this process, even when simulations are being run in parallel locally or on a cluster computing system. In SimEngine, a single “global seed” is used to generate a different seed for each simulation replicate. The set_config() function is used to set or change this global seed:

sim %<>% set_config(seed=123)

If a seed is not set using set_config(), SimEngine will set a random seed automatically so that the results can be replicated if desired. To view this seed, we use the vars() function:

sim <- new_sim()
print(vars(sim, "seed"))
#> <list_of<quosure>>
#> 
#> [[1]]
#> <quosure>
#> expr: ^sim
#> env:  global
#> 
#> [[2]]
#> <quosure>
#> expr: ^"seed"
#> env:  empty

Debugging and error/warning handling

In the simulation coding workflow, errors are inevitable. Some errors may affect all simulation replicates, while other errors may only affect a subset of replicates. By default, when a simulation is run, SimEngine will not stop if an error occurs; instead, errors are logged and stored in a dataframe along with information about the simulation replicates that resulted in those errors. Examining this dataframe by typing print(sim$errors) can sometimes help to quickly pinpoint the issue. This is demonstrated below:

sim <- new_sim()
sim %<>% set_config(num_sim=2)
sim %<>% set_levels(
  Sigma = list(
    s1 = list(mtx=matrix(c(3,1,1,2), nrow=2)),
    s3 = list(mtx=matrix(c(4,3,3,9), nrow=2)),
    s2 = list(mtx=matrix(c(1,2,2,1), nrow=2)),
    s4 = list(mtx=matrix(c(8,2,2,6), nrow=2))
  )
)
sim %<>% set_script(function() {
  x <- MASS::mvrnorm(n=1, mu=c(0,0), Sigma=L$Sigma$mtx)
  return(list(x1=x[1], x2=x[2]))
})
sim %<>% run()
#> Done. Errors detected in 25% of simulation replicates. Warnings detected in 0% of simulation replicates.
print(sim$errors)
#>   sim_uid level_id rep_id Sigma      runtime                          message
#> 1       5        3      1    s2 0.0002849102 'Sigma' is not positive definite
#> 2       6        3      2    s2 0.0002381802 'Sigma' is not positive definite
#>                                                      call
#> 1 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)
#> 2 MASS::mvrnorm(n = 1, mu = c(0, 0), Sigma = L$Sigma$mtx)

From the output above, we see that the code fails for the simulation replicates that use the level with Sigma="s2" because it uses an invalid (not positive definite) covariance matrix. Similarly, if a simulation involves replicates that throw warnings, all warnings are logged and stored in the dataframe sim$warnings.

The workflow demonstrated above can be useful to pinpoint errors, but it has two main drawbacks. First, it is undesirable to run a time-consuming simulation involving hundreds or thousands of replicates, only to find at the end that every replicate failed because of a typo. It may therefore useful to stop an entire simulation after a single error has occurred. Second, it can sometimes be difficult to determine exactly what caused an error without making use of more advanced debugging tools. For both of these situations, SimEngine includes the following configuration option:

sim %<>% set_config(stop_at_error=TRUE)

Setting stop_at_error=TRUE will stop the simulation when it encounters any error. Furthermore, the error will be thrown by R in the usual way, so if the simulation is being run in in RStudio, the built-in debugging tools (such as “Show Traceback” and “Rerun with debug”) can be used to find and fix the bug. Placing a call to browser() at the top of the simulation script can also be useful for debugging.

Monte Carlo error

Statistical simulations are often based on the principle of Monte Carlo approximation; specifically, pseudo-random sampling is used to evaluate the performance of a statistical procedure under a particular data-generating process. The performance of the procedure can be viewed as a statistical parameter and, due to the fact that only a finite number of simulation replicates can be performed, there is uncertainty in any estimate of performance. This uncertainty is often referred to as Monte Carlo error. We can quantify Monte Carlo error using, for example, the standard error of the performance estimator.

Measuring and reporting Monte Carlo error is a vital component of a simulation study. SimEngine includes an option in the summarize() function to automatically estimate the Monte Carlo standard error for any inferential summary statistic, e.g., estimator bias or confidence interval coverage. The standard error estimates are based on the formulas provided in Morris et al. 2019. If the option mc_se is set to TRUE, estimates of Monte Carlo standard error will be included in the summary data frame, along with associated 95% confidence intervals based on a normal approximation.

sim <- new_sim()
create_data <- function(n) { rpois(n, lambda=5) }
est_mean <- function(dat) {
  return(mean(dat))
}
sim %<>% set_levels(n=c(10,100,1000))
sim %<>% set_config(num_sim=5)
sim %<>% set_script(function() {
  dat <- create_data(L$n)
  lambda_hat <- est_mean(dat=dat)
  return (list("lambda_hat"=lambda_hat))
})
sim %<>% run()
#> Done. No errors or warnings detected.
sim %>% summarize(
  list(stat="mse", name="lambda_mse", estimate="lambda_hat", truth=5), 
  mc_se = TRUE
)
#>   level_id    n n_reps lambda_mse lambda_mse_mc_se lambda_mse_mc_ci_l
#> 1        1   10      5   0.602000     0.4280700877       -0.237017372
#> 2        2  100      5   0.052200     0.0234607545        0.006216921
#> 3        3 1000      5   0.002099     0.0005030319        0.001113057
#>   lambda_mse_mc_ci_u
#> 1        1.441017372
#> 2        0.098183079
#> 3        0.003084943