Introduction to SimEngine

This vignette is adapted from the homepage of the SimEngine website.

library(SimEngine)
#> Loading required package: magrittr
#> Welcome to SimEngine! Full package documentation can be found at:
#>  https://avi-kenny.github.io/SimEngine

Overview

SimEngine is an open-source R package for structuring, maintaining, running, and debugging statistical simulations on both local and cluster-based computing environments.

Getting started

The goal of many statistical simulations is to test how a new statistical method performs against existing methods. Most statistical simulations include three basic phases: (1) generate some data, (2) run one or more methods using the generated data, and (3) compare the performance of the methods.

To briefly illustrate how these phases are implemented using SimEngine, we will use the example of estimating the average treatment effect of a drug in the context of a randomized controlled trial (RCT).

1) Load the package and create a “simulation object”

The simulation object (an R object of class sim_obj) will contain all data, functions, and results related to your simulation.

library(SimEngine)
sim <- new_sim()

2) Code a function to generate some data

Most simulations will involve one or more functions that create a dataset designed to mimic some real-world data structure. Here, we write a function that simulates data from an RCT in which we compare a continuous outcome (e.g. blood pressure) between a treatment group and a control group. We generate the data by looping through a set of patients, assigning them randomly to one of the two groups, and generating their outcome according to a simple model.

# Code up the dataset-generating function
create_rct_data <- function (num_patients) {
  df <- data.frame(
    "patient_id" = integer(),
    "group" = character(),
    "outcome" = double(),
    stringsAsFactors = FALSE
  )
  for (i in 1:num_patients) {
    group <- ifelse(sample(c(0,1), size=1)==1, "treatment", "control")
    treatment_effect <- ifelse(group=="treatment", -7, 0)
    outcome <- rnorm(n=1, mean=130, sd=2) + treatment_effect
    df[i,] <- list(i, group, outcome)
  }
  return (df)
}

# Test the function
create_rct_data(5)
#>   patient_id     group  outcome
#> 1          1 treatment 119.8892
#> 2          2   control 128.1227
#> 3          3   control 131.0566
#> 4          4   control 130.5807
#> 5          5 treatment 122.3775

3) Code your methods (or other functions)

With SimEngine, any functions that you declare (or load via source) are automatically added to your simulation object when the simulation runs. In this example, we test two different estimators of the average treatment effect. For simplicity, we code this as a single function and use the type argument to specify which estimator we want to use, but you could also write two separate functions. The first estimator uses the known probability of being assigned to the treatment group (0.5), whereas the second estimator uses an estimate of this probability based on the observed data. Don’t worry too much about the mathematical details; the important thing is that both methods attempt to take in the dataset generated by the create_rct_data function and return an estimate of the treatment effect, which in this case is -7.

# Code up the estimators
est_tx_effect <- function(df, type) {
  n <- nrow(df)
  sum_t <- sum(df$outcome * (df$group=="treatment"))
  sum_c <- sum(df$outcome * (df$group=="control"))
  if (type=="est1") {
    true_prob <- 0.5
    return ( sum_t/(n*true_prob) - sum_c/(n*(1-true_prob)) )
  } else if (type=="est2") {
    est_prob <- sum(df$group=="treatment") / n
    return ( sum_t/(n*est_prob) - sum_c/(n*(1-est_prob)) )
  }
}

# Test out the estimators
df <- create_rct_data(1000)
est_tx_effect(df, "est1")
#> [1] -15.66783
est_tx_effect(df, "est2")
#> [1] -7.063783

4) Set the simulation levels

Often, we want to run the same simulation multiple times (with each run referred to as a “simulation replicate”), but with certain things changed. In this example, perhaps we want to vary the number of patients and the method used to estimate the average treatment effect. We refer to the things that vary as “simulation levels”. By default, SimEngine will run our simulation 10 times for each level combination. Below, since there are two methods and three values of num_patients, we have six level combinations and so SimEngine will run a total of 60 simulation replicates. Note that we make extensive use of the pipe operators (%>% and %<>%) from the magrittr package; if you have never used pipes, check out the magrittr documentation.

sim %<>% set_levels(
  estimator = c("est1", "est2"),
  num_patients = c(50, 200, 1000)
)

5) Create a simulation script

The simulation script is a function that runs a single simulation replicate and returns the results. Within a script, you can reference the current simulation level values using the variable L. For example, when the first simulation replicate is running, L$estimator will equal “est1” and L$num_patients will equal 50. In the last simulation replicate, L$estimator will equal “est2” and L$num_patients will equal 1,000. Your script will automatically have access to any functions that you created earlier.

sim %<>% set_script(function() {
  df <- create_rct_data(L$num_patients)
  est <- est_tx_effect(df, L$estimator)
  return (list(
    "est" = est,
    "mean_t" = mean(df$outcome[df$group=="treatment"]),
    "mean_c" = mean(df$outcome[df$group=="control"])
  ))
})

Your script should always return a list containing key-value pairs, where the keys are character strings and the values are simple data types (numbers, character strings, or boolean values). If you need to return more complex data types (e.g. lists or dataframes), see the Advanced usage documentation page. Note that in this example, you could have alternatively coded your estimators as separate functions and called them from within the script using the use_method function.

6) Set the simulation configuration

This controls options related to your entire simulation, such as the number of simulation replicates to run for each level combination and how to parallelize your code. This is also where you should specify any packages your simulation needs (instead of using library or require). See the set_config docs for more info. We set num_sim to 100, and so SimEngine will run a total of 600 simulation replicates (100 for each of the six level combinations).

sim %<>% set_config(
  num_sim = 100,
  parallel = TRUE,
  n_cores = 2,
  packages = c("ggplot2", "stringr")
)
#> 
#> Attaching package: 'ggplot2'
#> The following object is masked from 'package:SimEngine':
#> 
#>     vars

7) Run the simulation

All 600 replicates are run at once and results are stored in the simulation object.

sim %<>% run()
#> Done. No errors or warnings detected.

8) View and summarize results

Once the simulations have finished, use the summarize function to calculate common summary statistics, such as bias, variance, MSE, and coverage.

sim %>% summarize(
  list(stat="bias", truth=-7, estimate="est"),
  list(stat="mse", truth=-7, estimate="est")
)
#>   level_id estimator num_patients n_reps     bias_est      MSE_est
#> 1        1      est1           50    100  1.159739195 1.167700e+03
#> 2        2      est2           50    100  0.102151883 3.112068e-01
#> 3        3      est1          200    100  3.332611940 3.066223e+02
#> 4        4      est2          200    100 -0.049515894 7.599178e-02
#> 5        5      est1         1000    100  1.623324819 6.396039e+01
#> 6        6      est2         1000    100 -0.000810299 1.421038e-02

In this example, we see that the MSE of estimator 1 is much higher than that of estimator 2 and that MSE decreases with increasing sample size for both estimators, as expected. You can also directly access the results for individual simulation replicates.

head(sim$results)
#>   sim_uid level_id rep_id estimator num_patients     runtime        est
#> 1       1        1      1      est1           50 0.006202936 -17.028056
#> 2       7        1      2      est1           50 0.004877090  22.921537
#> 3       8        1      3      est1           50 0.006031990  -7.062596
#> 4       9        1      4      est1           50 0.005043983 -26.944357
#> 5      10        1      5      est1           50 0.007095098  23.259374
#> 6      11        1      6      est1           50 0.004637957  -7.814285
#>     mean_t   mean_c
#> 1 123.4479 130.3251
#> 2 122.5447 129.9188
#> 3 122.6812 129.7438
#> 4 122.7984 129.5545
#> 5 123.1836 130.3480
#> 6 122.8533 130.6675

Above, the sim_uid uniquely identifies a single simulation replicate and the level_id uniquely identifies a level combination. The rep_id is unique within a given level combination and identifies the replicate.