Disease outbreak outcome estimation using penalized splines

Overview

The purpose of this package is to allow estimation of complex outcome measures of infectious disease outbreaks.

The package works by using generalized additive models (GAMs) with penalized basis splines (P-splines) to approximate the observed data. Approximating splines are sampled from their distribution, and for each approximating spline, the outcome measure of interest is calculated. This yields a sampling distribution of the outcome measure of interest, from which point and itnerval estimates can then be obtained.

We begin by loading the package and a few other libraries needed by the code below:

Loading data and setting up the model

Next, we load observations. The example dataset is a CSV file consisting of weekly (time) count of cases of a seasonal infectious disease (cases).

We also calculate observed cumulative incidence (absolute and relative) here; these are used only for subsequent visualization, not for analysis.

Next, we generate a log-linked (family=poisson) GAM with 20-knot (k=20) cubic (m=3) cyclic P-splines (bs="cp"). You can vary the number of knots as needed. Fewer knots result in faster computation, but looser fit; more knots will take longer to compute, but increasing the number of knots beyond a certain point will not improve the fit. In your analysis, start with a small number of knots and increase it until adding more doesn’t change the results.

Generate a vector of time values at which the model will be sampled. This is used for analysis (in particular, for estimation of rates of change and areas under curves) as well as visualization (plotting splines). Higher sampleFreq increases the accuracy of results, but also increases computation time.

Here, we set up the time time values so they run from one half time period (inclusive) before the start of the observed times to one half time period (non-inclusive) after the end of the observed times.

This is the number of samples that will be drawn from the outcome distribution. Higher n decreases the variance of results, but also increases computation time. 20 is enough for draft analysis and this demo.

There are two main types of outcome measures that this package can estimate: time series outcomes (in which an outcome estimate is calculated for each observed time point), and scalar outcomes (in which a single overall outcome estimate is calculated across all time points).

Example 1: Time series outcome estimation

In this simple example, we estimate the 95% confidence interval for infection case counts. The workhorse of time series estimation is the pspline.estimate.timeseries function, which takes three parameters: a data frame of predictors at which the outcome will be estimated (predictors, which we set up above), a function which calculates the desired outcome, and the number of samples we want to draw (n).

We will see examples later of how to estimate custom outcomes, but for this simple example we can use the pspline.outbreak.cases function (which is built into the package) to calculate our outcome of interest.

The result of calling pspline.estimate.timeseries includes predictors, point estimates of our desired outcome (median), as well as an upper and lower confidence level (upper and lower), which we can plot. The results are what you might expect: the confidence interval is narrow outside of the main part of outbreak, and widest at the outbreak’s peak.

Example 2: Time series outcome estimation

Another simple outcome measure that the package can calculate for us is the relative cumulative incidence, using the pspline.outbreak.cumcases.relative function. Here we estimate and plot cumulative incidence at the 95% confidence level. Because we are calculating relative incidence, the outcome is constrained to 0 before the outbreak starts and to 1 after the outbreak ends, and therefore the confidence intervals are at their narrowest before and after the outbreak.

Example 3: Accessing individual time series samples

The package provides access to the individual samples of the outcome measure, which can be helpful for visualization. For example, if you ask for 15 sampled estimates of case counts, you will get a data frame with a pspline.sample column identifying the 15 time series samples:

Here casesSamples is a data frame of cases values for each time value, for 100 different samples. The samples are differentiated by the pspline.sample column.

Example 4: Estimation of a custom time series outcome

Suppose that the infection we are investigating progresses to serious disease in 80% of cases, but that it can be treated — before it progresses — with a treatment that has a 90% success rate. Also suppose that our supply is limited to 50 treatments, and that treatment is administered to all infected people until we run out of supply.

Then, among the first 100 cases, 10% will fail treatment, of which 80% will progress to serious disease; for all subsequent cases, no treatment will be available, and 80% will progress to serious disease. We are interested in an estimate of the number of cases of serious disease (as a function of time), based on our observations of the number of cases of the underlying infection.

The pspline.inference package does not include a built-in way to calculate this outcome, but we can write our own function to do it. This function will take a model object (model, obtained from the call to gam() above) and a data frame of predictors (data). It will return a data frame of calculated outcome estimates.

We can now combine our custom outcome measure calculation with time series estimation:

We’ll also calculate an estimate of all cumulative cases (serious or not), to better visualize the effect of the treatment

As you might imagine, as the estimated number of all cases rises past 50 and medication becomes unavailable, the estimated number of serious cases takes off sharply.

Our custom outcome calculation function would probably be more useful if it allowed progression rate, treatment success rate, and treatment supply to be varied. We can accomplish this by using a function within a function:

This is how we would use it:

Example 5: Scalar outcome measures

Rather than considering the expected incidence of the infection at hand, and its serious manifestation, let us now consider a related question: if we administer preventative medication on a first-come-first-served basis, and we have 50 doses of it, when do we expect to run out?

This is a scalar outcome measure, and we can estimate it using the pspline.estimate.scalars function. Similar to outbreak.estimate.timeseries, it takes a model, predictors, an outcome calculation function, the number of samples to draw, and the confidence interval. Also similar to pspline.estimate.timeseries, it returns median and lower/upper confidence limits for the outcome measure of interest.

For this particular outcome measure, we need to write a function to calculate it. We’ll use the same function-in-function technique to allow us to customize the number of medication doses as needed:

Estimated time before medication supply is exhausted, in weeks since July 1st
Lower CL Median Upper CL
23.025 24.025 24.75125

Example 6: Accessing individual scalar samples

Similar to time series outcomes, we can obtain the individual samples of a scalar outcome using the pspline.sample.scalars function. This is useful for visualization – in this example, we show the density plot of the estimated time when medication supply will run out:

Example 7: Outcome estimate validation

Being able to estimate an outcome is useful, but in order to trust those estimates, we need to validate the method of estimation. This is what the pspline.validate.scalars function is for. It performs a simulation study by first generating one of more hypothetical true states of the process we are interested in, then generating one or more sets of observations from each true state, and then computing the outcome measures for each true process and each set of observations. The results of validation specify whether the true value of an outcome measure was included in the confidence interval the required number of times.

Let’s run a simulation study to validate our estimate of the time when the medication supply will run out, as described in the previous examples. First, we need a function to generate a true outbreak:

Next, we need a function which will generate observations from a true process. Since we are dealing with infectious disease incidence, we will take observations to be Poisson-distributed with the distribution parameter equal to the true value. We will sample the true process in one-week increments:

We also need to specify how to create a model from a set of observations. We’ll use Poisson (log) link and a 20-knot cyclic cubic P-spline:

Finally, we need a way to calculate outcomes. We already wrote medicationSupplyEnd above, and it’s obviously necessaty that we use the exact same function for validation as we do for estimation.

Let’s now run a simulation study; we need to choose the number of truths we will generate and the number of observation sets we’ll generate from each truth, as well as the confidence level and the number of splines that will be sampled to estimate the outcome from each set of observations.

Simulation studies are time-consuming, so for demonstration purpose we’ll only include 10*10=100 simulations in this study. A proper simulation study would need to be bigger.

The simulation study shows whether the 95% confidence interval reported by the P-spline GAM method actually contains the true value 95% of the time. Due to the low power of the simulation study (only 10 truths and 10 observations), this validation doesn’t quite reach 95%:

Coverage: frequency with which true value is included in the 95% CI
Supply duration
92.00%