`drtmle`

: Doubly-Robust
Inference in RDavid Benkeser

Emory University

Department of Biostatistics and Bioinformatics

1518 Clifton Road, NE

Mailstop: 002-3AA

Atlanta, Georgia, 30322, U.S.A.

benkeser@emory.edu

Emory University

Department of Biostatistics and Bioinformatics

1518 Clifton Road, NE

Mailstop: 002-3AA

Atlanta, Georgia, 30322, U.S.A.

benkeser@emory.edu

Abstract

Inverse probability of treatment weighted estimators and
doubly-robust estimators (including augmented inverse probability of
treatment weight and targeted minimum loss-based estimators) are widely
used in causal inference to estimate and draw inference about the
average effect of a treatment. As an intermediate step, these estimators
require estimation of key nuisance parameters, which are often
regression functions. Typically, regressions are estimated using maximum
likelihood and parametric models. Confidence intervals and p-values may
be computed based on standard asymptotic results, such as the central
limit theorem and the delta method. However, in high-dimensional
settings maximum likelihood estimation often breaks down and standard
procedures no longer yield correct inference. Instead, we may rely on
adaptive estimators of nuisance parameters to construct flexible
regression estimators. However, use of adaptive estimators poses a
challenge for performing statistical inference about an estimated
treatment effect. While doubly-robust estimators facilitate inference
when *all* relevant regression functions are consistently
estimated, the same cannot be said when at least one estimator is
inconsistent. `drtmle`

implements doubly-robust confidence
intervals and hypothesis tests about targeted minimum loss-based
estimates of the average treatment effect. The package additionally
implements an inverse probability of treatment weighted estimator that
yields valid inference even when the probability of treatment is
estimated via adaptive methods.

In many fields researchers are interested in assessing the population-level effect of a particular treatment on an outcome of interest. The treatment might correspond to a drug, a potentially harmful exposure exposure, or a policy intervention. Often, the treatment may not be randomized due to ethical or logistical reasons. This has sparked great interest in developing statistical methodology to estimate population-level effects from observational data. Estimation of these effects often requires accounting for putative confounding factors, that is, factors related to whether a participant receives treatment and to the participant’s outcome. In many settings, researchers measure many potential confounders, for example using administrative databases or genetic sequencing technology. Due to the curse-of-dimensionality, common statistical estimation techniques are not feasible for such high-dimensional data without restrictive modeling assumptions. When these assumptions are violated, estimates of the population-level effect of a treatment may have large bias.

This has sparked an interest in using adaptive estimation techniques
from the machine learning literature to control for high-dimensional
confounders when estimating treatment effects. However, facilitating
proper inference (e.g., confidence intervals and hypothesis tests) about
estimates of effects based on adaptive estimators is challenging. In
particular, standard causal inference techniques, including both
G-computation (GCOMP) estimators (Robins
1986) and inverse probability of treatment weighted (IPTW)
estimators Robins, Hernan, and Brumback
(2000), fail to yield valid inference for common estimators of
effects. More complex proposals have been developed that *are
capable* of utilizing adaptive methods to control for confounding
while still yielding valid inference. These techniques include augmented
inverse probability of treatment estimation (AIPTW) (Robins, Rotnitzky, and Zhao 1994) and targeted
minimum loss-based estimation (TMLE) (van der
Laan and Rubin 2006). Under assumptions, these estimators have
limiting normal distributions with estimable variance. The asymptotic
variance estimates then serve as the basis for constructing confidence
intervals and hypothesis tests.

As an intermediate step, the AIPTW and TMLE estimators require
estimation of two key nuisance parameters: the probability of treatment
as a function of confounders (the propensity score, hereafter referred
to as the *PS*) and the average outcome as a function of
confounders and treatment (the outcome regression, *OR*). In
addition to their desirable inferential properties, AIPTW and TMLE
estimators also enjoy the property of double robustness. That is, the
estimators are consistent for the population-level effect of interest if
either of the PS or OR is consistently estimated. This property gives
the AIPTW and TMLE estimators a natural appeal: inconsistency in the
estimation of one nuisance parameter may be mitigated by the consistent
estimation of the other. As such, these estimators have increased in
popularity in recent years. However, recent work has shown that the
double robustness property does not necessarily extend to inferential
properties about these estimators when adaptive estimators of the PS and
OR are used Benkeser et al. (2017).
Instead, valid inference requires consistent estimation of *both*
the PS *and* the OR. van der Laan (2014) proposed new TMLE
estimators have an asymptotic normal sampling distribution with
estimable variance under consistent estimation of *either* the PS
*or* the OR, paving the way for inference that is doubly-robust.
Benkeser et al. (2017) proposed modifications of these estimators with
similar robustness properties and illustrated their practical
performance.

The two proposed procedures that yield doubly-robust inference are
quite involved, notably involving iterative estimation of additional,
complex nuisance parameters. The `drtmle`

package was
motivated by the need for a user-friendly tool to implement these
methods. The package implements the TMLE estimators of population-level
effects proposed in van der Laan (2014) and Benkeser et al. (2017), as
well as several extensions including cross-validated targeted minimum
loss-based estimators (CVTMLE). Additionally, `drtmle`

implements an IPTW estimator that overcomes shortcomings of existing
IPTW implementations with respect to statistical inference. In
particular, the estimator allows an adaptive estimator of the PS to be
used in construction of the IPTW estimator, while yielding valid
statistical inference about population-level effects. Both the TMLE and
IPTW estimators implemented in the `drtmle`

package are
capable of estimating population-level effects for discrete-valued
treatments and the package provides convenient utilities for
constructing confidence intervals and hypothesis tests about contrasts
of the mean outcome under different levels of treatment. This document
explains some of the key concepts underlying doubly-robust inference and
illustrates usage of the `drtmle`

package.

Suppose the observed data consist of \(n\) independent and identically distributed
copies of the random variable \(O = (W,A,Y)
\sim P_0\). Here, \(W\) is a
vector of baseline covariates, \(A\in\{0,1\}\) is a binary (for the sake of
simplicity) treatment, \(Y\) is an
outcome, and \(P_0\) is the true
data-generating distribution. The parameters of interest are \[\begin{equation}
\psi_0(a) = E_0\{\bar{Q}(a,W)\} = \int \bar{Q}_0(a,w) dQ_{W,0}(w) \ , \
\mbox{for} \ a = 0,1 \ ,
\label{parameterDef}
\end{equation}\]erDef} \end{equation} where \(\bar{Q}_0(a,w)=E_0 \left(Y \mid A=a,
W=w\right)\) is the so-called outcome regression and \(Q_W(w)=Pr_0\left(W\leq w\right)\) is the
distribution function of the covariates. The value \(\psi_0(a)\) represents the marginal (i.e.,
population-level)treatment-specific average outcome, hereafter referred
to as the *marginal mean* under treatment \(A=a\). The marginal effect of the treatment
on the outcome can be assessed by comparing marginal means under
different treatment levels. For example, the average treatment effect is
often defined as \(\psi_0 := \psi_0(1) -
\psi_0(0)\). Under additional assumptions, these parameters have
richer interpretations as mean counterfactual outcomes under the
different treatments and contrasts between these means define causal
effects.

An estimator of \(\psi_0(a)\) may be motivated directly by (1). Suppose we have available an estimate of the OR, \(\bar{Q}_n\), and an estimate of the distribution function of baseline covariates, \(Q_{W,n}\). An estimator of \(\psi_0(a)\) may be obtained by plugging these into (1), \(\psi_n(a) = \int \bar{Q}_0(a,w) dQ_{W,n}(w).\) Often the empirical distribution function of covariates is used so that this estimator can be equivalently written as \[\begin{equation*} \psi_{n,GCOMP}(a) = \frac{1}{n}\sum\limits_{i=1}^n \bar{Q}_n(a, W_i) \ . \end{equation*}\] This estimator is commonly referred to as the G-computation (GCOMP) estimator. Inference for GCOMP estimators based on parametric OR estimators may be facilitated through the nonparametric bootstrap. However, if adaptive approaches are used to estimate the OR, inference may not be available without further modification to the estimator.

An alternative estimator of the treatment-specific population-level mean may be obtained by noting that (1) can be equivalently written as \[\begin{align} E_0\{\bar{Q}(a,W)\} &= E_0\{E_0(Y \mid A = a, W)\} \notag \\ &= E_0\biggl\{\frac{I(A=a)}{Pr_0(A = a \mid W)} E_0(Y \mid A = a, W)\biggr\} \notag \\ &= E_0\biggl\{\frac{I(A=a)}{Pr_0(A = a \mid W)} E_0(Y \mid A, W)\biggr\} \notag \\ &= E_0\biggl\{\frac{I(A=a)}{Pr_0(A = a \mid W)} Y \biggr\} \ . \label{iptwID} \end{align}\] We use \(g_0(a \mid W) := Pr_0(A = a \mid W)\) to denote the propensity score. Suppose we had available \(g_n\), an estimate of the PS. Equation (2) motivates the estimator \[\begin{equation*} \psi_{n,IPTW}(a) := \frac{1}{n}\sum\limits_{i=1}^n \frac{I(A_i = a)}{g_n(a \mid W_i)} Y_i \ , \end{equation*}\] which is referred to as the inverse probability of treatment (IPTW) estimator. Inference for IPTW estimators based on parametric PS estimators may be facilitated through the nonparametric bootstrap. If adaptive approaches are used to estimate the PS, inference may not be available without further modification of the propensity score estimator, as we discuss below.

The GCOMP and IPTW estimators suffer from two important shortcomings. The first is a lack of robustness. That is, validity of GCOMP estimators relies on consistent estimation of the OR, while validity of the IPTW relies on consistent estimation of the PS. In practice, we may not know which of the OR or PS is simpler to consistently estimate and thus may not know which of the two estimators to prefer. Furthermore, if adaptive estimators (e.g., based on machine learning) are used to estimate the OR or PS, then the GCOMP and IPTW estimators lack a basis for statistical inference.

This motivates a new class of estimators that combine both the GCOMP and IPTW estimators. One example is the augmented IPTW (AIPTW) estimator \[\begin{equation*} \psi_{n,AIPTW}(a) = \psi_{n,IPTW}(a) + \frac{1}{n} \sum\limits_{i=1}^n \biggl\{ 1 - \frac{I(A_i = a)}{g_n(a \mid W_i)} \biggr\} \ \bar{Q}_n(a, W_i) \ , \end{equation*}\]

which adds an augmentation term to the IPTW estimator. Equivalently,
the estimator can be viewed as adding an augmentation to the GCOMP
estimator \[\begin{equation}
\label{oneStepEst}
\psi_{n,AIPTW}(a) = \psi_{n,GCOMP}(a) + \frac{1}{n} \sum\limits_{i=1}^n
\frac{I(A_i = a)}{g_n(a \mid W_i)} \{Y_i - \bar{Q}_n(a,W_i)\} \ .
\end{equation}\] The AIPTW estimator overcomes the two
limitations of the GCOMP and IPTW estimators. The estimator exhibits
double-robustness in that it is consistent for \(\psi_0(a)\) if either \(\bar{Q}_n\) is consistent for the true OR
*or* \(g_n\) is consistent for
the true PS. Furthermore, if both the OR and PS are consistently
estimated and regularity conditions hold, one can establish a limiting
normal distribution for the AIPTW estimator. Simple estimators of the
variance of this limiting distribution can be derived that may be used
to construct Wald-style confidence intervals and hypothesis tests.

A more recent proposal to improve robustness and inferential properties of the GCOMP and IPTW estimators utilizes targeted minimum loss-based estimation (TMLE). TMLE is a general methodology that may be used to modify an estimator of a regression function in such a way so as to ensure that the modified estimator satisfies user-selected equations. In the present problem, one can show that if an estimator of the OR \(\bar{Q}_n^*\) satisfies \[\begin{equation} \label{meanEIFZero} \frac{1}{n} \sum\limits_{i=1}^n \frac{I(A_i = a)}{g_n(a \mid W_i)} \{Y_i - \bar{Q}_n^*(a, W_i)\} = 0 \ , \end{equation}\] then the GCOMP estimator based on \(\bar{Q}_n^*\) will have the same asymptotic properties as the AIPTW estimator. That is, the estimator will be doubly-robust and, provided both the OR and PS are consistently estimated and regularity conditions are satisfied, will have a normal limiting distribution. TMLE provides a means of mapping an initial estimator of the OR into an estimator that satisfies (4). In addition to possessing the desirable asymptotic properties, TMLE estimators have been shown to outperform AIPTW estimators in finite samples (Porter et al. 2011). We refer readers to van der Laan and Rose (2011) for more about TMLE (van der Laan and Rose 2011)

van der Laan (2014) demonstrated that the double robustness property of AIPTW and TMLE estimators does not extend to their normal limiting distribution if adaptive estimators of the OR and PS are used to construct the estimators. However, in settings with high-dimensional and continuous-valued covariates, adaptive estimation is likely necessary in order to obtain a consistent estimate of either the OR or PS, and thus may be necessary to obtain minimally biased estimates of the marginal means of interest. van der Laan (2014) derived modified TMLE estimators that are doubly-robust not only with respect to consistency, but also with respect to asymptotic normality. Furthermore, he provided closed-form, doubly-robust variance estimators that may be used to construct doubly-robust confidence intervals and hypothesis tests. Benkeser et al. (2017) proposed simplifications of the van der Laan (2014) estimators and demonstrated the practical performance of estimators via simulation.

The key to the theory underlying these results is finding estimators of the OR and PS that simultaneously satisfy equation (4) in addition to two additional equations. These equations involve additional regression parameters that we refer to as the reduced outcome regression (R-OR) and the reduced propensity score (R-PS), defined respectively as \[\begin{align*} \bar{Q}_{r,0n}(a,w) &:= E_{0}\bigl\{Y - \bar{Q}_n(W) \mid A = a, g_n(W)=g_n(w)\bigr\} \ , \ \mbox{and} \\ g_{r,0n}(a \mid w) &:= Pr_0\left\{A = a \mid \bar{Q}_n(W)=\bar{Q}_n(w), g_n(W)=g_n(w)\right\} \ . \end{align*}\] In words, the R-OR is the regression of the residual from the outcome regression onto the estimated PS amongst observations with \(A = a\), while the R-PS is the propensity for treatment \(A = a\) given the estimated OR and PS. We note that a key feature of these reduced-dimension regressions is that they are low-dimensional regardless of the dimension of \(W\) and they do not depend on the true OR nor PS. Thus, these regressions may be consistently estimated at reasonably fast rates irrespective of the dimension of \(W\) and irrespective of whether the OR or PS are consistently estimated. Based on these reduced-dimension regressions, van der Laan (2014) showed that if estimates of the OR \(\bar{Q}_n^*\), PS \(g_n^*\), R-OR \(\bar{Q}_{r,n}^*\), and R-PS \(g_{r,n}^*\) satisfied \[\begin{align} & \frac{1}{n} \sum\limits_{i=1}^n \frac{I(A_i = a)}{g_n^*(a \mid W_i)} \{Y_i - \bar{Q}_n^*(a,W_i)\} = 0 \ , \label{tmleEIFSolve} \\ &\frac{1}{n} \sum\limits_{i=1}^n \frac{\bar{Q}_{r,n}(a, W_i)}{g_n(a \mid W_i)} \{I(A_i = a) - g_n^*(a \mid W_i)\} = 0 \ , \ \mbox{and} \label{tmleAsolve} \\ &\frac{1}{n} \sum\limits_{i=1}^n \frac{I(A_i = a)}{g_{r,n}^*(a \mid W_i)} \biggl\{\frac{g_{r,n}^*(a \mid W_i) - g_n^*(a \mid W_i)}{g_n^*(a \mid W_i)} \biggr\} \{Y_i - \bar{Q}_n^*(a, W_i) \} = 0 \ , \label{tmleQsolve} \end{align}\] then the GCOMP estimator based on \(\bar{Q}_n^*\) would be doubly-robust not only with respect to consistency, but also with respect to its limiting distribution. The author additionally proposed a TMLE algorithm to map initial estimates of the OR, PS, R-OR, and R-PS into estimates that satisfy these key equations.

Benkeser et al. (2017) demonstrated alternative equations that can be satisfied to achieve this form of double-robustness. In particular, they formulated an alternative version of the R-PS, \[\begin{align*} g_{r,0,1}(a \mid w) := Pr_0\{A = a \mid \bar{Q}_n(W) = \bar{Q}_n(w) \} \ , \end{align*}\]

which we refer to as R-PS1. They also introduced an additional regression of a weighted residual from the PS on the OR, \[\begin{align*} g_{r,0,2}(a \mid w) := E_0\biggl\{\frac{I(A = a) - g_n(a \mid W)}{g_n(a \mid W)} \ \bigg\vert \ \bar{Q}_n(W) = \bar{Q}_n(w) \biggr\} \ , \end{align*}\]

which we refer to as R-PS2. The key feature of R-PS1 and R-PS2 is that they are both univariate regressions, while R-PS is a bivariate regression. Benkeser and colleagues suggested that this may make R-PS1 and R-PS2 easier to estimate consistently than R-PS. Further, they showed that if estimators or the OR, PS, and R-OR satisfy equations (5)-(6), and if additionally, estimators of the OR, R-PS-1 \(g_{r,n,1}^*\), and R-PS-2 \(g_{r,n,2}^*\) satisfy \[ \frac{1}{n} \sum\limits_{i=1}^n \frac{I(A_i = a)}{g_{r,n,1}^*(a \mid W_i)} g_{r,n,2}^*(a \mid W_i) \ \{Y_i - \bar{Q}_n^*(a,W_i)\} = 0 \ , \] then the GCOMP estimator based on this \(\bar{Q}_n^*\) also enjoys a doubly-robust limiting distribution. Benkeser et al. (2017) provided a TMLE algorithm that produced such estimates and provided closed-form, doubly-robust variance estimates.

In the previous section, we introduced AIPTW and TMLE as two methods for constructing doubly-robust (with respect to consistency) estimators of our parameter of interest. Theoretically, these two frameworks are closely related and practically the two estimators are generally seen as competitors for estimating marginal means. In equation (3), we were able to generate a doubly-robust estimator by adding a term to the GCOMP estimator. This term is exactly the left-hand-side of the key equation (4), the key equation that is solved by the standard TMLE. This is no coincidence; this term is crucial in studying the asymptotic properties of the GCOMP, AIPTW, and TMLE estimators (for discussion, see Benkeser et al. (2017) Section 2.2). Recall that a TMLE estimator achieves doubly-robust consistency by satisfying equation (4), while the AIPTW achieves doubly-robust consistency by combining the GCOMP estimator with the left-hand-side of this term. Now, we have argued that a TMLE estimator may achieve doubly-robust limiting behavior if it additionally satisfies equations (6) and (7). It is thus sensible to wonder whether an AIPTW-style estimator could be constructed by combining the left-hand-side of these two additional equations with the AIPTW estimator. Benkeser and colleagues showed that, surprisingly, this AIPTW-style estimator does not achieve the same doubly-robust properties as the TMLE estimator. Thus, while AIPTW and TMLE are generally competitors for producing estimators doubly-robust with respect to consistency, for doubly-robust inference, TMLE is preferred.

In recent work, we
have proposed estimators of the average treatment effect to be used in
settings where this effect is weakly identifiable, as evidenced by small
estimated propensity scores. In these settings doubly robust estimators
may exhibit non-robust behavior in small samples. The proposed solution
was to target an alternative quantity in the propensity estimation step.
If we are interested in estimating \(\psi_0(a)\) for \(a = 0, 1\), we can estimate \(Pr_0(A = a \mid \bar{Q}_0(1, W), \bar{Q}_0(0,
W))\), which describes the conditional probability of receiving
treatment \(A = a\) given the value of
the two outcome regressions. For this reason, we call this quantity an
*outcome-adaptive propensity score*. We have shown that all of
the usual asymptotics for TMLE carry through when an estimator of this
quantity is substituted for the true propensity score. The resultant
TMLE is *super-efficient* meaning that it will generally have
greater asymptotic efficiency than a standard TMLE. Simulations indicate
that this approach also delivers more stable behavior in settings where
estimated propensities become very small. However, the estimator is not
doubly robust – the asymptotics rely on the outcome regression being
consistently estimated at \(n^{-1/4}\)
rate. In this sense, this estimator trades off robustness in the name of
achieving greater stability and efficiency. See our paper for further
discussion.

The main function of the package is the eponymous `drtmle`

function. This function estimates the treatment-specific marginal mean
for user-specified levels of a discrete-valued treatment and computes a
doubly-robust covariance matrix for these estimates. The
`drtmle`

function implements either the estimators of van der
Laan (2014) (if `reduction = "bivariate"`

) or the improved
estimators of Benkeser et al. (2017)
(`reduction = "univariate"`

). The package includes utility
functions for combining these estimates into estimates of average
treatment effects, as well as other contrasts, as discussed below.

The main data arguments for `drtmle`

are
`W, A, Y`

, which, as above, represent the covariates, a
discrete-valued treatment, and an outcome, respectively. Missing data
are allowed in `A`

and `Y`

; this is discussed
further in a section below. Beyond the data arguments, it is necessary
to specify how to estimate each of the OR, PS, R-OR, and R-PS (or R-PS1,
R-PS2). The package provides flexibility in how these regression
parameters may be estimated. Once the user has specified the data
arguments and how to estimate each regression parameter, the function
proceeds as follows: 1. estimate the OR via the internal
`estimateQ`

function; 2. estimate the PS via the internal
`estimateG`

function; 3. estimate reduced-dimension
regression via `estimateQrn`

and `estimategrn`

functions; 4. iteratively modify initial estimates of the OR and PS via
the `fluctuateQ`

and `fluctuateG`

functions until
equations (5)-(7) are approximately solved; 5. compute the TMLE
estimates and covariance estimates based on the final OR. We refer
interested readers to Benkeser et al. (2017) for a theoretical
derivation of how the iterative modification of initial OR and PS is
performed.

In addition to returning doubly-robust parameter and covariance
estimates, the `drtmle`

function returns estimates of the
GCOMP, AIPTW, standard TMLE, and the modified AIPTW estimator discussed
in the remark above. While doubly-robust inference based on these
estimators is not available, the estimators are provided for comparison.
In the following subsections, we look at various options for making
calls to `drtmle`

.

While the estimators computed by `drtmle`

are consistent
and asymptotically normal under inconsistent estimation of either the OR
or PS, it is nevertheless advisable to strive for consistent estimation
of *both* the OR and PS. If both of the regression parameters are
consistently estimated then the estimators computed by
`drtmle`

achieve the nonparametric efficiency bound,
resulting in narrower confidence intervals and more powerful hypothesis
tests. In order that the estimates of the OR and PS have the greatest
chance of consistency, the `drtmle`

function facilitates
estimation using adaptive estimation techniques. In fact, there are
three ways to estimate each of the regression parameters: generalized
linear models, super learning, and a user-specified estimation
technique. We demonstrate each of these approaches in turn on a simple
simulated data set. The true parameter values of interest are \(\psi_0(0)=\) `r round(EY0, 2)`

and \(\psi_0(1)=\) 0.72.

```
set.seed(1234)
<- 200
n <- data.frame(W1 = runif(n), W2 = rbinom(n, 1, 0.5))
W <- rbinom(n, 1, plogis(W$W1 + W$W2))
A <- rbinom(n, 1, plogis(W$W1 + W$W2*A)) Y
```

`glm`

to estimate regressionsGeneralized linear models are perhaps the most popular means of
estimating regression functions. Because of this legacy, these
estimators have been included as an option for estimating the regression
parameters. The user specifies the right-hand-side of a regression
formula as a character in the `glm_Q`

, `glm_g`

,
`glm_Qr`

, and `glm_gr`

options to estimate the OR,
PS, R-OR, and R-PS. These regressions are fit by calling the
`glm`

function in base R. Thus, the model specification
should be able to be parsed as a valid formula, as required by
`glm`

. Recall that the R-OR parameter involves a regression
onto the estimated propensity. Thus, the formula that is input via
`glm_Qr`

should assume data with a single column named
`"gn"`

. Similarly, the R-PS1 and R-PS2 that are computed for
the estimators of Benkeser et al. (2017) (the default estimators,
computed whenever `reduction = "univariate"`

) involve
regressions onto the estimated outcome regression. Thus, the formula
`glm_gr`

should assume data with a single column named
`"Qn"`

if `reduction = "univariate"`

. If instead,
the estimators of van der Laan (2014) are preferred, we can set the
option `reduction = "bivariate"`

and input a
`glm_gr`

formula that assumes data with two columns named
`"Qn"`

and `"gn"`

.

The code below illustrates how generalized linear models can be used
to estimate the regression parameters. In this call to
`drtmle`

, we specify `family = binomial()`

so that
logistic regression is used to estimate the OR. We specify
`stratify = FALSE`

to indicate that we wish to fit a single
OR to estimate \(\bar{Q}_0(A,W)\), as
opposed to fitting two separate regressions to estimate \(\bar{Q}_0(1,W)\) and \(\bar{Q}_0(0,W)\). If the former fitting is
specified, then the OR regression is fit using a data frame that
contains `W`

and `A`

. Thus, both
`colnames(W)`

and `"A"`

may appear in
`glm_Q`

. However, if `stratify = TRUE`

then the OR
model is fit separately in observations with \(A = 1\) and \(A =
0\). In this case, the `glm_Q`

option may not include
`"A"`

. We illustrate non-stratified regression for both the
univariate and bivariate reductions.

```
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
glm_fit_uni glm_g = "W1 + W2", glm_Q = "W1 + W2*A",
glm_gr = "Qn", glm_Qr = "gn", stratify = FALSE)
glm_fit_uni
```

```
## $est
##
## 1 0.7111213
## 0 0.5383661
##
## $cov
## 1 0
## 1 1.741923e-03 -1.003719e-05
## 0 -1.003719e-05 4.081532e-03
```

```
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
glm_fit_biv glm_g = "W1 + W2", glm_Q = "W1 + W2*A",
glm_gr = "Qn", glm_Qr = "gn", stratify = FALSE,
reduction = "bivariate")
glm_fit_biv
```

```
## $est
##
## 1 0.7111231
## 0 0.5385794
##
## $cov
## 1 0
## 1 1.709464e-03 -1.523075e-05
## 0 -1.523075e-05 3.593222e-03
```

`SuperLearner`

to estimate regressionsSuper learning is a loss-based ensemble machine learning technique
(van der Laan, Polley, and Hubbard 2007)
that is a generalization of stacking and stacked regression (Wolpert 1992)breiman:1996:ml}. The user
specifies many potential candidate regression estimators, referred to as
a *library* of candidate estimators, and the super learner
algorithm estimates the convex combination of regression estimators that
minimizes \(V\)-fold cross-validated
risk based on a user-selected loss function. Oracle inequalities show
that the super learner estimate performs essentially as well as the
unknown best convex combination of the candidate estimators (Vaart, Dudoit, and van der Laan 2006). Thus,
the methodology may be seen as an asymptotically optimal way of
performing estimator selection. Practically, the super learner provides
the best chance for obtaining a consistent estimator for both the OR and
PS, and thereby an efficient estimator of the marginal means of
interest.

Super learning is implemented in the R package
`SuperLearner`

(Polley et al.
2017) and `drtmle`

provides the option to utilize
super learning for estimation of the OR, PS, and reduced-dimension
regressions via the `SL_Q`

, `SL_g`

,
`SL_Qr`

, and `SL_gr`

options. When calling
`drtmle`

either the generalized linear model options for
regression modeling or the super learner options should be invoked. If
both are called, the function will default to the super learner option.
The inputs to the super learner options are passed to the
`SL.library`

argument of the `SuperLearner`

function of the `SuperLearner`

package. We refer readers to
the `SuperLearner`

help file for information on how to
properly specify a super learner library. In the code below, we
illustrate a call to `drtmle`

with a relatively simple
library. For the OR and PS, the library includes a main-terms
generalized linear model (via the function `SL.glm`

from
`SuperLearner`

) and an unadjusted generalized linear model
(`SL.mean`

). For the reduced dimension regressions we use a
super learner library with a main terms generalized linear model
(`SL.glm`

) and regression splines (`SL.gam`

, which
utilizes the `gam`

package (Hastie
2017)).

```
set.seed(1234)
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
sl_fit SL_g = c("SL.glm","SL.mean"),
SL_Q = c("SL.glm","SL.mean"),
SL_gr = c("SL.glm", "SL.gam"),
SL_Qr = c("SL.glm", "SL.gam"),
stratify = FALSE)
sl_fit
```

```
## $est
##
## 1 0.7109740
## 0 0.5381819
##
## $cov
## 1 0
## 1 1.737709e-03 -5.954220e-06
## 0 -5.954220e-06 4.142052e-03
```

The `drtmle`

package also allows users to pass in their
own functions for estimating regression parameters via the
`SL_`

options. These functions must be formatted properly for
compatibility with `drtmle`

. The necessary formatting is
identical the formatting required for writing a
`SuperLearner`

wrapper (see
`SuperLearner::write.SL.template()`

). The `drtmle`

package includes `SL.npreg`

, which is an example of how a
user can specify a function for estimating regression parameters. This
functions fits a kernel regression estimator using the `np`

package (Hayfield and Racine 2008). Below
we show the source code for this function.

` SL.npreg`

```
## function (Y, X, newX, family = gaussian(), obsWeights = rep(1,
## length(Y)), rangeThresh = 1e-07, ...)
## {
## options(np.messages = FALSE)
## if (abs(diff(range(Y))) <= rangeThresh) {
## thisMod <- glm(Y ~ 1, data = X)
## }
## else {
## bw <- np::npregbw(stats::as.formula(paste("Y ~", paste(names(X),
## collapse = "+"))), data = X, ftol = 0.01, tol = 0.01,
## remin = FALSE)
## thisMod <- np::npreg(bw)
## }
## pred <- stats::predict(thisMod, newdata = newX)
## fit <- list(object = thisMod)
## class(fit) <- "SL.npreg"
## out <- list(pred = pred, fit = fit)
## return(out)
## }
## <bytecode: 0x7fb0aeb22978>
## <environment: namespace:drtmle>
```

In the above code, we see that a user-specified function should have
options `Y, X, newX, family, obsWeights, ...`

, along with
other options specific to the function. The option `Y`

corresponds to the outcome of the regression and `X`

to the
variables onto which the outcome is regressed. The option
`newX`

should be of the same class and dimension as
`X`

and is meant to contain new values of variables at which
to evaluate the regression fit. The `family`

and
`obsWeights`

options are not strictly necessary (indeed, they
are not used by `SL.npreg`

), they are useful to include if
one wishes to utilize the function as part of a super learner library.
The `SL.npreg`

function proceeds by checking whether there is
sufficient variation in `Y`

to even both fitting the kernel
regression (as specified by option `rangeThresh`

), and if so,
fits a kernel regression with cross-validated bandwidth selection via
the `npregbw`

function. The output is then formatted in a
specific way so that `drtmle`

is able to retrieve the
appropriate objects from the fitted regression. In particular the output
should be a list with named objects `pred`

and
`fit`

; the former including predictions made from the fitted
regression on `newX`

, the latter including any information
that may be needed to predict new values based on the fitted object. A
class is assigned to the `fit`

object, so that an S3
`predict`

method may later be used to predict new values
based on the fitted regression object. The user must also specify this
`predict`

method. Use `drtmle:::predict.SL.npreg`

to view the simple `predict`

method for
`SL.npreg`

.

In the code below, we demonstrate a call to `drtmle`

with
`SL.npreg`

used to estimate each nuisance parameter. Here, we
also illustrate the use of `stratify = TRUE`

, which fits a
separate kernel regression of \(Y\)
onto \(W\) in observations with \(A = 1\) and \(A =
0\).

```
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
npreg_fit SL_g = "SL.npreg", SL_Q = "SL.npreg",
SL_gr = "SL.npreg", SL_Qr = "SL.npreg",
stratify = TRUE)
npreg_fit
```

```
## $est
##
## 1 0.7159062
## 0 0.5343013
##
## $cov
## 1 0
## 1 1.674562e-03 -1.023615e-05
## 0 -1.023615e-05 3.963064e-03
```

Some users may find it unsettling that when applying the super learner in real data analysis is that the results are dependent on the random seed that is set. To remove this dependence, one can instead average results over repeated super learner runs; more repeats will lead to less dependence on the random seed. There are generally two approaches that can be used: (i) averaging on the scale of the nuisance parameters and (ii) averaging on the scale of the point estimates.

For (i), suppose that we have \(n_{SL}\) estimated super learners for the OR, \(\bar{Q}_{n,j}, j = 1, \dots, n_{SL}\), and the PS \(g_{n,j}, j = 1, \dots, n_{SL}\). We can take the average over these fits, \[ \bar{Q}_n = \frac{1}{n_{SL}} \sum_{j = 1}^{n_{SL}} \bar{Q}_{n,j} \] as our estimate of the OR (similarly for the PS). These averaged estimates are then used to compute the AIPTW or TMLE, as described above.

For(ii), suppose for example that we have \(n_{SL}\) TMLE (similarly, AIPTW) estimates \(\psi_{n,j}(1)\) of \(\psi_0(1)\), each obtained based on one (or more) super learner of the OR and/or PS. We can take the average over these point estimates, \[ \psi_n(1) = \frac{1}{n_{SL}} \sum_{j=1}^{n_{SL}} \psi_{n,j}(1) \] as our estimate of \(\psi_{0}(1)\).

These feature is implemented via the `n_SL`

and
`avg_over`

options, where the former specifies the number of
super learners to repeat and the latter specifies the scale(s) to
average over. For example, if `avg_over = "drtmle"`

and
`n_SL = 2`

, then two point estimates are computed, each based
on a single super learner, and averaged to obtain a final estimate. If
instead `avg_over = "SL"`

and `n_SL = 2`

, then two
super learners are fit and averaged before a final point estimate is
obtained. Finally, if `avg_over = c("drtmle", "SL")`

and
`n_SL = 2`

then averaging on *both* scales is
performed. That is, the final estimate is the average of two point
estimates, each obtained using OR and PS estimates that are averaged
over two super learners.

Not surprisingly, averaging over multiple super learners can add
considerable computational expense to calls to `drtmle`

. The
package is currently parallelized at the level of individual calls to
`SuperLearner`

. With parallelization over enough cores,
repeated super learning should not add too much to the computational
burden relative to a single call to `SuperLearner`

.

If the outcome-adaptive PS is desired, the user may specify this via
the `adapt_g`

option. In this case, the additional targeting
steps needed for doubly robust inference are “turned off” (i.e.,
`guard = NULL`

), since, as discussed above, this estimator is
*not* doubly robust. Note that when the
`adapt-g = TRUE`

the data used for the PS estimation will be
a `data.frame`

with column names `QaW`

where
`a`

takes all the values specified in `a_0`

. For
example if `a_0 = c(0,1)`

, then the PS estimation data frame
will have two columns called `Q0W`

and `Q1W`

. This
may be relevant for users writing their own `SuperLearner`

wrappers or those who make use of the `glm_g`

setting, as
illustrated below.

```
# outcome adaptive propensity score fit using glms
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
adaptg_fit glm_Q = ".^2", glm_g = "Q1W + Q0W",
adapt_g = TRUE)
# outcome adaptive propensity score fit using super learner
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
adaptg_sl_fit SL_Q = c("SL.glm", "SL.mean"),
SL_g = c("SL.glm", "SL.mean"),
adapt_g = TRUE)
```

The `drtmle`

package also allows users to pass in their
own estimated OR and PS via the `Qn`

and `gn`

options, respectively. If `Qn`

is specified by the user,
`drtmle`

will ignore the nuisance parameter estimation
routines specified by `SL_Q`

and `glm_Q`

–
similarly for `gn`

and `SL_g`

, `glm_g`

.
As noted in the package documentation, there is a specific format
necessary for inputting `Qn`

and `gn`

. For
`Qn`

, the entries in the list should correspond to the OR
evaluated at A and the observed values of `W`

, with order
determined by the input to `a_0`

. For example, if
`a_0 = c(0, 1)`

then `Qn[[1]]`

should be OR at
\(A = 0\) and `Qn[[2]]`

should be outcome regression at \(A =
1\). The example below illustrates this concept.

```
# fit a GLM for outcome regression outside of drtmle
<- glm(Y ~ . , data = data.frame(A = A, W), family = binomial())
out_glm_fit
# generate Qn list
<- list(
Qn10 # first entry is predicted values setting A = 1
predict(out_glm_fit, newdata = data.frame(A = 1, W), type = "response"),
# second entry is predicted values setting A = 0
predict(out_glm_fit, newdata = data.frame(A = 0, W), type = "response")
)
# pass this list to drtmle to avoid internal estimation of
# outcome regression (note propensity score and reduced-dimension
# regressions are still estimated internally)
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
out_fit1 glm_g = ".", glm_Qr = "gn", glm_gr = "Qn", Qn = Qn10,
# crucial to set a_0 to match Qn's construction!
a_0 = c(1,0))
out_fit1
```

```
## $est
##
## 1 0.7116610
## 0 0.5376892
##
## $cov
## 1 0
## 1 1.726990e-03 1.428573e-06
## 0 1.428573e-06 4.183006e-03
```

```
# to be completely thorough let's re-order Qn10 and re-run
<- list(Qn10[[2]], Qn10[[1]])
Qn01
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
out_fit2 glm_g = ".", glm_Qr = "gn", glm_gr = "Qn",
# use re-ordered Qn list
Qn = Qn01,
# a_0 has to be reordered to match Qn01!
a_0 = c(0,1))
# should be the same as out_fit1, but re-ordered
out_fit2
```

```
## $est
##
## 0 0.5376892
## 1 0.7116610
##
## $cov
## 0 1
## 0 4.183006e-03 1.428573e-06
## 1 1.428573e-06 1.726990e-03
```

Similarly, for `gn`

we need to pass in properly ordered
lists. For example, if `a_0 = c(0, 1)`

then
`gn[[1]]`

should be propensity for receiving \(A = 0\) and `gn[[2]]`

should be
the propensity for receiving \(A = 1\).
The example below illustrates this concept.

```
# fit a GLM for propensity score outside of drtmle
<- glm(A ~ . , data = data.frame(W), family = binomial())
out_glm_fit
# get P(A = 1 | W) by calling predict
<- predict(out_glm_fit, newdata = data.frame(W), type = "response")
gn1W
# generate gn list
<- list(
gn10 # first entry is P(A = 1 | W)
gn1W,# second entry is P(A = 0 | W) = 1 - P(A = 1 | W)
1 - gn1W
)
# pass this list to drtmle to avoid internal estimation of
# propensity score (note reduced-dimension regressions are
# still estimated internally)
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
out_fit3 glm_Qr = "gn", glm_gr = "Qn",
Qn = Qn10, gn = gn10,
# crucial to set a_0 to match Qn and gn's construction!
a_0 = c(1,0))
out_fit3
```

```
## $est
##
## 1 0.7116610
## 0 0.5376892
##
## $cov
## 1 0
## 1 1.726990e-03 1.428573e-06
## 0 1.428573e-06 4.183006e-03
```

```
# to be completely thorough let's re-order gn10 and re-run
<- list(gn10[[2]], gn10[[1]])
gn01
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
out_fit4 glm_Qr = "gn", glm_gr = "Qn",
# use re-ordered Qn/gn list
Qn = Qn01, gn = gn01,
# a_0 has to be reordered to match Qn01!
a_0 = c(0,1))
# should be the same as out_fit3, but re-ordered
out_fit4
```

```
## $est
##
## 0 0.5376892
## 1 0.7116610
##
## $cov
## 0 1
## 0 4.183006e-03 1.428573e-06
## 1 1.428573e-06 1.726990e-03
```

This flexibility allows users to use any regression technique to
externally obtain initial estimates of nuisance parameters. The key step
is making sure that the ordering of `Qn`

and `gn`

is consistent with `a_0`

.

This section contains some more advanced mathematical material and may be omitted by users interested only in implementation. For simplicity, we focus this discussion on the standard TMLE/AIPTW estimators. The ideas generalize immediately to the DRTMLE estimator, but the formulas are more complex.

Reported standard error estimates are based on an estimate the asymptotic variance of the estimators. This variance can be characterized by the estimators influence function. For example, let \(\psi_{n}(1)\) be an AIPTW estimate of \(\psi_0(1)\). Under regularity conditions, \[ \psi_{n}(1) - \psi_0(1) = \frac{1}{n} \sum_{i=1}^n D^*(\bar{Q}_0, g_0, Q_{W,0})(O_i) + o_{p}(n^{-1/2}) \ , \] where \[ D^*(\bar{Q}_0, g_0, Q_{W,0})(O_i) = \frac{I(A_i = 1)}{g_0(1 \mid W_i)} \{ Y_i - \bar{Q}_0(1, W_i)\} + \bar{Q}_0(1, W_i) - \int \bar{Q}_0(1, w) dQ_{W,0}(w) \ . \] Thus, by the central limit theorem, \(n^{1/2} \psi_{n}(1)\) converges in distribution to a Normal distribution with mean \(\psi_0(1)\) and variance \[ \sigma^2_0 = E_0( [ D^*(\bar{Q}_0, g_0, Q_{W,0})(O) - E_0\{D^*(\bar{Q}_0, g_0, Q_{W,0})(O)\} ]^2 ) \ . \] An estimate of \(\sigma^2_0\) is naturally obtained by taking the empirical variance of \(D_i = D^*(\bar{Q}_n, g_n, Q_{W,n})(O_i)\), \[ \sigma^2_n = \frac{1}{n} \sum_{i=1}^n \left\{ D_i - \frac{1}{n} \sum_{j=1}^n D_j \right\}^2 \ . \] Thus, a natural estimate of the standard error of \(\psi_{n}(1)\) is \(\sigma_n / n^{1/2}\). This idea generalizes naturally to estimating the asymptotic covariance matrix of a vector \(n^{1/2} (\psi_{n}(a) : a)\).

For TMLE-based estimators, it is standard practice to construct \(D_i\) above using the *targeted*
version of the OR (denoted by \(\bar{Q}_n^*\) above). Recall that such
estimates are obtained by modifying an initial OR estimate in such a way
as to ensure a particular (set of) equation(s) is (are) solved. However,
simulations have shown that better standard error estimates for TMLE may
be achieved by instead using the *initial* OR estimate in the
computation of standard errors, as described in the previous subsection.
The `drtmle`

function includes the `targeted_se`

option to this end.

```
<-
glm_fit_uni_nontargeted_se drtmle(W = W, A = A, Y = Y, family = binomial(),
glm_g = "W1 + W2", glm_Q = "W1 + W2*A",
glm_gr = "Qn", glm_Qr = "gn", stratify = FALSE,
targeted_se = FALSE)
# compare to original call
list(
$drtmle$cov, # based on targeted OR
glm_fit_uni$drtmle$cov # untargeted OR
glm_fit_uni_nontargeted_se )
```

```
## [[1]]
## [,1] [,2]
## [1,] 1.741923e-03 -1.003719e-05
## [2,] -1.003719e-05 4.081532e-03
##
## [[2]]
## [,1] [,2]
## [1,] 1.745301e-03 -9.407578e-06
## [2,] -9.407578e-06 4.053834e-03
```

In simulations, we often find that when employing machine learning to
estimate the OR and/or PS, the estimated standard errors are *too
small* in small to moderate-sized samples, leading to under-coverage
of confidence intervals and possibly too-large type I errors of
hypothesis tests that are based on these standard error estimates. This
should not be too surprising when studying the form of the standard
error estimates. In particular, they will include a term that looks like
a (weighted) mean squared-error (MSE) of the OR. Because machine
learning algorithms often set out to minimize empirical MSE as part of
their training, we often find that the empirical MSE is biased too small
relative to the true MSE. Thus, in order to more accurately estimate MSE
of a machine learning algorithm, we typically rely on cross-validation.
The same trick can be used here to obtain a more conservative standard
error estimate.

Suppose we use \(V\)-fold cross-validation to obtained \(V\) training-sample-specific estimates of the OR, \(\bar{Q}_{n,v}, v = 1, \dots, V\) and the PS, \(g_{n,v}, v = 1, \dots, V\). For \(i = 1, \dots, n\), let \(\bar{Q}_{n,i}^{cv}(1)\) denote the estimate of \(\bar{Q}_{0}(a, W_i)\) obtained when observation \(i\) was in the validation fold. That is, \(\bar{Q}_{n,i}^{cv}\) is an out-of-sample prediction of \(Y_i\). We define \(g_{n,i}^{cv}\) similarly. Using these estimates, we can define \[ D_i^{cv} = \frac{I(A_i = 1)}{g_{n,i}^{cv}} \{ Y_i - \bar{Q}_{n,i}^{cv}(1)\} + \bar{Q}_{n,i}^{cv}(1) - \frac{1}{n} \sum_{i=1}^n \bar{Q}_{n,i}^{cv}(1) \] and compute an estimate of \(\sigma^2_0\) using \[ \sigma^2_{n,cv} = \frac{1}{n} \sum_{i=1}^n \left\{ D_i^{cv} - \frac{1}{n} \sum_{j=1}^n D_j^{cv} \right\}^2 \ . \]

Cross-validated standard errors can be obtained by setting
`se_cv = "full"`

and `se_cvFolds`

to a number
greater than 1.

```
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
glm_fit_uni_cv glm_g = "W1 + W2", glm_Q = "W1 + W2*A",
glm_gr = "Qn", glm_Qr = "gn",
se_cv = "full", se_cvFolds = 2, stratify = FALSE,
targeted_se = FALSE) # needed for this to work
# compare variance to previously obtained
list(glm_fit_uni$drtmle$cov, glm_fit_uni_cv$drtmle$cov)
```

```
## [[1]]
## [,1] [,2]
## [1,] 1.741923e-03 -1.003719e-05
## [2,] -1.003719e-05 4.081532e-03
##
## [[2]]
## [,1] [,2]
## [1,] 0.0020102441 0.0002597992
## [2,] 0.0002597992 0.0052698821
```

Obtaining fully cross-validated standard errors can be fairly time
intensive when super learner is used to estimate the OR and PS. In
`drtmle`

we give an option to compute standard error
estimates that are *partially cross-validated* and do not require
any additional fitting. Here we describe briefly how these estimates are
obtained and show calls to `drtmle`

to obtain these
estimates.

Suppose we have \(K\) candidate
regressions in our super learner library for both the OR and PS (in
practice, the number could be different for the OR as for the PS).
Recall that a super learner estimate of for example \(\bar{Q}_0(a, w)\) is obtained as a weighted
combination of estimates from each candidate regression. Let \(\bar{Q}_{n,k}\) denote the estimate of
\(\bar{Q}_{0}\) obtained by training
algorithm \(k\) using the full data
set. The super learner estimate of \(\bar{Q}_{0}(a,w)\) is \[
\bar{Q}_{n,SL}(a, w) = \sum_{k = 1}^K \alpha_{n,k} \bar{Q}_{n,k}(a, w)
\ ,
\] where \(\alpha_n = (\alpha_{n,1},
\dots, \alpha_{n,K})\) is a weight vector that is selected by
minimizing a cross-validated risk criteria. That is, each of the \(K\) algorithms is trained according to a
\(V\)-fold cross-validation scheme,
resulting in \(V + 1\) fits of each
algorithm (one in each of \(V\)
training folds plus one obtained by fitting the algorithm using the full
data). Let \(\bar{Q}_{n,k,v}\) denote
the \(k\)-th algorithm trained using
the \(v\)-th training fold. For \(i = 1,\dots,n\), let \(\bar{Q}_{n,k,i}^{cv}(1)\) denote the
estimate of \(\bar{Q}_0(1, W_i)\)
implied by the fitted algorithm \(\bar{Q}_{n,k,v}\), which was obtained when
observation \(i\) was in the validation
sample. That is, for observations with \(A_i =
1\), \(\bar{Q}_{n,k,i}^{cv}(1)\)
is an out of sample prediction of \(Y_i\) based on the \(k\)-th algorithm. Now, we can construct a
*partially cross-validated* super learner prediction as follows.
For \(i = 1, \dots, n\) let \[
\bar{Q}_{n,SL,i}^{cv}(1) = \sum_{k = 1}^K \alpha_{n,k}
\bar{Q}_{n,k,i}^{cv}(1)
\] be the estimate obtained by ensembling the
training-set-specific algorithm fits based on \(\alpha_n\). We can use these estimates to
produce our *partially cross-validated* standard error estimate
as follows. We define \[
D_i^{pcv} = \frac{I(A_i = 1)}{g_{n,SL,i}^{cv}} \{ Y_i -
\bar{Q}_{n,SL,i}^{cv}(1)\} + \bar{Q}_{n,SL,i}^{cv}(1) - \frac{1}{n}
\sum_{i=1}^n \bar{Q}_{n,SL,i}^{cv}(1)
\] and compute an estimate of \(\sigma^2_0\) using \[
\sigma^2_{n,pcv} = \frac{1}{n} \sum_{i=1}^n \left\{ D_i^{pcv} -
\frac{1}{n} \sum_{j=1}^n D_j^{pcv} \right\}^2 \ .
\]

Note that is not a proper cross-validated estimate, since observation \(i\) is used to compute the weight vector \(\alpha_n\). However, such estimates can be obtained based on the output of a single round of super learning, thereby providing improved standard error estimates at minimal additional computational cost.

Here, we demonstrate how to obtain such estimates using
`drtmle`

.

```
set.seed(1234)
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
sl_fit_pcv SL_g = c("SL.glm","SL.mean"),
SL_Q = c("SL.glm","SL.mean"),
SL_gr = c("SL.glm", "SL.gam"),
SL_Qr = c("SL.glm", "SL.gam"),
stratify = FALSE,
se_cv = "partial")
# compare estimates
# pt estimates should be same
# variance should be larger for pcv
list(
# no cv
sl_fit, # partial cv
sl_fit_pcv )
```

```
## [[1]]
## $est
##
## 1 0.7109740
## 0 0.5381819
##
## $cov
## 1 0
## 1 1.737709e-03 -5.954220e-06
## 0 -5.954220e-06 4.142052e-03
##
##
## [[2]]
## $est
##
## 1 0.7109740
## 0 0.5381819
##
## $cov
## 1 0
## 1 1.818141e-03 -1.846086e-05
## 0 -1.846086e-05 4.332436e-03
```

The `drtmle`

package contains methods that compute
doubly-robust confidence intervals. The `ci`

method computes
confidence intervals about the estimated marginal means in a fitted
`drtmle`

object, in addition to contrasts between those
means. By default the method gives confidence intervals about each
mean.

`ci(npreg_fit)`

```
## $drtmle
## est cil ciu
## 1 0.716 0.636 0.796
## 0 0.534 0.411 0.658
```

Alternatively the `contrast`

option allows users to
specify a linear combination of marginal means, which can be used to
estimate the average treatment effect as we show below.

`ci(npreg_fit, contrast = c(1,-1))`

```
## $drtmle
## est cil ciu
## E[Y(1)]-E[Y(0)] 0.182 0.034 0.329
```

The `contrast`

option can also be input as a list of
functions in order to compute a confidence interval for a user-specified
contrast. For example, we might be interested in the risk ratio \(\psi_0(1) / \psi_0(0)\). When constructing
Wald style confidence intervals for ratios, it is common to compute the
interval on the log scale and back-transform. That is, the confidence
interval is of the form \[
\mbox{exp}\biggl[\mbox{log}\biggl\{ \frac{\psi_0(1)}{\psi_0(0)} \biggr\}
\pm
z_{1-\alpha/2} \sigma^{\text{log}}_n \biggr] \ ,
\] where \(\sigma^{\text{log}}_n\) is the estimated
standard error on the log-scale. By the delta method, an estimate of
this standard error is given by \[
\sigma^{\text{log}}_n = \nabla g(\psi_n)^T \Sigma_n \nabla g(\psi_n),
\] where \(\Sigma_n\) is the
doubly-robust covariance matrix estimate output from `drtmle`

and \(\nabla g(\psi)\) is the gradient
of \(\mbox{log}\{\psi(1)/\psi(0)\}\).
To obtain this confidence interval, we may use the following code.

```
<- list(f = function(eff){ log(eff) },
riskRatio f_inv = function(eff){ exp(eff) },
h = function(est){ est[1]/est[2] },
fh_grad = function(est){ c(1/est[1],-1/est[2]) })
ci(npreg_fit, contrast = riskRatio)
```

```
## $drtmle
## est cil ciu
## user contrast 1.34 1.036 1.733
```

More generally this method of inputting the `contrast`

option to the `ci`

method can be used to compute confidence
intervals of the form \[
f^{-1}\bigl\{ f(h(\psi_n)) \pm z_{1-\alpha/2} \nabla f(h(\psi_n))^T
\Sigma_n
\nabla f(h(\psi_n)) \bigr\} \ ,
\] where \(f\)
(`contrast$f`

) is the transformation of the confidence
interval, \(f^{-1}\)
(`contrast$f_inv`

) is the back-transformation of the
interval, \(h\)
(`contrast$h`

) is the contrast of marginal means, and \(\nabla f(h(\psi_n))\)
(`contrast$fh_grad`

) defines the gradient of the transformed
contrast at the estimated marginal means.

We note that this method of computing confidence intervals can also be used to obtain a transformed confidence interval for a particular marginal mean. For example, in our running example \(Y\) is binary. Thus, we might wish to enforce that our confidence interval be computed on a scale which ensures that the confidence limits are bounded between zero and one. To that end, we can consider an interval of the form construct an interval on the logit scale and back-transform, \[ \mbox{expit}\biggl[ \mbox{log}\biggl\{ \frac{\psi_n(1)}{1 - \psi_n(1)} \biggr\} \pm z_{1-\alpha/2} \sigma^{\text{logit}}_n \biggr] \ . \] This confidence interval may be implemented as follows.

```
<- list(f = function(eff){ qlogis(eff) },
logitMean f_inv = function(eff){ plogis(eff) },
h = function(est){ est[1] },
fh_grad = function(est){ c(1/(est[1] - est[1]^2), 0) })
ci(npreg_fit, contrast = logitMean)
```

```
## $drtmle
## est cil ciu
## user contrast 0.716 0.629 0.789
```

Doubly-robust Wald-style hypothesis tests may be implemented using
the `wald_test`

method. As with confidence intervals, there
are three options for testing. First, we may test the null hypothesis
that the marginal means equal a fixed value (or values). Below, we test
the null hypothesis that \(\psi_0(1) =
0.5\) and that \(\psi_0(0) =
0.6\) by inputting a vector to the `null`

option.

`wald_test(npreg_fit, null = c(0.5, 0.6))`

```
## $drtmle
## zstat pval
## H0: E[Y(1)]=0.5 5.276 0.000
## H0: E[Y(0)]=0.6 -1.044 0.297
```

As with the `ci`

method, the `wald_test`

method
allows users to test linear combinations of marginal means by inputting
a vector of ones and negative ones in the `contrast`

option.
We can use this to test the null hypothesis of no average treatment
effect or to test the null hypothesis that the average treatment effect
equals 0.05 (by specifying the `null`

option)

`wald_test(npreg_fit, contrast = c(1, -1))`

```
## $drtmle
## zstat pval
## H0:E[Y(1)]-E[Y(0)]=0 2.414 0.016
```

`wald_test(npreg_fit, contrast = c(1, -1), null = 0.05)`

```
## $drtmle
## zstat pval
## H0:E[Y(1)]-E[Y(0)]=0.05 1.75 0.08
```

The `wald_test`

method also allows for user-specified
contrasts to be tested using syntax similar to the `ci`

method. The only difference syntactically is that it is not strictly
necessary to specify `f_inv`

, as the hypothesis test is
performed on the transformed scale. That is, we can generally test the
null hypothesis that \(f(h(\psi_0))\)
equals \(f_0\) (the function \(f\) applied to the value passed to
`null`

) using the Wald statistic \[
Z_n := \frac{\{f(h(\psi_n)) - f_0\}}{\nabla f(h(\psi_n))^T \Sigma_n
\nabla
f(h(\psi_n))} \ .
\] We can use the `riskRatio`

contrast defined
previously to test the null hypothesis that \(\psi_0(1)/\psi_0(0) = 1\).

`wald_test(npreg_fit, contrast = riskRatio, null = 1)`

```
## $drtmle
## zstat pval
## H0: user contrast = 1 2.231 0.026
```

The `drtmle`

function supports missing data in
`A`

and `Y`

. Such missing data are common in
practice and, if ignored, can bias estimates of the marginal means of
interest. The estimators previously discussed can be modified to allow
this missingness to depend on covariates. Let \(\Delta_A\) and \(\Delta_Y\) be indicators that \(A\) and \(Y\) are observed, respectively. We can
redefine the OR to be \(\bar{Q}_n(a,w) :=
E(\Delta_Y Y \mid \Delta_A A = a, \Delta_A = 1, \Delta_Y = 1, W =
w)\) and similarly redefine the PS to be \[\begin{align}
g_0(a \mid w) &= Pr_0(\Delta_A = 1, \Delta_A A = a, \Delta_Y = 1
\mid W = w)
\notag \\
&= Pr_0(\Delta_A = 1 \mid W) Pr_0(\Delta_A A = a \mid \Delta_A = 1,
W = w)
\label{modPS} \\
&\hspace{1.5in} Pr_0(\Delta_Y = 1 \mid \Delta_A = 1, \Delta_A A = a,
W = w) \ \notag.
\end{align}\] We introduce missing values to `A`

and
`Y`

in our running example.

```
<- rbinom(n, 1, plogis(2 + W$W1))
DeltaA <- rbinom(n, 1, plogis(2 + W$W2 + A))
DeltaY == 0] <- NA
A[DeltaA == 0] <- NA Y[DeltaY
```

The only modification of the call to `drtmle`

is in how
the PS estimation is performed. The options `glm_g`

and
`SL_g`

are still used to fit generalized linear models or a
super learner, respectively. However, the code allows for separate
regression fits for each of the three components of the PS in (8). To
perform separate generalized linear model fits for each of the three
components, `glm_g`

should be a named list with entries
`"DeltaA"`

, `"A"`

, and `"DeltaY"`

.
Respectively these should provide a regression formula for the
regression of \(\Delta_A\) on \(W\), \(A\)
on \(W\) in observations with \(\Delta_A = 1\), and a regression of \(\Delta_Y\) on \(W\) and \(A\) in observations with \(\Delta_A = \Delta_Y = 1\) (if
`stratify = FALSE`

) or a regression of \(\Delta_Y\) on \(W\) to be fit in observations with \(\Delta_A = \Delta_Y = 1\) and \(A = a\) for each \(a\) specified in option `a_0`

(if `stratify = TRUE`

). If only a single formula is passed to
`glm_g`

, it is used for each of the three regressions. We
illustrate a call to `drtmle`

with missing data and
generalized linear model fits below.

```
<- drtmle(W = W, A = A, Y = Y, stratify = FALSE,
glm_fit_wNAs glm_g = list(DeltaA = "W1 + W2", A = "W1 + W2",
DeltaY = "W1 + W2 + A"),
glm_Q = "W1 + W2*A", glm_Qr = "gn",
glm_gr = "Qn", family = binomial())
glm_fit_wNAs
```

```
## $est
##
## 1 0.7353583
## 0 0.5213146
##
## $cov
## 1 0
## 1 1.836839e-03 -2.380141e-05
## 0 -2.380141e-05 4.917920e-03
```

It is possible to mix generalized linear models and super learning
when fitting each of the components of the PS. In the below example we
use the wrapper function `SL.glm`

, which fits a main terms
logistic regression, to fit the regression of \(\Delta_A\) on \(W\), use `SL.npreg`

to fit the
regression of \(A\) on \(W\), and use a super learner with library
`SL.glm`

, `SL.mean`

, and `SL.gam`

to
fit the regression of \(Delta_Y\) on
\(W\) and \(A\).

```
<- drtmle(W = W, A = A, Y = Y, stratify = FALSE,
mixed_fit_wNAs SL_g = list(DeltaA = "SL.glm", A = "SL.npreg",
DeltaY = c("SL.glm", "SL.mean", "SL.gam")),
glm_Q = "W1 + W2*A", glm_Qr = "gn",
glm_gr = "Qn", family = binomial())
mixed_fit_wNAs
```

```
## $est
##
## 1 0.7386462
## 0 0.5187046
##
## $cov
## 1 0
## 1 0.0017938723 -2.77481e-05
## 0 -0.0000277481 4.88416e-03
```

We can also fit the PS regressions externally and pass in via
`gn`

, though this process is slightly more complicated than
when no missing data are present. The basic idea is to fit one
regression for (i) the indicator of missing \(A\), (ii) \(A\) itself, and (iii) the indicator of
missing the outcome. The three PS’s are then multiplied together and an
appropriately ordered list is constructed as before. We illustrate with
the following simple example.

```
# first regress indicator of missing A on W
<- glm(DeltaA ~ . , data = W, family = binomial())
fit_DeltaA
# get estimated propensity for observing A
<- predict(fit_DeltaA, type = "response")
ps_DeltaA
# now regress A on W | DeltaA = 1
<- glm(A[DeltaA == 1] ~ . , data = W[DeltaA == 1, , drop = FALSE],
fit_A family = binomial())
# get estimated propensity for receiving A = 1
<- predict(fit_A, newdata = W, type = "response")
ps_A1 # propensity for receiving A = 0
<- 1 - ps_A1
ps_A0
# now regress DeltaY on A + W | DeltaA = 1. Here we are pooling over
# values of A (i.e., this is what drtmle does if stratify = FALSE).
# We could also fit two regressions, one of DeltaY ~ W | DeltaA = 1, A = 1
# and one of DeltaY ~ W | DeltaA = 1, A = 0 (this is what drtmle does if
# stratify = TRUE).
<- glm(DeltaY[DeltaA == 1] ~ . ,
fit_DeltaY data = data.frame(A = A, W)[DeltaA == 1, ],
family = binomial())
# get estimated propensity for observing outcome if A = 1
<- predict(fit_DeltaY, newdata = data.frame(A = 1, W),
ps_DeltaY_A1 type = "response")
# get estimated propensity for observing outcome if A = 0
<- predict(fit_DeltaY, newdata = data.frame(A = 0, W),
ps_DeltaY_A0 type = "response")
# now combine all results into a single propensity score
<- list(
gn # propensity for DeltaA = 1, A = 1, DeltaY = 1
* ps_A1 * ps_DeltaY_A1,
ps_DeltaA # propensity for DeltaA = 1, A = 0, DeltaY = 1
* ps_A0 * ps_DeltaY_A0
ps_DeltaA
)
# pass in this gn to drtmle
<- drtmle(W = W, A = A, Y = Y, stratify = FALSE,
out_fit_ps # make sure a_0/gn are ordered properly!
gn = gn, a_0 = c(1, 0),
glm_Q = "W1 + W2*A", glm_Qr = "gn",
glm_gr = "Qn", family = binomial())
out_fit_ps
```

```
## $est
##
## 1 0.7353583
## 0 0.5213146
##
## $cov
## 1 0
## 1 1.836839e-03 -2.380141e-05
## 0 -2.380141e-05 4.917920e-03
```

So far in our examples, we have only considered binary treatments.
However, `drtmle`

is equipped to handle treatments with an
arbitrary number of discrete values. Suppose that \(A\) assumes values in a set \(\mathcal{A}\), the PS regression estimation
is modified to ensure that \[
\sum\limits_{a \in \mathcal{A}} g_n(a \mid w) = 1 \ \mbox{for all $w$}
\ .
\] If this condition is true, we say that the estimates of the PS
are compatible. Many regression methodologies that have been developed
work well with binary outcomes, but have not been extended to
multi-level outcomes. To ensure that users of `drtmle`

have
access to the large number of binary regression techniques when
estimating the PS, we have taken an alternative approach to estimation
of the propensity for each level of the treatment. Specifically, rather
than fitting a single multinomial-type regression, we fit a series of
binomial regressions. As a concrete example, consider the case of no
missing data and where \(\mathcal{A} =
\{0,1,2\}\). We can ensure compatible estimates of the PS by
generating an estimate \(g_n(0 \mid
w)\) of \(Pr_0(A = 0 \mid W =
w)\) and \(\tilde{g}_n(1 \mid
w)\) of \(Pr_0(A = 1 \mid A > 0, W =
w)\). The latter estimate may be generated by regression \(I(A = 1)\) on \(W\) in observations with \(A > 0\). Because the outcome is binary,
this regression may be estimated using any technique suitable for binary
outcomes. By simple rules of conditional probability, we know that \(g_0(1 \mid w) = Pr_0(A = 1 \mid A > 0, W =
w)\{1 - Pr_0(A = 0 \mid W)\}\) and thus an estimate of \(g_0(1 \mid w)\) can be computed as \(g_n(1 \mid w) = \tilde{g}_n(1 \mid w) \{1 - g_n(0
\mid w)\}\). Finally, we let \(g_n(2
\mid w) = 1 - g_n(0 \mid w) - g_n(1 \mid w)\), which ensures
compatibility of the estimates for each level of covariates. This
approach generalizes to an arbitrary number of treatment levels. Below,
we provide an illustration using simulated data with a treatment that
has three levels. The true parameter values in this simulation are \(\psi_0(0)=\) 0.62, \(\psi_0(1)=\) 0.72, and \(\psi_0(2)=\) 0.77.

```
set.seed(1234)
<- 200
n <- data.frame(W1 = runif(n), W2 = rbinom(n, 1, 0.5))
W <- rbinom(n, 2, plogis(W$W1 + W$W2))
A <- rbinom(n, 1, plogis(W$W1 + W$W2*A)) Y
```

The multi-level PS can still be estimated using any of the three techniques discussed previously. Here, we illustrate with simple generalized linear models.

```
<- drtmle(W = W, A = A, Y = Y, stratify = FALSE,
glm_fit_multiA glm_g = "W1 + W2", glm_Q = "W1 + W2*A",
glm_gr = "Qn", glm_Qr = "gn",
family = binomial(), a_0 = c(0,1,2))
glm_fit_multiA
```

```
## $est
##
## 0 0.4563335
## 1 0.7357243
## 2 0.7960642
##
## $cov
## 0 1 2
## 0 1.483804e-02 -3.223757e-05 -6.660173e-05
## 1 -3.223757e-05 2.178154e-03 6.376441e-05
## 2 -6.660173e-05 6.376441e-05 2.531909e-03
```

Above, we used the option `a_0`

to specify the levels of
the treatment at which we were interested in estimating marginal means.
There is no need for `a_0`

to contain all unique values of
`A`

. For example, certain levels of the treatment may not be
of scientific interest, but nevertheless we would like to use these data
to help estimate the OR and PS (by setting
`stratify = FALSE`

).

The confidence interval and testing procedures extend to multi-level
treatments with minor modifications. We can still obtain a confidence
interval and hypothesis test for each of the marginal means by not
specifying a `contrast`

.

`ci(glm_fit_multiA)`

```
## $drtmle
## est cil ciu
## 0 0.456 0.218 0.695
## 1 0.736 0.644 0.827
## 2 0.796 0.697 0.895
```

`wald_test(glm_fit_multiA, null = c(0.4, 0.5, 0.6))`

```
## $drtmle
## zstat pval
## H0: E[Y(0)]=0.4 0.462 0.644
## H0: E[Y(1)]=0.5 5.051 0.000
## H0: E[Y(2)]=0.6 3.896 0.000
```

Alternatively, we can specify a `contrast`

to estimate
confidence intervals about the average treatment effect of \(A=1\) vs. \(A=0\). Calls to `wald_test`

can
also be made.

`ci(glm_fit_multiA, contrast = c(-1, 1, 0))`

```
## $drtmle
## est cil ciu
## E[Y(1)]-E[Y(0)] 0.279 0.023 0.536
```

Similarly, we can compute confidence intervals comparing the average treatment effect of \(A = 2\) vs. \(A = 0\).

`ci(glm_fit_multiA, contrast = c(-1, 0, 1))`

```
## $drtmle
## est cil ciu
## E[Y(2)]-E[Y(0)] 0.34 0.08 0.599
```

The user-specified contrasts are also available. For example, we can
modify the `riskRatio`

object we used previously for a
two-level treatment to compute the risk ratio comparing \(A=1\) to \(A=0\) and comparing \(A = 2\) to \(A =
0\).

```
<- list(f = function(eff){ log(eff) },
riskRatio_1v0 f_inv = function(eff){ exp(eff) },
h = function(est){ est[2]/est[1] },
fh_grad = function(est){ c(1/est[2], -1/est[1], 0) })
ci(glm_fit_multiA, contrast = riskRatio_1v0)
```

```
## $drtmle
## est cil ciu
## user contrast 1.612 1.1 2.363
```

Similarly, we can compute confidence intervals for the risk ratio comparing the marginal mean for \(A=2\) to \(A=1\).

```
<- list(f = function(eff){ log(eff) },
riskRatio_2v0 f_inv = function(eff){ exp(eff) },
h = function(est){ est[3]/est[1] },
fh_grad = function(est){ c(0, -1/est[1], 1/est[3]) })
ci(glm_fit_multiA, contrast = riskRatio_2v0)
```

```
## $drtmle
## est cil ciu
## user contrast 1.744 1.382 2.202
```

When considering adaptive estimation techniques, one runs the risk of
overfitting the initial regressions. While the cross validation used by
super learning attempts to guard against overfitting, there are no
guarantees in finite-samples. To definitively avoid such overfitting,
one can use cross-validated estimates of the regression parameters.
Theoretically, the use of cross-validated regression weakens the
assumptions necessary to achieve asymptotic normality (Zheng and van der Laan 2010). This is true of
the regular TMLE and AIPTW, as well as of the TMLE estimator with
doubly-robust inference. The `drtmle`

function implements
cross-validated estimation of the regression parameters through the
`cvFolds`

option, as shown in the code below, where we
continue with the multi-level treatment example.

```
<- drtmle(W = W, A = A, Y = Y, family = binomial(),
cv_sl_fit SL_g = c("SL.glm", "SL.glm.interaction", "SL.gam"),
SL_Q = c("SL.glm", "SL.glm.interaction", "SL.gam"),
SL_gr = c("SL.glm", "SL.mean"),
SL_Qr = c("SL.glm", "SL.mean"),
stratify = FALSE, cvFolds = 2, a_0 = c(0, 1, 2))
cv_sl_fit
```

```
## $est
##
## 0 0.5018057
## 1 0.7366243
## 2 0.7882366
##
## $cov
## 0 1 2
## 0 1.193714e-02 -7.941656e-06 1.535563e-05
## 1 -7.941656e-06 2.235493e-03 6.137198e-05
## 2 1.535563e-05 6.137198e-05 2.758834e-03
```

Cross-validation may significantly increase the computation time
necessary for running `drtmle`

. Parallelized regression
fitting is available by setting the option
`use_future = TRUE`

. Since, internally, all computations are
carried out using futures (with the API provided by the “future” R
package), setting this option amounts only to parallelizing with futures
(Bengtsson 2017a) via
`future.apply::lapply`

. Selecting this option results in
parallelization of the estimation of the OR, PS, and reduced-dimension
regressions.

If no adjustments are made prior to invoking the `drtmle`

function, futures are computed sequentially. It is up to the user to
specify appropriate system-specific future back-ends that may better
suit their problem (e.g., using the `future.batchtools`

package (Bengtsson 2017b)). Below we
demonstrate registration using an ad-hoc cluster on the local system.
The choice of back-end is flexible, and submission of the job to a wide
variety of schedulers (e.g., SLURM, TORQUE, etc.) is also permitted.
Interested readers are invited to consult the documentation of the
`future.batchtools`

package for further information.

```
## Commented out to avoid build errors
# library(future.batchtools)
# library(parallel)
# cl <- makeCluster(2, type = "SOCK")
# plan(cluster, workers = cl)
# clusterEvalQ(cl, library("SuperLearner"))
# pcv_sl_fit <- drtmle(W = W, A = A, Y = Y, family = binomial(),
# SL_g = c("SL.glm", "SL.glm.interaction","SL.gam"),
# SL_Q = c("SL.glm", "SL.glm.interaction","SL.gam"),
# SL_gr = c("SL.glm", "SL.gam", "SL.mean"),
# SL_Qr = c("SL.glm", "SL.gam", "SL.mean"),
# stratify = FALSE, a_0 = c(0,1,2),
# cvFolds = 2, use_future = TRUE)
```

Doubly-robust estimators of marginal means are appealing in their robustness and efficiency properties. However, researchers may prefer the IPTW estimator for its intuitive derivation and implementation. Heretofore, inference for IPTW estimators precluded the use of adaptive estimators of the PS. However, van der Laan (2014) proposed modified IPTW estimators that allow for adaptive PS estimation. Similarly as above, this estimator is based on an additional regression parameter, \(\tilde{Q}_{r,0n}(a,w) := E_0\{Y \mid A = a, g_n(W) = g_n(w)\}\), that is the regression of the outcome \(Y\) on the estimated propensity amongst observations with \(A=a\). The author showed that if \(g_n^*(a \mid w)\), an estimator of the propensity for treatment \(A=a\), satisfied that \[ \frac{1}{n} \sum\limits_{i=1}^n \frac{\tilde{Q}_{r,0}(a,W_i)}{g_n(a \mid W = W_i)} \ \{ I(A_i = a) - g_n(a \mid W = W_i)\} = 0 \ ,\] then the IPTW estimator based on \(g_n^*\) has an asymptotic normal distribution with an estimable variance. In fact, van der Laan (2014) proposed a TMLE algorithm that mapped an initial PS estimate into an estimate that satisfies this key equation and variance estimators that can be used to generate Wald-style confidence intervals and hypothesis tests.

An estimator with equivalent asymptotic properties based on a PS
estimate, \(g_n(a \mid w)\), can be
computed as \[
\psi_{n,IPTW}(a) = \psi_{n,IPTW}(a) - \frac{1}{n} \sum\limits_{i=1}^n
\frac{\tilde{Q}_{r,0}(a,W_i)}{g_n(a \mid W = W_i)} \ \{ I(A_i = a) -
g_n(a
\mid W = W_i)\} \ .
\] Both this estimator and the TMLE-based IPTW estimator above
are implemented in `drtmle`

.

The IPTW estimators based on adaptive estimation of the PS are
implemented in the `adaptive_iptw`

function. The
`adaptive_iptw`

function shares many options with
`drtmle`

. However, as the IPTW estimator does not rely on an
estimate of the OR, these options are omitted. We continue with the
multilevel treatment example.

```
<- adaptive_iptw(W = W, A = A, Y = Y, stratify = FALSE,
npreg_iptw_multiA SL_g = "SL.npreg", SL_Qr = "SL.npreg",
family = binomial(), a_0 = c(0, 1, 2))
npreg_iptw_multiA
```

```
## $est
##
## 0 0.4646874
## 1 0.7052869
## 2 0.8030194
##
## $cov
## 0 1 2
## 0 1.414756e-02 -3.045258e-05 -4.361100e-05
## 1 -3.045258e-05 1.852371e-03 2.263188e-05
## 2 -4.361100e-05 2.263188e-05 2.371211e-03
```

The `"adaptive_iptw"`

class contains identical methods for
computing confidence intervals and Wald tests as the
`"drtmle"`

class and we refer readers back to confidence
intervals and hypothesis tests section for reminders on the various
options for these tests. By default, these methods are implemented for
the `iptw_tmle`

estimator of van der Laan (2014). We expect
that this estimator will have improved finite-sample behavior relative
to `iptw_os`

(the alternative estimator described in the
previous subsection), though further study is needed.

As with `drtmle`

, the `adaptive_iptw`

function
allows missing data in `A`

and `Y`

. Similarly,
`adaptive_iptw`

also allows for parallelized, cross-validated
estimation of the regression parameters via `cvFolds`

and
`parallel`

options. As with the estimators implemented in
`drtmle`

, cross-validation allows for valid inference under
weaker assumptions.

The `drtmle`

package was designed to provide a
user-friendly implementation of the TMLE algorithm that facilitates
doubly-robust inference about marginal means and average treatment
effects. While the theory underlying doubly-robust inference is complex,
the implementation of the methods only requires users to provide the
data and specify how to estimate regressions. While estimation of the OR
and PS is familiar to those familiar with causal inference-related
estimation, the reduced-dimension regressions may appear strange.
Indeed, it is difficult to provide intuition for these parameters and
more difficult to determine what scientific background knowledge might
guide users in how to estimate these regression parameters. Therefore,
we recommended estimating these quantities adaptively, e.g., with the
super learner. The super learner library should contain several
relatively smooth estimators such as `SL.mean`

,
`SL.glm`

, and `SL.gam`

. In particular, including
`SL.mean`

(an unadjusted generalized linear model) may be
important due to the fact that, when the OR and PS are consistently
estimated, the reduced-dimension regressions are identically equal to
zero. Thus, this unadjusted regression estimator may do a good job
estimating the reduced-dimension regressions in situations where the OR
and PS are estimated well. In addition to relatively smooth estimators,
we recommend additionally including several adaptive estimators such as
`SL.npreg`

or `SL.gam`

.

Planned extensions of the `drtmle`

package include
extensions of longitudinal settings involving treatments at multiple
time points and inclusion of the collaborative targeted maximum
likelihood estimator (CTMLE) of van der Laan (2014) that also
facilitates collaborative doubly-robust inference.

`sessionInfo()`

```
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] quadprog_1.5-8 nloptr_2.0.3 SuperLearner_2.0-28
## [4] gam_1.20.1 foreach_1.5.2 nnls_1.4
## [7] drtmle_1.1.2
##
## loaded via a namespace (and not attached):
## [1] np_0.60-11 Rcpp_1.0.9 bslib_0.4.2
## [4] compiler_4.2.0 jquerylib_0.1.4 iterators_1.0.14
## [7] tools_4.2.0 boot_1.3-28 digest_0.6.31
## [10] lattice_0.20-45 jsonlite_1.8.4 evaluate_0.19
## [13] lifecycle_1.0.3 rlang_1.0.6 Matrix_1.4-1
## [16] cli_3.5.0 yaml_2.3.6 parallel_4.2.0
## [19] SparseM_1.81 xfun_0.35 fastmap_1.1.0
## [22] stringr_1.5.0 knitr_1.41 MatrixModels_0.5-0
## [25] vctrs_0.5.1 sass_0.4.4 globals_0.16.2
## [28] grid_4.2.0 glue_1.6.2 listenv_0.9.0
## [31] R6_2.5.1 future.apply_1.10.0 parallelly_1.33.0
## [34] survival_3.4-0 rmarkdown_2.19 magrittr_2.0.3
## [37] MASS_7.3-57 codetools_0.2-18 htmltools_0.5.4
## [40] future_1.30.0 cubature_2.0.4.4 quantreg_5.94
## [43] stringi_1.7.8 cachem_1.0.6
```

Bengtsson, Henrik. 2017a. *future: A
Future API for R*.
https://cran.r-project.org/web/packages/future. https://github.com/HenrikBengtsson/future.

———. 2017b. *future.batchtools: A Future
API for Parallel and Distributed Processing Using
“Batchtools”*.
https://cran.r-project.org/web/packages/future.batchtools. https://github.com/HenrikBengtsson/future.batchtools.

Benkeser, D, M Carone, M van der Laan, and P Gilbert. 2017.
“Doubly-Robust Nonparametric Inference on the Average Treatment
Effect.” *Biometrika*.

Hastie, Trevor. 2017. *gam: Generalized
Additive Models*. https://CRAN.R-project.org/package=gam.

Hayfield, Tristen, and Jeffrey Racine. 2008. “Nonparametric
Econometrics: The Np Package.” *Journal of Statistical
Software* 27 (5). https://www.jstatsoft.org/v27/i05/.

Horvitz, D, and D Thompson. 1952. “A Generalization of Sampling
Without Replacement from a Finite Universe.” *Journal of the
American Statistical Association* 47 (260): 663–85. https://doi.org/10.1080/01621459.1952.10483446.

Polley, E, E LeDell, C Kennedy, S Lendle, and M van der Laan. 2017.
*SuperLearner: Super Learner Prediction*. https://CRAN.R-project.org/package=SuperLearner.

Porter, K, S Gruber, van der Laan M, and J Sekhon. 2011. “The
Relative Performance of Targeted Maximum Likelihood Estimators.”
*International Journal of Biostatistics* 7 (1). https://doi.org/10.2202/1557-4679.1308.

Robins, J. 1986. “A New Approach to Causal Inference in Mortality
Studies with a Sustained Exposure Period –- Application to Control of
the Healthy Worker Survivor Effect.” *Mathematical
Modelling* 7 (9–12): 1393–1512. https://doi.org/10.1016/0270-0255(86)90088-6.

Robins, J, M Hernan, and B Brumback. 2000. “Marginal Structural
Models and Causal Inference in Epidemiology.”
*Epidemiology* 11 (5): 550–60. https://doi.org/10.1097/00001648-200009000-00011.

Robins, J, A Rotnitzky, and L Zhao. 1994. “Estimation of
Regression Coefficients When Some Regressors Are Not Always
Observed.” *Journal of the American Statistical
Association* 89 (427): 846–66. https://doi.org/10.1080/01621459.1994.10476818.

Vaart, A van der, S Dudoit, and M van der Laan. 2006. “Oracle
Inequalities for Multi-Fold Cross-Validation.” *Statistics and
Decisions* 24 (3): 351–71. https://doi.org/10.1524/stnd.2006.24.3.351.

van der Laan, M. 2014. “Targeted Estimation of Nuisance Parameters
to Obtain Valid Statistical Inference.” *International Journal
of Biostatistics* 10 (1): 29–57. https://doi.org/10.1515/ijb-2012-0038.

van der Laan, M, E Polley, and A Hubbard. 2007. “Super
Learner.” *Statistical Applications in Genetics and Molecular
Biology* 6 (1). https://doi.org/10.2202/1544-6115.1309.

van der Laan, M, and S Rose. 2011. *Targeted Learning: Causal
Inference for Observational and Experimental Data*. Springer New
York. https://doi.org/10.1007/978-1-4419-9782-1.

van der Laan, M, and D Rubin. 2006. “Targeted Maximum Likelihood
Learning.” *International Journal of Biostatistics* 2 (1).
https://doi.org/10.2202/1557-4679.1043.

Wolpert, D. 1992. “Stacked Generalization.” *Neural
Networks* 5 (2): 241–59. https://doi.org/10.1016/S0893-6080(05)80023-1.

Zheng, W, and M van der Laan. 2010. “Asymptotic Theory for
Cross-Validated Targeted Maximum Likelihood Estimation.” *U.C.
Berkeley Division of Biostatistics Working Paper Series*. https://biostats.bepress.com/ucbbiostat/paper273/.