Residual Theory

Robert M. Cook

2019-07-01

The ‘residuals.em.glm’ method returns two types of residuals, the deviance and pearson. The residuals are defined using standard GLM theory where the log-likelihood of an observation is expressed in the form:

\[ l(y_i) = \frac{w_i(y_i\theta_i - b(\theta_i))}{\phi} + c(y_i, \phi) \]

Where \(y_i\) is the observed value, \(w_i\) are the observed weights, \(\phi\) is the dispersion parameter (currently assumed to be 1 for the models implemented), and \(b\) and \(c\) are distribution-dependent functions. The function \(b\) is related to \(y_i\) such that \(\mu_i = E(y_i) = b'(\theta_i)\).

The deviance residuals, \(d_i\), are defined as:

\[d_i = sign(y_i - \mu_i) \sqrt{2 \left( l_{sat.}(y_i) - l_{model}(y_i) \right) }\]

where \(l_{sat.}\) is the ‘saturated model’ such that each value of \(\mu_i = y_i\) and \(l_{model}\) is the likelihood of \(y_i\) for the predicted value of \(\mu_i\). We will not go into the relationship between \(\mu_i\), the covariates \(X\), and the fit parameters \(B\) here.

The deviance residuals are implemented as:

rho <- x %*% B
mu <- family()$linkinv(rho)

# Poisson deviance
abs(y - w * mu) * sqrt(2 * (dpois(y, y, log = TRUE) - dpois(y, w * mu, log = TRUE)))

# Binomial deviance
abs(y - w * mu) * sqrt(2 * (dbinom(y, w, y / w, log = TRUE) - dbinom(y, w, mu, log = TRUE)))

where x is the input data, B is the optimal parameters, w is the weights, y is the target values and family is (currently) either poisson or binom.

The Pearson’s residuals, \(r_i\), when the weights all equal 1 are defined as:

\[ r_i = \frac{y_i - \mu_i}{\sqrt{Var(y_i)}} = \frac{y_i - \mu_i}{\sqrt{V(\mu_i)}}\]

where \(V(\mu)\) is the variance function of the GLM and depends on the underlying distribution.

Where the models are weighted (either if we look at rate of successes for a binomial process or have an exposure for a Poisson process) the Pearson residuals become:

\[ r_i = \frac{y_i / w_i -\mu_i}{\sqrt{Var(y_i)}} = \frac{y_i/w_i - \mu_i}{\sqrt{V(\mu_i)/w_i}} = \frac{y - w_i \mu_i}{\sqrt{w_i V(\mu_i)}}\]

The relationship between \(Var(y_i)\) and \(V(\mu_i)\) arises from taking the differentials of \(l\) wrt. \(\theta\). The first differential gives rise to the score vector \(u\):

\[ \frac{d l}{d \theta} = u = \frac{w_i (y_i - b'(\theta))}{\phi}\]

and second differential:

\[ \frac{d^2 l}{d \theta^2} = -\frac{w_i b''(\theta)}{\phi} \]

Combining these two results with Bartlett’s second identity:

\[ Var_\theta\left[ \frac{d l}{d\theta} \right] = - E_\theta\left[ \frac{d^2 l}{d\theta^2} \right] \]

We can deduce that:

\[ Var_\theta\left[ \frac{w_i (y_i - b'(\theta))}{\phi} \right] = - E_\theta\left[ -\frac{w_i b''(\theta)}{\phi} \right] \]

\[ \frac{w_i^2 Var(y_i) }{\phi^2} = \frac{w_i b''(\theta)}{\phi} \]

\[ Var(y_i) = \frac{\phi b''(\theta)}{w_i} = \frac{\phi V(\theta)}{w_i} \]

In the stats package each distribution family carries its own variance function defined relative to \(\mu\) not \(\theta\), \(V(\mu) = b''(b'^{-1}(\mu))\). Hence we use:

\[ Var(y_i) = \frac{\phi V(\mu)} {w_i} \]