Whittaker-Henderson (WH) smoothing is a gradation method aimed at correcting the effect of sampling fluctuations on an observation vector. It is applied to evenly-spaced discrete observations. Initially proposed by Whittaker (1922) for constructing mortality tables and further developed by the works of Henderson (1924), it remains one of the most popular methods among actuaries for constructing experience tables in life insurance. Extending to two-dimensional tables, it can be used for studying various risks, including but not limited to: mortality, disability, long-term care, lapse, mortgage default, and unemployment.

Let \(\mathbf{y}\) be a vector of observations and \(\mathbf{w}\) a vector of positive weights, both of size \(n\). The estimator associated with Whittaker-Henderson smoothing is given by:

\[ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}}\{F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) + R_{\lambda,q}(\boldsymbol{\theta})\} \]

where:

\(F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = \underset{i = 1}{\overset{n}{\sum}} w_i(y_i - \theta_i)^2\) represents a fidelity criterion to the observations,

\(R_{\lambda,q}(\boldsymbol{\theta}) = \lambda \underset{i = 1}{\overset{n - q}{\sum}} (\Delta^q\boldsymbol{\theta})_i^2\) represents a smoothness criterion.

In the latter expression, \(\Delta^q\) denotes the forward difference operator of order \(q\), such that for any \(i\in[1,n - q]\):

\[ (\Delta^q\boldsymbol{\theta})_i = \underset{k = 0}{\overset{q}{\sum}} \begin{pmatrix}q \\ k\end{pmatrix}(- 1)^{q - k} \theta_{i + k}. \]

Let us define \(W = \text{Diag}(\mathbf{w})\), the diagonal matrix of weights, and \(D_{n,q}\) as the order \(q\) difference matrix of dimensions \((n-q) \times n\), such that \((D_{n,q}\boldsymbol{\theta})_i = (\Delta^q\boldsymbol{\theta})_i\) for all \(i \in [1, n-q]\). The most commonly used difference matrices of order 1 and 2 have the following forms:

\[ D_{n,1} = \begin{bmatrix} -1 & 1 & 0 & \ldots & 0 \\ 0 & -1 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & 0 & -1 & 1 \\ \end{bmatrix} \quad\text{and}\quad D_{n,2} = \begin{bmatrix} 1 & -2 & 1 & 0 & \ldots & 0 \\ 0 & 1 & -2 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & \ldots & 0 & 1 & -2 & 1 \\ \end{bmatrix}. \]

The fidelity and smoothness criteria can be rewritten in matrix form as:

\[ F(\mathbf{y},\mathbf{w},\boldsymbol{\theta}) = (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \quad \text{and} \quad R_{\lambda,q}(\boldsymbol{\theta}) = \lambda\boldsymbol{\theta}^TD_{n,q}^TD_{n,q}\boldsymbol{\theta}. \]

The associated estimator for smoothing becomes:

\[ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace \]

where \(P_{\lambda} = \lambda D_{n,q}^TD_{n,q}\).

In the bidimensional case, let us consider a matrix \(Y\) of observations and a matrix \(\Omega\) of non-negative weights, both of dimensions \(n_x \times n_z\). The estimator associated with the Whittaker-Henderson smoothing can be written as:

\[ \widehat{Y} = \underset{\Theta}{\text{argmin}}\{F(Y,\Omega, \Theta) + R_{\lambda,q}(\Theta)\} \]

where:

\(F(Y,\Omega, \Theta) = \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z} \Omega_{i,j}(Y_{i,j} - \Theta_{i,j})^2\) represents a fidelity criterion to the observations,

\(R_{\lambda,q}(\Theta) = \lambda_x \sum_{j = 1}^{n_z}\sum_{i = 1}^{n_x - q_x} (\Delta^{q_x}\Theta_{\bullet,j})_i^2 + \lambda_z \sum_{i = 1}^{n_x}\sum_{j = 1}^{n_z - q_z} (\Delta^{q_z}\Theta_{i,\bullet})_j^2\) is a smoothness criterion.

This latter criterion can be written as the sum of two one-dimensional regularization criteria, with orders \(q_x\) and \(q_z\), applied respectively to all rows and all columns of \(\Theta\). It is also possible to adopt matrix notations by defining \(\mathbf{y} = \textbf{vec}(Y)\), \(\mathbf{w} = \textbf{vec}(\Omega)\), and \(\boldsymbol{\theta} = \textbf{vec}(\Theta)\) as the vectors obtained by stacking the columns of the matrices \(Y\), \(\Omega\), and \(\Theta\), respectively. Additionally, let us denote \(W = \text{Diag}(\mathbf{w})\) and \(n = n_x \times n_z\). The fidelity and smoothness criteria can then be rewritten as linear combinations of the vectors \(\mathbf{y}\), \(\mathbf{w}\), and \(\boldsymbol{\theta}\):

\[ \begin{aligned} F(\mathbf{y},\mathbf{w}, \boldsymbol{\theta}) &= (\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) \\ R_{\lambda,q}(\boldsymbol{\theta}) &= \boldsymbol{\theta}^{T}(\lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}) \boldsymbol{\theta}. \end{aligned} \]

and the estimator associated with the smoothing takes the same form as in the one-dimensional case:

\[ \hat{\mathbf{y}} = \underset{\boldsymbol{\theta}}{\text{argmin}} \left\lbrace(\mathbf{y} - \boldsymbol{\theta})^TW(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right\rbrace \]

except in this case \(P_{\lambda} = \lambda_x I_{n_z} \otimes D_{n_x,q_x}^{T}D_{n_x,q_x} + \lambda_z D_{n_z,q_z}^{T}D_{n_z,q_z} \otimes I_{n_x}\).

Whittaker-Henderson estimator has an explicit solution:

\[\hat{\mathbf{y}} = (W + P_{\lambda})^{-1}W\mathbf{y}.\]

Indeed, as a minimum, \(\hat{\mathbf{y}}\) satisfies:

\[0 = \left.\frac{\partial}{\partial \boldsymbol{\theta}}\right|_{\hat{\mathbf{y}}}\left\lbrace(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\right\rbrace = - 2 X^TW(y - \hat{\mathbf{y}}) +2X^TP_{\lambda} \hat{\mathbf{y}}.\]

It follows that \((W + P_{\lambda})\hat{\boldsymbol{\theta}} = W\mathbf{y}\), and if \(W + P_{\lambda}\) is invertible, then \(\hat{\mathbf{y}}\) is indeed a solution for the original equation. When \(\lambda \neq 0\), \(W + P_{\lambda}\) is invertible as long as \(\mathbf{w}\) has \(q\) non-zero elements in the one-dimensional case, and \(\Omega\) has at least \(q_x \times q_z\) non-zero elements distributed over \(q_x\) different rows and \(q_z\) different columns in the two-dimensional case. These are sufficient conditions, which are always satisfied in practice and will not be demonstrated here.

The `WH`

package features two main functions
`WH_1d`

and `WH_2d`

corresponding to the
one-dimensional and two-dimensional cases respectively. Two arguments
are mandatory for those functions:

The vector (or matrix in the two-dimension case)

`d`

corresponding to the number of observed events of interest by age (or by age and duration in the two-dimension case).`d`

should have named elements (or rows and columns) for the model results to be extrapolated.The vector (or matrix in the two-dimension case)

`ec`

corresponding to the portfolio central exposure by age (or by age and duration in the two-dimension case) whose dimensions should match those of`d`

. The contribution of each individual to the portfolio central exposure corresponds to the time the individual was actually observed with corresponding age (and duration in the two-dimension cas). It always ranges from 0 to 1 and is affected by individuals leaving the portfolio, no matter the cause, as well as censoring and truncating phenomena.

Additional arguments may be supplied, whose description is given in the documentation of the functions.

The package also embed two fictive agregated datasets to illustrate how to use it:

`portfolio_mortality`

contains the agregated number of deaths and associated central exposure by age for an annuity portfolio.`portfolio_LTC`

contains the agregated number of deaths and associated central exposure by age and duration (in years) since the onset of LTC for the annuitant database of a long-term care portfolio.

```
# One-dimensional case
<- portfolio_mort$d
d <- portfolio_mort$ec
ec
<- WH_1d(d, ec)
WH_1d_fit / Brent method Using outer iteration
```

```
# Two-dimensional case
<- which(rowSums(portfolio_LTC$ec) > 5e2)
keep_age <- which(colSums(portfolio_LTC$ec) > 1e3)
keep_duration
<- portfolio_LTC$d[keep_age, keep_duration]
d <- portfolio_LTC$ec[keep_age, keep_duration]
ec
<- WH_2d(d, ec)
WH_2d_fit / Nelder-Mead method Using performance iteration
```

Functions `WH_1d`

and `WH_2d`

output objects of
class `"WH_1d"`

and `"WH_2d"`

to which additional
functions (including generic S3 methods) may be applied:

- The
`print`

function provides a glimpse of the fitted results

```
WH_1d_fitfunction
An object fitted using the WH_1D 74 data points:
Initial data contains : 19 to 92
Observation positions: 13454
Smoothing parameter selected: 7.5
Associated degrees of freedom
WH_2d_fitfunction
An object fitted using the WH_2D 90 data points:
Initial data contains : 75 to 89
First dimension: 0 to 5
Second dimension: 301 2
Smoothing parameters selected: 13 Associated degrees of freedom
```

- The
`plot`

function generates rough plots of the model fit, the associated standard deviation, the model residuals or the associated degrees of freedom. See the`plot.WH_1d`

and`plot.WH_2d`

functions help for more details.

`plot(WH_1d_fit)`

`plot(WH_1d_fit, "res")`

`plot(WH_1d_fit, "edf")`

```
plot(WH_2d_fit)
```

`plot(WH_2d_fit, "std_y_hat")`

- The
`predict`

function generates an extrapolation of the model. It requires a`newdata`

argument, a named list with one or two elements corresponding to the positions of the new observations. In the two-dimension case constraints are used so that the predicted values matches the fitted values for the initial observations (see Carballo, Durban, and Lee 2021 to understand why this is required).

`|> predict(newdata = 18:99) |> plot() WH_1d_fit `

```
|> predict(newdata = list(age = 50:99,
WH_2d_fit duration = 0:19)) |> plot()
```

- Finally the
`output_to_df`

function converts an`"WH_1d"`

or`"WH_2d"`

object into a`data.frame`

. Information about the fit is discarded in the process. This function may be useful to produce better visualizations from the data, for example using the ggplot2 package.

```
<- WH_1d_fit |> output_to_df()
WH_1d_df <- WH_2d_fit |> output_to_df() WH_2d_df
```

From the explicit solution of WH smoothinh, \(\mathbb{E}(\hat{\mathbf{y}}) = (W + P_{\lambda})^{-1}W\mathbb{E}(\mathbf{y}) \ne \mathbb{E}(\mathbf{y})\) when \(\lambda \ne 0\). This implies that penalization introduces a smoothing bias, which prevents the construction of a confidence interval centered around \(\mathbb{E}(\mathbf{y})\). Therefore, in this section, we turn to a Bayesian approach where smoothing can be interpreted more naturally.

Let us suppose that \(\mathbf{y} | \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, W^{-})\) and \(\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})\) where \(P_{\lambda}^{-}\) denotes the pseudo-inverse of the matrix \(P_{\lambda}\). The Bayes’ formula allows us to express the posterior likelihood \(f(\boldsymbol{\theta} | \mathbf{y})\) associated with these choices in the following form:

\[\begin{aligned} f(\boldsymbol{\theta} | \mathbf{y}) &= \frac{f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta})}{f(y)} \\ &\propto f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta}) \\ &\propto \exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right)\exp\left(-\frac{1}{2}\boldsymbol{\theta}^TP_{\lambda} \boldsymbol{\theta}\right) \\ &\propto \exp\left(- \frac{1}{2}\left[(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta}) + \boldsymbol{\theta}^TP_{\lambda}\boldsymbol{\theta}\right]\right). \end{aligned}\]

The mode of the posterior distribution, \(\hat{\boldsymbol{\theta}} = \underset{\boldsymbol{\theta}}{\text{argmax}} [f(\boldsymbol{\theta} | \mathbf{y})]\), also known as the maximum a posteriori (MAP) estimate, coincides with the explicit solution \(\hat{\mathbf{y}}\).

A second-order Taylor expansion of the log-posterior likelihood around \(\hat{\mathbf{y}} = \hat{\boldsymbol{\theta}}\) gives us:

\[\ln f(\boldsymbol{\theta} | \mathbf{y}) = \ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) + \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}^{T}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T} \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\]

\[\text{where} \quad \left.\frac{\partial \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = 0 \quad \text{and} \quad \left.\frac{\partial^2 \ln f(\boldsymbol{\theta} | \mathbf{y})}{\partial \boldsymbol{\theta} \partial \boldsymbol{\theta}^{T}}\right|_{\boldsymbol{\theta} = \hat{\boldsymbol{\theta}}} = - (W + P_{\lambda}).\]

Note that this last derivative no longer depends on \(\boldsymbol{\theta}\), and higher-order derivatives beyond 2 are all zero. The Taylor expansion allows for an exact computation of \(\ln f(\boldsymbol{\theta} | \mathbf{y})\). By substituting the result back into the Taylor expansion, we obtain:

\[\begin{aligned} f(\boldsymbol{\theta} | \mathbf{y}) &\propto \exp\left[\ln f(\hat{\boldsymbol{\theta}} | \mathbf{y}) - \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \\ &\propto \exp\left[- \frac{1}{2} (\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}})\right] \end{aligned}\]

which can immediately be recognized as the density of the \(\mathcal{N}(\hat{\boldsymbol{\theta}},(W + P_{\lambda})^{- 1})\) distribution.

The assumption \(\boldsymbol{\theta} \sim \mathcal{N}(0, P_{\lambda}^{-})\) corresponds to a simple Bayesian formalization of the smoothness criterion. It reflects a (improper) prior belief of the modeler about the underlying distribution of the observation vector \(\mathbf{y}\).

The use of Whittaker-Henderson smoothing in the Bayesian framework and the construction of credible intervals are conditioned on the validity of the assumption \(\mathbf{y} | \boldsymbol{\theta}\sim \mathcal{N}(\boldsymbol{\theta}, W^{-})\). The components of the observation vector should be independent and have known variances. The weight vector \(\mathbf{w}\) used should contain the inverses of these variances. Under these assumptions, credible intervals at \(100(1 - \alpha)\)% for smoothing can be constructed and take the form:

\[\mathbb{E}(\mathbf{y}) | \mathbf{y} \in \left[\hat{\mathbf{y}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(W + P_{\lambda})^{-1}\right\rbrace}\right]\]

with probability \(1 -\frac{\alpha}{2}\) where \(\hat{\mathbf{y}} = (W + P_{\lambda})^{- 1}W\mathbf{y}\) and \(\Phi\) denotes the cumulative distribution function of the standard normal distribution. According to Marra and Wood (2012), these credible intervals provide satisfactory coverage of the corresponding frequentist confidence intervals and can therefore be used as practical substitutes.

The previous section highlighted the need to apply the Whittaker-Henderson smoothing to independent observation vectors \(\mathbf{y}\) and weight vectors \(\mathbf{w}\) corresponding to the inverses of the variances of the components of \(\mathbf{y}\) in order to obtain a measure of uncertainty in the results. In this section, we propose, within the framework of duration models used for constructing experience tables, vectors \(\mathbf{y}\) and \(\mathbf{w}\) that satisfy these conditions.

Consider the observation of \(m\) individuals in a longitudinal study subject to left truncation and right censoring phenomena. Suppose we want to estimate a single distribution that depends on only one continuous explanatory variable, denoted by \(x\). For illustration purposes, we can consider it as a mortality distribution with the explanatory variable of interest \(x\) representing age. Such a distribution is fully characterized by one of the following quantities:

- the cumulative distribution function \(F(x)\) or its complement, the survival function \(S(x) = 1 - F(x)\),
- the associated probability density function \(f(x) = - \frac{\text{d}}{\text{d}x}S(x)\),
- the instantaneous hazard function \(\mu(x) = - \frac{\text{d}}{\text{d}x}\ln S(x)\).

Suppose that the considered distribution depends on a vector of parameters \(\boldsymbol{\theta}\) that we want to estimate using maximum likelihood. The likelihood associated with the observation of the individuals can be written as follows:

\[ \mathcal{L}(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\prod}} \left[\frac{f(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{\delta_i}\left[\frac{S(x_i + t_i,\boldsymbol{\theta})}{S(x_i,\boldsymbol{\theta})}\right]^{1 - \delta_i} \]

Where \(x_i\) represents the age at the start of observation, \(t_i\) represents the observation duration, i.e., the time elapsed between the start and end dates of observation, and \(\delta_i\) is the indicator of event observation, which takes the value 1 if the event of interest is observed and 0 if the observation is censored. We will not go into the details of calculating these three quantities, which should take into account the individual-specific information such as the subscription date, redemption date if applicable, as well as the global characteristics of the product, such as the presence of a deductible or medical selection phenomenon, and the choice of a restricted observation period due to data quality issues or delays in the reporting of observations. These factors typically lead to a narrower observation period than the actual presence period of individuals in the portfolio.

The various quantities introduced above are related by the following relationships:

\[ S(x) = \exp\left(\underset{u = 0}{\overset{x}{\int}}\mu(u)\text{d}u\right)\quad \text{and} \quad f(x) = \mu(x)S(x). \]

The maximization of the likelihood in that equation is equivalent to the maximization of the associated log-likelihood, which can be rewritten using only the instantaneous hazard function (also known as the mortality rate in the case of the death risk):

\[ \ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} \left[\delta_i \ln\mu(x_i + t_i,\boldsymbol{\theta}) - \underset{u = 0}{\overset{t_i}{\int}}\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right] \]

To discretize the problem, let us assume that the mortality rate is piecewise constant over one-year intervals between two integer ages. Formally, we have \(\mu(x + \epsilon) = \mu(x)\) for all \(x \in \mathbb{N}\) and \(\epsilon \in [0,1[\). Furthermore, if \(\mathbf{1}\) denotes the indicator function, then for any \(0 \leq a < x_{\max}\), we have \(\sum_{x = x_{\min}}^{x_{\max}} \mathbf{1}(x \leq a < x + 1) = 1\), where \(x_{\min} = \min(\mathbf{x})\) and \(x_{\max} = \max(\mathbf{x})\). The log-likelihood can be rewritten as:

\[ \begin{aligned} \ell(\boldsymbol{\theta}) = \underset{i = 1}{\overset{m}{\sum}} &\left[\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \delta_i\mathbf{1}(x \le x_i + t_i < x + 1)\ln\mu(x_i + t_i,\boldsymbol{\theta})\right. \\ &- \left.\underset{u = 0}{\overset{t_i}{\int}}\underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta})\text{d}u\right]. \end{aligned} \]

The assumption of piecewise constant mortality rate implies that:

\[ \begin{aligned} \mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x_i + t_i,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + t_i < x + 1) \ln\mu(x,\boldsymbol{\theta})\quad\text{and}\\ \mathbf{1}(x \le x_i + u < x + 1)\mu(x_i + u,\boldsymbol{\theta}) &= \mathbf{1}(x \le x_i + u < x + 1) \ln\mu(x,\boldsymbol{\theta}). \end{aligned} \]

It is then possible to interchange the two summations to obtain the following expressions:

\[ \begin{aligned} \ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \left[\ln\mu(x,\boldsymbol{\theta}) d(x) - \mu(x,\boldsymbol{\theta}) e_c(x)\right] \quad \text{where} \\ d(x) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \quad \text{and} \\ e_c(x) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\text{d}u = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x - x_i + 1) - \max(0, x - x_i)\right]^+ \end{aligned} \]

by denoting \(a^+ = \max(a, 0)\), where \(d(x)\) and \(e_c(x)\) correspond to the number of observed deaths between ages \(x\) and \(x + 1\) and the sum of observation durations of individuals between these ages, respectively (the latter quantity is also known as central exposure to risk).

The extension of the proposed approach to the two-dimensional framework requires only minor adjustments to the previous reasoning. Let \(z_{\min} = \min(\mathbf{z})\) and \(z_{\max} = \max(\mathbf{z})\). The piecewise constant assumption for the mortality rate needs to be extended to the second dimension. Formally, we now assume that \(\mu(x + \epsilon, z + \xi) = \mu(x, z)\) for all pairs \(x, z \in \mathbb{N}\) and \(\epsilon, \xi \in [0,1[\). The sums involving the variable \(x\) are then replaced by double sums considering all combinations of \(x\) and \(z\). The log-likelihood is given by:

\[ \begin{aligned} \ell(\boldsymbol{\theta}) &= \underset{x = x_{\min}}{\overset{x_{\max}}{\sum}} \underset{z = z_{\min}}{\overset{z_{\max}}{\sum}}\left[\ln\mu(x,z,\boldsymbol{\theta}) d(x,z) - \mu(x,z,\boldsymbol{\theta}) \mathbf{e_c}(x,z)\right] \quad \text{where} \\ d(x,z) & = \underset{i = 1}{\overset{m}{\sum}} \delta_i \mathbf{1}(x \le x_i + t_i < x + 1) \mathbf{1}(z \le z_i + t_i < z + 1) \quad \text{and}\\ \mathbf{e_c}(x,z) & = \underset{i = 1}{\overset{m}{\sum}}\underset{u = 0}{\overset{t_i}{\int}}\mathbf{1}(x \le x_i + u < x + 1)\mathbf{1}(z \le z_i + u < z + 1)\text{d}u \\ & = \underset{i = 1}{\overset{m}{\sum}} \left[\min(t_i, x + 1 - x_i, z + 1 - z_i) - \max(0, x - x_i, z - z_i)\right]^+ \end{aligned} \]

The choice \(\boldsymbol{\mu}(\boldsymbol{\theta}) = \exp(\boldsymbol{\theta})\), which corresponds to considering one parameter per observation, allows us to relate to the Whittaker-Henderson smoothing. Using the exponential function ensures positive values for the estimated mortality rate. The expressions of the likelihood in the unidimensional or two-dimensional case can then be written in a common vectorized form:

\[ \ell(\boldsymbol{\theta}) = \boldsymbol{\theta}^{T}\mathbf{d} - \exp(\boldsymbol{\theta})^{T}\mathbf{e_c} \]

where \(\mathbf{d}\) and \(\mathbf{e}_c\) represent the vectors of observed deaths and exposures to risk, respectively. The derivatives of the likelihood function for this model are given by:

\[\frac{\partial \ell}{\partial \boldsymbol{\theta}} = \left[\mathbf{d} -\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}\right] \quad \text{and} \quad \frac{\partial^2 \ell}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^T} = - \text{Diag}(\exp(\boldsymbol{\theta}) \odot \mathbf{e_c}).\]

Note that these likelihood equations are exactly what we would obtain by assuming that the observed numbers of deaths, conditional on the observed exposures to risk \(\mathbf{e}_c\), follow Poisson distributions with parameters \(\boldsymbol{\mu}(\boldsymbol{\theta}) \odot \mathbf{e}_c\). The model presented here has many similarities with a Poisson GLM (John Ashworth Nelder and Wedderburn 1972). However, the initial assumptions are not the same for both models.

These likelihood equations have an explicit solution given by \(\hat{\boldsymbol{\theta}} = \ln(\mathbf{d} / \mathbf{e}_c)\). This model, which treats each age independently, is known as the crude rates estimator. The properties of the maximum likelihood estimator imply that asymptotically \(\hat{\boldsymbol{\theta}} \sim \mathcal{N}(\boldsymbol{\theta}, W_{\hat{\boldsymbol{\theta}}}^{-1})\), where \(W_{\hat{\boldsymbol{\theta}}}\) is a diagonal matrix with elements \(\exp(\hat{\boldsymbol{\theta}}) \odot \mathbf{e}_c = (\mathbf{d} / \mathbf{e}_c) \odot \mathbf{e}_c = \mathbf{d}\). It should be noted that the asymptotic nature and the validity of this approximation are related to the number of individuals \(m\) in the portfolio and not the size \(n\) of the vectors \(\mathbf{d}\) and \(\mathbf{e}_c\).

Thus, we have shown that in the framework of duration models, using the crude rates estimator, asymptotically \(\ln(\mathbf{d} / \mathbf{e}_c) \sim \mathcal{N}(\ln\boldsymbol{\mu}, W^{-1})\), where \(W = \text{Diag}(\mathbf{d})\). This justifies applying the Whittaker-Henderson smoothing to the observation vector \(y = \ln(\mathbf{d} / \mathbf{e}_c)\) and weight vector \(\mathbf{w} = \mathbf{d}\). We obtain the following associated credibility intervals:

\[\ln\boldsymbol{\mu} | \mathbf{d}, \mathbf{e_c} \in \left[\hat{\boldsymbol{\theta}} \pm \Phi\left(1 -\frac{\alpha}{2}\right)\sqrt{\textbf{diag}\left\lbrace(\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\right\rbrace}\right]\]

where \(\hat{\boldsymbol{\theta}} = (\text{Diag}(\mathbf{d}) + P_{\lambda})^{-1}\text{Diag}(\mathbf{d})[\ln(\mathbf{d}) - \ln \mathbf{e}_c]\). Results on \(\boldsymbol{\mu}\) can be obtained directly by exponentiating the above expression.

In the definition of WH smoothing, \((\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\) represents a fidelity criterion to the observations, and \(\boldsymbol{\theta}^{T}P_{\lambda}\boldsymbol{\theta}\) represents a smoothness criterion. The relative importance of these criteria is controlled by the parameter (or pair of parameters in the two-dimensional case) \(\lambda\). The result of smoothing is highly sensitive to the chosen value of the smoothing parameter. Ideally, the selection of the smoothing parameter is based on the optimization of a statistical criterion, typically belonging to one of two major families.

On one hand, there are criteria based on the minimization of the model’s prediction error, among which the Akaike information criterion (AIC, Akaike 1973) and the generalized cross-validation (GCV, Wahba 1980) are included. The Bayesian information criterion (BIC, Schwarz 1978) has a similar form to AIC, although its theoretical justification is very different, and thus it can be considered part of this group.

On the other hand, there are criteria based on the maximization of a likelihood function known as the marginal likelihood. This type of criterion was introduced by Patterson H. D. (1971) in the Gaussian case, initially under the name of restricted likelihood (REML), and used by Anderssen and Bloomfield (1974) for the selection of smoothing parameters. Wahba (1985) and Kauermann (2005) show that criteria minimizing prediction error have the best asymptotic performance, but their convergence to the optimal smoothing parameters is slower. For finite sample sizes, criteria based on the maximization of a likelihood function are considered a more robust choice by many authors, such as Reiss and Todd Ogden (2009) or Wood (2011). For these reasons, we will favor the marginal likelihood as the selection criterion, especially since this choice naturally fits into the Bayesian framework introduced in the past sections.

Let us start from the notations and assumptions from last sections, namely \(\mathbf{y} | \boldsymbol{\theta} \sim \mathcal{N}(\boldsymbol{\theta}, W^{-})\) and \(\boldsymbol{\theta} | \lambda \sim \mathcal{N}(0, P_{\lambda} ^{-})\). In a purely Bayesian approach, it would be necessary to define a prior distribution on \(\lambda\) and then estimate the posterior distribution of each parameter vector using methods such as Markov Chain Monte Carlo. The empirical Bayesian approach we adopt seeks to find the value of \(\lambda\) that maximizes the marginal likelihood:

\[\mathcal{L}^m_\text{norm}(\lambda) = f(\mathbf{y} | \lambda) = \int f(\mathbf{y}, \boldsymbol{\theta} | \lambda)\text{d}\boldsymbol{\theta} = \int f(\mathbf{y} | \boldsymbol{\theta}) f(\boldsymbol{\theta} |\lambda)\text{d}\boldsymbol{\theta}.\]

This corresponds to the maximum likelihood method applied to the smoothing parameter. Let us explicitly rewrite the expressions of \(f(\mathbf{y} | \boldsymbol{\theta})\) and \(f(\boldsymbol{\theta} |\lambda)\) introduced previously:

\[\begin{aligned} f(\mathbf{y} | \boldsymbol{\theta}) &= \sqrt{\frac{|W|_{+}}{(2\pi)^{n_*}}}\exp\left(- \frac{1}{2}(\mathbf{y} - \boldsymbol{\theta})^{T}W(\mathbf{y} - \boldsymbol{\theta})\right) \\ f(\boldsymbol{\theta} |\lambda) &= \sqrt{\frac{|P_{\lambda}|_{+}}{(2\pi)^{p - q}}} \exp\left(- \frac{1}{2}\boldsymbol{\theta}^{T}P_{\lambda} \boldsymbol{\theta}\right) \end{aligned}\]

where \(|A|_{+}\) denotes the product of the non-zero eigenvalues of \(A\), \(n_*\) is the number of non-zero diagonal elements of \(W\), and \(q\) is the number of zero eigenvalues of \(P_{\lambda}\) (\(q = q_x \times q_z\) in the two-dimensional case). Based on the Taylor expansion performed to obtain credibility intervals, let us recall that:

\[ \ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda) = \ln f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) + \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T} (W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda}) \]

which leads to:

\[\begin{aligned} \mathcal{L}^m_\text{norm}(\lambda) &= \int \exp[\ln f(\mathbf{y}, \boldsymbol{\theta} | \lambda)]\text{d}\boldsymbol{\theta} \\ &= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \int \exp\left[- \frac{1}{2}(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda})^{T}(W + P_{\lambda})(\boldsymbol{\theta} - \hat{\boldsymbol{\theta}}_{\lambda}) \right]\text{d}\boldsymbol{\theta} \\ &= f(\mathbf{y}, \hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\ &= f_\mathbf{y}(\mathbf{y} | \hat{\boldsymbol{\theta}}_{\lambda}) f_{\boldsymbol{\theta}}(\hat{\boldsymbol{\theta}}_{\lambda} | \lambda) \sqrt{\frac{(2\pi)^p}{|W + P_{\lambda}|}} \\ &= \sqrt{\frac{|W|_{+}| P_{\lambda} |_{+}}{(2\pi)^{n_* - q}|W + P_{\lambda}|}} \exp\left(- \frac{1}{2}\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^T P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right]\right). \end{aligned}\]

The associated log-likelihood can be expressed as follows:

\[\begin{aligned} \ell^m_\text{reg}(\lambda) = - \frac{1}{2}&\left[(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda})^{T}W(\mathbf{y} -\hat{\boldsymbol{\theta}}_{\lambda}) + \hat{\boldsymbol{\theta}}_{\lambda}^{T}P_{\lambda} \hat{\boldsymbol{\theta}}_{\lambda}\right. \\ &\left.- \ln|W|_{+} - \ln|P_{\lambda}|_{+} + \ln|W + P_{\lambda}| + (n_* - q)\ln(2\pi)\right]. \end{aligned}\]

Once the selection of \(\lambda\)
has been determined using the estimator \(\hat{\lambda} =
\underset{\lambda}{\text{argmax}}\: \ell^m_\text{reg}(\lambda)\),
the lack of an explicit solution to this equation forces us to resort to
numerical methods for its resolution. The Newton algorithm could once
again be employed here and is a robust choice. This approach was notably
adopted by Wood (2011). However,
explicitly calculating the derivatives of the likelihood \(\ell^m_\text{reg}\) is rather difficult
from an operational perspective. Instead, we will use general heuristics
such as those provided by Brent (1973) and
John A. Nelder and Mead (1965), which are
applicable to any sufficiently smooth function. These heuristics do not
require derivative calculations and are implemented in the
`optimize`

and `optim`

functions of the
statistical programming language \(\mathsf{R}\).

See the upcoming paper available here

Akaike, Hirotsugu. 1973. “Information Theory and an Extension of
the Maximum Likelihood Principle.” In *2nd International
Symposium on Information Theory, 1973*.

Anderssen, RS, and Peter Bloomfield. 1974. “A Time Series Approach
to Numerical Differentiation.” *Technometrics* 16 (1):
69–75.

Brent, Richard P. 1973. “Algorithms for Minimization Without
Derivatives, Chap. 4.” Prentice-Hall, Englewood Cliffs, NJ.

Carballo, Alba, Maria Durban, and Dae-Jin Lee. 2021.
“Out-of-Sample Prediction in Multidimensional p-Spline
Models.” *Mathematics* 9 (15): 1761.

Henderson, Robert. 1924. “A New Method of Graduation.”
*Transactions of the Actuarial Society of America* 25: 29–40.

Kauermann, Göran. 2005. “A Note on Smoothing Parameter Selection
for Penalized Spline Smoothing.” *Journal of Statistical
Planning and Inference* 127 (1-2): 53–69.

Marra, Giampiero, and Simon N Wood. 2012. “Coverage Properties of
Confidence Intervals for Generalized Additive Model Components.”
*Scandinavian Journal of Statistics* 39 (1): 53–74.

Nelder, John A, and Roger Mead. 1965. “A Simplex Method for
Function Minimization.” *The Computer Journal* 7 (4):
308–13.

Nelder, John Ashworth, and Robert WM Wedderburn. 1972.
“Generalized Linear Models.” *Journal of the Royal
Statistical Society: Series A (General)* 135 (3): 370–84.

Patterson H. D., Thompson R. 1971. “Recovery of Inter-Block
Information When Block Sizes Are Unequal.” *Biometrika*
58: 545–54.

Reiss, Philip T, and R Todd Ogden. 2009. “Smoothing Parameter
Selection for a Class of Semiparametric Linear Models.”
*Journal of the Royal Statistical Society: Series B (Statistical
Methodology)* 71 (2): 505–23.

Schwarz, Gideon. 1978. “Estimating the Dimension of a
Model.” *The Annals of Statistics*, 461–64.

Wahba, Grace. 1980. *Spline Bases, Regularization, and Generalized
Cross Validation for Solving Approximation Problems with Large
Quantities of Noisy Data*. University of WISCONSIN.

———. 1985. “A Comparison of GCV and GML for Choosing the Smoothing
Parameter in the Generalized Spline Smoothing Problem.” *The
Annals of Statistics*, 1378–1402.

Whittaker, Edmund T. 1922. “On a New Method of Graduation.”
*Proceedings of the Edinburgh Mathematical Society* 41: 63–75.

Wood, Simon N. 2011. “Fast Stable Restricted Maximum Likelihood
and Marginal Likelihood Estimation of Semiparametric Generalized Linear
Models.” *Journal of the Royal Statistical Society: Series B
(Statistical Methodology)* 73 (1): 3–36.