Predicting the distribution of species across space and time is a fundamental piece for answering a multitude of questions surrounding the field of ecology. Consequentially, numerous statistical tools have been developed to assist ecologists in making these types of predictions; the most common of which is certainly species distribution models (SDMs).

These types of models are heavily reliant on the quality and quantity of data available across multiple time periods in order to produce meaningful results. Thankfully, the \(21^{st}\) century has been associated with a boom in digital technology and the collapse of data storage costs achieved, allowing us to see data on the environment and species occurrences grow at unprecedented rates. The problem however is that all these data are disparate in nature; they all have their own sampling protocols, assumptions and attributes, making joint analysis of them difficult.

On account of this issue, the field of integrated species distribution modelling has progressed considerably over the last few years – and now the associated methods are well-developed, and the benefits of using such models (over single dataset or simple data-pooling models) are apparent.

A major handicap to integrated modeling lies in the fact that the tools required to make inference with these models have not been established – especially with regards to general software to be used by ecologists. This in turn has stagnated the growth of applying these models to real data.

In light of this impediment to the field of integrated modelling, we
decided to make *PointedSDMs*, an *R* package to help
simplify the modelling process of integrated species distribution models
(ISDMs) for ecologists. It does so by using the *INLA* framework
(Rue, Martino, and Chopin 2009) to
approximate the models and by constructing wrapper functions around
those provided in the *R* package, *inlabru* (Bachl et al. 2019).

This *R markdown* file illustrates an example of modelling an
ISDM with *PointedSDMs,* using five disparate datasets containing
species *Setophaga* collected around Pennsylvania state (United
States of America). The first part of this file contains a brief
introduction to the statistical model used in ISDMs, as well as an
introduction to the included functions in the package. The second part
of this file uses *PointedSDMs* to run the ISDM for the provided
data. We note that, for this particular vignette, no inference was made
due to computational intensity of the model. However the *R*
script and data are provided below so that the user may carry out
inference.

The goal of our statistical model is to infer the “true” distribution of our species’ using the observations we have at hand, as well as environmental covariates describing the studied area, \(\boldsymbol{X}\), and other parameters describing the species, \(\boldsymbol{\theta}\). To do so, we construct a hierarchical state-space model composing of two parts: the underlying process model as well as observation models.

The process model is a stochastic process which describes how the points are distributed in space. The most commonly assumed process model is the Log-Gaussian Cox process (LGCP), which is characterized by the intensity function \(\lambda(s)\): such that a higher a higher intensity at a location implies the species is more abundant there, as well as a Gaussian random field used to account for all the environmental covariates not included in the model.

The observation models give a statistical description of the data collection process, and are chosen for each dataset based on their underlying sampling protocols (see (Isaac et al. 2020) for a detailed review of the types of observation models typically used in integrated models). As a result, we can write the likelihood of the observation model for a given dataset, \(Y_i\) as \(\mathcal{L} \left( Y_i \mid \lambda(s), \theta_i \right)\).

Therefore, given a collection of heterogeneous datasets \(\{Y_1, Y_2,...,Y_n\}\), the full-likelihood of the statistical process is given by:\[\mathcal{L} \left( \boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{\theta}, \boldsymbol{\phi} \right) = p\left( \lambda(s), \boldsymbol{X}, \boldsymbol{\phi} \right) \times \prod_{i=1}^n \mathcal{L} \left(Y_i \mid \lambda(s), \theta_i\right),\]where \(\boldsymbol\phi\) are parameters assumed for the underlying process.

*PointedSDMs* is designed to simplify the modelling process of
integrated species distribution models by using wrapper functions around
those found in *R-INLA* and *inlabru*. The simplification
is regarding four key steps in the statistical-modelling process:

data preparation,

model fitting,

cross-validation,

prediction.

The first function of interest with regards to these steps is
`intModel`

, which is designed to create objects required by
*inlabru*, assign the relevant information to individual objects
used in model specification, as well as structure the provided datasets
together into a homogenized framework.

The output of this function is an *R6* object, and so there
are several slot functions included to help specify the model at a finer
scale. Use `?dataSDM`

to get a comprehensive description of
these slot functions.

The next function used is `fitISDM`

which takes the data
object created with `intModel`

and runs the integrated model.
The output of this function is in essence an *inlabru* model
output with additional information to be used by the sequential
functions.

Spatial blocked cross-validation may be performed using the
`blockedCV`

function – which iteratively calculates and
averages the deviance information criteria (DIC) for models run without
data from the selected spatial block. Before this function is used,
`.$spatialBlock()`

from the object produced by
`intModel`

is required in order to specify how the data
should be spatially blocked (see below for an example).

`datasetOut`

provides cross-validation for integrated
models by running the full model with one less dataset, and calculating
a score measuring the relative importance of the left out dataset in the
full model.

Finally prediction of the full model may be completed using the
`predict`

function; the arguments here are mostly related to
specifying which components are required to be predicted, however the
function can act identically to the *inlabru*
`predict`

function if need be. After predictions are
complete, plots may be made using the generic `plot`

function.

This example aims to predict the distribution of three species of
genus *setophaga* across Pennsylvania state. This example is
notable in integrated modelling since it has been used by two seminal
papers in the field, namely those by: Isaac et
al. (2020) and Miller et al.
(2019). This file extends the example by adding two additional
datasets containing a further two species.

The first step in our analysis is to load in the packages required.

```
library(INLA)
library(inlabru)
library(USAboundaries)
library(sp)
library(sf)
library(blockCV)
library(ggmap)
library(sn)
library(terra)
library(RColorBrewer)
library(cowplot)
library(knitr)
library(kableExtra)
library(dplyr)
library(spocc)
```

Finally, additional objects required by *PointedSDMs* and the
*R-INLA* (Martins et al. 2013) and
*inlabru* (Bachl et al. 2019)
packages need to be assembled.

A *SpatialPolygons* object of Pennsylvania (obtained using the
*USAboundaries* (rOpenSci 2018)
package) is the first of these objects required, and it will be used to
construct an *inla.mesh* object as well as help with the plotting
of graphics. In order to create one of these polygons objects, we are
first required to define the projection reference system (see
`?CRS`

).

```
proj <- '+proj=longlat +datum=WGS84 +no_defs'
PA <- USAboundaries::us_states(states = "Pennsylvania")
#PA <- PA$geometry
PA <- as(PA, "sf")
st_crs(PA) <- proj
```

For our analysis, we used five datasets containing the geocoded
locations of our studied species’. The first three of these datasets are
present only datasets obtained via *eBird* through the
*spocc* package (Chamberlain 2021),
which is a toolkit used to obtain species’ observation data from a
selection of popular online data repositories (*GBIF*,
*Vernet*, *BISON*, *iNaturalist* and
*eBird*). These data were obtained using the following
script:

```
species <- c('caerulescens', 'fusca', 'magnolia')
dataSets <- list()
for (bird in species) {
raw_data <- spocc::occ(
query = paste('Setophaga', bird),
from = "gbif",
date = c("2005-01-01", "2005-12-31"),
geometry = PA)$gbif
rows <- grep("EBIRD", raw_data$data[[paste0('Setophaga_', bird)]]$collectionCode)
raw_data <- data.frame(raw_data$data[[1]][rows, ])
raw_data$Species_name <- rep(bird, nrow(raw_data))
data_sp <- st_as_sf(
x = raw_data[, names(raw_data) %in% c("longitude", "latitude", 'Species_name')],
coords = c('longitude', 'latitude'),
crs = proj)
dataSets[[paste0('eBird_', bird)]] <- data_sp[unlist(st_intersects(PA, data_sp)),]
}
```

The *PointedSDMs* package also contains two additional
structured datasets (*BBA, BBS*), which were both assumed to be
presence absence datasets in the study conducted by Isaac et al. (2020), containing the variable,
*NPres*, denoting the presence (or number of presences in the
*BBS* dataset) at each sampling location. However, we changed the
*NPres* variable name of the *BBS* dataset to
*Counts* in order to consider it a counts dataset for
illustrative purposes. No additional changes were made to the
*BBA* dataset, and so it was considered a presence absence
dataset as originally intended. These datasets may be loaded using the
following script:

Dataset name |
Sampling protocol | Number of observations |
Species name |
Source |
---|---|---|---|---|

BBS |
Counts | 45 | Caerulescens |
North American Breeding Bird Survey |

BBA |
Detection/nondetection | 5165 | Caerulescens |
Pennsylvania Breeding Bird Atlas |

eBird_caerulescens |
Present only | 264 | Caerulescens |
eBird |

eBird_magnolia |
Present only | 354 | Magnolia |
eBird |

eBird_fusca |
Present only | 217 | Fusca |
eBird |

Table 1: Table illustrating the different datasets used in the analysis.

Species distribution models study the relationship between our
in-situ species observations and the underlying environment, and so in
line with the study conducted by Isaac et al.
(2020), we considered two covariates: *elevation*,
describing the height above sea level (meters) and *canopy*,
describing the proportion of tree canopy covered in the area. These two
covariates were obtained from the Hollister and
Shah (2017) and Bocinsky et al.
(2019) R packages respectively. The script to obtain these
spatial covariates is provided in the same github repository as the
species occurrence data above; the only thing we did differently was
convert the covariates to a *spatRaster* object.

The *R-INLA* package requires a Delaunay triangulated mesh
used to approximate our spatial random fields, which is created by
supplying the `max.edge`

*,* `offset`

and
`cutoff`

arguments as well as our *SpatialPolygons*
map of Pennsylvania to the `inla.mesh.2d`

function. With this
mesh, `intModel`

will create integration points required by
our model using *inlabru*’s `fm_int`

function.

```
mesh <- inla.mesh.2d(boundary = inla.sp2segment(PA),
cutoff = 0.2,
max.edge = c(0.1, 0.24),
offset = c(0.1, 0.4),
crs = st_crs(proj))
mesh_plot <- ggplot() +
gg(mesh) +
ggtitle('Plot of mesh') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
mesh_plot
```

`intModel`

The `intModel`

function is the first step in the modeling
of our species data. The aim of the function is to organize and
standardize our data to ensure that each process is correctly modeled in
the integrated framework. For this standardization to work, we specify
the response variable names of the counts data
(`responseCounts`

) and present absence
(`responsePA`

), the coordinate names
(`Coordinates`

) as well as the coordinate reference system
used (`Projection`

). In addition, the species variable name
(`speciesName`

) was also specified in order to model the
effects for the species found in the datasets.

```
spatial_data <- intModel(dataSets,
Coordinates = c('X', 'Y'),
Projection = proj, Mesh = mesh,
responsePA = 'NPres', responseCounts = 'Counts',
spatialCovariates = covariates, speciesName = 'Species_name')
```

As mentioned above, there are some slot functions now inside the
`spatial_data`

objects, which allow for a finer level of
customization of the components for the integrated model.

`.$plot`

Let’s have a look at a graphical representation of the species’
distribution, using the `.$plot()`

function within
`spatial_data`

.

```
spatial_data$plot(Boundary = FALSE) +
geom_sf(data = PA, fill = 'black', alpha = 0.15, lwd = 1.2) +
theme_bw() +
ggtitle('Plot of the datasets') +
theme(plot.title = element_text(hjust = 0.5))
```

This will create a plot of the location of the species based on their
dataset, however if we wanted to plot by species, we could specify this
with the `Species = TRUE`

argument.

`.$specifySpatial`

When conducting Bayesian analysis, the choice of prior distribution
is imperative. In our example, we chose Penalizing complexity priors
(Simpson et al. 2017), which are designed
to control the spatial range and marginal standard deviation in the
GRF’s Matérn covariance function in order to reduce over-fitting in the
model. This can be done with ease using the
`.$specifySpatial`

function, which will also update all the
components associated with this field automatically.

`.$addBias`

Presence only datasets (such as our *eBird* data) are often
noted to contain numerous biases, which in turn can heavily skew results
obtained from our model. A common approach to account for these biases
is by using a variable in the model such as sampling effort (some
examples include: distance traveled, number of observations per visit,
duration spent observing). However in the absence of such variables,
running a second spatial field in the model can improve performance of
the integrated model (Simmonds et al.
2020). To add a second spatial field to our model, we use the
`.$addBias`

function, which is part of the *intModel*
object created above.

`.$priorsFixed`

Suppose we knew *a priori* what the mean and precision values
of one of the fixed terms for a given species was: we could add this
knowledge to the model using the `.$priorsFixed`

function.

`.$changeComponents`

If we would like to see what the components required by
*inlabru* in our model are, we can use the
`.$changeComponents`

function with no arguments
specified.

This model is thus specified with: two environmental covariates, an
intercept term per species, a spatial field per species, the three bias
fields for the *presence only* data as well as a shared spatial
field between the different datasets. If we wanted to add/change or
remove a component entirely, this could be done with the
`addComponent`

and `removeComponent`

argument
respectively.

`.$spatialBlock`

`.$spatialBlock`

is used to set up spatial blocked
cross-validation for the model by assigning each point in the datasets a
block based on where the point is located spatially. For this example,
we chose four blocks (`k=4`

) for our model, based around a
2x2 grid (`rows = 2, cols = 2`

). See the figure below for an
illustration of the spatial block: the amount of data within each block
appears reasonably equal.

`blockedCV`

Spatial blocked cross-validation may then be produced using the
`blockedCV`

function. We specify
`control.inla = list(int.strategy = 'eb')`

in our
`options`

argument in order to speed up the model running
time.

```
spatialBlocked <- blockedCV(data = spatial_data, options = list(control.inla = list(int.strategy = 'eb')))
```

More so, we can compare the cross-validation score from this model to
one without the shared spatial effect (specified with
`pointsSpatial = FALSE`

in `intModel`

).

```
no_fields <- intModel(dataSets,
Coordinates = c('X', 'Y'),
pointsSpatial = NULL,
Projection = proj, Mesh = mesh,
responsePA = 'NPres', responseCounts = 'Counts',
spatialCovariates = covariates, speciesName = 'Species_name')
no_fields$spatialBlock(k = 4, rows = 2, cols = 2)
```

```
spatialBlocked_no_fields <- blockedCV(data = no_fields, options = list(control.inla = list(int.strategy = 'eb')))
```

Based on the DIC scores, we conclude that the model with the shared spatial field provides a better fit of the data.

`fitISDM`

The integrated model is easily run with the `fitISDM`

function as seen below by supplying the occurrence and *R-INLA*
objects (created above with `intModel`

).

```
joint_model <- fitISDM(data = spatial_data,
options = list(control.inla = list(int.strategy = 'eb')))
```

```
results_plot <- joint_model$summary.fixed %>%
mutate(species = gsub('_.*$','',
row.names(joint_model$summary.fixed))) %>%
mutate(coefficient = row.names(joint_model$summary.fixed))
coefficient_plot <- ggplot(results_plot, aes(x = coefficient, y = mean)) +
geom_hline(yintercept = 0, colour = grey(0.25), lty = 2) +
geom_point(aes(x = coefficient,
y = mean)) +
geom_linerange(aes(x = coefficient,
ymin = `0.025quant`,
ymax = `0.975quant`,
col = species),
lwd = 1) +
theme_bw() +
scale_colour_manual(values = c('#003f5c', '#bc5090','#ffa600')) +
theme(legend.position="bottom",
plot.title = element_text(hjust = 0.5)) +
ggtitle("95% credibility intervals of the fixed effects\n
for the three studied species") +
labs(x = 'Variable', y = 'Coefficient value') +
coord_flip()
coefficient_plot
```

`datasetOut`

*PointedSDMs* also includes the function
`datasetOut`

, which iteratively omits one dataset (and its
associated marks) out of the full model. For example, we can calculate
the effect on the other variables by leaving out the dataset
*BBA*, which contributes the most occurrences used in our
joint-likelihood model by a significant amount.

Furthermore, by setting `predictions = TRUE`

we are able
to calculate some cross-validation score by leaving out the selected
dataset. This score is calculated by predicting the covariate values of
the left-out using the reduced model (i.e the model with the dataset
left out), using the predicted values as an offset in a new model, and
then finding the difference between the marginal-likelihood of the full
model (i.e the model with all the datasets considered) and the
marginal-likelihood of the offset model.

`predict`

A significant part of SDMs is creating prediction maps to understand
the species’ spread. Predictions of the integrated SDM’s from
`fitISDM`

are made easy using the `predict`

function. By supplying the relevant components, the predict function
will create the formula required by *inlabru*’s
`predict`

function to make predictive maps. In this example,
we made predictions for the spatial fields of the species, on the linear
scale using 1000 samples.

`plot`

*PointedSDMs* also provides a general plotting function to
plot the maps using the predicted values obtained in the
`predict`

function. The plot below shows the predictions of
the three spatial fields of our species.

Bachl, Fabian E, Finn Lindgren, David L Borchers, and Janine B Illian.
2019. “Inlabru: An r Package for Bayesian Spatial Modelling from
Ecological Survey Data.” *Methods in Ecology and
Evolution* 10 (6): 760–66. https://doi.org/10.1111/2041-210X.13168.

Bocinsky, R Kyle, Dylan Beaudette, Scott Chamberlain, and Maintainer R
Kyle Bocinsky. 2019. “FedData: Functions to Automate Downloading
Geospatial Data Available from Several Federated Data Sources.”
https://cran.r-project.org/package=FedData.

Chamberlain, Scott. 2021. “Spocc: Interface to Species Occurrence
Data Sources.” https://cran.r-project.org/package=spocc.

Hollister, J, and Tarak Shah. 2017. “Elevatr: Access Elevation
Data from Various APIs.” https://cran.r-project.org/package=elevatr.

Isaac, Nick JB, Marta A Jarzyna, Petr Keil, Lea I Dambly, Philipp H
Boersch-Supan, Ella Browning, Stephen N Freeman, et al. 2020.
“Data Integration for Large-Scale Models of Species
Distributions.” *Trends in Ecology & Evolution* 35
(1): 56–67. https://doi.org/10.1016/j.tree.2019.08.006.

Martins, Thiago G, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2013.
“Bayesian Computing with INLA: New Features.”
*Computational Statistics & Data Analysis* 67: 68–83. https://doi.org/10.1016/j.csda.2013.04.014.

Miller, David AW, Krishna Pacifici, Jamie S Sanderlin, and Brian J
Reich. 2019. “The Recent Past and Promising Future for Data
Integration Methods to Estimate Species’ Distributions.”
*Methods in Ecology and Evolution* 10 (1): 22–37. https://doi.org/10.1111/2041-210X.13110.

rOpenSci. 2018. “USABoundaries.” https://github.com/ropensci/USAboundaries.

Rue, Håvard, Sara Martino, and Nicolas Chopin. 2009. “Approximate
Bayesian Inference for Latent Gaussian Models by Using Integrated Nested
Laplace Approximations.” *Journal of the Royal Statistical
Society: Series b (Statistical Methodology)* 71 (2): 319–92. https://doi.org/10.1111/j.1467-9868.2008.00700.x.

Simmonds, Emily G, Susan G Jarvis, Peter A Henrys, Nick JB Isaac, and
Robert B O’Hara. 2020. “Is More Data Always Better? A Simulation
Study of Benefits and Limitations of Integrated Distribution
Models.” *Ecography* 43 (10): 1413–22. https://doi.org/10.1111/ecog.05146.

Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and
Sigrunn H Sørbye. 2017. “Penalising Model Component Complexity: A
Principled, Practical Approach to Constructing Priors.”
*Statistical Science* 32 (1): 1–28. https://doi.org/10.1214/16-STS576.