intSDM: R package to produce species distribution models in a reproducible framework

2023-10-20

Introduction

In support of our manuscript, we developed an R (R Core Team 2023) package to help construct integrated species distribution models (ISDMs) from disparate datasets in a simple and reproducible framework. This R Markdown document presents an illustration of the package by creating an ISDM for red-listed plant species obtained via the Vascular Plant Field Notes survey program in Norway, as well as citizen-science data obtained from Global Biodiversity Information Facility (GBIF). The first step in exploring this document is to download the package using the following script:


library(intSDM)
library(INLA)

intSDM has two main functions: startWorkflow() and sdmWorkflow(). The first of which is designed to setup and specify all the individual components of the workflow using different slot functions. The functions related to this object include:

Function name Function use
.$plot() Plot data and other objects required for the model.
.$addStructured() Add data not available on GBIF.
.$addMesh() Create an inla.mesh object.
.$addGBIF() Add data from GBIF.
.$addArea() Specify sampling domain.
.$addCovariates() Add spatial covariates.
.$crossValidation() Specify the cross-validation method.
.$modelOptions() Add R-INLA (Martins et al. 2013), inlabru (Bachl et al. 2019) and PointedSDMs (Mostert and O’Hara 2023) options.
.$specifySpatial() Add penalizing complexity priors to the spatial effects.
.$biasFields() Specify an additional spatial effect for a dataset.
.$workflowOutput() Specify the output of the workflow.
.$obtainMeta() Obtain metadata for the occurrence records.

sdmWorkflow() implements the workflow based on the objects added in startWorkflow(). The output of this function is a list of objects specified in .$workflowOutput().

Workflow setup

To start the workflow, we need to specify the coordinate reference system (CRS) considered for the analysis as well as the species used. The three species selected for this analysis from the Vascular Plant Field Notes (arnica montana, fraxinus excelsior and ulmus glabra) have records predominantly spread across the southern and eastern part of Norway. However the species ulmus glabra (which has the largest spread of the three species selected), has some of the records approaching the middle and middle-upper parts of Norway.

The other arguments (saveOptions and Save) should be used if the user wants the objects to be saved in a folder created by the function.


workflow <- startWorkflow(
        Projection = '+proj=utm +zone=32 +ellps=WGS84 +datum=WGS84 +units=m +no_defs',
        Species = c("Fraxinus_excelsior", "Ulmus_glabra", "Arnica_montana"),
        saveOptions = list(projectName =  'Vascular'), Save = FALSE
        )

Select area

Next we specify the study domain for the study: in this case Norway. This can be achieved using the countryName argument which will then access the object from the giscoR (Hernangómez 2023) R package. The argument is used to specify any additional arguments for giscoR::gisco_get_countries() function (in this case, resolution). If you have your own layer you would like to specify for the workflow (for example, a non-country layer), this may be added using the Object argument.

We then use .$plot() to see what the boundary looks like.


workflow$addArea(countryName = 'Norway', resolution = '60')
workflow$plot()

Adding occurrence data

Species’ occurrence data is certainly the most important component of a SDM, and intSDM has two slot functions to help you add data into the workflow: .$addGBIF() and .$addStructured().

The former of which uses the rgbif package to download data directly from GBIF. For this function, we need to specify the name of the dataset (datasetName) and the type of the dataset (datasetType) – which can be one of PO, PA or Counts. The argument is used to specify any addiditional arguments for rgbif::occ_data() (Chamberlain, Oldoni, and Waller 2022) (in this case, limit and datasetKey). If datasetType = 'PA', absences may be generated using generateAbsences = TRUE. This will treat the obtained data as a checklist survey data: combining all the sampling locations for the species in the dataset, and creating absences when a given species did not occur in a given region.

For this example we consider three sources of data. The Vascular Plant Field Notes is a collection of observations provided by the Norwegian University of Science and Technology’s (NTNU) (Norwegian University of Science and Technology 2023) University Museum and the University of Oslo (UiO) (University of Oslo 2023), containing records of standardized cross-lists of most vascular plants found in Norway. We treat these two datasets as detection/non-detection data, generating absences in sampling locations where the species does not occur.

The other source of data considered comes from the Norwegian Species Observation service (published by Artsdatabanken) (The Norwegian Biodiversity Information Centre 2023). This data is a collection of citizen science records – and as a result we treat it as presence-only data.


workflow$addGBIF(datasetName = 'NTNU', 
                 datasetType = 'PA',
                 limit = 10000,
                 coordinateUncertaintyInMeters = '0,50',
                 generateAbsences = TRUE, 
                 datasetKey = 'd29d79fd-2dc4-4ef5-89b8-cdf66994de0d')

workflow$addGBIF(datasetName = 'UiO',
                 datasetType = 'PA',
                 limit = 10000,
                 coordinateUncertaintyInMeters = '0,50',
                 generateAbsences = TRUE,
                 datasetKey = 'e45c7d91-81c6-4455-86e3-2965a5739b1f')

workflow$addGBIF(datasetName = 'CZ', 
                 datasetType = 'PO',
                 coordinateUncertaintyInMeters = '0,50',
                 limit = 10000,
                 datasetKey = 'b124e1e0-4755-430f-9eab-894f25a9b59c')

workflow$plot(Species = TRUE)

Adding covariate data

Covariate data may be added to the model using .$addCovariates(). Layers from WorldClim (Fick and Hijmans 2017) may be accessed using the worldClim argument. This in turn uses the geodata (Hijmans et al. 2023) R package to obtain spatRaster objects of the covariates cropped around the study domain. Other covariate layers may be added using the Object argument.


workflow$addCovariates(worldClim = 'tavg', res = 5, Function = scale)
workflow$plot(Covariates = TRUE)

Metadata

We can then view the metadata for the obtained occurrence records using the .$obtainMeta() function, which will give us the citation for the datasets used in this workflow.


workflow$obtainMeta()

Creating an inla.mesh object

One of the objects required for our model is an inla.mesh object, which we will use in the approximation of our spatial random fields. The .$addMesh() function’s argument uses INLA::inla.mesh.2d() to create this object.


workflow$addMesh(cutoff = 20000,
                 max.edge = c(60000, 80000),
                 offset= 100000)

workflow$plot(Mesh = TRUE)

Specify priors

Furthermore we also used penalizing complexity (PC) priors in our model, which are designed to control the spatial range and standard deviation in the GRF’s Matérn covariance function in order to reduce over-fitting in the model (Simpson et al. 2017).


workflow$specifySpatial(prior.range = c(300000, 0.05),
                        prior.sigma = c(50, 0.2))

Bias field

We specify an additional spatial effect for the citizen science data using .$biasFields() to account biases in the collection process (Simmonds et al. 2020).


workflow$biasFields('CZ')

Model options

Finally we specify some options used in INLA::inla(), as well as the output from our workflow – in this case we want plots of the log-intensity for each species.


workflow$workflowOutput('Maps')

workflow$modelOptions(INLA = list(control.inla=list(int.strategy = 'eb',
                                                    cmin = 0),
                                  safe = TRUE,
                                  inla.mode = 'experimental'))

Running workflow

The workflow is then implemented using the sdmWorkflow() function. Due to the lengthy time it requires to produce this map, inference is not made in this vignette. However the script is available below for the user to run the model themselves.


Maps <- sdmWorkflow(workflow)
Maps$Fraxinus_excelsior$Maps
Maps$Ulmus_glabra$Maps
Maps$Arnica_montana$Maps
Bachl, Fabian E, Finn Lindgren, David L Borchers, and Janine B Illian. 2019. “Inlabru: An r Package for Bayesian Spatial Modelling from Ecological Survey Data.” Methods in Ecology and Evolution 10 (6): 760–66. https://doi.org/10.1111/2041-210X.13168.
Chamberlain, Scott, Damiano Oldoni, and John Waller. 2022. “Rgbif: Interface to the Global Biodiversity Information Facility API.” https://doi.org/10.5281/zenodo.6023735.
Fick, Stephen E, and Robert J Hijmans. 2017. “WorldClim 2: New 1-Km Spatial Resolution Climate Surfaces for Global Land Areas.” International Journal of Climatology 37 (12): 4302–15. https://doi.org/10.1002/joc.5086.
Hernangómez, Diego. 2023. giscoR: Download Map Data from GISCO API - Eurostat (version 0.3.3). https://doi.org/10.5281/zenodo.4317946.
Hijmans, Robert J., Márcia Barbosa, Aniruddha Ghosh, and Alex Mandel. 2023. Geodata: Download Geographic Data. https://CRAN.R-project.org/package=geodata.
Martins, Thiago G, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2013. Bayesian computing with INLA: new features.” Computational Statistics and Data Analysis 67: 68–83. https://doi.org/10.1016/j.csda.2013.04.014.
Mostert, Philip S, and Robert B O’Hara. 2023. PointedSDMs: An R Package to Help Facilitate the Construction of Integrated Species Distribution Models.” Methods in Ecology and Evolution 14 (5): 1200–1207. https://doi.org/10.1111/2041-210X.14091.
Norwegian University of Science and Technology. 2023. “Vascular Plant Herbarium TRH, NTNU University Museum. Version 30.2217.” The Global Biodiversity Information Facility. https://doi.org/10.15468/zrlqok.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Simmonds, Emily G, Susan G Jarvis, Peter A Henrys, Nick JB Isaac, and Robert B O’Hara. 2020. Is more data always better? A simulation study of benefits and limitations of integrated distribution models.” Ecography 43 (10): 1413–22. https://doi.org/10.1111/ecog.05146.
Simpson, Daniel, Håvard Rue, Andrea Riebler, Thiago G Martins, and Sigrunn H Sørbye. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32 (1): 1–28. https://doi.org/10.1214/16-STS576.
The Norwegian Biodiversity Information Centre. 2023. “Norwegian Species Observation Service. Version 1.231.” The Global Biodiversity Information Facility. https://doi.org/10.15468/zjbzel.
University of Oslo. 2023. “Vascular Plant Herbarium, Oslo (o) UiO. Version 1.2160.” The Global Biodiversity Information Facility. https://doi.org/10.15468/wtlymk.