A modeling example

Julien Vollering

2020-05-04

This vignette is meant to introduce users to the MIAmaxent package by providing a worked example of a distribution modeling exercise. It shows how to use all of the top-level functions included in MIAmaxent in the order of a typical analysis. This vignette does NOT describe the theoretical underpinnings of the package. To learn more about the theory behind MIAmaxent, the user is referred to Halvorsen (2013) and Halvorsen et al. (2015), as well as other references listed in the documentation of the package.

Help pages for the package and for individual functions in the package can be accessed in R using the standard help command: ?"MIAmaxent" (after attaching the package using library(MIAmaxent)).

Abbreviations: EV = explanatory variable; DV = derived variable; RV = response variable; FOP = frequency of observed presence; AUC = area under the curve of the receiver operating characteristic plot


Introducing the data set

The data used in this vignette have been used to model the distribution of semi-natural grasslands in Østfold County, in southeastern Norway. The data set consists of 1059 locations where presence of semi-natural grasslands has been recorded, 13 environmental variables covering the extent of the study area, and 122 locations where the presence or absence of semi-natural grasslands has been recorded, independently of the presence-only records. The extent of the study area is about 4000 square kilometers, and the grain of the raster data is 500 meters (0.25 km2).

The data used in this vignette are included in the package as an example data set, so the results shown here can be directly replicated by executing the code as provided.

Before beginning the modeling exercise, it may be useful to see what some of the data look like in their geographical representation. We can use the raster package to plot the 1059 recorded presences on top of one of the environmental variable rasters:

library(raster)
#> Loading required package: sp
EV1 <- raster(list.files(system.file("extdata", "EV_continuous", 
                                     package="MIAmaxent"), full.names=TRUE)[1])
PO <- read.csv(system.file("extdata", "occurrence_PO.csv", package="MIAmaxent"))
plot(EV1, legend=FALSE)
points(PO$POINT_X, PO$POINT_Y, pch = 20, cex = 0.5, col = 'blue')

readData()

The starting point for modeling using MIAmaxent is a simple data object that contains occurrence data for the modeled target, and some number of explanatory variables (EVs). This data object must be formatted as a data frame, with the binary response variable (RV) representing occurrence in the first column, and corresponding EV values in subsequent columns. When the occurrence data consist of presence and absence records, these should be coded as “1” and “0” respectively. When the occurrence data consist of presence records only, presence locations are contrasted against locations with unknown occurrence, and the RV should be coded as “1” or “NA”. EVs may be continuous (numeric class) or categorical (factor class), as denoted in the data object.

The readData() function transforms data in CSV and ASCII raster file formats into a single data frame which serves as the starting point for modeling.

Users of the highly popular maxent.jar program for maximum entropy modeling are accustomed to data in a different format. Specifically, occurrence data is often in CSV file format, with presences records followed by coordinates, and explanatory data in ASCII raster file format. The readData() function makes it easy to read these data into the data object that is used in MIAmaxent. This function extracts values of the EVs at locations specified in the CSV file and properly formats these into the starting point for modeling. If the CSV file contains presence records only, then readData() also selects a random set of uninformed background locations for the data object. Alternatively, the user can specify a custom set of background locations by giving these in the CSV file.

We begin by creating our data object from file. Note that continuous and categorical environmental variables must be placed in separate directories:

library(MIAmaxent)
grasslandPO <- readData(
  occurrence=system.file("extdata", "occurrence_PO.csv", package="MIAmaxent"), 
  contEV=system.file("extdata", "EV_continuous", package="MIAmaxent"),
  catEV=system.file("extdata", "EV_categorical", package="MIAmaxent"),
  maxbkg=20000)

In this case, the number of uninformed background locations to be randomly selected (maxbkg=20000) was larger than the total number of raster cells in the study area, so all cells are included in the data object.

Most functions in MIAmaxent return console output. Therefore, it’s handy to assign function output to an object, so that you can manipulate that object further. If you forget, you can use ?.Last.value().

If we look at the resulting data object we see the response variable (with 1059 presence and 16420 background locations) along with 8 continuous and 5 categorical EVs:

str(grasslandPO)
#> 'data.frame':    17479 obs. of  14 variables:
#>  $ RV        : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pca1      : num  0.000504 0.000506 0.000554 0.00051 0.000486 ...
#>  $ prbygall  : num  0.00772 0.00134 0.003 0.01049 0.00092 ...
#>  $ prtilany  : num  0 0 0.3898 0.0884 0.1316 ...
#>  $ teraspif  : num  180 95 20 111 116 61 24 34 35 35 ...
#>  $ terdem    : num  0 0 0 12 0 0 0 11 26 39 ...
#>  $ terslpdg  : num  0 0.827 1.433 1.175 0.813 ...
#>  $ tersolrade: num  977 966 1018 956 960 ...
#>  $ tertpi09  : num  -3.54 -7.51 -6.11 -2.38 -15.16 ...
#>  $ geoberg   : Factor w/ 16 levels "0","2","3","4",..: 10 10 9 10 10 10 10 10 10 10 ...
#>  $ geolmja1  : Factor w/ 15 levels "11","12","15",..: 8 7 15 7 8 7 8 7 7 7 ...
#>  $ lcucor1   : Factor w/ 21 levels "111","112","121",..: 21 21 21 8 21 8 21 8 12 2 ...
#>  $ lcutilt4  : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 2 1 ...
#>  $ terslpps15: Factor w/ 6 levels "1","2","3","4",..: 1 1 1 6 1 1 1 1 6 6 ...
sum(grasslandPO$RV == 1, na.rm = TRUE)
#> [1] 1059
sum(is.na(grasslandPO$RV))
#> [1] 16420

IMPORTANT: A number of important issues for distribution modeling – such as accounting for sampling bias and setting study area extent – are not dealt with in MIAmaxent, and should be dealt with beforehand. Good modeling practice requires that these issues be considered carefully! See Guisan et al. (2017) for a full treatment of distribution modeling in R.


Examining patterns in occurrence

By its simplest definition, a distribution model examines the relationship between the modeled target and its environment. In this way, distribution modeling follows the long tradition of gradient analysis in vegetation ecology (Halvorsen, 2012). Therefore, before building an actual model, we should have some idea about what influence the environmental variables have on the occurrence of the target.

plotFOP()

We can use the plotFOP function to create a so-called Frequency of Observed Presence (FOP) plot. An FOP plot shows how commonly the target occurs across the range of the EV, and makes it possible to recognize patterns in frequency of occurrence. In theory, the relationship between a continuous EV and modeled target is expected to be unimodal, if the observed range of the EV is sufficiently large. In practice, the pattern seen in the FOP plot depends not only on the range of the EV — which is affected by the extent of the study area — but also the scaling of the EV.

Here we examine FOP plots for 2 of the continuous EVs:

teraspifFOP <- plotFOP(grasslandPO, "teraspif")
terslpdgFOP <- plotFOP(grasslandPO, "terslpdg")

The points in these FOP plots show the observed proportion of points in a given interval of the EV which contain presences. The red line is a local regression smoother which aims to summarize the pattern in the empirical FOP values. The grey distribution in the background is an approximation of the data density across the range of the EV.

Notice the difference in the scales of the FOP axes. EVs showing a larger interval on the FOP axis typically carry more explanatory power.

We can change the number of the number of intervals used to calculate FOP, or the neighborhood of the smoother, and we can access the plotted data directly:

terslpdgFOP <- plotFOP(grasslandPO, "terslpdg", span = 0.75, intervals = 20)

terslpdgFOP
#> $EVoptimum
#> [1] 6.382247
#> 
#> $FOPdata
#>                 int    n     intEV      intRV       loess
#> 1  (-0.00951,0.476] 2506 0.2601985 0.06304868  0.06692123
#> 2     (0.476,0.951] 3730 0.7188061 0.06729223  0.06253300
#> 3      (0.951,1.43] 3447 1.1799195 0.05860168  0.05922944
#> 4        (1.43,1.9] 2560 1.6513057 0.05507812  0.05698450
#> 5        (1.9,2.38] 1834 2.1238806 0.05507088  0.05588399
#> 6       (2.38,2.85] 1239 2.5967183 0.05004036  0.05601716
#> 7       (2.85,3.33]  830 3.0659710 0.06626506  0.05736699
#> 8        (3.33,3.8]  480 3.5487974 0.06666667  0.05992171
#> 9        (3.8,4.28]  318 4.0297763 0.05974843  0.06412131
#> 10      (4.28,4.76]  202 4.4976457 0.05445545  0.06853793
#> 11      (4.76,5.23]  131 4.9892501 0.08396947  0.07204624
#> 12      (5.23,5.71]   72 5.4752495 0.06944444  0.07506689
#> 13      (5.71,6.18]   44 5.9547193 0.11363636  0.07805324
#> 14      (6.18,6.66]   44 6.3822473 0.04545455  0.07853477
#> 15      (6.66,7.13]   21 6.8873543 0.19047619  0.07539665
#> 16      (7.13,7.61]   12 7.3856734 0.00000000  0.06830319
#> 17      (7.61,8.08]    5 7.7893479 0.00000000  0.05917056
#> 18      (8.08,8.56]    2 8.1287899 0.00000000  0.05042162
#> 19      (8.56,9.04]    1 9.0068598 0.00000000  0.01422974
#> 20      (9.04,9.52]    1 9.5107498 0.00000000 -0.01200278

Based on this FOP plot, the occurrence of semi-natural grasslands seems to be unimodally related to ‘terslopdg’ (terrain slope) with a maximum at around 6.

Now we examine FOP plots for one of the categorical EVs:

geobergFOP <- plotFOP(grasslandPO, 10)

We see that geoberg type 4 has the highest rate of observed presence, followed by type 2, and then types 3 and 28. If we look more closely however, we notice also that geoberg type 4 is sampled very rarely (see grey bars), with only 6 locations falling into that category:

geobergFOP
#> $EVoptimum
#> [1] 4
#> Levels: 0 2 3 4 5 21 22 26 27 28 29 35 40 62 82 85
#> 
#> $FOPdata
#>    level    n    levelRV
#> 1      0    5 0.00000000
#> 2      2   20 0.45000000
#> 3      3   74 0.28378378
#> 4      4    6 0.50000000
#> 5      5  456 0.10526316
#> 6     21 3448 0.05974478
#> 7     22   46 0.06521739
#> 8     26   14 0.00000000
#> 9     27   21 0.09523810
#> 10    28   58 0.27586207
#> 11    29    6 0.00000000
#> 12    35  369 0.11382114
#> 13    40    1 0.00000000
#> 14    62 8183 0.05230356
#> 15    82 4445 0.05939258
#> 16    85  327 0.05198777

If geoberg type 4 had shown a high FOP value and a large number of observations, the uncertainty associated with its FOP value would be lower and its likelihood of being selected in the model would be increased.

It’s useful to examine FOP plots for all candidate explanatory variables (EVs) before building a model.

Looking at FOP plots should help the modeler decide which EVs are likely to have greatest explanatory power, and gives an idea of the strength and shape of the relationships between the EVs and RV.


Transforming explanatory variables (EVs)

To fit the many different kinds of relationships between explanatory and response variables, we need to transform the EVs. This means that we create new “derived” variables (DVs) from the original EVs. Another way of thinking about this is to put it in terms of rescaling; we adjust the scale of the EV in many different ways in order to check which scaling is most ecologically relevant to the occurrence of the modeled target.

deriveVars()

The deriveVars() function produces DVs from EVs by 7 different transformation types: linear, monotonous, deviation, forward hinge, reverse hinge, threshold, and binary (Halvorsen et al., 2015). The first 6 of these are relevant for continuous variables and the binary transformation is relevant only for categorical variables. Different types of transformations can be turned on or off to balance model complexity with model fit.

For the spline-type transformations (forward hinge, reverse hinge, threshold) an endless number of different transformations are possible, so by default the function produces 20 of each, and then chooses those which explain the most variation in the RV. This means that 20 models are built and evaluated for each combination of EV and spline transformation, so running deriveVars() with these transformation types turned on can take a bit of time — depending on the size of the data set.

Here we produce all types of DVs from our EVs:

grasslandDVs <- deriveVars(grasslandPO, 
                           transformtype = c("L","M","D","HF","HR","T","B"))
#> Pre-selecting forward hinge transformations of pca1
#> Pre-selecting reverse hinge transformations of pca1
#> Pre-selecting threshold transformations of pca1
#> Pre-selecting forward hinge transformations of prbygall
#> Pre-selecting reverse hinge transformations of prbygall
#> Pre-selecting threshold transformations of prbygall
#> Pre-selecting forward hinge transformations of prtilany
#> Pre-selecting reverse hinge transformations of prtilany
#> Pre-selecting threshold transformations of prtilany
#> Pre-selecting forward hinge transformations of teraspif
#> Pre-selecting reverse hinge transformations of teraspif
#> Pre-selecting threshold transformations of teraspif
#> Pre-selecting forward hinge transformations of terdem
#> Pre-selecting reverse hinge transformations of terdem
#> Pre-selecting threshold transformations of terdem
#> Pre-selecting forward hinge transformations of terslpdg
#> Pre-selecting reverse hinge transformations of terslpdg
#> Pre-selecting threshold transformations of terslpdg
#> Pre-selecting forward hinge transformations of tersolrade
#> Pre-selecting reverse hinge transformations of tersolrade
#> Pre-selecting threshold transformations of tersolrade
#> Pre-selecting forward hinge transformations of tertpi09
#> Pre-selecting reverse hinge transformations of tertpi09
#> Pre-selecting threshold transformations of tertpi09

Turn write on and (optionally) specify a directory to save the transformation functions produced by deriveVars to file.

The output of deriveVars() is a list consisting of 2 parts:

Both list elements also contain the RV vector.

In our grasslands analysis, the contents of the list items look like this:

summary(grasslandDVs$dvdata)
#>            Length Class      Mode   
#> RV         17479  -none-     numeric
#> pca1          10  data.frame list   
#> prbygall       5  data.frame list   
#> prtilany       7  data.frame list   
#> teraspif       8  data.frame list   
#> terdem         8  data.frame list   
#> terslpdg       5  data.frame list   
#> tersolrade     5  data.frame list   
#> tertpi09       8  data.frame list   
#> geoberg       16  data.frame list   
#> geolmja1      15  data.frame list   
#> lcucor1       21  data.frame list   
#> lcutilt4       2  data.frame list   
#> terslpps15     6  data.frame list
head(summary(grasslandDVs$transformations))
#>                 Length Class  Mode    
#> RV              17479  -none- numeric 
#> pca1_L_transf       1  -none- function
#> pca1_M_transf       1  -none- function
#> pca1_D05_transf     1  -none- function
#> pca1_D1_transf      1  -none- function
#> pca1_D2_transf      1  -none- function
length(grasslandDVs$transformations)
#> [1] 117

Note that the names of DVs indicate the type of transformation was used to create them. For example, “terslpdg_D2” is a deviation-type transformation of terslpdg, where the slope of the deviation is controlled by a parameter value of 2. Meanwhile, “terslpdg_HR4” is a reverse hinge transformation, with the knot in the 4th position.

Underscores (’_‘) are used to denote DVs, and colons (’:’) are used to denote interaction terms, so EV names must not contain these characters. EV names should also be unique.

To illustrate, look at how a given DV relates to the original, untransformed EV from which it was derived. Here we examine “terslpdg_D2” and “terslpdg_M”:

plot(grasslandPO$terslpdg, grasslandDVs$dvdata$terslpdg$terslpdg_D2, pch=20, 
     ylab="terslpdg_D2")
plot(grasslandPO$terslpdg, grasslandDVs$dvdata$terslpdg$terslpdg_M, pch=20,
     ylab="terslpdg_M")

“terslpdg_D2” is the squared deviation (hence D2) from the estimated optimum in terslpdg (around 6). “terslpdg_M” is a monotone (hence M) transformation of terslpdg — specifically a zero-skew transformation.


Selecting variables

With DVs ready, we are ready to begin the process of choosing which variables to include in the model. This is arguably the most critical step in the whole modeling process. Following the principle of parsimony, the aim in selecting variables is to explain as much variation in the RV as efficiently as possible. The greater the number of EVs or DVs included in the model, the more variation in the RV we can explain, but at the cost of model complexity. In the MIAmaxent package, the benefit of additional variation explained is weighed against the cost in model complexity using an inference test (Chi-squared or F). Variables are added to the model one by one in a process termed forward selection, and each new model is compared to its predecessor. Another term for this process is “nested model comparison.”

Rather than selecting from the full pool of DVs one by one, MIAmaxent performs variable selection in two parts:

  1. First, a set of DVs is selected separately for each individual EV. This is done using the selectDVforEV() function.
  2. Second, the EVs themselves — each represented by a parsimonious set of DVs — are selected. This is done using the selectEV() function.

Variable selection occurs hierarchically: first DVs for each EV, then EVs for the full model.

selectDVforEV()

The selectDVforEV() function performs forward selection of individual DVs for each EV. In other words, the function takes each EV one by one, and narrows the group of DVs produced from that EV to a set which explains variation in the RVs most efficiently.

The alpha argument to selectDVforEV() is used in the inference test during forward selection, setting the threshold for how much variation a DV must explain to be retained. A lower alpha results in a more conservative test, i.e. DVs must explain more variation to be selected.

Here we use selectDVforEV() on the grassland data set. Note the “$dvdata” following grasslandsDV, which identifies the list of DVs we made using deriveVars() (see ?deriveVars() Value).

grasslandDVselect <- selectDVforEV(grasslandDVs$dvdata, alpha = 0.001, quiet = TRUE)

The output is a list consisting of 2 parts:

Comparing the list of DVs before and after selection, we can see that selectDVforEV() reduced the number of DVs considerably:

summary(grasslandDVs$dvdata)
#>            Length Class      Mode   
#> RV         17479  -none-     numeric
#> pca1          10  data.frame list   
#> prbygall       5  data.frame list   
#> prtilany       7  data.frame list   
#> teraspif       8  data.frame list   
#> terdem         8  data.frame list   
#> terslpdg       5  data.frame list   
#> tersolrade     5  data.frame list   
#> tertpi09       8  data.frame list   
#> geoberg       16  data.frame list   
#> geolmja1      15  data.frame list   
#> lcucor1       21  data.frame list   
#> lcutilt4       2  data.frame list   
#> terslpps15     6  data.frame list
sum(sapply(grasslandDVs$dvdata[-1], length))
#> [1] 116
summary(grasslandDVselect$dvdata)
#>            Length Class      Mode   
#> RV         17479  -none-     numeric
#> pca1           1  data.frame list   
#> prbygall       1  data.frame list   
#> prtilany       1  data.frame list   
#> teraspif       1  data.frame list   
#> terdem         1  data.frame list   
#> tertpi09       1  data.frame list   
#> geoberg        5  data.frame list   
#> geolmja1       3  data.frame list   
#> lcucor1        2  data.frame list   
#> terslpps15     1  data.frame list
sum(sapply(grasslandDVselect$dvdata[-1], length))
#> [1] 17

Note also that the number of EVs was reduced from 13 to 10. EVs for which none of the DVs explained a significant amount of variation are dropped.

Here is an example of one of the (13) trails of forward DV selection. Shown is the trail for the “terdem” EV:

grasslandDVselect$selection$terdem
#>    round                variables m   Dsq   Chisq df        P
#> 1      1               terdem_D05 1 0.006 106.684  1 5.22e-25
#> 2      1                 terdem_M 1 0.005  97.665  1 4.95e-23
#> 3      1              terdem_HR18 1 0.005  97.541  1 5.28e-23
#> 4      1                 terdem_L 1 0.005  97.460  1 5.49e-23
#> 5      1                terdem_D1 1 0.005  97.373  1 5.74e-23
#> 6      1               terdem_HR4 1 0.005  93.987  1 3.18e-22
#> 7      1                terdem_D2 1 0.005  83.598  1 6.06e-20
#> 8      1                terdem_T8 1 0.003  55.579  1 8.98e-14
#> 9      2  terdem_D05 + terdem_HR4 2 0.006   0.779  1 3.77e-01
#> 10     2   terdem_D05 + terdem_D2 2 0.006   0.572  1 4.49e-01
#> 11     2   terdem_D05 + terdem_T8 2 0.006   0.344  1 5.58e-01
#> 12     2    terdem_D05 + terdem_M 2 0.006   0.154  1 6.95e-01
#> 13     2 terdem_D05 + terdem_HR18 2 0.006   0.094  1 7.60e-01
#> 14     2   terdem_D05 + terdem_D1 2 0.006   0.092  1 7.61e-01
#> 15     2    terdem_D05 + terdem_L 2 0.006   0.091  1 7.63e-01

The columns in this data frame represent: the round of variable selection (round), the names of the DVs included in the model (variables), the number of DVs in the model (m), the fraction of deviance explained (D2, sensu Guisan & Zimmerman, 2000), the Chi-squared statistic for the nested model comparison (Chisq), the degrees of freedom for the nested model comparison (df), and the p-value for the Chi-squared statistic under the specified degrees of freedom (P).

This table shows, for example, that in the first round of selection, one model was built for each of the 8 DVs, and that all of these explained enough variation to be retained for the second round of selection. Of all the DVs produced from “terdem”, “terdem_D05” explained the most variation in the RV. However, none of the remaining DVs explained enough of the remaining variation to be selected in addition to “terdem_D05” (P > alpha, in round 2).

selectEV()

Part 2 of variable selection using MIAmaxent is performed by the selectEV() function. This function is similar to the selectDVforEV() function, but instead of selecting a parsimonious set of DVs to represent each EV, selectEV() selects a parsimonious set of EVs to comprise the model. This proceeds in the same manner as selectDVforEV(), with nested model comparison using inference tests. Where selectDVforEV() adds a single DV at a time, selectEV() adds a single set of DVs (representing one EV) at a time.

The selectEV() function also differs from selectDVforEV() in another important way; it provides the option of testing interaction terms between selected EVs (interaction = TRUE). Interaction terms are useful when one EV changes the effect of another EV on the modeled target. In MIAmaxent, only first-order interactions are tested, and only between EVs both included in the model.

Here we use selectEV() on the grassland data set. Note the “$dvdata” following grasslandsDVselect, which identifies the list of selected DVs we made using selectDVforEV() (see ?selectDVforEV() Value).

grasslandEVselect <- selectEV(grasslandDVselect$dvdata, alpha = 0.001, 
                              interaction = TRUE)
#> Forward selection of 10 EVs
#> Round 1 complete.
#> Round 2 complete.
#> Round 3 complete.
#> Round 4 complete.
#> Round 5 complete.
#> Round 6 complete.
#> Forward selection of interaction terms between 6 EVs
#> Round 7 complete.

The output is a list consisting of 3 parts:

Compare the list of EVs before and after:

summary(grasslandDVselect$dvdata)
#>            Length Class      Mode   
#> RV         17479  -none-     numeric
#> pca1           1  data.frame list   
#> prbygall       1  data.frame list   
#> prtilany       1  data.frame list   
#> teraspif       1  data.frame list   
#> terdem         1  data.frame list   
#> tertpi09       1  data.frame list   
#> geoberg        5  data.frame list   
#> geolmja1       3  data.frame list   
#> lcucor1        2  data.frame list   
#> terslpps15     1  data.frame list
length(grasslandDVselect$dvdata[-1])
#> [1] 10
summary(grasslandEVselect$dvdata)
#>          Length Class      Mode   
#> RV       17479  -none-     numeric
#> prbygall     1  data.frame list   
#> geoberg      5  data.frame list   
#> lcucor1      2  data.frame list   
#> tertpi09     1  data.frame list   
#> geolmja1     3  data.frame list   
#> teraspif     1  data.frame list
length(grasslandEVselect$dvdata[-1])
#> [1] 6

We can see that selectEV() reduced the number of EVs to 6. To check whether any interaction terms have been included in the model, we can look at its formula:

grasslandEVselect$selectedmodel$formula
#> RV ~ prbygall_M + geoberg_BX3 + geoberg_BX28 + geoberg_BX2 + 
#>     geoberg_BX35 + geoberg_BX5 + lcucor1_BX312 + lcucor1_BX322 + 
#>     tertpi09_HR12 + geolmja1_BX20 + geolmja1_BX42 + geolmja1_BX60 + 
#>     teraspif_T6
#> <environment: 0x000000001d5f6778>

No interaction terms were significant in the model. These would be denoted by two DV names separated by a colon (e.g. ‘pr.bygall_M:tertpi09_HR12’). This can be confirmed by looking at the trail of forward selection of EVs and interaction terms. Here we show only the best model of each round:

grasslandEVselect$selection[!duplicated(grasslandEVselect$selection$round), ]
#>    round
#> 1      1
#> 11     2
#> 20     3
#> 27     4
#> 33     5
#> 36     6
#> 37     7
#>                                                                           variables
#> 1                                                                          prbygall
#> 11                                                               prbygall + geoberg
#> 20                                                     prbygall + geoberg + lcucor1
#> 27                                          prbygall + geoberg + lcucor1 + tertpi09
#> 33                               prbygall + geoberg + lcucor1 + tertpi09 + geolmja1
#> 36                    prbygall + geoberg + lcucor1 + tertpi09 + geolmja1 + teraspif
#> 37 prbygall + geoberg + lcucor1 + tertpi09 + geolmja1 + teraspif + lcucor1:tertpi09
#>     m   Dsq   Chisq df        P
#> 1   1 0.012 219.973  1 9.17e-50
#> 11  6 0.019 112.063  5 1.50e-22
#> 20  8 0.023  69.863  2 6.75e-16
#> 27  9 0.026  62.059  1 3.33e-15
#> 33 12 0.028  25.978  3 9.64e-06
#> 36 13 0.028  14.735  1 1.24e-04
#> 37 15 0.029  10.593  2 5.01e-03

As expected, the model with the interaction term is not significant under the alpha value of 0.001. Instead, the selected model is the model with 6 single-order terms.

The full selection trail can be saved as a CSV-format file by setting write = TRUE in selectEV(), and optionally specifying a dir. A careful examination of the trail of forward selection can be helpful in choosing the final model for a given application.

Note that above we started the forward selection procedure from a null model including none of the EVs. However, if we had an a priori reason to include one or more of the EVs in the model regardless of explanatory power, we could do so using the formula argument in selectEV(). Then forward selection proceeds with the specified model as a starting point.

chooseModel()

We may choose to plot some of the data in the forward selection trail to help us decide which model to use. For example, we may plot fraction of deviance explained (D2) against round number, to see how much better the model fit is for each added term:

plot(grasslandEVselect$selection$round, grasslandEVselect$selection$Dsq, 
     xlab="round", ylab="Dsq")

In this case, we may decide that a simpler model with only 5 EVs is better than the selected model, since the amount of deviance explained levels off after round 5. The EVs included in this model are: prbygall + geoberg + lcucor1 + tertpi09 + geolmja1. To proceed with this model, instead of the selectedmodel in grasslandEVselect, we can use the chooseModel() function:

grasslandmodel <- chooseModel(grasslandDVselect$dvdata, 
                              formula("~ prbygall + geoberg + lcucor1 + 
                                      tertpi09 + geolmja1"))

This is the model which we explore further below.


Exploring the model

After building a model by selecting which EVs to include, it is useful to explore what the modeled relationships actually look like. A straightforward way to do this is to look at model predictions over a range of explanatory data. This gives the modeler a sense of the relationships that the model has captured, and can help the modeler understand strengths and weaknesses in the model.

plotResp()

We can use the plotResp() function to create a response curve. A response curve plots model output across the range of one particular EV. When other variables are excluded from the model entirely, this is called a “single-effect response curve”.

Here we examine a single-effect response curve for the most important EV included in the model chosen above:

plotResp(grasslandmodel, grasslandDVs$transformations, "prbygall")

To assess how well the relationship between the EV and RV is captured by the model, it can be useful to examine FOP plots and response plots side-by-side:

prbygallFOP <- plotFOP(grasslandPO, "prbygall")
plotResp(grasslandmodel, grasslandDVs$transformations, "prbygall")

The values on the y-axes of the plots are not directly comparable, but we expect the shape of the response plot curve to mirror, more or less, the shape of the FOP plot curve.

Here is the same comparison for one of the categorical variables included in the model:

geolmja1FOP <- plotFOP(grasslandPO, "geolmja1")
plotResp(grasslandmodel, grasslandDVs$transformations, "geolmja1")

plotResp2()

The plotResp2() function is very similar to the plotResp() function in that it takes the same arguments and is used to produce response plots. However, plotResp2() plots “marginal-effect” responses rather than “single-effect” responses. A marginal-effect response plot shows model predictions across the range of one EV while holding all other EVs in the model constant at their mean value. The resulting curves are often similar in shape, if not in scale.

Here is the marginal-effect response plot for the same continuous EV shown above:

plotResp2(grasslandmodel, grasslandDVs$transformations, "prbygall")

calculateFTVA()

As a measure of variable contribution, we can calculate the Fraction of Total Variation Accounted for (FTVA) by each variable (Halvorsen et al., 2015). Here with our chosen model:

calculateFTVA(grasslandEVselect, formula("~ prbygall + geoberg + lcucor1 + 
                                      tertpi09 + geolmja1"))
#>   variable  FTVA
#> 1 prbygall 0.429
#> 2  geoberg 0.250
#> 3  lcucor1 0.143
#> 4 tertpi09 0.107
#> 5 geolmja1 0.071

Applying the model

For many modeling applications — although not all — the motivation for building a model is to obtain model predictions. Model predictions, or model output, can be used in many different ways: to make predictions about parts of the study area for which there exist no occurrence data, to predict relative probability of occurrence outside the study area, or to predict relative probability of occurrence in the past or future. In other words, any form of spatial or temporal interpolation or extrapolation is possible (although not always recommended!). The only requirement is that the values of the EVs are known for the time or space for which model output is desired.

projectModel()

The projectModel() function can be used to obtain model predictions for any kind of modeling application. As input it takes the model to be projected (model), the transformation functions used by the model (transformations), and explanatory data to project across (data). For “maxent”-type models, the projectModel returns model predictions in probability ratio output (PRO) format for each location represented in data. PRO format gives relative probability of presence, and PRO = 1 is a reference value that represents the probability of presence in an “average” location in the training data.

A value of PRO = 1 can be interpreted as the relative probability of presence of a location randomly drawn from the training data. Put another way, values above 1 represent higher-than-average probability of presence, and vice versa.

Here, we obtain model output across the extent of the study area as represented by the training data. When we enter the data argument as a RasterStack or RasterBrick object, projectModel() automatically shows model predictions in geographical space. Note that the names of the raster layers must match names of EVs in the model.

EVfiles <- c(list.files(system.file("extdata", "EV_continuous", package="MIAmaxent"), 
             full.names=TRUE),
             list.files(system.file("extdata", "EV_categorical", package="MIAmaxent"), 
             full.names=TRUE))
EVstack <- raster::stack(EVfiles)
names(EVstack) <- gsub(".asc", "", basename(EVfiles))
grasslandPreds <- projectModel(model = grasslandmodel,
                               transformations = grasslandDVs$transformations,
                               data = EVstack)

It is often easier to visualize probability-ratio values on a log scale, so we plot the raster object again as log2(PRO + 1):

plot(log2(grasslandPreds$output + 1))

Alternatively, if the data are supplied in a simple data frame, model predictions are appended to the data in column 1, and returned as list item output. In this case, the predictions can be mapped to geographical space manually, by including spatial coordinates in the data input to the projectModel(), and then using the rasterize() function in the raster package.

Additionally, the projectModel() function automatically checks how the range of the input explanatory data compares to the range of the data used to train the model. This is important because if the range of the input data lies outside the range of the training data (i.e. the model is extrapolated), the predictions are less reliable. The range of continuous variables is reported on the training data scale, from 0 to 1, and the range of categorical variables is reported as “inside” if all categories are represented in the training data:

grasslandPreds
#> $output
#> class      : RasterLayer 
#> dimensions : 201, 158, 31758  (nrow, ncol, ncell)
#> resolution : 500, 500  (x, y)
#> extent     : 249000, 328000, 6531500, 6632000  (xmin, xmax, ymin, ymax)
#> crs        : NA 
#> source     : memory
#> names      : PRO 
#> values     : 0.3538549, 24.70643  (min, max)
#> 
#> 
#> $ranges
#> $ranges$prbygall
#> [1] 0 1
#> 
#> $ranges$geoberg
#> [1] "inside"
#> 
#> $ranges$lcucor1
#> [1] "inside"
#> 
#> $ranges$tertpi09
#> [1] 0 1
#> 
#> $ranges$geolmja1
#> [1] "inside"

Since we projected the model across the training data, it makes sense that all the ranges are reported as [0,1] or “inside.”


Evaluating the model

There are many ways of evaluating the quality of a model, including assessing which EVs are selected and gauging the realism of response curves. Arguably the best way to evaluate a model, however, is to test how often its predictions are correct using occurrence data which are independent from the data used to train the model. This is strongly preferable to using training data because it indicates whether the model reflects patterns specific to the training data or generalized patterns for the modeled target (i.e. whether the model is overfitted). Independent, presence-absence test data are not always available, for example when projecting a model into the future, but when they are, they represent a high standard in model validation.

testAUC()

The evaluation metric which is used most commonly for distribution models and is implemented in MIAmaxent is Area Under the Curve (AUC) of the receiver operating characteristic plot. This is a metric which measures the performance of the model as a binary classifier over the full range of discrimination thresholds. When it is calculated using independent test data we refer to it as “testAUC”.

The testAUC() function takes a data frame of presence and absence locations, along with the corresponding values of EVs at those locations, and calculates testAUC. The evaluation data can easily be read into R using the readData() function with PA = TRUE if desired, as shown below.

In our example, 122 test locations in Østfold County, Norway, were visited to record presence or absence of semi-natural grasslands:

grasslandPA <- readData(
  occurrence = system.file("extdata", "occurrence_PA.csv", package="MIAmaxent"), 
  contEV = system.file("extdata", "EV_continuous", package="MIAmaxent"),
  catEV = system.file("extdata", "EV_categorical", package="MIAmaxent"),
  PA = TRUE, XY = TRUE)
head(grasslandPA)
#>   RV      x       y     pca1    prbygall   prtilany teraspif terdem terslpdg
#> 1  1 262750 6557250 0.000646 0.002871206 0.45159969       93     24 1.114550
#> 2  1 263250 6558750 0.000635 0.029484030 0.01523342      109     19 1.236990
#> 3  1 263750 6555250 0.000694 0.000970874 0.92038828       41      0 0.365185
#> 4  1 265250 6555750 0.000719 0.003206413 0.27815631        7     13 0.326633
#> 5  1 265250 6559750 0.000652 0.002914238 0.27019149      151     20 1.382270
#> 6  1 266750 6553750 0.000848 0.012106540 0.21468930       49     10 0.769991
#>   tersolrade  tertpi09 geoberg geolmja1 lcucor1 lcutilt4 terslpps15
#> 1    967.471   8.67901      21      100     311        0          6
#> 2    956.743   1.50617      21      100     322        0          6
#> 3    981.625 -10.39510      21      130     523        1          1
#> 4    988.487  -4.71605      21      130     322        1          5
#> 5    941.362   6.06173      21      130     311        1          6
#> 6    985.827   2.41975      21      130     322        1          6
tail(grasslandPA)
#>     RV      x       y     pca1 prbygall  prtilany teraspif terdem terslpdg
#> 117  0 313250 6575250 0.000788        0 1.0101010      131    132 0.045296
#> 118  0 313750 6579250 0.000650        0 1.0076010       39    149 1.358030
#> 119  0 314250 6572750 0.000545        0 0.3865006       78    139 0.146076
#> 120  0 314250 6576750 0.000694        0 1.0060360       81    141 1.689610
#> 121  0 314750 6571250 0.000640        0 0.0000000      102    174 1.501690
#> 122  0 314750 6574750 0.000745        0 0.9434344      166    130 0.449772
#>     tersolrade tertpi09 geoberg geolmja1 lcucor1 lcutilt4 terslpps15
#> 117   1000.920 -3.39507      29      100     312        1          1
#> 118   1022.570 -1.76543      62      130     312        1          6
#> 119   1001.790 -7.23457      82       12     312        1          1
#> 120   1028.010  3.16049      62       12     312        1          6
#> 121    983.568 12.33330      82      130     312        0          6
#> 122    988.030 -6.40741      62       12     312        1          1

Plotted on the raster of (log-scaled) model predictions, these occurrences look like this:

plot(log2(grasslandPreds$output + 1))
presences <- grasslandPA[grasslandPA$RV==1, ]
absences <- grasslandPA[grasslandPA$RV==0, ]
points(presences$x, presences$y, pch = 20, cex = 0.5, col = 'red')
points(absences$x, absences$y, pch = 20, cex = 0.5, col = 'blue')
legend('topleft', c('presence', 'absence'), col = c('red', 'blue'), 
       pch = c(20, 20))

We can use these data to calculate testAUC for our distribution model:

testAUC(model = grasslandmodel, transformations = grasslandDVs$transformations,
        data = grasslandPA)

#> [1] 0.7817506

The ROC plot shows how the rate of true positives as well as false positives increases as the binary classification threshold is lowered. The area under this curve corresponds to the value returned by testAUC.

Note that the testAUC() function can also be used to calculate AUC from presence-only training data, but that the interpretation of these values differs importantly.


Alternative: logistic regression

In the modeling example above, we used presence-only data to fit maximum entropy models. It is important to note that MIAmaxent provides the exact same functionality for models fitted by standard logistic regression. Specifically, in all functions which perform model fitting (i.e. pre-selection of spline DVs in deriveVars(), along with selectDVforEV(), selectEV(), and chooseModel()), the fitting algorithm can be specified as "maxent" or"LR", where "maxent" is the default.

Conventionally, the maximum entropy estimation (algorithm = "maxent") is used with presence-only occurrence data, while logistic regression (algorithm = "LR"), is used with presence-absence data.

The maximum entropy algorithm represents a form of parametric density estimation using an exponential family model, and is equivalent to inhomogeneous Poisson process models (Fithian & Hastie, 2013). It is implemented in MIAmaxent using infinitely-weighted logistic regression, with presences automatically added to the background. The maximum entropy algorithm has been used primarily with presence-only occurrence data in the popular maxent.jar software (Phillips et al., 2006), but it may also be used with presence-absence data (Halvorsen, 2013). In contrast, standard logistic regression (binomial GLM) is customarily used with presence-absence occurrence data. For more information about the differences between these approaches, see Guisan et al. (2017) and Fithian & Hastie (2013).

Below, we follow the same procedure as above, but use the presence-absence data to fit models by logistic regression.

grasslandPA <- readData(
  occurrence = system.file("extdata", "occurrence_PA.csv", package="MIAmaxent"), 
  contEV = system.file("extdata", "EV_continuous", package="MIAmaxent"),
  catEV = system.file("extdata", "EV_categorical", package="MIAmaxent"),
  PA = TRUE, XY = FALSE)
str(grasslandPA)
#> 'data.frame':    122 obs. of  14 variables:
#>  $ RV        : num  1 1 1 1 1 1 1 1 1 1 ...
#>  $ pca1      : num  0.000646 0.000635 0.000694 0.000719 0.000652 ...
#>  $ prbygall  : num  0.002871 0.029484 0.000971 0.003206 0.002914 ...
#>  $ prtilany  : num  0.4516 0.0152 0.9204 0.2782 0.2702 ...
#>  $ teraspif  : num  93 109 41 7 151 49 62 90 48 167 ...
#>  $ terdem    : num  24 19 0 13 20 10 2 16 0 0 ...
#>  $ terslpdg  : num  1.115 1.237 0.365 0.327 1.382 ...
#>  $ tersolrade: num  967 957 982 988 941 ...
#>  $ tertpi09  : num  8.68 1.51 -10.4 -4.72 6.06 ...
#>  $ geoberg   : Factor w/ 5 levels "21","29","35",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ geolmja1  : Factor w/ 9 levels "12","15","41",..: 8 8 9 9 9 9 8 9 4 9 ...
#>  $ lcucor1   : Factor w/ 11 levels "112","142","211",..: 5 7 11 7 5 7 2 2 7 11 ...
#>  $ lcutilt4  : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 2 1 ...
#>  $ terslpps15: Factor w/ 5 levels "1","2","3","5",..: 5 5 1 4 5 5 1 5 1 1 ...

plotFOP(grasslandPA, "teraspif")
plotFOP(grasslandPA, "terslpdg")
plotFOP(grasslandPA, 10)

Note that since the RV in grasslandPA contains presence and absence, the plots above show empirical frequencies of presence, rather than observed frequencies of presence. This is a subtle but important distinction.

PA.grasslandDVs <- deriveVars(grasslandPA, algorithm = "LR")
#> Pre-selecting forward hinge transformations of pca1
#> Pre-selecting reverse hinge transformations of pca1
#> Pre-selecting threshold transformations of pca1
#> Pre-selecting forward hinge transformations of prbygall
#> Pre-selecting reverse hinge transformations of prbygall
#> Pre-selecting threshold transformations of prbygall
#> Pre-selecting forward hinge transformations of prtilany
#> Pre-selecting reverse hinge transformations of prtilany
#> Pre-selecting threshold transformations of prtilany
#> Pre-selecting forward hinge transformations of teraspif
#> Pre-selecting reverse hinge transformations of teraspif
#> Pre-selecting threshold transformations of teraspif
#> Pre-selecting forward hinge transformations of terdem
#> Pre-selecting reverse hinge transformations of terdem
#> Pre-selecting threshold transformations of terdem
#> Pre-selecting forward hinge transformations of terslpdg
#> Pre-selecting reverse hinge transformations of terslpdg
#> Pre-selecting threshold transformations of terslpdg
#> Pre-selecting forward hinge transformations of tersolrade
#> Pre-selecting reverse hinge transformations of tersolrade
#> Pre-selecting threshold transformations of tersolrade
#> Pre-selecting forward hinge transformations of tertpi09
#> Pre-selecting reverse hinge transformations of tertpi09
#> Pre-selecting threshold transformations of tertpi09

The DVs produced above are not the same as those produced previously with presence-only data, even though the variable names might be the same. That is because the exact forms of the transformations applied are data-dependent.

PA.grasslandDVselect <- selectDVforEV(PA.grasslandDVs$dvdata, alpha = 0.001, 
                                      algorithm = "LR", quiet = TRUE) 

PA.grasslandEVselect <- selectEV(PA.grasslandDVselect$dvdata, alpha = 0.001, algorithm = "LR")
#> Forward selection of 8 EVs
#> Round 1 complete.
#> Round 2 complete.
PA.grasslandEVselect$selection
#>    round           variables m   Dsq  Chisq df        P
#> 1      1              terdem 1 0.205 33.170  1 8.45e-09
#> 2      1            prbygall 1 0.187 30.177  1 3.94e-08
#> 3      1                pca1 1 0.120 19.379  1 1.07e-05
#> 4      1            prtilany 1 0.118 19.015  1 1.30e-05
#> 5      1             geoberg 1 0.107 17.312  1 3.17e-05
#> 6      1          tersolrade 1 0.095 15.338  1 8.99e-05
#> 7      1            terslpdg 1 0.085 13.809  1 2.02e-04
#> 8      1             lcucor1 1 0.080 12.869  1 3.34e-04
#> 9      2   terdem + prbygall 2 0.286 13.041  1 3.05e-04
#> 10     2   terdem + terslpdg 2 0.249  7.032  1 8.01e-03
#> 11     2   terdem + prtilany 2 0.238  5.310  1 2.12e-02
#> 12     2 terdem + tersolrade 2 0.233  4.564  1 3.26e-02
#> 13     2       terdem + pca1 2 0.218  2.151  1 1.42e-01
#> 14     2    terdem + lcucor1 2 0.205  0.003  1 9.59e-01
#> 15     2    terdem + geoberg 2 0.205  0.001  1 9.76e-01
PA.grasslandEVselect$selectedmodel
#> 
#> Call:  stats::glm(formula = formula, family = "binomial", data = data)
#> 
#> Coefficients:
#> (Intercept)    terdem_D1  prbygall_D2  
#>       3.266       -3.780      -51.989  
#> 
#> Degrees of Freedom: 121 Total (i.e. Null);  119 Residual
#> Null Deviance:       161.7 
#> Residual Deviance: 115.5     AIC: 121.5

plotResp(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations, "terdem")
plotFOP(grasslandPA, "terdem")

plotResp(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations, "prbygall")
plotFOP(grasslandPA, "prbygall")

calculateFTVA(PA.grasslandEVselect)
#>   variable  FTVA
#> 1   terdem 0.717
#> 2 prbygall 0.283
PA.grasslandPreds <- projectModel(model = PA.grasslandEVselect$selectedmodel,
                               transformations = PA.grasslandDVs$transformations,
                               data = EVstack)

Notice that predictions from the logistic regression model — unlike the maxent model — are on the interval [0,1]. These values represent predicted probability of presence, rather than predicted relative probability of presence.

Now, compare the model trained on presence-absence data by logistic regression with the model trained on presence-only data by maximum entropy estimation. First we calculate their AUCs on the presence-absence data set:

# logistic regression model
testAUC(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations, 
        grasslandPA, plot = FALSE) 
#> [1] 0.8435355
# maxent model
testAUC(grasslandmodel, grasslandDVs$transformations, grasslandPA, plot = FALSE) 
#> [1] 0.7817506

Next we calculate their AUCs on the presence-only data set:

# logistic regression model
testAUC(PA.grasslandEVselect$selectedmodel, PA.grasslandDVs$transformations, 
        grasslandPO, plot = FALSE)
#> Warning: The test data consist of 1/NA only, so NA is treated as absence.
#> Be aware of implications for the interpretation of the AUC value.
#> [1] 0.6109549
# maxent model
testAUC(grasslandmodel, grasslandDVs$transformations, grasslandPO, plot = FALSE)
#> Warning: The test data consist of 1/NA only, so NA is treated as absence.
#> Be aware of implications for the interpretation of the AUC value.
#> [1] 0.6898254

Notice that:

  1. AUC values are considerably lower for both models when calculated on presence-only data. This evidences the fact that uninformed background locations are treated as absences in this calculation; therefore even a perfect model would not show a perfect AUC of 1.
  2. The model trained on the same data as used for evaluation shows the higher AUC, in both evaluation cases; that is, the PA model gives better predictions of the PA data, and the PO model gives better predictions of the PO data. This result is not surprising, but it is not clear which is the better model.

In general, the question of how to make best use of presence-only and presence-absence data, when both are available, is an area of active research. Similarly, the use of presence-absence data in maximum entropy models is mostly unexplored (Halvorsen, 2013).


Acknowledgements

Thank you to Sabrina Mazzoni and Rune Halvorsen for providing the data used in this vignette.


References

  1. Aarts, G., Fieberg, J., & Matthiopoulos, J. (2012). Comparative interpretation of count, presence–absence and point methods for species distribution models. Methods in Ecology and Evolution, 3(1), 177-187.
  2. Fithian, W., & Hastie, T. (2013). Finite-sample equivalence in statistical models for presence-only data. The annals of applied statistics, 7(4), 1917.
  3. Guisan, A., Thuiller, W., & Zimmermann, N. E. (2017) Habitat Suitability and Distribution Models: With Applications in R. 1st ed. Ecology, Biodiversity and Conservation. Cambridge, UK: Cambridge University Press. https://doi.org/10.1017/9781139028271.
  4. Guisan, A., & Zimmermann, N. E. (2000). Predictive habitat distribution models in ecology. Ecological modelling, 135(2-3), 147-186.
  5. Halvorsen, R. (2012) A gradient analytic perspective on distribution modelling. Sommerfeltia, 35, 1-165.
  6. Halvorsen, R. (2013) A strict maximum likelihood explanation of MaxEnt, and some implications for distribution modelling. Sommerfeltia, 36, 1-132.
  7. Halvorsen, R., Mazzoni, S., Bryn, A. & Bakkestuen, V. (2015) Opportunities for improved distribution modelling practice via a strict maximum likelihood interpretation of MaxEnt. Ecography, 38, 172-183.
  8. Phillips, S. J., Anderson, R. P., & Schapire, R. E. (2006) Maximum Entropy Modeling of Species Geographic Distributions. Ecological Modelling, 190, 231–59.