Biodiversity Data Journal :
R Package
|
Corresponding author: Ingolf Kühn (ingolf.kuehn@ufz.de)
Academic editor: Scott Chamberlain
Received: 01 Sep 2017 | Accepted: 24 Feb 2018 | Published: 28 Feb 2018
© 2018 Gudrun Carl, Sam Levin, Ingolf Kühn
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Carl G, Levin S, Kühn I (2018) spind: an R Package to Account for Spatial Autocorrelation in the Analysis of Lattice Data. Biodiversity Data Journal 6: e20760. https://doi.org/10.3897/BDJ.6.e20760
|
|
spind is an R package aiming to provide a useful toolkit to account for spatial dependence in the analysis of lattice data. Grid-based data sets in spatial modelling often exhibit spatial dependence, i.e. values sampled at nearby locations are more similar than those sampled further apart. spind methods, described here, take this kind of two-dimensional dependence into account and are sensitive to its variation across different spatial scales. Methods presented to account for spatial autocorrelation are based on the two fundamentally different approaches of generalised estimating equations as well as wavelet-revised methods. Both methods are extensions to generalised linear models. spind also provides functions for multi-model inference and scaling by wavelet multiresolution regression. Since model evaluation is essential for assessing prediction accuracy in species distribution modelling, spind additionally supplies users with spatial accuracy measures, i.e. measures that are sensitive to the spatial arrangement of the predictions.
Cohen's kappa coefficient, Generalised Estimating Equations, Goodness-of-fit, Multimodel Inferrence, Multiresolution Regression, Prediction accuracy, Spatial autocorrelation, Species distribution modelling, Wavelet Revised Models
Grid-based data sets, whether they are biodiversity data such as species distributions, species abundance or trait data or sets of ecosystem service values, sociological or economic indices, of concentrations of pollutants, soil or hydrological properties, are often used in spatial modelling and may exhibit some degree of spatial dependence. This means that sampling at nearby locations may lead to sample values that are more similar than those further apart (
Various spatial methods are available to account for spatial autocorrelation in multiple linear regressions (e.g.
Package spind is available on CRAN and can be downloaded there. See: https://CRAN.R-project.org/package=spind. The development version is available on GitHub as well. See: https://github.com/levisc8/spind.
# Install from CRAN
install.packages('spind')
# Install development version from GitHub
devtools::install_github('levisc8/spind')
Next, the other main model that is introduced in this package is explained- the wavelet-revised model (
This means that wavelet analysis can be used for spatial filtering similar to the principal coordinates of neighbour matrices (PCNM) analysis (e.g.
The number and kind of coefficients in discrete wavelet transforms depend on the number and kind of wavelets used in the analysis. Briefly, there are two kinds of coefficients: detail and smooth ones. These reflect the different oscillating behaviour and distinguish between highly varying (detailed) and slowly varying (smooth) parts. This enables the user to decompose and to filter a data set in relation to its locally varying frequency characteristics. The wavelet filters are implemented using wavelet transforms from the waveslim package (
Different from other methods such as PCNM/MEM, the wavelet filters in thisapproach are applied to the response variable as well as all explanatory variables in a multiple regression. Moreover, they are applied within every step of GLM iteration before the regression coefficients are computed (
Spatial autocorrelation occurs when data sampled at adjacent locations exhibit more similar values than distant ones. This feature is detectable by the coefficients of smooth and relatively small wavelets. In WRMs, therefore, such kind of coefficients are always removed and thus smooth (i.e. low frequency) parts of data and, as a consequence, spatial autocorrelation. The tuning parameter level is a preset for the window size, that is, the range of all smooth wavelets. Sequentially setting level to lower integers allows the method to be adapted to one's data and autocorrelation gradually reduced, where strongest reduction can be found at the finest resolution, with level=1 usually working best.
For illustration, WRM is presented using the same musdata data set as above.
mwrm <- WRM(musculus ~ pollution + exposure, family = "poisson", data = musdata,
coord = coords, level = 1, plot = TRUE)
summary(mwrm)
predictions <- predict(mwrm, newdata = musdata)
All calculations can be performed using different types of wavelet families and different types of wavelet transforms, where haar for wavelet and dwt for transform is the default. To minimise boundary effects, WRM offers different options for embedding data collected in two-dimensional space in the larger frame of a square matrix. This can be done either by padding with zeros, mean values or reflected values at boundaries acting like mirrors. Therefore, there are different settings for padform in pad=list(…). Moreover, there is a factor padzone in pad=list(…) for expanding the padding zone.
WRM has many of the same features as GEE. Setting plot=TRUE allows the examination of the autocorrelation of residuals from a GLM of the same family as the WRM and passing further ggplot2 functions to customize_plot allows the user to manually add or subtract the required features (Fig.
Autocorrelation of residuals from GEE and in comparison to GLM. GEE with correlation structure: fixed performed best for the musdata data set. Spatial autocorrelation is computed as Moran’s I using the acfft function. The figure depicts simulated occurrence data of Mus musculus in response to the degree of pollution and the degree of exposure (for instance, to light, noise or other hypothetical risk factors).
Having filtered all data sets in relation to their frequency characteristics and removed smooth parts with a WRM, one can additionally decompose the detail (i.e. high frequency) parts into scale-specific subcomponents. One is then able to develop a scale-specific regression technique, subsequently known as wavelet multiresolution regression (
data(carlinadata)
coords <- carlinadata[ ,4:5]
# scale-specific regressions for detail components
ms2 <- scaleWMRR(carlina.horrida ~ aridity + land.use,
family = "poisson", data = carlinadata, coord = coords, scale = 2, trace = TRUE)
ms3 <- scaleWMRR(carlina.horrida ~ aridity + land.use,
family = "poisson", data = carlinadata, coord = coords, scale = 3, trace = TRUE)
aic<-aic.calc(carlina.horrida ~ aridity + land.use, family = "poisson", data = carlinadata,
mu = ms3$fitted)
aic
It is important to mention that it is essential to avoid comparisons of significance tests across scales. This is due to the fact that the sample size in general changes when scale-specific subcomponents are eliminated. However, in order to provide a consistently good quality criterion, there is the possibility to calculate log likelihoods and AIC values and then estimate the relative importance of a variable using the approach of model selection based on multimodel inference (MMI)(see below).
The generalised estimating equations approach developed by
To illustrate the use of the package, a GEE model is presented using the generated musdata data set included in the package. The following code uses already available datasets and provides R examples:
library(spind)
data(musdata)
# Fit a GEE and view the output
coords <- musdata[ ,4:5]
mgee <- GEE(musculus ~ pollution + exposure, family = "poisson",
data = musdata, coord = coords, corstr = "fixed", plot = TRUE, scale.fix = FALSE)
summary(mgee, printAutoCorPars = TRUE)
predictions <- predict(mgee, newdata = musdata)
spind includes methods for GEEs (summary.GEE and predict.GEE). These are useful in evaluating model fit and autocorrelation of residuals compared to a non-spatial model (in spind, this is a GLM with the same family as the GEE). Additionally, one can use the plot and customize_plot arguments in GEE to visually inspect the autocorrelation of the residuals from each regression and edit the plot using ggplot2 style inputs (Fig.
One drawback to non-clustered methods for GEEs arises from the way that R handles matrices. Trying to fit GEEs with corstr="fixed" to large data sets (i.e. the number of observations is approximately sqrt(.Machine$integer.max)) will result in errors, as the resulting variance-covariance matrices will be too large to be handled in R. This is where fitting clustered models is useful, as they work with smaller, more manageable matrices. These can be specified by changing the corstr argument to either "quadratic" or "exchangeable".
The package includes other functions that may be useful in diagnosing scale-specific features.
For example, the user might want to plot the wavelet variance or covariance of each of the variables as a function of level. The covar.plot function allows the user to visually examine the wavelet relationships from the model. For wavelet variance and covariance computation, the wavelet family d4 and the wtrafo = "modwt" were found to be mathematically more appropriate than others (
covar.plot(carlina.horrida ~ aridity + land.use - 1,
data = carlinadata, coord = coords, wavelet = "d4",
wtrafo = "modwt", plot = "var")
covar.plot(carlina.horrida ~ aridity + land.use - 1,
data = carlinadata, coord = coords, wavelet = "d4",
wtrafo = "modwt", plot = "covar")
The user may also want to view the smooth (i.e. slowly varying) components of any spatial data set at different resolutions. For this, the upscale function is offered, which allows the user to visually examine his/her data. The data sampled on a grid of squared cells and mostly sampled on adjacent cells may have any external contour. A square is used that embeds or completely encapsulates the data. Therefore, it becomes visible as a surface in a square. This square consists of 2n x 2n grid cells allowing a dyadic up-scaling for a number of different levels of scale. This means that the procedure of up-scaling related to the resolution level can be imagined as a two-dimensionally gradual enlargement of cell sizes (
Smooth components of wavelet decompositions at different scale levels. The upscaling is performed by the upscale function for variable aridity belonging to carlinadata data set. The data represent a square region. (Any region is extended to the next or next but one square of 2nx2n grid cells and is padded with predefined values, default is mean value, by the function provided. Thus the data recorded is available in a form that enables wavelet analyses.) Level=0 displays the raw, full-resolution predictor values, which are then "aggregated" by wavelets to ever coarser resolutions. Values increase from black to white. This function can be applied to any variable of interest, e.g. predictor, response or residuals.
upscale(f = carlinadata$aridity, coord = coords)
spind provides a couple of frameworks for conducting multi-model inference analyses (
Currently, the function only supports backwards model selection. In other words, one has to start with a full model (i.e. all of the variables in your model formula) and they are removed in a stepwise fashion. It is intended to add forward model selection methods shortly. Additionally, step.spind is written to always respect the hierarchy of variables in the model and the user cannot override this currently. For example, step.spind would not remove a main effect if the variable was still present as an interaction or polynomial, e.g. removing race while retaining I(race^2).
An example of step.spind using a GEE on the birthwt data set is shown in the MASS package below. The data in birthwt are not actually spatially explicit; a grid structure is imposed on them artificially. However, it is hoped that in using this data set, it will be demonstrated how this function can work with many types of data.
library(MASS)
data(birthwt)
# impose an artificial (not fully appropriate) grid structure
x <- rep(1:14,14)
y <- as.integer(gl(14,14))
coords <- cbind(x[-(190:196)],y[-(190:196)])
formula <- formula(low ~ age + lwt + race + smoke + ftv + bwt + I(race^2))
mgee <- GEE(formula = formula, family="gaussian", data=birthwt,
coord=coords, corstr="fixed",scale.fix=TRUE)
mwrm <- WRM(formula = formula, family="gaussian", data=birthwt,
coord=coords, level=1)
ssgee <- step.spind(object = mgee,data = birthwt)
sswrm <- step.spind(object = mwrm, data = birthwt, AICc=TRUE, trace = FALSE)
best.mgee <- GEE(formula = ssgee$model, family = "gaussian", data=birthwt,
coord=coords, corstr="fixed",scale.fix=TRUE)
best.wrm <- WRM(formula = sswrm$model, family="gaussian", data=birthwt,
coord=coords, level = 1)
summary(best.mgee,printAutoCorPars=FALSE)
summary(best.wrm)
Additionally, multimodel inference tools are offered for GEEs, WRMs and WMRRs which are loosely based on the MuMIn package. These are implemented in mmiWMRR and mmiGEE. They enable the user to examine the effect that the grid resolution and variable selection have on the resulting regressions and then to select the appropriate model for subsequent analyses. Note that mmiWMRR has two more arguments than mmiGEE. Moreover, settings will be changed in WRM and scaleWMRR for illustrative purposes. For two-dimensional wavelet models in geographical applications, the wavelet family d4 was found to be mathematically appropriate as well (
# Example for WRMs
data(carlinadata)
coords <- carlinadata[,4:5]
wrm <- WRM(carlina.horrida ~ aridity + land.use, family = "poisson",
data = carlinadata, coord = coords,level=1,wavelet="d4",pad=list(padzone=1.1))
ms1 <- scaleWMRR(carlina.horrida ~ aridity + land.use, family = "poisson",
data = carlinadata, coord = coords,scale=1,wavelet='d4', pad=list(padzone=1.1),trace=TRUE)
mmi <- mmiWMRR(object = wrm, data=carlinadata, scale=1, detail=TRUE,trace=TRUE)
# Example for GEEs
library(MASS)
data(birthwt)
# impose an artificial (not fully appropriate) grid structure
x <- rep(1:14,14)
y <- as.integer(gl(14,14))
coords <- cbind(x[-(190:196)],y[-(190:196)])
formula <- formula(low ~ race + smoke + bwt)
mgee <- GEE(formula = formula, family = "gaussian", data = birthwt,
coord=coords, corstr="fixed", scale.fix=TRUE)
mmi <- mmiGEE(object = mgee, data = birthwt)
Finally, one further tool is offered for model selection specific to WMRRs. rvi.plot uses mmiWMRR and creates a plot of the relative importance for each explanatory variable as a function of the resolution of the grid (in other words, as a function of the scale argument in mmiWMRR). It will also print the resulting model selection tables to the console. The output can help the user in choosing the most appropriate variables for subsequent analyses.
data(carlinadata)
coords <- carlinadata[,4:5]
rvi.plot(carlina.horrida ~ aridity + land.use, family = "poisson",
data=carlinadata,coord=coords,maxlevel=4,detail=TRUE,wavelet="d4",trace=TRUE)
Using an appropriate accuracy measure is essential for assessing prediction quality in modelling spatially explicit data. Goodness of fit measures such as Cohen's kappa coefficient, receiver operating characteristic (ROC), the area under the ROC curve (AUC) and maximum true skill statistic (TSS) are widely used to assess prediction errors in presence/absence models. There are some problems, especially related to prevalence, with more traditional measures such as AUC and hence TSS is recommended nowadays (e.g.
In spind, these metrics are categorised according to whether or not their outputs are dependent on the chosen threshold. th.dep (threshold dependent) and th.indep (threshold independent) are designed to work on any number of model types; all that is needed is a set of actual values, predictions and their associated coordinates. The hook data set is used to see how these work.
data(hook)
df <- hook[,1:2]
coords <- hook[,3:4]
# Threshold dependent metrics
th.dep.indices <- th.dep(data=df,coord=coords,spatial=TRUE)
# Confusion Matrix
th.dep.indices$cm
# Kappa statistic
th.dep.indices$kappa
# Threshold independent metrics
th.indep.indices <-th.indep(data=df,coord=coords,spatial=TRUE,plot.ROC=TRUE)
# AUC
th.indep.indices$AUC
# TSS
th.indep.indices$TSS
Finally, the authors would like to mention that, in many of these analyses, it is necessary to calculate spatial autocorrelation using Moran's I function (
coords <- musdata[,4:5]
mglm <- glm(musculus ~ pollution + exposure, family = "poisson", data = musdata)
ac <- acfft(coord = coords, f = resid(mglm, type = "pearson"), lim1 = 0, lim2 = 1, dmax = 10)
ac
Note that you can adjust the number of distance bins to examine in acfft using the dmax argument. The default is 10. Moreover, the user can choose the limits for the first bin (lim1,lim2). Its difference acts as an increment for all the others.
First, while there are many packages and functions available accounting for spatial autocorrelation in linear modeling or generalised linear modeling like frameworks, this is the only one that offers the user-optimised regression methods: spatial GEE, spatial WRM and scale-specific WMRR in conjunction with extended methods, step-wise model selection and multi-model inference analysis. Advantages are its simplicity to use as well as it computationally efficiency.
Second, to the best of the authors' knowledge, no other package or commercially available software offers the possibility to assess the accuracy of model predictions taking spatial dependence into account and being comparable to classical measures of model accuracy.
It is therefore believed that spind is an extremely useful tool in spatial analyses of lattice data, whether in biology, ecology, economics, geology, climatology or any other discipline with spatially structured data.
The package, together with documentation, is available on CRAN: https://CRAN.R-project.org/package=spind. The development version can be found on GitHub: https://github.com/levisc8/spind.
It is open-source software (published under the GPL public licence, ver. 3).
We thank Joaquín Hortal, Carsten Dormann and Scott Chamberlain for critical and constructive comments on an earlier version of the manuscript. We acknowledge support by the EU Collaborative project 'EU BON: Building the European Biodiversity Observation Network' (grant 308454) funded under the European Commission Framework Programme (FP) 7.