This function takes the occurrence points files and the predictor layers and executes data cleaning, data partitioning, pseudo-absence point sampling and variable selection according to their correlation. It saves the metadata and sdmdata files into the hard disk.

setup_sdmdata(species_name, occurrences, predictors, lon = "lon",
  lat = "lat", models_dir = "./models", real_absences = NULL,
  buffer_type = NULL, dist_buf = NULL, env_filter = FALSE,
  env_distance = "centroid", buffer_shape = NULL, min_env_dist = NULL,
  min_geog_dist = NULL, write_buffer = FALSE, seed = NULL,
  clean_dupl = FALSE, clean_nas = FALSE, clean_uni = FALSE,
  geo_filt = FALSE, geo_filt_dist = NULL, select_variables = FALSE,
  cutoff = 0.8, sample_proportion = 0.8, png_sdmdata = TRUE,
  n_back = 1000, partition_type = c("bootstrap"), boot_n = 1,
  boot_proportion = 0.7, cv_n = NULL, cv_partitions = NULL)

Arguments

species_name

A character string with the species name. Because species name will be used as a directory name, avoid non-ASCII characters, spaces and punctuation marks. Recommendation is to adopt "Genus_species" format. See names in example_occs as an example

occurrences

A data frame with occurrence data. Data must have at least columns with latitude and longitude values of species occurrences. See example_occs as an example

predictors

A Raster or RasterStack object with the environmental raster layers

lon

The name of the longitude column. Defaults to "lon"

lat

The name of the latitude column. Defaults to "lat"

models_dir

Folder path to save the output files. Defaults to "./models"

real_absences

User-defined absence points

buffer_type

Character string indicating whether the buffer should be calculated using the "mean", "median", "maximum" distance between occurrence points, or an absolute geographic "distance". If set to "user", a user-supplied shapefile will be used as a sampling area, and buffer_shape needs to be specified. If NULL, no distance buffer is applied. If set to "distance", dist_buf needs to be specified

dist_buf

Defines the width of the buffer. Needs to be specified if buffer_type = "distance". Distance unit is in the same unit of the RasterStack of predictor variables

env_filter

Logical. Should an euclidean environmental filter be applied? If TRUE, env_distance and min_env_dist need to be specified. Areas closest than min_env_dist (expressed in quantiles in the environmental space)will be omitted from the pseudoabsence sampling

env_distance

Character. Type of environmental distance, either "centroid" or "mindist". Defaults to "centroid", the distance of each raster pixel to the environmental centroid of the distribution. When set to "mindist", the minimum distance of each raster pixel to any of the occurrence points is calculated. Needs to be specified if env_filter = TRUE. A minimum value needs to be specified (parameter min_env_dist)

buffer_shape

User-defined buffer shapefile in which pseudoabsences will be generated. Needs to be specified if buffer_type = "user"

min_env_dist

Numeric. Sets a minimum value to exclude the areas closest (in the environmental space) to the occurrences or their centroid, expressed in quantiles, from 0 (the closest) to 1. Defaults to 0.05, excluding areas belonging to the 5 since this is based on quantiles, and environmental similarity can take large negative values, this is an abitrary value

min_geog_dist

Optional, numeric. A distance for the exclusion of the areas closest to the occurrence points (in the geographical space). Distance unit is in the same unit of the RasterStack of predictor variables

write_buffer

Logical. Should the resulting buffer RasterLayer be written? Defaults to FALSE

seed

Random number generator for reproducibility purposes. Used for sampling pseudoabsences

clean_dupl

Logical. If TRUE, removes points with the same longitude and latitude

clean_nas

Logical. If TRUE, removes points that are outside the bounds of the raster

clean_uni

Logical. If TRUE, selects only one point per pixel

geo_filt

Logical, delete occurrences that are too close to each other? See Varela et al. (2014)

geo_filt_dist

The distance of the geographic filter in the unit of the predictor raster, see Varela et al. (2014)

select_variables

Logical. Whether a variable selection should be performed. It excludes highly correlated environmental variables. If TRUE, cutoff and sample_proportion parameters must be specified

cutoff

Cutoff value of correlation between variables to exclude environmental layers Default is to exclude environmental variables with correlation > 0.8

sample_proportion

Numeric. Proportion of the raster values to be sampled to calculate the correlation. The value should be set as a decimal, between 0 and 1.

png_sdmdata

Logical, whether png files will be written

n_back

Number of pseudoabsence points. Default is 1,000

partition_type

Character. Type of data partitioning scheme, either "bootstrap" or k-fold "crossvalidation". If set to bootstrap, boot_proportion and boot_n must be specified. If set to crossvalidation, cv_n and cv_partitions must be specified

boot_n

Number of bootstrap runs

boot_proportion

Numerical 0 to 1, proportion of points to be sampled for bootstrap

cv_n

Number of crossvalidation runs

cv_partitions

Number of partitions in the crossvalidation

Value

Returns a data frame with the groups for each run (in columns called cv.1, cv.2 or boot.1, boot.2), presence/absence values, the geographical coordinates of the occurrence and pseudoabsence points, and the associated environmental variables (either all the layers or the selected ones if select_variables = TRUE).

Function writes on disk (inside subfolder at models_dir directory) a text file named sdmdata.csv that will be used by do_any or do_many

References

Varela S, Anderson RP, García-Valdés R, Fernández-González F (2014). “Environmental Filters Reduce the Effects of Sampling Bias and Improve Predictions of Ecological Niche Models.” Ecography, 37(11), 1084-1091. ISSN 1600-0587, doi:10.1111/j.1600-0587.2013.00441.x .

See also

Examples

if (FALSE) {
sp <- names(example_occs)[1]
sp_coord <- example_occs[[1]]
sp_setup <- setup_sdmdata(species_name = sp,
                          occurrences = sp_coord,
                          predictors = example_vars)
head(sp_setup)
}