This function takes the occurrence points files and the predictor layers and executes data cleaning, data partitioning, pseudo-absence point sampling and variable selection according to their correlation. It saves the metadata and sdmdata files into the hard disk.
setup_sdmdata(species_name, occurrences, predictors, lon = "lon",
lat = "lat", models_dir = "./models", real_absences = NULL,
buffer_type = NULL, dist_buf = NULL, env_filter = FALSE,
env_distance = "centroid", buffer_shape = NULL, min_env_dist = NULL,
min_geog_dist = NULL, write_buffer = FALSE, seed = NULL,
clean_dupl = FALSE, clean_nas = FALSE, clean_uni = FALSE,
geo_filt = FALSE, geo_filt_dist = NULL, select_variables = FALSE,
cutoff = 0.8, sample_proportion = 0.8, png_sdmdata = TRUE,
n_back = 1000, partition_type = c("bootstrap"), boot_n = 1,
boot_proportion = 0.7, cv_n = NULL, cv_partitions = NULL)
A character string with the species name. Because species
name will be used as a directory name, avoid non-ASCII characters, spaces and
punctuation marks.
Recommendation is to adopt "Genus_species" format. See names in
example_occs
as an example
A data frame with occurrence data. Data must have at least
columns with latitude and longitude values of species occurrences.
See example_occs
as an example
A Raster or RasterStack object with the environmental raster layers
The name of the longitude column. Defaults to "lon"
The name of the latitude column. Defaults to "lat"
Folder path to save the output files. Defaults to
"./models
"
User-defined absence points
Character string indicating whether the buffer should be
calculated using the "mean
", "median
", "maximum
"
distance between occurrence points, or an absolute geographic
"distance
". If set to "user
", a user-supplied shapefile will be
used as a sampling area, and buffer_shape
needs to be specified. If
NULL, no distance buffer is applied. If set to "distance
",
dist_buf
needs to be specified
Defines the width of the buffer. Needs to be specified if
buffer_type = "distance"
. Distance unit is in the same unit of the
RasterStack of predictor variables
Logical. Should an euclidean environmental filter be
applied? If TRUE, env_distance
and
min_env_dist
need to be specified. Areas closest than
min_env_dist
(expressed in quantiles in the environmental space)will
be omitted from the pseudoabsence sampling
Character. Type of environmental distance, either
"centroid
" or "mindist
". Defaults to "centroid
", the
distance of each raster pixel to the environmental centroid of the
distribution. When set to "mindist
", the minimum distance of each
raster pixel to any of the occurrence points is calculated. Needs to be
specified if env_filter = TRUE
. A minimum value needs to be
specified (parameter min_env_dist
)
User-defined buffer shapefile in which pseudoabsences
will be generated. Needs to be specified if buffer_type = "user"
Numeric. Sets a minimum value to exclude the areas closest (in the environmental space) to the occurrences or their centroid, expressed in quantiles, from 0 (the closest) to 1. Defaults to 0.05, excluding areas belonging to the 5 since this is based on quantiles, and environmental similarity can take large negative values, this is an abitrary value
Optional, numeric. A distance for the exclusion of the areas closest to the occurrence points (in the geographical space). Distance unit is in the same unit of the RasterStack of predictor variables
Logical. Should the resulting buffer RasterLayer be written? Defaults to FALSE
Random number generator for reproducibility purposes. Used for sampling pseudoabsences
Logical. If TRUE, removes points with the same longitude and latitude
Logical. If TRUE, removes points that are outside the bounds of the raster
Logical. If TRUE, selects only one point per pixel
Logical, delete occurrences that are too close to each other? See Varela et al. (2014)
The distance of the geographic filter in the unit of the predictor raster, see Varela et al. (2014)
Logical. Whether a variable selection should be performed. It excludes highly correlated environmental
variables. If TRUE, cutoff
and sample_proportion
parameters must be specified
Cutoff value of correlation between variables to exclude environmental layers Default is to exclude environmental variables with correlation > 0.8
Numeric. Proportion of the raster values to be sampled to calculate the correlation. The value should be set as a decimal, between 0 and 1.
Logical, whether png files will be written
Number of pseudoabsence points. Default is 1,000
Character. Type of data partitioning scheme, either
"bootstrap
" or k-fold "crossvalidation
". If set to bootstrap,
boot_proportion
and boot_n
must be specified. If set to
crossvalidation, cv_n
and cv_partitions
must be specified
Number of bootstrap runs
Numerical 0 to 1, proportion of points to be sampled for bootstrap
Number of crossvalidation runs
Number of partitions in the crossvalidation
Returns a data frame with the groups for each run (in columns called
cv.1, cv.2 or boot.1, boot.2), presence/absence values, the geographical
coordinates of the occurrence and pseudoabsence points, and the associated
environmental variables (either all the layers or the selected ones if
select_variables = TRUE
).
Function writes on disk (inside subfolder
at models_dir
directory) a text file named sdmdata.csv that will be used
by do_any
or do_many
Varela S, Anderson RP, García-Valdés R, Fernández-González F (2014). “Environmental Filters Reduce the Effects of Sampling Bias and Improve Predictions of Ecological Niche Models.” Ecography, 37(11), 1084-1091. ISSN 1600-0587, doi:10.1111/j.1600-0587.2013.00441.x .