Hostname: page-component-77f85d65b8-9vb7h Total loading time: 0 Render date: 2026-03-26T11:09:23.378Z Has data issue: false hasContentIssue false

Spatial standardization of taxon occurrence data—a call to action

Published online by Cambridge University Press:  05 February 2024

Gawain T. Antell*
Affiliation:
Department of Earth and Planetary Sciences, University of California, Riverside, California 92521, U.S.A.
Roger B. J. Benson
Affiliation:
Richard Gilder Graduate School and Division of Paleontology, American Museum of Natural History, New York 10024, U.S.A.
Erin E. Saupe
Affiliation:
Department of Earth Sciences, University of Oxford, Oxford OX1 3AN, U.K.
*
Corresponding author: Gawain T. Antell; Email: gawain.antell@ucr.edu

Abstract

The fossil record is spatiotemporally heterogeneous: taxon occurrence data have patchy spatial distributions, and this patchiness varies through time. Large-scale quantitative paleobiology studies that fail to account for heterogeneous sampling coverage will generate uninformative inferences at best and confidently draw wrong conclusions at worst. Explicitly spatial methods of standardization are necessary for analyses of large-scale fossil datasets, because nonspatial sample standardization, such as diversity rarefaction, is insufficient to reduce the signal of varying spatial coverage through time or between environments and clades. Spatial standardization should control both geographic area and dispersion (spread) of fossil localities. In addition to standardizing the spatial distribution of data, other factors may be standardized, including environmental heterogeneity or the number of publications or field collecting units that report taxon occurrences. Using a case study of published global Paleobiology Database occurrences, we demonstrate strong signals of sampling; without spatial standardization, these sampling signatures could be misattributed to biological processes. We discuss practical issues of implementing spatial standardization via subsampling and present the new R package divvy to improve the accessibility of spatial analysis. The software provides three spatial subsampling approaches, as well as related tools to quantify spatial coverage. After reviewing the theory, practice, and history of equalizing spatial coverage between data comparison groups, we outline priority areas to improve related data collection, analysis, and reporting practices in paleobiology.

Information

Type
On The Record
Creative Commons
Creative Common License - CCCreative Common License - BYCreative Common License - NCCreative Common License - SA
This is an Open Access article, distributed under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike licence (http://creativecommons.org/licenses/by-nc-sa/4.0), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the same Creative Commons licence is used to distribute the re-used or adapted article and the original article is properly cited. The written permission of Cambridge University Press must be obtained prior to any commercial use.
Copyright
Copyright © The Author(s), 2024. Published by Cambridge University Press on behalf of Paleontological Society
Figure 0

Table 1. Glossary of disciplinary terms relevant to taxon occurrences and spatial standardization.

Figure 1

Figure 1. A schematic of the species–area effect, in map view. The total sampling area (gray boxes) in A and C is twice as large as in B; these bounding regions could represent the total preserved outcrop area from three time steps or continents of comparison. Individual sampling sites within a study region are indicated with clear boxes, and species occurrences are represented with lowercase letters. Species count at an individual site is alpha diversity (annotated at only one site in each panel, for simplicity). Total species count within a study area is gamma diversity. There are many metrics for beta diversity related to species turnover between sites, but a simple and original measure is the ratio of gamma to mean alpha (Whittaker 1960, 1972). Note that both beta and gamma diversity increase as sampling area doubles from B to A, even though the distributions of alpha diversity, species’ geographic range size, and site density are identical. Without accounting for the difference in sampling area, (paleo)ecologists might falsely infer time bin A more diverse than B and with smaller proportional range sizes. C also has larger beta and gamma diversity than B, despite the same number and cumulative area of sampled sites, because the dispersion between sites is larger.

Figure 2

Figure 2. Five spatial subsamples of Pliocene bivalve occurrences from the Paleobiology Database (available as data object bivalves in the R package divvy). For each subsample, site dispersion is constrained by a circle of 3000 km diameter (A) or a minimum spanning tree with maximum great circle distance of 3000 km (B). Within each subsampling region, the number of occurrence sites is rarefied to 12 (open circles). Sites are raster grid cells of approximately equal area and shape. The random points to initiate subsamples are identical in A and B. Note that subsamples here are impervious to potential biogeographic barriers, for example, the Isthmus of Panama, which was not emergent for the full duration of the Pliocene. Subsamples can also overlap with each other, as shown in southeastern North America for two circular subsamples and three minimum spanning trees. Subsamples with overlapping regional boundaries may differ in the random subsets of sites they contain.

Figure 3

Figure 3. Scatter plots indicate the relationship between species count and mean per-species occupied grid cells in 63 time bins, either as a proportion of all occupied grid cells (A) or as a count within subsample regions of 12 cells (B). Outlier points are labeled by geological stage and overplotted on C: Ar, Artinskian; Gz, Gzhelian; Hir, Hirnantian. C, Species count in each stage, either tallied globally (dashed line) or within subsampled regions (solid line). Note logarithmic y-axis scale in C. Error bars in B and C denote interquartile range across 500 replicate subsampled regions. Geological periods: O, Ordovician; S, Silurian; D, Devonian; C, Carboniferous; P, Permian; Tr, Triassic; J, Jurassic; K, Cretaceous; Pg, Paleogene; N, Neogene.

Figure 4

Figure 4. Scatter plots indicate the pairwise relationship between either species count (A and B) or mean proportional occupancy of equal-area grid cells (C and D) and spatial sampling coverage, measured as either a count of grid cells (A and C) or summed length of minimum spanning tree connecting occupied cell centroids (B and D). Outlier points are labeled by the earliest geological stage of a time bin, here and on the timescale in Fig. 3C: Ar, Artinskian; Gz, Gzhelian.

Figure 5

Appendix Table A1. Time bins to divide the global Phanerozoic dataset (n = 63). A species’ occurrence record was included in a time bin if the name or age of both its maximum and minimum occurrence estimates fell within the onset and terminus ages (in Ma) for the bin. During binning, ages were rounded to the nearest 0.01 Ma for boundaries younger than 10 Ma, 0.1 Ma for boundaries 50–150 Ma, or 1 Ma for boundaries older than 150 Ma. The number of unique occurrences for each species is tallied for each time bin. Replicated from table S1 in Antell et al. (2020).

Figure 6

Figure A1. Kendall's τ (tau) coefficient distributions for correlations between (A) global species count and total occupied cells (sampling area), (B) global species count and summed minimum spanning tree length between occupied cells (sampling dispersion), (C) mean proportional species occupancy and sampling area, and (D) mean proportional occupancy and sampling dispersion. Figure 4 panels plot the corresponding scatter plot for each correlation.

Figure 7

Appendix Table A2. Kendall's τ (tau) coefficient estimates and 95% quantiles (from 500 subsamples) for pairwise nonparametric correlations between subsampled species count, mean occupied grid cells (excluding singly occurring species, and out of 12 cells in a subsample), and aggregation of sampling sites (summed length of minimum spanning tree connecting subsampled cell centroids). In each correlation, the predictor time series was pre-whitened with a first-order autoregressive model; the residuals of this model were correlated with the response series to account for temporal autocorrelation.