Influence of different data cleaning solutions of point-occurrence records on downstream macroecological diversity models.

Führding-Potschkat, Petra; Kreft, Holger; Ickert-Bond, Stefanie M

Führding-Potschkat, Petra; Kreft, Holger; Ickert-Bond, Stefanie M.

Afiliação

Führding-Potschkat P; Biodiversity, Macroecology and Conservation Biogeography, Faculty of Forest Sciences University of Göttingen Göttingen Germany.
Kreft H; Biodiversity, Macroecology and Conservation Biogeography, Faculty of Forest Sciences University of Göttingen Göttingen Germany.
Ickert-Bond SM; Department of Biology and Wildlife & UA Museum of the North University of Alaska Fairbanks Fairbanks Alaska USA.

Ecol Evol ; 12(8): e9168, 2022 Aug.

Article em En | MEDLINE | ID: mdl-35949539

RESUMO

Digital point-occurrence records from the Global Biodiversity Information Facility (GBIF) and other data providers enable a wide range of research in macroecology and biogeography. However, data errors may hamper immediate use. Manual data cleaning is time-consuming and often unfeasible, given that the databases may contain thousands or millions of records. Automated data cleaning pipelines are therefore of high importance. Taking North American Ephedra as a model, we examined how different data cleaning pipelines (using, e.g., the GBIF web application, and four different R packages) affect downstream species distribution models (SDMs). We also assessed how data differed from expert data. From 13,889 North American Ephedra observations in GBIF, the pipelines removed 31.7% to 62.7% false positives, invalid coordinates, and duplicates, leading to datasets between 9484 (GBIF application) and 5196 records (manual-guided filtering). The expert data consisted of 704 records, comparable to data from field studies. Although differences in the absolute numbers of records were relatively large, species richness models based on stacked SDMs (S-SDM) from pipeline and expert data were strongly correlated (mean Pearson's r across the pipelines: .9986, vs. the expert data: .9173). Our results suggest that all R package-based pipelines reliably identified invalid coordinates. In contrast, the GBIF-filtered data still contained both spatial and taxonomic errors. Major drawbacks emerge from the fact that no pipeline fully discovered misidentified specimens without the assistance of taxonomic expert knowledge. We conclude that application-filtered GBIF data will still need additional review to achieve higher spatial data quality. Achieving high-quality taxonomic data will require extra effort, probably by thoroughly analyzing the data for misidentified taxa, supported by experts.

Palavras-chave

GBIF; automated data cleaning pipelines; data quality; expert data; species distribution modeling

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Revista: Ecol Evol Ano de publicação: 2022 Tipo de documento: Article País de publicação: Reino Unido

Texto completo

Adicionar na Minha BVS

Imprimir

XML

PubMed Links

Buscar no Google