Your browser doesn't support javascript.
loading
Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks.
Guralnick, Robert; LaFrance, Raphael; Denslow, Michael; Blickhan, Samantha; Bouslog, Mark; Miller, Sean; Yost, Jenn; Best, Jason; Paul, Deborah L; Ellwood, Elizabeth; Gilbert, Edward; Allen, Julie.
Afiliação
  • Guralnick R; Florida Museum of Natural History University of Florida Gainesville Florida USA.
  • LaFrance R; Florida Museum of Natural History University of Florida Gainesville Florida USA.
  • Denslow M; Florida Museum of Natural History University of Florida Gainesville Florida USA.
  • Blickhan S; The Adler Planetarium Chicago Illinois USA.
  • Bouslog M; The Adler Planetarium Chicago Illinois USA.
  • Miller S; The Adler Planetarium Chicago Illinois USA.
  • Yost J; California Polytechnic State University San Luis Obispo California USA.
  • Best J; Botanical Research Institute of Texas and Fort Worth Botanic Garden Fort Worth Texas USA.
  • Paul DL; Prairie Research Institute University of Illinois Urbana-Champaign Champaign Illinois USA.
  • Ellwood E; Florida Museum of Natural History University of Florida Gainesville Florida USA.
  • Gilbert E; Arizona State University Tempe Arizona USA.
  • Allen J; Department of Biological Sciences Virginia Tech Blacksburg Virginia USA.
Appl Plant Sci ; 12(1): e11560, 2024.
Article em En | MEDLINE | ID: mdl-38369981
ABSTRACT
Premise Among the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long-recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches.

Methods:

We present two new semi-automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans-in-the-loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline.

Results:

Our results showcase a >93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre-processing, multiple OCR engines, and post-processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4-fold reductions in errors compared to off-the-shelf open-source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool.

Discussion:

Our work showcases a usable set of tools for herbarium digitization including a custom-built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.
Palavras-chave

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article