A benchmark dataset of herbarium specimen images with label data.

Dillen, Mathias; Groom, Quentin; Chagnoux, Simon; Güntsch, Anton; Hardisty, Alex; Haston, Elspeth; Livermore, Laurence; Runnel, Veljo; Schulman, Leif; Willemse, Luc; Wu, Zhengzhe; Phillips, Sarah

Dillen, Mathias; Groom, Quentin; Chagnoux, Simon; Güntsch, Anton; Hardisty, Alex; Haston, Elspeth; Livermore, Laurence; Runnel, Veljo; Schulman, Leif; Willemse, Luc; Wu, Zhengzhe; Phillips, Sarah.

Afiliación

Dillen M; Meise Botanic Garden, Meise, Belgium Meise Botanic Garden Meise Belgium.
Groom Q; Meise Botanic Garden, Meise, Belgium Meise Botanic Garden Meise Belgium.
Chagnoux S; Muséum National d'Histoire Naturelle, Paris, France Muséum National d'Histoire Naturelle Paris France.
Güntsch A; Freie Universität Berlin, Berlin, Germany Freie Universität Berlin Berlin Germany.
Hardisty A; School of Computer Science & Informatics, Cardiff University, Cardiff, United Kingdom School of Computer Science & Informatics, Cardiff University Cardiff United Kingdom.
Haston E; Royal Botanic Garden Edinburgh, Edinburgh, United Kingdom Royal Botanic Garden Edinburgh Edinburgh United Kingdom.
Livermore L; The Natural History Museum, London, United Kingdom The Natural History Museum London United Kingdom.
Runnel V; University of Tartu, Tartu, Estonia University of Tartu Tartu Estonia.
Schulman L; Finnish Museum of Natural History LUOMUS, Helsinki, Finland Finnish Museum of Natural History LUOMUS Helsinki Finland.
Willemse L; Naturalis, Leiden, Netherlands Naturalis Leiden Netherlands.
Wu Z; Finnish Museum of Natural History LUOMUS, Helsinki, Finland Finnish Museum of Natural History LUOMUS Helsinki Finland.
Phillips S; Royal Botanic Gardens Kew, Surrey, United Kingdom Royal Botanic Gardens Kew Surrey United Kingdom.

Biodivers Data J ; (7): e31817, 2019.

Article en En | MEDLINE | ID: mdl-30833825

ABSTRACT

ABSTRACT

BACKGROUND:

More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. NEW INFORMATION To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens.

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Bases de datos: MEDLINE Idioma: En Revista: Biodivers Data J Año: 2019 Tipo del documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Bases de datos: MEDLINE Idioma: En Revista: Biodivers Data J Año: 2019 Tipo del documento: Article