RESUMO
BACKGROUND: OpenBiodiv is a biodiversity knowledge graph containing a synthetic linked open dataset, OpenBiodiv-LOD, which combines knowledge extracted from academic literature with the taxonomic backbone used by the Global Biodiversity Information Facility. The linked open data is modelled according to the OpenBiodiv-O ontology integrating semantic resource types from recognised biodiversity and publishing ontologies with OpenBiodiv-O resource types, introduced to capture the semantics of resources not modelled before. NEW INFORMATION: We introduce the new release of the OpenBiodiv-LOD attained through information extraction and modelling of additional biodiversity entities. It was achieved by further developments to OpenBiodiv-O, the data storage infrastructure and the workflow and accompanying R software packages used for transformation of academic literature into Resource Description Framework (RDF). We discuss how to utilise the LOD in biodiversity informatics and give examples by providing solutions to several competency questions. We investigate performance issues that arise due to the large amount of inferred statements in the graph and conclude that OWL-full inference is impractical for the project and that unnecessary inference should be avoided.
RESUMO
Connecting basic data about bats and other potential hosts of SARS-CoV-2 with their ecological context is crucial to the understanding of the emergence and spread of the virus. However, when lockdowns in many countries started in March, 2020, the world's bat experts were locked out of their research laboratories, which in turn impeded access to large volumes of offline ecological and taxonomic data. Pandemic lockdowns have brought to attention the long-standing problem of so-called biological dark data: data that are published, but disconnected from digital knowledge resources and thus unavailable for high-throughput analysis. Knowledge of host-to-virus ecological interactions will be biased until this challenge is addressed. In this Viewpoint, we outline two viable solutions: first, in the short term, to interconnect published data about host organisms, viruses, and other pathogens; and second, to shift the publishing framework beyond unstructured text (the so-called PDF prison) to labelled networks of digital knowledge. As the indexing system for biodiversity data, biological taxonomy is foundational to both solutions. Building digitally connected knowledge graphs of host-pathogen interactions will establish the agility needed to quickly identify reservoir hosts of novel zoonoses, allow for more robust predictions of emergence, and thereby strengthen human and planetary health systems.
Assuntos
COVID-19 , Interações entre Hospedeiro e Microrganismos , Armazenamento e Recuperação da Informação , Animais , COVID-19/epidemiologia , COVID-19/virologia , Humanos , SARS-CoV-2 , ZoonosesRESUMO
A paper on the occasion of the 75th birthday of Terry Lee Erwin (1940-2020), an outstanding biologist and founding Editor-in-Chief of ZooKeys, was published in 2015 and contained complete lists of Erwin's publications, patronyms (taxa named after him) and new taxa published by him. The present paper aims to complement these lists with all new information published after 2015, including the papers in the present special issue of ZooKeys dedicated to the blessed memory of Terry Lee Erwin.
RESUMO
BACKGROUND: Data papers have emerged as a powerful instrument for open data publishing, obtaining credit, and establishing priority for datasets generated in scientific experiments. Academic publishing improves data and metadata quality through peer review and increases the impact of datasets by enhancing their visibility, accessibility, and reusability. OBJECTIVE: We aimed to establish a new type of article structure and template for omics studies: the omics data paper. To improve data interoperability and further incentivize researchers to publish well-described datasets, we created a prototype workflow for streamlined import of genomics metadata from the European Nucleotide Archive directly into a data paper manuscript. METHODS: An omics data paper template was designed by defining key article sections that encourage the description of omics datasets and methodologies. A metadata import workflow, based on REpresentational State Transfer services and Xpath, was prototyped to extract information from the European Nucleotide Archive, ArrayExpress, and BioSamples databases. FINDINGS: The template and workflow for automatic import of standard-compliant metadata into an omics data paper manuscript provide a mechanism for enhancing existing metadata through publishing. CONCLUSION: The omics data paper structure and workflow for import of genomics metadata will help to bring genomic and other omics datasets into the spotlight. Promoting enhanced metadata descriptions and enforcing manuscript peer review and data auditing of the underlying datasets brings additional quality to datasets. We hope that streamlined metadata reuse for scholarly publishing encourages authors to create enhanced metadata descriptions in the form of data papers to improve both the quality of their metadata and its findability and accessibility.
Assuntos
Genômica , Metadados , Bases de Dados Factuais , Revisão por Pares , Fluxo de TrabalhoRESUMO
Illegal transfer of wildlife has 2 main purposes: trade and scientific research. Trade is the most common, whereas scientific research is much less common and unprofitable, yet still important. Biopiracy in science is often neglected despite that many researchers encounter it during their careers. The use of illegally acquired specimens is detected in different research fields, from scientists bioprospecting for new pharmacological substances, to taxonomists working on natural history collections, to researchers working in zoos, aquariums, and botanical gardens. The practice can be due to a lack of knowledge about the permit requirements in different countries or, probably most often, to the generally high level of bureaucracy associated with rule compliance. Significant regulatory filters to avoid biopiracy can be provided by different stakeholders. Natural history collection hosts should adopt strict codes of conduct; editors of scientific publications should require authors to declare that all studied specimens were acquired legally and to cite museum catalog numbers as guarantee of best practices. Scientific societies should actively encourage publication in peer-reviewed journals of work in which specimens collected from the wild were used. The International Commission on Zoological Nomenclature could require newly designated types based on recently collected specimens to be accompanied by statements of deposition in recognized scientific or educational institutions. We also propose the creation of an online platform that gathers information about environmental regulations and permits required for scientific activities in different countries and respective responsible governmental agencies and the simplification of the bureaucracy related to regulating scientific activities. This would make regulations more agile and easier to comply with. The global biodiversity crisis means data need to be collected ever faster, but biopiracy is not the answer and undermines the credibility of science and researchers. It is critical to find a modus vivendi that promotes compliance with regulations and scientific progress.
Recolección de Fauna con Motivos Científicos Resumen El traslado ilegal de fauna tiene dos objetivos principales: el mercado y la investigación científica. El mercado es el más común, a la vez que la investigación científica es mucho menos común y poco rentable, pero de igual manera importante. La biopiratería en la ciencia comúnmente se ignora a pesar de que muchos investigadores se encuentran con ella a lo largo de sus carreras. El uso de especímenes adquiridos ilegalmente está detectado en diferentes campos de investigación, desde los científicos que realizan bio-exploraciones en búsqueda de nuevas sustancias farmacológicas, pasando por los taxónomos que trabajan en las colecciones de historia natural, hasta los investigadores que trabajan en zoológicos, acuarios y jardines botánicos. Esta práctica puede deberse a la falta de conocimiento sobre los requerimientos de los permisos en diferentes países o, probablemente con mayor frecuencia, a la alta cantidad de burocracia asociada con el seguimiento de las reglas. Los diferentes actores pueden proporcionar filtros regulatorios importantes para evitar la biopiratería. Los dueños de las colecciones de historia natural deberían adoptar códigos estrictos de conducta; los editores de las publicaciones científicas deberían exigirle a los autores que declaren que todos los especímenes estudiados fueron adquiridos legalmente y también que citen el número de catálogo del espécimen como garantías de mejores prácticas. Las sociedades científicas deberían promover activamente la publicación en revistas revisadas por pares de los trabajos en los que se usaron especímenes recolectados en su hábitat natural. La Comisión Internacional sobre la Nomenclatura Zoológica podría requerir que la designación reciente de tipos basada en especímenes recolectados recientemente esté acompañada por declaraciones de deposición en instituciones científicas o educativas reconocidas. También proponemos la creación de una plataforma en línea que recopile la información sobre las regulaciones ambientales y los permisos requeridos para la actividad científica en diferentes países, así como las agencias gubernamentales responsables y la simplificación de la burocracia relacionada con la regulación de las actividades científicas. Ésto haría que las regulaciones sean más ágiles y su cumplimiento más fácil. La crisis mundial de biodiversidad implica que los datos necesitan ser recolectados con mayor velocidad que nunca, pero la biopiratería no es la respuesta, además de que desvirtúa la credibilidad de la ciencia y de los investigadores. Es muy importante que encontremos un modus vivendi que promueva un acuerdo entre las reglas y el progreso científico.
Assuntos
Animais Selvagens , Conservação dos Recursos Naturais , Animais , Biodiversidade , História NaturalRESUMO
Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production.
RESUMO
BACKGROUND: The biodiversity domain, and in particular biological taxonomy, is moving in the direction of semantization of its research outputs. The present work introduces OpenBiodiv-O, the ontology that serves as the basis of the OpenBiodiv Knowledge Management System. Our intent is to provide an ontology that fills the gaps between ontologies for biodiversity resources, such as DarwinCore-based ontologies, and semantic publishing ontologies, such as the SPAR Ontologies. We bridge this gap by providing an ontology focusing on biological taxonomy. RESULTS: OpenBiodiv-O introduces classes, properties, and axioms in the domains of scholarly biodiversity publishing and biological taxonomy and aligns them with several important domain ontologies (FaBiO, DoCO, DwC, Darwin-SW, NOMEN, ENVO). By doing so, it bridges the ontological gap across scholarly biodiversity publishing and biological taxonomy and allows for the creation of a Linked Open Dataset (LOD) of biodiversity information (a biodiversity knowledge graph) and enables the creation of the OpenBiodiv Knowledge Management System. A key feature of the ontology is that it is an ontology of the scientific process of biological taxonomy and not of any particular state of knowledge. This feature allows it to express a multiplicity of scientific opinions. The resulting OpenBiodiv knowledge system may gain a high level of trust in the scientific community as it does not force a scientific opinion on its users (e.g. practicing taxonomists, library researchers, etc.), but rather provides the tools for experts to encode different views as science progresses. CONCLUSIONS: OpenBiodiv-O provides a conceptual model of the structure of a biodiversity publication and the development of related taxonomic concepts. It also serves as the basis for the OpenBiodiv Knowledge Management System.
Assuntos
Ontologias Biológicas , Biodiversidade , Classificação , SemânticaRESUMO
Collaborative effort among four lead indexes of taxon names and nomenclatural acts (International Plant Name Index (IPNI), Index Fungorum, MycoBank and ZooBank) and the journals PhytoKeys, MycoKeys and ZooKeys to create an automated, pre-publication, registration workflow, based on a server-to-server, XML request/response model. The registration model for ZooBank uses the TaxPub schema, which is an extension to the Journal Tag Publishing Suite (JATS) of the National Library of Medicine (NLM). The indexing or registration model of IPNI and Index Fungorum will use the Taxonomic Concept Transfer Schema (TCS) as a basic standard for the workflow. Other journals and publishers who intend to implement automated, pre-publication, registration of taxon names and nomenclatural acts can also use the open sample XML formats and links to schemas and relevant information published in the paper.
RESUMO
Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovry and description, and 2) the prevelence of rarity in species descriptions.
RESUMO
Biodiversity data is being digitized and made available online at a rapidly increasing rate but current practices typically do not preserve linkages between these data, which impedes interoperation, provenance tracking, and assembly of larger datasets. For data associated with biocollections, the biodiversity community has long recognized that an essential part of establishing and preserving linkages is to apply globally unique identifiers at the point when data are generated in the field and to persist these identifiers downstream, but this is seldom implemented in practice. There has neither been coalescence towards one single identifier solution (as in some other domains), nor even a set of recommended best practices and standards to support multiple identifier schemes sharing consistent responses. In order to further progress towards a broader community consensus, a group of biocollections and informatics experts assembled in Stockholm in October 2014 to discuss community next steps to overcome current roadblocks. The workshop participants divided into four groups focusing on: identifier practice in current field biocollections; identifier application for legacy biocollections; identifiers as applied to biodiversity data records as they are published and made available in semantically marked-up publications; and cross-cutting identifier solutions that bridge across these domains. The main outcome was consensus on key issues, including recognition of differences between legacy and new biocollections processes, the need for identifier metadata profiles that can report information on identifier persistence missions, and the unambiguous indication of the type of object associated with the identifier. Current identifier characteristics are also summarized, and an overview of available schemes and practices is provided.
RESUMO
The Barcode of Life Data Systems (BOLD) is designed to support the generation and application of DNA barcode data, but it also provides a unique source of data with potential for many research uses. This paper explores the streamlining of BOLD specimen data to record species distributions - and its fast publication using the Biodiversity Data Journal (BDJ), and its authoring platform, the Pensoft Writing Tool (PWT). We selected a sample of 630 specimens and 10 species of a highly diverse group of parasitoid wasps (Hymenoptera: Braconidae, Microgastrinae) from the Nearctic region and used the information in BOLD to uncover a significant number of new records (of locality, provinces, territories and states). By converting specimen information (such as locality, collection date, collector, voucher depository) from the BOLD platform to the Excel template provided by the PWT, it is possible to quickly upload and generate long lists of "Material Examined" for papers discussing taxonomy, ecology and/or new distribution records of species. For the vast majority of publications including DNA barcodes, the generation and publication of ancillary data associated with the barcoded material is seldom highlighted and often disregarded, and the analysis of those data sets to uncover new distribution patterns of species has rarely been explored, even though many BOLD records represent new and/or significant discoveries. The introduction of journals specializing in - and streamlining - the release of these datasets, such as the BDJ, should facilitate thorough analysis of these records, as shown in this paper.
RESUMO
Fauna Europaea is Europe's main zoological taxonomic index, making the scientific names and distributions of all living, currently known, multicellular, European land and freshwater animals species integrally available in one authoritative database. Fauna Europaea covers about 260,000 taxon names, including 145,000 accepted (sub)species, assembled by a large network of (>400) leading specialists, using advanced electronic tools for data collations with data quality assured through sophisticated validation routines. Fauna Europaea started in 2000 as an EC funded FP5 project and provides a unique taxonomic reference for many user-groups such as scientists, governments, industries, nature conservation communities and educational programs. Fauna Europaea was formally accepted as an INSPIRE standard for Europe, as part of the European Taxonomic Backbone established in PESI. Fauna Europaea provides a public web portal at faunaeur.org with links to other key biodiversity services, is installed as a taxonomic backbone in wide range of biodiversity services and actively contributes to biodiversity informatics innovations in various initiatives and EC programs.
RESUMO
BACKGROUND: Recent years have seen a surge in projects that produce large volumes of structured, machine-readable biodiversity data. To make these data amenable to processing by generic, open source "data enrichment" workflows, they are increasingly being represented in a variety of standards-compliant interchange formats. Here, we report on an initiative in which software developers and taxonomists came together to address the challenges and highlight the opportunities in the enrichment of such biodiversity data by engaging in intensive, collaborative software development: The Biodiversity Data Enrichment Hackathon. RESULTS: The hackathon brought together 37 participants (including developers and taxonomists, i.e. scientific professionals that gather, identify, name and classify species) from 10 countries: Belgium, Bulgaria, Canada, Finland, Germany, Italy, the Netherlands, New Zealand, the UK, and the US. The participants brought expertise in processing structured data, text mining, development of ontologies, digital identification keys, geographic information systems, niche modeling, natural language processing, provenance annotation, semantic integration, taxonomic name resolution, web service interfaces, workflow tools and visualisation. Most use cases and exemplar data were provided by taxonomists. One goal of the meeting was to facilitate re-use and enhancement of biodiversity knowledge by a broad range of stakeholders, such as taxonomists, systematists, ecologists, niche modelers, informaticians and ontologists. The suggested use cases resulted in nine breakout groups addressing three main themes: i) mobilising heritage biodiversity knowledge; ii) formalising and linking concepts; and iii) addressing interoperability between service platforms. Another goal was to further foster a community of experts in biodiversity informatics and to build human links between research projects and institutions, in response to recent calls to further such integration in this research domain. CONCLUSIONS: Beyond deriving prototype solutions for each use case, areas of inadequacy were discussed and are being pursued further. It was striking how many possible applications for biodiversity data there were and how quickly solutions could be put together when the normal constraints to collaboration were broken down for a week. Conversely, mobilising biodiversity knowledge from their silos in heritage literature and natural history collections will continue to require formalisation of the concepts (and the links between them) that define the research domain, as well as increased interoperability between the software platforms that operate on these concepts.
RESUMO
With the publication of the first eukaryotic species description, combining transcriptomic, DNA barcoding, and micro-CT imaging data, GigaScience and Pensoft demonstrate how classical taxonomic description of a new species can be enhanced by applying new generation molecular methods, and novel computing and imaging technologies. This 'holistic' approach in taxonomic description of a new species of cave-dwelling centipede is published in the Biodiversity Data Journal (BDJ), with coordinated data release in the GigaScience GigaDB database.