RESUMO
Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome-scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the availability of these missing metadata and to test the hypothesis that their availability decays with time. We also worked to remediate missing metadata by extracting them from associated published papers, online repositories, and direct communication with authors. Starting with 848 candidate genomic data sets (reduced representation and whole genome) from the International Nucleotide Sequence Database Collaboration, we determined that 561 contained mostly samples from wild populations. We successfully restored spatiotemporal metadata for 78% of these 561 data sets (n = 440 data sets with data on 45,105 individuals from 762 species in 17 phyla). Examining papers and online repositories was much more fruitful than contacting 351 authors, who replied to our email requests 45% of the time. Overall, 23% of our email queries to authors unearthed useful metadata. The probability of retrieving spatiotemporal metadata declined significantly as age of the data set increased. There was a 13.5% yearly decrease in metadata associated with published papers or online repositories and up to a 22% yearly decrease in metadata that were only available from authors. This rapid decay in metadata availability, mirrored in studies of other types of biological data, should motivate swift updates to data-sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost to conservation science forever.
Importancia de la curación oportuna de metadatos para la vigilancia mundial de la diversidad genética Resumen La diversidad genética intraespecífica representa un nivel fundamental, pero a la vez subvalorado de la biodiversidad. La diversidad genética puede indicar la resiliencia de una especie ante el clima cambiante, por lo que su medición es relevante para muchos objetivos de la política de conservación mundial y nacional. Muchos estudios producen una gran cantidad de datos sobre la diversidad a nivel genético de las poblaciones silvestres, aunque la mayoría (87%) no incluye los metadatos espaciales y temporales asociados para que sean reutilizados en los programas de monitoreo o para reconocer la soberanía de las naciones o los pueblos indígenas. Realizamos un "datatón" distribuido para cuantificar la disponibilidad de estos metadatos faltantes y para probar la hipótesis que supone que esta disponibilidad se deteriora con el tiempo. También trabajamos para reparar los metadatos faltantes al extraerlos de los artículos asociados publicados, los repositorios en línea y la comunicación directa con los autores. Iniciamos con 838 candidatos de conjuntos de datos genómicos (representación reducida y genoma completo) tomados de la colaboración internacional para la base de datos de secuencias de nucleótidos y determinamos que 561 incluían en su mayoría muestras tomadas de poblaciones silvestres. Restauramos con éxito los metadatos espaciotemporales en el 78% de estos 561 conjuntos de datos (n = 440 conjuntos de datos con información sobre 45,105 individuos de 762 especies en 17 filos). El análisis de los artículos y los repositorios virtuales fue mucho más productivo que contactar a los 351 autores, quienes tuvieron un 45% de respuesta a nuestros correos. En general, el 23% de nuestras consultas descubrieron metadatos útiles. La probabilidad de recuperar metadatos espaciotemporales declinó de manera significativa conforme incrementó la antigüedad del conjunto de datos. Hubo una disminución anual del 13.5% en los metadatos asociados con los artículos publicados y los repositorios virtuales y hasta una disminución anual del 22% en los metadatos que sólo estaban disponibles mediante la comunicación con los autores. Este rápido deterioro en la disponibilidad de los metadatos, duplicado en estudios de otros tipos de datos biológicos, debería motivar la pronta actualización de las políticas del intercambio de datos y las prácticas de los investigadores para asegurar que en las ciencias de la conservación no se pierda para siempre el contexto valioso proporcionado por los metadatos.
Assuntos
Conservação dos Recursos Naturais , Metadados , Humanos , Biodiversidade , Probabilidade , Variação GenéticaRESUMO
Understanding variation of traits within and among species through time and across space is central to many questions in biology. Many resources assemble species-level trait data, but the data and metadata underlying those trait measurements are often not reported. Here, we introduce FuTRES (Functional Trait Resource for Environmental Studies; pronounced few-tress), an online datastore and community resource for individual-level trait reporting that utilizes a semantic framework. FuTRES already stores millions of trait measurements for paleobiological, zooarchaeological, and modern specimens, with a current focus on mammals. We compare dynamically derived extant mammal species' body size measurements in FuTRES with summary values from other compilations, highlighting potential issues with simply reporting a single mean estimate. We then show that individual-level data improve estimates of body mass-including uncertainty-for zooarchaeological specimens. FuTRES facilitates trait data integration and discoverability, accelerating new research agendas, especially scaling from intra- to interspecific trait variability.
RESUMO
Emerging infectious diseases have been especially devastating to amphibians, the most endangered class of vertebrates. For amphibians, the greatest disease threat is chytridiomycosis, caused by one of two chytridiomycete fungal pathogens Batrachochytrium dendrobatidis (Bd) and Batrachochytrium salamandrivorans (Bsal). Research over the last two decades has shown that susceptibility to this disease varies greatly with respect to a suite of host and pathogen factors such as phylogeny, geography (including abiotic factors), host community composition, and historical exposure to pathogens; yet, despite a growing body of research, a comprehensive understanding of global chytridiomycosis incidence remains elusive. In a large collaborative effort, Bd-Maps was launched in 2007 to increase multidisciplinary investigations and understanding using compiled global Bd occurrence data (Bsal was not discovered until 2013). As its database functions aged and became unsustainable, we sought to address critical needs utilizing new technologies to meet the challenges of aggregating data to facilitate research on both Bd and Bsal. Here, we introduce an advanced central online repository to archive, aggregate, and share Bd and Bsal data collected from around the world. The Amphibian Disease Portal (https://amphibiandisease.org) addresses several critical community needs while also helping to build basic biological knowledge of chytridiomycosis. This portal could be useful for other amphibian diseases and could also be replicated for uses with other wildlife diseases. We show how the Amphibian Disease Portal provides: (1) a new repository for the legacy Bd-Maps data; (2) a repository for sample-level data to archive datasets and host published data with permanent DOIs; (3) a flexible framework to adapt to advances in field, laboratory, and informatics technologies; and (4) a global aggregation of Bd and Bsal infection data to enable and accelerate research and conservation. The new framework for this project is built using biodiversity informatics best practices and metadata standards to ensure scientific reproducibility and linkages across other biological and biodiversity repositories.
RESUMO
Sampling the natural world and built environment underpins much of science, yet systems for managing material samples and associated (meta)data are fragmented across institutional catalogs, practices for identification, and discipline-specific (meta)data standards. The Internet of Samples (iSamples) is a standards-based collaboration to uniquely, consistently, and conveniently identify material samples, record core metadata about them, and link them to other samples, data, and research products. iSamples extends existing resources and best practices in data stewardship to render a cross-domain cyberinfrastructure that enables transdisciplinary research, discovery, and reuse of material samples in 21st century natural science.
Assuntos
Internet , MetadadosRESUMO
Genetic data represent a relatively new frontier for our understanding of global biodiversity. Ideally, such data should include both organismal DNA-based genotypes and the ecological context where the organisms were sampled. Yet most tools and standards for data deposition focus exclusively either on genetic or ecological attributes. The Genomic Observatories Metadatabase (GEOME: geome-db.org) provides an intuitive solution for maintaining links between genetic data sets stored by the International Nucleotide Sequence Database Collaboration (INSDC) and their associated ecological metadata. GEOME facilitates the deposition of raw genetic data to INSDCs sequence read archive (SRA) while maintaining persistent links to standards-compliant ecological metadata held in the GEOME database. This approach facilitates findable, accessible, interoperable and reusable data archival practices. Moreover, GEOME enables data management solutions for large collaborative groups and expedites batch retrieval of genetic data from the SRA. The article that follows describes how GEOME can enable genuinely open data workflows for researchers in the field of molecular ecology.
Assuntos
Biodiversidade , Bases de Dados de Ácidos Nucleicos , Genômica , Metadados , Pesquisa , Ecologia , Armazenamento e Recuperação da Informação , Fluxo de TrabalhoRESUMO
Plant and animal phenology is shifting in response to urbanization, with most hypotheses focusing on the 'urban heat island' (UHI) effect as the driver. However, generalities regarding the direction and magnitude of phenological response to urbanization have not yet emerged because most studies have focused on remote-sensed vegetative phenologies or at local scales with relatively few species. Furthermore, how urbanization interacts with broad-scale climate gradients remains an unknown but important component of anthropogenically driven phenological change. Here, we used a database with >22 million in situ plant phenological observations from the United States and Europe to study the joint influence of varying human population density, which serves as an urbanization measure, and of regional temperature on median flowering and leaf-out dates across a wide plant phylogenetic spectrum. Separately, increasing population density and warmer regional temperature both advanced plant flowering and leaf-out. However, the influence of human population density on plant flowering and leaf-out depends on the regional temperature: high population density advanced plant phenology in cold areas but this effect disappeared or even reversed in warm areas. UHI effects (as measured by daily land surface temperature) alone cannot explain the overall influence of urbanization on plant phenology, suggesting that urbanization also affects plant phenology via other mechanisms. Shorter plants with large specific leaf areas and early flower or leaf-out dates were most affected by urbanization and temperature changes. Our study provides strong empirical evidence that the influence of urbanization on plant phenology varies with regional temperature. Therefore, robust understanding and accurate prediction of phenological changes must take this interaction into account.
Assuntos
Mudança Climática , Urbanização , Animais , Europa (Continente) , Humanos , Filogenia , Estações do Ano , Temperatura , Estados UnidosRESUMO
PREMISE OF THE STUDY: The Plant Phenology Ontology (PPO) was originally developed to integrate phenology observations of whole plants across different global observation networks. Here we describe a new release of the PPO and associated data pipelines that supports integration of phenology observations from herbarium specimens, which provide historical and modern phenology data. METHODS AND RESULTS: Critical changes to the PPO include key terms that describe how measurements from parts of plants, which are captured in most imaged herbarium specimens, relate to whole plants. We provide proof of concept for ingesting annotations from imaged herbarium sheets of Prunus serotina, the common black cherry. We then provide an example analysis of changes in flowering timing over the past 125 years, demonstrating the value of integrating herbarium and observational phenology data sets. CONCLUSIONS: These conceptual and technical advances will support the addition of phenology data from herbaria, but also could be expanded upon to facilitate the inclusion of data from photograph-based citizen science platforms. With the incorporation of herbarium phenology data, new historical baseline data will strengthen the capability to monitor, model, and forecast plant phenology changes.
RESUMO
Plant phenology - the timing of plant life-cycle events, such as flowering or leafing out - plays a fundamental role in the functioning of terrestrial ecosystems, including human agricultural systems. Because plant phenology is often linked with climatic variables, there is widespread interest in developing a deeper understanding of global plant phenology patterns and trends. Although phenology data from around the world are currently available, truly global analyses of plant phenology have so far been difficult because the organizations producing large-scale phenology data are using non-standardized terminologies and metrics during data collection and data processing. To address this problem, we have developed the Plant Phenology Ontology (PPO). The PPO provides the standardized vocabulary and semantic framework that is needed for large-scale integration of heterogeneous plant phenology data. Here, we describe the PPO, and we also report preliminary results of using the PPO and a new data processing pipeline to build a large dataset of phenology information from North America and Europe.
RESUMO
The Genomic Observatories Metadatabase (GeOMe, http://www.geome-db.org/) is an open access repository for geographic and ecological metadata associated with biosamples and genetic data. Whereas public databases have served as vital repositories for nucleotide sequences, they do not accession all the metadata required for ecological or evolutionary analyses. GeOMe fills this need, providing a user-friendly, web-based interface for both data contributors and data recipients. The interface allows data contributors to create a customized yet standard-compliant spreadsheet that captures the temporal and geospatial context of each biosample. These metadata are then validated and permanently linked to archived genetic data stored in the National Center for Biotechnology Information's (NCBI's) Sequence Read Archive (SRA) via unique persistent identifiers. By linking ecologically and evolutionarily relevant metadata with publicly archived sequence data in a structured manner, GeOMe sets a gold standard for data management in biodiversity science.
Assuntos
Biodiversidade , Bases de Dados de Ácidos Nucleicos , Metadados , MetagenômicaRESUMO
In many disciplines, data are highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline 10 lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers. We also outline the important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.
Assuntos
Disciplinas das Ciências Biológicas/métodos , Biologia Computacional/métodos , Mineração de Dados/métodos , Design de Software , Software , Disciplinas das Ciências Biológicas/estatística & dados numéricos , Disciplinas das Ciências Biológicas/tendências , Biologia Computacional/tendências , Mineração de Dados/estatística & dados numéricos , Mineração de Dados/tendências , Bases de Dados Factuais/estatística & dados numéricos , Bases de Dados Factuais/tendências , Previsões , Humanos , InternetRESUMO
Systems biology promises to revolutionize medicine, yet human wellbeing is also inherently linked to healthy societies and environments (sustainability). The IDEA Consortium is a systems ecology open science initiative to conduct the basic scientific research needed to build use-oriented simulations (avatars) of entire social-ecological systems. Islands are the most scientifically tractable places for these studies and we begin with one of the best known: Moorea, French Polynesia. The Moorea IDEA will be a sustainability simulator modeling links and feedbacks between climate, environment, biodiversity, and human activities across a coupled marine-terrestrial landscape. As a model system, the resulting knowledge and tools will improve our ability to predict human and natural change on Moorea and elsewhere at scales relevant to management/conservation actions.
Assuntos
Conservação dos Recursos Naturais/métodos , Ecologia/métodos , Ecossistema , Modelos Teóricos , Clima , Conservação dos Recursos Naturais/tendências , Ecologia/tendências , Previsões , Atividades Humanas , Humanos , Ilhas , PolinésiaRESUMO
Biodiversity data is being digitized and made available online at a rapidly increasing rate but current practices typically do not preserve linkages between these data, which impedes interoperation, provenance tracking, and assembly of larger datasets. For data associated with biocollections, the biodiversity community has long recognized that an essential part of establishing and preserving linkages is to apply globally unique identifiers at the point when data are generated in the field and to persist these identifiers downstream, but this is seldom implemented in practice. There has neither been coalescence towards one single identifier solution (as in some other domains), nor even a set of recommended best practices and standards to support multiple identifier schemes sharing consistent responses. In order to further progress towards a broader community consensus, a group of biocollections and informatics experts assembled in Stockholm in October 2014 to discuss community next steps to overcome current roadblocks. The workshop participants divided into four groups focusing on: identifier practice in current field biocollections; identifier application for legacy biocollections; identifiers as applied to biodiversity data records as they are published and made available in semantically marked-up publications; and cross-cutting identifier solutions that bridge across these domains. The main outcome was consensus on key issues, including recognition of differences between legacy and new biocollections processes, the need for identifier metadata profiles that can report information on identifier persistence missions, and the unambiguous indication of the type of object associated with the identifier. Current identifier characteristics are also summarized, and an overview of available schemes and practices is provided.
RESUMO
The biodiversity informatics community has discussed aspirations and approaches for assigning globally unique identifiers (GUIDs) to biocollections for nearly a decade. During that time, and despite misgivings, the de facto standard identifier has become the "Darwin Core Triplet", which is a concatenation of values for institution code, collection code, and catalog number associated with biocollections material. Our aim is not to rehash the challenging discussions regarding which GUID system in theory best supports the biodiversity informatics use case of discovering and linking digital data across the Internet, but how well we can link those data together at this moment, utilizing the current identifier schemes that have already been deployed. We gathered Darwin Core Triplets from a subset of VertNet records, along with vertebrate records from GenBank and the Barcode of Life Data System, in order to determine how Darwin Core Triplets are deployed "in the wild". We asked if those triplets follow the recommended structure and whether they provide an easy and unambiguous means to track from specimen records to genetic sequence records. We show that Darwin Core Triplets are often riddled with semantic and syntactic errors when deployed and curated in practice, despite specifications about how to construct them. Our results strongly suggest that Darwin Core Triplets that have not been carefully curated are not currently serving a useful role for relinking data. We briefly consider needed next steps to overcome current limitations.
Assuntos
Biodiversidade , Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Armazenamento e Recuperação da Informação , Bases de Dados Factuais , InternetRESUMO
BACKGROUND: Recent years have brought great progress in efforts to digitize the world's biodiversity data, but integrating data from many different providers, and across research domains, remains challenging. Semantic Web technologies have been widely recognized by biodiversity scientists for their potential to help solve this problem, yet these technologies have so far seen little use for biodiversity data. Such slow uptake has been due, in part, to the relative complexity of Semantic Web technologies along with a lack of domain-specific software tools to help non-experts publish their data to the Semantic Web. RESULTS: The BiSciCol Triplifier is new software that greatly simplifies the process of converting biodiversity data in standard, tabular formats, such as Darwin Core-Archives, into Semantic Web-ready Resource Description Framework (RDF) representations. The Triplifier uses a vocabulary based on the popular Darwin Core standard, includes both Web-based and command-line interfaces, and is fully open-source software. CONCLUSIONS: Unlike most other RDF conversion tools, the Triplifier does not require detailed familiarity with core Semantic Web technologies, and it is tailored to a widely popular biodiversity data format and vocabulary standard. As a result, the Triplifier can often fully automate the conversion of biodiversity data to RDF, thereby making the Semantic Web much more accessible to biodiversity scientists who might otherwise have relatively little knowledge of Semantic Web technologies. Easy availability of biodiversity data as RDF will allow researchers to combine data from disparate sources and analyze them with powerful linked data querying tools. However, before software like the Triplifier, and Semantic Web technologies in general, can reach their full potential for biodiversity science, the biodiversity informatics community must address several critical challenges, such as the widespread failure to use robust, globally unique identifiers for biodiversity data.
Assuntos
Biodiversidade , Biologia Computacional/métodos , Internet , Semântica , Software , Interface Usuário-ComputadorRESUMO
The co-authors of this paper hereby state their intention to work together to launch the Genomic Observatories Network (GOs Network) for which this document will serve as its Founding Charter. We define a Genomic Observatory as an ecosystem and/or site subject to long-term scientific research, including (but not limited to) the sustained study of genomic biodiversity from single-celled microbes to multicellular organisms.An international group of 64 scientists first published the call for a global network of Genomic Observatories in January 2012. The vision for such a network was expanded in a subsequent paper and developed over a series of meetings in Bremen (Germany), Shenzhen (China), Moorea (French Polynesia), Oxford (UK), Pacific Grove (California, USA), Washington (DC, USA), and London (UK). While this community-building process continues, here we express our mutual intent to establish the GOs Network formally, and to describe our shared vision for its future. The views expressed here are ours alone as individual scientists, and do not necessarily represent those of the institutions with which we are affiliated.
RESUMO
The study of biodiversity spans many disciplines and includes data pertaining to species distributions and abundances, genetic sequences, trait measurements, and ecological niches, complemented by information on collection and measurement protocols. A review of the current landscape of metadata standards and ontologies in biodiversity science suggests that existing standards such as the Darwin Core terminology are inadequate for describing biodiversity data in a semantically meaningful and computationally useful way. Existing ontologies, such as the Gene Ontology and others in the Open Biological and Biomedical Ontologies (OBO) Foundry library, provide a semantic structure but lack many of the necessary terms to describe biodiversity data in all its dimensions. In this paper, we describe the motivation for and ongoing development of a new Biological Collections Ontology, the Environment Ontology, and the Population and Community Ontology. These ontologies share the aim of improving data aggregation and integration across the biodiversity domain and can be used to describe physical samples and sampling processes (for example, collection, extraction, and preservation techniques), as well as biodiversity observations that involve no physical sampling. Together they encompass studies of: 1) individual organisms, including voucher specimens from ecological studies and museum specimens, 2) bulk or environmental samples (e.g., gut contents, soil, water) that include DNA, other molecules, and potentially many organisms, especially microbes, and 3) survey-based ecological observations. We discuss how these ontologies can be applied to biodiversity use cases that span genetic, organismal, and ecosystem levels of organization. We argue that if adopted as a standard and rigorously applied and enriched by the biodiversity community, these ontologies would significantly reduce barriers to data discovery, integration, and exchange among biodiversity resources and researchers.
Assuntos
Biodiversidade , Conhecimento , SemânticaRESUMO
Information capture pertaining to the "what?", "where?", and "when?" of biodiversity data is critical to maintain data integrity, interoperability, and utility. Moreover, DNA barcoding and other biodiversity studies must adhere to agreed upon data standards in order to effectively contextualize the biota encountered. A field information management system (FIMS) is presented that locks down metadata associated with collecting events, specimens, and tissues. Emphasis is placed on ease of use and flexibility of operation. Standardized templates for data entry are validated through a flexible, project-oriented validation process that assures adherence to data standards and thus data quality. Furthermore, we provide export functionality to existing cloud-based solutions, including Google Fusion Tables and Flickr to allow sharing of these data elements across research collaboration teams and other potential data harvesters via API services.
Assuntos
Código de Barras de DNA Taxonômico/métodos , Sistemas de Gerenciamento de Base de Dados , Gestão da Informação , Sistemas de Informação , Código de Barras de DNA Taxonômico/normas , Gestão da Informação/normas , Sistemas de Informação/normasRESUMO
Following up on efforts from two earlier workshops, a meeting was convened in San Diego to (a) establish working connections between experts in the use of the Darwin Core and the GSC MIxS standards, (b) conduct mutual briefings to promote knowledge exchange and to increase the understanding of the two communities' approaches, constraints, community goals, subtleties, etc., (c) perform an element-by-element comparison of the two standards, assessing the compatibility and complementarity of the two approaches, (d) propose and consider possible use cases and test beds in which a joint annotation approach might be tried, to useful scientific effect, and (e) propose additional action items necessary to continue the development of this joint effort. Several focused working teams were identified to continue the work after the meeting ended.
RESUMO
The Global Biodiversity Information Facility and the Genomic Standards Consortium convened a joint workshop at the University of Oxford, 27-29 February 2012, with a small group of experts from Europe, USA, China and Japan, to continue the alignment of the Darwin Core with the MIxS and related genomics standards. Several reference mappings were produced as well as test expressions of MIxS in RDF. The use and management of controlled vocabulary terms was considered in relation to both GBIF and the GSC, and tools for working with terms were reviewed. Extensions for publishing genomic biodiversity data to the GBIF network via a Darwin Core Archive were prototyped and work begun on preparing translations of the Darwin Core to Japanese and Chinese. Five genomic repositories were identified for engagement to begin the process of testing the publishing of genomic data to the GBIF network commencing with the SILVA rRNA database.