RESUMEN
The early twenty-first century has witnessed massive expansions in availability and accessibility of digital data in virtually all domains of the biodiversity sciences. Led by an array of asynchronous digitization activities spanning ecological, environmental, climatological, and biological collections data, these initiatives have resulted in a plethora of mostly disconnected and siloed data, leaving to researchers the tedious and time-consuming manual task of finding and connecting them in usable ways, integrating them into coherent data sets, and making them interoperable. The focus to date has been on elevating analog and physical records to digital replicas in local databases prior to elevating them to ever-growing aggregations of essentially disconnected discipline-specific information. In the present article, we propose a new interconnected network of digital objects on the Internet-the Digital Extended Specimen (DES) network-that transcends existing aggregator technology, augments the DES with third-party data through machine algorithms, and provides a platform for more efficient research and robust interdisciplinary discovery.
RESUMEN
High-quality microbiome research relies on the integrity, management and quality of supporting data. Currently biobanks and culture collections have different formats and approaches to data management. This necessitates a standard data format to underpin research, particularly in line with the FAIR data standards of findability, accessibility, interoperability and reusability. We address the importance of a unified, coordinated approach that ensures compatibility of data between that needed by biobanks and culture collections, but also to ensure linkage between bioinformatic databases and the wider research community.
Asunto(s)
Bases de Datos Factuales/normas , Microbiota , Biología Computacional , Europa (Continente) , Investigación/normasRESUMEN
BACKGROUND: Taxonomic descriptions are traditionally composed in natural language and published in a format that cannot be directly used by computers. The Exploring Taxon Concepts (ETC) project has been developing a set of web-based software tools that convert morphological descriptions published in telegraphic style to character data that can be reused and repurposed. This paper introduces the first semi-automated pipeline, to our knowledge, that converts morphological descriptions into taxon-character matrices to support systematics and evolutionary biology research. We then demonstrate and evaluate the use of the ETC Input Creation - Text Capture - Matrix Generation pipeline to generate body part measurement matrices from a set of 188 spider morphological descriptions and report the findings. RESULTS: From the given set of spider taxonomic publications, two versions of input (original and normalized) were generated and used by the ETC Text Capture and ETC Matrix Generation tools. The tools produced two corresponding spider body part measurement matrices, and the matrix from the normalized input was found to be much more similar to a gold standard matrix hand-curated by the scientist co-authors. Special conventions utilized in the original descriptions (e.g., the omission of measurement units) were attributed to the lower performance of using the original input. The results show that simple normalization of the description text greatly increased the quality of the machine-generated matrix and reduced edit effort. The machine-generated matrix also helped identify issues in the gold standard matrix. CONCLUSIONS: ETC Text Capture and ETC Matrix Generation are low-barrier and effective tools for extracting measurement values from spider taxonomic descriptions and are more effective when the descriptions are self-contained. Special conventions that make the description text less self-contained challenge automated extraction of data from biodiversity descriptions and hinder the automated reuse of the published knowledge. The tools will be updated to support new requirements revealed in this case study.
Asunto(s)
Evolución Biológica , Programas Informáticos , Arañas/anatomía & histología , Animales , HumanosRESUMEN
BACKGROUND: The need to create controlled vocabularies such as ontologies for knowledge organization and access has been widely recognized in various domains. Despite the indispensable need of thorough domain knowledge in ontology construction, most software tools for ontology construction are designed for knowledge engineers and not for domain experts to use. The differences in the opinions of different domain experts and in the terminology usages in source literature are rarely addressed by existing software. METHODS: OTO software was developed based on the Agile principles. Through iterations of software release and user feedback, new features are added and existing features modified to make the tool more intuitive and efficient to use for small and large data sets. The software is open source and built in Java. RESULTS: Ontology Term Organizer (OTO; http://biosemantics.arizona.edu/OTO/ ) is a user-friendly, web-based, consensus-promoting, open source application for organizing domain terms by dragging and dropping terms to appropriate locations. The application is designed for users with specific domain knowledge such as biology but not in-depth ontology construction skills. Specifically OTO can be used to establish is_a, part_of, synonym, and order relationships among terms in any domain that reflects the terminology usage in source literature and based on multiple experts' opinions. The organized terms may be fed into formal ontologies to boost their coverage. All datasets organized on OTO are publicly available. CONCLUSION: OTO has been used to organize the terms extracted from thirty volumes of Flora of North America and Flora of China combined, in addition to some smaller datasets of different taxon groups. User feedback indicates that the tool is efficient and user friendly. Being open source software, the application can be modified to fit varied term organization needs for different domains.
Asunto(s)
Biología Computacional/métodos , Procesamiento de Lenguaje Natural , Programas Informáticos , Terminología como Asunto , Vocabulario Controlado , Humanos , Semántica , Interfaz Usuario-ComputadorRESUMEN
Critical to answering large-scale questions in biology is the integration of knowledge from different disciplines into a coherent, computable whole. Controlled vocabularies such as ontologies represent a clear path toward this goal. Using survey questionnaires, we examined the attitudes of biologists toward adopting controlled vocabularies in phenotype publications. Our questions cover current experience and overall attitude with controlled vocabularies, the awareness of the issues around ambiguity and inconsistency in phenotype descriptions and post-publication professional data curation, the preferred solutions and the effort and desired rewards for adopting a new authoring workflow. Results suggest that although the existence of controlled vocabularies is widespread, their use is not common. A majority of respondents (74%) are frustrated with ambiguity in phenotypic descriptions, and there is a strong agreement (mean agreement score 4.21 out of 5) that author curation would better reflect the original meaning of phenotype data. Moreover, the vast majority (85%) of researchers would try a new authoring workflow if resultant data were more consistent and less ambiguous. Even more respondents (93%) suggested that they would try and possibly adopt a new authoring workflow if it required 5% additional effort as compared to normal, but higher rates resulted in a steep decline in likely adoption rates. Among the four different types of rewards, two types of citations were the most desired incentives for authors to produce computable data. Overall, our results suggest the adoption of a new authoring workflow would be accelerated by a user-friendly and efficient software-authoring tool, an increased awareness of the challenges text ambiguity creates for external curators and an elevated appreciation of the benefits of controlled vocabularies.
Asunto(s)
Curaduría de Datos , Programas Informáticos , Actitud , Fenotipo , Flujo de TrabajoRESUMEN
Omic BON is a thematic Biodiversity Observation Network under the Group on Earth Observations Biodiversity Observation Network (GEO BON), focused on coordinating the observation of biomolecules in organisms and the environment. Our founding partners include representatives from national, regional, and global observing systems; standards organizations; and data and sample management infrastructures. By coordinating observing strategies, methods, and data flows, Omic BON will facilitate the co-creation of a global omics meta-observatory to generate actionable knowledge. Here, we present key elements of Omic BON's founding charter and first activities.
Asunto(s)
Biodiversidad , ConocimientoRESUMEN
BACKGROUND: Here we present a revised species checklist for the Brassicaceae, updated from Warwick SI, Francis, A, Al-Shehbaz IA (2006), Brassicaceae: Species checklist and database on CD-ROM, Plant Systematics and Evolution 259: 249â25. This update of the checklist was initiated, based on recent taxonomic and molecular studies on the Brassicaceae that have resulted in new species names, combinations and associated synonyms. NEW INFORMATION: New data have been added indicating tribal affiliations within the family and where type specimens have been designated. In addition, information from many early publications has been checked and added to the database. The database now includes information on 14983 taxa, 4636 of which are currently accepted and divided into 340 genera and 52 tribes. A selected bibliography of recent publications on the Brassicaceae is included.
RESUMEN
Producing findable, accessible, interoperable and reusable (FAIR) data cannot be accomplished solely by data curators in all disciplines. In biology, we have shown that phenotypic data curation is not only costly, but it is burdened with inter-curator variation. We intend to propose a software platform that would enable all data producers, including authors of scientific publications, to produce ontologized data at the time of publication. Working toward this goal, we need to identify ontology construction methods that are preferred by end users. Here, we employ two usability studies to evaluate effectiveness, efficiency and user satisfaction with a set of four methods that allow an end user to add terms and their relations to an ontology. Thirty-three participants took part in a controlled experiment where they evaluated the four methods (Quick Form, Wizard, WebProtégé and Wikidata) after watching demonstration videos and completing a hands-on task. Another think-aloud study was conducted with three professional botanists. The efficiency effectiveness and user confidence in the methods are clearly revealed through statistical and content analyses of participants' comments. Quick Form, Wizard and WebProtégé offer distinct strengths that would benefit our author-driven FAIR data generation system. Features preferred by the participants will guide the design of future iterations.
Asunto(s)
Ontología de Genes , Programas Informáticos , HumanosRESUMEN
To use published phenotype information in computational analyses, there have been efforts to convert descriptions of phenotype characters from human languages to ontologized statements. This postpublication curation process is not only slow and costly, it is also burdened with significant intercurator variation (including curator-author variation), due to different interpretations of a character by various individuals. This problem is inherent in any human-based intellectual activity. To address this problem, making scientific publications semantically clear (i.e. computable) by the authors at the time of publication is a critical step if we are to avoid postpublication curation. To help authors efficiently produce species phenotypes while producing computable data, we are experimenting with an author-driven ontology development approach and developing and evaluating a series of ontology-aware software modules that would create publishable species descriptions that are readily useable in scientific computations. The first software module prototype called Measurement Recorder has been developed to assist authors in defining continuous measurements and reported in this paper. Two usability studies of the software were conducted with 22 undergraduate students majoring in information science and 32 in biology. Results suggest that participants can use Measurement Recorder without training and they find it easy to use after limited practice. Participants also appreciate the semantic enhancement features. Measurement Recorder's character reuse features facilitate character convergence among participants by 48% and have the potential to further reduce user errors in defining characters. A set of software design issues have also been identified and then corrected. Measurement Recorder enables authors to record measurements in a semantically clear manner and enriches phenotype ontology along the way. Future work includes representing the semantic data as Resource Description Framework (RDF) knowledge graphs and characterizing the division of work between authors as domain knowledge providers and ontology engineers as knowledge formalizers in this new author-driven ontology development approach.
Asunto(s)
Semántica , Programas Informáticos , Humanos , FenotipoRESUMEN
An amendment to this paper has been published and can be accessed via the original article.
RESUMEN
The field of microbiome research has evolved rapidly over the past few decades and has become a topic of great scientific and public interest. As a result of this rapid growth in interest covering different fields, we are lacking a clear commonly agreed definition of the term "microbiome." Moreover, a consensus on best practices in microbiome research is missing. Recently, a panel of international experts discussed the current gaps in the frame of the European-funded MicrobiomeSupport project. The meeting brought together about 40 leaders from diverse microbiome areas, while more than a hundred experts from all over the world took part in an online survey accompanying the workshop. This article excerpts the outcomes of the workshop and the corresponding online survey embedded in a short historical introduction and future outlook. We propose a definition of microbiome based on the compact, clear, and comprehensive description of the term provided by Whipps et al. in 1988, amended with a set of novel recommendations considering the latest technological developments and research findings. We clearly separate the terms microbiome and microbiota and provide a comprehensive discussion considering the composition of microbiota, the heterogeneity and dynamics of microbiomes in time and space, the stability and resilience of microbial networks, the definition of core microbiomes, and functionally relevant keystone species as well as co-evolutionary principles of microbe-host and inter-species interactions within the microbiome. These broad definitions together with the suggested unifying concepts will help to improve standardization of microbiome studies in the future, and could be the starting point for an integrated assessment of data resulting in a more rapid transfer of knowledge from basic science into practice. Furthermore, microbiome standards are important for solving new challenges associated with anthropogenic-driven changes in the field of planetary health, for which the understanding of microbiomes might play a key role. Video Abstract.
Asunto(s)
Microbiota , Terminología como Asunto , Encuestas y CuestionariosRESUMEN
Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production.
RESUMEN
Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema.