Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 48
Filtrar
1.
Syst Biol ; 71(6): 1290-1306, 2022 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-35285502

RESUMO

Morphology remains a primary source of phylogenetic information for many groups of organisms, and the only one for most fossil taxa. Organismal anatomy is not a collection of randomly assembled and independent "parts", but instead a set of dependent and hierarchically nested entities resulting from ontogeny and phylogeny. How do we make sense of these dependent and at times redundant characters? One promising approach is using ontologies-structured controlled vocabularies that summarize knowledge about different properties of anatomical entities, including developmental and structural dependencies. Here, we assess whether evolutionary patterns can explain the proximity of ontology-annotated characters within an ontology. To do so, we measure phylogenetic information across characters and evaluate if it matches the hierarchical structure given by ontological knowledge-in much the same way as across-species diversity structure is given by phylogeny. We implement an approach to evaluate the Bayesian phylogenetic information (BPI) content and phylogenetic dissonance among ontology-annotated anatomical data subsets. We applied this to data sets representing two disparate animal groups: bees (Hexapoda: Hymenoptera: Apoidea, 209 chars) and characiform fishes (Actinopterygii: Ostariophysi: Characiformes, 463 chars). For bees, we find that BPI is not substantially explained by anatomy since dissonance is often high among morphologically related anatomical entities. For fishes, we find substantial information for two clusters of anatomical entities instantiating concepts from the jaws and branchial arch bones, but among-subset information decreases and dissonance increases substantially moving to higher-level subsets in the ontology. We further applied our approach to address particular evolutionary hypotheses with an example of morphological evolution in miniature fishes. While we show that phylogenetic information does match ontology structure for some anatomical entities, additional relationships and processes, such as convergence, likely play a substantial role in explaining BPI and dissonance, and merit future investigation. Our work demonstrates how complex morphological data sets can be interrogated with ontologies by allowing one to access how information is spread hierarchically across anatomical concepts, how congruent this information is, and what sorts of processes may play a role in explaining it: phylogeny, development, or convergence. [Apidae; Bayesian phylogenetic information; Ostariophysi; Phenoscape; phylogenetic dissonance; semantic similarity.].


Assuntos
Artrópodes , Caraciformes , Animais , Teorema de Bayes , Fósseis , Filogenia
2.
PeerJ ; 10: e12618, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35186448

RESUMO

To be computationally reproducible and efficient, integration of disparate data depends on shared entities whose matching meaning (semantics) can be computationally assessed. For biodiversity data one of the most prevalent shared entities for linking data records is the associated taxon concept. Unlike Linnaean taxon names, the traditional way in which taxon concepts are provided, phylogenetic definitions are native to phylogenetic trees and offer well-defined semantics that can be transformed into formal, computationally evaluable logic expressions. These attributes make them highly suitable for phylogeny-driven comparative biology by allowing computationally verifiable and reproducible integration of taxon-linked data against Tree of Life-scale phylogenies. To achieve this, the first step is transforming phylogenetic definitions from the natural language text in which they are published to a structured interoperable data format that maintains strong ties to semantics and lends itself well to sharing, reuse, and long-term archival. To this end, we developed the Phyloreference Exchange Format (Phyx), a JSON-LD-based text format encompassing rich metadata for all elements of a phylogenetic definition, and we created a supporting software library, phyx.js, to streamline computational management of such files. Together they form a foundation layer for digitizing and computing with phylogenetic definitions of clades.


Assuntos
Semântica , Software , Filogenia , Biologia , Registros
4.
NPJ Digit Med ; 3: 24, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32140567

RESUMO

Storing very large amounts of data and delivering them to researchers in an efficient, verifiable, and compliant manner, is one of the major challenges faced by health care providers and researchers in the life sciences. The electronic health record (EHR) at a hospital or clinic currently functions as a silo, and although EHRs contain rich and abundant information that could be used to understand, improve, and learn from care as part learning health system access to these data is difficult, and the technical, legal, ethical, and social barriers are significant. If we create a microservice ecosystem where data can be accessed through APIs, these challenges become easier to overcome: a service-driven design decouples data from clients. This decoupling provides flexibility: different users can write in their preferred language and use different clients depending on their needs. APIs can be written for iOS apps, web apps, or an R library, and this flexibility highlights the potential ecosystem-building power of APIs. In this article, we use two case studies to illustrate what it means to participate in and contribute to interconnected ecosystems that powers APIs in a healthcare systems.

5.
Syst Biol ; 69(2): 345-362, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-31596473

RESUMO

There is a growing body of research on the evolution of anatomy in a wide variety of organisms. Discoveries in this field could be greatly accelerated by computational methods and resources that enable these findings to be compared across different studies and different organisms and linked with the genes responsible for anatomical modifications. Homology is a key concept in comparative anatomy; two important types are historical homology (the similarity of organisms due to common ancestry) and serial homology (the similarity of repeated structures within an organism). We explored how to most effectively represent historical and serial homology across anatomical structures to facilitate computational reasoning. We assembled a collection of homology assertions from the literature with a set of taxon phenotypes for the skeletal elements of vertebrate fins and limbs from the Phenoscape Knowledgebase. Using seven competency questions, we evaluated the reasoning ramifications of two logical models: the Reciprocal Existential Axioms (REA) homology model and the Ancestral Value Axioms (AVA) homology model. The AVA model returned all user-expected results in addition to the search term and any of its subclasses. The AVA model also returns any superclass of the query term in which a homology relationship has been asserted. The REA model returned the user-expected results for five out of seven queries. We identify some challenges of implementing complete homology queries due to limitations of OWL reasoning. This work lays the foundation for homology reasoning to be incorporated into other ontology-based tools, such as those that enable synthetic supermatrix construction and candidate gene discovery. [Homology; ontology; anatomy; morphology; evolution; knowledgebase; phenoscape.].


Assuntos
Classificação/métodos , Modelos Biológicos , Nadadeiras de Animais/anatomia & histologia , Animais , Extremidades/anatomia & histologia , Vertebrados/anatomia & histologia
6.
Database (Oxford) ; 20182018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-30576485

RESUMO

Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.


Assuntos
Curadoria de Dados/métodos , Mineração de Dados/métodos , Ontologia Genética , Processamento de Linguagem Natural , Fenótipo , Humanos
7.
F1000Res ; 72018.
Artigo em Inglês | MEDLINE | ID: mdl-30210780

RESUMO

In 2018, the annual Bioinformatics Open Source Conference was held for the first time in conjunction with the Galaxy Community Conference, as an experiment to see if we could reach people in the bioinformatics community who aren't part of the audience attracted by ISMB. Held in June 2018 at Reed College in Portland, Oregon, GCCBOSC (Galaxy Community Conference and Bioinformatics Open Source Conference) attracted over 300 participants from around the world. The meeting started with two days of training, followed by two days of talks and poster/demo sessions (with some joint and some parallel sessions). The joint sessions included well-received keynote talks by Tracy Teal, Fernando Pérez and Lucia Peixoto, as well as a panel discussion about documentation and training. After the main meeting, many attendees stayed for up to four additional collaboration days, an extended version of the Codefests that have been held in conjunction with previous BOSCs. GCCBOSC was a successful experiment. The organizers concluded that the best way to serve the broadest community of potential BOSC attendees will be to partner some years with the International Society for Computational Biology (ISMB) and others with GCC.


Assuntos
Biologia Computacional , Colaboração Intersetorial
8.
Cell Syst ; 6(4): 470-483.e8, 2018 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-29605182

RESUMO

Paralogous transcription factors (TFs) are oftentimes reported to have identical DNA-binding motifs, despite the fact that they perform distinct regulatory functions. Differential genomic targeting by paralogous TFs is generally assumed to be due to interactions with protein co-factors or the chromatin environment. Using a computational-experimental framework called iMADS (integrative modeling and analysis of differential specificity), we show that, contrary to previous assumptions, paralogous TFs bind differently to genomic target sites even in vitro. We used iMADS to quantify, model, and analyze specificity differences between 11 TFs from 4 protein families. We found that paralogous TFs have diverged mainly at medium- and low-affinity sites, which are poorly captured by current motif models. We identify sequence and shape features differentially preferred by paralogous TFs, and we show that the intrinsic differences in specificity among paralogous TFs contribute to their differential in vivo binding. Thus, our study represents a step forward in deciphering the molecular mechanisms of differential specificity in TF families.


Assuntos
Modelos Genéticos , Fatores de Transcrição/fisiologia , Sítios de Ligação , Regulação da Expressão Gênica/fisiologia , Modelos Moleculares , Motivos de Nucleotídeos , Análise de Sequência de Proteína , Fatores de Transcrição/química
9.
F1000Res ; 62017.
Artigo em Inglês | MEDLINE | ID: mdl-29118973

RESUMO

The Bioinformatics Open Source Conference (BOSC) is a meeting organized by the Open Bioinformatics Foundation (OBF), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development and Open Science within the biological research community. The 18th annual BOSC ( http://www.open-bio.org/wiki/BOSC_2017) took place in Prague, Czech Republic in July 2017. The conference brought together nearly 250 bioinformatics researchers, developers and users of open source software to interact and share ideas about standards, bioinformatics software development, open and reproducible science, and this year's theme, open data. As in previous years, the conference was preceded by a two-day collaborative coding event open to the bioinformatics community, called the OBF Codefest.

10.
Mol Ecol Resour ; 17(1): 120-128, 2017 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-27297607

RESUMO

The r computing and statistical language community has developed a myriad of resources for conducting population genetic analyses. However, resources for learning how to carry out population genetic analyses in r are scattered and often incomplete, which can make acquiring this skill unnecessarily difficult and time consuming. To address this gap, we developed an online community resource with guidance and working demonstrations for conducting population genetic analyses in r. The resource is freely available at http://popgen.nescent.org and includes material for both novices and advanced users of r for population genetics. To facilitate continued maintenance and growth of this resource, we developed a toolchain, process and conventions designed to (i) minimize financial and labour costs of upkeep; (ii) to provide a low barrier to contribution; and (iii) to ensure strong quality assurance. The toolchain includes automatic integration testing of every change and rebuilding of the website when new vignettes or edits are accepted. The process and conventions largely follow a common, distributed version control-based contribution workflow, which is used to provide and manage open peer review by designated website editors. The online resources include detailed documentation of this process, including video tutorials. We invite the community of population geneticists working in r to contribute to this resource, whether for a new use case of their own, or as one of the vignettes from the 'wish list' we maintain, or by improving existing vignettes.


Assuntos
Bioestatística/métodos , Genética Populacional/educação , Genética Populacional/métodos , Estatística como Assunto/educação , Acesso à Informação , Internet
12.
Mol Ecol Resour ; 17(1): 19-26, 2017 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-27417145

RESUMO

Genetic sequences of multiple genes are becoming increasingly common for a wide range of organisms including viruses, bacteria and eukaryotes. While such data may sometimes be treated as a single locus, in practice, a number of biological and statistical phenomena can lead to phylogenetic incongruence. In such cases, different loci should, at least as a preliminary step, be examined and analysed separately. The r software has become a popular platform for phylogenetics, with several packages implementing distance-based, parsimony and likelihood-based phylogenetic reconstruction, and an even greater number of packages implementing phylogenetic comparative methods. Unfortunately, basic data structures and tools for analysing multiple genes have so far been lacking, thereby limiting potential for investigating phylogenetic incongruence. In this study, we introduce the new r package apex to fill this gap. apex implements new object classes, which extend existing standards for storing DNA and amino acid sequences, and provides a number of convenient tools for handling, visualizing and analysing these data. In this study, we introduce the main features of the package and illustrate its functionalities through the analysis of a simple data set.


Assuntos
Biologia Computacional/métodos , Genes , Variação Genética , Biologia Molecular/métodos , Filogenia , Homologia de Sequência , Software
13.
F1000Res ; 52016.
Artigo em Inglês | MEDLINE | ID: mdl-27781083

RESUMO

Message from the ISCB: The Bioinformatics Open Source Conference (BOSC) is a yearly meeting organized by the Open Bioinformatics Foundation (OBF), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development and Open Science within the biological research community. BOSC has been run since 2000 as a two-day Special Interest Group (SIG) before the annual ISMB conference. The 17th annual BOSC ( http://www.open-bio.org/wiki/BOSC_2016) took place in Orlando, Florida in July 2016. As in previous years, the conference was preceded by a two-day collaborative coding event open to the bioinformatics community. The conference brought together nearly 100 bioinformatics researchers, developers and users of open source software to interact and share ideas about standards, bioinformatics software development, and open and reproducible science.

14.
PLoS Comput Biol ; 12(2): e1004691, 2016 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26914653

RESUMO

The Bioinformatics Open Source Conference (BOSC) is organized by the Open Bioinformatics Foundation (OBF), a nonprofit group dedicated to promoting the practice and philosophy of open source software development and open science within the biological research community. Since its inception in 2000, BOSC has provided bioinformatics developers with a forum for communicating the results of their latest efforts to the wider research community. BOSC offers a focused environment for developers and users to interact and share ideas about standards; software development practices; practical techniques for solving bioinformatics problems; and approaches that promote open science and sharing of data, results, and software. BOSC is run as a two-day special interest group (SIG) before the annual Intelligent Systems in Molecular Biology (ISMB) conference. BOSC 2015 took place in Dublin, Ireland, and was attended by over 125 people, about half of whom were first-time attendees. Session topics included "Data Science;" "Standards and Interoperability;" "Open Science and Reproducibility;" "Translational Bioinformatics;" "Visualization;" and "Bioinformatics Open Source Project Updates". In addition to two keynote talks and dozens of shorter talks chosen from submitted abstracts, BOSC 2015 included a panel, titled "Open Source, Open Door: Increasing Diversity in the Bioinformatics Open Source Community," that provided an opportunity for open discussion about ways to increase the diversity of participants in BOSC in particular, and in open source bioinformatics in general. The complete program of BOSC 2015 is available online at http://www.open-bio.org/wiki/BOSC_2015_Schedule.


Assuntos
Biologia Computacional/organização & administração , Congressos como Assunto , Humanos , Irlanda
15.
PLoS One ; 11(2): e0149102, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26870952

RESUMO

BACKGROUND: In recent years large bibliographic databases have made much of the published literature of biology available for searches. However, the capabilities of the search engines integrated into these databases for text-based bibliographic searches are limited. To enable searches that deliver the results expected by comparative anatomists, an underlying logical structure known as an ontology is required. DEVELOPMENT AND TESTING OF THE ONTOLOGY: Here we present the Mammalian Feeding Muscle Ontology (MFMO), a multi-species ontology focused on anatomical structures that participate in feeding and other oral/pharyngeal behaviors. A unique feature of the MFMO is that a simple, computable, definition of each muscle, which includes its attachments and innervation, is true across mammals. This construction mirrors the logical foundation of comparative anatomy and permits searches using language familiar to biologists. Further, it provides a template for muscles that will be useful in extending any anatomy ontology. The MFMO is developed to support the Feeding Experiments End-User Database Project (FEED, https://feedexp.org/), a publicly-available, online repository for physiological data collected from in vivo studies of feeding (e.g., mastication, biting, swallowing) in mammals. Currently the MFMO is integrated into FEED and also into two literature-specific implementations of Textpresso, a text-mining system that facilitates powerful searches of a corpus of scientific publications. We evaluate the MFMO by asking questions that test the ability of the ontology to return appropriate answers (competency questions). We compare the results of queries of the MFMO to results from similar searches in PubMed and Google Scholar. RESULTS AND SIGNIFICANCE: Our tests demonstrate that the MFMO is competent to answer queries formed in the common language of comparative anatomy, but PubMed and Google Scholar are not. Overall, our results show that by incorporating anatomical ontologies into searches, an expanded and anatomically comprehensive set of results can be obtained. The broader scientific and publishing communities should consider taking up the challenge of semantically enabled search capabilities.


Assuntos
Bases de Dados como Assunto , Músculos Faríngeos/anatomia & histologia , Animais , Humanos , Orofaringe/anatomia & histologia , Ferramenta de Busca
16.
Pac Symp Biocomput ; 21: 132-43, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26776180

RESUMO

There is growing use of ontologies for the measurement of cross-species phenotype similarity. Such similarity measurements contribute to diverse applications, such as identifying genetic models for human diseases, transferring knowledge among model organisms, and studying the genetic basis of evolutionary innovations. Two organismal features, whether genes, anatomical parts, or any other inherited feature, are considered to be homologous when they are evolutionarily derived from a single feature in a common ancestor. A classic example is the homology between the paired fins of fishes and vertebrate limbs. Anatomical ontologies that model the structural relations among parts may fail to include some known anatomical homologies unless they are deliberately added as separate axioms. The consequences of neglecting known homologies for applications that rely on such ontologies has not been well studied. Here, we examine how semantic similarity is affected when external homology knowledge is included. We measure phenotypic similarity between orthologous and non-orthologous gene pairs between humans and either mouse or zebrafish, and compare the inclusion of real with faux homology axioms. Semantic similarity was preferentially increased for orthologs when using real homology axioms, but only in the more divergent of the two species comparisons (human to zebrafish, not human to mouse), and the relative increase was less than 1% to non-orthologs. By contrast, inclusion of both real and faux random homology axioms preferentially increased similarities between genes that were initially more dissimilar in the other comparisons. Biologically meaningful increases in semantic similarity were seen for a select subset of gene pairs. Overall, the effect of including homology axioms on cross-species semantic similarity was modest at the levels of divergence examined here, but our results hint that it may be greater for more distant species comparisons.


Assuntos
Anatomia Comparada/métodos , Anatomia Comparada/estatística & dados numéricos , Animais , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Evolução Molecular , Humanos , Camundongos , Fenótipo , Semântica , Homologia de Sequência do Ácido Nucleico , Especificidade da Espécie , Integração de Sistemas , Peixe-Zebra/anatomia & histologia , Peixe-Zebra/genética
17.
Mol Biol Evol ; 33(1): 13-24, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26500251

RESUMO

Phenotypes resulting from mutations in genetic model organisms can help reveal candidate genes for evolutionarily important phenotypic changes in related taxa. Although testing candidate gene hypotheses experimentally in nonmodel organisms is typically difficult, ontology-driven information systems can help generate testable hypotheses about developmental processes in experimentally tractable organisms. Here, we tested candidate gene hypotheses suggested by expert use of the Phenoscape Knowledgebase, specifically looking for genes that are candidates responsible for evolutionarily interesting phenotypes in the ostariophysan fishes that bear resemblance to mutant phenotypes in zebrafish. For this, we searched ZFIN for genetic perturbations that result in either loss of basihyal element or loss of scales phenotypes, because these are the ancestral phenotypes observed in catfishes (Siluriformes). We tested the identified candidate genes by examining their endogenous expression patterns in the channel catfish, Ictalurus punctatus. The experimental results were consistent with the hypotheses that these features evolved through disruption in developmental pathways at, or upstream of, brpf1 and eda/edar for the ancestral losses of basihyal element and scales, respectively. These results demonstrate that ontological annotations of the phenotypic effects of genetic alterations in model organisms, when aggregated within a knowledgebase, can be used effectively to generate testable, and useful, hypotheses about evolutionary changes in morphology.


Assuntos
Peixes-Gato/genética , Evolução Molecular , Expressão Gênica , Modelos Genéticos , Fenótipo , Animais , Biologia Computacional , Expressão Gênica/genética , Expressão Gênica/fisiologia , Software
18.
Genesis ; 53(8): 561-71, 2015 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-26220875

RESUMO

The abundance of phenotypic diversity among species can enrich our knowledge of development and genetics beyond the limits of variation that can be observed in model organisms. The Phenoscape Knowledgebase (KB) is designed to enable exploration and discovery of phenotypic variation among species. Because phenotypes in the KB are annotated using standard ontologies, evolutionary phenotypes can be compared with phenotypes from genetic perturbations in model organisms. To illustrate the power of this approach, we review the use of the KB to find taxa showing evolutionary variation similar to that of a query gene. Matches are made between the full set of phenotypes described for a gene and an evolutionary profile, the latter of which is defined as the set of phenotypes that are variable among the daughters of any node on the taxonomic tree. Phenoscape's semantic similarity interface allows the user to assess the statistical significance of each match and flags matches that may only result from differences in annotation coverage between genetic and evolutionary studies. Tools such as this will help meet the challenge of relating the growing volume of genetic knowledge in model organisms to the diversity of phenotypes in nature. The Phenoscape KB is available at http://kb.phenoscape.org.


Assuntos
Bases de Dados Genéticas , Estudos de Associação Genética/métodos , Animais , Evolução Biológica , Biologia Computacional/métodos , Humanos , Bases de Conhecimento , Fenótipo
19.
Syst Biol ; 64(6): 936-52, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26018570

RESUMO

The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in noncomputer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived by machine reasoning over the formal semantics of the source ontologies. Inferred data reduced the missing data in the variable character-subset from 98.5% to 78.2%. Machine reasoning also enables the isolation of conflicts in the data, that is, cells where both presence and absence are indicated; reports regarding conflicting data provenance can be generated automatically. Further, reasoning enables quantification and new visualizations of the data, here for example, allowing identification of character space that has been undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, providing the data are semantically annotated. Because such data can also be linked to model organism genetics through computational scoring of phenotypic similarity, they open a rich set of future research questions into phenotype-to-genome relationships.


Assuntos
Ontologias Biológicas , Biologia Computacional/métodos , Fenótipo , Anfíbios/anatomia & histologia , Anfíbios/classificação , Animais , Evolução Biológica , Classificação , Interpretação Estatística de Dados
20.
Database (Oxford) ; 2015: bav040, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25972520

RESUMO

The diverse phenotypes of living organisms have been described for centuries, and though they may be digitized, they are not readily available in a computable form. Using over 100 morphological studies, the Phenoscape project has demonstrated that by annotating characters with community ontology terms, links between novel species anatomy and the genes that may underlie them can be made. But given the enormity of the legacy literature, how can this largely unexploited wealth of descriptive data be rendered amenable to large-scale computation? To identify the bottlenecks, we quantified the time involved in the major aspects of phenotype curation as we annotated characters from the vertebrate phylogenetic systematics literature. This involves attaching fully computable logical expressions consisting of ontology terms to the descriptions in character-by-taxon matrices. The workflow consists of: (i) data preparation, (ii) phenotype annotation, (iii) ontology development and (iv) curation team discussions and software development feedback. Our results showed that the completion of this work required two person-years by a team of two post-docs, a lead data curator, and students. Manual data preparation required close to 13% of the effort. This part in particular could be reduced substantially with better community data practices, such as depositing fully populated matrices in public repositories. Phenotype annotation required ∼40% of the effort. We are working to make this more efficient with Natural Language Processing tools. Ontology development (40%), however, remains a highly manual task requiring domain (anatomical) expertise and use of specialized software. The large overhead required for data preparation and ontology development contributed to a low annotation rate of approximately two characters per hour, compared with 14 characters per hour when activity was restricted to character annotation. Unlocking the potential of the vast stores of morphological descriptions requires better tools for efficiently processing natural language, and better community practices towards a born-digital morphology. Database URL: http://kb.phenoscape.org


Assuntos
Anatomia Comparada , Ontologias Biológicas , Curadoria de Dados/métodos , Mineração de Dados/métodos , Bases de Dados Factuais , Processamento de Linguagem Natural , Animais , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA