Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
PLoS Comput Biol ; 15(2): e1006493, 2019 02.
Artigo em Inglês | MEDLINE | ID: mdl-30768597

RESUMO

Phylogenomic research is accelerating the publication of landmark studies that aim to resolve deep divergences of major organismal groups. Meanwhile, systems for identifying and integrating the products of phylogenomic inference-such as newly supported clade concepts-have not kept pace. However, the ability to verbalize node concept congruence and conflict across multiple, in effect simultaneously endorsed phylogenomic hypotheses, is a prerequisite for building synthetic data environments for biological systematics and other domains impacted by these conflicting inferences. Here we develop a novel solution to the conflict verbalization challenge, based on a logic representation and reasoning approach that utilizes the language of Region Connection Calculus (RCC-5) to produce consistent alignments of node concepts endorsed by incongruent phylogenomic studies. The approach employs clade concept labels to individuate concepts used by each source, even if these carry identical names. Indirect RCC-5 modeling of intensional (property-based) node concept definitions, facilitated by the local relaxation of coverage constraints, allows parent concepts to attain congruence in spite of their differentially sampled children. To demonstrate the feasibility of this approach, we align two recent phylogenomic reconstructions of higher-level avian groups that entail strong conflict in the "neoavian explosion" region. According to our representations, this conflict is constituted by 26 instances of input "whole concept" overlap. These instances are further resolvable in the output labeling schemes and visualizations as "split concepts", which provide the labels and relations needed to build truly synthetic phylogenomic data environments. Because the RCC-5 alignments fundamentally reflect the trained, logic-enabled judgments of systematic experts, future designs for such environments need to promote a culture where experts routinely assess the intensionalities of node concepts published by our peers-even and especially when we are not in agreement with each other.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Filogenia , Animais , Aves/genética , Simulação por Computador , Humanos , Idioma
2.
Syst Biol ; 65(4): 561-82, 2016 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-27009895

RESUMO

Classifications and phylogenies of perceived natural entities change in the light of new evidence. Taxonomic changes, translated into Code-compliant names, frequently lead to name:meaning dissociations across succeeding treatments. Classification standards such as the Mammal Species of the World (MSW) may experience significant levels of taxonomic change from one edition to the next, with potential costs to long-term, large-scale information integration. This circumstance challenges the biodiversity and phylogenetic data communities to express taxonomic congruence and incongruence in ways that both humans and machines can process, that is, to logically represent taxonomic alignments across multiple classifications. We demonstrate that such alignments are feasible for two classifications of primates corresponding to the second and third MSW editions. Our approach has three main components: (i) use of taxonomic concept labels, that is name sec. author (where sec. means according to), to assemble each concept hierarchy separately via parent/child relationships; (ii) articulation of select concepts across the two hierarchies with user-provided Region Connection Calculus (RCC-5) relationships; and (iii) the use of an Answer Set Programming toolkit to infer and visualize logically consistent alignments of these input constraints. Our use case entails the Primates sec. Groves (1993; MSW2-317 taxonomic concepts; 233 at the species level) and Primates sec. Groves (2005; MSW3-483 taxonomic concepts; 376 at the species level). Using 402 RCC-5 input articulations, the reasoning process yields a single, consistent alignment and 153,111 Maximally Informative Relations that constitute a comprehensive meaning resolution map for every concept pair in the Primates sec. MSW2/MSW3. The complete alignment, and various partitions thereof, facilitate quantitative analyses of name:meaning dissociation, revealing that nearly one in three taxonomic names are not reliable across treatments-in the sense of the same name identifying congruent taxonomic meanings. The RCC-5 alignment approach is potentially widely applicable in systematics and can achieve scalable, precise resolution of semantically evolving name usages in synthetic, next-generation biodiversity, and phylogeny data platforms.


Assuntos
Classificação/métodos , Filogenia , Primatas/classificação , Animais , Biodiversidade
3.
Libr Trends ; 65(4): 555-562, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29375158

RESUMO

The era of big data and ubiquitous computation has brought with it concerns about ensuring reproducibility in this new research environment. It is easy to assume computational methods self-document by their very nature of being exact, deterministic processes. However, similar to laboratory experiments, ensuring reproducibility in the computational realm requires the documentation of both the protocols used (workflows) as well as a detailed description of the computational environment: algorithms, implementations, software environments as well as the data ingested and execution logs of the computation. These two aspects of computational reproducibility (workflows and execution details) are discussed in the context of biomolecular Nuclear Magnetic Resonance spectroscopy (bioNMR) as well as the PRIMAD model for computational reproducibility.

4.
BMC Bioinformatics ; 17(1): 471, 2016 Nov 17.
Artigo em Inglês | MEDLINE | ID: mdl-27855645

RESUMO

BACKGROUND: Taxonomic descriptions are traditionally composed in natural language and published in a format that cannot be directly used by computers. The Exploring Taxon Concepts (ETC) project has been developing a set of web-based software tools that convert morphological descriptions published in telegraphic style to character data that can be reused and repurposed. This paper introduces the first semi-automated pipeline, to our knowledge, that converts morphological descriptions into taxon-character matrices to support systematics and evolutionary biology research. We then demonstrate and evaluate the use of the ETC Input Creation - Text Capture - Matrix Generation pipeline to generate body part measurement matrices from a set of 188 spider morphological descriptions and report the findings. RESULTS: From the given set of spider taxonomic publications, two versions of input (original and normalized) were generated and used by the ETC Text Capture and ETC Matrix Generation tools. The tools produced two corresponding spider body part measurement matrices, and the matrix from the normalized input was found to be much more similar to a gold standard matrix hand-curated by the scientist co-authors. Special conventions utilized in the original descriptions (e.g., the omission of measurement units) were attributed to the lower performance of using the original input. The results show that simple normalization of the description text greatly increased the quality of the machine-generated matrix and reduced edit effort. The machine-generated matrix also helped identify issues in the gold standard matrix. CONCLUSIONS: ETC Text Capture and ETC Matrix Generation are low-barrier and effective tools for extracting measurement values from spider taxonomic descriptions and are more effective when the descriptions are self-contained. Special conventions that make the description text less self-contained challenge automated extraction of data from biodiversity descriptions and hinder the automated reuse of the published knowledge. The tools will be updated to support new requirements revealed in this case study.


Assuntos
Evolução Biológica , Software , Aranhas/anatomia & histologia , Animais , Humanos
5.
BMC Bioinformatics ; 13: 102, 2012 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-22594911

RESUMO

BACKGROUND: Microarray data analysis has been the subject of extensive and ongoing pipeline development due to its complexity, the availability of several options at each analysis step, and the development of new analysis demands, including integration with new data sources. Bioinformatics pipelines are usually custom built for different applications, making them typically difficult to modify, extend and repurpose. Scientific workflow systems are intended to address these issues by providing general-purpose frameworks in which to develop and execute such pipelines. The Kepler workflow environment is a well-established system under continual development that is employed in several areas of scientific research. Kepler provides a flexible graphical interface, featuring clear display of parameter values, for design and modification of workflows. It has capabilities for developing novel computational components in the R, Python, and Java programming languages, all of which are widely used for bioinformatics algorithm development, along with capabilities for invoking external applications and using web services. RESULTS: We developed a series of fully functional bioinformatics pipelines addressing common tasks in microarray processing in the Kepler workflow environment. These pipelines consist of a set of tools for GFF file processing of NimbleGen chromatin immunoprecipitation on microarray (ChIP-chip) datasets and more comprehensive workflows for Affymetrix gene expression microarray bioinformatics and basic primer design for PCR experiments, which are often used to validate microarray results. Although functional in themselves, these workflows can be easily customized, extended, or repurposed to match the needs of specific projects and are designed to be a toolkit and starting point for specific applications. These workflows illustrate a workflow programming paradigm focusing on local resources (programs and data) and therefore are close to traditional shell scripting or R/BioConductor scripting approaches to pipeline design. Finally, we suggest that microarray data processing task workflows may provide a basis for future example-based comparison of different workflow systems. CONCLUSIONS: We provide a set of tools and complete workflows for microarray data analysis in the Kepler environment, which has the advantages of offering graphical, clear display of conceptual steps and parameters and the ability to easily integrate other resources such as remote data and web services.


Assuntos
Biologia Computacional/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Software , Fluxo de Trabalho , Imunoprecipitação da Cromatina , Gráficos por Computador , Interface Usuário-Computador
6.
Nucleic Acids Res ; 38(3): e13, 2010 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19906703

RESUMO

Next-generation sequencing is revolutionizing the identification of transcription factor binding sites throughout the human genome. However, the bioinformatics analysis of large datasets collected using chromatin immunoprecipitation and high-throughput sequencing is often a roadblock that impedes researchers in their attempts to gain biological insights from their experiments. We have developed integrated peak-calling and analysis software (Sole-Search) which is available through a user-friendly interface and (i) converts raw data into a format for visualization on a genome browser, (ii) outputs ranked peak locations using a statistically based method that overcomes the significant problem of false positives, (iii) identifies the gene nearest to each peak, (iv) classifies the location of each peak relative to gene structure, (v) provides information such as the number of binding sites per chromosome and per gene and (vi) allows the user to determine overlap between two different experiments. In addition, the program performs an analysis of amplified and deleted regions of the input genome. This software is web-based and automated, allowing easy and immediate access to all investigators. We demonstrate the utility of our software by collecting, analyzing and comparing ChIP-seq data for six different human transcription factors/cell line combinations.


Assuntos
Imunoprecipitação da Cromatina , Análise de Sequência de DNA , Software , Fatores de Transcrição/metabolismo , Sítios de Ligação , Linhagem Celular Tumoral , Fator de Transcrição E2F4/metabolismo , Regulação da Expressão Gênica , Humanos , Internet , Células K562 , Elementos Reguladores de Transcrição , Fatores de Transcrição/classificação , Transcrição Gênica
7.
BMC Bioinformatics ; 11: 317, 2010 Jun 12.
Artigo em Inglês | MEDLINE | ID: mdl-20540779

RESUMO

BACKGROUND: For more than two decades microbiologists have used a highly conserved microbial gene as a phylogenetic marker for bacteria and archaea. The small-subunit ribosomal RNA gene, also known as 16 S rRNA, is encoded by ribosomal DNA, 16 S rDNA, and has provided a powerful comparative tool to microbial ecologists. Over time, the microbial ecology field has matured from small-scale studies in a select number of environments to massive collections of sequence data that are paired with dozens of corresponding collection variables. As the complexity of data and tool sets have grown, the need for flexible automation and maintenance of the core processes of 16 S rDNA sequence analysis has increased correspondingly. RESULTS: We present WATERS, an integrated approach for 16 S rDNA analysis that bundles a suite of publicly available 16 S rDNA analysis software tools into a single software package. The "toolkit" includes sequence alignment, chimera removal, OTU determination, taxonomy assignment, phylogentic tree construction as well as a host of ecological analysis and visualization tools. WATERS employs a flexible, collection-oriented 'workflow' approach using the open-source Kepler system as a platform. CONCLUSIONS: By packaging available software tools into a single automated workflow, WATERS simplifies 16 S rDNA analyses, especially for those without specialized bioinformatics, programming expertise. In addition, WATERS, like some of the newer comprehensive rRNA analysis tools, allows researchers to minimize the time dedicated to carrying out tedious informatics steps and to focus their attention instead on the biological interpretation of the results. One advantage of WATERS over other comprehensive tools is that the use of the Kepler workflow system facilitates result interpretation and reproducibility via a data provenance sub-system. Furthermore, new "actors" can be added to the workflow as desired and we see WATERS as an initial seed for a sizeable and growing repository of interoperable, easy-to-combine tools for asking increasingly complex microbial ecology questions.


Assuntos
Genômica/métodos , Ribossomos/genética , Alinhamento de Sequência/métodos , Software , Sequência de Bases , Genes de RNAr , Filogenia , Análise de Sequência de RNA
8.
Proc Assoc Inf Sci Technol ; 57(1): e355, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-33173824

RESUMO

In this preliminary study, we investigate the case of COVID-19 United States confirmed cases datasets, and perform experiments with aggregations of data by county, state, and different taxonomies for U.S. regions. The overarching goals of this study is to uncover potential data quality issues due to different levels of geospatial aggregation of data.

9.
Transform Digit Worlds (2018) ; 10766: 620-625, 2018 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-30334020

RESUMO

Two barriers to computational reproducibility are the ability to record the critical metadata required for rerunning a computation, as well as translating the semantics of the metadata so that alternate approaches can easily be configured for verifying computational reproducibility. We are addressing this problem in the context of biomolecular NMR computational analysis by developing a series of linked ontologies which define the semantics of the various software tools used by researchers for data transformation and analysis. Building from a core ontology representing the primary observational data of NMR, the linked data approach allows for the translation of metadata in order to configure alternate software approaches for given computational tasks. In this paper we illustrate the utility of this with a small sample of the core ontology as well as tool-specific semantics for two third-party software tools. This approach to semantic mediation will help support an automated approach to validating the reliability of computation in which the same processing workflow is implemented with different software tools. In addition, the detailed semantics of both the data and the processing functionalities will provide a method for software tool classification.

10.
PLoS One ; 10(2): e0118247, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25700173

RESUMO

Classifications and phylogenetic inferences of organismal groups change in light of new insights. Over time these changes can result in an imperfect tracking of taxonomic perspectives through the re-/use of Code-compliant or informal names. To mitigate these limitations, we introduce a novel approach for aligning taxonomies through the interaction of human experts and logic reasoners. We explore the performance of this approach with the Perelleschus use case of Franz & Cardona-Duque (2013). The use case includes six taxonomies published from 1936 to 2013, 54 taxonomic concepts (i.e., circumscriptions of names individuated according to their respective source publications), and 75 expert-asserted Region Connection Calculus articulations (e.g., congruence, proper inclusion, overlap, or exclusion). An Open Source reasoning toolkit is used to analyze 13 paired Perelleschus taxonomy alignments under heterogeneous constraints and interpretations. The reasoning workflow optimizes the logical consistency and expressiveness of the input and infers the set of maximally informative relations among the entailed taxonomic concepts. The latter are then used to produce merge visualizations that represent all congruent and non-congruent taxonomic elements among the aligned input trees. In this small use case with 6-53 input concepts per alignment, the information gained through the reasoning process is on average one order of magnitude greater than in the input. The approach offers scalable solutions for tracking provenance among succeeding taxonomic perspectives that may have differential biases in naming conventions, phylogenetic resolution, ingroup and outgroup sampling, or ostensive (member-referencing) versus intensional (property-referencing) concepts and articulations.


Assuntos
Algoritmos , Classificação/métodos , Filogenia , Gorgulhos/classificação , Animais , Alinhamento de Sequência/métodos
11.
Neural Netw ; 16(9): 1277-92, 2003 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-14622884

RESUMO

We present issues arising when trying to formalize disease maps, i.e. ontologies to represent the terminological relationships among concepts necessary to construct a knowledge-base of neurological disorders. These disease maps are being created in the context of a large-scale data mediation system being created for the Biomedical Informatics Research Network (BIRN). The BIRN is a multi-university consortium collaborating to establish a large-scale data and computational grid around neuroimaging data, collected across multiple scales. Test bed projects within BIRN involve both animal and human studies of Alzheimer's disease, Parkinson's disease and schizophrenia. Incorporating both the static 'terminological' relationships and dynamic processes, disease maps are being created to encapsulate a comprehensive theory of a disease. Terms within the disease map can also be connected to the relevant terms within other ontologies (e.g. the Unified Medical Language System), in order to allow the disease map management system to derive relationships between a larger set of terms than what is contained within the disease map itself. In this paper, we use the basic structure of a disease map we are developing for Parkinson's disease to illustrate our initial formalization for disease maps.


Assuntos
Processamento Eletrônico de Dados/métodos , Informática Médica/métodos , Doenças do Sistema Nervoso , Armazenamento e Recuperação da Informação/métodos , Doenças do Sistema Nervoso/classificação
12.
PLoS One ; 8(11): e76093, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24223697

RESUMO

Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema.


Assuntos
Armazenamento e Recuperação da Informação , Biologia Computacional , Humanos , Disseminação de Informação , Anotação de Sequência Molecular , Controle de Qualidade , Semântica , Software , Vocabulário Controlado
14.
J Struct Biol ; 138(1-2): 145-55, 2002.
Artigo em Inglês | MEDLINE | ID: mdl-12160711

RESUMO

Electron tomography is providing a wealth of 3D structural data on biological components ranging from molecules to cells. We are developing a web-accessible database tailored to high-resolution cellular level structural and protein localization data derived from electron tomography. The Cell Centered Database or CCDB is built on an object-relational framework using Oracle 8i and is housed on a server at the San Diego Supercomputer Center at the University of California, San Diego. Data can be deposited and accessed via a web interface. Each volume reconstruction is stored with a full set of descriptors along with tilt images and any derived products such as segmented objects and animations. Tomographic data are supplemented by high-resolution light microscopic data in order to provide correlated data on higher-order cellular and tissue structure. Every object segmented from a reconstruction is included as a distinct entity in the database along with measurements such as volume, surface area, diameter, and length and amount of protein labeling, allowing the querying of image-specific attributes. Data sets obtained in response to a CCDB query are retrieved via the Storage Resource Broker, a data management system for transparent access to local and distributed data collections. The CCDB is designed to provide a resource for structural biologists and to make tomographic data sets available to the scientific community at large.


Assuntos
Estruturas Celulares/ultraestrutura , Bases de Dados Factuais , Tomografia Computadorizada por Raios X , Animais , Processamento Eletrônico de Dados , Humanos , Imageamento Tridimensional , Microscopia Eletrônica , Proteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA