Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 4 de 4
Filtrar
1.
BMC Bioinformatics ; 11: 33, 2010 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-20078885

RESUMEN

BACKGROUND: With the rapid expansion of DNA sequencing databases, it is now feasible to identify relevant information from prior sequencing projects and completed genomes and apply it to de novo sequencing of new organisms. As an example, this paper demonstrates how such extra information can be used to improve de novo assemblies by augmenting the overlapping step. Finding all pairs of overlapping reads is a key task in many genome assemblers, and to this end, highly efficient algorithms have been developed to find alignments in large collections of sequences. It is well known that due to repeated sequences, many aligned pairs of reads nevertheless do not overlap. But no overlapping algorithm to date takes a rigorous approach to separating aligned but non-overlapping read pairs from true overlaps. RESULTS: We present an approach that extends the Minimus assembler by a data driven step to classify overlaps as true or false prior to contig construction. We trained several different classification models within the Weka framework using various statistics derived from overlaps of reads available from prior sequencing projects. These statistics included percent mismatch and k-mer frequencies within the overlaps as well as a comparative genomics score derived from mapping reads to multiple reference genomes. We show that in real whole-genome sequencing data from the E. coli and S. aureus genomes, by providing a curated set of overlaps to the contigging phase of the assembler, we nearly doubled the median contig length (N50) without sacrificing coverage of the genome or increasing the number of mis-assemblies. CONCLUSIONS: Machine learning methods that use comparative and non-comparative features to classify overlaps as true or false can be used to improve the quality of a sequence assembly.


Asunto(s)
Inteligencia Artificial , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Bases de Datos de Ácidos Nucleicos , Genoma Bacteriano , Genoma Fúngico
2.
AMIA Annu Symp Proc ; : 485-9, 2008 Nov 06.
Artículo en Inglés | MEDLINE | ID: mdl-18999010

RESUMEN

We present a study on how to predict new emerging trends in the biomedical domain based on textual data. We thereby propose a way of anticipating the transformation of arbitrary information into ground truth knowledge by predicting the inclusion of new terms into the MeSH ontology. We also discuss the preparation of a dataset for the evaluation of emerging trend prediction algorithms that is based on PubMed abstracts and related MeSH terms. The results suggest that early prediction of emerging trends is possible.


Asunto(s)
Almacenamiento y Recuperación de la Información/métodos , Factor de Impacto de la Revista , Medical Subject Headings/estadística & datos numéricos , Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas/métodos , Publicaciones Periódicas como Asunto/clasificación , Publicaciones Periódicas como Asunto/tendencias , Terminología como Asunto , Algoritmos , Inteligencia Artificial , Interpretación Estadística de Datos , Estados Unidos
3.
BMC Bioinformatics ; 9: 207, 2008 Apr 23.
Artículo en Inglés | MEDLINE | ID: mdl-18433469

RESUMEN

BACKGROUND: The increasing amount of published literature in biomedicine represents an immense source of knowledge, which can only efficiently be accessed by a new generation of automated information extraction tools. Named entity recognition of well-defined objects, such as genes or proteins, has achieved a sufficient level of maturity such that it can form the basis for the next step: the extraction of relations that exist between the recognized entities. Whereas most early work focused on the mere detection of relations, the classification of the type of relation is also of great importance and this is the focus of this work. In this paper we describe an approach that extracts both the existence of a relation and its type. Our work is based on Conditional Random Fields, which have been applied with much success to the task of named entity recognition. RESULTS: We benchmark our approach on two different tasks. The first task is the identification of semantic relations between diseases and treatments. The available data set consists of manually annotated PubMed abstracts. The second task is the identification of relations between genes and diseases from a set of concise phrases, so-called GeneRIF (Gene Reference Into Function) phrases. In our experimental setting, we do not assume that the entities are given, as is often the case in previous relation extraction work. Rather the extraction of the entities is solved as a subproblem. Compared with other state-of-the-art approaches, we achieve very competitive results on both data sets. To demonstrate the scalability of our solution, we apply our approach to the complete human GeneRIF database. The resulting gene-disease network contains 34758 semantic associations between 4939 genes and 1745 diseases. The gene-disease network is publicly available as a machine-readable RDF graph. CONCLUSION: We extend the framework of Conditional Random Fields towards the annotation of semantic relations from text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.


Asunto(s)
Sistemas de Administración de Bases de Datos , Procesamiento de Lenguaje Natural , Investigación Biomédica/métodos , Sistemas de Administración de Bases de Datos/normas , Sistemas de Administración de Bases de Datos/estadística & datos numéricos , Bases de Datos Genéticas , Enfermedad/clasificación , Enfermedad/etiología , Genes/fisiología , Humanos , MEDLINE , Modelos Estadísticos , Semántica , Análisis de Secuencia , Terminología como Asunto , Terapéutica/clasificación , Vocabulario Controlado
4.
IEEE Trans Nanobioscience ; 3(3): 180-91, 2004 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-15473070

RESUMEN

Structural learning of Bayesian networks applied to sets of genome-wide expression patterns has been recently discovered as a potentially useful tool for the systems-level statistical description of gene interactions. We train and analyze Bayesian networks with the goal of inferring biological aspects of gene function. Our two-component approach focuses on supporting the drug discovery process by identifying genes with central roles for the network operation, which could act as drug targets. The first component, referred to as scale-free analysis, uses topological measures of the network-related to a high-traffic load of genes-as estimators for their functional importance. The second component, referred to as generative inverse modeling, is a method of estimating the effect of a simulated drug treatment or mutation on the global state of the network, as measured in the expression profile. We show for a dataset from acute lymphoblastic leukemia patients that both approaches are suitable for finding genes with central cellular functions. In addition, generative inverse modeling correctly identifies a known oncogene in a purely data-driven way.


Asunto(s)
Biomarcadores de Tumor/metabolismo , Sistemas de Liberación de Medicamentos/métodos , Perfilación de la Expresión Génica/métodos , Modelos Biológicos , Leucemia-Linfoma Linfoblástico de Células Precursoras/metabolismo , Transducción de Señal , Algoritmos , Animales , Teorema de Bayes , Simulación por Computador , Diseño de Fármacos , Regulación de la Expresión Génica , Humanos , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Proteoma/metabolismo
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...