Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Artículo en Inglés | MEDLINE | ID: mdl-27141091

RESUMEN

We describe the development of a chemical entity recognition system and its application in the CHEMDNER-patent track of BioCreative 2015. This community challenge includes a Chemical Entity Mention in Patents (CEMP) recognition task and a Chemical Passage Detection (CPD) classification task. We addressed both tasks by an ensemble system that combines a dictionary-based approach with a statistical one. For this purpose the performance of several lexical resources was assessed using Peregrine, our open-source indexing engine. We combined our dictionary-based results on the patent corpus with the results of tmChem, a chemical recognizer using a conditional random field classifier. To improve the performance of tmChem, we utilized three additional features, viz. part-of-speech tags, lemmas and word-vector clusters. When evaluated on the training data, our final system obtained an F-score of 85.21% for the CEMP task, and an accuracy of 91.53% for the CPD task. On the test set, the best system ranked sixth among 21 teams for CEMP with an F-score of 86.82%, and second among nine teams for CPD with an accuracy of 94.23%. The differences in performance between the best ensemble system and the statistical system separately were small.Database URL: http://biosemantics.org/chemdner-patents.


Asunto(s)
Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Aprendizaje Automático , Patentes como Asunto , Modelos Estadísticos , Programas Informáticos
2.
PLoS One ; 11(2): e0149621, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-26919047

RESUMEN

High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing biomedical knowledge for identification and interpretation of gene-disease associations. The implicitome can be used in conjunction with experimental data resources to rationalize both known and novel associations. We demonstrate the usefulness of the implicitome by rationalizing known and novel gene-disease associations, including those from GWAS. To facilitate the re-use of implicit gene-disease associations, we publish our data in compliance with FAIR Data Publishing recommendations [https://www.force11.org/group/fairgroup] using nanopublications. An online tool (http://knowledge.bio) is available to explore established and potential gene-disease associations in the context of other biomedical relations.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos
3.
Nat Genet ; 47(2): 115-25, 2015 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-25581432

RESUMEN

Many cancer-associated somatic copy number alterations (SCNAs) are known. Currently, one of the challenges is to identify the molecular downstream effects of these variants. Although several SCNAs are known to change gene expression levels, it is not clear whether each individual SCNA affects gene expression. We reanalyzed 77,840 expression profiles and observed a limited set of 'transcriptional components' that describe well-known biology, explain the vast majority of variation in gene expression and enable us to predict the biological function of genes. On correcting expression profiles for these components, we observed that the residual expression levels (in 'functional genomic mRNA' profiling) correlated strongly with copy number. DNA copy number correlated positively with expression levels for 99% of all abundantly expressed human genes, indicating global gene dosage sensitivity. By applying this method to 16,172 patient-derived tumor samples, we replicated many loci with aberrant copy numbers and identified recurrently disrupted genes in genomically unstable cancers.


Asunto(s)
Variaciones en el Número de Copia de ADN , Dosificación de Gen , Regulación Neoplásica de la Expresión Génica/genética , Genómica , Neoplasias/genética , Transcriptoma , Hibridación Genómica Comparativa , Perfilación de la Expresión Génica , Redes Reguladoras de Genes , Sitios Genéticos , Humanos , ARN Mensajero/genética , ARN Neoplásico/genética
4.
PLoS One ; 8(11): e78665, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-24260124

RESUMEN

MOTIVATION: Weighted semantic networks built from text-mined literature can be used to retrieve known protein-protein or gene-disease associations, and have been shown to anticipate associations years before they are explicitly stated in the literature. Our text-mining system recognizes over 640,000 biomedical concepts: some are specific (i.e., names of genes or proteins) others generic (e.g., 'Homo sapiens'). Generic concepts may play important roles in automated information retrieval, extraction, and inference but may also result in concept overload and confound retrieval and reasoning with low-relevance or even spurious links. Here, we attempted to optimize the retrieval performance for protein-protein interactions (PPI) by filtering generic concepts (node filtering) or links to generic concepts (edge filtering) from a weighted semantic network. First, we defined metrics based on network properties that quantify the specificity of concepts. Then using these metrics, we systematically filtered generic information from the network while monitoring retrieval performance of known protein-protein interactions. We also systematically filtered specific information from the network (inverse filtering), and assessed the retrieval performance of networks composed of generic information alone. RESULTS: Filtering generic or specific information induced a two-phase response in retrieval performance: initially the effects of filtering were minimal but beyond a critical threshold network performance suddenly drops. Contrary to expectations, networks composed exclusively of generic information demonstrated retrieval performance comparable to unfiltered networks that also contain specific concepts. Furthermore, an analysis using individual generic concepts demonstrated that they can effectively support the retrieval of known protein-protein interactions. For instance the concept "binding" is indicative for PPI retrieval and the concept "mutation abnormality" is indicative for gene-disease associations. CONCLUSION: Generic concepts are important for information retrieval and cannot be removed from semantic networks without negative impact on retrieval performance.


Asunto(s)
Minería de Datos/métodos , Semántica , Vocabulario Controlado , Humanos
5.
Proteomics ; 12(4-5): 543-9, 2012 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-22246801

RESUMEN

Bioinformatics is the field where computational methods from various domains have come together for analysis of biological data. Each domain has introduced its own specific jargon. However, in closely related domains, e.g. machine learning and statistics, concordant and discordant terminology occurs, the later can lead to confusion. This article aims to help solve the confusion of tongues arising from these two closely related domains, which are frequently used in bioinformatics. We provide a short summary of the most commonly applied machine learning and statistical approaches to data analysis in bioinformatics, i.e. classification and statistical hypothesis testing. We explain differences and similarities in common terminology used in various domains, such as precision, recall, sensitivity and true positive rate. This primer can serve as a guide to the terminology used in these fields.


Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Proteómica/métodos , Estadística como Asunto/métodos , Modelos Estadísticos
6.
Methods Mol Biol ; 760: 129-40, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-21779994

RESUMEN

This chapter gives a brief overview of text-mining techniques to extract knowledge from large text collections. It describes the basis pipeline of how to come from text to relationships between biological concepts and the problems that are encountered at each step in the pipeline. We first explain how words in text are recognized as concepts. Second, concepts are associated with each other using 2×2 contingency tables and test statistics. Third, we explain that it is possible to extract indirect links between concepts using the direct links taken from 2×2 table analyses. This we call implicit information extraction. Fourth, the validation techniques to evaluate a text-mining system such as ROC curves and retrospective studies are discussed. We conclude by examining how text information can be combined with other non-textual data sources such as microarray expression data and what the future directions are for text-mining within the Internet.


Asunto(s)
Biología Computacional , Minería de Datos , Almacenamiento y Recuperación de la Información , Bases de Datos Factuales , Humanos , Internet , Reproducibilidad de los Resultados , Estudios Retrospectivos
7.
Nat Genet ; 43(4): 281-3, 2011 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-21445068

RESUMEN

Data citation and the derivation of semantic constructs directly from datasets have now both found their place in scientific communication. The social challenge facing us is to maintain the value of traditional narrative publications and their relationship to the datasets they report upon while at the same time developing appropriate metrics for citation of data and data constructs.


Asunto(s)
Bases de Datos Genéticas , Comunicación , Variación Genética , Humanos , Bases del Conocimiento , Edición
8.
Proteomics ; 11(5): 843-53, 2011 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-21280221

RESUMEN

We introduce a framework for predicting novel protein-protein interactions (PPIs), based on Fisher's method for combining probabilities of predictions that are based on different data sources, such as the biomedical literature, protein domain and mRNA expression information. Our method compares favorably to our previous method based on text-mining alone and other methods such as STRING. We evaluated our algorithms through the prediction of experimentally found protein interactions underlying Muscular Dystrophy, Huntington's Disease and Polycystic Kidney Disease, which had not yet been recorded in protein-protein interaction databases. We found a 1.74-fold increase in the mean average prediction precision for dysferlin and a 3.09-fold for huntingtin when compared to STRING. The top 10 of predicted interaction partners of huntingtin were analysed in depth. Five were identified previously, and the other five were new potential interaction partners. The full matrix of human protein pairs and their prediction scores are available for download. Our framework can be extended to predict other types of relationships such as proteins in a complex, pathway or related disease mechanisms.


Asunto(s)
Enfermedad de Huntington/metabolismo , Proteínas de la Membrana/metabolismo , Proteínas Musculares/metabolismo , Distrofias Musculares/metabolismo , Proteínas del Tejido Nervioso/metabolismo , Proteínas Nucleares/metabolismo , Enfermedades Renales Poliquísticas/metabolismo , Mapeo de Interacción de Proteínas/métodos , Algoritmos , Animales , Biología Computacional/métodos , Bases de Datos de Proteínas , Drosophila , Disferlina , Expresión Génica , Humanos , Proteína Huntingtina , Enfermedad de Huntington/genética , Proteínas de la Membrana/genética , Ratones , Terapia Molecular Dirigida , Proteínas Musculares/genética , Distrofias Musculares/genética , Proteínas del Tejido Nervioso/genética , Proteínas Nucleares/genética , Enfermedades Renales Poliquísticas/genética , Valor Predictivo de las Pruebas , Probabilidad , Unión Proteica , Estructura Terciaria de Proteína , ARN Mensajero
9.
PLoS One ; 5(11): e13854, 2010 Nov 05.
Artículo en Inglés | MEDLINE | ID: mdl-21079765

RESUMEN

Dysferlin is critical for repair of muscle membranes after damage. Mutations in dysferlin lead to a progressive muscular dystrophy. Recent studies suggest additional roles for dysferlin. We set out to study dysferlin's protein-protein interactions to obtain comprehensive knowledge of dysferlin functionalities in a myogenic context. We developed a robust and reproducible method to isolate dysferlin protein complexes from cells and tissue. We analyzed the composition of these complexes in cultured myoblasts, myotubes and skeletal muscle tissue by mass spectrometry and subsequently inferred potential protein functions through bioinformatics analyses. Our data confirm previously reported interactions and support a function for dysferlin as a vesicle trafficking protein. In addition novel potential functionalities were uncovered, including phagocytosis and focal adhesion. Our data reveal that the dysferlin protein complex has a dynamic composition as a function of myogenic differentiation. We provide additional experimental evidence and show dysferlin localization to, and interaction with the focal adhesion protein vinculin at the sarcolemma. Finally, our studies reveal evidence for cross-talk between dysferlin and its protein family member myoferlin. Together our analyses show that dysferlin is not only a membrane repair protein but also important for muscle membrane maintenance and integrity.


Asunto(s)
Proteínas de la Membrana/metabolismo , Mapeo de Interacción de Proteínas/métodos , Proteoma/metabolismo , Proteómica/métodos , Sarcolema/metabolismo , Animales , Western Blotting , Células Cultivadas , Análisis por Conglomerados , Disferlina , Perfilación de la Expresión Génica , Inmunoprecipitación , Proteínas de la Membrana/genética , Ratones , Fibras Musculares Esqueléticas/metabolismo , Proteínas Musculares/metabolismo , Músculo Esquelético/metabolismo , Mioblastos/citología , Mioblastos/metabolismo , Unión Proteica , Proteoma/genética , Espectrometría de Masa por Ionización de Electrospray
10.
PLoS One ; 5(8): e11940, 2010 Aug 04.
Artículo en Inglés | MEDLINE | ID: mdl-20694146

RESUMEN

Calpain 3 (CAPN3) is a cysteine protease that when mutated causes Limb Girdle Muscular Dystrophy 2A. It is thereby the only described Calpain family member that genetically causes a disease. Due to its inherent instability little is known of its substrates or its mechanism of activity and pathogenicity. In this investigation we define a primary sequence motif underlying CAPN3 substrate cleavage. This motif can transform non-related proteins into substrates, and identifies >300 new putative CAPN3 targets. Bioinformatic analyses of these targets demonstrate a critical role in muscle cytoskeletal remodeling and identify novel CAPN3 functions. Among the new CAPN3 substrates are three E3 SUMO ligases of the Protein Inhibitor of Activated Stats (PIAS) family. CAPN3 can cleave PIAS proteins and negatively regulates PIAS3 sumoylase activity. Consequently, SUMO2 is deregulated in patient muscle tissue. Our study thus uncovers unexpected crosstalk between CAPN3 proteolysis and protein sumoylation, with strong implications for muscle remodeling.


Asunto(s)
Calpaína/metabolismo , Biología Computacional , Músculos/metabolismo , Secuencias de Aminoácidos , Secuencia de Aminoácidos , Calpaína/química , Secuencia de Consenso , Citoesqueleto/metabolismo , Humanos , Cinética , Datos de Secuencia Molecular , Músculos/citología
11.
PLoS One ; 4(11): e7894, 2009 Nov 18.
Artículo en Inglés | MEDLINE | ID: mdl-19924298

RESUMEN

We have developed a method that predicts Protein-Protein Interactions (PPIs) based on the similarity of the context in which proteins appear in literature. This method outperforms previously developed PPI prediction algorithms that rely on the conjunction of two protein names in MEDLINE abstracts. We show significant increases in coverage (76% versus 32%) and sensitivity (66% versus 41% at a specificity of 95%) for the prediction of PPIs currently archived in 6 PPI databases. A retrospective analysis shows that PPIs can efficiently be predicted before they enter PPI databases and before their interaction is explicitly described in the literature. The practical value of the method for discovery of novel PPIs is illustrated by the experimental confirmation of the inferred physical interaction between CAPN3 and PARVB, which was based on frequent co-occurrence of both proteins with concepts like Z-disc, dysferlin, and alpha-actinin. The relationships between proteins predicted by our method are broader than PPIs, and include proteins in the same complex or pathway. Dependent on the type of relationships deemed useful, the precision of our method can be as high as 90%. The full set of predicted interactions is available in a downloadable matrix and through the webtool Nermal, which lists the most likely interaction partners for a given protein. Our framework can be used for prioritizing potential interaction partners, hitherto undiscovered, for follow-up studies and to aid the generation of accurate protein interaction maps.


Asunto(s)
Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Algoritmos , Animales , Calpaína/metabolismo , Clonación Molecular , Humanos , MEDLINE , Ratones , Modelos Estadísticos , Proteínas Musculares/metabolismo , Unión Proteica , Proteínas/metabolismo , Curva ROC , Proteínas Recombinantes/química , Reproducibilidad de los Resultados , Estados Unidos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA