Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
1.
BMC Bioinformatics ; 25(1): 106, 2024 Mar 10.
Artículo en Inglés | MEDLINE | ID: mdl-38461247

RESUMEN

BACKGROUND: Predicting protein-protein interactions (PPIs) from sequence data is a key challenge in computational biology. While various computational methods have been proposed, the utilization of sequence embeddings from protein language models, which contain diverse information, including structural, evolutionary, and functional aspects, has not been fully exploited. Additionally, there is a significant need for a comprehensive neural network capable of efficiently extracting these multifaceted representations. RESULTS: Addressing this gap, we propose xCAPT5, a novel hybrid classifier that uniquely leverages the T5-XL-UniRef50 protein large language model for generating rich amino acid embeddings from protein sequences. The core of xCAPT5 is a multi-kernel deep convolutional siamese neural network, which effectively captures intricate interaction features at both micro and macro levels, integrated with the XGBoost algorithm, enhancing PPIs classification performance. By concatenating max and average pooling features in a depth-wise manner, xCAPT5 effectively learns crucial features with low computational cost. CONCLUSION: This study represents one of the initial efforts to extract informative amino acid embeddings from a large protein language model using a deep and wide convolutional network. Experimental results show that xCAPT5 outperforms recent state-of-the-art methods in binary PPI prediction, excelling in cross-validation on several benchmark datasets and demonstrating robust generalization across intra-species, cross-species, inter-species, and stringent similarity contexts.


Asunto(s)
Redes Neurales de la Computación , Proteínas , Proteínas/química , Algoritmos , Secuencia de Aminoácidos , Aminoácidos
2.
Bioinformatics ; 34(20): 3539-3546, 2018 10 15.
Artículo en Inglés | MEDLINE | ID: mdl-29718118

RESUMEN

Motivation: Recognition of biomedical named entities in the textual literature is a highly challenging research topic with great interest, playing as the prerequisite for extracting huge amount of high-valued biomedical knowledge deposited in unstructured text and transforming them into well-structured formats. Long Short-Term Memory (LSTM) networks have recently been employed in various biomedical named entity recognition (NER) models with great success. They, however, often did not take advantages of all useful linguistic information and still have many aspects to be further improved for better performance. Results: We propose D3NER, a novel biomedical named entity recognition (NER) model using conditional random fields and bidirectional long short-term memory improved with fine-tuned embeddings of various linguistic information. D3NER is thoroughly compared with seven very recent state-of-the-art NER models, of which two are even joint models with named entity normalization (NEN), which was proven to bring performance improvements to NER. Experimental results on benchmark datasets, i.e. the BioCreative V Chemical Disease Relation (BC5 CDR), the NCBI Disease and the FSU-PRGE gene/protein corpus, demonstrate the out-performance and stability of D3NER over all compared models for chemical, gene/protein NER and over all models (without NEN jointed, as D3NER) for disease NER, in almost all cases. On the BC5 CDR corpus, D3NER achieves F1 of 93.14 and 84.68% for the chemical and disease NER, respectively; while on the NCBI Disease corpus, its F1 for the disease NER is 84.41%. Its F1 for the gene/protein NER on FSU-PRGE is 87.62%. Availability and implementation: Data and source code are available at: https://github.com/aidantee/D3NER. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Lingüística , Anotación de Secuencia Molecular , Programas Informáticos , Benchmarking , Humanos , Proteínas/análisis , Proteínas/genética
3.
Nucleic Acids Res ; 39(2): e6, 2011 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-21051340

RESUMEN

Recognition of genomic binding sites by transcription factors can occur through base-specific recognition, or by recognition of variations within the structure of the DNA macromolecule. In this article, we investigate what information can be retrieved from local DNA structural properties that is relevant to transcription factor binding and that cannot be captured by the nucleotide sequence alone. More specifically, we explore the benefit of employing the structural characteristics of DNA to create binding-site models that encompass indirect recognition for the Escherichia coli model organism. We developed a novel methodology [Conditional Random fields of Smoothed Structural Data (CRoSSeD)], based on structural scales and conditional random fields to model and predict regulator binding sites. The value of relying on local structural-DNA properties is demonstrated by improved classifier performance on a large number of biological datasets, and by the detection of novel binding sites which could be validated by independent data sources, and which could not be identified using sequence data alone. We further show that the CRoSSeD-binding-site models can be related to the actual molecular mechanisms of the transcription factor DNA binding, and thus cannot only be used for prediction of novel sites, but might also give valuable insights into unknown binding mechanisms of transcription factors.


Asunto(s)
Escherichia coli/genética , Modelos Estadísticos , Elementos Reguladores de la Transcripción , Factores de Transcripción/metabolismo , Sitios de Unión , ADN Bacteriano/química , ADN Bacteriano/metabolismo , Probabilidad , Regulón
4.
BMC Bioinformatics ; 11: 360, 2010 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-20594316

RESUMEN

BACKGROUND: Molecular interaction networks can be efficiently studied using network visualization software such as Cytoscape. The relevant nodes, edges and their attributes can be imported in Cytoscape in various file formats, or directly from external databases through specialized third party plugins. However, molecular data are often stored in relational databases with their own specific structure, for which dedicated plugins do not exist. Therefore, a more generic solution is presented. RESULTS: A new Cytoscape plugin 'CytoSQL' is developed to connect Cytoscape to any relational database. It allows to launch SQL ('Structured Query Language') queries from within Cytoscape, with the option to inject node or edge features of an existing network as SQL arguments, and to convert the retrieved data to Cytoscape network components. Supported by a set of case studies we demonstrate the flexibility and the power of the CytoSQL plugin in converting specific data subsets into meaningful network representations. CONCLUSIONS: CytoSQL offers a unified approach to let Cytoscape interact with relational databases. Thanks to the power of the SQL syntax, this tool can rapidly generate and enrich networks according to very complex criteria. The plugin is available at http://www.ptools.ua.ac.be/CytoSQL.


Asunto(s)
Bases de Datos Genéticas , Programas Informáticos , Animales , Fenómenos Fisiológicos Celulares , Genómica , Humanos , Proteínas/metabolismo
5.
Bioinformatics ; 24(24): 2857-64, 2008 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-18940828

RESUMEN

MOTIVATION: Phosphorylation is a crucial post-translational protein modification mechanism with important regulatory functions in biological systems. It is catalyzed by a group of enzymes called kinases, each of which recognizes certain target sites in its substrate proteins. Several authors have built computational models trained from sets of experimentally validated phosphorylation sites to predict these target sites for each given kinase. All of these models suffer from certain limitations, such as the fact that they do not take into account the dependencies between amino acid motifs within protein sequences in a global fashion. RESULTS: We propose a novel approach to predict phosphorylation sites from the protein sequence. The method uses a positive dataset to train a conditional random field (CRF) model. The negative training dataset is used to specify the decision threshold corresponding to a desired false positive rate. Application of the method on experimentally verified benchmark phosphorylation data (Phospho.ELM) shows that it performs well compared to existing methods for most kinases. This is to our knowledge that the first report of the use of CRFs to predict post-translational modification sites in protein sequences. AVAILABILITY: The source code of the implementation, called CRPhos, is available from http://www.ptools.ua.ac.be/CRPhos/


Asunto(s)
Algoritmos , Proteínas Quinasas/metabolismo , Biología Computacional/métodos , Bases de Datos de Proteínas , Fosforilación , Proteínas Quinasas/química , Análisis de Secuencia de Proteína
6.
Database (Oxford) ; 20162016 07.
Artículo en Inglés | MEDLINE | ID: mdl-27630201

RESUMEN

The BioCreative V chemical-disease relation (CDR) track was proposed to accelerate the progress of text mining in facilitating integrative understanding of chemicals, diseases and their relations. In this article, we describe an extension of our system (namely UET-CAM) that participated in the BioCreative V CDR. The original UET-CAM system's performance was ranked fourth among 18 participating systems by the BioCreative CDR track committee. In the Disease Named Entity Recognition and Normalization (DNER) phase, our system employed joint inference (decoding) with a perceptron-based named entity recognizer (NER) and a back-off model with Semantic Supervised Indexing and Skip-gram for named entity normalization. In the chemical-induced disease (CID) relation extraction phase, we proposed a pipeline that includes a coreference resolution module and a Support Vector Machine relation extraction model. The former module utilized a multi-pass sieve to extend entity recall. In this article, the UET-CAM system was improved by adding a 'silver' CID corpus to train the prediction model. This silver standard corpus of more than 50 thousand sentences was automatically built based on the Comparative Toxicogenomics Database (CTD) database. We evaluated our method on the CDR test set. Results showed that our system could reach the state of the art performance with F1 of 82.44 for the DNER task and 58.90 for the CID task. Analysis demonstrated substantial benefits of both the multi-pass sieve coreference resolution method (F1 + 4.13%) and the silver CID corpus (F1 +7.3%).Database URL: SilverCID-The silver-standard corpus for CID relation extraction is freely online available at: https://zenodo.org/record/34530 (doi:10.5281/zenodo.34530).


Asunto(s)
Trastornos Químicamente Inducidos/genética , Trastornos Químicamente Inducidos/metabolismo , Minería de Datos/métodos , Modelos Teóricos , Máquina de Vectores de Soporte , Animales , Humanos
7.
8.
Phytochemistry ; 72(10): 1192-218, 2011 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-21345472

RESUMEN

The congruent development of computational technology, bioinformatics and analytical instrumentation makes proteomics ready for the next leap. Present-day state of the art proteomics grew from a descriptive method towards a full stake holder in systems biology. High throughput and genome wide studies are now made at the functional level. These include quantitative aspects, functional aspects with respect to protein interactions as well as post translational modifications and advanced computational methods that aid in predicting protein function and mapping these functionalities across the species border. In this review an overview is given of the current status of these aspects in plant studies with special attention to non-genomic model plants.


Asunto(s)
Proteínas de Plantas/análisis , Proteómica , Biología Computacional , Simulación por Computador , Bases de Datos de Proteínas , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Unión Proteica , Procesamiento Proteico-Postraduccional
9.
Int J Biol Sci ; 6(1): 51-67, 2010 Jan 12.
Artículo en Inglés | MEDLINE | ID: mdl-20087442

RESUMEN

Promyelocytic Leukaemia Protein nuclear bodies (PML-NBs) are dynamic nuclear protein aggregates. To gain insight in PML-NB function, reductionist and high throughput techniques have been employed to identify PML-NB proteins. Here we present a manually curated network of the PML-NB interactome based on extensive literature review including database information. By compiling 'the PML-ome', we highlighted the presence of interactors in the Small Ubiquitin Like Modifier (SUMO) conjugation pathway. Additionally, we show an enrichment of SUMOylatable proteins in the PML-NBs through an in-house prediction algorithm. Therefore, based on the PML network, we hypothesize that PML-NBs may function as a nuclear SUMOylation hotspot.


Asunto(s)
Proteínas Nucleares/metabolismo , Proteínas Modificadoras Pequeñas Relacionadas con Ubiquitina/metabolismo , Factores de Transcripción/metabolismo , Proteínas Supresoras de Tumor/metabolismo , Algoritmos , Animales , Núcleo Celular/metabolismo , Humanos , Modelos Biológicos , Proteína de la Leucemia Promielocítica , Multimerización de Proteína , Transducción de Señal , Ubiquitinación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA