Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 39(12)2023 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-38019945

RESUMEN

MOTIVATION: Technical errors in sequencing or bioinformatics steps and difficulties in alignment at some genomic sites result in false positive (FP) variants. Filtering based on quality metrics is a common method for detecting FP variants, but setting thresholds to reduce FP rates may reduce the number of true positive variants by overlooking the more complex relationships between features. The goal of this study is to develop a machine learning-based model for identifying FPs that integrates quality metrics with genomic features and with the feature interpretability property to provide insights into model results. RESULTS: We propose a random forest-based model that utilizes genomic features to improve identification of FPs. Further examination of the features shows that the newly introduced features have an important impact on the prediction of variants misclassified by VEF, GATK-CNN, and GARFIELD, recently introduced FP detection systems. We applied cost-sensitive training to avoid errors in misclassification of true variants and developed a model that provides a robust mechanism against misclassification of true variants while increasing the prediction rate of FP variants. This model can be easily re-trained when factors such as experimental protocols might alter the FP distribution. In addition, it has an interpretability mechanism that allows users to understand the impact of features on the model's predictions. AVAILABILITY AND IMPLEMENTATION: The software implementation can be found at https://github.com/ideateknoloji/FPDetect.


Asunto(s)
Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Genómica/métodos , Genoma , Programas Informáticos , Biología Computacional
2.
Bioinformatics ; 38(Suppl_2): ii155-ii161, 2022 09 16.
Artículo en Inglés | MEDLINE | ID: mdl-36124801

RESUMEN

MOTIVATION: The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a model is limited by the availability of interacting protein-ligand pairs. On the other hand, large amounts of unlabelled protein sequences and chemical compounds are available and have been used to train language models that learn useful representations. In this study, we propose exploiting pretrained biochemical language models to initialize (i.e. warm start) targeted molecule generation models. We investigate two warm start strategies: (i) a one-stage strategy where the initialized model is trained on targeted molecule generation and (ii) a two-stage strategy containing a pre-finetuning on molecular generation followed by target-specific training. We also compare two decoding strategies to generate compounds: beam search and sampling. RESULTS: The results show that the warm-started models perform better than a baseline model trained from scratch. The two proposed warm-start strategies achieve similar results to each other with respect to widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a number of novel proteins suggests that the one-stage strategy generalizes better than the two-stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics for assessing compound quality. AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials (i.e., data, models, and outputs) are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Lenguaje , Programas Informáticos , Diseño de Fármacos , Ligandos , Proteínas
3.
BMC Bioinformatics ; 20(1): 156, 2019 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-30917789

RESUMEN

BACKGROUND: Although there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions. RESULTS: We applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data. CONCLUSIONS: The core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.


Asunto(s)
Minería de Datos , Semántica , Algoritmos , Bacterias/metabolismo , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Estándares de Referencia , Programas Informáticos
4.
BMC Bioinformatics ; 20(Suppl 21): 707, 2019 Dec 23.
Artículo en Inglés | MEDLINE | ID: mdl-31865904

RESUMEN

BACKGROUND: Use of medication can cause adverse drug reactions (ADRs), unwanted or unexpected events, which are a major safety concern. Drug labels, or prescribing information or package inserts, describe ADRs. Therefore, systematically identifying ADR information from drug labels is critical in multiple aspects; however, this task is challenging due to the nature of the natural language of drug labels. RESULTS: In this paper, we present a machine learning- and rule-based system for the identification of ADR entity mentions in the text of drug labels and their normalization through the Medical Dictionary for Regulatory Activities (MedDRA) dictionary. The machine learning approach is based on a recently proposed deep learning architecture, which integrates bi-directional Long Short-Term Memory (Bi-LSTM), Convolutional Neural Network (CNN), and Conditional Random Fields (CRF) for entity recognition. The rule-based approach, used for normalizing the identified ADR mentions to MedDRA terms, is based on an extension of our in-house text-mining system, SciMiner. We evaluated our system on the Text Analysis Conference (TAC) Adverse Drug Reaction 2017 challenge test data set, consisting of 200 manually curated US FDA drug labels. Our ML-based system achieved 77.0% F1 score on the task of ADR mention recognition and 82.6% micro-averaged F1 score on the task of ADR normalization, while rule-based system achieved 67.4 and 77.6% F1 scores, respectively. CONCLUSION: Our study demonstrates that a system composed of a deep learning architecture for entity recognition and a rule-based model for entity normalization is a promising approach for ADR extraction from drug labels.


Asunto(s)
Etiquetado de Medicamentos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Aprendizaje Automático , Minería de Datos , Aprendizaje Profundo , Redes Neurales de la Computación , Estados Unidos , United States Food and Drug Administration
5.
BMC Bioinformatics ; 20(1): 324, 2019 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-31195961

RESUMEN

BACKGROUND: As DNA sequencing technologies are improving and getting cheaper, genomic data can be utilized for diagnosis of many diseases such as cancer. Human raw genome data is huge in size for computational systems. Therefore, there is a need for a compact and accurate representation of the valuable information in DNA. The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting measures with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. RESULTS: We evaluated the proposed representations on the task of cancer type classification. We applied various machine learning techniques using the tf-idf and tf-rf schemes and their BM25 versions. Our results show that the BM25-tf-rf representation leads to improved classification accuracy and f-score values compared to the other representations. The highest accuracy (76.44%) and f-score (76.95%) are achieved with the BM25-tf-rf based data representation. CONCLUSIONS: As a result of our experiments, the BM25-tf-rf scheme and the proposed neural network model is shown to be the best performing classification system for our case study of cancer type classification. This system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are found to be in the literature as target or causal genes.


Asunto(s)
Genómica/métodos , Modelos Genéticos , Modelos Estadísticos , Mutación/genética , Bases de Datos Genéticas , Exones/genética , Humanos , Intrones/genética , Aprendizaje Automático , Neoplasias/genética , Redes Neurales de la Computación
6.
Bioinformatics ; 34(13): i295-i303, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29949957

RESUMEN

Motivation: The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand-based approach can be utilized in protein representation. In this study, we propose SMILESVec, a Simplified molecular input line entry system (SMILES)-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, Basic local alignment tool and ProtVec, and two compound fingerprint-based protein representation methods are compared. Results: We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein sequence-based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure-based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation. Availability and implementation: https://github.com/hkmztrk/SMILESVecProteinRepresentation. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Biología Computacional , Ligandos , Proteínas , Secuencia de Aminoácidos , Análisis por Conglomerados , Biología Computacional/métodos , Modelos Moleculares , Unión Proteica , Proteínas/química , Análisis de Secuencia de Proteína
7.
Bioinformatics ; 34(17): i821-i829, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30423097

RESUMEN

Motivation: The identification of novel drug-target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to determine whether a DT pair interacts or not. However, protein-ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein-ligand complexes or 2D features of compounds. One novel approach used in this work is the modeling of protein sequences and compound 1D representations with convolutional neural networks (CNNs). Results: The results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction. Availability and implementation: https://github.com/hkmztrk/DeepDTA. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Mapeo de Interacción de Proteínas/métodos , Proteínas/química , Algoritmos , Secuencia de Aminoácidos , Ligandos , Aprendizaje Automático , Redes Neurales de la Computación , Programas Informáticos
8.
Bioinformatics ; 33(14): i49-i58, 2017 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-28881973

RESUMEN

MOTIVATION: The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. METHODS: We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. RESULTS: The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric. AVAILABILITY AND IMPLEMENTATION: A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/ . CONTACT: gizemsogancioglu@gmail.com or arzucan.ozgur@boun.edu.tr.


Asunto(s)
Procesamiento de Lenguaje Natural , Aprendizaje Automático Supervisado , Ontologías Biológicas , Humanos , Semántica
9.
BMC Bioinformatics ; 17: 128, 2016 Mar 18.
Artículo en Inglés | MEDLINE | ID: mdl-26987649

RESUMEN

BACKGROUND: Molecular structures can be represented as strings of special characters using SMILES. Since each molecule is represented as a string, the similarity between compounds can be computed using SMILES-based string similarity functions. Most previous studies on drug-target interaction prediction use 2D-based compound similarity kernels such as SIMCOMP. To the best of our knowledge, using SMILES-based similarity functions, which are computationally more efficient than the 2D-based kernels, has not been investigated for this task before. RESULTS: In this study, we adapt and evaluate various SMILES-based similarity methods for drug-target interaction prediction. In addition, inspired by the vector space model of Information Retrieval we propose cosine similarity based SMILES kernels that make use of the Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF) weighting approaches. We also investigate generating composite kernels by combining our best SMILES-based similarity functions with the SIMCOMP kernel. With this study, we provided a comparison of 13 different ligand similarity functions, each of which utilizes the SMILES string of molecule representation. Additionally, TF and TF-IDF based cosine similarity kernels are proposed. CONCLUSION: The more efficient SMILES-based similarity functions performed similarly to the more complex 2D-based SIMCOMP kernel in terms of AUC-ROC scores. The TF-IDF based cosine similarity obtained a better AUC-PR score than the SIMCOMP kernel on the GPCR benchmark data set. The composite kernel of TF-IDF based cosine similarity and SIMCOMP achieved the best AUC-PR scores for all data sets.


Asunto(s)
Enzimas/metabolismo , Canales Iónicos/metabolismo , Modelos Moleculares , Preparaciones Farmacéuticas/metabolismo , Receptores Citoplasmáticos y Nucleares/metabolismo , Receptores Acoplados a Proteínas G/metabolismo , Ligandos , Estructura Molecular , Unión Proteica
10.
Bioinformatics ; 31(18): 3035-42, 2015 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-25971743

RESUMEN

MOTIVATION: G protein-coupled receptors (GPCRs) are probably the most attractive drug target membrane proteins, which constitute nearly half of drug targets in the contemporary drug discovery industry. While the majority of drug discovery studies employ existing GPCR and ligand interactions to identify new compounds, there remains a shortage of specific databases with precisely annotated GPCR-ligand associations. RESULTS: We have developed a new database, GLASS, which aims to provide a comprehensive, manually curated resource for experimentally validated GPCR-ligand associations. A new text-mining algorithm was proposed to collect GPCR-ligand interactions from the biomedical literature, which is then crosschecked with five primary pharmacological datasets, to enhance the coverage and accuracy of GPCR-ligand association data identifications. A special architecture has been designed to allow users for making homologous ligand search with flexible bioactivity parameters. The current database contains ∼500 000 unique entries, of which the vast majority stems from ligand associations with rhodopsin- and secretin-like receptors. The GLASS database should find its most useful application in various in silico GPCR screening and functional annotation studies. AVAILABILITY AND IMPLEMENTATION: The website of GLASS database is freely available at http://zhanglab.ccmb.med.umich.edu/GLASS/. CONTACT: zhng@umich.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Bases de Datos de Proteínas , Receptores Acoplados a Proteínas G/química , Receptores Acoplados a Proteínas G/metabolismo , Minería de Datos , Humanos , Internet , Ligandos , Modelos Moleculares , Unión Proteica
11.
BMC Bioinformatics ; 16 Suppl 10: S5, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26201262

RESUMEN

BACKGROUND: Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resources. Developing methods to automatically extract bacteria habitat relations from the text of these electronic resources is crucial for facilitating research in these areas. METHODS: We introduce a linguistically motivated rule-based approach for recognizing and normalizing names of bacteria habitats in biomedical text by using an ontology. Our approach is based on the shallow syntactic analysis of the text that include sentence segmentation, part-of-speech (POS) tagging, partial parsing, and lemmatization. In addition, we propose two methods for identifying bacteria habitat localization relations. The underlying assumption for the first method is that discourse changes with a new paragraph. Therefore, it operates on a paragraph-basis. The second method performs a more fine-grained analysis of the text and operates on a sentence-basis. We also develop a novel anaphora resolution method for bacteria coreferences and incorporate it with the sentence-based relation extraction approach. RESULTS: We participated in the Bacteria Biotope (BB) Task of the BioNLP Shared Task 2013. Our system (Boun) achieved the second best performance with 68% Slot Error Rate (SER) in Sub-task 1 (Entity Detection and Categorization), and ranked third with an F-score of 27% in Sub-task 2 (Localization Event Extraction). This paper reports the system that is implemented for the shared task, including the novel methods developed and the improvements obtained after the official evaluation. The extensions include the expansion of the OntoBiotope ontology using the training set for Sub-task 1, and the novel sentence-based relation extraction method incorporated with anaphora resolution for Sub-task 2. These extensions resulted in promising results for Sub-task 1 with a SER of 68%, and state-of-the-art performance for Sub-task 2 with an F-score of 53%. CONCLUSIONS: Our results show that a linguistically-oriented approach based on the shallow syntactic analysis of the text is as effective as machine learning approaches for the detection and ontology-based normalization of habitat entities. Furthermore, the newly developed sentence-based relation extraction system with the anaphora resolution module significantly outperforms the paragraph-based one, as well as the other systems that participated in the BB Shared Task 2013.


Asunto(s)
Bacterias/clasificación , Ecosistema , Microbiología Ambiental , Lingüística , Procesamiento de Lenguaje Natural , Programas Informáticos , Bacterias/genética , Humanos , Aprendizaje Automático
12.
Bioinformatics ; 29(10): 1357-8, 2013 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-23515528

RESUMEN

SUMMARY: Knowledge of pathogen-host protein interactions is required to better understand infection mechanisms. The pathogen-host interaction search tool (PHISTO) is a web-accessible platform that provides relevant information about pathogen-host interactions (PHIs). It enables access to the most up-to-date PHI data for all pathogen types for which experimentally verified protein interactions with human are available. The platform also offers integrated tools for visualization of PHI networks, graph-theoretical analysis of targeted human proteins, BLAST search and text mining for detecting missing experimental methods. PHISTO will facilitate PHI studies that provide potential therapeutic targets for infectious diseases. AVAILABILITY: http://www.phisto.org. CONTACT: saliha.durmus@boun.edu.tr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Minería de Datos , Interacciones Huésped-Patógeno , Motor de Búsqueda , Enfermedades Transmisibles , Bases de Datos de Proteínas , Humanos , Internet , Dominios y Motivos de Interacción de Proteínas
13.
Mol Inform ; 43(3): e202300249, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38196065

RESUMEN

Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.


Asunto(s)
Algoritmos , Proteínas , Humanos , Ligandos , Proteínas/química , Aprendizaje Automático , Estructura Molecular
14.
J Comput Biol ; 30(11): 1240-1245, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37988394

RESUMEN

Robust generalization of drug-target affinity (DTA) prediction models is a notoriously difficult problem in computational drug discovery. In this article, we present pydebiaseddta: a computational software for improving the generalizability of DTA prediction models to novel ligands and/or proteins. pydebiaseddta serves as the practical implementation of the DebiasedDTA training framework, which advocates modifying the training distribution to mitigate the effect of spurious correlations in the training data set that leads to substantially degraded performance for novel ligands and proteins. Written in Python programming language, pydebiaseddta combines a user-friendly streamlined interface with a feature-rich and highly modifiable architecture. With this article we introduce our software, showcase its main functionalities, and describe practical ways for new users to engage with it.


Asunto(s)
Lenguajes de Programación , Programas Informáticos , Proteínas , Descubrimiento de Drogas
15.
J Comput Biol ; 30(11): 1226-1239, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37988395

RESUMEN

Statistical models that accurately predict the binding affinity of an input ligand-protein pair can greatly accelerate drug discovery. Such models are trained on available ligand-protein interaction data sets, which may contain biases that lead the predictor models to learn data set-specific, spurious patterns instead of generalizable relationships. This leads the prediction performances of these models to drop dramatically for previously unseen biomolecules. Various approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading overall prediction performance. In this article, we present DebiasedDTA, a novel training framework for drug-target affinity (DTA) prediction models that addresses data set biases to improve the generalizability of such models. DebiasedDTA relies on reweighting the training samples to achieve robust generalization, and is thus applicable to most DTA prediction models. Extensive experiments with different biomolecule representations, model architectures, and data sets demonstrate that DebiasedDTA achieves improved generalizability in predicting drug-target affinities.


Asunto(s)
Modelos Estadísticos , Proteínas , Ligandos , Proteínas/química , Descubrimiento de Drogas
16.
ArXiv ; 2023 Dec 13.
Artículo en Inglés | MEDLINE | ID: mdl-38764593

RESUMEN

Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora: Learning Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best overall performance, with BioBERT achieving the highest recall (91.95%) and F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%). Interestingly, despite not being explicitly trained for biomedical texts, GPT-4 achieved commendable performance, comparable to the top-performing BERT models. It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of 86.49% on the LLL dataset. These results suggest that GPT models can effectively detect PPIs from text data, offering promising avenues for application in biomedical literature mining. Further research could explore how these models might be fine-tuned for even more specialized tasks within the biomedical domain.

17.
BMC Bioinformatics ; 12: 481, 2011 Dec 18.
Artículo en Inglés | MEDLINE | ID: mdl-22177292

RESUMEN

BACKGROUND: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. RESULTS: We have integrated nine event extraction systems in the U-Compare framework, making them intercompatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. CONCLUSIONS: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.


Asunto(s)
Minería de Datos , Sistemas de Computación , Publicaciones Periódicas como Asunto , Programas Informáticos
18.
Nucleic Acids Res ; 37(Database issue): D642-6, 2009 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-18978014

RESUMEN

Molecular interaction data exists in a number of repositories, each with its own data format, molecule identifier and information coverage. Michigan molecular interactions (MiMI) assists scientists searching through this profusion of molecular interaction data. The original release of MiMI gathered data from well-known protein interaction databases, and deep merged this information while keeping track of provenance. Based on the feedback received from users, MiMI has been completely redesigned. This article describes the resulting MiMI Release 2 (MiMIr2). New functionality includes extension from proteins to genes and to pathways; identification of highlighted sentences in source publications; seamless two-way linkage with Cytoscape; query facilities based on MeSH/GO terms and other concepts; approximate graph matching to find relevant pathways; support for querying in bulk; and a user focus-group driven interface design. MiMI is part of the NIH's; National Center for Integrative Biomedical Informatics (NCIBI) and is publicly available at: http://mimi.ncibi.org.


Asunto(s)
Bases de Datos de Proteínas , Mapeo de Interacción de Proteínas , Proteínas/metabolismo , Gráficos por Computador , Proteínas/genética , Interfaz Usuario-Computador
19.
Biol Direct ; 16(1): 7, 2021 02 08.
Artículo en Inglés | MEDLINE | ID: mdl-33557857

RESUMEN

Cancer is a poligenetic disease with each cancer type having a different mutation profile. Genomic data can be utilized to detect these profiles and to diagnose and differentiate cancer types. Variant calling provide mutation information. Gene expression data reveal the altered cell behaviour. The combination of the mutation and expression information can lead to accurate discrimination of different cancer types. In this study, we utilized and transferred the information of existing mutations for a novel gene selection method for gene expression data. We tested the proposed method in order to diagnose and differentiate cancer types. It is a disease specific method as both the mutations and expressions are filtered according to the selected cancer types. Our experiment results show that the proposed gene selection method leads to similar or improved performance metrics compared to classical feature selection methods and curated gene sets.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Genómica/estadística & datos numéricos , Aprendizaje Automático , Neoplasias/clasificación , Algoritmos , Neoplasias/genética
20.
Mol Inform ; 40(5): e2000212, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-33225594

RESUMEN

Identification of high affinity drug-target interactions is a major research question in drug discovery. Proteins are generally represented by their structures or sequences. However, structures are available only for a small subset of biomolecules and sequence similarity is not always correlated with functional similarity. We propose ChemBoost, a chemical language based approach for affinity prediction using SMILES syntax. We hypothesize that SMILES is a codified language and ligands are documents composed of chemical words. These documents can be used to learn chemical word vectors that represent words in similar contexts with similar vectors. In ChemBoost, the ligands are represented via chemical word embeddings, while the proteins are represented through sequence-based features and/or chemical words of their ligands. Our aim is to process the patterns in SMILES as a language to predict protein-ligand affinity, even when we cannot infer the function from the sequence. We used eXtreme Gradient Boosting to predict protein-ligand affinities in KIBA and BindingDB data sets. ChemBoost was able to predict drug-target binding affinity as well as or better than state-of-the-art machine learning systems. When powered with ligand-centric representations, ChemBoost was more robust to the changes in protein sequence similarity and successfully captured the interactions between a protein and a ligand, even if the protein has low sequence similarity to the known targets of the ligand.


Asunto(s)
Descubrimiento de Drogas/métodos , Aprendizaje Automático , Unión Proteica , Biología Computacional/métodos , Química Computacional/métodos , Ligandos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA