Pesquisa | Portal Regional da BVS

Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations.

Miranda-Escalada, Antonio; Mehryary, Farrokh; Luoma, Jouni; Estrada-Zavala, Darryl; Gasco, Luis; Pyysalo, Sampo; Valencia, Alfonso; Krallinger, Martin.

Database (Oxford) ; 20232023 11 28.

Artigo em Inglês | MEDLINE | ID: mdl-38015956

RESUMO

It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug-gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However, their systematic, large-scale exploitation is key for developing tools, impacting knowledge fields as diverse as drug design or metabolic pathway research. Previous efforts in the extraction of drug-gene/protein interactions from the literature did not address these scalability and granularity issues. To tackle them, we have organized the DrugProt track at BioCreative VII. In the context of the track, we have released the DrugProt Gold Standard corpus, a collection of 5000 PubMed abstracts, manually annotated with granular drug-gene/protein interactions. We have proposed a novel large-scale track to evaluate the capacity of natural language processing systems to scale to the range of millions of documents, and generate with their predictions a silver standard knowledge graph of 53 993 602 nodes and 19 367 406 edges. Its use exceeds the shared task and points toward pharmacological and biological applications such as drug discovery or continuous database curation. Finally, we have created a persistent evaluation scenario on CodaLab to continuously evaluate new relation extraction systems that may arise. Thirty teams from four continents, which involved 110 people, sent 107 submission runs for the Main DrugProt track, and nine teams submitted 21 runs for the Large Scale DrugProt track. Most participants implemented deep learning approaches based on pretrained transformer-like language models (LMs) such as BERT or BioBERT, reaching precision and recall values as high as 0.9167 and 0.9542 for some relation types. Finally, some initial explorations of the applicability of the knowledge graph have shown its potential to explore the chemical-protein relations described in the literature, or chemical compound-enzyme interactions. Database URL: https://doi.org/10.5281/zenodo.4955410.

Assuntos

Mineração de Dados , Reconhecimento Automatizado de Padrão , Humanos , Bases de Dados Factuais , Mineração de Dados/métodos , Proteínas/metabolismo

The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest.

Szklarczyk, Damian; Kirsch, Rebecca; Koutrouli, Mikaela; Nastou, Katerina; Mehryary, Farrokh; Hachilif, Radja; Gable, Annika L; Fang, Tao; Doncheva, Nadezhda T; Pyysalo, Sampo; Bork, Peer; Jensen, Lars J; von Mering, Christian.

Nucleic Acids Res ; 51(D1): D638-D646, 2023 01 06.

Artigo em Inglês | MEDLINE | ID: mdl-36370105

RESUMO

Much of the complexity within cells arises from functional and regulatory interactions among proteins. The core of these interactions is increasingly known, but novel interactions continue to be discovered, and the information remains scattered across different database resources, experimental modalities and levels of mechanistic detail. The STRING database (https://string-db.org/) systematically collects and integrates protein-protein interactions-both physical interactions as well as functional associations. The data originate from a number of sources: automated text mining of the scientific literature, computational interaction predictions from co-expression, conserved genomic context, databases of interaction experiments and known complexes/pathways from curated sources. All of these interactions are critically assessed, scored, and subsequently automatically transferred to less well-studied organisms using hierarchical orthology information. The data can be accessed via the website, but also programmatically and via bulk downloads. The most recent developments in STRING (version 12.0) are: (i) it is now possible to create, browse and analyze a full interaction network for any novel genome of interest, by submitting its complement of encoded proteins, (ii) the co-expression channel now uses variational auto-encoders to predict interactions, and it covers two new sources, single-cell RNA-seq and experimental proteomics data and (iii) the confidence in each experimentally derived interaction is now estimated based on the detection method used, and communicated to the user in the web-interface. Furthermore, STRING continues to enhance its facilities for functional enrichment analysis, which are now fully available also for user-submitted genomes.

Assuntos

Mapeamento de Interação de Proteínas , Proteínas , Mapeamento de Interação de Proteínas/métodos , Bases de Dados de Proteínas , Proteínas/genética , Proteínas/metabolismo , Genômica , Proteômica , Interface Usuário-Computador

Neural Network and Random Forest Models in Protein Function Prediction.

Hakala, Kai; Kaewphan, Suwisa; Bjorne, Jari; Mehryary, Farrokh; Moen, Hans; Tolvanen, Martti; Salakoski, Tapio; Ginter, Filip.

IEEE/ACM Trans Comput Biol Bioinform ; 19(3): 1772-1781, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-33306472

RESUMO

Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.

Assuntos

Redes Neurais de Computação , Proteínas , Bases de Dados de Proteínas , Proteínas/química , Alinhamento de Sequência , Software

Potent pairing: ensemble of long short-term memory networks and support vector machine for chemical-protein relation extraction.

Mehryary, Farrokh; Björne, Jari; Salakoski, Tapio; Ginter, Filip.

Database (Oxford) ; 20182018 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-30576487

RESUMO

Biomedical researchers regularly discover new interactions between chemical compounds/drugs and genes/proteins, and report them in research literature. Having knowledge about these interactions is crucially important in many research areas such as precision medicine and drug discovery. The BioCreative VI Task 5 (CHEMPROT) challenge promotes the development and evaluation of computer systems that can automatically recognize and extract statements of such interactions from biomedical literature. We participated in this challenge with a Support Vector Machine (SVM) system and a deep learning-based system (ST-ANN), and achieved an F-score of 60.99 for the task. After the shared task, we have significantly improved the performance of the ST-ANN system. Additionally, we have developed a new deep learning-based system (I-ANN) that considerably outperforms the ST-ANN system. Both ST-ANN and I-ANN systems are centered around training an ensemble of artificial neural networks and utilizing different bidirectional Long Short-Term Memory (LSTM) chains for representing the shortest dependency path and/or the full sentence. By combining the predictions of the SVM and the I-ANN systems, we achieved an F-score of 63.10 for the task, improving our previous F-score by 2.11 percentage points. Our systems are fully open-source and publicly available. We highlight that the systems we present in this study are not applicable only to the BioCreative VI Task 5, but can be effortlessly re-trained to extract any types of relations of interest, with no modifications of the source code required, if a manually annotated corpus is provided as training data in a specific file format.

Assuntos

Descoberta de Drogas/métodos , Redes Neurais de Computação , Preparações Farmacêuticas , Proteínas , Máquina de Vetores de Suporte , Mineração de Dados , Bases de Dados de Compostos Químicos , Bases de Dados de Proteínas , Aprendizado Profundo , Preparações Farmacêuticas/química , Preparações Farmacêuticas/metabolismo , Ligação Proteica , Proteínas/química , Proteínas/metabolismo

Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task.

Sarker, Abeed; Belousov, Maksim; Friedrichs, Jasper; Hakala, Kai; Kiritchenko, Svetlana; Mehryary, Farrokh; Han, Sifei; Tran, Tung; Rios, Anthony; Kavuluru, Ramakanth; de Bruijn, Berry; Ginter, Filip; Mahata, Debanjan; Mohammad, Saif M; Nenadic, Goran; Gonzalez-Hernandez, Graciela.

J Am Med Inform Assoc ; 25(10): 1274-1283, 2018 10 01.

Artigo em Inglês | MEDLINE | ID: mdl-30272184

RESUMO

Objective: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. Materials and Methods: We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3); and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Results: Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming individual systems. Discussion: Among individual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Conclusions: Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (http://dx.doi.org/10.17632/rxwfb3tysd.1).

Assuntos

Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/classificação , Processamento de Linguagem Natural , Redes Neurais de Computação , Mídias Sociais/classificação , Máquina de Vetores de Suporte , Mineração de Dados/métodos , Humanos , Farmacovigilância

An expanded evaluation of protein function prediction methods shows an improvement in accuracy.

Jiang, Yuxiang; Oron, Tal Ronnen; Clark, Wyatt T; Bankapur, Asma R; D'Andrea, Daniel; Lepore, Rosalba; Funk, Christopher S; Kahanda, Indika; Verspoor, Karin M; Ben-Hur, Asa; Koo, Da Chen Emily; Penfold-Brown, Duncan; Shasha, Dennis; Youngs, Noah; Bonneau, Richard; Lin, Alexandra; Sahraeian, Sayed M E; Martelli, Pier Luigi; Profiti, Giuseppe; Casadio, Rita; Cao, Renzhi; Zhong, Zhaolong; Cheng, Jianlin; Altenhoff, Adrian; Skunca, Nives; Dessimoz, Christophe; Dogan, Tunca; Hakala, Kai; Kaewphan, Suwisa; Mehryary, Farrokh; Salakoski, Tapio; Ginter, Filip; Fang, Hai; Smithers, Ben; Oates, Matt; Gough, Julian; Törönen, Petri; Koskinen, Patrik; Holm, Liisa; Chen, Ching-Tai; Hsu, Wen-Lian; Bryson, Kevin; Cozzetto, Domenico; Minneci, Federico; Jones, David T; Chapman, Samuel; Bkc, Dukka; Khan, Ishita K; Kihara, Daisuke; Ofer, Dan.

Genome Biol ; 17(1): 184, 2016 09 07.

Artigo em Inglês | MEDLINE | ID: mdl-27604469

RESUMO

BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.

Assuntos

Biologia Computacional , Proteínas/química , Software , Relação Estrutura-Atividade , Algoritmos , Bases de Dados de Proteínas , Ontologia Genética , Humanos , Anotação de Sequência Molecular , Proteínas/genética

Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification.

Mehryary, Farrokh; Kaewphan, Suwisa; Hakala, Kai; Ginter, Filip.

J Biomed Semantics ; 7: 27, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27175227

RESUMO

BACKGROUND: Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task. Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction. METHODS: Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation. RESULTS: The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database. AVAILABILITY: The data and source code for this work are available at: http://bionlp-www.utu.fi/trigger-clustering/.

Assuntos

Informática Médica/métodos , Processamento de Linguagem Natural , Aprendizado de Máquina Supervisionado , Aprendizado de Máquina não Supervisionado , Mineração de Dados , Bases de Dados Factuais

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA