Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37950510

RESUMO

SUMMARY: Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein-protein and protein-chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful. AVAILABILITY AND IMPLEMENTATION: PEDL+ is freely available at https://github.com/leonweber/pedl.


Assuntos
Software , PubMed , Bases de Dados Factuais
2.
Bioinformatics ; 37(17): 2792-2794, 2021 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-33508086

RESUMO

SUMMARY: Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair, a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair, recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora. AVAILABILITY AND IMPLEMENTATION: HunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.
Bioinformatics ; 36(1): 295-302, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31243432

RESUMO

MOTIVATION: Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora. RESULTS: We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5-13 pp on the entity types chemicals, species and genes. AVAILABILITY AND IMPLEMENTATION: HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Biologia Computacional/métodos , Análise de Dados , Software
4.
Bioinformatics ; 36(Suppl_1): i490-i498, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657389

RESUMO

MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Idioma , Proteínas , Publicações , Projetos de Pesquisa
5.
Bioinformatics ; 33(14): i37-i48, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881963

RESUMO

MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . CONTACT: habibima@informatik.hu-berlin.de.


Assuntos
Mineração de Dados/métodos , Aprendizado de Máquina , Animais , Humanos , Camundongos , Software
6.
J Chem Theory Comput ; 18(7): 4408-4417, 2022 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-35671364

RESUMO

Machine learning (ML) approaches have demonstrated the ability to predict molecular spectra at a fraction of the computational cost of traditional theoretical chemistry methods while maintaining high accuracy. Graph neural networks (GNNs) are particularly promising in this regard, but different types of GNNs have not yet been systematically compared. In this work, we benchmark and analyze five different GNNs for the prediction of excitation spectra from the QM9 dataset of organic molecules. We compare the GNN performance in the obvious runtime measurements, prediction accuracy, and analysis of outliers in the test set. Moreover, through TMAP clustering and statistical analysis, we are able to highlight clear hotspots of high prediction errors as well as optimal spectra prediction for molecules with certain functional groups. This in-depth benchmarking and subsequent analysis protocol lays down a recipe for comparing different ML methods and evaluating dataset quality.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação
7.
Database (Oxford) ; 20222022 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-36399413

RESUMO

The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.


Assuntos
Idioma , Proteínas , Bases de Dados Factuais , Toxicogenética
8.
Annu Int Conf IEEE Eng Med Biol Soc ; 2019: 6677-6680, 2019 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-31947373

RESUMO

Sounds caused by the action of the heart reflect both its health as well as deficiencies and are examined by physicians since antiquity. Pathologies of the valves, e.g. insufficiencies and stenosis, cardiac effusion, arrhythmia, inflammation of the surrounding tissue and other diagnosis can be reached by experienced physicians. However, practice is needed to assess the findings correctly. Furthermore, stethoscopes do not allow for long-term monitoring of a patient. Recently, radar technology has shown the ability to perform continuous touchless and thereby burden-free heart sound measurements. In order to perform automated classification of the signals, the first and most important step is to segment the heart sounds into their physiological phases. This paper examines the use of different Long Short-Term Memory (LSTM) architectures for this purpose based on a large dataset of radar-recorded heart sounds gathered from 30 different test persons in a clinical study. The best-performing network, a bidirectional LSTM, achieves a sample-wise accuracy of 93.4 % and a F1 score for the first heart sound of 95.8 %.


Assuntos
Ruídos Cardíacos , Estetoscópios , Arritmias Cardíacas , Coração , Humanos , Radar
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA