Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 80
Filtrar
1.
Bioinformatics ; 40(8)2024 08 02.
Artículo en Inglés | MEDLINE | ID: mdl-39067036

RESUMEN

MOTIVATION: Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). RESULTS: We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach. AVAILABILITY AND IMPLEMENTATION: The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.


Asunto(s)
Procesamiento de Lenguaje Natural , Redes Neurales de la Computación , Bases del Conocimiento , Algoritmos , Unified Medical Language System , Humanos , Biología Computacional/métodos
2.
Gigascience ; 132024 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-38896539

RESUMEN

BACKGROUND: Scientific workflow systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets, as they offer reproducibility, dependability, and scalability of analyses by automatic parallelization on large compute clusters. However, implementing workflows is difficult due to the involvement of many black-box tools and the deep infrastructure stack necessary for their execution. Simultaneously, user-supporting tools are rare, and the number of available examples is much lower than in classical programming languages. RESULTS: To address these challenges, we investigate the efficiency of large language models (LLMs), specifically ChatGPT, to support users when dealing with scientific workflows. We performed 3 user studies in 2 scientific domains to evaluate ChatGPT for comprehending, adapting, and extending workflows. Our results indicate that LLMs efficiently interpret workflows but achieve lower performance for exchanging components or purposeful workflow extensions. We characterize their limitations in these challenging scenarios and suggest future research directions. CONCLUSIONS: Our results show a high accuracy for comprehending and explaining scientific workflows while achieving a reduced performance for modifying and extending workflow descriptions. These findings clearly illustrate the need for further research in this area.


Asunto(s)
Flujo de Trabajo , Lenguajes de Programación , Programas Informáticos , Biología Computacional/métodos , Humanos
3.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38383060

RESUMEN

MOTIVATION: In precision oncology (PO), clinicians aim to find the best treatment for any patient based on their molecular characterization. A major bottleneck is the manual annotation and evaluation of individual variants, for which usually a range of knowledge bases are screened. To incorporate and integrate the vast information of different databases, fast and accurate methods for harmonizing databases with different types of information are necessary. An essential step for harmonization in PO includes the normalization of tumor entities as well as therapy options for patients. SUMMARY: preon is a fast and accurate library for the normalization of drug names and cancer types in large-scale data integration. AVAILABILITY AND IMPLEMENTATION: preon is implemented in Python and freely available via the PyPI repository. Source code and the data underlying this article are available in GitHub at https://github.com/ermshaua/preon/.


Asunto(s)
Neoplasias , Humanos , Neoplasias/tratamiento farmacológico , Medicina de Precisión , Oncología Médica , Programas Informáticos , Bases de Datos Factuales
4.
Oral Oncol ; 149: 106678, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38219707

RESUMEN

AIM: We aimed to evaluate the applicability of a customized NanoString panel for molecular subtyping of recurrent or metastatic head and neck squamous cell carcinoma (R/M-HNSCC). Additionally, histological analyses were conducted, correlated with the molecular subtypes and tested for their prognostic value. MATERIAL AND METHODS: We conducted molecular subtyping of R/M-HNSCC according to the molecular subtypes defined by Keck et al. For molecular analyses a 231 gene customized NanoString panel (the most accurately subtype defining genes, based on previous analyses) was applied to tumor samples from R/M-HNSCC patients that were treated in the CeFCiD trial (AIO/IAG-KHT trial 1108). A total of 130 samples from 95 patients were available for sequencing, of which 80 samples from 67 patients passed quality controls and were included in histological analyses. H&E stained slides were evaluated regarding distinct morphological patterns (e.g. tumor budding, nuclear size, stroma content). RESULTS: Determination of molecular subtypes led to classification of tumor samples as basal (n = 46, 45 %), inflamed/mesenchymal (n = 31, 30 %) and classical (n = 26, 25 %). Expression levels of Amphiregulin (AREG) were significantly higher for the basal and classical subtypes compared to the mesenchymal subtype. While molecular subtypes did not have an impact on survival, high levels of tumor budding were associated with poor outcomes. No correlation was found between molecular subtypes and histological characteristics. CONCLUSIONS: Utilizing the 231-gene NanoString panel we were able to determine the molecular subtype of R/M-HNSCC samples by the use of FFPE material. The value to stratify for different treatment options remains to be explored in the future. The prognostic value of tumor budding was underscored in this clinically well annotated cohort.


Asunto(s)
Carcinoma de Células Escamosas , Neoplasias de Cabeza y Cuello , Humanos , Carcinoma de Células Escamosas/patología , Neoplasias de Cabeza y Cuello/genética , Recurrencia Local de Neoplasia/patología , Pronóstico , Carcinoma de Células Escamosas de Cabeza y Cuello/genética , Ensayos Clínicos como Asunto
5.
J Dent ; 141: 104796, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38072335

RESUMEN

INTRODUCTION: Natural language processing (NLP) is an intersection between Computer Science and Linguistic which aims to enable machines to process and understand human language. We here summarized applications and limitations of NLP in dentistry. DATA AND SOURCES: Narrative review. FINDINGS: NLP has evolved increasingly fast. For the dental domain, relevant NLP applications are text classification (e.g., symptom classification) and natural language generation and understanding (e.g., clinical chatbots assisting professionals in office work and patient communication). Analyzing large quantities of text will allow understanding diseases and their trajectories and support a more precise and personalized care. Speech recognition systems may serve as virtual assistants and facilitate automated documentation. However, to date, NLP has rarely been applied in dentistry. Existing research focuses mainly on rule-based solutions for narrow tasks. Technologies such as Recurrent Neural Networks and Transformers have been shown to surpass the language processing capabilities of such rule-based solutions in many fields, but are data-hungry (i.e., rely on large amounts of training data), which limits their application in the dental domain at present. Technologies such as federated or transfer learning or data sharing concepts may allow to overcome this limitation, while challenges in terms of explainability, reproducibility, generalizability and evaluation of NLP in dentistry remain to be resolved for enabling approval of such technologies in medical devices and services. CONCLUSIONS: NLP will become a cornerstone of a number of applications in dentistry. The community is called to action to improve the current limitations and foster reliable, high-quality dental NLP. CLINICAL SIGNIFICANCE: NLP for text classification (e.g., dental symptom classification) and language generation and understanding (e.g., clinical chatbots, speech recognition) will support administrative tasks in dentistry, provide deeper insights for clinicians and support research and education.


Asunto(s)
Comunicación , Procesamiento de Lenguaje Natural , Humanos , Reproducibilidad de los Resultados , Odontología
6.
Bioinformatics ; 39(11)2023 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-37975879

RESUMEN

MOTIVATION: Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS: We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION: The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.


Asunto(s)
Benchmarking , Minería de Datos , Minería de Datos/métodos , Programas Informáticos , Lenguaje , Procesamiento de Lenguaje Natural
7.
JAMA Netw Open ; 6(11): e2343689, 2023 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-37976064

RESUMEN

Importance: Clinical interpretation of complex biomarkers for precision oncology currently requires manual investigations of previous studies and databases. Conversational large language models (LLMs) might be beneficial as automated tools for assisting clinical decision-making. Objective: To assess performance and define their role using 4 recent LLMs as support tools for precision oncology. Design, Setting, and Participants: This diagnostic study examined 10 fictional cases of patients with advanced cancer with genetic alterations. Each case was submitted to 4 different LLMs (ChatGPT, Galactica, Perplexity, and BioMedLM) and 1 expert physician to identify personalized treatment options in 2023. Treatment options were masked and presented to a molecular tumor board (MTB), whose members rated the likelihood of a treatment option coming from an LLM on a scale from 0 to 10 (0, extremely unlikely; 10, extremely likely) and decided whether the treatment option was clinically useful. Main Outcomes and Measures: Number of treatment options, precision, recall, F1 score of LLMs compared with human experts, recognizability, and usefulness of recommendations. Results: For 10 fictional cancer patients (4 with lung cancer, 6 with other; median [IQR] 3.5 [3.0-4.8] molecular alterations per patient), a median (IQR) number of 4.0 (4.0-4.0) compared with 3.0 (3.0-5.0), 7.5 (4.3-9.8), 11.5 (7.8-13.0), and 13.0 (11.3-21.5) treatment options each was identified by the human expert and 4 LLMs, respectively. When considering the expert as a criterion standard, LLM-proposed treatment options reached F1 scores of 0.04, 0.17, 0.14, and 0.19 across all patients combined. Combining treatment options from different LLMs allowed a precision of 0.29 and a recall of 0.29 for an F1 score of 0.29. LLM-generated treatment options were recognized as AI-generated with a median (IQR) 7.5 (5.3-9.0) points in contrast to 2.0 (1.0-3.0) points for manually annotated cases. A crucial reason for identifying AI-generated treatment options was insufficient accompanying evidence. For each patient, at least 1 LLM generated a treatment option that was considered helpful by MTB members. Two unique useful treatment options (including 1 unique treatment strategy) were identified only by LLM. Conclusions and Relevance: In this diagnostic study, treatment options of LLMs in precision oncology did not reach the quality and credibility of human experts; however, they generated helpful ideas that might have complemented established procedures. Considering technological progress, LLMs could play an increasingly important role in assisting with screening and selecting relevant biomedical literature to support evidence-based, personalized treatment decisions.


Asunto(s)
Neoplasias Pulmonares , Medicina de Precisión , Humanos , Oncología Médica , Lenguaje , Comunicación
8.
Bioinformatics ; 39(11)2023 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-37950510

RESUMEN

SUMMARY: Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein-protein and protein-chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful. AVAILABILITY AND IMPLEMENTATION: PEDL+ is freely available at https://github.com/leonweber/pedl.


Asunto(s)
Programas Informáticos , PubMed , Bases de Datos Factuales
9.
Cancers (Basel) ; 15(3)2023 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-36765893

RESUMEN

Pancreatic neuroendocrine neoplasms (panNENs) are a rare yet diverse type of neoplasia whose precise clinical-pathological classification is frequently challenging. Since incorrect classifications can affect treatment decisions, additional tools which support the diagnosis, such as machine learning (ML) techniques, are critically needed but generally unavailable due to the scarcity of suitable ML training data for rare panNENs. Here, we demonstrate that a multi-step ML framework predicts clinically relevant panNEN characteristics while being exclusively trained on widely available data of a healthy origin. The approach classifies panNENs by deconvolving their transcriptomes into cell type proportions based on shared gene expression profiles with healthy pancreatic cell types. The deconvolution results were found to provide a prognostic value with respect to the prediction of the overall patient survival time, neoplastic grading, and carcinoma versus tumor subclassification. The performance with which a proliferation rate agnostic deconvolution ML model could predict the clinical characteristics was found to be comparable to that of a comparative baseline model trained on the proliferation rate-informed MKI67 levels. The approach is novel in that it complements established proliferation rate-oriented classification schemes whose results can be reproduced and further refined by differentiating between identically graded subgroups. By including non-endocrine cell types, the deconvolution approach furthermore provides an in silico quantification of panNEN dedifferentiation, optimizing it for challenging clinical classification tasks in more aggressive panNEN subtypes.

10.
Database (Oxford) ; 20222022 11 18.
Artículo en Inglés | MEDLINE | ID: mdl-36399413

RESUMEN

The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.


Asunto(s)
Lenguaje , Proteínas , Bases de Datos Factuales , Toxicogenética
11.
Database (Oxford) ; 20222022 06 27.
Artículo en Inglés | MEDLINE | ID: mdl-35758881

RESUMEN

High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48-0.91 for entity detection and 0.71-0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg.


Asunto(s)
Algoritmos , Minería de Datos , ADN/genética , Minería de Datos/métodos , Bases de Datos Factuales , Humanos , PubMed
12.
J Chem Theory Comput ; 18(7): 4408-4417, 2022 Jul 12.
Artículo en Inglés | MEDLINE | ID: mdl-35671364

RESUMEN

Machine learning (ML) approaches have demonstrated the ability to predict molecular spectra at a fraction of the computational cost of traditional theoretical chemistry methods while maintaining high accuracy. Graph neural networks (GNNs) are particularly promising in this regard, but different types of GNNs have not yet been systematically compared. In this work, we benchmark and analyze five different GNNs for the prediction of excitation spectra from the QM9 dataset of organic molecules. We compare the GNN performance in the obvious runtime measurements, prediction accuracy, and analysis of outliers in the test set. Moreover, through TMAP clustering and statistical analysis, we are able to highlight clear hotspots of high prediction errors as well as optimal spectra prediction for molecules with certain functional groups. This in-depth benchmarking and subsequent analysis protocol lays down a recipe for comparing different ML methods and evaluating dataset quality.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación
13.
Genome Med ; 14(1): 24, 2022 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-35227293

RESUMEN

BACKGROUND: Pancreatic neuroendocrine neoplasms (PanNENs) fall into two subclasses: the well-differentiated, low- to high-grade pancreatic neuroendocrine tumors (PanNETs), and the poorly-differentiated, high-grade pancreatic neuroendocrine carcinomas (PanNECs). While recent studies suggest an endocrine descent of PanNETs, the origin of PanNECs remains unknown. METHODS: We performed DNA methylation analysis for 57 PanNEN samples and found that distinct methylation profiles separated PanNENs into two major groups, clearly distinguishing high-grade PanNECs from other PanNETs including high-grade NETG3. DNA alterations and immunohistochemistry of cell-type markers PDX1, ARX, and SOX9 were utilized to further characterize PanNECs and their cell of origin in the pancreas. RESULTS: Phylo-epigenetic and cell-type signature features derived from alpha, beta, acinar, and ductal adult cells suggest an exocrine cell of origin for PanNECs, thus separating them in cell lineage from other PanNENs of endocrine origin. CONCLUSIONS: Our study provides a robust and clinically applicable method to clearly distinguish PanNECs from G3 PanNETs, improving patient stratification.


Asunto(s)
Carcinoma Neuroendocrino , Tumores Neuroendocrinos , Neoplasias Pancreáticas , Adulto , Carcinoma Neuroendocrino/genética , Carcinoma Neuroendocrino/patología , Metilación de ADN , Humanos , Clasificación del Tumor , Tumores Neuroendocrinos/genética , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/patología
14.
Datenbank Spektrum ; 21(3): 255-260, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34786019

RESUMEN

Today's scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 "FONDA -- Foundations of Workflows for Large-Scale Scientific Data Analysis", in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA's internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the "making of" a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.

15.
Cancers (Basel) ; 13(17)2021 Sep 04.
Artículo en Inglés | MEDLINE | ID: mdl-34503273

RESUMEN

BACKGROUND: The clinical management of high-grade gastroenteropancreatic neuroendocrine neoplasms (GEP-NEN) is challenging due to disease heterogeneity, illustrating the need for reliable biomarkers facilitating patient stratification and guiding treatment decisions. FMS-like tyrosine kinase 3 ligand (Flt3L) is emerging as a prognostic or predictive surrogate marker of host tumoral immune response and might enable the stratification of patients with otherwise comparable tumor features. METHODS: We evaluated Flt3L gene expression in tumor tissue as well as circulating Flt3L levels as potential biomarkers in a cohort of 54 patients with GEP-NEN. RESULTS: We detected a prominent induction of Flt3L gene expression in individual G2 and G3 NEN, but not in G1 neuroendocrine tumors (NET). Flt3L mRNA expression levels in tumor tissue predicted the disease-related survival of patients with highly proliferative G2 and G3 NEN more accurately than the conventional criteria of grading or NEC/NET differentiation. High level Flt3L mRNA expression was associated with the increased expression of genes related to immunogenic cell death, lymphocyte effector function and dendritic cell maturation, suggesting a less tolerogenic (more proinflammatory) phenotype of tumors with Flt3L induction. Importantly, circulating levels of Flt3L were also elevated in high grade NEN and correlated with patients' progression-free and disease-related survival, thereby reflecting the results observed in tumor tissue. CONCLUSIONS: We propose Flt3L as a prognostic biomarker for high grade GEP-NEN, harnessing its potential as a marker of an inflammatory tumor microenvironment. Flt3L measurements in serum, which can be easily be incorporated into clinical routine, should be further evaluated to guide patient stratification and treatment decisions.

16.
JAMIA Open ; 4(2): ooab025, 2021 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-33898938

RESUMEN

OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. MATERIALS AND METHODS: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. RESULTS: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72-0.90 for named entity recognition, 0.10-0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. DISCUSSION: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. CONCLUSION: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.

17.
Bioinformatics ; 37(17): 2792-2794, 2021 Sep 09.
Artículo en Inglés | MEDLINE | ID: mdl-33508086

RESUMEN

SUMMARY: Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair, a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair, recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora. AVAILABILITY AND IMPLEMENTATION: HunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

18.
Bioinformatics ; 37(2): 236-242, 2021 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-32726411

RESUMEN

MOTIVATION: The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient to judge upon the biological correctness of a relation, as experimental evidence might be weak or only valid in a certain context. Furthermore, statements may be more speculative than confirmative, and different articles often contradict each other. Experts therefore always take the complete literature into account to take a reliable decision upon a relationship. It is an open research question how to do this effectively in an automatic manner. RESULTS: We propose two novel relation extraction approaches which use recent representation learning techniques to create comprehensive models of biomedical entities or entity-pairs, respectively. These representations are learned by considering all publications from PubMed mentioning an entity or a pair. They are used as input for a neural network for classifying relations globally, i.e. the derived predictions are corpus-based, not sentence- or article based as in prior art. Experiments on the extraction of mutation-disease, drug-disease and drug-drug relationships show that the learned embeddings indeed capture semantic information of the entities under study and outperform traditional methods by 4-29% regarding F1 score. AVAILABILITY AND IMPLEMENTATION: Source codes are available at: https://github.com/mariosaenger/bio-re-with-entity-embeddings. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Redes Neurales de la Computación , Programas Informáticos , Minería de Datos , PubMed , Publicaciones , Semántica
19.
Med Genet ; 33(2): 167-177, 2021 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38836022

RESUMEN

High-throughput technologies have led to a continuously growing amount of information about regulatory features in the genome. A wealth of data generated by large international research consortia is available from online databases. Disease-driven studies provide details on specific DNA elements or epigenetic modifications regulating gene expression in specific cellular and developmental contexts, but these results are usually only published in scientific articles. All this information can be helpful in interpreting variants in the regulatory genome. This review describes a selection of high-profile data sources providing information on the non-coding genome, as well as pitfalls and techniques to search and capture information from the literature.

20.
Bioinformatics ; 36(Suppl_1): i490-i498, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32657389

RESUMEN

MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Lenguaje , Proteínas , Publicaciones , Proyectos de Investigación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA