Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 77
Filtrar
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38383060

RESUMO

MOTIVATION: In precision oncology (PO), clinicians aim to find the best treatment for any patient based on their molecular characterization. A major bottleneck is the manual annotation and evaluation of individual variants, for which usually a range of knowledge bases are screened. To incorporate and integrate the vast information of different databases, fast and accurate methods for harmonizing databases with different types of information are necessary. An essential step for harmonization in PO includes the normalization of tumor entities as well as therapy options for patients. SUMMARY: preon is a fast and accurate library for the normalization of drug names and cancer types in large-scale data integration. AVAILABILITY AND IMPLEMENTATION: preon is implemented in Python and freely available via the PyPI repository. Source code and the data underlying this article are available in GitHub at https://github.com/ermshaua/preon/.


Assuntos
Neoplasias , Humanos , Neoplasias/tratamento farmacológico , Medicina de Precisão , Oncologia , Software , Bases de Dados Factuais
2.
Oral Oncol ; 149: 106678, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38219707

RESUMO

AIM: We aimed to evaluate the applicability of a customized NanoString panel for molecular subtyping of recurrent or metastatic head and neck squamous cell carcinoma (R/M-HNSCC). Additionally, histological analyses were conducted, correlated with the molecular subtypes and tested for their prognostic value. MATERIAL AND METHODS: We conducted molecular subtyping of R/M-HNSCC according to the molecular subtypes defined by Keck et al. For molecular analyses a 231 gene customized NanoString panel (the most accurately subtype defining genes, based on previous analyses) was applied to tumor samples from R/M-HNSCC patients that were treated in the CeFCiD trial (AIO/IAG-KHT trial 1108). A total of 130 samples from 95 patients were available for sequencing, of which 80 samples from 67 patients passed quality controls and were included in histological analyses. H&E stained slides were evaluated regarding distinct morphological patterns (e.g. tumor budding, nuclear size, stroma content). RESULTS: Determination of molecular subtypes led to classification of tumor samples as basal (n = 46, 45 %), inflamed/mesenchymal (n = 31, 30 %) and classical (n = 26, 25 %). Expression levels of Amphiregulin (AREG) were significantly higher for the basal and classical subtypes compared to the mesenchymal subtype. While molecular subtypes did not have an impact on survival, high levels of tumor budding were associated with poor outcomes. No correlation was found between molecular subtypes and histological characteristics. CONCLUSIONS: Utilizing the 231-gene NanoString panel we were able to determine the molecular subtype of R/M-HNSCC samples by the use of FFPE material. The value to stratify for different treatment options remains to be explored in the future. The prognostic value of tumor budding was underscored in this clinically well annotated cohort.


Assuntos
Carcinoma de Células Escamosas , Neoplasias de Cabeça e Pescoço , Humanos , Carcinoma de Células Escamosas/patologia , Neoplasias de Cabeça e Pescoço/genética , Recidiva Local de Neoplasia/patologia , Prognóstico , Carcinoma de Células Escamosas de Cabeça e Pescoço/genética , Ensaios Clínicos como Assunto
3.
J Dent ; 141: 104796, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38072335

RESUMO

INTRODUCTION: Natural language processing (NLP) is an intersection between Computer Science and Linguistic which aims to enable machines to process and understand human language. We here summarized applications and limitations of NLP in dentistry. DATA AND SOURCES: Narrative review. FINDINGS: NLP has evolved increasingly fast. For the dental domain, relevant NLP applications are text classification (e.g., symptom classification) and natural language generation and understanding (e.g., clinical chatbots assisting professionals in office work and patient communication). Analyzing large quantities of text will allow understanding diseases and their trajectories and support a more precise and personalized care. Speech recognition systems may serve as virtual assistants and facilitate automated documentation. However, to date, NLP has rarely been applied in dentistry. Existing research focuses mainly on rule-based solutions for narrow tasks. Technologies such as Recurrent Neural Networks and Transformers have been shown to surpass the language processing capabilities of such rule-based solutions in many fields, but are data-hungry (i.e., rely on large amounts of training data), which limits their application in the dental domain at present. Technologies such as federated or transfer learning or data sharing concepts may allow to overcome this limitation, while challenges in terms of explainability, reproducibility, generalizability and evaluation of NLP in dentistry remain to be resolved for enabling approval of such technologies in medical devices and services. CONCLUSIONS: NLP will become a cornerstone of a number of applications in dentistry. The community is called to action to improve the current limitations and foster reliable, high-quality dental NLP. CLINICAL SIGNIFICANCE: NLP for text classification (e.g., dental symptom classification) and language generation and understanding (e.g., clinical chatbots, speech recognition) will support administrative tasks in dentistry, provide deeper insights for clinicians and support research and education.


Assuntos
Comunicação , Processamento de Linguagem Natural , Humanos , Reprodutibilidade dos Testes , Odontologia
4.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37950510

RESUMO

SUMMARY: Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein-protein and protein-chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful. AVAILABILITY AND IMPLEMENTATION: PEDL+ is freely available at https://github.com/leonweber/pedl.


Assuntos
Software , PubMed , Bases de Dados Factuais
5.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37975879

RESUMO

MOTIVATION: Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS: We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION: The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.


Assuntos
Benchmarking , Mineração de Dados , Mineração de Dados/métodos , Software , Idioma , Processamento de Linguagem Natural
6.
JAMA Netw Open ; 6(11): e2343689, 2023 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-37976064

RESUMO

Importance: Clinical interpretation of complex biomarkers for precision oncology currently requires manual investigations of previous studies and databases. Conversational large language models (LLMs) might be beneficial as automated tools for assisting clinical decision-making. Objective: To assess performance and define their role using 4 recent LLMs as support tools for precision oncology. Design, Setting, and Participants: This diagnostic study examined 10 fictional cases of patients with advanced cancer with genetic alterations. Each case was submitted to 4 different LLMs (ChatGPT, Galactica, Perplexity, and BioMedLM) and 1 expert physician to identify personalized treatment options in 2023. Treatment options were masked and presented to a molecular tumor board (MTB), whose members rated the likelihood of a treatment option coming from an LLM on a scale from 0 to 10 (0, extremely unlikely; 10, extremely likely) and decided whether the treatment option was clinically useful. Main Outcomes and Measures: Number of treatment options, precision, recall, F1 score of LLMs compared with human experts, recognizability, and usefulness of recommendations. Results: For 10 fictional cancer patients (4 with lung cancer, 6 with other; median [IQR] 3.5 [3.0-4.8] molecular alterations per patient), a median (IQR) number of 4.0 (4.0-4.0) compared with 3.0 (3.0-5.0), 7.5 (4.3-9.8), 11.5 (7.8-13.0), and 13.0 (11.3-21.5) treatment options each was identified by the human expert and 4 LLMs, respectively. When considering the expert as a criterion standard, LLM-proposed treatment options reached F1 scores of 0.04, 0.17, 0.14, and 0.19 across all patients combined. Combining treatment options from different LLMs allowed a precision of 0.29 and a recall of 0.29 for an F1 score of 0.29. LLM-generated treatment options were recognized as AI-generated with a median (IQR) 7.5 (5.3-9.0) points in contrast to 2.0 (1.0-3.0) points for manually annotated cases. A crucial reason for identifying AI-generated treatment options was insufficient accompanying evidence. For each patient, at least 1 LLM generated a treatment option that was considered helpful by MTB members. Two unique useful treatment options (including 1 unique treatment strategy) were identified only by LLM. Conclusions and Relevance: In this diagnostic study, treatment options of LLMs in precision oncology did not reach the quality and credibility of human experts; however, they generated helpful ideas that might have complemented established procedures. Considering technological progress, LLMs could play an increasingly important role in assisting with screening and selecting relevant biomedical literature to support evidence-based, personalized treatment decisions.


Assuntos
Neoplasias Pulmonares , Medicina de Precisão , Humanos , Oncologia , Idioma , Comunicação
7.
Cancers (Basel) ; 15(3)2023 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-36765893

RESUMO

Pancreatic neuroendocrine neoplasms (panNENs) are a rare yet diverse type of neoplasia whose precise clinical-pathological classification is frequently challenging. Since incorrect classifications can affect treatment decisions, additional tools which support the diagnosis, such as machine learning (ML) techniques, are critically needed but generally unavailable due to the scarcity of suitable ML training data for rare panNENs. Here, we demonstrate that a multi-step ML framework predicts clinically relevant panNEN characteristics while being exclusively trained on widely available data of a healthy origin. The approach classifies panNENs by deconvolving their transcriptomes into cell type proportions based on shared gene expression profiles with healthy pancreatic cell types. The deconvolution results were found to provide a prognostic value with respect to the prediction of the overall patient survival time, neoplastic grading, and carcinoma versus tumor subclassification. The performance with which a proliferation rate agnostic deconvolution ML model could predict the clinical characteristics was found to be comparable to that of a comparative baseline model trained on the proliferation rate-informed MKI67 levels. The approach is novel in that it complements established proliferation rate-oriented classification schemes whose results can be reproduced and further refined by differentiating between identically graded subgroups. By including non-endocrine cell types, the deconvolution approach furthermore provides an in silico quantification of panNEN dedifferentiation, optimizing it for challenging clinical classification tasks in more aggressive panNEN subtypes.

8.
Database (Oxford) ; 20222022 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-36399413

RESUMO

The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.


Assuntos
Idioma , Proteínas , Bases de Dados Factuais , Toxicogenética
9.
Database (Oxford) ; 20222022 06 27.
Artigo em Inglês | MEDLINE | ID: mdl-35758881

RESUMO

High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48-0.91 for entity detection and 0.71-0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg.


Assuntos
Algoritmos , Mineração de Dados , DNA/genética , Mineração de Dados/métodos , Bases de Dados Factuais , Humanos , PubMed
10.
J Chem Theory Comput ; 18(7): 4408-4417, 2022 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-35671364

RESUMO

Machine learning (ML) approaches have demonstrated the ability to predict molecular spectra at a fraction of the computational cost of traditional theoretical chemistry methods while maintaining high accuracy. Graph neural networks (GNNs) are particularly promising in this regard, but different types of GNNs have not yet been systematically compared. In this work, we benchmark and analyze five different GNNs for the prediction of excitation spectra from the QM9 dataset of organic molecules. We compare the GNN performance in the obvious runtime measurements, prediction accuracy, and analysis of outliers in the test set. Moreover, through TMAP clustering and statistical analysis, we are able to highlight clear hotspots of high prediction errors as well as optimal spectra prediction for molecules with certain functional groups. This in-depth benchmarking and subsequent analysis protocol lays down a recipe for comparing different ML methods and evaluating dataset quality.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação
11.
Genome Med ; 14(1): 24, 2022 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-35227293

RESUMO

BACKGROUND: Pancreatic neuroendocrine neoplasms (PanNENs) fall into two subclasses: the well-differentiated, low- to high-grade pancreatic neuroendocrine tumors (PanNETs), and the poorly-differentiated, high-grade pancreatic neuroendocrine carcinomas (PanNECs). While recent studies suggest an endocrine descent of PanNETs, the origin of PanNECs remains unknown. METHODS: We performed DNA methylation analysis for 57 PanNEN samples and found that distinct methylation profiles separated PanNENs into two major groups, clearly distinguishing high-grade PanNECs from other PanNETs including high-grade NETG3. DNA alterations and immunohistochemistry of cell-type markers PDX1, ARX, and SOX9 were utilized to further characterize PanNECs and their cell of origin in the pancreas. RESULTS: Phylo-epigenetic and cell-type signature features derived from alpha, beta, acinar, and ductal adult cells suggest an exocrine cell of origin for PanNECs, thus separating them in cell lineage from other PanNENs of endocrine origin. CONCLUSIONS: Our study provides a robust and clinically applicable method to clearly distinguish PanNECs from G3 PanNETs, improving patient stratification.


Assuntos
Carcinoma Neuroendócrino , Tumores Neuroendócrinos , Neoplasias Pancreáticas , Adulto , Carcinoma Neuroendócrino/genética , Carcinoma Neuroendócrino/patologia , Metilação de DNA , Humanos , Gradação de Tumores , Tumores Neuroendócrinos/genética , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/patologia
12.
Datenbank Spektrum ; 21(3): 255-260, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34786019

RESUMO

Today's scientific data analysis very often requires complex Data Analysis Workflows (DAWs) executed over distributed computational infrastructures, e.g., clusters. Much research effort is devoted to the tuning and performance optimization of specific workflows for specific clusters. However, an arguably even more important problem for accelerating research is the reduction of development, adaptation, and maintenance times of DAWs. We describe the design and setup of the Collaborative Research Center (CRC) 1404 "FONDA -- Foundations of Workflows for Large-Scale Scientific Data Analysis", in which roughly 50 researchers jointly investigate new technologies, algorithms, and models to increase the portability, adaptability, and dependability of DAWs executed over distributed infrastructures. We describe the motivation behind our project, explain its underlying core concepts, introduce FONDA's internal structure, and sketch our vision for the future of workflow-based scientific data analysis. We also describe some lessons learned during the "making of" a CRC in Computer Science with strong interdisciplinary components, with the aim to foster similar endeavors.

13.
Cancers (Basel) ; 13(17)2021 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-34503273

RESUMO

BACKGROUND: The clinical management of high-grade gastroenteropancreatic neuroendocrine neoplasms (GEP-NEN) is challenging due to disease heterogeneity, illustrating the need for reliable biomarkers facilitating patient stratification and guiding treatment decisions. FMS-like tyrosine kinase 3 ligand (Flt3L) is emerging as a prognostic or predictive surrogate marker of host tumoral immune response and might enable the stratification of patients with otherwise comparable tumor features. METHODS: We evaluated Flt3L gene expression in tumor tissue as well as circulating Flt3L levels as potential biomarkers in a cohort of 54 patients with GEP-NEN. RESULTS: We detected a prominent induction of Flt3L gene expression in individual G2 and G3 NEN, but not in G1 neuroendocrine tumors (NET). Flt3L mRNA expression levels in tumor tissue predicted the disease-related survival of patients with highly proliferative G2 and G3 NEN more accurately than the conventional criteria of grading or NEC/NET differentiation. High level Flt3L mRNA expression was associated with the increased expression of genes related to immunogenic cell death, lymphocyte effector function and dendritic cell maturation, suggesting a less tolerogenic (more proinflammatory) phenotype of tumors with Flt3L induction. Importantly, circulating levels of Flt3L were also elevated in high grade NEN and correlated with patients' progression-free and disease-related survival, thereby reflecting the results observed in tumor tissue. CONCLUSIONS: We propose Flt3L as a prognostic biomarker for high grade GEP-NEN, harnessing its potential as a marker of an inflammatory tumor microenvironment. Flt3L measurements in serum, which can be easily be incorporated into clinical routine, should be further evaluated to guide patient stratification and treatment decisions.

14.
JAMIA Open ; 4(2): ooab025, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-33898938

RESUMO

OBJECTIVE: We present the Berlin-Tübingen-Oncology corpus (BRONCO), a large and freely available corpus of shuffled sentences from German oncological discharge summaries annotated with diagnosis, treatments, medications, and further attributes including negation and speculation. The aim of BRONCO is to foster reproducible and openly available research on Information Extraction from German medical texts. MATERIALS AND METHODS: BRONCO consists of 200 manually deidentified discharge summaries of cancer patients. Annotation followed a structured and quality-controlled process involving 2 groups of medical experts to ensure consistency, comprehensiveness, and high quality of annotations. We present results of several state-of-the-art techniques for different IE tasks as baselines for subsequent research. RESULTS: The annotated corpus consists of 11 434 sentences and 89 942 tokens, annotated with 11 124 annotations for medical entities and 3118 annotations of related attributes. We publish 75% of the corpus as a set of shuffled sentences, and keep 25% as held-out data set for unbiased evaluation of future IE tools. On this held-out dataset, our baselines reach depending on the specific entity types F1-scores of 0.72-0.90 for named entity recognition, 0.10-0.68 for entity normalization, 0.55 for negation detection, and 0.33 for speculation detection. DISCUSSION: Medical corpus annotation is a complex and time-consuming task. This makes sharing of such resources even more important. CONCLUSION: To our knowledge, BRONCO is the first sizable and freely available German medical corpus. Our baseline results show that more research efforts are necessary to lift the quality of information extraction in German medical texts to the level already possible for English.

15.
Bioinformatics ; 37(17): 2792-2794, 2021 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-33508086

RESUMO

SUMMARY: Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair, a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair, recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora. AVAILABILITY AND IMPLEMENTATION: HunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

16.
Bioinformatics ; 37(2): 236-242, 2021 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-32726411

RESUMO

MOTIVATION: The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient to judge upon the biological correctness of a relation, as experimental evidence might be weak or only valid in a certain context. Furthermore, statements may be more speculative than confirmative, and different articles often contradict each other. Experts therefore always take the complete literature into account to take a reliable decision upon a relationship. It is an open research question how to do this effectively in an automatic manner. RESULTS: We propose two novel relation extraction approaches which use recent representation learning techniques to create comprehensive models of biomedical entities or entity-pairs, respectively. These representations are learned by considering all publications from PubMed mentioning an entity or a pair. They are used as input for a neural network for classifying relations globally, i.e. the derived predictions are corpus-based, not sentence- or article based as in prior art. Experiments on the extraction of mutation-disease, drug-disease and drug-drug relationships show that the learned embeddings indeed capture semantic information of the entities under study and outperform traditional methods by 4-29% regarding F1 score. AVAILABILITY AND IMPLEMENTATION: Source codes are available at: https://github.com/mariosaenger/bio-re-with-entity-embeddings. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Software , Mineração de Dados , PubMed , Publicações , Semântica
17.
Bioinformatics ; 36(Suppl_1): i490-i498, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657389

RESUMO

MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Idioma , Proteínas , Publicações , Projetos de Pesquisa
18.
Nat Commun ; 11(1): 3651, 2020 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-32686676

RESUMO

Lesion-based targeting strategies underlie cancer precision medicine. However, biological principles - such as cellular senescence - remain difficult to implement in molecularly informed treatment decisions. Functional analyses in syngeneic mouse models and cross-species validation in patient datasets might uncover clinically relevant genetics of biological response programs. Here, we show that chemotherapy-exposed primary Eµ-myc transgenic lymphomas - with and without defined genetic lesions - recapitulate molecular signatures of patients with diffuse large B-cell lymphoma (DLBCL). Importantly, we interrogate the murine lymphoma capacity to senesce and its epigenetic control via the histone H3 lysine 9 (H3K9)-methyltransferase Suv(ar)39h1 and H3K9me3-active demethylases by loss- and gain-of-function genetics, and an unbiased clinical trial-like approach. A mouse-derived senescence-indicating gene signature, termed "SUVARness", as well as high-level H3K9me3 lymphoma expression, predict favorable DLBCL patient outcome. Our data support the use of functional genetics in transgenic mouse models to incorporate basic biology knowledge into cancer precision medicine in the clinic.


Assuntos
Senescência Celular , Histona Metiltransferases , Linfoma Difuso de Grandes Células B , Células 3T3 , Animais , Linhagem Celular Tumoral , Modelos Animais de Doenças , Epigênese Genética , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Histona Metiltransferases/genética , Histona Metiltransferases/metabolismo , Humanos , Linfoma Difuso de Grandes Células B/genética , Linfoma Difuso de Grandes Células B/patologia , Camundongos , Camundongos Transgênicos , Prognóstico
19.
Bioinformatics ; 36(1): 295-302, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31243432

RESUMO

MOTIVATION: Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora. RESULTS: We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5-13 pp on the entity types chemicals, species and genes. AVAILABILITY AND IMPLEMENTATION: HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Biologia Computacional/métodos , Análise de Dados , Software
20.
BMC Bioinformatics ; 20(1): 429, 2019 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-31419935

RESUMO

BACKGROUND: Diagnosis and treatment decisions in cancer increasingly depend on a detailed analysis of the mutational status of a patient's genome. This analysis relies on previously published information regarding the association of variations to disease progression and possible interventions. Clinicians to a large degree use biomedical search engines to obtain such information; however, the vast majority of scientific publications focus on basic science and have no direct clinical impact. We develop the Variant-Information Search Tool (VIST), a search engine designed for the targeted search of clinically relevant publications given an oncological mutation profile. RESULTS: VIST indexes all PubMed abstracts and content from ClinicalTrials.gov. It applies advanced text mining to identify mentions of genes, variants and drugs and uses machine learning based scoring to judge the clinical relevance of indexed abstracts. Its functionality is available through a fast and intuitive web interface. We perform several evaluations, showing that VIST's ranking is superior to that of PubMed or a pure vector space model with regard to the clinical relevance of a document's content. CONCLUSION: Different user groups search repositories of scientific publications with different intentions. This diversity is not adequately reflected in the standard search engines, often leading to poor performance in specialized settings. We develop a search engine for the specific case of finding documents that are clinically relevant in the course of cancer treatment. We believe that the architecture of our engine, heavily relying on machine learning algorithms, can also act as a blueprint for search engines in other, equally specific domains. VIST is freely available at https://vist.informatik.hu-berlin.de/.


Assuntos
Neoplasias/patologia , Medicina de Precisão , Ferramenta de Busca , Algoritmos , Bases de Dados como Assunto , Documentação , Humanos , Internet , Interface Usuário-Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...