Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 79
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38383060

RESUMO

MOTIVATION: In precision oncology (PO), clinicians aim to find the best treatment for any patient based on their molecular characterization. A major bottleneck is the manual annotation and evaluation of individual variants, for which usually a range of knowledge bases are screened. To incorporate and integrate the vast information of different databases, fast and accurate methods for harmonizing databases with different types of information are necessary. An essential step for harmonization in PO includes the normalization of tumor entities as well as therapy options for patients. SUMMARY: preon is a fast and accurate library for the normalization of drug names and cancer types in large-scale data integration. AVAILABILITY AND IMPLEMENTATION: preon is implemented in Python and freely available via the PyPI repository. Source code and the data underlying this article are available in GitHub at https://github.com/ermshaua/preon/.


Assuntos
Neoplasias , Humanos , Neoplasias/tratamento farmacológico , Medicina de Precisão , Oncologia , Software , Bases de Dados Factuais
2.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37975879

RESUMO

MOTIVATION: Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS: We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION: The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.


Assuntos
Benchmarking , Mineração de Dados , Mineração de Dados/métodos , Software , Idioma , Processamento de Linguagem Natural
3.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37950510

RESUMO

SUMMARY: Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein-protein and protein-chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful. AVAILABILITY AND IMPLEMENTATION: PEDL+ is freely available at https://github.com/leonweber/pedl.


Assuntos
Software , PubMed , Bases de Dados Factuais
4.
Bioinformatics ; 37(2): 236-242, 2021 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-32726411

RESUMO

MOTIVATION: The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient to judge upon the biological correctness of a relation, as experimental evidence might be weak or only valid in a certain context. Furthermore, statements may be more speculative than confirmative, and different articles often contradict each other. Experts therefore always take the complete literature into account to take a reliable decision upon a relationship. It is an open research question how to do this effectively in an automatic manner. RESULTS: We propose two novel relation extraction approaches which use recent representation learning techniques to create comprehensive models of biomedical entities or entity-pairs, respectively. These representations are learned by considering all publications from PubMed mentioning an entity or a pair. They are used as input for a neural network for classifying relations globally, i.e. the derived predictions are corpus-based, not sentence- or article based as in prior art. Experiments on the extraction of mutation-disease, drug-disease and drug-drug relationships show that the learned embeddings indeed capture semantic information of the entities under study and outperform traditional methods by 4-29% regarding F1 score. AVAILABILITY AND IMPLEMENTATION: Source codes are available at: https://github.com/mariosaenger/bio-re-with-entity-embeddings. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Software , Mineração de Dados , PubMed , Publicações , Semântica
5.
Bioinformatics ; 37(17): 2792-2794, 2021 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-33508086

RESUMO

SUMMARY: Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair, a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair, recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora. AVAILABILITY AND IMPLEMENTATION: HunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

6.
Bioinformatics ; 36(Suppl_1): i490-i498, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657389

RESUMO

MOTIVATION: A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein-protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. RESULTS: We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. AVAILABILITY AND IMPLEMENTATION: PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Idioma , Proteínas , Publicações , Projetos de Pesquisa
7.
Bioinformatics ; 36(1): 295-302, 2020 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-31243432

RESUMO

MOTIVATION: Several recent studies showed that the application of deep neural networks advanced the state-of-the-art in named entity recognition (NER), including biomedical NER. However, the impact on performance and the robustness of improvements crucially depends on the availability of sufficiently large training corpora, which is a problem in the biomedical domain with its often rather small gold standard corpora. RESULTS: We evaluate different methods for alleviating the data sparsity problem by pretraining a deep neural network (LSTM-CRF), followed by a rather short fine-tuning phase focusing on a particular corpus. Experiments were performed using 34 different corpora covering five different biomedical entity types, yielding an average increase in F1-score of ∼2 pp compared to learning without pretraining. We experimented both with supervised and semi-supervised pretraining, leading to interesting insights into the precision/recall trade-off. Based on our results, we created the stand-alone NER tool HUNER incorporating fully trained models for five entity types. On the independent CRAFT corpus, which was not used for creating HUNER, it outperforms the state-of-the-art tools GNormPlus and tmChem by 5-13 pp on the entity types chemicals, species and genes. AVAILABILITY AND IMPLEMENTATION: HUNER is freely available at https://hu-ner.github.io. HUNER comes in containers, making it easy to install and use, and it can be applied off-the-shelf to arbitrary texts. We also provide an integrated tool for obtaining and converting all 34 corpora used in our evaluation, including fixed training, development and test splits to enable fair comparisons in the future. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Redes Neurais de Computação , Biologia Computacional/métodos , Análise de Dados , Software
8.
BMC Bioinformatics ; 20(1): 429, 2019 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-31419935

RESUMO

BACKGROUND: Diagnosis and treatment decisions in cancer increasingly depend on a detailed analysis of the mutational status of a patient's genome. This analysis relies on previously published information regarding the association of variations to disease progression and possible interventions. Clinicians to a large degree use biomedical search engines to obtain such information; however, the vast majority of scientific publications focus on basic science and have no direct clinical impact. We develop the Variant-Information Search Tool (VIST), a search engine designed for the targeted search of clinically relevant publications given an oncological mutation profile. RESULTS: VIST indexes all PubMed abstracts and content from ClinicalTrials.gov. It applies advanced text mining to identify mentions of genes, variants and drugs and uses machine learning based scoring to judge the clinical relevance of indexed abstracts. Its functionality is available through a fast and intuitive web interface. We perform several evaluations, showing that VIST's ranking is superior to that of PubMed or a pure vector space model with regard to the clinical relevance of a document's content. CONCLUSION: Different user groups search repositories of scientific publications with different intentions. This diversity is not adequately reflected in the standard search engines, often leading to poor performance in specialized settings. We develop a search engine for the specific case of finding documents that are clinically relevant in the course of cancer treatment. We believe that the architecture of our engine, heavily relying on machine learning algorithms, can also act as a blueprint for search engines in other, equally specific domains. VIST is freely available at https://vist.informatik.hu-berlin.de/.


Assuntos
Neoplasias/patologia , Medicina de Precisão , Ferramenta de Busca , Algoritmos , Bases de Dados como Assunto , Documentação , Humanos , Internet , Interface Usuário-Computador
9.
Brief Bioinform ; 18(5): 837-850, 2017 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27473063

RESUMO

Differential network analysis (DiNA) denotes a recent class of network-based Bioinformatics algorithms which focus on the differences in network topologies between two states of a cell, such as healthy and disease, to identify key players in the discriminating biological processes. In contrast to conventional differential analysis, DiNA identifies changes in the interplay between molecules, rather than changes in single molecules. This ability is especially important in cases where effectors are changed, e.g. mutated, but their expression is not. A number of different DiNA approaches have been proposed, yet a comparative assessment of their performance in different settings is still lacking. In this paper, we evaluate 10 different DiNA algorithms regarding their ability to recover genetic key players from transcriptome data. We construct high-quality regulatory networks and enrich them with co-expression data from four different types of cancer. Next, we assess the results of applying DiNA algorithms on these data sets using a gold standard list (GSL). We find that local DiNA algorithms are generally superior to global algorithms, and that all DiNA algorithms outperform conventional differential expression analysis. We also assess the ability of DiNA methods to exploit additional knowledge in the underlying cellular networks. To this end, we enrich the cancer-type specific networks with known regulatory miRNAs and compare the algorithms performance in networks with and without miRNA. We find that including miRNAs consistently and considerably improves the performance of almost all tested algorithms. Our results underline the advantages of comprehensive cell models for the analysis of -omics data.


Assuntos
Redes Reguladoras de Genes , Algoritmos , Biologia Computacional , Perfilação da Expressão Gênica , MicroRNAs
10.
Bioinformatics ; 33(14): i37-i48, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881963

RESUMO

MOTIVATION: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . CONTACT: habibima@informatik.hu-berlin.de.


Assuntos
Mineração de Dados/métodos , Aprendizado de Máquina , Animais , Humanos , Camundongos , Software
11.
BMC Med Inform Decis Mak ; 18(1): 107, 2018 11 21.
Artigo em Inglês | MEDLINE | ID: mdl-30463544

RESUMO

BACKGROUND: The decreasing cost of obtaining high-quality calls of genomic variants and the increasing availability of clinically relevant data on such variants are important drivers for personalized oncology. To allow rational genome-based decisions in diagnosis and treatment, clinicians need intuitive access to up-to-date and comprehensive variant information, encompassing, for instance, prevalence in populations and diseases, functional impact at the molecular level, associations to druggable targets, or results from clinical trials. In practice, collecting such comprehensive information on genomic variants is difficult since the underlying data is dispersed over a multitude of distributed, heterogeneous, sometimes conflicting, and quickly evolving data sources. To work efficiently, clinicians require powerful Variant Information Systems (VIS) which automatically collect and aggregate available evidences from such data sources without suppressing existing uncertainty. METHODS: We address the most important cornerstones of modeling a VIS: We take from emerging community standards regarding the necessary breadth of variant information and procedures for their clinical assessment, long standing experience in implementing biomedical databases and information systems, our own clinical record of diagnosis and treatment of cancer patients based on molecular profiles, and extensive literature review to derive a set of design principles along which we develop a relational data model for variant level data. In addition, we characterize a number of public variant data sources, and describe a data integration pipeline to integrate their data into a VIS. RESULTS: We provide a number of contributions that are fundamental to the design and implementation of a comprehensive, operational VIS. In particular, we (a) present a relational data model to accurately reflect data extracted from public databases relevant for clinical variant interpretation, (b) introduce a fault tolerant and performant integration pipeline for public variant data sources, and (c) offer recommendations regarding a number of intricate challenges encountered when integrating variant data for clincal interpretation. CONCLUSION: The analysis of requirements for representation of variant level data in an operational data model, together with the implementation-ready relational data model presented here, and the instructional description of methods to acquire comprehensive information to fill it, are an important step towards variant information systems for genomic medicine.


Assuntos
Variação Genética , Genômica , Aplicações da Informática Médica , Oncologia , Medicina de Precisão , Genômica/métodos , Humanos , Oncologia/métodos , Medicina de Precisão/métodos
12.
Int J Cancer ; 141(6): 1215-1221, 2017 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-28560858

RESUMO

Cetuximab is the single targeted therapy approved for the treatment of head and neck cancer (HNSCC). Predictive biomarkers have not been established and patient stratification based on molecular tumor profiles has not been possible. Since EGFR pathway activation is pronounced in basal subtype, we hypothesized this activation could be a predictive signature for an EGFR directed treatment. From our patient-derived xenograft platform of HNSCC, 28 models were subjected to Affymetrix gene expression studies on HG U133+ 2.0. Based on the expression of 821 genes, the subtype of each of the 28 models was determined by integrating gene expression profiles through centroid-clustering with previously published gene expression data by Keck et al. The models were treated in groups of 5-6 animals with docetaxel, cetuximab, everolimus, cis- or carboplatin and 5-fluorouracil. Response was evaluated by comparing tumor volume at treatment initiation and after 3 weeks of treatment (RTV). Tumors distributed over the 3 signature-defined subtypes: 5 mesenchymal/inflamed phenotype (MS), 15 basal type (BA), 8 classical type (CL). Cluster analysis revealed a strong correlation between response to cetuximab and the basal subtype. RTV MS 3.32 vs. BA 0.78 (MS vs. BA, unpaired t-test, p 0.0002). Cetuximab responders were distributed as following: 1/5 in MS, 5/8 in CL and 13/15 in the BA group. Activity of classical chemotherapies did not differ between the subtypes. In conclusion basal subtype was associated with response to EGFR directed therapy in head and neck squamous cell cancer patient-derived xenografts.


Assuntos
Carcinoma Basocelular/tratamento farmacológico , Carcinoma de Células Escamosas/tratamento farmacológico , Cetuximab/farmacologia , Neoplasias de Cabeça e Pescoço/tratamento farmacológico , Animais , Antineoplásicos/farmacologia , Carboplatina/farmacologia , Carcinoma Basocelular/enzimologia , Carcinoma Basocelular/genética , Carcinoma Basocelular/patologia , Carcinoma de Células Escamosas/enzimologia , Carcinoma de Células Escamosas/genética , Carcinoma de Células Escamosas/patologia , Análise Mutacional de DNA , Docetaxel , Receptores ErbB/genética , Everolimo/farmacologia , Fluoruracila/farmacologia , Expressão Gênica , Neoplasias de Cabeça e Pescoço/enzimologia , Neoplasias de Cabeça e Pescoço/genética , Neoplasias de Cabeça e Pescoço/patologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Camundongos , Camundongos Endogâmicos NOD , Estudos Retrospectivos , Carcinoma de Células Escamosas de Cabeça e Pescoço , Taxoides/farmacologia , Ensaios Antitumorais Modelo de Xenoenxerto
13.
Bioinformatics ; 32(18): 2883-5, 2016 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-27256315

RESUMO

UNLABELLED: : Descriptions of genetic variations and their effect are widely spread across the biomedical literature. However, finding all mentions of a specific variation, or all mentions of variations in a specific gene, is difficult to achieve due to the many ways such variations are described. Here, we describe SETH, a tool for the recognition of variations from text and their subsequent normalization to dbSNP or UniProt. SETH achieves high precision and recall on several evaluation corpora of PubMed abstracts. It is freely available and encompasses stand-alone scripts for isolated application and evaluation as well as a thorough documentation for integration into other applications. AVAILABILITY AND IMPLEMENTATION: SETH is released under the Apache 2.0 license and can be downloaded from http://rockt.github.io/SETH/ CONTACT: thomas@informatik.hu-berlin.de or leser@informatik.hu-berlin.de.


Assuntos
Curadoria de Dados , Mineração de Dados , Variação Genética , Biologia Computacional/métodos , Genes , Humanos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , PubMed , Publicações , Terminologia como Assunto
14.
Bioinformatics ; 32(17): 2590-7, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27187206

RESUMO

MOTIVATION: Integrating heterogeneous datasets from several sources is a common bioinformatics task that often requires implementing a complex workflow intermixing database access, data filtering, format conversions, identifier mapping, among further diverse operations. Data integration is especially important when annotating next generation sequencing data, where a multitude of diverse tools and heterogeneous databases can be used to provide a large variety of annotation for genomic locations, such a single nucleotide variants or genes. Each tool and data source is potentially useful for a given project and often more than one are used in parallel for the same purpose. However, software that always produces all available data is difficult to maintain and quickly leads to an excess of data, creating an information overload rather than the desired goal-oriented and integrated result. RESULTS: We present SoFIA, a framework for workflow-driven data integration with a focus on genomic annotation. SoFIA conceptualizes workflow templates as comprehensive workflows that cover as many data integration operations as possible in a given domain. However, these templates are not intended to be executed as a whole; instead, when given an integration task consisting of a set of input data and a set of desired output data, SoFIA derives a minimal workflow that completes the task. These workflows are typically fast and create exactly the information a user wants without requiring them to do any implementation work. Using a comprehensive genome annotation template, we highlight the flexibility, extensibility and power of the framework using real-life case studies. AVAILABILITY AND IMPLEMENTATION: https://github.com/childsish/sofia/releases/latest under the GNU General Public License CONTACT: liam.childs@hu-berlin.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Curadoria de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Software , Genoma , Genômica , Humanos , Armazenamento e Recuperação da Informação
15.
PLoS Genet ; 10(5): e1004338, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24875049

RESUMO

Circadian rhythms are essential to the temporal regulation of molecular processes in living systems and as such to life itself. Deregulation of these rhythms leads to failures in biological processes and eventually to the manifestation of pathological phenotypes including cancer. To address the questions as to what are the elicitors of a disrupted clock in cancer, we applied a systems biology approach to correlate experimental, bioinformatics and modelling data from several cell line models for colorectal and skin cancer. We found strong and weak circadian oscillators within the same type of cancer and identified a set of genes, which allows the discrimination between the two oscillator-types. Among those genes are IFNGR2, PITX2, RFWD2, PPARγ, LOXL2, Rab6 and SPARC, all involved in cancer-related pathways. Using a bioinformatics approach, we extended the core-clock network and present its interconnection to the discriminative set of genes. Interestingly, such gene signatures link the clock to oncogenic pathways like the RAS/MAPK pathway. To investigate the potential impact of the RAS/MAPK pathway - a major driver of colorectal carcinogenesis - on the circadian clock, we used a computational model which predicted that perturbation of BMAL1-mediated transcription can generate the circadian phenotypes similar to those observed in metastatic cell lines. Using an inducible RAS expression system, we show that overexpression of RAS disrupts the circadian clock and leads to an increase of the circadian period while RAS inhibition causes a shortening of period length, as predicted by our mathematical simulations. Together, our data demonstrate that perturbations induced by a single oncogene are sufficient to deregulate the mammalian circadian clock.


Assuntos
Relógios Circadianos/genética , Neoplasias Colorretais/genética , Proteínas Proto-Oncogênicas/biossíntese , Neoplasias Cutâneas/genética , Proteínas ras/biossíntese , Linhagem Celular Tumoral , Neoplasias Colorretais/patologia , Regulação Neoplásica da Expressão Gênica , Humanos , Quinases de Proteína Quinase Ativadas por Mitógeno/genética , Proteínas Proto-Oncogênicas/genética , Proteínas Proto-Oncogênicas p21(ras) , Transdução de Sinais , Neoplasias Cutâneas/patologia , Proteínas ras/genética
16.
Brief Bioinform ; 15(2): 327-40, 2014 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-23255168

RESUMO

New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.


Assuntos
Mineração de Dados/métodos , Publicações , Software , Algoritmos , Inteligência Artificial , Biologia Computacional/métodos , Mineração de Dados/normas , Humanos , Processamento de Linguagem Natural
17.
Bioinformatics ; 31(8): 1258-66, 2015 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-25433699

RESUMO

MOTIVATION: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. RESULTS: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. CONCLUSIONS: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. AVAILABILITY AND IMPLEMENTATION: Web-service is freely accessible at http://fastforward.sys-bio.net/. CONTACT: leser@informatik.hu-berlin.de or nils.bluethgen@charite.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Reguladoras de Genes , Genoma Humano , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Neoplasias/metabolismo , Fatores de Transcrição/metabolismo , Inteligência Artificial , Simulação por Computador , Mineração de Dados , Bases de Dados Factuais , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , Modelos Biológicos , Neoplasias/classificação , Neoplasias/genética , Fatores de Transcrição/genética
18.
Methods ; 74: 36-46, 2015 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-25448292

RESUMO

Biologists often pose queries to search engines and biological databases to obtain answers related to ongoing experiments. This is known to be a time consuming, and sometimes frustrating, task in which more than one query is posed and many databases are consulted to come to possible answers for a single fact. Question answering comes as an alternative to this process by allowing queries to be posed as questions, by integrating various resources of different nature and by returning an exact answer to the user. We have surveyed the current solutions on question answering for Biology, present an overview on the methods which are usually employed and give insights on how to boost performance of systems in this domain.


Assuntos
Biologia/métodos , Bases de Dados Factuais , Processamento de Linguagem Natural , Interface Usuário-Computador , Animais , Biologia/tendências , Bases de Dados Factuais/tendências , Humanos , Internet/tendências
19.
Nucleic Acids Res ; 42(Database issue): D950-8, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24304896

RESUMO

CellFinder (http://www.cellfinder.org) is a comprehensive one-stop resource for molecular data characterizing mammalian cells in different tissues and in different development stages. It is built from carefully selected data sets stemming from other curated databases and the biomedical literature. To date, CellFinder describes 3394 cell types and 50 951 cell lines. The database currently contains 3055 microscopic and anatomical images, 205 whole-genome expression profiles of 194 cell/tissue types from RNA-seq and microarrays and 553 905 protein expressions for 535 cells/tissues. Text mining of a corpus of >2000 publications followed by manual curation confirmed expression information on ∼900 proteins and genes. CellFinder's data model is capable to seamlessly represent entities from single cells to the organ level, to incorporate mappings between homologous entities in different species and to describe processes of cell development and differentiation. Its ontological backbone currently consists of 204 741 ontology terms incorporated from 10 different ontologies unified under the novel CELDA ontology. CellFinder's web portal allows searching, browsing and comparing the stored data, interactive construction of developmental trees and navigating the partonomic hierarchy of cells and tissues through a unique body browser designed for life scientists and clinicians.


Assuntos
Células/metabolismo , Bases de Dados Factuais , Animais , Linhagem Celular , Fenômenos Fisiológicos Celulares , Células/citologia , Estruturas Celulares/ultraestrutura , Mineração de Dados , Perfilação da Expressão Gênica , Humanos , Internet , Rim/citologia , Fígado/citologia , Proteínas/metabolismo , RNA/metabolismo
20.
BMC Genomics ; 16: 136, 2015 02 27.
Artigo em Inglês | MEDLINE | ID: mdl-27391904

RESUMO

BACKGROUND: The analysis of differential splicing (DS) is crucial for understanding physiological processes in cells and organs. In particular, aberrant transcripts are known to be involved in various diseases including cancer. A widely used technique for studying DS are exon arrays. Over the last decade a variety of algorithms for the detection of DS events from exon arrays has been developed. However, no comprehensive, comparative evaluation including sensitivity to the most important data features has been conducted so far. To this end, we created multiple data sets based on simulated data to assess strengths and weaknesses of seven published methods as well as a newly developed method, KLAS. Additionally, we evaluated all methods on two cancer data sets that comprised RT-PCR validated results. RESULTS: Our studies indicated ARH as the most robust methods when integrating the results over all scenarios and data sets. Nevertheless, special cases or requirements favor other methods. While FIRMA was highly sensitive according to experimental data, SplicingCompass, MIDAS and ANOSVA showed high specificity throughout the scenarios. On experimental data ARH, FIRMA, MIDAS, and KLAS performed best. CONCLUSIONS: Each method shows different characteristics regarding sensitivity, specificity, interference to certain data settings and robustness over multiple data sets. While some methods can be considered as generally good choices over all data sets and scenarios, other methods show heterogeneous prediction quality on the different data sets. The adequate method has to be chosen carefully and with a defined study aim in mind.


Assuntos
Algoritmos , Processamento Alternativo , Éxons , Splicing de RNA , RNA Neoplásico/genética , Humanos , Sensibilidade e Especificidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA