RESUMO
S-phase entry and exit are regulated by hundreds of protein complexes that assemble "just in time," orchestrated by a multitude of distinct events. To help understand their interplay, we have created a tailored visualization based on the Minardo layout, highlighting over 80 essential events. This complements our earlier visualization of M-phase, and both can be displayed together, giving a comprehensive overview of the events regulating the cell division cycle. To view this SnapShot, open or download the PDF.
Assuntos
Ciclo Celular/genética , Mitose/genética , Complexos Multiproteicos/genética , Fase S/genética , Divisão Celular/genética , Ciclina B/genética , Ciclina D/genética , Quinases Ciclina-Dependentes/genética , Fase G2/genética , Humanos , Fosforilação/genética , Complexo de Endopeptidases do Proteassoma/genéticaRESUMO
Although countless highly penetrant variants have been associated with Mendelian disorders, the genetic etiologies underlying complex diseases remain largely unresolved. By mining the medical records of over 110 million patients, we examine the extent to which Mendelian variation contributes to complex disease risk. We detect thousands of associations between Mendelian and complex diseases, revealing a nondegenerate, phenotypic code that links each complex disorder to a unique collection of Mendelian loci. Using genome-wide association results, we demonstrate that common variants associated with complex diseases are enriched in the genes indicated by this "Mendelian code." Finally, we detect hundreds of comorbidity associations among Mendelian disorders, and we use probabilistic genetic modeling to demonstrate that Mendelian variants likely contribute nonadditively to the risk for a subset of complex diseases. Overall, this study illustrates a complementary approach for mapping complex disease loci and provides unique predictions concerning the etiologies of specific diseases.
Assuntos
Doença/genética , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Modelos Genéticos , Registros de Saúde Pessoal , Humanos , Penetrância , Polimorfismo de Nucleotídeo ÚnicoRESUMO
MOTIVATION: Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. RESULTS: In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by â¼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). AVAILABILITY AND IMPLEMENTATION: All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.
Assuntos
Aprendizado Profundo , Bases de Dados Factuais , Dicionários como Assunto , Biologia Computacional/métodos , Processamento de Linguagem Natural , Mineração de Dados/métodosRESUMO
MOTIVATION: Understanding biological processes relies heavily on curated knowledge of physical interactions between proteins. Yet, a notable gap remains between the information stored in databases of curated knowledge and the plethora of interactions documented in the scientific literature. RESULTS: To bridge this gap, we introduce ComplexTome, a manually annotated corpus designed to facilitate the development of text-mining methods for the extraction of complex formation relationships among biomedical entities targeting the downstream semantics of the physical interaction subnetwork of the STRING database. This corpus comprises 1287 documents with â¼3500 relationships. We train a novel relation extraction model on this corpus and find that it can highly reliably identify physical protein interactions (F1-score = 82.8%). We additionally enhance the model's capabilities through unsupervised trigger word detection and apply it to extract relations and trigger words for these relations from all open publications in the domain literature. This information has been fully integrated into the latest version of the STRING database. AVAILABILITY AND IMPLEMENTATION: We provide the corpus, code, and all results produced by the large-scale runs of our systems biomedical on literature via Zenodo https://doi.org/10.5281/zenodo.8139716, Github https://github.com/farmeh/ComplexTome_extraction, and the latest version of STRING database https://string-db.org/.
Assuntos
Mineração de Dados , Bases de Dados de Proteínas , Mineração de Dados/métodos , Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Proteínas/químicaRESUMO
MOTIVATION: Despite lifestyle factors (LSFs) being increasingly acknowledged in shaping individual health trajectories, particularly in chronic diseases, they have still not been systematically described in the biomedical literature. This is in part because no named entity recognition (NER) system exists, which can comprehensively detect all types of LSFs in text. The task is challenging due to their inherent diversity, lack of a comprehensive LSF classification for dictionary-based NER, and lack of a corpus for deep learning-based NER. RESULTS: We present a novel Lifestyle Factor Ontology (LSFO), which we used to develop a dictionary-based system for recognition and normalization of LSFs. Additionally, we introduce a manually annotated corpus for LSFs (LSF200) suitable for training and evaluation of NER systems, and use it to train a transformer-based system. Evaluating the performance of both NER systems on the corpus revealed an F-score of 64% for the dictionary-based system and 76% for the transformer-based system. Large-scale application of these systems on PubMed abstracts and PMC Open Access articles identified over 300 million mentions of LSF in the biomedical literature. AVAILABILITY: LSFO, the annotated LSF200 corpus, and the detected LSFs in PubMed and PMC-OA articles using both NER systems, are available under open licenses via the following GitHub repository: Https://github.com/EsmaeilNourani/LSFO-expansion. This repository contains links to two associated GitHub repositories and a Zenodo project related to the study. LSFO is also available at BioPortal: Https://bioportal.bioontology.org/ontologies/LSFO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
MOTIVATION: Protein networks are commonly used for understanding how proteins interact. However, they are typically biased by data availability, favoring well-studied proteins with more interactions. To uncover functions of understudied proteins, we must use data that are not affected by this literature bias, such as single-cell RNA-seq and proteomics. Due to data sparseness and redundancy, functional association analysis becomes complex. RESULTS: To address this, we have developed FAVA (Functional Associations using Variational Autoencoders), which compresses high-dimensional data into a low-dimensional space. FAVA infers networks from high-dimensional omics data with much higher accuracy than existing methods, across a diverse collection of real as well as simulated datasets. FAVA can process large datasets with over 0.5 million conditions and has predicted 4210 interactions between 1039 understudied proteins. Our findings showcase FAVA's capability to offer novel perspectives on protein interactions. FAVA functions within the scverse ecosystem, employing AnnData as its input source. AVAILABILITY AND IMPLEMENTATION: Source code, documentation, and tutorials for FAVA are accessible on GitHub at https://github.com/mikelkou/fava. FAVA can also be installed and used via pip/PyPI
Assuntos
Proteômica , Análise da Expressão Gênica de Célula Única , Perfilação da Expressão Gênica , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , SoftwareRESUMO
The Illuminating the Druggable Genome (IDG) project aims to improve our understanding of understudied proteins and our ability to study them in the context of disease biology by perturbing them with small molecules, biologics, or other therapeutic modalities. Two main products from the IDG effort are the Target Central Resource Database (TCRD) (http://juniper.health.unm.edu/tcrd/), which curates and aggregates information, and Pharos (https://pharos.nih.gov/), a web interface for fusers to extract and visualize data from TCRD. Since the 2021 release, TCRD/Pharos has focused on developing visualization and analysis tools that help reveal higher-level patterns in the underlying data. The current iterations of TCRD and Pharos enable users to perform enrichment calculations based on subsets of targets, diseases, or ligands and to create interactive heat maps and UpSet charts of many types of annotations. Using several examples, we show how to address disease biology and drug discovery questions through enrichment calculations and UpSet charts.
Assuntos
Bases de Dados Factuais , Terapia de Alvo Molecular , Proteoma , Humanos , Produtos Biológicos , Descoberta de Drogas , Internet , Proteoma/efeitos dos fármacosRESUMO
Multiple myeloma (MM) is a neoplasia of B plasma cells that often induces bone pain. However, the mechanisms underlying myeloma-induced bone pain (MIBP) are mostly unknown. Using a syngeneic MM mouse model, we show that periosteal nerve sprouting of calcitonin gene-related peptide (CGRP+) and growth associated protein 43 (GAP43+) fibers occurs concurrent to the onset of nociception and its blockade provides transient pain relief. MM patient samples also showed increased periosteal innervation. Mechanistically, we investigated MM induced gene expression changes in the dorsal root ganglia (DRG) innervating the MM-bearing bone of male mice and found alterations in pathways associated with cell cycle, immune response and neuronal signaling. The MM transcriptional signature was consistent with metastatic MM infiltration to the DRG, a never-before described feature of the disease that we further demonstrated histologically. In the DRG, MM cells caused loss of vascularization and neuronal injury, which may contribute to late-stage MIBP. Interestingly, the transcriptional signature of a MM patient was consistent with MM cell infiltration to the DRG. Overall, our results suggest that MM induces a plethora of peripheral nervous system alterations that may contribute to the failure of current analgesics and suggest neuroprotective drugs as appropriate strategies to treat early onset MIBP.SIGNIFICANCE STATEMENT Multiple myeloma (MM) is a painful bone marrow cancer that significantly impairs the quality of life of the patients. Analgesic therapies for myeloma-induced bone pain (MIBP) are limited and often ineffective, and the mechanisms of MIBP remain unknown. In this manuscript, we describe cancer-induced periosteal nerve sprouting in a mouse model of MIBP, where we also encounter metastasis to the dorsal root ganglia (DRG), a never-before described feature of the disease. Concomitant to myeloma infiltration, the lumbar DRGs presented blood vessel damage and transcriptional alterations, which may mediate MIBP. Explorative studies on human tissue support our preclinical findings. Understanding the mechanisms of MIBP is crucial to develop targeted analgesic with better efficacy and fewer side effects for this patient population.
Assuntos
Doenças Ósseas , Mieloma Múltiplo , Tecido Nervoso , Humanos , Camundongos , Masculino , Animais , Mieloma Múltiplo/complicações , Mieloma Múltiplo/metabolismo , Mieloma Múltiplo/patologia , Qualidade de Vida , Dor/metabolismo , Tecido Nervoso/metabolismo , Tecido Nervoso/patologia , Gânglios Espinais/metabolismoRESUMO
The rising prevalence of liver diseases related to obesity and excessive use of alcohol is fuelling an increasing demand for accurate biomarkers aimed at community screening, diagnosis of steatohepatitis and significant fibrosis, monitoring, prognostication and prediction of treatment efficacy. Breakthroughs in omics methodologies and the power of bioinformatics have created an excellent opportunity to apply technological advances to clinical needs, for instance in the development of precision biomarkers for personalised medicine. Via omics technologies, biological processes from the genes to circulating protein, as well as the microbiome - including bacteria, viruses and fungi, can be investigated on an axis. However, there are important barriers to omics-based biomarker discovery and validation, including the use of semi-quantitative measurements from untargeted platforms, which may exhibit high analytical, inter- and intra-individual variance. Standardising methods and the need to validate them across diverse populations presents a challenge, partly due to disease complexity and the dynamic nature of biomarker expression at different disease stages. Lack of validity causes lost opportunities when studies fail to provide the knowledge needed for regulatory approvals, all of which contributes to a delayed translation of these discoveries into clinical practice. While no omics-based biomarkers have matured to clinical implementation, the extent of data generated has enabled the hypothesis-free discovery of a plethora of candidate biomarkers that warrant further validation. To explore the many opportunities of omics technologies, hepatologists need detailed knowledge of commonalities and differences between the various omics layers, and both the barriers to and advantages of these approaches.
Assuntos
Biomarcadores , Humanos , Biomarcadores/análise , Biomarcadores/metabolismo , Fígado Gorduroso/diagnóstico , Fígado Gorduroso/genética , Proteômica/métodos , Metabolômica/métodos , Genômica/métodosRESUMO
MOTIVATION: The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in many named entity recognition tasks, results for species name recognition remain poor. We hypothesize that this is primarily due to the lack of appropriate corpora. RESULTS: We introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We demonstrate that S1000 makes highly accurate recognition of species names possible (F-score =93.1%), both for deep learning and dictionary-based methods. AVAILABILITY AND IMPLEMENTATION: All resources introduced in this study are available under open licenses from https://jensenlab.org/resources/s1000/. The webpage contains links to a Zenodo project and three GitHub repositories associated with the study.
Assuntos
Mineração de Dados , Mineração de Dados/métodosRESUMO
Hypothesis-free high-throughput profiling allows relative quantification of thousands of proteins or transcripts across samples and thereby identification of differentially expressed genes. It is used in many biological contexts to characterize differences between cell lines and tissues, identify drug mode of action or drivers of drug resistance, among others. Changes in gene expression can also be due to confounding factors that were not accounted for in the experimental plan, such as change in cell proliferation. We combined the analysis of 1,076 and 1,040 cell lines in five proteomics and three transcriptomics data sets to identify 157 genes that correlate with cell proliferation rates. These include actors in DNA replication and mitosis, and genes periodically expressed during the cell cycle. This signature of cell proliferation is a valuable resource when analyzing high-throughput data showing changes in proliferation across conditions. We show how to use this resource to help in interpretation of in vitro drug screens and tumor samples. It informs on differences of cell proliferation rates between conditions where such information is not directly available. The signature genes also highlight which hits in a screen may be due to proliferation changes; this can either contribute to biological interpretation or help focus on experiment-specific regulation events otherwise buried in the statistical analysis.
Assuntos
Proteômica , Transcriptoma , Transcriptoma/genética , Perfilação da Expressão Gênica , Proliferação de Células/genética , MitoseRESUMO
In 2014, the National Institutes of Health (NIH) initiated the Illuminating the Druggable Genome (IDG) program to identify and improve our understanding of poorly characterized proteins that can potentially be modulated using small molecules or biologics. Two resources produced from these efforts are: The Target Central Resource Database (TCRD) (http://juniper.health.unm.edu/tcrd/) and Pharos (https://pharos.nih.gov/), a web interface to browse the TCRD. The ultimate goal of these resources is to highlight and facilitate research into currently understudied proteins, by aggregating a multitude of data sources, and ranking targets based on the amount of data available, and presenting data in machine learning ready format. Since the 2017 release, both TCRD and Pharos have produced two major releases, which have incorporated or expanded an additional 25 data sources. Recently incorporated data types include human and viral-human protein-protein interactions, protein-disease and protein-phenotype associations, and drug-induced gene signatures, among others. These aggregated data have enabled us to generate new visualizations and content sections in Pharos, in order to empower users to find new areas of study in the druggable genome.
Assuntos
Bases de Dados Factuais , Genoma Humano , Doenças Neurodegenerativas/genética , Proteômica/métodos , Software , Viroses/genética , Animais , Anticonvulsivantes/química , Anticonvulsivantes/uso terapêutico , Antivirais/química , Antivirais/uso terapêutico , Produtos Biológicos/química , Produtos Biológicos/uso terapêutico , Mineração de Dados/estatística & dados numéricos , Interações Hospedeiro-Patógeno/efeitos dos fármacos , Interações Hospedeiro-Patógeno/genética , Humanos , Internet , Aprendizado de Máquina/estatística & dados numéricos , Camundongos , Camundongos Knockout , Terapia de Alvo Molecular/métodos , Doenças Neurodegenerativas/classificação , Doenças Neurodegenerativas/tratamento farmacológico , Doenças Neurodegenerativas/virologia , Mapeamento de Interação de Proteínas , Proteoma/agonistas , Proteoma/antagonistas & inibidores , Proteoma/genética , Proteoma/metabolismo , Bibliotecas de Moléculas Pequenas/química , Bibliotecas de Moléculas Pequenas/uso terapêutico , Viroses/classificação , Viroses/tratamento farmacológico , Viroses/virologiaRESUMO
MOTIVATION: Genome-wide association studies can reveal important genotype-phenotype associations; however, data quality and interpretability issues must be addressed. For drug discovery scientists seeking to prioritize targets based on the available evidence, these issues go beyond the single study. RESULTS: Here, we describe rational ranking, filtering and interpretation of inferred gene-trait associations and data aggregation across studies by leveraging existing curation and harmonization efforts. Each gene-trait association is evaluated for confidence, with scores derived solely from aggregated statistics, linking a protein-coding gene and phenotype. We propose a method for assessing confidence in gene-trait associations from evidence aggregated across studies, including a bibliometric assessment of scientific consensus based on the iCite relative citation ratio, and meanRank scores, to aggregate multivariate evidence.This method, intended for drug target hypothesis generation, scoring and ranking, has been implemented as an analytical pipeline, available as open source, with public datasets of results, and a web application designed for usability by drug discovery scientists. AVAILABILITY AND IMPLEMENTATION: Web application, datasets and source code via https://unmtid-shinyapps.net/tiga/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Estudo de Associação Genômica Ampla , Iluminação , Genótipo , Polimorfismo de Nucleotídeo Único , FenótipoRESUMO
MOTIVATION: Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. RESULTS: We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. AVAILABILITY AND IMPLEMENTATION: CoCoScore is available at: https://github.com/JungeAlexander/cocoscore. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional , Mineração de Dados , Processamento de Linguagem Natural , Publicações , Biologia Computacional/métodos , Humanos , Proteínas/genéticaRESUMO
Gene expression studies are reported to be influenced by pre-analytical factors that can compromise RNA yield and integrity, which in turn may confound the experimental findings. Here we investigate the impact of four pre-analytical factors on brain-derived RNA: time-before-collection, tissue specimen size, tissue collection method, and RNA isolation method. We report no significant differences in RNA yield or integrity between 20 mg and 60 mg tissue samples collected in either liquid nitrogen or the RNAlater stabilizing solution. Isolation of RNA employing the TRIzol reagent resulted in a higher yield compared to isolation via the QIAcube kit while the latter resulted in RNA of slightly better integrity. Keeping brain tissue samples at room temperature for up to 160 min prior to collection and isolation of RNA resulted in no significant difference in yield or integrity. Our findings have significant practical and financial consequences for clinical genomic departments and other laboratory settings performing large-scale routine RNA expression analysis of brain samples.
Assuntos
Encéfalo/metabolismo , RNA/metabolismo , Animais , Camundongos , RNA/isolamento & purificação , Estabilidade de RNA , Manejo de Espécimes/métodos , Temperatura , Fatores de TempoRESUMO
MOTIVATION: Long non-coding RNAs (lncRNAs) are important regulators in wide variety of biological processes, which are linked to many diseases. Compared to protein-coding genes (PCGs), the association between diseases and lncRNAs is still not well studied. Thus, inferring disease-associated lncRNAs on a genome-wide scale has become imperative. RESULTS: In this study, we propose a machine learning-based method, DislncRF, which infers disease-associated lncRNAs on a genome-wide scale based on tissue expression profiles. DislncRF uses random forest models trained on expression profiles of known disease-associated PCGs across human tissues to extract general patterns between expression profiles and diseases. These models are then applied to score associations between lncRNAs and diseases. DislncRF was benchmarked against a gold standard dataset and compared to other methods. The results show that DislncRF yields promising performance and outperforms the existing methods. The utility of DislncRF is further substantiated on two diseases in which we find that top scoring candidates are supported by literature or independent datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/xypan1232/DislncRF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
RNA Longo não Codificante/genética , Genoma , Humanos , Aprendizado de MáquinaRESUMO
Tailored therapy aims to cure cancer patients effectively and safely, based on the complex interactions between patients' genomic features, disease pathology and drug metabolism. Thus, the continual increase in scientific literature drives the need for efficient methods of data mining to improve the extraction of useful information from texts based on patients' genomic features. An important application of text mining to tailored therapy in cancer encompasses the use of mutations and cancer fusion genes as moieties that change patients' cellular networks to develop cancer, and also affect drug metabolism. Fusion proteins, which are derived from the slippage of two parental genes, are produced in cancer by chromosomal aberrations and trans-splicing. Given that the two parental proteins for predicted fusion proteins are known, we used our previously developed method for identifying chimeric protein-protein interactions (ChiPPIs) associated with the fusion proteins. Here, we present a validation approach that receives fusion proteins of interest, predicts their cellular network alterations by ChiPPI and validates them by our new method, ProtFus, using an online literature search. This process resulted in a set of 358 fusion proteins and their corresponding protein interactions, as a training set for a Naïve Bayes classifier, to identify predicted fusion proteins that have reliable evidence in the literature and that were confirmed experimentally. Next, for a test group of 1817 fusion proteins, we were able to identify from the literature 2908 PPIs in total, across 18 cancer types. The described method, ProtFus, can be used for screening the literature to identify unique cases of fusion proteins and their PPIs, as means of studying alterations of protein networks in cancers. Availability: http://protfus.md.biu.ac.il/.
Assuntos
Mineração de Dados/métodos , Proteínas de Fusão Oncogênica/genética , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Teorema de Bayes , Big Data , Biologia Computacional , Mineração de Dados/estatística & dados numéricos , Bases de Dados Genéticas , Humanos , Mutação , Neoplasias/genética , Neoplasias/terapia , Proteínas de Fusão Oncogênica/química , Proteínas de Fusão Oncogênica/metabolismo , Medicina de Precisão , Mapeamento de Interação de Proteínas/estatística & dados numéricos , Mapas de Interação de ProteínasRESUMO
Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
Assuntos
Biologia Computacional/normas , Genômica/normas , Filogenia , Proteômica/normas , Archaea/classificação , Archaea/genética , Bactérias/classificação , Bactérias/genética , Biologia Computacional/métodos , Bases de Dados Genéticas , Eucariotos/classificação , Eucariotos/genética , Ontologia Genética , Genômica/métodos , Modelos Genéticos , Proteômica/métodos , Análise de Sequência de Proteína , Homologia de Sequência , Especificidade da EspécieRESUMO
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.