RESUMEN
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
Asunto(s)
Minería de Datos/métodos , Farmacogenética , Recolección de Datos , Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Publicaciones , SemánticaRESUMEN
Most pharmacogenomics knowledge is contained in the text of published studies, and is thus not available for automated computation. Natural Language Processing (NLP) techniques for extracting relationships in specific domains often rely on hand-built rules and domain-specific ontologies to achieve good performance. In a new and evolving field such as pharmacogenomics (PGx), rules and ontologies may not be available. Recent progress in syntactic NLP parsing in the context of a large corpus of pharmacogenomics text provides new opportunities for automated relationship extraction. We describe an ontology of PGx relationships built starting from a lexicon of key pharmacogenomic entities and a syntactic parse of more than 87 million sentences from 17 million MEDLINE abstracts. We used the syntactic structure of PGx statements to systematically extract commonly occurring relationships and to map them to a common schema. Our extracted relationships have a 70-87.7% precision and involve not only key PGx entities such as genes, drugs, and phenotypes (e.g., VKORC1, warfarin, clotting disorder), but also critical entities that are frequently modified by these key entities (e.g., VKORC1 polymorphism, warfarin response, clotting disorder treatment). The result of our analysis is a network of 40,000 relationships between more than 200 entity types with clear semantics. This network is used to guide the curation of PGx knowledge and provide a computable resource for knowledge discovery.
Asunto(s)
Farmacogenética/métodos , Semántica , Bases de Datos Factuales , MEDLINE , Procesamiento de Lenguaje Natural , Terminología como Asunto , Estados UnidosRESUMEN
BACKGROUND: Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities--particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases) are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations. RESULTS: Thus, building on a tool called Textpresso, we have created the Pharmspresso tool to assist in identifying important pharmacogenomic facts in full text articles. Pharmspresso parses text to find references to human genes, polymorphisms, drugs and diseases and their relationships. It presents these as a series of marked-up text fragments, in which key concepts are visually highlighted. To evaluate Pharmspresso, we used a gold standard of 45 human-curated articles. Pharmspresso identified 78%, 61%, and 74% of target gene, polymorphism, and drug concepts, respectively. CONCLUSION: Pharmspresso is a text analysis tool that extracts pharmacogenomic concepts from the literature automatically and thus captures our current understanding of gene-drug interactions in a computable form. We have made Pharmspresso available at http://pharmspresso.stanford.edu.
Asunto(s)
Farmacogenética/métodos , Programas Informáticos , Biología Computacional/métodos , Bases de Datos Genéticas , Almacenamiento y Recuperación de la Información/métodos , InternetRESUMEN
Deciphering gene regulatory network architecture amounts to the identification of the regulators, conditions in which they act, genes they regulate, cis-acting motifs they bind, expression profiles they dictate and more complex relationships between alternative regulatory partnerships and alternative regulatory motifs that give rise to sub-modalities of expression profiles. The 'location data' in yeast is a comprehensive resource that provides transcription factor-DNA interaction information in vivo. Here, we provide two contributions: first, we developed means to assess the extent of noise in the location data, and consequently for extracting signals from it. Second, we couple signal extraction with better characterization of the genetic network architecture. We apply two methods for the detection of combinatorial associations between transcription factors (TFs), the integration of which provides a global map of combinatorial regulatory interactions. We discover the capacity of regulatory motifs and TF partnerships to dictate fine-tuned expression patterns of subsets of genes, which are clearly distinct from those displayed by most genes assigned to the same TF. Our findings provide carefully prioritized, high-quality assignments between regulators and regulated genes and as such should prove useful for experimental and computational biologists alike.
Asunto(s)
Biología Computacional/métodos , Proteínas de Unión al ADN/metabolismo , Regulación de la Expresión Génica , Genómica/métodos , Factores de Transcripción/metabolismo , Sitios de Unión , Proteínas de Unión al ADN/análisis , Interpretación Estadística de Datos , Proteínas Fúngicas/metabolismo , Perfilación de la Expresión Génica , Genoma , Regiones Promotoras Genéticas , Secuencias Reguladoras de Ácidos Nucleicos , Factores de Transcripción/análisis , Transcripción GenéticaRESUMEN
In order to make more informed healthcare decisions, consumers need information systems that deliver accurate and reliable information about their illnesses and potential treatments. Reports of randomized clinical trials (RCTs) provide reliable medical evidence about the efficacy of treatments. Current methods to access, search for, and retrieve RCTs are keyword-based, time-consuming, and suffer from poor precision. Personalized semantic search and medical evidence summarization aim to solve this problem. The performance of these approaches may improve if they have access to study subject descriptors (e.g. age, gender, and ethnicity), trial sizes, and diseases/symptoms studied. We have developed a novel method to automatically extract such subject demographic information from RCT abstracts. We used text classification augmented with a Hidden Markov Model to identify sentences containing subject demographics, and subsequently these sentences were parsed using Natural Language Processing techniques to extract relevant information. Our results show accuracy levels of 82.5%, 92.5%, and 92.0% for extraction of subject descriptors, trial sizes, and diseases/symptoms descriptors respectively.
Asunto(s)
Indización y Redacción de Resúmenes , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Cadenas de Markov , Ensayos Clínicos Controlados Aleatorios como AsuntoRESUMEN
Drug-drug interactions (DDIs) can occur when two drugs interact with the same gene product. Most available information about gene-drug relationships is contained within the scientific literature, but is dispersed over a large number of publications, with thousands of new publications added each month. In this setting, automated text mining is an attractive solution for identifying gene-drug relationships and aggregating them to predict novel DDIs. In previous work, we have shown that gene-drug interactions can be extracted from Medline abstracts with high fidelity - we extract not only the genes and drugs, but also the type of relationship expressed in individual sentences (e.g. metabolize, inhibit, activate and many others). We normalize these relationships and map them to a standardized ontology. In this work, we hypothesize that we can combine these normalized gene-drug relationships, drawn from a very broad and diverse literature, to infer DDIs. Using a training set of established DDIs, we have trained a random forest classifier to score potential DDIs based on the features of the normalized assertions extracted from the literature that relate two drugs to a gene product. The classifier recognizes the combinations of relationships, drugs and genes that are most associated with the gold standard DDIs, correctly identifying 79.8% of assertions relating interacting drug pairs and 78.9% of assertions relating noninteracting drug pairs. Most significantly, because our text processing method captures the semantics of individual gene-drug relationships, we can construct mechanistic pharmacological explanations for the newly-proposed DDIs. We show how our classifier can be used to explain known DDIs and to uncover new DDIs that have not yet been reported.
Asunto(s)
Minería de Datos/métodos , Interacciones Farmacológicas , Algoritmos , Hidrocarburo de Aril Hidroxilasas/genética , Hidrocarburo de Aril Hidroxilasas/metabolismo , Biología Computacional , Citocromo P-450 CYP2C9 , Citocromo P-450 CYP3A/genética , Citocromo P-450 CYP3A/metabolismo , Humanos , Bases del Conocimiento , MEDLINE , Farmacogenética/estadística & datos numéricos , Verapamilo/metabolismo , Warfarina/metabolismoRESUMEN
The need for efficient text-mining tools that support curation of the biomedical literature is ever increasing. In this article, we describe an experiment aimed at verifying whether a text-mining tool capable of extracting meaningful relationships among domain entities can be successfully integrated into the curation workflow of a major biological database. We evaluate in particular (i) the usability of the system's interface, as perceived by users, and (ii) the correlation of the ranking of interactions, as provided by the text-mining system, with the choices of the curators.
Asunto(s)
Minería de Datos/métodos , Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Farmacogenética , Indización y Redacción de Resúmenes , Investigación Biomédica , Reproducibilidad de los Resultados , Interfaz Usuario-ComputadorRESUMEN
The mission of the Pharmacogenomics Knowledge Base (PharmGKB; www.pharmgkb.org ) is to collect, encode and disseminate knowledge about the impact of human genetic variations on drug responses. It is an important worldwide resource of clinical pharmacogenomic biomarkers available to all. The PharmGKB website has evolved to highlight our knowledge curation and aggregation over our previous emphasis on collecting primary data. This review summarizes the methods we use to drive this expanded scope of 'Knowledge Acquisition to Clinical Applications', the new features available on our website and our future goals.
Asunto(s)
Biomarcadores/metabolismo , Bases de Datos Factuales , Farmacogenética , Variación Genética , Humanos , Internet , Bases del ConocimientoRESUMEN
BACKGROUND: Advances in Natural Language Processing (NLP) techniques enable the extraction of fine-grained relationships mentioned in biomedical text. The variability and the complexity of natural language in expressing similar relationships causes the extracted relationships to be highly heterogeneous, which makes the construction of knowledge bases difficult and poses a challenge in using these for data mining or question answering. RESULTS: We report on the semi-automatic construction of the PHARE relationship ontology (the PHArmacogenomic RElationships Ontology) consisting of 200 curated relations from over 40,000 heterogeneous relationships extracted via text-mining. These heterogeneous relations are then mapped to the PHARE ontology using synonyms, entity descriptions and hierarchies of entities and roles. Once mapped, relationships can be normalized and compared using the structure of the ontology to identify relationships that have similar semantics but different syntax. We compare and contrast the manual procedure with a fully automated approach using WordNet to quantify the degree of integration enabled by iterative curation and refinement of the PHARE ontology. The result of such integration is a repository of normalized biomedical relationships, named PHARE-KB, which can be queried using Semantic Web technologies such as SPARQL and can be visualized in the form of a biological network. CONCLUSIONS: The PHARE ontology serves as a common semantic framework to integrate more than 40,000 relationships pertinent to pharmacogenomics. The PHARE ontology forms the foundation of a knowledge base named PHARE-KB. Once populated with relationships, PHARE-KB (i) can be visualized in the form of a biological network to guide human tasks such as database curation and (ii) can be queried programmatically to guide bioinformatics applications such as the prediction of molecular interactions. PHARE is available at http://purl.bioontology.org/ontology/PHARE.
RESUMEN
The biomedical literature holds our understanding of pharmacogenomics, but it is dispersed across many journals. In order to integrate our knowledge, connect important facts across publications and generate new hypotheses we must organize and encode the contents of the literature. By creating databases of structured pharmocogenomic knowledge, we can make the value of the literature much greater than the sum of the individual reports. We can, for example, generate candidate gene lists or interpret surprising hits in genome-wide association studies. Text mining automatically adds structure to the unstructured knowledge embedded in millions of publications, and recent years have seen a surge in work on biomedical text mining, some specific to pharmacogenomics literature. These methods enable extraction of specific types of information and can also provide answers to general, systemic queries. In this article, we describe the main tasks of text mining in the context of pharmacogenomics, summarize recent applications and anticipate the next phase of text mining applications.
Asunto(s)
Minería de Datos/tendencias , Farmacogenética/métodos , Animales , Minería de Datos/métodos , Bases de Datos Genéticas/tendencias , Humanos , Almacenamiento y Recuperación de la Información/métodos , Almacenamiento y Recuperación de la Información/tendencias , Farmacogenética/estadística & datos numéricos , Farmacogenética/tendenciasRESUMEN
A critical goal of pharmacogenomics research is to identify genes that can explain variation in drug response. We have previously reported a method that creates a genome-scale ranking of genes likely to interact with a drug. The algorithm uses information about drug structure and indications of use to rank the genes. Although the algorithm has good performance, its performance depends on a curated set of drug-gene relationships that is expensive to create and difficult to maintain. In this work, we assess the utility of text mining in extracting a network of drug-gene relationships automatically. This provides a valuable aggregate source of knowledge, subsequently used as input into the algorithm that ranks potential pharmacogenes. Using a drug-gene network created from sentence-level co-occurrence in the full text of scientific articles, we compared the performance to that of a network created by manual curation of those articles. Under a wide range of conditions, we show that a knowledge base derived from text-mining the literature performs as well as, and sometimes better than, a high-quality, manually curated knowledge base. We conclude that we can use relationships mined automatically from the literature as a knowledgebase for pharmacogenomics relationships. Additionally, when relationships are missed by text mining, our system can accurately extrapolate new relationships with 77.4% precision.
Asunto(s)
Farmacogenética/estadística & datos numéricos , Algoritmos , Biología Computacional , Minería de Datos/estadística & datos numéricos , Humanos , Bases del ConocimientoRESUMEN
The immune system of higher organisms is, by any standard, complex. To date, using reductionist techniques, immunologists have elucidated many of the basic principles of how the immune system functions, yet our understanding is still far from complete. In an era of high throughput measurements, it is already clear that the scientific knowledge we have accumulated has itself grown larger than our ability to cope with it, and thus it is increasingly important to develop bioinformatics tools with which to navigate the complexity of the information that is available to us. Here, we describe ImmuneXpresso, an information extraction system, tailored for parsing the primary literature of immunology and relating it to experimental data. The immune system is very much dependent on the interactions of various white blood cells with each other, either in synaptic contacts, at a distance using cytokines or chemokines, or both. Therefore, as a first approximation, we used ImmuneXpresso to create a literature derived network of interactions between cells and cytokines. Integration of cell-specific gene expression data facilitates cross-validation of cytokine mediated cell-cell interactions and suggests novel interactions. We evaluate the performance of our automatically generated multi-scale model against existing manually curated data, and show how this system can be used to guide experimentalists in interpreting multi-scale, experimental data. Our methodology is scalable and can be generalized to other systems.
Asunto(s)
Comunicación Celular/inmunología , Citocinas/inmunología , Sistema Inmunológico/fisiología , Bases del Conocimiento , Animales , Biometría , Linfocitos T CD4-Positivos/inmunología , Citocinas/sangre , Bases de Datos Factuales , Femenino , Perfilación de la Expresión Génica/estadística & datos numéricos , Humanos , Subgrupos Linfocitarios/inmunología , Masculino , RatonesRESUMEN
Oncologists managing cancer patients use radiology imaging studies to evaluate changes in measurable cancer lesions. Currently, the textual radiology report summarizes the findings, but is disconnected from the primary image data. This makes it difficult for the physician to obtain a visual overview of the location and behavior of the disease. LesionViewer is a prototype software system designed to assist clinicians in comprehending and reviewing radiology imaging studies. The interface provides an Anatomical Summary View of the location of lesions identified in a series of studies, and direct navigation to the relevant primary image data. LesionViewer's Disease Summary View provides a temporal abstraction of the disease behavior between studies utilizing methods of the RECIST guideline. In a usability study, nine physicians used the system to accurately perform clinical tasks appropriate to the analysis of radiology reports and image data. All users reported they would use the system if available.
Asunto(s)
Neoplasias/patología , Interpretación de Imagen Radiográfica Asistida por Computador , Algoritmos , Actitud del Personal de Salud , Competencia Clínica , Humanos , Neoplasias/diagnóstico por imagen , Sistemas de Información Radiológica , Programas InformáticosRESUMEN
Numerous health decision aids (HDAs) have been developed to increase the participation of patients in shared decision-making, but many have limited accessibility and narrow applicability in clinical care. In the Health e-Decision project, we address these limitations in our work on building general HDAs targeted for older adults. Our approach uses a decision-support software architecture that enables principled methods for HDAs. We have formalized a novel knowledge-based decision model (KBDM), using Protégé OWL, that developers and clinicians can instantiate to tailor the components of the architecture for a particular health problem. In this paper, we present the methods used in the architecture and the knowledgebase design; the latter encompasses influence-diagram concepts, specific health problems, health outcome states, and probabilistic relationships. We discuss how this approach improves upon prior HDA methods. We also show that our use of computer-interpretable knowledge provides a structured, customizable means of enabling patient-centered decision support.