Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
Nat Genet ; 51(2): 364, 2019 02.
Artigo em Inglês | MEDLINE | ID: mdl-30559491

RESUMO

In the version of this article originally published, the name of author Serafim Batzoglou was misspelled. The error has been corrected in the HTML and PDF versions of the article.

3.
Cell ; 174(3): 758-769.e9, 2018 07 26.
Artigo em Inglês | MEDLINE | ID: mdl-30033370

RESUMO

While mutations affecting protein-coding regions have been examined across many cancers, structural variants at the genome-wide level are still poorly defined. Through integrative deep whole-genome and -transcriptome analysis of 101 castration-resistant prostate cancer metastases (109X tumor/38X normal coverage), we identified structural variants altering critical regulators of tumorigenesis and progression not detectable by exome approaches. Notably, we observed amplification of an intergenic enhancer region 624 kb upstream of the androgen receptor (AR) in 81% of patients, correlating with increased AR expression. Tandem duplication hotspots also occur near MYC, in lncRNAs associated with post-translational MYC regulation. Classes of structural variations were linked to distinct DNA repair deficiencies, suggesting their etiology, including associations of CDK12 mutation with tandem duplications, TP53 inactivation with inverted rearrangements and chromothripsis, and BRCA2 inactivation with deletions. Together, these observations provide a comprehensive view of how structural variations affect critical regulators in metastatic prostate cancer.


Assuntos
Variação Estrutural do Genoma/genética , Neoplasias da Próstata/genética , Idoso , Idoso de 80 Anos ou mais , Proteína BRCA2/metabolismo , Quinases Ciclina-Dependentes/metabolismo , Variações do Número de Cópias de DNA , Exoma , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Humanos , Masculino , Pessoa de Meia-Idade , Mutação , Metástase Neoplásica/genética , Proteínas Proto-Oncogênicas c-myc/genética , Proteínas Proto-Oncogênicas c-myc/metabolismo , Receptores Androgênicos/genética , Receptores Androgênicos/metabolismo , Sequências de Repetição em Tandem/genética , Proteína Supressora de Tumor p53/metabolismo , Sequenciamento Completo do Genoma/métodos
4.
Nat Genet ; 50(8): 1161-1170, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-30038395

RESUMO

Millions of human genomes and exomes have been sequenced, but their clinical applications remain limited due to the difficulty of distinguishing disease-causing mutations from benign genetic variation. Here we demonstrate that common missense variants in other primate species are largely clinically benign in human, enabling pathogenic mutations to be systematically identified by the process of elimination. Using hundreds of thousands of common variants from population sequencing of six non-human primate species, we train a deep neural network that identifies pathogenic mutations in rare disease patients with 88% accuracy and enables the discovery of 14 new candidate genes in intellectual disability at genome-wide significance. Cataloging common variation from additional primate species would improve interpretation for millions of variants of uncertain significance, further advancing the clinical utility of human genome sequencing.


Assuntos
Genoma Humano , Mutação , Rede Nervosa/fisiologia , Animais , Exoma , Predisposição Genética para Doença , Humanos , Deficiência Intelectual/genética , Deficiência Intelectual/patologia , Primatas
5.
Hum Mutat ; 37(11): 1215-1222, 2016 11.
Artigo em Inglês | MEDLINE | ID: mdl-27539938

RESUMO

Acute intermittent porphyria results from hydroxymethylbilane synthase (HMBS) mutations that markedly decrease HMBS enzymatic activity. This dominant disease is diagnosed when heterozygotes have life-threatening acute attacks, while most heterozygotes remain asymptomatic and undiagnosed. Although >400 HMBS mutations have been reported, the prevalence of pathogenic HMBS mutations in genomic/exomic databases, and the actual disease penetrance are unknown. Thus, we interrogated genomic/exomic databases, identified non-synonymous variants (NSVs) and consensus splice-site variants (CSSVs) in various demographic/racial groups, and determined the NSV's pathogenicity by prediction algorithms and in vitro expression assays. Caucasians had the most: 58 NSVs and two CSSVs among ∼92,000 alleles, a 0.00575 combined allele frequency. In silico algorithms predicted 14 out of 58 NSVs as "likely-pathogenic." In vitro expression identified 10 out of 58 NSVs as likely-pathogenic (seven predicted in silico), which together with two CSSVs had a combined allele frequency of 0.00056. Notably, six presumably pathogenic mutations/NSVs in the Human Gene Mutation Database were benign. Compared with the recent prevalence estimate of symptomatic European heterozygotes (∼0.000005), the prevalence of likely-pathogenic HMBS mutations among Caucasians was >100 times more frequent. Thus, the estimated penetrance of acute attacks was ∼1% of heterozygotes with likely-pathogenic mutations, highlighting the importance of predisposing/protective genes and environmental modifiers that precipitate/prevent the attacks.


Assuntos
Variação Genética , Penetrância , Porfiria Aguda Intermitente/genética , População Branca/genética , Simulação por Computador , Feminino , Frequência do Gene , Humanos , Masculino , Porfiria Aguda Intermitente/etnologia , Análise de Sequência de DNA
6.
Bioinformatics ; 32(18): 2883-5, 2016 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-27256315

RESUMO

UNLABELLED: : Descriptions of genetic variations and their effect are widely spread across the biomedical literature. However, finding all mentions of a specific variation, or all mentions of variations in a specific gene, is difficult to achieve due to the many ways such variations are described. Here, we describe SETH, a tool for the recognition of variations from text and their subsequent normalization to dbSNP or UniProt. SETH achieves high precision and recall on several evaluation corpora of PubMed abstracts. It is freely available and encompasses stand-alone scripts for isolated application and evaluation as well as a thorough documentation for integration into other applications. AVAILABILITY AND IMPLEMENTATION: SETH is released under the Apache 2.0 license and can be downloaded from http://rockt.github.io/SETH/ CONTACT: thomas@informatik.hu-berlin.de or leser@informatik.hu-berlin.de.


Assuntos
Curadoria de Dados , Mineração de Dados , Variação Genética , Biologia Computacional/métodos , Genes , Humanos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , PubMed , Publicações , Terminologia como Assunto
7.
Bioinformatics ; 32(12): i101-i110, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307606

RESUMO

MOTIVATION: Underrepresentation of racial groups represents an important challenge and major gap in phenomics research. Most of the current human phenomics research is based primarily on European populations; hence it is an important challenge to expand it to consider other population groups. One approach is to utilize data from EMR databases that contain patient data from diverse demographics and ancestries. The implications of this racial underrepresentation of data can be profound regarding effects on the healthcare delivery and actionability. To the best of our knowledge, our work is the first attempt to perform comparative, population-scale analyses of disease networks across three different populations, namely Caucasian (EA), African American (AA) and Hispanic/Latino (HL). RESULTS: We compared susceptibility profiles and temporal connectivity patterns for 1988 diseases and 37 282 disease pairs represented in a clinical population of 1 025 573 patients. Accordingly, we revealed appreciable differences in disease susceptibility, temporal patterns, network structure and underlying disease connections between EA, AA and HL populations. We found 2158 significantly comorbid diseases for the EA cohort, 3265 for AA and 672 for HL. We further outlined key disease pair associations unique to each population as well as categorical enrichments of these pairs. Finally, we identified 51 key 'hub' diseases that are the focal points in the race-centric networks and of particular clinical importance. Incorporating race-specific disease comorbidity patterns will produce a more accurate and complete picture of the disease landscape overall and could support more precise understanding of disease relationships and patient management towards improved clinical outcomes. CONTACTS: rong.chen@mssm.edu or joel.dudley@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Registros Eletrônicos de Saúde , Negro ou Afro-Americano , Bases de Dados Factuais , Hispânico ou Latino , Humanos , População Branca
8.
Nat Biotechnol ; 34(5): 531-8, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-27065010

RESUMO

Genetic studies of human disease have traditionally focused on the detection of disease-causing mutations in afflicted individuals. Here we describe a complementary approach that seeks to identify healthy individuals resilient to highly penetrant forms of genetic childhood disorders. A comprehensive screen of 874 genes in 589,306 genomes led to the identification of 13 adults harboring mutations for 8 severe Mendelian conditions, with no reported clinical manifestation of the indicated disease. Our findings demonstrate the promise of broadening genetic studies to systematically search for well individuals who are buffering the effects of rare, highly penetrant, deleterious mutations. They also indicate that incomplete penetrance for Mendelian diseases is likely more common than previously believed. The identification of resilient individuals may provide a first step toward uncovering protective genetic variants that could help elucidate the mechanisms of Mendelian diseases and new therapeutic strategies.


Assuntos
Mapeamento Cromossômico/métodos , Resistência à Doença/genética , Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/genética , Genoma Humano/genética , Análise da Randomização Mendeliana/métodos , Criança , Pré-Escolar , Mapeamento Cromossômico/estatística & dados numéricos , Análise Mutacional de DNA/métodos , Feminino , Predisposição Genética para Doença/genética , Testes Genéticos/métodos , Humanos , Lactente , Recém-Nascido , Masculino , Análise da Randomização Mendeliana/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único/genética , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
9.
BMC Bioinformatics ; 17: 24, 2016 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-26746786

RESUMO

BACKGROUND: Data from a plethora of high-throughput sequencing studies is readily available to researchers, providing genetic variants detected in a variety of healthy and disease populations. While each individual cohort helps gain insights into polymorphic and disease-associated variants, a joint perspective can be more powerful in identifying polymorphisms, rare variants, disease-associations, genetic burden, somatic variants, and disease mechanisms. DESCRIPTION: We have set up a Reference Variant Store (RVS) containing variants observed in a number of large-scale sequencing efforts, such as 1000 Genomes, ExAC, Scripps Wellderly, UK10K; various genotyping studies; and disease association databases. RVS holds extensive annotations pertaining to affected genes, functional impacts, disease associations, and population frequencies. RVS currently stores 400 million distinct variants observed in more than 80,000 human samples. CONCLUSIONS: RVS facilitates cross-study analysis to discover novel genetic risk factors, gene-disease associations, potential disease mechanisms, and actionable variants. Due to its large reference populations, RVS can also be employed for variant filtration and gene prioritization. AVAILABILITY: A web interface to public datasets and annotations in RVS is available at https://rvs.u.hpc.mssm.edu/.


Assuntos
Bases de Dados Genéticas , Doença/genética , Variação Genética , Anotação de Sequência Molecular/métodos , Genoma Humano , Técnicas de Genotipagem , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Bases de Conhecimento , Valores de Referência , Fatores de Risco
10.
Bioinformatics ; 32(1): 151-3, 2016 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-26363178

RESUMO

MOTIVATION: A plethora of sequenced and genotyped disease cohorts is available to the biomedical research community, spread across many portals and represented in various formats. RESULTS: We have gathered several large studies, including GERA and GRU, and computed population- and disease-specific genetic variant frequencies. In total, our portal provides fast access to genetic variants observed in 84,928 individuals from 39 disease populations. We also include 66,335 controls, such as the 1000 Genomes and Scripps Wellderly. CONCLUSION: Combining multiple studies helps validate disease-associated variants in each underlying data set, detect potential false positives using frequencies of control populations, and identify novel candidate disease-causing alterations in known or suspected genes. AVAILABILITY AND IMPLEMENTATION: https://rvs.u.hpc.mssm.edu/divas CONTACT: rong.chen@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bases de Dados Genéticas , Doença/genética , Variação Genética , Software , Estudos de Coortes , Humanos , Interface Usuário-Computador
11.
Genome Med ; 7: 77, 2015 Jul 29.
Artigo em Inglês | MEDLINE | ID: mdl-26338694

RESUMO

Routine clinical application of whole exome sequencing remains challenging due to difficulties in variant interpretation, large dataset management, and workflow integration. We describe a tool named ClinLabGeneticist to implement a workflow in clinical laboratories for management of variant assessment in genetic testing and disease diagnosis. We established an extensive variant annotation data source for the identification of pathogenic variants. A dashboard was deployed to aid a multi-step, hierarchical review process leading to final clinical decisions on genetic variant assessment. In addition, a central database was built to archive all of the genetic testing data, notes, and comments throughout the review process, variant validation data by Sanger sequencing as well as the final clinical reports for future reference. The entire workflow including data entry, distribution of work assignments, variant evaluation and review, selection of variants for validation, report generation, and communications between various personnel is integrated into a single data management platform. Three case studies are presented to illustrate the utility of ClinLabGeneticist. ClinLabGeneticist is freely available to academia at http://rongchenlab.org/software/clinlabgeneticist .


Assuntos
Biologia Computacional/métodos , Testes Genéticos , Variação Genética , Software , Criança , Serviços de Laboratório Clínico , Deficiências do Desenvolvimento/genética , Exoma , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Masculino , Porfiria Eritropoética/genética , Análise de Sequência de DNA , Fluxo de Trabalho
12.
BMC Genomics ; 16 Suppl 8: S3, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26110593

RESUMO

BACKGROUND: The invention of high throughput sequencing technologies has led to the discoveries of hundreds of thousands of genetic variants associated with thousands of human diseases. Many of these genetic variants are located outside the protein coding regions, and as such, it is challenging to interpret the function of these genetic variants by traditional genetic approaches. Recent genome-wide functional genomics studies, such as FANTOM5 and ENCODE have uncovered a large number of regulatory elements across hundreds of different tissues or cell lines in the human genome. These findings provide an opportunity to study the interaction between regulatory elements and disease-associated genetic variants. Identifying these diseased-related regulatory elements will shed light on understanding the mechanisms of how these variants regulate gene expression and ultimately result in disease formation and progression. RESULTS: In this study, we curated and categorized 27,558 Mendelian disease variants, 20,964 complex disease variants, 5,809 cancer predisposing germline variants, and 43,364 recurrent cancer somatic mutations. Compared against nine different types of regulatory regions from FANTOM5 and ENCODE projects, we found that different types of disease variants show distinctive propensity for particular regulatory elements. Mendelian disease variants and recurrent cancer somatic mutations are 22-fold and 10- fold significantly enriched in promoter regions respectively (q<0.001), compared with allele-frequency-matched genomic background. Separate from these two categories, cancer predisposing germline variants are 27-fold enriched in histone modification regions (q<0.001), 10-fold enriched in chromatin physical interaction regions (q<0.001), and 6-fold enriched in transcription promoters (q<0.001). Furthermore, Mendelian disease variants and recurrent cancer somatic mutations share very similar distribution across types of functional effects. We further found that regulatory regions are located within over 50% coding exon regions. Transcription promoters, methylation regions, and transcription insulators have the highest density of disease variants, with 472, 239, and 72 disease variants per one million base pairs, respectively. CONCLUSIONS: Disease-associated variants in different disease categories are preferentially located in particular regulatory elements. These results will be useful for an overall understanding about the differences among the pathogenic mechanisms of various disease-associated variants.


Assuntos
Doença/genética , Variação Genética , Elementos Reguladores de Transcrição , Cromatina/genética , Biologia Computacional , Estudo de Associação Genômica Ampla , Sequenciamento de Nucleotídeos em Larga Escala , Histonas/genética , Humanos , Isoformas de Proteínas/genética
13.
Pac Symp Biocomput ; : 407-18, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-25592600

RESUMO

In the past decade there has been an explosion in genetic research that has resulted in the generation of enormous quantities of disease-related data. In the current study, we have compiled disease risk gene variant information and Electronic Medical Record (EMR) classification codes from various repositories for 305 diseases. Using such data, we developed a pipeline to test for clinical prevalence, gene-variant overlap, and literature presence for all 46,360 unique diseases pairs. To determine whether disease pairs were enriched we systematically employed both Fishers' Exact (medical and literature) and Term Frequency-Inverse Document Frequency (genetics) methodologies to test for enrichment, defining statistical significance at a Bonferonni adjusted threshold of (p < 1 × 10(-6)) and weighted q < 0.05 accordingly. We hypothesize that disease pairs that are statistically enriched in medical and genetic spheres, but not so in the literature have the potential to reveal non-obvious connections between clinically disparate phenotypes. Using this pipeline, we identified 2,316 disease pairs that were significantly enriched within an EMR and 213 enriched genetically. Of these, 65 disease pairs were statistically enriched in both, 19 of which are believed to be novel. These identified non-obvious relationships between disease pairs are suggestive of a shared underlying etiology with clinical presentation. Further investigation of uncovered disease-pair relationships has the potential to provide insights into the architecture of complex diseases, and update existing knowledge of risk factors.


Assuntos
Doença/genética , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Registros Eletrônicos de Saúde , Ontologia Genética/estatística & dados numéricos , Marcadores Genéticos , Variação Genética , Humanos , Mutação , Fenótipo , Fatores de Risco
15.
J Biomed Inform ; 45(5): 842-50, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-22564364

RESUMO

MOTIVATION: Genetic factors determine differences in pharmacokinetics, drug efficacy, and drug responses between individuals and sub-populations. Wrong dosages of drugs can lead to severe adverse drug reactions in individuals whose drug metabolism drastically differs from the "assumed average". Databases such as PharmGKB are excellent sources of pharmacogenetic information on enzymes, genetic variants, and drug response affected by changes in enzymatic activity. Here, we seek to aid researchers, database curators, and clinicians in their search for relevant information by automatically extracting these data from literature. APPROACH: We automatically populate a repository of information on genetic variants, relations to drugs, occurrence in sub-populations, and associations with disease. We mine textual data from PubMed abstracts to discover such genotype-phenotype associations, focusing on SNPs that can be associated with variations in drug response. The overall repository covers relations found between genes, variants, alleles, drugs, diseases, adverse drug reactions, populations, and allele frequencies. We cross-reference these data to EntrezGene, PharmGKB, PubChem, and others. RESULTS: The performance regarding entity recognition and relation extraction yields a precision of 90-92% for the major entity types (gene, drug, disease), and 76-84% for relations involving these types. Comparison of our repository to PharmGKB reveals a coverage of 93% of gene-drug associations in PharmGKB and 97% of the gene-variant mappings based on 180,000 PubMed abstracts. AVAILABILITY: http://bioai4core.fulton.asu.edu/snpshot.


Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas , Doença/genética , Farmacogenética/métodos , Polimorfismo de Nucleotídeo Único , Animais , Estudos de Associação Genética/métodos , Humanos , Bases de Conhecimento , Camundongos , PubMed , Ratos
16.
Bioinformatics ; 27(19): 2769-71, 2011 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-21813477

RESUMO

SUMMARY: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987. AVAILABILITY: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net. CONTACT: jorg.hakenberg@roche.com.


Assuntos
Mineração de Dados , Biblioteca Gênica , Processamento Eletrônico de Dados , Genes , Internet , Proteínas , Editoração , Terminologia como Assunto
17.
PLoS Comput Biol ; 6: e1000837, 2010 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-20617200

RESUMO

The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein-protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.


Assuntos
Mineração de Dados/métodos , Bases de Dados de Proteínas , Processamento de Linguagem Natural , Mapeamento de Interação de Proteínas/métodos , Proteínas/classificação , Algoritmos , Área Sob a Curva , Árvores de Decisões , Modelos Moleculares , Reprodutibilidade dos Testes
18.
Artigo em Inglês | MEDLINE | ID: mdl-20498514

RESUMO

Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend- - third-party software are available as supplementary information (see Appendix).


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados Genéticas , Mapeamento de Interação de Proteínas/métodos , Bases de Dados de Proteínas , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Sociedades Científicas
19.
Pac Symp Biocomput ; : 465-76, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-19908398

RESUMO

Biological pathways are seen as highly critical in our understanding of the mechanism of biological functions. To collect information about pathways, manual curation has been the most popular method. However, pathway annotation is regarded as heavily time-consuming, as it requires expert curators to identify and collect information from different sources. Even with the pieces of biological facts and interactions collected from various sources, curators have to apply their biological knowledge to arrange the acquired interactions in such a way that together they perform a common biological function as a pathway. In this paper, we propose a novel approach for automated pathway synthesis that acquires facts from hand-curated knowledge bases. To comprehend the incompleteness of the knowledge bases, our approach also obtains facts through automated extraction from Medline abstracts. An essential component of our approach is to apply logical reasoning to the acquired facts based on the biological knowledge about pathways. By representing such biological knowledge, the reasoning component is capable of assigning ordering to the acquired facts and interactions that is necessary for pathway synthesis. We demonstrate the feasibility of our approach with the development of a system that synthesizes pharmacokinetic pathways. We evaluate our approach by reconstructing the existing pharmacokinetic pathways available in PharmGKB. Our results show that not only that our approach is capable of synthesizing these pathways but also uncovering information that is not available in the manually annotated pathways.


Assuntos
Farmacocinética , Inteligência Artificial , Carbamatos/farmacocinética , Biologia Computacional , Humanos , Bases de Conhecimento , MEDLINE , Redes e Vias Metabólicas , Modelos Biológicos , Piperidinas/farmacocinética , Pravastatina/farmacocinética , Biologia Sintética
20.
Nucleic Acids Res ; 37(Web Server issue): W300-4, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19465383

RESUMO

High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene.


Assuntos
Genes , Software , Animais , Reabsorção Óssea/genética , Perfilação da Expressão Gênica , Humanos , Camundongos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos , Osteoporose/genética , Neoplasias Pancreáticas/genética , Porfiria Hepatoeritropoética/genética , PubMed , Ratos , Vocabulário Controlado
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...