RESUMO
While mutations affecting protein-coding regions have been examined across many cancers, structural variants at the genome-wide level are still poorly defined. Through integrative deep whole-genome and -transcriptome analysis of 101 castration-resistant prostate cancer metastases (109X tumor/38X normal coverage), we identified structural variants altering critical regulators of tumorigenesis and progression not detectable by exome approaches. Notably, we observed amplification of an intergenic enhancer region 624 kb upstream of the androgen receptor (AR) in 81% of patients, correlating with increased AR expression. Tandem duplication hotspots also occur near MYC, in lncRNAs associated with post-translational MYC regulation. Classes of structural variations were linked to distinct DNA repair deficiencies, suggesting their etiology, including associations of CDK12 mutation with tandem duplications, TP53 inactivation with inverted rearrangements and chromothripsis, and BRCA2 inactivation with deletions. Together, these observations provide a comprehensive view of how structural variations affect critical regulators in metastatic prostate cancer.
Assuntos
Variação Estrutural do Genoma/genética , Neoplasias da Próstata/genética , Idoso , Idoso de 80 Anos ou mais , Proteína BRCA2/metabolismo , Quinases Ciclina-Dependentes/metabolismo , Variações do Número de Cópias de DNA , Exoma , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Humanos , Masculino , Pessoa de Meia-Idade , Mutação , Metástase Neoplásica/genética , Proteínas Proto-Oncogênicas c-myc/genética , Proteínas Proto-Oncogênicas c-myc/metabolismo , Receptores Androgênicos/genética , Receptores Androgênicos/metabolismo , Sequências de Repetição em Tandem/genética , Proteína Supressora de Tumor p53/metabolismo , Sequenciamento Completo do Genoma/métodosRESUMO
UNLABELLED: : Descriptions of genetic variations and their effect are widely spread across the biomedical literature. However, finding all mentions of a specific variation, or all mentions of variations in a specific gene, is difficult to achieve due to the many ways such variations are described. Here, we describe SETH, a tool for the recognition of variations from text and their subsequent normalization to dbSNP or UniProt. SETH achieves high precision and recall on several evaluation corpora of PubMed abstracts. It is freely available and encompasses stand-alone scripts for isolated application and evaluation as well as a thorough documentation for integration into other applications. AVAILABILITY AND IMPLEMENTATION: SETH is released under the Apache 2.0 license and can be downloaded from http://rockt.github.io/SETH/ CONTACT: thomas@informatik.hu-berlin.de or leser@informatik.hu-berlin.de.
Assuntos
Curadoria de Dados , Mineração de Dados , Variação Genética , Biologia Computacional/métodos , Genes , Humanos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , PubMed , Publicações , Terminologia como AssuntoRESUMO
MOTIVATION: A plethora of sequenced and genotyped disease cohorts is available to the biomedical research community, spread across many portals and represented in various formats. RESULTS: We have gathered several large studies, including GERA and GRU, and computed population- and disease-specific genetic variant frequencies. In total, our portal provides fast access to genetic variants observed in 84,928 individuals from 39 disease populations. We also include 66,335 controls, such as the 1000 Genomes and Scripps Wellderly. CONCLUSION: Combining multiple studies helps validate disease-associated variants in each underlying data set, detect potential false positives using frequencies of control populations, and identify novel candidate disease-causing alterations in known or suspected genes. AVAILABILITY AND IMPLEMENTATION: https://rvs.u.hpc.mssm.edu/divas CONTACT: rong.chen@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bases de Dados Genéticas , Doença/genética , Variação Genética , Software , Estudos de Coortes , Humanos , Interface Usuário-ComputadorRESUMO
MOTIVATION: Underrepresentation of racial groups represents an important challenge and major gap in phenomics research. Most of the current human phenomics research is based primarily on European populations; hence it is an important challenge to expand it to consider other population groups. One approach is to utilize data from EMR databases that contain patient data from diverse demographics and ancestries. The implications of this racial underrepresentation of data can be profound regarding effects on the healthcare delivery and actionability. To the best of our knowledge, our work is the first attempt to perform comparative, population-scale analyses of disease networks across three different populations, namely Caucasian (EA), African American (AA) and Hispanic/Latino (HL). RESULTS: We compared susceptibility profiles and temporal connectivity patterns for 1988 diseases and 37 282 disease pairs represented in a clinical population of 1 025 573 patients. Accordingly, we revealed appreciable differences in disease susceptibility, temporal patterns, network structure and underlying disease connections between EA, AA and HL populations. We found 2158 significantly comorbid diseases for the EA cohort, 3265 for AA and 672 for HL. We further outlined key disease pair associations unique to each population as well as categorical enrichments of these pairs. Finally, we identified 51 key 'hub' diseases that are the focal points in the race-centric networks and of particular clinical importance. Incorporating race-specific disease comorbidity patterns will produce a more accurate and complete picture of the disease landscape overall and could support more precise understanding of disease relationships and patient management towards improved clinical outcomes. CONTACTS: rong.chen@mssm.edu or joel.dudley@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Registros Eletrônicos de Saúde , Negro ou Afro-Americano , Bases de Dados Factuais , Hispânico ou Latino , Humanos , População BrancaRESUMO
BACKGROUND: Data from a plethora of high-throughput sequencing studies is readily available to researchers, providing genetic variants detected in a variety of healthy and disease populations. While each individual cohort helps gain insights into polymorphic and disease-associated variants, a joint perspective can be more powerful in identifying polymorphisms, rare variants, disease-associations, genetic burden, somatic variants, and disease mechanisms. DESCRIPTION: We have set up a Reference Variant Store (RVS) containing variants observed in a number of large-scale sequencing efforts, such as 1000 Genomes, ExAC, Scripps Wellderly, UK10K; various genotyping studies; and disease association databases. RVS holds extensive annotations pertaining to affected genes, functional impacts, disease associations, and population frequencies. RVS currently stores 400 million distinct variants observed in more than 80,000 human samples. CONCLUSIONS: RVS facilitates cross-study analysis to discover novel genetic risk factors, gene-disease associations, potential disease mechanisms, and actionable variants. Due to its large reference populations, RVS can also be employed for variant filtration and gene prioritization. AVAILABILITY: A web interface to public datasets and annotations in RVS is available at https://rvs.u.hpc.mssm.edu/.
Assuntos
Bases de Dados Genéticas , Doença/genética , Variação Genética , Anotação de Sequência Molecular/métodos , Genoma Humano , Técnicas de Genotipagem , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Bases de Conhecimento , Valores de Referência , Fatores de RiscoRESUMO
Acute intermittent porphyria results from hydroxymethylbilane synthase (HMBS) mutations that markedly decrease HMBS enzymatic activity. This dominant disease is diagnosed when heterozygotes have life-threatening acute attacks, while most heterozygotes remain asymptomatic and undiagnosed. Although >400 HMBS mutations have been reported, the prevalence of pathogenic HMBS mutations in genomic/exomic databases, and the actual disease penetrance are unknown. Thus, we interrogated genomic/exomic databases, identified non-synonymous variants (NSVs) and consensus splice-site variants (CSSVs) in various demographic/racial groups, and determined the NSV's pathogenicity by prediction algorithms and in vitro expression assays. Caucasians had the most: 58 NSVs and two CSSVs among â¼92,000 alleles, a 0.00575 combined allele frequency. In silico algorithms predicted 14 out of 58 NSVs as "likely-pathogenic." In vitro expression identified 10 out of 58 NSVs as likely-pathogenic (seven predicted in silico), which together with two CSSVs had a combined allele frequency of 0.00056. Notably, six presumably pathogenic mutations/NSVs in the Human Gene Mutation Database were benign. Compared with the recent prevalence estimate of symptomatic European heterozygotes (â¼0.000005), the prevalence of likely-pathogenic HMBS mutations among Caucasians was >100 times more frequent. Thus, the estimated penetrance of acute attacks was â¼1% of heterozygotes with likely-pathogenic mutations, highlighting the importance of predisposing/protective genes and environmental modifiers that precipitate/prevent the attacks.
Assuntos
Variação Genética , Penetrância , Porfiria Aguda Intermitente/genética , População Branca/genética , Simulação por Computador , Feminino , Frequência do Gene , Humanos , Masculino , Porfiria Aguda Intermitente/etnologia , Análise de Sequência de DNARESUMO
BACKGROUND: The invention of high throughput sequencing technologies has led to the discoveries of hundreds of thousands of genetic variants associated with thousands of human diseases. Many of these genetic variants are located outside the protein coding regions, and as such, it is challenging to interpret the function of these genetic variants by traditional genetic approaches. Recent genome-wide functional genomics studies, such as FANTOM5 and ENCODE have uncovered a large number of regulatory elements across hundreds of different tissues or cell lines in the human genome. These findings provide an opportunity to study the interaction between regulatory elements and disease-associated genetic variants. Identifying these diseased-related regulatory elements will shed light on understanding the mechanisms of how these variants regulate gene expression and ultimately result in disease formation and progression. RESULTS: In this study, we curated and categorized 27,558 Mendelian disease variants, 20,964 complex disease variants, 5,809 cancer predisposing germline variants, and 43,364 recurrent cancer somatic mutations. Compared against nine different types of regulatory regions from FANTOM5 and ENCODE projects, we found that different types of disease variants show distinctive propensity for particular regulatory elements. Mendelian disease variants and recurrent cancer somatic mutations are 22-fold and 10- fold significantly enriched in promoter regions respectively (q<0.001), compared with allele-frequency-matched genomic background. Separate from these two categories, cancer predisposing germline variants are 27-fold enriched in histone modification regions (q<0.001), 10-fold enriched in chromatin physical interaction regions (q<0.001), and 6-fold enriched in transcription promoters (q<0.001). Furthermore, Mendelian disease variants and recurrent cancer somatic mutations share very similar distribution across types of functional effects. We further found that regulatory regions are located within over 50% coding exon regions. Transcription promoters, methylation regions, and transcription insulators have the highest density of disease variants, with 472, 239, and 72 disease variants per one million base pairs, respectively. CONCLUSIONS: Disease-associated variants in different disease categories are preferentially located in particular regulatory elements. These results will be useful for an overall understanding about the differences among the pathogenic mechanisms of various disease-associated variants.
Assuntos
Doença/genética , Variação Genética , Elementos Reguladores de Transcrição , Cromatina/genética , Biologia Computacional , Estudo de Associação Genômica Ampla , Sequenciamento de Nucleotídeos em Larga Escala , Histonas/genética , Humanos , Isoformas de Proteínas/genéticaRESUMO
SUMMARY: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987. AVAILABILITY: The library and web services are implemented in Java and the sources are available from http://gnat.sourceforge.net. CONTACT: jorg.hakenberg@roche.com.
Assuntos
Mineração de Dados , Biblioteca Gênica , Processamento Eletrônico de Dados , Genes , Internet , Proteínas , Editoração , Terminologia como AssuntoRESUMO
MOTIVATION: Genetic factors determine differences in pharmacokinetics, drug efficacy, and drug responses between individuals and sub-populations. Wrong dosages of drugs can lead to severe adverse drug reactions in individuals whose drug metabolism drastically differs from the "assumed average". Databases such as PharmGKB are excellent sources of pharmacogenetic information on enzymes, genetic variants, and drug response affected by changes in enzymatic activity. Here, we seek to aid researchers, database curators, and clinicians in their search for relevant information by automatically extracting these data from literature. APPROACH: We automatically populate a repository of information on genetic variants, relations to drugs, occurrence in sub-populations, and associations with disease. We mine textual data from PubMed abstracts to discover such genotype-phenotype associations, focusing on SNPs that can be associated with variations in drug response. The overall repository covers relations found between genes, variants, alleles, drugs, diseases, adverse drug reactions, populations, and allele frequencies. We cross-reference these data to EntrezGene, PharmGKB, PubChem, and others. RESULTS: The performance regarding entity recognition and relation extraction yields a precision of 90-92% for the major entity types (gene, drug, disease), and 76-84% for relations involving these types. Comparison of our repository to PharmGKB reveals a coverage of 93% of gene-drug associations in PharmGKB and 97% of the gene-variant mappings based on 180,000 PubMed abstracts. AVAILABILITY: http://bioai4core.fulton.asu.edu/snpshot.
Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas , Doença/genética , Farmacogenética/métodos , Polimorfismo de Nucleotídeo Único , Animais , Estudos de Associação Genética/métodos , Humanos , Bases de Conhecimento , Camundongos , PubMed , RatosRESUMO
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein-protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.
Assuntos
Mineração de Dados/métodos , Bases de Dados de Proteínas , Processamento de Linguagem Natural , Mapeamento de Interação de Proteínas/métodos , Proteínas/classificação , Algoritmos , Área Sob a Curva , Árvores de Decisões , Modelos Moleculares , Reprodutibilidade dos TestesRESUMO
High-throughput screens such as microarrays and RNAi screens produce huge amounts of data. They typically result in hundreds of genes, which are often further explored and clustered via enriched GeneOntology terms. The strength of such analyses is that they build on high-quality manual annotations provided with the GeneOntology. However, the weakness is that annotations are restricted to process, function and location and that they do not cover all known genes in model organisms. GoGene addresses this weakness by complementing high-quality manual annotation with high-throughput text mining extracting co-occurrences of genes and ontology terms from literature. GoGene contains over 4,000,000 associations between genes and gene-related terms for 10 model organisms extracted from more than 18,000,000 PubMed entries. It does not cover only process, function and location of genes, but also biomedical categories such as diseases, compounds, techniques and mutations. By bringing it all together, GoGene provides the most recent and most complete facts about genes and can rank them according to novelty and importance. GoGene accepts keywords, gene lists, gene sequences and protein sequences as input and supports search for genes in PubMed, EntrezGene and via BLAST. Since all associations of genes to terms are supported by evidence in the literature, the results are transparent and can be verified by the user. GoGene is available at http://gopubmed.org/gogene.
Assuntos
Genes , Software , Animais , Reabsorção Óssea/genética , Perfilação da Expressão Gênica , Humanos , Camundongos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos , Osteoporose/genética , Neoplasias Pancreáticas/genética , Porfiria Hepatoeritropoética/genética , PubMed , Ratos , Vocabulário ControladoRESUMO
BACKGROUND: Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. RESULTS: The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. CONCLUSION: Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. AVAILABILITY: The three benchmark datasets created for the purpose of disambiguation are available in Additional file 1.
Assuntos
Biologia Computacional/métodos , Vocabulário Controlado , Algoritmos , Armazenamento e Recuperação da Informação , Informática Médica/métodos , Medical Subject Headings , Reconhecimento Automatizado de Padrão , Unified Medical Language SystemRESUMO
MOTIVATION: Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed information not contained in a publication; it is also key for integrating different knowledge sources. From an information retrieval perspective, normalization facilitates indexing and querying. Gene mention normalization (GN) is particularly challenging given the high ambiguity of gene names: they refer to orthologous or entirely different genes, are named after phenotypes and other biomedical terms, or they resemble common English words. RESULTS: We present the first publicly available system, GNAT, reported to handle inter-species GN. Our method uses extensive background knowledge on genes to resolve ambiguous names to EntrezGene identifiers. It performs comparably to single-species approaches proposed by us and others. On a benchmark set derived from BioCreative 1 and 2 data that contains genes from 13 species, GNAT achieves an F-measure of 81.4% (90.8% precision at 73.8% recall). For the single-species task, we report an F-measure of 85.4% on human genes. AVAILABILITY: A web-frontend is available at http://cbioc.eas.asu.edu/gnat/. GNAT will also be available within the BioCreativeMetaService project, see http://bcms.bioinfo.cnio.es. SUPPLEMENTARY INFORMATION: The test data set, lexica, and links toexternal data are available at http://cbioc.eas.asu.edu/gnat/
Assuntos
Algoritmos , Inteligência Artificial , Bases de Dados Genéticas , Genes/genética , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Software , Especificidade da EspécieRESUMO
In the version of this article originally published, the name of author Serafim Batzoglou was misspelled. The error has been corrected in the HTML and PDF versions of the article.
RESUMO
Millions of human genomes and exomes have been sequenced, but their clinical applications remain limited due to the difficulty of distinguishing disease-causing mutations from benign genetic variation. Here we demonstrate that common missense variants in other primate species are largely clinically benign in human, enabling pathogenic mutations to be systematically identified by the process of elimination. Using hundreds of thousands of common variants from population sequencing of six non-human primate species, we train a deep neural network that identifies pathogenic mutations in rare disease patients with 88% accuracy and enables the discovery of 14 new candidate genes in intellectual disability at genome-wide significance. Cataloging common variation from additional primate species would improve interpretation for millions of variants of uncertain significance, further advancing the clinical utility of human genome sequencing.
Assuntos
Genoma Humano , Mutação , Rede Nervosa/fisiologia , Animais , Exoma , Predisposição Genética para Doença , Humanos , Deficiência Intelectual/genética , Deficiência Intelectual/patologia , PrimatasRESUMO
UNLABELLED: The biomedical literature contains a wealth of information on associations between many different types of objects, such as protein-protein interactions, gene-disease associations and subcellular locations of proteins. When searching such information using conventional search engines, e.g. PubMed, users see the data only one-abstract at a time and 'hidden' in natural language text. AliBaba is an interactive tool for graphical summarization of search results. It parses the set of abstracts that fit a PubMed query and presents extracted information on biomedical objects and their relationships as a graphical network. AliBaba extracts associations between cells, diseases, drugs, proteins, species and tissues. Several filter options allow for a more focused search. Thus, researchers can grasp complex networks described in various articles at a glance. AVAILABILITY: http://alibaba.informatik.hu-berlin.de/
Assuntos
Indexação e Redação de Resumos/métodos , Algoritmos , Gráficos por Computador , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Software , Interface Usuário-Computador , Sistemas de Gerenciamento de Base de Dados , Armazenamento e Recuperação da Informação/métodos , Vocabulário ControladoRESUMO
Genetic studies of human disease have traditionally focused on the detection of disease-causing mutations in afflicted individuals. Here we describe a complementary approach that seeks to identify healthy individuals resilient to highly penetrant forms of genetic childhood disorders. A comprehensive screen of 874 genes in 589,306 genomes led to the identification of 13 adults harboring mutations for 8 severe Mendelian conditions, with no reported clinical manifestation of the indicated disease. Our findings demonstrate the promise of broadening genetic studies to systematically search for well individuals who are buffering the effects of rare, highly penetrant, deleterious mutations. They also indicate that incomplete penetrance for Mendelian diseases is likely more common than previously believed. The identification of resilient individuals may provide a first step toward uncovering protective genetic variants that could help elucidate the mechanisms of Mendelian diseases and new therapeutic strategies.
Assuntos
Mapeamento Cromossômico/métodos , Resistência à Doença/genética , Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/genética , Genoma Humano/genética , Análise da Randomização Mendeliana/métodos , Criança , Pré-Escolar , Mapeamento Cromossômico/estatística & dados numéricos , Análise Mutacional de DNA/métodos , Feminino , Predisposição Genética para Doença/genética , Testes Genéticos/métodos , Humanos , Lactente , Recém-Nascido , Masculino , Análise da Randomização Mendeliana/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único/genética , Reprodutibilidade dos Testes , Sensibilidade e EspecificidadeRESUMO
In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.
Assuntos
Biologia Computacional/métodos , Genes , Reconhecimento Automatizado de Padrão/classificação , Reconhecimento Automatizado de Padrão/métodos , Reconhecimento Psicológico , Terminologia como AssuntoRESUMO
In the past decade there has been an explosion in genetic research that has resulted in the generation of enormous quantities of disease-related data. In the current study, we have compiled disease risk gene variant information and Electronic Medical Record (EMR) classification codes from various repositories for 305 diseases. Using such data, we developed a pipeline to test for clinical prevalence, gene-variant overlap, and literature presence for all 46,360 unique diseases pairs. To determine whether disease pairs were enriched we systematically employed both Fishers' Exact (medical and literature) and Term Frequency-Inverse Document Frequency (genetics) methodologies to test for enrichment, defining statistical significance at a Bonferonni adjusted threshold of (p < 1 × 10(-6)) and weighted q < 0.05 accordingly. We hypothesize that disease pairs that are statistically enriched in medical and genetic spheres, but not so in the literature have the potential to reveal non-obvious connections between clinically disparate phenotypes. Using this pipeline, we identified 2,316 disease pairs that were significantly enriched within an EMR and 213 enriched genetically. Of these, 65 disease pairs were statistically enriched in both, 19 of which are believed to be novel. These identified non-obvious relationships between disease pairs are suggestive of a shared underlying etiology with clinical presentation. Further investigation of uncovered disease-pair relationships has the potential to provide insights into the architecture of complex diseases, and update existing knowledge of risk factors.