RESUMO
BACKGROUND: Diagnosing genetic disorders requires extensive manual curation and interpretation of candidate variants, a labor-intensive task even for trained geneticists. Although artificial intelligence (AI) shows promise in aiding these diagnoses, existing AI tools have only achieved moderate success for primary diagnosis. METHODS: AI-MARRVEL (AIM) uses a random-forest machine-learning classifier trained on over 3.5 million variants from thousands of diagnosed cases. AIM additionally incorporates expert-engineered features into training to recapitulate the intricate decision-making processes in molecular diagnosis. The online version of AIM is available at https://ai.marrvel.org. To evaluate AIM, we benchmarked it with diagnosed patients from three independent cohorts. RESULTS: AIM improved the rate of accurate genetic diagnosis, doubling the number of solved cases as compared with benchmarked methods, across three distinct real-world cohorts. To better identify diagnosable cases from the unsolved pools accumulated over time, we designed a confidence metric on which AIM achieved a precision rate of 98% and identified 57% of diagnosable cases out of a collection of 871 cases. Furthermore, AIM's performance improved after being fine-tuned for targeted settings including recessive disorders and trio analysis. Finally, AIM demonstrated potential for novel disease gene discovery by correctly predicting two newly reported disease genes from the Undiagnosed Diseases Network. CONCLUSIONS: AIM achieved superior accuracy compared with existing methods for genetic diagnosis. We anticipate that this tool may aid in primary diagnosis, reanalysis of unsolved cases, and the discovery of novel disease genes. (Funded by the NIH Common Fund and others.).
RESUMO
In the effort to treat Mendelian disorders, correcting the underlying molecular imbalance may be more effective than symptomatic treatment. Identifying treatments that might accomplish this goal requires extensive and up-to-date knowledge of molecular pathways-including drug-gene and gene-gene relationships. To address this challenge, we present "parsing modifiers via article annotations" (PARMESAN), a computational tool that searches PubMed and PubMed Central for information to assemble these relationships into a central knowledge base. PARMESAN then predicts putatively novel drug-gene relationships, assigning an evidence-based score to each prediction. We compare PARMESAN's drug-gene predictions to all of the drug-gene relationships displayed by the Drug-Gene Interaction Database (DGIdb) and show that higher-scoring relationship predictions are more likely to match the directionality (up- versus down-regulation) indicated by this database. PARMESAN had more than 200,000 drug predictions scoring above 8 (as one example cutoff), for more than 3,700 genes. Among these predicted relationships, 210 were registered in DGIdb and 201 (96%) had matching directionality. This publicly available tool provides an automated way to prioritize drug screens to target the most-promising drugs to test, thereby saving time and resources in the development of therapeutics for genetic disorders.
Assuntos
PubMed , Humanos , Bases de Dados FactuaisRESUMO
OBJECTIVE: Collier/Olf/EBF (COE) transcription factors have distinct expression patterns in the developing and mature nervous system. To date, a neurological disease association has been conclusively established for only the Early B-cell Factor-3 (EBF3) COE family member through the identification of heterozygous loss-of-function variants in individuals with autism spectrum/neurodevelopmental disorders (NDD). Here, we identify a symptom severity risk association with missense variants primarily disrupting the zinc finger domain (ZNF) in EBF3-related NDD. METHODS: A phenotypic assessment of 41 individuals was combined with a literature meta-analysis for a total of 83 individuals diagnosed with EBF3-related NDD. Quantitative diagnostic phenotypic and symptom severity scales were developed to compare EBF3 variant type and location to identify genotype-phenotype correlations. To stratify the effects of EBF3 variants disrupting either the DNA-binding domain (DBD) or the ZNF, we used in vivo fruit fly UAS-GAL4 expression and in vitro luciferase assays. RESULTS: We show that patient symptom severity correlates with EBF3 missense variants perturbing the ZNF, which is a key protein domain required for stabilizing the interaction between EBF3 and the target DNA sequence. We found that ZNF-associated variants failed to restore viability in the fruit fly and impaired transcriptional activation. However, the recurrent variant EBF3 p.Arg209Trp in the DBD is capable of partially rescuing viability in the fly and preserved transcriptional activation. INTERPRETATION: We describe a symptom severity risk association with ZNF perturbations and EBF3 loss-of-function in the largest reported cohort to date of EBF3-related NDD patients. This analysis should have potential predictive clinical value for newly identified patients with EBF3 gene variants. ANN NEUROL 2022;92:138-153.
Assuntos
Transtorno do Espectro Autista , Transtornos do Neurodesenvolvimento , Fatores de Transcrição , Dedos de Zinco , Transtorno do Espectro Autista/genética , Humanos , Mutação de Sentido Incorreto/genética , Transtornos do Neurodesenvolvimento/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Dedos de Zinco/genéticaRESUMO
The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient's disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient's given set of phenotypes. Diagnosis of singleton patients (without relatives' exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database-based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children's Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu.
Assuntos
Exoma , Criança , Genótipo , Humanos , Fenótipo , Probabilidade , Estudos RetrospectivosRESUMO
PURPOSE: Both monogenic pathogenic variant cataloging and clinical patient diagnosis start with variant-level evidence retrieval followed by expert evidence integration in search of diagnostic variants and genes. Here, we try to accelerate pathogenic variant evidence retrieval by an automatic approach. METHODS: Automatic VAriant evidence DAtabase (AVADA) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic genetic variant evidence in full-text primary literature about monogenic disease and convert it to genomic coordinates. RESULTS: AVADA automatically retrieved almost 60% of likely disease-causing variants deposited in the Human Gene Mutation Database (HGMD), a 4.4-fold improvement over the current best open source automated variant extractor. AVADA contains over 60,000 likely disease-causing variants that are in HGMD but not in ClinVar. AVADA also highlights the challenges of automated variant mapping and pathogenicity curation. However, when combined with manual validation, on 245 diagnosed patients, AVADA provides valuable evidence for an additional 18 diagnostic variants, on top of ClinVar's 21, versus only 2 using the best current automated approach. CONCLUSION: AVADA advances automated retrieval of pathogenic monogenic variant evidence from full-text literature. Far from perfect, but much faster than PubMed/Google Scholar search, careful curation of AVADA-retrieved evidence can aid both database curation and patient diagnosis.
Assuntos
Processamento Eletrônico de Dados/métodos , Genômica/métodos , Armazenamento e Recuperação da Informação/métodos , Gerenciamento de Dados/métodos , Bases de Dados Factuais , Bases de Dados Genéticas , Humanos , Processamento de Linguagem Natural , PubMed , PublicaçõesRESUMO
PURPOSE: Exome sequencing and diagnosis is beginning to spread across the medical establishment. The most time-consuming part of genome-based diagnosis is the manual step of matching the potentially long list of patient candidate genes to patient phenotypes to identify the causative disease. METHODS: We introduce Phrank (for phenotype ranking), an information theory-inspired method that utilizes a Bayesian network to prioritize candidate diseases or genes, as a stand-alone module that can be run with any underlying knowledgebase and any variant filtering scheme. RESULTS: Phrank outperforms existing methods at ranking the causative disease or gene when applied to 169 real patient exomes with Mendelian diagnoses. Phrank's greatest improvement is in disease space, where across all 169 patients it ranks only 3 diseases on average ahead of the true diagnosis, whereas Phenomizer ranks 32 diseases ahead of the causal one. CONCLUSIONS: Using Phrank to rank all patient candidate genes or diseases, as they start working through a new case, will save the busy clinician much time in deriving a genetic diagnosis.
Assuntos
Diagnóstico por Computador , Doenças Genéticas Inatas/diagnóstico , Testes Genéticos , Fenótipo , Software , Benchmarking , Biologia Computacional/métodos , Exoma , Humanos , Bases de Conhecimento , Patologia Molecular/métodosRESUMO
PURPOSE: Diagnosing monogenic diseases facilitates optimal care, but can involve the manual evaluation of hundreds of genetic variants per case. Computational tools like Phrank expedite this process by ranking all candidate genes by their ability to explain the patient's phenotypes. To use these tools, busy clinicians must manually encode patient phenotypes from lengthy clinical notes. With 100 million human genomes estimated to be sequenced by 2025, a fast alternative to manual phenotype extraction from clinical notes will become necessary. METHODS: We introduce ClinPhen, a fast, high-accuracy tool that automatically converts clinical notes into a prioritized list of patient phenotypes using Human Phenotype Ontology (HPO) terms. RESULTS: ClinPhen shows superior accuracy and 20× speedup over existing phenotype extractors, and its novel phenotype prioritization scheme improves the performance of gene-ranking tools. CONCLUSION: While a dedicated clinician can process 200 patient records in a 40-hour workweek, ClinPhen does the same in 10 minutes. Compared with manual phenotype extraction, ClinPhen saves an additional 3-5 hours per Mendelian disease diagnosis. Providers can now add ClinPhen's output to each summary note attached to a filled testing laboratory request form. ClinPhen makes a substantial contribution to improvements in efficiency critically needed to meet the surging demand for clinical diagnostic sequencing.
Assuntos
Biologia Computacional , Doenças Genéticas Inatas/diagnóstico , Prontuários Médicos , Algoritmos , Humanos , Processamento de Linguagem Natural , FenótipoRESUMO
A major contributor to the scientific reproducibility crisis has been that the results from homogeneous, single-center studies do not generalize to heterogeneous, real world populations. Multi-cohort gene expression analysis has helped to increase reproducibility by aggregating data from diverse populations into a single analysis. To make the multi-cohort analysis process more feasible, we have assembled an analysis pipeline which implements rigorously studied meta-analysis best practices. We have compiled and made publicly available the results of our own multi-cohort gene expression analysis of 103 diseases, spanning 615 studies and 36,915 samples, through a novel and interactive web application. As a result, we have made both the process of and the results from multi-cohort gene expression analysis more approachable for non-technical users.