RESUMO
Variants which disrupt splicing are a frequent cause of rare disease that have been under-ascertained clinically. Accurate and efficient methods to predict a variant's impact on splicing are needed to interpret the growing number of variants of unknown significance (VUS) identified by exome and genome sequencing. Here, we present the results of the CAGI6 Splicing VUS challenge, which invited predictions of the splicing impact of 56 variants ascertained clinically and functionally validated to determine splicing impact. The performance of 12 prediction methods, along with SpliceAI and CADD, was compared on the 56 functionally validated variants. The maximum accuracy achieved was 82% from two different approaches, one weighting SpliceAI scores by minor allele frequency, and one applying the recently published Splicing Prediction Pipeline (SPiP). SPiP performed optimally in terms of sensitivity, while an ensemble method combining multiple prediction tools and information from databases exceeded all others for specificity. Several challenge methods equalled or exceeded the performance of SpliceAI, with ultimate choice of prediction method likely to depend on experimental or clinical aims. One quarter of the variants were incorrectly predicted by at least 50% of the methods, highlighting the need for further improvements to splicing prediction methods for successful clinical application.
RESUMO
A critical challenge in genetic diagnostics is the computational assessment of candidate splice variants, specifically the interpretation of nucleotide changes located outside of the highly conserved dinucleotide sequences at the 5' and 3' ends of introns. To address this gap, we developed the Super Quick Information-content Random-forest Learning of Splice variants (SQUIRLS) algorithm. SQUIRLS generates a small set of interpretable features for machine learning by calculating the information-content of wild-type and variant sequences of canonical and cryptic splice sites, assessing changes in candidate splicing regulatory sequences, and incorporating characteristics of the sequence such as exon length, disruptions of the AG exclusion zone, and conservation. We curated a comprehensive collection of disease-associated splice-altering variants at positions outside of the highly conserved AG/GT dinucleotides at the termini of introns. SQUIRLS trains two random-forest classifiers for the donor and for the acceptor and combines their outputs by logistic regression to yield a final score. We show that SQUIRLS transcends previous state-of-the-art accuracy in classifying splice variants as assessed by rank analysis in simulated exomes, and is significantly faster than competing methods. SQUIRLS provides tabular output files for incorporation into diagnostic pipelines for exome and genome analysis, as well as visualizations that contextualize predicted effects of variants on splicing to make it easier to interpret splice variants in diagnostic settings.
Assuntos
Algoritmos , Curadoria de Dados/métodos , Doenças Genéticas Inatas/genética , Sítios de Splice de RNA , Splicing de RNA , Software , Sequência de Bases , Biologia Computacional/métodos , Exoma , Éxons , Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/patologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Íntrons , Mutação , Sequenciamento do ExomaRESUMO
MOTIVATION: Methods for concept recognition (CR) in clinical texts have largely been tested on abstracts or articles from the medical literature. However, texts from electronic health records (EHRs) frequently contain spelling errors, abbreviations, and other nonstandard ways of representing clinical concepts. RESULTS: Here, we present a method inspired by the BLAST algorithm for biosequence alignment that screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Our method, the Term-BLAST-like alignment tool (TBLAT) leverages a gold standard corpus for typographical errors to implement a sequence alignment-inspired method for efficient entity linkage. We present a comprehensive experimental comparison of TBLAT with five widely used tools. Experimental results show an increase of 10% in recall on scientific publications and 20% increase in recall on EHR records (when compared against the next best method), hence supporting a significant enhancement of the entity linking task. The method can be used stand-alone or as a complement to existing approaches. AVAILABILITY AND IMPLEMENTATION: Fenominal is a Java library that implements TBLAT for named CR of Human Phenotype Ontology terms and is available at https://github.com/monarch-initiative/fenominal under the GNU General Public License v3.0.
Assuntos
Algoritmos , Idioma , Humanos , Alinhamento de Sequência , Registros Eletrônicos de Saúde , PublicaçõesRESUMO
PURPOSE: Clinical intuition is commonly incorporated into the differential diagnosis as an assessment of the likelihood of candidate diagnoses based either on the patient population being seen in a specific clinic or on the signs and symptoms of the initial presentation. Algorithms to support diagnostic sequencing in individuals with a suspected rare genetic disease do not yet incorporate intuition and instead assume that each Mendelian disease has an equal pretest probability. METHODS: The LIRICAL algorithm calculates the likelihood ratio of clinical manifestations represented by Human Phenotype Ontology (HPO) terms to rank candidate diagnoses. The initial version of LIRICAL assumed an equal pretest probability for each disease in its calculation of the posttest probability (where the test is diagnostic exome or genome sequencing). We introduce Clinical Intuition for Likelihood Ratios (ClintLR), an extension of the LIRICAL algorithm that boosts the pretest probability of groups of related diseases deemed to be more likely. RESULTS: The average rank of the correct diagnosis in simulations using ClintLR showed a statistically significant improvement over a range of adjustment factors. CONCLUSION: ClintLR successfully encodes clinical intuition to improve ranking of rare diseases in diagnostic sequencing. ClintLR is freely available at https://github.com/TheJacksonLaboratory/ClintLR.
RESUMO
Human Phenotype Ontology (HPO)-based analysis has become standard for genomic diagnostics of rare diseases. Current algorithms use a variety of semantic and statistical approaches to prioritize the typically long lists of genes with candidate pathogenic variants. These algorithms do not provide robust estimates of the strength of the predictions beyond the placement in a ranked list, nor do they provide measures of how much any individual phenotypic observation has contributed to the prioritization result. However, given that the overall success rate of genomic diagnostics is only around 25%-50% or less in many cohorts, a good ranking cannot be taken to imply that the gene or disease at rank one is necessarily a good candidate. Here, we present an approach to genomic diagnostics that exploits the likelihood ratio (LR) framework to provide an estimate of (1) the posttest probability of candidate diagnoses, (2) the LR for each observed HPO phenotype, and (3) the predicted pathogenicity of observed genotypes. LIkelihood Ratio Interpretation of Clinical AbnormaLities (LIRICAL) placed the correct diagnosis within the first three ranks in 92.9% of 384 case reports comprising 262 Mendelian diseases, and the correct diagnosis had a mean posttest probability of 67.3%. Simulations show that LIRICAL is robust to many typically encountered forms of genomic and phenomic noise. In summary, LIRICAL provides accurate, clinically interpretable results for phenotype-driven genomic diagnostics.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , Genômica , Doenças Raras/diagnóstico , Algoritmos , Exoma/genética , Humanos , Fenótipo , Doenças Raras/genética , SoftwareRESUMO
The Human Phenotype Ontology (HPO, https://hpo.jax.org) was launched in 2008 to provide a comprehensive logical standard to describe and computationally analyze phenotypic abnormalities found in human disease. The HPO is now a worldwide standard for phenotype exchange. The HPO has grown steadily since its inception due to considerable contributions from clinical experts and researchers from a diverse range of disciplines. Here, we present recent major extensions of the HPO for neurology, nephrology, immunology, pulmonology, newborn screening, and other areas. For example, the seizure subontology now reflects the International League Against Epilepsy (ILAE) guidelines and these enhancements have already shown clinical validity. We present new efforts to harmonize computational definitions of phenotypic abnormalities across the HPO and multiple phenotype ontologies used for animal models of disease. These efforts will benefit software such as Exomiser by improving the accuracy and scope of cross-species phenotype matching. The computational modeling strategy used by the HPO to define disease entities and phenotypic features and distinguish between them is explained in detail.We also report on recent efforts to translate the HPO into indigenous languages. Finally, we summarize recent advances in the use of HPO in electronic health record systems.
Assuntos
Ontologias Biológicas , Biologia Computacional/métodos , Bases de Dados Factuais , Doença/genética , Genoma , Fenótipo , Software , Animais , Modelos Animais de Doenças , Genótipo , Humanos , Recém-Nascido , Cooperação Internacional , Internet , Triagem Neonatal/métodos , Farmacogenética/métodos , Terminologia como AssuntoRESUMO
Rare disease diagnostics and disease gene discovery have been revolutionized by whole-exome and genome sequencing but identifying the causative variant(s) from the millions in each individual remains challenging. The use of deep phenotyping of patients and reference genotype-phenotype knowledge, alongside variant data such as allele frequency, segregation, and predicted pathogenicity, has proved an effective strategy to tackle this issue. Here we review the numerous tools that have been developed to automate this approach and demonstrate the power of such an approach on several thousand diagnosed cases from the 100,000 Genomes Project. Finally, we discuss the challenges that need to be overcome if we are going to improve detection rates and help the majority of patients that still remain without a molecular diagnosis after state-of-the-art genomic interpretation.
Assuntos
Exoma , Doenças Raras , Exoma/genética , Genômica , Humanos , Fenótipo , Doenças Raras/diagnóstico , Doenças Raras/genética , Sequenciamento do ExomaRESUMO
The Human Phenotype Ontology (HPO)-a standardized vocabulary of phenotypic abnormalities associated with 7000+ diseases-is used by thousands of researchers, clinicians, informaticians and electronic health record systems around the world. Its detailed descriptions of clinical abnormalities and computable disease definitions have made HPO the de facto standard for deep phenotyping in the field of rare disease. The HPO's interoperability with other ontologies has enabled it to be used to improve diagnostic accuracy by incorporating model organism data. It also plays a key role in the popular Exomiser tool, which identifies potential disease-causing variants from whole-exome or whole-genome sequencing data. Since the HPO was first introduced in 2008, its users have become both more numerous and more diverse. To meet these emerging needs, the project has added new content, language translations, mappings and computational tooling, as well as integrations with external community data. The HPO continues to collaborate with clinical adopters to improve specific areas of the ontology and extend standardized disease descriptions. The newly redesigned HPO website (www.human-phenotype-ontology.org) simplifies browsing terms and exploring clinical features, diseases, and human genes.
Assuntos
Ontologias Biológicas , Biologia Computacional/métodos , Anormalidades Congênitas/genética , Predisposição Genética para Doença/genética , Bases de Conhecimento , Doenças Raras/genética , Anormalidades Congênitas/diagnóstico , Bases de Dados Genéticas , Variação Genética , Humanos , Internet , Fenótipo , Doenças Raras/diagnóstico , Sequenciamento Completo do Genoma/métodosRESUMO
BACKGROUND: Target enrichment combined with chromosome conformation capturing methodologies such as capture Hi-C (CHC) can be used to investigate spatial layouts of genomic regions with high resolution and at scalable costs. A common application of CHC is the investigation of regulatory elements that are in contact with promoters, but CHC can be used for a range of other applications. Therefore, probe design for CHC needs to be adapted to experimental needs, but no flexible tool is currently available for this purpose. RESULTS: We present a Java desktop application called GOPHER (Generator Of Probes for capture Hi-C Experiments at high Resolution) that implements three strategies for CHC probe design. GOPHER's simple approach is similar to the probe design of previous approaches that employ CHC to investigate all promoters, with one probe being placed at each margin of a single digest that overlaps the transcription start site (TSS) of each promoter. GOPHER's simple-patched approach extends this methodology with a heuristic that improves coverage of viewpoints in which the TSS is located near to one of the boundaries of the digest. GOPHER's extended approach is intended mainly for focused investigations of smaller gene sets. GOPHER can also be used to design probes for regions other than TSS such as GWAS hits or large blocks of genomic sequence. GOPHER additionally provides a number of features that allow users to visualize and edit viewpoints, and outputs a range of files useful for documentation, ordering probes, and downstream analysis. CONCLUSION: GOPHER is an easy-to-use and robust desktop application for CHC probe design. Source code and a precompiled executable can be downloaded from the GOPHER GitHub page at https://github.com/TheJacksonLaboratory/Gopher .
Assuntos
Sondas de DNA/genética , Software , Redes Reguladoras de Genes , Regiões Promotoras Genéticas , Sequências Reguladoras de Ácido Nucleico , Sítio de Iniciação de TranscriçãoRESUMO
BACKGROUND: Progressive bilateral sensorineural deafness in postlingual period may be linked to many different etiologies including genetic factors. Identification of the exact deafness cause may, therefore, be quite challenging. Here we present a family with late-onset hearing loss as an autosomal dominant trait caused by a novel EYA4 mutation. CASE PRESENTATION: Forty-four years old female proband clinically investigated for progressive hearing loss and occasional dizziness with positive family history for deafness was subject to molecular-genetic testing. Patient's DNA sample was analyzed by whole exome sequencing. We identified a novel missense variant c.804G > C located at the last base pair of exon 10 in EYA4. Candidate variant was confirmed by Sanger sequencing in the proband and her family members. In silico prediction tools and co-segregation analysis were used to indicate pathogenicity of the identified variant. To confirm our hypothesis, we performed minigene assay to demonstrate if the transcript of exon 10 in EYA4 is present. We provide evidence that this mutation in vitro compromises donor site functionality and causes exon 10 skipping and frameshift that most likely results in nonsense-mediated mRNA decay. The onset of moderate to severe hearing loss in the family ranged from 10 to 40 years. The normal cardiac phenotype was confirmed by ECG and echocardiography. CONCLUSIONS: We identified a novel EYA4 mutation associated with adult-onset autosomal dominant sensorineural hearing loss. This report extends the knowledge of spectrum of EYA4 mutations and demonstrates the pathogenicity of a variant affecting specific position in the gene. A comprehensive review of known EYA4 mutations is also given and their impact on cardiac phenotype is discussed. Our findings highlight the importance of genetic testing and complex clinical assessment in patients with familial progressive hearing loss.
Assuntos
Genes Dominantes , Perda Auditiva/genética , Transativadores/genética , Idade de Início , Feminino , Humanos , Pessoa de Meia-Idade , EslováquiaRESUMO
Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information. Materials and Methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically. Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task. Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings. Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.
RESUMO
Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.
RESUMO
Sarcopenia is a serious systemic disease that reduces overall survival. TAVI is selectively performed in patients with severe aortic stenosis who are not indicated for open cardiac surgery due to severe polymorbidity. Artificial intelligence-assisted body composition assessment from available CT scans appears to be a simple tool to stratify these patients into low and high risk based on future estimates of all-cause mortality. Within our study, the segmentation of preprocedural CT scans at the level of the lumbar third vertebra in patients undergoing TAVI was performed using a neural network (AutoMATiCA). The obtained parameters (area and density of skeletal muscles and intramuscular, visceral, and subcutaneous adipose tissue) were analyzed using Cox univariate and multivariable models for continuous and categorical variables to assess the relation of selected variables with all-cause mortality. 866 patients were included (median(interquartile range)): age 79.7 (74.9-83.3) years; BMI 28.9 (25.9-32.6) kg/m2. Survival analysis was performed on all automatically obtained parameters of muscle and fat density and area. Skeletal muscle index (SMI in cm2/m2), visceral (VAT in HU) and subcutaneous adipose tissue (SAT in HU) density predicted the all-cause mortality in patients after TAVI expressed as hazard ratio (HR) with 95% confidence interval (CI): SMI HR 0.986, 95% CI (0.975-0.996); VAT 1.015 (1.002-1.028) and SAT 1.014 (1.004-1.023), all p < 0.05. Automatic body composition assessment can estimate higher all-cause mortality risk in patients after TAVI, which may be useful in preoperative clinical reasoning and stratification of patients.
Assuntos
Sarcopenia , Humanos , Idoso , Inteligência Artificial , Tecido Adiposo , Músculo Esquelético , Gordura Subcutânea , Composição Corporal/fisiologia , Estudos RetrospectivosRESUMO
The Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema was released in 2022 and approved by ISO as a standard for sharing clinical and genomic information about an individual, including phenotypic descriptions, numerical measurements, genetic information, diagnoses, and treatments. A phenopacket can be used as an input file for software that supports phenotype-driven genomic diagnostics and for algorithms that facilitate patient classification and stratification for identifying new diseases and treatments. There has been a great need for a collection of phenopackets to test software pipelines and algorithms. Here, we present Phenopacket Store. Version 0.1.19 of Phenopacket Store includes 6668 phenopackets representing 475 Mendelian and chromosomal diseases associated with 423 genes and 3834 unique pathogenic alleles curated from 959 different publications. This represents the first large-scale collection of case-level, standardized phenotypic information derived from case reports in the literature with detailed descriptions of the clinical data and will be useful for many purposes, including the development and testing of software for prioritizing genes and diseases in diagnostic genomics, machine learning analysis of clinical phenotype data, patient stratification, and genotype-phenotype correlations. This corpus also provides best-practice examples for curating literature-derived data using the GA4GH Phenopacket Schema.
RESUMO
The Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema was released in 2022 and approved by ISO as a standard for sharing clinical and genomic information about an individual, including phenotypic descriptions, numerical measurements, genetic information, diagnoses, and treatments. A phenopacket can be used as an input file for software that supports phenotype-driven genomic diagnostics and for algorithms that facilitate patient classification and stratification for identifying new diseases and treatments. There has been a great need for a collection of phenopackets to test software pipelines and algorithms. Here, we present phenopacket-store. Version 0.1.12 of phenopacket-store includes 4916 phenopackets representing 277 Mendelian and chromosomal diseases associated with 236 genes, and 2872 unique pathogenic alleles curated from 605 different publications. This represents the first large-scale collection of case-level, standardized phenotypic information derived from case reports in the literature with detailed descriptions of the clinical data and will be useful for many purposes, including the development and testing of software for prioritizing genes and diseases in diagnostic genomics, machine learning analysis of clinical phenotype data, patient stratification, and genotype-phenotype correlations. This corpus also provides best-practice examples for curating literature-derived data using the GA4GH Phenopacket Schema.
RESUMO
Numerous factors regulate alternative splicing of human genes at a co-transcriptional level. However, how alternative splicing depends on the regulation of gene expression is poorly understood. We leveraged data from the Genotype-Tissue Expression (GTEx) project to show a significant association of gene expression and splicing for 6874 (4.9%) of 141,043 exons in 1106 (13.3%) of 8314 genes with substantially variable expression in ten GTEx tissues. About half of these exons demonstrate higher inclusion with higher gene expression, and half demonstrate higher exclusion, with the observed direction of coupling being highly consistent across different tissues and in external datasets. The exons differ with respect to sequence characteristics, enriched sequence motifs, RNA polymerase II binding, and inferred transcription rate of downstream introns. The exons were enriched for hundreds of isoform-specific Gene Ontology annotations, suggesting that the coupling of expression and alternative splicing described here may provide an important gene regulatory mechanism that might be used in a variety of biological contexts. In particular, higher inclusion exons could play an important role during cell division.
RESUMO
We identified a de novo heterozygous transient receptor potential cation channel subfamily M (melastatin) member 3 (TRPM3) missense variant, p.(Asn1126Asp), in a patient with developmental delay and manifestations of cerebral palsy (CP) using phenotype-driven prioritization analysis of whole-genome sequencing data with Exomiser. The variant is localized in the functionally important ion transport domain of the TRPM3 protein and predicted to impact the protein structure. Our report adds TRPM3 to the list of Mendelian disease-associated genes that can be associated with CP and provides further evidence for the pathogenicity of the variant p.(Asn1126Asp).
Assuntos
Paralisia Cerebral , Deficiência Intelectual , Malformações do Sistema Nervoso , Canais de Cátion TRPM , Humanos , Paralisia Cerebral/genética , Deficiência Intelectual/genética , Mutação de Sentido Incorreto/genética , Fenótipo , Canais de Cátion TRPM/genéticaRESUMO
The Global Alliance for Genomics and Health (GA4GH) is a standards-setting organization that is developing a suite of coordinated standards for genomics. The GA4GH Phenopacket Schema is a standard for sharing disease and phenotype information that characterizes an individual person or biosample. The Phenopacket Schema is flexible and can represent clinical data for any kind of human disease including rare disease, complex disease, and cancer. It also allows consortia or databases to apply additional constraints to ensure uniform data collection for specific goals. We present phenopacket-tools, an open-source Java library and command-line application for construction, conversion, and validation of phenopackets. Phenopacket-tools simplifies construction of phenopackets by providing concise builders, programmatic shortcuts, and predefined building blocks (ontology classes) for concepts such as anatomical organs, age of onset, biospecimen type, and clinical modifiers. Phenopacket-tools can be used to validate the syntax and semantics of phenopackets as well as to assess adherence to additional user-defined requirements. The documentation includes examples showing how to use the Java library and the command-line tool to create and validate phenopackets. We demonstrate how to create, convert, and validate phenopackets using the library or the command-line application. Source code, API documentation, comprehensive user guide and a tutorial can be found at https://github.com/phenopackets/phenopacket-tools. The library can be installed from the public Maven Central artifact repository and the application is available as a standalone archive. The phenopacket-tools library helps developers implement and standardize the collection and exchange of phenotypic and other clinical data for use in phenotype-driven genomic diagnostics, translational research, and precision medicine applications.
Assuntos
Neoplasias , Software , Humanos , Genômica , Bases de Dados Factuais , Biblioteca GênicaRESUMO
The Global Alliance for Genomics and Health (GA4GH) is developing a suite of coordinated standards for genomics for healthcare. The Phenopacket is a new GA4GH standard for sharing disease and phenotype information that characterizes an individual person, linking that individual to detailed phenotypic descriptions, genetic information, diagnoses, and treatments. A detailed example is presented that illustrates how to use the schema to represent the clinical course of a patient with retinoblastoma, including demographic information, the clinical diagnosis, phenotypic features and clinical measurements, an examination of the extirpated tumor, therapies, and the results of genomic analysis. The Phenopacket Schema, together with other GA4GH data and technical standards, will enable data exchange and provide a foundation for the computational analysis of disease and phenotype information to improve our ability to diagnose and conduct research on all types of disorders, including cancer and rare diseases.