RESUMO
PURPOSE: Phenotype information is crucial for the interpretation of genomic variants. So far it has only been accessible for bioinformatics workflows after encoding into clinical terms by expert dysmorphologists. METHODS: Here, we introduce an approach driven by artificial intelligence that uses portrait photographs for the interpretation of clinical exome data. We measured the value added by computer-assisted image analysis to the diagnostic yield on a cohort consisting of 679 individuals with 105 different monogenic disorders. For each case in the cohort we compiled frontal photos, clinical features, and the disease-causing variants, and simulated multiple exomes of different ethnic backgrounds. RESULTS: The additional use of similarity scores from computer-assisted analysis of frontal photos improved the top 1 accuracy rate by more than 20-89% and the top 10 accuracy rate by more than 5-99% for the disease-causing gene. CONCLUSION: Image analysis by deep-learning algorithms can be used to quantify the phenotypic similarity (PP4 criterion of the American College of Medical Genetics and Genomics guidelines) and to advance the performance of bioinformatics pipelines for exome analysis.
Assuntos
Biologia Computacional/métodos , Processamento de Imagem Assistida por Computador/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bases de Dados Genéticas , Aprendizado Profundo , Exoma/genética , Feminino , Genômica , Humanos , Masculino , Fenótipo , SoftwareRESUMO
MOTIVATION: Next generation sequencing technology considerably changed the way we screen for pathogenic mutations in rare Mendelian disorders. However, the identification of the disease-causing mutation amongst thousands of variants of partly unknown relevance is still challenging and efficient techniques that reduce the genomic search space play a decisive role. Often segregation- or linkage analysis are used to prioritize candidates, however, these approaches require correct information about the degree of relationship among the sequenced samples. For quality assurance an automated control of pedigree structures and sample assignment is therefore highly desirable in order to detect label mix-ups that might otherwise corrupt downstream analysis. RESULTS: We developed an algorithm based on likelihood ratios that discriminates between different classes of relationship for an arbitrary number of genotyped samples. By identifying the most likely class we are able to reconstruct entire pedigrees iteratively, even for highly consanguineous families. We tested our approach on exome data of different sequencing studies and achieved high precision for all pedigree predictions. By analyzing the precision for varying degrees of relatedness or inbreeding we could show that a prediction is robust down to magnitudes of a few hundred loci. AVAILABILITY AND IMPLEMENTATION: A java standalone application that computes the relationships between multiple samples as well as a Rscript that visualizes the pedigree information is available for download as well as a web service at www.gene-talk.de CONTACT: heinrich@molgen.mpg.deSupplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma Humano , Mutação , Linhagem , Análise de Sequência de DNA/métodos , Software , Algoritmos , Exoma , Feminino , Ligação Genética , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , MasculinoRESUMO
Significant improvements in automated image analysis have been achieved in recent years and tools are now increasingly being used in computer-assisted syndromology. However, the ability to recognize a syndromic facial gestalt might depend on the syndrome and may also be confounded by severity of phenotype, size of available training sets, ethnicity, age, and sex. Therefore, benchmarking and comparing the performance of deep-learned classification processes is inherently difficult. For a systematic analysis of these influencing factors we chose the lysosomal storage diseases mucolipidosis as well as mucopolysaccharidosis type I and II that are known for their wide and overlapping phenotypic spectra. For a dysmorphic comparison we used Smith-Lemli-Opitz syndrome as another inborn error of metabolism and Nicolaides-Baraitser syndrome as another disorder that is also characterized by coarse facies. A classifier that was trained on these five cohorts, comprising 289 patients in total, achieved a mean accuracy of 62%. We also developed a simulation framework to analyze the effect of potential confounders, such as cohort size, age, sex, or ethnic background on the distinguishability of phenotypes. We found that the true positive rate increases for all analyzed disorders for growing cohorts (n = [10...40]) while ethnicity and sex have no significant influence. The dynamics of the accuracies strongly suggest that the maximum distinguishability is a phenotype-specific value, which has not been reached yet for any of the studied disorders. This should also be a motivation to further intensify data sharing efforts, as computer-assisted syndrome classification can still be improved by enlarging the available training sets.
Assuntos
Processamento de Imagem Assistida por Computador/métodos , Processamento de Imagem Assistida por Computador/tendências , Erros Inatos do Metabolismo/diagnóstico , Adolescente , Algoritmos , Criança , Fácies , Feminino , Deformidades Congênitas do Pé/diagnóstico , Deformidades Congênitas do Pé/metabolismo , Humanos , Hipotricose/diagnóstico , Hipotricose/metabolismo , Deficiência Intelectual/diagnóstico , Deficiência Intelectual/metabolismo , Masculino , Erros Inatos do Metabolismo/metabolismo , Erros Inatos do Metabolismo/patologia , Técnicas de Diagnóstico Molecular/métodos , Técnicas de Diagnóstico Molecular/tendências , Fenótipo , Síndrome de Smith-Lemli-Opitz/diagnóstico , Síndrome de Smith-Lemli-Opitz/metabolismo , SíndromeRESUMO
MOTIVATION: When analyzing a case group of patients with ultra-rare disorders the ethnicities are often diverse and the data quality might vary. The population substructure in the case group as well as the heterogeneous data quality can cause substantial inflation of test statistics and result in spurious associations in case-control studies if not properly adjusted for. Existing techniques to correct for confounding effects were especially developed for common variants and are not applicable to rare variants. RESULTS: We analyzed strategies to select suitable controls for cases that are based on similarity metrics that vary in their weighting schemes. We simulated different disease entities on real exome data and show that a similarity-based selection scheme can help to reduce false positive associations and to optimize the performance of the statistical tests. Especially when data quality as well as ethnicities vary a lot in the case group, a matching approach that puts more weight on rare variants shows the best performance. We reanalyzed collections of unrelated patients with Kabuki make-up syndrome, Hyperphosphatasia with Mental Retardation syndrome and Catel-Manzke syndrome for which the disease genes were recently described. We show that rare variant association tests are more sensitive and specific in identifying the disease gene than intersection filters and should thus be considered as a favorable approach in analyzing even small patient cohorts. AVAILABILITY AND IMPLEMENTATION: Datasets used in our analysis are available at ftp://ftp.1000genomes.ebi.ac.uk./vol1/ftp/ CONTACT: : peter.krawitz@charite.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Estudos de Associação Genética , Variação Genética , Estudos de Casos e Controles , Confiabilidade dos Dados , Doença/genética , Etnicidade/genética , Humanos , Curva ROC , Análise de Sequência de DNARESUMO
Dysmorphologists sometimes encounter challenges in recognizing disorders due to phenotypic variability influenced by factors such as age and ethnicity. Moreover, the performance of Next Generation Phenotyping Tools such as GestaltMatcher is dependent on the diversity of the training set. Therefore, we developed GestaltMatcher Database (GMDB) - a global reference for the phenotypic variability of rare diseases that complies with the FAIR-principles. We curated dysmorphic patient images and metadata from 2,224 publications, transforming GMDB into an online dynamic case report journal. To encourage clinicians worldwide to contribute, each case can receive a Digital Object Identifier (DOI), making it a citable micro-publication. This resulted in a collection of 2,312 unpublished images, partly with longitudinal data. We have compiled a collection of 10,189 frontal images from 7,695 patients representing 683 disorders. The web interface enables gene- and phenotype-centered queries for registered users (https://db.gestaltmatcher.org/). Despite the predominant European ancestry of most patients (59%), our global collaborations have facilitated the inclusion of data from frequently underrepresented ethnicities, with 17% Asian, 4% African, and 6% with other ethnic backgrounds. The analysis has revealed a significant enhancement in GestaltMatcher performance across all ethnic groups, incorporating non-European ethnicities, showcasing a remarkable increase in Top-1-Accuracy by 31.56% and Top-5-Accuracy by 12.64%. Importantly, this improvement was achieved without altering the performance metrics for European patients. GMDB addresses dysmorphology challenges by representing phenotypic variability and including underrepresented groups, enhancing global diagnostic rates and serving as a vital clinician reference database.
RESUMO
The most important factor that complicates the work of dysmorphologists is the significant phenotypic variability of the human face. Next-Generation Phenotyping (NGP) tools that assist clinicians with recognizing characteristic syndromic patterns are particularly challenged when confronted with patients from populations different from their training data. To that end, we systematically analyzed the impact of genetic ancestry on facial dysmorphism. For that purpose, we established the GestaltMatcher Database (GMDB) as a reference dataset for medical images of patients with rare genetic disorders from around the world. We collected 10,980 frontal facial images - more than a quarter previously unpublished - from 8,346 patients, representing 581 rare disorders. Although the predominant ancestry is still European (67%), data from underrepresented populations have been increased considerably via global collaborations (19% Asian and 7% African). This includes previously unpublished reports for more than 40% of the African patients. The NGP analysis on this diverse dataset revealed characteristic performance differences depending on the composition of training and test sets corresponding to genetic relatedness. For clinical use of NGP, incorporating non-European patients resulted in a profound enhancement of GestaltMatcher performance. The top-5 accuracy rate increased by +11.29%. Importantly, this improvement in delineating the correct disorder from a facial portrait was achieved without decreasing the performance on European patients. By design, GMDB complies with the FAIR principles by rendering the curated medical data findable, accessible, interoperable, and reusable. This means GMDB can also serve as data for training and benchmarking. In summary, our study on facial dysmorphism on a global sample revealed a considerable cross ancestral phenotypic variability confounding NGP that should be counteracted by international efforts for increasing data diversity. GMDB will serve as a vital reference database for clinicians and a transparent training set for advancing NGP technology.
RESUMO
SUMMARY: Next-generation sequencing has become a powerful tool in personalized medicine. Exomes or even whole genomes of patients suffering from rare diseases are screened for sequence variants. After filtering out common polymorphisms, the assessment and interpretation of detected personal variants in the clinical context is an often time-consuming effort. We have developed GeneTalk, a web-based platform that serves as an expert exchange network for the assessment of personal and potentially disease-relevant sequence variants. GeneTalk assists a clinical geneticist who is searching for information about specific sequence variants and connects this user to other users with expertise for the same sequence variant. AVAILABILITY: GeneTalk is available at www.gene-talk.de. Users can login without registering in a demo account. CONTACT: peter.krawitz@gene-talk.de.
Assuntos
Biologia Computacional/métodos , Genoma Humano , Disseminação de Informação/métodos , Polimorfismo Genético , Análise de Sequência de DNA/métodos , Exoma , Humanos , Internet , Gestão do Conhecimento , Anotação de Sequência Molecular , Medicina de Precisão , Software , Interface Usuário-ComputadorRESUMO
Many monogenic disorders cause a characteristic facial morphology. Artificial intelligence can support physicians in recognizing these patterns by associating facial phenotypes with the underlying syndrome through training on thousands of patient photographs. However, this 'supervised' approach means that diagnoses are only possible if the disorder was part of the training set. To improve recognition of ultra-rare disorders, we developed GestaltMatcher, an encoder for portraits that is based on a deep convolutional neural network. Photographs of 17,560 patients with 1,115 rare disorders were used to define a Clinical Face Phenotype Space, in which distances between cases define syndromic similarity. Here we show that patients can be matched to others with the same molecular diagnosis even when the disorder was not included in the training set. Together with mutation data, GestaltMatcher could not only accelerate the clinical diagnosis of patients with ultra-rare disorders and facial dysmorphism but also enable the delineation of new phenotypes.
Assuntos
Inteligência Artificial , Doenças Raras , Face , Humanos , Redes Neurais de Computação , Fenótipo , Doenças Raras/genéticaRESUMO
Many rare syndromes can be well described and delineated from other disorders by a combination of characteristic symptoms. These phenotypic features are best documented with terms of the Human Phenotype Ontology (HPO), which are increasingly used in electronic health records (EHRs), too. Many algorithms that perform HPO-based gene prioritization have also been developed; however, the performance of many such tools suffers from an over-representation of atypical cases in the medical literature. This is certainly the case if the algorithm cannot handle features that occur with reduced frequency in a disorder. With Cada, we built a knowledge graph based on both case annotations and disorder annotations. Using network representation learning, we achieve gene prioritization by link prediction. Our results suggest that Cada exhibits superior performance particularly for patients that present with the pathognomonic findings of a disease. Additionally, information about the frequency of occurrence of a feature can readily be incorporated, when available. Crucial in the design of our approach is the use of the growing amount of phenotype-genotype information that diagnostic labs deposit in databases such as ClinVar. By this means, Cada is an ideal reference tool for differential diagnostics in rare disorders that can also be updated regularly.
RESUMO
The identification of disease-causing mutations in next-generation sequencing (NGS) data requires efficient filtering techniques. In patients with rare recessive diseases, compound heterozygosity of pathogenic mutations is the most likely inheritance model if the parents are non-consanguineous. We developed a web-based compound heterozygous filter that is suited for data from NGS projects and that is easy to use for non-bioinformaticians. We analyzed the power of compound heterozygous mutation filtering by deriving background distributions for healthy individuals from different ethnicities and studied the effectiveness in trios as well as more complex pedigree structures. While usually more then 30 genes harbor potential compound heterozygotes in single exomes, this number can be markedly reduced with every additional member of the pedigree that is included in the analysis. In a real data set with exomes of four family members, two sisters affected by Mabry syndrome and their healthy parents, the disease-causing gene PIGO, which harbors the pathogenic compound heterozygous variants, could be readily identified. Compound heterozygous filtering is an efficient means to reduce the number of candidate mutations in studies aiming at identifying recessive disease genes in non-consanguineous families. A web-server is provided to make this filtering strategy available at www.gene-talk.de.
Assuntos
Biologia Computacional/métodos , Heterozigoto , Sequenciamento de Nucleotídeos em Larga Escala , Exoma/genética , Humanos , Mutação , LinhagemRESUMO
With exome sequencing becoming a tool for mutation detection in routine diagnostics there is an increasing need for platform-independent methods of quality control. We present a genotype-weighted metric that allows comparison of all the variant calls of an exome to a high-quality reference dataset of an ethnically matched population. The exome-wide genotyping accuracy is estimated from the distance to this reference set, and does not require any further knowledge about data generation or the bioinformatics involved. The distances of our metric are visualized by non-metric multidimensional scaling and serve as an intuitive, standardizable score for the quality assessment of exome data.
RESUMO
The bioreaction database established by Ma and Zeng (Bioinformatics, 2003, 19, 270-277) for in silico reconstruction of genome-scale metabolic networks has been widely used. Based on more recent information in the reference databases KEGG LIGAND and Brenda, we upgrade the bioreaction database in this work by almost doubling the number of reactions from 3565 to 6851. Over 70% of the reactions have been manually updated/revised in terms of reversibility, reactant pairs, currency metabolites and error correction. For the first time, 41 spontaneous sugar mutarotation reactions are introduced into the biochemical database. The upgrade significantly improves the reconstruction of genome scale metabolic networks. Many gaps or missing biochemical links can be recovered, as exemplified with three model organisms Homo sapiens, Aspergillus niger, and Escherichia coli. The topological parameters of the constructed networks were also largely affected, however, the overall network structure remains scale-free. Furthermore, we consider the problem of computing biologically feasible shortest paths in reconstructed metabolic networks. We show that these paths are hard to compute and present solutions to find such paths in networks of small and medium size.