RESUMO
Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data such as electronic health records (EHRs) have not been fully exploited to derive rare disease annotations. Here, we present open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives. The rare disease prevalence derived by OARD is highly correlated with those annotated in the original rare disease knowledgebase. By performing association analysis, we identified more than 1 million novel disease-phenotype association pairs that were previously missed by human annotation, and >60% were confirmed true associations via manual review of a list of sampled pairs. Compared to the manual curated annotation, OARD is 100% data driven and its pipeline can be shared across different institutions. By supporting privacy-preserving sharing of aggregated summary statistics, such as term frequencies and disease-phenotype associations, it fills an important gap to facilitate data-driven research in the rare disease community.
Assuntos
Processamento de Linguagem Natural , Doenças Raras , Registros Eletrônicos de Saúde , Humanos , Fenótipo , Doenças Raras/genéticaRESUMO
Human Phenotype Ontology (HPO)-based approaches have gained popularity in recent times as a tool for genomic diagnostics of rare diseases. However, these approaches do not make full use of the available information on disease and patient phenotypes. We present a new method called Phen2Disease, which utilizes the bidirectional maximum matching semantic similarity between two phenotype sets of patients and diseases to prioritize diseases and genes. Our comprehensive experiments have been conducted on six real data cohorts with 2051 cases (Cohort 1, n = 384; Cohort 2, n = 281; Cohort 3, n = 185; Cohort 4, n = 784; Cohort 5, n = 208; and Cohort 6, n = 209) and two simulated data cohorts with 1000 cases. The results of the experiments showed that Phen2Disease outperforms the three state-of-the-art methods when only phenotype information and HPO knowledge base are used, particularly in cohorts with fewer average numbers of HPO terms. We also observed that patients with higher information content scores have more specific information, leading to more accurate predictions. Moreover, Phen2Disease provides high interpretability with ranked diseases and patient HPO terms presented. Our method provides a novel approach to utilizing phenotype data for genomic diagnostics of rare diseases, with potential for clinical impact. Phen2Disease is freely available on GitHub at https://github.com/ZhuLab-Fudan/Phen2Disease.
Assuntos
Ontologias Biológicas , Doenças Raras , Humanos , Semântica , Genômica , FenótipoRESUMO
Speech and language disorders are known to have a substantial genetic contribution. Although frequently examined as components of other conditions, research on the genetic basis of linguistic differences as separate phenotypic subgroups has been limited so far. Here, we performed an in-depth characterization of speech and language disorders in 52 143 individuals, reconstructing clinical histories using a large-scale data-mining approach of the electronic medical records from an entire large paediatric healthcare network. The reported frequency of these disorders was the highest between 2 and 5 years old and spanned a spectrum of 26 broad speech and language diagnoses. We used natural language processing to assess the degree to which clinical diagnoses in full-text notes were reflected in ICD-10 diagnosis codes. We found that aphasia and speech apraxia could be retrieved easily through ICD-10 diagnosis codes, whereas stuttering as a speech phenotype was coded in only 12% of individuals through appropriate ICD-10 codes. We found significant comorbidity of speech and language disorders in neurodevelopmental conditions (30.31%) and, to a lesser degree, with epilepsies (6.07%) and movement disorders (2.05%). The most common genetic disorders retrievable in our analysis of electronic medical records were STXBP1 (n = 21), PTEN (n = 20) and CACNA1A (n = 18). When assessing associations of genetic diagnoses with specific linguistic phenotypes, we observed associations of STXBP1 and aphasia (P = 8.57 × 10-7, 95% confidence interval = 18.62-130.39) and MYO7A with speech and language development delay attributable to hearing loss (P = 1.24 × 10-5, 95% confidence interval = 17.46-infinity). Finally, in a sub-cohort of 726 individuals with whole-exome sequencing data, we identified an enrichment of rare variants in neuronal receptor pathways, in addition to associations of UQCRC1 and KIF17 with expressive aphasia, MROH8 and BCHE with poor speech, and USP37, SLC22A9 and UMODL1 with aphasia. In summary, our study outlines the landscape of paediatric speech and language disorders, confirming the phenotypic complexity of linguistic traits and novel genotype-phenotype associations. Subgroups of paediatric speech and language disorders differ significantly with respect to the composition of monogenic aetiologies.
RESUMO
Autoimmunity in inborn errors of immunity (IEIs) has a multifactorial pathogenesis and develops subsequent to a genetic predisposition in conjunction with gene regulation, environmental modifiers, and infectious triggers. On the basis of incremental data availability owing to upfront application of omics technologies, a more granular and dynamic view of mechanisms and manifestations is warranted. Here, we present a comprehensive novel concept of autoimmunity in IEIs that considers multiple layers of interdependent elements and connects 101 causative genes or deletions according to the quality of the allelic variants with 47 molecular pathways and 22 immune effector mechanisms. Furthermore, we list 50 resulting manifestations together with the corresponding Human Phenotype Ontology terms and review the types and frequencies of the most relevant clinical presentations. When all of its elements are taken together, this concept (1) extends the historical anatomic view of central versus peripheral tolerance toward multiple interdependent mechanisms of immune tolerance, (2) delineates the mechanisms underlying the protean clinical manifestations, and thereby, (3) points toward the most suitable precision therapy for autoimmunity in IEIs. The multilayer concept of autoimmune mechanisms and manifestations in IEIs will facilitate research design and provide clinical guidance on the use of precision medicine irrespective of the data depth available in each health care scenario.
Assuntos
Autoimunidade , Medicina de Precisão , Humanos , Alelos , Predisposição Genética para Doença , Tolerância ImunológicaRESUMO
PURPOSE: Clinical intuition is commonly incorporated into the differential diagnosis as an assessment of the likelihood of candidate diagnoses based either on the patient population being seen in a specific clinic or on the signs and symptoms of the initial presentation. Algorithms to support diagnostic sequencing in individuals with a suspected rare genetic disease do not yet incorporate intuition and instead assume that each Mendelian disease has an equal pretest probability. METHODS: The LIRICAL algorithm calculates the likelihood ratio of clinical manifestations represented by Human Phenotype Ontology (HPO) terms to rank candidate diagnoses. The initial version of LIRICAL assumed an equal pretest probability for each disease in its calculation of the posttest probability (where the test is diagnostic exome or genome sequencing). We introduce Clinical Intuition for Likelihood Ratios (ClintLR), an extension of the LIRICAL algorithm that boosts the pretest probability of groups of related diseases deemed to be more likely. RESULTS: The average rank of the correct diagnosis in simulations using ClintLR showed a statistically significant improvement over a range of adjustment factors. CONCLUSION: ClintLR successfully encodes clinical intuition to improve ranking of rare diseases in diagnostic sequencing. ClintLR is freely available at https://github.com/TheJacksonLaboratory/ClintLR.
RESUMO
OBJECTIVE: Clinical deep phenotyping and phenotype annotation play a critical role in both the diagnosis of patients with rare disorders as well as in building computationally-tractable knowledge in the rare disorders field. These processes rely on using ontology concepts, often from the Human Phenotype Ontology, in conjunction with a phenotype concept recognition task (supported usually by machine learning methods) to curate patient profiles or existing scientific literature. With the significant shift in the use of large language models (LLMs) for most NLP tasks, we examine the performance of the latest Generative Pre-trained Transformer (GPT) models underpinning ChatGPT as a foundation for the tasks of clinical phenotyping and phenotype annotation. MATERIALS AND METHODS: The experimental setup of the study included seven prompts of various levels of specificity, two GPT models (gpt-3.5-turbo and gpt-4.0) and two established gold standard corpora for phenotype recognition, one consisting of publication abstracts and the other clinical observations. RESULTS: The best run, using in-context learning, achieved 0.58 document-level F1 score on publication abstracts and 0.75 document-level F1 score on clinical observations, as well as a mention-level F1 score of 0.7, which surpasses the current best in class tool. Without in-context learning, however, performance is significantly below the existing approaches. CONCLUSION: Our experiments show that gpt-4.0 surpasses the state of the art performance if the task is constrained to a subset of the target ontology where there is prior knowledge of the terms that are expected to be matched. While the results are promising, the non-deterministic nature of the outcomes, the high cost and the lack of concordance between different runs using the same prompt and input make the use of these LLMs challenging for this particular task.
Assuntos
Conhecimento , Idioma , Humanos , Aprendizado de Máquina , Fenótipo , Doenças RarasRESUMO
BACKGROUND: There are approximately 8,000 different rare diseases that affect roughly 400 million people worldwide. Many of them suffer from delayed diagnosis. Ciliopathies are rare monogenic disorders characterized by a significant phenotypic and genetic heterogeneity that raises an important challenge for clinical diagnosis. Diagnosis support systems (DSS) applied to electronic health record (EHR) data may help identify undiagnosed patients, which is of paramount importance to improve patients' care. Our objective was to evaluate three online-accessible rare disease DSSs using phenotypes derived from EHRs for the diagnosis of ciliopathies. METHODS: Two datasets of ciliopathy cases, either proven or suspected, and two datasets of controls were used to evaluate the DSSs. Patient phenotypes were automatically extracted from their EHRs and converted to Human Phenotype Ontology terms. We tested the ability of the DSSs to diagnose cases in contrast to controls based on Orphanet ontology. RESULTS: A total of 79 cases and 38 controls were selected. Performances of the DSSs on ciliopathy real world data (best DSS with area under the ROC curve = 0.72) were not as good as published performances on the test set used in the DSS development phase. None of these systems obtained results which could be described as "expert-level". Patients with multisystemic symptoms were generally easier to diagnose than patients with isolated symptoms. Diseases easily confused with ciliopathy generally affected multiple organs and had overlapping phenotypes. Four challenges need to be considered to improve the performances: to make the DSSs interoperable with EHR systems, to validate the performances in real-life settings, to deal with data quality, and to leverage methods and resources for rare and complex diseases. CONCLUSION: Our study provides insights into the complexities of diagnosing highly heterogenous rare diseases and offers lessons derived from evaluation existing DSSs in real-world settings. These insights are not only beneficial for ciliopathy diagnosis but also hold relevance for the enhancement of DSS for various complex rare disorders, by guiding the development of more clinically relevant rare disease DSSs, that could support early diagnosis and finally make more patients eligible for treatment.
Assuntos
Ciliopatias , Registros Eletrônicos de Saúde , Doenças Raras , Humanos , Ciliopatias/diagnóstico , Doenças Raras/diagnóstico , Sistemas de Apoio a Decisões Clínicas , FenótipoRESUMO
Human Phenotype Ontology (HPO)-based analysis has become standard for genomic diagnostics of rare diseases. Current algorithms use a variety of semantic and statistical approaches to prioritize the typically long lists of genes with candidate pathogenic variants. These algorithms do not provide robust estimates of the strength of the predictions beyond the placement in a ranked list, nor do they provide measures of how much any individual phenotypic observation has contributed to the prioritization result. However, given that the overall success rate of genomic diagnostics is only around 25%-50% or less in many cohorts, a good ranking cannot be taken to imply that the gene or disease at rank one is necessarily a good candidate. Here, we present an approach to genomic diagnostics that exploits the likelihood ratio (LR) framework to provide an estimate of (1) the posttest probability of candidate diagnoses, (2) the LR for each observed HPO phenotype, and (3) the predicted pathogenicity of observed genotypes. LIkelihood Ratio Interpretation of Clinical AbnormaLities (LIRICAL) placed the correct diagnosis within the first three ranks in 92.9% of 384 case reports comprising 262 Mendelian diseases, and the correct diagnosis had a mean posttest probability of 67.3%. Simulations show that LIRICAL is robust to many typically encountered forms of genomic and phenomic noise. In summary, LIRICAL provides accurate, clinically interpretable results for phenotype-driven genomic diagnostics.
Assuntos
Biologia Computacional , Bases de Dados Genéticas , Genômica , Doenças Raras/diagnóstico , Algoritmos , Exoma/genética , Humanos , Fenótipo , Doenças Raras/genética , SoftwareRESUMO
More than 100 genetic etiologies have been identified in developmental and epileptic encephalopathies (DEEs), but correlating genetic findings with clinical features at scale has remained a hurdle because of a lack of frameworks for analyzing heterogenous clinical data. Here, we analyzed 31,742 Human Phenotype Ontology (HPO) terms in 846 individuals with existing whole-exome trio data and assessed associated clinical features and phenotypic relatedness by using HPO-based semantic similarity analysis for individuals with de novo variants in the same gene. Gene-specific phenotypic signatures included associations of SCN1A with "complex febrile seizures" (HP: 0011172; p = 2.1 × 10-5) and "focal clonic seizures" (HP: 0002266; p = 8.9 × 10-6), STXBP1 with "absent speech" (HP: 0001344; p = 1.3 × 10-11), and SLC6A1 with "EEG with generalized slow activity" (HP: 0010845; p = 0.018). Of 41 genes with de novo variants in two or more individuals, 11 genes showed significant phenotypic similarity, including SCN1A (n = 16, p < 0.0001), STXBP1 (n = 14, p = 0.0021), and KCNB1 (n = 6, p = 0.011). Including genetic and phenotypic data of control subjects increased phenotypic similarity for all genetic etiologies, whereas the probability of observing de novo variants decreased, emphasizing the conceptual differences between semantic similarity analysis and approaches based on the expected number of de novo events. We demonstrate that HPO-based phenotype analysis captures unique profiles for distinct genetic etiologies, reflecting the breadth of the phenotypic spectrum in genetic epilepsies. Semantic similarity can be used to generate statistical evidence for disease causation analogous to the traditional approach of primarily defining disease entities through similar clinical features.
Assuntos
Proteínas da Membrana Plasmática de Transporte de GABA/genética , Proteínas Munc18/genética , Canal de Sódio Disparado por Voltagem NAV1.1/genética , Convulsões/genética , Espasmos Infantis/genética , Distúrbios da Fala/genética , Pré-Escolar , Estudos de Coortes , Feminino , Expressão Gênica , Ontologia Genética , Humanos , Masculino , Mutação , Fenótipo , Convulsões/classificação , Convulsões/diagnóstico , Convulsões/fisiopatologia , Semântica , Canais de Potássio Shab/genética , Espasmos Infantis/classificação , Espasmos Infantis/diagnóstico , Espasmos Infantis/fisiopatologia , Distúrbios da Fala/classificação , Distúrbios da Fala/diagnóstico , Distúrbios da Fala/fisiopatologia , Terminologia como Assunto , Sequenciamento do ExomaRESUMO
Disease-causing variants in STXBP1 are among the most common genetic causes of neurodevelopmental disorders. However, the phenotypic spectrum in STXBP1-related disorders is wide and clear correlations between variant type and clinical features have not been observed so far. Here, we harmonized clinical data across 534 individuals with STXBP1-related disorders and analysed 19â973 derived phenotypic terms, including phenotypes of 253 individuals previously unreported in the scientific literature. The overall phenotypic landscape in STXBP1-related disorders is characterized by neurodevelopmental abnormalities in 95% and seizures in 89% of individuals, including focal-onset seizures as the most common seizure type (47%). More than 88% of individuals with STXBP1-related disorders have seizure onset in the first year of life, including neonatal seizure onset in 47%. Individuals with protein-truncating variants and deletions in STXBP1 (n = 261) were almost twice as likely to present with West syndrome and were more phenotypically similar than expected by chance. Five genetic hotspots with recurrent variants were identified in more than 10 individuals, including p.Arg406Cys/His (n = 40), p.Arg292Cys/His/Leu/Pro (n = 30), p.Arg551Cys/Gly/His/Leu (n = 24), p.Pro139Leu (n = 12), and p.Arg190Trp (n = 11). None of the recurrent variants were significantly associated with distinct electroclinical syndromes, single phenotypic features, or showed overall clinical similarity, indicating that the baseline variability in STXBP1-related disorders is too high for discrete phenotypic subgroups to emerge. We then reconstructed the seizure history in 62 individuals with STXBP1-related disorders in detail, retrospectively assigning seizure type and seizure frequency monthly across 4433 time intervals, and retrieved 251 anti-seizure medication prescriptions from the electronic medical records. We demonstrate a dynamic pattern of seizure control and complex interplay with response to specific medications particularly in the first year of life when seizures in STXBP1-related disorders are the most prominent. Adrenocorticotropic hormone and phenobarbital were more likely to initially reduce seizure frequency in infantile spasms and focal seizures compared to other treatment options, while the ketogenic diet was most effective in maintaining seizure freedom. In summary, we demonstrate how the multidimensional spectrum of phenotypic features in STXBP1-related disorders can be assessed using a computational phenotype framework to facilitate the development of future precision-medicine approaches.
Assuntos
Epilepsia , Espasmos Infantis , Eletroencefalografia , Epilepsia/genética , Humanos , Lactente , Proteínas Munc18/genética , Estudos Retrospectivos , Convulsões/genética , Espasmos Infantis/tratamento farmacológico , Espasmos Infantis/genéticaRESUMO
Identifying the causal variant for diagnosis of genetic diseases is challenging when using next-generation sequencing approaches and variant prioritization tools can assist in this task. These tools provide in silico predictions of variant pathogenicity, however they are agnostic to the disease under study. We previously performed a disease-specific benchmark of 24 such tools to assess how they perform in different disease contexts. We found that the tools themselves show large differences in performance, but more importantly that the best tools for variant prioritization are dependent on the disease phenotypes being considered. Here we expand the assessment to 37 tools and refine our assessment by separating performance for nonsynonymous single nucleotide variants (nsSNVs) and missense variants (i.e., excluding nonsense variants). We found differences in performance for missense variants compared to nsSNVs and recommend three tools that stand out in terms of their performance (BayesDel, CADD, and ClinPred).
Assuntos
Benchmarking , Biologia Computacional , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mutação de Sentido Incorreto , FenótipoRESUMO
Making a specific diagnosis in neurodevelopmental disorders is traditionally based on recognizing clinical features of a distinct syndrome, which guides testing of its possible genetic etiologies. Scalable frameworks for genomic diagnostics, however, have struggled to integrate meaningful measurements of clinical phenotypic features. While standardization has enabled generation and interpretation of genomic data for clinical diagnostics at unprecedented scale, making the equivalent breakthrough for clinical data has proven challenging. However, increasingly clinical features are being recorded using controlled dictionaries with machine readable formats such as the Human Phenotype Ontology (HPO), which greatly facilitates their use in the diagnostic space. Improving the tractability of large-scale clinical information will present new opportunities to inform genomic research and diagnostics from a clinical perspective. Here, we describe novel approaches for computational phenotyping to harmonize clinical features, improve data translation through revising domain-specific dictionaries, quantify phenotypic features, and determine clinical relatedness. We demonstrate how these concepts can be applied to longitudinal phenotypic information, which represents a critical element of developmental disorders and pediatric conditions. Finally, we expand our discussion to clinical data derived from electronic medical records, a largely untapped resource of deep clinical information with distinct strengths and weaknesses.
Assuntos
Registros Eletrônicos de Saúde , Genômica , Criança , Humanos , FenótipoRESUMO
Technological advances in both genome sequencing and prenatal imaging are increasing our ability to accurately recognize and diagnose Mendelian conditions prenatally. Phenotype-driven early genetic diagnosis of fetal genetic disease can help to strategize treatment options and clinical preventive measures during the perinatal period, to plan in utero therapies, and to inform parental decision-making. Fetal phenotypes of genetic diseases are often unique and at present are not well understood; more comprehensive knowledge about prenatal phenotypes and computational resources have an enormous potential to improve diagnostics and translational research. The Human Phenotype Ontology (HPO) has been widely used to support diagnostics and translational research in human genetics. To better support prenatal usage, the HPO consortium conducted a series of workshops with a group of domain experts in a variety of medical specialties, diagnostic techniques, as well as diseases and phenotypes related to prenatal medicine, including perinatal pathology, musculoskeletal anomalies, neurology, medical genetics, hydrops fetalis, craniofacial malformations, cardiology, neonatal-perinatal medicine, fetal medicine, placental pathology, prenatal imaging, and bioinformatics. We expanded the representation of prenatal phenotypes in HPO by adding 95 new phenotype terms under the Abnormality of prenatal development or birth (HP:0001197) grouping term, and revised definitions, synonyms, and disease annotations for most of the 152 terms that existed before the beginning of this effort. The expansion of prenatal phenotypes in HPO will support phenotype-driven prenatal exome and genome sequencing for precision genetic diagnostics of rare diseases to support prenatal care.
Assuntos
Biologia Computacional , Placenta , Recém-Nascido , Humanos , Feminino , Gravidez , Biologia Computacional/métodos , Fenótipo , Doenças Raras , Sequenciamento do ExomaRESUMO
The developmental and epileptic encephalopathies (DEEs) are heterogeneous disorders with a strong genetic contribution, but the underlying genetic etiology remains unknown in a significant proportion of individuals. To explore whether statistical support for genetic etiologies can be generated on the basis of phenotypic features, we analyzed whole-exome sequencing data and phenotypic similarities by using Human Phenotype Ontology (HPO) in 314 individuals with DEEs. We identified a de novo c.508C>T (p.Arg170Trp) variant in AP2M1 in two individuals with a phenotypic similarity that was higher than expected by chance (p = 0.003) and a phenotype related to epilepsy with myoclonic-atonic seizures. We subsequently found the same de novo variant in two individuals with neurodevelopmental disorders and generalized epilepsy in a cohort of 2,310 individuals who underwent diagnostic whole-exome sequencing. AP2M1 encodes the µ-subunit of the adaptor protein complex 2 (AP-2), which is involved in clathrin-mediated endocytosis (CME) and synaptic vesicle recycling. Modeling of protein dynamics indicated that the p.Arg170Trp variant impairs the conformational activation and thermodynamic entropy of the AP-2 complex. Functional complementation of both the µ-subunit carrying the p.Arg170Trp variant in human cells and astrocytes derived from AP-2µ conditional knockout mice revealed a significant impairment of CME of transferrin. In contrast, stability, expression levels, membrane recruitment, and localization were not impaired, suggesting a functional alteration of the AP-2 complex as the underlying disease mechanism. We establish a recurrent pathogenic variant in AP2M1 as a cause of DEEs with distinct phenotypic features, and we implicate dysfunction of the early steps of endocytosis as a disease mechanism in epilepsy.
Assuntos
Complexo 2 de Proteínas Adaptadoras/genética , Subunidades mu do Complexo de Proteínas Adaptadoras/genética , Encefalopatias/etiologia , Clatrina/metabolismo , Endocitose , Epilepsia/etiologia , Mutação de Sentido Incorreto , Transtornos do Neurodesenvolvimento/etiologia , Adolescente , Animais , Encefalopatias/patologia , Criança , Pré-Escolar , Clatrina/genética , Epilepsia/patologia , Feminino , Humanos , Lactente , Camundongos , Camundongos Knockout , Transtornos do Neurodesenvolvimento/patologia , Sequenciamento do ExomaRESUMO
The study aims at developing a neural network model to improve the performance of Human Phenotype Ontology (HPO) concept recognition tools. We used the terms, definitions, and comments about the phenotypic concepts in the HPO database to train our model. The document to be analyzed is first split into sentences and annotated with a base method to generate candidate concepts. The sentences, along with the candidate concepts, are then fed into the pre-trained model for re-ranking. Our model comprises the pre-trained BlueBERT and a feature selection module, followed by a contrastive loss. We re-ranked the results generated by three robust HPO annotation tools and compared the performance against most of the existing approaches. The experimental results show that our model can improve the performance of the existing methods. Significantly, it boosted 3.0% and 5.6% in F1 score on the two evaluated datasets compared with the base methods. It removed more than 80% of the false positives predicted by the base methods, resulting in up to 18% improvement in precision. Our model utilizes the descriptive data in the ontology and the contextual information in the sentences for re-ranking. The results indicate that the additional information and the re-ranking model can significantly enhance the precision of HPO concept recognition compared with the base method.
Assuntos
Idioma , Redes Neurais de Computação , Bases de Dados Factuais , Humanos , FenótipoRESUMO
The ICD-10-GM coding system used in the German healthcare system only captures a minority of rare disease diagnoses. Therefore, information on the incidence and prevalence of rare diseases as well as necessary (financial) resources for the expert care required for evidence-based decisions by health insurers, care providers, and politicians are lacking. Furthermore, the missing information complicates and sometimes even precludes the generation of scientific knowledge on rare diseases. Therefore, starting in 2023, all in-patient cases in Germany with a rare disease diagnosis must be coded by an ORPHAcode using the Alpha-ID-SE file.The file Alpha-ID-SE links the ICD-10-GM codes to the internationally established ORPHAcodes for rare diseases. Commercially available software tools progressively support the coding of rare diseases. In several centers for rare diseases linked to university hospitals, IT tools and procedures were established to realize a complete coding of rare diseases. These include financial incentives for the institutions providing rare disease codes, systematic queries asking for rare disease codes during the coding process, and a semi-automated coding process for all patients with a rare disease previously seen at the institution. A combination of the different approaches probably results in the most complete coding.To get the complete picture of rare disease epidemiology and care requirements, a specific and unique coding of out-patient cases is also desirable. Furthermore, a structured reporting of phenotype is required, especially for complex rare diseases and for yet undiagnosed cases.
Assuntos
Classificação Internacional de Doenças , Doenças Raras , Humanos , Doenças Raras/diagnóstico , Doenças Raras/epidemiologia , Doenças Raras/terapia , Alemanha/epidemiologia , Atenção à Saúde , Instalações de SaúdeRESUMO
BACKGROUND: Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. RESULTS: In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. CONCLUSIONS: This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Assuntos
Redes Neurais de Computação , Aprendizado de Máquina Supervisionado , Mineração de Dados , Humanos , FenótipoRESUMO
Bi-allelic TECPR2 variants have been associated with a complex syndrome with features of both a neurodevelopmental and neurodegenerative disorder. Here, we provide a comprehensive clinical description and variant interpretation framework for this genetic locus. Through international collaboration, we identified 17 individuals from 15 families with bi-allelic TECPR2-variants. We systemically reviewed clinical and molecular data from this cohort and 11 cases previously reported. Phenotypes were standardized using Human Phenotype Ontology terms. A cross-sectional analysis revealed global developmental delay/intellectual disability, muscular hypotonia, ataxia, hyporeflexia, respiratory infections, and central/nocturnal hypopnea as core manifestations. A review of brain magnetic resonance imaging scans demonstrated a thin corpus callosum in 52%. We evaluated 17 distinct variants. Missense variants in TECPR2 are predominantly located in the N- and C-terminal regions containing ß-propeller repeats. Despite constituting nearly half of disease-associated TECPR2 variants, classifying missense variants as (likely) pathogenic according to ACMG criteria remains challenging. We estimate a pathogenic variant carrier frequency of 1/1221 in the general and 1/155 in the Jewish Ashkenazi populations. Based on clinical, neuroimaging, and genetic data, we provide recommendations for variant reporting, clinical assessment, and surveillance/treatment of individuals with TECPR2-associated disorder. This sets the stage for future prospective natural history studies.
Assuntos
Proteínas de Transporte/genética , Neuropatias Hereditárias Sensoriais e Autônomas , Deficiência Intelectual , Proteínas do Tecido Nervoso/genética , Adolescente , Proteínas de Transporte/química , Criança , Pré-Escolar , Estudos de Coortes , Estudos Transversais , Família , Feminino , Neuropatias Hereditárias Sensoriais e Autônomas/complicações , Neuropatias Hereditárias Sensoriais e Autônomas/diagnóstico , Neuropatias Hereditárias Sensoriais e Autônomas/genética , Neuropatias Hereditárias Sensoriais e Autônomas/patologia , Humanos , Lactente , Deficiência Intelectual/complicações , Deficiência Intelectual/diagnóstico , Deficiência Intelectual/genética , Deficiência Intelectual/patologia , Imageamento por Ressonância Magnética , Masculino , Modelos Moleculares , Mutação de Sentido Incorreto , Proteínas do Tecido Nervoso/química , Neuroimagem/métodos , Linhagem , Fenótipo , Conformação ProteicaRESUMO
Recently, to speed up the differential-diagnosis process based on symptoms and signs observed from an affected individual in the diagnosis of rare diseases, researchers have developed and implemented phenotype-driven differential-diagnosis systems. The performance of those systems relies on the quantity and quality of underlying databases of disease-phenotype associations (DPAs). Although such databases are often developed by manual curation, they inherently suffer from limited coverage. To address this problem, we propose a text-mining approach to increase the coverage of DPA databases and consequently improve the performance of differential-diagnosis systems. Our analysis showed that a text-mining approach using one million case reports obtained from PubMed could increase the coverage of manually curated DPAs in Orphanet by 125.6%. We also present PubCaseFinder (see Web Resources), a new phenotype-driven differential-diagnosis system in a freely available web application. By utilizing automatically extracted DPAs from case reports in addition to manually curated DPAs, PubCaseFinder improves the performance of automated differential diagnosis. Moreover, PubCaseFinder helps clinicians search for relevant case reports by using phenotype-based comparisons and confirm the results with detailed contextual information.
Assuntos
Doenças Raras/diagnóstico , Doenças Raras/genética , Mineração de Dados/métodos , Bases de Dados Genéticas , Diagnóstico Diferencial , Humanos , FenótipoRESUMO
BACKGROUND: 16p13.11 microduplication syndrome has a variable presentation and is characterized primarily by neurodevelopmental and physical phenotypes resulting from copy number variation at chromosome 16p13.11. Given its variability, there may be features that have not yet been reported. The goal of this study was to use a patient "self-phenotyping" survey to collect data directly from patients to further characterize the phenotypes of 16p13.11 microduplication syndrome. OBJECTIVE: This study aimed to (1) discover self-identified phenotypes in 16p13.11 microduplication syndrome that have been underrepresented in the scientific literature and (2) demonstrate that self-phenotyping tools are valuable sources of data for the medical and scientific communities. METHODS: As part of a large study to compare and evaluate patient self-phenotyping surveys, an online survey tool, Phenotypr, was developed for patients with rare disorders to self-report phenotypes. Participants with 16p13.11 microduplication syndrome were recruited through the Boston Children's Hospital 16p13.11 Registry. Either the caregiver, parent, or legal guardian of an affected child or the affected person (if aged 18 years or above) completed the survey. Results were securely transferred to a Research Electronic Data Capture database and aggregated for analysis. RESULTS: A total of 19 participants enrolled in the study. Notably, among the 19 participants, aggression and anxiety were mentioned by 3 (16%) and 4 (21%) participants, respectively, which is an increase over the numbers in previously published literature. Additionally, among the 19 participants, 3 (16%) had asthma and 2 (11%) had other immunological disorders, both of which have not been previously described in the syndrome. CONCLUSIONS: Several phenotypes might be underrepresented in the previous 16p13.11 microduplication literature, and new possible phenotypes have been identified. Whenever possible, patients should continue to be referenced as a source of complete phenotyping data on their condition. Self-phenotyping may lead to a better understanding of the prevalence of phenotypes in genetic disorders and may identify previously unreported phenotypes.