Búsqueda | Portal de Búsqueda de la BVS Colombia

A hybrid framework with large language models for rare disease phenotyping.

Wu, Jinge; Dong, Hang; Li, Zexi; Wang, Haowei; Li, Runci; Patra, Arijit; Dai, Chengliang; Ali, Waqar; Scordis, Phil; Wu, Honghan.

BMC Med Inform Decis Mak ; 24(1): 289, 2024 Oct 08.

Artículo en Inglés | MEDLINE | ID: mdl-39375687

RESUMEN

PURPOSE: Rare diseases pose significant challenges in diagnosis and treatment due to their low prevalence and heterogeneous clinical presentations. Unstructured clinical notes contain valuable information for identifying rare diseases, but manual curation is time-consuming and prone to subjectivity. This study aims to develop a hybrid approach combining dictionary-based natural language processing (NLP) tools with large language models (LLMs) to improve rare disease identification from unstructured clinical reports. METHODS: We propose a novel hybrid framework that integrates the Orphanet Rare Disease Ontology (ORDO) and the Unified Medical Language System (UMLS) to create a comprehensive rare disease vocabulary. SemEHR, a dictionary-based NLP tool, is employed to extract rare disease mentions from clinical notes. To refine the results and improve accuracy, we leverage various LLMs, including LLaMA3, Phi3-mini, and domain-specific models like OpenBioLLM and BioMistral. Different prompting strategies, such as zero-shot, few-shot, and knowledge-augmented generation, are explored to optimize the LLMs' performance. RESULTS: The proposed hybrid approach demonstrates superior performance compared to traditional NLP systems and standalone LLMs. LLaMA3 and Phi3-mini achieve the highest F1 scores in rare disease identification. Few-shot prompting with 1-3 examples yields the best results, while knowledge-augmented generation shows limited improvement. Notably, the approach uncovers a significant number of potential rare disease cases not documented in structured diagnostic records, highlighting its ability to identify previously unrecognized patients. CONCLUSION: The hybrid approach combining dictionary-based NLP tools with LLMs shows great promise for improving rare disease identification from unstructured clinical reports. By leveraging the strengths of both techniques, the method demonstrates superior performance and the potential to uncover hidden rare disease cases. Further research is needed to address limitations related to ontology mapping and overlapping case identification, and to integrate the approach into clinical practice for early diagnosis and improved patient outcomes.

Asunto(s)

Procesamiento de Lenguaje Natural , Enfermedades Raras , Unified Medical Language System , Enfermedades Raras/diagnóstico , Humanos , Fenotipo , Registros Electrónicos de Salud , Ontologías Biológicas

PDON: Parkinson's disease ontology for representation and modeling of the Parkinson's disease knowledge domain.

Younesi, Erfan; Malhotra, Ashutosh; Gündel, Michaela; Scordis, Phil; Kodamullil, Alpha Tom; Page, Matt; Müller, Bernd; Springstubbe, Stephan; Wüllner, Ullrich; Scheller, Dieter; Hofmann-Apitius, Martin.

Theor Biol Med Model ; 12: 20, 2015 Sep 22.

Artículo en Inglés | MEDLINE | ID: mdl-26395080

RESUMEN

BACKGROUND: Despite the unprecedented and increasing amount of data, relatively little progress has been made in molecular characterization of mechanisms underlying Parkinson's disease. In the area of Parkinson's research, there is a pressing need to integrate various pieces of information into a meaningful context of presumed disease mechanism(s). Disease ontologies provide a novel means for organizing, integrating, and standardizing the knowledge domains specific to disease in a compact, formalized and computer-readable form and serve as a reference for knowledge exchange or systems modeling of disease mechanism. METHODS: The Parkinson's disease ontology was built according to the life cycle of ontology building. Structural, functional, and expert evaluation of the ontology was performed to ensure the quality and usability of the ontology. A novelty metric has been introduced to measure the gain of new knowledge using the ontology. Finally, a cause-and-effect model was built around PINK1 and two gene expression studies from the Gene Expression Omnibus database were re-annotated to demonstrate the usability of the ontology. RESULTS: The Parkinson's disease ontology with a subclass-based taxonomic hierarchy covers the broad spectrum of major biomedical concepts from molecular to clinical features of the disease, and also reflects different views on disease features held by molecular biologists, clinicians and drug developers. The current version of the ontology contains 632 concepts, which are organized under nine views. The structural evaluation showed the balanced dispersion of concept classes throughout the ontology. The functional evaluation demonstrated that the ontology-driven literature search could gain novel knowledge not present in the reference Parkinson's knowledge map. The ontology was able to answer specific questions related to Parkinson's when evaluated by experts. Finally, the added value of the Parkinson's disease ontology is demonstrated by ontology-driven modeling of PINK1 and re-annotation of gene expression datasets relevant to Parkinson's disease. CONCLUSIONS: Parkinson's disease ontology delivers the knowledge domain of Parkinson's disease in a compact, computer-readable form, which can be further edited and enriched by the scientific community and also to be used to construct, represent and automatically extend Parkinson's-related computable models. A practical version of the Parkinson's disease ontology for browsing and editing can be publicly accessed at http://bioportal.bioontology.org/ontologies/PDON .

Asunto(s)

Ontología de Genes , Conocimiento , Enfermedad de Parkinson/genética , Programas Informáticos , Animales , Bases de Datos Genéticas , Modelos Animales de Enfermedad , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Humanos , Anotación de Secuencia Molecular , Enfermedad de Parkinson/etiología

Predicting seizure recurrence after an initial seizure-like episode from routine clinical notes using large language models: a retrospective cohort study.

Beaulieu-Jones, Brett K; Villamar, Mauricio F; Scordis, Phil; Bartmann, Ana Paula; Ali, Waqar; Wissel, Benjamin D; Alsentzer, Emily; de Jong, Johann; Patra, Arijit; Kohane, Isaac.

Lancet Digit Health ; 5(12): e882-e894, 2023 12.

Artículo en Inglés | MEDLINE | ID: mdl-38000873

RESUMEN

BACKGROUND: The evaluation and management of first-time seizure-like events in children can be difficult because these episodes are not always directly observed and might be epileptic seizures or other conditions (seizure mimics). We aimed to evaluate whether machine learning models using real-world data could predict seizure recurrence after an initial seizure-like event. METHODS: This retrospective cohort study compared models trained and evaluated on two separate datasets between Jan 1, 2010, and Jan 1, 2020: electronic medical records (EMRs) at Boston Children's Hospital and de-identified, patient-level, administrative claims data from the IBM MarketScan research database. The study population comprised patients with an initial diagnosis of either epilepsy or convulsions before the age of 21 years, based on International Classification of Diseases, Clinical Modification (ICD-CM) codes. We compared machine learning-based predictive modelling using structured data (logistic regression and XGBoost) with emerging techniques in natural language processing by use of large language models. FINDINGS: The primary cohort comprised 14â021 patients at Boston Children's Hospital matching inclusion criteria with an initial seizure-like event and the comparison cohort comprised 15â062 patients within the IBM MarketScan research database. Seizure recurrence based on a composite expert-derived definition occurred in 57% of patients at Boston Children's Hospital and 63% of patients within IBM MarketScan. Large language models with additional domain-specific and location-specific pre-training on patients excluded from the study (F1-score 0·826 [95% CI 0·817-0·835], AUC 0·897 [95% CI 0·875-0·913]) performed best. All large language models, including the base model without additional pre-training (F1-score 0·739 [95% CI 0·738-0·741], AUROC 0·846 [95% CI 0·826-0·861]) outperformed models trained with structured data. With structured data only, XGBoost outperformed logistic regression and XGBoost models trained with the Boston Children's Hospital EMR (logistic regression: F1-score 0·650 [95% CI 0·643-0·657], AUC 0·694 [95% CI 0·685-0·705], XGBoost: F1-score 0·679 [0·676-0·683], AUC 0·725 [0·717-0·734]) performed similarly to models trained on the IBM MarketScan database (logistic regression: F1-score 0·596 [0·590-0·601], AUC 0·670 [0·664-0·675], XGBoost: F1-score 0·678 [0·668-0·687], AUC 0·710 [0·703-0·714]). INTERPRETATION: Physician's clinical notes about an initial seizure-like event include substantial signals for prediction of seizure recurrence, and additional domain-specific and location-specific pre-training can significantly improve the performance of clinical large language models, even for specialised cohorts. FUNDING: UCB, National Institute of Neurological Disorders and Stroke (US National Institutes of Health).

Asunto(s)

Epilepsia , Convulsiones , Niño , Humanos , Adulto Joven , Adulto , Estudios Retrospectivos , Convulsiones/diagnóstico , Aprendizaje Automático , Registros Electrónicos de Salud

Data and sample sharing as an enabler for large-scale biomarker research and development: The EPND perspective.

Bose, Niranjan; Brookes, Anthony J; Scordis, Phil; Visser, Pieter Jelle.

Front Neurol ; 13: 1031091, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-36530625

RESUMEN

Biomarker discovery, development, and validation are reliant on large-scale analyses of high-quality samples and data. Currently, significant quantities of data and samples have been generated by European studies on Alzheimer's disease (AD) and other neurodegenerative diseases (NDD), representing a valuable resource for developing biomarkers to support early detection of disease, treatment monitoring, and patient stratification. However, discovery of, access to, and sharing of data and samples from AD and NDD research are hindered both by silos that limit collaboration, and by the array of complex requirements for secure, legal, and ethical sharing. In this Perspective article, we examine key challenges currently hampering large-scale biomarker research, and outline how the European Platform for Neurodegenerative Diseases (EPND) plans to address them. The first such challenge is a fragmented landscape filled with technical barriers that make it difficult to discover and access high-quality samples and data in one location. A second challenge is related to the complex array of legal and ethical requirements that must be navigated by researchers when sharing data and samples, to ensure compliance with data protection regulations and research ethics. Another challenge is the lack of broad-scale collaboration and opportunities to facilitate partnerships between data and sample contributors and researchers, in addition to a lack of regulatory engagement early in the research process to enable validation of potential biomarkers. A further challenge facing projects is the need to remain sustainable beyond initial funding periods, ensuring data and samples are shared and reused, thereby driving further research and innovation. In addressing these challenges, EPND will enable an environment of faster and more disruptive research on diagnostics and disease-modifying therapies for Alzheimer's disease and other neurodegenerative diseases.

Clustering of Alzheimer's and Parkinson's disease based on genetic burden of shared molecular mechanisms.

Emon, Mohammad Asif; Heinson, Ashley; Wu, Ping; Domingo-Fernández, Daniel; Sood, Meemansa; Vrooman, Henri; Corvol, Jean-Christophe; Scordis, Phil; Hofmann-Apitius, Martin; Fröhlich, Holger.

Sci Rep ; 10(1): 19097, 2020 11 05.

Artículo en Inglés | MEDLINE | ID: mdl-33154531

RESUMEN

One of the visions of precision medicine has been to re-define disease taxonomies based on molecular characteristics rather than on phenotypic evidence. However, achieving this goal is highly challenging, specifically in neurology. Our contribution is a machine-learning based joint molecular subtyping of Alzheimer's (AD) and Parkinson's Disease (PD), based on the genetic burden of 15 molecular mechanisms comprising 27 proteins (e.g. APOE) that have been described in both diseases. We demonstrate that our joint AD/PD clustering using a combination of sparse autoencoders and sparse non-negative matrix factorization is reproducible and can be associated with significant differences of AD and PD patient subgroups on a clinical, pathophysiological and molecular level. Hence, clusters are disease-associated. To our knowledge this work is the first demonstration of a mechanism based stratification in the field of neurodegenerative diseases. Overall, we thus see this work as an important step towards a molecular mechanism-based taxonomy of neurological disorders, which could help in developing better targeted therapies in the future by going beyond classical phenotype based disease definitions.

Asunto(s)

Enfermedad de Alzheimer/clasificación , Enfermedad de Alzheimer/genética , Enfermedad de Parkinson/clasificación , Enfermedad de Parkinson/genética , Anciano , Anciano de 80 o más Años , Enfermedad de Alzheimer/metabolismo , Péptidos beta-Amiloides/líquido cefalorraquídeo , Encéfalo/diagnóstico por imagen , Análisis por Conglomerados , Estudios de Cohortes , Desarrollo de Medicamentos , Epigenoma , Femenino , Genotipo , Humanos , Masculino , Persona de Mediana Edad , Neuroimagen , Evaluación de Resultado en la Atención de Salud , Enfermedad de Parkinson/metabolismo , Polimorfismo de Nucleótido Simple , Medicina de Precisión , Transcriptoma , Aprendizaje Automático no Supervisado

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA