Search | VHL Regional Portal

1.

Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation.

Lee, Kyeryoung; Liu, Zongzhi; Mai, Yun; Jun, Tomi; Ma, Meng; Wang, Tongyu; Ai, Lei; Calay, Ediz; Oh, William; Stolovitzky, Gustavo; Schadt, Eric; Wang, Xiaoyan.

JMIR AI ; 3: e50800, 2024 Jul 29.

Article in English | MEDLINE | ID: mdl-39073872

ABSTRACT

BACKGROUND: Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) has the potential to achieve these objectives. OBJECTIVE: This study aims to assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records using deep learning-based NLP techniques. METHODS: We obtained data of 3281 industry-sponsored phase 2 or 3 interventional clinical trials recruiting patients with non-small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn disease from ClinicalTrials.gov, spanning the period between 2013 and 2020. A customized bidirectional long short-term memory- and conditional random field-based NLP pipeline was used to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms along with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of patients with non-small cell lung cancer (n=2775), curated from the Mount Sinai Health System, as a pilot study. RESULTS: We manually annotated the clinical trial eligibility corpus (485/3281, 14.78% trials) and constructed an eligibility criteria-specific ontology. Our customized NLP pipeline, developed based on the eligibility criteria-specific ontology that we created through manual annotation, achieved high precision (0.91, range 0.67-1.00) and recall (0.79, range 0.50-1) scores, as well as a high F1-score (0.83, range 0.67-1), enabling the efficient extraction of granular criteria entities and relevant attributes from 3281 clinical trials. A standardized eligibility criteria knowledge base, compatible with electronic health records, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. In addition, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients. CONCLUSIONS: Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of patients eligible for the trial. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification.

2.

Author Correction: Divergent landscapes of A-to-I editing in postmortem and living human brain.

Rodriguez de Los Santos, Miguel; Kopell, Brian H; Buxbaum Grice, Ariela; Ganesh, Gauri; Yang, Andy; Amini, Pardis; Liharska, Lora E; Vornholt, Eric; Fullard, John F; Dong, Pengfei; Park, Eric; Zipkowitz, Sarah; Kaji, Deepak A; Thompson, Ryan C; Liu, Donjing; Park, You Jeong; Cheng, Esther; Ziafat, Kimia; Moya, Emily; Fennessy, Brian; Wilkins, Lillian; Silk, Hannah; Linares, Lisa M; Sullivan, Brendan; Cohen, Vanessa; Kota, Prashant; Feng, Claudia; Johnson, Jessica S; Rieder, Marysia-Kolbe; Scarpa, Joseph; Nadkarni, Girish N; Wang, Minghui; Zhang, Bin; Sklar, Pamela; Beckmann, Noam D; Schadt, Eric E; Roussos, Panos; Charney, Alexander W; Breen, Michael S.

Nat Commun ; 15(1): 5828, 2024 Jul 11.

Article in English | MEDLINE | ID: mdl-38992062

3.

Dual-specificity protein phosphatase 6 (DUSP6) overexpression reduces amyloid load and improves memory deficits in male 5xFAD mice.

Pan, Allen L; Audrain, Mickael; Sakakibara, Emmy; Joshi, Rajeev; Zhu, Xiaodong; Wang, Qian; Wang, Minghui; Beckmann, Noam D; Schadt, Eric E; Gandy, Sam; Zhang, Bin; Ehrlich, Michelle E; Salton, Stephen R.

Front Aging Neurosci ; 16: 1400447, 2024.

Article in English | MEDLINE | ID: mdl-39006222

ABSTRACT

Introduction: Dual specificity protein phosphatase 6 (DUSP6) was recently identified as a key hub gene in a causal VGF gene network that regulates late-onset Alzheimer's disease (AD). Importantly, decreased DUSP6 levels are correlated with an increased clinical dementia rating (CDR) in human subjects, and DUSP6 levels are additionally decreased in the 5xFAD amyloidopathy mouse model. Methods: To investigate the role of DUSP6 in AD, we stereotactically injected AAV5-DUSP6 or AAV5-GFP (control) into the dorsal hippocampus (dHc) of both female and male 5xFAD or wild type mice, to induce overexpression of DUSP6 or GFP. Results: Barnes maze testing indicated that DUSP6 overexpression in the dHc of 5xFAD mice improved memory deficits and was associated with reduced amyloid plaque load, Aß1-40 and Aß1-42 levels, and amyloid precursor protein processing enzyme BACE1, in male but not in female mice. Microglial activation, which was increased in 5xFAD mice, was significantly reduced by dHc DUSP6 overexpression in both males and females, as was the number of "microglial clusters," which correlated with reduced amyloid plaque size. Transcriptomic profiling of female 5xFAD hippocampus revealed upregulation of inflammatory and extracellular signal-regulated kinase pathways, while dHc DUSP6 overexpression in female 5xFAD mice downregulated a subset of genes in these pathways. Gene ontology analysis of DEGs (p < 0.05) identified a greater number of synaptic pathways that were regulated by DUSP6 overexpression in male compared to female 5xFAD. Discussion: In summary, DUSP6 overexpression in dHc reduced amyloid deposition and memory deficits in male but not female 5xFAD mice, whereas reduced neuroinflammation and microglial activation were observed in both males and females, suggesting that DUSP6-induced reduction of microglial activation did not contribute to sex-dependent improvement in memory deficits. The sex-dependent regulation of synaptic pathways by DUSP6 overexpression, however, correlated with the improvement of spatial memory deficits in male but not female 5xFAD.

4.

Divergent landscapes of A-to-I editing in postmortem and living human brain.

Rodriguez de Los Santos, Miguel; Kopell, Brian H; Buxbaum Grice, Ariela; Ganesh, Gauri; Yang, Andy; Amini, Pardis; Liharska, Lora E; Vornholt, Eric; Fullard, John F; Dong, Pengfei; Park, Eric; Zipkowitz, Sarah; Kaji, Deepak A; Thompson, Ryan C; Liu, Donjing; Park, You Jeong; Cheng, Esther; Ziafat, Kimia; Moya, Emily; Fennessy, Brian; Wilkins, Lillian; Silk, Hannah; Linares, Lisa M; Sullivan, Brendan; Cohen, Vanessa; Kota, Prashant; Feng, Claudia; Johnson, Jessica S; Rieder, Marysia-Kolbe; Scarpa, Joseph; Nadkarni, Girish N; Wang, Minghui; Zhang, Bin; Sklar, Pamela; Beckmann, Noam D; Schadt, Eric E; Roussos, Panos; Charney, Alexander W; Breen, Michael S.

Nat Commun ; 15(1): 5366, 2024 Jun 26.

Article in English | MEDLINE | ID: mdl-38926387

ABSTRACT

Adenosine-to-inosine (A-to-I) editing is a prevalent post-transcriptional RNA modification within the brain. Yet, most research has relied on postmortem samples, assuming it is an accurate representation of RNA biology in the living brain. We challenge this assumption by comparing A-to-I editing between postmortem and living prefrontal cortical tissues. Major differences were found, with over 70,000 A-to-I sites showing higher editing levels in postmortem tissues. Increased A-to-I editing in postmortem tissues is linked to higher ADAR and ADARB1 expression, is more pronounced in non-neuronal cells, and indicative of postmortem activation of inflammation and hypoxia. Higher A-to-I editing in living tissues marks sites that are evolutionarily preserved, synaptic, developmentally timed, and disrupted in neurological conditions. Common genetic variants were also found to differentially affect A-to-I editing levels in living versus postmortem tissues. Collectively, these discoveries offer more nuanced and accurate insights into the regulatory mechanisms of RNA editing in the human brain.

Subject(s)

Adenosine Deaminase , Adenosine , Autopsy , Brain , Inosine , RNA Editing , RNA-Binding Proteins , Humans , Adenosine/metabolism , Adenosine Deaminase/metabolism , Adenosine Deaminase/genetics , Brain/metabolism , Inosine/metabolism , Inosine/genetics , RNA-Binding Proteins/metabolism , RNA-Binding Proteins/genetics , Prefrontal Cortex/metabolism , Postmortem Changes , Male

5.

An algorithm to identify patients aged 0-3 with rare genetic disorders.

Webb, Bryn D; Lau, Lisa Y; Tsevdos, Despina; Shewcraft, Ryan A; Corrigan, David; Shi, Lisong; Lee, Seungwoo; Tyler, Jonathan; Li, Shilong; Wang, Zichen; Stolovitzky, Gustavo; Edelmann, Lisa; Chen, Rong; Schadt, Eric E; Li, Li.

Orphanet J Rare Dis ; 19(1): 183, 2024 May 02.

Article in English | MEDLINE | ID: mdl-38698482

ABSTRACT

BACKGROUND: With over 7000 Mendelian disorders, identifying children with a specific rare genetic disorder diagnosis through structured electronic medical record data is challenging given incompleteness of records, inaccurate medical diagnosis coding, as well as heterogeneity in clinical symptoms and procedures for specific disorders. We sought to develop a digital phenotyping algorithm (PheIndex) using electronic medical records to identify children aged 0-3 diagnosed with genetic disorders or who present with illness with an increased risk for genetic disorders. RESULTS: Through expert opinion, we established 13 criteria for the algorithm and derived a score and a classification. The performance of each criterion and the classification were validated by chart review. PheIndex identified 1,088 children out of 93,154 live births who may be at an increased risk for genetic disorders. Chart review demonstrated that the algorithm achieved 90% sensitivity, 97% specificity, and 94% accuracy. CONCLUSIONS: The PheIndex algorithm can help identify when a rare genetic disorder may be present, alerting providers to consider ordering a diagnostic genetic test and/or referring a patient to a medical geneticist.

Subject(s)

Algorithms , Rare Diseases , Humans , Rare Diseases/genetics , Rare Diseases/diagnosis , Infant , Infant, Newborn , Child, Preschool , Female , Male , Electronic Health Records , Genetic Diseases, Inborn/diagnosis , Genetic Diseases, Inborn/genetics , Phenotype

6.

Divergent landscapes of A-to-I editing in postmortem and living human brain.

de Los Santos, Miguel Rodriguez; Kopell, Brian H; Grice, Ariela Buxbaum; Ganesh, Gauri; Yang, Andy; Amini, Pardis; Liharska, Lora E; Vornholt, Eric; Fullard, John F; Dong, Pengfei; Park, Eric; Zipkowitz, Sarah; Kaji, Deepak A; Thompson, Ryan C; Liu, Donjing; Park, You Jeong; Cheng, Esther; Ziafat, Kimia; Moya, Emily; Fennessy, Brian; Wilkins, Lillian; Silk, Hannah; Linares, Lisa M; Sullivan, Brendan; Cohen, Vanessa; Kota, Prashant; Feng, Claudia; Johnson, Jessica S; Rieder, Marysia-Kolbe; Scarpa, Joseph; Nadkarni, Girish N; Wang, Minghui; Zhang, Bin; Sklar, Pamela; Beckmann, Noam D; Schadt, Eric E; Roussos, Panos; Charney, Alexander W; Breen, Michael S.

medRxiv ; 2024 May 09.

Article in English | MEDLINE | ID: mdl-38765961

ABSTRACT

Adenosine-to-inosine (A-to-I) editing is a prevalent post-transcriptional RNA modification within the brain. Yet, most research has relied on postmortem samples, assuming it is an accurate representation of RNA biology in the living brain. We challenge this assumption by comparing A-to-I editing between postmortem and living prefrontal cortical tissues. Major differences were found, with over 70,000 A-to-I sites showing higher editing levels in postmortem tissues. Increased A-to-I editing in postmortem tissues is linked to higher ADAR1 and ADARB1 expression, is more pronounced in non-neuronal cells, and indicative of postmortem activation of inflammation and hypoxia. Higher A-to-I editing in living tissues marks sites that are evolutionarily preserved, synaptic, developmentally timed, and disrupted in neurological conditions. Common genetic variants were also found to differentially affect A-to-I editing levels in living versus postmortem tissues. Collectively, these discoveries illuminate the nuanced functions and intricate regulatory mechanisms of RNA editing within the human brain.

7.

Multiomic foundations of human prefrontal cortex tissue function.

Kopell, Brian H; Kaji, Deepak A; Liharska, Lora E; Vornholt, Eric; Valentine, Alissa; Lund, Anina; Hashemi, Alice; Thompson, Ryan C; Lohrenz, Terry; Johnson, Jessica S; Bussola, Nicole; Cheng, Esther; Park, You Jeong; Shah, Punit; Ma, Weiping; Searfoss, Richard; Qasim, Salman; Miller, Gregory M; Chand, Nischal Mahaveer; Aristel, Alisha; Humphrey, Jack; Wilkins, Lillian; Ziafat, Kimia; Silk, Hannah; Linares, Lisa M; Sullivan, Brendan; Feng, Claudia; Batten, Seth R; Bang, Dan; Barbosa, Leonardo S; Twomey, Thomas; White, Jason P; Vannucci, Marina; Hadj-Amar, Beniamino; Cohen, Vanessa; Kota, Prashant; Moya, Emily; Rieder, Marysia-Kolbe; Figee, Martijn; Nadkarni, Girish N; Breen, Michael S; Kishida, Kenneth T; Scarpa, Joseph; Ruderfer, Douglas M; Narain, Niven R; Wang, Pei; Kiebish, Michael A; Schadt, Eric E; Saez, Ignacio; Montague, P Read.

medRxiv ; 2024 May 17.

Article in English | MEDLINE | ID: mdl-38798344

ABSTRACT

The prefrontal cortex (PFC) is a region of the brain that in humans is involved in the production of higher-order functions such as cognition, emotion, perception, and behavior. Neurotransmission in the PFC produces higher-order functions by integrating information from other areas of the brain. At the foundation of neurotransmission, and by extension at the foundation of higher-order brain functions, are an untold number of coordinated molecular processes involving the DNA sequence variants in the genome, RNA transcripts in the transcriptome, and proteins in the proteome. These "multiomic" foundations are poorly understood in humans, perhaps in part because most modern studies that characterize the molecular state of the human PFC use tissue obtained when neurotransmission and higher-order brain functions have ceased (i.e., the postmortem state). Here, analyses are presented on data generated for the Living Brain Project (LBP) to investigate whether PFC tissue from individuals with intact higher-order brain function has characteristic multiomic foundations. Two complementary strategies were employed towards this end. The first strategy was to identify in PFC samples obtained from living study participants a signature of RNA transcript expression associated with neurotransmission measured intracranially at the time of PFC sampling, in some cases while participants performed a task engaging higher-order brain functions. The second strategy was to perform multiomic comparisons between PFC samples obtained from individuals with intact higher-order brain function at the time of sampling (i.e., living study participants) and PFC samples obtained in the postmortem state. RNA transcript expression within multiple PFC cell types was associated with fluctuations of dopaminergic, serotonergic, and/or noradrenergic neurotransmission in the substantia nigra measured while participants played a computer game that engaged higher-order brain functions. A subset of these associations - termed the "transcriptional program associated with neurotransmission" (TPAWN) - were reproduced in analyses of brain RNA transcript expression and intracranial neurotransmission data obtained from a second LBP cohort and from a cohort in an independent study. RNA transcripts involved in TPAWN were found to be (1) enriched for RNA transcripts associated with measures of neurotransmission in rodent and cell models, (2) enriched for RNA transcripts encoded by evolutionarily constrained genes, (3) depleted of RNA transcripts regulated by common DNA sequence variants, and (4) enriched for RNA transcripts implicated in higher-order brain functions by human population genetic studies. In PFC excitatory neurons of living study participants, higher expression of the genes in TPAWN tracked with higher expression of RNA transcripts that in rodent PFC samples are markers of a class of excitatory neurons that connect the PFC to deep brain structures. TPAWN was further reproduced by RNA transcript expression patterns differentiating living PFC samples from postmortem PFC samples, and significant differences between living and postmortem PFC samples were additionally observed with respect to (1) the expression of most primary RNA transcripts, mature RNA transcripts, and proteins, (2) the splicing of most primary RNA transcripts into mature RNA transcripts, (3) the patterns of co-expression between RNA transcripts and proteins, and (4) the effects of some DNA sequence variants on RNA transcript and protein expression. Taken together, this report highlights that studies of brain tissue obtained in a safe and ethical manner from large cohorts of living individuals can help advance understanding of the multiomic foundations of brain function.

8.

Characterizing cell type specific transcriptional differences between the living and postmortem human brain.

Vornholt, Eric; Liharska, Lora E; Cheng, Esther; Hashemi, Alice; Park, You Jeong; Ziafat, Kimia; Wilkins, Lillian; Silk, Hannah; Linares, Lisa M; Thompson, Ryan C; Sullivan, Brendan; Moya, Emily; Nadkarni, Girish N; Sebra, Robert; Schadt, Eric E; Kopell, Brian H; Charney, Alexander W; Beckmann, Noam D.

medRxiv ; 2024 May 01.

Article in English | MEDLINE | ID: mdl-38746297

ABSTRACT

Single-nucleus RNA sequencing (snRNA-seq) is often used to define gene expression patterns characteristic of brain cell types as well as to identify cell type specific gene expression signatures of neurological and mental illnesses in postmortem human brains. As methods to obtain brain tissue from living individuals emerge, it is essential to characterize gene expression differences associated with tissue originating from either living or postmortem subjects using snRNA-seq, and to assess whether and how such differences may impact snRNA-seq studies of brain tissue. To address this, human prefrontal cortex single nuclei gene expression was generated and compared between 31 samples from living individuals and 21 postmortem samples. The same cell types were consistently identified in living and postmortem nuclei, though for each cell type, a large proportion of genes were differentially expressed between samples from postmortem and living individuals. Notably, estimation of cell type proportions by cell type deconvolution of pseudo-bulk data was found to be more accurate in samples from living individuals. To allow for future integration of living and postmortem brain gene expression, a model was developed that quantifies from gene expression data the probability a human brain tissue sample was obtained postmortem. These probabilities are established as a means to statistically account for the gene expression differences between samples from living and postmortem individuals. Together, the results presented here provide a deep characterization of both differences between snRNA-seq derived from samples from living and postmortem individuals, as well as qualify and account for their effect on common analyses performed on this type of data.

9.

Transcriptomic landscape of human induced pluripotent stem cell-derived osteogenic differentiation identifies a regulatory role of KLF16.

Ru, Ying; Ma, Meng; Zhou, Xianxiao; Kriti, Divya; Cohen, Ninette; D'Souza, Sunita; Schaniel, Christoph; Motch Perrine, Susan M; Kuo, Sharon; Pinto, Dalila; Housman, Genevieve; Wu, Meng; Holmes, Greg; Schadt, Eric; van Bakel, Harm; Zhang, Bin; Jabs, Ethylin Wang.

bioRxiv ; 2024 Feb 12.

Article in English | MEDLINE | ID: mdl-38405902

ABSTRACT

Osteogenic differentiation is essential for bone development and metabolism, but the underlying gene regulatory networks have not been well investigated. We differentiated mesenchymal stem cells, derived from 20 human induced pluripotent stem cell lines, into preosteoblasts and osteoblasts, and performed systematic RNA-seq analyses of 60 samples for differential gene expression. We noted a highly significant correlation in expression patterns and genomic proximity among transcription factor (TF) and long noncoding RNA (lncRNA) genes. We identified TF-TF regulatory networks, regulatory roles of lncRNAs on their neighboring coding genes for TFs and splicing factors, and differential splicing of TF, lncRNA, and splicing factor genes. TF-TF regulatory and gene co-expression network analyses suggested an inhibitory role of TF KLF16 in osteogenic differentiation. We demonstrate that in vitro overexpression of human KLF16 inhibits osteogenic differentiation and mineralization, and in vivo Klf16+/- mice exhibit increased bone mineral density, trabecular number, and cortical bone area. Thus, our model system highlights the regulatory complexity of osteogenic differentiation and identifies novel osteogenic genes.

10.

Analytical challenges in omics research on asthma and allergy: A National Institute of Allergy and Infectious Diseases workshop.

Bunyavanich, Supinda; Becker, Patrice M; Altman, Matthew C; Lasky-Su, Jessica; Ober, Carole; Zengler, Karsten; Berdyshev, Evgeny; Bonneau, Richard; Chatila, Talal; Chatterjee, Nilanjan; Chung, Kian Fan; Cutcliffe, Colleen; Davidson, Wendy; Dong, Gang; Fang, Gang; Fulkerson, Patricia; Himes, Blanca E; Liang, Liming; Mathias, Rasika A; Ogino, Shuji; Petrosino, Joseph; Price, Nathan D; Schadt, Eric; Schofield, James; Seibold, Max A; Steen, Hanno; Wheatley, Lisa; Zhang, Hongmei; Togias, Alkis; Hasegawa, Kohei.

J Allergy Clin Immunol ; 153(4): 954-968, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38295882

ABSTRACT

Studies of asthma and allergy are generating increasing volumes of omics data for analysis and interpretation. The National Institute of Allergy and Infectious Diseases (NIAID) assembled a workshop comprising investigators studying asthma and allergic diseases using omics approaches, omics investigators from outside the field, and NIAID medical and scientific officers to discuss the following areas in asthma and allergy research: genomics, epigenomics, transcriptomics, microbiomics, metabolomics, proteomics, lipidomics, integrative omics, systems biology, and causal inference. Current states of the art, present challenges, novel and emerging strategies, and priorities for progress were presented and discussed for each area. This workshop report summarizes the major points and conclusions from this NIAID workshop. As a group, the investigators underscored the imperatives for rigorous analytic frameworks, integration of different omics data types, cross-disciplinary interaction, strategies for overcoming current limitations, and the overarching goal to improve scientific understanding and care of asthma and allergic diseases.

Subject(s)

Asthma , Hypersensitivity , United States , Humans , National Institute of Allergy and Infectious Diseases (U.S.) , Hypersensitivity/genetics , Asthma/etiology , Genomics , Proteomics , Metabolomics

11.

Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq.

Tyler, Scott R; Lozano-Ojalvo, Daniel; Guccione, Ernesto; Schadt, Eric E.

Nat Commun ; 15(1): 699, 2024 Jan 24.

Article in English | MEDLINE | ID: mdl-38267438

ABSTRACT

While sub-clustering cell-populations has become popular in single cell-omics, negative controls for this process are lacking. Popular feature-selection/clustering algorithms fail the null-dataset problem, allowing erroneous subdivisions of homogenous clusters until nearly each cell is called its own cluster. Using real and synthetic datasets, we find that anti-correlated gene selection reduces or eliminates erroneous subdivisions, increases marker-gene selection efficacy, and efficiently scales to millions of cells.

Subject(s)

Algorithms , Single-Cell Gene Expression Analysis , Cluster Analysis

12.

Detecting Ground Glass Opacity Features in Patients With Lung Cancer: Automated Extraction and Longitudinal Analysis via Deep Learning-Based Natural Language Processing.

Lee, Kyeryoung; Liu, Zongzhi; Chandran, Urmila; Kalsekar, Iftekhar; Laxmanan, Balaji; Higashi, Mitchell K; Jun, Tomi; Ma, Meng; Li, Minghao; Mai, Yun; Gilman, Christopher; Wang, Tongyu; Ai, Lei; Aggarwal, Parag; Pan, Qi; Oh, William; Stolovitzky, Gustavo; Schadt, Eric; Wang, Xiaoyan.

JMIR AI ; 2: e44537, 2023 Jun 01.

Article in English | MEDLINE | ID: mdl-38875565

ABSTRACT

BACKGROUND: Ground-glass opacities (GGOs) appearing in computed tomography (CT) scans may indicate potential lung malignancy. Proper management of GGOs based on their features can prevent the development of lung cancer. Electronic health records are rich sources of information on GGO nodules and their granular features, but most of the valuable information is embedded in unstructured clinical notes. OBJECTIVE: We aimed to develop, test, and validate a deep learning-based natural language processing (NLP) tool that automatically extracts GGO features to inform the longitudinal trajectory of GGO status from large-scale radiology notes. METHODS: We developed a bidirectional long short-term memory with a conditional random field-based deep-learning NLP pipeline to extract GGO and granular features of GGO retrospectively from radiology notes of 13,216 lung cancer patients. We evaluated the pipeline with quality assessments and analyzed cohort characterization of the distribution of nodule features longitudinally to assess changes in size and solidity over time. RESULTS: Our NLP pipeline built on the GGO ontology we developed achieved between 95% and 100% precision, 89% and 100% recall, and 92% and 100% F1-scores on different GGO features. We deployed this GGO NLP model to extract and structure comprehensive characteristics of GGOs from 29,496 radiology notes of 4521 lung cancer patients. Longitudinal analysis revealed that size increased in 16.8% (240/1424) of patients, decreased in 14.6% (208/1424), and remained unchanged in 68.5% (976/1424) in their last note compared to the first note. Among 1127 patients who had longitudinal radiology notes of GGO status, 815 (72.3%) were reported to have stable status, and 259 (23%) had increased/progressed status in the subsequent notes. CONCLUSIONS: Our deep learning-based NLP pipeline can automatically extract granular GGO features at scale from electronic health records when this information is documented in radiology notes and help inform the natural history of GGO. This will open the way for a new paradigm in lung cancer prevention and early detection.

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL