Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 334
Filter
1.
medRxiv ; 2024 Apr 19.
Article in English | MEDLINE | ID: mdl-38699370

ABSTRACT

The Phenome-wide association studies (PheWAS) have become widely used for efficient, high-throughput evaluation of relationship between a genetic factor and a large number of disease phenotypes, typically extracted from a DNA biobank linked with electronic medical records (EMR). Phecodes, billing code-derived disease case-control status, are usually used as outcome variables in PheWAS and logistic regression has been the standard choice of analysis method. Since the clinical diagnoses in EMR are often inaccurate with errors which can lead to biases in the odds ratio estimates, much effort has been put to accurately define the cases and controls to ensure an accurate analysis. Specifically in order to correctly classify controls in the population, an exclusion criteria list for each Phecode was manually compiled to obtain unbiased odds ratios. However, the accuracy of the list cannot be guaranteed without extensive data curation process. The costly curation process limits the efficiency of large-scale analyses that take full advantage of all structured phenotypic information available in EMR. Here, we proposed to estimate relative risks (RR) instead. We first demonstrated the desired nature of RR that overcomes the inaccuracy in the controls via theoretical formula. With simulation and real data application, we further confirmed that RR is unbiased without compiling exclusion criteria lists. With RR as estimates, we are able to efficiently extend PheWAS to a larger-scale, phenome construction agnostic analysis of phenotypes, using ICD 9/10 codes, which preserve much more disease-related clinical information than Phecodes.

3.
medRxiv ; 2024 Feb 13.
Article in English | MEDLINE | ID: mdl-38410487

ABSTRACT

Summary: With the rapid growth of genetic data linked to electronic health record data in huge cohorts, large-scale phenome-wide association study (PheWAS), have become powerful discovery tools in biomedical research. PheWAS is an analysis method to study phenotype associations utilizing longitudinal electronic health record (EHR) data. Previous PheWAS packages were developed mostly in the days of smaller biobanks and with earlier PheWAS approaches. PheTK was designed to simplify analysis and efficiently handle biobank-scale data. PheTK uses multithreading and supports a full PheWAS workflow including extraction of data from OMOP databases and Hail matrix tables as well as PheWAS analysis for both phecode version 1.2 and phecodeX. Benchmarking results showed PheTK took 64% less time than the R PheWAS package to complete the same workflow. PheTK can be run locally or on cloud platforms such as the All of Us Researcher Workbench ( All of Us ) or the UK Biobank (UKB) Research Analysis Platform (RAP). Availability and implementation: The PheTK package is freely available on the Python Package Index (PyPi) and on GitHub under GNU Public License (GPL-3) at https://github.com/nhgritctran/PheTK . It is implemented in Python and platform independent. The demonstration workspace for All of Us will be made available in the future as a featured workspace. Contact: PheTK@mail.nih.gov.

4.
medRxiv ; 2024 Jan 10.
Article in English | MEDLINE | ID: mdl-38260403

ABSTRACT

Genome-wide association studies (GWAS) have been instrumental in identifying genetic associations for various diseases and traits. However, uncovering genetic underpinnings among traits beyond univariate phenotype associations remains a challenge. Multi-phenotype associations (MPA), or genetic pleiotropy, offer important insights into shared genes and pathways among traits, enhancing our understanding of genetic architectures of complex diseases. GWAS of biobank-linked electronic health record (EHR) data are increasingly being utilized to identify MPA among various traits and diseases. However, methodologies that can efficiently take advantage of distributed EHR to detect MPA are still lacking. Here, we introduce mixWAS, a novel algorithm that efficiently and losslessly integrates multiple EHRs via summary statistics, allowing the detection of MPA among mixed phenotypes while accounting for heterogeneities across EHRs. Simulations demonstrate that mixWAS outperforms the widely used MPA detection method, Phenome-wide association study (PheWAS), across diverse scenarios. Applying mixWAS to data from seven EHRs in the US, we identified 4,534 MPA among blood lipids, BMI, and circulatory diseases. Validation in an independent EHR data from UK confirmed 97.7% of the associations. mixWAS fundamentally improves the detection of MPA and is available as a free, open-source software.

5.
Sci Transl Med ; 15(726): eade9214, 2023 12 13.
Article in English | MEDLINE | ID: mdl-38091411

ABSTRACT

The National Institutes of Health's All of Us Research Program is an accessible platform that hosts genomic and phenotypic data to be collected from 1 million participants in the United States. Its mission is to accelerate medical research and clinical breakthroughs with a special emphasis on diversity.


Subject(s)
Biomedical Research , Population Health , Humans , United States , Data Science , National Institutes of Health (U.S.)
7.
J Am Med Inform Assoc ; 31(1): 139-153, 2023 Dec 22.
Article in English | MEDLINE | ID: mdl-37885303

ABSTRACT

OBJECTIVE: The All of Us Research Program (All of Us) aims to recruit over a million participants to further precision medicine. Essential to the verification of biobanks is a replication of known associations to establish validity. Here, we evaluated how well All of Us data replicated known cigarette smoking associations. MATERIALS AND METHODS: We defined smoking exposure as follows: (1) an EHR Smoking exposure that used International Classification of Disease codes; (2) participant provided information (PPI) Ever Smoking; and, (3) PPI Current Smoking, both from the lifestyle survey. We performed a phenome-wide association study (PheWAS) for each smoking exposure measurement type. For each, we compared the effect sizes derived from the PheWAS to published meta-analyses that studied cigarette smoking from PubMed. We defined two levels of replication of meta-analyses: (1) nominally replicated: which required agreement of direction of effect size, and (2) fully replicated: which required overlap of confidence intervals. RESULTS: PheWASes with EHR Smoking, PPI Ever Smoking, and PPI Current Smoking revealed 736, 492, and 639 phenome-wide significant associations, respectively. We identified 165 meta-analyses representing 99 distinct phenotypes that could be matched to EHR phenotypes. At P < .05, 74 were nominally replicated and 55 were fully replicated. At P < 2.68 × 10-5 (Bonferroni threshold), 58 were nominally replicated and 40 were fully replicated. DISCUSSION: Most phenotypes found in published meta-analyses associated with smoking were nominally replicated in All of Us. Both survey and EHR definitions for smoking produced similar results. CONCLUSION: This study demonstrated the feasibility of studying common exposures using All of Us data.


Subject(s)
Genome-Wide Association Study , Population Health , Humans , Genome-Wide Association Study/methods , Phenotype , Polymorphism, Single Nucleotide , Smoking
8.
Sci Rep ; 13(1): 18532, 2023 10 28.
Article in English | MEDLINE | ID: mdl-37898691

ABSTRACT

Clostridioides difficile (C. diff.) infection (CDI) is a leading cause of hospital acquired diarrhea in North America and Europe and a major cause of morbidity and mortality. Known risk factors do not fully explain CDI susceptibility, and genetic susceptibility is suggested by the fact that some patients with colons that are colonized with C. diff. do not develop any infection while others develop severe or recurrent infections. To identify common genetic variants associated with CDI, we performed a genome-wide association analysis in 19,861 participants (1349 cases; 18,512 controls) from the Electronic Medical Records and Genomics (eMERGE) Network. Using logistic regression, we found strong evidence for genetic variation in the DRB locus of the MHC (HLA) II region that predisposes individuals to CDI (P > 1.0 × 10-14; OR 1.56). Altered transcriptional regulation in the HLA region may play a role in conferring susceptibility to this opportunistic enteric pathogen.


Subject(s)
Clostridium Infections , Genome-Wide Association Study , Humans , Clostridium Infections/genetics , Diarrhea , Histocompatibility Antigens , HLA Antigens/genetics , Histocompatibility Antigens Class II , Genetic Variation
9.
Nat Commun ; 14(1): 5419, 2023 09 05.
Article in English | MEDLINE | ID: mdl-37669985

ABSTRACT

Recently, large scale genomic projects such as All of Us and the UK Biobank have introduced a new research paradigm where data are stored centrally in cloud-based Trusted Research Environments (TREs). To characterize the advantages and drawbacks of different TRE attributes in facilitating cross-cohort analysis, we conduct a Genome-Wide Association Study of standard lipid measures using two approaches: meta-analysis and pooled analysis. Comparison of full summary data from both approaches with an external study shows strong correlation of known loci with lipid levels (R2 ~ 83-97%). Importantly, 90 variants meet the significance threshold only in the meta-analysis and 64 variants are significant only in pooled analysis, with approximately 20% of variants in each of those groups being most prevalent in non-European, non-Asian ancestry individuals. These findings have important implications, as technical and policy choices lead to cross-cohort analyses generating similar, but not identical results, particularly for non-European ancestral populations.


Subject(s)
Genome-Wide Association Study , Population Health , Humans , Genomics , Policy , Lipids
10.
Genet Med ; 25(12): 100966, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37622442

ABSTRACT

PURPOSE: Automated use of electronic health records may aid in decreasing the diagnostic delay for rare diseases. The phenotype risk score (PheRS) is a weighted aggregate of syndromically related phenotypes that measures the similarity between an individual's conditions and features of a disease. For some diseases, there are individuals without a diagnosis of that disease who have scores similar to diagnosed patients. These individuals may have that disease but not yet be diagnosed. METHODS: We calculated the PheRS for cystic fibrosis (CF) for 965,626 subjects in the Vanderbilt University Medical Center electronic health record. RESULTS: Of the 400 subjects with the highest PheRS for CF, 248 (62%) had been diagnosed with CF. Twenty-six of the remaining participants, those who were alive and had DNA available in the linked DNA biobank, underwent clinical review and sequencing analysis of CFTR and SERPINA1. This uncovered a potential diagnosis for 2 subjects, 1 with CF and 1 with alpha-1-antitrypsin deficiency. An additional 7 subjects had pathogenic or likely pathogenic variants, 2 in CFTR and 5 in SERPINA1. CONCLUSION: These findings may be clinically actionable for the providers caring for these patients. Importantly, this study highlights feasibility and challenges for future implications of this approach.


Subject(s)
Cystic Fibrosis Transmembrane Conductance Regulator , Cystic Fibrosis , Humans , Cystic Fibrosis Transmembrane Conductance Regulator/genetics , Electronic Health Records , Delayed Diagnosis , Cystic Fibrosis/diagnosis , Cystic Fibrosis/genetics , Cystic Fibrosis/pathology , DNA , Mutation
11.
Annu Rev Biomed Data Sci ; 6: 443-464, 2023 08 10.
Article in English | MEDLINE | ID: mdl-37561600

ABSTRACT

The All of Us Research Program's Data and Research Center (DRC) was established to help acquire, curate, and provide access to one of the world's largest and most diverse datasets for precision medicine research. Already, over 500,000 participants are enrolled in All of Us, 80% of whom are underrepresented in biomedical research, and data are being analyzed by a community of over 2,300 researchers. The DRC created this thriving data ecosystem by collaborating with engaged participants, innovative program partners, and empowered researchers. In this review, we first describe how the DRC is organized to meet the needs of this broad group of stakeholders. We then outline guiding principles, common challenges, and innovative approaches used to build the All of Us data ecosystem. Finally, we share lessons learned to help others navigate important decisions and trade-offs in building a modern biomedical data platform.


Subject(s)
Biomedical Research , Population Health , Humans , Ecosystem , Precision Medicine
12.
PLoS One ; 18(8): e0286469, 2023.
Article in English | MEDLINE | ID: mdl-37651384

ABSTRACT

Alpha-1 antitrypsin deficiency (AATD), a relatively common autosomal recessive genetic disorder, is underdiagnosed in symptomatic individuals. We sought to compare the risk of liver transplantation associated with hepatitis C infection with AATD heterozygotes and homozygotes and determine if SERPINA1 sequencing would identify undiagnosed AATD. We performed a retrospective cohort study in a deidentified Electronic Health Record (EHR)-linked DNA biobank with 72,027 individuals genotyped for the M, Z, and S alleles in SERPINA1. We investigated liver transplantation frequency by genotype group and compared with hepatitis C infection. We performed SERPINA1 sequencing in carriers of pathogenic AATD alleles who underwent liver transplantation. Liver transplantation was associated with the Z allele (ZZ: odds ratio [OR] = 1.31, p<2e-16; MZ: OR = 1.02, p = 1.2e-13) and with hepatitis C (OR = 1.20, p<2e-16). For liver transplantation, there was a significant interaction between genotype and hepatitis C (ZZ: interaction OR = 1.23, p = 4.7e-4; MZ: interaction OR = 1.11, p = 6.9e-13). Sequencing uncovered a second, rare, pathogenic SERPINA1 variant in six of 133 individuals with liver transplants and without hepatitis C. Liver transplantation was more common in individuals with AATD risk alleles (including heterozygotes), and AATD and hepatitis C demonstrated evidence of a gene-environment interaction in relation to liver transplantation. The current AATD screening strategy may miss diagnoses whereas SERPINA1 sequencing may increase diagnostic yield for AATD, stratify risk for liver disease, and inform clinical management for individuals with AATD risk alleles and liver disease risk factors.


Subject(s)
Hepatitis C , alpha 1-Antitrypsin Deficiency , Humans , Alleles , Gene-Environment Interaction , Retrospective Studies , alpha 1-Antitrypsin Deficiency/diagnosis , alpha 1-Antitrypsin Deficiency/genetics , Hepatitis C/genetics , Hepacivirus/genetics , Genetics, Population , alpha 1-Antitrypsin/genetics
13.
Am J Hum Genet ; 110(9): 1522-1533, 2023 09 07.
Article in English | MEDLINE | ID: mdl-37607538

ABSTRACT

Population-scale biobanks linked to electronic health record data provide vast opportunities to extend our knowledge of human genetics and discover new phenotype-genotype associations. Given their dense phenotype data, biobanks can also facilitate replication studies on a phenome-wide scale. Here, we introduce the phenotype-genotype reference map (PGRM), a set of 5,879 genetic associations from 523 GWAS publications that can be used for high-throughput replication experiments. PGRM phenotypes are standardized as phecodes, ensuring interoperability between biobanks. We applied the PGRM to five ancestry-specific cohorts from four independent biobanks and found evidence of robust replications across a wide array of phenotypes. We show how the PGRM can be used to detect data corruption and to empirically assess parameters for phenome-wide studies. Finally, we use the PGRM to explore factors associated with replicability of GWAS results.


Subject(s)
Biological Specimen Banks , Data Science , Humans , Phenomics , Phenotype , Genotype
14.
PLoS One ; 18(5): e0283553, 2023.
Article in English | MEDLINE | ID: mdl-37196047

ABSTRACT

OBJECTIVE: Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ~50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique. MATERIALS AND METHODS: We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes. RESULTS: Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes. DISCUSSION: As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype-phenotype associations with clinical interpretation. CONCLUSION: A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.


Subject(s)
Diverticular Diseases , Diverticulitis , Diverticulum , Humans , Electronic Health Records , Genome-Wide Association Study/methods , Natural Language Processing , Phenotype , Algorithms , Polymorphism, Single Nucleotide
15.
Clin Pharmacol Ther ; 114(2): 404-412, 2023 08.
Article in English | MEDLINE | ID: mdl-37150941

ABSTRACT

Antibiotics are a known cause of idiosyncratic drug-induced liver injury (DILI). According to the Centers for Disease Control and Prevention, the five most commonly prescribed antibiotics in the United States are azithromycin, ciprofloxacin, cephalexin, amoxicillin, and amoxicillin-clavulanate. We quantified the frequency of acute DILI for these common antibiotics in the All of Us Research Program, one of the largest electronic health record (EHR)-linked research cohorts in the United States. Retrospective analyses were conducted applying a standardized phenotyping algorithm to de-identified clinical data available in the All of Us database for 318,598 study participants. Between February 1984 and December 2022, more than 30% of All of Us participants (n = 119,812 individuals) had been exposed to at least 1 of our 5 study drugs. Initial screening identified 591 potential case patients that met our preselected laboratory-based phenotyping criteria. Because DILI is a diagnosis of exclusion, we then used phenome scanning to narrow the case counts by (i) scanning all EHRs to identify all alternative diagnostic explanations for the laboratory abnormalities, and (ii) leveraging International Classification of Disease 9th revision (ICD)-9 and ICD 10th revision (ICD)-10 codes as exclusion criteria to eliminate misclassification. Our final case counts were 30 DILI cases with amoxicillin-clavulanate, 24 cases with azithromycin, 24 cases with ciprofloxacin, 22 cases with amoxicillin alone, and < 20 cases with cephalexin. These findings demonstrate that data from EHR-linked research cohorts can be efficiently mined to identify DILI cases related to the use of common antibiotics.


Subject(s)
Chemical and Drug Induced Liver Injury , Population Health , Humans , United States/epidemiology , Anti-Bacterial Agents/adverse effects , Azithromycin/adverse effects , Retrospective Studies , Chemical and Drug Induced Liver Injury/diagnosis , Chemical and Drug Induced Liver Injury/epidemiology , Chemical and Drug Induced Liver Injury/etiology , Amoxicillin-Potassium Clavulanate Combination/adverse effects , Amoxicillin , Ciprofloxacin/adverse effects , Cephalexin
16.
NPJ Digit Med ; 6(1): 89, 2023 May 19.
Article in English | MEDLINE | ID: mdl-37208468

ABSTRACT

Common data models solve many challenges of standardizing electronic health record (EHR) data but are unable to semantically integrate all of the resources needed for deep phenotyping. Open Biological and Biomedical Ontology (OBO) Foundry ontologies provide computable representations of biological knowledge and enable the integration of heterogeneous data. However, mapping EHR data to OBO ontologies requires significant manual curation and domain expertise. We introduce OMOP2OBO, an algorithm for mapping Observational Medical Outcomes Partnership (OMOP) vocabularies to OBO ontologies. Using OMOP2OBO, we produced mappings for 92,367 conditions, 8611 drug ingredients, and 10,673 measurement results, which covered 68-99% of concepts used in clinical practice when examined across 24 hospitals. When used to phenotype rare disease patients, the mappings helped systematically identify undiagnosed patients who might benefit from genetic testing. By aligning OMOP vocabularies to OBO ontologies our algorithm presents new opportunities to advance EHR-based deep phenotyping.

17.
Arthritis Rheumatol ; 75(9): 1532-1541, 2023 09.
Article in English | MEDLINE | ID: mdl-37096581

ABSTRACT

OBJECTIVE: Systemic lupus erythematosus (SLE) poses diagnostic challenges. We undertook this study to evaluate the utility of a phenotype risk score (PheRS) and a genetic risk score (GRS) to identify SLE individuals in a real-world setting. METHODS: Using a de-identified electronic health record (EHR) database with an associated DNA biobank, we identified 789 SLE cases and 2,261 controls with available MEGAEX genotyping. A PheRS for SLE was developed using billing codes that captured American College of Rheumatology SLE criteria. We developed a GRS with 58 SLE risk single-nucleotide polymorphisms (SNPs). RESULTS: SLE cases had a significantly higher PheRS (mean ± SD 7.7 ± 8.0 versus 0.8 ± 2.0 in controls; P < 0.001) and GRS (mean ± SD 12.2 ± 2.3 versus 11.0 ± 2.0 in controls; P < 0.001). Black individuals with SLE had a higher PheRS compared to White individuals (mean ± SD 10.0 ± 10.1 versus 7.1 ± 7.2, respectively; P = 0.002) but a lower GRS (mean ± SD 9.0 ± 1.4 versus 12.3 ± 1.7, respectively; P < 0.001). Models predicting SLE that used only the PheRS had an area under the curve (AUC) of 0.87. Adding the GRS to the PheRS resulted in a minimal difference with an AUC of 0.89. On chart review, controls with the highest PheRS and GRS had undiagnosed SLE. CONCLUSION: We developed a SLE PheRS to identify established and undiagnosed SLE individuals. A SLE GRS using known risk SNPs did not add value beyond the PheRS and was of limited utility in Black individuals with SLE. More work is needed to understand the genetic risks of SLE in diverse populations.


Subject(s)
Electronic Health Records , Lupus Erythematosus, Systemic , Humans , Lupus Erythematosus, Systemic/epidemiology , Lupus Erythematosus, Systemic/genetics , Lupus Erythematosus, Systemic/diagnosis , Risk Factors , Phenotype , White
19.
Am J Epidemiol ; 192(1): 11-24, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36205043

ABSTRACT

The All of Us Research Program, a health and genetics epidemiologic data collection program, has been substantially affected by the coronavirus disease 2019 (COVID-19) pandemic. Although the program is highly digital in nature, certain aspects of the data collection require in-person interaction between staff and participants. Before the pandemic, the program was enrolling approximately 12,500 participants per month at more than 400 clinical sites. In March 2020, because of the pandemic, all in-person activity at program sites and by engagement partners was paused to develop processes and procedures for in-person activities that incorporated strict safety protocols. In addition, the program adopted new data collection methodologies to reduce the need for in-person activities. Through February 2022, a total of 224 clinical sites had reactivated in-person activity, and all enrollment and engagement partners have adopted new data collection methods that can be used remotely. As the COVID-19 pandemic persists, the program continues to require safety procedures for in-person activity and continues to generate and pilot methodologies that reduce risk and make it easier for participants to provide information.


Subject(s)
COVID-19 , Population Health , Humans , COVID-19/epidemiology , Pandemics/prevention & control , Data Collection
20.
Obesity (Silver Spring) ; 30(12): 2477-2488, 2022 12.
Article in English | MEDLINE | ID: mdl-36372681

ABSTRACT

OBJECTIVE: High BMI is associated with many comorbidities and mortality. This study aimed to elucidate the overall clinical risk of obesity using a genome- and phenome-wide approach. METHODS: This study performed a phenome-wide association study of BMI using a clinical cohort of 736,726 adults. This was followed by genetic association studies using two separate cohorts: one consisting of 65,174 adults in the Electronic Medical Records and Genomics (eMERGE) Network and another with 405,432 participants in the UK Biobank. RESULTS: Class 3 obesity was associated with 433 phenotypes, representing 59.3% of all billing codes in individuals with severe obesity. A genome-wide polygenic risk score for BMI, accounting for 7.5% of variance in BMI, was associated with 296 clinical diseases, including strong associations with type 2 diabetes, sleep apnea, hypertension, and chronic liver disease. In all three cohorts, 199 phenotypes were associated with class 3 obesity and polygenic risk for obesity, including novel associations such as increased risk of renal failure, venous insufficiency, and gastroesophageal reflux. CONCLUSIONS: This combined genomic and phenomic systematic approach demonstrated that obesity has a strong genetic predisposition and is associated with a considerable burden of disease across all disease classes.


Subject(s)
Diabetes Mellitus, Type 2 , Phenomics , Humans , Electronic Health Records , Genome-Wide Association Study , Diabetes Mellitus, Type 2/epidemiology , Diabetes Mellitus, Type 2/genetics , Polymorphism, Single Nucleotide , Genomics , Genetic Predisposition to Disease , Obesity/epidemiology , Obesity/genetics , Phenotype , Cost of Illness
SELECTION OF CITATIONS
SEARCH DETAIL
...