|

1.

To weight or not to weight? The effect of selection bias in 3 large electronic health record-linked biobanks and recommendations for practice.

Salvatore, Maxwell; Kundu, Ritoban; Shi, Xu; Friese, Christopher R; Lee, Seunggeun; Fritsche, Lars G; Mondul, Alison M; Hanauer, David; Pearce, Celeste Leigh; Mukherjee, Bhramar.

J Am Med Inform Assoc ; 2024 May 14.

Article En | MEDLINE | ID: mdl-38742457

OBJECTIVES: To develop recommendations regarding the use of weights to reduce selection bias for commonly performed analyses using electronic health record (EHR)-linked biobank data. MATERIALS AND METHODS: We mapped diagnosis (ICD code) data to standardized phecodes from 3 EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n = 244 071), Michigan Genomics Initiative (MGI; n = 81 243), and UK Biobank (UKB; n = 401 167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to represent the US adult population more. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted 4 common analyses comparing unweighted and weighted results. RESULTS: For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted phenome-wide association study for colorectal cancer, the strongest associations remained unaltered, with considerable overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates. DISCUSSION: Weighting had a limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation. When interested in estimating effect size, specific signals from untargeted association analyses should be followed up by weighted analysis. CONCLUSION: EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.

2.

To weight or not to weight? Studying the effect of selection bias in three large EHR-linked biobanks.

Salvatore, Maxwell; Kundu, Ritoban; Shi, Xu; Friese, Christopher R; Lee, Seunggeun; Fritsche, Lars G; Mondul, Alison M; Hanauer, David; Pearce, Celeste Leigh; Mukherjee, Bhramar.

medRxiv ; 2024 Feb 13.

Article En | MEDLINE | ID: mdl-38405832

Objective: To explore the role of selection bias adjustment by weighting electronic health record (EHR)-linked biobank data for commonly performed analyses. Materials and methods: We mapped diagnosis (ICD code) data to standardized phecodes from three EHR-linked biobanks with varying recruitment strategies: All of Us (AOU; n=244,071), Michigan Genomics Initiative (MGI; n=81,243), and UK Biobank (UKB; n=401,167). Using 2019 National Health Interview Survey data, we constructed selection weights for AOU and MGI to be more representative of the US adult population. We used weights previously developed for UKB to represent the UKB-eligible population. We conducted four common descriptive and analytic tasks comparing unweighted and weighted results. Results: For AOU and MGI, estimated phecode prevalences decreased after weighting (weighted-unweighted median phecode prevalence ratio [MPR]: 0.82 and 0.61), while UKB's estimates increased (MPR: 1.06). Weighting minimally impacted latent phenome dimensionality estimation. Comparing weighted versus unweighted PheWAS for colorectal cancer, the strongest associations remained unaltered and there was large overlap in significant hits. Weighting affected the estimated log-odds ratio for sex and colorectal cancer to align more closely with national registry-based estimates. Discussion: Weighting had limited impact on dimensionality estimation and large-scale hypothesis testing but impacted prevalence and association estimation more. Results from untargeted association analyses should be followed by weighted analysis when effect size estimation is of interest for specific signals. Conclusion: EHR-linked biobanks should report recruitment and selection mechanisms and provide selection weights with defined target populations. Researchers should consider their intended estimands, specify source and target populations, and weight EHR-linked biobank analyses accordingly.

3.

Uncovering associations between pre-existing conditions and COVID-19 Severity: A polygenic risk score approach across three large biobanks.

Fritsche, Lars G; Nam, Kisung; Du, Jiacong; Kundu, Ritoban; Salvatore, Maxwell; Shi, Xu; Lee, Seunggeun; Burgess, Stephen; Mukherjee, Bhramar.

PLoS Genet ; 19(12): e1010907, 2023 Dec.

Article En | MEDLINE | ID: mdl-38113267

OBJECTIVE: To overcome the limitations associated with the collection and curation of COVID-19 outcome data in biobanks, this study proposes the use of polygenic risk scores (PRS) as reliable proxies of COVID-19 severity across three large biobanks: the Michigan Genomics Initiative (MGI), UK Biobank (UKB), and NIH All of Us. The goal is to identify associations between pre-existing conditions and COVID-19 severity. METHODS: Drawing on a sample of more than 500,000 individuals from the three biobanks, we conducted a phenome-wide association study (PheWAS) to identify associations between a PRS for COVID-19 severity, derived from a genome-wide association study on COVID-19 hospitalization, and clinical pre-existing, pre-pandemic phenotypes. We performed cohort-specific PRS PheWAS and a subsequent fixed-effects meta-analysis. RESULTS: The current study uncovered 23 pre-existing conditions significantly associated with the COVID-19 severity PRS in cohort-specific analyses, of which 21 were observed in the UKB cohort and two in the MGI cohort. The meta-analysis yielded 27 significant phenotypes predominantly related to obesity, metabolic disorders, and cardiovascular conditions. After adjusting for body mass index, several clinical phenotypes, such as hypercholesterolemia and gastrointestinal disorders, remained associated with an increased risk of hospitalization following COVID-19 infection. CONCLUSION: By employing PRS as a proxy for COVID-19 severity, we corroborated known risk factors and identified novel associations between pre-existing clinical phenotypes and COVID-19 severity. Our study highlights the potential value of using PRS when actual outcome data may be limited or inadequate for robust analyses.

COVID-19 , Population Health , Humans , Genome-Wide Association Study , Genetic Risk Score , COVID-19/genetics , Biological Specimen Banks , Preexisting Condition Coverage , Risk Factors , Genetic Predisposition to Disease

4.

Comparative impact assessment of COVID-19 policy interventions in five South Asian countries using reported and estimated unreported death counts during 2020-2021.

Kundu, Ritoban; Datta, Jyotishka; Ray, Debashree; Mishra, Swapnil; Bhattacharyya, Rupam; Zimmermann, Lauren; Mukherjee, Bhramar.

PLOS Glob Public Health ; 3(12): e0002063, 2023.

Article En | MEDLINE | ID: mdl-38150465

There has been raging discussion and debate around the quality of COVID death data in South Asia. According to WHO, of the 5.5 million reported COVID-19 deaths from 2020-2021, 0.57 million (10%) were contributed by five low and middle income countries (LMIC) countries in the Global South: India, Pakistan, Bangladesh, Sri Lanka and Nepal. However, a number of excess death estimates show that the actual death toll from COVID-19 is significantly higher than the reported number of deaths. For example, the IHME and WHO both project around 14.9 million total deaths, of which 4.5-5.5 million were attributed to these five countries in 2020-2021. We focus our gaze on the COVID-19 performance of these five countries where 23.5% of the world population lives in 2020 and 2021, via a counterfactual lens and ask, to what extent the mortality of one LMIC would have been affected if it adopted the pandemic policies of another, similar country? We use a Bayesian semi-mechanistic model developed by Mishra et al. (2021) to compare both the reported and estimated total death tolls by permuting the time-varying reproduction number (Rt) across these countries over a similar time period. Our analysis shows that, in the first half of 2021, mortality in India in terms of reported deaths could have been reduced to 96 and 102 deaths per million compared to actual 170 reported deaths per million had it adopted the policies of Nepal and Pakistan respectively. In terms of total deaths, India could have averted 481 and 466 deaths per million had it adopted the policies of Bangladesh and Pakistan. On the other hand, India had a lower number of reported COVID-19 deaths per million (48 deaths per million) and a lower estimated total deaths per million (80 deaths per million) in the second half of 2021, and LMICs other than Pakistan would have lower reported mortality had they followed India's strategy. The gap between the reported and estimated total deaths highlights the varying level and extent of under-reporting of deaths across the subcontinent, and that model estimates are contingent on accuracy of the death data. Our analysis shows the importance of timely public health intervention and vaccines for lowering mortality and the need for better coverage infrastructure for the death registration system in LMICs.

5.

Lessons from SARS-CoV-2 in India: A data-driven framework for pandemic resilience.

Salvatore, Maxwell; Purkayastha, Soumik; Ganapathi, Lakshmi; Bhattacharyya, Rupam; Kundu, Ritoban; Zimmermann, Lauren; Ray, Debashree; Hazra, Aditi; Kleinsasser, Michael; Solomon, Sunil; Subbaraman, Ramnath; Mukherjee, Bhramar.

Sci Adv ; 8(24): eabp8621, 2022 Jun 17.

Article En | MEDLINE | ID: mdl-35714183

India experienced a massive surge in SARS-CoV-2 infections and deaths during April to June 2021 despite having controlled the epidemic relatively well during 2020. Using counterfactual predictions from epidemiological disease transmission models, we produce evidence in support of how strengthening public health interventions early would have helped control transmission in the country and significantly reduced mortality during the second wave, even without harsh lockdowns. We argue that enhanced surveillance at district, state, and national levels and constant assessment of risk associated with increased transmission are critical for future pandemic responsiveness. Building on our retrospective analysis, we provide a tiered data-driven framework for timely escalation of future interventions as a tool for policy-makers.

6.

Extending the susceptible-exposed-infected-removed (SEIR) model to handle the false negative rate and symptom-based administration of COVID-19 diagnostic tests: SEIR-fansy.

Bhaduri, Ritwik; Kundu, Ritoban; Purkayastha, Soumik; Kleinsasser, Michael; Beesley, Lauren J; Mukherjee, Bhramar; Datta, Jyotishka.

Stat Med ; 41(13): 2317-2337, 2022 06 15.

Article En | MEDLINE | ID: mdl-35224743

False negative rates of severe acute respiratory coronavirus 2 diagnostic tests, together with selection bias due to prioritized testing can result in inaccurate modeling of COVID-19 transmission dynamics based on reported "case" counts. We propose an extension of the widely used Susceptible-Exposed-Infected-Removed (SEIR) model that accounts for misclassification error and selection bias, and derive an analytic expression for the basic reproduction number R0 as a function of false negative rates of the diagnostic tests and selection probabilities for getting tested. Analyzing data from the first two waves of the pandemic in India, we show that correcting for misclassification and selection leads to more accurate prediction in a test sample. We provide estimates of undetected infections and deaths between April 1, 2020 and August 31, 2021. At the end of the first wave in India, the estimated under-reporting factor for cases was at 11.1 (95% CI: 10.7,11.5) and for deaths at 3.58 (95% CI: 3.5,3.66) as of February 1, 2021, while they change to 19.2 (95% CI: 17.9, 19.9) and 4.55 (95% CI: 4.32, 4.68) as of July 1, 2021. Equivalently, 9.0% (95% CI: 8.7%, 9.3%) and 5.2% (95% CI: 5.0%, 5.6%) of total estimated infections were reported on these two dates, while 27.9% (95% CI: 27.3%, 28.6%) and 22% (95% CI: 21.4%, 23.1%) of estimated total deaths were reported. Extensive simulation studies demonstrate the effect of misclassification and selection on estimation of R0 and prediction of future infections. A R-package SEIRfansy is developed for broader dissemination.

COVID-19 , Basic Reproduction Number , COVID-19/diagnosis , COVID-19/epidemiology , Humans , India/epidemiology , Pandemics , SARS-CoV-2

7.

Author Correction: Incorporating false negative tests in epidemiological models for SARS-CoV-2 transmission and reconciling with seroprevalence estimates.

Bhattacharyya, Rupam; Kundu, Ritoban; Bhaduri, Ritwik; Ray, Debashree; Beesley, Lauren J; Salvatore, Maxwell; Mukherjee, Bhramar.

Sci Rep ; 11(1): 17221, 2021 Aug 20.

Article En | MEDLINE | ID: mdl-34417536

8.

Estimating the wave 1 and wave 2 infection fatality rates from SARS-CoV-2 in India.

Purkayastha, Soumik; Kundu, Ritoban; Bhaduri, Ritwik; Barker, Daniel; Kleinsasser, Michael; Ray, Debashree; Mukherjee, Bhramar.

BMC Res Notes ; 14(1): 262, 2021 Jul 08.

Article En | MEDLINE | ID: mdl-34238344

OBJECTIVE: There has been much discussion and debate around the underreporting of COVID-19 infections and deaths in India. In this short report we first estimate the underreporting factor for infections from publicly available data released by the Indian Council of Medical Research on reported number of cases and national seroprevalence surveys. We then use a compartmental epidemiologic model to estimate the undetected number of infections and deaths, yielding estimates of the corresponding underreporting factors. We compare the serosurvey based ad hoc estimate of the infection fatality rate (IFR) with the model-based estimate. Since the first and second waves in India are intrinsically different in nature, we carry out this exercise in two periods: the first wave (April 1, 2020-January 31, 2021) and part of the second wave (February 1, 2021-May 15, 2021). The latest national seroprevalence estimate is from January 2021, and thus only relevant to our wave 1 calculations. RESULTS: Both wave 1 and wave 2 estimates qualitatively show that there is a large degree of "covert infections" in India, with model-based estimated underreporting factor for infections as 11.11 (95% credible interval (CrI) 10.71-11.47) and for deaths as 3.56 (95% CrI 3.48-3.64) for wave 1. For wave 2, underreporting factor for infections escalate to 26.77 (95% CrI 24.26-28.81) and to 5.77 (95% CrI 5.34-6.15) for deaths. If we rely on only reported deaths, the IFR estimate is 0.13% for wave 1 and 0.03% for part of wave 2. Taking underreporting of deaths into account, the IFR estimate is 0.46% for wave 1 and 0.18% for wave 2 (till May 15). Combining waves 1 and 2, as of May 15, while India reported a total of nearly 25 million cases and 270 thousand deaths, the estimated number of infections and deaths stand at 491 million (36% of the population) and 1.21 million respectively, yielding an estimated (combined) infection fatality rate of 0.25%. There is considerable variation in these estimates across Indian states. Up to date seroprevalence studies and mortality data are needed to validate these model-based estimates.

Biomedical Research , COVID-19 , Humans , India/epidemiology , SARS-CoV-2 , Seroepidemiologic Studies

9.

COVID-19 Pandemic in India: Through the Lens of Modeling.

Babu, Giridhara R; Ray, Debashree; Bhaduri, Ritwik; Halder, Aritra; Kundu, Ritoban; Menon, Gautam I; Mukherjee, Bhramar.

Glob Health Sci Pract ; 9(2): 220-228, 2021 06 30.

Article En | MEDLINE | ID: mdl-34234020

COVID-19 , Pandemics , Public Health , COVID-19/mortality , COVID-19/prevention & control , Humans , India/epidemiology , SARS-CoV-2

10.

A comparison of five epidemiological models for transmission of SARS-CoV-2 in India.

Purkayastha, Soumik; Bhattacharyya, Rupam; Bhaduri, Ritwik; Kundu, Ritoban; Gu, Xuelin; Salvatore, Maxwell; Ray, Debashree; Mishra, Swapnil; Mukherjee, Bhramar.

BMC Infect Dis ; 21(1): 533, 2021 Jun 07.

Article En | MEDLINE | ID: mdl-34098885

BACKGROUND: Many popular disease transmission models have helped nations respond to the COVID-19 pandemic by informing decisions about pandemic planning, resource allocation, implementation of social distancing measures, lockdowns, and other non-pharmaceutical interventions. We study how five epidemiological models forecast and assess the course of the pandemic in India: a baseline curve-fitting model, an extended SIR (eSIR) model, two extended SEIR (SAPHIRE and SEIR-fansy) models, and a semi-mechanistic Bayesian hierarchical model (ICM). METHODS: Using COVID-19 case-recovery-death count data reported in India from March 15 to October 15 to train the models, we generate predictions from each of the five models from October 16 to December 31. To compare prediction accuracy with respect to reported cumulative and active case counts and reported cumulative death counts, we compute the symmetric mean absolute prediction error (SMAPE) for each of the five models. For reported cumulative cases and deaths, we compute Pearson's and Lin's correlation coefficients to investigate how well the projected and observed reported counts agree. We also present underreporting factors when available, and comment on uncertainty of projections from each model. RESULTS: For active case counts, SMAPE values are 35.14% (SEIR-fansy) and 37.96% (eSIR). For cumulative case counts, SMAPE values are 6.89% (baseline), 6.59% (eSIR), 2.25% (SAPHIRE) and 2.29% (SEIR-fansy). For cumulative death counts, the SMAPE values are 4.74% (SEIR-fansy), 8.94% (eSIR) and 0.77% (ICM). Three models (SAPHIRE, SEIR-fansy and ICM) return total (sum of reported and unreported) cumulative case counts as well. We compute underreporting factors as of October 31 and note that for cumulative cases, the SEIR-fansy model yields an underreporting factor of 7.25 and ICM model yields 4.54 for the same quantity. For total (sum of reported and unreported) cumulative deaths the SEIR-fansy model reports an underreporting factor of 2.97. On October 31, we observe 8.18 million cumulative reported cases, while the projections (in millions) from the baseline model are 8.71 (95% credible interval: 8.63-8.80), while eSIR yields 8.35 (7.19-9.60), SAPHIRE returns 8.17 (7.90-8.52) and SEIR-fansy projects 8.51 (8.18-8.85) million cases. Cumulative case projections from the eSIR model have the highest uncertainty in terms of width of 95% credible intervals, followed by those from SAPHIRE, the baseline model and finally SEIR-fansy. CONCLUSIONS: In this comparative paper, we describe five different models used to study the transmission dynamics of the SARS-Cov-2 virus in India. While simulation studies are the only gold standard way to compare the accuracy of the models, here we were uniquely poised to compare the projected case-counts against observed data on a test period. The largest variability across models is observed in predicting the "total" number of infections including reported and unreported cases (on which we have no validation data). The degree of under-reporting has been a major concern in India and is characterized in this report. Overall, the SEIR-fansy model appeared to be a good choice with publicly available R-package and desired flexibility plus accuracy.

COVID-19/epidemiology , COVID-19/transmission , Pandemics , Bayes Theorem , Communicable Disease Control/methods , Computer Simulation , Forecasting , Humans , India/epidemiology , Models, Statistical

11.

Incorporating false negative tests in epidemiological models for SARS-CoV-2 transmission and reconciling with seroprevalence estimates.

Bhattacharyya, Rupam; Kundu, Ritoban; Bhaduri, Ritwik; Ray, Debashree; Beesley, Lauren J; Salvatore, Maxwell; Mukherjee, Bhramar.

Sci Rep ; 11(1): 9748, 2021 05 07.

Article En | MEDLINE | ID: mdl-33963259

Susceptible-Exposed-Infected-Removed (SEIR)-type epidemiologic models, modeling unascertained infections latently, can predict unreported cases and deaths assuming perfect testing. We apply a method we developed to account for the high false negative rates of diagnostic RT-PCR tests for detecting an active SARS-CoV-2 infection in a classic SEIR model. The number of unascertained cases and false negatives being unobservable in a real study, population-based serosurveys can help validate model projections. Applying our method to training data from Delhi, India, during March 15-June 30, 2020, we estimate the underreporting factor for cases at 34-53 (deaths: 8-13) on July 10, 2020, largely consistent with the findings of the first round of serosurveys for Delhi (done during June 27-July 10, 2020) with an estimated 22.86% IgG antibody prevalence, yielding estimated underreporting factors of 30-42 for cases. Together, these imply approximately 96-98% cases in Delhi remained unreported (July 10, 2020). Updated calculations using training data during March 15-December 31, 2020 yield estimated underreporting factor for cases at 13-22 (deaths: 3-7) on January 23, 2021, which are again consistent with the latest (fifth) round of serosurveys for Delhi (done during January 15-23, 2021) with an estimated 56.13% IgG antibody prevalence, yielding an estimated range for the underreporting factor for cases at 17-21. Together, these updated estimates imply approximately 92-96% cases in Delhi remained unreported (January 23, 2021). Such model-based estimates, updated with latest data, provide a viable alternative to repeated resource-intensive serosurveys for tracking unreported cases and deaths and gauging the true extent of the pandemic.

COVID-19/diagnosis , COVID-19/epidemiology , SARS-CoV-2/isolation & purification , Adolescent , Adult , Antibodies, Viral/immunology , COVID-19/immunology , COVID-19/transmission , COVID-19 Testing , Child , Child, Preschool , False Negative Reactions , Female , Humans , Immunoglobulin G/immunology , India/epidemiology , Male , SARS-CoV-2/immunology , Seroepidemiologic Studies , Young Adult

12.

EXTENDING THE SUSCEPTIBLE-EXPOSED-INFECTED-REMOVED(SEIR) MODEL TO HANDLE THE HIGH FALSE NEGATIVE RATE AND SYMPTOM-BASED ADMINISTRATION OF COVID-19 DIAGNOSTIC TESTS: SEIR-fansy.

Bhaduri, Ritwik; Kundu, Ritoban; Purkayastha, Soumik; Kleinsasser, Michael; Beesley, Lauren J; Mukherjee, Bhramar.

medRxiv ; 2020 Sep 25.

Article En | MEDLINE | ID: mdl-32995829

The false negative rate of the diagnostic RT-PCR test for SARS-CoV-2 has been reported to be substantially high. Due to limited availability of testing, only a non-random subset of the population can get tested. Hence, the reported test counts are subject to a large degree of selection bias. We consider an extension of the Susceptible-Exposed-Infected-Removed (SEIR) model under both selection bias and misclassification. We derive closed form expression for the basic reproduction number under such data anomalies using the next generation matrix method. We conduct extensive simulation studies to quantify the effect of misclassification and selection on the resultant estimation and prediction of future case counts. Finally we apply the methods to reported case-death-recovery count data from India, a nation with more than 5 million cases reported over the last seven months. We show that correcting for misclassification and selection can lead to more accurate prediction of case-counts (and death counts) using the observed data as a beta tester. The model also provides an estimate of undetected infections and thus an under-reporting factor. For India, the estimated under-reporting factor for cases is around 21 and for deaths is around 6. We develop an R-package (SEIRfansy) for broader dissemination of the methods.