Pesquisa | Portal de Pesquisa da BVS Enfermagem

1.

Trans-Balance: Reducing demographic disparity for prediction models in the presence of class imbalance.

Hong, Chuan; Liu, Molei; Wojdyla, Daniel M; Hickey, Jimmy; Pencina, Michael; Henao, Ricardo.

J Biomed Inform ; 149: 104532, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-38070817

RESUMO

INTRODUCTION: Risk prediction, including early disease detection, prevention, and intervention, is essential to precision medicine. However, systematic bias in risk estimation caused by heterogeneity across different demographic groups can lead to inappropriate or misinformed treatment decisions. In addition, low incidence (class-imbalance) outcomes negatively impact the classification performance of many standard learning algorithms which further exacerbates the racial disparity issues. Therefore, it is crucial to improve the performance of statistical and machine learning models in underrepresented populations in the presence of heavy class imbalance. METHOD: To address demographic disparity in the presence of class imbalance, we develop a novel framework, Trans-Balance, by leveraging recent advances in imbalance learning, transfer learning, and federated learning. We consider a practical setting where data from multiple sites are stored locally under privacy constraints. RESULTS: We show that the proposed Trans-Balance framework improves upon existing approaches by explicitly accounting for heterogeneity across demographic subgroups and cohorts. We demonstrate the feasibility and validity of our methods through numerical experiments and a real application to a multi-cohort study with data from participants of four large, NIH-funded cohorts for stroke risk prediction. CONCLUSION: Our findings indicate that the Trans-Balance approach significantly improves predictive performance, especially in scenarios marked by severe class imbalance and demographic disparity. Given its versatility and effectiveness, Trans-Balance offers a valuable contribution to enhancing risk prediction in biomedical research and related fields.

Assuntos

Algoritmos , Pesquisa Biomédica , Humanos , Estudos de Coortes , Aprendizado de Máquina , Demografia

2.

Efficient Evaluation of Prediction Rules in Semi-Supervised Settings under Stratified Sampling.

Gronsbell, Jessica; Liu, Molei; Tian, Lu; Cai, Tianxi.

J R Stat Soc Series B Stat Methodol ; 84(4): 1353-1391, 2022 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-36275859

RESUMO

In many contemporary applications, large amounts of unlabeled data are readily available while labeled examples are limited. There has been substantial interest in semi-supervised learning (SSL) which aims to leverage unlabeled data to improve estimation or prediction. However, current SSL literature focuses primarily on settings where labeled data is selected uniformly at random from the population of interest. Stratified sampling, while posing additional analytical challenges, is highly applicable to many real world problems. Moreover, no SSL methods currently exist for estimating the prediction performance of a fitted model when the labeled data is not selected uniformly at random. In this paper, we propose a two-step SSL procedure for evaluating a prediction rule derived from a working binary regression model based on the Brier score and overall misclassification rate under stratified sampling. In step I, we impute the missing labels via weighted regression with nonlinear basis functions to account for stratified sampling and to improve efficiency. In step II, we augment the initial imputations to ensure the consistency of the resulting estimators regardless of the specification of the prediction model or the imputation model. The final estimator is then obtained with the augmented imputations. We provide asymptotic theory and numerical studies illustrating that our proposals outperform their supervised counterparts in terms of efficiency gain. Our methods are motivated by electronic health record (EHR) research and validated with a real data analysis of an EHR-based study of diabetic neuropathy.

3.

Weakly Semi-supervised phenotyping using Electronic Health records.

Nogues, Isabelle-Emmanuella; Wen, Jun; Lin, Yucong; Liu, Molei; Tedeschi, Sara K; Geva, Alon; Cai, Tianxi; Hong, Chuan.

J Biomed Inform ; 134: 104175, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-36064111

RESUMO

OBJECTIVE: Electronic Health Record (EHR) based phenotyping is a crucial yet challenging problem in the biomedical field. Though clinicians typically determine patient-level diagnoses via manual chart review, the sheer volume and heterogeneity of EHR data renders such tasks challenging, time-consuming, and prohibitively expensive, thus leading to a scarcity of clinical annotations in EHRs. Weakly supervised learning algorithms have been successfully applied to various EHR phenotyping problems, due to their ability to leverage information from large quantities of unlabeled samples to better inform predictions based on a far smaller number of patients. However, most weakly supervised methods are subject to the challenge to choose the right cutoff value to generate an optimal classifier. Furthermore, since they only utilize the most informative features (i.e., main ICD and NLP counts) they may fail for episodic phenotypes that cannot be consistently detected via ICD and NLP data. In this paper, we propose a label-efficient, weakly semi-supervised deep learning algorithm for EHR phenotyping (WSS-DL), which overcomes the limitations above. MATERIALS AND METHODS: WSS-DL classifies patient-level disease status through a series of learning stages: 1) generating silver standard labels, 2) deriving enhanced-silver-standard labels by fitting a weakly supervised deep learning model to data with silver standard labels as outcomes and high dimensional EHR features as input, and 3) obtaining the final prediction score and classifier by fitting a supervised learning model to data with a minimal number of gold standard labels as the outcome, and the enhanced-silver-standard labels and a minimal set of most informative EHR features as input. To assess the generalizability of WSS-DL across different phenotypes and medical institutions, we apply WSS-DL to classify a total of 17 diseases, including both acute and chronic conditions, using EHR data from three healthcare systems. Additionally, we determine the minimum quantity of training labels required by WSS-DL to outperform existing supervised and semi-supervised phenotyping methods. RESULTS: The proposed method, in combining the strengths of deep learning and weakly semi-supervised learning, successfully leverages the crucial phenotyping information contained in EHR features from unlabeled samples. Indeed, the deep learning model's ability to handle high-dimensional EHR features allows it to generate strong phenotype status predictions from silver standard labels. These predictions, in turn, provide highly effective features in the final logistic regression stage, leading to high phenotyping accuracy in notably small subsets of labeled data (e.g. n = 40 labeled samples). CONCLUSION: Our method's high performance in EHR datasets with very small numbers of labels indicates its potential value in aiding doctors to diagnose rare diseases as well as conditions susceptible to misdiagnosis.

Assuntos

Registros Eletrônicos de Saúde , Aprendizado de Máquina Supervisionado , Algoritmos , Modelos Logísticos , Fenótipo

4.

International Changes in COVID-19 Clinical Trajectories Across 315 Hospitals and 6 Countries: Retrospective Cohort Study.

Weber, Griffin M; Zhang, Harrison G; L'Yi, Sehi; Bonzel, Clara-Lea; Hong, Chuan; Avillach, Paul; Gutiérrez-Sacristán, Alba; Palmer, Nathan P; Tan, Amelia Li Min; Wang, Xuan; Yuan, William; Gehlenborg, Nils; Alloni, Anna; Amendola, Danilo F; Bellasi, Antonio; Bellazzi, Riccardo; Beraghi, Michele; Bucalo, Mauro; Chiovato, Luca; Cho, Kelly; Dagliati, Arianna; Estiri, Hossein; Follett, Robert W; García Barrio, Noelia; Hanauer, David A; Henderson, Darren W; Ho, Yuk-Lam; Holmes, John H; Hutch, Meghan R; Kavuluru, Ramakanth; Kirchoff, Katie; Klann, Jeffrey G; Krishnamurthy, Ashok K; Le, Trang T; Liu, Molei; Loh, Ne Hooi Will; Lozano-Zahonero, Sara; Luo, Yuan; Maidlow, Sarah; Makoudjou, Adeline; Malovini, Alberto; Martins, Marcelo Roberto; Moal, Bertrand; Morris, Michele; Mowery, Danielle L; Murphy, Shawn N; Neuraz, Antoine; Ngiam, Kee Yuan; Okoshi, Marina P; Omenn, Gilbert S.

J Med Internet Res ; 23(10): e31400, 2021 10 11.

Artigo em Inglês | MEDLINE | ID: mdl-34533459

RESUMO

BACKGROUND: Many countries have experienced 2 predominant waves of COVID-19-related hospitalizations. Comparing the clinical trajectories of patients hospitalized in separate waves of the pandemic enables further understanding of the evolving epidemiology, pathophysiology, and health care dynamics of the COVID-19 pandemic. OBJECTIVE: In this retrospective cohort study, we analyzed electronic health record (EHR) data from patients with SARS-CoV-2 infections hospitalized in participating health care systems representing 315 hospitals across 6 countries. We compared hospitalization rates, severe COVID-19 risk, and mean laboratory values between patients hospitalized during the first and second waves of the pandemic. METHODS: Using a federated approach, each participating health care system extracted patient-level clinical data on their first and second wave cohorts and submitted aggregated data to the central site. Data quality control steps were adopted at the central site to correct for implausible values and harmonize units. Statistical analyses were performed by computing individual health care system effect sizes and synthesizing these using random effect meta-analyses to account for heterogeneity. We focused the laboratory analysis on C-reactive protein (CRP), ferritin, fibrinogen, procalcitonin, D-dimer, and creatinine based on their reported associations with severe COVID-19. RESULTS: Data were available for 79,613 patients, of which 32,467 were hospitalized in the first wave and 47,146 in the second wave. The prevalence of male patients and patients aged 50 to 69 years decreased significantly between the first and second waves. Patients hospitalized in the second wave had a 9.9% reduction in the risk of severe COVID-19 compared to patients hospitalized in the first wave (95% CI 8.5%-11.3%). Demographic subgroup analyses indicated that patients aged 26 to 49 years and 50 to 69 years; male and female patients; and black patients had significantly lower risk for severe disease in the second wave than in the first wave. At admission, the mean values of CRP were significantly lower in the second wave than in the first wave. On the seventh hospital day, the mean values of CRP, ferritin, fibrinogen, and procalcitonin were significantly lower in the second wave than in the first wave. In general, countries exhibited variable changes in laboratory testing rates from the first to the second wave. At admission, there was a significantly higher testing rate for D-dimer in France, Germany, and Spain. CONCLUSIONS: Patients hospitalized in the second wave were at significantly lower risk for severe COVID-19. This corresponded to mean laboratory values in the second wave that were more likely to be in typical physiological ranges on the seventh hospital day compared to the first wave. Our federated approach demonstrated the feasibility and power of harmonizing heterogeneous EHR data from multiple international health care systems to rapidly conduct large-scale studies to characterize how COVID-19 clinical trajectories evolve.

Assuntos

COVID-19 , Pandemias , Adulto , Idoso , Feminino , Hospitalização , Hospitais , Humanos , Masculino , Pessoa de Meia-Idade , Estudos Retrospectivos , SARS-CoV-2

5.

Authorship Correction: International Changes in COVID-19 Clinical Trajectories Across 315 Hospitals and 6 Countries: Retrospective Cohort Study.

Weber, Griffin M; Zhang, Harrison G; L'Yi, Sehi; Bonzel, Clara-Lea; Hong, Chuan; Avillach, Paul; Gutiérrez-Sacristán, Alba; Palmer, Nathan P; Tan, Amelia Li Min; Wang, Xuan; Yuan, William; Gehlenborg, Nils; Alloni, Anna; Amendola, Danilo F; Bellasi, Antonio; Bellazzi, Riccardo; Beraghi, Michele; Bucalo, Mauro; Chiovato, Luca; Cho, Kelly; Dagliati, Arianna; Estiri, Hossein; Follett, Robert W; García Barrio, Noelia; Hanauer, David A; Henderson, Darren W; Ho, Yuk-Lam; Holmes, John H; Hutch, Meghan R; Kavuluru, Ramakanth; Kirchoff, Katie; Klann, Jeffrey G; Krishnamurthy, Ashok K; Le, Trang T; Liu, Molei; Loh, Ne Hooi Will; Lozano-Zahonero, Sara; Luo, Yuan; Maidlow, Sarah; Makoudjou, Adeline; Malovini, Alberto; Martins, Marcelo Roberto; Moal, Bertrand; Morris, Michele; Mowery, Danielle L; Murphy, Shawn N; Neuraz, Antoine; Ngiam, Kee Yuan; Okoshi, Marina P; Omenn, Gilbert S.

J Med Internet Res ; 23(11): e34625, 2021 Nov 30.

Artigo em Inglês | MEDLINE | ID: mdl-34889759

RESUMO

[This corrects the article DOI: 10.2196/31400.].

6.

Modeling individualized coefficient alpha to measure quality of test score data.

Liu, Molei; Hu, Ming; Zhou, Xiao-Hua.

Stat Med ; 37(22): 3230-3243, 2018 09 30.

Artigo em Inglês | MEDLINE | ID: mdl-29797426

RESUMO

Individualized coefficient alpha is defined. It is item and subject specific and is used to measure the quality of test score data with heterogenicity among the subjects and items. A regression model is developed based on 3 sets of generalized estimating equations. The first set of generalized estimating equation models the expectation of the responses, the second set models the response's variance, and the third set is proposed to estimate the individualized coefficient alpha, defined and used to measure individualized internal consistency of the responses. We also use different techniques to extend our method to handle missing data. Asymptotic property of the estimators is discussed, based on which inference on the coefficient alpha is derived. Performance of our method is evaluated through simulation study and real data analysis. The real data application is from a health literacy study in Hunan province of China.

Assuntos

Avaliação Educacional , Letramento em Saúde , Modelos Estatísticos , China , Simulação por Computador , Confiabilidade dos Dados , Humanos

7.

Heterogeneous associations between interleukin-6 receptor variants and phenotypes across ancestries and implications for therapy.

Wang, Xuan; Liu, Molei; Nogues, Isabelle-Emmanuella; Chen, Tony; Xiong, Xin; Bonzel, Clara-Lea; Zhang, Harrison; Hong, Chuan; Xia, Yin; Dahal, Kumar; Costa, Lauren; Cui, Jing; Gaziano, J Michael; Kim, Seoyoung C; Ho, Yuk-Lam; Cho, Kelly; Cai, Tianxi; Liao, Katherine P.

Sci Rep ; 14(1): 8021, 2024 04 05.

Artigo em Inglês | MEDLINE | ID: mdl-38580710

RESUMO

The Phenome-Wide Association Study (PheWAS) is increasingly used to broadly screen for potential treatment effects, e.g., IL6R variant as a proxy for IL6R antagonists. This approach offers an opportunity to address the limited power in clinical trials to study differential treatment effects across patient subgroups. However, limited methods exist to efficiently test for differences across subgroups in the thousands of multiple comparisons generated as part of a PheWAS. In this study, we developed an approach that maximizes the power to test for heterogeneous genotype-phenotype associations and applied this approach to an IL6R PheWAS among individuals of African (AFR) and European (EUR) ancestries. We identified 29 traits with differences in IL6R variant-phenotype associations, including a lower risk of type 2 diabetes in AFR (OR 0.96) vs EUR (OR 1.0, p-value for heterogeneity = 8.5 × 10-3), and higher white blood cell count (p-value for heterogeneity = 8.5 × 10-131). These data suggest a more salutary effect of IL6R blockade for T2D among individuals of AFR vs EUR ancestry and provide data to inform ongoing clinical trials targeting IL6 for an expanding number of conditions. Moreover, the method to test for heterogeneity of associations can be applied broadly to other large-scale genotype-phenotype screens in diverse populations.

Assuntos

Diabetes Mellitus Tipo 2 , Humanos , Diabetes Mellitus Tipo 2/tratamento farmacológico , Diabetes Mellitus Tipo 2/genética , Estudos de Associação Genética , Fenótipo , Polimorfismo de Nucleotídeo Único , Receptores de Interleucina-6/genética

8.

Accelerating Genome- and Phenome-Wide Association Studies using GPUs - A case study using data from the Million Veteran Program.

Rodriguez, Alex; Kim, Youngdae; Nandi, Tarak Nath; Keat, Karl; Kumar, Rachit; Bhukar, Rohan; Conery, Mitchell; Liu, Molei; Hessington, John; Maheshwari, Ketan; Schmidt, Drew; Begoli, Edmon; Tourassi, Georgia; Muralidhar, Sumitra; Natarajan, Pradeep; Voight, Benjamin F; Cho, Kelly; Gaziano, J Michael; Damrauer, Scott M; Liao, Katherine P; Zhou, Wei; Huffman, Jennifer E; Verma, Anurag; Madduri, Ravi K.

bioRxiv ; 2024 May 22.

Artigo em Inglês | MEDLINE | ID: mdl-38826407

RESUMO

The expansion of biobanks has significantly propelled genomic discoveries yet the sheer scale of data within these repositories poses formidable computational hurdles, particularly in handling extensive matrix operations required by prevailing statistical frameworks. In this work, we introduce computational optimizations to the SAIGE (Scalable and Accurate Implementation of Generalized Mixed Model) algorithm, notably employing a GPU-based distributed computing approach to tackle these challenges. We applied these optimizations to conduct a large-scale genome-wide association study (GWAS) across 2,068 phenotypes derived from electronic health records of 635,969 diverse participants from the Veterans Affairs (VA) Million Veteran Program (MVP). Our strategies enabled scaling up the analysis to over 6,000 nodes on the Department of Energy (DOE) Oak Ridge Leadership Computing Facility (OLCF) Summit High-Performance Computer (HPC), resulting in a 20-fold acceleration compared to the baseline model. We also provide a Docker container with our optimizations that was successfully used on multiple cloud infrastructures on UK Biobank and All of Us datasets where we showed significant time and cost benefits over the baseline SAIGE model.

9.

Assessing the Most Vulnerable Subgroup to Type II Diabetes Associated with Statin Usage: Evidence from Electronic Health Record Data.

Guo, Xinzhou; Wei, Waverly; Liu, Molei; Cai, Tianxi; Wu, Chong; Wang, Jingshen.

J Am Stat Assoc ; 118(543): 1488-1499, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38223220

RESUMO

There have been increased concerns that the use of statins, one of the most commonly prescribed drugs for treating coronary artery disease, is potentially associated with the increased risk of new-onset Type II diabetes (T2D). Nevertheless, to date, there is no robust evidence supporting as to whether and what kind of populations are indeed vulnerable for developing T2D after taking statins. In this case study, leveraging the biobank and electronic health record data in the Partner Health System, we introduce a new data analysis pipeline and a novel statistical methodology that address existing limitations by (i) designing a rigorous causal framework that systematically examines the causal effects of statin usage on T2D risk in observational data, (ii) uncovering which patient subgroup is most vulnerable for developing T2D after taking statins, and (iii) assessing the replicability and statistical significance of the most vulnerable subgroup via a bootstrap calibration procedure. Our proposed approach delivers asymptotically sharp confidence intervals and debiased estimate for the treatment effect of the most vulnerable subgroup in the presence of high-dimensional covariates. With our proposed approach, we find that females with high T2D genetic risk are at the highest risk of developing T2D due to statin usage.

10.

Knowledge-Driven Online Multimodal Automated Phenotyping System.

Xiong, Xin; Sweet, Sara Morini; Liu, Molei; Hong, Chuan; Bonzel, Clara-Lea; Panickan, Vidul Ayakulangara; Zhou, Doudou; Wang, Linshanshan; Costa, Lauren; Ho, Yuk-Lam; Geva, Alon; Mandl, Kenneth D; Cheng, Suchun; Xia, Zongqi; Cho, Kelly; Gaziano, J Michael; Liao, Katherine P; Cai, Tianxi; Cai, Tianrun.

medRxiv ; 2023 Oct 02.

Artigo em Inglês | MEDLINE | ID: mdl-37873131

RESUMO

Though electronic health record (EHR) systems are a rich repository of clinical information with large potential, the use of EHR-based phenotyping algorithms is often hindered by inaccurate diagnostic records, the presence of many irrelevant features, and the requirement for a human-labeled training set. In this paper, we describe a knowledge-driven online multimodal automated phenotyping (KOMAP) system that i) generates a list of informative features by an online narrative and codified feature search engine (ONCE) and ii) enables the training of a multimodal phenotyping algorithm based on summary data. Powered by composite knowledge from multiple EHR sources, online article corpora, and a large language model, features selected by ONCE show high concordance with the state-of-the-art AI models (GPT4 and ChatGPT) and encourage large-scale phenotyping by providing a smaller but highly relevant feature set. Validation of the KOMAP system across four healthcare centers suggests that it can generate efficient phenotyping algorithms with robust performance. Compared to other methods requiring patient-level inputs and gold-standard labels, the fully online KOMAP provides a significant opportunity to enable multi-center collaboration.

11.

Diversity and Scale: Genetic Architecture of 2,068 Traits in the VA Million Veteran Program.

Verma, Anurag; Huffman, Jennifer E; Rodriguez, Alex; Conery, Mitchell; Liu, Molei; Ho, Yuk-Lam; Kim, Youngdae; Heise, David A; Guare, Lindsay; Panickan, Vidul Ayakulangara; Garcon, Helene; Linares, Franciel; Costa, Lauren; Goethert, Ian; Tipton, Ryan; Honerlaw, Jacqueline; Davies, Laura; Whitbourne, Stacey; Cohen, Jeremy; Posner, Daniel C; Sangar, Rahul; Murray, Michael; Wang, Xuan; Dochtermann, Daniel R; Devineni, Poornima; Shi, Yunling; Nandi, Tarak Nath; Assimes, Themistocles L; Brunette, Charles A; Carroll, Robert J; Clifford, Royce; Duvall, Scott; Gelernter, Joel; Hung, Adriana; Iyengar, Sudha K; Joseph, Jacob; Kember, Rachel; Kranzler, Henry; Levey, Daniel; Luoh, Shiuh-Wen; Merritt, Victoria C; Overstreet, Cassie; Deak, Joseph D; Grant, Struan F A; Polimanti, Renato; Roussos, Panos; Sun, Yan V; Venkatesh, Sanan; Voloudakis, Georgios; Justice, Amy.

medRxiv ; 2023 Jun 29.

Artigo em Inglês | MEDLINE | ID: mdl-37425708

RESUMO

Genome-wide association studies (GWAS) have underrepresented individuals from non-European populations, impeding progress in characterizing the genetic architecture and consequences of health and disease traits. To address this, we present a population-stratified phenome-wide GWAS followed by a multi-population meta-analysis for 2,068 traits derived from electronic health records of 635,969 participants in the Million Veteran Program (MVP), a longitudinal cohort study of diverse U.S. Veterans genetically similar to the respective African (121,177), Admixed American (59,048), East Asian (6,702), and European (449,042) superpopulations defined by the 1000 Genomes Project. We identified 38,270 independent variants associating with one or more traits at experiment-wide P<4.6×10-11 significance; fine-mapping 6,318 signals identified from 613 traits to single-variant resolution. Among these, a third (2,069) of the associations were found only among participants genetically similar to non-European reference populations, demonstrating the importance of expanding diversity in genetic studies. Our work provides a comprehensive atlas of phenome-wide genetic associations for future studies dissecting the architecture of complex traits in diverse populations.

12.

Prior Adaptive Semi-supervised Learning with Application to EHR Phenotyping.

Zhang, Yichi; Liu, Molei; Neykov, Matey; Cai, Tianxi.

J Mach Learn Res ; 232022.

Artigo em Inglês | MEDLINE | ID: mdl-37974910

RESUMO

Electronic Health Record (EHR) data, a rich source for biomedical research, have been successfully used to gain novel insight into a wide range of diseases. Despite its potential, EHR is currently underutilized for discovery research due to its major limitation in the lack of precise phenotype information. To overcome such difficulties, recent efforts have been devoted to developing supervised algorithms to accurately predict phenotypes based on relatively small training datasets with gold standard labels extracted via chart review. However, supervised methods typically require a sizable training set to yield generalizable algorithms, especially when the number of candidate features, p, is large. In this paper, we propose a semi-supervised (SS) EHR phenotyping method that borrows information from both a small, labeled dataset (where both the label Y and the feature set X are observed) and a much larger, weakly-labeled dataset in which the feature set X is accompanied only by a surrogate label S that is available to all patients. Under a working prior assumption that S is related to X only through Y and allowing it to hold approximately, we propose a prior adaptive semi-supervised (PASS) estimator that incorporates the prior knowledge by shrinking the estimator towards a direction derived under the prior. We derive asymptotic theory for the proposed estimator and justify its efficiency and robustness to prior information of poor quality. We also demonstrate its superiority over existing estimators under various scenarios via simulation studies and on three real-world EHR phenotyping studies at a large tertiary hospital.

13.

Individual Data Protected Integrative Regression Analysis of High-Dimensional Heterogeneous Data.

Cai, Tianxi; Liu, Molei; Xia, Yin.

J Am Stat Assoc ; 117(540): 2105-2119, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-37975021

RESUMO

Evidence-based decision making often relies on meta-analyzing multiple studies, which enables more precise estimation and investigation of generalizability. Integrative analysis of multiple heterogeneous studies is, however, highly challenging in the ultra high-dimensional setting. The challenge is even more pronounced when the individual-level data cannot be shared across studies, known as DataSHIELD contraint. Under sparse regression models that are assumed to be similar yet not identical across studies, we propose in this paper a novel integrative estimation procedure for data-Shielding High-dimensional Integrative Regression (SHIR). SHIR protects individual data through summary-statistics-based integrating procedure, accommodates between-study heterogeneity in both the covariate distribution and model parameters, and attains consistent variable selection. Theoretically, SHIR is statistically more efficient than the existing distributed approaches that integrate debiased LASSO estimators from the local sites. Furthermore, the estimation error incurred by aggregating derived data is negligible compared to the statistical minimax rate and SHIR is shown to be asymptotically equivalent in estimation to the ideal estimator obtained by sharing all data. The finite-sample performance of our method is studied and compared with existing approaches via extensive simulation settings. We further illustrate the utility of SHIR to derive phenotyping algorithms for coronary artery disease using electronic health records data from multiple chronic disease cohorts.

14.

Fast and powerful conditional randomization testing via distillation.

Liu, Molei; Katsevich, Eugene; Janson, Lucas; Ramdas, Aaditya.

Biometrika ; 109(2): 277-293, 2022 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-37416628

RESUMO

We consider the problem of conditional independence testing: given a response Y and covariates (X,Z), we test the null hypothesis that Yâ««Xâ£Z. The conditional randomization test was recently proposed as a way to use distributional information about Xâ£Z to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about Yâ£(X,Z). This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.

15.

International comparisons of laboratory values from the 4CE collaborative to predict COVID-19 mortality.

Weber, Griffin M; Hong, Chuan; Xia, Zongqi; Palmer, Nathan P; Avillach, Paul; L'Yi, Sehi; Keller, Mark S; Murphy, Shawn N; Gutiérrez-Sacristán, Alba; Bonzel, Clara-Lea; Serret-Larmande, Arnaud; Neuraz, Antoine; Omenn, Gilbert S; Visweswaran, Shyam; Klann, Jeffrey G; South, Andrew M; Loh, Ne Hooi Will; Cannataro, Mario; Beaulieu-Jones, Brett K; Bellazzi, Riccardo; Agapito, Giuseppe; Alessiani, Mario; Aronow, Bruce J; Bell, Douglas S; Benoit, Vincent; Bourgeois, Florence T; Chiovato, Luca; Cho, Kelly; Dagliati, Arianna; DuVall, Scott L; Barrio, Noelia García; Hanauer, David A; Ho, Yuk-Lam; Holmes, John H; Issitt, Richard W; Liu, Molei; Luo, Yuan; Lynch, Kristine E; Maidlow, Sarah E; Malovini, Alberto; Mandl, Kenneth D; Mao, Chengsheng; Matheny, Michael E; Moore, Jason H; Morris, Jeffrey S; Morris, Michele; Mowery, Danielle L; Ngiam, Kee Yuan; Patel, Lav P; Pedrera-Jimenez, Miguel.

NPJ Digit Med ; 5(1): 74, 2022 Jun 13.

Artigo em Inglês | MEDLINE | ID: mdl-35697747

RESUMO

Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates across healthcare systems, countries, and continents. Specifically, we trained a Cox regression model with nine measured laboratory test values, standard demographics at admission, and comorbidity burden pre-admission. These models were compared at site, country, and continent level. Of the 39,969 hospitalized patients with COVID-19 (68.6% male), 5717 (14.3%) died. In the Cox model, age, albumin, AST, creatine, CRP, and white blood cell count are most predictive of mortality. The baseline covariates are more predictive of mortality during the early days of COVID-19 hospitalization. Models trained at healthcare systems with larger cohort size largely retain good transportability performance when porting to different sites. The combination of routine laboratory test values at admission along with basic demographic features can predict mortality in patients hospitalized with COVID-19. Importantly, this potentially deployable model differs from prior work by demonstrating not only consistent performance but also reliable transportability across healthcare systems in the US and Europe, highlighting the generalizability of this model and the overall approach.

16.

Changes in laboratory value improvement and mortality rates over the course of the pandemic: an international retrospective cohort study of hospitalised patients infected with SARS-CoV-2.

Hong, Chuan; Zhang, Harrison G; L'Yi, Sehi; Weber, Griffin; Avillach, Paul; Tan, Bryce W Q; Gutiérrez-Sacristán, Alba; Bonzel, Clara-Lea; Palmer, Nathan P; Malovini, Alberto; Tibollo, Valentina; Luo, Yuan; Hutch, Meghan R; Liu, Molei; Bourgeois, Florence; Bellazzi, Riccardo; Chiovato, Luca; Sanz Vidorreta, Fernando J; Le, Trang T; Wang, Xuan; Yuan, William; Neuraz, Antoine; Benoit, Vincent; Moal, Bertrand; Morris, Michele; Hanauer, David A; Maidlow, Sarah; Wagholikar, Kavishwar; Murphy, Shawn; Estiri, Hossein; Makoudjou, Adeline; Tippmann, Patric; Klann, Jeffery; Follett, Robert W; Gehlenborg, Nils; Omenn, Gilbert S; Xia, Zongqi; Dagliati, Arianna; Visweswaran, Shyam; Patel, Lav P; Mowery, Danielle L; Schriver, Emily R; Samayamuthu, Malarkodi Jebathilagam; Kavuluru, Ramakanth; Lozano-Zahonero, Sara; Zöller, Daniela; Tan, Amelia L M; Tan, Byorn W L; Ngiam, Kee Yuan; Holmes, John H.

BMJ Open ; 12(6): e057725, 2022 06 23.

Artigo em Inglês | MEDLINE | ID: mdl-35738646

RESUMO

OBJECTIVE: To assess changes in international mortality rates and laboratory recovery rates during hospitalisation for patients hospitalised with SARS-CoV-2 between the first wave (1 March to 30 June 2020) and the second wave (1 July 2020 to 31 January 2021) of the COVID-19 pandemic. DESIGN, SETTING AND PARTICIPANTS: This is a retrospective cohort study of 83 178 hospitalised patients admitted between 7 days before or 14 days after PCR-confirmed SARS-CoV-2 infection within the Consortium for Clinical Characterization of COVID-19 by Electronic Health Record, an international multihealthcare system collaborative of 288 hospitals in the USA and Europe. The laboratory recovery rates and mortality rates over time were compared between the two waves of the pandemic. PRIMARY AND SECONDARY OUTCOME MEASURES: The primary outcome was all-cause mortality rate within 28 days after hospitalisation stratified by predicted low, medium and high mortality risk at baseline. The secondary outcome was the average rate of change in laboratory values during the first week of hospitalisation. RESULTS: Baseline Charlson Comorbidity Index and laboratory values at admission were not significantly different between the first and second waves. The improvement in laboratory values over time was faster in the second wave compared with the first. The average C reactive protein rate of change was -4.72 mg/dL vs -4.14 mg/dL per day (p=0.05). The mortality rates within each risk category significantly decreased over time, with the most substantial decrease in the high-risk group (42.3% in March-April 2020 vs 30.8% in November 2020 to January 2021, p<0.001) and a moderate decrease in the intermediate-risk group (21.5% in March-April 2020 vs 14.3% in November 2020 to January 2021, p<0.001). CONCLUSIONS: Admission profiles of patients hospitalised with SARS-CoV-2 infection did not differ greatly between the first and second waves of the pandemic, but there were notable differences in laboratory improvement rates during hospitalisation. Mortality risks among patients with similar risk profiles decreased over the course of the pandemic. The improvement in laboratory values and mortality risk was consistent across multiple countries.

Assuntos

COVID-19 , Pandemias , Hospitalização , Humanos , Estudos Retrospectivos , SARS-CoV-2

17.

Integrative High Dimensional Multiple Testing with Heterogeneity under Data Sharing Constraints.

Liu, Molei; Xia, Yin; Cho, Kelly; Cai, Tianxi.

J Mach Learn Res ; 222021 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-37426040

RESUMO

Identifying informative predictors in a high dimensional regression model is a critical step for association analysis and predictive modeling. Signal detection in the high dimensional setting often fails due to the limited sample size. One approach to improving power is through meta-analyzing multiple studies which address the same scientific question. However, integrative analysis of high dimensional data from multiple studies is challenging in the presence of between-study heterogeneity. The challenge is even more pronounced with additional data sharing constraints under which only summary data can be shared across different sites. In this paper, we propose a novel data shielding integrative large-scale testing (DSILT) approach to signal detection allowing between-study heterogeneity and not requiring the sharing of individual level data. Assuming the underlying high dimensional regression models of the data differ across studies yet share similar support, the proposed method incorporates proper integrative estimation and debiasing procedures to construct test statistics for the overall effects of specific covariates. We also develop a multiple testing procedure to identify significant effects while controlling the false discovery rate (FDR) and false discovery proportion (FDP). Theoretical comparisons of the new testing procedure with the ideal individual-level meta-analysis (ILMA) approach and other distributed inference methods are investigated. Simulation studies demonstrate that the proposed testing procedure performs well in both controlling false discovery and attaining power. The new method is applied to a real example detecting interaction effects of the genetic variants for statins and obesity on the risk for type II diabetes.

18.

Double/debiased machine learning for logistic partially linear model.

Liu, Molei; Zhang, Y I; Zhou, Doudou.

Econom J ; 24(3): 559-588, 2021 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-38223304

RESUMO

We propose double/debiased machine learning approaches to infer a parametric component of a logistic partially linear model. Our framework is based on a Neyman orthogonal score equation consisting of two nuisance models for the nonparametric component of the logistic model and conditional mean of the exposure with the control group. To estimate the nuisance models, we separately consider the use of high dimensional (HD) sparse regression and (nonparametric) machine learning (ML) methods. In the HD case, we derive certain moment equations to calibrate the first order bias of the nuisance models, which preserves the model double robustness property. In the ML case, we handle the nonlinearity of the logit link through a novel and easy-to-implement 'full model refitting' procedure. We evaluate our methods through simulation and apply them in assessing the effect of the emergency contraceptive pill on early gestation and new births based on a 2008 policy reform in Chile.

19.

National Trends in Disease Activity for COVID-19 Among Children in the US.

Hutch, Meghan R; Liu, Molei; Avillach, Paul; Luo, Yuan; Bourgeois, Florence T.

Front Pediatr ; 9: 700656, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34307261

RESUMO

Ongoing monitoring of COVID-19 disease burden in children will help inform mitigation strategies and guide pediatric vaccination programs. Leveraging a national, comprehensive dataset, we sought to quantify and compare disease burden and trends in hospitalizations for children and adults in the US.

20.

A high-throughput phenotyping algorithm is portable from adult to pediatric populations.

Geva, Alon; Liu, Molei; Panickan, Vidul A; Avillach, Paul; Cai, Tianxi; Mandl, Kenneth D.

J Am Med Inform Assoc ; 28(6): 1265-1269, 2021 06 12.

Artigo em Inglês | MEDLINE | ID: mdl-33594412

RESUMO

OBJECTIVE: Multimodal automated phenotyping (MAP) is a scalable, high-throughput phenotyping method, developed using electronic health record (EHR) data from an adult population. We tested transportability of MAP to a pediatric population. MATERIALS AND METHODS: Without additional feature engineering or supervised training, we applied MAP to a pediatric population enrolled in a biobank and evaluated performance against physician-reviewed medical records. We also compared performance of MAP at the pediatric institution and the original adult institution where MAP was developed, including for 6 phenotypes validated at both institutions against physician-reviewed medical records. RESULTS: MAP performed equally well in the pediatric setting (average AUC 0.98) as it did at the general adult hospital system (average AUC 0.96). MAP's performance in the pediatric sample was similar across the 6 specific phenotypes also validated against gold-standard labels in the adult biobank. CONCLUSIONS: MAP is highly transportable across diverse populations and has potential for wide-scale use.

Assuntos

Algoritmos , Registros Eletrônicos de Saúde , Humanos , Fenótipo

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA