Search | VHL Regional Portal

1.

Functional Principal Component Analysis as an Alternative to Mixed-Effect Models for Describing Sparse Repeated Measures in Presence of Missing Data.

Ségalas, Corentin; Helmer, Catherine; Genuer, Robin; Proust-Lima, Cécile.

Stat Med ; 2024 Sep 09.

Article in English | MEDLINE | ID: mdl-39248704

ABSTRACT

Analyzing longitudinal data in health studies is challenging due to sparse and error-prone measurements, strong within-individual correlation, missing data and various trajectory shapes. While mixed-effect models (MM) effectively address these challenges, they remain parametric models and may incur computational costs. In contrast, functional principal component analysis (FPCA) is a non-parametric approach developed for regular and dense functional data that flexibly describes temporal trajectories at a potentially lower computational cost. This article presents an empirical simulation study evaluating the behavior of FPCA with sparse and error-prone repeated measures and its robustness under different missing data schemes in comparison with MM. The results show that FPCA is well-suited in the presence of missing at random data caused by dropout, except in scenarios involving most frequent and systematic dropout. Like MM, FPCA fails under missing not at random mechanism. The FPCA was applied to describe the trajectories of four cognitive functions before clinical dementia and contrast them with those of matched controls in a case-control study nested in a population-based aging cohort. The average cognitive declines of future dementia cases showed a sudden divergence from those of their matched controls with a sharp acceleration 5 to 2.5 years prior to diagnosis.

2.

Gaps in the usage and reporting of multiple imputation for incomplete data: findings from a scoping review of observational studies addressing causal questions.

Mainzer, Rheanna M; Moreno-Betancur, Margarita; Nguyen, Cattram D; Simpson, Julie A; Carlin, John B; Lee, Katherine J.

BMC Med Res Methodol ; 24(1): 193, 2024 Sep 04.

Article in English | MEDLINE | ID: mdl-39232661

ABSTRACT

BACKGROUND: Missing data are common in observational studies and often occur in several of the variables required when estimating a causal effect, i.e. the exposure, outcome and/or variables used to control for confounding. Analyses involving multiple incomplete variables are not as straightforward as analyses with a single incomplete variable. For example, in the context of multivariable missingness, the standard missing data assumptions ("missing completely at random", "missing at random" [MAR], "missing not at random") are difficult to interpret and assess. It is not clear how the complexities that arise due to multivariable missingness are being addressed in practice. The aim of this study was to review how missing data are managed and reported in observational studies that use multiple imputation (MI) for causal effect estimation, with a particular focus on missing data summaries, missing data assumptions, primary and sensitivity analyses, and MI implementation. METHODS: We searched five top general epidemiology journals for observational studies that aimed to answer a causal research question and used MI, published between January 2019 and December 2021. Article screening and data extraction were performed systematically. RESULTS: Of the 130 studies included in this review, 108 (83%) derived an analysis sample by excluding individuals with missing data in specific variables (e.g., outcome) and 114 (88%) had multivariable missingness within the analysis sample. Forty-four (34%) studies provided a statement about missing data assumptions, 35 of which stated the MAR assumption, but only 11/44 (25%) studies provided a justification for these assumptions. The number of imputations, MI method and MI software were generally well-reported (71%, 75% and 88% of studies, respectively), while aspects of the imputation model specification were not clear for more than half of the studies. A secondary analysis that used a different approach to handle the missing data was conducted in 69/130 (53%) studies. Of these 69 studies, 68 (99%) lacked a clear justification for the secondary analysis. CONCLUSION: Effort is needed to clarify the rationale for and improve the reporting of MI for estimation of causal effects from observational data. We encourage greater transparency in making and reporting analytical decisions related to missing data.

Subject(s)

Observational Studies as Topic , Humans , Observational Studies as Topic/methods , Observational Studies as Topic/statistics & numerical data , Data Interpretation, Statistical , Causality , Research Design/standards , Research Design/statistics & numerical data

3.

RNAseqCovarImpute: a multiple imputation procedure that outperforms complete case and single imputation differential expression analysis.

Baker, Brennan H; Sathyanarayana, Sheela; Szpiro, Adam A; MacDonald, James W; Paquette, Alison G.

Genome Biol ; 25(1): 236, 2024 Sep 03.

Article in English | MEDLINE | ID: mdl-39227979

ABSTRACT

Missing covariate data is a common problem that has not been addressed in observational studies of gene expression. Here, we present a multiple imputation method that accommodates high dimensional gene expression data by incorporating principal component analysis of the transcriptome into the multiple imputation prediction models to avoid bias. Simulation studies using three datasets show that this method outperforms complete case and single imputation analyses at uncovering true positive differentially expressed genes, limiting false discovery rates, and minimizing bias. This method is easily implemented via an R Bioconductor package, RNAseqCovarImpute that integrates with the limma-voom pipeline for differential expression analysis.

Subject(s)

Software , Humans , Gene Expression Profiling/methods , Transcriptome , Principal Component Analysis , Sequence Analysis, RNA/methods

4.

Efficient use of binned data for imputing univariate time series data.

Darji, Jay; Biswas, Nupur; Padul, Vijay; Gill, Jaya; Kesari, Santosh; Ashili, Shashaanka.

Front Big Data ; 7: 1422650, 2024.

Article in English | MEDLINE | ID: mdl-39234189

ABSTRACT

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

5.

Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little's MCAR test).

Hayes, Timothy; Baraldi, Amanda N; Coxe, Stefany.

Behav Res Methods ; 2024 Sep 09.

Article in English | MEDLINE | ID: mdl-39251529

ABSTRACT

The selection of auxiliary variables is an important first step in appropriately implementing missing data methods such as full information maximum likelihood (FIML) estimation or multiple imputation. However, practical guidelines and statistical tests for selecting useful auxiliary variables are somewhat lacking, leading to potentially biased estimates. We propose the use of random forest analysis and lasso regression as alternative methods to select auxiliary variables, particularly in situations in which the missing data pattern is nonlinear or otherwise complex (i.e., interactive relationships between variables and missingness). Monte Carlo simulations demonstrate the effectiveness of random forest analysis and lasso regression compared to traditional methods (t-tests, Little's MCAR test, logistic regressions), in terms of both selecting auxiliary variables and the performance of said auxiliary variables when incorporated in an analysis with missing data. Both techniques outperformed traditional methods, providing a promising direction for improvement of practical methods for handling missing data in statistical analyses.

6.

Handling missing data and measurement error for early-onset myopia risk prediction models.

Lai, Hongyu; Gao, Kaiye; Li, Meiyan; Li, Tao; Zhou, Xiaodong; Zhou, Xingtao; Guo, Hui; Fu, Bo.

BMC Med Res Methodol ; 24(1): 194, 2024 Sep 06.

Article in English | MEDLINE | ID: mdl-39243025

ABSTRACT

BACKGROUND: Early identification of children at high risk of developing myopia is essential to prevent myopia progression by introducing timely interventions. However, missing data and measurement error (ME) are common challenges in risk prediction modelling that can introduce bias in myopia prediction. METHODS: We explore four imputation methods to address missing data and ME: single imputation (SI), multiple imputation under missing at random (MI-MAR), multiple imputation with calibration procedure (MI-ME), and multiple imputation under missing not at random (MI-MNAR). We compare four machine-learning models (Decision Tree, Naive Bayes, Random Forest, and Xgboost) and three statistical models (logistic regression, stepwise logistic regression, and least absolute shrinkage and selection operator logistic regression) in myopia risk prediction. We apply these models to the Shanghai Jinshan Myopia Cohort Study and also conduct a simulation study to investigate the impact of missing mechanisms, the degree of ME, and the importance of predictors on model performance. Model performance is evaluated using the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). RESULTS: Our findings indicate that in scenarios with missing data and ME, using MI-ME in combination with logistic regression yields the best prediction results. In scenarios without ME, employing MI-MAR to handle missing data outperforms SI regardless of the missing mechanisms. When ME has a greater impact on prediction than missing data, the relative advantage of MI-MAR diminishes, and MI-ME becomes more superior. Furthermore, our results demonstrate that statistical models exhibit better prediction performance than machine-learning models. CONCLUSION: MI-ME emerges as a reliable method for handling missing data and ME in important predictors for early-onset myopia risk prediction.

Subject(s)

Machine Learning , Myopia , Humans , Myopia/diagnosis , Myopia/epidemiology , Female , Child , Male , Logistic Models , Models, Statistical , Risk Assessment/methods , Risk Assessment/statistics & numerical data , Risk Factors , ROC Curve , Bayes Theorem , China/epidemiology , Cohort Studies , Age of Onset

7.

Stream salinity prediction in data-scarce regions: Application of transfer learning and uncertainty quantification.

Khodkar, Kasra; Mirchi, Ali; Nourani, Vahid; Kaghazchi, Afsaneh; Sadler, Jeffrey M; Mansaray, Abubakarr; Wagner, Kevin; Alderman, Phillip D; Taghvaeian, Saleh; Bailey, Ryan T.

J Contam Hydrol ; 266: 104418, 2024 Sep.

Article in English | MEDLINE | ID: mdl-39217676

ABSTRACT

Scarcity of stream salinity data poses a challenge to understanding salinity dynamics and its implications for water supply management in water-scarce salt-prone regions around the world. This paper introduces a framework for generating continuous daily stream salinity estimates using instance-based transfer learning (TL) and assessing the reliability of the synthetic salinity data through uncertainty quantification via prediction intervals (PIs). The framework was developed using two temporally distinct specific conductance (SC) datasets from the Upper Red River Basin (URRB) located in southwestern Oklahoma and Texas Panhandle, United States. The instance-based TL approach was implemented by calibrating Feedforward Neural Networks (FFNNs) on a source SC dataset of around 1200 instantaneous grab samples collected by United States Geological Survey (USGS) from 1959 to 1993. The trained FFNNs were subsequently tested on a target dataset (1998-present) of 220 instantaneous grab samples collected by the Oklahoma Water Resources Board (OWRB). The framework's generalizability was assessed in the data-rich Bird Creek watershed in Oklahoma by manipulating continuous SC data to simulate data-scarce conditions for training the models and using the complete Bird Creek dataset for model evaluation. The Lower Upper Bound Estimation (LUBE) method was used with FFNNs to estimate PIs for uncertainty quantification. Autoregressive SC prediction methods via FFNN were found to be reliable with Nash Sutcliffe Efficiency (NSE) values of 0.65 and 0.45 on in-sample and out-of-sample test data, respectively. The same modeling scenario resulted in an NSE of 0.54 for the Bird Creek data using a similar missing data ratio, whereas a higher ratio of observed data increased the accuracy (NSE = 0.84). The relatively narrow estimated PIs for the North Fork Red River in the URRB indicated satisfactory stream salinity predictions, showing an average width equivalent to 25 % of the observed range and a confidence level of 70 %.

Subject(s)

Environmental Monitoring , Rivers , Salinity , Rivers/chemistry , Uncertainty , Oklahoma , Environmental Monitoring/methods , Texas , Neural Networks, Computer , Models, Theoretical

8.

An improved Gaussian process for filling the missing data in GNSS position time series considering the influence of adjacent stations.

Qiu, Xiaomeng; Wang, Fengwei; Zhang, Qiuxi; Tao, Guoqiang; Zhou, Shijian.

Sci Rep ; 14(1): 19268, 2024 Aug 20.

Article in English | MEDLINE | ID: mdl-39164405

ABSTRACT

Due to various unavoidable reasons or gross error elimination, missing data inevitably exist in global navigation satellite system (GNSS) position time series, which may result in many analysis methods not being applicable. Typically, interpolating the missing data is a crucial preprocessing step before analyzing the time series. The conventional methods for filling missing data do not consider the influence of adjacent stations. In this work, an improved Gaussian process (GP) approach is developed to fill the missing data of GNSS time series, in which the time series of adjacent stations are applied to construct impact factors, together with a comparison of the conventional GP and the commonly used cubic spline methods. For the simulation experiments, the root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R) are adopted to evaluate the performance of the improved GP. The results show that the filled missing data of the improved GP are closer to the true values than those of the conventional GP and cubic spline methods, regardless of the missing percentages ranging from 5 to 30%, with an interval of 5%. Specifically, the mean relative RMSE and MAE improvements for the improved GP with respect to the conventional GP are 21.2%, 21.3% and 8.3% and 12.7%, 16.2% and 11.01% for the North (N), East (E) and Up (U) components, respectively. In the real experiment, eight GNSS stations are analyzed using improved GP, together with conventional GP and a cubic spline. The results indicate that the first three principal components (PCs) of the improved GP can perverse 98.3%, 99.8% and 77.0% of the total variance for the N, E and U components, respectively. This value is obviously higher than those of the conventional GP and cubic spline. Therefore, we can conclude that the improved GP can better fill in the missing data in GNSS position time series than the conventional GP and cubic spline because of the impacts of adjacent stations.

9.

Balancing efficacy and computational burden: weighted mean, multiple imputation, and inverse probability weighting methods for item non-response in reliable scales.

Guide, Andrew; Garbett, Shawn; Feng, Xiaoke; Mapes, Brandy M; Cook, Justin; Sulieman, Lina; Cronin, Robert M; Chen, Qingxia.

J Am Med Inform Assoc ; 2024 Aug 13.

Article in English | MEDLINE | ID: mdl-39138951

ABSTRACT

IMPORTANCE: Scales often arise from multi-item questionnaires, yet commonly face item non-response. Traditional solutions use weighted mean (WMean) from available responses, but potentially overlook missing data intricacies. Advanced methods like multiple imputation (MI) address broader missing data, but demand increased computational resources. Researchers frequently use survey data in the All of Us Research Program (All of Us), and it is imperative to determine if the increased computational burden of employing MI to handle non-response is justifiable. OBJECTIVES: Using the 5-item Physical Activity Neighborhood Environment Scale (PANES) in All of Us, this study assessed the tradeoff between efficacy and computational demands of WMean, MI, and inverse probability weighting (IPW) when dealing with item non-response. MATERIALS AND METHODS: Synthetic missingness, allowing 1 or more item non-response, was introduced into PANES across 3 missing mechanisms and various missing percentages (10%-50%). Each scenario compared WMean of complete questions, MI, and IPW on bias, variability, coverage probability, and computation time. RESULTS: All methods showed minimal biases (all <5.5%) for good internal consistency, with WMean suffered most with poor consistency. IPW showed considerable variability with increasing missing percentage. MI required significantly more computational resources, taking >8000 and >100 times longer than WMean and IPW in full data analysis, respectively. DISCUSSION AND CONCLUSION: The marginal performance advantages of MI for item non-response in highly reliable scales do not warrant its escalated cloud computational burden in All of Us, particularly when coupled with computationally demanding post-imputation analyses. Researchers using survey scales with low missingness could utilize WMean to reduce computing burden.

10.

Novel logarithmic imputation procedures using multi auxiliary information under ranked set sampling.

Kumar, Anoop; Bhushan, Shashi; Emam, Walid; Tashkandy, Yusra; Khan, M J S.

Sci Rep ; 14(1): 18027, 2024 Aug 04.

Article in English | MEDLINE | ID: mdl-39098844

ABSTRACT

Ranked set sampling (RSS) is known to increase the efficiency of the estimators while comparing it with simple random sampling. The problem of missingness creates a gap in the information that needs to be addressed before proceeding for estimation. Negligible amount of work has been carried out to deal with missingness utilizing RSS. This paper proposes some logarithmic type methods of imputation for the estimation of population mean under RSS using auxiliary information. The properties of the suggested imputation procedures are examined. A simulation study is accomplished to show that the proposed imputation procedures exhibit better results in comparison to some of the existing imputation procedures. Few real applications of the proposed imputation procedures is also provided to generalize the simulation study.

11.

Machine Learning-Based Risk Prediction of Discharge Status for Sepsis.

Cai, Kaida; Lou, Yuqing; Wang, Zhengyan; Yang, Xiaofang; Zhao, Xin.

Entropy (Basel) ; 26(8)2024 Jul 25.

Article in English | MEDLINE | ID: mdl-39202095

ABSTRACT

As a severe inflammatory response syndrome, sepsis presents complex challenges in predicting patient outcomes due to its unclear pathogenesis and the unstable discharge status of affected individuals. In this study, we develop a machine learning-based method for predicting the discharge status of sepsis patients, aiming to improve treatment decisions. To enhance the robustness of our analysis against outliers, we incorporate robust statistical methods, specifically the minimum covariance determinant technique. We utilize the random forest imputation method to effectively manage and impute missing data. For feature selection, we employ Lasso penalized logistic regression, which efficiently identifies significant predictors and reduces model complexity, setting the stage for the application of more complex predictive methods. Our predictive analysis incorporates multiple machine learning methods, including random forest, support vector machine, and XGBoost. We compare the prediction performance of these methods with Lasso penalized logistic regression to identify the most effective approach. Each method's performance is rigorously evaluated through ten iterations of 10-fold cross-validation to ensure robust and reliable results. Our comparative analysis reveals that XGBoost surpasses the other models, demonstrating its exceptional capability to navigate the complexities of sepsis data effectively.

12.

Analyses Using Multiple Imputation Need to Consider Missing Data in Auxiliary Variables.

Madley-Dowd, Paul; Curnow, Elinor; Hughes, Rachael A; Cornish, Rosie; Tilling, Kate; Heron, Jon.

Am J Epidemiol ; 2024 Aug 27.

Article in English | MEDLINE | ID: mdl-39191658

ABSTRACT

Auxiliary variables are used in multiple imputation (MI) to reduce bias and increase efficiency. These variables may often themselves be incomplete. We explored how missing data in auxiliary variables influenced estimates obtained from MI. We implemented a simulation study with three different missing data mechanisms for the outcome. We then examined the impact of increasing proportions of missing data and different missingness mechanisms for the auxiliary variable on bias of an unadjusted linear regression coefficient and the fraction of missing information. We illustrate our findings with an applied example in the Avon Longitudinal Study of Parents and Children. We found that where complete records analyses were biased, increasing proportions of missing data in auxiliary variables, under any missing data mechanism, reduced the ability of MI including the auxiliary variable to mitigate this bias. Where there was no bias in the complete records analysis, inclusion of a missing not at random auxiliary variable in MI introduced bias of potentially important magnitude (up to 17% of the effect size in our simulation). Careful consideration of the quantity and nature of missing data in auxiliary variables needs to be made when selecting them for use in MI models.

13.

Factors Associated with Missing Sociodemographic Data in the IRIS® (Intelligent Research in Sight) Registry.

Ross, Connor; Ivanov, Alexander; Elze, Tobias; Miller, Joan W; Lum, Flora; Lorch, Alice C; Oke, Isdin.

Ophthalmol Sci ; 4(6): 100542, 2024.

Article in English | MEDLINE | ID: mdl-39139543

ABSTRACT

Purpose: To describe the prevalence of missing sociodemographic data in the IRIS® (Intelligent Research in Sight) Registry and to identify practice-level characteristics associated with missing sociodemographic data. Design: Cross-sectional study. Participants: All patients with clinical encounters at practices participating in the IRIS Registry prior to December 31, 2020. Methods: We describe geographic and temporal trends in the prevalence of missing data for each sociodemographic variable (age, sex, race, ethnicity, geographic location, insurance type, and smoking status). Each practice contributing data to the registry was categorized based on the number of patients, number of physicians, geographic location, patient visit frequency, and patient population demographics. Main Outcome Measures: Multivariable linear regression was used to describe the association of practice-level characteristics with missing patient-level sociodemographic data. Results: This study included the electronic health records of 66 477 365 patients receiving care at 3306 practices participating in the IRIS Registry. The median number of patients per practice was 11 415 (interquartile range: 5849-24 148) and the median number of physicians per practice was 3 (interquartile range: 1-7). The prevalence of missing patient sociodemographic data were 0.1% for birth year, 0.4% for sex, 24.8% for race, 30.2% for ethnicity, 2.3% for 3-digit zip code, 14.8% for state, 5.5% for smoking status, and 17.0% for insurance type. The prevalence of missing data increased over time and varied at the state-level. Missing race data were associated with practices that had fewer visits per patient (P < 0.001), cared for a larger nonprivately insured patient population (P = 0.001), and were located in urban areas (P < 0.001). Frequent patient visits were associated with a lower prevalence of missing race (P < 0.001), ethnicity (P < 0.001), and insurance (P < 0.001), but a higher prevalence of missing smoking status (P < 0.001). Conclusions: There are geographic and temporal trends in missing race, ethnicity, and insurance type data in the IRIS Registry. Several practice-level characteristics, including practice size, geographic location, and patient population, are associated with missing sociodemographic data. While the prevalence and patterns of missing data may change in future versions of the IRIS registry, there will remain a need to develop standardized approaches for minimizing potential sources of bias and ensure reproducibility across research studies. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

14.

Deeply-Learned Generalized Linear Models with Missing Data.

Lim, David K; Rashid, Naim U; Oliva, Junier B; Ibrahim, Joseph G.

J Comput Graph Stat ; 33(2): 638-650, 2024.

Article in English | MEDLINE | ID: mdl-39184956

ABSTRACT

Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to various supervised learning problems. However, the greater prevalence and complexity of missing data in such datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, dlglm, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of the Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data. Supplementary materials for this article are available online.

15.

Multivariable Mendelian randomization with incomplete measurements on the exposure variables in the Hispanic Community Health Study/Study of Latinos.

Li, Yilun; Wong, Kin Yau; Howard, Annie Green; Gordon-Larsen, Penny; Highland, Heather M; Graff, Mariaelisa; North, Kari E; Downie, Carolina G; Avery, Christy L; Yu, Bing; Young, Kristin L; Buchanan, Victoria L; Kaplan, Robert; Hou, Lifang; Joyce, Brian Thomas; Qi, Qibin; Sofer, Tamar; Moon, Jee-Young; Lin, Dan-Yu.

HGG Adv ; 5(4): 100338, 2024 Aug 02.

Article in English | MEDLINE | ID: mdl-39095990

ABSTRACT

Multivariable Mendelian randomization allows simultaneous estimation of direct causal effects of multiple exposure variables on an outcome. When the exposure variables of interest are quantitative omic features, obtaining complete data can be economically and technically challenging: the measurement cost is high, and the measurement devices may have inherent detection limits. In this paper, we propose a valid and efficient method to handle unmeasured and undetectable values of the exposure variables in a one-sample multivariable Mendelian randomization analysis with individual-level data. We estimate the direct causal effects with maximum likelihood estimation and develop an expectation-maximization algorithm to compute the estimators. We show the advantages of the proposed method through simulation studies and provide an application to the Hispanic Community Health Study/Study of Latinos, which has a large amount of unmeasured exposure data.

16.

Software application profile: tpc and micd-R packages for causal discovery with incomplete cohort data.

Andrews, Ryan M; Bang, Christine W; Didelez, Vanessa; Witte, Janine; Foraita, Ronja.

Int J Epidemiol ; 53(5)2024 Aug 14.

Article in English | MEDLINE | ID: mdl-39186942

ABSTRACT

MOTIVATION: The Peter Clark (PC) algorithm is a popular causal discovery method to learn causal graphs in a data-driven way. Until recently, existing PC algorithm implementations in R had important limitations regarding missing values, temporal structure or mixed measurement scales (categorical/continuous), which are all common features of cohort data. The new R packages presented here, micd and tpc, fill these gaps. IMPLEMENTATION: micd and tpc packages are R packages. GENERAL FEATURES: The micd package provides add-on functionality for dealing with missing values to the existing pcalg R package, including methods for multiple imputations relying on the Missing At Random assumption. Also, micd allows for mixed measurement scales assuming conditional Gaussianity. The tpc package efficiently exploits temporal information in a way that results in a more informative output that is less prone to statistical errors. AVAILABILITY: The tpc and micd packages are freely available on the Comprehensive R Archive Network (CRAN). Their source code is also available on GitHub (https://github.com/bips-hb/micd; https://github.com/bips-hb/tpc).

Subject(s)

Algorithms , Causality , Software , Humans , Cohort Studies , Data Interpretation, Statistical

17.

Phylogenomic analyses of Blattodea combining traditional methods, incremental tree-building, and quality-aware support.

Evangelista, Dominic A; Nelson, Dvorah; Kotyková Varadínová, Zuzana; Kotyk, Michael; Rousseaux, Nicolas; Shanahan, Tristan; Grandcolas, Phillippe; Legendre, Frédéric.

Mol Phylogenet Evol ; 200: 108177, 2024 Nov.

Article in English | MEDLINE | ID: mdl-39142526

ABSTRACT

Despite the many advances of the genomic era, there is a persistent problem in assessing the uncertainty of phylogenomic hypotheses. We see this in the recent history of phylogenetics for cockroaches and termites (Blattodea), where huge advances have been made, but there are still major inconsistencies between studies. To address this, we present a phylogenetic analysis of Blattodea that emphasizes identification and quantification of uncertainty. We analyze 1183 gene domains using three methods (multi-species coalescent inference, concatenation, and a supermatrix-supertree hybrid approach) and assess support for controversial relationships while considering data quality. The hybrid approach-here dubbed "tiered phylogenetic inference"-incorporates information about data quality into an incremental tree building framework. Leveraging this method, we are able to identify cases of low or misleading support that would not be possible otherwise, and explore them more thoroughly with follow-up tests. In particular, quality annotations pointed towards nodes with high bootstrap support that later turned out to have large ambiguities, sometimes resulting from low-quality data. We also clarify issues related to some recalcitrant nodes: Anaplectidae's placement lacks unbiased signal, Ectobiidae s.s. and Anaplectoideini need greater taxon sampling, the deepest relationships among most Blaberidae lack signal. As a result, several previous phylogenetic uncertainties are now closer to being resolved (e.g., African and Malagasy "Rhabdoblatta" spp. are the sister to all other Blaberidae, and Oxyhaloinae is sister to the remaining Blaberidae). Overall, we argue for more approaches to quantifying support that take data quality into account to uncover the nature of recalcitrant nodes.

Subject(s)

Cockroaches , Isoptera , Phylogeny , Animals , Isoptera/genetics , Isoptera/classification , Cockroaches/genetics , Cockroaches/classification , Genomics , Models, Genetic

18.

Incorporating informatively collected laboratory data from EHR in clinical prediction models.

Sun, Minghui; Engelhard, Matthew M; Bedoya, Armando D; Goldstein, Benjamin A.

BMC Med Inform Decis Mak ; 24(1): 206, 2024 Jul 24.

Article in English | MEDLINE | ID: mdl-39049049

ABSTRACT

BACKGROUND: Electronic Health Records (EHR) are widely used to develop clinical prediction models (CPMs). However, one of the challenges is that there is often a degree of informative missing data. For example, laboratory measures are typically taken when a clinician is concerned that there is a need. When data are the so-called Not Missing at Random (NMAR), analytic strategies based on other missingness mechanisms are inappropriate. In this work, we seek to compare the impact of different strategies for handling missing data on CPMs performance. METHODS: We considered a predictive model for rapid inpatient deterioration as an exemplar implementation. This model incorporated twelve laboratory measures with varying levels of missingness. Five labs had missingness rate levels around 50%, and the other seven had missingness levels around 90%. We included them based on the belief that their missingness status can be highly informational for the prediction. In our study, we explicitly compared the various missing data strategies: mean imputation, normal-value imputation, conditional imputation, categorical encoding, and missingness embeddings. Some of these were also combined with the last observation carried forward (LOCF). We implemented logistic LASSO regression, multilayer perceptron (MLP), and long short-term memory (LSTM) models as the downstream classifiers. We compared the AUROC of testing data and used bootstrapping to construct 95% confidence intervals. RESULTS: We had 105,198 inpatient encounters, with 4.7% having experienced the deterioration outcome of interest. LSTM models generally outperformed other cross-sectional models, where embedding approaches and categorical encoding yielded the best results. For the cross-sectional models, normal-value imputation with LOCF generated the best results. CONCLUSION: Strategies that accounted for the possibility of NMAR missing data yielded better model performance than those did not. The embedding method had an advantage as it did not require prior clinical knowledge. Using LOCF could enhance the performance of cross-sectional models but have countereffects in LSTM models.

Subject(s)

Electronic Health Records , Humans , Clinical Deterioration , Models, Statistical , Clinical Laboratory Techniques

19.

Consequences of Data Loss on Clinical Decision-Making in Continuous Glucose Monitoring: Retrospective Cohort Study.

den Braber, Niala; Braem, Carlijn I R; Vollenbroek-Hutten, Miriam M R; Hermens, Hermie J; Urgert, Thomas; Yavuz, Utku S; Veltink, Peter H; Laverman, Gozewijn D.

Interact J Med Res ; 13: e50849, 2024 Jul 31.

Article in English | MEDLINE | ID: mdl-39083801

ABSTRACT

BACKGROUND: The impact of missing data on individual continuous glucose monitoring (CGM) data is unknown but can influence clinical decision-making for patients. OBJECTIVE: We aimed to investigate the consequences of data loss on glucose metrics in individual patient recordings from continuous glucose monitors and assess its implications on clinical decision-making. METHODS: The CGM data were collected from patients with type 1 and 2 diabetes using the FreeStyle Libre sensor (Abbott Diabetes Care). We selected 7-28 days of 24 hours of continuous data without any missing values from each individual patient. To mimic real-world data loss, missing data ranging from 5% to 50% were introduced into the data set. From this modified data set, clinical metrics including time below range (TBR), TBR level 2 (TBR2), and other common glucose metrics were calculated in the data sets with and that without data loss. Recordings in which glucose metrics deviated relevantly due to data loss, as determined by clinical experts, were defined as expert panel boundary error (ÎµEPB). These errors were expressed as a percentage of the total number of recordings. The errors for the recordings with glucose management indicator <53 mmol/mol were investigated. RESULTS: A total of 84 patients contributed to 798 recordings over 28 days. With 5%-50% data loss for 7-28 days recordings, the ÎµEPB varied from 0 out of 798 (0.0%) to 147 out of 736 (20.0%) for TBR and 0 out of 612 (0.0%) to 22 out of 408 (5.4%) recordings for TBR2. In the case of 14-day recordings, TBR and TBR2 episodes completely disappeared due to 30% data loss in 2 out of 786 (0.3%) and 32 out of 522 (6.1%) of the cases, respectively. However, the initial values of the disappeared TBR and TBR2 were relatively small (<0.1%). In the recordings with glucose management indicator <53 mmol/mol the ÎµEPB was 9.6% for 14 days with 30% data loss. CONCLUSIONS: With a maximum of 30% data loss in 14-day CGM recordings, there is minimal impact of missing data on the clinical interpretation of various glucose metrics. TRIAL REGISTRATION: ClinicalTrials.gov NCT05584293; https://clinicaltrials.gov/study/NCT05584293.

20.

The importance of missing data in estimating BMI trajectories.

Gray, Laura A.

Sci Rep ; 14(1): 17740, 2024 07 31.

Article in English | MEDLINE | ID: mdl-39085396

ABSTRACT

Body Mass Index (BMI) trajectories are important for understanding how BMI develops over time. Missing data is often stated as a limitation in studies that analyse BMI over time and there is limited research exploring how missing data influences BMI trajectories. This study explores the influence missing data has in estimating BMI trajectories and the impact on subsequent analysis. This study uses data from the English Longitudinal Study of Ageing. Distinct BMI trajectories are estimated for adults aged 50 years and over. Next, multiple methods accounting for missing data are implemented and compared. Estimated trajectories are then used to predict the risk of developing type 2 diabetes mellitus (T2DM). Four distinct trajectories are identified using each of the missing data methods: stable overweight, elevated BMI, increasing BMI, and decreasing BMI. However, the likelihoods of individuals following the different trajectories differ between the different methods. The influence of BMI trajectory on T2DM is reduced after accounting for missing data. More work is needed to understand which methods for missing data are most reliable. When estimating BMI trajectories, missing data should be considered. The extent to which accounting for missing data influences cost-effectiveness analyses should be investigated.

Subject(s)

Body Mass Index , Diabetes Mellitus, Type 2 , Humans , Middle Aged , Diabetes Mellitus, Type 2/epidemiology , Female , Male , Longitudinal Studies , Aged , Overweight/epidemiology , Obesity/epidemiology

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL