Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.566
Filtrar
1.
J Comput Graph Stat ; 33(2): 638-650, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39184956

RESUMO

Deep Learning (DL) methods have dramatically increased in popularity in recent years, with significant growth in their application to various supervised learning problems. However, the greater prevalence and complexity of missing data in such datasets present significant challenges for DL methods. Here, we provide a formal treatment of missing data in the context of deeply learned generalized linear models, a supervised DL architecture for regression and classification problems. We propose a new architecture, dlglm, that is one of the first to be able to flexibly account for both ignorable and non-ignorable patterns of missingness in input features and response at training time. We demonstrate through statistical simulation that our method outperforms existing approaches for supervised learning tasks in the presence of missing not at random (MNAR) missingness. We conclude with a case study of the Bank Marketing dataset from the UCI Machine Learning Repository, in which we predict whether clients subscribed to a product based on phone survey data. Supplementary materials for this article are available online.

2.
Ophthalmol Sci ; 4(6): 100542, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39139543

RESUMO

Purpose: To describe the prevalence of missing sociodemographic data in the IRIS® (Intelligent Research in Sight) Registry and to identify practice-level characteristics associated with missing sociodemographic data. Design: Cross-sectional study. Participants: All patients with clinical encounters at practices participating in the IRIS Registry prior to December 31, 2020. Methods: We describe geographic and temporal trends in the prevalence of missing data for each sociodemographic variable (age, sex, race, ethnicity, geographic location, insurance type, and smoking status). Each practice contributing data to the registry was categorized based on the number of patients, number of physicians, geographic location, patient visit frequency, and patient population demographics. Main Outcome Measures: Multivariable linear regression was used to describe the association of practice-level characteristics with missing patient-level sociodemographic data. Results: This study included the electronic health records of 66 477 365 patients receiving care at 3306 practices participating in the IRIS Registry. The median number of patients per practice was 11 415 (interquartile range: 5849-24 148) and the median number of physicians per practice was 3 (interquartile range: 1-7). The prevalence of missing patient sociodemographic data were 0.1% for birth year, 0.4% for sex, 24.8% for race, 30.2% for ethnicity, 2.3% for 3-digit zip code, 14.8% for state, 5.5% for smoking status, and 17.0% for insurance type. The prevalence of missing data increased over time and varied at the state-level. Missing race data were associated with practices that had fewer visits per patient (P < 0.001), cared for a larger nonprivately insured patient population (P = 0.001), and were located in urban areas (P < 0.001). Frequent patient visits were associated with a lower prevalence of missing race (P < 0.001), ethnicity (P < 0.001), and insurance (P < 0.001), but a higher prevalence of missing smoking status (P < 0.001). Conclusions: There are geographic and temporal trends in missing race, ethnicity, and insurance type data in the IRIS Registry. Several practice-level characteristics, including practice size, geographic location, and patient population, are associated with missing sociodemographic data. While the prevalence and patterns of missing data may change in future versions of the IRIS registry, there will remain a need to develop standardized approaches for minimizing potential sources of bias and ensure reproducibility across research studies. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

3.
Artigo em Inglês | MEDLINE | ID: mdl-39138951

RESUMO

IMPORTANCE: Scales often arise from multi-item questionnaires, yet commonly face item non-response. Traditional solutions use weighted mean (WMean) from available responses, but potentially overlook missing data intricacies. Advanced methods like multiple imputation (MI) address broader missing data, but demand increased computational resources. Researchers frequently use survey data in the All of Us Research Program (All of Us), and it is imperative to determine if the increased computational burden of employing MI to handle non-response is justifiable. OBJECTIVES: Using the 5-item Physical Activity Neighborhood Environment Scale (PANES) in All of Us, this study assessed the tradeoff between efficacy and computational demands of WMean, MI, and inverse probability weighting (IPW) when dealing with item non-response. MATERIALS AND METHODS: Synthetic missingness, allowing 1 or more item non-response, was introduced into PANES across 3 missing mechanisms and various missing percentages (10%-50%). Each scenario compared WMean of complete questions, MI, and IPW on bias, variability, coverage probability, and computation time. RESULTS: All methods showed minimal biases (all <5.5%) for good internal consistency, with WMean suffered most with poor consistency. IPW showed considerable variability with increasing missing percentage. MI required significantly more computational resources, taking >8000 and >100 times longer than WMean and IPW in full data analysis, respectively. DISCUSSION AND CONCLUSION: The marginal performance advantages of MI for item non-response in highly reliable scales do not warrant its escalated cloud computational burden in All of Us, particularly when coupled with computationally demanding post-imputation analyses. Researchers using survey scales with low missingness could utilize WMean to reduce computing burden.

4.
Sci Rep ; 14(1): 18027, 2024 Aug 04.
Artigo em Inglês | MEDLINE | ID: mdl-39098844

RESUMO

Ranked set sampling (RSS) is known to increase the efficiency of the estimators while comparing it with simple random sampling. The problem of missingness creates a gap in the information that needs to be addressed before proceeding for estimation. Negligible amount of work has been carried out to deal with missingness utilizing RSS. This paper proposes some logarithmic type methods of imputation for the estimation of population mean under RSS using auxiliary information. The properties of the suggested imputation procedures are examined. A simulation study is accomplished to show that the proposed imputation procedures exhibit better results in comparison to some of the existing imputation procedures. Few real applications of the proposed imputation procedures is also provided to generalize the simulation study.

5.
Sci Rep ; 14(1): 19268, 2024 Aug 20.
Artigo em Inglês | MEDLINE | ID: mdl-39164405

RESUMO

Due to various unavoidable reasons or gross error elimination, missing data inevitably exist in global navigation satellite system (GNSS) position time series, which may result in many analysis methods not being applicable. Typically, interpolating the missing data is a crucial preprocessing step before analyzing the time series. The conventional methods for filling missing data do not consider the influence of adjacent stations. In this work, an improved Gaussian process (GP) approach is developed to fill the missing data of GNSS time series, in which the time series of adjacent stations are applied to construct impact factors, together with a comparison of the conventional GP and the commonly used cubic spline methods. For the simulation experiments, the root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R) are adopted to evaluate the performance of the improved GP. The results show that the filled missing data of the improved GP are closer to the true values than those of the conventional GP and cubic spline methods, regardless of the missing percentages ranging from 5 to 30%, with an interval of 5%. Specifically, the mean relative RMSE and MAE improvements for the improved GP with respect to the conventional GP are 21.2%, 21.3% and 8.3% and 12.7%, 16.2% and 11.01% for the North (N), East (E) and Up (U) components, respectively. In the real experiment, eight GNSS stations are analyzed using improved GP, together with conventional GP and a cubic spline. The results indicate that the first three principal components (PCs) of the improved GP can perverse 98.3%, 99.8% and 77.0% of the total variance for the N, E and U components, respectively. This value is obviously higher than those of the conventional GP and cubic spline. Therefore, we can conclude that the improved GP can better fill in the missing data in GNSS position time series than the conventional GP and cubic spline because of the impacts of adjacent stations.

6.
Int J Epidemiol ; 53(5)2024 Aug 14.
Artigo em Inglês | MEDLINE | ID: mdl-39186942

RESUMO

MOTIVATION: The Peter Clark (PC) algorithm is a popular causal discovery method to learn causal graphs in a data-driven way. Until recently, existing PC algorithm implementations in R had important limitations regarding missing values, temporal structure or mixed measurement scales (categorical/continuous), which are all common features of cohort data. The new R packages presented here, micd and tpc, fill these gaps. IMPLEMENTATION: micd and tpc packages are R packages. GENERAL FEATURES: The micd package provides add-on functionality for dealing with missing values to the existing pcalg R package, including methods for multiple imputations relying on the Missing At Random assumption. Also, micd allows for mixed measurement scales assuming conditional Gaussianity. The tpc package efficiently exploits temporal information in a way that results in a more informative output that is less prone to statistical errors. AVAILABILITY: The tpc and micd packages are freely available on the Comprehensive R Archive Network (CRAN). Their source code is also available on GitHub (https://github.com/bips-hb/micd; https://github.com/bips-hb/tpc).


Assuntos
Algoritmos , Causalidade , Software , Humanos , Estudos de Coortes , Interpretação Estatística de Dados
7.
Am J Epidemiol ; 2024 Aug 27.
Artigo em Inglês | MEDLINE | ID: mdl-39191658

RESUMO

Auxiliary variables are used in multiple imputation (MI) to reduce bias and increase efficiency. These variables may often themselves be incomplete. We explored how missing data in auxiliary variables influenced estimates obtained from MI. We implemented a simulation study with three different missing data mechanisms for the outcome. We then examined the impact of increasing proportions of missing data and different missingness mechanisms for the auxiliary variable on bias of an unadjusted linear regression coefficient and the fraction of missing information. We illustrate our findings with an applied example in the Avon Longitudinal Study of Parents and Children. We found that where complete records analyses were biased, increasing proportions of missing data in auxiliary variables, under any missing data mechanism, reduced the ability of MI including the auxiliary variable to mitigate this bias. Where there was no bias in the complete records analysis, inclusion of a missing not at random auxiliary variable in MI introduced bias of potentially important magnitude (up to 17% of the effect size in our simulation). Careful consideration of the quantity and nature of missing data in auxiliary variables needs to be made when selecting them for use in MI models.

8.
HGG Adv ; 5(4): 100338, 2024 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-39095990

RESUMO

Multivariable Mendelian randomization allows simultaneous estimation of direct causal effects of multiple exposure variables on an outcome. When the exposure variables of interest are quantitative omic features, obtaining complete data can be economically and technically challenging: the measurement cost is high, and the measurement devices may have inherent detection limits. In this paper, we propose a valid and efficient method to handle unmeasured and undetectable values of the exposure variables in a one-sample multivariable Mendelian randomization analysis with individual-level data. We estimate the direct causal effects with maximum likelihood estimation and develop an expectation-maximization algorithm to compute the estimators. We show the advantages of the proposed method through simulation studies and provide an application to the Hispanic Community Health Study/Study of Latinos, which has a large amount of unmeasured exposure data.

9.
Mol Phylogenet Evol ; 200: 108177, 2024 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-39142526

RESUMO

Despite the many advances of the genomic era, there is a persistent problem in assessing the uncertainty of phylogenomic hypotheses. We see this in the recent history of phylogenetics for cockroaches and termites (Blattodea), where huge advances have been made, but there are still major inconsistencies between studies. To address this, we present a phylogenetic analysis of Blattodea that emphasizes identification and quantification of uncertainty. We analyze 1183 gene domains using three methods (multi-species coalescent inference, concatenation, and a supermatrix-supertree hybrid approach) and assess support for controversial relationships while considering data quality. The hybrid approach-here dubbed "tiered phylogenetic inference"-incorporates information about data quality into an incremental tree building framework. Leveraging this method, we are able to identify cases of low or misleading support that would not be possible otherwise, and explore them more thoroughly with follow-up tests. In particular, quality annotations pointed towards nodes with high bootstrap support that later turned out to have large ambiguities, sometimes resulting from low-quality data. We also clarify issues related to some recalcitrant nodes: Anaplectidae's placement lacks unbiased signal, Ectobiidae s.s. and Anaplectoideini need greater taxon sampling, the deepest relationships among most Blaberidae lack signal. As a result, several previous phylogenetic uncertainties are now closer to being resolved (e.g., African and Malagasy "Rhabdoblatta" spp. are the sister to all other Blaberidae, and Oxyhaloinae is sister to the remaining Blaberidae). Overall, we argue for more approaches to quantifying support that take data quality into account to uncover the nature of recalcitrant nodes.

11.
Artigo em Inglês | MEDLINE | ID: mdl-38947282

RESUMO

Integrative factorization methods for multi-omic data estimate factors explaining biological variation. Factors can be treated as covariates to predict an outcome and the factorization can be used to impute missing values. However, no available methods provide a comprehensive framework for statistical inference and uncertainty quantification for these tasks. A novel framework, Bayesian Simultaneous Factorization (BSF), is proposed to decompose multi-omics variation into joint and individual structures simultaneously within a probabilistic framework. BSF uses conjugate normal priors and the posterior mode of this model can be estimated by solving a structured nuclear norm-penalized objective that also achieves rank selection and motivates the choice of hyperparameters. BSF is then extended to simultaneously predict a continuous or binary phenotype while estimating latent factors, termed Bayesian Simultaneous Factorization and Prediction (BSFP). BSF and BSFP accommodate concurrent imputation, i.e., imputation during the model-fitting process, and full posterior inference for missing data, including "blockwise" missingness. It is shown via simulation that BSFP is competitive in recovering latent variation structure, and demonstrate the importance of accounting for uncertainty in the estimated factorization within the predictive model. The imputation performance of BSF is examined via simulation under missing-at-random and missing-not-at-random assumptions. Finally, BSFP is used to predict lung function based on the bronchoalveolar lavage metabolome and proteome from a study of HIV-associated obstructive lung disease, revealing multi-omic patterns related to lung function decline and a cluster of patients with obstructive lung disease driven by shared metabolomic and proteomic abundance patterns.

12.
Sci Rep ; 14(1): 17740, 2024 07 31.
Artigo em Inglês | MEDLINE | ID: mdl-39085396

RESUMO

Body Mass Index (BMI) trajectories are important for understanding how BMI develops over time. Missing data is often stated as a limitation in studies that analyse BMI over time and there is limited research exploring how missing data influences BMI trajectories. This study explores the influence missing data has in estimating BMI trajectories and the impact on subsequent analysis. This study uses data from the English Longitudinal Study of Ageing. Distinct BMI trajectories are estimated for adults aged 50 years and over. Next, multiple methods accounting for missing data are implemented and compared. Estimated trajectories are then used to predict the risk of developing type 2 diabetes mellitus (T2DM). Four distinct trajectories are identified using each of the missing data methods: stable overweight, elevated BMI, increasing BMI, and decreasing BMI. However, the likelihoods of individuals following the different trajectories differ between the different methods. The influence of BMI trajectory on T2DM is reduced after accounting for missing data. More work is needed to understand which methods for missing data are most reliable. When estimating BMI trajectories, missing data should be considered. The extent to which accounting for missing data influences cost-effectiveness analyses should be investigated.


Assuntos
Índice de Massa Corporal , Diabetes Mellitus Tipo 2 , Humanos , Pessoa de Meia-Idade , Diabetes Mellitus Tipo 2/epidemiologia , Feminino , Masculino , Estudos Longitudinais , Idoso , Sobrepeso/epidemiologia , Obesidade/epidemiologia
13.
J Clin Epidemiol ; 173: 111458, 2024 Jul 09.
Artigo em Inglês | MEDLINE | ID: mdl-38986959

RESUMO

OBJECTIVES: This paper discusses methodological challenges in epidemiological association analysis of a time-to-event outcome and hypothesized risk factors, where age/time at the onset of the outcome may be missing in some cases, a condition commonly encountered when the outcome is self-reported. STUDY DESIGN AND SETTING: A cohort study with long-term follow-up for outcome ascertainment such as the Childhood Cancer Survivor Study (CCSS), a large cohort study of 5-year survivors of childhood cancer diagnosed in 1970-1999 in which occurrences and age at onset of various chronic health conditions (CHCs) are self-reported in surveys. Simple methods for handling missing onset age and their potential bias in the exposure-outcome association inference are discussed. The interval-censored method is discussed as a remedy for handling this problem. The finite sample performance of these approaches is compared through Monte Carlo simulations. Examples from the CCSS include four CHCs (diabetes, myocardial infarction, osteoporosis/osteopenia, and growth hormone deficiency). RESULTS: The interval-censored method is useable in practice using the standard statistical software. The simulation study showed that the regression coefficient estimates from the 'Interval censored' method consistently displayed reduced bias and, in most cases, smaller standard deviations, resulting in smaller mean square errors, compared to those from the simple approaches, regardless of the proportion of subjects with an event of interest, the proportion of missing onset age, and the sample size. CONCLUSION: The interval-censored method is a statistically valid and practical approach to the association analysis of self-reported time-to-event data when onset age may be missing. While the simpler approaches that force such data into complete data may enable the standard analytic methods to be applicable, there is considerable loss in both accuracy and precision relative to the interval-censored method.

14.
Psychoneuroendocrinology ; 168: 107116, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-38981200

RESUMO

INTRODUCTION: Living in socioeconomic disadvantage has been conceptualised as a chronic stressor, although this contradicts evidence from studies using hair cortisol and cortisone as a measure of hypothalamus-pituitary-adrenal (HPA)1 axis activity. These studies used complete case analyses, ignoring the impact of missing data for inference, despite the high proportion of missing biomarker data. The methodological limitations of studies investigating the association between socioeconomic position (SEP)2 defined as education, wealth, and social class and hair cortisol and cortisone are considered in this study by comparing three common methods to deal with missing data: (1) Complete Case Analysis (CCA),3 (2) Inverse Probability Weighting (IPW) 4and (3) weighted Multiple Imputation (MI).5 This study examines if socioeconomic disadvantage is associated with higher levels of HPA axis activity as measured by hair cortisol and cortisone among older adults using three approaches for compensating for missing data. METHOD: Cortisol and cortisone levels in hair samples from 4573 participants in the 6th wave (2012-2013) of the English Longitudinal Study of Ageing (ELSA)6 were examined, in relation to education, wealth, and social class. We compared linear regression models with CCA, weighted and multiple imputed weighted linear regression models. RESULTS: Social groups with certain characteristics (i.e., ethnic minorities, in routine and manual occupations, physically inactive, with poorer health, and smokers) were less likely to have hair cortisol and hair cortisone data compared to the most advantaged groups. We found a consistent pattern of higher levels of hair cortisol and cortisone among the most socioeconomically disadvantaged groups compared to the most advantaged groups. Complete case approaches to missing data underestimated the levels of hair cortisol in education and social class and the levels of hair cortisone in education, wealth, and social class in the most disadvantaged groups. CONCLUSION: This study demonstrates that social disadvantage as measured by disadvantaged SEP is associated with increased HPA axis activity. The conceptualisation of social disadvantage as a chronic stressor may be valid and previous studies reporting no associations between SEP and hair cortisol may be biased due to their lack of consideration of missing data cases which showed the underrepresentation of disadvantaged social groups in the analyses. Future analyses using biosocial data may need to consider and adjust for missing data.


Assuntos
Cortisona , Cabelo , Hidrocortisona , Sistema Hipotálamo-Hipofisário , Sistema Hipófise-Suprarrenal , Classe Social , Estresse Psicológico , Humanos , Sistema Hipotálamo-Hipofisário/metabolismo , Hidrocortisona/metabolismo , Hidrocortisona/análise , Sistema Hipófise-Suprarrenal/metabolismo , Idoso , Cabelo/química , Cabelo/metabolismo , Masculino , Feminino , Cortisona/metabolismo , Cortisona/análise , Inglaterra , Estresse Psicológico/metabolismo , Pessoa de Meia-Idade , Estudos Longitudinais , Fatores Socioeconômicos , Idoso de 80 Anos ou mais , Envelhecimento/fisiologia , Envelhecimento/metabolismo
15.
BMC Med Inform Decis Mak ; 24(1): 206, 2024 Jul 24.
Artigo em Inglês | MEDLINE | ID: mdl-39049049

RESUMO

BACKGROUND: Electronic Health Records (EHR) are widely used to develop clinical prediction models (CPMs). However, one of the challenges is that there is often a degree of informative missing data. For example, laboratory measures are typically taken when a clinician is concerned that there is a need. When data are the so-called Not Missing at Random (NMAR), analytic strategies based on other missingness mechanisms are inappropriate. In this work, we seek to compare the impact of different strategies for handling missing data on CPMs performance. METHODS: We considered a predictive model for rapid inpatient deterioration as an exemplar implementation. This model incorporated twelve laboratory measures with varying levels of missingness. Five labs had missingness rate levels around 50%, and the other seven had missingness levels around 90%. We included them based on the belief that their missingness status can be highly informational for the prediction. In our study, we explicitly compared the various missing data strategies: mean imputation, normal-value imputation, conditional imputation, categorical encoding, and missingness embeddings. Some of these were also combined with the last observation carried forward (LOCF). We implemented logistic LASSO regression, multilayer perceptron (MLP), and long short-term memory (LSTM) models as the downstream classifiers. We compared the AUROC of testing data and used bootstrapping to construct 95% confidence intervals. RESULTS: We had 105,198 inpatient encounters, with 4.7% having experienced the deterioration outcome of interest. LSTM models generally outperformed other cross-sectional models, where embedding approaches and categorical encoding yielded the best results. For the cross-sectional models, normal-value imputation with LOCF generated the best results. CONCLUSION: Strategies that accounted for the possibility of NMAR missing data yielded better model performance than those did not. The embedding method had an advantage as it did not require prior clinical knowledge. Using LOCF could enhance the performance of cross-sectional models but have countereffects in LSTM models.


Assuntos
Registros Eletrônicos de Saúde , Humanos , Deterioração Clínica , Modelos Estatísticos , Técnicas de Laboratório Clínico
16.
Interact J Med Res ; 13: e50849, 2024 Jul 31.
Artigo em Inglês | MEDLINE | ID: mdl-39083801

RESUMO

BACKGROUND: The impact of missing data on individual continuous glucose monitoring (CGM) data is unknown but can influence clinical decision-making for patients. OBJECTIVE: We aimed to investigate the consequences of data loss on glucose metrics in individual patient recordings from continuous glucose monitors and assess its implications on clinical decision-making. METHODS: The CGM data were collected from patients with type 1 and 2 diabetes using the FreeStyle Libre sensor (Abbott Diabetes Care). We selected 7-28 days of 24 hours of continuous data without any missing values from each individual patient. To mimic real-world data loss, missing data ranging from 5% to 50% were introduced into the data set. From this modified data set, clinical metrics including time below range (TBR), TBR level 2 (TBR2), and other common glucose metrics were calculated in the data sets with and that without data loss. Recordings in which glucose metrics deviated relevantly due to data loss, as determined by clinical experts, were defined as expert panel boundary error (εEPB). These errors were expressed as a percentage of the total number of recordings. The errors for the recordings with glucose management indicator <53 mmol/mol were investigated. RESULTS: A total of 84 patients contributed to 798 recordings over 28 days. With 5%-50% data loss for 7-28 days recordings, the εEPB varied from 0 out of 798 (0.0%) to 147 out of 736 (20.0%) for TBR and 0 out of 612 (0.0%) to 22 out of 408 (5.4%) recordings for TBR2. In the case of 14-day recordings, TBR and TBR2 episodes completely disappeared due to 30% data loss in 2 out of 786 (0.3%) and 32 out of 522 (6.1%) of the cases, respectively. However, the initial values of the disappeared TBR and TBR2 were relatively small (<0.1%). In the recordings with glucose management indicator <53 mmol/mol the εEPB was 9.6% for 14 days with 30% data loss. CONCLUSIONS: With a maximum of 30% data loss in 14-day CGM recordings, there is minimal impact of missing data on the clinical interpretation of various glucose metrics. TRIAL REGISTRATION: ClinicalTrials.gov NCT05584293; https://clinicaltrials.gov/study/NCT05584293.

17.
Heliyon ; 10(13): e33826, 2024 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-39027625

RESUMO

Although presepsin, a crucial biomarker for the diagnosis and management of sepsis, has gained prominence in contemporary medical research, its relationship with routine laboratory parameters, including demographic data and hospital blood test data, remains underexplored. This study integrates machine learning with explainable artificial intelligence (XAI) to provide insights into the relationship between presepsin and these parameters. Advanced machine learning classifiers provide a multilateral view of data and play an important role in highlighting the interrelationships between presepsin and other parameters. XAI enhances analysis by ensuring transparency in the model's decisions, especially in selecting key parameters that significantly enhance classification accuracy. Utilizing XAI, this study successfully identified critical parameters that increased the predictive accuracy for sepsis patients, achieving a remarkable ROC AUC of 0.97 and an accuracy of 0.94. This breakthrough is possibly attributed to the comprehensive utilization of XAI in refining parameter selection, thus leading to these significant predictive metrics. The presence of missing data in datasets is another concern; this study addresses it by employing Extreme Gradient Boosting (XGBoost) to manage missing data, effectively mitigating potential biases while preserving both the accuracy and relevance of the results. The perspective of examining data from higher dimensions using machine learning transcends traditional observation and analysis. The findings of this study hold the potential to enhance patient diagnoses and treatment, underscoring the value of merging traditional research methods with advanced analytical tools.

18.
Brain Commun ; 6(4): fcae219, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39035417

RESUMO

Alzheimer's disease is a highly heterogeneous disease in which different biomarkers are dynamic over different windows of the decades-long pathophysiological processes, and potentially have distinct involvement in different subgroups. Subtype and Stage Inference is an unsupervised learning algorithm that disentangles the phenotypic heterogeneity and temporal progression of disease biomarkers, providing disease insight and quantitative estimates of individual subtype and stage. However, a key limitation of Subtype and Stage Inference is that it requires a complete set of biomarkers for each subject, reducing the number of datapoints available for model fitting and limiting applications of Subtype and Stage Inference to modalities that are widely collected, e.g. volumetric biomarkers derived from structural MRI. In this study, we adapted the Subtype and Stage Inference algorithm to handle missing data, enabling the application of Subtype and Stage Inference to multimodal data (magnetic resonance imaging, positron emission tomography, cerebrospinal fluid and cognitive tests) from 789 participants in the Alzheimer's Disease Neuroimaging Initiative. Missing-data Subtype and Stage Inference identified five subtypes having distinct progression patterns, which we describe by the earliest unique abnormality as 'Typical AD with Early Tau', 'Typical AD with Late Tau', 'Cortical', 'Cognitive' and 'Subcortical'. These new multimodal subtypes were differentially associated with age, years of education, Apolipoprotein E (APOE4) status, white matter hyperintensity burden and the rate of conversion from mild cognitive impairment to Alzheimer's disease, with the 'Cognitive' subtype showing the fastest clinical progression, and the 'Subcortical' subtype the slowest. Overall, we demonstrate that missing-data Subtype and Stage Inference reveals a finer landscape of Alzheimer's disease subtypes, each of which are associated with different risk factors. Missing-data Subtype and Stage Inference has broad utility, enabling the prediction of progression in a much wider set of individuals, rather than being restricted to those with complete data.

19.
Comput Methods Programs Biomed ; 254: 108308, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-38968829

RESUMO

BACKGROUND AND OBJECTIVE: In the field of lung cancer research, particularly in the analysis of overall survival (OS), artificial intelligence (AI) serves crucial roles with specific aims. Given the prevalent issue of missing data in the medical domain, our primary objective is to develop an AI model capable of dynamically handling this missing data. Additionally, we aim to leverage all accessible data, effectively analyzing both uncensored patients who have experienced the event of interest and censored patients who have not, by embedding a specialized technique within our AI model, not commonly utilized in other AI tasks. Through the realization of these objectives, our model aims to provide precise OS predictions for non-small cell lung cancer (NSCLC) patients, thus overcoming these significant challenges. METHODS: We present a novel approach to survival analysis with missing values in the context of NSCLC, which exploits the strengths of the transformer architecture to account only for available features without requiring any imputation strategy. More specifically, this model tailors the transformer architecture to tabular data by adapting its feature embedding and masked self-attention to mask missing data and fully exploit the available ones. By making use of ad-hoc designed losses for OS, it is able to account for both censored and uncensored patients, as well as changes in risks over time. RESULTS: We compared our method with state-of-the-art models for survival analysis coupled with different imputation strategies. We evaluated the results obtained over a period of 6 years using different time granularities obtaining a Ct-index, a time-dependent variant of the C-index, of 71.97, 77.58 and 80.72 for time units of 1 month, 1 year and 2 years, respectively, outperforming all state-of-the-art methods regardless of the imputation method used. CONCLUSIONS: The results show that our model not only outperforms the state-of-the-art's performance but also simplifies the analysis in the presence of missing data, by effectively eliminating the need to identify the most appropriate imputation strategy for predicting OS in NSCLC patients.


Assuntos
Carcinoma Pulmonar de Células não Pequenas , Aprendizado Profundo , Neoplasias Pulmonares , Humanos , Neoplasias Pulmonares/mortalidade , Carcinoma Pulmonar de Células não Pequenas/mortalidade , Análise de Sobrevida , Algoritmos , Masculino , Feminino , Prognóstico , Inteligência Artificial
20.
Mol Ecol Resour ; : e13992, 2024 Jul 06.
Artigo em Inglês | MEDLINE | ID: mdl-38970328

RESUMO

Current methodologies of genome-wide single-nucleotide polymorphism (SNP) genotyping produce large amounts of missing data that may affect statistical inference and bias the outcome of experiments. Genotype imputation is routinely used in well-studied species to buffer the impact in downstream analysis, and several algorithms are available to fill in missing genotypes. The lack of reference haplotype panels precludes the use of these methods in genomic studies on non-model organisms. As an alternative, machine learning algorithms are employed to explore the genotype data and to estimate the missing genotypes. Here, we propose an imputation method based on self-organizing maps (SOM), a widely used neural networks formed by spatially distributed neurons that cluster similar inputs into close neurons. The method explores genotype datasets to select SNP loci to build binary vectors from the genotypes, and initializes and trains neural networks for each query missing SNP genotype. The SOM-derived clustering is then used to impute the best genotype. To automate the imputation process, we have implemented gtImputation, an open-source application programmed in Python3 and with a user-friendly GUI to facilitate the whole process. The method performance was validated by comparing its accuracy, precision and sensitivity on several benchmark genotype datasets with other available imputation algorithms. Our approach produced highly accurate and precise genotype imputations even for SNPs with alleles at low frequency and outperformed other algorithms, especially for datasets from mixed populations with unrelated individuals.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA