Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 2.088
Filter
1.
BMC Bioinformatics ; 25(1): 317, 2024 Oct 01.
Article in English | MEDLINE | ID: mdl-39354334

ABSTRACT

BACKGROUND: Single-cell RNA sequencing (scRNA-seq) technology has emerged as a crucial tool for studying cellular heterogeneity. However, dropouts are inherent to the sequencing process, known as dropout events, posing challenges in downstream analysis and interpretation. Imputing dropout data becomes a critical concern in scRNA-seq data analysis. Present imputation methods predominantly rely on statistical or machine learning approaches, often overlooking inter-sample correlations. RESULTS: To address this limitation, We introduced SAE-Impute, a new computational method for imputing single-cell data by combining subspace regression and auto-encoders for enhancing the accuracy and reliability of the imputation process. Specifically, SAE-Impute assesses sample correlations via subspace regression, predicts potential dropout values, and then leverages these predictions within an autoencoder framework for interpolation. To validate the performance of SAE-Impute, we systematically conducted experiments on both simulated and real scRNA-seq datasets. These results highlight that SAE-Impute effectively reduces false negative signals in single-cell data and enhances the retrieval of dropout values, gene-gene and cell-cell correlations. Finally, We also conducted several downstream analyses on the imputed single-cell RNA sequencing (scRNA-seq) data, including the identification of differential gene expression, cell clustering and visualization, and cell trajectory construction. CONCLUSIONS: These results once again demonstrate that SAE-Impute is able to effectively reduce the droupouts in single-cell dataset, thereby improving the functional interpretability of the data.


Subject(s)
Sequence Analysis, RNA , Single-Cell Analysis , Single-Cell Analysis/methods , Sequence Analysis, RNA/methods , Computational Biology/methods , Algorithms , Humans , Machine Learning , Software
2.
Caspian J Intern Med ; 15(4): 615-622, 2024.
Article in English | MEDLINE | ID: mdl-39359440

ABSTRACT

Background: Diabetes, a currently threatening disease, has severe consequences for individuals' health conditions. The present study aimed to investigate the factors affecting the changes in the longitudinal outcome of blood sugar using a three-level analysis with the presence of missing data in diabetic patients. Methods: A total of 526 diabetic patients were followed longitudinally selected from the annual data collected from the rural population monitored by Tonekabon health centers in the North of Iran during 2018-2019 from the Iranian Integrated Health System (SIB) database. In analyzing this longitudinal data, the three-level model (level 1: observation (time), level 2: subject, level 3: health center) was carried out with multiple imputations of possible missing values in longitudinal data. Results: Results of fitting the three-level model indicated that every unit of change in the body mass index (BMI) significantly increased the fasting blood sugar by an average of 0.5 mg/dl (p=0.024). The impact of level 1 (observations) was insignificant in the three-level model. Still, the random effect of level 3 (healthcare centers) showed a highly significant measure for health centers (14.62, p<0.001). Conclusion: The BMI reduction, the healthcare centers' socioeconomic status, and the health services provided have potential effects in controlling diabetes.

3.
BMC Cardiovasc Disord ; 24(1): 544, 2024 Oct 09.
Article in English | MEDLINE | ID: mdl-39385080

ABSTRACT

BACKGROUND: Hypertension is a common disease, often overlooked in its early stages due to mild symptoms. And persistent elevated blood pressure can lead to adverse outcomes such as coronary heart disease, stroke, and kidney disease. There are many risk factors that lead to hypertension, including various environmental chemicals that humans are exposed to, which are believed to be modifiable risk factors for hypertension. OBJECTIVE: To investigate the role of environmental chemical exposures in predicting hypertension. METHODS: A total of 11,039 eligible participants were obtained from NHANES 2003-2016, and multiple imputation was used to process the missing data, resulting in 5 imputed datasets. 8 Machine learning algorithms were applied to the 5 imputed datasets to establish hypertension prediction models, and the average accuracy score, precision score, recall score, and F1 score were calculated. A generalized linear model was also built to predict the systolic and diastolic blood pressure levels. RESULTS: All 8 algorithms had good predictions for hypertension, with Support Vector Machine (SVM) being the best, with accuracy, precision, recall, F1 scores and area under the curve (AUC) of 0.751, 0.699, 0.717, 0.708 and 0.822, respectively. The R2 of the linear model on the training and test sets was 0.28, 0.25 for systolic and 0.06, 0.05 for diastolic blood pressure. CONCLUSIONS: In this study, relatively accurate prediction of hypertension was achieved using environmental chemicals with machine learning algorithms, demonstrating the predictive value of environmental chemicals for hypertension.


Subject(s)
Blood Pressure , Environmental Exposure , Hypertension , Nutrition Surveys , Predictive Value of Tests , Humans , Hypertension/epidemiology , Hypertension/diagnosis , Hypertension/physiopathology , Male , Female , Risk Assessment , Middle Aged , United States/epidemiology , Risk Factors , Environmental Exposure/adverse effects , Adult , Blood Pressure/drug effects , Support Vector Machine , Environmental Pollutants/adverse effects , Machine Learning , Reproducibility of Results , Aged , Time Factors , Cross-Sectional Studies
4.
BMC Med Res Methodol ; 24(1): 231, 2024 Oct 07.
Article in English | MEDLINE | ID: mdl-39375597

ABSTRACT

BACKGROUND: Epidemiological and clinical studies often have missing data, frequently analysed using multiple imputation (MI). In general, MI estimates will be biased if data are missing not at random (MNAR). Bias due to data MNAR can be reduced by including other variables ("auxiliary variables") in imputation models, in addition to those required for the substantive analysis. Common advice is to take an inclusive approach to auxiliary variable selection (i.e. include all variables thought to be predictive of missingness and/or the missing values). There are no clear guidelines about the impact of this strategy when data may be MNAR. METHODS: We explore the impact of including an auxiliary variable predictive of missingness but, in truth, unrelated to the partially observed variable, when data are MNAR. We quantify, algebraically and by simulation, the magnitude of the additional bias of the MI estimator for the exposure coefficient (fitting either a linear or logistic regression model), when the (continuous or binary) partially observed variable is either the analysis outcome or the exposure. Here, "additional bias" refers to the difference in magnitude of the MI estimator when the imputation model includes (i) the auxiliary variable and the other analysis model variables; (ii) just the other analysis model variables, noting that both will be biased due to data MNAR. We illustrate the extent of this additional bias by re-analysing data from a birth cohort study. RESULTS: The additional bias can be relatively large when the outcome is partially observed and missingness is caused by the outcome itself, and even larger if missingness is caused by both the outcome and the exposure (when either the outcome or exposure is partially observed). CONCLUSIONS: When using MI, the naïve and commonly used strategy of including all available auxiliary variables should be avoided. We recommend including the variables most predictive of the partially observed variable as auxiliary variables, where these can be identified through consideration of the plausible casual diagrams and missingness mechanisms, as well as data exploration (noting that associations with the partially observed variable in the complete records may be distorted due to selection bias).


Subject(s)
Bias , Humans , Data Interpretation, Statistical , Models, Statistical , Computer Simulation , Algorithms , Logistic Models , Research Design/statistics & numerical data
5.
Sleep Adv ; 5(1): zpae069, 2024.
Article in English | MEDLINE | ID: mdl-39372544

ABSTRACT

Study Objectives: Obstructive sleep apnea (OSA) can induce excessive sleepiness, causing work-related injuries and low productivity. Most individuals with OSA in the United Kingdom are undiagnosed, and thus, theoretically, workplace screening, might by identifying these individuals improve both their individual health and overall productivity. However, the prevalence of OSA in different workplaces is unclear. This study aimed to estimate the prevalence of OSA by industries and occupations in England. Methods: The Health Survey for England 2019 dataset was combined with Sleep Heart Health Study dataset. We applied multiple imputation for the combined dataset to estimate OSA in the English population aged 40-64. We estimated the pooled prevalence of OSA by both industry and occupation by separating samples by Standard Industry Classification and Standard Occupation Classification. Results: The overall OSA prevalence estimated by imputation for ages 40-64 was 17.8% (95% CI = 15.9% to 19.9%). Separating those samples into industrial/occupational groups, the estimated prevalence of OSA varied widely by industry/occupation. Descriptive analysis revealed that the estimated prevalence of OSA was relatively higher in the Accommodation and food, Public administration and defence; compulsory social security, Construction industries, and Protective service occupations, health and social care associate professionals, and skilled construction and building trades occupations. Conclusions: In England in 2019, Accommodation and food, Public administration and defence; compulsory social security, Construction industries, and Protective service occupations, health and social care associate professionals, and skilled construction and building trades occupations showed a relatively higher prevalence of OSA indicating that they may be target populations for workplace screening.

6.
Med Decis Making ; : 272989X241285038, 2024 Oct 08.
Article in English | MEDLINE | ID: mdl-39377510

ABSTRACT

BACKGROUND: Estimating change in health-related quality of life (HRQOL) from pre- to poststroke is challenging because HRQOL is rarely collected prior to stroke. Leveraging HRQOL data collected both before and after stroke, we sought to estimate the change in HRQOL from prestroke to early poststroke. METHODS: Stroke survivors completed the Patient-Reported Outcomes Measurement Information System Global Health (PROMIS-GH) scale at both pre- and early poststroke. Patient characteristics were compared for those who did and did not complete the PROMIS-GH. The mean change in PROMIS-GH T-score was estimated using complete case analysis, multiple imputation, and multiple imputation with delta adjustment. RESULTS: A total of 4,473 stroke survivors were included (mean age 63.1 ± 14.1 y, 47.5% female, 82.6% ischemic stroke). A total of 993 (22.2%) patients completed the PROMIS-GH at prestroke while 2,298 (51.4%) completed it early poststroke. Compared with those without PROMIS-GH, patients with PROMIS-GH prestroke had worse comorbidity burden. Patients who completed PROMIS-GH early poststroke had better early poststroke clinician-rated function and shorter hospital length of stay. Complete case analysis and multiple imputation revealed patients' PROMIS-GH T-scores worsened by 2 to 3 points. Multiple imputation with delta adjustment revealed patients' PROMIS-GH T-scores worsened by 4 to 10 points, depending on delta values chosen. CONCLUSIONS: Systematic differences in patients who completed the PROMIS-GH at both pre- and early poststroke suggest that missing PROMIS-GH scores may be missing not at random (MNAR). Multiple imputation with delta adjustment, which is better suited for MNAR data, may be a preferable method for analysis of change in HRQOL from pre- to poststroke. Given our study's large proportion of missing HRQOL data, future studies with less missing HRQOL data are necessary to verify our results. HIGHLIGHTS: Estimating the change in health-related quality of life from pre- to poststroke is challenging because health-related quality-of-life data are rarely collected prior to stroke. Previously used methods to assess the burden of stroke on health-related quality of life suffer from recall bias and selection bias.Using health-related quality-of-life data collected both before and after stroke, we sought to estimate the change in health-related quality of life after stroke using statistical methods that account for missing data.Comparisons of patients who did and did not complete health-related quality-of-life scales at both pre- and poststroke suggested that missing data may be missing not at random.Statistical methods that account for data that are missing not at random revealed more worsening in health-related quality of life after stroke than traditional methods such as complete case analysis or multiple imputation.

7.
Proc (IEEE Int Conf Healthc Inform) ; 2024: 177-182, 2024 Jun.
Article in English | MEDLINE | ID: mdl-39387063

ABSTRACT

The imputation of missing values (IMV) in electronic health records tabular data is crucial to enable machine learning for patient-specific predictive modeling. While IMV methods are developed in biostatistics and recently in machine learning, deep learning-based solutions have shown limited success in learning tabular data. This paper proposes a novel attention-based missing value imputation framework that learns to reconstruct data with missing values leveraging between-feature (self-attention) or between-sample attentions. We adopt data manipulation methods used in contrastive learning to improve the generalization of the trained imputation model. The proposed self-attention imputation method outperforms state-of-the-art statistical and machine learning-based (decision-tree) imputation methods, reducing the normalized root mean squared error by 18.4% to 74.7% on five tabular data sets and 52.6% to 82.6% on two electronic health records data sets. The proposed attention-based missing value imputation method shows superior performance across a wide range of missingness (10% to 50%) when the values are missing completely at random.

8.
Cell Genom ; 4(10): 100668, 2024 Oct 09.
Article in English | MEDLINE | ID: mdl-39389019

ABSTRACT

Genetic factors significantly influence the concentration of metabolites in adults. Nevertheless, the genetic influence on neonatal metabolites remains uncertain. To bridge this gap, we employed genotype imputation techniques on large-scale low-pass genome data obtained from non-invasive prenatal testing. Subsequently, we conducted association studies on a total of 75 metabolic components in neonates. The study identified 19 previously reported associations and 11 novel associations between single-nucleotide polymorphisms and metabolic components. These associations were initially found in the discovery cohort (8,744 participants) and subsequently confirmed in a replication cohort (19,041 participants). The average heritability of metabolic components was estimated to be 76.2%, with a range of 69%-78.8%. These findings offer valuable insights into the genetic architecture of neonatal metabolism.


Subject(s)
Genome-Wide Association Study , Polymorphism, Single Nucleotide , Humans , Infant, Newborn , Female , Male , Cohort Studies , Genotype , Metabolome/genetics
9.
Cell Genom ; 4(10): 100669, 2024 Oct 09.
Article in English | MEDLINE | ID: mdl-39389018

ABSTRACT

Non-invasive prenatal testing (NIPT) employs ultra-low-pass sequencing of maternal plasma cell-free DNA to detect fetal trisomy. Its global adoption has established NIPT as a large human genetic resource for exploring genetic variations and their associations with phenotypes. Here, we present methods for analyzing large-scale, low-depth NIPT data, including customized algorithms and software for genetic variant detection, genotype imputation, family relatedness, population structure inference, and genome-wide association analysis of maternal genomes. Our results demonstrate accurate allele frequency estimation and high genotype imputation accuracy (R2>0.84) for NIPT sequencing depths from 0.1× to 0.3×. We also achieve effective classification of duplicates and first-degree relatives, along with robust principal-component analysis. Additionally, we obtain an R2>0.81 for estimating genetic effect sizes across genotyping and sequencing platforms with adequate sample sizes. These methods offer a robust theoretical and practical foundation for utilizing NIPT data in medical genetic research.


Subject(s)
Genome-Wide Association Study , Humans , Female , Pregnancy , Genome-Wide Association Study/methods , Noninvasive Prenatal Testing/methods , Prenatal Diagnosis/methods , Gene Frequency , Algorithms , Genotype , Sequence Analysis, DNA/methods , Polymorphism, Single Nucleotide , Software
10.
Front Genet ; 15: 1444554, 2024.
Article in English | MEDLINE | ID: mdl-39385936

ABSTRACT

Introduction: Modern histocompatibility algorithms depend on the comparison and analysis of high-resolution HLA protein sequences and structures, especially when considering epitope-based algorithms, which aim to model the interactions involved in antibody or T cell binding. HLA genotype imputation can be performed in the cases where only low/intermediate-resolution HLA genotype is available or if specific loci are missing, and by providing an individuals' race/ethnicity/ancestry information, imputation results can be more accurate. This study assesses the effect of imputing high-resolution genotypes on molecular mismatch scores under a variety of ancestry assumptions. Methods: We compared molecular matching scores from "ground-truth" high-resolution genotypes against scores from genotypes which are imputed from low-resolution genotypes. Analysis was focused on a simulated patient-donor dataset and confirmed using two real-world datasets, and deviations were aggregated based on various ancestry assumptions. Results: We observed that using multiple imputation generally results in lower error in molecular matching scores compared to single imputation, and that using the correct ancestry assumptions can reduce error introduced during imputation. Discussion: We conclude that for epitope analysis, imputation is a valuable and low-risk strategy, as long as care is taken regarding epitope analysis context, ancestry assumptions, and (multiple) imputation strategy.

11.
J Clin Epidemiol ; : 111539, 2024 Sep 24.
Article in English | MEDLINE | ID: mdl-39326470

ABSTRACT

OBJECTIVE: The development of clinical prediction models is often impeded by the occurrence of missing values in the predictors. Various methods for imputing missing values before modelling have been proposed. Some of them are based on variants of multiple imputation by chained equations, while others are based on single imputation. These methods may include elements of flexible modelling or machine learning algorithms, and for some of them user-friendly software packages are available. The aim of this study was to investigate by simulation if some of these methods consistently outperform others in performance measures of clinical prediction models. STUDY DESIGN AND SETTING: We simulated development and validation cohorts by mimicking observed distributions of predictors and outcome variable of a real data set. In the development cohorts, missing predictor values were created in 36 scenarios defined by the missingness mechanism and proportion of non-complete cases. We applied three imputation algorithms that were available in R software: mice, aregImpute and missForest. These algorithms differed in their use of linear or flexible models, or random forests, the way of sampling from the predictive posterior distribution, and the generation of a single or multiple imputed data sets. For multiple imputation we also investigated the impact of the number of imputations. Logistic regression models were fitted with the simulated development cohorts before (full data analysis) and after missing value generation (complete case analysis), and with the imputed data. Prognostic model performance was measured by the scaled Brier score, c-statistic, calibration intercept and slope, and by the mean absolute prediction error evaluated in validation cohorts without missing values. Performance of full data analysis was considered as ideal. RESULTS: None of the imputation methods achieved the model's predictive accuracy that would be obtained in case of no missingness. In general, complete case analysis yielded the worst performance, and deviation from ideal performance increased with increasing percentage of missingness and decreasing sample size. Across all scenarios and performance measures, aregImpute and mice, both with 100 imputations, resulted in highest predictive accuracy. Surprisingly aregImpute outperformed full data analysis in achieving calibration slopes very close to 1 across all scenarios and outcome models. The increase of mice's performance with 100 compared to 5 imputations was only marginal. The differences between the imputation methods decreased with increasing sample sizes and decreasing proportion of non-complete cases. CONCLUSION: In our simulation study, model calibration was more affected by the choice of the imputation method than model discrimination. While differences in model performance after using imputation methods were generally small, multiple imputation methods as mice and aregImpute that can handle linear or nonlinear associations between predictors and outcome are an attractive and reliable choice in most situations.

12.
Public Health Nutr ; 27(1): e184, 2024 Sep 27.
Article in English | MEDLINE | ID: mdl-39327915

ABSTRACT

OBJECTIVE: Studies using the dietary inflammatory index often perform complete case analyses (CCA) to handle missing data, which may reduce the sample size and increase the risk of bias. Furthermore, population-level socio-economic differences in the energy-adjusted dietary inflammatory index (E-DII) have not been recently studied. Therefore, we aimed to describe socio-demographic differences in E-DII scores among American adults and compare the results using two statistical approaches for handling missing data, i.e. CCA and multiple imputation (MI). DESIGN: Cross-sectional analysis. E-DII scores were computed using a 24-hour dietary recall. Linear regression was used to compare the E-DII scores by age, sex, race/ethnicity, education and income using both CCA and MI. SETTING: USA. PARTICIPANTS: This study included 34 547 non-Hispanic White, non-Hispanic Black and Hispanic adults aged ≥ 20 years from the 2005-2018 National Health and Nutrition Examination Survey. RESULTS: The MI and CCA subpopulations comprised 34 547 and 23 955 participants, respectively. Overall, 57 % of the American adults reported 24-hour dietary intakes associated with inflammation. Both methods showed similar patterns wherein 24-hour dietary intakes associated with high inflammation were commonly reported among males, younger adults, non-Hispanic Black adults and those with lower education or income. Differences in point estimates between CCA and MI were mostly modest at ≤ 20 %. CONCLUSIONS: The two approaches for handling missing data produced comparable point estimates and 95 % CI. Differences in the E-DII scores by age, sex, race/ethnicity, education and income suggest that socio-economic disparities in health may be partially explained by the inflammatory potential of diet.


Subject(s)
Diet , Inflammation , Nutrition Surveys , Socioeconomic Factors , Humans , Male , Female , Adult , Cross-Sectional Studies , Inflammation/epidemiology , Middle Aged , Diet/statistics & numerical data , United States/epidemiology , Young Adult , Hispanic or Latino/statistics & numerical data , Aged , White People/statistics & numerical data , Black or African American/statistics & numerical data , Sociodemographic Factors
13.
Nan Fang Yi Ke Da Xue Xue Bao ; 44(8): 1561-1570, 2024 Aug 20.
Article in Chinese | MEDLINE | ID: mdl-39276052

ABSTRACT

OBJECTIVE: To evaluate the performance of magnetic resonance imaging (MRI) multi-sequence feature imputation and fusion mutual model based on sequence deletion in differentiating high-grade glioma (HGG) from low-grade glioma (LGG). METHODS: We retrospectively collected multi-sequence MR images from 305 glioma patients, including 189 HGG patients and 116 LGG patients. The region of interest (ROI) of T1-weighted images (T1WI), T2-weighted images (T2WI), T2 fluid attenuated inversion recovery (T2_FLAIR) and post-contrast enhancement T1WI (CE_T1WI) were delineated to extract the radiomics features. A mutual-aid model of MRI multi-sequence feature imputation and fusion based on sequence deletion was used for imputation and fusion of the feature matrix with missing data. The discriminative ability of the model was evaluated using 5-fold cross-validation method and by assessing the accuracy, balanced accuracy, area under the ROC curve (AUC), specificity, and sensitivity. The proposed model was quantitatively compared with other non-holonomic multimodal classification models for discriminating HGG and LGG. Class separability experiments were performed on the latent features learned by the proposed feature imputation and fusion methods to observe the classification effect of the samples in twodimensional plane. Convergence experiments were used to verify the feasibility of the model. RESULTS: For differentiation of HGG from LGG with a missing rate of 10%, the proposed model achieved accuracy, balanced accuracy, AUC, specificity, and sensitivity of 0.777, 0.768, 0.826, 0.754 and 0.780, respectively. The fused latent features showed excellent performance in the class separability experiment, and the algorithm could be iterated to convergence with superior classification performance over other methods at the missing rates of 30% and 50%. CONCLUSION: The proposed model has excellent performance in classification task of HGG and LGG and outperforms other non-holonomic multimodal classification models, demonstrating its potential for efficient processing of non-holonomic multimodal data.


Subject(s)
Brain Neoplasms , Glioma , Magnetic Resonance Imaging , Humans , Glioma/diagnostic imaging , Glioma/pathology , Magnetic Resonance Imaging/methods , Retrospective Studies , Brain Neoplasms/diagnostic imaging , Brain Neoplasms/pathology , Algorithms , Neoplasm Grading , ROC Curve , Sensitivity and Specificity
14.
Brief Bioinform ; 25(5)2024 Jul 25.
Article in English | MEDLINE | ID: mdl-39302340

ABSTRACT

The Hardy-Weinberg equilibrium (HWE) assumption is essential to many population genetics models. Multiple tests were developed to test its applicability in observed genotypes. Current methods are divided into exact tests applicable to small populations and a small number of alleles, and approximate goodness-of-fit tests. Existing tests cannot handle ambiguous typing in multi-allelic loci. We here present a novel exact test Unambiguous Multi Allelic Test (UMAT) not limited to the number of alleles and population size, based on a perturbative approach around the current observations. We show its accuracy in the detection of deviation from HWE. We then propose an additional model to handle ambiguous typing using either sampling into UMAT or a goodness-of-fit test test with a variance estimate taking ambiguity into account, named Asymptotic Statistical Test with Ambiguity (ASTA). We show the accuracy of ASTA and the possibility of detecting the source of deviation from HWE. We apply these tests to the HLA loci to reproduce multiple previously reported deviations from HWE, and a large number of new ones.


Subject(s)
Genetics, Population , Humans , Polymorphism, Genetic , Models, Genetic , Alleles , Gene Frequency , Genotype , Genetic Loci
15.
Genetics ; 2024 Sep 10.
Article in English | MEDLINE | ID: mdl-39255064

ABSTRACT

The expansive collection of genetic and phenotypic data within biobanks offers an unprecedented opportunity for biomedical research. However, the frequent occurrence of missing phenotypes presents a significant barrier to fully leveraging this potential. In our target application, on one hand, we have only a small and complete dataset with both genotypes and phenotypes to build a genetic prediction model, commonly called a polygenic (risk) score (PGS or PRS); on the other hand, we have a large dataset of genotypes (e.g. from a biobank) without the phenotype of interest. Our goal is to leverage the large dataset of genotypes (but without the phenotype) and a separate GWAS summary dataset of the phenotype to impute the phenotypes, which are then used as an individual-level dataset, along with the small complete dataset, to build a nonlinear model as PGS. More specifically, we trained some nonlinear models to 7 imputed and observed phenotypes from the UK Biobank data. We then trained an ensemble model to integrate these models for each trait, resulting in higher R2 values in prediction than using only the small complete (observed) dataset. Additionally, for 2 of the 7 traits, we observed that the nonlinear model trained with the imputed traits had higher R2 than using the imputed traits directly as the PGS, while for the remaining 5 traits, no improvement was found. These findings demonstrates the potential of leveraging existing genetic data and accounting for nonlinear genetic relationships to improve prediction accuracy for some traits.

16.
Brief Bioinform ; 25(5)2024 Jul 25.
Article in English | MEDLINE | ID: mdl-39242194

ABSTRACT

MOTIVATION: Single cell RNA sequencing (scRNA-seq) technique enables the transcriptome profiling of hundreds to ten thousands of cells at the unprecedented individual level and provides new insights to study cell heterogeneity. However, its advantages are hampered by dropout events. To address this problem, we propose a Blockwise Accelerated Non-negative Matrix Factorization framework with Structural network constraints (BANMF-S) to impute those technical zeros. RESULTS: BANMF-S constructs a gene-gene similarity network to integrate prior information from the external PPI network by the Triadic Closure Principle and a cell-cell similarity network to capture the neighborhood structure and temporal information through a Minimum-Spanning Tree. By collaboratively employing these two networks as regularizations, BANMF-S encourages the coherence of similar gene and cell pairs in the latent space, enhancing the potential to recover the underlying features. Besides, BANMF-S adopts a blocklization strategy to solve the traditional NMF problem through distributed Stochastic Gradient Descent method in a parallel way to accelerate the optimization. Numerical experiments on simulations and real datasets verify that BANMF-S can improve the accuracy of downstream clustering and pseudo-trajectory inference, and its performance is superior to seven state-of-the-art algorithms. AVAILABILITY: All data used in this work are downloaded from publicly available data sources, and their corresponding accession numbers or source URLs are provided in Supplementary File Section 5.1 Dataset Information. The source codes are publicly available in Github repository https://github.com/jiayingzhao/BANMF-S.


Subject(s)
Algorithms , Single-Cell Analysis , Single-Cell Analysis/methods , Humans , Gene Regulatory Networks , Gene Expression Profiling/methods , Computational Biology/methods , Sequence Analysis, RNA/methods , Software
17.
Biometrics ; 80(3)2024 Jul 01.
Article in English | MEDLINE | ID: mdl-39271117

ABSTRACT

In randomized controlled trials, adjusting for baseline covariates is commonly used to improve the precision of treatment effect estimation. However, covariates often have missing values. Recently, Zhao and Ding studied two simple strategies, the single imputation method and missingness-indicator method (MIM), to handle missing covariates and showed that both methods can provide an efficiency gain compared to not adjusting for covariates. To better understand and compare these two strategies, we propose and investigate a novel theoretical imputation framework termed cross-world imputation (CWI). This framework includes both single imputation and MIM as special cases, facilitating the comparison of their efficiency. Through the lens of CWI, we show that MIM implicitly searches for the optimal CWI values and thus achieves optimal efficiency. We also derive conditions under which the single imputation method, by searching for the optimal single imputation values, can achieve the same efficiency as the MIM. We illustrate our findings through simulation studies and a real data analysis based on the Childhood Adenotonsillectomy Trial. We conclude by discussing the practical implications of our findings.


Subject(s)
Computer Simulation , Models, Statistical , Randomized Controlled Trials as Topic , Randomized Controlled Trials as Topic/statistics & numerical data , Randomized Controlled Trials as Topic/methods , Humans , Data Interpretation, Statistical , Child , Biometry/methods , Adenoidectomy/statistics & numerical data , Tonsillectomy/statistics & numerical data
18.
Front Big Data ; 7: 1422650, 2024.
Article in English | MEDLINE | ID: mdl-39234189

ABSTRACT

Time series data are recorded in various sectors, resulting in a large amount of data. However, the continuity of these data is often interrupted, resulting in periods of missing data. Several algorithms are used to impute the missing data, and the performance of these methods is widely varied. Apart from the choice of algorithm, the effective imputation depends on the nature of missing and available data. We conducted extensive studies using different types of time series data, specifically heart rate data and power consumption data. We generated the missing data for different time spans and imputed using different algorithms with binned data of different sizes. The performance was evaluated using the root mean square error (RMSE) metric. We observed a reduction in RMSE when using binned data compared to the entire dataset, particularly in the case of the expectation-maximization (EM) algorithm. We found that RMSE was reduced when using binned data for 1-, 5-, and 15-min missing data, with greater reduction observed for 15-min missing data. We also observed the effect of data fluctuation. We conclude that the usefulness of binned data depends precisely on the span of missing data, sampling frequency of the data, and fluctuation within data. Depending on the inherent characteristics, quality, and quantity of the missing and available data, binned data can impute a wide variety of data, including biological heart rate data derived from the Internet of Things (IoT) device smartwatch and non-biological data such as household power consumption data.

19.
J Am Coll Cardiol ; 84(11): 1025-1037, 2024 Sep 10.
Article in English | MEDLINE | ID: mdl-39232630

ABSTRACT

During patient follow-up in a randomized trial, some deaths may occur. Where death (or noncardiovascular death) is not part of an outcome of interest it is termed a competing risk. Conventional analyses (eg, Cox proportional hazards model) handle death similarly to other censored follow-up. Patients still alive are unrealistically assumed to be representative of those who died. The Fine and Gray model has been used to handle competing risks, but is often used inappropriately and can be misleading. We propose an alternative multiple imputation approach that plausibly accounts for the fact that patients who die tend also to be at high risk for the (unobserved) outcome of interest. This provides a logical framework for exploring the impact of a competing risk, recognizing that there is no unique solution. We illustrate these issues in 3 cardiovascular trials and in simulation studies. We conclude with practical recommendations for handling competing risks in future trials.


Subject(s)
Cardiovascular Diseases , Humans , Risk Assessment/methods , Cardiovascular Diseases/mortality , Cardiovascular Diseases/therapy , Randomized Controlled Trials as Topic/methods , Clinical Trials as Topic , Proportional Hazards Models
20.
Patterns (N Y) ; 5(8): 101021, 2024 Aug 09.
Article in English | MEDLINE | ID: mdl-39233691

ABSTRACT

Imputation of missing features in spatial transcriptomics is urgently needed due to technological limitations. However, most existing computational methods suffer from moderate accuracy and cannot estimate the reliability of the imputation. To fill this research gap, we introduce a computational model, TransImpute, that imputes the missing feature modality in spatial transcriptomics by mapping it from single-cell reference data. We derive a set of attributes that can accurately predict imputation uncertainty, enabling us to select reliably imputed genes. In addition, we introduce a spatial autocorrelation metric as a regularization to avoid overestimating spatial patterns. Multiple datasets from various platforms demonstrate that our approach significantly improves the reliability of downstream analyses in detecting spatial variable genes and interacting ligand-receptor pairs. Therefore, TransImpute offers a reliable approach to spatial analysis of missing features for both matched and unseen modalities, such as nascent RNAs.

SELECTION OF CITATIONS
SEARCH DETAIL