Búsqueda | Biblioteca Virtual en Salud

1.

Self-supervised deep learning of gene-gene interactions for improved gene expression recovery.

Wei, Qingyue; Islam, Md Tauhidul; Zhou, Yuyin; Xing, Lei.

Brief Bioinform ; 25(2)2024 Jan 22.

Artículo en Inglés | MEDLINE | ID: mdl-38349062

RESUMEN

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene-gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene-gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene-gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.

Asunto(s)

Aprendizaje Profundo , Epistasis Genética , Análisis de Datos , Genómica , Expresión Génica , Perfilación de la Expresión Génica , Análisis de Secuencia de ARN

2.

Multi-omics regulatory network inference in the presence of missing data.

Henao, Juan D; Lauber, Michael; Azevedo, Manuel; Grekova, Anastasiia; Theis, Fabian; List, Markus; Ogris, Christoph; Schubert, Benjamin.

Brief Bioinform ; 24(5)2023 09 20.

Artículo en Inglés | MEDLINE | ID: mdl-37670505

RESUMEN

A key problem in systems biology is the discovery of regulatory mechanisms that drive phenotypic behaviour of complex biological systems in the form of multi-level networks. Modern multi-omics profiling techniques probe these fundamental regulatory networks but are often hampered by experimental restrictions leading to missing data or partially measured omics types for subsets of individuals due to cost restrictions. In such scenarios, in which missing data is present, classical computational approaches to infer regulatory networks are limited. In recent years, approaches have been proposed to infer sparse regression models in the presence of missing information. Nevertheless, these methods have not been adopted for regulatory network inference yet. In this study, we integrated regression-based methods that can handle missingness into KiMONo, a Knowledge guided Multi-Omics Network inference approach, and benchmarked their performance on commonly encountered missing data scenarios in single- and multi-omics studies. Overall, two-step approaches that explicitly handle missingness performed best for a wide range of random- and block-missingness scenarios on imbalanced omics-layers dimensions, while methods implicitly handling missingness performed best on balanced omics-layers dimensions. Our results show that robust multi-omics network inference in the presence of missing data with KiMONo is feasible and thus allows users to leverage available multi-omics data to its full extent.

Asunto(s)

Benchmarking , Multiómica , Humanos , Biología de Sistemas

3.

scGGAN: single-cell RNA-seq imputation by graph-based generative adversarial network.

Huang, Zimo; Wang, Jun; Lu, Xudong; Mohd Zain, Azlan; Yu, Guoxian.

Brief Bioinform ; 24(2)2023 03 19.

Artículo en Inglés | MEDLINE | ID: mdl-36733262

RESUMEN

Single-cell RNA sequencing (scRNA-seq) data are typically with a large number of missing values, which often results in the loss of critical gene signaling information and seriously limit the downstream analysis. Deep learning-based imputation methods often can better handle scRNA-seq data than shallow ones, but most of them do not consider the inherent relations between genes, and the expression of a gene is often regulated by other genes. Therefore, it is essential to impute scRNA-seq data by considering the regional gene-to-gene relations. We propose a novel model (named scGGAN) to impute scRNA-seq data that learns the gene-to-gene relations by Graph Convolutional Networks (GCN) and global scRNA-seq data distribution by Generative Adversarial Networks (GAN). scGGAN first leverages single-cell and bulk genomics data to explore inherent relations between genes and builds a more compact gene relation network to jointly capture the homogeneous and heterogeneous information. Then, it constructs a GCN-based GAN model to integrate the scRNA-seq, gene sequencing data and gene relation network for generating scRNA-seq data, and trains the model through adversarial learning. Finally, it utilizes data generated by the trained GCN-based GAN model to impute scRNA-seq data. Experiments on simulated and real scRNA-seq datasets show that scGGAN can effectively identify dropout events, recover the biologically meaningful expressions, determine subcellular states and types, improve the differential expression analysis and temporal dynamics analysis. Ablation experiments confirm that both the gene relation network and gene sequence data help the imputation of scRNA-seq data.

Asunto(s)

Análisis de Expresión Génica de una Sola Célula , Programas Informáticos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Genómica , Perfilación de la Expresión Génica

4.

Comparing preprocessing strategies for 3D-Gene microarray data of extracellular vesicle-derived miRNAs.

Takemoto, Yuto; Ito, Daisuke; Komori, Shota; Kishimoto, Yoshiyuki; Yamada, Shinichiro; Hashizume, Atsushi; Katsuno, Masahisa; Nakatochi, Masahiro.

BMC Bioinformatics ; 25(1): 221, 2024 Jun 20.

Artículo en Inglés | MEDLINE | ID: mdl-38902629

RESUMEN

BACKGROUND: Extracellular vesicle-derived (EV)-miRNAs have potential to serve as biomarkers for the diagnosis of various diseases. miRNA microarrays are widely used to quantify circulating EV-miRNA levels, and the preprocessing of miRNA microarray data is critical for analytical accuracy and reliability. Thus, although microarray data have been used in various studies, the effects of preprocessing have not been studied for Toray's 3D-Gene chip, a widely used measurement method. We aimed to evaluate batch effect, missing value imputation accuracy, and the influence of preprocessing on measured values in 18 different preprocessing pipelines for EV-miRNA microarray data from two cohorts with amyotrophic lateral sclerosis using 3D-Gene technology. RESULTS: Eighteen different pipelines with different types and orders of missing value completion and normalization were used to preprocess the 3D-Gene microarray EV-miRNA data. Notable results were suppressed in the batch effects in all pipelines using the batch effect correction method ComBat. Furthermore, pipelines utilizing missForest for missing value imputation showed high agreement with measured values. In contrast, imputation using constant values for missing data exhibited low agreement. CONCLUSIONS: This study highlights the importance of selecting the appropriate preprocessing strategy for EV-miRNA microarray data when using 3D-Gene technology. These findings emphasize the importance of validating preprocessing approaches, particularly in the context of batch effect correction and missing value imputation, for reliably analyzing data in biomarker discovery and disease research.

Asunto(s)

Vesículas Extracelulares , MicroARNs , Análisis de Secuencia por Matrices de Oligonucleótidos , Vesículas Extracelulares/metabolismo , Vesículas Extracelulares/genética , MicroARNs/genética , MicroARNs/metabolismo , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Esclerosis Amiotrófica Lateral/genética , Esclerosis Amiotrófica Lateral/metabolismo , Perfilación de la Expresión Génica/métodos

5.

Missing Values in Longitudinal Proteome Dynamics Studies: Making a Case for Data Multiple Imputation.

Yan, Yu; Sankar, Baradwaj Simha; Mirza, Bilal; Ng, Dominic C M; Pelletier, Alexander R; Huang, Sarah D; Wang, Wei; Watson, Karol; Wang, Ding; Ping, Peipei.

J Proteome Res ; 23(9): 4151-4162, 2024 Sep 06.

Artículo en Inglés | MEDLINE | ID: mdl-39189460

RESUMEN

Temporal proteomics data sets are often confounded by the challenges of missing values. These missing data points, in a time-series context, can lead to fluctuations in measurements or the omission of critical events, thus hindering the ability to fully comprehend the underlying biomedical processes. We introduce a Data Multiple Imputation (DMI) pipeline designed to address this challenge in temporal data set turnover rate quantifications, enabling robust downstream analysis to gain novel discoveries. To demonstrate its utility and generalizability, we applied this pipeline to two use cases: a murine cardiac temporal proteomics data set and a human plasma temporal proteomics data set, both aimed at examining protein turnover rates. This DMI pipeline significantly enhanced the detection of protein turnover rate in both data sets, and furthermore, the imputed data sets captured new representation of proteins, leading to an augmented view of biological pathways, protein complex dynamics, as well as biomarker-disease associations. Importantly, DMI exhibited superior performance in benchmark data sets compared to single imputation methods (DSI). In summary, we have demonstrated that this DMI pipeline is effective at overcoming challenges introduced by missing values in temporal proteome dynamics studies.

Asunto(s)

Proteoma , Proteómica , Humanos , Proteoma/análisis , Proteoma/metabolismo , Proteómica/métodos , Animales , Ratones , Estudios Longitudinales , Interpretación Estadística de Datos

6.

Matrix factorization for biomedical link prediction and scRNA-seq data imputation: an empirical survey.

Ou-Yang, Le; Lu, Fan; Zhang, Zi-Chao; Wu, Min.

Brief Bioinform ; 23(1)2022 01 17.

Artículo en Inglés | MEDLINE | ID: mdl-34864871

RESUMEN

Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation.

Asunto(s)

Análisis de Datos , Análisis de la Célula Individual , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Secuenciación del Exoma

7.

Evaluating the state of the art in missing data imputation for clinical data.

Luo, Yuan.

Brief Bioinform ; 23(1)2022 01 17.

Artículo en Inglés | MEDLINE | ID: mdl-34882223

RESUMEN

Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often only available at irregular intervals that vary between patients and type of data, with entries often being unmeasured or unknown. As a result, missing data often represent one of the major impediments to optimal knowledge derivation from clinical data. The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of the art in imputing missing data for clinical time series. We extracted 13 commonly measured blood laboratory tests. To evaluate the imputation performance, we randomly removed one recorded result per laboratory test per patient admission and used them as the ground truth. DACMI is the first shared-task challenge on clinical time series imputation to our best knowledge. The challenge attracted 12 international teams spanning three continents across multiple industries and academia. The evaluation outcome suggests that competitive machine learning and statistical models (e.g. LightGBM, MICE and XGBoost) coupled with carefully engineered temporal and cross-sectional features can achieve strong imputation performance. However, care needs to be taken to prevent overblown model complexity. The challenge participating systems collectively experimented with a wide range of machine learning and probabilistic algorithms to combine temporal imputation and cross-sectional imputation, and their design principles will inform future efforts to better model clinical missing data.

Asunto(s)

Algoritmos , Aprendizaje Automático , Estudios Transversales , Recolección de Datos , Humanos , Modelos Estadísticos

8.

Multiple augmented reduced rank regression for pan-cancer analysis.

Wang, Jiuzhou; Lock, Eric F.

Biometrics ; 80(1)2024 Jan 29.

Artículo en Inglés | MEDLINE | ID: mdl-38281771

RESUMEN

Statistical approaches that successfully combine multiple datasets are more powerful, efficient, and scientifically informative than separate analyses. To address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (ie, cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variations. We consider a structured nuclear norm objective that is motivated by random matrix theory, in which the regression or factorization terms may be shared or specific to any number of cohorts. Our framework subsumes several existing methods, such as reduced rank regression and unsupervised multimatrix factorization approaches, and includes a promising novel approach to regression and factorization of a single dataset (aRRR) as a special case. Simulations demonstrate substantial gains in power from combining multiple datasets, and from parsimoniously accounting for all structured variations. We apply maRRR to gene expression data from multiple cancer types (ie, pan-cancer) from The Cancer Genome Atlas, with somatic mutations as covariates. The method performs well with respect to prediction and imputation of held-out data, and provides new insights into mutation-driven and auxiliary variations that are shared or specific to certain cancer types.

Asunto(s)

Neoplasias , Humanos , Análisis Multivariante , Neoplasias/genética

9.

Predicting lung cancer survival prognosis based on the conditional survival bayesian network.

Zhong, Lu; Yang, Fan; Sun, Shanshan; Wang, Lijie; Yu, Hong; Nie, Xiushan; Liu, Ailing; Xu, Ning; Zhang, Lanfang; Zhang, Mingjuan; Qi, Yue; Ji, Huaijun; Liu, Guiyuan; Zhao, Huan; Jiang, Yinan; Li, Jingyi; Song, Chengcun; Yu, Xin; Yang, Liu; Yu, Jinchao; Feng, Hu; Guo, Xiaolei; Yang, Fujun; Xue, Fuzhong.

BMC Med Res Methodol ; 24(1): 16, 2024 Jan 22.

Artículo en Inglés | MEDLINE | ID: mdl-38254038

RESUMEN

Lung cancer is a leading cause of cancer deaths and imposes an enormous economic burden on patients. It is important to develop an accurate risk assessment model to determine the appropriate treatment for patients after an initial lung cancer diagnosis. The Cox proportional hazards model is mainly employed in survival analysis. However, real-world medical data are usually incomplete, posing a great challenge to the application of this model. Commonly used imputation methods cannot achieve sufficient accuracy when data are missing, so we investigated novel methods for the development of clinical prediction models. In this article, we present a novel model for survival prediction in missing scenarios. We collected data from 5,240 patients diagnosed with lung cancer at the Weihai Municipal Hospital, China. Then, we applied a joint model that combined a BN and a Cox model to predict mortality risk in individual patients with lung cancer. The established prognostic model achieved good predictive performance in discrimination and calibration. We showed that combining the BN with the Cox proportional hazards model is highly beneficial and provides a more efficient tool for risk prediction.

Asunto(s)

Neoplasias Pulmonares , Humanos , Neoplasias Pulmonares/diagnóstico , Teorema de Bayes , Pronóstico , Calibración , China/epidemiología

10.

Machine Learning-Based Risk Prediction of Discharge Status for Sepsis.

Cai, Kaida; Lou, Yuqing; Wang, Zhengyan; Yang, Xiaofang; Zhao, Xin.

Entropy (Basel) ; 26(8)2024 Jul 25.

Artículo en Inglés | MEDLINE | ID: mdl-39202095

RESUMEN

As a severe inflammatory response syndrome, sepsis presents complex challenges in predicting patient outcomes due to its unclear pathogenesis and the unstable discharge status of affected individuals. In this study, we develop a machine learning-based method for predicting the discharge status of sepsis patients, aiming to improve treatment decisions. To enhance the robustness of our analysis against outliers, we incorporate robust statistical methods, specifically the minimum covariance determinant technique. We utilize the random forest imputation method to effectively manage and impute missing data. For feature selection, we employ Lasso penalized logistic regression, which efficiently identifies significant predictors and reduces model complexity, setting the stage for the application of more complex predictive methods. Our predictive analysis incorporates multiple machine learning methods, including random forest, support vector machine, and XGBoost. We compare the prediction performance of these methods with Lasso penalized logistic regression to identify the most effective approach. Each method's performance is rigorously evaluated through ten iterations of 10-fold cross-validation to ensure robust and reliable results. Our comparative analysis reveals that XGBoost surpasses the other models, demonstrating its exceptional capability to navigate the complexities of sepsis data effectively.

11.

Long-term SARS-CoV-2 neutralizing antibody level prediction using multimodal deep learning: A prospective cohort study on longitudinal data in Wuhan, China.

Fang, Cong; Yan, Weiming; Chen, Yuying; Dou, Zhiyong; Liu, Tingting; Luo, Fengning; Chen, Weiwei; Li, Xitang; Chen, Yajie; Wu, Wenhui; Yuan, Zhize; Niu, Yuxin; Wang, Peng; Zhu, Wenzhen; Luo, Xiaoping; Chen, Tao; Bai, Xiang; Wang, Xiaojing; Ning, Qin.

J Med Virol ; 95(8): e29036, 2023 08.

Artículo en Inglés | MEDLINE | ID: mdl-37621210

RESUMEN

The ongoing epidemic of SARS-CoV-2 is taking a substantial financial and health toll on people worldwide. Assessing the level and duration of SARS-CoV-2 neutralizing antibody (Nab) would provide key information for government to make sound healthcare policies. Assessed at 3-, 6-, 12-, and 18-month postdischarge, we described the temporal change of IgG levels in 450 individuals with moderate to critical COVID-19 infection. Moreover, a data imputation framework combined with a novel deep learning model was implemented to predict the long-term Nab and IgG levels in these patients. Demographic characteristics, inspection reports, and CT scans during hospitalization were used in this model. Interpretability of the model was further validated with Shapely Additive exPlanation (SHAP) and Gradient-weighted Class Activation Mapping (GradCAM). IgG levels peaked at 3 months and remained stable in 12 months postdischarge, followed by a significant decline in 18 months postdischarge. However, the Nab levels declined from 6 months postdischarge. By training on the cohort of 450 patients, our long-term antibody prediction (LTAP) model could predict long-term IgG levels with relatively high area under the receiver operating characteristic curve (AUC), accuracy, precision, recall, and F1-score, which far exceeds the performance achievable by commonly used models. Several prognostic factors including FDP levels, the percentages of T cells, B cells and natural killer cells, older age, sex, underlying diseases, and so forth, served as important indicators for IgG prediction. Based on these top 15 prognostic factors identified in IgG prediction, a simplified LTAP model for Nab level prediction was established and achieved an AUC of 0.828, which was 8.9% higher than MLP and 6.6% higher than LSTM. The close correlation between IgG and Nab levels making it possible to predict long-term Nab levels based on the factors selected by our LTAP model. Furthermore, our model identified that coagulation disorders and excessive immune response, which indicate disease severity, are closely related to the production of IgG and Nab. This universal model can be used as routine discharge tests to identify virus-infected individuals at risk for recurrent infection and determine the optimal timing of vaccination for general populations.

Asunto(s)

COVID-19 , Aprendizaje Profundo , Humanos , Anticuerpos Neutralizantes , SARS-CoV-2 , Cuidados Posteriores , Estudios Prospectivos , COVID-19/diagnóstico , Alta del Paciente , China/epidemiología , Anticuerpos Antivirales , Inmunoglobulina G

12.

Missing data imputation, prediction, and feature selection in diagnosis of vaginal prolapse.

Fan, Mingxuan; Peng, Xiaoling; Niu, Xiaoyu; Cui, Tao; He, Qiaolin.

BMC Med Res Methodol ; 23(1): 259, 2023 11 06.

Artículo en Inglés | MEDLINE | ID: mdl-37932660

RESUMEN

BACKGROUND: Data loss often occurs in the collection of clinical data. Directly discarding the incomplete sample may lead to low accuracy of medical diagnosis. A suitable data imputation method can help researchers make better use of valuable medical data. METHODS: In this paper, five popular imputation methods including mean imputation, expectation-maximization (EM) imputation, K-nearest neighbors (KNN) imputation, denoising autoencoders (DAE) and generative adversarial imputation nets (GAIN) are employed on an incomplete clinical data with 28,274 cases for vaginal prolapse prediction. A comprehensive comparison study for the performance of these methods has been conducted through certain classification criteria. It is shown that the prediction accuracy can be greatly improved by using the imputed data, especially by GAIN. To find out the important risk factors to this disease among a large number of candidate features, three variable selection methods: the least absolute shrinkage and selection operator (LASSO), the smoothly clipped absolute deviation (SCAD) and the broken adaptive ridge (BAR) are implemented in logistic regression for feature selection on the imputed datasets. In pursuit of our primary objective, which is accurate diagnosis, we employed diagnostic accuracy (classification accuracy) as a pivotal metric to assess both imputation and feature selection techniques. This assessment encompassed seven classifiers (logistic regression (LR) classifier, random forest (RF) classifier, support machine classifier (SVC), extreme gradient boosting (XGBoost) , LASSO classifier, SCAD classifier and Elastic Net classifier)enhancing the comprehensiveness of our evaluation. RESULTS: The proposed framework imputation-variable selection-prediction is quite suitable to the collected vaginal prolapse datasets. It is observed that the original dataset is well imputed by GAIN first, and then 9 most significant features were selected using BAR from the original 67 features in GAIN imputed dataset, with only negligible loss in model prediction. BAR is superior to the other two variable selection methods in our tests. CONCLUDES: Overall, combining the imputation, classification and variable selection, we achieve good interpretability while maintaining high accuracy in computer-aided medical diagnosis.

Asunto(s)

Prolapso Uterino , Femenino , Humanos , Diagnóstico por Computador , Modelos Logísticos

13.

Graph Machine Learning for Improved Imputation of Missing Tropospheric Ozone Data.

Betancourt, Clara; Li, Cathy W Y; Kleinert, Felix; Schultz, Martin G.

Environ Sci Technol ; 57(46): 18246-18258, 2023 Nov 21.

Artículo en Inglés | MEDLINE | ID: mdl-37661931

RESUMEN

Gaps in the measurement series of atmospheric pollutants can impede the reliable assessment of their impacts and trends. We propose a new method for missing data imputation of the air pollutant tropospheric ozone by using the graph machine learning algorithm "correct and smooth". This algorithm uses auxiliary data that characterize the measurement location and, in addition, ozone observations at neighboring sites to improve the imputations of simple statistical and machine learning models. We apply our method to data from 278 stations of the year 2011 of the German Environment Agency (Umweltbundesamt - UBA) monitoring network. The preliminary version of these data exhibits three gap patterns: shorter gaps in the range of hours, longer gaps of up to several months in length, and gaps occurring at multiple stations at once. For short gaps of up to 5 h, linear interpolation is most accurate. Longer gaps at single stations are most effectively imputed by a random forest in connection with the correct and smooth. For longer gaps at multiple stations, the correct and smooth algorithm improved the random forest despite a lack of data in the neighborhood of the missing values. We therefore suggest a hybrid of linear interpolation and graph machine learning for the imputation of tropospheric ozone time series.

Asunto(s)

Contaminantes Atmosféricos , Contaminación del Aire , Ozono , Ozono/análisis , Contaminación del Aire/análisis , Monitoreo del Ambiente/métodos , Contaminantes Atmosféricos/análisis , Aprendizaje Automático

14.

A Hybrid Missing Data Imputation Method for Batch Process Monitoring Dataset.

Gan, Qihong; Gong, Lang; Hu, Dasha; Jiang, Yuming; Ding, Xuefeng.

Sensors (Basel) ; 23(21)2023 Oct 24.

Artículo en Inglés | MEDLINE | ID: mdl-37960379

RESUMEN

Batch process monitoring datasets usually contain missing data, which decreases the performance of data-driven modeling for fault identification and optimal control. Many methods have been proposed to impute missing data; however, they do not fulfill the need for data quality, especially in sensor datasets with different types of missing data. We propose a hybrid missing data imputation method for batch process monitoring datasets with multi-type missing data. In this method, the missing data is first classified into five categories based on the continuous missing duration and the number of variables missing simultaneously. Then, different categories of missing data are step-by-step imputed considering their unique characteristics. A combination of three single-dimensional interpolation models is employed to impute transient isolated missing values. An iterative imputation based on a multivariate regression model is designed for imputing long-term missing variables, and a combination model based on single-dimensional interpolation and multivariate regression is proposed for imputing short-term missing variables. The Long Short-Term Memory (LSTM) model is utilized to impute both short-term and long-term missing samples. Finally, a series of experiments for different categories of missing data were conducted based on a real-world batch process monitoring dataset. The results demonstrate that the proposed method achieves higher imputation accuracy than other comparative methods.

15.

A Missing Traffic Data Imputation Method Based on a Diffusion Convolutional Neural Network-Generative Adversarial Network.

Zhang, Chenchen; Zhou, Lei; Xiao, Xuemei; Xu, Dongwei.

Sensors (Basel) ; 23(23)2023 Dec 04.

Artículo en Inglés | MEDLINE | ID: mdl-38067974

RESUMEN

Traffic state data are key to the proper operation of intelligent transportation systems (ITS). However, traffic detectors often receive environmental factors that cause missing values in the collected traffic state data. Therefore, aiming at the above problem, a method for imputing missing traffic state data based on a Diffusion Convolutional Neural Network-Generative Adversarial Network (DCNN-GAN) is proposed in this paper. The proposed method uses a graph embedding algorithm to construct a road network structure based on spatial correlation instead of the original road network structure; through the use of a GAN for confrontation training, it is possible to generate missing traffic state data based on the known data of the road network. In the generator, the spatiotemporal features of the reconstructed road network are extracted by the DCNN to realize the imputation. Two real traffic datasets were used to verify the effectiveness of this method, with the results of the proposed model proving better than those of the other models used for comparison.

16.

Dynamic Tensor Modeling for Missing Data Completion in Electronic Toll Collection Gantry Systems.

Rui, Yikang; Zhao, Yan; Lu, Wenqi; Wang, Can.

Sensors (Basel) ; 24(1)2023 Dec 23.

Artículo en Inglés | MEDLINE | ID: mdl-38202948

RESUMEN

The deployment of Electronic Toll Collection (ETC) gantry systems marks a transformative advancement in the journey toward an interconnected and intelligent highway traffic infrastructure. The integration of these systems signifies a leap forward in streamlining toll collection and minimizing environmental impact through decreased idle times. To solve the problems of missing sensor data in an ETC gantry system with large volumes and insufficient traffic detection among ETC gantries, this study constructs a high-order tensor model based on the analysis of the high-dimensional, sparse, large-volume, and heterogeneous characteristics of ETC gantry data. In addition, a missing data completion method for the ETC gantry data is proposed based on an improved dynamic tensor flow model. This study approximates the decomposition of neighboring tensor blocks in the high-order tensor model of the ETC gantry data based on tensor Tucker decomposition and the Laplacian matrix. This method captures the correlations among space, time, and user information in the ETC gantry data. Case studies demonstrate that our method enhances ETC gantry data quality across various rates of missing data while also reducing computational complexity. For instance, at a less than 5% missing data rate, our approach reduced the RMSE for time vehicle distance by 0.0051, for traffic volume by 0.0056, and for interval speed by 0.0049 compared to the MATRIX method. These improvements not only indicate a potential for more precise traffic data analysis but also add value to the application of ETC systems and contribute to theoretical and practical advancements in the field.

17.

Multi-Type Missing Imputation of Time-Series Power Equipment Monitoring Data Based on Moving Average Filter-Asymmetric Denoising Autoencoder.

Jiang, Ling; Gu, Juping; Zhang, Xinsong; Hua, Liang; Cai, Yueming.

Sensors (Basel) ; 23(24)2023 Dec 08.

Artículo en Inglés | MEDLINE | ID: mdl-38139543

RESUMEN

Supervisory control and data acquisition (SCADA) systems are widely utilized in power equipment for condition monitoring. For the collected data, there generally exists a problem-missing data of different types and patterns. This leads to the poor quality and utilization difficulties of the collected data. To address this problem, this paper customizes methodology that combines an asymmetric denoising autoencoder (ADAE) and moving average filter (MAF) to perform accurate missing data imputation. First, convolution and gated recurrent unit (GRU) are applied to the encoder of the ADAE, while the decoder still utilizes the fully connected layers to form an asymmetric network structure. The ADAE extracts the local periodic and temporal features from monitoring data and then decodes the features to realize the imputation of the multi-type missing. On this basis, according to the continuity of power data in the time domain, the MAF is utilized to fuse the prior knowledge of the neighborhood of missing data to secondarily optimize the imputed data. Case studies reveal that the developed method achieves greater accuracy compared to existing models. This paper adopts experiments under different scenarios to justify that the MAF-ADAE method applies to actual power equipment monitoring data imputation.

18.

Bioactive Compounds in Plasma as a Function of Sex and Sweetener Resulting from a Maqui-Lemon Beverage Consumption Using Statistical and Machine Learning Techniques.

Hernández-Prieto, Diego; Fernández, Pablo S; Agulló, Vicente; García-Viguera, Cristina; Egea, Jose A.

Int J Mol Sci ; 24(3)2023 Jan 21.

Artículo en Inglés | MEDLINE | ID: mdl-36768467

RESUMEN

The present study analyses the effect of a beverage composed of citrus and maqui (Aristotelia chilensis) with different sweeteners on male and female consumers. Beverages were designed and tested (140 volunteers) as a source of polyphenols, in a previous work. Plasma samples were taken before and after two months of daily intake. Samples were measured for bioactive-compound levels with metabolomics techniques, and the resulting data were analysed with advanced versions of ANOVA and clustering analysis, to describe the effects of sex and sweetener factors on bioactive compounds. To improve the results, machine learning techniques were applied to perform feature selection and data imputation. The results reflect a series of compounds which are more regulated for men, such as caffeic acid or 3,4-dihydroxyphenylacetic acid, and for women, trans ferulic acid (TFA) or naringenin glucuronide. Regulations are also observed with sweeteners, such as TFA with stevia in women, or vanillic acid with sucrose in men. These results show that there is a differential regulation of these two families of polyphenols by sex, and that this is influenced by sweeteners.

Asunto(s)

Citrus , Stevia , Animales , Edulcorantes/farmacología , Bebidas/análisis , Polifenoles/análisis

19.

ImputeGAN: Generative Adversarial Network for Multivariate Time Series Imputation.

Qin, Rui; Wang, Yong.

Entropy (Basel) ; 25(1)2023 Jan 10.

Artículo en Inglés | MEDLINE | ID: mdl-36673278

RESUMEN

Since missing values in multivariate time series data are inevitable, many researchers have come up with methods to deal with the missing data. These include case deletion methods, statistics-based imputation methods, and machine learning-based imputation methods. However, these methods cannot handle temporal information, or the complementation results are unstable. We propose a model based on generative adversarial networks (GANs) and an iterative strategy based on the gradient of the complementary results to solve these problems. This ensures the generalizability of the model and the reasonableness of the complementation results. We conducted experiments on three large-scale datasets and compare them with traditional complementation methods. The experimental results show that imputeGAN outperforms traditional complementation methods in terms of accuracy of complementation.

20.

Fast and robust imputation for miRNA expression data using constrained least squares.

Webber, James W; Elias, Kevin M.

BMC Bioinformatics ; 23(1): 145, 2022 Apr 22.

Artículo en Inglés | MEDLINE | ID: mdl-35459087

RESUMEN

BACKGROUND: High dimensional transcriptome profiling, whether through next generation sequencing techniques or high-throughput arrays, may result in scattered variables with missing data. Data imputation is a common strategy to maximize the inclusion of samples by using statistical techniques to fill in missing values. However, many data imputation methods are cumbersome and risk introduction of systematic bias. RESULTS: We present a new data imputation method using constrained least squares and algorithms from the inverse problems literature and present applications for this technique in miRNA expression analysis. The proposed technique is shown to offer an imputation orders of magnitude faster, with greater than or equal accuracy when compared to similar methods from the literature. CONCLUSIONS: This study offers a robust and efficient algorithm for data imputation, which can be used, e.g., to improve cancer prediction accuracy in the presence of missing data.

Asunto(s)

MicroARNs , Algoritmos , Perfilación de la Expresión Génica/métodos , Análisis de los Mínimos Cuadrados , MicroARNs/genética , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

Detalles de la búsqueda