Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 136
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38349062

RESUMEN

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene-gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene-gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene-gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.


Asunto(s)
Aprendizaje Profundo , Epistasis Genética , Análisis de Datos , Genómica , Expresión Génica , Perfilación de la Expresión Génica , Análisis de Secuencia de ARN
2.
Brief Bioinform ; 24(5)2023 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-37670505

RESUMEN

A key problem in systems biology is the discovery of regulatory mechanisms that drive phenotypic behaviour of complex biological systems in the form of multi-level networks. Modern multi-omics profiling techniques probe these fundamental regulatory networks but are often hampered by experimental restrictions leading to missing data or partially measured omics types for subsets of individuals due to cost restrictions. In such scenarios, in which missing data is present, classical computational approaches to infer regulatory networks are limited. In recent years, approaches have been proposed to infer sparse regression models in the presence of missing information. Nevertheless, these methods have not been adopted for regulatory network inference yet. In this study, we integrated regression-based methods that can handle missingness into KiMONo, a Knowledge guided Multi-Omics Network inference approach, and benchmarked their performance on commonly encountered missing data scenarios in single- and multi-omics studies. Overall, two-step approaches that explicitly handle missingness performed best for a wide range of random- and block-missingness scenarios on imbalanced omics-layers dimensions, while methods implicitly handling missingness performed best on balanced omics-layers dimensions. Our results show that robust multi-omics network inference in the presence of missing data with KiMONo is feasible and thus allows users to leverage available multi-omics data to its full extent.


Asunto(s)
Benchmarking , Multiómica , Humanos , Biología de Sistemas
3.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36733262

RESUMEN

Single-cell RNA sequencing (scRNA-seq) data are typically with a large number of missing values, which often results in the loss of critical gene signaling information and seriously limit the downstream analysis. Deep learning-based imputation methods often can better handle scRNA-seq data than shallow ones, but most of them do not consider the inherent relations between genes, and the expression of a gene is often regulated by other genes. Therefore, it is essential to impute scRNA-seq data by considering the regional gene-to-gene relations. We propose a novel model (named scGGAN) to impute scRNA-seq data that learns the gene-to-gene relations by Graph Convolutional Networks (GCN) and global scRNA-seq data distribution by Generative Adversarial Networks (GAN). scGGAN first leverages single-cell and bulk genomics data to explore inherent relations between genes and builds a more compact gene relation network to jointly capture the homogeneous and heterogeneous information. Then, it constructs a GCN-based GAN model to integrate the scRNA-seq, gene sequencing data and gene relation network for generating scRNA-seq data, and trains the model through adversarial learning. Finally, it utilizes data generated by the trained GCN-based GAN model to impute scRNA-seq data. Experiments on simulated and real scRNA-seq datasets show that scGGAN can effectively identify dropout events, recover the biologically meaningful expressions, determine subcellular states and types, improve the differential expression analysis and temporal dynamics analysis. Ablation experiments confirm that both the gene relation network and gene sequence data help the imputation of scRNA-seq data.


Asunto(s)
Análisis de Expresión Génica de una Sola Célula , Programas Informáticos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Genómica , Perfilación de la Expresión Génica
4.
BMC Bioinformatics ; 25(1): 221, 2024 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-38902629

RESUMEN

BACKGROUND: Extracellular vesicle-derived (EV)-miRNAs have potential to serve as biomarkers for the diagnosis of various diseases. miRNA microarrays are widely used to quantify circulating EV-miRNA levels, and the preprocessing of miRNA microarray data is critical for analytical accuracy and reliability. Thus, although microarray data have been used in various studies, the effects of preprocessing have not been studied for Toray's 3D-Gene chip, a widely used measurement method. We aimed to evaluate batch effect, missing value imputation accuracy, and the influence of preprocessing on measured values in 18 different preprocessing pipelines for EV-miRNA microarray data from two cohorts with amyotrophic lateral sclerosis using 3D-Gene technology. RESULTS: Eighteen different pipelines with different types and orders of missing value completion and normalization were used to preprocess the 3D-Gene microarray EV-miRNA data. Notable results were suppressed in the batch effects in all pipelines using the batch effect correction method ComBat. Furthermore, pipelines utilizing missForest for missing value imputation showed high agreement with measured values. In contrast, imputation using constant values for missing data exhibited low agreement. CONCLUSIONS: This study highlights the importance of selecting the appropriate preprocessing strategy for EV-miRNA microarray data when using 3D-Gene technology. These findings emphasize the importance of validating preprocessing approaches, particularly in the context of batch effect correction and missing value imputation, for reliably analyzing data in biomarker discovery and disease research.


Asunto(s)
Vesículas Extracelulares , MicroARNs , Análisis de Secuencia por Matrices de Oligonucleótidos , Vesículas Extracelulares/metabolismo , Vesículas Extracelulares/genética , MicroARNs/genética , MicroARNs/metabolismo , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Esclerosis Amiotrófica Lateral/genética , Esclerosis Amiotrófica Lateral/metabolismo , Perfilación de la Expresión Génica/métodos
5.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34864871

RESUMEN

Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation.


Asunto(s)
Análisis de Datos , Análisis de la Célula Individual , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Secuenciación del Exoma
6.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34882223

RESUMEN

Clinical data are increasingly being mined to derive new medical knowledge with a goal of enabling greater diagnostic precision, better-personalized therapeutic regimens, improved clinical outcomes and more efficient utilization of health-care resources. However, clinical data are often only available at irregular intervals that vary between patients and type of data, with entries often being unmeasured or unknown. As a result, missing data often represent one of the major impediments to optimal knowledge derivation from clinical data. The Data Analytics Challenge on Missing data Imputation (DACMI) presented a shared clinical dataset with ground truth for evaluating and advancing the state of the art in imputing missing data for clinical time series. We extracted 13 commonly measured blood laboratory tests. To evaluate the imputation performance, we randomly removed one recorded result per laboratory test per patient admission and used them as the ground truth. DACMI is the first shared-task challenge on clinical time series imputation to our best knowledge. The challenge attracted 12 international teams spanning three continents across multiple industries and academia. The evaluation outcome suggests that competitive machine learning and statistical models (e.g. LightGBM, MICE and XGBoost) coupled with carefully engineered temporal and cross-sectional features can achieve strong imputation performance. However, care needs to be taken to prevent overblown model complexity. The challenge participating systems collectively experimented with a wide range of machine learning and probabilistic algorithms to combine temporal imputation and cross-sectional imputation, and their design principles will inform future efforts to better model clinical missing data.


Asunto(s)
Algoritmos , Aprendizaje Automático , Estudios Transversales , Recolección de Datos , Humanos , Modelos Estadísticos
7.
Biometrics ; 80(1)2024 Jan 29.
Artículo en Inglés | MEDLINE | ID: mdl-38281771

RESUMEN

Statistical approaches that successfully combine multiple datasets are more powerful, efficient, and scientifically informative than separate analyses. To address variation architectures correctly and comprehensively for high-dimensional data across multiple sample sets (ie, cohorts), we propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method to concurrently learn both covariate-driven and auxiliary structured variations. We consider a structured nuclear norm objective that is motivated by random matrix theory, in which the regression or factorization terms may be shared or specific to any number of cohorts. Our framework subsumes several existing methods, such as reduced rank regression and unsupervised multimatrix factorization approaches, and includes a promising novel approach to regression and factorization of a single dataset (aRRR) as a special case. Simulations demonstrate substantial gains in power from combining multiple datasets, and from parsimoniously accounting for all structured variations. We apply maRRR to gene expression data from multiple cancer types (ie, pan-cancer) from The Cancer Genome Atlas, with somatic mutations as covariates. The method performs well with respect to prediction and imputation of held-out data, and provides new insights into mutation-driven and auxiliary variations that are shared or specific to certain cancer types.


Asunto(s)
Neoplasias , Humanos , Análisis Multivariante , Neoplasias/genética
8.
BMC Med Res Methodol ; 24(1): 16, 2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38254038

RESUMEN

Lung cancer is a leading cause of cancer deaths and imposes an enormous economic burden on patients. It is important to develop an accurate risk assessment model to determine the appropriate treatment for patients after an initial lung cancer diagnosis. The Cox proportional hazards model is mainly employed in survival analysis. However, real-world medical data are usually incomplete, posing a great challenge to the application of this model. Commonly used imputation methods cannot achieve sufficient accuracy when data are missing, so we investigated novel methods for the development of clinical prediction models. In this article, we present a novel model for survival prediction in missing scenarios. We collected data from 5,240 patients diagnosed with lung cancer at the Weihai Municipal Hospital, China. Then, we applied a joint model that combined a BN and a Cox model to predict mortality risk in individual patients with lung cancer. The established prognostic model achieved good predictive performance in discrimination and calibration. We showed that combining the BN with the Cox proportional hazards model is highly beneficial and provides a more efficient tool for risk prediction.


Asunto(s)
Neoplasias Pulmonares , Humanos , Neoplasias Pulmonares/diagnóstico , Teorema de Bayes , Pronóstico , Calibración , China/epidemiología
9.
J Med Virol ; 95(8): e29036, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37621210

RESUMEN

The ongoing epidemic of SARS-CoV-2 is taking a substantial financial and health toll on people worldwide. Assessing the level and duration of SARS-CoV-2 neutralizing antibody (Nab) would provide key information for government to make sound healthcare policies. Assessed at 3-, 6-, 12-, and 18-month postdischarge, we described the temporal change of IgG levels in 450 individuals with moderate to critical COVID-19 infection. Moreover, a data imputation framework combined with a novel deep learning model was implemented to predict the long-term Nab and IgG levels in these patients. Demographic characteristics, inspection reports, and CT scans during hospitalization were used in this model. Interpretability of the model was further validated with Shapely Additive exPlanation (SHAP) and Gradient-weighted Class Activation Mapping (GradCAM). IgG levels peaked at 3 months and remained stable in 12 months postdischarge, followed by a significant decline in 18 months postdischarge. However, the Nab levels declined from 6 months postdischarge. By training on the cohort of 450 patients, our long-term antibody prediction (LTAP) model could predict long-term IgG levels with relatively high area under the receiver operating characteristic curve (AUC), accuracy, precision, recall, and F1-score, which far exceeds the performance achievable by commonly used models. Several prognostic factors including FDP levels, the percentages of T cells, B cells and natural killer cells, older age, sex, underlying diseases, and so forth, served as important indicators for IgG prediction. Based on these top 15 prognostic factors identified in IgG prediction, a simplified LTAP model for Nab level prediction was established and achieved an AUC of 0.828, which was 8.9% higher than MLP and 6.6% higher than LSTM. The close correlation between IgG and Nab levels making it possible to predict long-term Nab levels based on the factors selected by our LTAP model. Furthermore, our model identified that coagulation disorders and excessive immune response, which indicate disease severity, are closely related to the production of IgG and Nab. This universal model can be used as routine discharge tests to identify virus-infected individuals at risk for recurrent infection and determine the optimal timing of vaccination for general populations.


Asunto(s)
COVID-19 , Aprendizaje Profundo , Humanos , Anticuerpos Neutralizantes , SARS-CoV-2 , Cuidados Posteriores , Estudios Prospectivos , COVID-19/diagnóstico , Alta del Paciente , China/epidemiología , Anticuerpos Antivirales , Inmunoglobulina G
10.
BMC Med Res Methodol ; 23(1): 259, 2023 11 06.
Artículo en Inglés | MEDLINE | ID: mdl-37932660

RESUMEN

BACKGROUND: Data loss often occurs in the collection of clinical data. Directly discarding the incomplete sample may lead to low accuracy of medical diagnosis. A suitable data imputation method can help researchers make better use of valuable medical data. METHODS: In this paper, five popular imputation methods including mean imputation, expectation-maximization (EM) imputation, K-nearest neighbors (KNN) imputation, denoising autoencoders (DAE) and generative adversarial imputation nets (GAIN) are employed on an incomplete clinical data with 28,274 cases for vaginal prolapse prediction. A comprehensive comparison study for the performance of these methods has been conducted through certain classification criteria. It is shown that the prediction accuracy can be greatly improved by using the imputed data, especially by GAIN. To find out the important risk factors to this disease among a large number of candidate features, three variable selection methods: the least absolute shrinkage and selection operator (LASSO), the smoothly clipped absolute deviation (SCAD) and the broken adaptive ridge (BAR) are implemented in logistic regression for feature selection on the imputed datasets. In pursuit of our primary objective, which is accurate diagnosis, we employed diagnostic accuracy (classification accuracy) as a pivotal metric to assess both imputation and feature selection techniques. This assessment encompassed seven classifiers (logistic regression (LR) classifier, random forest (RF) classifier, support machine classifier (SVC), extreme gradient boosting (XGBoost) , LASSO classifier, SCAD classifier and Elastic Net classifier)enhancing the comprehensiveness of our evaluation. RESULTS: The proposed framework imputation-variable selection-prediction is quite suitable to the collected vaginal prolapse datasets. It is observed that the original dataset is well imputed by GAIN first, and then 9 most significant features were selected using BAR from the original 67 features in GAIN imputed dataset, with only negligible loss in model prediction. BAR is superior to the other two variable selection methods in our tests. CONCLUDES: Overall, combining the imputation, classification and variable selection, we achieve good interpretability while maintaining high accuracy in computer-aided medical diagnosis.


Asunto(s)
Prolapso Uterino , Femenino , Humanos , Diagnóstico por Computador , Modelos Logísticos
11.
Environ Sci Technol ; 57(46): 18246-18258, 2023 Nov 21.
Artículo en Inglés | MEDLINE | ID: mdl-37661931

RESUMEN

Gaps in the measurement series of atmospheric pollutants can impede the reliable assessment of their impacts and trends. We propose a new method for missing data imputation of the air pollutant tropospheric ozone by using the graph machine learning algorithm "correct and smooth". This algorithm uses auxiliary data that characterize the measurement location and, in addition, ozone observations at neighboring sites to improve the imputations of simple statistical and machine learning models. We apply our method to data from 278 stations of the year 2011 of the German Environment Agency (Umweltbundesamt - UBA) monitoring network. The preliminary version of these data exhibits three gap patterns: shorter gaps in the range of hours, longer gaps of up to several months in length, and gaps occurring at multiple stations at once. For short gaps of up to 5 h, linear interpolation is most accurate. Longer gaps at single stations are most effectively imputed by a random forest in connection with the correct and smooth. For longer gaps at multiple stations, the correct and smooth algorithm improved the random forest despite a lack of data in the neighborhood of the missing values. We therefore suggest a hybrid of linear interpolation and graph machine learning for the imputation of tropospheric ozone time series.


Asunto(s)
Contaminantes Atmosféricos , Contaminación del Aire , Ozono , Ozono/análisis , Contaminación del Aire/análisis , Monitoreo del Ambiente/métodos , Contaminantes Atmosféricos/análisis , Aprendizaje Automático
12.
Sensors (Basel) ; 23(23)2023 Dec 04.
Artículo en Inglés | MEDLINE | ID: mdl-38067974

RESUMEN

Traffic state data are key to the proper operation of intelligent transportation systems (ITS). However, traffic detectors often receive environmental factors that cause missing values in the collected traffic state data. Therefore, aiming at the above problem, a method for imputing missing traffic state data based on a Diffusion Convolutional Neural Network-Generative Adversarial Network (DCNN-GAN) is proposed in this paper. The proposed method uses a graph embedding algorithm to construct a road network structure based on spatial correlation instead of the original road network structure; through the use of a GAN for confrontation training, it is possible to generate missing traffic state data based on the known data of the road network. In the generator, the spatiotemporal features of the reconstructed road network are extracted by the DCNN to realize the imputation. Two real traffic datasets were used to verify the effectiveness of this method, with the results of the proposed model proving better than those of the other models used for comparison.

13.
Sensors (Basel) ; 23(21)2023 Oct 24.
Artículo en Inglés | MEDLINE | ID: mdl-37960379

RESUMEN

Batch process monitoring datasets usually contain missing data, which decreases the performance of data-driven modeling for fault identification and optimal control. Many methods have been proposed to impute missing data; however, they do not fulfill the need for data quality, especially in sensor datasets with different types of missing data. We propose a hybrid missing data imputation method for batch process monitoring datasets with multi-type missing data. In this method, the missing data is first classified into five categories based on the continuous missing duration and the number of variables missing simultaneously. Then, different categories of missing data are step-by-step imputed considering their unique characteristics. A combination of three single-dimensional interpolation models is employed to impute transient isolated missing values. An iterative imputation based on a multivariate regression model is designed for imputing long-term missing variables, and a combination model based on single-dimensional interpolation and multivariate regression is proposed for imputing short-term missing variables. The Long Short-Term Memory (LSTM) model is utilized to impute both short-term and long-term missing samples. Finally, a series of experiments for different categories of missing data were conducted based on a real-world batch process monitoring dataset. The results demonstrate that the proposed method achieves higher imputation accuracy than other comparative methods.

14.
Sensors (Basel) ; 24(1)2023 Dec 23.
Artículo en Inglés | MEDLINE | ID: mdl-38202948

RESUMEN

The deployment of Electronic Toll Collection (ETC) gantry systems marks a transformative advancement in the journey toward an interconnected and intelligent highway traffic infrastructure. The integration of these systems signifies a leap forward in streamlining toll collection and minimizing environmental impact through decreased idle times. To solve the problems of missing sensor data in an ETC gantry system with large volumes and insufficient traffic detection among ETC gantries, this study constructs a high-order tensor model based on the analysis of the high-dimensional, sparse, large-volume, and heterogeneous characteristics of ETC gantry data. In addition, a missing data completion method for the ETC gantry data is proposed based on an improved dynamic tensor flow model. This study approximates the decomposition of neighboring tensor blocks in the high-order tensor model of the ETC gantry data based on tensor Tucker decomposition and the Laplacian matrix. This method captures the correlations among space, time, and user information in the ETC gantry data. Case studies demonstrate that our method enhances ETC gantry data quality across various rates of missing data while also reducing computational complexity. For instance, at a less than 5% missing data rate, our approach reduced the RMSE for time vehicle distance by 0.0051, for traffic volume by 0.0056, and for interval speed by 0.0049 compared to the MATRIX method. These improvements not only indicate a potential for more precise traffic data analysis but also add value to the application of ETC systems and contribute to theoretical and practical advancements in the field.

15.
Sensors (Basel) ; 23(24)2023 Dec 08.
Artículo en Inglés | MEDLINE | ID: mdl-38139543

RESUMEN

Supervisory control and data acquisition (SCADA) systems are widely utilized in power equipment for condition monitoring. For the collected data, there generally exists a problem-missing data of different types and patterns. This leads to the poor quality and utilization difficulties of the collected data. To address this problem, this paper customizes methodology that combines an asymmetric denoising autoencoder (ADAE) and moving average filter (MAF) to perform accurate missing data imputation. First, convolution and gated recurrent unit (GRU) are applied to the encoder of the ADAE, while the decoder still utilizes the fully connected layers to form an asymmetric network structure. The ADAE extracts the local periodic and temporal features from monitoring data and then decodes the features to realize the imputation of the multi-type missing. On this basis, according to the continuity of power data in the time domain, the MAF is utilized to fuse the prior knowledge of the neighborhood of missing data to secondarily optimize the imputed data. Case studies reveal that the developed method achieves greater accuracy compared to existing models. This paper adopts experiments under different scenarios to justify that the MAF-ADAE method applies to actual power equipment monitoring data imputation.

16.
Int J Mol Sci ; 24(3)2023 Jan 21.
Artículo en Inglés | MEDLINE | ID: mdl-36768467

RESUMEN

The present study analyses the effect of a beverage composed of citrus and maqui (Aristotelia chilensis) with different sweeteners on male and female consumers. Beverages were designed and tested (140 volunteers) as a source of polyphenols, in a previous work. Plasma samples were taken before and after two months of daily intake. Samples were measured for bioactive-compound levels with metabolomics techniques, and the resulting data were analysed with advanced versions of ANOVA and clustering analysis, to describe the effects of sex and sweetener factors on bioactive compounds. To improve the results, machine learning techniques were applied to perform feature selection and data imputation. The results reflect a series of compounds which are more regulated for men, such as caffeic acid or 3,4-dihydroxyphenylacetic acid, and for women, trans ferulic acid (TFA) or naringenin glucuronide. Regulations are also observed with sweeteners, such as TFA with stevia in women, or vanillic acid with sucrose in men. These results show that there is a differential regulation of these two families of polyphenols by sex, and that this is influenced by sweeteners.


Asunto(s)
Citrus , Stevia , Animales , Edulcorantes/farmacología , Bebidas/análisis , Polifenoles/análisis
17.
Entropy (Basel) ; 25(1)2023 Jan 10.
Artículo en Inglés | MEDLINE | ID: mdl-36673278

RESUMEN

Since missing values in multivariate time series data are inevitable, many researchers have come up with methods to deal with the missing data. These include case deletion methods, statistics-based imputation methods, and machine learning-based imputation methods. However, these methods cannot handle temporal information, or the complementation results are unstable. We propose a model based on generative adversarial networks (GANs) and an iterative strategy based on the gradient of the complementary results to solve these problems. This ensures the generalizability of the model and the reasonableness of the complementation results. We conducted experiments on three large-scale datasets and compare them with traditional complementation methods. The experimental results show that imputeGAN outperforms traditional complementation methods in terms of accuracy of complementation.

18.
BMC Bioinformatics ; 23(1): 145, 2022 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-35459087

RESUMEN

BACKGROUND: High dimensional transcriptome profiling, whether through next generation sequencing techniques or high-throughput arrays, may result in scattered variables with missing data. Data imputation is a common strategy to maximize the inclusion of samples by using statistical techniques to fill in missing values. However, many data imputation methods are cumbersome and risk introduction of systematic bias. RESULTS: We present a new data imputation method using constrained least squares and algorithms from the inverse problems literature and present applications for this technique in miRNA expression analysis. The proposed technique is shown to offer an imputation orders of magnitude faster, with greater than or equal accuracy when compared to similar methods from the literature. CONCLUSIONS: This study offers a robust and efficient algorithm for data imputation, which can be used, e.g., to improve cancer prediction accuracy in the presence of missing data.


Asunto(s)
MicroARNs , Algoritmos , Perfilación de la Expresión Génica/métodos , Análisis de los Mínimos Cuadrados , MicroARNs/genética , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos
19.
BMC Genomics ; 23(1): 377, 2022 May 18.
Artículo en Inglés | MEDLINE | ID: mdl-35585494

RESUMEN

BACKGROUND: In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints. RESULTS: We propose a framework, called PhyloMissForest, to impute missing entries in phylogenetic distance matrices and infer accurate evolutionary relationships. PhyloMissForest is built upon a random forest structure that infers the missing entries of the input data, based on the known parts of it. PhyloMissForest contributes with a robust and configurable framework that incorporates multiple search strategies and machine learning, complemented by phylogenetic techniques, to provide a more accurate inference of lost phylogenetic distances. We evaluate our framework by examining three real-world datasets, two DNA-based sequence alignments and one containing amino acid data, and two additional instances with simulated DNA data. Moreover, we follow a design of experiments methodology to define the hyperparameter values of our algorithm, which is a concise method, preferable in comparison to the well-known exhaustive parameters search. By varying the percentages of missing data from 5% to 60%, we generally outperform the state-of-the-art alternative imputation techniques in the tests conducted on real DNA data. In addition, significant improvements in execution time are observed for the amino acid instance. The results observed on simulated data also denote the attainment of improved imputations when dealing with large percentages of missing data. CONCLUSIONS: By merging multiple search strategies, machine learning, and phylogenetic techniques, PhyloMissForest provides a highly customizable and robust framework for phylogenetic missing data imputation, with significant topological accuracy and effective speedups over the state of the art.


Asunto(s)
Algoritmos , ADN , Aminoácidos , Filogenia , Alineación de Secuencia
20.
Mol Pharm ; 19(5): 1488-1504, 2022 05 02.
Artículo en Inglés | MEDLINE | ID: mdl-35412314

RESUMEN

Animal pharmacokinetic (PK) data as well as human and animal in vitro systems are utilized in drug discovery to define the rate and route of drug elimination. Accurate prediction and mechanistic understanding of drug clearance and disposition in animals provide a degree of confidence for extrapolation to humans. In addition, prediction of in vivo properties can be used to improve design during drug discovery, help select compounds with better properties, and reduce the number of in vivo experiments. In this study, we generated machine learning models able to predict rat in vivo PK parameters and concentration-time PK profiles based on the molecular chemical structure and either measured or predicted in vitro parameters. The models were trained on internal in vivo rat PK data for over 3000 diverse compounds from multiple projects and therapeutic areas, and the predicted endpoints include clearance and oral bioavailability. We compared the performance of various traditional machine learning algorithms and deep learning approaches, including graph convolutional neural networks. The best models for PK parameters achieved R2 = 0.63 [root mean squared error (RMSE) = 0.26] for clearance and R2 = 0.55 (RMSE = 0.46) for bioavailability. The models provide a fast and cost-efficient way to guide the design of molecules with optimal PK profiles, to enable the prediction of virtual compounds at the point of design, and to drive prioritization of compounds for in vivo assays.


Asunto(s)
Aprendizaje Automático , Modelos Biológicos , Animales , Disponibilidad Biológica , Descubrimiento de Drogas , Tasa de Depuración Metabólica , Preparaciones Farmacéuticas , Farmacocinética , Ratas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA