Búsqueda | Portal de Búsqueda de la BVS

A comparison of RNA-Seq data preprocessing pipelines for transcriptomic predictions across independent studies.

Van, Richard; Alvarez, Daniel; Mize, Travis; Gannavarapu, Sravani; Chintham Reddy, Lohitha; Nasoz, Fatma; Han, Mira V.

BMC Bioinformatics ; 25(1): 181, 2024 May 08.

Artículo en Inglés | MEDLINE | ID: mdl-38720247

RESUMEN

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.

Asunto(s)

Aprendizaje Automático , Neoplasias , RNA-Seq , Humanos , RNA-Seq/métodos , Neoplasias/genética , Transcriptoma/genética , Análisis de Secuencia de ARN/métodos , Perfilación de la Expresión Génica/métodos , Biología Computacional/métodos

Machine Learning Approaches for Fracture Risk Assessment: A Comparative Analysis of Genomic and Phenotypic Data in 5130 Older Men.

Wu, Qing; Nasoz, Fatma; Jung, Jongyun; Bhattarai, Bibek; Han, Mira V.

Calcif Tissue Int ; 107(4): 353-361, 2020 10.

Artículo en Inglés | MEDLINE | ID: mdl-32728911

RESUMEN

The study aims were to develop fracture prediction models by using machine learning approaches and genomic data, as well as to identify the best modeling approach for fracture prediction. The genomic data of Osteoporotic Fractures in Men, cohort Study (n = 5130), were analyzed. After a comprehensive genotype imputation, genetic risk score (GRS) was calculated from 1103 associated Single Nucleotide Polymorphisms for each participant. Data were normalized and split into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and logistic regression were used to develop prediction models for major osteoporotic fractures separately, with GRS, bone density, and other risk factors as predictors. In model training, the synthetic minority oversampling technique was used to account for low fracture rate, and tenfold cross-validation was employed for hyperparameters optimization. In the testing, the area under curve (AUC) and accuracy were used to assess the model performance. The McNemar test was employed to examine the accuracy difference between models. The results showed that the prediction performance of gradient boosting was the best, with AUC of 0.71 and an accuracy of 0.88, and the GRS ranked as the 7th most important variable in the model. The performance of random forest and neural network were also significantly better than that of logistic regression. This study suggested that improving fracture prediction in older men can be achieved by incorporating genetic profiling and by utilizing the gradient boosting approach. This result should not be extrapolated to women or young individuals.

Asunto(s)

Densidad Ósea , Fracturas Óseas/diagnóstico , Aprendizaje Automático , Medición de Riesgo , Actividades Cotidianas , Anciano , Anciano de 80 o más Años , Estudios de Cohortes , Genómica , Humanos , Masculino , Fenotipo

Machine learning approaches for the prediction of bone mineral density by using genomic and phenotypic data of 5130 older men.

Wu, Qing; Nasoz, Fatma; Jung, Jongyun; Bhattarai, Bibek; Han, Mira V; Greenes, Robert A; Saag, Kenneth G.

Sci Rep ; 11(1): 4482, 2021 02 24.

Artículo en Inglés | MEDLINE | ID: mdl-33627720

RESUMEN

The study aimed to utilize machine learning (ML) approaches and genomic data to develop a prediction model for bone mineral density (BMD) and identify the best modeling approach for BMD prediction. The genomic and phenotypic data of Osteoporotic Fractures in Men Study (n = 5130) was analyzed. Genetic risk score (GRS) was calculated from 1103 associated SNPs for each participant after a comprehensive genotype imputation. Data were normalized and divided into a training set (80%) and a validation set (20%) for analysis. Random forest, gradient boosting, neural network, and linear regression were used to develop BMD prediction models separately. Ten-fold cross-validation was used for hyper-parameters optimization. Mean square error and mean absolute error were used to assess model performance. When using GRS and phenotypic covariates as the predictors, all ML models' performance and linear regression in BMD prediction were similar. However, when replacing GRS with the 1103 individual SNPs in the model, ML models performed significantly better than linear regression (with lasso regularization), and the gradient boosting model performed the best. Our study suggested that ML models, especially gradient boosting, can improve BMD prediction in genomic data.

Asunto(s)

Densidad Ósea/genética , Densidad Ósea/fisiología , Anciano , Fracturas Óseas/genética , Fracturas Óseas/patología , Genómica/métodos , Genotipo , Humanos , Modelos Lineales , Aprendizaje Automático , Masculino , Polimorfismo de Nucleótido Simple/genética , Medición de Riesgo , Factores de Riesgo

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA