RESUMO
Feature selection plays an important role in improving the performance of classification or reducing the dimensionality of high-dimensional datasets, such as high-throughput genomics/proteomics data in bioinformatics. As a popular approach with computational efficiency and scalability, information theory has been widely incorporated into feature selection. In this study, we propose a unique weight-based feature selection (WBFS) algorithm that assesses selected features and candidate features to identify the key protein biomarkers for classifying lung cancer subtypes from The Cancer Proteome Atlas (TCPA) database and we further explored the survival analysis between selected biomarkers and subtypes of lung cancer. Results show good performance of the combination of our WBFS method and Bayesian network for mining potential biomarkers. These candidate signatures have valuable biological significance in tumor classification and patient survival analysis. Taken together, this study proposes the WBFS method that helps to explore candidate biomarkers from biomedical datasets and provides useful information for tumor diagnosis or therapy strategies.
RESUMO
Reverse phase protein array (RPPA) is a functional proteomics technology amenable to moderately high throughputs of samples and antibodies. The University of Texas MD Anderson Cancer Center RPPA Core Facility has implemented various processes and techniques to maximize RPPA throughput; key among them are maximizing array configuration and relying on database management and automation. One major tool used by the RPPA Core is a semi-automated RPPA process management system referred to as the RPPA Pipeline. The RPPA Pipeline, developed with the aid of MD Avnderson's Department of Bioinformatics and Computational Biology and InSilico Solutions, has streamlined sample and antibody tracking as well as advanced quality control measures of various RPPA processes. This chapter covers RPPA Core processes associated with the RPPA Pipeline workflow from sample receipt to sample printing to slide staining and RPPA report generation that enables the RPPA Core to process at least 13,000 samples per year with approximately 450 individual RPPA-quality antibodies. Additionally, this chapter will cover results of large-scale clinical sample processing, including The Cancer Genome Atlas Project and The Cancer Proteome Atlas.
Assuntos
Análise Serial de Proteínas , Proteômica , Estudos Clínicos como Assunto , Humanos , Proteoma , Proteômica/instrumentação , Proteômica/métodos , Proteômica/tendências , Controle de QualidadeRESUMO
The Cancer Proteome Atlas (TCPA) project collects reverse-phase protein arrays (RPPA)-based proteome datasets from nearly 8000 samples across 32 cancer types. This study aims to investigate the pan-cancer proteome signature and identify cancer subtypes of glioma, kidney cancer, and lung cancer based on TCPA data. We first visualized the tumor clustering models using t-distributed stochastic neighbour embedding (t-SNE) and bi-clustering heatmap. Then, three feature selection methods (pyHSICLasso, XGBoost, and Random Forest) were performed to select protein features for classifying cancer subtypes in training dataset, and the LibSVM algorithm was empolyed to test classification accuracy in the validation dataset. Clustering analysis revealed that different kinds of tumors have relatively distinct proteomic profiling based on tissue or origin. We identified 20, 10, and 20 protein features with the highest accuracies in classifying subtypes of glioma, kidney cancer, and lung cancer, respectively. The predictive abilities of the selected proteins were confirmed by receiving operating characteristic (ROC) analysis. Finally, the Bayesian network was utilized to explore the protein biomarkers that have direct causal relationships with cancer subtypes. Overall, we highlight the theoretical and technical applications of machine learning based feature selection approaches in the analysis of high-throughput biological data, particularly for cancer biomarker research. SIGNIFICANCE: Functional proteomics is a powerful approach for characterizing cell signaling pathways and understanding their phenotypic effects on cancer development. The TCPA database provides a platform to explore and analyze TCGA pan-cancer RPPA-based protein expression. With the advent of the RPPA technology, the availability of high-throughput data in TCPA platform has made it possible to use machine learning methods to identify protein biomarkers and further differentiate subtypes of cancer based on proteomic data. In this study, we highlight the role of feature selection and Bayesian network in discovery protein biomarker for classifying cancer subtypes based on functional proteomic data. The application of machine learning methods in the analysis of high-throughput biological data, particularly for cancer biomarker researches, which have potential clinical values in developing individualized treatment strategies.