ABSTRACT
BACKGROUND: The histopathological and molecular heterogeneity of normal tissue adjacent to cancerous tissue (NTAC) and normal tissue adjacent to benign tissue (NTAB), and the availability of limited specimens make deciphering the mechanisms of carcinogenesis challenging. Our goal was to identify histogenetic biomarkers that could be reliably used to define a transforming fingerprint using RNA in situ hybridization. METHODS: We evaluated 15 tumor-related RNA in situ hybridization biomarkers using tumor microarray and samples of seven tumor-adjacent normal tissues from 314 patients. Biomarkers were determined using comprehensive statistical methods (significance of support vector machine-based artificial intelligence and area under curve scoring of classification distribution). RESULTS: TP53 was found to be a most reliable index (P <10(-7); area under curve >87%) for distinguishing NTAC from NTAB, according to the results of a significance panel (BCL10, BECN1, BRCA2, FITH, PTCH11 and TP53). CONCLUSIONS: The genetic alterations in TP53 between NTAC and NTAB may provide new insight into the field of cancerization and tumor transformation.
Subject(s)
Biomarkers, Tumor/analysis , Tumor Suppressor Protein p53/analysis , Cell Transformation, Neoplastic , Genes, p53 , Humans , In Situ HybridizationABSTRACT
MOTIVATION: Feature selection approaches have been widely applied to deal with the small sample size problem in the analysis of micro-array datasets. For the multiclass problem, the proposed methods are based on the idea of selecting a gene subset to distinguish all classes. However, it will be more effective to solve a multiclass problem by splitting it into a set of two-class problems and solving each problem with a respective classification system. RESULTS: We propose a genetic programming (GP)-based approach to analyze multiclass microarray datasets. Unlike the traditional GP, the individual proposed in this article consists of a set of small-scale ensembles, named as sub-ensemble (denoted by SE). Each SE consists of a set of trees. In application, a multiclass problem is divided into a set of two-class problems, each of which is tackled by a SE first. The SEs tackling the respective two-class problems are combined to construct a GP individual, so each individual can deal with a multiclass problem directly. Effective methods are proposed to solve the problems arising in the fusion of SEs, and a greedy algorithm is designed to keep high diversity in SEs. This GP is tested in five datasets. The results show that the proposed method effectively implements the feature selection and classification tasks.
Subject(s)
Algorithms , Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Classification/methods , Pattern Recognition, Automated/methods , Reproducibility of Results , Sample SizeABSTRACT
In predicting palm oil mill effluent (POME) degradation efficiency, previous developed quadratic model quantitatively evaluated the effects of O2 flowrate, TiO2 loadings and initial concentration of POME in labscale photocatalytic system, which however suffered from low generalization due to the overfitting behaviour. Evidently, high RMSE (131.61) and low R2 (-630.49) obtained indicates its insufficiency in describing POME degradation at unseen factor ranges, hence verified the fact of poor generalization. To overcome this issue, several models were developed via machine learning-assisted techniques, namely Gaussian Process Regression (GPR), Linear Regression (LR), Decision Tree (DT), Supported Vector Machine (SVM) and Regression Tree Ensemble (RTE), subsequently being assessed systematically. To achieve high generalization, all models were subjected to 'train-all-test-all' strategy, 5-fold and 10-fold cross validation. Specifically, GPR model was furnished with high accuracy in 'train-all-test-all' strategy, judging from its low RMSE (1.0394) and high R2 (0.9962), which however menaced by the risk of overfitting. In contrast, despite relatively poorer RMSE and R2 (1.7964 and 0.9886) obtained in 5-fold cross validation, GPR model was rendered with highest generalization, while sufficiently preserving its accuracy in development process. Besides, SVM and RTE models were also demonstrated promising R2 (0.9372 and 0.9208), which however shadowed by their high RMSEs (4.2174 and 4.7366). Furthermore, the extraordinary generalization of GPR model was coincidentally verified in 10-fold cross validation. The lowest RMSE (2.1624) and highest R2 (0.9835) obtained with feature number of 36 asserted its sufficiency in both generalization and accuracy prospect. Other models were all rendered with slight lower R2 (> 0.9), plausibly due to the higher RMSE (> 4.0). According to GPR model, optimized POME degradation (52.52%) can be obtained at 70 mL/min of O2, 70.0 g/L of TiO2 and 250 ppm of POME concentration, with only â¼3% error as compared to the actual data.
Subject(s)
Industrial Waste , Waste Disposal, Fluid , Industrial Waste/analysis , Machine Learning , Palm Oil , Plant OilsABSTRACT
This paper proposes an efficient ensemble system to tackle the protein secondary structure prediction problem with neural networks as base classifiers. The experimental results show that the multi-layer system can lead to better results. When deploying more accurate classifiers, the higher accuracy of the ensemble system can be obtained.
Subject(s)
Computational Biology/methods , Neural Networks, Computer , Protein Structure, Secondary , Proteins/chemistry , Protein Conformation , Protein FoldingABSTRACT
We address the microarray dataset based cancer classification using a newly proposed multiple classifier system (MCS), referred to as Rotation Forest. To the best of our knowledge, it is the first time that Rotation Forest has been applied to the microarray dataset classification. In the framework of Rotation Forest, a linear transformation method is required to project data into new feature space for each classifier, and then the base classifiers are trained in different new spaces so as to enhance both the accuracies of base classifiers and the diversity in the ensemble system. Principal component analysis (PCA), non-parametric discriminant analysis (NDA) and random projections (RP) were applied to feature transformation in the original Rotation Forest. In this paper, we use independent component analysis (ICA) as a new transformation method since it can better describe the property of microarray data. The breast cancer dataset and prostate dataset are deployed to validate the efficiency of Rotation Forest. In all the experiments, it can be found that Rotation Forest outperforms other MCSs, such as Bagging and Boosting. In addition, the experimental results also revealed that ICA can further improve the performance of Rotation Forest compared with the original transformation methods.
Subject(s)
Artificial Intelligence , Breast Neoplasms/classification , Models, Statistical , Pattern Recognition, Automated/methods , Prostatic Neoplasms/classification , Algorithms , Breast Neoplasms/genetics , Female , Humans , Male , Prostatic Neoplasms/geneticsABSTRACT
Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.
Subject(s)
Gene Expression Regulation, Neoplastic , Neoplasms/diagnosis , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Area Under Curve , Artificial Intelligence , Computational Biology/methods , Gene Expression Profiling/methods , Humans , Machine Learning , Models, Statistical , Neoplasms/pathology , Pattern Recognition, Automated , Reproducibility of ResultsABSTRACT
In this paper, a genetic algorithm (GA) based ensemble support vector machine (SVM) classifier built on gene pairs (GA-ESP) is proposed. The SVMs (base classifiers of the ensemble system) are trained on different informative gene pairs. These gene pairs are selected by the top scoring pair (TSP) criterion. Each of these pairs projects the original microarray expression onto a 2-D space. Extensive permutation of gene pairs may reveal more useful information and potentially lead to an ensemble classifier with satisfactory accuracy and interpretability. GA is further applied to select an optimized combination of base classifiers. The effectiveness of the GA-ESP classifier is evaluated on both binary-class and multi-class datasets.
Subject(s)
Gene Expression Profiling/methods , Gene Expression Regulation , Genes , Oligonucleotide Array Sequence Analysis/methods , Support Vector Machine , TranscriptomeABSTRACT
AIM: To investigate the diverse characteristics of different pathological gradings of gastric adenocarcinoma (GA) using tumor-related genes. METHODS: GA tissues in different pathological gradings and normal tissues were subjected to tissue arrays. Expressions of 15 major tumor-related genes were detected by RNA in situ hybridization along with 3' terminal digoxin-labeled anti-sense single stranded oligonucleotide and locked nucleic acid modifying probe within the tissue array. The data obtained were processed by support vector machines by four different feature selection methods to discover the respective critical gene/gene subsets contributing to the GA activities of different pathological gradings. RESULTS: In comparison of poorly differentiated GA with normal tissues, tumor-related gene TP53 plays a key role, although other six tumor-related genes could also achieve the Area Under Curve (AUC) of the receiver operating characteristic independently by more than 80%. Comparing the well differentiated GA with normal tissues, we found that 11 tumor-related genes could independently obtain the AUC by more than 80%, but only the gene subsets, TP53, RB and PTEN, play a key role. Only the gene subsets, Bcl10, UVRAG, APC, Beclin1, NM23, PTEN and RB could distinguish between the poorly differentiated and well differentiated GA. None of a single gene could obtain a valid distinction. CONCLUSION: Different from the traditional point of view, the well differentiated cancer tissues have more alterations of important tumor-related genes than the poorly differentiated cancer tissues.
Subject(s)
Adenocarcinoma/genetics , Adenocarcinoma/pathology , Biomarkers, Tumor/genetics , Cell Differentiation/genetics , Stomach Neoplasms/genetics , Stomach Neoplasms/pathology , Adult , Aged , Female , Gene Expression Regulation, Neoplastic , Humans , In Situ Hybridization , Male , Middle Aged , Neoplasm Staging , Predictive Value of Tests , RNA, Messenger/analysis , ROC Curve , Tissue Array AnalysisABSTRACT
Independent component analysis (ICA) has been widely deployed to the analysis of microarray datasets. Although it was pointed out that after ICA transformation, different independent components (ICs) are of different biological significance, the IC selection problem is still far from fully explored. In this paper, we propose a genetic algorithm (GA) based ensemble independent component selection (EICS) system. In this system, GA is applied to select a set of optimal IC subsets, which are then used to build diverse and accurate base classifiers. Finally, all base classifiers are combined with majority vote rule. To show the validity of the proposed method, we apply it to classify three DNA microarray data sets involving various human normal and tumor tissue samples. The experimental results show that our ensemble method obtains stable and satisfying classification results when compared with several existing methods.