Pesquisa | Biblioteca Virtual em Saúde

Integration of machine learning-based prediction for enhanced Model's generalization: Application in photocatalytic polishing of palm oil mill effluent (POME).

Ng, Kim Hoong; Gan, Y S; Cheng, Chin Kui; Liu, Kun-Hong; Liong, Sze-Teng.

Environ Pollut ; 267: 115500, 2020 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-33254722

RESUMO

In predicting palm oil mill effluent (POME) degradation efficiency, previous developed quadratic model quantitatively evaluated the effects of O2 flowrate, TiO2 loadings and initial concentration of POME in labscale photocatalytic system, which however suffered from low generalization due to the overfitting behaviour. Evidently, high RMSE (131.61) and low R2 (-630.49) obtained indicates its insufficiency in describing POME degradation at unseen factor ranges, hence verified the fact of poor generalization. To overcome this issue, several models were developed via machine learning-assisted techniques, namely Gaussian Process Regression (GPR), Linear Regression (LR), Decision Tree (DT), Supported Vector Machine (SVM) and Regression Tree Ensemble (RTE), subsequently being assessed systematically. To achieve high generalization, all models were subjected to 'train-all-test-all' strategy, 5-fold and 10-fold cross validation. Specifically, GPR model was furnished with high accuracy in 'train-all-test-all' strategy, judging from its low RMSE (1.0394) and high R2 (0.9962), which however menaced by the risk of overfitting. In contrast, despite relatively poorer RMSE and R2 (1.7964 and 0.9886) obtained in 5-fold cross validation, GPR model was rendered with highest generalization, while sufficiently preserving its accuracy in development process. Besides, SVM and RTE models were also demonstrated promising R2 (0.9372 and 0.9208), which however shadowed by their high RMSEs (4.2174 and 4.7366). Furthermore, the extraordinary generalization of GPR model was coincidentally verified in 10-fold cross validation. The lowest RMSE (2.1624) and highest R2 (0.9835) obtained with feature number of 36 asserted its sufficiency in both generalization and accuracy prospect. Other models were all rendered with slight lower R2 (> 0.9), plausibly due to the higher RMSE (> 4.0). According to GPR model, optimized POME degradation (52.52%) can be obtained at 70 mL/min of O2, 70.0 g/L of TiO2 and 250 ppm of POME concentration, with only â¼3% error as compared to the actual data.

Assuntos

Resíduos Industriais , Eliminação de Resíduos Líquidos , Resíduos Industriais/análise , Aprendizado de Máquina , Óleo de Palmeira , Óleos de Plantas

Genetic programming based ensemble system for microarray data classification.

Liu, Kun-Hong; Tong, Muchenxuan; Xie, Shu-Tong; Yee Ng, Vincent To.

Comput Math Methods Med ; 2015: 193406, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-25810748

RESUMO

Recently, more and more machine learning techniques have been applied to microarray data analysis. The aim of this study is to propose a genetic programming (GP) based new ensemble system (named GPES), which can be used to effectively classify different types of cancers. Decision trees are deployed as base classifiers in this ensemble framework with three operators: Min, Max, and Average. Each individual of the GP is an ensemble system, and they become more and more accurate in the evolutionary process. The feature selection technique and balanced subsampling technique are applied to increase the diversity in each ensemble system. The final ensemble committee is selected by a forward search algorithm, which is shown to be capable of fitting data automatically. The performance of GPES is evaluated using five binary class and six multiclass microarray datasets, and results show that the algorithm can achieve better results in most cases compared with some other ensemble systems. By using elaborate base classifiers or applying other sampling techniques, the performance of GPES may be further improved.

Assuntos

Regulação Neoplásica da Expressão Gênica , Neoplasias/diagnóstico , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Algoritmos , Área Sob a Curva , Inteligência Artificial , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Humanos , Aprendizado de Máquina , Modelos Estatísticos , Neoplasias/patologia , Reconhecimento Automatizado de Padrão , Reprodutibilidade dos Testes

An ensemble of SVM classifiers based on gene pairs.

Tong, Muchenxuan; Liu, Kun-Hong; Xu, Chungui; Ju, Wenbin.

Comput Biol Med ; 43(6): 729-37, 2013 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-23668348

RESUMO

In this paper, a genetic algorithm (GA) based ensemble support vector machine (SVM) classifier built on gene pairs (GA-ESP) is proposed. The SVMs (base classifiers of the ensemble system) are trained on different informative gene pairs. These gene pairs are selected by the top scoring pair (TSP) criterion. Each of these pairs projects the original microarray expression onto a 2-D space. Extensive permutation of gene pairs may reveal more useful information and potentially lead to an ensemble classifier with satisfactory accuracy and interpretability. GA is further applied to select an optimized combination of base classifiers. The effectiveness of the GA-ESP classifier is evaluated on both binary-class and multi-class datasets.

Assuntos

Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica , Genes , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Máquina de Vetores de Suporte , Transcriptoma

Novel cancerization marker, TP53, and its role in distinguishing normal tissue adjacent to cancerous tissue from normal tissue adjacent to benign tissue.

Liu, Guo-Yan; Liu, Kun-Hong; Li, Yin; Pan, Chao; Su, Ji-Qin; Liao, Hong-Feng; Yv, Ren-Xiang; Li, Zhao-Hui; Yuan, Li; Zhang, Huan-Jing; Tzeng, Chi-Meng; Xiong, Bing.

World J Surg Oncol ; 10: 252, 2012 Nov 21.

Artigo em Inglês | MEDLINE | ID: mdl-23170979

RESUMO

BACKGROUND: The histopathological and molecular heterogeneity of normal tissue adjacent to cancerous tissue (NTAC) and normal tissue adjacent to benign tissue (NTAB), and the availability of limited specimens make deciphering the mechanisms of carcinogenesis challenging. Our goal was to identify histogenetic biomarkers that could be reliably used to define a transforming fingerprint using RNA in situ hybridization. METHODS: We evaluated 15 tumor-related RNA in situ hybridization biomarkers using tumor microarray and samples of seven tumor-adjacent normal tissues from 314 patients. Biomarkers were determined using comprehensive statistical methods (significance of support vector machine-based artificial intelligence and area under curve scoring of classification distribution). RESULTS: TP53 was found to be a most reliable index (P <10(-7); area under curve >87%) for distinguishing NTAC from NTAB, according to the results of a significance panel (BCL10, BECN1, BRCA2, FITH, PTCH11 and TP53). CONCLUSIONS: The genetic alterations in TP53 between NTAC and NTAB may provide new insight into the field of cancerization and tumor transformation.

Assuntos

Biomarcadores Tumorais/análise , Proteína Supressora de Tumor p53/análise , Transformação Celular Neoplásica , Genes p53 , Humanos , Hibridização In Situ

Alterations of tumor-related genes do not exactly match the histopathological grade in gastric adenocarcinomas.

Liu, Guo-Yan; Liu, Kun-Hong; Zhang, Yong; Wang, Yu-Zhi; Wu, Xiao-Hong; Lu, Yi-Zhuo; Pan, Chao; Yin, Ping; Liao, Hong-Feng; Su, Ji-Qin; Ge, Qing; Luo, Qi; Xiong, Bin.

World J Gastroenterol ; 16(9): 1129-37, 2010 Mar 07.

Artigo em Inglês | MEDLINE | ID: mdl-20205286

RESUMO

AIM: To investigate the diverse characteristics of different pathological gradings of gastric adenocarcinoma (GA) using tumor-related genes. METHODS: GA tissues in different pathological gradings and normal tissues were subjected to tissue arrays. Expressions of 15 major tumor-related genes were detected by RNA in situ hybridization along with 3' terminal digoxin-labeled anti-sense single stranded oligonucleotide and locked nucleic acid modifying probe within the tissue array. The data obtained were processed by support vector machines by four different feature selection methods to discover the respective critical gene/gene subsets contributing to the GA activities of different pathological gradings. RESULTS: In comparison of poorly differentiated GA with normal tissues, tumor-related gene TP53 plays a key role, although other six tumor-related genes could also achieve the Area Under Curve (AUC) of the receiver operating characteristic independently by more than 80%. Comparing the well differentiated GA with normal tissues, we found that 11 tumor-related genes could independently obtain the AUC by more than 80%, but only the gene subsets, TP53, RB and PTEN, play a key role. Only the gene subsets, Bcl10, UVRAG, APC, Beclin1, NM23, PTEN and RB could distinguish between the poorly differentiated and well differentiated GA. None of a single gene could obtain a valid distinction. CONCLUSION: Different from the traditional point of view, the well differentiated cancer tissues have more alterations of important tumor-related genes than the poorly differentiated cancer tissues.

Assuntos

Adenocarcinoma/genética , Adenocarcinoma/patologia , Biomarcadores Tumorais/genética , Diferenciação Celular/genética , Neoplasias Gástricas/genética , Neoplasias Gástricas/patologia , Adulto , Idoso , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Hibridização In Situ , Masculino , Pessoa de Meia-Idade , Estadiamento de Neoplasias , Valor Preditivo dos Testes , RNA Mensageiro/análise , Curva ROC , Análise Serial de Tecidos

Microarray data classification based on ensemble independent component selection.

Liu, Kun-Hong; Li, Bo; Wu, Qing-Qiang; Zhang, Jun; Du, Ji-Xiang; Liu, Guo-Yan.

Comput Biol Med ; 39(11): 953-60, 2009 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-19716554

RESUMO

Independent component analysis (ICA) has been widely deployed to the analysis of microarray datasets. Although it was pointed out that after ICA transformation, different independent components (ICs) are of different biological significance, the IC selection problem is still far from fully explored. In this paper, we propose a genetic algorithm (GA) based ensemble independent component selection (EICS) system. In this system, GA is applied to select a set of optimal IC subsets, which are then used to build diverse and accurate base classifiers. Finally, all base classifiers are combined with majority vote rule. To show the validity of the proposed method, we apply it to classify three DNA microarray data sets involving various human normal and tumor tissue samples. The experimental results show that our ensemble method obtains stable and satisfying classification results when compared with several existing methods.

Assuntos

Análise de Sequência com Séries de Oligonucleotídeos , Algoritmos , Modelos Teóricos , Neoplasias/genética , Neoplasias/patologia

A genetic programming-based approach to the classification of multiclass microarray datasets.

Liu, Kun-Hong; Xu, Chun-Gui.

Bioinformatics ; 25(3): 331-7, 2009 Feb 01.

Artigo em Inglês | MEDLINE | ID: mdl-19088122

RESUMO

MOTIVATION: Feature selection approaches have been widely applied to deal with the small sample size problem in the analysis of micro-array datasets. For the multiclass problem, the proposed methods are based on the idea of selecting a gene subset to distinguish all classes. However, it will be more effective to solve a multiclass problem by splitting it into a set of two-class problems and solving each problem with a respective classification system. RESULTS: We propose a genetic programming (GP)-based approach to analyze multiclass microarray datasets. Unlike the traditional GP, the individual proposed in this article consists of a set of small-scale ensembles, named as sub-ensemble (denoted by SE). Each SE consists of a set of trees. In application, a multiclass problem is divided into a set of two-class problems, each of which is tackled by a SE first. The SEs tackling the respective two-class problems are combined to construct a GP individual, so each individual can deal with a multiclass problem directly. Effective methods are proposed to solve the problems arising in the fusion of SEs, and a greedy algorithm is designed to keep high diversity in SEs. This GP is tested in five datasets. The results show that the proposed method effectively implements the feature selection and classification tasks.

Assuntos

Algoritmos , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Classificação/métodos , Reconhecimento Automatizado de Padrão/métodos , Reprodutibilidade dos Testes , Tamanho da Amostra

Efficient ensemble schemes for protein secondary structure prediction.

Liu, Kun-Hong; Xia, Jun-Feng; Li, Xueling.

Protein Pept Lett ; 15(5): 488-93, 2008.

Artigo em Inglês | MEDLINE | ID: mdl-18537739

RESUMO

This paper proposes an efficient ensemble system to tackle the protein secondary structure prediction problem with neural networks as base classifiers. The experimental results show that the multi-layer system can lead to better results. When deploying more accurate classifiers, the higher accuracy of the ensemble system can be obtained.

Assuntos

Biologia Computacional/métodos , Redes Neurais de Computação , Estrutura Secundária de Proteína , Proteínas/química , Conformação Proteica , Dobramento de Proteína

Cancer classification using Rotation Forest.

Liu, Kun-Hong; Huang, De-Shuang.

Comput Biol Med ; 38(5): 601-10, 2008 May.

Artigo em Inglês | MEDLINE | ID: mdl-18394595

RESUMO

We address the microarray dataset based cancer classification using a newly proposed multiple classifier system (MCS), referred to as Rotation Forest. To the best of our knowledge, it is the first time that Rotation Forest has been applied to the microarray dataset classification. In the framework of Rotation Forest, a linear transformation method is required to project data into new feature space for each classifier, and then the base classifiers are trained in different new spaces so as to enhance both the accuracies of base classifiers and the diversity in the ensemble system. Principal component analysis (PCA), non-parametric discriminant analysis (NDA) and random projections (RP) were applied to feature transformation in the original Rotation Forest. In this paper, we use independent component analysis (ICA) as a new transformation method since it can better describe the property of microarray data. The breast cancer dataset and prostate dataset are deployed to validate the efficiency of Rotation Forest. In all the experiments, it can be found that Rotation Forest outperforms other MCSs, such as Bagging and Boosting. In addition, the experimental results also revealed that ICA can further improve the performance of Rotation Forest compared with the original transformation methods.

Assuntos

Inteligência Artificial , Neoplasias da Mama/classificação , Modelos Estatísticos , Reconhecimento Automatizado de Padrão/métodos , Neoplasias da Próstata/classificação , Algoritmos , Neoplasias da Mama/genética , Feminino , Humanos , Masculino , Neoplasias da Próstata/genética

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA