RESUMO
The extraction of predictive features from the complex high-dimensional multi-omic data is necessary for decoding and overcoming the therapeutic responses in systems pharmacology. Developing computational methods to reduce high-dimensional space of features in in vitro, in vivo and clinical data is essential to discover the evolution and mechanisms of the drug responses and drug resistance. In this paper, we have utilized the matrix factorization (MF) as a modality for high dimensionality reduction in systems pharmacology. In this respect, we have proposed three novel feature selection methods using the mathematical conception of a basis for features. We have applied these techniques as well as three other MF methods to analyze eight different gene expression datasets to investigate and compare their performance for feature selection. Our results show that these methods are capable of reducing the feature spaces and find predictive features in terms of phenotype determination. The three proposed techniques outperform the other methods used and can extract a 2-gene signature predictive of a tyrosine kinase inhibitor treatment response in the Cancer Cell Line Encyclopedia.
Assuntos
Algoritmos , Neoplasias , Humanos , Neoplasias/tratamento farmacológico , Neoplasias/genética , Farmacologia em RedeRESUMO
Multi-classifier systems (MCSs) are some kind of predictive models that classify instances by combining the output of an ensemble of classifiers given in a pool. With the aim of enhancing the performance of MCSs, dynamic selection (DS) techniques have been introduced and applied to MCSs. Dealing with each test sample classification, DS methods seek to perform the task of classifier selection so that only the most competent classifiers are selected. The principal subject regarding DS techniques is how the competence of classifiers corresponding to every new test sample classification task can be estimated. In traditional dynamic selection methods, for classifying an unknown test sample x, first, a local region of data that is similar to x is detected. Then, those classifiers that efficiently classify the data in the local region are also selected so as to perform the classification task for x. Therefore, the main effort of these methods is focused on one of the two following tasks: (i) to provide a measure for identifying a local region, or (ii) to provide a criterion for measuring the efficiency of classifiers in the local region (competence measure). This paper proposes a new version of dynamic selection techniques that does not follow the aforementioned approach. Our proposed method uses a multi-label classifier in the training phase to determine the appropriate set of classifiers directly (without applying any criterion such as a competence measure). In the generalization phase, the suggested method is employed efficiently so as to predict the appropriate set of classifiers for classifying the test sample x. It is remarkable that the suggested multi-label-based framework is the first method that uses multi-label classification concepts for dynamic classifier selection. Unlike the existing meta-learning methods for dynamic ensemble selection in the literature, our proposed method is very simple to implement and does not need meta-features. As the experimental results indicate, the suggested technique produces a good performance in terms of both classification accuracy and simplicity which is fairly comparable with that of the benchmark DS techniques. The results of conducting the Quade non-parametric statistical test corroborate the clear dominance of the proposed method over the other benchmark methods.
RESUMO
One of the most critical challenges in managing complex diseases like COVID-19 is to establish an intelligent triage system that can optimize the clinical decision-making at the time of a global pandemic. The clinical presentation and patients' characteristics are usually utilized to identify those patients who need more critical care. However, the clinical evidence shows an unmet need to determine more accurate and optimal clinical biomarkers to triage patients under a condition like the COVID-19 crisis. Here we have presented a machine learning approach to find a group of clinical indicators from the blood tests of a set of COVID-19 patients that are predictive of poor prognosis and morbidity. Our approach consists of two interconnected schemes: Feature Selection and Prognosis Classification. The former is based on different Matrix Factorization (MF)-based methods, and the latter is performed using Random Forest algorithm. Our model reveals that Arterial Blood Gas (ABG) O2 Saturation and C-Reactive Protein (CRP) are the most important clinical biomarkers determining the poor prognosis in these patients. Our approach paves the path of building quantitative and optimized clinical management systems for COVID-19 and similar diseases.
Assuntos
COVID-19 , Biomarcadores , Humanos , Aprendizado de Máquina , Pandemias , Triagem/métodosRESUMO
One of the most critical challenges in managing complex diseases like COVID-19 is to establish an intelligent triage system that can optimize the clinical decision-making at the time of a global pandemic. The clinical presentation and patientsâ™ characteristics are usually utilized to identify those patients who need more critical care. However, the clinical evidence shows an unmet need to determine more accurate and optimal clinical biomarkers to triage patients under a condition like the COVID-19 crisis. Here we have presented a machine learning approach to find a group of clinical indicators from the blood tests of a set of COVID-19 patients that are predictive of poor prognosis and morbidity. Our approach consists of two interconnected schemes: Feature Selection and Prognosis Classification. The former is based on different Matrix Factorization (MF)-based methods, and the latter is performed using Random Forest algorithm. Our model reveals that Arterial Blood Gas (ABG) O 2 Saturation and C-Reactive Protein (CRP) are the most important clinical biomarkers determining the poor prognosis in these patients. Our approach paves the path of building quantitative and optimized clinical management systems for COVID-19 and similar diseases.
RESUMO
Recently, advances in bioinformatics lead to microarray high dimensional datasets. These kinds of datasets are still challenging for researchers in the area of machine learning since they suffer from small sample size and extremely large number of features. Therefore, feature selection is the problem of interest in the learning process in this area. In this paper, a novel feature selection method based on a global search (by using the main concepts of divide and conquer technique) which is called CCFS, is proposed. The proposed CCFS algorithm divides vertically (on features) the dataset by random manner and utilizes the fundamental concepts of cooperation coevolution by using a filter criterion in the fitness function in order to search the solution space via binary gravitational search algorithm. For determining the effectiveness of the proposed method some experiments are carried out on seven binary microarray high dimensional datasets. The obtained results are compared with nine state-of-the-art feature selection algorithms including Interact (INT), and Maximum Relevancy Minimum Redundancy (MRMR). The average outcomes of the results are analyzed by a statistical non-parametric test and it reveals that the proposed method has a meaningful difference to the others in terms of accuracy, sensitivity, specificity and number of selected features.
Assuntos
Algoritmos , Biologia Computacional , Análise em Microsséries , Bases de Dados Factuais , HumanosRESUMO
MicroRNAs (miRNAs) are small non-coding RNAs that have important functions in gene regulation. Since finding miRNA target experimentally is costly and needs spending much time, the use of machine learning methods is a growing research area for miRNA target prediction. In this paper, a new approach is proposed by using two popular ensemble strategies, i.e. Ensemble Pruning and Rotation Forest (EP-RTF), to predict human miRNA target. For EP, the approach utilizes Genetic Algorithm (GA). In other words, a subset of classifiers from the heterogeneous ensemble is first selected by GA. Next, the selected classifiers are trained based on the RTF method and then are combined using weighted majority voting. In addition to seeking a better subset of classifiers, the parameter of RTF is also optimized by GA. Findings of the present study confirm that the newly developed EP-RTF outperforms (in terms of classification accuracy, sensitivity, and specificity) the previously applied methods over four datasets in the field of human miRNA target. Diversity-error diagrams reveal that the proposed ensemble approach constructs individual classifiers which are more accurate and usually diverse than the other ensemble approaches. Given these experimental results, we highly recommend EP-RTF for improving the performance of miRNA target prediction.
Assuntos
Algoritmos , MicroRNAs/genética , Cromossomos Humanos , Regulação da Expressão Gênica , Humanos , Modelos GenéticosRESUMO
In this study the problem of protein fold recognition, that is a classification task, is solved via a hybrid of evolutionary algorithms namely multi-gene Genetic Programming (GP) and Genetic Algorithm (GA). Our proposed method consists of two main stages and is performed on three datasets taken from the literature. Each dataset contains different feature groups and classes. In the first step, multi-gene GP is used for producing binary classifiers based on various feature groups for each class. Then, different classifiers obtained for each class are combined via weighted voting so that the weights are determined through GA. At the end of the first step, there is a separate binary classifier for each class. In the second stage, the obtained binary classifiers are combined via GA weighting in order to generate the overall classifier. The final obtained classifier is superior to the previous works found in the literature in terms of classification accuracy.
Assuntos
Algoritmos , Modelos Químicos , Modelos Moleculares , Reconhecimento Automatizado de Padrão/métodos , Proteínas/química , Proteínas/ultraestrutura , Sequência de Aminoácidos , Biomimética/métodos , Simulação por Computador , Modelos Genéticos , Dados de Sequência Molecular , Conformação Proteica , Dobramento de ProteínaRESUMO
In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.