Pesquisa | Portal Regional da BVS

1.

Nonlinear physics opens a new paradigm for accurate transcription start site prediction.

Barbero-Aparicio, José Antonio; Cuesta-Lopez, Santiago; García-Osorio, César Ignacio; Pérez-Rodríguez, Javier; García-Pedrajas, Nicolás.

BMC Bioinformatics ; 23(1): 565, 2022 Dec 30.

Artigo em Inglês | MEDLINE | ID: mdl-36585618

RESUMO

There is evidence that DNA breathing (spontaneous opening of the DNA strands) plays a relevant role in the interactions of DNA with other molecules, and in particular in the transcription process. Therefore, having physical models that can predict these openings is of interest. However, this source of information has not been used before either in transcription start sites (TSSs) or promoter prediction. In this article, one such model is used as an additional information source that, when used by a machine learning (ML) model, improves the results of current methods for the prediction of TSSs. In addition, we provide evidence on the validity of the physical model, as it is able by itself to predict TSSs with high accuracy. This opens an exciting avenue of research at the intersection of statistical mechanics and ML, where ML models in bioinformatics can be improved using physical models of DNA as feature extractors.

Assuntos

Biologia Computacional , DNA , Sítio de Iniciação de Transcrição , Regiões Promotoras Genéticas , Biologia Computacional/métodos

2.

Graph-Based Feature Selection Approach for Molecular Activity Prediction.

Cerruela-García, Gonzalo; Cuevas-Muñoz, José Manuel; García-Pedrajas, Nicolás.

J Chem Inf Model ; 62(7): 1618-1632, 2022 04 11.

Artigo em Inglês | MEDLINE | ID: mdl-35315648

RESUMO

In the construction of QSAR models for the prediction of molecular activity, feature selection is a common task aimed at improving the results and understanding of the problem. The selection of features allows elimination of irrelevant and redundant features, reduces the effect of dimensionality problems, and improves the generalization and interpretability of the models. In many feature selection applications, such as those based on ensembles of feature selectors, it is necessary to combine different selection processes. In this work, we evaluate the application of a new feature selection approach to the prediction of molecular activity, based on the construction of an undirected graph to combine base feature selectors. The experimental results demonstrate the efficiency of the graph-based method in terms of the classification performance, reduction, and redundancy compared to the standard voting method. The graph-based method can be extended to different feature selection algorithms and applied to other cheminformatics problems.

Assuntos

Algoritmos , Projetos de Pesquisa

3.

Grab'Em: A Novel Graph-Based Method for Combining Feature Subset Selectors.

de Haro-Garcia, Aida; Toledano, Jose Perez-Parras; Cerruela-Garcia, Gonzalo; Garcia-Pedrajas, Nicolas.

IEEE Trans Cybern ; 52(5): 2942-2954, 2022 May.

Artigo em Inglês | MEDLINE | ID: mdl-33027013

RESUMO

Feature selection is one of the most frequent tasks in data mining applications. Its ability to remove useless and redundant features improves the classification performance and gains knowledge about a given problem makes feature selection a common first step in data mining. In many feature selection applications, we need to combine the results of different feature selection processes. The two most common scenarios are the ensembles of feature selectors and the scaling up of feature selection methods using a data division approach. The standard procedure is to store the number of times every feature has been selected as a vote for the feature and then evaluate different selection thresholds with a certain criterion to obtain the final subset of selected features. However, this method is suboptimal as the relationships of the features are not considered in the voting process. Two redundant features may be selected a similar number of times due to the different sets of instances used each time. Thus, a voting scheme would tend to select both of them. In this article, we present a new approach: instead of using only the number of times a feature has been selected, the approach considers how many times the features have been selected together by a feature selection algorithm. The proposal is based on constructing an undirected graph where the vertices are the features, and the edges count the number of times every pair of instances has been selected together. This graph is used to select the best subset of features, avoiding the redundancy introduced by the voting scheme. The proposal improves the results of the standard voting scheme in both ensembles of feature selectors and data division methods for scaling up feature selection.

Assuntos

Algoritmos , Mineração de Dados , Projetos de Pesquisa

4.

Floating Search Methodology for Combining Classification Models for Site Recognition in DNA Sequences.

Perez-Rodriguez, Javier; de Haro-Garcia, Aida; Garcia-Pedrajas, Nicolas.

IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2471-2482, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-32078558

RESUMO

Recognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. The best approaches use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this problem with the best possible performance. A major issue is that the number of possible models to combine is large and the use of all of these models is impractical. In this paper we present a methodology for combining many sources of information to recognize any functional site using "floating search", a powerful heuristics applicable when the cost of evaluating each solution is high. We present experiments on four functional sites in the human genome, which is used as the target genome, and use another 20 species as sources of evidence. The proposed methodology shows significant improvement over state-of-the-art methods. The results show an advantage of the proposed method and also challenge the standard assumption of using only genomes not very close and not very far from the human to improve the recognition of functional sites.

Assuntos

Biologia Computacional/métodos , Componentes do Gene/genética , Genoma Humano/genética , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases/genética , Humanos , Modelos Genéticos

5.

Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction.

Antelo-Collado, Aurelio; Carrasco-Velar, Ramón; García-Pedrajas, Nicolás; Cerruela-García, Gonzalo.

J Chem Inf Model ; 61(1): 76-94, 2021 01 25.

Artigo em Inglês | MEDLINE | ID: mdl-33350301

RESUMO

During the drug development process, it is common to carry out toxicity tests and adverse effect studies, which are essential to guarantee patient safety and the success of the research. The use of in silico quantitative structure-activity relationship (QSAR) approaches for this task involves processing a huge amount of data that, in many cases, have an imbalanced distribution of active and inactive samples. This is usually termed the class-imbalance problem and may have a significant negative effect on the performance of the learned models. The performance of feature selection (FS) for QSAR models is usually damaged by the class-imbalance nature of the involved datasets. This paper proposes the use of an FS method focused on dealing with the class-imbalance problems. The method is based on the use of FS ensembles constructed by boosting and using two well-known FS methods, fast clustering-based FS and the fast correlation-based filter. The experimental results demonstrate the efficiency of the proposal in terms of the classification performance compared to standard methods. The proposal can be extended to other FS methods and applied to other problems in cheminformatics.

Assuntos

Algoritmos , Relação Quantitativa Estrutura-Atividade , Simulação por Computador , Humanos , Projetos de Pesquisa

6.

Maximum common property: a new approach for molecular similarity.

Antelo-Collado, Aurelio; Carrasco-Velar, Ramón; García-Pedrajas, Nicolás; Cerruela-García, Gonzalo.

J Cheminform ; 12(1): 61, 2020 Oct 09.

Artigo em Inglês | MEDLINE | ID: mdl-33372638

RESUMO

The maximum common property similarity (MCPhd) method is presented using descriptors as a new approach to determine the similarity between two chemical compounds or molecular graphs. This method uses the concept of maximum common property arising from the concept of maximum common substructure and is based on the electrotopographic state index for atoms. A new algorithm to quantify the similarity values of chemical structures based on the presented maximum common property concept is also developed in this paper. To verify the validity of this approach, the similarity of a sample of compounds with antimalarial activity is calculated and compared with the results obtained by four different similarity methods: the small molecule subgraph detector (SMSD), molecular fingerprint based (OBabel_FP2), ISIDA descriptors and shape-feature similarity (SHAFTS). The results obtained by the MCPhd method differ significantly from those obtained by the compared methods, improving the quantification of the similarity. A major advantage of the proposed method is that it helps to understand the analogy or proximity between physicochemical properties of the molecular fragments or subgraphs compared with the biological response or biological activity. In this new approach, more than one property can be potentially used. The method can be considered a hybrid procedure because it combines descriptor and the fragment approaches.

7.

Molecular Mechanisms Controlling the Disease Cycle in the Vascular Pathogen Verticillium dahliae Characterized Through Forward Genetics and Transcriptomics.

Sarmiento-Villamil, Jorge L; García-Pedrajas, Nicolás E; Cañizares, M Carmen; García-Pedrajas, María D.

Mol Plant Microbe Interact ; 33(6): 825-841, 2020 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-32154756

RESUMO

The soil-borne pathogen Verticillium dahliae has a worldwide distribution and a plethora of hosts of agronomic value. Molecular analysis of virulence processes can identify targets for disease control. In this work, we compared the global gene transcription profile of random T-DNA insertion mutant strain D-10-8F, which exhibits reduced virulence and alterations in microsclerotium formation and polar growth, with that of the wild-type strain. Three genes identified as differentially expressed were selected for functional characterization. To produce deletion mutants, we developed an updated version of one-step construction of Agrobacterium-recombination-ready plasmids (OSCAR) that included the negative selection marker HSVtk (herpes simplex virus thymidine kinase gene) to prevent ectopic integration of the deletion constructs. Deletion of VdRGS1 (VDAG_00683), encoding a regulator of G protein signaling (RGS) protein and highly upregulated in the wild type versus D-10-8F, resulted in phenotypic alterations in development and virulence that were indistinguishable from those of the random T-DNA insertion mutant. In contrast, deletion of the other two genes selected, vrg1 (VDAG_07039) and vvs1 (VDAG_01858), showed that they do not play major roles in morphogenesis or virulence in V. dahliae. Taken together the results presented here on the transcriptomic analysis and phenotypic characterization of D-10-8F and ∆VdRGS1 strains provide evidence that variations in G protein signaling control the progression of the disease cycle in V. dahliae. We propose that G protein-mediated signals induce the expression of multiple virulence factors during biotrophic growth, whereas massive production of microsclerotia at late stages of infection requires repression of G protein signaling via upregulation of VdRGS1 activity.

Assuntos

Doenças das Plantas/microbiologia , Transcriptoma , Verticillium/genética , Verticillium/patogenicidade , DNA Bacteriano , Proteínas Fúngicas , Deleção de Genes , Virulência

8.

Influence of feature rankers in the construction of molecular activity prediction models.

Cerruela-García, Gonzalo; Pérez-Parra Toledano, José; de Haro-García, Aída; García-Pedrajas, Nicolás.

J Comput Aided Mol Des ; 34(3): 305-325, 2020 03.

Artigo em Inglês | MEDLINE | ID: mdl-31893338

RESUMO

In the construction of activity prediction models, the use of feature ranking methods is a useful mechanism for extracting information for ranking features in terms of their significance to develop predictive models. This paper studies the influence of feature rankers in the construction of molecular activity prediction models; for this purpose, a comparative study of fourteen rankings methods for feature selection was conducted. The activity prediction models were constructed using four well-known classifiers and a wide collection of datasets. The ranking algorithms were compared considering the performance of these classifiers using different metrics and the consistency of the ranked features.

Assuntos

Modelos Moleculares , Software , Algoritmos , Humanos

9.

Multilabel and Missing Label Methods for Binary Quantitative Structure-Activity Relationship Models: An Application for the Prediction of Adverse Drug Reactions.

Pérez-Parras Toledano, José; García-Pedrajas, Nicolás; Cerruela-García, Gonzalo.

J Chem Inf Model ; 59(10): 4120-4130, 2019 10 28.

Artigo em Inglês | MEDLINE | ID: mdl-31514503

RESUMO

The prediction of adverse drug reactions in the discovery of new medicines is highly challenging. In the task of predicting the adverse reactions of chemical compounds, information about different targets is often available. Although we can focus on every adverse drug reaction prediction separately, multilabel approaches have been proven useful in many research areas for taking advantage of the relationship among the targets. However, when approaching the prediction problem from a multilabel point of view, we have to deal with the lack of information for some labels. This missing labels problem is a relevant issue in the field of cheminformatics approaches. This paper aims to predict the adverse drug reaction of commercial drugs using a multilabel approach where the possible presence of missing labels is also taken into consideration. We propose the use of multilabel methods to deal with the prediction of a large set of 27 different adverse reaction targets. We also propose the use of multilabel methods specifically designed to deal with the missing labels problem to test their ability to solve this difficult problem. The results show the validity of the proposed approach, demonstrating a superior performance of the multilabel method compared with the single-label approach in addressing the problem of adverse drug reaction prediction.

Assuntos

Biologia Computacional/métodos , Algoritmos , Simulação por Computador , Descoberta de Drogas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Modelos Moleculares , Relação Quantitativa Estrutura-Atividade

10.

Improving the combination of results in the ensembles of prototype selectors.

Cerruela-García, Gonzalo; de Haro-García, Aida; Toledano, José Pérez-Parras; García-Pedrajas, Nicolás.

Neural Netw ; 118: 175-191, 2019 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-31299623

RESUMO

Prototype selection is one of the most common preprocessing tasks in data mining applications. The vast amounts of data that we must handle in practical problems render the removal of noisy, redundant or useless instances a convenient first step for any real-world application. Many algorithms have been proposed for prototype selection. For difficult problems, however, the use of only a single method would unlikely achieve the desired performance. Similar to the problem of classification, ensembles of prototype selectors have been proposed to overcome the limitations of single algorithms. In ensembles of prototype selectors, the usual combination method is based on a voting scheme coupled with an acceptance threshold. However, this method is suboptimal, because the relationships among the prototypes are not taken into account. In this paper, we propose a different approach, in which we consider not only the number of times every prototype has been selected but also the subsets of prototypes that are selected. With this additional information we develop GEEBIES, which is a new way of combining the results of ensembles of prototype selectors. In a large set of problems, we show that our proposal outperforms the standard boosting approach. A way of scaling up our method to large datasets is also proposed and experimentally tested.

Assuntos

Algoritmos , Bases de Dados Factuais , Estudo de Prova de Conceito , Mineração de Dados/normas , Bases de Dados Factuais/normas

11.

A nonparametric weighted feature extraction-based method for c-Jun N-terminal kinase-3 inhibitor prediction.

Cerruela García, Gonzalo; García-Pedrajas, Nicolás.

J Mol Graph Model ; 90: 235-242, 2019 07.

Artigo em Inglês | MEDLINE | ID: mdl-31103916

RESUMO

In this work, the application of a new strategy called NWFE ensemble (nonparametric weighted feature extraction ensemble) method is proposed. Subspace-supervised projections based on NWFE are incorporated into the construction of ensembles of classifiers to facilitate the correct classification of wrongly classified instances without being detrimental to the overall performance of the ensemble. The performance of NWFE is investigated with a c-Jun N-terminal kinase-3 inhibitor benchmark dataset using different chemical compound representation models. Compared with the standard method, the results obtained show that the applied method improves the prediction performance using two classifiers based on decision trees and support vector machines.

Assuntos

Proteína Quinase 10 Ativada por Mitógeno/antagonistas & inibidores , Inibidores de Proteínas Quinases/química , Inibidores de Proteínas Quinases/farmacologia , Algoritmos , Árvores de Decisões , Humanos , Máquina de Vetores de Suporte

12.

Boosted feature selectors: a case study on prediction P-gp inhibitors and substrates.

Cerruela García, Gonzalo; García-Pedrajas, Nicolás.

J Comput Aided Mol Des ; 32(11): 1273-1294, 2018 11.

Artigo em Inglês | MEDLINE | ID: mdl-30367310

RESUMO

Feature selection is commonly used as a preprocessing step to machine learning for improving learning performance, lowering computational complexity and facilitating model interpretation. This paper proposes the application of boosting feature selection to improve the classification performance of standard feature selection algorithms evaluated for the prediction of P-gp inhibitors and substrates. Two well-known classification algorithms, decision trees and support vector machines, were used to classify the chemical compounds. The experimental results showed better performance for boosting feature selection with respect to the standard feature selection algorithms while maintaining the capability for feature reduction.

Assuntos

Subfamília B de Transportador de Cassetes de Ligação de ATP/antagonistas & inibidores , Subfamília B de Transportador de Cassetes de Ligação de ATP/química , Ligantes , Aprendizado de Máquina , Algoritmos , Árvores de Decisões , Estrutura Molecular , Ligação Proteica , Relação Quantitativa Estrutura-Atividade , Máquina de Vetores de Suporte

13.

The APSES transcription factor Vst1 is a key regulator of development in microsclerotium- and resting mycelium-producing Verticillium species.

Sarmiento-Villamil, Jorge L; García-Pedrajas, Nicolás E; Baeza-Montañez, Lourdes; García-Pedrajas, María D.

Mol Plant Pathol ; 19(1): 59-76, 2018 01.

Artigo em Inglês | MEDLINE | ID: mdl-27696683

RESUMO

Plant pathogens of the genus Verticillium pose a threat to many important crops worldwide. They are soil-borne fungi which invade the plant systemically, causing wilt symptoms. We functionally characterized the APSES family transcription factor Vst1 in two Verticillium species, V. dahliae and V. nonalfalfae, which produce microsclerotia and melanized hyphae as resistant structures, respectively. We found that, in V. dahliae Δvst1 strains, microsclerotium biogenesis stalled after an initial swelling of hyphal cells and cultures were never pigmented. In V. nonalfalfae Δvst1, melanized hyphae were also absent. These results suggest that Vst1 controls melanin biosynthesis independent of its role in morphogenesis. The absence of vst1 also had a great impact on sporulation in both species, affecting the generation of the characteristic verticillate conidiophore structure and sporulation rates in liquid medium. In contrast with these key roles in development, Vst1 activity was dispensable for virulence. We performed a microarray analysis comparing global transcription patterns of wild-type and Δvst1 in V. dahliae. G-protein/cyclic adenosine monophosphate (G-protein/cAMP) signalling and mitogen-activated protein kinase (MAPK) cascades are known to regulate fungal morphogenesis and virulence. The microarray analysis revealed a negative interaction of Vst1 with G-protein/cAMP signalling and a positive interaction with MAPK signalling. This analysis also identified Rho signalling as a potential regulator of morphogenesis in V. dahliae, positively interacting with Vst1. Furthermore, it exposed the association of secondary metabolism and development in this species, identifying Vst1 as a potential co-regulator of both processes. Characterization of the putative Vst1 targets identified in this study will aid in the dissection of specific aspects of development.

Assuntos

Proteínas Fúngicas/metabolismo , Micélio/metabolismo , Fatores de Transcrição/metabolismo , Verticillium/crescimento & desenvolvimento , Verticillium/metabolismo , Regulação para Baixo/genética , Proteínas Fúngicas/genética , Deleção de Genes , Regulação Fúngica da Expressão Gênica , Melaninas/biossíntese , Morfogênese/genética , Família Multigênica , Micélio/citologia , Oxirredução , Metabolismo Secundário/genética , Transdução de Sinais/genética , Esporos Fúngicos/efeitos dos fármacos , Esporos Fúngicos/fisiologia , Transcrição Gênica , Verticillium/patogenicidade

14.

A Proposal for Local $k$ Values for $k$ -Nearest Neighbor Rule.

Garcia-Pedrajas, Nicolas; Romero Del Castillo, Juan A; Cerruela-Garcia, Gonzalo.

IEEE Trans Neural Netw Learn Syst ; 28(2): 470-475, 2017 02.

Artigo em Inglês | MEDLINE | ID: mdl-26731778

RESUMO

The k -nearest neighbor ( k -NN) classifier is one of the most widely used methods of classification due to several interesting features, including good generalization and easy implementation. Although simple, it is usually able to match and even outperform more sophisticated and complex methods. One of the problems with this approach is fixing the appropriate value of k . Although a good value might be obtained using cross validation, it is unlikely that the same value could be optimal for the whole space spanned by the training set. It is evident that different regions of the feature space would require different values of k due to the different distributions of prototypes. The situation of a query instance in the center of a class is very different from the situation of a query instance near the boundary between two classes. In this brief, we present a simple yet powerful approach to setting a local value of k . We associate a potentially different k to every prototype and obtain the best value of k by optimizing a criterion consisting of the local and global effects of the different k values in the neighborhood of the prototype. The proposed method has a fast training stage and the same complexity as the standard k -NN approach at the testing stage. The experiments show that this simple approach can significantly outperform the standard k -NN rule for both standard and class-imbalanced problems in a large set of different problems.

15.

Stepwise approach for combining many sources of evidence for site-recognition in genomic sequences.

Pérez-Rodríguez, Javier; García-Pedrajas, Nicolás.

BMC Bioinformatics ; 17: 117, 2016 Mar 05.

Artigo em Inglês | MEDLINE | ID: mdl-26945666

RESUMO

BACKGROUND: Recognizing the different functional parts of genes, such as promoters, translation initiation sites, donors, acceptors and stop codons, is a fundamental task of many current studies in Bioinformatics. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. However, with the rapid evolution of our ability to collect genomic information, it has been shown that combining many sources of evidence is fundamental to the success of any recognition task. With the advent of next-generation sequencing, the number of available genomes is increasing very rapidly. Thus, methods for making use of such large amounts of information are needed. RESULTS: In this paper, we present a methodology for combining tens or even hundreds of different classifiers for an improved performance. Our approach can include almost a limitless number of sources of evidence. We can use the evidence for the prediction of sites in a certain species, such as human, or other species as needed. This approach can be used for any of the functional recognition tasks cited above. However, to provide the necessary focus, we have tested our approach in two functional recognition tasks: translation initiation site and stop codon recognition. We have used the entire human genome as a target and another 20 species as sources of evidence and tested our method on five different human chromosomes. The proposed method achieves better accuracy than the best state-of-the-art method both in terms of the geometric mean of the specificity and sensitivity and the area under the receiver operating characteristic and precision recall curves. Furthermore, our approach shows a more principled way for selecting the best genomes to be combined for a given recognition task. CONCLUSIONS: Our approach has proven to be a powerful tool for improving the performance of functional site recognition, and it is a useful method for combining many sources of evidence for any recognition task in Bioinformatics. The results also show that the common approach of heuristically choosing the species to be used as source of evidence can be improved because the best combinations of genomes for recognition were those not usually selected. Although the experiments were performed for translation initiation site and stop codon recognition, any other recognition task may benefit from our methodology.

Assuntos

Biologia Computacional/métodos , Genoma Humano , Genômica/métodos , Biossíntese de Proteínas/genética , Códon de Terminação/genética , Humanos , Curva ROC , Sensibilidade e Especificidade , Software , Máquina de Vetores de Suporte

16.

Improving translation initiation site and stop codon recognition by using more than two classes.

Pérez-Rodríguez, Javier; Arroyo-Peña, Alexis G; García-Pedrajas, Nicolás.

Bioinformatics ; 30(19): 2702-8, 2014 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-24903421

RESUMO

MOTIVATION: The recognition of translation initiation sites and stop codons is a fundamental part of any gene recognition program. Currently, the most successful methods use powerful classifiers, such as support vector machines with various string kernels. These methods all use two classes, one of positive instances and another one of negative instances that are constructed using sequences from the whole genome. However, the features of the negative sequences differ depending on the position of the negative samples in the gene. There are differences depending on whether they are from exons, introns, intergenic regions or any other functional part of the genome. Thus, the positive class is fairly homogeneous, as all its sequences come from the same part of the gene, but the negative class is composed of different instances. The classifier suffers from this problem. In this article, we propose the training of different classifiers with different negative, more homogeneous, classes and the combination of these classifiers for improved accuracy. RESULTS: The proposed method achieves better accuracy than the best state-of-the-art method, both in terms of the geometric mean of the specificity and sensitivity and the area under the receiver operating characteristic and precision recall curves. The method is tested on the whole human genome. The results for recognizing both translation initiation sites and stop codons indicated improvements in the rates of both false-negative results (FN) and false-positive results (FP). On an average, for translation initiation site recognition, the false-negative ratio was reduced by 30.2% and the FP ratio decreased by 10.9%. For stop codon prediction, FP were reduced by 41.4% and FN by 31.7%. AVAILABILITY AND IMPLEMENTATION: The source code is licensed under the General Public License and is thus freely available. The datasets and source code can be obtained from http://cib.uco.es/site-recognition. CONTACT: npedrajas@uco.es.

Assuntos

Códon de Iniciação , Códon de Terminação , Biologia Computacional/métodos , Biossíntese de Proteínas , Sequência de Bases , Genoma Humano , Humanos , Curva ROC , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Software , Máquina de Vetores de Suporte

17.

A scalable memetic algorithm for simultaneous instance and feature selection.

García-Pedrajas, Nicolás; de Haro-García, Aida; Pérez-Rodríguez, Javier.

Evol Comput ; 22(1): 1-45, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-23544367

RESUMO

Instance selection is becoming increasingly relevant due to the huge amount of data that is constantly produced in many fields of research. At the same time, most of the recent pattern recognition problems involve highly complex datasets with a large number of possible explanatory variables. For many reasons, this abundance of variables significantly harms classification or recognition tasks. There are efficiency issues, too, because the speed of many classification algorithms is largely improved when the complexity of the data is reduced. One of the approaches to address problems that have too many features or instances is feature or instance selection, respectively. Although most methods address instance and feature selection separately, both problems are interwoven, and benefits are expected from facing these two tasks jointly. This paper proposes a new memetic algorithm for dealing with many instances and many features simultaneously by performing joint instance and feature selection. The proposed method performs four different local search procedures with the aim of obtaining the most relevant subsets of instances and features to perform an accurate classification. A new fitness function is also proposed that enforces instance selection but avoids putting too much pressure on removing features. We prove experimentally that this fitness function improves the results in terms of testing error. Regarding the scalability of the method, an extension of the stratification approach is developed for simultaneous instance and feature selection. This extension allows the application of the proposed algorithm to large datasets. An extensive comparison using 55 medium to large datasets from the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 30 large problems, with very good results. The accuracy of the method for class-imbalanced problems in a set of 40 datasets is shown. The usefulness of the method is also tested using decision trees and support vector machines as classification methods.

Assuntos

Algoritmos , Classificação/métodos , Metodologias Computacionais , Reconhecimento Automatizado de Padrão/métodos , Ferramenta de Busca/métodos , Simulação por Computador , Árvores de Decisões , Máquina de Vetores de Suporte

18.

OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets.

García-Pedrajas, Nicolás; Perez-Rodríguez, Javier; de Haro-García, Aida.

IEEE Trans Cybern ; 43(1): 332-46, 2013 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-22868583

RESUMO

In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.

19.

Constructing ensembles of classifiers by means of weighted instance selection.

García-Pedrajas, Nicolás.

IEEE Trans Neural Netw ; 20(2): 258-77, 2009 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-19179252

RESUMO

In this paper, we approach the problem of constructing ensembles of classifiers from the point of view of instance selection. Instance selection is aimed at obtaining a subset of the instances available for training capable of achieving, at least, the same performance as the whole training set. In this way, instance selection algorithms try to keep the performance of the classifiers while reducing the number of instances in the training set. Meanwhile, boosting methods construct an ensemble of classifiers iteratively focusing each new member on the most difficult instances by means of a biased distribution of the training instances. In this work, we show how these two methodologies can be combined advantageously. We can use instance selection algorithms for boosting using as objective to optimize the training error weighted by the biased distribution of the instances given by the boosting method. Our method can be considered as boosting by instance selection. Instance selection has mostly been developed and used for k -nearest neighbor ( k -NN) classifiers. So, as a first step, our methodology is suited to construct ensembles of k -NN classifiers. Constructing ensembles of classifiers by means of instance selection has the important feature of reducing the space complexity of the final ensemble as only a subset of the instances is selected for each classifier. However, the methodology is not restricted to k-NN classifier. Other classifiers, such as decision trees and support vector machines (SVMs), may also benefit from a smaller training set, as they produce simpler classifiers if an instance selection algorithm is performed before training. In the experimental section, we show that the proposed approach is able to produce better and simpler ensembles than random subspace method (RSM) method for k-NN and standard ensemble methods for C4.5 and SVMs.

Assuntos

Algoritmos , Reconhecimento Automatizado de Padrão/métodos , Máquina de Vetores de Suporte

20.

Boosting random subspace method.

García-Pedrajas, Nicolás; Ortiz-Boyer, Domingo.

Neural Netw ; 21(9): 1344-62, 2008 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-18272334

RESUMO

In this paper we propose a boosting approach to random subspace method (RSM) to achieve an improved performance and avoid some of the major drawbacks of RSM. RSM is a successful method for classification. However, the random selection of inputs, its source of success, can also be a major problem. For several problems some of the selected subspaces may lack the discriminant ability to separate the different classes. These subspaces produce poor classifiers that harm the performance of the ensemble. Additionally, boosting RSM would also be an interesting approach for improving its performance. Nevertheless, the application of the two methods together, boosting and RSM, achieves poor results, worse than the results of each method separately. In this work, we propose a new approach for combining RSM and boosting. Instead of obtaining random subspaces, we search subspaces that optimize the weighted classification error given by the boosting algorithm, and then the new classifier added to the ensemble is trained using the obtained subspace. An additional advantage of the proposed methodology is that it can be used with any classifier, including those, such as k nearest neighbor classifiers, that cannot use boosting methods easily. The proposed approach is compared with standard ADABoost and RSM showing an improved performance on a large set of 45 problems from the UCI Machine Learning Repository. An additional study of the effect of noise on the labels of the training instances shows that the less aggressive versions of the proposed methodology are more robust than ADABoost in the presence of noise.

Assuntos

Algoritmos , Inteligência Artificial , Classificação , Genética/estatística & dados numéricos , Modelos Estatísticos , Mutação/fisiologia , Redes Neurais de Computação

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA