Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
IEEE Trans Cybern ; 52(5): 2942-2954, 2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-33027013

RESUMO

Feature selection is one of the most frequent tasks in data mining applications. Its ability to remove useless and redundant features improves the classification performance and gains knowledge about a given problem makes feature selection a common first step in data mining. In many feature selection applications, we need to combine the results of different feature selection processes. The two most common scenarios are the ensembles of feature selectors and the scaling up of feature selection methods using a data division approach. The standard procedure is to store the number of times every feature has been selected as a vote for the feature and then evaluate different selection thresholds with a certain criterion to obtain the final subset of selected features. However, this method is suboptimal as the relationships of the features are not considered in the voting process. Two redundant features may be selected a similar number of times due to the different sets of instances used each time. Thus, a voting scheme would tend to select both of them. In this article, we present a new approach: instead of using only the number of times a feature has been selected, the approach considers how many times the features have been selected together by a feature selection algorithm. The proposal is based on constructing an undirected graph where the vertices are the features, and the edges count the number of times every pair of instances has been selected together. This graph is used to select the best subset of features, avoiding the redundancy introduced by the voting scheme. The proposal improves the results of the standard voting scheme in both ensembles of feature selectors and data division methods for scaling up feature selection.


Assuntos
Algoritmos , Mineração de Dados , Projetos de Pesquisa
2.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2471-2482, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-32078558

RESUMO

Recognition of the functional sites of genes, such as translation initiation sites, donor and acceptor splice sites and stop codons, is a relevant part of many current problems in bioinformatics. The best approaches use sophisticated classifiers, such as support vector machines. However, with the rapid accumulation of sequence data, methods for combining many sources of evidence are necessary as it is unlikely that a single classifier can solve this problem with the best possible performance. A major issue is that the number of possible models to combine is large and the use of all of these models is impractical. In this paper we present a methodology for combining many sources of information to recognize any functional site using "floating search", a powerful heuristics applicable when the cost of evaluating each solution is high. We present experiments on four functional sites in the human genome, which is used as the target genome, and use another 20 species as sources of evidence. The proposed methodology shows significant improvement over state-of-the-art methods. The results show an advantage of the proposed method and also challenge the standard assumption of using only genomes not very close and not very far from the human to improve the recognition of functional sites.


Assuntos
Biologia Computacional/métodos , Componentes do Gene/genética , Genoma Humano/genética , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases/genética , Humanos , Modelos Genéticos
3.
J Comput Aided Mol Des ; 34(3): 305-325, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31893338

RESUMO

In the construction of activity prediction models, the use of feature ranking methods is a useful mechanism for extracting information for ranking features in terms of their significance to develop predictive models. This paper studies the influence of feature rankers in the construction of molecular activity prediction models; for this purpose, a comparative study of fourteen rankings methods for feature selection was conducted. The activity prediction models were constructed using four well-known classifiers and a wide collection of datasets. The ranking algorithms were compared considering the performance of these classifiers using different metrics and the consistency of the ranked features.


Assuntos
Modelos Moleculares , Software , Algoritmos , Humanos
4.
Neural Netw ; 118: 175-191, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-31299623

RESUMO

Prototype selection is one of the most common preprocessing tasks in data mining applications. The vast amounts of data that we must handle in practical problems render the removal of noisy, redundant or useless instances a convenient first step for any real-world application. Many algorithms have been proposed for prototype selection. For difficult problems, however, the use of only a single method would unlikely achieve the desired performance. Similar to the problem of classification, ensembles of prototype selectors have been proposed to overcome the limitations of single algorithms. In ensembles of prototype selectors, the usual combination method is based on a voting scheme coupled with an acceptance threshold. However, this method is suboptimal, because the relationships among the prototypes are not taken into account. In this paper, we propose a different approach, in which we consider not only the number of times every prototype has been selected but also the subsets of prototypes that are selected. With this additional information we develop GEEBIES, which is a new way of combining the results of ensembles of prototype selectors. In a large set of problems, we show that our proposal outperforms the standard boosting approach. A way of scaling up our method to large datasets is also proposed and experimentally tested.


Assuntos
Algoritmos , Bases de Dados Factuais , Estudo de Prova de Conceito , Mineração de Dados/normas , Bases de Dados Factuais/normas
5.
Evol Comput ; 22(1): 1-45, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-23544367

RESUMO

Instance selection is becoming increasingly relevant due to the huge amount of data that is constantly produced in many fields of research. At the same time, most of the recent pattern recognition problems involve highly complex datasets with a large number of possible explanatory variables. For many reasons, this abundance of variables significantly harms classification or recognition tasks. There are efficiency issues, too, because the speed of many classification algorithms is largely improved when the complexity of the data is reduced. One of the approaches to address problems that have too many features or instances is feature or instance selection, respectively. Although most methods address instance and feature selection separately, both problems are interwoven, and benefits are expected from facing these two tasks jointly. This paper proposes a new memetic algorithm for dealing with many instances and many features simultaneously by performing joint instance and feature selection. The proposed method performs four different local search procedures with the aim of obtaining the most relevant subsets of instances and features to perform an accurate classification. A new fitness function is also proposed that enforces instance selection but avoids putting too much pressure on removing features. We prove experimentally that this fitness function improves the results in terms of testing error. Regarding the scalability of the method, an extension of the stratification approach is developed for simultaneous instance and feature selection. This extension allows the application of the proposed algorithm to large datasets. An extensive comparison using 55 medium to large datasets from the UCI Machine Learning Repository shows the usefulness of our method. Additionally, the method is applied to 30 large problems, with very good results. The accuracy of the method for class-imbalanced problems in a set of 40 datasets is shown. The usefulness of the method is also tested using decision trees and support vector machines as classification methods.


Assuntos
Algoritmos , Classificação/métodos , Metodologias Computacionais , Reconhecimento Automatizado de Padrão/métodos , Ferramenta de Busca/métodos , Simulação por Computador , Árvores de Decisões , Máquina de Vetores de Suporte
6.
IEEE Trans Cybern ; 43(1): 332-46, 2013 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-22868583

RESUMO

In current research, an enormous amount of information is constantly being produced, which poses a challenge for data mining algorithms. Many of the problems in extremely active research areas, such as bioinformatics, security and intrusion detection, or text mining, share the following two features: large data sets and class-imbalanced distribution of samples. Although many methods have been proposed for dealing with class-imbalanced data sets, most of these methods are not scalable to the very large data sets common to those research fields. In this paper, we propose a new approach to dealing with the class-imbalance problem that is scalable to data sets with many millions of instances and hundreds of features. This proposal is based on the divide-and-conquer principle combined with application of the selection process to balanced subsets of the whole data set. This divide-and-conquer principle allows the execution of the algorithm in linear time. Furthermore, the proposed method is easy to implement using a parallel environment and can work without loading the whole data set into memory. Using 40 class-imbalanced medium-sized data sets, we will demonstrate our method's ability to improve the results of state-of-the-art instance selection methods for class-imbalanced data sets. Using three very large data sets, we will show the scalability of our proposal to millions of instances and hundreds of features.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...