Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
1.
PeerJ ; 11: e14779, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36785708

RESUMO

A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2's clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.


Assuntos
Algoritmos , Proteínas , Alinhamento de Sequência , Sequência de Aminoácidos , Proteínas/química , Análise por Conglomerados
2.
J Int Neuropsychol Soc ; 26(7): 690-700, 2020 08.
Artigo em Inglês | MEDLINE | ID: mdl-32200771

RESUMO

OBJECTIVE: To determine how well machine learning algorithms can classify mild cognitive impairment (MCI) subtypes and Alzheimer's disease (AD) using features obtained from the digital Clock Drawing Test (dCDT). METHODS: dCDT protocols were administered to 163 patients diagnosed with AD(n = 59), amnestic MCI (aMCI; n = 26), combined mixed/dysexecutive MCI (mixed/dys MCI; n = 43), and patients without MCI (non-MCI; n = 35) using standard clock drawing command and copy procedures, that is, draw the face of the clock, put in all of the numbers, and set the hands for "10 after 11." A digital pen and custom software recorded patient's drawings. Three hundred and fifty features were evaluated for maximum information/minimum redundancy. The best subset of features was used to train classification models to determine diagnostic accuracy. RESULTS: Neural network employing information theoretic feature selection approaches achieved the best 2-group classification results with 10-fold cross validation accuracies at or above 83%, that is, AD versus non-MCI = 91.42%; AD versus aMCI = 91.49%; AD versus mixed/dys MCI = 84.05%; aMCI versus mixed/dys MCI = 84.11%; aMCI versus non-MCI = 83.44%; and mixed/dys MCI versus non-MCI = 85.42%. A follow-up two-group non-MCI versus all MCI patients analysis yielded comparable results (83.69%). Two-group classification analyses were achieved with 25-125 dCDT features depending on group classification. Three- and four-group analyses yielded lower but still promising levels of classification accuracy. CONCLUSION: Early identification of emergent neurodegenerative illness is criterial for better disease management. Applying machine learning to standard neuropsychological tests promises to be an effective first line screening method for classification of non-MCI and MCI subtypes.


Assuntos
Doença de Alzheimer/diagnóstico , Disfunção Cognitiva/diagnóstico , Aprendizado de Máquina , Testes Neuropsicológicos , Idoso , Idoso de 80 Anos ou mais , Feminino , Humanos , Masculino
3.
IEEE Trans Neural Netw Learn Syst ; 29(9): 4504-4509, 2018 09.
Artigo em Inglês | MEDLINE | ID: mdl-29028210

RESUMO

Feature subset selection can be used to sieve through large volumes of data and discover the most informative subset of variables for a particular learning problem. Yet, due to memory and other resource constraints (e.g., CPU availability), many of the state-of-the-art feature subset selection methods cannot be extended to high dimensional data, or data sets with an extremely large volume of instances. In this brief, we extend online feature selection (OFS), a recently introduced approach that uses partial feature information, by developing an ensemble of online linear models to make predictions. The OFS approach employs a linear model as the base classifier, which allows the $l_{0}$ -norm of the parameter vector to be constrained to perform feature selection leading to sparse linear models. We demonstrate that the proposed ensemble model typically yields a smaller error rate than any single linear model, while maintaining the same level of sparsity and complexity at the time of testing.

4.
IEEE Trans Neural Netw Learn Syst ; 29(6): 2530-2544, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-28504951

RESUMO

Increasingly, many machine learning applications are now associated with very large data sets whose sizes were almost unimaginable just a short time ago. As a result, many of the current algorithms cannot handle, or do not scale to, today's extremely large volumes of data. Fortunately, not all features that make up a typical data set carry information that is relevant or useful for prediction, and identifying and removing such irrelevant features can significantly reduce the total data size. The unfortunate dilemma, however, is that some of the current data sets are so large that common feature selection algorithms-whose very goal is to reduce the dimensionality-cannot handle such large data sets, creating a vicious cycle. We describe a sequential learning framework for feature subset selection (SLSS) that can scale with both the number of features and the number of observations. The proposed framework uses multiarm bandit algorithms to sequentially search a subset of variables, and assign a level of importance for each feature. The novel contribution of SLSS is its ability to naturally scale to large data sets, evaluate such data in a very small amount of time, and be performed independently of the optimization of any classifier to reduce unnecessary complexity. We demonstrate the capabilities of SLSS on synthetic and real-world data sets.

5.
Alzheimers Dement (Amst) ; 1(3): 339-348, 2015 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-26693175

RESUMO

INTRODUCTION: The dynamic range of cerebrospinal fluid (CSF) amyloid ß (Aß1-42) measurement does not parallel to cognitive changes in Alzheimer's disease (AD) and cognitively normal (CN) subjects across different studies. Therefore, identifying novel proteins to characterize symptomatic AD samples is important. METHODS: Proteins were profiled using a multianalyte platform by Rules Based Medicine (MAP-RBM). Due to underlying heterogeneity and unbalanced sample size, we combined subjects (344 AD and 325 CN) from three cohorts: Alzheimer's Disease Neuroimaging Initiative, Penn Center for Neurodegenerative Disease Research of the University of Pennsylvania, and Knight Alzheimer's Disease Research Center at Washington University in St. Louis. We focused on samples whose cognitive and amyloid status was consistent. We performed linear regression (accounted for age, gender, number of APOE e4 alleles, and cohort variable) to identify amyloid-related proteins for symptomatic AD subjects in this largest ever CSF-based MAP-RBM study. ANOVA and Tukey's test were used to evaluate if these proteins were related to cognitive impairment changes as measured by mini-mental state examination (MMSE). RESULTS: Seven proteins were significantly associated with Aß1-42 levels in the combined cohort (false discovery rate adjusted P < .05), of which lipoprotein a (Lp(a)), prolactin (PRL), resistin, and vascular endothelial growth factor (VEGF) have consistent direction of associations across every individual cohort. VEGF was strongly associated with MMSE scores, followed by pancreatic polypeptide and immunoglobulin A (IgA), suggesting they may be related to staging of AD. DISCUSSION: Lp(a), PRL, IgA, and tissue factor/thromboplastin have never been reported for AD diagnosis in previous individual CSF-based MAP-RBM studies. Although some of our reported analytes are related to AD pathophysiology, others' roles in symptomatic AD samples worth further explorations.

6.
IEEE Trans Nanobioscience ; 14(6): 608-16, 2015 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-26316190

RESUMO

Recent advances in machine learning, specifically in deep learning with neural networks, has made a profound impact on fields such as natural language processing, image classification, and language modeling; however, feasibility and potential benefits of the approaches to metagenomic data analysis has been largely under-explored. Deep learning exploits many layers of learning nonlinear feature representations, typically in an unsupervised fashion, and recent results have shown outstanding generalization performance on previously unseen data. Furthermore, some deep learning methods can also represent the structure in a data set. Consequently, deep learning and neural networks may prove to be an appropriate approach for metagenomic data. To determine whether such approaches are indeed appropriate for metagenomics, we experiment with two deep learning methods: i) a deep belief network, and ii) a recursive neural network, the latter of which provides a tree representing the structure of the data. We compare these approaches to the standard multi-layer perceptron, which has been well-established in the machine learning community as a powerful prediction algorithm, though its presence is largely missing in metagenomics literature. We find that traditional neural networks can be quite powerful classifiers on metagenomic data compared to baseline methods, such as random forests. On the other hand, while the deep learning approaches did not result in improvements to the classification accuracy, they do provide the ability to learn hierarchical representations of a data set that standard classification methods do not allow. Our goal in this effort is not to determine the best algorithm in terms accuracy-as that depends on the specific application-but rather to highlight the benefits and drawbacks of each of the approach we discuss and provide insight on how they can be improved for predictive metagenomic analysis.


Assuntos
Metagenômica/métodos , Redes Neurais de Computação , Algoritmos , Microbiota
7.
Brain Res ; 1624: 390-397, 2015 Oct 22.
Artigo em Inglês | MEDLINE | ID: mdl-26256251

RESUMO

INTRODUCTION: Reductions of cerebrospinal fluid (CSF) amyloid-beta (Aß42) and elevated phosphorylated-tau (p-Tau) reflect in vivo Alzheimer's disease (AD) pathology and show utility in predicting conversion from mild cognitive impairment (MCI) to dementia. We investigated the P50 event-related potential component as a noninvasive biomarker of AD pathology in non-demented elderly. METHODS: 36 MCI patients were stratified into amyloid positive (MCI-AD, n=17) and negative (MCI-Other, n=19) groups using CSF levels of Aß42. All amyloid positive patients were also p-Tau positive. P50s were elicited with an auditory oddball paradigm. RESULTS: MCI-AD patients yielded larger P50s than MCI-Other. The best amyloid-status predictor model showed 94.7% sensitivity, 94.1% specificity and 94.4% total accuracy. DISCUSSION: P50 predicted amyloid status in MCI patients, thereby showing a relationship with AD pathology versus MCI from another etiology. The P50 may have clinical utility for inexpensive pre-screening and assessment of Alzheimer's pathology.


Assuntos
Doença de Alzheimer/fisiopatologia , Potenciais Evocados/fisiologia , Sintomas Prodrômicos , Idoso , Doença de Alzheimer/líquido cefalorraquidiano , Peptídeos beta-Amiloides/líquido cefalorraquidiano , Biomarcadores , Disfunção Cognitiva/líquido cefalorraquidiano , Disfunção Cognitiva/fisiopatologia , Eletroencefalografia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Fragmentos de Peptídeos/líquido cefalorraquidiano , Tempo de Reação/fisiologia , Estatísticas não Paramétricas
8.
IEEE Trans Neural Netw Learn Syst ; 26(4): 880-6, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25794384

RESUMO

Selection of most informative features that leads to a small loss on future data are arguably one of the most important steps in classification, data analysis and model selection. Several feature selection (FS) algorithms are available; however, due to noise present in any data set, FS algorithms are typically accompanied by an appropriate cross-validation scheme. In this brief, we propose a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any FS algorithm, regardless of the FS criteria used by that algorithm, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point. We provide freely available software implementations of the proposed methodology.

10.
IEEE Trans Neural Netw Learn Syst ; 25(1): 12-26, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24806641

RESUMO

An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions that change over time. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this paper, we introduce compacted object sample extraction (COMPOSE), a computational geometry-based framework to learn from nonstationary streaming data, where labels are unavailable (or presented very sporadically) after initialization. We introduce the algorithm in detail, and discuss its results and performances on several synthetic and real-world data sets, which demonstrate the ability of the algorithm to learn under several different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch.

11.
Ann Biomed Eng ; 41(2): 223-37, 2013 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-22956158

RESUMO

The objective of this research was to assess the utility of a simple near infrared spectroscopy (NIRS) technology for objective assessment of the hemodynamic response to acute pain. For this exploration, we used functional near infrared spectroscopy (fNIRS) to measure the hemodynamic response on the forehead during three trials of a cold pressor test (CPT) in 20 adults. To measure hemodynamic changes at the superficial tissues as well as the intracranial tissues, two configurations of 'far' and 'near' source-detector separations were used. We identified two features that were found to be fairly consistent across all subjects. The first feature was the change of total hemoglobin (THb) concentration in a given condition divided by the duration of that condition [Formula: see text]. Statistical analyses revealed that during the first CPT trial [Formula: see text] significantly changed from its baseline value in all channels. Also, adaptation to repeated CPTs was observed in both [Formula: see text] parameter and the reported post-stimulus pain rating scores. The second feature was the difference between the maximum and the minimum of the evoked changes in the THb concentration (ΔTHb). A significant correlation was observed between the post-stimulus pain rating score and ΔTHb at all channels. An asymmetrical activity was observed only at the 'far' channels. These results suggest that fNIRS can potentially be used as a reliable technique for the assessment of the hemodynamic response to tonic pain induced by the CPT.


Assuntos
Dor Aguda/diagnóstico , Adulto , Temperatura Baixa , Feminino , Testa , Mãos , Hemodinâmica , Hemoglobinas/análise , Humanos , Peróxido de Hidrogênio/análise , Masculino , Espectroscopia de Luz Próxima ao Infravermelho
12.
IEEE Trans Neural Netw ; 22(10): 1517-31, 2011 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-21824845

RESUMO

We introduce an ensemble of classifiers-based approach for incremental learning of concept drift, characterized by nonstationary environments (NSEs), where the underlying data distributions change over time. The proposed algorithm, named Learn(++). NSE, learns from consecutive batches of data without making any assumptions on the nature or rate of drift; it can learn from such environments that experience constant or variable rate of drift, addition or deletion of concept classes, as well as cyclical drift. The algorithm learns incrementally, as other members of the Learn(++) family of algorithms, that is, without requiring access to previously seen data. Learn(++). NSE trains one new classifier for each batch of data it receives, and combines these classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifier's time-adjusted accuracy on current and past environments. This approach allows the algorithm to recognize, and act accordingly, to the changes in underlying data distributions, as well as to a possible reoccurrence of an earlier distribution. We evaluate the algorithm on several synthetic datasets designed to simulate a variety of nonstationary environments, as well as a real-world weather prediction dataset. Comparisons with several other approaches are also included. Results indicate that Learn(++). NSE can track the changing environments very closely, regardless of the type of concept drift. To allow future use, comparison and benchmarking by interested researchers, we also release our data used in this paper.


Assuntos
Algoritmos , Inteligência Artificial , Modelos Neurológicos , Redes Neurais de Computação , Processamento Eletrônico de Dados/métodos , Meio Ambiente , Humanos , Aprendizagem/fisiologia , Dinâmica não Linear
13.
Comput Biol Chem ; 35(3): 199-209, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21704267

RESUMO

Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels.


Assuntos
Biologia Computacional , Metagenoma/genética , Algoritmos , Bactérias/genética , Simulação por Computador , Mineração de Dados , Bases de Dados Genéticas , Análise de Sequência de DNA
14.
J Biomed Biotechnol ; 2011: 495849, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21541181

RESUMO

High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments ("reads") from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between "known" and "unknown" taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an "unknown" class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate the performance of several algorithms on a real acid mine drainage dataset.


Assuntos
Código de Barras de DNA Taxonômico/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bactérias/genética , Bases de Dados de Ácidos Nucleicos , Genoma/genética , Metagenômica , Mineração , Fases de Leitura Aberta/genética , Curva ROC , Especificidade da Espécie , Eliminação de Resíduos Líquidos
15.
Artigo em Inglês | MEDLINE | ID: mdl-22254735

RESUMO

As life expectancy increases, particularly in the developed world, so does the prevalence of Alzheimer's Disease (AD). AD is a neurodegenerative disorder characterized by neurofibrillary plaques and tangles in the brain that leads to neuronal death and dementia. Early diagnosis of AD is still a major unresolved health concern: several biomarkers are being investigated, among which the electroencephalogram (EEG) provides the only option for an electrophysiological information. In this study, EEG signals obtained from 161 subjects--79 with AD, and 82 age-matched controls (CN)--are analyzed using several nonlinear signal complexity measures. These measures include: Higuchi fractal dimension (HFD), spectral entropy (SE), spectral centroid (SC), spectral roll-off (SR), and zero-crossing rate (ZCR). HFD is a quantitative measure of time series complexity derived from fractal theory. Among spectral measures, SE measures the level of disorder in the spectrum, SC is a measure of spectral shape, and SR is frequency sample below which a specified percent of the spectral magnitude distribution is contained. Lastly, ZCR is simply the rate at which the signal changes signs. A t-test was first applied to determine those features that provide significant differences between the groups. Those features were then used to train a neural network. The classification accuracies ranged from 60-66%, suggesting they contain some discriminatory information; however, not enough to be clinically useful alone. Combining these features and training a support vector machine (SVM) resulted in a diagnostic accuracy of 78%, indicating that these feature carry complementary information.


Assuntos
Algoritmos , Doença de Alzheimer/diagnóstico , Doença de Alzheimer/fisiopatologia , Inteligência Artificial , Encéfalo/fisiopatologia , Diagnóstico por Computador/métodos , Eletroencefalografia/métodos , Reconhecimento Automatizado de Padrão/métodos , Idoso , Feminino , Humanos , Masculino , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
16.
Artigo em Inglês | MEDLINE | ID: mdl-22255310

RESUMO

Due to the enormity of the solution space for sequential ordering problems, non-exhaustive heuristic techniques have been the focus of many research efforts, particularly in the field of operations research. In this paper, we outline an ecologically motivated problem in which environmental samples have been obtained along a gradient (e.g. pH), with which we desire to recover the sample order. Not only do we model the problem for the benefit of an optimization approach, we also incorporate hybrid particle swarm techniques to address the problem. The described method is implemented on a real dataset from which 22 biological samples were obtained along a pH gradient. We show that we are able to approach the optimal permutation of samples by evaluating only approximately 5000 solutions--infinitesimally smaller than the 22! possible solutions.


Assuntos
Modelos Teóricos , Algoritmos
17.
Artigo em Inglês | MEDLINE | ID: mdl-21097123

RESUMO

Alarmingly increasing prevalence of Alzheimer's disease (AD) due to the aging population in developing countries, combined with lack of standardized and conclusive diagnostic procedures, make early diagnosis of Alzheimer's disease a major public health concern. While no current medical treatment exists to stop or reverse this disease, recent dementia specific pharmacological advances can slow its progression, making early diagnosis all the more important. Several noninvasive biomarkers have been proposed, including P300 based EEG analysis, MRI volumetric analysis, PET based metabolic activity analysis, as alternatives to neuropsychological evaluation, the current gold standard of diagnosis. Each of these approaches, have shown some promising outcomes, however, a comprehensive data fusion analysis has not yet been conducted to investigate whether these different modalities carry complementary information, and if so, whether they can be combined to provide a more accurate analysis. In this effort, we provide a first look at such an analysis in combining EEG, MRI and PET data using an ensemble of classifiers based decision fusion approach, to determine whether a strategic combination of these different modalities can improve the diagnostic accuracy over any of the individual data sources when used with an automated classifier. Results show an improvement of up to 10%-20% using this approach compared to the classification performance obtained when using each individual data source.


Assuntos
Doença de Alzheimer/diagnóstico , Eletroencefalografia/métodos , Interpretação de Imagem Assistida por Computador/métodos , Imageamento por Ressonância Magnética/métodos , Tomografia por Emissão de Pósitrons/métodos , Idoso , Doença de Alzheimer/diagnóstico por imagem , Sistemas de Apoio a Decisões Clínicas , Eletrodos , Feminino , Humanos , Masculino
18.
Artigo em Inglês | MEDLINE | ID: mdl-19965206

RESUMO

As the average life expectancy increases, particularly in developing countries, prevalence of neurodegenerative diseases has also increased. This trend is especially alarming for Alzheimer's disease (AD); as there is no cure to stop or reverse the effects of AD. However, recent pharmacological advances can slow the progression of AD, but only if AD is diagnosed at early stages. We have previously introduced an ensemble of classifiers based approach for combining event related potentials obtained from different electrode locations as an effective approach for early diagnosis of AD. We further expand this approach and analyze its robustness and stability in two ways: comparing the diagnostic accuracy on hand selected and cleaned data vs. standard automated preprocessing, but more importantly, comparing the diagnostic accuracy on two different cohorts, whose data are collected under different settings: a research university lab and a community clinic.


Assuntos
Doença de Alzheimer/diagnóstico , Potenciais Evocados , Idoso de 80 Anos ou mais , Algoritmos , Doença de Alzheimer/fisiopatologia , Automação , Encéfalo/patologia , Estudos de Coortes , Diagnóstico Precoce , Eletrodos , Humanos , Reprodutibilidade dos Testes , Processamento de Sinais Assistido por Computador
19.
IEEE Trans Neural Netw ; 20(1): 152-68, 2009 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19109088

RESUMO

We have previously introduced an incremental learning algorithm Learn(++), which learns novel information from consecutive data sets by generating an ensemble of classifiers with each data set, and combining them by weighted majority voting. However, Learn(++) suffers from an inherent "outvoting" problem when asked to learn a new class omega(new) introduced by a subsequent data set, as earlier classifiers not trained on this class are guaranteed to misclassify omega(new) instances. The collective votes of earlier classifiers, for an inevitably incorrect decision, then outweigh the votes of the new classifiers' correct decision on omega(new) instances--until there are enough new classifiers to counteract the unfair outvoting. This forces Learn(++) to generate an unnecessarily large number of classifiers. This paper describes Learn(++).NC, specifically designed for efficient incremental learning of multiple new classes using significantly fewer classifiers. To do so, Learn (++).NC introduces dynamically weighted consult and vote (DW-CAV), a novel voting mechanism for combining classifiers: individual classifiers consult with each other to determine which ones are most qualified to classify a given instance, and decide how much weight, if any, each classifier's decision should carry. Experiments on real-world problems indicate that the new algorithm performs remarkably well with substantially fewer classifiers, not only as compared to its predecessor Learn(++), but also as compared to several other algorithms recently proposed for similar problems.


Assuntos
Algoritmos , Inteligência Artificial , Simulação por Computador , Bases de Dados Factuais , Processamento de Imagem Assistida por Computador , Reconhecimento Automatizado de Padrão , Reprodutibilidade dos Testes , Fatores de Tempo , Compostos Orgânicos Voláteis/química
20.
Curr Genomics ; 10(7): 493-510, 2009 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-20436876

RESUMO

Traditionally, studies in microbial genomics have focused on single-genomes from cultured species, thereby limiting their focus to the small percentage of species that can be cultured outside their natural environment. Fortunately, recent advances in high-throughput sequencing and computational analyses have ushered in the new field of metagenomics, which aims to decode the genomes of microbes from natural communities without the need for cultivation. Although metagenomic studies have shed a great deal of insight into bacterial diversity and coding capacity, several computational challenges remain due to the massive size and complexity of metagenomic sequence data. Current tools and techniques are reviewed in this paper which address challenges in 1) genomic fragment annotation, 2) phylogenetic reconstruction, 3) functional classification of samples, and 4) interpreting complementary metaproteomics and metametabolomics data. Also surveyed are important applications of metagenomic studies, including microbial forensics and the roles of microbial communities in shaping human health and soil ecology.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...