Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
J Int Neuropsychol Soc ; 26(7): 690-700, 2020 08.
Artículo en Inglés | MEDLINE | ID: mdl-32200771

RESUMEN

OBJECTIVE: To determine how well machine learning algorithms can classify mild cognitive impairment (MCI) subtypes and Alzheimer's disease (AD) using features obtained from the digital Clock Drawing Test (dCDT). METHODS: dCDT protocols were administered to 163 patients diagnosed with AD(n = 59), amnestic MCI (aMCI; n = 26), combined mixed/dysexecutive MCI (mixed/dys MCI; n = 43), and patients without MCI (non-MCI; n = 35) using standard clock drawing command and copy procedures, that is, draw the face of the clock, put in all of the numbers, and set the hands for "10 after 11." A digital pen and custom software recorded patient's drawings. Three hundred and fifty features were evaluated for maximum information/minimum redundancy. The best subset of features was used to train classification models to determine diagnostic accuracy. RESULTS: Neural network employing information theoretic feature selection approaches achieved the best 2-group classification results with 10-fold cross validation accuracies at or above 83%, that is, AD versus non-MCI = 91.42%; AD versus aMCI = 91.49%; AD versus mixed/dys MCI = 84.05%; aMCI versus mixed/dys MCI = 84.11%; aMCI versus non-MCI = 83.44%; and mixed/dys MCI versus non-MCI = 85.42%. A follow-up two-group non-MCI versus all MCI patients analysis yielded comparable results (83.69%). Two-group classification analyses were achieved with 25-125 dCDT features depending on group classification. Three- and four-group analyses yielded lower but still promising levels of classification accuracy. CONCLUSION: Early identification of emergent neurodegenerative illness is criterial for better disease management. Applying machine learning to standard neuropsychological tests promises to be an effective first line screening method for classification of non-MCI and MCI subtypes.


Asunto(s)
Enfermedad de Alzheimer/diagnóstico , Disfunción Cognitiva/diagnóstico , Aprendizaje Automático , Pruebas Neuropsicológicas , Anciano , Anciano de 80 o más Años , Femenino , Humanos , Masculino
2.
PeerJ ; 11: e14779, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36785708

RESUMEN

A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2's clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.


Asunto(s)
Algoritmos , Proteínas , Alineación de Secuencia , Secuencia de Aminoácidos , Proteínas/química , Análisis por Conglomerados
3.
J Biomed Biotechnol ; 2011: 495849, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-21541181

RESUMEN

High-throughput sequencing technologies enable metagenome profiling, simultaneous sequencing of multiple microbial species present within an environmental sample. Since metagenomic data includes sequence fragments ("reads") from organisms that are absent from any database, new algorithms must be developed for the identification and annotation of novel sequence fragments. Homology-based techniques have been modified to detect novel species and genera, but, composition-based methods, have not been adapted. We develop a detection technique that can discriminate between "known" and "unknown" taxa, which can be used with composition-based methods, as well as a hybrid method. Unlike previous studies, we rigorously evaluate all algorithms for their ability to detect novel taxa. First, we show that the integration of a detector with a composition-based method performs significantly better than homology-based methods for the detection of novel species and genera, with best performance at finer taxonomic resolutions. Most importantly, we evaluate all the algorithms by introducing an "unknown" class and show that the modified version of PhymmBL has similar or better overall classification performance than the other modified algorithms, especially for the species-level and ultrashort reads. Finally, we evaluate the performance of several algorithms on a real acid mine drainage dataset.


Asunto(s)
Código de Barras del ADN Taxonómico/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Bacterias/genética , Bases de Datos de Ácidos Nucleicos , Genoma/genética , Metagenómica , Minería , Sistemas de Lectura Abierta/genética , Curva ROC , Especificidad de la Especie , Eliminación de Residuos Líquidos
4.
IEEE Trans Neural Netw Learn Syst ; 29(6): 2530-2544, 2018 06.
Artículo en Inglés | MEDLINE | ID: mdl-28504951

RESUMEN

Increasingly, many machine learning applications are now associated with very large data sets whose sizes were almost unimaginable just a short time ago. As a result, many of the current algorithms cannot handle, or do not scale to, today's extremely large volumes of data. Fortunately, not all features that make up a typical data set carry information that is relevant or useful for prediction, and identifying and removing such irrelevant features can significantly reduce the total data size. The unfortunate dilemma, however, is that some of the current data sets are so large that common feature selection algorithms-whose very goal is to reduce the dimensionality-cannot handle such large data sets, creating a vicious cycle. We describe a sequential learning framework for feature subset selection (SLSS) that can scale with both the number of features and the number of observations. The proposed framework uses multiarm bandit algorithms to sequentially search a subset of variables, and assign a level of importance for each feature. The novel contribution of SLSS is its ability to naturally scale to large data sets, evaluate such data in a very small amount of time, and be performed independently of the optimization of any classifier to reduce unnecessary complexity. We demonstrate the capabilities of SLSS on synthetic and real-world data sets.

5.
IEEE Trans Neural Netw Learn Syst ; 29(9): 4504-4509, 2018 09.
Artículo en Inglés | MEDLINE | ID: mdl-29028210

RESUMEN

Feature subset selection can be used to sieve through large volumes of data and discover the most informative subset of variables for a particular learning problem. Yet, due to memory and other resource constraints (e.g., CPU availability), many of the state-of-the-art feature subset selection methods cannot be extended to high dimensional data, or data sets with an extremely large volume of instances. In this brief, we extend online feature selection (OFS), a recently introduced approach that uses partial feature information, by developing an ensemble of online linear models to make predictions. The OFS approach employs a linear model as the base classifier, which allows the $l_{0}$ -norm of the parameter vector to be constrained to perform feature selection leading to sparse linear models. We demonstrate that the proposed ensemble model typically yields a smaller error rate than any single linear model, while maintaining the same level of sparsity and complexity at the time of testing.

6.
Comput Biol Med ; 37(4): 542-58, 2007 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-16989799

RESUMEN

Early diagnosis of Alzheimer's disease (AD) is becoming an increasingly important healthcare concern. Prior approaches analyzing event-related potentials (ERPs) had varying degrees of success, primarily due to smaller study cohorts, and the inherent difficulty of the problem. A new effort using multiresolution analysis of ERPs is described. Distinctions of this study include analyzing a larger cohort, comparing different wavelets and different frequency bands, using ensemble-based decisions and, most importantly, aiming the earliest possible diagnosis of the disease. Surprising yet promising outcomes indicate that ERPs in response to novel sounds of oddball paradigm may be more reliable as a biomarker than the more commonly used responses to target sounds.


Asunto(s)
Enfermedad de Alzheimer/clasificación , Corteza Cerebral/fisiología , Electroencefalografía/clasificación , Potenciales Relacionados con Evento P300/fisiología , Análisis de Fourier , Procesamiento de Señales Asistido por Computador , Anciano , Algoritmos , Enfermedad de Alzheimer/diagnóstico , Atención/fisiología , Mapeo Encefálico , Estudios de Cohortes , Gráficos por Computador , Humanos , Masculino , Escala del Estado Mental , Persona de Mediana Edad , Percepción de la Altura Tonal/fisiología , Valores de Referencia , Sensibilidad y Especificidad
7.
IEEE Trans Syst Man Cybern B Cybern ; 37(2): 437-50, 2007 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-17416170

RESUMEN

This paper introduces Learn++, an ensemble of classifiers based algorithm originally developed for incremental learning, and now adapted for information/data fusion applications. Recognizing the conceptual similarity between incremental learning and data fusion, Learn++ follows an alternative approach to data fusion, i.e., sequentially generating an ensemble of classifiers that specifically seek the most discriminating information from each data set. It was observed that Learn++ based data fusion consistently outperforms a similarly configured ensemble classifier trained on any of the individual data sources across several applications. Furthermore, even if the classifiers trained on individual data sources are fine tuned for the given problem, Learn++ can still achieve a statistically significant improvement by combining them, if the additional data sets carry complementary information. The algorithm can also identify-albeit indirectly-those data sets that do not carry such additional information. Finally, it was shown that the algorithm can consecutively learn both the supplementary novel information coming from additional data of the same source, and the complementary information coming from new data sources without requiring access to any of the previously seen data.


Asunto(s)
Algoritmos , Inteligencia Artificial , Análisis por Conglomerados , Sistemas de Administración de Bases de Datos , Bases de Datos Factuales , Almacenamiento y Recuperación de la Información/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Programas Informáticos
8.
IEEE Trans Neural Netw Learn Syst ; 26(4): 880-6, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-25794384

RESUMEN

Selection of most informative features that leads to a small loss on future data are arguably one of the most important steps in classification, data analysis and model selection. Several feature selection (FS) algorithms are available; however, due to noise present in any data set, FS algorithms are typically accompanied by an appropriate cross-validation scheme. In this brief, we propose a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any FS algorithm, regardless of the FS criteria used by that algorithm, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point. We provide freely available software implementations of the proposed methodology.

9.
IEEE Trans Nanobioscience ; 14(6): 608-16, 2015 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-26316190

RESUMEN

Recent advances in machine learning, specifically in deep learning with neural networks, has made a profound impact on fields such as natural language processing, image classification, and language modeling; however, feasibility and potential benefits of the approaches to metagenomic data analysis has been largely under-explored. Deep learning exploits many layers of learning nonlinear feature representations, typically in an unsupervised fashion, and recent results have shown outstanding generalization performance on previously unseen data. Furthermore, some deep learning methods can also represent the structure in a data set. Consequently, deep learning and neural networks may prove to be an appropriate approach for metagenomic data. To determine whether such approaches are indeed appropriate for metagenomics, we experiment with two deep learning methods: i) a deep belief network, and ii) a recursive neural network, the latter of which provides a tree representing the structure of the data. We compare these approaches to the standard multi-layer perceptron, which has been well-established in the machine learning community as a powerful prediction algorithm, though its presence is largely missing in metagenomics literature. We find that traditional neural networks can be quite powerful classifiers on metagenomic data compared to baseline methods, such as random forests. On the other hand, while the deep learning approaches did not result in improvements to the classification accuracy, they do provide the ability to learn hierarchical representations of a data set that standard classification methods do not allow. Our goal in this effort is not to determine the best algorithm in terms accuracy-as that depends on the specific application-but rather to highlight the benefits and drawbacks of each of the approach we discuss and provide insight on how they can be improved for predictive metagenomic analysis.


Asunto(s)
Metagenómica/métodos , Redes Neurales de la Computación , Algoritmos , Microbiota
10.
Brain Res ; 1624: 390-397, 2015 Oct 22.
Artículo en Inglés | MEDLINE | ID: mdl-26256251

RESUMEN

INTRODUCTION: Reductions of cerebrospinal fluid (CSF) amyloid-beta (Aß42) and elevated phosphorylated-tau (p-Tau) reflect in vivo Alzheimer's disease (AD) pathology and show utility in predicting conversion from mild cognitive impairment (MCI) to dementia. We investigated the P50 event-related potential component as a noninvasive biomarker of AD pathology in non-demented elderly. METHODS: 36 MCI patients were stratified into amyloid positive (MCI-AD, n=17) and negative (MCI-Other, n=19) groups using CSF levels of Aß42. All amyloid positive patients were also p-Tau positive. P50s were elicited with an auditory oddball paradigm. RESULTS: MCI-AD patients yielded larger P50s than MCI-Other. The best amyloid-status predictor model showed 94.7% sensitivity, 94.1% specificity and 94.4% total accuracy. DISCUSSION: P50 predicted amyloid status in MCI patients, thereby showing a relationship with AD pathology versus MCI from another etiology. The P50 may have clinical utility for inexpensive pre-screening and assessment of Alzheimer's pathology.


Asunto(s)
Enfermedad de Alzheimer/fisiopatología , Potenciales Evocados/fisiología , Síntomas Prodrómicos , Anciano , Enfermedad de Alzheimer/líquido cefalorraquídeo , Péptidos beta-Amiloides/líquido cefalorraquídeo , Biomarcadores , Disfunción Cognitiva/líquido cefalorraquídeo , Disfunción Cognitiva/fisiopatología , Electroencefalografía , Femenino , Humanos , Masculino , Persona de Mediana Edad , Fragmentos de Péptidos/líquido cefalorraquídeo , Tiempo de Reacción/fisiología , Estadísticas no Paramétricas
11.
Alzheimers Dement (Amst) ; 1(3): 339-348, 2015 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-26693175

RESUMEN

INTRODUCTION: The dynamic range of cerebrospinal fluid (CSF) amyloid ß (Aß1-42) measurement does not parallel to cognitive changes in Alzheimer's disease (AD) and cognitively normal (CN) subjects across different studies. Therefore, identifying novel proteins to characterize symptomatic AD samples is important. METHODS: Proteins were profiled using a multianalyte platform by Rules Based Medicine (MAP-RBM). Due to underlying heterogeneity and unbalanced sample size, we combined subjects (344 AD and 325 CN) from three cohorts: Alzheimer's Disease Neuroimaging Initiative, Penn Center for Neurodegenerative Disease Research of the University of Pennsylvania, and Knight Alzheimer's Disease Research Center at Washington University in St. Louis. We focused on samples whose cognitive and amyloid status was consistent. We performed linear regression (accounted for age, gender, number of APOE e4 alleles, and cohort variable) to identify amyloid-related proteins for symptomatic AD subjects in this largest ever CSF-based MAP-RBM study. ANOVA and Tukey's test were used to evaluate if these proteins were related to cognitive impairment changes as measured by mini-mental state examination (MMSE). RESULTS: Seven proteins were significantly associated with Aß1-42 levels in the combined cohort (false discovery rate adjusted P < .05), of which lipoprotein a (Lp(a)), prolactin (PRL), resistin, and vascular endothelial growth factor (VEGF) have consistent direction of associations across every individual cohort. VEGF was strongly associated with MMSE scores, followed by pancreatic polypeptide and immunoglobulin A (IgA), suggesting they may be related to staging of AD. DISCUSSION: Lp(a), PRL, IgA, and tissue factor/thromboplastin have never been reported for AD diagnosis in previous individual CSF-based MAP-RBM studies. Although some of our reported analytes are related to AD pathophysiology, others' roles in symptomatic AD samples worth further explorations.

12.
Artículo en Inglés | MEDLINE | ID: mdl-15344404

RESUMEN

An incremental learning algorithm is introduced for learning new information from additional data that may later become available, after a classifier has already been trained using a previously available database. The proposed algorithm is capable of incrementally learning new information without forgetting previously acquired knowledge and without requiring access to the original database, even when new data include examples of previously unseen classes. Scenarios requiring such a learning algorithm are encountered often in nondestructive evaluation (NDE) in which large volumes of data are collected in batches over a period of time, and new defect types may become available in subsequent databases. The algorithm, named Learn++, takes advantage of synergistic generalization performance of an ensemble of classifiers in which each classifier is trained with a strategically chosen subset of the training databases that subsequently become available. The ensemble of classifiers then is combined through a weighted majority voting procedure. Learn++ is independent of the specific classifier(s) comprising the ensemble, and hence may be used with any supervised learning algorithm. The voting procedure also allows Learn++ to estimate the confidence in its own decision. We present the algorithm and its promising results on two separate ultrasonic weld inspection applications.

13.
IEEE Trans Neural Netw Learn Syst ; 25(1): 12-26, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24806641

RESUMEN

An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions that change over time. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this paper, we introduce compacted object sample extraction (COMPOSE), a computational geometry-based framework to learn from nonstationary streaming data, where labels are unavailable (or presented very sporadically) after initialization. We introduce the algorithm in detail, and discuss its results and performances on several synthetic and real-world data sets, which demonstrate the ability of the algorithm to learn under several different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch.

14.
Ann Biomed Eng ; 41(2): 223-37, 2013 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-22956158

RESUMEN

The objective of this research was to assess the utility of a simple near infrared spectroscopy (NIRS) technology for objective assessment of the hemodynamic response to acute pain. For this exploration, we used functional near infrared spectroscopy (fNIRS) to measure the hemodynamic response on the forehead during three trials of a cold pressor test (CPT) in 20 adults. To measure hemodynamic changes at the superficial tissues as well as the intracranial tissues, two configurations of 'far' and 'near' source-detector separations were used. We identified two features that were found to be fairly consistent across all subjects. The first feature was the change of total hemoglobin (THb) concentration in a given condition divided by the duration of that condition [Formula: see text]. Statistical analyses revealed that during the first CPT trial [Formula: see text] significantly changed from its baseline value in all channels. Also, adaptation to repeated CPTs was observed in both [Formula: see text] parameter and the reported post-stimulus pain rating scores. The second feature was the difference between the maximum and the minimum of the evoked changes in the THb concentration (ΔTHb). A significant correlation was observed between the post-stimulus pain rating score and ΔTHb at all channels. An asymmetrical activity was observed only at the 'far' channels. These results suggest that fNIRS can potentially be used as a reliable technique for the assessment of the hemodynamic response to tonic pain induced by the CPT.


Asunto(s)
Dolor Agudo/diagnóstico , Adulto , Frío , Femenino , Frente , Mano , Hemodinámica , Hemoglobinas/análisis , Humanos , Peróxido de Hidrógeno/análisis , Masculino , Espectroscopía Infrarroja Corta
15.
IEEE Trans Neural Netw ; 22(10): 1517-31, 2011 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-21824845

RESUMEN

We introduce an ensemble of classifiers-based approach for incremental learning of concept drift, characterized by nonstationary environments (NSEs), where the underlying data distributions change over time. The proposed algorithm, named Learn(++). NSE, learns from consecutive batches of data without making any assumptions on the nature or rate of drift; it can learn from such environments that experience constant or variable rate of drift, addition or deletion of concept classes, as well as cyclical drift. The algorithm learns incrementally, as other members of the Learn(++) family of algorithms, that is, without requiring access to previously seen data. Learn(++). NSE trains one new classifier for each batch of data it receives, and combines these classifiers using a dynamically weighted majority voting. The novelty of the approach is in determining the voting weights, based on each classifier's time-adjusted accuracy on current and past environments. This approach allows the algorithm to recognize, and act accordingly, to the changes in underlying data distributions, as well as to a possible reoccurrence of an earlier distribution. We evaluate the algorithm on several synthetic datasets designed to simulate a variety of nonstationary environments, as well as a real-world weather prediction dataset. Comparisons with several other approaches are also included. Results indicate that Learn(++). NSE can track the changing environments very closely, regardless of the type of concept drift. To allow future use, comparison and benchmarking by interested researchers, we also release our data used in this paper.


Asunto(s)
Algoritmos , Inteligencia Artificial , Modelos Neurológicos , Redes Neurales de la Computación , Procesamiento Automatizado de Datos/métodos , Ambiente , Humanos , Aprendizaje/fisiología , Dinámicas no Lineales
16.
Artículo en Inglés | MEDLINE | ID: mdl-22254735

RESUMEN

As life expectancy increases, particularly in the developed world, so does the prevalence of Alzheimer's Disease (AD). AD is a neurodegenerative disorder characterized by neurofibrillary plaques and tangles in the brain that leads to neuronal death and dementia. Early diagnosis of AD is still a major unresolved health concern: several biomarkers are being investigated, among which the electroencephalogram (EEG) provides the only option for an electrophysiological information. In this study, EEG signals obtained from 161 subjects--79 with AD, and 82 age-matched controls (CN)--are analyzed using several nonlinear signal complexity measures. These measures include: Higuchi fractal dimension (HFD), spectral entropy (SE), spectral centroid (SC), spectral roll-off (SR), and zero-crossing rate (ZCR). HFD is a quantitative measure of time series complexity derived from fractal theory. Among spectral measures, SE measures the level of disorder in the spectrum, SC is a measure of spectral shape, and SR is frequency sample below which a specified percent of the spectral magnitude distribution is contained. Lastly, ZCR is simply the rate at which the signal changes signs. A t-test was first applied to determine those features that provide significant differences between the groups. Those features were then used to train a neural network. The classification accuracies ranged from 60-66%, suggesting they contain some discriminatory information; however, not enough to be clinically useful alone. Combining these features and training a support vector machine (SVM) resulted in a diagnostic accuracy of 78%, indicating that these feature carry complementary information.


Asunto(s)
Algoritmos , Enfermedad de Alzheimer/diagnóstico , Enfermedad de Alzheimer/fisiopatología , Inteligencia Artificial , Encéfalo/fisiopatología , Diagnóstico por Computador/métodos , Electroencefalografía/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Anciano , Femenino , Humanos , Masculino , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
17.
Artículo en Inglés | MEDLINE | ID: mdl-22255310

RESUMEN

Due to the enormity of the solution space for sequential ordering problems, non-exhaustive heuristic techniques have been the focus of many research efforts, particularly in the field of operations research. In this paper, we outline an ecologically motivated problem in which environmental samples have been obtained along a gradient (e.g. pH), with which we desire to recover the sample order. Not only do we model the problem for the benefit of an optimization approach, we also incorporate hybrid particle swarm techniques to address the problem. The described method is implemented on a real dataset from which 22 biological samples were obtained along a pH gradient. We show that we are able to approach the optimal permutation of samples by evaluating only approximately 5000 solutions--infinitesimally smaller than the 22! possible solutions.


Asunto(s)
Modelos Teóricos , Algoritmos
18.
Comput Biol Chem ; 35(3): 199-209, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21704267

RESUMEN

Analysis of DNA sequences isolated directly from the environment, known as metagenomics, produces a large quantity of genome fragments that need to be classified into specific taxa. Most composition-based classification methods use all features instead of a subset of features that may maximize classifier accuracy. We show that feature selection methods can boost performance of taxonomic classifiers. This work proposes three different filter-based feature selection methods that stem from information theory: (1) a technique that combines Kullback-Leibler, Mutual Information, and distance information, (2) a text mining technique, TF-IDF, and (3) minimum redundancy-maximum-relevance (mRMR). The feature selection methods are compared by how well they improve support vector machine classification of genomic reads. Overall, the 6mer mRMR method performs well, especially on the phyla-level. If the number of total features is very large, feature selection becomes difficult because a small subset of features that captures a majority of the data variance is less likely to exist. Therefore, we conclude that there is a trade-off between feature set size and feature selection method to optimize classification performance. For larger feature set sizes, TF-IDF works better for finer-resolutions while mRMR performs the best out of any method for N=6 for all taxonomic levels.


Asunto(s)
Biología Computacional , Metagenoma/genética , Algoritmos , Bacterias/genética , Simulación por Computador , Minería de Datos , Bases de Datos Genéticas , Análisis de Secuencia de ADN
19.
Artículo en Inglés | MEDLINE | ID: mdl-21097123

RESUMEN

Alarmingly increasing prevalence of Alzheimer's disease (AD) due to the aging population in developing countries, combined with lack of standardized and conclusive diagnostic procedures, make early diagnosis of Alzheimer's disease a major public health concern. While no current medical treatment exists to stop or reverse this disease, recent dementia specific pharmacological advances can slow its progression, making early diagnosis all the more important. Several noninvasive biomarkers have been proposed, including P300 based EEG analysis, MRI volumetric analysis, PET based metabolic activity analysis, as alternatives to neuropsychological evaluation, the current gold standard of diagnosis. Each of these approaches, have shown some promising outcomes, however, a comprehensive data fusion analysis has not yet been conducted to investigate whether these different modalities carry complementary information, and if so, whether they can be combined to provide a more accurate analysis. In this effort, we provide a first look at such an analysis in combining EEG, MRI and PET data using an ensemble of classifiers based decision fusion approach, to determine whether a strategic combination of these different modalities can improve the diagnostic accuracy over any of the individual data sources when used with an automated classifier. Results show an improvement of up to 10%-20% using this approach compared to the classification performance obtained when using each individual data source.


Asunto(s)
Enfermedad de Alzheimer/diagnóstico , Electroencefalografía/métodos , Interpretación de Imagen Asistida por Computador/métodos , Imagen por Resonancia Magnética/métodos , Tomografía de Emisión de Positrones/métodos , Anciano , Enfermedad de Alzheimer/diagnóstico por imagen , Sistemas de Apoyo a Decisiones Clínicas , Electrodos , Femenino , Humanos , Masculino
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA