Búsqueda | Portal Regional de la BVS

1.

CellBoost: A pipeline for machine assisted annotation in Neuroanatomy.

Qian, Kui; Friedman, Beth; Takatoh, Jun; Wang, Fan; Kleinfeld, David; Freund, Yoav.

bioRxiv ; 2024 Jan 21.

Artículo en Inglés | MEDLINE | ID: mdl-38293051

RESUMEN

One of the important yet labor intensive tasks in neuroanatomy is the identification of select populations of cells. Current high-throughput techniques enable marking cells with histochemical fluorescent molecules as well as through the genetic expression of fluorescent proteins. Modern scanning microscopes allow high resolution multi-channel imaging of the mechanically or optically sectioned brain with thousands of marked cells per square millimeter. Manual identification of all marked cells is prohibitively time consuming. At the same time, simple segmentation algorithms suffer from high error rates and sensitivity to variation in fluorescent intensity and spatial distribution. We present a methodology that combines human judgement and machine learning that serves to significantly reduce the labor of the anatomist while improving the consistency of the annotation. As a demonstration, we analyzed murine brains with marked premotor neurons in the brainstem. We compared the error rate of our method to the disagreement rate among human anatomists. This comparison shows that our method can reduce the time to annotate by as much as ten-fold without significantly increasing the rate of errors. We show that our method achieves significant reduction in labor while achieving an accuracy that is similar to the level of agreement between different anatomists.

2.

An active texture-based digital atlas enables automated mapping of structures and markers across brains.

Chen, Yuncong; McElvain, Lauren E; Tolpygo, Alexander S; Ferrante, Daniel; Friedman, Beth; Mitra, Partha P; Karten, Harvey J; Freund, Yoav; Kleinfeld, David.

Nat Methods ; 16(4): 341-350, 2019 04.

Artículo en Inglés | MEDLINE | ID: mdl-30858600

RESUMEN

Brain atlases enable the mapping of labeled cells and projections from different brains onto a standard coordinate system. We address two issues in the construction and use of atlases. First, expert neuroanatomists ascertain the fine-scale pattern of brain tissue, the 'texture' formed by cellular organization, to define cytoarchitectural borders. We automate the processes of localizing landmark structures and alignment of brains to a reference atlas using machine learning and training data derived from expert annotations. Second, we construct an atlas that is active; that is, augmented with each use. We show that the alignment of new brains to a reference atlas can continuously refine the coordinate system and associated variance. We apply this approach to the adult murine brainstem and achieve a precise alignment of projections in cytoarchitecturally ill-defined regions across brains from different animals.

Asunto(s)

Mapeo Encefálico/métodos , Encéfalo/diagnóstico por imagen , Biología Computacional/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Algoritmos , Animales , Encéfalo/anatomía & histología , Tronco Encefálico/diagnóstico por imagen , Aprendizaje Automático , Imagen por Resonancia Magnética , Masculino , Ratones , Ratones Endogámicos C57BL , Neuronas Motoras , Neuroanatomía , Neuronas , Probabilidad , Médula Espinal/diagnóstico por imagen

3.

An online learning approach to occlusion boundary detection.

Jacobson, Natan; Freund, Yoav; Nguyen, Truong Q.

IEEE Trans Image Process ; 21(1): 252-61, 2012 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-21788193

RESUMEN

We propose a novel online learning-based framework for occlusion boundary detection in video sequences. This approach does not require any prior training and instead "learns" occlusion boundaries by updating a set of weights for the online learning Hedge algorithm at each frame instance. Whereas previous training-based methods perform well only on data similar to the trained examples, the proposed method is well suited for any video sequence. We demonstrate the performance of the proposed detector both for the CMU data set, which includes hand-labeled occlusion boundaries, and for a novel video sequence. In addition to occlusion boundary detection, the proposed algorithm is capable of classifying occlusion boundaries by angle and by whether the occluding object is covering or uncovering the background.

Asunto(s)

Algoritmos , Interpretación de Imagen Asistida por Computador/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Fotograbar/métodos , Grabación en Video/métodos , Inteligencia Artificial , Aumento de la Imagen/métodos , Sistemas en Línea , Reproducibilidad de los Resultados , Sensibilidad y Especificidad

4.

Automatic identification of fluorescently labeled brain cells for rapid functional imaging.

Valmianski, Ilya; Shih, Andy Y; Driscoll, Jonathan D; Matthews, David W; Freund, Yoav; Kleinfeld, David.

J Neurophysiol ; 104(3): 1803-11, 2010 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-20610792

RESUMEN

The on-line identification of labeled cells and vessels is a rate-limiting step in scanning microscopy. We use supervised learning to formulate an algorithm that rapidly and automatically tags fluorescently labeled somata in full-field images of cortex and constructs an optimized scan path through these cells. A single classifier works across multiple subjects, regions of the cortex of similar depth, and different magnification and contrast levels without the need to retrain the algorithm. Retraining only has to be performed when the morphological properties of the cells change significantly. In conjunction with two-photon laser scanning microscopy and bulk-labeling of cells in layers 2/3 of rat parietal cortex with a calcium indicator, we can automatically identify â¼ 50 cells within 1 min and sample them at â¼ 100 Hz with a signal-to-noise ratio of â¼ 10.

Asunto(s)

Colorantes Fluorescentes/análisis , Microscopía Confocal/métodos , Corteza Somatosensorial/química , Corteza Somatosensorial/citología , Animales , Ratas , Ratas Sprague-Dawley , Corteza Somatosensorial/fisiología , Factores de Tiempo

5.

Minimizing off-target signals in RNA fluorescent in situ hybridization.

Arvey, Aaron; Hermann, Anita; Hsia, Cheryl C; Ie, Eugene; Freund, Yoav; McGinnis, William.

Nucleic Acids Res ; 38(10): e115, 2010 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-20164092

RESUMEN

Fluorescent in situ hybridization (FISH) techniques are becoming extremely sensitive, to the point where individual RNA or DNA molecules can be detected with small probes. At this level of sensitivity, the elimination of 'off-target' hybridization is of crucial importance, but typical probes used for RNA and DNA FISH contain sequences repeated elsewhere in the genome. We find that very short (e.g. 20 nt) perfect repeated sequences within much longer probes (e.g. 350-1500 nt) can produce significant off-target signals. The extent of noise is surprising given the long length of the probes and the short length of non-specific regions. When we removed the small regions of repeated sequence from either short or long probes, we find that the signal-to-noise ratio is increased by orders of magnitude, putting us in a regime where fluorescent signals can be considered to be a quantitative measure of target transcript numbers. As the majority of genes in complex organisms contain repeated k-mers, we provide genome-wide annotations of k-mer-uniqueness at http://cbio.mskcc.org/ approximately aarvey/repeatmap.

Asunto(s)

Hibridación Fluorescente in Situ/métodos , Sondas ARN/química , ARN Mensajero/análisis , Animales , Proteínas de Drosophila/genética , Drosophila melanogaster/embriología , Drosophila melanogaster/genética , Embrión no Mamífero/química , Proteínas Nucleares/genética , ARN Mensajero/química , Secuencias Repetitivas de Ácidos Nucleicos , Factores de Transcripción/genética

6.

Visualization of individual Scr mRNAs during Drosophila embryogenesis yields evidence for transcriptional bursting.

Paré, Adam; Lemons, Derek; Kosman, Dave; Beaver, William; Freund, Yoav; McGinnis, William.

Curr Biol ; 19(23): 2037-42, 2009 Dec 15.

Artículo en Inglés | MEDLINE | ID: mdl-19931455

RESUMEN

The detection and counting of transcripts within single cells via fluorescent in situ hybridization (FISH) has allowed researchers to ask quantitative questions about gene expression at the level of individual cells. This method is often preferable to quantitative RT-PCR, because it does not necessitate destruction of the cells being probed and maintains spatial information that may be of interest. Until now, studies using FISH at single-molecule resolution have only been rigorously carried out in isolated cells (e.g., yeast cells or mammalian cell culture). Here, we describe the detection and counting of transcripts within single cells of fixed, whole-mount Drosophila embryos via a combination of FISH, immunohistochemistry, and image segmentation. Our method takes advantage of inexpensive, long RNA probes detected with antibodies, and we present novel evidence to show that we can robustly detect single mRNA molecules. We use this method to characterize transcription at the endogenous locus of the Hox gene Sex combs reduced (Scr), by comparing a stably expressing group of cells to a group that only transiently expresses the gene. Our data provide evidence for transcriptional bursting, as well for divergent "accumulation" and "maintenance" phases of gene activity at the Scr locus.

Asunto(s)

Proteínas de Drosophila/metabolismo , Drosophila/embriología , ARN Mensajero/metabolismo , Factores de Transcripción/metabolismo , Transcripción Genética/fisiología , Animales , Proteínas de Drosophila/genética , Regulación del Desarrollo de la Expresión Génica/fisiología , Inmunohistoquímica , Hibridación Fluorescente in Situ , ARN Mensajero/genética , Factores de Transcripción/genética

7.

ResBoost: characterizing and predicting catalytic residues in enzymes.

Alterovitz, Ron; Arvey, Aaron; Sankararaman, Sriram; Dallett, Carolina; Freund, Yoav; Sjölander, Kimmen.

BMC Bioinformatics ; 10: 197, 2009 Jun 27.

Artículo en Inglés | MEDLINE | ID: mdl-19558703

RESUMEN

BACKGROUND: Identifying the catalytic residues in enzymes can aid in understanding the molecular basis of an enzyme's function and has significant implications for designing new drugs, identifying genetic disorders, and engineering proteins with novel functions. Since experimentally determining catalytic sites is expensive, better computational methods for identifying catalytic residues are needed. RESULTS: We propose ResBoost, a new computational method to learn characteristics of catalytic residues. The method effectively selects and combines rules of thumb into a simple, easily interpretable logical expression that can be used for prediction. We formally define the rules of thumb that are often used to narrow the list of candidate residues, including residue evolutionary conservation, 3D clustering, solvent accessibility, and hydrophilicity. ResBoost builds on two methods from machine learning, the AdaBoost algorithm and Alternating Decision Trees, and provides precise control over the inherent trade-off between sensitivity and specificity. We evaluated ResBoost using cross-validation on a dataset of 100 enzymes from the hand-curated Catalytic Site Atlas (CSA). CONCLUSION: ResBoost achieved 85% sensitivity for a 9.8% false positive rate and 73% sensitivity for a 5.7% false positive rate. ResBoost reduces the number of false positives by up to 56% compared to the use of evolutionary conservation scoring alone. We also illustrate the ability of ResBoost to identify recently validated catalytic residues not listed in the CSA.

Asunto(s)

Biología Computacional/métodos , Enzimas/química , Programas Informáticos , Sitios de Unión , Catálisis , Bases de Datos de Proteínas

8.

Image-based crystal detection: a machine-learning approach.

Liu, Roy; Freund, Yoav; Spraggon, Glen.

Acta Crystallogr D Biol Crystallogr ; 64(Pt 12): 1187-95, 2008 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-19018095

RESUMEN

The ability of computers to learn from and annotate large databases of crystallization-trial images provides not only the ability to reduce the workload of crystallization studies, but also an opportunity to annotate crystallization trials as part of a framework for improving screening methods. Here, a system is presented that scores sets of images based on the likelihood of containing crystalline material as perceived by a machine-learning algorithm. The system can be incorporated into existing crystallization-analysis pipelines, whereby specialists examine images as they normally would with the exception that the images appear in rank order according to a simple real-valued score. Promising results are shown for 319 112 images associated with 150 structures solved by the Joint Center for Structural Genomics pipeline during the 2006-2007 year. Overall, the algorithm achieves a mean receiver operating characteristic score of 0.919 and a 78% reduction in human effort per set when considering an absolute score cutoff for screening images, while incurring a loss of five out of 150 structures.

Asunto(s)

Inteligencia Artificial , Cristalografía por Rayos X/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Proteínas/química , Algoritmos , Cristalización , Cristalografía por Rayos X/instrumentación , Cristalografía por Rayos X/tendencias , Sistemas de Administración de Bases de Datos/economía , Sistemas de Administración de Bases de Datos/instrumentación , Interpretación de Imagen Asistida por Computador , Procesamiento de Imagen Asistido por Computador/instrumentación , Curva ROC

9.

Lamellipodial actin mechanically links myosin activity with adhesion-site formation.

Giannone, Grégory; Dubin-Thaler, Benjamin J; Rossier, Olivier; Cai, Yunfei; Chaga, Oleg; Jiang, Guoying; Beaver, William; Döbereiner, Hans-Günther; Freund, Yoav; Borisy, Gary; Sheetz, Michael P.

Cell ; 128(3): 561-75, 2007 Feb 09.

Artículo en Inglés | MEDLINE | ID: mdl-17289574

RESUMEN

Cell motility proceeds by cycles of edge protrusion, adhesion, and retraction. Whether these functions are coordinated by biochemical or biomechanical processes is unknown. We find that myosin II pulls the rear of the lamellipodial actin network, causing upward bending, edge retraction, and initiation of new adhesion sites. The network then separates from the edge and condenses over the myosin. Protrusion resumes as lamellipodial actin regenerates from the front and extends rearward until it reaches newly assembled myosin, initiating the next cycle. Upward bending, observed by evanescence and electron microscopy, results in ruffle formation when adhesion strength is low. Correlative fluorescence and electron microscopy shows that the regenerating lamellipodium forms a cohesive, separable layer of actin above the lamellum. Thus, actin polymerization periodically builds a mechanical link, the lamellipodium, connecting myosin motors with the initiation of adhesion sites, suggesting that the major functions driving motility are coordinated by a biomechanical process.

Asunto(s)

Actinas/metabolismo , Adhesión Celular , Miosinas/metabolismo , Seudópodos/química , Animales , Movimiento Celular , Fibroblastos/citología , Ratones , Microscopía Electrónica , Microscopía Fluorescente , Miosina Tipo II/genética , Miosina Tipo II/metabolismo , Periodicidad , Polímeros/metabolismo , Seudópodos/ultraestructura

10.

A classification-based framework for predicting and analyzing gene regulatory response.

Kundaje, Anshul; Middendorf, Manuel; Shah, Mihir; Wiggins, Chris H; Freund, Yoav; Leslie, Christina.

BMC Bioinformatics ; 7 Suppl 1: S5, 2006 Mar 20.

Artículo en Inglés | MEDLINE | ID: mdl-16723008

RESUMEN

BACKGROUND: We have recently introduced a predictive framework for studying gene transcriptional regulation in simpler organisms using a novel supervised learning algorithm called GeneClass. GeneClass is motivated by the hypothesis that in model organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular microarray experiment based on the presence of binding site subsequences ("motifs") in the gene's regulatory region and the expression levels of regulators such as transcription factors in the experiment ("parents"). GeneClass formulates the learning task as a classification problem--predicting +1 and -1 labels corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. Using the Adaboost algorithm, GeneClass learns a prediction function in the form of an alternating decision tree, a margin-based generalization of a decision tree. METHODS: In the current work, we introduce a new, robust version of the GeneClass algorithm that increases stability and computational efficiency, yielding a more scalable and reliable predictive model. The improved stability of the prediction tree enables us to introduce a detailed post-processing framework for biological interpretation, including individual and group target gene analysis to reveal condition-specific regulation programs and to suggest signaling pathways. Robust GeneClass uses a novel stabilized variant of boosting that allows a set of correlated features, rather than single features, to be included at nodes of the tree; in this way, biologically important features that are correlated with the single best feature are retained rather than decorrelated and lost in the next round of boosting. Other computational developments include fast matrix computation of the loss function for all features, allowing scalability to large datasets, and the use of abstaining weak rules, which results in a more shallow and interpretable tree. We also show how to incorporate genome-wide protein-DNA binding data from ChIP chip experiments into the GeneClass algorithm, and we use an improved noise model for gene expression data. RESULTS: Using the improved scalability of Robust GeneClass, we present larger scale experiments on a yeast environmental stress dataset, training and testing on all genes and using a comprehensive set of potential regulators. We demonstrate the improved stability of the features in the learned prediction tree, and we show the utility of the post-processing framework by analyzing two groups of genes in yeast--the protein chaperones and a set of putative targets of the Nrg1 and Nrg2 transcription factors--and suggesting novel hypotheses about their transcriptional and post-transcriptional regulation. Detailed results and Robust GeneClass source code is available for download from http://www.cs.columbia.edu/compbio/robust-geneclass.

Asunto(s)

Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica , Algoritmos , Secuencias de Aminoácidos , Sitios de Unión , Interpretación Estadística de Datos , Bases de Datos de Proteínas , Proteínas Fúngicas/química , Proteínas de Choque Térmico/metabolismo , Chaperonas Moleculares/química , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Saccharomyces cerevisiae/metabolismo

11.

Identifying metabolic enzymes with multiple types of association evidence.

Kharchenko, Peter; Chen, Lifeng; Freund, Yoav; Vitkup, Dennis; Church, George M.

BMC Bioinformatics ; 7: 177, 2006 Mar 29.

Artículo en Inglés | MEDLINE | ID: mdl-16571130

RESUMEN

BACKGROUND: Existing large-scale metabolic models of sequenced organisms commonly include enzymatic functions which can not be attributed to any gene in that organism. Existing computational strategies for identifying such missing genes rely primarily on sequence homology to known enzyme-encoding genes. RESULTS: We present a novel method for identifying genes encoding for a specific metabolic function based on a local structure of metabolic network and multiple types of functional association evidence, including clustering of genes on the chromosome, similarity of phylogenetic profiles, gene expression, protein fusion events and others. Using E. coli and S. cerevisiae metabolic networks, we illustrate predictive ability of each individual type of association evidence and show that significantly better predictions can be obtained based on the combination of all data. In this way our method is able to predict 60% of enzyme-encoding genes of E. coli metabolism within the top 10 (out of 3551) candidates for their enzymatic function, and as a top candidate within 43% of the cases. CONCLUSION: We illustrate that a combination of genome context and other functional association evidence is effective in predicting genes encoding metabolic enzymes. Our approach does not rely on direct sequence homology to known enzyme-encoding genes, and can be used in conjunction with traditional homology-based metabolic reconstruction methods. The method can also be used to target orphan metabolic activities.

Asunto(s)

Enzimas/genética , Proteínas de Escherichia coli/genética , Genoma Bacteriano , Proteínas de Saccharomyces cerevisiae/genética , Análisis de Secuencia de Proteína/métodos , Metabolismo Energético/genética , Filogenia

12.

Profile-based string kernels for remote homology detection and motif extraction.

Kuang, Rui; Ie, Eugene; Wang, Ke; Wang, Kai; Siddiqi, Mahira; Freund, Yoav; Leslie, Christina.

J Bioinform Comput Biol ; 3(3): 527-50, 2005 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-16108083

RESUMEN

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the profiles is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We further examine how to incorporate predicted secondary structure information into the profile kernel to obtain a small but significant performance improvement. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs"--short regions of the original profile that contribute almost all the weight of the SVM classification score--and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results also outperform cluster kernels while providing much better scalability to large datasets.

Asunto(s)

Algoritmos , Inteligencia Artificial , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencias de Aminoácidos , Secuencia de Aminoácidos , Datos de Secuencia Molecular , Proteínas/análisis , Proteínas/clasificación , Homología de Secuencia de Aminoácido

13.

Predicting genetic regulatory response using classification.

Middendorf, Manuel; Kundaje, Anshul; Wiggins, Chris; Freund, Yoav; Leslie, Christina.

Bioinformatics ; 20 Suppl 1: i232-40, 2004 Aug 04.

Artículo en Inglés | MEDLINE | ID: mdl-15262804

RESUMEN

MOTIVATION: Studying gene regulatory mechanisms in simple model organisms through analysis of high-throughput genomic data has emerged as a central problem in computational biology. Most approaches in the literature have focused either on finding a few strong regulatory patterns or on learning descriptive models from training data. However, these approaches are not yet adequate for making accurate predictions about which genes will be up- or down-regulated in new or held-out experiments. By introducing a predictive methodology for this problem, we can use powerful tools from machine learning and assess the statistical significance of our predictions. RESULTS: We present a novel classification-based method for learning to predict gene regulatory response. Our approach is motivated by the hypothesis that in simple organisms such as Saccharomyces cerevisiae, we can learn a decision rule for predicting whether a gene is up- or down-regulated in a particular experiment based on (1) the presence of binding site subsequences ('motifs') in the gene's regulatory region and (2) the expression levels of regulators such as transcription factors in the experiment ('parents'). Thus, our learning task integrates two qualitatively different data sources: genome-wide cDNA microarray data across multiple perturbation and mutant experiments along with motif profile data from regulatory sequences. We convert the regression task of predicting real-valued gene expression measurements to a classification task of predicting +1 and -1 labels, corresponding to up- and down-regulation beyond the levels of biological and measurement noise in microarray measurements. The learning algorithm employed is boosting with a margin-based generalization of decision trees, alternating decision trees. This large-margin classifier is sufficiently flexible to allow complex logical functions, yet sufficiently simple to give insight into the combinatorial mechanisms of gene regulation. We observe encouraging prediction accuracy on experiments based on the Gasch S.cerevisiae dataset, and we show that we can accurately predict up- and down-regulation on held-out experiments. We also show how to extract significant regulators, motifs and motif-regulator pairs from the learned models for various stress responses. Our method thus provides predictive hypotheses, suggests biological experiments, and provides interpretable insight into the structure of genetic regulatory networks. AVAILABILITY: The MLJava package is available upon request to the authors. Supplementary: Additional results are available from http://www.cs.columbia.edu/compbio/geneclass

Asunto(s)

Mapeo Cromosómico/métodos , Regulación de la Expresión Génica/fisiología , Modelos Genéticos , Proteoma/metabolismo , Elementos Reguladores de la Transcripción/genética , Transducción de Señal/genética , Factores de Transcripción/genética , Sitios de Unión , Simulación por Computador , Unión Proteica , Proteínas de Saccharomyces cerevisiae/fisiología , Análisis de Secuencia de ADN/métodos , Activación Transcripcional/fisiología

14.

Profile-based string kernels for remote homology detection and motif extraction.

Kuang, Rui; Ie, Eugene; Wang, Ke; Wang, Kai; Siddiqi, Mahira; Freund, Yoav; Leslie, Christina.

Proc IEEE Comput Syst Bioinform Conf ; : 152-60, 2004.

Artículo en Inglés | MEDLINE | ID: mdl-16448009

RESUMEN

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" -- short regions of the original profile that contribute almost all the weight of the SVM classification score -- and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.

Asunto(s)

Algoritmos , Perfilación de la Expresión Génica/métodos , Expresión Génica/genética , Reconocimiento de Normas Patrones Automatizadas/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencias de Aminoácidos , Inteligencia Artificial , Análisis por Conglomerados , Homología de Secuencia de Aminoácido

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA