Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 38(Suppl 1): i84-i91, 2022 06 24.
Artículo en Inglés | MEDLINE | ID: mdl-35758812

RESUMEN

MOTIVATION: Molecular carcinogenicity is a preventable cause of cancer, but systematically identifying carcinogenic compounds, which involves performing experiments on animal models, is expensive, time consuming and low throughput. As a result, carcinogenicity information is limited and building data-driven models with good prediction accuracy remains a major challenge. RESULTS: In this work, we propose CONCERTO, a deep learning model that uses a graph transformer in conjunction with a molecular fingerprint representation for carcinogenicity prediction from molecular structure. Special efforts have been made to overcome the data size constraint, such as multi-round pre-training on related but lower quality mutagenicity data, and transfer learning from a large self-supervised model. Extensive experiments demonstrate that our model performs well and can generalize to external validation sets. CONCERTO could be useful for guiding future carcinogenicity experiments and provide insight into the molecular basis of carcinogenicity. AVAILABILITY AND IMPLEMENTATION: The code and data underlying this article are available on github at https://github.com/bowang-lab/CONCERTO.


Asunto(s)
Carcinógenos , Redes Neurales de la Computación , Animales , Carcinógenos/toxicidad , Predicción , Mutágenos
2.
Bioinformatics ; 34(13): i429-i437, 2018 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-29949959

RESUMEN

Motivation: Alternative splice site selection is inherently competitive and the probability of a given splice site to be used also depends on the strength of neighboring sites. Here, we present a new model named the competitive splice site model (COSSMO), which explicitly accounts for these competitive effects and predicts the percent selected index (PSI) distribution over any number of putative splice sites. We model an alternative splicing event as the choice of a 3' acceptor site conditional on a fixed upstream 5' donor site or the choice of a 5' donor site conditional on a fixed 3' acceptor site. We build four different architectures that use convolutional layers, communication layers, long short-term memory and residual networks, respectively, to learn relevant motifs from sequence alone. We also construct a new dataset from genome annotations and RNA-Seq read data that we use to train our model. Results: COSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, and achieve an R2 of 0.6 in modeling the PSI distribution. We visualize the motifs that COSSMO learns from sequence and show that COSSMO recognizes the consensus splice site sequences and many known splicing factors with high specificity. Availability and implementation: Model predictions, our training dataset, and code are available from http://cossmo.genes.toronto.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Empalme Alternativo , Aprendizaje Profundo , Sitios de Empalme de ARN , Análisis de Secuencia de ARN/métodos , Biología Computacional/métodos , Humanos , Modelos Genéticos , Probabilidad , Programas Informáticos
3.
Bioinformatics ; 34(17): 2889-2898, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-29648582

RESUMEN

Motivation: Processing of transcripts at the 3'-end involves cleavage at a polyadenylation site followed by the addition of a poly(A)-tail. By selecting which site is cleaved, the process of alternative polyadenylation enables genes to produce transcript isoforms with different 3'-ends. To facilitate the identification and treatment of disease-causing mutations that affect polyadenylation and to understand the sequence determinants underlying this regulatory process, a computational model that can accurately predict polyadenylation patterns from genomic features is desirable. Results: Previous works have focused on identifying candidate polyadenylation sites and classifying tissue-specific sites. By training on how multiple sites in genes are competitively selected for polyadenylation from 3'-end sequencing data, we developed a deep learning model that can predict the tissue-specific strength of a polyadenylation site in the 3' untranslated region of the human genome given only its genomic sequence. We demonstrate the model's broad utility on multiple tasks, without any application-specific training. The model can be used to predict which polyadenylation site is more likely to be selected in genes with multiple sites. It can be used to scan the 3' untranslated region to find candidate polyadenylation sites. It can be used to classify the pathogenicity of variants near annotated polyadenylation sites in ClinVar. It can also be used to anticipate the effect of antisense oligonucleotide experiments to redirect polyadenylation. We provide analysis on how different features affect the model's predictive performance and a method to identify sensitive regions of the genome at the single-based resolution that can affect polyadenylation regulation. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Poliadenilación , Regiones no Traducidas 3' , Regulación de la Expresión Génica , Genoma Humano , Genómica , Humanos , Poli A
4.
Nature ; 498(7453): 241-5, 2013 Jun 13.
Artículo en Inglés | MEDLINE | ID: mdl-23739326

RESUMEN

Previous investigations of the core gene regulatory circuitry that controls the pluripotency of embryonic stem (ES) cells have largely focused on the roles of transcription, chromatin and non-coding RNA regulators. Alternative splicing represents a widely acting mode of gene regulation, yet its role in regulating ES-cell pluripotency and differentiation is poorly understood. Here we identify the muscleblind-like RNA binding proteins, MBNL1 and MBNL2, as conserved and direct negative regulators of a large program of cassette exon alternative splicing events that are differentially regulated between ES cells and other cell types. Knockdown of MBNL proteins in differentiated cells causes switching to an ES-cell-like alternative splicing pattern for approximately half of these events, whereas overexpression of MBNL proteins in ES cells promotes differentiated-cell-like alternative splicing patterns. Among the MBNL-regulated events is an ES-cell-specific alternative splicing switch in the forkhead family transcription factor FOXP1 that controls pluripotency. Consistent with a central and negative regulatory role for MBNL proteins in pluripotency, their knockdown significantly enhances the expression of key pluripotency genes and the formation of induced pluripotent stem cells during somatic cell reprogramming.


Asunto(s)
Empalme Alternativo , Reprogramación Celular , Proteínas de Unión al ADN/metabolismo , Células Madre Embrionarias/citología , Células Madre Embrionarias/metabolismo , Proteínas de Unión al ARN/metabolismo , Empalme Alternativo/genética , Secuencias de Aminoácidos , Animales , Diferenciación Celular/genética , Línea Celular , Proteínas de Unión al ADN/química , Proteínas de Unión al ADN/deficiencia , Proteínas de Unión al ADN/genética , Fibroblastos/citología , Fibroblastos/metabolismo , Factores de Transcripción Forkhead/metabolismo , Técnicas de Silenciamiento del Gen , Células HEK293 , Células HeLa , Humanos , Células Madre Pluripotentes Inducidas/citología , Células Madre Pluripotentes Inducidas/metabolismo , Cinética , Ratones , Proteínas de Unión al ARN/química , Proteínas de Unión al ARN/genética , Proteínas Represoras/metabolismo
5.
Nature ; 499(7457): 172-7, 2013 Jul 11.
Artículo en Inglés | MEDLINE | ID: mdl-23846655

RESUMEN

RNA-binding proteins are key regulators of gene expression, yet only a small fraction have been functionally characterized. Here we report a systematic analysis of the RNA motifs recognized by RNA-binding proteins, encompassing 205 distinct genes from 24 diverse eukaryotes. The sequence specificities of RNA-binding proteins display deep evolutionary conservation, and the recognition preferences for a large fraction of metazoan RNA-binding proteins can thus be inferred from their RNA-binding domain sequence. The motifs that we identify in vitro correlate well with in vivo RNA-binding data. Moreover, we can associate them with distinct functional roles in diverse types of post-transcriptional regulation, enabling new insights into the functions of RNA-binding proteins both in normal physiology and in human disease. These data provide an unprecedented overview of RNA-binding proteins and their targets, and constitute an invaluable resource for determining post-transcriptional regulatory mechanisms in eukaryotes.


Asunto(s)
Regulación de la Expresión Génica/genética , Motivos de Nucleótidos/genética , Proteínas de Unión al ARN/metabolismo , Trastorno Autístico/genética , Secuencia de Bases , Sitios de Unión/genética , Secuencia Conservada/genética , Células Eucariotas/metabolismo , Humanos , Datos de Secuencia Molecular , Estructura Terciaria de Proteína/genética , Factores de Empalme de ARN , Estabilidad del ARN/genética , Proteínas de Unión al ARN/química , Proteínas de Unión al ARN/genética
6.
Crit Rev Biochem Mol Biol ; 51(2): 102-9, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-26806341

RESUMEN

High Content Screening (HCS) technologies that combine automated fluorescence microscopy with high throughput biotechnology have become powerful systems for studying cell biology and drug screening. These systems can produce more than 100 000 images per day, making their success dependent on automated image analysis. In this review, we describe the steps involved in quantifying microscopy images and different approaches for each step. Typically, individual cells are segmented from the background using a segmentation algorithm. Each cell is then quantified by extracting numerical features, such as area and intensity measurements. As these feature representations are typically high dimensional (>500), modern machine learning algorithms are used to classify, cluster and visualize cells in HCS experiments. Machine learning algorithms that learn feature representations, in addition to the classification or clustering task, have recently advanced the state of the art on several benchmarking tasks in the computer vision community. These techniques have also recently been applied to HCS image analysis.


Asunto(s)
Procesamiento de Imagen Asistido por Computador , Microscopía Fluorescente , Algoritmos , Biotecnología , Aprendizaje Automático , Programas Informáticos , Visión Ocular
7.
Mol Syst Biol ; 13(4): 924, 2017 04 18.
Artículo en Inglés | MEDLINE | ID: mdl-28420678

RESUMEN

Existing computational pipelines for quantitative analysis of high-content microscopy data rely on traditional machine learning approaches that fail to accurately classify more than a single dataset without substantial tuning and training, requiring extensive analysis. Here, we demonstrate that the application of deep learning to biological image data can overcome the pitfalls associated with conventional machine learning classifiers. Using a deep convolutional neural network (DeepLoc) to analyze yeast cell images, we show improved performance over traditional approaches in the automated classification of protein subcellular localization. We also demonstrate the ability of DeepLoc to classify highly divergent image sets, including images of pheromone-arrested cells with abnormal cellular morphology, as well as images generated in different genetic backgrounds and in different laboratories. We offer an open-source implementation that enables updating DeepLoc on new microscopy datasets. This study highlights deep learning as an important tool for the expedited analysis of high-content microscopy data.


Asunto(s)
Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/ultraestructura , Biología de Sistemas/métodos , Aprendizaje Automático , Microscopía , Redes Neurales de la Computación , Saccharomyces cerevisiae/metabolismo
8.
Genome Res ; 24(11): 1774-86, 2014 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-25258385

RESUMEN

Alternative splicing (AS) of precursor RNAs is responsible for greatly expanding the regulatory and functional capacity of eukaryotic genomes. Of the different classes of AS, intron retention (IR) is the least well understood. In plants and unicellular eukaryotes, IR is the most common form of AS, whereas in animals, it is thought to represent the least prevalent form. Using high-coverage poly(A)(+) RNA-seq data, we observe that IR is surprisingly frequent in mammals, affecting transcripts from as many as three-quarters of multiexonic genes. A highly correlated set of cis features comprising an "IR code" reliably discriminates retained from constitutively spliced introns. We show that IR acts widely to reduce the levels of transcripts that are less or not required for the physiology of the cell or tissue type in which they are detected. This "transcriptome tuning" function of IR acts through both nonsense-mediated mRNA decay and nuclear sequestration and turnover of IR transcripts. We further show that IR is linked to a cross-talk mechanism involving localized stalling of RNA polymerase II (Pol II) and reduced availability of spliceosomal components. Collectively, the results implicate a global checkpoint-type mechanism whereby reduced recruitment of splicing components coupled to Pol II pausing underlies widespread IR-mediated suppression of inappropriately expressed transcripts.


Asunto(s)
Empalme Alternativo , Intrones/genética , Mamíferos/genética , Transcriptoma/genética , Células 3T3 , Animales , Diferenciación Celular/genética , Línea Celular , Línea Celular Tumoral , Células Cultivadas , Evolución Molecular , Células HeLa , Humanos , Células K562 , Mamíferos/clasificación , Ratones , Modelos Genéticos , Especificidad de Órganos , Análisis de Componente Principal , ARN Polimerasa II/metabolismo , Precursores del ARN/genética , Precursores del ARN/metabolismo , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa , Especificidad de la Especie , Vertebrados/clasificación , Vertebrados/genética
9.
Bioinformatics ; 32(12): i52-i59, 2016 06 15.
Artículo en Inglés | MEDLINE | ID: mdl-27307644

RESUMEN

MOTIVATION: High-content screening (HCS) technologies have enabled large scale imaging experiments for studying cell biology and for drug screening. These systems produce hundreds of thousands of microscopy images per day and their utility depends on automated image analysis. Recently, deep learning approaches that learn feature representations directly from pixel intensity values have dominated object recognition challenges. These tasks typically have a single centered object per image and existing models are not directly applicable to microscopy datasets. Here we develop an approach that combines deep convolutional neural networks (CNNs) with multiple instance learning (MIL) in order to classify and segment microscopy images using only whole image level annotations. RESULTS: We introduce a new neural network architecture that uses MIL to simultaneously classify and segment microscopy images with populations of cells. We base our approach on the similarity between the aggregation function used in MIL and pooling layers used in CNNs. To facilitate aggregating across large numbers of instances in CNN feature maps we present the Noisy-AND pooling function, a new MIL operator that is robust to outliers. Combining CNNs with MIL enables training CNNs using whole microscopy images with image level labels. We show that training end-to-end MIL CNNs outperforms several previous methods on both mammalian and yeast datasets without requiring any segmentation steps. AVAILABILITY AND IMPLEMENTATION: Torch7 implementation available upon request. CONTACT: oren.kraus@mail.utoronto.ca.


Asunto(s)
Interpretación de Imagen Asistida por Computador , Aprendizaje Automático , Microscopía , Algoritmos , Humanos , Redes Neurales de la Computación , Levaduras/citología
10.
BMC Genomics ; 17(1): 787, 2016 10 07.
Artículo en Inglés | MEDLINE | ID: mdl-27717327

RESUMEN

BACKGROUND: Alternative mRNA splicing is critical to proteomic diversity and tissue and species differentiation. Exclusion of cassette exons, also called exon skipping, is the most common type of alternative splicing in mammals. RESULTS: We present a computational model that predicts absolute (though not tissue-differential) percent-spliced-in of cassette exons more accurately than previous models, despite not using any 'hand-crafted' biological features such as motif counts. We achieve nearly identical performance using only the conservation score (mammalian phastCons) of each splice junction normalized by average conservation over 100 bp of the corresponding flanking intron, demonstrating that conservation is an unexpectedly powerful indicator of alternative splicing patterns. Using this method, we provide evidence that intronic splicing regulation occurs predominantly within 100 bp of the alternative splice sites and that conserved elements in this region are, as expected, functioning as splicing regulators. We show that among conserved cassette exons, increased conservation of flanking introns is associated with reduced inclusion. We also propose a new definition of intronic splicing regulatory elements (ISREs) that is independent of conservation, and show that most ISREs do not match known binding sites or splicing factors despite being predictive of percent-spliced-in. CONCLUSIONS: These findings suggest that one mechanism for the evolutionary transition from constitutive to alternative splicing is the emergence of cis-acting splicing inhibitors. The association of our ISREs with differences in splicing suggests the existence of novel RNA-binding proteins and/or novel splicing roles for known RNA-binding proteins.


Asunto(s)
Empalme Alternativo , Evolución Molecular , Modelos Biológicos , Animales , Área Bajo la Curva , Encéfalo/metabolismo , Exones , Regulación de la Expresión Génica , Humanos , Intrones , Especificidad de Órganos/genética , Sitios de Empalme de ARN , Secuencias Reguladoras de Ácidos Nucleicos
11.
Nature ; 465(7294): 53-9, 2010 May 06.
Artículo en Inglés | MEDLINE | ID: mdl-20445623

RESUMEN

Alternative splicing has a crucial role in the generation of biological complexity, and its misregulation is often involved in human disease. Here we describe the assembly of a 'splicing code', which uses combinations of hundreds of RNA features to predict tissue-dependent changes in alternative splicing for thousands of exons. The code determines new classes of splicing patterns, identifies distinct regulatory programs in different tissues, and identifies mutation-verified regulatory sequences. Widespread regulatory strategies are revealed, including the use of unexpectedly large combinations of features, the establishment of low exon inclusion levels that are overcome by features in specific tissues, the appearance of features deeper into introns than previously appreciated, and the modulation of splice variant levels by transcript structure characteristics. The code detected a class of exons whose inclusion silences expression in adult tissues by activating nonsense-mediated messenger RNA decay, but whose exclusion promotes expression during embryogenesis. The code facilitates the discovery and detailed characterization of regulated alternative splicing events on a genome-wide scale.


Asunto(s)
Empalme Alternativo/genética , Regulación de la Expresión Génica , Código Genético/genética , Modelos Genéticos , ARN Mensajero/metabolismo , Animales , Silenciador del Gen , Humanos , Ratones , Reproducibilidad de los Resultados
12.
Bioinformatics ; 30(12): i121-9, 2014 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-24931975

RESUMEN

MOTIVATION: Alternative splicing (AS) is a regulated process that directs the generation of different transcripts from single genes. A computational model that can accurately predict splicing patterns based on genomic features and cellular context is highly desirable, both in understanding this widespread phenomenon, and in exploring the effects of genetic variations on AS. METHODS: Using a deep neural network, we developed a model inferred from mouse RNA-Seq data that can predict splicing patterns in individual tissues and differences in splicing patterns across tissues. Our architecture uses hidden variables that jointly represent features in genomic sequences and tissue types when making predictions. A graphics processing unit was used to greatly reduce the training time of our models with millions of parameters. RESULTS: We show that the deep architecture surpasses the performance of the previous Bayesian method for predicting AS patterns. With the proper optimization procedure and selection of hyperparameters, we demonstrate that deep architectures can be beneficial, even with a moderately sparse dataset. An analysis of what the model has learned in terms of the genomic features is presented.


Asunto(s)
Empalme Alternativo , Inteligencia Artificial , Algoritmos , Animales , Teorema de Bayes , Genómica/métodos , Humanos , Ratones , Redes Neurales de la Computación , Análisis de Secuencia de ARN
13.
Bioinformatics ; 29(7): 821-9, 2013 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-23419374

RESUMEN

MOTIVATION: Tandem mass spectrometry (MS/MS) is a dominant approach for large-scale high-throughput post-translational modification (PTM) profiling. Although current state-of-the-art blind PTM spectral analysis algorithms can predict thousands of modified peptides (PTM predictions) in an MS/MS experiment, a significant percentage of these predictions have inaccurate modification mass estimates and false modification site assignments. This problem can be addressed by post-processing the PTM predictions with a PTM refinement algorithm. We developed a novel PTM refinement algorithm, iPTMClust, which extends a recently introduced PTM refinement algorithm PTMClust and uses a non-parametric Bayesian model to better account for uncertainties in the quantity and identity of PTMs in the input data. The use of this new modeling approach enables iPTMClust to provide a confidence score per modification site that allows fine-tuning and interpreting resulting PTM predictions. RESULTS: The primary goal behind iPTMClust is to improve the quality of the PTM predictions. First, to demonstrate that iPTMClust produces sensible and accurate cluster assignments, we compare it with k-means clustering, mixtures of Gaussians (MOG) and PTMClust on a synthetically generated PTM dataset. Second, in two separate benchmark experiments using PTM data taken from a phosphopeptide and a yeast proteome study, we show that iPTMClust outperforms state-of-the-art PTM prediction and refinement algorithms, including PTMClust. Finally, we illustrate the general applicability of our new approach on a set of human chromatin protein complex data, where we are able to identify putative novel modified peptides and modification sites that may be involved in the formation and regulation of protein complexes. Our method facilitates accurate PTM profiling, which is an important step in understanding the mechanisms behind many biological processes and should be an integral part of any proteomic study. AVAILABILITY: Our algorithm is implemented in Java and is freely available for academic use from http://genes.toronto.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Procesamiento Proteico-Postraduccional , Espectrometría de Masas en Tándem , Teorema de Bayes , Análisis por Conglomerados , Proteínas Fúngicas/metabolismo , Humanos , Fosfopéptidos/química , Mapeo de Interacción de Proteínas , Proteoma/metabolismo , Proteómica/métodos , Estadísticas no Paramétricas
14.
Nat Genet ; 37(9): 991-6, 2005 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-16127451

RESUMEN

Recent mammalian microarray experiments detected widespread transcription and indicated that there may be many undiscovered multiple-exon protein-coding genes. To explore this possibility, we labeled cDNA from unamplified, polyadenylation-selected RNA samples from 37 mouse tissues to microarrays encompassing 1.14 million exon probes. We analyzed these data using GenRate, a Bayesian algorithm that uses a genome-wide scoring function in a factor graph to infer genes. At a stringent exon false detection rate of 2.7%, GenRate detected 12,145 gene-length transcripts and confirmed 81% of the 10,000 most highly expressed known genes. Notably, our analysis showed that most of the 155,839 exons detected by GenRate were associated with known genes, providing microarray-based evidence that most multiple-exon genes have already been identified. GenRate also detected tens of thousands of potential new exons and reconciled discrepancies in current cDNA databases by 'stitching' new transcribed regions into previously annotated genes.


Asunto(s)
Biología Computacional , ADN Complementario/química , Bases de Datos como Asunto , Exones/genética , Genoma , Transcripción Genética , Algoritmos , Animales , Perfilación de la Expresión Génica , Humanos , Ratones , Análisis por Micromatrices , ARN Mensajero/química , ARN Mensajero/metabolismo
15.
BMC Bioinformatics ; 13 Suppl 6: S11, 2012 Apr 19.
Artículo en Inglés | MEDLINE | ID: mdl-22537040

RESUMEN

Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large amounts of data. However, much of the signal present in this data is corrupted or obscured by biases resulting in non-uniform and non-proportional representation of sequences from different transcripts. Many existing analyses attempt to deal with these and other biases with various task-specific approaches, which makes direct comparison between them difficult. However, two popular tools for isoform quantification, MISO and Cufflinks, have adopted a general probabilistic framework to model and mitigate these biases in a more general fashion. These advances motivate the need to investigate the effects of RNA-seq biases on the accuracy of different approaches for isoform quantification. We conduct the investigation by building models of increasing sophistication to account for noise introduced by the biases and compare their accuracy to the established approaches. We focus on methods that estimate the expression of alternatively-spliced isoforms with the percent-spliced-in (PSI) metric for each exon skipping event. To improve their estimates, many methods use evidence from RNA-seq reads that align to exon bodies. However, the methods we propose focus on reads that span only exon-exon junctions. As a result, our approaches are simpler and less sensitive to exon definitions than existing methods, which enables us to distinguish their strengths and weaknesses more easily. We present several probabilistic models of of position-specific read counts with increasing complexity and compare them to each other and to the current state-of-the-art methods in isoform quantification, MISO and Cufflinks. On a validation set with RT-PCR measurements for 26 cassette events, some of our methods are more accurate and some are significantly more consistent than these two popular tools. This comparison demonstrates the challenges in estimating the percent inclusion of alternatively spliced junctions and illuminates the tradeoffs between different approaches.


Asunto(s)
Empalme Alternativo , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ARN/métodos , Exones , Perfilación de la Expresión Génica , Células HeLa , Humanos , Modelos Estadísticos , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa
16.
Bioinformatics ; 27(6): 797-806, 2011 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-21258065

RESUMEN

MOTIVATION: A post-translational modification (PTM) is a chemical modification of a protein that occurs naturally. Many of these modifications, such as phosphorylation, are known to play pivotal roles in the regulation of protein function. Henceforth, PTM perturbations have been linked to diverse diseases like Parkinson's, Alzheimer's, diabetes and cancer. To discover PTMs on a genome-wide scale, there is a recent surge of interest in analyzing tandem mass spectrometry data, and several unrestrictive (so-called 'blind') PTM search methods have been reported. However, these approaches are subject to noise in mass measurements and in the predicted modification site (amino acid position) within peptides, which can result in false PTM assignments. RESULTS: To address these issues, we devised a machine learning algorithm, PTMClust, that can be applied to the output of blind PTM search methods to improve prediction quality, by suppressing noise in the data and clustering peptides with the same underlying modification to form PTM groups. We show that our technique outperforms two standard clustering algorithms on a simulated dataset. Additionally, we show that our algorithm significantly improves sensitivity and specificity when applied to the output of three different blind PTM search engines, SIMS, InsPecT and MODmap. Additionally, PTMClust markedly outperforms another PTM refinement algorithm, PTMFinder. We demonstrate that our technique is able to reduce false PTM assignments, improve overall detection coverage and facilitate novel PTM discovery, including terminus modifications. We applied our technique to a large-scale yeast MS/MS proteome profiling dataset and found numerous known and novel PTMs. Accurately identifying modifications in protein sequences is a critical first step for PTM profiling, and thus our approach may benefit routine proteomic analysis. AVAILABILITY: Our algorithm is implemented in Matlab and is freely available for academic use. The software is available online from http://genes.toronto.edu.


Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Procesamiento Proteico-Postraduccional , Espectrometría de Masas en Tándem , Algoritmos , Secuencia de Aminoácidos , Teorema de Bayes , Análisis por Conglomerados , Modelos Estadísticos , Proteómica/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos
17.
Bioinformatics ; 27(18): 2554-62, 2011 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-21803804

RESUMEN

MOTIVATION: Alternative splicing is a major contributor to cellular diversity in mammalian tissues and relates to many human diseases. An important goal in understanding this phenomenon is to infer a 'splicing code' that predicts how splicing is regulated in different cell types by features derived from RNA, DNA and epigenetic modifiers. METHODS: We formulate the assembly of a splicing code as a problem of statistical inference and introduce a Bayesian method that uses an adaptively selected number of hidden variables to combine subgroups of features into a network, allows different tissues to share feature subgroups and uses a Gibbs sampler to hedge predictions and ascertain the statistical significance of identified features. RESULTS: Using data for 3665 cassette exons, 1014 RNA features and 4 tissue types derived from 27 mouse tissues (http://genes.toronto.edu/wasp), we benchmarked several methods. Our method outperforms all others, and achieves relative improvements of 52% in splicing code quality and up to 22% in classification error, compared with the state of the art. Novel combinations of regulatory features and novel combinations of tissues that share feature subgroups were identified using our method. CONTACT: frey@psi.toronto.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Empalme Alternativo/genética , Isoformas de ARN/genética , ARN/genética , Algoritmos , Animales , Secuencia de Bases , Teorema de Bayes , Exones , Expresión Génica , Regulación de la Expresión Génica , Humanos , Ratones , Modelos Genéticos , Empalme del ARN , Transcripción Genética
18.
Bioinformatics ; 26(12): i325-33, 2010 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-20529924

RESUMEN

MOTIVATION: Transcripts from approximately 95% of human multi-exon genes are subject to alternative splicing (AS). The growing interest in AS is propelled by its prominent contribution to transcriptome and proteome complexity and the role of aberrant AS in numerous diseases. Recent technological advances enable thousands of exons to be simultaneously profiled across diverse cell types and cellular conditions, but require accurate identification of condition-specific splicing changes. It is necessary to accurately identify such splicing changes to elucidate the underlying regulatory programs or link the splicing changes to specific diseases. RESULTS: We present a probabilistic model tailored for high-throughput AS data, where observed isoform levels are explained as combinations of condition-specific AS signals. According to our formulation, given an AS dataset our tasks are to detect common signals in the data and identify the exons relevant to each signal. Our model can incorporate prior knowledge about underlying AS signals, measurement quality and gene expression level effects. Using a large-scale multi-tissue AS dataset, we demonstrate the advantage of our method over standard alternative approaches. In addition, we describe newly found tissue-specific AS signals which were verified experimentally, and discuss associated regulatory features. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Empalme Alternativo/genética , Modelos Estadísticos , Algoritmos , Exones , Perfilación de la Expresión Génica , Empalme del ARN
19.
Nat Methods ; 4(12): 1045-9, 2007 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-18026111

RESUMEN

We demonstrate that paired expression profiles of microRNAs (miRNAs) and mRNAs can be used to identify functional miRNA-target relationships with high precision. We used a Bayesian data analysis algorithm, GenMiR++, to identify a network of 1,597 high-confidence target predictions for 104 human miRNAs, which was supported by RNA expression data across 88 tissues and cell types, sequence complementarity and comparative genomics data. We experimentally verified our predictions by investigating the result of let-7b downregulation in retinoblastoma using quantitative reverse transcriptase (RT)-PCR and microarray profiling: some of our verified let-7b targets include CDC25A and BCL7A. Compared to sequence-based predictions, our high-scoring GenMiR++ predictions had much more consistent Gene Ontology annotations and were more accurate predictors of which mRNA levels respond to changes in let-7b levels.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Marcación de Gen/métodos , MicroARNs/genética , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ARN/métodos , Secuencia de Bases , Humanos , Datos de Secuencia Molecular
20.
Mol Cell Proteomics ; 7(3): 519-33, 2008 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-18056057

RESUMEN

Defective mobilization of Ca2+ by cardiomyocytes can lead to cardiac insufficiency, but the causative mechanisms leading to congestive heart failure (HF) remain unclear. In the present study we performed exhaustive global proteomics surveys of cardiac ventricle isolated from a mouse model of cardiomyopathy overexpressing a phospholamban mutant, R9C (PLN-R9C), and exhibiting impaired Ca2+ handling and death at 24 weeks and compared them with normal control littermates. The relative expression patterns of 6190 high confidence proteins were monitored by shotgun tandem mass spectrometry at 8, 16, and 24 weeks of disease progression. Significant differential abundance of 593 proteins was detected. These proteins mapped to select biological pathways such as endoplasmic reticulum stress response, cytoskeletal remodeling, and apoptosis and included known biomarkers of HF (e.g. brain natriuretic peptide/atrial natriuretic factor and angiotensin-converting enzyme) and other indicators of presymptomatic functional impairment. These altered proteomic profiles were concordant with cognate mRNA patterns recorded in parallel using high density mRNA microarrays, and top candidates were validated by RT-PCR and Western blotting. Mapping of our highest ranked proteins against a human diseased explant and to available data sets indicated that many of these proteins could serve as markers of disease. Indeed we showed that several of these proteins are detectable in mouse and human plasma and display differential abundance in the plasma of diseased mice and affected patients. These results offer a systems-wide perspective of the dynamic maladaptions associated with impaired Ca2+ homeostasis that perturb myocyte function and ultimately converge to cause HF.


Asunto(s)
Proteínas de Unión al Calcio/genética , Cardiomiopatía Dilatada/metabolismo , Mutación/genética , Análisis por Matrices de Proteínas , Proteómica/métodos , Estrés Fisiológico/metabolismo , Animales , Biomarcadores/sangre , Cardiomiopatía Dilatada/sangre , Cardiomiopatía Dilatada/diagnóstico por imagen , Cardiomiopatía Dilatada/fisiopatología , Modelos Animales de Enfermedad , Femenino , Regulación de la Expresión Génica , Insuficiencia Cardíaca , Hemodinámica , Humanos , Masculino , Redes y Vías Metabólicas , Ratones , Ratones Transgénicos , Miocardio/patología , Análisis de Secuencia por Matrices de Oligonucleótidos , Fenotipo , ARN Mensajero/genética , ARN Mensajero/metabolismo , Reproducibilidad de los Resultados , Factores de Tiempo , Ultrasonografía
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA