RESUMO
MOTIVATION: Molecular carcinogenicity is a preventable cause of cancer, but systematically identifying carcinogenic compounds, which involves performing experiments on animal models, is expensive, time consuming and low throughput. As a result, carcinogenicity information is limited and building data-driven models with good prediction accuracy remains a major challenge. RESULTS: In this work, we propose CONCERTO, a deep learning model that uses a graph transformer in conjunction with a molecular fingerprint representation for carcinogenicity prediction from molecular structure. Special efforts have been made to overcome the data size constraint, such as multi-round pre-training on related but lower quality mutagenicity data, and transfer learning from a large self-supervised model. Extensive experiments demonstrate that our model performs well and can generalize to external validation sets. CONCERTO could be useful for guiding future carcinogenicity experiments and provide insight into the molecular basis of carcinogenicity. AVAILABILITY AND IMPLEMENTATION: The code and data underlying this article are available on github at https://github.com/bowang-lab/CONCERTO.
Assuntos
Carcinógenos , Redes Neurais de Computação , Animais , Carcinógenos/toxicidade , Previsões , MutagênicosRESUMO
Recent papers have described the first application of high-throughput sequencing (HTS) technologies to the characterization of transcriptomes. These studies emphasize the tremendous power of this new technology, in terms of both profiling coverage and quantitative accuracy. Initial discoveries include the detection of substantial new transcript complexity, the elucidation of binding maps and regulatory properties of RNA-binding proteins, and new insights into the links between different steps in pre-mRNA processing. We review these findings, focusing on results from profiling mammalian transcriptomes. The strengths and limitations of HTS relative to microarray profiling are discussed. We also consider how future advances in HTS technology are likely to transform our understanding of integrated cellular networks operating at the RNA level.
Assuntos
Perfilação da Expressão Gênica , Análise de Sequência de RNA/métodos , Animais , Perfilação da Expressão Gênica/economia , Perfilação da Expressão Gênica/tendências , Regulação da Expressão Gênica , Mamíferos , Processamento Pós-Transcricional do RNA , Análise de Sequência de RNA/economia , TransativadoresRESUMO
MOTIVATION: Alternative splicing (AS) is a regulated process that directs the generation of different transcripts from single genes. A computational model that can accurately predict splicing patterns based on genomic features and cellular context is highly desirable, both in understanding this widespread phenomenon, and in exploring the effects of genetic variations on AS. METHODS: Using a deep neural network, we developed a model inferred from mouse RNA-Seq data that can predict splicing patterns in individual tissues and differences in splicing patterns across tissues. Our architecture uses hidden variables that jointly represent features in genomic sequences and tissue types when making predictions. A graphics processing unit was used to greatly reduce the training time of our models with millions of parameters. RESULTS: We show that the deep architecture surpasses the performance of the previous Bayesian method for predicting AS patterns. With the proper optimization procedure and selection of hyperparameters, we demonstrate that deep architectures can be beneficial, even with a moderately sparse dataset. An analysis of what the model has learned in terms of the genomic features is presented.
Assuntos
Processamento Alternativo , Inteligência Artificial , Algoritmos , Animais , Teorema de Bayes , Genômica/métodos , Humanos , Camundongos , Redes Neurais de Computação , Análise de Sequência de RNARESUMO
Alternative splicing (AS) plays a crucial role in the diversification of gene function and regulation. Consequently, the systematic identification and characterization of temporally regulated splice variants is of critical importance to understanding animal development. We have used high-throughput RNA sequencing and microarray profiling to analyze AS in C. elegans across various stages of development. This analysis identified thousands of novel splicing events, including hundreds of developmentally regulated AS events. To make these data easily accessible and informative, we constructed the C. elegans Splice Browser, a web resource in which researchers can mine AS events of interest and retrieve information about their relative levels and regulation across development. The data presented in this study, along with the Splice Browser, provide the most comprehensive set of annotated splice variants in C. elegans to date, and are therefore expected to facilitate focused, high resolution in vivo functional assays of AS function.
Assuntos
Processamento Alternativo/genética , Caenorhabditis elegans/genética , Animais , Bases de Dados Genéticas , Éxons/genética , Feminino , Perfilação da Expressão Gênica , Estudo de Associação Genômica Ampla , Masculino , Dados de Sequência Molecular , Análise de Sequência com Séries de Oligonucleotídeos , SoftwareRESUMO
In the face of rapidly accumulating genomic data, our understanding of the RNA regulatory code remains incomplete. Pre-trained genomic foundation models offer an avenue to adapt learned RNA representations to biological prediction tasks. However, existing genomic foundation models are trained using strategies borrowed from textual or visual domains, such as masked language modelling or next token prediction, that do not leverage biological domain knowledge. Here, we introduce Orthrus, a Mamba-based RNA foundation model pre-trained using a novel self-supervised contrastive learning objective with biological augmentations. Orthrus is trained by maximizing embedding similarity between curated pairs of RNA transcripts, where pairs are formed from splice isoforms of 10 model organisms and transcripts from orthologous genes in 400+ mammalian species from the Zoonomia Project. This training objective results in a latent representation that clusters RNA sequences with functional and evolutionary similarities. We find that the generalized mature RNA isoform representations learned by Orthrus significantly outperform existing genomic foundation models on five mRNA property prediction tasks, and requires only a fraction of fine-tuning data to do so.
RESUMO
Transcript quantification is a long-standing problem in genomics and estimating the relative abundance of alternatively-spliced isoforms from the same transcript is an important special case. Both problems have recently been illuminated by high-throughput RNA sequencing experiments which are quickly generating large amounts of data. However, much of the signal present in this data is corrupted or obscured by biases resulting in non-uniform and non-proportional representation of sequences from different transcripts. Many existing analyses attempt to deal with these and other biases with various task-specific approaches, which makes direct comparison between them difficult. However, two popular tools for isoform quantification, MISO and Cufflinks, have adopted a general probabilistic framework to model and mitigate these biases in a more general fashion. These advances motivate the need to investigate the effects of RNA-seq biases on the accuracy of different approaches for isoform quantification. We conduct the investigation by building models of increasing sophistication to account for noise introduced by the biases and compare their accuracy to the established approaches. We focus on methods that estimate the expression of alternatively-spliced isoforms with the percent-spliced-in (PSI) metric for each exon skipping event. To improve their estimates, many methods use evidence from RNA-seq reads that align to exon bodies. However, the methods we propose focus on reads that span only exon-exon junctions. As a result, our approaches are simpler and less sensitive to exon definitions than existing methods, which enables us to distinguish their strengths and weaknesses more easily. We present several probabilistic models of of position-specific read counts with increasing complexity and compare them to each other and to the current state-of-the-art methods in isoform quantification, MISO and Cufflinks. On a validation set with RT-PCR measurements for 26 cassette events, some of our methods are more accurate and some are significantly more consistent than these two popular tools. This comparison demonstrates the challenges in estimating the percent inclusion of alternatively spliced junctions and illuminates the tradeoffs between different approaches.
Assuntos
Processamento Alternativo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Éxons , Perfilação da Expressão Gênica , Células HeLa , Humanos , Modelos Estatísticos , Reação em Cadeia da Polimerase Via Transcriptase ReversaRESUMO
Although database search tools originally developed for shotgun proteome have been widely used in immunopeptidomic mass spectrometry identifications, they have been reported to achieve undesirably low sensitivities or high false positive rates as a result of the hugely inflated search space caused by the lack of specific enzymic digestions in immunopeptidome. To overcome such a problem, we developed a motif-guided immunopeptidome database building tool named IntroSpect, which is designed to first learn the peptide motifs from high confidence hits in the initial search, and then build a targeted database for refined search. Evaluated on 18 representative HLA class I datasets, IntroSpect can improve the sensitivity by an average of 76%, compared to conventional searches with unspecific digestions, while maintaining a very high level of accuracy (~96%), as confirmed by synthetic validation experiments. A distinct advantage of IntroSpect is that it does not depend on any external HLA data, so that it performs equally well on both well-studied and poorly-studied HLA types, unlike the previously developed method SpectMHC. We have also designed IntroSpect to keep a global FDR that can be conveniently controlled, similar to a conventional database search. Finally, we demonstrate the practical value of IntroSpect by discovering neoepitopes from MS data directly, an important application in cancer immunotherapies. IntroSpect is freely available to download and use.
Assuntos
Peptídeos , Proteoma , Bases de Dados Factuais , Bases de Dados de Proteínas , Imunoterapia , Espectrometria de Massas/métodos , Peptídeos/químicaRESUMO
Neoantigen-based cancer immunotherapies hold the promise of being a truly personalized, effective treatment for diverse cancer types. ELISPOT assays, as a powerful experimental technique, can verify the existence of antigen specific T cells to support basic clinical research and monitor clinical trials. However, despite the high sensitivity of ELISPOT assays, detecting immune responses of neoantigen specific T cells in a patient or healthy donor's PBMCs is still extremely difficult, since the frequency of these T cells can be very low. We developed a novel experimental method, by co-stimulation of T cells with anti-CD28 and IL-2 at the beginning of ELISPOT, to further increase the sensitivity of ELISPOT and mitigate the challenge introduced by low frequency T cells. Under the optimal concentration of 1 µg/ml for anti-CD28 and 1 U/ml for IL-2, an 11.7-fold increase of T cell response against CMV peptide was observed by using our method, and it outperforms other cytokine stimulation alternatives (5-10 folds). We also showed that this method can be effectively applied to detect neoantigen-specific T cells in healthy donors' and a melanoma patient's PBMCs. To the best of our knowledge, this is the first report that the co-stimulation of anti-CD28 and IL-2 is able to significantly improve the sensitivity of ELISPOT assays, indicating that anti-CD28 and IL-2 signaling can act in synergy to lower the T cell activation threshold and trigger more neoantigen-specific T cells.
Assuntos
Anticorpos/farmacologia , Antígenos de Neoplasias/imunologia , Antígenos CD28/imunologia , ELISPOT , Testes de Liberação de Interferon-gama , Interleucina-2/farmacologia , Ativação Linfocitária/efeitos dos fármacos , Neoplasias/imunologia , Linfócitos T/efeitos dos fármacos , Antígenos de Neoplasias/genética , Antígenos de Neoplasias/metabolismo , Células Cultivadas , Sinergismo Farmacológico , Humanos , Interferon gama/imunologia , Interferon gama/metabolismo , Antígeno MART-1/imunologia , Mutação , Neoplasias/genética , Neoplasias/metabolismo , Fragmentos de Peptídeos/imunologia , Reprodutibilidade dos Testes , Transdução de Sinais , Linfócitos T/imunologia , Linfócitos T/metabolismo , Proteínas da Matriz Viral/imunologiaRESUMO
RNA editing of adenosine to inosine (A to I) is catalyzed by ADAR1 and dramatically alters the cellular transcriptome, although its functional roles in somatic cell reprogramming are largely unexplored. Here, we show that loss of ADAR1-mediated A-to-I editing disrupts mesenchymal-to-epithelial transition (MET) during induced pluripotent stem cell (iPSC) reprogramming and impedes acquisition of induced pluripotency. Using chemical and genetic approaches, we show that absence of ADAR1-dependent RNA editing induces aberrant innate immune responses through the double-stranded RNA (dsRNA) sensor MDA5, unleashing endoplasmic reticulum (ER) stress and hindering epithelial fate acquisition. We found that A-to-I editing impedes MDA5 sensing and sequestration of dsRNAs encoding membrane proteins, which promote ER homeostasis by activating the PERK-dependent unfolded protein response pathway to consequently facilitate MET. This study therefore establishes a critical role for ADAR1 and its A-to-I editing activity during cell fate transitions and delineates a key regulatory layer underlying MET to control efficient reprogramming.
Assuntos
Células-Tronco Pluripotentes Induzidas , Edição de RNA , Adenosina Desaminase/genética , Adenosina Desaminase/metabolismo , Células-Tronco Pluripotentes Induzidas/metabolismo , Inosina/metabolismo , RNA de Cadeia DuplaRESUMO
RNA editing results in post-transcriptional modification and could potentially contribute to carcinogenesis. However, RNA editing in advanced lung adenocarcinomas has not yet been studied. Based on whole genome and transcriptome sequencing data, we identified 1,071,296 RNA editing events from matched normal, primary and metastatic samples contributed by 24 lung adenocarcinoma patients, with 91.3% A-to-G editing on average, and found significantly more RNA editing sites in tumors than in normal samples. To investigate cancer relevant editing events, we detected 67,851 hyper-editing sites in primary and 50,480 hyper-editing sites in metastatic samples. 46 genes with hyper-editing in coding regions were found to result in amino acid alterations, while hundreds of hyper-editing events in non-coding regions could modulate splicing or gene expression, including genes related to tumor stage or clinic prognosis. Comparing RNA editome of primary and metastatic samples, we also discovered hyper-edited genes that may promote metastasis development. These findings showed a landscape of RNA editing in matched normal, primary and metastatic tissues of lung adenocarcinomas for the first time and provided new insights to understand the molecular characterization of this disease.
Assuntos
Adenocarcinoma/genética , Adenocarcinoma/patologia , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , Edição de RNA/genética , Adenocarcinoma/mortalidade , Adenocarcinoma de Pulmão , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Humanos , Estimativa de Kaplan-Meier , Neoplasias Pulmonares/mortalidade , Análise de Sequência de RNA/métodos , TranscriptomaRESUMO
With the advancement of second generation sequencing techniques, our ability to detect and quantify RNA editing on a global scale has been vastly improved. As a result, RNA editing is now being studied under a growing number of biological conditions so that its biochemical mechanisms and functional roles can be further understood. However, a major barrier that prevents RNA editing from being a routine RNA-seq analysis, similar to gene expression and splicing analysis, for example, is the lack of user-friendly and effective computational tools. Based on years of experience of analyzing RNA editing using diverse RNA-seq datasets, we have developed a software tool, RED-ML: RNA Editing Detection based on Machine learning (pronounced as "red ML"). The input to RED-ML can be as simple as a single BAM file, while it can also take advantage of matched genomic variant information when available. The output not only contains detected RNA editing sites, but also a confidence score to facilitate downstream filtering. We have carefully designed validation experiments and performed extensive comparison and analysis to show the efficiency and effectiveness of RED-ML under different conditions, and it can accurately detect novel RNA editing sites without relying on curated RNA editing databases. We have also made this tool freely available via GitHub
Assuntos
Aprendizado de Máquina , Edição de RNA , Software , Humanos , Masculino , Análise de Sequência de RNARESUMO
To facilitate precision medicine and whole-genome annotation, we developed a machine-learning technique that scores how strongly genetic variants affect RNA splicing, whose alteration contributes to many diseases. Analysis of more than 650,000 intronic and exonic variants revealed widespread patterns of mutation-driven aberrant splicing. Intronic disease mutations that are more than 30 nucleotides from any splice site alter splicing nine times as often as common variants, and missense exonic disease mutations that have the least impact on protein function are five times as likely as others to alter splicing. We detected tens of thousands of disease-causing mutations, including those involved in cancers and spinal muscular atrophy. Examination of intronic and exonic variants found using whole-genome sequencing of individuals with autism revealed misspliced genes with neurodevelopmental phenotypes. Our approach provides evidence for causal variants and should enable new discoveries in precision medicine.
Assuntos
Inteligência Artificial , Transtornos Globais do Desenvolvimento Infantil/genética , Neoplasias Colorretais Hereditárias sem Polipose/genética , Estudo de Associação Genômica Ampla/métodos , Anotação de Sequência Molecular/métodos , Atrofia Muscular Espinal/genética , Splicing de RNA/genética , Proteínas Adaptadoras de Transdução de Sinal/genética , Simulação por Computador , DNA/genética , Éxons/genética , Código Genético , Marcadores Genéticos , Variação Genética , Humanos , Íntrons/genética , Modelos Genéticos , Proteína 1 Homóloga a MutL , Mutação de Sentido Incorreto , Proteínas Nucleares/genética , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Sítios de Splice de RNA/genética , Proteínas de Ligação a RNA/genéticaRESUMO
Transcriptome complexity and its relation to numerous diseases underpins the need to predict in silico splice variants and the regulatory elements that affect them. Building upon our recently described splicing code, we developed AVISPA, a Galaxy-based web tool for splicing prediction and analysis. Given an exon and its proximal sequence, the tool predicts whether the exon is alternatively spliced, displays tissue-dependent splicing patterns, and whether it has associated regulatory elements. We assess AVISPA's accuracy on an independent dataset of tissue-dependent exons, and illustrate how the tool can be applied to analyze a gene of interest. AVISPA is available at http://avispa.biociphers.org.
Assuntos
Processamento Alternativo , Biologia Computacional/métodos , Navegador , Algoritmos , Bases de Dados de Ácidos Nucleicos , Éxons , Genômica/métodos , Especificidade de Órgãos/genética , Curva ROC , Transcriptoma , Fator A de Crescimento do Endotélio Vascular/genéticaRESUMO
Microcephaly-capillary malformation (MIC-CAP) syndrome is characterized by severe microcephaly with progressive cortical atrophy, intractable epilepsy, profound developmental delay and multiple small capillary malformations on the skin. We used whole-exome sequencing of five patients with MIC-CAP syndrome and identified recessive mutations in STAMBP, a gene encoding the deubiquitinating (DUB) isopeptidase STAMBP (STAM-binding protein, also known as AMSH, associated molecule with the SH3 domain of STAM) that has a key role in cell surface receptor-mediated endocytosis and sorting. Patient cell lines showed reduced STAMBP expression associated with accumulation of ubiquitin-conjugated protein aggregates, elevated apoptosis and insensitive activation of the RAS-MAPK and PI3K-AKT-mTOR pathways. The latter cellular phenotype is notable considering the established connection between these pathways and their association with vascular and capillary malformations. Furthermore, our findings of a congenital human disorder caused by a defective DUB protein that functions in endocytosis implicates ubiquitin-conjugate aggregation and elevated apoptosis as factors potentially influencing the progressive neuronal loss underlying MIC-CAP syndrome.
Assuntos
Capilares/patologia , Deficiências do Desenvolvimento/genética , Complexos Endossomais de Distribuição Requeridos para Transporte/genética , Epilepsia/genética , Microcefalia/genética , Mutação/genética , Dermatopatias/genética , Ubiquitina Tiolesterase/genética , Estudos de Casos e Controles , Pré-Escolar , Estudos de Coortes , Deficiências do Desenvolvimento/patologia , Complexos Endossomais de Distribuição Requeridos para Transporte/antagonistas & inibidores , Complexos Endossomais de Distribuição Requeridos para Transporte/metabolismo , Epilepsia/patologia , Exoma/genética , Feminino , Técnica Indireta de Fluorescência para Anticorpo , Genes Recessivos , Genoma Humano , Genótipo , Humanos , Lactente , Masculino , Microcefalia/patologia , RNA Interferente Pequeno/genética , Dermatopatias/patologia , Síndrome , Ubiquitina Tiolesterase/antagonistas & inibidores , Ubiquitina Tiolesterase/metabolismoRESUMO
How species with similar repertoires of protein-coding genes differ so markedly at the phenotypic level is poorly understood. By comparing organ transcriptomes from vertebrate species spanning ~350 million years of evolution, we observed significant differences in alternative splicing complexity between vertebrate lineages, with the highest complexity in primates. Within 6 million years, the splicing profiles of physiologically equivalent organs diverged such that they are more strongly related to the identity of a species than they are to organ type. Most vertebrate species-specific splicing patterns are cis-directed. However, a subset of pronounced splicing changes are predicted to remodel protein interactions involving trans-acting regulators. These events likely further contributed to the diversification of splicing and other transcriptomic changes that underlie phenotypic differences among vertebrate species.
Assuntos
Processamento Alternativo , Evolução Molecular , Transcriptoma , Vertebrados/genética , Animais , Evolução Biológica , Galinhas/genética , Éxons , Íntrons , Lagartos/genética , Camundongos/genética , Camundongos Endogâmicos C57BL/genética , Gambás/genética , Fenótipo , Ornitorrinco/genética , Primatas/genética , Sítios de Splice de RNA , Sequências Reguladoras de Ácido Ribonucleico , Especificidade da Espécie , Xenopus/genéticaRESUMO
BACKGROUND: Transcriptome profiling of patterns of RNA expression is a powerful approach to identify networks of genes that play a role in disease. To date, most mRNA profiling of tissues has been accomplished using microarrays, but next-generation sequencing can offer a richer and more comprehensive picture. METHODOLOGY/PRINCIPAL FINDINGS: ECO is a rare multi-system developmental disorder caused by a homozygous mutation in ICK encoding intestinal cell kinase. We performed gene expression profiling using both cDNA microarrays and next-generation mRNA sequencing (mRNA-seq) of skin fibroblasts from ECO-affected subjects. We then validated a subset of differentially expressed transcripts identified by each method using quantitative reverse transcription-polymerase chain reaction (qRT-PCR). Finally, we used gene ontology (GO) to identify critical pathways and processes that were abnormal according to each technical platform. Methodologically, mRNA-seq identifies a much larger number of differentially expressed genes with much better correlation to qRT-PCR results than the microarray (r²â=â0.794 and 0.137, respectively). Biologically, cDNA microarray identified functional pathways focused on anatomical structure and development, while the mRNA-seq platform identified a higher proportion of genes involved in cell division and DNA replication pathways. CONCLUSIONS/SIGNIFICANCE: Transcriptome profiling with mRNA-seq had greater sensitivity, range and accuracy than the microarray. The two platforms generated different but complementary hypotheses for further evaluation.
Assuntos
Doenças do Desenvolvimento Ósseo/genética , Encefalopatias/genética , Doenças do Sistema Endócrino/genética , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de RNA/métodos , Doenças do Desenvolvimento Ósseo/patologia , Encefalopatias/patologia , Linhagem Celular , Proliferação de Células , Doenças do Sistema Endócrino/patologia , Fibroblastos/metabolismo , Fibroblastos/patologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Reprodutibilidade dos Testes , Pele/patologiaRESUMO
We carried out the first analysis of alternative splicing complexity in human tissues using mRNA-Seq data. New splice junctions were detected in approximately 20% of multiexon genes, many of which are tissue specific. By combining mRNA-Seq and EST-cDNA sequence data, we estimate that transcripts from approximately 95% of multiexon genes undergo alternative splicing and that there are approximately 100,000 intermediate- to high-abundance alternative splicing events in major human tissues. From a comparison with quantitative alternative splicing microarray profiling data, we also show that mRNA-Seq data provide reliable measurements for exon inclusion levels.