RESUMO
RNA editing is a relevant epitranscriptome phenomenon able to increase the transcriptome and proteome diversity of eukaryotic organisms. ADAR mediated RNA editing is widespread in humans in which millions of A-to-I changes modify thousands of primary transcripts. RNA editing has pivotal roles in the regulation of gene expression or modulation of the innate immune response or functioning of several neurotransmitter receptors. Massive transcriptome sequencing has fostered the research in this field. Nonetheless, different aspects of the RNA editing biology are still unknown and need to be elucidated. To support the study of A-to-I RNA editing we have updated our REDIportal catalogue raising its content to about 16 millions of events detected in 9642 human RNAseq samples from the GTEx project by using a dedicated pipeline based on the HPC version of the REDItools software. REDIportal now allows searches at sample level, provides overviews of RNA editing profiles per each RNAseq experiment, implements a Gene View module to look at individual events in their genic context and hosts the CLAIRE database. Starting from this novel version, REDIportal will start collecting non-human RNA editing changes for comparative genomics investigations. The database is freely available at http://srv00.recas.ba.infn.it/atlas/index.html.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Regulação da Expressão Gênica , Proteoma/genética , Edição de RNA/genética , Transcriptoma/genética , Sequência de Bases/genética , Curadoria de Dados/métodos , Mineração de Dados/métodos , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Humanos , Internet , Proteômica/métodosRESUMO
BACKGROUND: RNA editing is a widespread co-/post-transcriptional mechanism that alters primary RNA sequences through the modification of specific nucleotides and it can increase both the transcriptome and proteome diversity. The automatic detection of RNA-editing from RNA-seq data is computational intensive and limited to small data sets, thus preventing a reliable genome-wide characterisation of such process. RESULTS: In this work we introduce HPC-REDItools, an upgraded tool for accurate RNA-editing events discovery from large dataset repositories. AVAILABILITY: https://github.com/BioinfoUNIBA/REDItools2 . CONCLUSIONS: HPC-REDItools is dramatically faster than the previous version, REDItools, enabling big-data analysis by means of a MPI-based implementation and scaling almost linearly with the number of available cores.
Assuntos
Metodologias Computacionais , Edição de RNA/genética , Software , Algoritmos , Sequência de Bases , Genoma , Transcriptoma/genéticaRESUMO
BACKGROUND: The advent of Next Generation Sequencing (NGS) technologies and the concomitant reduction in sequencing costs allows unprecedented high throughput profiling of biological systems in a cost-efficient manner. Modern biological experiments are increasingly becoming both data and computationally intensive and the wealth of publicly available biological data is introducing bioinformatics into the "Big Data" era. For these reasons, the effective application of High Performance Computing (HPC) architectures is becoming progressively more recognized also by bioinformaticians. Here we describe HPC resources provisioning pilot programs dedicated to bioinformaticians, run by the Italian Node of ELIXIR (ELIXIR-IT) in collaboration with CINECA, the main Italian supercomputing center. RESULTS: Starting from April 2016, CINECA and ELIXIR-IT launched the pilot Call "ELIXIR-IT HPC@CINECA", offering streamlined access to HPC resources for bioinformatics. Resources are made available either through web front-ends to dedicated workflows developed at CINECA or by providing direct access to the High Performance Computing systems through a standard command-line interface tailored for bioinformatics data analysis. This allows to offer to the biomedical research community a production scale environment, continuously updated with the latest available versions of publicly available reference datasets and bioinformatic tools. Currently, 63 research projects have gained access to the HPC@CINECA program, for a total handout of ~ 8 Millions of CPU/hours and, for data storage, ~ 100 TB of permanent and ~ 300 TB of temporary space. CONCLUSIONS: Three years after the beginning of the ELIXIR-IT HPC@CINECA program, we can appreciate its impact over the Italian bioinformatics community and draw some considerations. Several Italian researchers who applied to the program have gained access to one of the top-ranking public scientific supercomputing facilities in Europe. Those investigators had the opportunity to sensibly reduce computational turnaround times in their research projects and to process massive amounts of data, pursuing research approaches that would have been otherwise difficult or impossible to undertake. Moreover, by taking advantage of the wealth of documentation and training material provided by CINECA, participants had the opportunity to improve their skills in the usage of HPC systems and be better positioned to apply to similar EU programs of greater scale, such as PRACE. To illustrate the effective usage and impact of the resources awarded by the program - in different research applications - we report five successful use cases, which have already published their findings in peer-reviewed journals.
Assuntos
Biologia Computacional , Metodologias Computacionais , Software , Algoritmos , Animais , Linhagem Celular , Bases de Dados Genéticas , Fusão Gênica , Genoma , Humanos , Prunus persica/genética , Edição de RNA , Andorinhas/genéticaRESUMO
BACKGROUND: R-loops are three-stranded nucleic acid structures that usually form during transcription and that may lead to gene regulation or genome instability. DRIP (DNA:RNA Immunoprecipitation)-seq techniques are widely used to map R-loops genome-wide providing insights into R-loop biology. However, annotation of DRIP-seq peaks to genes can be a tricky step, due to the lack of strand information when using the common basic DRIP technique. RESULTS: Here, we introduce DRIP-seq Optimized Peak Annotator (DROPA), a new tool for gene annotation of R-loop peaks based on gene expression information. DROPA allows a full customization of annotation options, ranging from the choice of reference datasets to gene feature definitions. DROPA allows to assign R-loop peaks to the DNA template strand in gene body with a false positive rate of less than 7%. A comparison of DROPA performance with three widely used annotation tools show that it identifies less false positive annotations than the others. CONCLUSIONS: DROPA is a fully customizable peak-annotation tool optimized for co-transcriptional DRIP-seq peaks, which allows a finest gene annotation based on gene expression information. Its output can easily be integrated into pipelines to perform downstream analyses, while useful and informative summary plots and statistical enrichment tests can be produced.
Assuntos
DNA/metabolismo , Imunoprecipitação , Anotação de Sequência Molecular , RNA/metabolismo , Software , Sequência de Bases , DNA/genética , Regulação da Expressão Gênica , RNA/genéticaRESUMO
BACKGROUND: The advent and ongoing development of next generation sequencing technologies (NGS) has led to a rapid increase in the rate of human genome re-sequencing data, paving the way for personalized genomics and precision medicine. The body of genome resequencing data is progressively increasing underlining the need for accurate and time-effective bioinformatics systems for genotyping - a crucial prerequisite for identification of candidate causal mutations in diagnostic screens. RESULTS: Here we present CoVaCS, a fully automated, highly accurate system with a web based graphical interface for genotyping and variant annotation. Extensive tests on a gold standard benchmark data-set -the NA12878 Illumina platinum genome- confirm that call-sets based on our consensus strategy are completely in line with those attained by similar command line based approaches, and far more accurate than call-sets from any individual tool. Importantly our system exhibits better sensitivity and higher specificity than equivalent commercial software. CONCLUSIONS: CoVaCS offers optimized pipelines integrating state of the art tools for variant calling and annotation for whole genome sequencing (WGS), whole-exome sequencing (WES) and target-gene sequencing (TGS) data. The system is currently hosted at Cineca, and offers the speed of a HPC computing facility, a crucial consideration when large numbers of samples must be analysed. Importantly, all the analyses are performed automatically allowing high reproducibility of the results. As such, we believe that CoVaCS can be a valuable tool for the analysis of human genome resequencing studies. CoVaCS is available at: https://bioinformatics.cineca.it/covacs .
Assuntos
Biologia Computacional/métodos , Sequência Consenso , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bases de Dados Genéticas , Mutação INDEL , Anotação de Sequência Molecular , Polimorfismo de Nucleotídeo Único , Sensibilidade e Especificidade , Interface Usuário-Computador , Navegador , Fluxo de TrabalhoRESUMO
Applying next-generation sequencing (NGS) technologies to species of agricultural interest has the potential to accelerate the understanding and exploration of genetic resources. The storage, availability and maintenance of huge quantities of NGS-generated data remains a major challenge. The PeachVar-DB portal, available at http://hpc-bioinformatics.cineca.it/peach, is an open-source catalog of genetic variants present in peach (Prunus persica L. Batsch) and wild-related species of Prunus genera, annotated from 146 samples publicly released on the Sequence Read Archive (SRA). We designed a user-friendly web-based interface of the database, providing search tools to retrieve single nucleotide polymorphism (SNP) and InDel variants, along with useful statistics and information. PeachVar-DB results are linked to the Genome Database for Rosaceae (GDR) and the Phytozome database to allow easy access to other external useful plant-oriented resources. In order to extend the genetic diversity covered by the PeachVar-DB further, and to allow increasingly powerful comparative analysis, we will progressively integrate newly released data.
Assuntos
Biologia Computacional/métodos , Variação Genética , Genoma de Planta/genética , Prunus persica/genética , Mineração de Dados/métodos , Bases de Dados Genéticas , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Internet , Filogenia , Polimorfismo de Nucleotídeo Único , Prunus persica/classificação , Rosaceae/classificação , Rosaceae/genéticaRESUMO
RNA editing by A-to-I deamination is a relevant co/posttranscriptional modification carried out by ADAR enzymes. In humans, it has pivotal cellular effects and its deregulation has been linked to a variety of human disorders including neurological and neurodegenerative diseases and cancer. Despite its biological relevance, the detection of RNA editing variants in large transcriptome sequencing experiments (RNAseq) is yet a challenging computational task. To drastically reduce computing times we have developed a novel REDItools version able to identify A-to-I events in huge amount of RNAseq data employing High Performance Computing (HPC) infrastructures.Here we show how to use REDItools v2 in HPC systems.
Assuntos
Metodologias Computacionais , Edição de RNA/fisiologia , Análise de Sequência de RNA/métodos , Animais , Biologia Computacional/métodos , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/genética , Doenças do Sistema Nervoso/genética , Doenças Neurodegenerativas/genética , Software , TranscriptomaRESUMO
Stressful experiences are part of everyday life and animals have evolved physiological and behavioral responses aimed at coping with stress and maintaining homeostasis. However, repeated or intense stress can induce maladaptive reactions leading to behavioral disorders. Adaptations in the brain, mediated by changes in gene expression, have a crucial role in the stress response. Recent years have seen a tremendous increase in studies on the transcriptional effects of stress. The input raw data are freely available from public repositories and represent a wealth of information for further global and integrative retrospective analyses. We downloaded from the Sequence Read Archive 751 samples (SRA-experiments), from 18 independent BioProjects studying the effects of different stressors on the brain transcriptome in mice. We performed a massive bioinformatics re-analysis applying a single, standardized pipeline for computing differential gene expression. This data mining allowed the identification of novel candidate stress-related genes and specific signatures associated with different stress conditions. The large amount of computational results produced was systematized in the interactive "Stress Mice Portal".
Assuntos
Encéfalo/fisiologia , Expressão Gênica , Estresse Fisiológico , Estresse Psicológico , Transcriptoma , Animais , Biologia Computacional , Mineração de Dados , Conjuntos de Dados como Assunto , Feminino , Masculino , CamundongosRESUMO
Autism spectrum disorder (ASD) is a heterogeneous neurodevelopmental condition with unknown etiology. Recent experimental evidences suggest the contribution of non-coding RNAs (ncRNAs) in the pathophysiology of ASD. In this work, we aimed to investigate the expression profile of the ncRNA class of circular RNAs (circRNAs) in the hippocampus of the BTBR T + tf/J (BTBR) mouse model and age-matched C57BL/6J (B6) mice. Alongside, we analyzed BTBR hippocampal gene expression profile to evaluate possible correlations between the differential abundance of circular and linear gene products. From RNA sequencing data, we identified circRNAs highly modulated in BTBR mice. Thirteen circRNAs and their corresponding linear isoforms were validated by RT-qPCR analysis. The BTBR-regulated circCdh9 was better characterized in terms of molecular structure and expression, highlighting altered levels not only in the hippocampus, but also in the cerebellum, prefrontal cortex, and amygdala. Finally, gene expression analysis of the BTBR hippocampus pinpointed altered biological and molecular pathways relevant for the ASD phenotype. By comparison of circRNA and gene expression profiles, we identified 6 genes significantly regulated at either circRNA or mRNA gene products, suggesting low overall correlation between circRNA and host gene expression. In conclusion, our results indicate a consistent deregulation of circRNA expression in the hippocampus of BTBR mice. ASD-related circRNAs should be considered in functional studies to identify their contribution to the etiology of the disorder. In addition, as abundant and highly stable molecules, circRNAs represent interesting potential biomarkers for autism.
Assuntos
Transtorno do Espectro Autista/metabolismo , Modelos Animais de Doenças , Hipocampo/metabolismo , Camundongos Endogâmicos/metabolismo , Camundongos Mutantes/metabolismo , RNA Circular/biossíntese , RNA Mensageiro/biossíntese , Animais , Transtorno do Espectro Autista/genética , Química Encefálica , Perfilação da Expressão Gênica , Ontologia Genética , Humanos , Masculino , Camundongos Endogâmicos C57BL , Camundongos Endogâmicos/genética , Camundongos Mutantes/genética , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Especificidade da EspécieRESUMO
Background: Gene fusions derive from chromosomal rearrangements. The resulting chimeric transcripts are often endowed with oncogenic potential. Furthermore, they serve as diagnostic tools for the clinical classification of cancer subgroups with different prognosis and, in some cases, they can provide specific drug targets. To date, many efforts have been carried out to study gene fusion events occurring in tumor samples. In recent years, the availability of a comprehensive next-generation sequencing dataset for all existing human tumor cell lines has provided the opportunity to further investigate these data in order to identify novel and still uncharacterized gene fusion events. Results: In our work, we have extensively reanalyzed 935 paired-end RNA-sequencing experiments downloaded from the Cancer Cell Line Encyclopedia repository, aiming at addressing novel putative cell-line specific gene fusion events in human malignancies. The bioinformatics analysis has been performed by the execution of four gene fusion detection algorithms. The results have been further prioritized by running a Bayesian classifier that makes an in silico validation. The collection of fusion events supported by all of the predictive software results in a robust set of â¼1,700 in silico predicted novel candidates suitable for downstream analyses. Given the huge amount of data and information produced, computational results have been systematized in a database named LiGeA. The database can be browsed through a dynamic and interactive web portal, further integrated with validated data from other well-known repositories. Taking advantage of the intuitive query forms, the users can easily access, navigate, filter, and select the putative gene fusions for further validations and studies. They can also find suitable experimental models for a given fusion of interest. Conclusions: We believe that the LiGeA resource can represent not only the first compendium of both known and putative novel gene fusion events in the catalog of all of the human malignant cell lines but it can also become a handy starting point for wet-lab biologists who wish to investigate novel cancer biomarkers and specific drug targets.