RESUMEN
Promoter annotation is an important task in the analysis of a genome. One of the main challenges for this task is locating the border between the promoter region and the transcribing region of the gene, the transcription start site (TSS). The TSS is the reference point to delimit the DNA sequence responsible for the assembly of the transcribing complex. As the same gene can have more than one TSS, so to delimit the promoter region, it is important to locate the closest TSS to the site of the beginning of the translation. This paper presents TSSFinder, a new software for the prediction of the TSS signal of eukaryotic genes that is significantly more accurate than other available software. We currently are the only application to offer pre-trained models for six different eukaryotic organisms: Arabidopsis thaliana, Drosophila melanogaster, Gallus gallus, Homo sapiens, Oryza sativa and Saccharomyces cerevisiae. Additionally, our software can be easily customized for specific organisms using only 125 DNA sequences with a validated TSS signal and corresponding genomic locations as a training set. TSSFinder is a valuable new tool for the annotation of genomes. TSSFinder source code and docker container can be downloaded from http://tssfinder.github.io. Alternatively, TSSFinder is also available as a web service at http://sucest-fun.org/wsapp/tssfinder/.
Asunto(s)
Biología Computacional/métodos , Eucariontes/genética , Genoma , Genómica/métodos , Regiones Promotoras Genéticas , Programas Informáticos , Sitio de Iniciación de la Transcripción , Algoritmos , Bases de Datos Genéticas , Reproducibilidad de los Resultados , Análisis de Secuencia de ADN , Navegador WebRESUMEN
Discrete Markovian models can be used to characterize patterns in sequences of values and have many applications in biological sequence analysis, including gene prediction, CpG island detection, alignment, and protein profiling. We present ToPS, a computational framework that can be used to implement different applications in bioinformatics analysis by combining eight kinds of models: (i) independent and identically distributed process; (ii) variable-length Markov chain; (iii) inhomogeneous Markov chain; (iv) hidden Markov model; (v) profile hidden Markov model; (vi) pair hidden Markov model; (vii) generalized hidden Markov model; and (viii) similarity based sequence weighting. The framework includes functionality for training, simulation and decoding of the models. Additionally, it provides two methods to help parameter setting: Akaike and Bayesian information criteria (AIC and BIC). The models can be used stand-alone, combined in Bayesian classifiers, or included in more complex, multi-model, probabilistic architectures using GHMMs. In particular the framework provides a novel, flexible, implementation of decoding in GHMMs that detects when the architecture can be traversed efficiently.
Asunto(s)
Biología Computacional/métodos , Cadenas de Markov , Análisis de Secuencia/métodos , Teorema de Bayes , Islas de CpG/genéticaRESUMEN
BACKGROUND: The implication of post-transcriptional regulation by microRNAs in molecular mechanisms underlying cancer disease is well documented. However, their interference at the cellular level is not fully explored. Functional in vitro studies are fundamental for the comprehension of their role; nevertheless results are highly dependable on the adopted cellular model. Next generation small RNA transcriptomic sequencing data of a tumor cell line and keratinocytes derived from primary culture was generated in order to characterize the microRNA content of these systems, thus helping in their understanding. Both constitute cell models for functional studies of microRNAs in head and neck squamous cell carcinoma (HNSCC), a smoking-related cancer. Known microRNAs were quantified and analyzed in the context of gene regulation. New microRNAs were investigated using similarity and structural search, ab initio classification, and prediction of the location of mature microRNAs within would-be precursor sequences. Results were compared with small RNA transcriptomic sequences from HNSCC samples in order to access the applicability of these cell models for cancer phenotype comprehension and for novel molecule discovery. RESULTS: Ten miRNAs represented over 70% of the mature molecules present in each of the cell types. The most expressed molecules were miR-21, miR-24 and miR-205, Accordingly; miR-21 and miR-205 have been previously shown to play a role in epithelial cell biology. Although miR-21 has been implicated in cancer development, and evaluated as a biomarker in HNSCC progression, no significant expression differences were seen between cell types. We demonstrate that differentially expressed mature miRNAs target cell differentiation and apoptosis related biological processes, indicating that they might represent, with acceptable accuracy, the genetic context from which they derive. Most miRNAs identified in the cancer cell line and in keratinocytes were present in tumor samples and cancer-free samples, respectively, with miR-21, miR-24 and miR-205 still among the most prevalent molecules at all instances. Thirteen miRNA-like structures, containing reads identified by the deep sequencing, were predicted from putative miRNA precursor sequences. Strong evidences suggest that one of them could be a new miRNA. This molecule was mostly expressed in the tumor cell line and HNSCC samples indicating a possible biological function in cancer. CONCLUSIONS: Critical biological features of cells must be fully understood before they can be chosen as models for functional studies. Expression levels of miRNAs relate to cell type and tissue context. This study provides insights on miRNA content of two cell models used for cancer research. Pathways commonly deregulated in HNSCC might be targeted by most expressed and also by differentially expressed miRNAs. Results indicate that the use of cell models for cancer research demands careful assessment of underlying molecular characteristics for proper data interpretation. Additionally, one new miRNA-like molecule with a potential role in cancer was identified in the cell lines and clinical samples.
Asunto(s)
Carcinoma de Células Escamosas/genética , Neoplasias de Cabeza y Cuello/genética , MicroARNs/metabolismo , ARN/metabolismo , Anciano , Anciano de 80 o más Años , Carcinoma de Células Escamosas/patología , Células Cultivadas , Femenino , Regulación Neoplásica de la Expresión Génica , Biblioteca de Genes , Neoplasias de Cabeza y Cuello/patología , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Queratinocitos/metabolismo , Masculino , Persona de Mediana Edad , Análisis de Componente Principal , ARN/química , ARN Mensajero/metabolismo , Análisis de Secuencia de ARN , Carcinoma de Células Escamosas de Cabeza y Cuello , TranscriptomaRESUMEN
Large-scale transcriptome projects have shown that the number of RNA transcripts not coding for proteins (non-coding RNAs) is much larger than previously recognized. High-throughput technologies, coupled with bioinformatics approaches, have produced increasing amounts of data, highlighting the role of non-coding RNAs (ncRNAs) in biological processes. Data generated by these studies include diverse non-coding RNA classes from organisms of different kingdoms, which were obtained using different experimental and computational assays. This has led to a rapid increase of specialized RNA databases. The fast growth in the number of available databases makes integration of stored information a difficult task. We present here NRDR, a Non-coding RNA Databases Resource for information retrieval on ncRNA databases (www.ncrnadatabases.org). We performed a survey of 102 public databases on ncRNAs and we have introduced four categorizations to classify these databases and to help researchers quickly search and find the information they need: RNA family, information source, information content and available search mechanisms. NRDR is a useful databases searching tool that will facilitate research on ncRNAs.
Asunto(s)
Anotación de Secuencia Molecular , ARN no Traducido/genética , Transcripción Genética , Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos , Internet , ARN no Traducido/clasificaciónRESUMEN
This chapter provides two main contributions: (1) a description of computational tools and databases used to identify and analyze transposable elements (TEs) and circRNAs in plants; and (2) data analysis on public TE and circRNA data. Our goal is to highlight the primary information available in the literature on circular noncoding RNAs and transposable elements in plants. The exploratory analysis performed on publicly available circRNA and TEs data help discuss four sequence features. Finally, we investigate the association on circRNAs:TE in plants in the model organism Arabidopsis thaliana.
Asunto(s)
Arabidopsis , Elementos Transponibles de ADN , Arabidopsis/genética , Biología Computacional , Elementos Transponibles de ADN/genética , Plantas/genética , ARN CircularRESUMEN
BACKGROUND: A large number of probabilistic models used in sequence analysis assign non-zero probability values to most input sequences. To decide when a given probability is sufficient the most common way is bayesian binary classification, where the probability of the model characterizing the sequence family of interest is compared to that of an alternative probability model. We can use as alternative model a null model. This is the scoring technique used by sequence analysis tools such as HMMER, SAM and INFERNAL. The most prevalent null models are position-independent residue distributions that include: the uniform distribution, genomic distribution, family-specific distribution and the target sequence distribution. This paper presents a study to evaluate the impact of the choice of a null model in the final result of classifications. In particular, we are interested in minimizing the number of false predictions in a classification. This is a crucial issue to reduce costs of biological validation. RESULTS: For all the tests, the target null model presented the lowest number of false positives, when using random sequences as a test. The study was performed in DNA sequences using GC content as the measure of content bias, but the results should be valid also for protein sequences. To broaden the application of the results, the study was performed using randomly generated sequences. Previous studies were performed on aminoacid sequences, using only one probabilistic model (HMM) and on a specific benchmark, and lack more general conclusions about the performance of null models. Finally, a benchmark test with P. falciparum confirmed these results. CONCLUSIONS: Of the evaluated models the best suited for classification are the uniform model and the target model. However, the use of the uniform model presents a GC bias that can cause more false positives for candidate sequences with extreme compositional bias, a characteristic not described in previous studies. In these cases the target model is more dependable for biological validation due to its higher specificity.
Asunto(s)
Secuencia de Bases/genética , Clasificación/métodos , Modelos Estadísticos , Análisis de Secuencia de ADN/métodos , Composición de Base , Teorema de Bayes , Funciones de Verosimilitud , Plasmodium falciparum/genética , Curva ROC , Proyectos de Investigación , Sensibilidad y EspecificidadRESUMEN
Hepatitis C virus (HCV), exhibits considerable genetic diversity, but presents a relatively well conserved 5' noncoding region (5' NCR) among all genotypes. In this study, the structural features and translational efficiency of the HCV 5' NCR sequences were analyzed using the programs RNAfold, RNAshapes and RNApdist and with a bicistronic dual luciferase expression system, respectively. RNA structure prediction software indicated that base substitutions will alter potentially the 5' NCR structure. The heterogeneous sequence observed on 5' NCR led to important changes in their translation efficiency in different cell culture lines. Interactions of the viral RNA with cellular transacting factors may vary according to the cell type and viral genome polymorphisms that may result in the translational efficiency observed.
Asunto(s)
Regiones no Traducidas 5' , Hepacivirus/clasificación , Hepacivirus/genética , Hepatitis C Crónica/virología , Anciano , Secuencia de Bases , Línea Celular , Femenino , Genes Reporteros , Hepacivirus/aislamiento & purificación , Humanos , Luciferasas/metabolismo , Masculino , Persona de Mediana Edad , Modelos Moleculares , Datos de Secuencia Molecular , Conformación de Ácido Nucleico , Biosíntesis de Proteínas , ARN Viral/genética , Análisis de Secuencia de ADNRESUMEN
One of the most important resources for researchers of noncoding RNAs is the information available in public databases spread over the internet. However, the effective exploration of this data can represent a daunting task, given the large amount of databases available and the variety of stored data. This chapter describes a classification of databases based on information source, type of RNA, source organisms, data formats, and the mechanisms for information retrieval, detailing the relevance of each of these classifications and its usability by researchers. This classification is used to update a 2012 review, indexing now more than 229 public databases. This review will include an assessment of the new trends for ncRNA research based on the information that is being offered by the databases. Additionally, we will expand the previous analysis focusing on the usability and application of these databases in pathogen and disease research. Finally, this chapter will analyze how currently available database schemas can help the development of new and improved web resources.
Asunto(s)
Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos/tendencias , Almacenamiento y Recuperación de la Información/tendencias , ARN no Traducido/genética , Biología Computacional/tendencias , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Conjuntos de Datos como Asunto , Humanos , Almacenamiento y Recuperación de la Información/estadística & datos numéricosRESUMEN
BACKGROUND: Sugarcane cultivars are polyploid interspecific hybrids of giant genomes, typically with 10-13 sets of chromosomes from 2 Saccharum species. The ploidy, hybridity, and size of the genome, estimated to have >10 Gb, pose a challenge for sequencing. RESULTS: Here we present a gene space assembly of SP80-3280, including 373,869 putative genes and their potential regulatory regions. The alignment of single-copy genes in diploid grasses to the putative genes indicates that we could resolve 2-6 (up to 15) putative homo(eo)logs that are 99.1% identical within their coding sequences. Dissimilarities increase in their regulatory regions, and gene promoter analysis shows differences in regulatory elements within gene families that are expressed in a species-specific manner. We exemplify these differences for sucrose synthase (SuSy) and phenylalanine ammonia-lyase (PAL), 2 gene families central to carbon partitioning. SP80-3280 has particular regulatory elements involved in sucrose synthesis not found in the ancestor Saccharum spontaneum. PAL regulatory elements are found in co-expressed genes related to fiber synthesis within gene networks defined during plant growth and maturation. Comparison with sorghum reveals predominantly bi-allelic variations in sugarcane, consistent with the formation of 2 "subgenomes" after their divergence â¼3.8-4.6 million years ago and reveals single-nucleotide variants that may underlie their differences. CONCLUSIONS: This assembly represents a large step towards a whole-genome assembly of a commercial sugarcane cultivar. It includes a rich diversity of genes and homo(eo)logous resolution for a representative fraction of the gene space, relevant to improve biomass and food production.
Asunto(s)
Mapeo Contig/métodos , Glucosiltransferasas/genética , Fenilanina Amoníaco-Liasa/genética , Saccharum/crecimiento & desarrollo , Biomasa , Productos Agrícolas/genética , Productos Agrícolas/crecimiento & desarrollo , Variación Genética , Tamaño del Genoma , Genoma de Planta , Familia de Multigenes , Proteínas de Plantas/genética , Poliploidía , Regiones Promotoras Genéticas , Saccharum/genéticaRESUMEN
BACKGROUND: Small non-coding regulatory RNAs control cellular functions at the transcriptional and post-transcriptional levels. Oral squamous cell carcinoma is among the leading cancers in the world and the presence of cervical lymph node metastases is currently its strongest prognostic factor. In this work we aimed at finding small RNAs expressed in oral squamous cell carcinoma that could be associated with the presence of lymph node metastasis. METHODS: Small RNA libraries from metastatic and non-metastatic oral squamous cell carcinomas were sequenced for the identification and quantification of known small RNAs. Selected markers were validated in plasma samples. Additionally, we used in silico analysis to investigate possible new molecules, not previously described, involved in the metastatic process. RESULTS: Global expression patterns were not associated with cervical metastases. MiR-21, miR-203 and miR-205 were highly expressed throughout samples, in agreement with their role in epithelial cell biology, but disagreeing with studies correlating these molecules with cancer invasion. Eighteen microRNAs, but no other small RNA class, varied consistently between metastatic and non-metastatic samples. Nine of these microRNAs had been previously detected in human plasma, eight of which presented consistent results between tissue and plasma samples. MiR-31 and miR-130b, known to inhibit several steps in the metastatic process, were over-expressed in non-metastatic samples and the expression of miR-130b was confirmed in plasma of patients showing no metastasis. MiR-181 and miR-296 were detected in metastatic tumors and the expression of miR-296 was confirmed in plasma of patients presenting metastasis. A novel microRNA-like molecule was also associated with non-metastatic samples, potentially targeting cell-signaling mechanisms. CONCLUSIONS: We corroborate literature data on the role of small RNAs in cancer metastasis and suggest the detection of microRNAs as a tool that may assist in the evaluation of oral squamous cell carcinoma metastatic potential.
Asunto(s)
Carcinoma de Células Escamosas/genética , Carcinoma de Células Escamosas/patología , Perfilación de la Expresión Génica , MicroARNs/genética , Neoplasias de la Boca/genética , Neoplasias de la Boca/patología , Anciano , Anciano de 80 o más Años , Secuencia de Bases , Biomarcadores de Tumor/sangre , Biomarcadores de Tumor/genética , Carcinoma de Células Escamosas/sangre , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Metástasis Linfática , Masculino , MicroARNs/sangre , Persona de Mediana Edad , Datos de Secuencia Molecular , Neoplasias de la Boca/sangre , Estadificación de Neoplasias , Análisis de Secuencia de ARNRESUMEN
Non protein-coding RNAs (ncRNAs) are a research hotspot in bioinformatics. Recent discoveries have revealed new ncRNA families performing a variety of roles, from gene expression regulation to catalytic activities. It is also believed that other families are still to be unveiled. Computational methods developed for protein coding genes often fail when searching for ncRNAs. Noncoding RNAs functionality is often heavily dependent on their secondary structure, which makes gene discovery very different from protein coding RNA genes. This motivated the development of specific methods for ncRNA research. This article reviews the main approaches used to identify ncRNAs and predict secondary structure.