Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 20 de 92
2.
BMC Bioinformatics ; 24(1): 471, 2023 Dec 13.
Article En | MEDLINE | ID: mdl-38093195

BACKGROUND: In canonical protein translation, ribosomes initiate translation at a specific start codon, maintain a single reading frame throughout elongation, and terminate at the first in-frame stop codon. However, ribosomal behavior can deviate at each of these steps, sometimes in a programmed manner. Certain mRNAs contain sequence and structural elements that cause ribosomes to begin translation at alternative start codons, shift reading frame, read through stop codons, or reinitiate on the same mRNA. These processes represent important translational control mechanisms that can allow an mRNA to encode multiple functional protein products or regulate protein expression. The prevalence of these events remains uncertain, due to the difficulty of systematic detection. RESULTS: We have developed a computational model to infer non-canonical translation events from ribosome profiling data. CONCLUSION: ORFeus identifies known examples of alternative open reading frames and recoding events across different organisms and enables transcriptome-wide searches for novel events.


Frameshifting, Ribosomal , Ribosomes , Codon, Terminator/genetics , Ribosomes/genetics , Ribosomes/metabolism , Open Reading Frames , RNA, Messenger/genetics , RNA, Messenger/metabolism , Protein Biosynthesis
3.
PLoS Comput Biol ; 19(3): e1010971, 2023 Mar.
Article En | MEDLINE | ID: mdl-36888579

[This corrects the article DOI: 10.1371/journal.pcbi.1009492.].

4.
Bioinformatics ; 39(1)2023 01 01.
Article En | MEDLINE | ID: mdl-36511586

SUMMARY: Codetta is a Python program for predicting the genetic code table of an organism from nucleotide sequences. Codetta can analyze an arbitrary nucleotide sequence and needs no sequence annotation or taxonomic placement. The most likely amino acid decoding for each of the 64 codons is inferred from alignments of profile hidden Markov models of conserved proteins to the input sequence. AVAILABILITY AND IMPLEMENTATION: Codetta 2.0 is implemented as a Python 3 program for MacOS and Linux and is available from http://eddylab.org/software/codetta/codetta2.tar.gz and at http://github.com/kshulgina/codetta. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Genetic Code , Software , Base Sequence
5.
Curr Biol ; 32(12): 2632-2639.e2, 2022 06 20.
Article En | MEDLINE | ID: mdl-35588743

Comparisons of genomes of different species are used to identify lineage-specific genes, those genes that appear unique to one species or clade. Lineage-specific genes are often thought to represent genetic novelty that underlies unique adaptations. Identification of these genes depends not only on genome sequences, but also on inferred gene annotations. Comparative analyses typically use available genomes that have been annotated using different methods, increasing the risk that orthologous DNA sequences may be erroneously annotated as a gene in one species but not another, appearing lineage specific as a result. To evaluate the impact of such "annotation heterogeneity," we identified four clades of species with sequenced genomes with more than one publicly available gene annotation, allowing us to compare the number of lineage-specific genes inferred when differing annotation methods are used to those resulting when annotation method is uniform across the clade. In these case studies, annotation heterogeneity increases the apparent number of lineage-specific genes by up to 15-fold, suggesting that annotation heterogeneity is a substantial source of potential artifact.


Genome , Base Sequence , Genome/genetics , Molecular Sequence Annotation
6.
PLoS Comput Biol ; 18(3): e1009492, 2022 03.
Article En | MEDLINE | ID: mdl-35255082

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.


Algorithms , Benchmarking , Sequence Analysis
7.
Elife ; 102021 11 09.
Article En | MEDLINE | ID: mdl-34751130

The genetic code has been proposed to be a 'frozen accident,' but the discovery of alternative genetic codes over the past four decades has shown that it can evolve to some degree. Since most examples were found anecdotally, it is difficult to draw general conclusions about the evolutionary trajectories of codon reassignment and why some codons are affected more frequently. To fill in the diversity of genetic codes, we developed Codetta, a computational method to predict the amino acid decoding of each codon from nucleotide sequence data. We surveyed the genetic code usage of over 250,000 bacterial and archaeal genome sequences in GenBank and discovered five new reassignments of arginine codons (AGG, CGA, and CGG), representing the first sense codon changes in bacteria. In a clade of uncultivated Bacilli, the reassignment of AGG to become the dominant methionine codon likely evolved by a change in the amino acid charging of an arginine tRNA. The reassignments of CGA and/or CGG were found in genomes with low GC content, an evolutionary force that likely helped drive these codons to low frequency and enable their reassignment.


All life forms rely on a 'code' to translate their genetic information into proteins. This code relies on limited permutations of three nucleotides ­ the building blocks that form DNA and other types of genetic information. Each 'triplet' of nucleotides ­ or codon ­ encodes a specific amino acid, the basic component of proteins. Reading the sequence of codons in the right order will let the cell know which amino acid to assemble next on a growing protein. For instance, the codon CGG ­ formed of the nucleotides guanine (G) and cytosine (C) ­ codes for the amino acid arginine. From bacteria to humans, most life forms rely on the same genetic code. Yet certain organisms have evolved to use slightly different codes, where one or several codons have an altered meaning. To better understand how alternative genetic codes have evolved, Shulgina and Eddy set out to find more organisms featuring these altered codons, creating a new software called Codetta that can analyze the genome of a microorganism and predict the genetic code it uses. Codetta was then used to sift through the genetic information of 250,000 microorganisms. This was made possible by the sequencing, in recent years, of the genomes of hundreds of thousands of bacteria and other microorganisms ­ including many never studied before. These analyses revealed five groups of bacteria with alternative genetic codes, all of which had changes in the codons that code for arginine. Amongst these, four had genomes with a low proportion of guanine and cytosine nucleotides. This may have made some guanine and cytosine-rich arginine codons very rare in these organisms and, therefore, easier to be reassigned to encode another amino acid. The work by Shulgina and Eddy demonstrates that Codetta is a new, useful tool that scientists can use to understand how genetic codes evolve. In addition, it can also help to ensure the accuracy of widely used protein databases, which assume which genetic code organisms use to predict protein sequences from their genomes.


Computational Biology/methods , Evolution, Molecular , Genetic Code , Genetic Techniques/instrumentation , Genome, Archaeal , Genome, Bacterial , Codon/genetics
8.
Nucleic Acids Res ; 49(D1): D192-D200, 2021 01 08.
Article En | MEDLINE | ID: mdl-33211869

Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.


Databases, Nucleic Acid , Metagenome , MicroRNAs/genetics , RNA, Bacterial/genetics , RNA, Untranslated/genetics , RNA, Viral/genetics , Bacteria/genetics , Bacteria/metabolism , Base Pairing , Base Sequence , Humans , Internet , MicroRNAs/classification , MicroRNAs/metabolism , Molecular Sequence Annotation , Nucleic Acid Conformation , RNA, Bacterial/classification , RNA, Bacterial/metabolism , RNA, Untranslated/classification , RNA, Untranslated/metabolism , RNA, Viral/classification , RNA, Viral/metabolism , Sequence Alignment , Sequence Analysis, RNA , Software , Viruses/genetics , Viruses/metabolism
9.
PLoS Comput Biol ; 16(11): e1008085, 2020 11.
Article En | MEDLINE | ID: mdl-33253143

Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments.


Models, Statistical , Algorithms , Computer Simulation , Likelihood Functions , Nucleic Acid Conformation , Sequence Analysis, RNA/methods
10.
PLoS Biol ; 18(11): e3000862, 2020 11.
Article En | MEDLINE | ID: mdl-33137085

Genes for which homologs can be detected only in a limited group of evolutionarily related species, called "lineage-specific genes," are pervasive: Essentially every lineage has them, and they often comprise a sizable fraction of the group's total genes. Lineage-specific genes are often interpreted as "novel" genes, representing genetic novelty born anew within that lineage. Here, we develop a simple method to test an alternative null hypothesis: that lineage-specific genes do have homologs outside of the lineage that, even while evolving at a constant rate in a novelty-free manner, have merely become undetectable by search algorithms used to infer homology. We show that this null hypothesis is sufficient to explain the lack of detected homologs of a large number of lineage-specific genes in fungi and insects. However, we also find that a minority of lineage-specific genes in both clades are not well explained by this novelty-free model. The method provides a simple way of identifying which lineage-specific genes call for special explanations beyond homology detection failure, highlighting them as interesting candidates for further study.


Sequence Analysis, DNA/methods , Sequence Homology, Nucleic Acid , Algorithms , Biological Evolution , Evolution, Molecular , Genes, Fungal/genetics , Genes, Insect/genetics , Models, Genetic , Phylogeny , Species Specificity , Structural Homology, Protein
11.
Bioinformatics ; 36(10): 3072-3076, 2020 05 01.
Article En | MEDLINE | ID: mdl-32031582

Pairwise sequence covariations are a signal of conserved RNA secondary structure. We describe a method for distinguishing when lack of covariation signal can be taken as evidence against a conserved RNA structure, as opposed to when a sequence alignment merely has insufficient variation to detect covariations. We find that alignments for several long non-coding RNAs previously shown to lack covariation support do have adequate covariation detection power, providing additional evidence against their proposed conserved structures. AVAILABILITY AND IMPLEMENTATION: The R-scape web server is at eddylab.org/R-scape, with a link to download the source code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


RNA, Long Noncoding , RNA , Algorithms , Conserved Sequence , Nucleic Acid Conformation , RNA/genetics , Sequence Alignment , Sequence Analysis, RNA , Software
12.
Elife ; 92020 01 15.
Article En | MEDLINE | ID: mdl-31939737

The anatomy of many neural circuits is being characterized with increasing resolution, but their molecular properties remain mostly unknown. Here, we characterize gene expression patterns in distinct neural cell types of the Drosophila visual system using genetic lines to access individual cell types, the TAPIN-seq method to measure their transcriptomes, and a probabilistic method to interpret these measurements. We used these tools to build a resource of high-resolution transcriptomes for 100 driver lines covering 67 cell types, available at http://www.opticlobe.com. Combining these transcriptomes with recently reported connectomes helps characterize how information is transmitted and processed across a range of scales, from individual synapses to circuit pathways. We describe examples that include identifying neurotransmitters, including cases of apparent co-release, generating functional hypotheses based on receptor expression, as well as identifying strong commonalities between different cell types.


In the brain, large numbers of different types of neurons connect with each other to form complex networks. In recent years, researchers have made great progress in mapping all the connections between these cells, creating 'wiring diagrams' known as connectomes. However, charting the connections between neurons does not give all the answers as to how the brain works; for example, it does not necessarily reveal the nature of the information two connected cells exchange. Assessing which genes are switched on in different neurons can give insight into neuronal properties that are not obvious from physical connections alone. To fill that knowledge gap, Davis, Nern et al. aimed to measure the genes expressed in a well-characterized network of neurons in the fruit fly visual system. First, 100 fly strains were established, each carrying a single type of neuron colored with a fluorescent marker. Then, a biochemical approach was developed to extract the part of the cell that contains the genetic code from the neurons with the marker. Finally, a statistical tool was used to assess which genes were on in each type of neurons. This led to the creation of a database that shows whether 15,000 genes in each neuron type across 100 fly strains were switched on. Combining this information with previous knowledge about the flies' visual system revealed new information: for example, it helped to understand which chemicals the neurons use to communicate, and whether certain cells activate or inhibit each other. The work by Davis, Nern et al. demonstrates how genetic approaches can complement other methods, and it offers a new tool for other scientists to use in their work. With more advanced genetic methods, it may one day become possible to better grasp how complex brains in other organisms are organized, and how they are disrupted in disease.


Connectome , Genome , Neurons/physiology , Animals , Drosophila/genetics , Drosophila/physiology , Gene Expression , Probability , Transcriptome , Visual Pathways/metabolism
13.
PLoS Comput Biol ; 15(12): e1007560, 2019 12.
Article En | MEDLINE | ID: mdl-31856220

Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs-assembling partial features into whole features in deeper layers-tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences.


Genomics/statistics & numerical data , Neural Networks, Computer , Amino Acid Motifs , Binding Sites/genetics , Computational Biology , Computer Simulation , DNA/genetics , Databases, Genetic/statistics & numerical data , Deep Learning/statistics & numerical data , Genome, Human , Humans , Transcription Factors/chemistry , Transcription Factors/genetics , Transcription Factors/metabolism
14.
Elife ; 82019 11 07.
Article En | MEDLINE | ID: mdl-31697236

The polarized structure of axons and dendrites in neuronal cells depends in part on RNA localization. Previous studies have looked at which polyadenylated RNAs are enriched in neuronal projections or at synapses, but less is known about the distribution of non-adenylated RNAs. By physically dissecting projections from cell bodies of primary rat hippocampal neurons and sequencing total RNA, we found an unexpected set of free circular introns with a non-canonical branchpoint enriched in neuronal projections. These introns appear to be tailless lariats that escape debranching. They lack ribosome occupancy, sequence conservation, and known localization signals, and their function, if any, is not known. Nonetheless, their enrichment in projections has important implications for our understanding of the mechanisms by which RNAs reach distal compartments of asymmetric cells.


Hippocampus/cytology , Introns/genetics , Neurons/metabolism , RNA, Circular/genetics , Animals , Axons/metabolism , Cells, Cultured , Dendrites/metabolism , Female , Gene Expression Profiling , Gene Ontology , High-Throughput Nucleotide Sequencing/methods , Nucleic Acid Conformation , RNA, Circular/chemistry , RNA, Circular/metabolism , Rats, Sprague-Dawley
15.
Nucleic Acids Res ; 47(D1): D427-D432, 2019 01 08.
Article En | MEDLINE | ID: mdl-30357350

The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors' ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.


Databases, Protein , Proteins/classification , Molecular Sequence Annotation , Protein Domains , Proteins/chemistry , Repetitive Sequences, Amino Acid
16.
Nucleic Acids Res ; 46(W1): W200-W204, 2018 07 02.
Article En | MEDLINE | ID: mdl-29905871

The HMMER webserver [http://www.ebi.ac.uk/Tools/hmmer] is a free-to-use service which provides fast searches against widely used sequence databases and profile hidden Markov model (HMM) libraries using the HMMER software suite (http://hmmer.org). The results of a sequence search may be summarized in a number of ways, allowing users to view and filter the significant hits by domain architecture or taxonomy. For large scale usage, we provide an application programmatic interface (API) which has been expanded in scope, such that all result presentations are available via both HTML and API. Furthermore, we have refactored our JavaScript visualization library to provide standalone components for different result representations. These consume the aforementioned API and can be integrated into third-party websites. The range of databases that can be searched against has been expanded, adding four sequence datasets (12 in total) and one profile HMM library (6 in total). To help users explore the biological context of their results, and to discover new data resources, search results are now supplemented with cross references to other EMBL-EBI databases.


Sequence Analysis , Software , Catalytic Domain , Databases, Genetic , Internet , Markov Chains , Sequence Analysis, Protein , User-Computer Interface
17.
Nucleic Acids Res ; 46(15): 7970-7976, 2018 09 06.
Article En | MEDLINE | ID: mdl-29788499

Group I catalytic introns have been found in bacterial, viral, organellar, and some eukaryotic genomes, but not in archaea. All known archaeal introns are bulge-helix-bulge (BHB) introns, with the exception of a few group II introns. It has been proposed that BHB introns arose from extinct group I intron ancestors, much like eukaryotic spliceosomal introns are thought to have descended from group II introns. However, group I introns have little sequence conservation, making them difficult to detect with standard sequence similarity searches. Taking advantage of recent improvements in a computational homology search method that accounts for both conserved sequence and RNA secondary structure, we have identified 39 group I introns in a wide range of archaeal phyla, including examples of group I introns and BHB introns in the same host gene.


Archaea/genetics , Introns/genetics , RNA, Archaeal/genetics , RNA, Catalytic/genetics , Archaea/classification , Archaea/enzymology , Base Sequence , Nucleic Acid Conformation , Phylogeny , RNA, Archaeal/chemistry , RNA, Archaeal/classification , RNA, Catalytic/chemistry , RNA, Catalytic/classification , Species Specificity
18.
Nucleic Acids Res ; 46(D1): D335-D342, 2018 01 04.
Article En | MEDLINE | ID: mdl-29112718

The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.


Databases, Nucleic Acid , Genome , RNA, Untranslated/chemistry , RNA, Untranslated/genetics , Humans , Molecular Sequence Annotation , Nucleic Acid Conformation , RNA, Untranslated/classification , Sequence Alignment , Sequence Analysis, RNA
19.
Curr Biol ; 27(13): R661-R663, 2017 07 10.
Article En | MEDLINE | ID: mdl-28697368

New genes arise from pre-existing genes, but some de novo origin from non-genic sequence also seems plausible. A new study has surprisingly concluded that 25% of random DNA sequences yield beneficial products when expressed in bacteria.

20.
Cell Rep ; 19(8): 1723-1738, 2017 05 23.
Article En | MEDLINE | ID: mdl-28538188

The MALAT1 (Metastasis-Associated Lung Adenocarcinoma Transcript 1) gene encodes a noncoding RNA that is processed into a long nuclear retained transcript (MALAT1) and a small cytoplasmic tRNA-like transcript (mascRNA). Using an RNA sequence- and structure-based covariance model, we identified more than 130 genomic loci in vertebrate genomes containing the MALAT1 3' end triple-helix structure and its immediate downstream tRNA-like structure, including 44 in the green lizard Anolis carolinensis. Structural and computational analyses revealed a co-occurrence of components of the 3' end module. MALAT1-like genes in Anolis carolinensis are highly expressed in adult testis, thus we named them testis-abundant long noncoding RNAs (tancRNAs). MALAT1-like loci also produce multiple small RNA species, including PIWI-interacting RNAs (piRNAs), from the antisense strand. The 3' ends of tancRNAs serve as potential targets for the PIWI-piRNA complex. Thus, we have identified an evolutionarily conserved class of long noncoding RNAs (lncRNAs) with similar structural constraints, post-transcriptional processing, and subcellular localization and a distinct function in spermatocytes.


Genetic Loci , Genome, Human , RNA, Long Noncoding/genetics , Animals , Base Sequence , Cell Nucleus/metabolism , Humans , Lizards/genetics , Male , Nucleic Acid Conformation , Organ Specificity/genetics , RNA, Long Noncoding/chemistry , RNA, Small Interfering/genetics , Spermatocytes/metabolism
...