Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 92
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
2.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36511586

RESUMEN

SUMMARY: Codetta is a Python program for predicting the genetic code table of an organism from nucleotide sequences. Codetta can analyze an arbitrary nucleotide sequence and needs no sequence annotation or taxonomic placement. The most likely amino acid decoding for each of the 64 codons is inferred from alignments of profile hidden Markov models of conserved proteins to the input sequence. AVAILABILITY AND IMPLEMENTATION: Codetta 2.0 is implemented as a Python 3 program for MacOS and Linux and is available from http://eddylab.org/software/codetta/codetta2.tar.gz and at http://github.com/kshulgina/codetta. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Código Genético , Programas Informáticos , Secuencia de Bases
3.
PLoS Comput Biol ; 19(3): e1010971, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-36888579

RESUMEN

[This corrects the article DOI: 10.1371/journal.pcbi.1009492.].

4.
BMC Bioinformatics ; 24(1): 471, 2023 Dec 13.
Artículo en Inglés | MEDLINE | ID: mdl-38093195

RESUMEN

BACKGROUND: In canonical protein translation, ribosomes initiate translation at a specific start codon, maintain a single reading frame throughout elongation, and terminate at the first in-frame stop codon. However, ribosomal behavior can deviate at each of these steps, sometimes in a programmed manner. Certain mRNAs contain sequence and structural elements that cause ribosomes to begin translation at alternative start codons, shift reading frame, read through stop codons, or reinitiate on the same mRNA. These processes represent important translational control mechanisms that can allow an mRNA to encode multiple functional protein products or regulate protein expression. The prevalence of these events remains uncertain, due to the difficulty of systematic detection. RESULTS: We have developed a computational model to infer non-canonical translation events from ribosome profiling data. CONCLUSION: ORFeus identifies known examples of alternative open reading frames and recoding events across different organisms and enables transcriptome-wide searches for novel events.


Asunto(s)
Sistema de Lectura Ribosómico , Ribosomas , Codón de Terminación/genética , Ribosomas/genética , Ribosomas/metabolismo , Sistemas de Lectura Abierta , ARN Mensajero/genética , ARN Mensajero/metabolismo , Biosíntesis de Proteínas
5.
PLoS Biol ; 18(11): e3000862, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33137085

RESUMEN

Genes for which homologs can be detected only in a limited group of evolutionarily related species, called "lineage-specific genes," are pervasive: Essentially every lineage has them, and they often comprise a sizable fraction of the group's total genes. Lineage-specific genes are often interpreted as "novel" genes, representing genetic novelty born anew within that lineage. Here, we develop a simple method to test an alternative null hypothesis: that lineage-specific genes do have homologs outside of the lineage that, even while evolving at a constant rate in a novelty-free manner, have merely become undetectable by search algorithms used to infer homology. We show that this null hypothesis is sufficient to explain the lack of detected homologs of a large number of lineage-specific genes in fungi and insects. However, we also find that a minority of lineage-specific genes in both clades are not well explained by this novelty-free model. The method provides a simple way of identifying which lineage-specific genes call for special explanations beyond homology detection failure, highlighting them as interesting candidates for further study.


Asunto(s)
Análisis de Secuencia de ADN/métodos , Homología de Secuencia de Ácido Nucleico , Algoritmos , Evolución Biológica , Evolución Molecular , Genes Fúngicos/genética , Genes de Insecto/genética , Modelos Genéticos , Filogenia , Especificidad de la Especie , Homología Estructural de Proteína
6.
PLoS Comput Biol ; 18(3): e1009492, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-35255082

RESUMEN

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.


Asunto(s)
Algoritmos , Benchmarking , Análisis de Secuencia
7.
Nucleic Acids Res ; 49(D1): D192-D200, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33211869

RESUMEN

Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Metagenoma , MicroARNs/genética , ARN Bacteriano/genética , ARN no Traducido/genética , ARN Viral/genética , Bacterias/genética , Bacterias/metabolismo , Emparejamiento Base , Secuencia de Bases , Humanos , Internet , MicroARNs/clasificación , MicroARNs/metabolismo , Anotación de Secuencia Molecular , Conformación de Ácido Nucleico , ARN Bacteriano/clasificación , ARN Bacteriano/metabolismo , ARN no Traducido/clasificación , ARN no Traducido/metabolismo , ARN Viral/clasificación , ARN Viral/metabolismo , Alineación de Secuencia , Análisis de Secuencia de ARN , Programas Informáticos , Virus/genética , Virus/metabolismo
8.
Bioinformatics ; 36(10): 3072-3076, 2020 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-32031582

RESUMEN

Pairwise sequence covariations are a signal of conserved RNA secondary structure. We describe a method for distinguishing when lack of covariation signal can be taken as evidence against a conserved RNA structure, as opposed to when a sequence alignment merely has insufficient variation to detect covariations. We find that alignments for several long non-coding RNAs previously shown to lack covariation support do have adequate covariation detection power, providing additional evidence against their proposed conserved structures. AVAILABILITY AND IMPLEMENTATION: The R-scape web server is at eddylab.org/R-scape, with a link to download the source code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
ARN Largo no Codificante , ARN , Algoritmos , Secuencia Conservada , Conformación de Ácido Nucleico , ARN/genética , Alineación de Secuencia , Análisis de Secuencia de ARN , Programas Informáticos
9.
PLoS Comput Biol ; 16(11): e1008085, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33253143

RESUMEN

Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments.


Asunto(s)
Modelos Estadísticos , Algoritmos , Simulación por Computador , Funciones de Verosimilitud , Conformación de Ácido Nucleico , Análisis de Secuencia de ARN/métodos
10.
Nucleic Acids Res ; 47(D1): D427-D432, 2019 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-30357350

RESUMEN

The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors' ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.


Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Anotación de Secuencia Molecular , Dominios Proteicos , Proteínas/química , Secuencias Repetitivas de Aminoácido
11.
Nat Methods ; 14(1): 45-48, 2017 01.
Artículo en Inglés | MEDLINE | ID: mdl-27819659

RESUMEN

Many functional RNAs have an evolutionarily conserved secondary structure. Conservation of RNA base pairing induces pairwise covariations in sequence alignments. We developed a computational method, R-scape (RNA Structural Covariation Above Phylogenetic Expectation), that quantitatively tests whether covariation analysis supports the presence of a conserved RNA secondary structure. R-scape analysis finds no statistically significant support for proposed secondary structures of the long noncoding RNAs HOTAIR, SRA, and Xist.


Asunto(s)
Evolución Molecular , Filogenia , ARN Largo no Codificante/química , ARN Largo no Codificante/genética , Emparejamiento Base , Secuencia de Bases , Humanos , Conformación de Ácido Nucleico
12.
PLoS Comput Biol ; 15(12): e1007560, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31856220

RESUMEN

Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs-assembling partial features into whole features in deeper layers-tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences.


Asunto(s)
Genómica/estadística & datos numéricos , Redes Neurales de la Computación , Secuencias de Aminoácidos , Sitios de Unión/genética , Biología Computacional , Simulación por Computador , ADN/genética , Bases de Datos Genéticas/estadística & datos numéricos , Aprendizaje Profundo/estadística & datos numéricos , Genoma Humano , Humanos , Factores de Transcripción/química , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
13.
Nucleic Acids Res ; 46(15): 7970-7976, 2018 09 06.
Artículo en Inglés | MEDLINE | ID: mdl-29788499

RESUMEN

Group I catalytic introns have been found in bacterial, viral, organellar, and some eukaryotic genomes, but not in archaea. All known archaeal introns are bulge-helix-bulge (BHB) introns, with the exception of a few group II introns. It has been proposed that BHB introns arose from extinct group I intron ancestors, much like eukaryotic spliceosomal introns are thought to have descended from group II introns. However, group I introns have little sequence conservation, making them difficult to detect with standard sequence similarity searches. Taking advantage of recent improvements in a computational homology search method that accounts for both conserved sequence and RNA secondary structure, we have identified 39 group I introns in a wide range of archaeal phyla, including examples of group I introns and BHB introns in the same host gene.


Asunto(s)
Archaea/genética , Intrones/genética , ARN de Archaea/genética , ARN Catalítico/genética , Archaea/clasificación , Archaea/enzimología , Secuencia de Bases , Conformación de Ácido Nucleico , Filogenia , ARN de Archaea/química , ARN de Archaea/clasificación , ARN Catalítico/química , ARN Catalítico/clasificación , Especificidad de la Especie
14.
Nucleic Acids Res ; 46(W1): W200-W204, 2018 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-29905871

RESUMEN

The HMMER webserver [http://www.ebi.ac.uk/Tools/hmmer] is a free-to-use service which provides fast searches against widely used sequence databases and profile hidden Markov model (HMM) libraries using the HMMER software suite (http://hmmer.org). The results of a sequence search may be summarized in a number of ways, allowing users to view and filter the significant hits by domain architecture or taxonomy. For large scale usage, we provide an application programmatic interface (API) which has been expanded in scope, such that all result presentations are available via both HTML and API. Furthermore, we have refactored our JavaScript visualization library to provide standalone components for different result representations. These consume the aforementioned API and can be integrated into third-party websites. The range of databases that can be searched against has been expanded, adding four sequence datasets (12 in total) and one profile HMM library (6 in total). To help users explore the biological context of their results, and to discover new data resources, search results are now supplemented with cross references to other EMBL-EBI databases.


Asunto(s)
Análisis de Secuencia , Programas Informáticos , Dominio Catalítico , Bases de Datos Genéticas , Internet , Cadenas de Markov , Análisis de Secuencia de Proteína , Interfaz Usuario-Computador
15.
Nucleic Acids Res ; 46(D1): D335-D342, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29112718

RESUMEN

The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genoma , ARN no Traducido/química , ARN no Traducido/genética , Humanos , Anotación de Secuencia Molecular , Conformación de Ácido Nucleico , ARN no Traducido/clasificación , Alineación de Secuencia , Análisis de Secuencia de ARN
16.
Nucleic Acids Res ; 44(D1): D81-9, 2016 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-26612867

RESUMEN

Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.


Asunto(s)
Elementos Transponibles de ADN , ADN/química , Bases de Datos de Ácidos Nucleicos , Secuencias Repetitivas de Ácidos Nucleicos , Animales , ADN/clasificación , Genoma , Humanos , Internet , Cadenas de Markov , Ratones , Anotación de Secuencia Molecular , Alineación de Secuencia
17.
Nucleic Acids Res ; 44(D1): D279-85, 2016 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-26673716

RESUMEN

In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.


Asunto(s)
Bases de Datos de Proteínas , Proteínas/clasificación , Proteoma/química , Alineación de Secuencia , Análisis de Secuencia de Proteína , Anotación de Secuencia Molecular
18.
Nucleic Acids Res ; 43(W1): W30-8, 2015 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-25943547

RESUMEN

The HMMER website, available at http://www.ebi.ac.uk/Tools/hmmer/, provides access to the protein homology search algorithms found in the HMMER software suite. Since the first release of the website in 2011, the search repertoire has been expanded to include the iterative search algorithm, jackhmmer. The continued growth of the target sequence databases means that traditional tabular representations of significant sequence hits can be overwhelming to the user. Consequently, additional ways of presenting homology search results have been developed, allowing them to be summarised according to taxonomic distribution or domain architecture. The taxonomy and domain architecture representations can be used in combination to filter the results according to the needs of a user. Searches can also be restricted prior to submission using a new taxonomic filter, which not only ensures that the results are specific to the requested taxonomic group, but also improves search performance. The repertoire of profile hidden Markov model libraries, which are used for annotation of query sequences with protein families and domains, has been expanded to include the libraries from CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs. Finally, we discuss the relocation of the HMMER webserver to the European Bioinformatics Institute and the potential impact that this will have.


Asunto(s)
Homología de Secuencia de Aminoácido , Programas Informáticos , Algoritmos , Bases de Datos de Proteínas , Internet , Cadenas de Markov , Estructura Terciaria de Proteína , Alineación de Secuencia , Análisis de Secuencia de Proteína
19.
Nucleic Acids Res ; 43(Database issue): D130-7, 2015 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-25392425

RESUMEN

The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , ARN no Traducido/química , Genómica , Internet , Anotación de Secuencia Molecular , Conformación de Ácido Nucleico , Motivos de Nucleótidos , ARN Largo no Codificante/química , ARN no Traducido/clasificación , Programas Informáticos
20.
PLoS Biol ; 11(1): e1001473, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23382650

RESUMEN

The macronuclear genome of the ciliate Oxytricha trifallax displays an extreme and unique eukaryotic genome architecture with extensive genomic variation. During sexual genome development, the expressed, somatic macronuclear genome is whittled down to the genic portion of a small fraction (∼5%) of its precursor "silent" germline micronuclear genome by a process of "unscrambling" and fragmentation. The tiny macronuclear "nanochromosomes" typically encode single, protein-coding genes (a small portion, 10%, encode 2-8 genes), have minimal noncoding regions, and are differentially amplified to an average of ∼2,000 copies. We report the high-quality genome assembly of ∼16,000 complete nanochromosomes (∼50 Mb haploid genome size) that vary from 469 bp to 66 kb long (mean ∼3.2 kb) and encode ∼18,500 genes. Alternative DNA fragmentation processes ∼10% of the nanochromosomes into multiple isoforms that usually encode complete genes. Nucleotide diversity in the macronucleus is very high (SNP heterozygosity is ∼4.0%), suggesting that Oxytricha trifallax may have one of the largest known effective population sizes of eukaryotes. Comparison to other ciliates with nonscrambled genomes and long macronuclear chromosomes (on the order of 100 kb) suggests several candidate proteins that could be involved in genome rearrangement, including domesticated MULE and IS1595-like DDE transposases. The assembly of the highly fragmented Oxytricha macronuclear genome is the first completed genome with such an unusual architecture. This genome sequence provides tantalizing glimpses into novel molecular biology and evolution. For example, Oxytricha maintains tens of millions of telomeres per cell and has also evolved an intriguing expansion of telomere end-binding proteins. In conjunction with the micronuclear genome in progress, the O. trifallax macronuclear genome will provide an invaluable resource for investigating programmed genome rearrangements, complementing studies of rearrangements arising during evolution and disease.


Asunto(s)
ADN Protozoario/genética , Genoma de Protozoos/genética , Oxytricha/genética , Secuencia de Bases , Variaciones en el Número de Copia de ADN , Fragmentación del ADN , Amplificación de Genes , Reordenamiento Génico/genética , Genes Protozoarios , Variación Genética , Macronúcleo/genética , Datos de Secuencia Molecular , Unión Proteica , ARN Mensajero/genética , Análisis de Secuencia de ADN , Telómero/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA