Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 92
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
2.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36511586

RESUMO

SUMMARY: Codetta is a Python program for predicting the genetic code table of an organism from nucleotide sequences. Codetta can analyze an arbitrary nucleotide sequence and needs no sequence annotation or taxonomic placement. The most likely amino acid decoding for each of the 64 codons is inferred from alignments of profile hidden Markov models of conserved proteins to the input sequence. AVAILABILITY AND IMPLEMENTATION: Codetta 2.0 is implemented as a Python 3 program for MacOS and Linux and is available from http://eddylab.org/software/codetta/codetta2.tar.gz and at http://github.com/kshulgina/codetta. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Código Genético , Software , Sequência de Bases
3.
PLoS Comput Biol ; 19(3): e1010971, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-36888579

RESUMO

[This corrects the article DOI: 10.1371/journal.pcbi.1009492.].

4.
BMC Bioinformatics ; 24(1): 471, 2023 Dec 13.
Artigo em Inglês | MEDLINE | ID: mdl-38093195

RESUMO

BACKGROUND: In canonical protein translation, ribosomes initiate translation at a specific start codon, maintain a single reading frame throughout elongation, and terminate at the first in-frame stop codon. However, ribosomal behavior can deviate at each of these steps, sometimes in a programmed manner. Certain mRNAs contain sequence and structural elements that cause ribosomes to begin translation at alternative start codons, shift reading frame, read through stop codons, or reinitiate on the same mRNA. These processes represent important translational control mechanisms that can allow an mRNA to encode multiple functional protein products or regulate protein expression. The prevalence of these events remains uncertain, due to the difficulty of systematic detection. RESULTS: We have developed a computational model to infer non-canonical translation events from ribosome profiling data. CONCLUSION: ORFeus identifies known examples of alternative open reading frames and recoding events across different organisms and enables transcriptome-wide searches for novel events.


Assuntos
Mudança da Fase de Leitura do Gene Ribossômico , Ribossomos , Códon de Terminação/genética , Ribossomos/genética , Ribossomos/metabolismo , Fases de Leitura Aberta , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Biossíntese de Proteínas
5.
PLoS Biol ; 18(11): e3000862, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33137085

RESUMO

Genes for which homologs can be detected only in a limited group of evolutionarily related species, called "lineage-specific genes," are pervasive: Essentially every lineage has them, and they often comprise a sizable fraction of the group's total genes. Lineage-specific genes are often interpreted as "novel" genes, representing genetic novelty born anew within that lineage. Here, we develop a simple method to test an alternative null hypothesis: that lineage-specific genes do have homologs outside of the lineage that, even while evolving at a constant rate in a novelty-free manner, have merely become undetectable by search algorithms used to infer homology. We show that this null hypothesis is sufficient to explain the lack of detected homologs of a large number of lineage-specific genes in fungi and insects. However, we also find that a minority of lineage-specific genes in both clades are not well explained by this novelty-free model. The method provides a simple way of identifying which lineage-specific genes call for special explanations beyond homology detection failure, highlighting them as interesting candidates for further study.


Assuntos
Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Algoritmos , Evolução Biológica , Evolução Molecular , Genes Fúngicos/genética , Genes de Insetos/genética , Modelos Genéticos , Filogenia , Especificidade da Espécie , Homologia Estrutural de Proteína
6.
PLoS Comput Biol ; 18(3): e1009492, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35255082

RESUMO

Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.


Assuntos
Algoritmos , Benchmarking , Análise de Sequência
7.
Nucleic Acids Res ; 49(D1): D192-D200, 2021 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-33211869

RESUMO

Rfam is a database of RNA families where each of the 3444 families is represented by a multiple sequence alignment of known RNA sequences and a covariance model that can be used to search for additional members of the family. Recent developments have involved expert collaborations to improve the quality and coverage of Rfam data, focusing on microRNAs, viral and bacterial RNAs. We have completed the first phase of synchronising microRNA families in Rfam and miRBase, creating 356 new Rfam families and updating 40. We established a procedure for comprehensive annotation of viral RNA families starting with Flavivirus and Coronaviridae RNAs. We have also increased the coverage of bacterial and metagenome-based RNA families from the ZWD database. These developments have enabled a significant growth of the database, with the addition of 759 new families in Rfam 14. To facilitate further community contribution to Rfam, expert users are now able to build and submit new families using the newly developed Rfam Cloud family curation system. New Rfam website features include a new sequence similarity search powered by RNAcentral, as well as search and visualisation of families with pseudoknots. Rfam is freely available at https://rfam.org.


Assuntos
Bases de Dados de Ácidos Nucleicos , Metagenoma , MicroRNAs/genética , RNA Bacteriano/genética , RNA não Traduzido/genética , RNA Viral/genética , Bactérias/genética , Bactérias/metabolismo , Pareamento de Bases , Sequência de Bases , Humanos , Internet , MicroRNAs/classificação , MicroRNAs/metabolismo , Anotação de Sequência Molecular , Conformação de Ácido Nucleico , RNA Bacteriano/classificação , RNA Bacteriano/metabolismo , RNA não Traduzido/classificação , RNA não Traduzido/metabolismo , RNA Viral/classificação , RNA Viral/metabolismo , Alinhamento de Sequência , Análise de Sequência de RNA , Software , Vírus/genética , Vírus/metabolismo
8.
Bioinformatics ; 36(10): 3072-3076, 2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-32031582

RESUMO

Pairwise sequence covariations are a signal of conserved RNA secondary structure. We describe a method for distinguishing when lack of covariation signal can be taken as evidence against a conserved RNA structure, as opposed to when a sequence alignment merely has insufficient variation to detect covariations. We find that alignments for several long non-coding RNAs previously shown to lack covariation support do have adequate covariation detection power, providing additional evidence against their proposed conserved structures. AVAILABILITY AND IMPLEMENTATION: The R-scape web server is at eddylab.org/R-scape, with a link to download the source code. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
RNA Longo não Codificante , RNA , Algoritmos , Sequência Conservada , Conformação de Ácido Nucleico , RNA/genética , Alinhamento de Sequência , Análise de Sequência de RNA , Software
9.
PLoS Comput Biol ; 16(11): e1008085, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33253143

RESUMO

Most methods for biological sequence homology search and alignment work with primary sequence alone, neglecting higher-order correlations. Recently, statistical physics models called Potts models have been used to infer all-by-all pairwise correlations between sites in deep multiple sequence alignments, and these pairwise couplings have improved 3D structure predictions. Here we extend the use of Potts models from structure prediction to sequence alignment and homology search by developing what we call a hidden Potts model (HPM) that merges a Potts emission process to a generative probability model of insertion and deletion. Because an HPM is incompatible with efficient dynamic programming alignment algorithms, we develop an approximate algorithm based on importance sampling, using simpler probabilistic models as proposal distributions. We test an HPM implementation on RNA structure homology search benchmarks, where we can compare directly to exact alignment methods that capture nested RNA base-pairing correlations (stochastic context-free grammars). HPMs perform promisingly in these proof of principle experiments.


Assuntos
Modelos Estatísticos , Algoritmos , Simulação por Computador , Funções Verossimilhança , Conformação de Ácido Nucleico , Análise de Sequência de RNA/métodos
10.
Nucleic Acids Res ; 47(D1): D427-D432, 2019 01 08.
Artigo em Inglês | MEDLINE | ID: mdl-30357350

RESUMO

The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors' ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Anotação de Sequência Molecular , Domínios Proteicos , Proteínas/química , Sequências Repetitivas de Aminoácidos
11.
Nat Methods ; 14(1): 45-48, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-27819659

RESUMO

Many functional RNAs have an evolutionarily conserved secondary structure. Conservation of RNA base pairing induces pairwise covariations in sequence alignments. We developed a computational method, R-scape (RNA Structural Covariation Above Phylogenetic Expectation), that quantitatively tests whether covariation analysis supports the presence of a conserved RNA secondary structure. R-scape analysis finds no statistically significant support for proposed secondary structures of the long noncoding RNAs HOTAIR, SRA, and Xist.


Assuntos
Evolução Molecular , Filogenia , RNA Longo não Codificante/química , RNA Longo não Codificante/genética , Pareamento de Bases , Sequência de Bases , Humanos , Conformação de Ácido Nucleico
12.
PLoS Comput Biol ; 15(12): e1007560, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31856220

RESUMO

Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs-assembling partial features into whole features in deeper layers-tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences.


Assuntos
Genômica/estatística & dados numéricos , Redes Neurais de Computação , Motivos de Aminoácidos , Sítios de Ligação/genética , Biologia Computacional , Simulação por Computador , DNA/genética , Bases de Dados Genéticas/estatística & dados numéricos , Aprendizado Profundo/estatística & dados numéricos , Genoma Humano , Humanos , Fatores de Transcrição/química , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
13.
Nucleic Acids Res ; 46(15): 7970-7976, 2018 09 06.
Artigo em Inglês | MEDLINE | ID: mdl-29788499

RESUMO

Group I catalytic introns have been found in bacterial, viral, organellar, and some eukaryotic genomes, but not in archaea. All known archaeal introns are bulge-helix-bulge (BHB) introns, with the exception of a few group II introns. It has been proposed that BHB introns arose from extinct group I intron ancestors, much like eukaryotic spliceosomal introns are thought to have descended from group II introns. However, group I introns have little sequence conservation, making them difficult to detect with standard sequence similarity searches. Taking advantage of recent improvements in a computational homology search method that accounts for both conserved sequence and RNA secondary structure, we have identified 39 group I introns in a wide range of archaeal phyla, including examples of group I introns and BHB introns in the same host gene.


Assuntos
Archaea/genética , Íntrons/genética , RNA Arqueal/genética , RNA Catalítico/genética , Archaea/classificação , Archaea/enzimologia , Sequência de Bases , Conformação de Ácido Nucleico , Filogenia , RNA Arqueal/química , RNA Arqueal/classificação , RNA Catalítico/química , RNA Catalítico/classificação , Especificidade da Espécie
14.
Nucleic Acids Res ; 46(W1): W200-W204, 2018 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-29905871

RESUMO

The HMMER webserver [http://www.ebi.ac.uk/Tools/hmmer] is a free-to-use service which provides fast searches against widely used sequence databases and profile hidden Markov model (HMM) libraries using the HMMER software suite (http://hmmer.org). The results of a sequence search may be summarized in a number of ways, allowing users to view and filter the significant hits by domain architecture or taxonomy. For large scale usage, we provide an application programmatic interface (API) which has been expanded in scope, such that all result presentations are available via both HTML and API. Furthermore, we have refactored our JavaScript visualization library to provide standalone components for different result representations. These consume the aforementioned API and can be integrated into third-party websites. The range of databases that can be searched against has been expanded, adding four sequence datasets (12 in total) and one profile HMM library (6 in total). To help users explore the biological context of their results, and to discover new data resources, search results are now supplemented with cross references to other EMBL-EBI databases.


Assuntos
Análise de Sequência , Software , Domínio Catalítico , Bases de Dados Genéticas , Internet , Cadeias de Markov , Análise de Sequência de Proteína , Interface Usuário-Computador
15.
Nucleic Acids Res ; 46(D1): D335-D342, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-29112718

RESUMO

The Rfam database is a collection of RNA families in which each family is represented by a multiple sequence alignment, a consensus secondary structure, and a covariance model. In this paper we introduce Rfam release 13.0, which switches to a new genome-centric approach that annotates a non-redundant set of reference genomes with RNA families. We describe new web interface features including faceted text search and R-scape secondary structure visualizations. We discuss a new literature curation workflow and a pipeline for building families based on RNAcentral. There are 236 new families in release 13.0, bringing the total number of families to 2687. The Rfam website is http://rfam.org.


Assuntos
Bases de Dados de Ácidos Nucleicos , Genoma , RNA não Traduzido/química , RNA não Traduzido/genética , Humanos , Anotação de Sequência Molecular , Conformação de Ácido Nucleico , RNA não Traduzido/classificação , Alinhamento de Sequência , Análise de Sequência de RNA
16.
Nucleic Acids Res ; 44(D1): D81-9, 2016 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-26612867

RESUMO

Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.


Assuntos
Elementos de DNA Transponíveis , DNA/química , Bases de Dados de Ácidos Nucleicos , Sequências Repetitivas de Ácido Nucleico , Animais , DNA/classificação , Genoma , Humanos , Internet , Cadeias de Markov , Camundongos , Anotação de Sequência Molecular , Alinhamento de Sequência
17.
Nucleic Acids Res ; 44(D1): D279-85, 2016 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-26673716

RESUMO

In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Proteoma/química , Alinhamento de Sequência , Análise de Sequência de Proteína , Anotação de Sequência Molecular
18.
Nucleic Acids Res ; 43(W1): W30-8, 2015 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-25943547

RESUMO

The HMMER website, available at http://www.ebi.ac.uk/Tools/hmmer/, provides access to the protein homology search algorithms found in the HMMER software suite. Since the first release of the website in 2011, the search repertoire has been expanded to include the iterative search algorithm, jackhmmer. The continued growth of the target sequence databases means that traditional tabular representations of significant sequence hits can be overwhelming to the user. Consequently, additional ways of presenting homology search results have been developed, allowing them to be summarised according to taxonomic distribution or domain architecture. The taxonomy and domain architecture representations can be used in combination to filter the results according to the needs of a user. Searches can also be restricted prior to submission using a new taxonomic filter, which not only ensures that the results are specific to the requested taxonomic group, but also improves search performance. The repertoire of profile hidden Markov model libraries, which are used for annotation of query sequences with protein families and domains, has been expanded to include the libraries from CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs. Finally, we discuss the relocation of the HMMER webserver to the European Bioinformatics Institute and the potential impact that this will have.


Assuntos
Homologia de Sequência de Aminoácidos , Software , Algoritmos , Bases de Dados de Proteínas , Internet , Cadeias de Markov , Estrutura Terciária de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína
19.
Nucleic Acids Res ; 43(Database issue): D130-7, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25392425

RESUMO

The Rfam database (available at http://rfam.xfam.org) is a collection of non-coding RNA families represented by manually curated sequence alignments, consensus secondary structures and annotation gathered from corresponding Wikipedia, taxonomy and ontology resources. In this article, we detail updates and improvements to the Rfam data and website for the Rfam 12.0 release. We describe the upgrade of our search pipeline to use Infernal 1.1 and demonstrate its improved homology detection ability by comparison with the previous version. The new pipeline is easier for users to apply to their own data sets, and we illustrate its ability to annotate RNAs in genomic and metagenomic data sets of various sizes. Rfam has been expanded to include 260 new families, including the well-studied large subunit ribosomal RNA family, and for the first time includes information on short sequence- and structure-based RNA motifs present within families.


Assuntos
Bases de Dados de Ácidos Nucleicos , RNA não Traduzido/química , Genômica , Internet , Anotação de Sequência Molecular , Conformação de Ácido Nucleico , Motivos de Nucleotídeos , RNA Longo não Codificante/química , RNA não Traduzido/classificação , Software
20.
PLoS Biol ; 11(1): e1001473, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23382650

RESUMO

The macronuclear genome of the ciliate Oxytricha trifallax displays an extreme and unique eukaryotic genome architecture with extensive genomic variation. During sexual genome development, the expressed, somatic macronuclear genome is whittled down to the genic portion of a small fraction (∼5%) of its precursor "silent" germline micronuclear genome by a process of "unscrambling" and fragmentation. The tiny macronuclear "nanochromosomes" typically encode single, protein-coding genes (a small portion, 10%, encode 2-8 genes), have minimal noncoding regions, and are differentially amplified to an average of ∼2,000 copies. We report the high-quality genome assembly of ∼16,000 complete nanochromosomes (∼50 Mb haploid genome size) that vary from 469 bp to 66 kb long (mean ∼3.2 kb) and encode ∼18,500 genes. Alternative DNA fragmentation processes ∼10% of the nanochromosomes into multiple isoforms that usually encode complete genes. Nucleotide diversity in the macronucleus is very high (SNP heterozygosity is ∼4.0%), suggesting that Oxytricha trifallax may have one of the largest known effective population sizes of eukaryotes. Comparison to other ciliates with nonscrambled genomes and long macronuclear chromosomes (on the order of 100 kb) suggests several candidate proteins that could be involved in genome rearrangement, including domesticated MULE and IS1595-like DDE transposases. The assembly of the highly fragmented Oxytricha macronuclear genome is the first completed genome with such an unusual architecture. This genome sequence provides tantalizing glimpses into novel molecular biology and evolution. For example, Oxytricha maintains tens of millions of telomeres per cell and has also evolved an intriguing expansion of telomere end-binding proteins. In conjunction with the micronuclear genome in progress, the O. trifallax macronuclear genome will provide an invaluable resource for investigating programmed genome rearrangements, complementing studies of rearrangements arising during evolution and disease.


Assuntos
DNA de Protozoário/genética , Genoma de Protozoário/genética , Oxytricha/genética , Sequência de Bases , Variações do Número de Cópias de DNA , Fragmentação do DNA , Amplificação de Genes , Rearranjo Gênico/genética , Genes de Protozoários , Variação Genética , Macronúcleo/genética , Dados de Sequência Molecular , Ligação Proteica , RNA Mensageiro/genética , Análise de Sequência de DNA , Telômero/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA