Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Genome Res ; 24(12): 2077-89, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25273068

RESUMO

Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.


Assuntos
Genoma , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Animais , Biologia Computacional/métodos , Simulação por Computador , Conjuntos de Dados como Assunto , Estudo de Associação Genômica Ampla , Humanos , Mamíferos/genética , Filogenia , Reprodutibilidade dos Testes
2.
Bioinformatics ; 27(23): 3266-75, 2011 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-21994225

RESUMO

MOTIVATION: There have been several studies on the micro-inversions between human and chimpanzee, but there are large discrepancies among their results. Furthermore, all of them rely on alignment procedures or existing alignment results to identify inversions. However, the core alignment procedures do not take very small inversions into consideration. Therefore, their analyses cannot find inversions that are too small to be detected by a classic aligner. We call such inversions pico-inversions. RESULTS: We re-analyzed human-chimpanzee alignment from the UCSC Genome Browser for micro-inplace-inversions and screened for pico-inplace-inversions using a likelihood ratio test. We report that the quantity of inplace-inversions between human and chimpanzee is substantially greater than what had previously been discovered. We also present the software tool PicoInversionMiner to detect pico-inplace-inversions between closely related species. AVAILABILITY: Software tools, scripts and result data are available at http://faculty.cs.niu.edu/~hou/PicoInversion.html. CONTACT: mhou@cs.niu.edu.


Assuntos
Inversão Cromossômica , Evolução Molecular , Pan troglodytes/genética , Alinhamento de Sequência/métodos , Animais , Genoma , Humanos , Software
3.
Bioinformatics ; 23(8): 917-25, 2007 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-17308341

RESUMO

MOTIVATION: Complex genomes contain numerous repeated sequences, and genomic duplication is believed to be a main evolutionary mechanism to obtain new functions. Several tools are available for de novo repeat sequence identification, and many approaches exist for clustering homologous protein sequences. We present an efficient new approach to identify and cluster homologous DNA sequences with high accuracy at the level of whole genomes, excluding low-complexity repeats, tandem repeats and annotated interspersed repeats. We also determine the boundaries of each group member so that it closely represents a biological unit, e.g. a complete gene, or a partial gene coding a protein domain. RESULTS: We developed a program called HomologMiner to identify homologous groups applicable to genome sequences that have been properly marked for low-complexity repeats and annotated interspersed repeats. We applied it to the whole genomes of human (hg17), macaque (rheMac2) and mouse (mm8). Groups obtained include gene families (e.g. olfactory receptor gene family, zinc finger families), unannotated interspersed repeats and additional homologous groups that resulted from recent segmental duplications. Our program incorporates several new methods: a new abstract definition of consistent duplicate units, a new criterion to remove moderately frequent tandem repeats, and new algorithmic techniques. We also provide preliminary analysis of the output on the three genomes mentioned above, and show several applications including identifying boundaries of tandem gene clusters and novel interspersed repeat families. AVAILABILITY: All programs and datasets are downloadable from www.bx.psu.edu/miller_lab.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Sequências Repetitivas de Ácido Nucleico/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Software , Sequência de Bases , Dados de Sequência Molecular
4.
Genome Res ; 17(12): 1797-808, 2007 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-17984227

RESUMO

This article describes a set of alignments of 28 vertebrate genome sequences that is provided by the UCSC Genome Browser. The alignments can be viewed on the Human Genome Browser (March 2006 assembly) at http://genome.ucsc.edu, downloaded in bulk by anonymous FTP from http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz28way, or analyzed with the Galaxy server at http://g2.bx.psu.edu. This article illustrates the power of this resource for exploring vertebrate and mammalian evolution, using three examples. First, we present several vignettes involving insertions and deletions within protein-coding regions, including a look at some human-specific indels. Then we study the extent to which start codons and stop codons in the human sequence are conserved in other species, showing that start codons are in general more poorly conserved than stop codons. Finally, an investigation of the phylogenetic depth of conservation for several classes of functional elements in the human genome reveals striking differences in the rates and modes of decay in alignability. Each functional class has a distinctive period of stringent constraint, followed by decays that allow (for the case of regulatory regions) or reject (for coding regions and ultraconserved elements) insertions and deletions.


Assuntos
Sequência Conservada , Bases de Dados Genéticas , Alinhamento de Sequência/métodos , Animais , Sequência de Bases , Gatos , Bovinos , Códon de Iniciação/genética , Códon de Terminação/genética , Cães , Genoma Humano , Cobaias , Humanos , Camundongos , Dados de Sequência Molecular , Mutagênese Insercional , Coelhos , Ratos , Deleção de Sequência
5.
Genome Res ; 17(6): 760-74, 2007 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-17567995

RESUMO

A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.


Assuntos
Evolução Molecular , Genoma Humano , Mamíferos/genética , Fases de Leitura Aberta , Filogenia , Alinhamento de Sequência , Animais , Projeto Genoma Humano , Humanos
6.
Genome Res ; 15(1): 184-94, 2005 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-15590941

RESUMO

Multiple-sequence alignment analysis is a powerful approach for understanding phylogenetic relationships, annotating genes, and detecting functional regulatory elements. With a growing number of partly or fully sequenced vertebrate genomes, effective tools for performing multiple comparisons are required to accurately and efficiently assist biological discoveries. Here we introduce Mulan (http://mulan.dcode.org/), a novel method and a network server for comparing multiple draft and finished-quality sequences to identify functional elements conserved over evolutionary time. Mulan brings together several novel algorithms: the TBA multi-aligner program for rapid identification of local sequence conservation, and the multiTF program for detecting evolutionarily conserved transcription factor binding sites in multiple alignments. In addition, Mulan supports two-way communication with the GALA database; alignments of multiple species dynamically generated in GALA can be viewed in Mulan, and conserved transcription factor binding sites identified with Mulan/multiTF can be integrated and overlaid with extensive genome annotation data using GALA. Local multiple alignments computed by Mulan ensure reliable representation of short- and large-scale genomic rearrangements in distant organisms. Mulan allows for interactive modification of critical conservation parameters to differentially predict conserved regions in comparisons of both closely and distantly related species. We illustrate the uses and applications of the Mulan tool through multispecies comparisons of the GATA3 gene locus and the identification of elements that are conserved in a different way in avians than in other genomes, allowing speculation on the evolution of birds. Source code for the aligners and the aligner-evaluation software can be freely downloaded from http://www.bx.psu.edu/miller_lab/.


Assuntos
Gráficos por Computador , Evolução Molecular , Alinhamento de Sequência/métodos , Animais , Anuros/genética , Sítios de Ligação/genética , Galinhas/genética , Biologia Computacional/métodos , Sequência Conservada/genética , Proteínas de Ligação a DNA/genética , Peixes/genética , Fator de Transcrição GATA3 , Genoma , Genoma Humano , Humanos , Camundongos , Filogenia , Ratos , Homologia de Sequência do Ácido Nucleico , Software , Transativadores/genética
7.
Genome Res ; 15(8): 1034-50, 2005 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-16024819

RESUMO

We have conducted a comprehensive search for conserved elements in vertebrate genomes, using genome-wide multiple alignments of five vertebrate species (human, mouse, rat, chicken, and Fugu rubripes). Parallel searches have been performed with multiple alignments of four insect species (three species of Drosophila and Anopheles gambiae), two species of Caenorhabditis, and seven species of Saccharomyces. Conserved elements were identified with a computer program called phastCons, which is based on a two-state phylogenetic hidden Markov model (phylo-HMM). PhastCons works by fitting a phylo-HMM to the data by maximum likelihood, subject to constraints designed to calibrate the model across species groups, and then predicting conserved elements based on this model. The predicted elements cover roughly 3%-8% of the human genome (depending on the details of the calibration procedure) and substantially higher fractions of the more compact Drosophila melanogaster (37%-53%), Caenorhabditis elegans (18%-37%), and Saccharaomyces cerevisiae (47%-68%) genomes. From yeasts to vertebrates, in order of increasing genome size and general biological complexity, increasing fractions of conserved bases are found to lie outside of the exons of known protein-coding genes. In all groups, the most highly conserved elements (HCEs), by log-odds score, are hundreds or thousands of bases long. These elements share certain properties with ultraconserved elements, but they tend to be longer and less perfectly conserved, and they overlap genes of somewhat different functional categories. In vertebrates, HCEs are associated with the 3' UTRs of regulatory genes, stable gene deserts, and megabase-sized regions rich in moderately conserved noncoding sequences. Noncoding HCEs also show strong statistical evidence of an enrichment for RNA secondary structure.


Assuntos
Sequência Conservada , Evolução Molecular , Insetos/genética , Vertebrados/genética , Leveduras/genética , Regiões 3' não Traduzidas , Animais , Pareamento de Bases/genética , Sequência de Bases , Caenorhabditis elegans/genética , DNA Intergênico , Genoma , Humanos , Dados de Sequência Molecular , Saccharomyces/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA