RESUMO
Low-cost short read sequencing technology has revolutionized genomics, though it is only just becoming practical for the high-quality de novo assembly of a novel large genome. We describe the Assemblathon 1 competition, which aimed to comprehensively assess the state of the art in de novo assembly methods when applied to current sequencing technologies. In a collaborative effort, teams were asked to assemble a simulated Illumina HiSeq data set of an unknown, simulated diploid genome. A total of 41 assemblies from 17 different groups were received. Novel haplotype aware assessments of coverage, contiguity, structure, base calling, and copy number were made. We establish that within this benchmark: (1) It is possible to assemble the genome to a high level of coverage and accuracy, and that (2) large differences exist between the assemblies, suggesting room for further improvements in current methods. The simulated benchmark, including the correct answer, the assemblies, and the code that was used to evaluate the assemblies is now public and freely available from http://www.assemblathon.org/.
Assuntos
Genoma/fisiologia , Genômica/métodos , Análise de Sequência de DNA/métodosRESUMO
Genome sequencing projects have been initiated for a wide range of eukaryotes. A few projects have reached completion, but most exist as draft assemblies. As one of the main reasons to sequence a genome is to obtain its catalog of genes, an important question is how complete or completable the catalog is in unfinished genomes. To answer this question, we have identified a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in low copy numbers in higher eukaryotes. From an analysis of a phylogenetically diverse set of eukaryotic genome assemblies, we found that the proportion of CEGs mapped in draft genomes provides a useful metric for describing the gene space, and complements the commonly used N50 length and x-fold coverage values.
Assuntos
Genes , Genômica , Animais , Mapeamento Cromossômico , Humanos , Proteínas/genéticaRESUMO
MOTIVATION: The numbers of finished and ongoing genome projects are increasing at a rapid rate, and providing the catalog of genes for these new genomes is a key challenge. Obtaining a set of well-characterized genes is a basic requirement in the initial steps of any genome annotation process. An accurate set of genes is needed in order to learn about species-specific properties, to train gene-finding programs, and to validate automatic predictions. Unfortunately, many new genome projects lack comprehensive experimental data to derive a reliable initial set of genes. RESULTS: In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of conserved protein families that occur in a wide range of eukaryotes, and present a mapping procedure that accurately identifies their exon-intron structures in a novel genomic sequence. CEGMA includes the use of profile-hidden Markov models to ensure the reliability of the gene structures. Our procedure allows one to build an initial set of reliable gene annotations in potentially any eukaryotic genome, even those in draft stages. AVAILABILITY: Software and data sets are available online at http://korflab.ucdavis.edu/Datasets.
Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Documentação/métodos , Proteoma/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Sequência de Bases , Humanos , Dados de Sequência Molecular , Homologia de Sequência do Ácido NucleicoRESUMO
WormBase (http://www.wormbase.org), the model organism database for information about Caenorhabditis elegans and related nematodes, continues to expand in breadth and depth. Over the past year, WormBase has added multiple large-scale datasets including SAGE, interactome, 3D protein structure datasets and NCBI KOGs. To accommodate this growth, the International WormBase Consortium has improved the user interface by adding new features to aid in navigation, visualization of large-scale datasets, advanced searching and data mining. Internally, we have restructured the database models to rationalize the representation of genes and to prepare the system to accept the genome sequences of three additional Caenorhabditis species over the coming year.
Assuntos
Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/genética , Caenorhabditis elegans/genética , Caenorhabditis/genética , Bases de Dados Genéticas , Genômica , Animais , Caenorhabditis/metabolismo , Caenorhabditis elegans/metabolismo , Proteínas de Caenorhabditis elegans/metabolismo , Bases de Dados Genéticas/tendências , Perfilação da Expressão Gênica , Conformação Proteica , Software , Integração de Sistemas , Técnicas do Sistema de Duplo-Híbrido , Interface Usuário-ComputadorRESUMO
WormBase (http://www.wormbase.org/) is a web-accessible central data repository for information about Caenorhabditis elegans and related nematodes. The past two years have seen a significant expansion in the biological scope of WormBase, including the integration of large-scale, genome-wide data sets, the inclusion of genome sequence and gene predictions from related species and active literature curation. This expansion of data has also driven the development and refinement of user interfaces and operability, including a new Genome Browser, new searches and facilities for data access and the inclusion of extensive documentation. These advances have expanded WormBase beyond the obvious target audience of C. elegans researchers, to include researchers wishing to explore problems in functional and comparative genomics within the context of a powerful genetic system.
Assuntos
Caenorhabditis elegans/genética , Caenorhabditis/genética , Bases de Dados de Ácidos Nucleicos , Genômica , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , DNA de Helmintos/análise , Coleta de Dados , Etiquetas de Sequências Expressas , Expressão Gênica , Armazenamento e Recuperação da Informação , Neurônios/classificação , Polimorfismo de Nucleotídeo Único , Controle de Qualidade , Interferência de RNA , RNA de Helmintos/antagonistas & inibidores , Homologia de Sequência do Ácido NucleicoRESUMO
WormBase (http://www.wormbase.org/) is the central data repository for information about Caenorhabditis elegans and related nematodes. As a model organism database, WormBase extends beyond the genomic sequence, integrating experimental results with extensively annotated views of the genome. The WormBase Consortium continues to expand the biological scope and utility of WormBase with the inclusion of large-scale genomic analyses, through active data and literature curation, through new analysis and visualization tools, and through refinement of the user interface. Over the past year, the nearly complete genomic sequence and comparative analyses of the closely related species Caenorhabditis briggsae have been integrated into WormBase, including gene predictions, ortholog assignments and a new synteny viewer to display the relationships between the two species. Extensive site-wide refinement of the user interface now provides quick access to the most frequently accessed resources and a consistent browsing experience across the site. Unified single-page views now provide complete summaries of commonly accessed entries like genes. These advances continue to increase the utility of WormBase for C.elegans researchers, as well as for those researchers exploring problems in functional and comparative genomics in the context of a powerful genetic system.
Assuntos
Caenorhabditis elegans/genética , Caenorhabditis/genética , Bases de Dados Genéticas , Genômica , Animais , Biologia Computacional , Armazenamento e Recuperação da Informação , Internet , Interface Usuário-ComputadorRESUMO
Genome instability is associated with mitotic errors and cancer. This phenomenon can lead to deleterious rearrangements, but also genetic novelty, and many questions regarding its genesis, fate and evolutionary role remain unanswered. Here, we describe extreme chromosomal restructuring during genome elimination, a process resulting from hybridization of Arabidopsis plants expressing different centromere histones H3. Shattered chromosomes are formed from the genome of the haploid inducer, consistent with genomic catastrophes affecting a single, laggard chromosome compartmentalized within a micronucleus. Analysis of breakpoint junctions implicates breaks followed by repair through non-homologous end joining (NHEJ) or stalled fork repair. Furthermore, mutation of required NHEJ factor DNA Ligase 4 results in enhanced haploid recovery. Lastly, heritability and stability of a rearranged chromosome suggest a potential for enduring genomic novelty. These findings provide a tractable, natural system towards investigating the causes and mechanisms of complex genomic rearrangements similar to those associated with several human disorders.
Assuntos
Arabidopsis/genética , Aberrações Cromossômicas , Genoma de Planta/genética , Instabilidade Genômica/fisiologia , Hibridização Genética/genética , Sequência de Bases , Análise Citogenética , Dano ao DNA/genética , Reparo do DNA por Junção de Extremidades/genética , DNA Ligase Dependente de ATP , DNA Ligases/genética , Primers do DNA/genética , Instabilidade Genômica/genética , Genótipo , Dados de Sequência Molecular , Polimorfismo de Nucleotídeo Único/genética , Análise de Sequência de DNARESUMO
BACKGROUND: Centromeres are essential for chromosome segregation, yet their DNA sequences evolve rapidly. In most animals and plants that have been studied, centromeres contain megabase-scale arrays of tandem repeats. Despite their importance, very little is known about the degree to which centromere tandem repeats share common properties between different species across different phyla. We used bioinformatic methods to identify high-copy tandem repeats from 282 species using publicly available genomic sequence and our own data. RESULTS: Our methods are compatible with all current sequencing technologies. Long Pacific Biosciences sequence reads allowed us to find tandem repeat monomers up to 1,419 bp. We assumed that the most abundant tandem repeat is the centromere DNA, which was true for most species whose centromeres have been previously characterized, suggesting this is a general property of genomes. High-copy centromere tandem repeats were found in almost all animal and plant genomes, but repeat monomers were highly variable in sequence composition and length. Furthermore, phylogenetic analysis of sequence homology showed little evidence of sequence conservation beyond approximately 50 million years of divergence. We find that despite an overall lack of sequence conservation, centromere tandem repeats from diverse species showed similar modes of evolution. CONCLUSIONS: While centromere position in most eukaryotes is epigenetically determined, our results indicate that tandem repeats are highly prevalent at centromeres of both animal and plant genomes. This suggests a functional role for such repeats, perhaps in promoting concerted evolution of centromere DNA across chromosomes.
Assuntos
Centrômero/genética , Evolução Molecular , Sequências de Repetição em Tandem , Animais , Sequência de Bases , Dados de Sequência Molecular , Plantas/genética , Especificidade da EspécieRESUMO
BACKGROUND: The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. RESULTS: In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. CONCLUSIONS: Many current genome assemblers produced useful assemblies, containing a significant representation of their genes and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.
RESUMO
Many introns significantly increase gene expression through a process termed intron-mediated enhancement (IME). Introns exist in the transcribed DNA and the nascent RNA, and could affect expression from either location. To determine which is more relevant to IME, hybrid introns were constructed that contain sequences from stimulating Arabidopsis thaliana introns either in their normal orientation or as the reverse complement. Both ends of each intron are from the non-stimulatory COR15a intron in their normal orientation to allow splicing. The inversions create major alterations to the sequence of the transcribed RNA with relatively minor changes to the DNA structure. Introns containing portions of either the UBQ10 or ATPK1 intron increased expression to a similar degree regardless of orientation. Also, computational predictions of IME improve when both intron strands are considered. These findings are more consistent with models of IME that act at the level of DNA rather than RNA.
RESUMO
BACKGROUND: A proposed role for Myc in maintaining mouse embryonic stem (ES) cell pluripotency is transcriptional repression of key differentiation-promoting genes, but detail of the mechanism has remained an important open topic. RESULTS: To test the hypothesis that the zinc finger protein Miz-1 plays a central role, in the present work we conducted chromatin immunoprecipitation/microarray (ChIP-chip) analysis of Myc and Miz-1 in human ES cells, finding homeobox (Hox) genes as the most significant functional class of Miz-1 direct targets. Miz-1 differentiation-associated target genes specifically lack acetylated lysine 9 and trimethylated lysine 4 of histone H3 (AcH3K9 and H3K4me3) 9 histone marks, consistent with a repressed transcriptional state. Almost 30% of Miz-1 targets are also bound by Myc and these cobound genes are mostly factors that promote differentiation including Hox genes. Knockdown of Myc increased expression of differentiation genes directly bound by Myc and Miz-1, while a subset of the same genes is downregulated by Miz-1 loss-of-function. Myc and Miz-1 proteins interact with each other and associate with several corepressor factors in ES cells, suggesting a mechanism of repression of differentiation genes. CONCLUSIONS: Taken together our data indicate that Miz-1 and Myc maintain human ES cell pluripotency by coordinately suppressing differentiation genes, particularly Hox genes. These data also support a new model of how Myc and Miz-1 function on chromatin.
RESUMO
While many properties of eukaryotic gene structure are well characterized, differences in the form and function of introns that occur at different positions within a transcript are less well understood. In particular, the dynamics of intron length variation with respect to intron position has received relatively little attention. This study analyzes all available data on intron lengths in GenBank and finds a significant trend of increased length in first introns throughout a wide range of species. This trend was found to be even stronger when using high-confidence gene annotation data for three model organisms (Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster) which show that the first intron in the 5' UTR is--on average--significantly longer than all downstream introns within a gene. A partial explanation for increased first intron length in A. thaliana is suggested by the increased frequency of certain motifs that are present in first introns. The phenomenon of longer first introns can potentially be used to improve gene prediction software and also to detect errors in existing gene annotations.
Assuntos
Arabidopsis/genética , Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Genes de Helmintos , Genes de Insetos , Genes , Íntrons/genética , Animais , Bases de Dados de Ácidos Nucleicos/organização & administração , Bases de Dados de Ácidos Nucleicos/normas , Genes de Plantas/genética , Reprodutibilidade dos TestesRESUMO
The ORFeome project has validated and corrected a large number of predicted gene models in the nematode C. elegans, and has provided an enormous resource for proteome-scale studies. To make the resource useful to the research and teaching community, it needs to be integrated with other large-scale data sets, including the C. elegans genome, cell lineage, neurological wiring diagram, transcriptome, and gene expression map. This integration is also critical because the ORFeome data sets, like other 'omics' data sets, have significant false-positive and false-negative rates, and comparison to related data is necessary to make confidence judgments in any given data point. WormBase, the central data repository for information about C. elegans and related nematodes, provides such a platform for integration. In this report, we will describe how C. elegans ORFeome data are deposited in the database, how they are used to correct gene models, how they are integrated and displayed in the context of other data sets at the WormBase Web site, and how WormBase establishes connection with the reagent-based resources at the ORFeome project Web site.