Pesquisa | BVS Doenças Infecciosas e Parasitárias

1.

The evolution of synaptic and cognitive capacity: Insights from the nervous system transcriptome of Aplysia.

Orvis, Joshua; Albertin, Caroline B; Shrestha, Pragya; Chen, Shuangshuang; Zheng, Melanie; Rodriguez, Cheyenne J; Tallon, Luke J; Mahurkar, Anup; Zimin, Aleksey V; Kim, Michelle; Liu, Kelvin; Kandel, Eric R; Fraser, Claire M; Sossin, Wayne; Abrams, Thomas W.

Proc Natl Acad Sci U S A ; 119(28): e2122301119, 2022 07 12.

Artigo em Inglês | MEDLINE | ID: mdl-35867761

RESUMO

The gastropod mollusk Aplysia is an important model for cellular and molecular neurobiological studies, particularly for investigations of molecular mechanisms of learning and memory. We developed an optimized assembly pipeline to generate an improved Aplysia nervous system transcriptome. This improved transcriptome enabled us to explore the evolution of cognitive capacity at the molecular level. Were there evolutionary expansions of neuronal genes between this relatively simple gastropod Aplysia (20,000 neurons) and Octopus (500 million neurons), the invertebrate with the most elaborate neuronal circuitry and greatest behavioral complexity? Are the tremendous advances in cognitive power in vertebrates explained by expansion of the synaptic proteome that resulted from multiple rounds of whole genome duplication in this clade? Overall, the complement of genes linked to neuronal function is similar between Octopus and Aplysia. As expected, a number of synaptic scaffold proteins have more isoforms in humans than in Aplysia or Octopus. However, several scaffold families present in mollusks and other protostomes are absent in vertebrates, including the Fifes, Lev10s, SOLs, and a NETO family. Thus, whereas vertebrates have more scaffold isoforms from select families, invertebrates have additional scaffold protein families not found in vertebrates. This analysis provides insights into the evolution of the synaptic proteome. Both synaptic proteins and synaptic plasticity evolved gradually, yet the last deuterostome-protostome common ancestor already possessed an elaborate suite of genes associated with synaptic function, and critical for synaptic plasticity.

Assuntos

Aplysia , Evolução Biológica , Cognição , Sinapses , Animais , Aplysia/genética , Aplysia/metabolismo , Plasticidade Neuronal/genética , Neurônios/metabolismo , Isoformas de Proteínas/genética , Proteoma , Sinapses/metabolismo , Transcriptoma

2.

JASPER: A fast genome polishing tool that improves accuracy of genome assemblies.

Guo, Alina; Salzberg, Steven L; Zimin, Aleksey V.

PLoS Comput Biol ; 19(3): e1011032, 2023 03.

Artigo em Inglês | MEDLINE | ID: mdl-37000853

RESUMO

Advances in long-read sequencing technologies have dramatically improved the contiguity and completeness of genome assemblies. Using the latest nanopore-based sequencers, we can generate enough data for the assembly of a human genome from a single flow cell. With the long-read data from these sequences, we can now routinely produce de novo genome assemblies in which half or more of a genome is contained in megabase-scale contigs. Assemblies produced from nanopore data alone, though, have relatively high error rates and can benefit from a process called polishing, in which more-accurate reads are used to correct errors in the consensus sequence. In this manuscript, we present a novel tool for genome polishing called JASPER (Jellyfish-based Assembly Sequence Polisher for Error Reduction). In contrast to many other polishing methods, JASPER gains efficiency by avoiding the alignment of reads to the assembly. Instead, JASPER uses a database of k-mer counts that it creates from the reads to detect and correct errors in the consensus. Our experiments demonstrate that JASPER is faster than alignment-based polishers, and both faster and more accurate than other k-mer based polishing methods. We also introduce the idea of using a polishing tool to create population-specific reference genomes, and illustrate this idea using sequence data from multiple individuals from Tokyo, Japan.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Humanos , Análise de Sequência de DNA , Genoma Humano/genética , Metagenômica

3.

The SAMBA tool uses long reads to improve the contiguity of genome assemblies.

Zimin, Aleksey V; Salzberg, Steven L.

PLoS Comput Biol ; 18(2): e1009860, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-35120119

RESUMO

Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software

4.

Genome sequence of the progenitor of the wheat D genome Aegilops tauschii.

Luo, Ming-Cheng; Gu, Yong Q; Puiu, Daniela; Wang, Hao; Twardziok, Sven O; Deal, Karin R; Huo, Naxin; Zhu, Tingting; Wang, Le; Wang, Yi; McGuire, Patrick E; Liu, Shuyang; Long, Hai; Ramasamy, Ramesh K; Rodriguez, Juan C; Van, Sonny L; Yuan, Luxia; Wang, Zhenzhong; Xia, Zhiqiang; Xiao, Lichan; Anderson, Olin D; Ouyang, Shuhong; Liang, Yong; Zimin, Aleksey V; Pertea, Geo; Qi, Peng; Bennetzen, Jeffrey L; Dai, Xiongtao; Dawson, Matthew W; Müller, Hans-Georg; Kugler, Karl; Rivarola-Duarte, Lorena; Spannagl, Manuel; Mayer, Klaus F X; Lu, Fu-Hao; Bevan, Michael W; Leroy, Philippe; Li, Pingchuan; You, Frank M; Sun, Qixin; Liu, Zhiyong; Lyons, Eric; Wicker, Thomas; Salzberg, Steven L; Devos, Katrien M; Dvorák, Jan.

Nature ; 551(7681): 498-502, 2017 11 23.

Artigo em Inglês | MEDLINE | ID: mdl-29143815

RESUMO

Aegilops tauschii is the diploid progenitor of the D genome of hexaploid wheat (Triticum aestivum, genomes AABBDD) and an important genetic resource for wheat. The large size and highly repetitive nature of the Ae. tauschii genome has until now precluded the development of a reference-quality genome sequence. Here we use an array of advanced technologies, including ordered-clone genome sequencing, whole-genome shotgun sequencing, and BioNano optical genome mapping, to generate a reference-quality genome sequence for Ae. tauschii ssp. strangulata accession AL8/78, which is closely related to the wheat D genome. We show that compared to other sequenced plant genomes, including a much larger conifer genome, the Ae. tauschii genome contains unprecedented amounts of very similar repeated sequences. Our genome comparisons reveal that the Ae. tauschii genome has a greater number of dispersed duplicated genes than other sequenced genomes and its chromosomes have been structurally evolving an order of magnitude faster than those of other grass genomes. The decay of colinearity with other grass genomes correlates with recombination rates along chromosomes. We propose that the vast amounts of very similar repeated sequences cause frequent errors in recombination and lead to gene duplications and structural chromosome changes that drive fast genome evolution.

Assuntos

Genoma de Planta , Filogenia , Poaceae/genética , Triticum/genética , Mapeamento Cromossômico , Diploide , Evolução Molecular , Duplicação Gênica , Genes de Plantas/genética , Genômica/normas , Poaceae/classificação , Recombinação Genética/genética , Análise de Sequência de DNA/normas , Triticum/classificação

5.

Genome assembly and characterization of a complex zfBED-NLR gene-containing disease resistance locus in Carolina Gold Select rice with Nanopore sequencing.

Read, Andrew C; Moscou, Matthew J; Zimin, Aleksey V; Pertea, Geo; Meyer, Rachel S; Purugganan, Michael D; Leach, Jan E; Triplett, Lindsay R; Salzberg, Steven L; Bogdanove, Adam J.

PLoS Genet ; 16(1): e1008571, 2020 01.

Artigo em Inglês | MEDLINE | ID: mdl-31986137

RESUMO

Long-read sequencing facilitates assembly of complex genomic regions. In plants, loci containing nucleotide-binding, leucine-rich repeat (NLR) disease resistance genes are an important example of such regions. NLR genes constitute one of the largest gene families in plants and are often clustered, evolving via duplication, contraction, and transposition. We recently mapped the Xo1 locus for resistance to bacterial blight and bacterial leaf streak, found in the American heirloom rice variety Carolina Gold Select, to a region that in the Nipponbare reference genome is NLR gene-rich. Here, toward identification of the Xo1 gene, we combined Nanopore and Illumina reads and generated a high-quality Carolina Gold Select genome assembly. We identified 529 complete or partial NLR genes and discovered, relative to Nipponbare, an expansion of NLR genes at the Xo1 locus. One of these has high sequence similarity to the cloned, functionally similar Xa1 gene. Both harbor an integrated zfBED domain, and the repeats within each protein are nearly perfect. Across diverse Oryzeae, we identified two sub-clades of NLR genes with these features, varying in the presence of the zfBED domain and the number of repeats. The Carolina Gold Select genome assembly also uncovered at the Xo1 locus a rice blast resistance gene and a gene encoding a polyphenol oxidase (PPO). PPO activity has been used as a marker for blast resistance at the locus in some varieties; however, the Carolina Gold Select sequence revealed a loss-of-function mutation in the PPO gene that breaks this association. Our results demonstrate that whole genome sequencing combining Nanopore and Illumina reads effectively resolves NLR gene loci. Our identification of an Xo1 candidate is an important step toward mechanistic characterization, including the role(s) of the zfBED domain. Finally, the Carolina Gold Select genome assembly will facilitate identification of other useful traits in this historically important variety.

Assuntos

Resistência à Doença , Proteínas NLR/genética , Oryza/genética , Proteínas de Plantas/genética , Anotação de Sequência Molecular , Proteínas NLR/química , Proteínas NLR/metabolismo , Sequenciamento por Nanoporos/métodos , Oryza/imunologia , Proteínas de Plantas/química , Proteínas de Plantas/metabolismo , Sequenciamento Completo do Genoma/métodos , Dedos de Zinco

6.

Human contamination in bacterial genomes has created thousands of spurious proteins.

Breitwieser, Florian P; Pertea, Mihaela; Zimin, Aleksey V; Salzberg, Steven L.

Genome Res ; 29(6): 954-960, 2019 06.

Artigo em Inglês | MEDLINE | ID: mdl-31064768

RESUMO

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein "families" across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.

Assuntos

Contaminação por DNA , Genoma Bacteriano , Genoma Humano , Genômica , Bases de Dados Genéticas , Variação Genética , Genoma Arqueal , Genômica/métodos , Genômica/normas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Fases de Leitura Aberta , Sequências Repetitivas de Ácido Nucleico

7.

The Atlantic salmon genome provides insights into rediploidization.

Lien, Sigbjørn; Koop, Ben F; Sandve, Simen R; Miller, Jason R; Kent, Matthew P; Nome, Torfinn; Hvidsten, Torgeir R; Leong, Jong S; Minkley, David R; Zimin, Aleksey; Grammes, Fabian; Grove, Harald; Gjuvsland, Arne; Walenz, Brian; Hermansen, Russell A; von Schalburg, Kris; Rondeau, Eric B; Di Genova, Alex; Samy, Jeevan K A; Olav Vik, Jon; Vigeland, Magnus D; Caler, Lis; Grimholt, Unni; Jentoft, Sissel; Våge, Dag Inge; de Jong, Pieter; Moen, Thomas; Baranski, Matthew; Palti, Yniv; Smith, Douglas R; Yorke, James A; Nederbragt, Alexander J; Tooming-Klunderud, Ave; Jakobsen, Kjetill S; Jiang, Xuanting; Fan, Dingding; Hu, Yan; Liberles, David A; Vidal, Rodrigo; Iturra, Patricia; Jones, Steven J M; Jonassen, Inge; Maass, Alejandro; Omholt, Stig W; Davidson, William S.

Nature ; 533(7602): 200-5, 2016 05 12.

Artigo em Inglês | MEDLINE | ID: mdl-27088604

RESUMO

The whole-genome duplication 80 million years ago of the common ancestor of salmonids (salmonid-specific fourth vertebrate whole-genome duplication, Ss4R) provides unique opportunities to learn about the evolutionary fate of a duplicated vertebrate genome in 70 extant lineages. Here we present a high-quality genome assembly for Atlantic salmon (Salmo salar), and show that large genomic reorganizations, coinciding with bursts of transposon-mediated repeat expansions, were crucial for the post-Ss4R rediploidization process. Comparisons of duplicate gene expression patterns across a wide range of tissues with orthologous genes from a pre-Ss4R outgroup unexpectedly demonstrate far more instances of neofunctionalization than subfunctionalization. Surprisingly, we find that genes that were retained as duplicates after the teleost-specific whole-genome duplication 320 million years ago were not more likely to be retained after the Ss4R, and that the duplicate retention was not influenced to a great extent by the nature of the predicted protein interactions of the gene products. Finally, we demonstrate that the Atlantic salmon assembly can serve as a reference sequence for the study of other salmonids for a range of purposes.

Assuntos

Diploide , Evolução Molecular , Duplicação Gênica/genética , Genes Duplicados/genética , Genoma/genética , Salmo salar/genética , Animais , Elementos de DNA Transponíveis/genética , Feminino , Genômica , Masculino , Modelos Genéticos , Mutagênese/genética , Filogenia , Padrões de Referência , Salmo salar/classificação , Homologia de Sequência

8.

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies.

Zimin, Aleksey V; Salzberg, Steven L.

PLoS Comput Biol ; 16(6): e1007981, 2020 06.

Artigo em Inglês | MEDLINE | ID: mdl-32589667

RESUMO

The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8-15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to "polish" the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.

Assuntos

Genoma Bacteriano , Algoritmos , Biopolímeros/genética , Análise de Sequência de DNA/métodos

9.

Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm.

Zimin, Aleksey V; Puiu, Daniela; Luo, Ming-Cheng; Zhu, Tingting; Koren, Sergey; Marçais, Guillaume; Yorke, James A; Dvorák, Jan; Salzberg, Steven L.

Genome Res ; 27(5): 787-792, 2017 05.

Artigo em Inglês | MEDLINE | ID: mdl-28130360

RESUMO

Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here, we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and extremely repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807 nucleotides. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.

Assuntos

Mapeamento de Sequências Contíguas/métodos , Genoma de Planta , Genômica/métodos , Poaceae/genética , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software , Mapeamento de Sequências Contíguas/normas , Tamanho do Genoma , Genômica/normas , Análise de Sequência de DNA/normas

10.

MUMmer4: A fast and versatile genome alignment system.

Marçais, Guillaume; Delcher, Arthur L; Phillippy, Adam M; Coston, Rachel; Salzberg, Steven L; Zimin, Aleksey.

PLoS Comput Biol ; 14(1): e1005944, 2018 01.

Artigo em Inglês | MEDLINE | ID: mdl-29373581

RESUMO

The MUMmer system and the genome sequence aligner nucmer included within it are among the most widely used alignment packages in genomics. Since the last major release of MUMmer version 3 in 2004, it has been applied to many types of problems including aligning whole genome sequences, aligning reads to a reference genome, and comparing different assemblies of the same genome. Despite its broad utility, MUMmer3 has limitations that can make it difficult to use for large genomes and for the very large sequence data sets that are common today. In this paper we describe MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes; we illustrate this with an alignment of the human and chimpanzee genomes, which allows us to compute that the two species are 98% identical across 96% of their length. With the enhancements described here, MUMmer4 can also be used to efficiently align reads to reference genomes, although it is less sensitive and accurate than the dedicated read aligners. The nucmer aligner in MUMmer4 can now be called from scripting languages such as Perl, Python and Ruby. These improvements make MUMer4 one the most versatile genome alignment packages available.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Animais , Arabidopsis/genética , Genoma Humano , Genoma de Planta , Genômica , Humanos , Modelos Teóricos , Pan troglodytes , Polimorfismo de Nucleotídeo Único , Linguagens de Programação , Análise de Sequência de DNA , Análise de Sequência de Proteína

11.

Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species.

Muñoz, Adriana; Santos Muñoz, Daniella; Zimin, Aleksey; Yorke, James A.

BMC Genomics ; 17(Suppl 10): 826, 2016 11 11.

Artigo em Inglês | MEDLINE | ID: mdl-28185554

RESUMO

BACKGROUND: The diversity in eukaryotic life reflects a diversity in regulatory pathways. Nocedal and Johnson argue that the rewiring of gene regulatory networks is a major force for the diversity of life, that changes in regulation can create new species. RESULTS: We have created a method (based on our new "ping-pong algorithm) for detecting more complicated rewirings, where several transcription factors can substitute for one or more transcription factors in the regulation of a family of co-regulated genes. An example is illustrative. A rewiring has been reported by Hogues et al. that RAP1 in Saccharomyces cerevisiae substitutes for TBF1/CBF1 in Candida albicans for ribosomal RP genes. There one transcription factor substitutes for another on some collection of genes. Such a substitution is referred to as a "rewiring". We agree with this finding of rewiring as far as it goes but the situation is more complicated. Many transcription factors can regulate a gene and our algorithm finds that in this example a "team" (or collection) of three transcription factors including RAP1 substitutes for TBF1 for 19 genes. The switch occurs for a branch of the phylogenetic tree containing 10 species (including Saccharomyces cerevisiae), while the remaining 13 species (Candida albicans) are regulated by TBF1. CONCLUSIONS: To gain insight into more general evolutionary mechanisms, we have created a mathematical algorithm that finds such general switching events and we prove that it converges. Of course any such computational discovery should be validated in the biological tests. For each branch of the phylogenetic tree and each gene module, our algorithm finds a sub-group of co-regulated genes and a team of transcription factors that substitutes for another team of transcription factors. In most cases the signal will be small but in some cases we find a strong signal of switching. We report our findings for 23 Ascomycota fungi species.

Assuntos

Algoritmos , Evolução Molecular , Proteínas de Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/genética , Fatores de Transcrição/genética , Candida albicans/classificação , Candida albicans/genética , Candida albicans/metabolismo , Redes Reguladoras de Genes , Filogenia , Saccharomyces cerevisiae/classificação , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo , Complexo Shelterina , Proteínas de Ligação a Telômeros/genética , Fatores de Transcrição/metabolismo , Transcrição Gênica

12.

GAGE: A critical evaluation of genome assemblies and assembly algorithms.

Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey; Puiu, Daniela; Magoc, Tanja; Koren, Sergey; Treangen, Todd J; Schatz, Michael C; Delcher, Arthur L; Roberts, Michael; Marçais, Guillaume; Pop, Mihai; Yorke, James A.

Genome Res ; 22(3): 557-67, 2012 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-22147368

RESUMO

New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just a few days. The sequence data generated by one of these projects consist of millions or billions of short DNA sequences (reads) that range from 50 to 150 nt in length. These sequences must then be assembled de novo before most genome analyses can begin. Unfortunately, genome assembly remains a very difficult problem, made more difficult by shorter reads and unreliable long-range linking information. In this study, we evaluated several of the leading de novo assembly algorithms on four different short-read data sets, all generated by Illumina sequencers. Our results describe the relative performance of the different assemblers as well as other significant differences in assembly difficulty that appear to be inherent in the genomes themselves. Three overarching conclusions are apparent: first, that data quality, rather than the assembler itself, has a dramatic effect on the quality of an assembled genome; second, that the degree of contiguity of an assembly varies enormously among different assemblers and different genomes; and third, that the correctness of an assembly also varies widely and is not well correlated with statistics on contiguity. To enable others to replicate our results, all of our data and methods are freely available, as are all assemblers used in this study.

Assuntos

Algoritmos , Genômica/métodos , Análise de Sequência de DNA , Animais , Biologia Computacional/métodos , Genoma , Genoma Bacteriano/genética , Humanos , Internet , Reprodutibilidade dos Testes

13.

The MaSuRCA genome assembler.

Zimin, Aleksey V; Marçais, Guillaume; Puiu, Daniela; Roberts, Michael; Salzberg, Steven L; Yorke, James A.

Bioinformatics ; 29(21): 2669-77, 2013 Nov 01.

Artigo em Inglês | MEDLINE | ID: mdl-23990416

RESUMO

MOTIVATION: Second-generation sequencing technologies produce high coverage of the genome by short reads at a low cost, which has prompted development of new assembly methods. In particular, multiple algorithms based on de Bruijn graphs have been shown to be effective for the assembly problem. In this article, we describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies, and which allows variable read lengths while tolerating a significant level of sequencing error. Our method transforms large numbers of paired-end reads into a much smaller number of longer 'super-reads'. The use of super-reads allows us to assemble combinations of Illumina reads of differing lengths together with longer reads from 454 and Sanger sequencing technologies, making it one of the few assemblers capable of handling such mixtures. We call our system the Maryland Super-Read Celera Assembler (abbreviated MaSuRCA and pronounced 'mazurka'). RESULTS: We evaluate the performance of MaSuRCA against two of the most widely used assemblers for Illumina data, Allpaths-LG and SOAPdenovo2, on two datasets from organisms for which high-quality assemblies are available: the bacterium Rhodobacter sphaeroides and chromosome 16 of the mouse genome. We show that MaSuRCA performs on par or better than Allpaths-LG and significantly better than SOAPdenovo on these data, when evaluated against the finished sequence. We then show that MaSuRCA can significantly improve its assemblies when the original data are augmented with long reads. AVAILABILITY: MaSuRCA is available as open-source code at ftp://ftp.genome.umd.edu/pub/MaSuRCA/. Previous (pre-publication) releases have been publicly available for over a year. CONTACT: alekseyz@ipst.umd.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genômica/métodos , Algoritmos , Animais , Genoma Bacteriano , Camundongos , Rhodobacter sphaeroides/genética , Análise de Sequência de DNA/métodos , Software

14.

Draft genome of the globally widespread and invasive Argentine ant (Linepithema humile).

Smith, Christopher D; Zimin, Aleksey; Holt, Carson; Abouheif, Ehab; Benton, Richard; Cash, Elizabeth; Croset, Vincent; Currie, Cameron R; Elhaik, Eran; Elsik, Christine G; Fave, Marie-Julie; Fernandes, Vilaiwan; Gadau, Jürgen; Gibson, Joshua D; Graur, Dan; Grubbs, Kirk J; Hagen, Darren E; Helmkampf, Martin; Holley, Jo-Anne; Hu, Hao; Viniegra, Ana Sofia Ibarraran; Johnson, Brian R; Johnson, Reed M; Khila, Abderrahman; Kim, Jay W; Laird, Joseph; Mathis, Kaitlyn A; Moeller, Joseph A; Muñoz-Torres, Monica C; Murphy, Marguerite C; Nakamura, Rin; Nigam, Surabhi; Overson, Rick P; Placek, Jennifer E; Rajakumar, Rajendhran; Reese, Justin T; Robertson, Hugh M; Smith, Chris R; Suarez, Andrew V; Suen, Garret; Suhr, Elissa L; Tao, Shu; Torres, Candice W; van Wilgenburg, Ellen; Viljakainen, Lumi; Walden, Kimberly K O; Wild, Alexander L; Yandell, Mark; Yorke, James A; Tsutsui, Neil D.

Proc Natl Acad Sci U S A ; 108(14): 5673-8, 2011 Apr 05.

Artigo em Inglês | MEDLINE | ID: mdl-21282631

RESUMO

Ants are some of the most abundant and familiar animals on Earth, and they play vital roles in most terrestrial ecosystems. Although all ants are eusocial, and display a variety of complex and fascinating behaviors, few genomic resources exist for them. Here, we report the draft genome sequence of a particularly widespread and well-studied species, the invasive Argentine ant (Linepithema humile), which was accomplished using a combination of 454 (Roche) and Illumina sequencing and community-based funding rather than federal grant support. Manual annotation of >1,000 genes from a variety of different gene families and functional classes reveals unique features of the Argentine ant's biology, as well as similarities to Apis mellifera and Nasonia vitripennis. Distinctive features of the Argentine ant genome include remarkable expansions of gustatory (116 genes) and odorant receptors (367 genes), an abundance of cytochrome P450 genes (>110), lineage-specific expansions of yellow/major royal jelly proteins and desaturases, and complete CpG DNA methylation and RNAi toolkits. The Argentine ant genome contains fewer immune genes than Drosophila and Tribolium, which may reflect the prominent role played by behavioral and chemical suppression of pathogens. Analysis of the ratio of observed to expected CpG nucleotides for genes in the reproductive development and apoptosis pathways suggests higher levels of methylation than in the genome overall. The resources provided by this genome sequence will offer an abundance of tools for researchers seeking to illuminate the fascinating biology of this emerging model organism.

Assuntos

Formigas/genética , Genoma de Inseto/genética , Genômica/métodos , Filogenia , Animais , Formigas/fisiologia , Sequência de Bases , California , Metilação de DNA , Biblioteca Gênica , Genética Populacional , Hierarquia Social , Dados de Sequência Molecular , Polimorfismo de Nucleotídeo Único/genética , Receptores Odorantes/genética , Análise de Sequência de DNA

15.

Draft genome of the red harvester ant Pogonomyrmex barbatus.

Smith, Chris R; Smith, Christopher D; Robertson, Hugh M; Helmkampf, Martin; Zimin, Aleksey; Yandell, Mark; Holt, Carson; Hu, Hao; Abouheif, Ehab; Benton, Richard; Cash, Elizabeth; Croset, Vincent; Currie, Cameron R; Elhaik, Eran; Elsik, Christine G; Favé, Marie-Julie; Fernandes, Vilaiwan; Gibson, Joshua D; Graur, Dan; Gronenberg, Wulfila; Grubbs, Kirk J; Hagen, Darren E; Viniegra, Ana Sofia Ibarraran; Johnson, Brian R; Johnson, Reed M; Khila, Abderrahman; Kim, Jay W; Mathis, Kaitlyn A; Munoz-Torres, Monica C; Murphy, Marguerite C; Mustard, Julie A; Nakamura, Rin; Niehuis, Oliver; Nigam, Surabhi; Overson, Rick P; Placek, Jennifer E; Rajakumar, Rajendhran; Reese, Justin T; Suen, Garret; Tao, Shu; Torres, Candice W; Tsutsui, Neil D; Viljakainen, Lumi; Wolschin, Florian; Gadau, Jürgen.

Proc Natl Acad Sci U S A ; 108(14): 5667-72, 2011 Apr 05.

Artigo em Inglês | MEDLINE | ID: mdl-21282651

RESUMO

We report the draft genome sequence of the red harvester ant, Pogonomyrmex barbatus. The genome was sequenced using 454 pyrosequencing, and the current assembly and annotation were completed in less than 1 y. Analyses of conserved gene groups (more than 1,200 manually annotated genes to date) suggest a high-quality assembly and annotation comparable to recently sequenced insect genomes using Sanger sequencing. The red harvester ant is a model for studying reproductive division of labor, phenotypic plasticity, and sociogenomics. Although the genome of P. barbatus is similar to other sequenced hymenopterans (Apis mellifera and Nasonia vitripennis) in GC content and compositional organization, and possesses a complete CpG methylation toolkit, its predicted genomic CpG content differs markedly from the other hymenopterans. Gene networks involved in generating key differences between the queen and worker castes (e.g., wings and ovaries) show signatures of increased methylation and suggest that ants and bees may have independently co-opted the same gene regulatory mechanisms for reproductive division of labor. Gene family expansions (e.g., 344 functional odorant receptors) and pseudogene accumulation in chemoreception and P450 genes compared with A. mellifera and N. vitripennis are consistent with major life-history changes during the adaptive radiation of Pogonomyrmex spp., perhaps in parallel with the development of the North American deserts.

Assuntos

Formigas/genética , Redes Reguladoras de Genes/genética , Genoma de Inseto/genética , Genômica/métodos , Filogenia , Animais , Formigas/fisiologia , Sequência de Bases , Clima Desértico , Hierarquia Social , Dados de Sequência Molecular , América do Norte , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Receptores Odorantes/genética , Análise de Sequência de DNA

16.

Next-generation sequencing strategies for characterizing the turkey genome.

Dalloul, Rami A; Zimin, Aleksey V; Settlage, Robert E; Kim, Sungwon; Reed, Kent M.

Poult Sci ; 93(2): 479-84, 2014 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-24570472

RESUMO

The turkey genome sequencing project was initiated in 2008 and has relied primarily on next-generation sequencing (NGS) technologies. Our first efforts used a synergistic combination of 2 NGS platforms (Roche/454 and Illumina GAII), detailed bacterial artificial chromosome (BAC) maps, and unique assembly tools to sequence and assemble the genome of the domesticated turkey, Meleagris gallopavo. Since the first release in 2010, efforts to improve the genome assembly, gene annotation, and genomic analyses continue. The initial assembly build (2.01) represented about 89% of the genome sequence with 17X coverage depth (931 Mb). Sequence contigs were assigned to 30 of the 40 chromosomes with approximately 10% of the assembled sequence corresponding to unassigned chromosomes (ChrUn). The sequence has been refined through both genome-wide and area-focused sequencing, including shotgun and paired-end sequencing, and targeted sequencing of chromosomal regions with low or incomplete coverage. These additional efforts have improved the sequence assembly resulting in 2 subsequent genome builds of higher genome coverage (25X/Build3.0 and 30X/Build4.0) with a current sequence totaling 1,010 Mb. Further, BAC with end sequences assigned to the Z/W and MG18 (MHC) chromosomes, ChrUn, or not placed in the previous build were isolated, deeply sequenced (Hi-Seq), and incorporated into the latest build (5.0). To aid in the annotation and to generate a gene expression atlas of major tissues, a comprehensive set of RNA samples was collected at various developmental stages of female and male turkeys. Transcriptome sequencing data (using Illumina Hi-Seq) will provide information to enhance the final assembly and ultimately improve sequence annotation. The most current sequence covers more than 95% of the turkey genome and should yield a much improved gene level of annotation, making it a valuable resource for studying genetic variations underlying economically important traits in poultry.

Assuntos

Mapeamento Cromossômico/métodos , Genoma , Análise de Sequência de DNA/métodos , Perus/genética , Animais , Mapeamento Cromossômico/veterinária , Cromossomos Artificiais Bacterianos , Sequenciamento de Nucleotídeos em Larga Escala

17.

PSAURON: a tool for assessing protein annotation across a broad range of species.

Sommer, Markus J; Zimin, Aleksey V; Salzberg, Steven L.

bioRxiv ; 2024 May 18.

Artigo em Inglês | MEDLINE | ID: mdl-38798674

RESUMO

Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein coding region. PSAURON scores can be used for genome-wide protein annotation assessment as well as the rapid identification of potentially spurious annotated proteins. Validation against established benchmarks demonstrates PSAURON's effectiveness and correlation with recognized measures of protein quality, highlighting its potential use as a general-purpose method to evaluate gene annotation. PSAURON is open source and freely available at https://github.com/salzberg-lab/PSAURON . One-Sentence Summary: PSAURON is a machine learning-based tool for rapid assessment of protein coding gene annotation.

18.

Direct sequencing of insect symbionts via nanopore adaptive sampling.

Badger, Jonathan H; Giordano, Rosanna; Zimin, Aleksey; Wappel, Robert; Eskipehlivan, Senem M; Muller, Stephanie; Donthu, Ravikiran; Soto-Adames, Felipe; Vieira, Paulo; Zasada, Inga; Goodwin, Sara.

Curr Opin Insect Sci ; 61: 101135, 2024 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-37926187

RESUMO

Insect symbionts can alter their host phenotype and their effects can range from beneficial to pathogenic. Moreover, many insects exhibit co-infections, making their study more challenging. Less than 1% of insect species have high-quality referenced genomes available and fewer still also have their symbionts sequenced. Two methods are commonly used to sequence symbionts: whole-genome sequencing to concomitantly capture the host and bacterial genomes, or isolation of the symbiont's genome before sequencing. These methods are limited when dealing with rare or poorly characterized symbionts. Long-read technology is an important tool to generate high-quality genomes as they can overcome high levels of heterozygosity, repeat content, and transposable elements that confound short-read methods. Oxford Nanopore (ONT) adaptive sampling allows a sequencing instrument to select or reject sequences in real time. We describe a method based on ONT adaptive sampling (subtractive) approach that readily permitted the sequencing of the complete genomes of mitochondria, Buchnera and its plasmids (pLeu, pTrp), and Wolbachia genomes in two aphid species, Aphis glycines and Pentalonia nigronervosa. Adaptive sampling is able to retrieve organelles such as mitochondria and symbionts that have high representation in their hosts such as Buchnera and Wolbachia, but is less successful at retrieving symbionts in low concentrations.

Assuntos

Buchnera , Nanoporos , Animais , Buchnera/genética , Elementos de DNA Transponíveis , Insetos/genética

19.

A genome sequence for the threatened whitebark pine.

Neale, David B; Zimin, Aleksey V; Meltzer, Amy; Bhattarai, Akriti; Amee, Maurice; Figueroa Corona, Laura; Allen, Brian J; Puiu, Daniela; Wright, Jessica; De La Torre, Amanda R; McGuire, Patrick E; Timp, Winston; Salzberg, Steven L; Wegrzyn, Jill L.

G3 (Bethesda) ; 14(5)2024 05 07.

Artigo em Inglês | MEDLINE | ID: mdl-38526344

RESUMO

Whitebark pine (WBP, Pinus albicaulis) is a white pine of subalpine regions in the Western contiguous United States and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola) and additional threats from mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short reads of haploid megagametophyte tissue and Oxford Nanopore long reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6âGb of sequence in 92,740 contigs (N50 537,007âbp) and 34,716 scaffolds (N50 2.0âGb). Approximately 87.2% (24.0âGb) of total sequence was placed on the 12 WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the 3 subclasses of NLRs. Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo-assembled transcriptomes.

Assuntos

Genoma de Planta , Anotação de Sequência Molecular , Pinus , Pinus/genética , Pinus/parasitologia , Genômica/métodos , Espécies em Perigo de Extinção , Sequenciamento de Nucleotídeos em Larga Escala

20.

Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis.

Dalloul, Rami A; Long, Julie A; Zimin, Aleksey V; Aslam, Luqman; Beal, Kathryn; Blomberg, Le Ann; Bouffard, Pascal; Burt, David W; Crasta, Oswald; Crooijmans, Richard P M A; Cooper, Kristal; Coulombe, Roger A; De, Supriyo; Delany, Mary E; Dodgson, Jerry B; Dong, Jennifer J; Evans, Clive; Frederickson, Karin M; Flicek, Paul; Florea, Liliana; Folkerts, Otto; Groenen, Martien A M; Harkins, Tim T; Herrero, Javier; Hoffmann, Steve; Megens, Hendrik-Jan; Jiang, Andrew; de Jong, Pieter; Kaiser, Pete; Kim, Heebal; Kim, Kyu-Won; Kim, Sungwon; Langenberger, David; Lee, Mi-Kyung; Lee, Taeheon; Mane, Shrinivasrao; Marcais, Guillaume; Marz, Manja; McElroy, Audrey P; Modise, Thero; Nefedov, Mikhail; Notredame, Cédric; Paton, Ian R; Payne, William S; Pertea, Geo; Prickett, Dennis; Puiu, Daniela; Qioa, Dan; Raineri, Emanuele; Ruffier, Magali.

PLoS Biol ; 8(9)2010 Sep 07.

Artigo em Inglês | MEDLINE | ID: mdl-20838655

RESUMO

A synergistic combination of two next-generation sequencing platforms with a detailed comparative BAC physical contig map provided a cost-effective assembly of the genome sequence of the domestic turkey (Meleagris gallopavo). Heterozygosity of the sequenced source genome allowed discovery of more than 600,000 high quality single nucleotide variants. Despite this heterozygosity, the current genome assembly (â¼1.1 Gb) includes 917 Mb of sequence assigned to specific turkey chromosomes. Annotation identified nearly 16,000 genes, with 15,093 recognized as protein coding and 611 as non-coding RNA genes. Comparative analysis of the turkey, chicken, and zebra finch genomes, and comparing avian to mammalian species, supports the characteristic stability of avian genomes and identifies genes unique to the avian lineage. Clear differences are seen in number and variety of genes of the avian immune system where expansions and novel genes are less frequent than examples of gene loss. The turkey genome sequence provides resources to further understand the evolution of vertebrate genomes and genetic variation underlying economically important quantitative traits in poultry. This integrated approach may be a model for providing both gene and chromosome level assemblies of other species with agricultural, ecological, and evolutionary interest.

Assuntos

Genoma , Perus/genética , Animais , Sequência de Bases , Mapeamento Cromossômico , DNA/genética , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA , Homologia de Sequência do Ácido Nucleico , Especificidade da Espécie

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA