Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
1.
bioRxiv ; 2024 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-38798674

RESUMO

Evaluating the accuracy of protein-coding sequences in genome annotations is a challenging problem for which there is no broadly applicable solution. In this manuscript we introduce PSAURON (Protein Sequence Assessment Using a Reference ORF Network), a novel software tool developed to assess the quality of protein-coding gene annotations. Utilizing a machine learning model trained on a diverse dataset from over 1000 plant and animal genomes, PSAURON assigns a score to coding DNA or protein sequence that reflects the likelihood that the sequence is a genuine protein coding region. PSAURON scores can be used for genome-wide protein annotation assessment as well as the rapid identification of potentially spurious annotated proteins. Validation against established benchmarks demonstrates PSAURON's effectiveness and correlation with recognized measures of protein quality, highlighting its potential use as a general-purpose method to evaluate gene annotation. PSAURON is open source and freely available at https://github.com/salzberg-lab/PSAURON . One-Sentence Summary: PSAURON is a machine learning-based tool for rapid assessment of protein coding gene annotation.

2.
G3 (Bethesda) ; 14(5)2024 05 07.
Artigo em Inglês | MEDLINE | ID: mdl-38526344

RESUMO

Whitebark pine (WBP, Pinus albicaulis) is a white pine of subalpine regions in the Western contiguous United States and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola) and additional threats from mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short reads of haploid megagametophyte tissue and Oxford Nanopore long reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gb of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gb). Approximately 87.2% (24.0 Gb) of total sequence was placed on the 12 WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the 3 subclasses of NLRs. Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo-assembled transcriptomes.


Assuntos
Genoma de Planta , Anotação de Sequência Molecular , Pinus , Pinus/genética , Pinus/parasitologia , Genômica/métodos , Espécies em Perigo de Extinção , Sequenciamento de Nucleotídeos em Larga Escala
3.
bioRxiv ; 2023 Nov 17.
Artigo em Inglês | MEDLINE | ID: mdl-38014212

RESUMO

Whitebark pine (WBP, Pinus albicaulis ) is a white pine of subalpine regions in western contiguous US and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola ) and additional threats from mountain pine beetle ( Dendroctonus ponderosae ), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short-reads of haploid megametophyte tissue and Oxford Nanopore long-reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gbp of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gbp). Approximately 87.2% (24.0 Gbp) of total sequence was placed on the twelve WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich-repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the three subclasses of NLRs (TNL, CNL, RNL). Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo assembled transcriptomes.

4.
Genome Biol Evol ; 15(7)2023 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-37364298

RESUMO

Stalk-eyed flies in the genus Teleopsis carry selfish genetic elements that induce sex ratio (SR) meiotic drive and impact the fitness of male and female carriers. Here, we assemble and describe a chromosome-level genome assembly of the stalk-eyed fly, Teleopsis dalmanni, to elucidate patterns of divergence associated with SR. The genome contains tens of thousands of transposable element (TE) insertions and hundreds of transcriptionally and insertionally active TE families. By resequencing pools of SR and ST males using short and long reads, we find widespread differentiation and divergence between XSR and XST associated with multiple nested inversions involving most of the SR haplotype. Examination of genomic coverage and gene expression data revealed seven X-linked genes with elevated expression and coverage in SR males. The most extreme and likely drive candidate involves an XSR-specific expansion of an array of partial copies of JASPer, a gene necessary for maintenance of euchromatin and associated with regulation of TE expression. In addition, we find evidence for rapid protein evolution between XSR and XST for testis expressed and novel genes, that is, either recent duplicates or lacking a Dipteran ortholog, including an X-linked duplicate of maelstrom, which is also involved in TE silencing. Overall, the evidence suggests that this ancient XSR polymorphism has had a variety of impacts on repetitive DNA and its regulation in this species.


Assuntos
Dípteros , Cromossomo X , Animais , Feminino , Masculino , Cromossomo X/genética , Dípteros/genética , Razão de Masculinidade , Olho , Testículo
5.
PLoS Comput Biol ; 19(3): e1011032, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-37000853

RESUMO

Advances in long-read sequencing technologies have dramatically improved the contiguity and completeness of genome assemblies. Using the latest nanopore-based sequencers, we can generate enough data for the assembly of a human genome from a single flow cell. With the long-read data from these sequences, we can now routinely produce de novo genome assemblies in which half or more of a genome is contained in megabase-scale contigs. Assemblies produced from nanopore data alone, though, have relatively high error rates and can benefit from a process called polishing, in which more-accurate reads are used to correct errors in the consensus sequence. In this manuscript, we present a novel tool for genome polishing called JASPER (Jellyfish-based Assembly Sequence Polisher for Error Reduction). In contrast to many other polishing methods, JASPER gains efficiency by avoiding the alignment of reads to the assembly. Instead, JASPER uses a database of k-mer counts that it creates from the reads to detect and correct errors in the consensus. Our experiments demonstrate that JASPER is faster than alignment-based polishers, and both faster and more accurate than other k-mer based polishing methods. We also introduce the idea of using a polishing tool to create population-specific reference genomes, and illustrate this idea using sequence data from multiple individuals from Tokyo, Japan.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Nanoporos , Humanos , Análise de Sequência de DNA , Genoma Humano/genética , Metagenômica
6.
G3 (Bethesda) ; 13(3)2023 03 09.
Artigo em Inglês | MEDLINE | ID: mdl-36630290

RESUMO

We used long-read DNA sequencing to assemble the genome of a Southern Han Chinese male. We organized the sequence into chromosomes and filled in gaps using the recently completed T2T-CHM13 genome as a guide, yielding a gap-free genome, Han1, containing 3,099,707,698 bases. Using the T2T-CHM13 annotation as a reference, we mapped all genes onto the Han1 genome and identified additional gene copies, generating a total of 60,708 putative genes, of which 20,003 are protein-coding. A comprehensive comparison between the genes revealed that 235 protein-coding genes were substantially different between the individuals, with frameshifts or truncations affecting the protein-coding sequence. Most of these were heterozygous variants in which one gene copy was unaffected. This represents the first gene-level comparison between two finished, annotated individual human genomes.


Assuntos
População do Leste Asiático , Genoma Humano , Humanos , Masculino , População do Leste Asiático/genética , Anotação de Sequência Molecular , Análise de Sequência de DNA
7.
Proc Natl Acad Sci U S A ; 119(28): e2122301119, 2022 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-35867761

RESUMO

The gastropod mollusk Aplysia is an important model for cellular and molecular neurobiological studies, particularly for investigations of molecular mechanisms of learning and memory. We developed an optimized assembly pipeline to generate an improved Aplysia nervous system transcriptome. This improved transcriptome enabled us to explore the evolution of cognitive capacity at the molecular level. Were there evolutionary expansions of neuronal genes between this relatively simple gastropod Aplysia (20,000 neurons) and Octopus (500 million neurons), the invertebrate with the most elaborate neuronal circuitry and greatest behavioral complexity? Are the tremendous advances in cognitive power in vertebrates explained by expansion of the synaptic proteome that resulted from multiple rounds of whole genome duplication in this clade? Overall, the complement of genes linked to neuronal function is similar between Octopus and Aplysia. As expected, a number of synaptic scaffold proteins have more isoforms in humans than in Aplysia or Octopus. However, several scaffold families present in mollusks and other protostomes are absent in vertebrates, including the Fifes, Lev10s, SOLs, and a NETO family. Thus, whereas vertebrates have more scaffold isoforms from select families, invertebrates have additional scaffold protein families not found in vertebrates. This analysis provides insights into the evolution of the synaptic proteome. Both synaptic proteins and synaptic plasticity evolved gradually, yet the last deuterostome-protostome common ancestor already possessed an elaborate suite of genes associated with synaptic function, and critical for synaptic plasticity.


Assuntos
Aplysia , Evolução Biológica , Cognição , Sinapses , Animais , Aplysia/genética , Aplysia/metabolismo , Plasticidade Neuronal/genética , Neurônios/metabolismo , Isoformas de Proteínas/genética , Proteoma , Sinapses/metabolismo , Transcriptoma
8.
Nat Commun ; 13(1): 2047, 2022 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-35440538

RESUMO

The genus Quercus, which emerged ∼55 million years ago during globally warm temperatures, diversified into ∼450 extant species. We present a high-quality de novo genome assembly of a California endemic oak, Quercus lobata, revealing features consistent with oak evolutionary success. Effective population size remained large throughout history despite declining since early Miocene. Analysis of 39,373 mapped protein-coding genes outlined copious duplications consistent with genetic and phenotypic diversity, both by retention of genes created during the ancient γ whole genome hexaploid duplication event and by tandem duplication within families, including numerous resistance genes and a very large block of duplicated DUF247 genes, which have been found to be associated with self-incompatibility in grasses. An additional surprising finding is that subcontext-specific patterns of DNA methylation associated with transposable elements reveal broadly-distributed heterochromatin in intergenic regions, similar to grasses. Collectively, these features promote genetic and phenotypic variation that would facilitate adaptability to changing environments.


Assuntos
Quercus , Evolução Biológica , Metilação de DNA/genética , Epigenoma , Evolução Molecular , Humanos , Quercus/genética
9.
PLoS Comput Biol ; 18(2): e1009860, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35120119

RESUMO

Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software
10.
G3 (Bethesda) ; 12(1)2022 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-35100403

RESUMO

Sequencing, assembly, and annotation of the 26.5 Gbp hexaploid genome of coast redwood (Sequoia sempervirens) was completed leading toward discovery of genes related to climate adaptation and investigation of the origin of the hexaploid genome. Deep-coverage short-read Illumina sequencing data from haploid tissue from a single seed were combined with long-read Oxford Nanopore Technologies sequencing data from diploid needle tissue to create an initial assembly, which was then scaffolded using proximity ligation data to produce a highly contiguous final assembly, SESE 2.1, with a scaffold N50 size of 44.9 Mbp. The assembly included several scaffolds that span entire chromosome arms, confirmed by the presence of telomere and centromere sequences on the ends of the scaffolds. The structural annotation produced 118,906 genes with 113 containing introns that exceed 500 Kbp in length and one reaching 2 Mb. Nearly 19 Gbp of the genome represented repetitive content with the vast majority characterized as long terminal repeats, with a 2.9:1 ratio of Copia to Gypsy elements that may aid in gene expression control. Comparison of coast redwood to other conifers revealed species-specific expansions for a plethora of abiotic and biotic stress response genes, including those involved in fungal disease resistance, detoxification, and physical injury/structural remodeling and others supporting flavonoid biosynthesis. Analysis of multiple genes that exist in triplicate in coast redwood but only once in its diploid relative, giant sequoia, supports a previous hypothesis that the hexaploidy is the result of autopolyploidy rather than any hybridizations with separate but closely related conifer species.


Assuntos
Sequoia , Evolução Biológica , Cromossomos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Sequoia/genética
11.
Artigo em Inglês | MEDLINE | ID: mdl-37602140

RESUMO

Kraken and KrakenUniq are widely-used tools for classifying metagenomics sequences. A key requirement for these systems is a database containing all k-mers from all genomes that the users want to be able to detect, where k = 31 by default. This database can be very large, easily exceeding 100 gigabytes (GB) and sometimes 400 GB. Previously, Kraken and KrakenUniq required loading the entire database into main memory (RAM), and if RAM was insufficient, they used memory mapping, which significantly increased the running time for large datasets. We have implemented a new algorithm in KrakenUniq that allows it to load and process the database in chunks, with only a modest increase in running time. This enhancement now makes it feasible to run KrakenUniq on very large datasets and huge databases on virtually any computer, even a laptop, while providing the same very high classification accuracy as the previous system. Statement of need: The KrakenUniq software classifies reads from metagenomic samples to establish which organisms are present in the samples and estimate their abundance. The software is widely used used by researchers and clinicians in medical diagnostics, microbiome and environmental studies.Typical databases used by KrakenUniq are tens to hundreds of gigabytes in size. The original KrakenUniq code required loading the entire database in RAM, which demanded expensive high-memory servers to run it efficiently. If a user did not have enough physical RAM to load the entire database, KrakenUniq resorted to memory-mapping the database, which significantly increased run times, frequently by a factor of more than 100. The new functionality described in this paper enables users who do not have access to high-memory servers to run KrakenUniq efficiently, with a CPU time performance increase of 3 to 4-fold, down from 100+.

12.
Genetics ; 220(2)2022 02 04.
Artigo em Inglês | MEDLINE | ID: mdl-34897437

RESUMO

Until 2019, the human genome was available in only one fully annotated version, GRCh38, which was the result of 18 years of continuous improvement and revision. Despite dramatic improvements in sequencing technology, no other genome was available as an annotated reference until 2019, when the genome of an Ashkenazi individual, Ash1, was released. In this study, we describe the assembly and annotation of a second individual genome, from a Puerto Rican individual whose DNA was collected as part of the Human Pangenome project. The new genome, called PR1, is the first true reference genome created from an individual of African descent. Due to recent improvements in both sequencing and assembly technology, and particularly to the use of the recently completed CHM13 human genome as a guide to assembly, PR1 is more complete and more contiguous than either GRCh38 or Ash1. Annotation revealed 37,755 genes (of which 19,999 are protein coding), including 12 additional gene copies that are present in PR1 and missing from CHM13. Fifty-seven genes have fewer copies in PR1 than in CHM13, 9 map only partially, and 3 genes (all noncoding) from CHM13 are entirely missing from PR1.


Assuntos
População Negra , Genoma Humano , Hispânico ou Latino/genética , Humanos , Anotação de Sequência Molecular
13.
Gigascience ; 122022 12 28.
Artigo em Inglês | MEDLINE | ID: mdl-36762707

RESUMO

The orb web is a remarkable example of animal architecture that is observed in families of spiders that diverged over 200 million years ago. While several genomes exist for araneid orb-weavers, none exist for other orb-weaving families, hampering efforts to investigate the genetic basis of this complex behavior. Here we present a chromosome-level genome assembly for the cribellate orb-weaving spider Uloborus diversus. The assembly reinforces evidence of an ancient arachnid genome duplication and identifies complete open reading frames for every class of spidroin gene, which encode the proteins that are the key structural components of spider silks. We identified the 2 X chromosomes for U. diversus and identify candidate sex-determining loci. This chromosome-level assembly will be a valuable resource for evolutionary research into the origins of orb-weaving, spidroin evolution, chromosomal rearrangement, and chromosomal sex determination in spiders.


Assuntos
Fibroínas , Aranhas , Animais , Filogenia , Fibroínas/genética , Seda/genética , Genoma , Cromossomos Sexuais/genética , Aranhas/genética
14.
Sci Adv ; 7(26)2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-34162536

RESUMO

The American lobster, Homarus americanus, is integral to marine ecosystems and supports an important commercial fishery. This iconic species also serves as a valuable model for deciphering neural networks controlling rhythmic motor patterns and olfaction. Here, we report a high-quality draft assembly of the H. americanus genome with 25,284 predicted gene models. Analysis of the neural gene complement revealed extraordinary development of the chemosensory machinery, including a profound diversification of ligand-gated ion channels and secretory molecules. The discovery of a novel class of chimeric receptors coupling pattern recognition and neurotransmitter binding suggests a deep integration between the neural and immune systems. A robust repertoire of genes involved in innate immunity, genome stability, cell survival, chemical defense, and cuticle formation represents a diversity of defense mechanisms essential to thrive in the benthic marine environment. Together, these unique evolutionary adaptations contribute to the longevity and ecological success of this long-lived benthic predator.


Assuntos
Longevidade , Nephropidae , Animais , Ecossistema , Longevidade/genética , Nephropidae/genética , Nephropidae/metabolismo , Sistema Nervoso
15.
G3 (Bethesda) ; 10(11): 3907-3919, 2020 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-32948606

RESUMO

The giant sequoia (Sequoiadendron giganteum) of California are massive, long-lived trees that grow along the U.S. Sierra Nevada mountains. Genomic data are limited in giant sequoia and producing a reference genome sequence has been an important goal to allow marker development for restoration and management. Using deep-coverage Illumina and Oxford Nanopore sequencing, combined with Dovetail chromosome conformation capture libraries, the genome was assembled into eleven chromosome-scale scaffolds containing 8.125 Gbp of sequence. Iso-Seq transcripts, assembled from three distinct tissues, was used as evidence to annotate a total of 41,632 protein-coding genes. The genome was found to contain, distributed unevenly across all 11 chromosomes and in 63 orthogroups, over 900 complete or partial predicted NLR genes, of which 375 are supported by annotation derived from protein evidence and gene modeling. This giant sequoia reference genome sequence represents the first genome sequenced in the Cupressaceae family, and lays a foundation for using genomic tools to aid in giant sequoia conservation and management.


Assuntos
Sequoiadendron , Cromossomos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Árvores
16.
Genetics ; 216(2): 599-608, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32796007

RESUMO

Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.


Assuntos
Cromossomos de Plantas/genética , Mapeamento de Sequências Contíguas/métodos , Dosagem de Genes , Triticum/genética , Mapeamento de Sequências Contíguas/normas , Genoma de Planta , Genômica/métodos , Genômica/normas , Padrões de Referência
17.
Genome Biol ; 21(1): 129, 2020 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-32487205

RESUMO

BACKGROUND: Thousands of experiments and studies use the human reference genome as a resource each year. This single reference genome, GRCh38, is a mosaic created from a small number of individuals, representing a very small sample of the human population. There is a need for reference genomes from multiple human populations to avoid potential biases. RESULTS: Here, we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are > 99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. Forty of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. Eleven genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~ 1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes. CONCLUSIONS: The Ash1 genome is presented as a reference for any genetic studies involving Ashkenazi Jewish individuals.


Assuntos
Genoma Humano , Humanos , Anotação de Sequência Molecular , Valores de Referência , Translocação Genética
18.
PLoS Comput Biol ; 16(6): e1007981, 2020 06.
Artigo em Inglês | MEDLINE | ID: mdl-32589667

RESUMO

The introduction of third-generation DNA sequencing technologies in recent years has allowed scientists to generate dramatically longer sequence reads, which when used in whole-genome sequencing projects have yielded better repeat resolution and far more contiguous genome assemblies. While the promise of better contiguity has held true, the relatively high error rate of long reads, averaging 8-15%, has made it challenging to generate a highly accurate final sequence. Current long-read sequencing technologies display a tendency toward systematic errors, in particular in homopolymer regions, which present additional challenges. A cost-effective strategy to generate highly contiguous assemblies with a very low overall error rate is to combine long reads with low-cost short-read data, which currently have an error rate below 0.5%. This hybrid strategy can be pursued either by incorporating the short-read data into the early phase of assembly, during the read correction step, or by using short reads to "polish" the consensus built from long reads. In this report, we present the assembly polishing tool POLCA (POLishing by Calling Alternatives) and compare its performance with two other popular polishing programs, Pilon and Racon. We show that on simulated data POLCA is more accurate than Pilon, and comparable in accuracy to Racon. On real data, all three programs show similar performance, but POLCA is consistently much faster than either of the other polishing programs.


Assuntos
Genoma Bacteriano , Algoritmos , Biopolímeros/genética , Análise de Sequência de DNA/métodos
19.
Gigascience ; 9(5)2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-32432329

RESUMO

BACKGROUND: The release of the first reference genome of walnut (Juglans regia L.) enabled many achievements in the characterization of walnut genetic and functional variation. However, it is highly fragmented, preventing the integration of genetic, transcriptomic, and proteomic information to fully elucidate walnut biological processes. FINDINGS: Here, we report the new chromosome-scale assembly of the walnut reference genome (Chandler v2.0) obtained by combining Oxford Nanopore long-read sequencing with chromosome conformation capture (Hi-C) technology. Relative to the previous reference genome, the new assembly features an 84.4-fold increase in N50 size, with the 16 chromosomal pseudomolecules assembled and representing 95% of its total length. Using full-length transcripts from single-molecule real-time sequencing, we predicted 37,554 gene models, with a mean gene length higher than the previous gene annotations. Most of the new protein-coding genes (90%) present both start and stop codons, which represents a significant improvement compared with Chandler v1.0 (only 48%). We then tested the potential impact of the new chromosome-level genome on different areas of walnut research. By studying the proteome changes occurring during male flower development, we observed that the virtual proteome obtained from Chandler v2.0 presents fewer artifacts than the previous reference genome, enabling the identification of a new potential pollen allergen in walnut. Also, the new chromosome-scale genome facilitates in-depth studies of intraspecies genetic diversity by revealing previously undetected autozygous regions in Chandler, likely resulting from inbreeding, and 195 genomic regions highly differentiated between Western and Eastern walnut cultivars. CONCLUSION: Overall, Chandler v2.0 will serve as a valuable resource to better understand and explore walnut biology.


Assuntos
Cromossomos de Plantas , Biologia Computacional/métodos , Genoma de Planta , Genômica/métodos , Juglans/genética , Variação Genética , Estudo de Associação Genômica Ampla , Juglans/metabolismo , Anotação de Sequência Molecular , Fases de Leitura Aberta , Proteômica/métodos , Especificidade da Espécie
20.
Insect Biochem Mol Biol ; 120: 103334, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32109587

RESUMO

The soybean aphid, Aphis glycines Matsumura (Hemiptera: Aphididae) is a serious pest of the soybean plant, Glycine max, a major world-wide agricultural crop. We assembled a de novo genome sequence of Ap. glycines Biotype 1, from a culture established shortly after this species invaded North America. 20.4% of the Ap. glycines proteome is duplicated. These in-paralogs are enriched with Gene Ontology (GO) categories mostly related to apoptosis, a possible adaptation to plant chemistry and other environmental stressors. Approximately one-third of these genes show parallel duplication in other aphids. But Ap. gossypii, its closest related species, has the lowest number of these duplicated genes. An Illumina GoldenGate assay of 2380 SNPs was used to determine the world-wide population structure of Ap. Glycines. China and South Korean aphids are the closest to those in North America. China is the likely origin of other Asian aphid populations. The most distantly related aphids to those in North America are from Australia. The diversity of Ap. glycines in North America has decreased over time since its arrival. The genetic diversity of Ap. glycines North American population sampled shortly after its first detection in 2001 up to 2012 does not appear to correlate with geography. However, aphids collected on soybean Rag experimental varieties in Minnesota (MN), Iowa (IA), and Wisconsin (WI), closer to high density Rhamnus cathartica stands, appear to have higher capacity to colonize resistant soybean plants than aphids sampled in Ohio (OH), North Dakota (ND), and South Dakota (SD). Samples from the former states have SNP alleles with high FST values and frequencies, that overlap with genes involved in iron metabolism, a crucial metabolic pathway that may be affected by the Rag-associated soybean plant response. The Ap. glycines Biotype 1 genome will provide needed information for future analyses of mechanisms of aphid virulence and pesticide resistance as well as facilitate comparative analyses between aphids with differing natural history and host plant range.


Assuntos
Adaptação Biológica , Afídeos/genética , Evolução Biológica , Ecótipo , Genoma de Inseto , Espécies Introduzidas , Alelos , Animais , Polimorfismo de Nucleotídeo Único , Estados Unidos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA