Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 409
Filtrar
Mais filtros












Intervalo de ano de publicação
1.
Nat Commun ; 15(1): 8357, 2024 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-39333501

RESUMO

For taxonomy based classification of metagenomics assembled contigs, current methods use sequence similarity to identify their most likely taxonomy. However, in the related field of metagenomic binning, contigs are routinely clustered using information from both the contig sequences and their abundance. We introduce Taxometer, a neural network based method that improves the annotations and estimates the quality of any taxonomic classifier using contig abundance profiles and tetra-nucleotide frequencies. We apply Taxometer to five short-read CAMI2 datasets and find that it increases the average share of correct species-level contig annotations of the MMSeqs2 tool from 66.6% to 86.2%. Additionally, it reduce the share of wrong species-level annotations in the CAMI2 Rhizosphere dataset by an average of two-fold for Metabuli, Centrifuge, and Kraken2. Futhermore, we use Taxometer for benchmarking taxonomic classifiers on two complex long-read metagenomics data sets where ground truth is not known. Taxometer is available as open-source software and can enhance any taxonomic annotation of metagenomic contigs.


Assuntos
Metagenômica , Software , Metagenômica/métodos , Redes Neurais de Computação , Classificação/métodos , Metagenoma/genética , Algoritmos , Mapeamento de Sequências Contíguas/métodos , Rizosfera
2.
BMC Bioinformatics ; 25(1): 241, 2024 Jul 16.
Artigo em Inglês | MEDLINE | ID: mdl-39014300

RESUMO

BACKGROUND: Using next-generation sequencing technologies, scientists can sequence complex microbial communities directly from the environment. Significant insights into the structure, diversity, and ecology of microbial communities have resulted from the study of metagenomics. The assembly of reads into longer contigs, which are then binned into groups of contigs that correspond to different species in the metagenomic sample, is a crucial step in the analysis of metagenomics. It is necessary to organize these contigs into operational taxonomic units (OTUs) for further taxonomic profiling and functional analysis. For binning, which is synonymous with the clustering of OTUs, the tetra-nucleotide frequency (TNF) is typically utilized as a compositional feature for each OTU. RESULTS: In this paper, we present AFIT, a new l-mer statistic vector for each contig, and AFITBin, a novel method for metagenomic binning based on AFIT and a matrix factorization method. To evaluate the performance of the AFIT vector, the t-SNE algorithm is used to compare species clustering based on AFIT and TNF information. In addition, the efficacy of AFITBin is demonstrated on both simulated and real datasets in comparison to state-of-the-art binning methods such as MetaBAT 2, MaxBin 2.0, CONCOT, MetaCon, SolidBin, BusyBee Web, and MetaBinner. To further analyze the performance of the purposed AFIT vector, we compare the barcodes of the AFIT vector and the TNF vector. CONCLUSION: The results demonstrate that AFITBin shows superior performance in taxonomic identification compared to existing methods, leveraging the AFIT vector for improved results in metagenomic binning. This approach holds promise for advancing the analysis of metagenomic data, providing more reliable insights into microbial community composition and function. AVAILABILITY: A python package is available at: https://github.com/SayehSobhani/AFITBin .


Assuntos
Algoritmos , Metagenômica , Metagenômica/métodos , Nucleotídeos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Microbiota/genética , Análise de Sequência de DNA/métodos , Análise por Conglomerados , Mapeamento de Sequências Contíguas/métodos , Metagenoma/genética
3.
G3 (Bethesda) ; 14(8)2024 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-38861393

RESUMO

The nine-spined stickleback (Pungitius pungitius) has been increasingly used as a model system in studies of local adaptation and sex chromosome evolution but its current reference genome assembly is far from perfect, lacking distinct sex chromosomes. We generated an improved assembly of the nine-spined stickleback reference genome (98.3% BUSCO completeness) with the aid of linked-read mapping. While the new assembly (v8) was of similar size as the earlier version (v7), we were able to assign 4.4 times more contigs to the linkage groups and improve the contiguity of the genome. Moreover, the new assembly contains a ∼22.8 Mb Y-linked scaffold (LG22) consisting mainly of previously assigned X-contigs, putative Y-contigs, putative centromere contigs, and highly repetitive elements. The male individual showed an even mapping depth on LG12 (pseudo X chromosome) and LG22 (Y-linked scaffold) in the segregating sites, suggesting near-pure X and Y representation in the v8 assembly. A total of 26,803 genes were annotated, and about 33% of the assembly was found to consist of repetitive elements. The high proportion of repetitive elements in LG22 (53.10%) suggests it can be difficult to assemble the complete sequence of the species' Y chromosome. Nevertheless, the new assembly is a significant improvement over the previous version and should provide a valuable resource for genomic studies of stickleback fishes.


Assuntos
Genoma , Smegmamorpha , Animais , Smegmamorpha/genética , Masculino , Feminino , Genômica/métodos , Mapeamento de Sequências Contíguas/métodos , Cromossomos Sexuais/genética , Mapeamento Cromossômico , Anotação de Sequência Molecular , Ligação Genética , Sequências Repetitivas de Ácido Nucleico
4.
Bioinformatics ; 38(10): 2675-2682, 2022 05 13.
Artigo em Inglês | MEDLINE | ID: mdl-35561180

RESUMO

MOTIVATION: Crucial to the correctness of a genome assembly is the accuracy of the underlying scaffolds that specify the orders and orientations of contigs together with the gap distances between contigs. The current methods construct scaffolds based on the alignments of 'linking' reads against contigs. We found that some 'optimal' alignments are mistaken due to factors such as the contig boundary effect, particularly in the presence of repeats. Occasionally, the incorrect alignments can even overwhelm the correct ones. The detection of the incorrect linking information is challenging in any existing methods. RESULTS: In this study, we present a novel scaffolding method RegScaf. It first examines the distribution of distances between contigs from read alignment by the kernel density. When multiple modes are shown in a density, orientation-supported links are grouped into clusters, each of which defines a linking distance corresponding to a mode. The linear model parameterizes contigs by their positions on the genome; then each linking distance between a pair of contigs is taken as an observation on the difference of their positions. The parameters are estimated by minimizing a global loss function, which is a version of trimmed sum of squares. The least trimmed squares estimate has such a high breakdown value that it can automatically remove the mistaken linking distances. The results on both synthetic and real datasets demonstrate that RegScaf outperforms some popular scaffolders, especially in the accuracy of gap estimates by substantially reducing extremely abnormal errors. Its strength in resolving repeat regions is exemplified by a real case. Its adaptability to large genomes and TGS long reads is validated as well. AVAILABILITY AND IMPLEMENTATION: RegScaf is publicly available at https://github.com/lemontealala/RegScaf.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Mapeamento de Sequências Contíguas/métodos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos
5.
Genome Biol ; 23(1): 29, 2022 01 20.
Artigo em Inglês | MEDLINE | ID: mdl-35057847

RESUMO

Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking on simulated and real datasets of varying complexity and diversity confirm this novelty and demonstrate the superiority of Strainline.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Viral , Haplótipos , SARS-CoV-2/genética , Software , Benchmarking , COVID-19/virologia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , SARS-CoV-2/classificação , Análise de Sequência de DNA
6.
Nucleic Acids Res ; 49(20): e117, 2021 11 18.
Artigo em Inglês | MEDLINE | ID: mdl-34417615

RESUMO

Scaffolding, i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding using second generation sequencing reads based on likelihoods of genome assemblies. A generative model for sequencing is used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called Swalo with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. Swalo is freely available for download at https://atifrahman.github.io/SWALO/.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Análise de Sequência de DNA/métodos , Software , Bactérias , Humanos , Funções Verossimilhança
7.
PLoS Genet ; 17(8): e1009705, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34437539

RESUMO

Whole-genome duplication and genome compaction are thought to have played important roles in teleost fish evolution. Ayu (or sweetfish), Plecoglossus altivelis, belongs to the superorder Stomiati, order Osmeriformes. Stomiati is phylogenetically classified as sister taxa of Neoteleostei. Thus, ayu holds an important position in the fish tree of life. Although ayu is economically important for the food industry and recreational fishing in Japan, few genomic resources are available for this species. To address this problem, we produced a draft genome sequence of ayu by whole-genome shotgun sequencing and constructed linkage maps using a genotyping-by-sequencing approach. Syntenic analyses of ayu and other teleost fish provided information about chromosomal rearrangements during the divergence of Stomiati, Protacanthopterygii and Neoteleostei. The size of the ayu genome indicates that genome compaction occurred after the divergence of the family Osmeridae. Ayu has an XX/XY sex-determination system for which we identified sex-associated loci by a genome-wide association study by genotyping-by-sequencing and whole-genome resequencing using wild populations. Genome-wide association mapping using wild ayu populations revealed three sex-linked scaffolds (total, 2.03 Mb). Comparison of whole-genome resequencing mapping coverage between males and females identified male-specific regions in sex-linked scaffolds. A duplicate copy of the anti-Müllerian hormone type-II receptor gene (amhr2bY) was found within these male-specific regions, distinct from the autosomal copy of amhr2. Expression of the Y-linked amhr2 gene was male-specific in sox9b-positive somatic cells surrounding germ cells in undifferentiated gonads, whereas autosomal amhr2 transcripts were detected in somatic cells in sexually undifferentiated gonads of both genetic males and females. Loss-of-function mutation for amhr2bY induced male to female sex reversal. Taken together with the known role of Amh and Amhr2 in sex differentiation, these results indicate that the paralog of amhr2 on the ayu Y chromosome determines genetic sex, and the male-specific amh-amhr2 pathway is critical for testicular differentiation in ayu.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Osmeriformes/genética , Receptores de Peptídeos/genética , Receptores de Fatores de Crescimento Transformadores beta/genética , Sequenciamento Completo do Genoma/métodos , Animais , Feminino , Proteínas de Peixes/genética , Mutação com Perda de Função , Masculino , Caracteres Sexuais , Sintenia
8.
Sci Rep ; 11(1): 14944, 2021 07 22.
Artigo em Inglês | MEDLINE | ID: mdl-34294764

RESUMO

Picrorhiza kurrooa is an endangered medicinal herb which is distributed across the Himalayan region at an altitude between 3000-5000 m above mean sea level. The medicinal properties of P. kurrooa are attributed to monoterpenoid picrosides present in leaf, rhizome and root of the plant. However, no genomic information is currently available for P. kurrooa, which limits our understanding about its molecular systems and associated responses. The present study brings the first assembled draft genome of P. kurrooa by using 227 Gb of raw data generated by Illumina and PacBio RS II sequencing platforms. The assembled genome has a size of n = ~ 1.7 Gb with 12,924 scaffolds. Four pronged assembly quality validations studies, including experimentally reported ESTs mapping and directed sequencing of the assembled contigs, confirmed high reliability of the assembly. About 76% of the genome is covered by complex repeats alone. Annotation revealed 24,798 protein coding and 9789 non-coding genes. Using the assembled genome, a total of 710 miRNAs were discovered, many of which were found responsible for molecular response against temperature changes. The miRNAs and targets were validated experimentally. The availability of draft genome sequence will aid in genetic improvement and conservation of P. kurrooa. Also, this study provided an efficient approach for assembling complex genomes while dealing with repeats when regular assemblers failed to progress due to repeats.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma de Planta , Picrorhiza/genética , Análise de Sequência de DNA/métodos , Espécies em Perigo de Extinção , Tamanho do Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Plantas Medicinais/genética
10.
PLoS One ; 16(4): e0249850, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33844699

RESUMO

In this article, we present QuASeR, a reference-free DNA sequence reconstruction implementation via de novo assembly on both gate-based and quantum annealing platforms. This is the first time this important application in bioinformatics is modeled using quantum computation. Each one of the four steps of the implementation (TSP, QUBO, Hamiltonians and QAOA) is explained with a proof-of-concept example to target both the genomics research community and quantum application developers in a self-contained manner. The implementation and results on executing the algorithm from a set of DNA reads to a reconstructed sequence, on a gate-based quantum simulator, the D-Wave quantum annealing simulator and hardware are detailed. We also highlight the limitations of current classical simulation and available quantum hardware systems. The implementation is open-source and can be found on https://github.com/QE-Lab/QuASeR.


Assuntos
Análise de Sequência de DNA/métodos , Software , Animais , Mapeamento de Sequências Contíguas/métodos , Humanos
11.
Genes (Basel) ; 12(5)2021 04 26.
Artigo em Inglês | MEDLINE | ID: mdl-33926025

RESUMO

Sequencing of whole microbial genomes has become a standard procedure for cluster detection, source tracking, outbreak investigation and surveillance of many microorganisms. An increasing number of laboratories are currently in a transition phase from classical methods towards next generation sequencing, generating unprecedented amounts of data. Since the precision of downstream analyses depends significantly on the quality of raw data generated on the sequencing instrument, a comprehensive, meaningful primary quality control is indispensable. Here, we present AQUAMIS, a Snakemake workflow for an extensive quality control and assembly of raw Illumina sequencing data, allowing laboratories to automatize the initial analysis of their microbial whole-genome sequencing data. AQUAMIS performs all steps of primary sequence analysis, consisting of read trimming, read quality control (QC), taxonomic classification, de-novo assembly, reference identification, assembly QC and contamination detection, both on the read and assembly level. The results are visualized in an interactive HTML report including species-specific QC thresholds, allowing non-bioinformaticians to assess the quality of sequencing experiments at a glance. All results are also available as a standard-compliant JSON file, facilitating easy downstream analyses and data exchange. We have applied AQUAMIS to analyze ~13,000 microbial isolates as well as ~1000 in-silico contaminated datasets, proving the workflow's ability to perform in high throughput routine sequencing environments and reliably predict contaminations. We found that intergenus and intragenus contaminations can be detected most accurately using a combination of different QC metrics available within AQUAMIS.


Assuntos
Genoma Bacteriano , Controle de Qualidade , Sequenciamento Completo do Genoma/métodos , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Contaminação por DNA , Escherichia coli , Listeria monocytogenes , Salmonella enterica , Sensibilidade e Especificidade , Software , Especificidade da Espécie , Sequenciamento Completo do Genoma/normas , Fluxo de Trabalho
12.
Nat Commun ; 12(1): 1935, 2021 04 28.
Artigo em Inglês | MEDLINE | ID: mdl-33911078

RESUMO

Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. To date, these assemblies have been best created with complex protocols, such as cultured cells that contain a single-haplotype (haploid) genome, single cells where haplotypes are separated, or co-sequencing of parental genomes in a trio-based approach. These approaches are impractical in most situations. To address this issue, we present FALCON-Phase, a phasing tool that uses ultra-long-range Hi-C chromatin interaction data to extend phase blocks of partially-phased diploid assembles to chromosome or scaffold scale. FALCON-Phase uses the inherent phasing information in Hi-C reads, skipping variant calling, and reduces the computational complexity of phasing. Our method is validated on three benchmark datasets generated as part of the Vertebrate Genomes Project (VGP), including human, cow, and zebra finch, for which high-quality, fully haplotype-resolved assemblies are available using the trio-based approach. FALCON-Phase is accurate without having parental data and performance is better in samples with higher heterozygosity. For cow and zebra finch the accuracy is 97% compared to 80-91% for human. FALCON-Phase is applicable to any draft assembly that contains long primary contigs and phased associate contigs.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Bovinos , Haplótipos/genética , Humanos , Polimorfismo de Nucleotídeo Único/genética , Peixe-Zebra/genética
13.
Nat Commun ; 12(1): 1485, 2021 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-33674578

RESUMO

Yeast whole genome sequencing (WGS) lacks end-to-end workflows that identify genetic engineering. Here we present Prymetime, a tool that assembles yeast plasmids and chromosomes and annotates genetic engineering sequences. It is a hybrid workflow-it uses short and long reads as inputs to perform separate linear and circular assembly steps. This structure is necessary to accurately resolve genetic engineering sequences in plasmids and the genome. We show this by assembling diverse engineered yeasts, in some cases revealing unintended deletions and integrations. Furthermore, the resulting whole genomes are high quality, although the underlying assembly software does not consistently resolve highly repetitive genome features. Finally, we assemble plasmids and genome integrations from metagenomic sequencing, even with 1 engineered cell in 1000. This work is a blueprint for building WGS workflows and establishes WGS-based identification of yeast genetic engineering.


Assuntos
Engenharia Genética/métodos , Genoma Fúngico , Saccharomyces cerevisiae/genética , Sequenciamento Completo do Genoma/métodos , Sequência de Bases , Cromossomos , Cromossomos Artificiais de Levedura , Clonagem Molecular , Simulação por Computador , Mapeamento de Sequências Contíguas/métodos , Metagenoma , Metagenômica , Plasmídeos , Software , Transformação Genética
14.
Nat Genet ; 53(4): 574-584, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33737755

RESUMO

Rye is a valuable food and forage crop, an important genetic resource for wheat and triticale improvement and an indispensable material for efficient comparative genomic studies in grasses. Here, we sequenced the genome of Weining rye, an elite Chinese rye variety. The assembled contigs (7.74 Gb) accounted for 98.47% of the estimated genome size (7.86 Gb), with 93.67% of the contigs (7.25 Gb) assigned to seven chromosomes. Repetitive elements constituted 90.31% of the assembled genome. Compared to previously sequenced Triticeae genomes, Daniela, Sumaya and Sumana retrotransposons showed strong expansion in rye. Further analyses of the Weining assembly shed new light on genome-wide gene duplications and their impact on starch biosynthesis genes, physical organization of complex prolamin loci, gene expression features underlying early heading trait and putative domestication-associated chromosomal regions and loci in rye. This genome sequence promises to accelerate genomic and breeding studies in rye and related cereal crops.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Produtos Agrícolas/genética , Genoma de Planta , Proteínas de Plantas/genética , Característica Quantitativa Herdável , Secale/genética , Duplicação Gênica , Regulação da Expressão Gênica de Plantas , Loci Gênicos , Tamanho do Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Melhoramento Vegetal , Proteínas de Plantas/metabolismo , Retroelementos , Amido/biossíntese , Triticum/genética
15.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33634311

RESUMO

In the field of genome assembly, scaffolding methods make it possible to obtain a more complete and contiguous reference genome, which is the cornerstone of genomic research. Scaffolding methods typically utilize the alignments between contigs and sequencing data (reads) to determine the orientation and order among contigs and to produce longer scaffolds, which are helpful for genomic downstream analysis. With the rapid development of high-throughput sequencing technologies, diverse types of reads have emerged over the past decade, especially in long-range sequencing, which have greatly enhanced the assembly quality of scaffolding methods. As the number of scaffolding methods increases, biology and bioinformatics researchers need to perform in-depth analyses of state-of-the-art scaffolding methods. In this article, we focus on the difficulties in scaffolding, the differences in characteristics among various kinds of reads, the methods by which current scaffolding methods address these difficulties, and future research opportunities. We hope this work will benefit the design of new scaffolding methods and the selection of appropriate scaffolding methods for specific biological studies.


Assuntos
Biologia Computacional/métodos , Mapeamento de Sequências Contíguas/métodos , Genoma , Software , Animais , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
16.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33621981

RESUMO

Contigs assembled from the third-generation sequencing long reads are usually more complete than the second-generation short reads. However, the current algorithms still have difficulty in assembling the long reads into the ideal complete and accurate genome, or the theoretical best result [1]. To improve the long read contigs and with more and more fully sequenced genomes available, it could still be possible to use the similar genome-assisted reassembly method [2], which was initially proposed for the short reads making use of a closely related genome (similar genome) to the sequencing genome (target genome). The method aligns the contigs and reads to the similar genome, and then extends and refines the aligned contigs with the aligned reads. Here, we introduce AlignGraph2, a similar genome-assisted reassembly pipeline for the PacBio long reads. The AlignGraph2 pipeline is the second version of AlignGraph algorithm proposed by us but completely redesigned, can be inputted with either error-prone or HiFi long reads, and contains four novel algorithms: similarity-aware alignment algorithm and alignment filtration algorithm for alignment of the long reads and preassembled contigs to the similar genome, and reassembly algorithm and weight-adjusted consensus algorithm for extension and refinement of the preassembled contigs. In our performance tests on both error-prone and HiFi long reads, AlignGraph2 can align 5.7-27.2% more long reads and 7.3-56.0% more bases than some current alignment algorithm and is more efficient or comparable to the others. For contigs assembled with various de novo algorithms and aligned to similar genomes (aligned contigs), AlignGraph2 can extend 8.7-94.7% of them (extendable contigs), and obtain contigs of 7.0-249.6% larger N50 value and 5.2-87.7% smaller number of indels per 100 kbp (extended contigs). With genomes of decreased similarities, AlignGraph2 also has relatively stable performance. The AlignGraph2 software can be downloaded for free from this site: https://github.com/huangs001/AlignGraph2.


Assuntos
Algoritmos , Arabidopsis/genética , Ilhas de CpG/genética , Genoma Fúngico , Genoma Humano , Genoma de Planta , Saccharomyces cerevisiae/genética , Alinhamento de Sequência/métodos , Software , Mapeamento de Sequências Contíguas/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Sequenciamento Completo do Genoma/métodos
17.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33429431

RESUMO

With the rapid progress of sequencing technologies, various types of sequencing reads and assembly algorithms have been designed to construct genome assemblies. Although recent studies have attempted to evaluate the appropriate type of sequencing reads and algorithms for assembling high-quality genomes, it is still a challenge to set the correct combination for constructing animal genomes. Here, we present a comparative performance assessment of 14 assembly combinations-9 software programs with different short and long reads of Duroc pig. Based on the results of the optimization process for genome construction, we designed an integrated hybrid de novo assembly pipeline, HSCG, and constructed a draft genome for Duroc pig. Comparison between the new genome and Sus scrofa 11.1 revealed important breakpoints in two S. scrofa 11.1 genes. Our findings may provide new insights into the pan-genome analysis studies of agricultural animals, and the integrated assembly pipeline may serve as a guide for the assembly of other animal genomes.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Mapeamento de Sequências Contíguas/métodos , Genoma , Suínos/genética , Animais , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Masculino , Análise de Sequência de DNA , Software
18.
Nat Biotechnol ; 39(4): 422-430, 2021 04.
Artigo em Inglês | MEDLINE | ID: mdl-33318652

RESUMO

Generating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence. Here we report an algorithm for hybrid assembly, WENGAN, that provides very high quality at low computational cost. We demonstrate de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms to improve assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity (contig NG50: 17.24-80.64 Mb), few assembly errors (contig NGA50: 11.8-59.59 Mb), good consensus quality (QV: 27.84-42.88) and high gene completeness (BUSCO complete: 94.6-95.2%), while consuming low computational resources (CPU hours: 187-1,200). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb).


Assuntos
Biologia Computacional/métodos , Mapeamento de Sequências Contíguas/métodos , Genoma Humano , Algoritmos , Haploidia , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
19.
BMC Genomics ; 21(1): 631, 2020 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-32928108

RESUMO

BACKGROUND: We benchmarked the hybrid assembly approaches of MaSuRCA, SPAdes, and Unicycler for bacterial pathogens using Illumina and Oxford Nanopore sequencing by determining genome completeness and accuracy, antimicrobial resistance (AMR), virulence potential, multilocus sequence typing (MLST), phylogeny, and pan genome. Ten bacterial species (10 strains) were tested for simulated reads of both mediocre- and low-quality, whereas 11 bacterial species (12 strains) were tested for real reads. RESULTS: Unicycler performed the best for achieving contiguous genomes, closely followed by MaSuRCA, while all SPAdes assemblies were incomplete. MaSuRCA was less tolerant of low-quality long reads than SPAdes and Unicycler. The hybrid assemblies of five antimicrobial-resistant strains with simulated reads provided consistent AMR genotypes with the reference genomes. The MaSuRCA assembly of Staphylococcus aureus with real reads contained msr(A) and tet(K), while the reference genome and SPAdes and Unicycler assemblies harbored blaZ. The AMR genotypes of the reference genomes and hybrid assemblies were consistent for the other five antimicrobial-resistant strains with real reads. The numbers of virulence genes in all hybrid assemblies were similar to those of the reference genomes, irrespective of simulated or real reads. Only one exception existed that the reference genome and hybrid assemblies of Pseudomonas aeruginosa with mediocre-quality long reads carried 241 virulence genes, whereas 184 virulence genes were identified in the hybrid assemblies of low-quality long reads. The MaSuRCA assemblies of Escherichia coli O157:H7 and Salmonella Typhimurium with mediocre-quality long reads contained 126 and 118 virulence genes, respectively, while 110 and 107 virulence genes were detected in their MaSuRCA assemblies of low-quality long reads, respectively. All approaches performed well in our MLST and phylogenetic analyses. The pan genomes of the hybrid assemblies of S. Typhimurium with mediocre-quality long reads were similar to that of the reference genome, while SPAdes and Unicycler were more tolerant of low-quality long reads than MaSuRCA for the pan-genome analysis. All approaches functioned well in the pan-genome analysis of Campylobacter jejuni with real reads. CONCLUSIONS: Our research demonstrates the hybrid assembly pipeline of Unicycler as a superior approach for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing.


Assuntos
Genoma Bacteriano , Genômica/métodos , Sequenciamento por Nanoporos/métodos , Benchmarking , Campylobacter jejuni , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Cronobacter sakazakii , Farmacorresistência Bacteriana , Genômica/normas , Listeria monocytogenes , Sequenciamento por Nanoporos/normas , Pseudomonas aeruginosa , Salmonella typhimurium , Virulência
20.
Genetics ; 216(2): 599-608, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32796007

RESUMO

Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.


Assuntos
Cromossomos de Plantas/genética , Mapeamento de Sequências Contíguas/métodos , Dosagem de Genes , Triticum/genética , Mapeamento de Sequências Contíguas/normas , Genoma de Planta , Genômica/métodos , Genômica/normas , Padrões de Referência
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...