Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
1.
Genes (Basel) ; 12(5)2021 04 26.
Artigo em Inglês | MEDLINE | ID: mdl-33926025

RESUMO

Sequencing of whole microbial genomes has become a standard procedure for cluster detection, source tracking, outbreak investigation and surveillance of many microorganisms. An increasing number of laboratories are currently in a transition phase from classical methods towards next generation sequencing, generating unprecedented amounts of data. Since the precision of downstream analyses depends significantly on the quality of raw data generated on the sequencing instrument, a comprehensive, meaningful primary quality control is indispensable. Here, we present AQUAMIS, a Snakemake workflow for an extensive quality control and assembly of raw Illumina sequencing data, allowing laboratories to automatize the initial analysis of their microbial whole-genome sequencing data. AQUAMIS performs all steps of primary sequence analysis, consisting of read trimming, read quality control (QC), taxonomic classification, de-novo assembly, reference identification, assembly QC and contamination detection, both on the read and assembly level. The results are visualized in an interactive HTML report including species-specific QC thresholds, allowing non-bioinformaticians to assess the quality of sequencing experiments at a glance. All results are also available as a standard-compliant JSON file, facilitating easy downstream analyses and data exchange. We have applied AQUAMIS to analyze ~13,000 microbial isolates as well as ~1000 in-silico contaminated datasets, proving the workflow's ability to perform in high throughput routine sequencing environments and reliably predict contaminations. We found that intergenus and intragenus contaminations can be detected most accurately using a combination of different QC metrics available within AQUAMIS.


Assuntos
Genoma Bacteriano , Controle de Qualidade , Sequenciamento Completo do Genoma/métodos , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Contaminação por DNA , Escherichia coli , Listeria monocytogenes , Salmonella enterica , Sensibilidade e Especificidade , Software , Especificidade da Espécie , Sequenciamento Completo do Genoma/normas , Fluxo de Trabalho
2.
BMC Genomics ; 21(1): 631, 2020 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-32928108

RESUMO

BACKGROUND: We benchmarked the hybrid assembly approaches of MaSuRCA, SPAdes, and Unicycler for bacterial pathogens using Illumina and Oxford Nanopore sequencing by determining genome completeness and accuracy, antimicrobial resistance (AMR), virulence potential, multilocus sequence typing (MLST), phylogeny, and pan genome. Ten bacterial species (10 strains) were tested for simulated reads of both mediocre- and low-quality, whereas 11 bacterial species (12 strains) were tested for real reads. RESULTS: Unicycler performed the best for achieving contiguous genomes, closely followed by MaSuRCA, while all SPAdes assemblies were incomplete. MaSuRCA was less tolerant of low-quality long reads than SPAdes and Unicycler. The hybrid assemblies of five antimicrobial-resistant strains with simulated reads provided consistent AMR genotypes with the reference genomes. The MaSuRCA assembly of Staphylococcus aureus with real reads contained msr(A) and tet(K), while the reference genome and SPAdes and Unicycler assemblies harbored blaZ. The AMR genotypes of the reference genomes and hybrid assemblies were consistent for the other five antimicrobial-resistant strains with real reads. The numbers of virulence genes in all hybrid assemblies were similar to those of the reference genomes, irrespective of simulated or real reads. Only one exception existed that the reference genome and hybrid assemblies of Pseudomonas aeruginosa with mediocre-quality long reads carried 241 virulence genes, whereas 184 virulence genes were identified in the hybrid assemblies of low-quality long reads. The MaSuRCA assemblies of Escherichia coli O157:H7 and Salmonella Typhimurium with mediocre-quality long reads contained 126 and 118 virulence genes, respectively, while 110 and 107 virulence genes were detected in their MaSuRCA assemblies of low-quality long reads, respectively. All approaches performed well in our MLST and phylogenetic analyses. The pan genomes of the hybrid assemblies of S. Typhimurium with mediocre-quality long reads were similar to that of the reference genome, while SPAdes and Unicycler were more tolerant of low-quality long reads than MaSuRCA for the pan-genome analysis. All approaches functioned well in the pan-genome analysis of Campylobacter jejuni with real reads. CONCLUSIONS: Our research demonstrates the hybrid assembly pipeline of Unicycler as a superior approach for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing.


Assuntos
Genoma Bacteriano , Genômica/métodos , Sequenciamento por Nanoporos/métodos , Benchmarking , Campylobacter jejuni , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Cronobacter sakazakii , Farmacorresistência Bacteriana , Genômica/normas , Listeria monocytogenes , Sequenciamento por Nanoporos/normas , Pseudomonas aeruginosa , Salmonella typhimurium , Virulência
3.
Genetics ; 216(2): 599-608, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32796007

RESUMO

Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.


Assuntos
Cromossomos de Plantas/genética , Mapeamento de Sequências Contíguas/métodos , Dosagem de Genes , Triticum/genética , Mapeamento de Sequências Contíguas/normas , Genoma de Planta , Genômica/métodos , Genômica/normas , Padrões de Referência
4.
Genes (Basel) ; 11(7)2020 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-32679846

RESUMO

RPGR exon ORF15 variants are one of the most frequent causes for inherited retinal disorders (IRDs), in particular retinitis pigmentosa. The low sequence complexity of this mutation hotspot makes it prone to indels and challenging for sequence data analysis. Whole-exome sequencing generally fails to provide adequate coverage in this region. Therefore, complementary methods are needed to avoid false positives as well as negative results. In this study, next-generation sequencing (NGS) was used to sequence long-range PCR amplicons for an IRD cohort of African ancestry. By developing a novel secondary analysis pipeline based on de novo assembly, we were able to avoid the miscalling of variants generated by standard NGS analysis tools. We identified pathogenic variants in 11 patients (13% of the cohort), two of which have not been reported previously. We provide a novel and alternative end-to-end secondary analysis pipeline for targeted NGS of ORF15 that is less prone to false positive and negative variant calls.


Assuntos
Proteínas do Olho/genética , Testes Genéticos/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Retinose Pigmentar/genética , População Negra/genética , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Éxons , Feminino , Testes Genéticos/normas , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Povos Indígenas/genética , Masculino , Mutação , Linhagem , Retinose Pigmentar/diagnóstico , Sensibilidade e Especificidade , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas
5.
Sci Rep ; 9(1): 17716, 2019 11 27.
Artigo em Inglês | MEDLINE | ID: mdl-31776409

RESUMO

Atlantic herring (Clupea harengus) is one of the most abundant fish species in the world. It is an important economical and nutritional resource, as well as a crucial part of the North Atlantic ecosystem. In 2016, a draft herring genome assembly was published. Being a species of such importance, we sought to independently verify and potentially improve the herring genome assembly. We sequenced the herring genome generating paired-end, mate-pair, linked and long reads. Three assembly versions of the herring genome were generated based on a de novo assembly (A1), which was scaffolded using linked and long reads (A2) and then merged with the previously published assembly (A3). The resulting assemblies were compared using parameters describing the size, fragmentation, correctness, and completeness of the assemblies. Results showed that the A2 assembly was less fragmented, more complete and more correct than A1. A3 showed improvement in fragmentation and correctness compared with A2 and the published assembly but was slightly less complete than the published assembly. Thus, we here confirmed the previously published herring assembly, and made improvements by further scaffolding the assembly and removing low-quality sequences using linked and long reads and merging of assemblies.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Peixes/genética , Genoma , Sequenciamento Completo do Genoma/métodos , Animais , Mapeamento de Sequências Contíguas/normas , Sequenciamento Completo do Genoma/normas
6.
Gigascience ; 8(7)2019 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-31289836

RESUMO

BACKGROUND: Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. Similar sequence content can confound the mapping of short next-generation sequencing reads to a reference genome. It is therefore possible that the presence of both sex chromosomes in a reference genome can cause technical artifacts in genomic data and affect downstream analyses and applications. Understanding this problem is critical for medical genomics and population genomic inference. RESULTS: Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We find that sequence homology affects read mapping on the sex chromosomes and this has downstream effects on variant calling. However, we show that XYalign can correct mismapping, resulting in more accurate variant calling. We also show how metrics output by XYalign can be used to identify XX and XY individuals across diverse sequencing experiments, including low- and high-coverage whole-genome sequencing, and exome sequencing. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other uses including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3). CONCLUSIONS: Sex chromsome sequence homology causes the mismapping of short reads, which in turn affects downstream analyses. XYalign provides a reproducible framework to correct mismapping and improve variant calling on the sex chromsomes.


Assuntos
Cromossomos Humanos X/genética , Cromossomos Humanos Y/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Artefatos , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Feminino , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Masculino , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , Análise de Sequência de DNA/normas
7.
Gigascience ; 8(5)2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31077315

RESUMO

BACKGROUND: In recent years, massively parallel complementary DNA sequencing (RNA sequencing [RNA-Seq]) has emerged as a fast, cost-effective, and robust technology to study entire transcriptomes in various manners. In particular, for non-model organisms and in the absence of an appropriate reference genome, RNA-Seq is used to reconstruct the transcriptome de novo. Although the de novo transcriptome assembly of non-model organisms has been on the rise recently and new tools are frequently developing, there is still a knowledge gap about which assembly software should be used to build a comprehensive de novo assembly. RESULTS: Here, we present a large-scale comparative study in which 10 de novo assembly tools are applied to 9 RNA-Seq data sets spanning different kingdoms of life. Overall, we built >200 single assemblies and evaluated their performance on a combination of 20 biological-based and reference-free metrics. Our study is accompanied by a comprehensive and extensible Electronic Supplement that summarizes all data sets, assembly execution instructions, and evaluation results. Trinity, SPAdes, and Trans-ABySS, followed by Bridger and SOAPdenovo-Trans, generally outperformed the other tools compared. Moreover, we observed species-specific differences in the performance of each assembler. No tool delivered the best results for all data sets. CONCLUSIONS: We recommend a careful choice and normalization of evaluation metrics to select the best assembling results as a critical step in the reconstruction of a comprehensive de novo transcriptome assembly.


Assuntos
Análise de Sequência de RNA/métodos , Software , Transcriptoma , Animais , Arabidopsis , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Escherichia coli , Humanos , Camundongos , Análise de Sequência de RNA/normas
8.
Genes (Basel) ; 10(1)2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-30669388

RESUMO

A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.


Assuntos
Anopheles/genética , Genoma de Inseto , Análise de Sequência de DNA/métodos , Animais , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Ploidias , Polimorfismo Genético , Análise de Sequência de DNA/normas
9.
Genome Res ; 28(11): 1720-1732, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30341161

RESUMO

Despite the rapid development of sequencing technologies, the assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout 2, a reference-assisted assembly tool that works for large and complex genomes. By taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout 2 infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. By using Ragout 2, we transformed NGS assemblies of 16 laboratory mouse strains into sets of complete chromosomes, leaving <5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long Pacific Biosciences (PacBio) reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. We applied Ragout 2 to the Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared with other genomes from the Muridae family. Chromosome painting maps confirmed most large-scale rearrangements that Ragout 2 detected. We applied Ragout 2 to improve draft sequences of three ape genomes that have recently been published. Ragout 2 transformed three sets of contigs (generated using PacBio reads only) into chromosome-scale assemblies with accuracy comparable to chromosome assemblies generated in the original study using BioNano maps, Hi-C, BAC clones, and FISH.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Sequenciamento Completo do Genoma/métodos , Animais , Mapeamento de Sequências Contíguas/normas , Camundongos , Padrões de Referência , Sequenciamento Completo do Genoma/normas
10.
Sci Rep ; 8(1): 2059, 2018 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-29391438

RESUMO

Single-cell genomics is a straightforward approach to obtain genomes from uncultured microbes. However, sequence reads from a single-cell amplified genome (SAG) contain significant bias and chimeric sequences. Here, we describe Cleaning and Co-assembly of a Single-Cell Amplified Genome (ccSAG), a novel analytical workflow to obtain composite single-cell genomes with elimination of sequence errors. By the integration of ccSAG with a massively parallel single-cell genome amplification platform based on droplet microfluidics, we can generate multiple SAGs and effectively integrate them into the composite genomes quality equivalent to the data obtained from bulk DNA. We obtained two novel draft genomes from single gut microbial cells with high completeness (>96.6%) and extremely low contamination (<1.25%). Moreover, we revealed the presence of single nucleotide polymorphisms in the specific gene by sequence comparison at the single-cell level. Thus, the workflow yields near-complete genomes from uncultured microbes, and enables analyses of genetic heterogeneity within identical strains.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Microbioma Gastrointestinal , Genoma Bacteriano , Análise de Célula Única/métodos , Sequenciamento Completo do Genoma/métodos , Animais , Bacillus subtilis/genética , Mapeamento de Sequências Contíguas/normas , Escherichia coli/genética , Masculino , Camundongos , Camundongos Endogâmicos ICR , Polimorfismo de Nucleotídeo Único , Análise de Célula Única/normas , Sequenciamento Completo do Genoma/normas
11.
J Comput Biol ; 25(2): 214-235, 2018 02.
Artigo em Inglês | MEDLINE | ID: mdl-29028176

RESUMO

We formalize a new problem variant in gene-block discovery, denoted Reference-Anchored Gene Blocks (RAGB), given a query sequence Q of length n, representing the gene array of a DNA element, a window size bound d on the length of a substring of interest in Q, and a set of target gene sequences [Formula: see text]. Our objective is to identify gene blocks in [Formula: see text] that are centered in a subset q of co-localized genes from Q, and contain genomes from [Formula: see text] in which the corresponding orthologs of the genes from q are also co-localized. We cast RAGB as a variant of a (colored) biclique problem in bipartite graphs, and analyze its parameterized complexity, as well as the parameterized complexity of other related problems. We give an [Formula: see text] time algorithm for the uncolored variant of our biclique problem, where m is the number of areas of interest that are parsed from the target sequences, and n and d are as defined earlier. Our algorithm can be adapted to compute all maximal bicliques in the graph within the same time complexity, and to handle edge weights with a slight [Formula: see text] increase to its time complexity. For the colored version of the problem, our algorithm has a time complexity of [Formula: see text]. We implement the algorithm and exemplify its application to the data mining of proteobacterial gene blocks that are centered in predicted proteobacterial genomic islands, leading to the identification of putatively mobilized clusters of virulence, pathogenicity, and resistance genes.


Assuntos
Genoma Bacteriano , Família Multigênica , Análise de Sequência de DNA/métodos , Algoritmos , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Padrões de Referência , Análise de Sequência de DNA/normas
12.
Gigascience ; 6(11): 1-6, 2017 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-29092041

RESUMO

The chimpanzee is arguably the most important species for the study of human origins. A key resource for these studies is a high-quality reference genome assembly; however, as with most mammalian genomes, the current iteration of the chimpanzee reference genome assembly is highly fragmented. In the current iteration of the chimpanzee reference genome assembly (Pan_tro_2.1.4), the sequence is scattered across more then 183 000 contigs, incorporating more than 159 000 gaps, with a genome-wide contig N50 of 51 Kbp. In this work, we produce an extensive and diverse array of sequencing datasets to rapidly assemble a new chimpanzee reference that surpasses previous iterations in bases represented and organized in large scaffolds. To this end, we show substantial improvements over the current release of the chimpanzee genome (Pan_tro_2.1.4) by several metrics, such as increased contiguity by >750% and 300% on contigs and scaffolds, respectively, and closure of 77% of gaps in the Pan_tro_2.1.4 assembly gaps spanning >850 Kbp of the novel coding sequence based on RNASeq data. We further report more than 2700 genes that had putatively erroneous frame-shift predictions to human in Pan_tro_2.1.4 and show a substantial increase in the annotation of repetitive elements. We apply a simple 3-way hybrid approach to considerably improve the reference genome assembly for the chimpanzee, providing a valuable resource for the study of human origins. Furthermore, we produce extensive sequencing datasets that are all derived from the same cell line, generating a broad non-human benchmark dataset.


Assuntos
Mapeamento de Sequências Contíguas/normas , Genoma , Genômica/normas , Anotação de Sequência Molecular/normas , Pan troglodytes/genética , Sequenciamento Completo do Genoma/normas , Animais , Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Anotação de Sequência Molecular/métodos , Padrões de Referência , Sequenciamento Completo do Genoma/métodos
13.
Genome Biol ; 18(1): 182, 2017 09 21.
Artigo em Inglês | MEDLINE | ID: mdl-28934964

RESUMO

BACKGROUND: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. RESULTS: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. CONCLUSIONS: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.


Assuntos
Benchmarking/métodos , Mapeamento de Sequências Contíguas/métodos , Código de Barras de DNA Taxonômico/métodos , Metagenoma , Análise de Sequência de DNA/métodos , Software , Benchmarking/normas , Mapeamento de Sequências Contíguas/normas , Código de Barras de DNA Taxonômico/normas , Humanos , Microbiota , Filogenia , Análise de Sequência de DNA/normas
14.
Sci Rep ; 7(1): 10963, 2017 09 08.
Artigo em Inglês | MEDLINE | ID: mdl-28887485

RESUMO

Complementary to reference-based variant detection, recent studies revealed that many novel variants could be detected with de novo assembled genomes. To evaluate the effect of reads coverage and the accuracy of assembly-based variant calling, we simulated short reads containing more than 3 million of single nucleotide variants (SNVs) from the whole human genome and compared the efficiency of SNV calling between the assembly-based and alignment-based calling approaches. We assessed the quality of the assembled contig and found that a minimum of 30X coverage of short reads was needed to ensure reliable SNV calling and to generate assembled contigs with a good coverage of genome and genes. In addition, we observed that the assembly-based approach had a much lower recall rate and precision comparing to the alignment-based approach that would recover 99% of imputed SNVs. We observed similar results with experimental reads for NA24385, an individual whose germline variants were well characterized. Although there are additional values for SNVs detection, the assembly-based approach would have great risk of false discovery of novel SNVs. Further improvement of de novo assembly algorithms are needed in order to warrant a good completeness of genome with haplotype resolved and high fidelity of assembled sequences.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Mapeamento de Sequências Contíguas/normas , Estudo de Associação Genômica Ampla/normas , Humanos , Alinhamento de Sequência/normas , Análise de Sequência de DNA/normas
15.
Genome ; 60(12): 1086-1088, 2017 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-28841400

RESUMO

RNA-Seq methodology has been used to generate a comprehensive transcriptome sequence resource for perennial ryegrass, an important temperate pasture grass species. A total of 931 547 255 reads were obtained from libraries corresponding to 19 distinct tissue samples, including both vegetative and reproductive stages of development. Assembly of data generated a final filtered reference set of 48 713 contigs and scaffolds. The transcriptome resource will support whole genome sequence assembly, comparative genomics, implementation of genotyping-by-sequencing (GBS) methods based on transcript sampling, and identification of candidate genes for multiple biological functions.


Assuntos
Mapeamento de Sequências Contíguas/normas , Genoma de Planta , Lolium/genética , Transcriptoma , Mapeamento de Sequências Contíguas/métodos , Anotação de Sequência Molecular , Valores de Referência
16.
Gigascience ; 6(7): 1-8, 2017 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-28531267

RESUMO

The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Projeto Genoma Humano , Alinhamento de Sequência/métodos , Sequenciamento Completo do Genoma/métodos , Algoritmos , Mapeamento de Sequências Contíguas/normas , Humanos , Padrões de Referência , Alinhamento de Sequência/normas , Sequenciamento Completo do Genoma/normas
17.
Genome Res ; 27(5): 885-896, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28420692

RESUMO

Advances in genome sequencing and assembly technologies are generating many high-quality genome sequences, but assemblies of large, repeat-rich polyploid genomes, such as that of bread wheat, remain fragmented and incomplete. We have generated a new wheat whole-genome shotgun sequence assembly using a combination of optimized data types and an assembly algorithm designed to deal with large and complex genomes. The new assembly represents >78% of the genome with a scaffold N50 of 88.8 kb that has a high fidelity to the input data. Our new annotation combines strand-specific Illumina RNA-seq and Pacific Biosciences (PacBio) full-length cDNAs to identify 104,091 high-confidence protein-coding genes and 10,156 noncoding RNA genes. We confirmed three known and identified one novel genome rearrangements. Our approach enables the rapid and scalable assembly of wheat genomes, the identification of structural variants, and the definition of complete gene models, all powerful resources for trait analysis and breeding of this key global crop.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma de Planta , Anotação de Sequência Molecular/métodos , Proteínas de Plantas/genética , Translocação Genética , Triticum/genética , Algoritmos , Mapeamento de Sequências Contíguas/normas , Anotação de Sequência Molecular/normas , Polimorfismo Genético , Poliploidia
18.
Genome Res ; 27(5): 813-823, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28360230

RESUMO

The most polymorphic part of the human genome, the MHC, encodes over 160 proteins of diverse function. Half of them, including the HLA class I and II genes, are directly involved in immune responses. Consequently, the MHC region strongly associates with numerous diseases and clinical therapies. Notoriously, the MHC region has been intractable to high-throughput analysis at complete sequence resolution, and current reference haplotypes are inadequate for large-scale studies. To address these challenges, we developed a method that specifically captures and sequences the 4.8-Mbp MHC region from genomic DNA. For 95 MHC homozygous cell lines we assembled, de novo, a set of high-fidelity contigs and a sequence scaffold, representing a mean 98% of the target region. Included are six alternative MHC reference sequences of the human genome that we completed and refined. Characterization of the sequence and structural diversity of the MHC region shows the approach accurately determines the sequences of the highly polymorphic HLA class I and HLA class II genes and the complex structural diversity of complement factor C4A/C4B It has also uncovered extensive and unexpected diversity in other MHC genes; an example is MUC22, which encodes a lung mucin and exhibits more coding sequence alleles than any HLA class I or II gene studied here. More than 60% of the coding sequence alleles analyzed were previously uncharacterized. We have created a substantial database of robust reference MHC haplotype sequences that will enable future population scale studies of this complicated and clinically important region of the human genome.


Assuntos
Complemento C4/genética , Genes MHC da Classe II , Genes MHC Classe I , Haplótipos , Mucinas/genética , Polimorfismo Genético , Animais , Linhagem Celular , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/normas , Genoma Humano , Genômica/métodos , Genômica/normas , Humanos , Fases de Leitura Aberta , Pan troglodytes/genética , Padrões de Referência
19.
Genome Res ; 27(5): 697-708, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28360231

RESUMO

Accurate and contiguous genome assembly is key to a comprehensive understanding of the processes shaping genomic diversity and evolution. Yet, it is frequently constrained by constitutive heterochromatin, usually characterized by highly repetitive DNA. As a key feature of genome architecture associated with centromeric and subtelomeric regions, it locally influences meiotic recombination. In this study, we assess the impact of large tandem repeat arrays on the recombination rate landscape in an avian speciation model, the Eurasian crow. We assembled two high-quality genome references using single-molecule real-time sequencing (long-read assembly [LR]) and single-molecule optical maps (optical map assembly [OM]). A three-way comparison including the published short-read assembly (SR) constructed for the same individual allowed assessing assembly properties and pinpointing misassemblies. By combining information from all three assemblies, we characterized 36 previously unidentified large repetitive regions in the proximity of sequence assembly breakpoints, the majority of which contained complex arrays of a 14-kb satellite repeat or its 1.2-kb subunit. Using whole-genome population resequencing data, we estimated the population-scaled recombination rate (ρ) and found it to be significantly reduced in these regions. These findings are consistent with an effect of low recombination in regions adjacent to centromeric or subtelomeric heterochromatin and add to our understanding of the processes generating widespread heterogeneity in genetic diversity and differentiation along the genome. By combining three different technologies, our results highlight the importance of adding a layer of information on genome structure that is inaccessible to each approach independently.


Assuntos
Mapeamento de Sequências Contíguas/normas , Genoma , Sequências de Repetição em Tandem , Animais , Cromatina/genética , Cromatina/metabolismo , Mapeamento de Sequências Contíguas/métodos , Corvos/genética , Recombinação Homóloga , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas
20.
Genome Res ; 27(5): 849-864, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28396521

RESUMO

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Humano , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Mapeamento de Sequências Contíguas/normas , Genômica/normas , Haploidia , Haplótipos , Humanos , Polimorfismo Genético , Padrões de Referência , Análise de Sequência de DNA/normas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA