Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Genes (Basel) ; 12(5)2021 04 26.
Artículo en Inglés | MEDLINE | ID: mdl-33926025

RESUMEN

Sequencing of whole microbial genomes has become a standard procedure for cluster detection, source tracking, outbreak investigation and surveillance of many microorganisms. An increasing number of laboratories are currently in a transition phase from classical methods towards next generation sequencing, generating unprecedented amounts of data. Since the precision of downstream analyses depends significantly on the quality of raw data generated on the sequencing instrument, a comprehensive, meaningful primary quality control is indispensable. Here, we present AQUAMIS, a Snakemake workflow for an extensive quality control and assembly of raw Illumina sequencing data, allowing laboratories to automatize the initial analysis of their microbial whole-genome sequencing data. AQUAMIS performs all steps of primary sequence analysis, consisting of read trimming, read quality control (QC), taxonomic classification, de-novo assembly, reference identification, assembly QC and contamination detection, both on the read and assembly level. The results are visualized in an interactive HTML report including species-specific QC thresholds, allowing non-bioinformaticians to assess the quality of sequencing experiments at a glance. All results are also available as a standard-compliant JSON file, facilitating easy downstream analyses and data exchange. We have applied AQUAMIS to analyze ~13,000 microbial isolates as well as ~1000 in-silico contaminated datasets, proving the workflow's ability to perform in high throughput routine sequencing environments and reliably predict contaminations. We found that intergenus and intragenus contaminations can be detected most accurately using a combination of different QC metrics available within AQUAMIS.


Asunto(s)
Genoma Bacteriano , Control de Calidad , Secuenciación Completa del Genoma/métodos , Mapeo Contig/métodos , Mapeo Contig/normas , Contaminación de ADN , Escherichia coli , Listeria monocytogenes , Salmonella enterica , Sensibilidad y Especificidad , Programas Informáticos , Especificidad de la Especie , Secuenciación Completa del Genoma/normas , Flujo de Trabajo
2.
BMC Genomics ; 21(1): 631, 2020 Sep 14.
Artículo en Inglés | MEDLINE | ID: mdl-32928108

RESUMEN

BACKGROUND: We benchmarked the hybrid assembly approaches of MaSuRCA, SPAdes, and Unicycler for bacterial pathogens using Illumina and Oxford Nanopore sequencing by determining genome completeness and accuracy, antimicrobial resistance (AMR), virulence potential, multilocus sequence typing (MLST), phylogeny, and pan genome. Ten bacterial species (10 strains) were tested for simulated reads of both mediocre- and low-quality, whereas 11 bacterial species (12 strains) were tested for real reads. RESULTS: Unicycler performed the best for achieving contiguous genomes, closely followed by MaSuRCA, while all SPAdes assemblies were incomplete. MaSuRCA was less tolerant of low-quality long reads than SPAdes and Unicycler. The hybrid assemblies of five antimicrobial-resistant strains with simulated reads provided consistent AMR genotypes with the reference genomes. The MaSuRCA assembly of Staphylococcus aureus with real reads contained msr(A) and tet(K), while the reference genome and SPAdes and Unicycler assemblies harbored blaZ. The AMR genotypes of the reference genomes and hybrid assemblies were consistent for the other five antimicrobial-resistant strains with real reads. The numbers of virulence genes in all hybrid assemblies were similar to those of the reference genomes, irrespective of simulated or real reads. Only one exception existed that the reference genome and hybrid assemblies of Pseudomonas aeruginosa with mediocre-quality long reads carried 241 virulence genes, whereas 184 virulence genes were identified in the hybrid assemblies of low-quality long reads. The MaSuRCA assemblies of Escherichia coli O157:H7 and Salmonella Typhimurium with mediocre-quality long reads contained 126 and 118 virulence genes, respectively, while 110 and 107 virulence genes were detected in their MaSuRCA assemblies of low-quality long reads, respectively. All approaches performed well in our MLST and phylogenetic analyses. The pan genomes of the hybrid assemblies of S. Typhimurium with mediocre-quality long reads were similar to that of the reference genome, while SPAdes and Unicycler were more tolerant of low-quality long reads than MaSuRCA for the pan-genome analysis. All approaches functioned well in the pan-genome analysis of Campylobacter jejuni with real reads. CONCLUSIONS: Our research demonstrates the hybrid assembly pipeline of Unicycler as a superior approach for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing.


Asunto(s)
Genoma Bacteriano , Genómica/métodos , Secuenciación de Nanoporos/métodos , Benchmarking , Campylobacter jejuni , Mapeo Contig/métodos , Mapeo Contig/normas , Cronobacter sakazakii , Farmacorresistencia Bacteriana , Genómica/normas , Listeria monocytogenes , Secuenciación de Nanoporos/normas , Pseudomonas aeruginosa , Salmonella typhimurium , Virulencia
3.
Genetics ; 216(2): 599-608, 2020 10.
Artículo en Inglés | MEDLINE | ID: mdl-32796007

RESUMEN

Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of nongap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2000 genes that were previously unplaced. We also discovered >5700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.


Asunto(s)
Cromosomas de las Plantas/genética , Mapeo Contig/métodos , Dosificación de Gen , Triticum/genética , Mapeo Contig/normas , Genoma de Planta , Genómica/métodos , Genómica/normas , Estándares de Referencia
4.
Genes (Basel) ; 11(7)2020 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-32679846

RESUMEN

RPGR exon ORF15 variants are one of the most frequent causes for inherited retinal disorders (IRDs), in particular retinitis pigmentosa. The low sequence complexity of this mutation hotspot makes it prone to indels and challenging for sequence data analysis. Whole-exome sequencing generally fails to provide adequate coverage in this region. Therefore, complementary methods are needed to avoid false positives as well as negative results. In this study, next-generation sequencing (NGS) was used to sequence long-range PCR amplicons for an IRD cohort of African ancestry. By developing a novel secondary analysis pipeline based on de novo assembly, we were able to avoid the miscalling of variants generated by standard NGS analysis tools. We identified pathogenic variants in 11 patients (13% of the cohort), two of which have not been reported previously. We provide a novel and alternative end-to-end secondary analysis pipeline for targeted NGS of ORF15 that is less prone to false positive and negative variant calls.


Asunto(s)
Proteínas del Ojo/genética , Pruebas Genéticas/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Retinitis Pigmentosa/genética , Población Negra/genética , Mapeo Contig/métodos , Mapeo Contig/normas , Exones , Femenino , Pruebas Genéticas/normas , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Pueblos Indígenas/genética , Masculino , Mutación , Linaje , Retinitis Pigmentosa/diagnóstico , Sensibilidad y Especificidad , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas
5.
Sci Rep ; 9(1): 17716, 2019 11 27.
Artículo en Inglés | MEDLINE | ID: mdl-31776409

RESUMEN

Atlantic herring (Clupea harengus) is one of the most abundant fish species in the world. It is an important economical and nutritional resource, as well as a crucial part of the North Atlantic ecosystem. In 2016, a draft herring genome assembly was published. Being a species of such importance, we sought to independently verify and potentially improve the herring genome assembly. We sequenced the herring genome generating paired-end, mate-pair, linked and long reads. Three assembly versions of the herring genome were generated based on a de novo assembly (A1), which was scaffolded using linked and long reads (A2) and then merged with the previously published assembly (A3). The resulting assemblies were compared using parameters describing the size, fragmentation, correctness, and completeness of the assemblies. Results showed that the A2 assembly was less fragmented, more complete and more correct than A1. A3 showed improvement in fragmentation and correctness compared with A2 and the published assembly but was slightly less complete than the published assembly. Thus, we here confirmed the previously published herring assembly, and made improvements by further scaffolding the assembly and removing low-quality sequences using linked and long reads and merging of assemblies.


Asunto(s)
Mapeo Contig/métodos , Peces/genética , Genoma , Secuenciación Completa del Genoma/métodos , Animales , Mapeo Contig/normas , Secuenciación Completa del Genoma/normas
6.
Gigascience ; 8(7)2019 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-31289836

RESUMEN

BACKGROUND: Mammalian X and Y chromosomes share a common evolutionary origin and retain regions of high sequence similarity. Similar sequence content can confound the mapping of short next-generation sequencing reads to a reference genome. It is therefore possible that the presence of both sex chromosomes in a reference genome can cause technical artifacts in genomic data and affect downstream analyses and applications. Understanding this problem is critical for medical genomics and population genomic inference. RESULTS: Here, we characterize how sequence homology can affect analyses on the sex chromosomes and present XYalign, a new tool that (1) facilitates the inference of sex chromosome complement from next-generation sequencing data; (2) corrects erroneous read mapping on the sex chromosomes; and (3) tabulates and visualizes important metrics for quality control such as mapping quality, sequencing depth, and allele balance. We find that sequence homology affects read mapping on the sex chromosomes and this has downstream effects on variant calling. However, we show that XYalign can correct mismapping, resulting in more accurate variant calling. We also show how metrics output by XYalign can be used to identify XX and XY individuals across diverse sequencing experiments, including low- and high-coverage whole-genome sequencing, and exome sequencing. Finally, we discuss how the flexibility of the XYalign framework can be leveraged for other uses including the identification of aneuploidy on the autosomes. XYalign is available open source under the GNU General Public License (version 3). CONCLUSIONS: Sex chromsome sequence homology causes the mismapping of short reads, which in turn affects downstream analyses. XYalign provides a reproducible framework to correct mismapping and improve variant calling on the sex chromsomes.


Asunto(s)
Cromosomas Humanos X/genética , Cromosomas Humanos Y/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Homología de Secuencia de Ácido Nucleico , Artefactos , Mapeo Contig/métodos , Mapeo Contig/normas , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Masculino , Alineación de Secuencia/métodos , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/normas
7.
Gigascience ; 8(5)2019 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-31077315

RESUMEN

BACKGROUND: In recent years, massively parallel complementary DNA sequencing (RNA sequencing [RNA-Seq]) has emerged as a fast, cost-effective, and robust technology to study entire transcriptomes in various manners. In particular, for non-model organisms and in the absence of an appropriate reference genome, RNA-Seq is used to reconstruct the transcriptome de novo. Although the de novo transcriptome assembly of non-model organisms has been on the rise recently and new tools are frequently developing, there is still a knowledge gap about which assembly software should be used to build a comprehensive de novo assembly. RESULTS: Here, we present a large-scale comparative study in which 10 de novo assembly tools are applied to 9 RNA-Seq data sets spanning different kingdoms of life. Overall, we built >200 single assemblies and evaluated their performance on a combination of 20 biological-based and reference-free metrics. Our study is accompanied by a comprehensive and extensible Electronic Supplement that summarizes all data sets, assembly execution instructions, and evaluation results. Trinity, SPAdes, and Trans-ABySS, followed by Bridger and SOAPdenovo-Trans, generally outperformed the other tools compared. Moreover, we observed species-specific differences in the performance of each assembler. No tool delivered the best results for all data sets. CONCLUSIONS: We recommend a careful choice and normalization of evaluation metrics to select the best assembling results as a critical step in the reconstruction of a comprehensive de novo transcriptome assembly.


Asunto(s)
Análisis de Secuencia de ARN/métodos , Programas Informáticos , Transcriptoma , Animales , Arabidopsis , Mapeo Contig/métodos , Mapeo Contig/normas , Escherichia coli , Humanos , Ratones , Análisis de Secuencia de ARN/normas
8.
Genes (Basel) ; 10(1)2019 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-30669388

RESUMEN

A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 µg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 h movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes were present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes were present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio-based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.


Asunto(s)
Anopheles/genética , Genoma de los Insectos , Análisis de Secuencia de ADN/métodos , Animales , Mapeo Contig/métodos , Mapeo Contig/normas , Ploidias , Polimorfismo Genético , Análisis de Secuencia de ADN/normas
9.
Genome Res ; 28(11): 1720-1732, 2018 11.
Artículo en Inglés | MEDLINE | ID: mdl-30341161

RESUMEN

Despite the rapid development of sequencing technologies, the assembly of mammalian-scale genomes into complete chromosomes remains one of the most challenging problems in bioinformatics. To help address this difficulty, we developed Ragout 2, a reference-assisted assembly tool that works for large and complex genomes. By taking one or more target assemblies (generated from an NGS assembler) and one or multiple related reference genomes, Ragout 2 infers the evolutionary relationships between the genomes and builds the final assemblies using a genome rearrangement approach. By using Ragout 2, we transformed NGS assemblies of 16 laboratory mouse strains into sets of complete chromosomes, leaving <5% of sequence unlocalized per set. Various benchmarks, including PCR testing and realigning of long Pacific Biosciences (PacBio) reads, suggest only a small number of structural errors in the final assemblies, comparable with direct assembly approaches. We applied Ragout 2 to the Mus caroli and Mus pahari genomes, which exhibit karyotype-scale variations compared with other genomes from the Muridae family. Chromosome painting maps confirmed most large-scale rearrangements that Ragout 2 detected. We applied Ragout 2 to improve draft sequences of three ape genomes that have recently been published. Ragout 2 transformed three sets of contigs (generated using PacBio reads only) into chromosome-scale assemblies with accuracy comparable to chromosome assemblies generated in the original study using BioNano maps, Hi-C, BAC clones, and FISH.


Asunto(s)
Mapeo Contig/métodos , Secuenciación Completa del Genoma/métodos , Animales , Mapeo Contig/normas , Ratones , Estándares de Referencia , Secuenciación Completa del Genoma/normas
10.
Sci Rep ; 8(1): 2059, 2018 02 01.
Artículo en Inglés | MEDLINE | ID: mdl-29391438

RESUMEN

Single-cell genomics is a straightforward approach to obtain genomes from uncultured microbes. However, sequence reads from a single-cell amplified genome (SAG) contain significant bias and chimeric sequences. Here, we describe Cleaning and Co-assembly of a Single-Cell Amplified Genome (ccSAG), a novel analytical workflow to obtain composite single-cell genomes with elimination of sequence errors. By the integration of ccSAG with a massively parallel single-cell genome amplification platform based on droplet microfluidics, we can generate multiple SAGs and effectively integrate them into the composite genomes quality equivalent to the data obtained from bulk DNA. We obtained two novel draft genomes from single gut microbial cells with high completeness (>96.6%) and extremely low contamination (<1.25%). Moreover, we revealed the presence of single nucleotide polymorphisms in the specific gene by sequence comparison at the single-cell level. Thus, the workflow yields near-complete genomes from uncultured microbes, and enables analyses of genetic heterogeneity within identical strains.


Asunto(s)
Mapeo Contig/métodos , Microbioma Gastrointestinal , Genoma Bacteriano , Análisis de la Célula Individual/métodos , Secuenciación Completa del Genoma/métodos , Animales , Bacillus subtilis/genética , Mapeo Contig/normas , Escherichia coli/genética , Masculino , Ratones , Ratones Endogámicos ICR , Polimorfismo de Nucleótido Simple , Análisis de la Célula Individual/normas , Secuenciación Completa del Genoma/normas
11.
J Comput Biol ; 25(2): 214-235, 2018 02.
Artículo en Inglés | MEDLINE | ID: mdl-29028176

RESUMEN

We formalize a new problem variant in gene-block discovery, denoted Reference-Anchored Gene Blocks (RAGB), given a query sequence Q of length n, representing the gene array of a DNA element, a window size bound d on the length of a substring of interest in Q, and a set of target gene sequences [Formula: see text]. Our objective is to identify gene blocks in [Formula: see text] that are centered in a subset q of co-localized genes from Q, and contain genomes from [Formula: see text] in which the corresponding orthologs of the genes from q are also co-localized. We cast RAGB as a variant of a (colored) biclique problem in bipartite graphs, and analyze its parameterized complexity, as well as the parameterized complexity of other related problems. We give an [Formula: see text] time algorithm for the uncolored variant of our biclique problem, where m is the number of areas of interest that are parsed from the target sequences, and n and d are as defined earlier. Our algorithm can be adapted to compute all maximal bicliques in the graph within the same time complexity, and to handle edge weights with a slight [Formula: see text] increase to its time complexity. For the colored version of the problem, our algorithm has a time complexity of [Formula: see text]. We implement the algorithm and exemplify its application to the data mining of proteobacterial gene blocks that are centered in predicted proteobacterial genomic islands, leading to the identification of putatively mobilized clusters of virulence, pathogenicity, and resistance genes.


Asunto(s)
Genoma Bacteriano , Familia de Multigenes , Análisis de Secuencia de ADN/métodos , Algoritmos , Mapeo Contig/métodos , Mapeo Contig/normas , Estándares de Referencia , Análisis de Secuencia de ADN/normas
12.
Gigascience ; 6(11): 1-6, 2017 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-29092041

RESUMEN

The chimpanzee is arguably the most important species for the study of human origins. A key resource for these studies is a high-quality reference genome assembly; however, as with most mammalian genomes, the current iteration of the chimpanzee reference genome assembly is highly fragmented. In the current iteration of the chimpanzee reference genome assembly (Pan_tro_2.1.4), the sequence is scattered across more then 183 000 contigs, incorporating more than 159 000 gaps, with a genome-wide contig N50 of 51 Kbp. In this work, we produce an extensive and diverse array of sequencing datasets to rapidly assemble a new chimpanzee reference that surpasses previous iterations in bases represented and organized in large scaffolds. To this end, we show substantial improvements over the current release of the chimpanzee genome (Pan_tro_2.1.4) by several metrics, such as increased contiguity by >750% and 300% on contigs and scaffolds, respectively, and closure of 77% of gaps in the Pan_tro_2.1.4 assembly gaps spanning >850 Kbp of the novel coding sequence based on RNASeq data. We further report more than 2700 genes that had putatively erroneous frame-shift predictions to human in Pan_tro_2.1.4 and show a substantial increase in the annotation of repetitive elements. We apply a simple 3-way hybrid approach to considerably improve the reference genome assembly for the chimpanzee, providing a valuable resource for the study of human origins. Furthermore, we produce extensive sequencing datasets that are all derived from the same cell line, generating a broad non-human benchmark dataset.


Asunto(s)
Mapeo Contig/normas , Genoma , Genómica/normas , Anotación de Secuencia Molecular/normas , Pan troglodytes/genética , Secuenciación Completa del Genoma/normas , Animales , Mapeo Contig/métodos , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Estándares de Referencia , Secuenciación Completa del Genoma/métodos
13.
Genome Biol ; 18(1): 182, 2017 09 21.
Artículo en Inglés | MEDLINE | ID: mdl-28934964

RESUMEN

BACKGROUND: One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited. RESULTS: In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages. CONCLUSIONS: This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.


Asunto(s)
Benchmarking/métodos , Mapeo Contig/métodos , Código de Barras del ADN Taxonómico/métodos , Metagenoma , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Benchmarking/normas , Mapeo Contig/normas , Código de Barras del ADN Taxonómico/normas , Humanos , Microbiota , Filogenia , Análisis de Secuencia de ADN/normas
14.
Sci Rep ; 7(1): 10963, 2017 09 08.
Artículo en Inglés | MEDLINE | ID: mdl-28887485

RESUMEN

Complementary to reference-based variant detection, recent studies revealed that many novel variants could be detected with de novo assembled genomes. To evaluate the effect of reads coverage and the accuracy of assembly-based variant calling, we simulated short reads containing more than 3 million of single nucleotide variants (SNVs) from the whole human genome and compared the efficiency of SNV calling between the assembly-based and alignment-based calling approaches. We assessed the quality of the assembled contig and found that a minimum of 30X coverage of short reads was needed to ensure reliable SNV calling and to generate assembled contigs with a good coverage of genome and genes. In addition, we observed that the assembly-based approach had a much lower recall rate and precision comparing to the alignment-based approach that would recover 99% of imputed SNVs. We observed similar results with experimental reads for NA24385, an individual whose germline variants were well characterized. Although there are additional values for SNVs detection, the assembly-based approach would have great risk of false discovery of novel SNVs. Further improvement of de novo assembly algorithms are needed in order to warrant a good completeness of genome with haplotype resolved and high fidelity of assembled sequences.


Asunto(s)
Mapeo Contig/métodos , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Mapeo Contig/normas , Estudio de Asociación del Genoma Completo/normas , Humanos , Alineación de Secuencia/normas , Análisis de Secuencia de ADN/normas
15.
Genome ; 60(12): 1086-1088, 2017 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-28841400

RESUMEN

RNA-Seq methodology has been used to generate a comprehensive transcriptome sequence resource for perennial ryegrass, an important temperate pasture grass species. A total of 931 547 255 reads were obtained from libraries corresponding to 19 distinct tissue samples, including both vegetative and reproductive stages of development. Assembly of data generated a final filtered reference set of 48 713 contigs and scaffolds. The transcriptome resource will support whole genome sequence assembly, comparative genomics, implementation of genotyping-by-sequencing (GBS) methods based on transcript sampling, and identification of candidate genes for multiple biological functions.


Asunto(s)
Mapeo Contig/normas , Genoma de Planta , Lolium/genética , Transcriptoma , Mapeo Contig/métodos , Anotación de Secuencia Molecular , Valores de Referencia
16.
Gigascience ; 6(7): 1-8, 2017 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-28531267

RESUMEN

The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.


Asunto(s)
Mapeo Contig/métodos , Proyecto Genoma Humano , Alineación de Secuencia/métodos , Secuenciación Completa del Genoma/métodos , Algoritmos , Mapeo Contig/normas , Humanos , Estándares de Referencia , Alineación de Secuencia/normas , Secuenciación Completa del Genoma/normas
17.
Genome Res ; 27(5): 849-864, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28396521

RESUMEN

The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.


Asunto(s)
Mapeo Contig/métodos , Genoma Humano , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Mapeo Contig/normas , Genómica/normas , Haploidia , Haplotipos , Humanos , Polimorfismo Genético , Estándares de Referencia , Análisis de Secuencia de ADN/normas
18.
Genome Res ; 27(5): 835-848, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28396522

RESUMEN

A viral quasispecies, the ensemble of viral strains populating an infected person, can be highly diverse. For optimal assessment of virulence, pathogenesis, and therapy selection, determining the haplotypes of the individual strains can play a key role. As many viruses are subject to high mutation and recombination rates, high-quality reference genomes are often not available at the time of a new disease outbreak. We present SAVAGE, a computational tool for reconstructing individual haplotypes of intra-host virus strains without the need for a high-quality reference genome. SAVAGE makes use of either FM-index-based data structures or ad hoc consensus reference sequence for constructing overlap graphs from patient sample data. In this overlap graph, nodes represent reads and/or contigs, while edges reflect that two reads/contigs, based on sound statistical considerations, represent identical haplotypic sequence. Following an iterative scheme, a new overlap assembly algorithm that is based on the enumeration of statistically well-calibrated groups of reads/contigs then efficiently reconstructs the individual haplotypes from this overlap graph. In benchmark experiments on simulated and on real deep-coverage data, SAVAGE drastically outperforms generic de novo assemblers as well as the only specialized de novo viral quasispecies assembler available so far. When run on ad hoc consensus reference sequence, SAVAGE performs very favorably in comparison with state-of-the-art reference genome-guided tools. We also apply SAVAGE on two deep-coverage samples of patients infected by the Zika and the hepatitis C virus, respectively, which sheds light on the genetic structures of the respective viral quasispecies.


Asunto(s)
Mapeo Contig/métodos , Genoma Viral , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Mapeo Contig/normas , Genómica/normas , Haplotipos , Hepacivirus/genética , Polimorfismo Genético , Estándares de Referencia , Análisis de Secuencia de ADN/normas , Virus Zika/genética
19.
Genome Res ; 27(5): 813-823, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28360230

RESUMEN

The most polymorphic part of the human genome, the MHC, encodes over 160 proteins of diverse function. Half of them, including the HLA class I and II genes, are directly involved in immune responses. Consequently, the MHC region strongly associates with numerous diseases and clinical therapies. Notoriously, the MHC region has been intractable to high-throughput analysis at complete sequence resolution, and current reference haplotypes are inadequate for large-scale studies. To address these challenges, we developed a method that specifically captures and sequences the 4.8-Mbp MHC region from genomic DNA. For 95 MHC homozygous cell lines we assembled, de novo, a set of high-fidelity contigs and a sequence scaffold, representing a mean 98% of the target region. Included are six alternative MHC reference sequences of the human genome that we completed and refined. Characterization of the sequence and structural diversity of the MHC region shows the approach accurately determines the sequences of the highly polymorphic HLA class I and HLA class II genes and the complex structural diversity of complement factor C4A/C4B It has also uncovered extensive and unexpected diversity in other MHC genes; an example is MUC22, which encodes a lung mucin and exhibits more coding sequence alleles than any HLA class I or II gene studied here. More than 60% of the coding sequence alleles analyzed were previously uncharacterized. We have created a substantial database of robust reference MHC haplotype sequences that will enable future population scale studies of this complicated and clinically important region of the human genome.


Asunto(s)
Complemento C4/genética , Genes MHC Clase II , Genes MHC Clase I , Haplotipos , Mucinas/genética , Polimorfismo Genético , Animales , Línea Celular , Mapeo Contig/métodos , Mapeo Contig/normas , Genoma Humano , Genómica/métodos , Genómica/normas , Humanos , Sistemas de Lectura Abierta , Pan troglodytes/genética , Estándares de Referencia
20.
Genome Res ; 27(5): 697-708, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28360231

RESUMEN

Accurate and contiguous genome assembly is key to a comprehensive understanding of the processes shaping genomic diversity and evolution. Yet, it is frequently constrained by constitutive heterochromatin, usually characterized by highly repetitive DNA. As a key feature of genome architecture associated with centromeric and subtelomeric regions, it locally influences meiotic recombination. In this study, we assess the impact of large tandem repeat arrays on the recombination rate landscape in an avian speciation model, the Eurasian crow. We assembled two high-quality genome references using single-molecule real-time sequencing (long-read assembly [LR]) and single-molecule optical maps (optical map assembly [OM]). A three-way comparison including the published short-read assembly (SR) constructed for the same individual allowed assessing assembly properties and pinpointing misassemblies. By combining information from all three assemblies, we characterized 36 previously unidentified large repetitive regions in the proximity of sequence assembly breakpoints, the majority of which contained complex arrays of a 14-kb satellite repeat or its 1.2-kb subunit. Using whole-genome population resequencing data, we estimated the population-scaled recombination rate (ρ) and found it to be significantly reduced in these regions. These findings are consistent with an effect of low recombination in regions adjacent to centromeric or subtelomeric heterochromatin and add to our understanding of the processes generating widespread heterogeneity in genetic diversity and differentiation along the genome. By combining three different technologies, our results highlight the importance of adding a layer of information on genome structure that is inaccessible to each approach independently.


Asunto(s)
Mapeo Contig/normas , Genoma , Secuencias Repetidas en Tándem , Animales , Cromatina/genética , Cromatina/metabolismo , Mapeo Contig/métodos , Cuervos/genética , Recombinación Homóloga , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...