RESUMEN
Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversions <2 kbp form by twin-priming during L1 retrotransposition; 80% of the larger inversions are balanced and affect twice as many nucleotides as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7 × 10-4 per locus per generation. Recurrent inversions exhibit a sex-chromosomal bias and co-localize with genomic disorder critical regions. We propose that inversion recurrence results in an elevated number of heterozygous carriers and structural SD diversity, which increases mutability in the population and predisposes specific haplotypes to disease-causing CNVs.
Asunto(s)
Inversión Cromosómica , Duplicaciones Segmentarias en el Genoma , Inversión Cromosómica/genética , Variaciones en el Número de Copia de ADN/genética , Genoma Humano , Genómica , HumanosRESUMEN
The prevalence of highly repetitive sequences within the human Y chromosome has prevented its complete assembly to date1 and led to its systematic omission from genomic analyses. Here we present de novo assemblies of 43 Y chromosomes spanning 182,900 years of human evolution and report considerable diversity in size and structure. Half of the male-specific euchromatic region is subject to large inversions with a greater than twofold higher recurrence rate compared with all other chromosomes2. Ampliconic sequences associated with these inversions show differing mutation rates that are sequence context dependent, and some ampliconic genes exhibit evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, Yq12, is composed of alternating repeat arrays that show extensive variation in the number, size and distribution, but retain a 1:1 copy-number ratio. Finally, our data suggest that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kb away from the currently established1 boundary. The availability of fully sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of traits with specific Y-chromosomal variants and garnering insights into the evolution and function of complex regions of the human genome.
Asunto(s)
Cromosomas Humanos Y , Evolución Molecular , Humanos , Masculino , Cromosomas Humanos Y/genética , Genoma Humano/genética , Genómica , Tasa de Mutación , Fenotipo , Eucromatina/genética , Seudogenes , Variación Genética/genética , Cromosomas Humanos X/genética , Regiones Pseudoautosómicas/genéticaRESUMEN
The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.
Asunto(s)
Genoma Humano , Genómica , Genoma Humano/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADNRESUMEN
The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.
Asunto(s)
Mapeo Cromosómico , Diploidia , Genoma Humano , Genómica , Humanos , Mapeo Cromosómico/normas , Genoma Humano/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Estándares de Referencia , Genómica/métodos , Genómica/normas , Cromosomas Humanos/genética , Variación Genética/genéticaRESUMEN
The secreted mucins MUC5AC and MUC5B are large glycoproteins that play critical defensive roles in pathogen entrapment and mucociliary clearance. Their respective genes contain polymorphic and degenerate protein-coding variable number tandem repeats (VNTRs) that make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5,761-5,762 amino acids [aa]); however, seven haplotypes have expanded VNTRs (6,291-7,019 aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5,249-6,325 aa) with cysteine-rich domain and VNTR copy-number variation. We group MUC5AC alleles into three phylogenetic clades: H1 (46%, â¼5,654 aa), H2 (33%, â¼5,742 aa), and H3 (7%, â¼6,325 aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium and Tajima's D analyses reveal that East Asians carry exceptionally large blocks with an excess of rare variation (p < 0.05) at MUC5AC. To validate this result, we use Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observe a signature of positive selection in H1 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium (p < 0.05), consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein-coding VNTRs for improved disease associations.
Asunto(s)
Alelos , Variación Genética , Haplotipos , Repeticiones de Minisatélite , Mucina 5AC , Mucina 5B , Filogenia , Humanos , Mucina 5B/genética , Animales , Mucina 5AC/genética , Mucina 5AC/metabolismo , Repeticiones de Minisatélite/genética , Variaciones en el Número de Copia de ADN , Primates/genéticaRESUMEN
Advances in long-read sequencing (LRS) technologies continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant-calling precision and recall of Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant-calling precision and recall of SVs and indels in HiFi data sets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant call sets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.
Asunto(s)
Genómica , Nanoporos , Mutación INDEL , Secuenciación Completa del GenomaRESUMEN
There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.
Asunto(s)
ADN Satélite , Polimorfismo Genético , Humanos , ADN Satélite/genética , Haplotipos , Duplicaciones Segmentarias en el Genoma , Análisis de Secuencia de ADNRESUMEN
Complex structural variants (CSVs) encompass multiple breakpoints and are often missed or misinterpreted. We developed SVision, a deep-learning-based multi-object-recognition framework, to automatically detect and haracterize CSVs from long-read sequencing data. SVision outperforms current callers at identifying the internal structure of complex events and has revealed 80 high-quality CSVs with 25 distinct structures from an individual genome. SVision directly detects CSVs without matching known structures, allowing sensitive detection of both common and previously uncharacterized complex rearrangements.
Asunto(s)
Aprendizaje Profundo , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADNRESUMEN
Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.
Asunto(s)
Genoma Humano/genética , Variación Estructural del Genoma , Genómica/métodos , Objetivos , Secuenciación Completa del Genoma/métodos , Secuenciación Completa del Genoma/normas , Variaciones en el Número de Copia de ADN , Exones/genética , Humanos , Proyectos de Investigación , Duplicaciones Segmentarias en el Genoma , Alineación de SecuenciaRESUMEN
Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Gráficos por Computador , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADNRESUMEN
MOTIVATION: With the fast development of sequencing technology, accurate de novo genome assembly is now possible even for larger genomes. Graph-based representations of genomes arise both as part of the assembly process, but also in the context of pangenomes representing a population. In both cases, polymorphic loci lead to bubble structures in such graphs. Detecting bubbles is hence an important task when working with genomic variants in the context of genome graphs. RESULTS: Here, we present a fast general-purpose tool, called BubbleGun, for detecting bubbles and superbubbles in genome graphs. Furthermore, BubbleGun detects and outputs runs of linearly connected bubbles and superbubbles, which we call bubble chains. We showcase its utility on de Bruijn graphs and compare our results to vg's snarl detection. We show that BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 min on a human sample de Bruijn graph of around 2 million nodes. AVAILABILITY AND IMPLEMENTATION: BubbleGun is available and documented as a Python3 package at https://github.com/fawaz-dabbaghieh/bubble_gun under MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Programas Informáticos , Humanos , Análisis de Secuencia de ADN/métodos , Genoma , Genómica/métodosRESUMEN
MOTIVATION: De Bruijn graphs can be constructed from short reads efficiently and have been used for many purposes. Traditionally, long-read sequencing technologies have had too high error rates for de Bruijn graph-based methods. Recently, HiFi reads have provided a combination of long-read length and low error rate, which enables de Bruijn graphs to be used with HiFi reads. RESULTS: We have implemented MBG, a tool for building sparse de Bruijn graphs from HiFi reads. MBG outperforms existing tools for building dense de Bruijn graphs and can build a graph of 50× coverage whole human genome HiFi reads in four hours on a single core. MBG also assembles the bacterial E.coli genome into a single contig in 8 s. AVAILABILITY AND IMPLEMENTATION: Package manager: https://anaconda.org/bioconda/mbg and source code: https://github.com/maickrau/MBG. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
SUMMARY: Single-cell DNA template strand sequencing (Strand-seq) enables chromosome length haplotype phasing, construction of phased assemblies, mapping sister-chromatid exchange events and structural variant discovery. The initial quality control of potentially thousands of single-cell libraries is still done manually by domain experts. ASHLEYS automates this tedious task, delivers near-expert performance and labels even large datasets in seconds. AVAILABILITY AND IMPLEMENTATION: github.com/friendsofstrandseq/ashleys-qc, MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMEN
MOTIVATION: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus-a property that is not exploited by extant methods. RESULTS: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project dataset. On this graph, PSI outperforms GCSA2 in terms of index size, query time and sensitivity. AVAILABILITY AND IMPLEMENTATION: The C++ implementation is publicly available at: https://github.com/cartoonist/psi.
Asunto(s)
Algoritmos , Genoma Humano , Programas Informáticos , Alelos , Diploidia , Humanos , Análisis de Secuencia de ADNRESUMEN
MOTIVATION: Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. RESULTS: We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers' bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with |V| nodes and |E| edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(|V|+âmwâ|E| log w) for acyclic graphs and O(|V|+m|E| log w) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm. AVAILABILITY AND IMPLEMENTATION: https://github.com/maickrau/GraphAligner. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Genoma , Alineación de Secuencia , Análisis de Secuencia de ADNRESUMEN
Motivation: Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community. Results: We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. Availability and implementation: https://github.com/whatshap/whatshap. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Diploidia , Genoma Fúngico , Análisis de Secuencia de ADN/métodos , Visualización de Datos , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Levaduras/genéticaRESUMEN
Motivation: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately. Results: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization algorithm, termed SaaRclust, and demonstrates its ability to reliably cluster long reads by chromosome. For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of Pacific Bioscience reads with 30.1× coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly. Availability and implementation: https://github.com/daewoooo/SaaRclust.
Asunto(s)
Cromosomas Humanos , Simulación por Computador , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Algoritmos , Femenino , Genoma Humano , Humanos , Análisis de Secuencia de ADN/métodosRESUMEN
Small insertions and deletions (indels) and large structural variations (SVs) are major contributors to human genetic diversity and disease. However, mutation rates and characteristics of de novo indels and SVs in the general population have remained largely unexplored. We report 332 validated de novo structural changes identified in whole genomes of 250 families, including complex indels, retrotransposon insertions, and interchromosomal events. These data indicate a mutation rate of 2.94 indels (1-20 bp) and 0.16 SVs (>20 bp) per generation. De novo structural changes affect on average 4.1 kbp of genomic sequence and 29 coding bases per generation, which is 91 and 52 times more nucleotides than de novo substitutions, respectively. This contrasts with the equal genomic footprint of inherited SVs and substitutions. An excess of structural changes originated on paternal haplotypes. Additionally, we observed a nonuniform distribution of de novo SVs across offspring. These results reveal the importance of different mutational mechanisms to changes in human genome structure across generations.
Asunto(s)
Variación Genética , Genoma Humano , Alelos , Secuencia de Aminoácidos , Femenino , Genómica , Haplotipos , Humanos , Mutación INDEL , Masculino , Datos de Secuencia Molecular , Tasa de Mutación , Polimorfismo de Nucleótido Simple , Retroelementos/genética , Alineación de Secuencia , Análisis de Secuencia de ADNRESUMEN
MOTIVATION: Next Generation Sequencing (NGS) has enabled studying structural genomic variants (SVs) such as duplications and inversions in large cohorts. SVs have been shown to play important roles in multiple diseases, including cancer. As costs for NGS continue to decline and variant databases become ever more complete, the relevance of genotyping also SVs from NGS data increases steadily, which is in stark contrast to the lack of tools to do so. RESULTS: We introduce a novel statistical approach, called DIGTYPER (Duplication and Inversion GenoTYPER), which computes genotype likelihoods for a given inversion or duplication and reports the maximum likelihood genotype. In contrast to purely coverage-based approaches, DIGTYPER uses breakpoint-spanning read pairs as well as split alignments for genotyping, enabling typing also of small events. We tested our approach on simulated and on real data and compared the genotype predictions to those made by DELLY, which discovers SVs and computes genotypes, and SVTyper, a genotyping program used to genotype variants detected by LUMPY. DIGTYPER compares favorable especially for duplications (of all lengths) and for shorter inversions (up to 300 bp). In contrast to DELLY, our approach can genotype SVs from data bases without having to rediscover them. AVAILABILITY AND IMPLEMENTATION: https://bitbucket.org/jana_ebler/digtyper.git. CONTACT: t.marschall@mpi-inf.mpg.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Duplicación Cromosómica , Inversión Cromosómica , Variación Estructural del Genoma , Técnicas de Genotipaje/métodos , Bases de Datos de Ácidos Nucleicos , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN , Eliminación de Secuencia , Programas InformáticosRESUMEN
MOTIVATION: Horizontal gene transfer (HGT) is a fundamental mechanism that enables organisms such as bacteria to directly transfer genetic material between distant species. This way, bacteria can acquire new traits such as antibiotic resistance or pathogenic toxins. Current bioinformatics approaches focus on the detection of past HGT events by exploring phylogenetic trees or genome composition inconsistencies. However, these techniques normally require the availability of finished and fully annotated genomes and of sufficiently large deviations that allow detection and are thus not widely applicable. Especially in outbreak scenarios with HGT-mediated emergence of new pathogens, like the enterohemorrhagic Escherichia coli outbreak in Germany 2011, there is need for fast and precise HGT detection. Next-generation sequencing (NGS) technologies facilitate rapid analysis of unknown pathogens but, to the best of our knowledge, so far no approach detects HGTs directly from NGS reads. RESULTS: We present Daisy, a novel mapping-based tool for HGT detection. Daisy determines HGT boundaries with split-read mapping and evaluates candidate regions relying on read pair and coverage information. Daisy successfully detects HGT regions with base pair resolution in both simulated and real data, and outperforms alternative approaches using a genome assembly of the reads. We see our approach as a powerful complement for a comprehensive analysis of HGT in the context of NGS data. AVAILABILITY AND IMPLEMENTATION: Daisy is freely available from http://github.com/ktrappe/daisy CONTACT: renardb@rki.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.