Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
1.
Cell ; 173(6): 1356-1369.e22, 2018 05 31.
Artigo em Inglês | MEDLINE | ID: mdl-29856954

RESUMO

Genetic changes causing brain size expansion in human evolution have remained elusive. Notch signaling is essential for radial glia stem cell proliferation and is a determinant of neuronal number in the mammalian cortex. We find that three paralogs of human-specific NOTCH2NL are highly expressed in radial glia. Functional analysis reveals that different alleles of NOTCH2NL have varying potencies to enhance Notch signaling by interacting directly with NOTCH receptors. Consistent with a role in Notch signaling, NOTCH2NL ectopic expression delays differentiation of neuronal progenitors, while deletion accelerates differentiation into cortical neurons. Furthermore, NOTCH2NL genes provide the breakpoints in 1q21.1 distal deletion/duplication syndrome, where duplications are associated with macrocephaly and autism and deletions with microcephaly and schizophrenia. Thus, the emergence of human-specific NOTCH2NL genes may have contributed to the rapid evolution of the larger human neocortex, accompanied by loss of genomic stability at the 1q21.1 locus and resulting recurrent neurodevelopmental disorders.


Assuntos
Encéfalo/embriologia , Córtex Cerebral/fisiologia , Neurogênese/fisiologia , Receptor Notch2/metabolismo , Transdução de Sinais , Animais , Diferenciação Celular , Células-Tronco Embrionárias/metabolismo , Feminino , Deleção de Genes , Genes Reporter , Gorilla gorilla , Células HEK293 , Humanos , Neocórtex/citologia , Células-Tronco Neurais/metabolismo , Neuroglia/metabolismo , Neurônios/metabolismo , Pan troglodytes , Receptor Notch2/genética , Análise de Sequência de RNA
2.
Nature ; 617(7960): 312-324, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-37165242

RESUMO

Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.


Assuntos
Genoma Humano , Genômica , Humanos , Diploide , Genoma Humano/genética , Haplótipos/genética , Análise de Sequência de DNA , Genômica/normas , Padrões de Referência , Estudos de Coortes , Alelos , Variação Genética
3.
Nature ; 604(7906): 437-446, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35444317

RESUMO

The human reference genome is the most widely used resource in human genetics and is due for a major update. Its current structure is a linear composite of merged haplotypes from more than 20 people, with a single individual comprising most of the sequence. It contains biases and errors within a framework that does not represent global human genomic variation. A high-quality reference with global representation of common variants, including single-nucleotide variants, structural variants and functional elements, is needed. The Human Pangenome Reference Consortium aims to create a more sophisticated and complete human reference genome with a graph-based, telomere-to-telomere representation of global genomic diversity. Here we leverage innovations in technology, study design and global partnerships with the goal of constructing the highest-possible quality human pangenome reference. Our goal is to improve data representation and streamline analyses to enable routine assembly of complete diploid genomes. With attention to ethical frameworks, the human pangenome reference will contain a more accurate and diverse representation of global genomic variation, improve gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine.


Assuntos
Genoma Humano , Genômica , Genoma Humano/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
4.
Nat Methods ; 2024 Sep 11.
Artigo em Inglês | MEDLINE | ID: mdl-39261641

RESUMO

Pangenomes reduce reference bias by representing genetic diversity better than a single reference sequence. Yet when comparing a sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with by filtering rare variants. However, this blunt heuristic both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach that imputes a personalized pangenome subgraph by sampling local haplotypes according to k-mer counts in the reads. We implement the approach in the vg toolkit ( https://github.com/vgteam/vg ) for the Giraffe short-read aligner and compare its accuracy to state-of-the-art methods using human pangenome graphs from the Human Pangenome Reference Consortium. This reduces small variant genotyping errors by four times relative to the Genome Analysis Toolkit and makes short-read structural variant genotyping of known variants competitive with long-read variant discovery methods.

5.
Nat Methods ; 20(2): 239-247, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36646895

RESUMO

Pangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our toolchain, which consists of additions to the VG toolkit and a standalone tool, RPVG, can construct spliced pangenome graphs, map RNA sequencing data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. We show that this workflow improves accuracy over state-of-the-art RNA sequencing mapping methods, and that it can efficiently quantify haplotype-specific transcript expression without needing to characterize the haplotypes of a sample beforehand.


Assuntos
Biologia Computacional , Perfilação da Expressão Gênica , Haplótipos , Metagenômica , Transcriptoma
6.
Nature ; 587(7833): 246-251, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33177663

RESUMO

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Assuntos
Genoma/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Vertebrados/genética , Âmnio , Animais , Simulação por Computador , Genômica/normas , Haplótipos , Humanos , Controle de Qualidade , Alinhamento de Sequência/normas , Software/normas
7.
Annu Rev Genomics Hum Genet ; 21: 139-162, 2020 08 31.
Artigo em Inglês | MEDLINE | ID: mdl-32453966

RESUMO

Low-cost whole-genome assembly has enabled the collection of haplotype-resolved pangenomes for numerous organisms. In turn, this technological change is encouraging the development of methods that can precisely address the sequence and variation described in large collections of related genomes. These approaches often use graphical models of the pangenome to support algorithms for sequence alignment, visualization, functional genomics, and association studies. The additional information provided to these methods by the pangenome allows them to achieve superior performance on a variety of bioinformatic tasks, including read alignment, variant calling, and genotyping. Pangenome graphs stand to become a ubiquitous tool in genomics. Although it is unclear whether they will replace linearreference genomes, their ability to harmoniously relate multiple sequence and coordinate systems will make them useful irrespective of which pangenomic models become most common in the future.


Assuntos
Algoritmos , Biologia Computacional/métodos , Gráficos por Computador , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA
8.
Bioinformatics ; 36(21): 5139-5144, 2021 01 29.
Artigo em Inglês | MEDLINE | ID: mdl-33040146

RESUMO

MOTIVATION: Pangenomics is a growing field within computational genomics. Many pangenomic analyses use bidirected sequence graphs as their core data model. However, implementing and correctly using this data model can be difficult, and the scale of pangenomic datasets can be challenging to work at. These challenges have impeded progress in this field. RESULTS: Here, we present a stack of two C++ libraries, libbdsg and libhandlegraph, which use a simple, field-proven interface, designed to expose elementary features of these graphs while preventing common graph manipulation mistakes. The libraries also provide a Python binding. Using a diverse collection of pangenome graphs, we demonstrate that these tools allow for efficient construction and manipulation of large genome graphs with dense variation. For instance, the speed and memory usage are up to an order of magnitude better than the prior graph implementation in the VG toolkit, which has now transitioned to using libbdsg's implementations. AVAILABILITY AND IMPLEMENTATION: libhandlegraph and libbdsg are available under an MIT License from https://github.com/vgteam/libhandlegraph and https://github.com/vgteam/libbdsg.


Assuntos
Bibliotecas , Software , Genoma , Genômica
9.
Bioinformatics ; 36(2): 400-407, 2020 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-31406990

RESUMO

MOTIVATION: The variation graph toolkit (VG) represents genetic variation as a graph. Although each path in the graph is a potential haplotype, most paths are non-biological, unlikely recombinations of true haplotypes. RESULTS: We augment the VG model with haplotype information to identify which paths are more likely to exist in nature. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by building a whole-genome index of the 5008 haplotypes of the 1000 Genomes Project, and an index of all 108 070 Trans-Omics for Precision Medicine Freeze 5 chromosome 17 haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes. AVAILABILITY AND IMPLEMENTATION: Our software is available at https://github.com/vgteam/vg, https://github.com/jltsiren/gbwt and https://github.com/jltsiren/gcsa2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Haplótipos , Algoritmos , Genoma , Análise de Sequência de DNA , Software
10.
Bioinformatics ; 36(Suppl_1): i146-i153, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32657356

RESUMO

MOTIVATION: Graph representations of genomes are capable of expressing more genetic variation and can therefore better represent a population than standard linear genomes. However, due to the greater complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes become much more difficult in genome graphs. Calculating distance is one such function that is simple in a linear genome but complicated in a graph context. In read mapping algorithms such distance calculations are fundamental to determining if seed alignments could belong to the same mapping. RESULTS: We have developed an algorithm for quickly calculating the minimum distance between positions on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms are efficient and practical to use for a new generation of mapping algorithms based upon genome graphs. AVAILABILITY AND IMPLEMENTATION: Our algorithms have been implemented as part of the vg toolkit and are available at https://github.com/vgteam/vg.


Assuntos
Genoma , Software , Algoritmos , Análise por Conglomerados , Análise de Sequência de DNA
11.
Genome Res ; 27(5): 665-676, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28360232

RESUMO

The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.


Assuntos
Genoma Humano , Estudo de Associação Genômica Ampla/métodos , Genômica/métodos , Estudo de Associação Genômica Ampla/normas , Genômica/normas , Humanos , Polimorfismo Genético
12.
Bioinformatics ; 35(24): 5318-5320, 2019 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-31368484

RESUMO

MOTIVATION: Compared to traditional haploid reference genomes, graph genomes are an efficient and compact data structure for storing multiple genomic sequences, for storing polymorphisms or for mapping sequencing reads with greater sensitivity. Further, graphs are well-studied computer science objects that can be efficiently analyzed. However, their adoption in genomic research is slow, in part because of the cognitive difficulty in interpreting graphs. RESULTS: We present an intuitive graphical representation for graph genomes that re-uses well-honed techniques developed to display public transport networks, and demonstrate it as a web tool. AVAILABILITY AND IMPLEMENTATION: Code: https://github.com/vgteam/sequenceTubeMap. DEMONSTRATION: https://vgteam.github.io/sequenceTubeMap/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma , Software , Genômica , Análise de Sequência de DNA
13.
Bioinformatics ; 34(13): i105-i114, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949989

RESUMO

Motivation: Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community. Results: We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50× coverage Illumina data and 10× PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants. Availability and implementation: https://github.com/whatshap/whatshap. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Diploide , Genoma Fúngico , Análise de Sequência de DNA/métodos , Visualização de Dados , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Leveduras/genética
14.
Bioinformatics ; 31(22): 3569-76, 2015 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-26220960

RESUMO

MOTIVATION: Sequence mapping is the cornerstone of modern genomics. However, most existing sequence mapping algorithms are insufficiently general. RESULTS: We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique best mapping, and define this criterion uniformly for all reference bases. Mappings under context schemes can also be made stable, so that extension of the query string (e.g. by increasing read length) will not alter the mapping of previously mapped positions. Context schemes are general in several senses. They natively support the detection of arbitrary complex, novel rearrangements relative to the reference. They can scale over orders of magnitude in query sequence length. Finally, they are trivially extensible to more complex reference structures, such as graphs, that incorporate additional variation. We demonstrate empirically the existence of high-performance context schemes, and present efficient context scheme mapping algorithms. AVAILABILITY AND IMPLEMENTATION: The software test framework created for this study is available from https://registry.hub.docker.com/u/adamnovak/sequence-graphs/. CONTACT: anovak@soe.ucsc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genômica/métodos , Software , Algoritmos , Simulação por Computador , Loci Gênicos , Humanos , Complexo Principal de Histocompatibilidade/genética , Alinhamento de Sequência
15.
Nat Biotechnol ; 42(4): 663-673, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-37165083

RESUMO

Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.


Assuntos
Drosophila melanogaster , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Animais , Drosophila melanogaster/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alelos , Análise de Sequência de DNA , Genoma Humano/genética
16.
bioRxiv ; 2024 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-39464141

RESUMO

The current reference genome is the backbone of diverse and rich annotations. Simple text formats, like VCF or BED, have been widely adopted and helped the critical exchange of genomic information. There is a dire need for tools and formats enabling pangenomic annotation to facilitate such enrichment of pangenomic references. The Graph Alignment Format (GAF) is a text format, tab-delimited like BED/VCF files, which was proposed to represent alignments. GAF could also be used to store paths representing annotations in a pangenome graph, but there are no tools to index and query them efficiently. Here, we present extensions to vg and HTSlib that provide efficient sorting, indexing, and querying for GAF files. With this approach, annotations overlapping a subgraph can be extracted quickly. Paths are sorted based on the IDs of traversed nodes, compressed with BGZIP, and indexed with HTSlib/tabix via our extensions for the GAF format. Compared to the binary GAM format, GAF files are easier to edit or inspect because they are plain text, and we show that they are twice as fast to sort and half as large on disk. In addition, we updated vg annotate, which takes BED or GFF3 annotation files relative to linear sequences and projects them into the pangenome. It can now produce GAF files representing these annotations' paths through the pangenome. We showcase these new tools on several applications. We projected annotations for all Human Pangenome Reference Consortium Year 1 haplotypes, including genes, segmental duplications, tandem repeats and repeats annotations, into the Minigraph-Cactus pangenome (GRCh38-based v1.1). We also projected known variants from the GWAS Catalog and expression QTLs from the GTEx project into the pangenome. Finally, we reanalyzed ATAC-seq data from ENCODE to demonstrate what a coverage track could look like in a pangenome graph. These rich annotations can be quickly queried with vg and visualized using existing tools like the Sequence Tube Map or Bandage.

17.
bioRxiv ; 2023 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-38168361

RESUMO

Pangenomes, by including genetic diversity, should reduce reference bias by better representing new samples compared to them. Yet when comparing a new sample to a pangenome, variants in the pangenome that are not part of the sample can be misleading, for example, causing false read mappings. These irrelevant variants are generally rarer in terms of allele frequency, and have previously been dealt with using allele frequency filters. However, this is a blunt heuristic that both fails to remove some irrelevant variants and removes many relevant variants. We propose a new approach, inspired by local ancestry inference methods, that imputes a personalized pangenome subgraph based on sampling local haplotypes according to k-mer counts in the reads. Our approach is tailored for the Giraffe short read aligner, as the indexes it needs for read mapping can be built quickly. We compare the accuracy of our approach to state-of-the-art methods using graphs from the Human Pangenome Reference Consortium. The resulting personalized pangenome pipelines provide faster pangenome read mapping than comparable pipelines that use a linear reference, reduce small variant genotyping errors by 4x relative to the Genome Analysis Toolkit (GATK) best-practice pipeline, and for the first time make short-read structural variant genotyping competitive with long-read discovery methods.

18.
Science ; 374(6574): abg8871, 2021 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-34914532

RESUMO

We introduce Giraffe, a pangenome short-read mapper that can efficiently map to a collection of haplotypes threaded through a sequence graph. Giraffe maps sequencing reads to thousands of human genomes at a speed comparable to that of standard methods mapping to a single reference genome. The increased mapping accuracy enables downstream improvements in genome-wide genotyping pipelines for both small variants and larger structural variants. We used Giraffe to genotype 167,000 structural variants, discovered in long-read studies, in 5202 diverse human genomes that were sequenced using short reads. We conclude that pangenomics facilitates a more comprehensive characterization of variation and, as a result, has the potential to improve many genomic analyses.


Assuntos
Variação Genética , Genoma Humano , Genômica/métodos , Técnicas de Genotipagem , Algoritmos , Alelos , Biologia Computacional , Genoma Fúngico , Genótipo , Haplótipos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Saccharomyces/genética , Saccharomyces cerevisiae/genética , Análise de Sequência de DNA
19.
Genome Biol ; 21(1): 35, 2020 02 12.
Artigo em Inglês | MEDLINE | ID: mdl-32051000

RESUMO

Structural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmark vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.


Assuntos
Variação Estrutural do Genoma , Técnicas de Genotipagem/métodos , Software , Genoma Fúngico , Saccharomyces cerevisiae , Sequenciamento Completo do Genoma/métodos
20.
F1000Res ; 8: 1751, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-34386196

RESUMO

In March 2019, 45 scientists and software engineers from around the world converged at the University of California, Santa Cruz for the first pangenomics codeathon. The purpose of the meeting was to propose technical specifications and standards for a usable human pangenome as well as to build relevant tools for genome graph infrastructures. During the meeting, the group held several intense and productive discussions covering a diverse set of topics, including advantages of graph genomes over a linear reference representation, design of new methods that can leverage graph-based data structures, and novel visualization and annotation approaches for pangenomes. Additionally, the participants self-organized themselves into teams that worked intensely over a three-day period to build a set of pipelines and tools for specific pangenomic applications. A summary of the questions raised and the tools developed are reported in this manuscript.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa