Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

Complex genetic variation in nearly complete human genomes.

Logsdon, Glennis A; Ebert, Peter; Audano, Peter A; Loftus, Mark; Porubsky, David; Ebler, Jana; Yilmaz, Feyza; Hallast, Pille; Prodanov, Timofey; Yoo, DongAhn; Paisie, Carolyn A; Harvey, William T; Zhao, Xuefang; Martino, Gianni V; Henglin, Mir; Munson, Katherine M; Rabbani, Keon; Chin, Chen-Shan; Gu, Bida; Ashraf, Hufsah; Austine-Orimoloye, Olanrewaju; Balachandran, Parithi; Bonder, Marc Jan; Cheng, Haoyu; Chong, Zechen; Crabtree, Jonathan; Gerstein, Mark; Guethlein, Lisbeth A; Hasenfeld, Patrick; Hickey, Glenn; Hoekzema, Kendra; Hunt, Sarah E; Jensen, Matthew; Jiang, Yunzhe; Koren, Sergey; Kwon, Youngjun; Li, Chong; Li, Heng; Li, Jiaqi; Norman, Paul J; Oshima, Keisuke K; Paten, Benedict; Phillippy, Adam M; Pollock, Nicholas R; Rausch, Tobias; Rautiainen, Mikko; Scholz, Stephan; Song, Yuwei; Söylev, Arda; Sulovari, Arvis.

bioRxiv ; 2024 Sep 25.

Artigo em Inglês | MEDLINE | ID: mdl-39372794

RESUMO

Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here, we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (130 Mbp median continuity), closing 92% of all previous assembly gaps1,2 and reaching telomere-to-telomere (T2T) status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8, and AMY1/AMY2, and fully resolve 1,852 complex structural variants (SVs). In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite high-order repeat (HOR) array length and characterize the pattern of mobile element insertions into α-satellite HOR arrays. While most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference1 significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference3 to a median quality value (QV) of 45. Using this approach, 26,115 SVs per sample are detected, substantially increasing the number of SVs now amenable to downstream disease association studies.

2.

A de novo variant in PAK2 detected in an individual with Knobloch type 2 syndrome.

Werren, Elizabeth A; Kalsner, Louisa; Ewald, Jessica; Peracchio, Michael; King, Cameron; Vats, Purva; Audano, Peter A; Robinson, Peter N; Adams, Mark D; Kelly, Melissa A; Matson, Adam P.

bioRxiv ; 2024 Apr 22.

Artigo em Inglês | MEDLINE | ID: mdl-38712026

RESUMO

P21-activated kinase 2 (PAK2) is a serine/threonine kinase essential for a variety of cellular processes including signal transduction, cellular survival, proliferation, and migration. A recent report proposed monoallelic PAK2 variants cause Knobloch syndrome type 2 (KNO2)-a developmental disorder primarily characterized by ocular anomalies. Here, we identified a novel de novo heterozygous missense variant in PAK2, NM_002577.4:c.1273G>A, p.(D425N), by whole genome sequencing in an individual with features consistent with KNO2. Notable clinical phenotypes include global developmental delay, congenital retinal detachment, mild cerebral ventriculomegaly, hypotonia, FTT, pyloric stenosis, feeding intolerance, patent ductus arteriosus, and mild facial dysmorphism. The p.(D425N) variant lies within the protein kinase domain and is predicted to be functionally damaging by in silico analysis. Previous clinical genetic testing did not report this variant due to unknown relevance of PAK2 variants at the time of testing, highlighting the importance of reanalysis. Our findings also substantiate the candidacy of PAK2 variants in KNO2 and expand the KNO2 clinical spectrum.

3.

Structurally divergent and recurrently mutated regions of primate genomes.

Mao, Yafei; Harvey, William T; Porubsky, David; Munson, Katherine M; Hoekzema, Kendra; Lewis, Alexandra P; Audano, Peter A; Rozanski, Allison; Yang, Xiangyu; Zhang, Shilong; Yoo, DongAhn; Gordon, David S; Fair, Tyler; Wei, Xiaoxi; Logsdon, Glennis A; Haukness, Marina; Dishuck, Philip C; Jeong, Hyeonsoo; Del Rosario, Ricardo; Bauer, Vanessa L; Fattor, Will T; Wilkerson, Gregory K; Mao, Yuxiang; Shi, Yongyong; Sun, Qiang; Lu, Qing; Paten, Benedict; Bakken, Trygve E; Pollen, Alex A; Feng, Guoping; Sawyer, Sara L; Warren, Wesley C; Carbone, Lucia; Eichler, Evan E.

Cell ; 187(6): 1547-1562.e13, 2024 Mar 14.

Artigo em Inglês | MEDLINE | ID: mdl-38428424

RESUMO

We sequenced and assembled using multiple long-read sequencing technologies the genomes of chimpanzee, bonobo, gorilla, orangutan, gibbon, macaque, owl monkey, and marmoset. We identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. We estimate that 819.47 Mbp or â¼27% of the genome has been affected by SVs across primate evolution. We identify 1,607 structurally divergent regions wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (e.g., CARD, C4, and OLAH gene families) and additional lineage-specific genes are generated (e.g., CKAP2, VPS36, ACBD7, and NEK5 paralogs), becoming targets of rapid chromosomal diversification and positive selection (e.g., RGPD gene family). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species.

Assuntos

Genoma , Primatas , Animais , Humanos , Sequência de Bases , Primatas/classificação , Primatas/genética , Evolução Biológica , Análise de Sequência de DNA , Variação Estrutural do Genoma

4.

Small polymorphisms are a source of ancestral bias in structural variant breakpoint placement.

Audano, Peter A; Beck, Christine R.

Genome Res ; 34(1): 7-19, 2024 02 07.

Artigo em Inglês | MEDLINE | ID: mdl-38176712

RESUMO

High-quality genome assemblies and sophisticated algorithms have increased sensitivity for a wide range of variant types, and breakpoint accuracy for structural variants (SVs, ≥50 bp) has improved to near base pair precision. Despite these advances, many SV breakpoint locations are subject to systematic bias affecting variant representation. To understand why SV breakpoints are inconsistent across samples, we reanalyzed 64 phased haplotypes constructed from long-read assemblies released by the Human Genome Structural Variation Consortium (HGSVC). We identify 882 SV insertions and 180 SV deletions with variable breakpoints not anchored in tandem repeats (TRs) or segmental duplications (SDs). SVs called from aligned sequencing reads increase breakpoint disagreements by 2×-16×. Sequence accuracy had a minimal impact on breakpoints, but we observe a strong effect of ancestry. We confirm that SNP and indel polymorphisms are enriched at shifted breakpoints and are also absent from variant callsets. Breakpoint homology increases the likelihood of imprecise SV calls and the distance they are shifted, and tandem duplications are the most heavily affected SVs. Because graph genome methods normalize SV calls across samples, we investigated graphs generated by two different methods and find the resulting breakpoints are subject to other technical biases affecting breakpoint accuracy. The breakpoint inconsistencies we characterize affect â¼5% of the SVs called in a human genome and can impact variant interpretation and annotation. These limitations underscore a need for algorithm development to improve SV databases, mitigate the impact of ancestry on breakpoints, and increase the value of callsets for investigating breakpoint features.

Assuntos

Algoritmos , Genoma Humano , Humanos , Análise de Sequência , Variação Estrutural do Genoma , Viés , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala

5.

Assembly of 43 human Y chromosomes reveals extensive complexity and variation.

Hallast, Pille; Ebert, Peter; Loftus, Mark; Yilmaz, Feyza; Audano, Peter A; Logsdon, Glennis A; Bonder, Marc Jan; Zhou, Weichen; Höps, Wolfram; Kim, Kwondo; Li, Chong; Hoyt, Savannah J; Dishuck, Philip C; Porubsky, David; Tsetsos, Fotios; Kwon, Jee Young; Zhu, Qihui; Munson, Katherine M; Hasenfeld, Patrick; Harvey, William T; Lewis, Alexandra P; Kordosky, Jennifer; Hoekzema, Kendra; O'Neill, Rachel J; Korbel, Jan O; Tyler-Smith, Chris; Eichler, Evan E; Shi, Xinghua; Beck, Christine R; Marschall, Tobias; Konkel, Miriam K; Lee, Charles.

Nature ; 621(7978): 355-364, 2023 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-37612510

RESUMO

The prevalence of highly repetitive sequences within the human Y chromosome has prevented its complete assembly to date1 and led to its systematic omission from genomic analyses. Here we present de novo assemblies of 43 Y chromosomes spanning 182,900 years of human evolution and report considerable diversity in size and structure. Half of the male-specific euchromatic region is subject to large inversions with a greater than twofold higher recurrence rate compared with all other chromosomes2. Ampliconic sequences associated with these inversions show differing mutation rates that are sequence context dependent, and some ampliconic genes exhibit evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, Yq12, is composed of alternating repeat arrays that show extensive variation in the number, size and distribution, but retain a 1:1 copy-number ratio. Finally, our data suggest that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kb away from the currently established1 boundary. The availability of fully sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of traits with specific Y-chromosomal variants and garnering insights into the evolution and function of complex regions of the human genome.

Assuntos

Cromossomos Humanos Y , Evolução Molecular , Humanos , Masculino , Cromossomos Humanos Y/genética , Genoma Humano/genética , Genômica , Taxa de Mutação , Fenótipo , Eucromatina/genética , Pseudogenes , Variação Genética/genética , Cromossomos Humanos X/genética , Regiões Pseudoautossômicas/genética

6.

Small allelic variants are a source of ancestral bias in structural variant breakpoint placement.

Audano, Peter A; Beck, Christine R.

bioRxiv ; 2023 Jun 26.

Artigo em Inglês | MEDLINE | ID: mdl-37425850

RESUMO

High-quality genome assemblies and sophisticated algorithms have increased sensitivity for a wide range of variant types, and breakpoint accuracy for structural variants (SVs, ≥ 50 bp) has improved to near basepair precision. Despite these advances, many SVs in unique regions of the genome are subject to systematic bias that affects breakpoint location. This ambiguity leads to less accurate variant comparisons across samples, and it obscures true breakpoint features needed for mechanistic inferences. To understand why SVs are not consistently placed, we reanalyzed 64 phased haplotypes constructed from long-read assemblies released by the Human Genome Structural Variation Consortium (HGSVC). We identified variable breakpoints for 882 SV insertions and 180 SV deletions not anchored in tandem repeats (TRs) or segmental duplications (SDs). While this is unexpectedly high for genome assemblies in unique loci, we find read-based callsets from the same sequencing data yielded 1,566 insertions and 986 deletions with inconsistent breakpoints also not anchored in TRs or SDs. When we investigated causes for breakpoint inaccuracy, we found sequence and assembly errors had minimal impact, but we observed a strong effect of ancestry. We confirmed that polymorphic mismatches and small indels are enriched at shifted breakpoints and that these polymorphisms are generally lost when breakpoints shift. Long tracts of homology, such as SVs mediated by transposable elements, increase the likelihood of imprecise SV calls and the distance they are shifted. Tandem Duplication (TD) breakpoints are the most heavily affected SV class with 14% of TDs placed at different locations across haplotypes. While graph genome methods normalize SV calls across many samples, the resulting breakpoints are sometimes incorrect, highlighting a need to tune graph methods for breakpoint accuracy. The breakpoint inconsistencies we characterize collectively affect ~5% of the SVs called in a human genome and underscore a need for algorithm development to improve SV databases, mitigate the impact of ancestry on breakpoint placement, and increase the value of callsets for investigating mutational processes.

7.

Resolution of structural variation in diverse mouse genomes reveals chromatin remodeling due to transposable elements.

Ferraj, Ardian; Audano, Peter A; Balachandran, Parithi; Czechanski, Anne; Flores, Jacob I; Radecki, Alexander A; Mosur, Varun; Gordon, David S; Walawalkar, Isha A; Eichler, Evan E; Reinholdt, Laura G; Beck, Christine R.

Cell Genom ; 3(5): 100291, 2023 May 10.

Artigo em Inglês | MEDLINE | ID: mdl-37228752

RESUMO

Diverse inbred mouse strains are important biomedical research models, yet genome characterization of many strains is fundamentally lacking in comparison with humans. In particular, catalogs of structural variants (SVs) (variants ≥ 50 bp) are incomplete, limiting the discovery of causative alleles for phenotypic variation. Here, we resolve genome-wide SVs in 20 genetically distinct inbred mice with long-read sequencing. We report 413,758 site-specific SVs affecting 13% (356 Mbp) of the mouse reference assembly, including 510 previously unannotated coding variants. We substantially improve the Mus musculus transposable element (TE) callset, and we find that TEs comprise 39% of SVs and account for 75% of altered bases. We further utilize this callset to investigate how TE heterogeneity affects mouse embryonic stem cells and find multiple TE classes that influence chromatin accessibility. Our work provides a comprehensive analysis of SVs found in diverse mouse genomes and illustrates the role of TEs in epigenetic differences.

8.

Whole-genome long-read sequencing downsampling and its effect on variant calling precision and recall.

Harvey, William T; Ebert, Peter; Ebler, Jana; Audano, Peter A; Munson, Katherine M; Hoekzema, Kendra; Porubsky, David; Beck, Christine R; Marschall, Tobias; Garimella, Kiran; Eichler, Evan E.

bioRxiv ; 2023 May 04.

Artigo em Inglês | MEDLINE | ID: mdl-37205567

RESUMO

Advances in long-read sequencing (LRS) technology continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant calling precision and recall of Oxford Nanopore Technologies (ONT) and PacBio HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant calling precision and recall of SVs and indels in HiFi datasets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant callsets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.

9.

Structurally divergent and recurrently mutated regions of primate genomes.

Mao, Yafei; Harvey, William T; Porubsky, David; Munson, Katherine M; Hoekzema, Kendra; Lewis, Alexandra P; Audano, Peter A; Rozanski, Allison; Yang, Xiangyu; Zhang, Shilong; Gordon, David S; Wei, Xiaoxi; Logsdon, Glennis A; Haukness, Marina; Dishuck, Philip C; Jeong, Hyeonsoo; Del Rosario, Ricardo; Bauer, Vanessa L; Fattor, Will T; Wilkerson, Gregory K; Lu, Qing; Paten, Benedict; Feng, Guoping; Sawyer, Sara L; Warren, Wesley C; Carbone, Lucia; Eichler, Evan E.

bioRxiv ; 2023 Mar 07.

Artigo em Inglês | MEDLINE | ID: mdl-36945442

RESUMO

To better understand the pattern of primate genome structural variation, we sequenced and assembled using multiple long-read sequencing technologies the genomes of eight nonhuman primate species, including New World monkeys (owl monkey and marmoset), Old World monkey (macaque), Asian apes (orangutan and gibbon), and African ape lineages (gorilla, bonobo, and chimpanzee). Compared to the human genome, we identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. Across 50 million years of primate evolution, we estimate that 819.47 Mbp or ~27% of the genome has been affected by SVs based on analysis of these primate lineages. We identify 1,607 structurally divergent regions (SDRs) wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (CARDs, ABCD7, OLAH) and new lineage-specific genes are generated (e.g., CKAP2, NEK5) and have become targets of rapid chromosomal diversification and positive selection (e.g., RGPDs). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species for the first time.

10.

Whole-genome long-read sequencing downsampling and its effect on variant-calling precision and recall.

Harvey, William T; Ebert, Peter; Ebler, Jana; Audano, Peter A; Munson, Katherine M; Hoekzema, Kendra; Porubsky, David; Beck, Christine R; Marschall, Tobias; Garimella, Kiran; Eichler, Evan E.

Genome Res ; 33(12): 2029-2040, 2023 12 27.

Artigo em Inglês | MEDLINE | ID: mdl-38190646

RESUMO

Advances in long-read sequencing (LRS) technologies continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant-calling precision and recall of Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant-calling precision and recall of SVs and indels in HiFi data sets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant call sets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.

Assuntos

Genômica , Nanoporos , Mutação INDEL , Sequenciamento Completo do Genoma

11.

Transposable element-mediated rearrangements are prevalent in human genomes.

Balachandran, Parithi; Walawalkar, Isha A; Flores, Jacob I; Dayton, Jacob N; Audano, Peter A; Beck, Christine R.

Nat Commun ; 13(1): 7115, 2022 11 19.

Artigo em Inglês | MEDLINE | ID: mdl-36402840

RESUMO

Transposable elements constitute about half of human genomes, and their role in generating human variation through retrotransposition is broadly studied and appreciated. Structural variants mediated by transposons, which we call transposable element-mediated rearrangements (TEMRs), are less well studied, and the mechanisms leading to their formation as well as their broader impact on human diversity are poorly understood. Here, we identify 493 unique TEMRs across the genomes of three individuals. While homology directed repair is the dominant driver of TEMRs, our sequence-resolved TEMR resource allows us to identify complex inversion breakpoints, triplications or other high copy number polymorphisms, and additional complexities. TEMRs are enriched in genic loci and can create potentially important risk alleles such as a deletion in TRIM65, a known cancer biomarker and therapeutic target. These findings expand our understanding of this important class of structural variation, the mechanisms responsible for their formation, and establish them as an important driver of human diversity.

Assuntos

Elementos de DNA Transponíveis , Genoma Humano , Humanos , Elementos de DNA Transponíveis/genética , Genoma Humano/genética , Rearranjo Gênico/genética , Variações do Número de Cópias de DNA , Proteínas com Motivo Tripartido/genética , Ubiquitina-Proteína Ligases/genética

12.

SVision: a deep learning approach to resolve complex structural variants.

Lin, Jiadong; Wang, Songbo; Audano, Peter A; Meng, Deyu; Flores, Jacob I; Kosters, Walter; Yang, Xiaofei; Jia, Peng; Marschall, Tobias; Beck, Christine R; Ye, Kai.

Nat Methods ; 19(10): 1230-1233, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-36109679

RESUMO

Complex structural variants (CSVs) encompass multiple breakpoints and are often missed or misinterpreted. We developed SVision, a deep-learning-based multi-object-recognition framework, to automatically detect and haracterize CSVs from long-read sequencing data. SVision outperforms current callers at identifying the internal structure of complex events and has revealed 80 high-quality CSVs with 25 distinct structures from an individual genome. SVision directly detects CSVs without matching known structures, allowing sensitive detection of both common and previously uncharacterized complex rearrangements.

Assuntos

Aprendizado Profundo , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA

13.

Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders.

Porubsky, David; Höps, Wolfram; Ashraf, Hufsah; Hsieh, PingHsun; Rodriguez-Martin, Bernardo; Yilmaz, Feyza; Ebler, Jana; Hallast, Pille; Maria Maggiolini, Flavia Angela; Harvey, William T; Henning, Barbara; Audano, Peter A; Gordon, David S; Ebert, Peter; Hasenfeld, Patrick; Benito, Eva; Zhu, Qihui; Lee, Charles; Antonacci, Francesca; Steinrücken, Matthias; Beck, Christine R; Sanders, Ashley D; Marschall, Tobias; Eichler, Evan E; Korbel, Jan O.

Cell ; 185(11): 1986-2005.e26, 2022 05 26.

Artigo em Inglês | MEDLINE | ID: mdl-35525246

RESUMO

Unlike copy number variants (CNVs), inversions remain an underexplored genetic variation class. By integrating multiple genomic technologies, we discover 729 inversions in 41 human genomes. Approximately 85% of inversions <2 kbp form by twin-priming during L1 retrotransposition; 80% of the larger inversions are balanced and affect twice as many nucleotides as CNVs. Balanced inversions show an excess of common variants, and 72% are flanked by segmental duplications (SDs) or retrotransposons. Since flanking repeats promote non-allelic homologous recombination, we developed complementary approaches to identify recurrent inversion formation. We describe 40 recurrent inversions encompassing 0.6% of the genome, showing inversion rates up to 2.7 × 10-4 per locus per generation. Recurrent inversions exhibit a sex-chromosomal bias and co-localize with genomic disorder critical regions. We propose that inversion recurrence results in an elevated number of heterozygous carriers and structural SD diversity, which increases mutability in the population and predisposes specific haplotypes to disease-causing CNVs.

Assuntos

Inversão Cromossômica , Duplicações Segmentares Genômicas , Inversão Cromossômica/genética , Variações do Número de Cópias de DNA/genética , Genoma Humano , Genômica , Humanos

14.

Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes.

Ebler, Jana; Ebert, Peter; Clarke, Wayne E; Rausch, Tobias; Audano, Peter A; Houwaart, Torsten; Mao, Yafei; Korbel, Jan O; Eichler, Evan E; Zody, Michael C; Dilthey, Alexander T; Marschall, Tobias.

Nat Genet ; 54(4): 518-525, 2022 04.

Artigo em Inglês | MEDLINE | ID: mdl-35410384

RESUMO

Typical genotyping workflows map reads to a reference genome before identifying genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden. Furthermore, short-read lengths limit the ability to characterize repetitive genomic regions, which are particularly challenging for fast k-mer-based genotypers. In the present study, we propose a new algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation-a process we refer to as genome inference. Compared with mapping-based approaches, PanGenie is more than 4 times faster at 30-fold coverage and achieves better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (≥50 bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being faster compared with alignment-based workflows.

Assuntos

Variação Genética , Genoma Humano , Genômica , Algoritmos , Genoma Humano/genética , Estudo de Associação Genômica Ampla , Genômica/métodos , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA

15.

Familial long-read sequencing increases yield of de novo mutations.

Noyes, Michelle D; Harvey, William T; Porubsky, David; Sulovari, Arvis; Li, Ruiyang; Rose, Nicholas R; Audano, Peter A; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Mantere, Tuomo; Graves-Lindsay, Tina A; Sanders, Ashley D; Goodwin, Sara; Kramer, Melissa; Mokrab, Younes; Zody, Michael C; Hoischen, Alexander; Korbel, Jan O; McCombie, W Richard; Eichler, Evan E.

Am J Hum Genet ; 109(4): 631-646, 2022 04 07.

Artigo em Inglês | MEDLINE | ID: mdl-35290762

RESUMO

Studies of de novo mutation (DNM) have typically excluded some of the most repetitive and complex regions of the genome because these regions cannot be unambiguously mapped with short-read sequencing data. To better understand the genome-wide pattern of DNM, we generated long-read sequence data from an autism parent-child quad with an affected female where no pathogenic variant had been discovered in short-read Illumina sequence data. We deeply sequenced all four individuals by using three sequencing platforms (Illumina, Oxford Nanopore, and Pacific Biosciences) and three complementary technologies (Strand-seq, optical mapping, and 10X Genomics). Using long-read sequencing, we initially discovered and validated 171 DNMs across two children-a 20% increase in the number of de novo single-nucleotide variants (SNVs) and indels when compared to short-read callsets. The number of DNMs further increased by 5% when considering a more complete human reference (T2T-CHM13) because of the recovery of events in regions absent from GRCh38 (e.g., three DNMs in heterochromatic satellites). In total, we validated 195 de novo germline mutations and 23 potential post-zygotic mosaic mutations across both children; the overall true substitution rate based on this integrated callset is at least 1.41 × 10-8 substitutions per nucleotide per generation. We also identified six de novo insertions and deletions in tandem repeats, two of which represent structural variants. We demonstrate that long-read sequencing and assembly, especially when combined with a more complete reference genome, increases the number of DNMs by >25% compared to previous studies, providing a more complete catalog of DNM compared to short-read data alone.

Assuntos

Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Feminino , Humanos , Mutação/genética , Nucleotídeos , Análise de Sequência de DNA , Software

16.

A high-quality bonobo genome refines the analysis of hominid evolution.

Mao, Yafei; Catacchio, Claudia R; Hillier, LaDeana W; Porubsky, David; Li, Ruiyang; Sulovari, Arvis; Fernandes, Jason D; Montinaro, Francesco; Gordon, David S; Storer, Jessica M; Haukness, Marina; Fiddes, Ian T; Murali, Shwetha Canchi; Dishuck, Philip C; Hsieh, PingHsun; Harvey, William T; Audano, Peter A; Mercuri, Ludovica; Piccolo, Ilaria; Antonacci, Francesca; Munson, Katherine M; Lewis, Alexandra P; Baker, Carl; Underwood, Jason G; Hoekzema, Kendra; Huang, Tzu-Hsueh; Sorensen, Melanie; Walker, Jerilyn A; Hoffman, Jinna; Thibaud-Nissen, Françoise; Salama, Sofie R; Pang, Andy W C; Lee, Joyce; Hastie, Alex R; Paten, Benedict; Batzer, Mark A; Diekhans, Mark; Ventura, Mario; Eichler, Evan E.

Nature ; 594(7861): 77-81, 2021 06.

Artigo em Inglês | MEDLINE | ID: mdl-33953399

RESUMO

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.

Assuntos

Evolução Molecular , Genoma/genética , Genômica , Pan paniscus/genética , Filogenia , Animais , Fator de Iniciação 4A em Eucariotos/genética , Feminino , Genes , Gorilla gorilla/genética , Anotação de Sequência Molecular/normas , Pan troglodytes/genética , Pongo/genética , Duplicações Segmentares Genômicas , Análise de Sequência de DNA

17.

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies.

Zhao, Xuefang; Collins, Ryan L; Lee, Wan-Ping; Weber, Alexandra M; Jun, Yukyung; Zhu, Qihui; Weisburd, Ben; Huang, Yongqing; Audano, Peter A; Wang, Harold; Walker, Mark; Lowther, Chelsea; Fu, Jack; Gerstein, Mark B; Devine, Scott E; Marschall, Tobias; Korbel, Jan O; Eichler, Evan E; Chaisson, Mark J P; Lee, Charles; Mills, Ryan E; Brand, Harrison; Talkowski, Michael E.

Am J Hum Genet ; 108(5): 919-928, 2021 05 06.

Artigo em Inglês | MEDLINE | ID: mdl-33789087

RESUMO

Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.

Assuntos

Genoma Humano/genética , Variação Estrutural do Genoma , Genômica/métodos , Objetivos , Sequenciamento Completo do Genoma/métodos , Sequenciamento Completo do Genoma/normas , Variações do Número de Cópias de DNA , Éxons/genética , Humanos , Projetos de Pesquisa , Duplicações Segmentares Genômicas , Alinhamento de Sequência

18.

Haplotype-resolved diverse human genomes and integrated analysis of structural variation.

Ebert, Peter; Audano, Peter A; Zhu, Qihui; Rodriguez-Martin, Bernardo; Porubsky, David; Bonder, Marc Jan; Sulovari, Arvis; Ebler, Jana; Zhou, Weichen; Serra Mari, Rebecca; Yilmaz, Feyza; Zhao, Xuefang; Hsieh, PingHsun; Lee, Joyce; Kumar, Sushant; Lin, Jiadong; Rausch, Tobias; Chen, Yu; Ren, Jingwen; Santamarina, Martin; Höps, Wolfram; Ashraf, Hufsah; Chuang, Nelson T; Yang, Xiaofei; Munson, Katherine M; Lewis, Alexandra P; Fairley, Susan; Tallon, Luke J; Clarke, Wayne E; Basile, Anna O; Byrska-Bishop, Marta; Corvelo, André; Evani, Uday S; Lu, Tsung-Yu; Chaisson, Mark J P; Chen, Junjie; Li, Chong; Brand, Harrison; Wenger, Aaron M; Ghareghani, Maryam; Harvey, William T; Raeder, Benjamin; Hasenfeld, Patrick; Regier, Allison A; Abel, Haley J; Hall, Ira M; Flicek, Paul; Stegle, Oliver; Gerstein, Mark B; Tubio, Jose M C.

Science ; 372(6537)2021 04 02.

Artigo em Inglês | MEDLINE | ID: mdl-33632895

RESUMO

Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.

Assuntos

Variação Genética , Genoma Humano , Haplótipos , Feminino , Genótipo , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Mutação INDEL , Sequências Repetitivas Dispersas , Masculino , Grupos Populacionais/genética , Locos de Características Quantitativas , Retroelementos , Análise de Sequência de DNA , Inversão de Sequência , Sequenciamento Completo do Genoma

19.

Fully phased human genome assembly without parental data using single-cell strand sequencing and long reads.

Porubsky, David; Ebert, Peter; Audano, Peter A; Vollger, Mitchell R; Harvey, William T; Marijon, Pierre; Ebler, Jana; Munson, Katherine M; Sorensen, Melanie; Sulovari, Arvis; Haukness, Marina; Ghareghani, Maryam; Lansdorp, Peter M; Paten, Benedict; Devine, Scott E; Sanders, Ashley D; Lee, Charles; Chaisson, Mark J P; Korbel, Jan O; Eichler, Evan E; Marschall, Tobias.

Nat Biotechnol ; 39(3): 302-308, 2021 03.

Artigo em Inglês | MEDLINE | ID: mdl-33288906

RESUMO

Human genomes are typically assembled as consensus sequences that lack information on parental haplotypes. Here we describe a reference-free workflow for diploid de novo genome assembly that combines the chromosome-wide phasing and scaffolding capabilities of single-cell strand sequencing1,2 with continuous long-read or high-fidelity3 sequencing data. Employing this strategy, we produced a completely phased de novo genome assembly for each haplotype of an individual of Puerto Rican descent (HG00733) in the absence of parental data. The assemblies are accurate (quality value > 40) and highly contiguous (contig N50 > 23 Mbp) with low switch error rates (0.17%), providing fully phased single-nucleotide variants, indels and structural variants. A comparison of Oxford Nanopore Technologies and Pacific Biosciences phased assemblies identified 154 regions that are preferential sites of contig breaks, irrespective of sequencing technology or phasing algorithms.

Assuntos

Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Pais , Análise de Sequência de DNA/métodos , Análise de Célula Única/métodos , Algoritmos , Haplótipos , Humanos , Porto Rico/etnologia

20.

Sequence diversity analyses of an improved rhesus macaque genome enhance its biomedical utility.

Warren, Wesley C; Harris, R Alan; Haukness, Marina; Fiddes, Ian T; Murali, Shwetha C; Fernandes, Jason; Dishuck, Philip C; Storer, Jessica M; Raveendran, Muthuswamy; Hillier, LaDeana W; Porubsky, David; Mao, Yafei; Gordon, David; Vollger, Mitchell R; Lewis, Alexandra P; Munson, Katherine M; DeVogelaere, Elizabeth; Armstrong, Joel; Diekhans, Mark; Walker, Jerilyn A; Tomlinson, Chad; Graves-Lindsay, Tina A; Kremitzki, Milinn; Salama, Sofie R; Audano, Peter A; Escalona, Merly; Maurer, Nicholas W; Antonacci, Francesca; Mercuri, Ludovica; Maggiolini, Flavia A M; Catacchio, Claudia Rita; Underwood, Jason G; O'Connor, David H; Sanders, Ashley D; Korbel, Jan O; Ferguson, Betsy; Kubisch, H Michael; Picker, Louis; Kalin, Ned H; Rosene, Douglas; Levine, Jon; Abbott, David H; Gray, Stanton B; Sanchez, Mar M; Kovacs-Balint, Zsofia A; Kemnitz, Joseph W; Thomasy, Sara M; Roberts, Jeffrey A; Kinnally, Erin L; Capitanio, John P.

Science ; 370(6523)2020 12 18.

Artigo em Inglês | MEDLINE | ID: mdl-33335035

RESUMO

The rhesus macaque (Macaca mulatta) is the most widely studied nonhuman primate (NHP) in biomedical research. We present an updated reference genome assembly (Mmul_10, contig N50 = 46 Mbp) that increases the sequence contiguity 120-fold and annotate it using 6.5 million full-length transcripts, thus improving our understanding of gene content, isoform diversity, and repeat organization. With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility. Whole-genome sequencing (WGS) data from 853 rhesus macaques identified 85.7 million single-nucleotide variants (SNVs) and 10.5 million indel variants, including potentially damaging variants in genes associated with human autism and developmental delay, providing a framework for developing noninvasive NHP models of human disease.

Assuntos

Predisposição Genética para Doença , Genoma , Macaca mulatta/genética , Polimorfismo de Nucleotídeo Único , Animais , Variação Genética , Humanos , Anotação de Sequência Molecular , Sequenciamento Completo do Genoma

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA