Búsqueda | Portal Regional de la BVS

A High-Quality Blue Whale Genome, Segmental Duplications, and Historical Demography.

Bukhman, Yury V; Morin, Phillip A; Meyer, Susanne; Chu, Li-Fang; Jacobsen, Jeff K; Antosiewicz-Bourget, Jessica; Mamott, Daniel; Gonzales, Maylie; Argus, Cara; Bolin, Jennifer; Berres, Mark E; Fedrigo, Olivier; Steill, John; Swanson, Scott A; Jiang, Peng; Rhie, Arang; Formenti, Giulio; Phillippy, Adam M; Harris, Robert S; Wood, Jonathan M D; Howe, Kerstin; Kirilenko, Bogdan M; Munegowda, Chetan; Hiller, Michael; Jain, Aashish; Kihara, Daisuke; Johnston, J Spencer; Ionkov, Alexander; Raja, Kalpana; Toh, Huishi; Lang, Aimee; Wolf, Magnus; Jarvis, Erich D; Thomson, James A; Chaisson, Mark J P; Stewart, Ron.

Mol Biol Evol ; 41(3)2024 Mar 01.

Artículo en Inglés | MEDLINE | ID: mdl-38376487

RESUMEN

The blue whale, Balaenoptera musculus, is the largest animal known to have ever existed, making it an important case study in longevity and resistance to cancer. To further this and other blue whale-related research, we report a reference-quality, long-read-based genome assembly of this fascinating species. We assembled the genome from PacBio long reads and utilized Illumina/10×, optical maps, and Hi-C data for scaffolding, polishing, and manual curation. We also provided long read RNA-seq data to facilitate the annotation of the assembly by NCBI and Ensembl. Additionally, we annotated both haplotypes using TOGA and measured the genome size by flow cytometry. We then compared the blue whale genome with other cetaceans and artiodactyls, including vaquita (Phocoena sinus), the world's smallest cetacean, to investigate blue whale's unique biological traits. We found a dramatic amplification of several genes in the blue whale genome resulting from a recent burst in segmental duplications, though the possible connection between this amplification and giant body size requires further study. We also discovered sites in the insulin-like growth factor-1 gene correlated with body size in cetaceans. Finally, using our assembly to examine the heterozygosity and historical demography of Pacific and Atlantic blue whale populations, we found that the genomes of both populations are highly heterozygous and that their genetic isolation dates to the last interglacial period. Taken together, these results indicate how a high-quality, annotated blue whale genome will serve as an important resource for biology, evolution, and conservation research.

Asunto(s)

Balaenoptera , Neoplasias , Animales , Balaenoptera/genética , Duplicaciones Segmentarias en el Genoma , Genoma , Demografía , Neoplasias/genética

The motif composition of variable number tandem repeats impacts gene expression.

Lu, Tsung-Yu; Smaruj, Paulina N; Fudenberg, Geoffrey; Mancuso, Nicholas; Chaisson, Mark J P.

Genome Res ; 33(4): 511-524, 2023 04.

Artículo en Inglés | MEDLINE | ID: mdl-37037626

RESUMEN

Understanding the impact of DNA variation on human traits is a fundamental question in human genetics. Variable number tandem repeats (VNTRs) make up â¼3% of the human genome but are often excluded from association analysis owing to poor read mappability or divergent repeat content. Although methods exist to estimate VNTR length from short-read data, it is known that VNTRs vary in both length and repeat (motif) composition. Here, we use a repeat-pangenome graph (RPGG) constructed on 35 haplotype-resolved assemblies to detect variation in both VNTR length and repeat composition. We align population-scale data from the Genotype-Tissue Expression (GTEx) Consortium to examine how variations in sequence composition may be linked to expression, including cases independent of overall VNTR length. We find that 9422 out of 39,125 VNTRs are associated with nearby gene expression through motif variations, of which only 23.4% are accessible from length. Fine-mapping identifies 174 genes to be likely driven by variation in certain VNTR motifs and not overall length. We highlight two genes, CACNA1C and RNF213, that have expression associated with motif variation, showing the utility of RPGG analysis as a new approach for trait association in multiallelic and highly variable loci.

Asunto(s)

Adenosina Trifosfatasas , Repeticiones de Minisatélite , Humanos , Repeticiones de Minisatélite/genética , Fenotipo , Haplotipos , Expresión Génica , Adenosina Trifosfatasas/genética , Ubiquitina-Proteína Ligasas/genética

Semi-automated assembly of high-quality diploid human reference genomes.

Jarvis, Erich D; Formenti, Giulio; Rhie, Arang; Guarracino, Andrea; Yang, Chentao; Wood, Jonathan; Tracey, Alan; Thibaud-Nissen, Francoise; Vollger, Mitchell R; Porubsky, David; Cheng, Haoyu; Asri, Mobin; Logsdon, Glennis A; Carnevali, Paolo; Chaisson, Mark J P; Chin, Chen-Shan; Cody, Sarah; Collins, Joanna; Ebert, Peter; Escalona, Merly; Fedrigo, Olivier; Fulton, Robert S; Fulton, Lucinda L; Garg, Shilpa; Gerton, Jennifer L; Ghurye, Jay; Granat, Anastasiya; Green, Richard E; Harvey, William; Hasenfeld, Patrick; Hastie, Alex; Haukness, Marina; Jaeger, Erich B; Jain, Miten; Kirsche, Melanie; Kolmogorov, Mikhail; Korbel, Jan O; Koren, Sergey; Korlach, Jonas; Lee, Joyce; Li, Daofeng; Lindsay, Tina; Lucas, Julian; Luo, Feng; Marschall, Tobias; Mitchell, Matthew W; McDaniel, Jennifer; Nie, Fan; Olsen, Hugh E; Olson, Nathan D.

Nature ; 611(7936): 519-531, 2022 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-36261518

RESUMEN

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.

Asunto(s)

Mapeo Cromosómico , Diploidia , Genoma Humano , Genómica , Humanos , Mapeo Cromosómico/normas , Genoma Humano/genética , Haplotipos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Estándares de Referencia , Genómica/métodos , Genómica/normas , Cromosomas Humanos/genética , Variación Genética/genética

Long-read sequence and assembly of segmental duplications.

Vollger, Mitchell R; Dishuck, Philip C; Sorensen, Melanie; Welch, AnneMarie E; Dang, Vy; Dougherty, Max L; Graves-Lindsay, Tina A; Wilson, Richard K; Chaisson, Mark J P; Eichler, Evan E.

Nat Methods ; 16(1): 88-94, 2019 01.

Artículo en Inglés | MEDLINE | ID: mdl-30559433

RESUMEN

We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.

Asunto(s)

Biología Computacional , Duplicaciones Segmentarias en el Genoma , Análisis de Secuencia de ADN/métodos , Genoma Humano , Humanos , Anotación de Secuencia Molecular

Resolving multicopy duplications de novo using polyploid phasing.

Chaisson, Mark J; Mukherjee, Sudipto; Kannan, Sreeram; Eichler, Evan E.

Res Comput Mol Biol ; 10229: 117-133, 2017 May.

Artículo en Inglés | MEDLINE | ID: mdl-28808695

RESUMEN

While the rise of single-molecule sequencing systems has enabled an unprecedented rise in the ability to assemble complex regions of the genome, long segmental duplications in the genome still remain a challenging frontier in assembly. Segmental duplications are at the same time both gene rich and prone to large structural rearrangements, making the resolution of their sequences important in medical and evolutionary studies. Duplicated sequences that are collapsed in mammalian de novo assemblies are rarely identical; after a sequence is duplicated, it begins to acquire paralog specific variants. In this paper, we study the problem of resolving the variations in multicopy long-segmental duplications by developing and utilizing algorithms for polyploid phasing. We develop two algorithms: the first one is targeted at maximizing the likelihood of observing the reads given the underlying haplotypes using discrete matrix completion. The second algorithm is based on correlation clustering and exploits an assumption, which is often satisfied in these duplications, that each paralog has a sizable number of paralog-specific variants. We develop a detailed simulation methodology, and demonstrate the superior performance of the proposed algorithms on an array of simulated datasets. We measure the likelihood score as well as reconstruction accuracy, i.e., what fraction of the reads are clustered correctly. In both the performance metrics, we find that our algorithms dominate existing algorithms on more than 93% of the datasets. While the discrete matrix completion performs better on likelihood score, the correlation clustering algorithm performs better on reconstruction accuracy due to the stronger regularization inherent in the algorithm. We also show that our correlation-clustering algorithm can reconstruct on an average 7.0 haplotypes in 10-copy duplication data-sets whereas existing algorithms reconstruct less than 1 copy on average.

HySA: a Hybrid Structural variant Assembly approach using next-generation and single-molecule sequencing technologies.

Fan, Xian; Chaisson, Mark; Nakhleh, Luay; Chen, Ken.

Genome Res ; 27(5): 793-800, 2017 05.

Artículo en Inglés | MEDLINE | ID: mdl-28104618

RESUMEN

Achieving complete, accurate, and cost-effective assembly of human genomes is of great importance for realizing the promise of precision medicine. The abundance of repeats and genetic variations in human genomes and the limitations of existing sequencing technologies call for the development of novel assembly methods that can leverage the complementary strengths of multiple technologies. We propose a Hybrid Structural variant Assembly (HySA) approach that integrates sequencing reads from next-generation sequencing and single-molecule sequencing technologies to accurately assemble and detect structural variants (SVs) in human genomes. By identifying homologous SV-containing reads from different technologies through a bipartite-graph-based clustering algorithm, our approach turns a whole genome assembly problem into a set of independent SV assembly problems, each of which can be effectively solved to enhance the assembly of structurally altered regions in human genomes. We used data generated from a haploid hydatidiform mole genome (CHM1) and a diploid human genome (NA12878) to test our approach. The result showed that, compared with existing methods, our approach had a low false discovery rate and substantially improved the detection of many types of SVs, particularly novel large insertions, small indels (10-50 bp), and short tandem repeat expansions and contractions. Our work highlights the strengths and limitations of current approaches and provides an effective solution for extending the power of existing sequencing technologies for SV discovery.

Asunto(s)

Mapeo Contig/métodos , Genoma Humano , Variación Estructural del Genoma , Genómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Mapeo Contig/normas , Diploidia , Genómica/normas , Haploidia , Humanos , Ratones , Análisis de Secuencia de ADN/normas , Secuencias Repetidas en Tándem

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA