Búsqueda | Portal de Búsqueda de la BVS

1.

Characterizing the Major Structural Variant Alleles of the Human Genome.

Audano, Peter A; Sulovari, Arvis; Graves-Lindsay, Tina A; Cantsilieris, Stuart; Sorensen, Melanie; Welch, AnneMarie E; Dougherty, Max L; Nelson, Bradley J; Shah, Ankeeta; Dutcher, Susan K; Warren, Wesley C; Magrini, Vincent; McGrath, Sean D; Li, Yang I; Wilson, Richard K; Eichler, Evan E.

Cell ; 176(3): 663-675.e19, 2019 01 24.

Artículo en Inglés | MEDLINE | ID: mdl-30661756

RESUMEN

In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity.

Asunto(s)

Frecuencia de los Genes/genética , Genoma Humano/genética , Variación Estructural del Genoma/genética , Alelos , Eucromatina/genética , Genómica/métodos , Humanos , Repeticiones de Minisatélite/genética , Análisis de Secuencia de ADN/métodos

2.

Structural and genetic diversity in the secreted mucins MUC5AC and MUC5B.

Plender, Elizabeth G; Prodanov, Timofey; Hsieh, PingHsun; Nizamis, Evangelos; Harvey, William T; Sulovari, Arvis; Munson, Katherine M; Kaufman, Eli J; O'Neal, Wanda K; Valdmanis, Paul N; Marschall, Tobias; Bloom, Jesse D; Eichler, Evan E.

Am J Hum Genet ; 111(8): 1700-1716, 2024 Aug 08.

Artículo en Inglés | MEDLINE | ID: mdl-38991590

RESUMEN

The secreted mucins MUC5AC and MUC5B are large glycoproteins that play critical defensive roles in pathogen entrapment and mucociliary clearance. Their respective genes contain polymorphic and degenerate protein-coding variable number tandem repeats (VNTRs) that make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5,761-5,762 amino acids [aa]); however, seven haplotypes have expanded VNTRs (6,291-7,019 aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5,249-6,325 aa) with cysteine-rich domain and VNTR copy-number variation. We group MUC5AC alleles into three phylogenetic clades: H1 (46%, â¼5,654 aa), H2 (33%, â¼5,742 aa), and H3 (7%, â¼6,325 aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium and Tajima's D analyses reveal that East Asians carry exceptionally large blocks with an excess of rare variation (p < 0.05) at MUC5AC. To validate this result, we use Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observe a signature of positive selection in H1 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium (p < 0.05), consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein-coding VNTRs for improved disease associations.

Asunto(s)

Alelos , Variación Genética , Haplotipos , Repeticiones de Minisatélite , Mucina 5AC , Mucina 5B , Filogenia , Humanos , Mucina 5B/genética , Animales , Mucina 5AC/genética , Mucina 5AC/metabolismo , Repeticiones de Minisatélite/genética , Variaciones en el Número de Copia de ADN , Primates/genética

3.

A high-quality bonobo genome refines the analysis of hominid evolution.

Mao, Yafei; Catacchio, Claudia R; Hillier, LaDeana W; Porubsky, David; Li, Ruiyang; Sulovari, Arvis; Fernandes, Jason D; Montinaro, Francesco; Gordon, David S; Storer, Jessica M; Haukness, Marina; Fiddes, Ian T; Murali, Shwetha Canchi; Dishuck, Philip C; Hsieh, PingHsun; Harvey, William T; Audano, Peter A; Mercuri, Ludovica; Piccolo, Ilaria; Antonacci, Francesca; Munson, Katherine M; Lewis, Alexandra P; Baker, Carl; Underwood, Jason G; Hoekzema, Kendra; Huang, Tzu-Hsueh; Sorensen, Melanie; Walker, Jerilyn A; Hoffman, Jinna; Thibaud-Nissen, Françoise; Salama, Sofie R; Pang, Andy W C; Lee, Joyce; Hastie, Alex R; Paten, Benedict; Batzer, Mark A; Diekhans, Mark; Ventura, Mario; Eichler, Evan E.

Nature ; 594(7861): 77-81, 2021 06.

Artículo en Inglés | MEDLINE | ID: mdl-33953399

RESUMEN

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.

Asunto(s)

Evolución Molecular , Genoma/genética , Genómica , Pan paniscus/genética , Filogenia , Animales , Factor 4A Eucariótico de Iniciación/genética , Femenino , Genes , Gorilla gorilla/genética , Anotación de Secuencia Molecular/normas , Pan troglodytes/genética , Pongo/genética , Duplicaciones Segmentarias en el Genoma , Análisis de Secuencia de ADN

4.

Familial long-read sequencing increases yield of de novo mutations.

Noyes, Michelle D; Harvey, William T; Porubsky, David; Sulovari, Arvis; Li, Ruiyang; Rose, Nicholas R; Audano, Peter A; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Mantere, Tuomo; Graves-Lindsay, Tina A; Sanders, Ashley D; Goodwin, Sara; Kramer, Melissa; Mokrab, Younes; Zody, Michael C; Hoischen, Alexander; Korbel, Jan O; McCombie, W Richard; Eichler, Evan E.

Am J Hum Genet ; 109(4): 631-646, 2022 04 07.

Artículo en Inglés | MEDLINE | ID: mdl-35290762

RESUMEN

Studies of de novo mutation (DNM) have typically excluded some of the most repetitive and complex regions of the genome because these regions cannot be unambiguously mapped with short-read sequencing data. To better understand the genome-wide pattern of DNM, we generated long-read sequence data from an autism parent-child quad with an affected female where no pathogenic variant had been discovered in short-read Illumina sequence data. We deeply sequenced all four individuals by using three sequencing platforms (Illumina, Oxford Nanopore, and Pacific Biosciences) and three complementary technologies (Strand-seq, optical mapping, and 10X Genomics). Using long-read sequencing, we initially discovered and validated 171 DNMs across two children-a 20% increase in the number of de novo single-nucleotide variants (SNVs) and indels when compared to short-read callsets. The number of DNMs further increased by 5% when considering a more complete human reference (T2T-CHM13) because of the recovery of events in regions absent from GRCh38 (e.g., three DNMs in heterochromatic satellites). In total, we validated 195 de novo germline mutations and 23 potential post-zygotic mosaic mutations across both children; the overall true substitution rate based on this integrated callset is at least 1.41 × 10-8 substitutions per nucleotide per generation. We also identified six de novo insertions and deletions in tandem repeats, two of which represent structural variants. We demonstrate that long-read sequencing and assembly, especially when combined with a more complete reference genome, increases the number of DNMs by >25% compared to previous studies, providing a more complete catalog of DNM compared to short-read data alone.

Asunto(s)

Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Femenino , Humanos , Mutación/genética , Nucleótidos , Análisis de Secuencia de ADN , Programas Informáticos

5.

Targeted long-read sequencing identifies missing disease-causing variation.

Miller, Danny E; Sulovari, Arvis; Wang, Tianyun; Loucks, Hailey; Hoekzema, Kendra; Munson, Katherine M; Lewis, Alexandra P; Fuerte, Edith P Almanza; Paschal, Catherine R; Walsh, Tom; Thies, Jenny; Bennett, James T; Glass, Ian; Dipple, Katrina M; Patterson, Karynne; Bonkowski, Emily S; Nelson, Zoe; Squire, Audrey; Sikes, Megan; Beckman, Erika; Bennett, Robin L; Earl, Dawn; Lee, Winston; Allikmets, Rando; Perlman, Seth J; Chow, Penny; Hing, Anne V; Wenger, Tara L; Adam, Margaret P; Sun, Angela; Lam, Christina; Chang, Irene; Zou, Xue; Austin, Stephanie L; Huggins, Erin; Safi, Alexias; Iyengar, Apoorva K; Reddy, Timothy E; Majoros, William H; Allen, Andrew S; Crawford, Gregory E; Kishnani, Priya S; King, Mary-Claire; Cherry, Tim; Chong, Jessica X; Bamshad, Michael J; Nickerson, Deborah A; Mefford, Heather C; Doherty, Dan; Eichler, Evan E.

Am J Hum Genet ; 108(8): 1436-1449, 2021 08 05.

Artículo en Inglés | MEDLINE | ID: mdl-34216551

RESUMEN

Despite widespread clinical genetic testing, many individuals with suspected genetic conditions lack a precise diagnosis, limiting their opportunity to take advantage of state-of-the-art treatments. In some cases, testing reveals difficult-to-evaluate structural differences, candidate variants that do not fully explain the phenotype, single pathogenic variants in recessive disorders, or no variants in genes of interest. Thus, there is a need for better tools to identify a precise genetic diagnosis in individuals when conventional testing approaches have been exhausted. We performed targeted long-read sequencing (T-LRS) using adaptive sampling on the Oxford Nanopore platform on 40 individuals, 10 of whom lacked a complete molecular diagnosis. We computationally targeted up to 151 Mbp of sequence per individual and searched for pathogenic substitutions, structural variants, and methylation differences using a single data source. We detected all genomic aberrations-including single-nucleotide variants, copy number changes, repeat expansions, and methylation differences-identified by prior clinical testing. In 8/8 individuals with complex structural rearrangements, T-LRS enabled more precise resolution of the mutation, leading to changes in clinical management in one case. In ten individuals with suspected Mendelian conditions lacking a precise genetic diagnosis, T-LRS identified pathogenic or likely pathogenic variants in six and variants of uncertain significance in two others. T-LRS accurately identifies pathogenic structural variants, resolves complex rearrangements, and identifies Mendelian variants not detected by other technologies. T-LRS represents an efficient and cost-effective strategy to evaluate high-priority genes and regions or complex clinical testing results.

Asunto(s)

Aberraciones Cromosómicas , Análisis Citogenético/métodos , Enfermedades Genéticas Congénitas/diagnóstico , Enfermedades Genéticas Congénitas/genética , Predisposición Genética a la Enfermedad , Genoma Humano , Mutación , Variaciones en el Número de Copia de ADN , Femenino , Pruebas Genéticas , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Cariotipificación , Masculino , Análisis de Secuencia de ADN

6.

Characterizing nucleotide variation and expansion dynamics in human-specific variable number tandem repeats.

Course, Meredith M; Sulovari, Arvis; Gudsnuk, Kathryn; Eichler, Evan E; Valdmanis, Paul N.

Genome Res ; 31(8): 1313-1324, 2021 08.

Artículo en Inglés | MEDLINE | ID: mdl-34244228

RESUMEN

There are more than 55,000 variable number tandem repeats (VNTRs) in the human genome, notable for both their striking polymorphism and mutability. Despite their role in human evolution and genomic variation, they have yet to be studied collectively and in detail, partially owing to their large size, variability, and predominant location in noncoding regions. Here, we examine 467 VNTRs that are human-specific expansions, unique to one location in the genome, and not associated with retrotransposons. We leverage publicly available long-read genomes, including from the Human Genome Structural Variant Consortium, to ascertain the exact nucleotide composition of these VNTRs and compare their composition of alleles. We then confirm repeat unit composition in more than 3000 short-read samples from the 1000 Genomes Project. Our analysis reveals that these VNTRs contain highly structured repeat motif organization, modified by frequent deletion and duplication events. Although overall VNTR compositions tend to remain similar between 1000 Genomes Project superpopulations, we describe a notable exception with substantial differences in repeat composition (in PCBP3), as well as several VNTRs that are significantly different in length between superpopulations (in ART1, PROP1, DYNC2I1, and LOC102723906). We also observe that most of these VNTRs are expanded in archaic human genomes, yet remain stable in length between single generations. Collectively, our findings indicate that repeat motif variability, repeat composition, and repeat length are all informative modalities to consider when characterizing VNTRs and their contribution to genomic variation.

Asunto(s)

Repeticiones de Minisatélite , Nucleótidos , Genoma Humano , Variación Estructural del Genoma , Humanos , Repeticiones de Minisatélite/genética , Polimorfismo Genético

7.

Quantitative assessment reveals the dominance of duplicated sequences in germline-derived extrachromosomal circular DNA.

Mouakkad-Montoya, Lila; Murata, Michael M; Sulovari, Arvis; Suzuki, Ryusuke; Osia, Beth; Malkova, Anna; Katsumata, Makoto; Giuliano, Armando E; Eichler, Evan E; Tanaka, Hisashi.

Proc Natl Acad Sci U S A ; 118(47)2021 11 23.

Artículo en Inglés | MEDLINE | ID: mdl-34789574

RESUMEN

Extrachromosomal circular DNA (eccDNA) originates from linear chromosomal DNA in various human tissues under physiological and disease conditions. The genomic origins of eccDNA have largely been investigated using in vitro-amplified DNA. However, in vitro amplification obscures quantitative information by skewing the total population stoichiometry. In addition, the analyses have focused on eccDNA stemming from single-copy genomic regions, leaving eccDNA from multicopy regions unexamined. To address these issues, we isolated eccDNA without in vitro amplification (naïve small circular DNA, nscDNA) and assessed the populations quantitatively by integrated genomic, molecular, and cytogenetic approaches. nscDNA of up to tens of kilobases were successfully enriched by our approach and were predominantly derived from multicopy genomic regions including segmental duplications (SDs). SDs, which account for 5% of the human genome and are hotspots for copy number variations, were significantly overrepresented in sperm nscDNA, with three times more sequencing reads derived from SDs than from the entire single-copy regions. SDs were also overrepresented in mouse sperm nscDNA, which we estimated to comprise 0.2% of nuclear DNA. Considering that eccDNA can be integrated into chromosomes, germline-derived nscDNA may be a mediator of genome diversity.

Asunto(s)

ADN Circular , Células Germinativas , Animales , Cromosomas , ADN , Variaciones en el Número de Copia de ADN , Genoma Humano , Células HeLa , Humanos , Masculino , Ratones , Ratones Endogámicos C57BL , Duplicaciones Segmentarias en el Genoma , Espermatozoides

8.

Evolution of a Human-Specific Tandem Repeat Associated with ALS.

Course, Meredith M; Gudsnuk, Kathryn; Smukowski, Samuel N; Winston, Kosuke; Desai, Nitin; Ross, Jay P; Sulovari, Arvis; Bourassa, Cynthia V; Spiegelman, Dan; Couthouis, Julien; Yu, Chang-En; Tsuang, Debby W; Jayadev, Suman; Kay, Mark A; Gitler, Aaron D; Dupre, Nicolas; Eichler, Evan E; Dion, Patrick A; Rouleau, Guy A; Valdmanis, Paul N.

Am J Hum Genet ; 107(3): 445-460, 2020 09 03.

Artículo en Inglés | MEDLINE | ID: mdl-32750315

RESUMEN

Tandem repeats are proposed to contribute to human-specific traits, and more than 40 tandem repeat expansions are known to cause neurological disease. Here, we characterize a human-specific 69 bp variable number tandem repeat (VNTR) in the last intron of WDR7, which exhibits striking variability in both copy number and nucleotide composition, as revealed by long-read sequencing. In addition, greater repeat copy number is significantly enriched in three independent cohorts of individuals with sporadic amyotrophic lateral sclerosis (ALS). Each unit of the repeat forms a stem-loop structure with the potential to produce microRNAs, and the repeat RNA can aggregate when expressed in cells. We leveraged its remarkable sequence variability to align the repeat in 288 samples and uncover its mechanism of expansion. We found that the repeat expands in the 3'-5' direction, in groups of repeat units divisible by two. The expansion patterns we observed were consistent with duplication events, and a replication error called template switching. We also observed that the VNTR is expanded in both Denisovan and Neanderthal genomes but is fixed at one copy or fewer in non-human primates. Evaluating the repeat in 1000 Genomes Project samples reveals that some repeat segments are solely present or absent in certain geographic populations. The large size of the repeat unit in this VNTR, along with our multiplexed sequencing strategy, provides an unprecedented opportunity to study mechanisms of repeat expansion, and a framework for evaluating the roles of VNTRs in human evolution and disease.

Asunto(s)

Proteínas Adaptadoras Transductoras de Señales/genética , Esclerosis Amiotrófica Lateral/genética , Evolución Molecular , Secuencias Repetidas en Tándem/genética , Anciano , Enfermedad de Alzheimer/genética , Enfermedad de Alzheimer/patología , Esclerosis Amiotrófica Lateral/patología , Expansión de las Repeticiones de ADN/genética , Femenino , Regulación de la Expresión Génica/genética , Humanos , Masculino , Repeticiones de Minisatélite/genética , Fenotipo , Especificidad de la Especie

9.

Single-cell strand sequencing of a macaque genome reveals multiple nested inversions and breakpoint reuse during primate evolution.

Maggiolini, Flavia Angela Maria; Sanders, Ashley D; Shew, Colin James; Sulovari, Arvis; Mao, Yafei; Puig, Marta; Catacchio, Claudia Rita; Dellino, Maria; Palmisano, Donato; Mercuri, Ludovica; Bitonto, Miriana; Porubský, David; Cáceres, Mario; Eichler, Evan E; Ventura, Mario; Dennis, Megan Y; Korbel, Jan O; Antonacci, Francesca.

Genome Res ; 30(11): 1680-1693, 2020 11.

Artículo en Inglés | MEDLINE | ID: mdl-33093070

RESUMEN

Rhesus macaque is an Old World monkey that shared a common ancestor with human â¼25 Myr ago and is an important animal model for human disease studies. A deep understanding of its genetics is therefore required for both biomedical and evolutionary studies. Among structural variants, inversions represent a driving force in speciation and play an important role in disease predisposition. Here we generated a genome-wide map of inversions between human and macaque, combining single-cell strand sequencing with cytogenetics. We identified 375 total inversions between 859 bp and 92 Mbp, increasing by eightfold the number of previously reported inversions. Among these, 19 inversions flanked by segmental duplications overlap with recurrent copy number variants associated with neurocognitive disorders. Evolutionary analyses show that in 17 out of 19 cases, the Hominidae orientation of these disease-associated regions is always derived. This suggests that duplicated sequences likely played a fundamental role in generating inversions in humans and great apes, creating architectures that nowadays predispose these regions to disease-associated genetic instability. Finally, we identified 861 genes mapping at 156 inversions breakpoints, with some showing evidence of differential expression in human and macaque cell lines, thus highlighting candidates that might have contributed to the evolution of species-specific features. This study depicts the most accurate fine-scale map of inversions between human and macaque using a two-pronged integrative approach, such as single-cell strand sequencing and cytogenetics, and represents a valuable resource toward understanding of the biology and evolution of primate species.

Asunto(s)

Puntos de Rotura del Cromosoma , Inversión Cromosómica , Evolución Molecular , Macaca mulatta/genética , Animales , Enfermedad/genética , Regulación de la Expresión Génica , Genoma , Genómica , Heterocigoto , Humanos , Hibridación Fluorescente in Situ , Recombinación Genética , Análisis de Secuencia de ADN , Análisis de la Célula Individual

10.

A virome-wide clonal integration analysis platform for discovering cancer viral etiology.

Chen, Xun; Kost, Jason; Sulovari, Arvis; Wong, Nathalie; Liang, Winnie S; Cao, Jian; Li, Dawei.

Genome Res ; 29(5): 819-830, 2019 05.

Artículo en Inglés | MEDLINE | ID: mdl-30872350

RESUMEN

Oncoviral infection is responsible for 12%-15% of cancer in humans. Convergent evidence from epidemiology, pathology, and oncology suggests that new viral etiologies for cancers remain to be discovered. Oncoviral profiles can be obtained from cancer genome sequencing data; however, widespread viral sequence contamination and noncausal viruses complicate the process of identifying genuine oncoviruses. Here, we propose a novel strategy to address these challenges by performing virome-wide screening of early-stage clonal viral integrations. To implement this strategy, we developed VIcaller, a novel platform for identifying viral integrations that are derived from any characterized viruses and shared by a large proportion of tumor cells using whole-genome sequencing (WGS) data. The sensitivity and precision were confirmed with simulated and benchmark cancer data sets. By applying this platform to cancer WGS data sets with proven or speculated viral etiology, we newly identified or confirmed clonal integrations of hepatitis B virus (HBV), human papillomavirus (HPV), Epstein-Barr virus (EBV), and BK Virus (BKV), suggesting the involvement of these viruses in early stages of tumorigenesis in affected tumors, such as HBV in TERT and KMT2B (also known as MLL4) gene loci in liver cancer, HPV and BKV in bladder cancer, and EBV in non-Hodgkin's lymphoma. We also showed the capacity of VIcaller to identify integrations from some uncharacterized viruses. This is the first study to systematically investigate the strategy and method of virome-wide screening of clonal integrations to identify oncoviruses. Searching clonal viral integrations with our platform has the capacity to identify virus-caused cancers and discover cancer viral etiologies.

Asunto(s)

Neoplasias/virología , Integración Viral/genética , Secuenciación Completa del Genoma , Virus BK/genética , Virus BK/patogenicidad , Carcinogénesis/genética , Transformación Celular Neoplásica , ADN Viral , Proteínas de Unión al ADN/genética , Virus de la Hepatitis B/genética , Virus de la Hepatitis B/patogenicidad , Herpesvirus Humano 4/genética , Herpesvirus Humano 4/patogenicidad , N-Metiltransferasa de Histona-Lisina , Humanos , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/virología , Linfoma no Hodgkin/genética , Linfoma no Hodgkin/virología , Neoplasias/genética , Papillomaviridae/genética , Papillomaviridae/patogenicidad , Programas Informáticos , Neoplasias de la Vejiga Urinaria/genética , Neoplasias de la Vejiga Urinaria/virología

11.

Human-specific tandem repeat expansion and differential gene expression during primate evolution.

Sulovari, Arvis; Li, Ruiyang; Audano, Peter A; Porubsky, David; Vollger, Mitchell R; Logsdon, Glennis A; Warren, Wesley C; Pollen, Alex A; Chaisson, Mark J P; Eichler, Evan E.

Proc Natl Acad Sci U S A ; 116(46): 23243-23253, 2019 11 12.

Artículo en Inglés | MEDLINE | ID: mdl-31659027

RESUMEN

Short tandem repeats (STRs) and variable number tandem repeats (VNTRs) are important sources of natural and disease-causing variation, yet they have been problematic to resolve in reference genomes and genotype with short-read technology. We created a framework to model the evolution and instability of STRs and VNTRs in apes. We phased and assembled 3 ape genomes (chimpanzee, gorilla, and orangutan) using long-read and 10x Genomics linked-read sequence data for 21,442 human tandem repeats discovered in 6 haplotype-resolved assemblies of Yoruban, Chinese, and Puerto Rican origin. We define a set of 1,584 STRs/VNTRs expanded specifically in humans, including large tandem repeats affecting coding and noncoding portions of genes (e.g., MUC3A, CACNA1C). We show that short interspersed nuclear element-VNTR-Alu (SVA) retrotransposition is the main mechanism for distributing GC-rich human-specific tandem repeat expansions throughout the genome but with a bias against genes. In contrast, we observe that VNTRs not originating from retrotransposons have a propensity to cluster near genes, especially in the subtelomere. Using tissue-specific expression from human and chimpanzee brains, we identify genes where transcript isoform usage differs significantly, likely caused by cryptic splicing variation within VNTRs. Using single-cell expression from cerebral organoids, we observe a strong effect for genes associated with transcription profiles analogous to intermediate progenitor cells. Finally, we compare the sequence composition of some of the largest human-specific repeat expansions and identify 52 STRs/VNTRs with at least 40 uninterrupted pure tracts as candidates for genetically unstable regions associated with disease.

Asunto(s)

Evolución Molecular , Genoma Humano , Primates/genética , Secuencias Repetidas en Tándem , Animales , Enfermedad/genética , Variación Estructural del Genoma , Humanos , Empalme del ARN

12.

VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing.

Sulovari, Arvis; Li, Dawei.

Genomics ; 112(1): 207-211, 2020 01.

Artículo en Inglés | MEDLINE | ID: mdl-30710609

RESUMEN

Viral sequence integrations in the human genome have been implicated in various human diseases. Viral integrations remain among the most challenging-to-detect structural changes of the human genome. No studies have systematically analyzed how molecular and bioinformatics factors affect the power (sensitivity) to detect viral integrations using high-throughput sequencing (HTS). We selected a wide-range of molecular and bioinformatics factors covering genome sequence characteristics, HTS features, and viral integration detection. We designed a fast simulation-based framework to model the process of detecting variable viral integration events in the human genome. We then examined the associations of selected factors with viral integration detection power. We identified six factors that significantly affected viral integration detection power (Pâ¯<â¯2â¯×â¯10-16). The strongest factors associated with detection power included proportion of sample cells with clonal viral integrations (Pearson's ρâ¯=â¯0.64), sequencing depth (ρâ¯=â¯0.37), length of viral integration (ρâ¯=â¯0.37), paired-end read insert size (ρâ¯=â¯0.23), user-defined threshold (number of supporting reads) to claim successful identification of integrations (ρâ¯=â¯-0.19), and read length (when sequence volume was fixed) (ρâ¯=â¯-0.09). As the first tool of its kind, VIpower incorporates all these factors, which can be manipulated in concert with each other to optimize the detection power. This tool may be used to estimate viral integration detection power for various combinations of sequencing or analytic parameters. It may also be used to estimate the parameters required to achieve a specific power when designing new sequencing experiments.

Asunto(s)

Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Integración Viral , Genoma Humano , Humanos

13.

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.

Vollger, Mitchell R; Logsdon, Glennis A; Audano, Peter A; Sulovari, Arvis; Porubsky, David; Peluso, Paul; Wenger, Aaron M; Concepcion, Gregory T; Kronenberg, Zev N; Munson, Katherine M; Baker, Carl; Sanders, Ashley D; Spierings, Diana C J; Lansdorp, Peter M; Surti, Urvashi; Hunkapiller, Michael W; Eichler, Evan E.

Ann Hum Genet ; 84(2): 125-140, 2020 03.

Artículo en Inglés | MEDLINE | ID: mdl-31711268

RESUMEN

The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes.

Asunto(s)

Biomarcadores/análisis , Variación Genética , Genoma Humano , Haploidia , Mola Hidatiforme/genética , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Anotación de Secuencia Molecular , Embarazo

14.

Atlas of human diseases influenced by genetic variants with extreme allele frequency differences.

Sulovari, Arvis; Chen, Yolanda H; Hudziak, James J; Li, Dawei.

Hum Genet ; 136(1): 39-54, 2017 01.

Artículo en Inglés | MEDLINE | ID: mdl-27699474

RESUMEN

Genetic variants with extreme allele frequency differences (EAFD) may underlie some human health disparities across populations. To identify EAFD loci, we systematically analyzed and characterized 81 million genomic variants from 2504 unrelated individuals of 26 world populations (phase III of the 1000 Genomes Project). Our analyses revealed a total of 434 genes, 15 pathways, and 18 diseases and traits influenced by EAFD variants from five continental populations. They included known EAFD genes, such as LCT (lactose tolerance), SLC24A5 (skin pigmentation), and EDAR (hair morphology). We found many novel EAFD genes, including TBC1D2B (autophagy mediator), TRIM40 (gastrointestinal inflammatory regulator), KRT71, KRT75, KRT83, and KRTAP10-1 (hair and epithelial keratin synthesis), PIK3R3 (insulin receptor interaction), DARS (neurological disorders), and NACA2 (skin inflammatory response). Our results also showed four complex diseases significantly associated with EAFD loci, including asthma (adjusted enrichment P = 4 × 10-8), type I diabetes (P = 6 × 10-9), alcohol consumption (P = 0.0002), and attention deficit/hyperactivity disorder (P = 0.003). This study provides a comprehensive atlas of genes, pathways, and human diseases significantly influenced by EAFD variants.

Asunto(s)

Frecuencia de los Genes , Predisposición Genética a la Enfermedad , Variación Genética , Pueblo Asiatico/genética , Población Negra/genética , Estudio de Asociación del Genoma Completo , Humanos , Desequilibrio de Ligamiento , Modelos Teóricos , Fenotipo , Polimorfismo de Nucleótido Simple , Población Blanca/genética

15.

Eye color: A potential indicator of alcohol dependence risk in European Americans.

Sulovari, Arvis; Kranzler, Henry R; Farrer, Lindsay A; Gelernter, Joel; Li, Dawei.

Am J Med Genet B Neuropsychiatr Genet ; 168B(5): 347-53, 2015 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-25921801

RESUMEN

In archival samples of European-ancestry subjects, light-eyed individuals have been found to consume more alcohol than dark-eyed individuals. No published population-based studies have directly tested the association between alcohol dependence (AD) and eye color. We hypothesized that light-eyed individuals have a higher prevalence of AD than dark-eyed individuals. A mixture model was used to select a homogeneous sample of 1,263 European-Americans and control for population stratification. After quality control, we conducted an association study using logistic regression, adjusting for confounders (age, sex, and genetic ancestry). We found evidence of association between AD and blue eye color (P = 0.0005 and odds ratio = 1.83 (1.31-2.57)), supporting light eye color as a risk factor relative to brown eye color. Network-based analyses revealed a statistically significant (P = 0.02) number of genetic interactions between eye color genes and AD-associated genes. We found evidence of linkage disequilibrium between an AD-associated GABA receptor gene cluster, GABRB3/GABRG3, and eye color genes, OCA2/HERC2, as well as between AD-associated GRM5 and pigmentation-associated TYR. Our population-phenotype, network, and linkage disequilibrium analyses support association between blue eye color and AD. Although we controlled for stratification we cannot exclude underlying occult stratification as a contributor to this observation. Although replication is needed, our findings suggest that eye pigmentation information may be useful in research on AD. Further characterization of this association may unravel new AD etiological factors. © 2015 Wiley Periodicals, Inc.

Asunto(s)

Alcoholismo/diagnóstico , Alcoholismo/genética , Color del Ojo , Desequilibrio de Ligamiento/genética , Polimorfismo de Nucleótido Simple/genética , Color del Ojo/fisiología , Femenino , Genotipo , Humanos , Masculino , Riesgo , Población Blanca/genética

16.

GACT: a Genome build and Allele definition Conversion Tool for SNP imputation and meta-analysis in genetic association studies.

Sulovari, Arvis; Li, Dawei.

BMC Genomics ; 15: 610, 2014 Jul 19.

Artículo en Inglés | MEDLINE | ID: mdl-25038819

RESUMEN

BACKGROUND: Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. RESULTS: In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. CONCLUSION: GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep-sequencing, particularly for data from the dbGaP and other public databases. GACT SOFTWARE: http://www.uvm.edu/genomics/software/gact.

Asunto(s)

Alelos , Genómica , Polimorfismo de Nucleótido Simple , Programas Informáticos , Conjuntos de Datos como Asunto , Frecuencia de los Genes , Estudios de Asociación Genética , Genoma , Genómica/métodos , Genotipo , Humanos

17.

Structural and genetic diversity in the secreted mucins, MUC5AC and MUC5B.

Plender, Elizabeth G; Prodanov, Timofey; Hsieh, PingHsun; Nizamis, Evangelos; Harvey, William T; Sulovari, Arvis; Munson, Katherine M; Kaufman, Eli J; O'Neal, Wanda K; Valdmanis, Paul N; Marschall, Tobias; Bloom, Jesse D; Eichler, Evan E.

bioRxiv ; 2024 Mar 20.

Artículo en Inglés | MEDLINE | ID: mdl-38562829

RESUMEN

The secreted mucins MUC5AC and MUC5B play critical defensive roles in airway pathogen entrapment and mucociliary clearance by encoding large glycoproteins with variable number tandem repeats (VNTRs). These polymorphic and degenerate protein coding VNTRs make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5761-5762aa); however, seven haplotypes have expanded VNTRs (6291-7019aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5249-6325aa) with cysteine-rich domain and VNTR copy number variation. We grouped MUC5AC alleles into three phylogenetic clades: H1 (46%, ~5654aa), H2 (33%, ~5742aa), and H3 (7%, ~6325aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium (LD) and Tajima's D analyses reveal that East Asians carry exceptionally large MUC5AC LD blocks with an excess of rare variation (p<0.05). To validate this result, we used Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observed signatures of positive selection in H1 and H2 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Africans and Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium, consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein coding VNTRs for improved disease associations.

18.

Complex genetic variation in nearly complete human genomes.

Logsdon, Glennis A; Ebert, Peter; Audano, Peter A; Loftus, Mark; Porubsky, David; Ebler, Jana; Yilmaz, Feyza; Hallast, Pille; Prodanov, Timofey; Yoo, DongAhn; Paisie, Carolyn A; Harvey, William T; Zhao, Xuefang; Martino, Gianni V; Henglin, Mir; Munson, Katherine M; Rabbani, Keon; Chin, Chen-Shan; Gu, Bida; Ashraf, Hufsah; Austine-Orimoloye, Olanrewaju; Balachandran, Parithi; Bonder, Marc Jan; Cheng, Haoyu; Chong, Zechen; Crabtree, Jonathan; Gerstein, Mark; Guethlein, Lisbeth A; Hasenfeld, Patrick; Hickey, Glenn; Hoekzema, Kendra; Hunt, Sarah E; Jensen, Matthew; Jiang, Yunzhe; Koren, Sergey; Kwon, Youngjun; Li, Chong; Li, Heng; Li, Jiaqi; Norman, Paul J; Oshima, Keisuke K; Paten, Benedict; Phillippy, Adam M; Pollock, Nicholas R; Rausch, Tobias; Rautiainen, Mikko; Scholz, Stephan; Song, Yuwei; Söylev, Arda; Sulovari, Arvis.

bioRxiv ; 2024 Sep 25.

Artículo en Inglés | MEDLINE | ID: mdl-39372794

RESUMEN

Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here, we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (130 Mbp median continuity), closing 92% of all previous assembly gaps1,2 and reaching telomere-to-telomere (T2T) status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8, and AMY1/AMY2, and fully resolve 1,852 complex structural variants (SVs). In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite high-order repeat (HOR) array length and characterize the pattern of mobile element insertions into α-satellite HOR arrays. While most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference1 significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference3 to a median quality value (QV) of 45. Using this approach, 26,115 SVs per sample are detected, substantially increasing the number of SVs now amenable to downstream disease association studies.

19.

Advances in the discovery and analyses of human tandem repeats.

Chaisson, Mark J P; Sulovari, Arvis; Valdmanis, Paul N; Miller, Danny E; Eichler, Evan E.

Emerg Top Life Sci ; 7(3): 361-381, 2023 Dec 14.

Artículo en Inglés | MEDLINE | ID: mdl-37905568

RESUMEN

Long-read sequencing platforms provide unparalleled access to the structure and composition of all classes of tandemly repeated DNA from STRs to satellite arrays. This review summarizes our current understanding of their organization within the human genome, their importance with respect to disease, as well as the advances and challenges in understanding their genetic diversity and functional effects. Novel computational methods are being developed to visualize and associate these complex patterns of human variation with disease, expression, and epigenetic differences. We predict accurate characterization of this repeat-rich form of human variation will become increasingly relevant to both basic and clinical human genetics.

Asunto(s)

ADN , Secuencias Repetidas en Tándem , Humanos , Secuencias Repetidas en Tándem/genética , Epigénesis Genética

20.

Segmental duplications and their variation in a complete human genome.

Vollger, Mitchell R; Guitart, Xavi; Dishuck, Philip C; Mercuri, Ludovica; Harvey, William T; Gershman, Ariel; Diekhans, Mark; Sulovari, Arvis; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Porubsky, David; Li, Ruiyang; Nurk, Sergey; Koren, Sergey; Miga, Karen H; Phillippy, Adam M; Timp, Winston; Ventura, Mario; Eichler, Evan E.

Science ; 376(6588): eabj6965, 2022 04.

Artículo en Inglés | MEDLINE | ID: mdl-35357917

RESUMEN

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human (n = 12) and nonhuman primate (n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.

Asunto(s)

Variaciones en el Número de Copia de ADN , Duplicación de Gen , Genoma Humano , Duplicaciones Segmentarias en el Genoma , Evolución Molecular , Proteínas Activadoras de GTPasa/genética , Humanos , Polimorfismo de Nucleótido Simple , Proteínas Proto-Oncogénicas/genética

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA