Pesquisa | BVS Integralidade em Saúde

1.

Characterizing the Major Structural Variant Alleles of the Human Genome.

Audano, Peter A; Sulovari, Arvis; Graves-Lindsay, Tina A; Cantsilieris, Stuart; Sorensen, Melanie; Welch, AnneMarie E; Dougherty, Max L; Nelson, Bradley J; Shah, Ankeeta; Dutcher, Susan K; Warren, Wesley C; Magrini, Vincent; McGrath, Sean D; Li, Yang I; Wilson, Richard K; Eichler, Evan E.

Cell ; 176(3): 663-675.e19, 2019 01 24.

Artigo em Inglês | MEDLINE | ID: mdl-30661756

RESUMO

In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity.

Assuntos

Frequência do Gene/genética , Genoma Humano/genética , Variação Estrutural do Genoma/genética , Alelos , Eucromatina/genética , Genômica/métodos , Humanos , Repetições Minissatélites/genética , Análise de Sequência de DNA/métodos

2.

Structural and genetic diversity in the secreted mucins MUC5AC and MUC5B.

Plender, Elizabeth G; Prodanov, Timofey; Hsieh, PingHsun; Nizamis, Evangelos; Harvey, William T; Sulovari, Arvis; Munson, Katherine M; Kaufman, Eli J; O'Neal, Wanda K; Valdmanis, Paul N; Marschall, Tobias; Bloom, Jesse D; Eichler, Evan E.

Am J Hum Genet ; 111(8): 1700-1716, 2024 Aug 08.

Artigo em Inglês | MEDLINE | ID: mdl-38991590

RESUMO

The secreted mucins MUC5AC and MUC5B are large glycoproteins that play critical defensive roles in pathogen entrapment and mucociliary clearance. Their respective genes contain polymorphic and degenerate protein-coding variable number tandem repeats (VNTRs) that make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5,761-5,762 amino acids [aa]); however, seven haplotypes have expanded VNTRs (6,291-7,019 aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5,249-6,325 aa) with cysteine-rich domain and VNTR copy-number variation. We group MUC5AC alleles into three phylogenetic clades: H1 (46%, â¼5,654 aa), H2 (33%, â¼5,742 aa), and H3 (7%, â¼6,325 aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium and Tajima's D analyses reveal that East Asians carry exceptionally large blocks with an excess of rare variation (p < 0.05) at MUC5AC. To validate this result, we use Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observe a signature of positive selection in H1 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium (p < 0.05), consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein-coding VNTRs for improved disease associations.

Assuntos

Alelos , Variação Genética , Haplótipos , Repetições Minissatélites , Mucina-5AC , Mucina-5B , Filogenia , Humanos , Mucina-5B/genética , Animais , Mucina-5AC/genética , Mucina-5AC/metabolismo , Repetições Minissatélites/genética , Variações do Número de Cópias de DNA , Primatas/genética

3.

A high-quality bonobo genome refines the analysis of hominid evolution.

Mao, Yafei; Catacchio, Claudia R; Hillier, LaDeana W; Porubsky, David; Li, Ruiyang; Sulovari, Arvis; Fernandes, Jason D; Montinaro, Francesco; Gordon, David S; Storer, Jessica M; Haukness, Marina; Fiddes, Ian T; Murali, Shwetha Canchi; Dishuck, Philip C; Hsieh, PingHsun; Harvey, William T; Audano, Peter A; Mercuri, Ludovica; Piccolo, Ilaria; Antonacci, Francesca; Munson, Katherine M; Lewis, Alexandra P; Baker, Carl; Underwood, Jason G; Hoekzema, Kendra; Huang, Tzu-Hsueh; Sorensen, Melanie; Walker, Jerilyn A; Hoffman, Jinna; Thibaud-Nissen, Françoise; Salama, Sofie R; Pang, Andy W C; Lee, Joyce; Hastie, Alex R; Paten, Benedict; Batzer, Mark A; Diekhans, Mark; Ventura, Mario; Eichler, Evan E.

Nature ; 594(7861): 77-81, 2021 06.

Artigo em Inglês | MEDLINE | ID: mdl-33953399

RESUMO

The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.

Assuntos

Evolução Molecular , Genoma/genética , Genômica , Pan paniscus/genética , Filogenia , Animais , Fator de Iniciação 4A em Eucariotos/genética , Feminino , Genes , Gorilla gorilla/genética , Anotação de Sequência Molecular/normas , Pan troglodytes/genética , Pongo/genética , Duplicações Segmentares Genômicas , Análise de Sequência de DNA

4.

Familial long-read sequencing increases yield of de novo mutations.

Noyes, Michelle D; Harvey, William T; Porubsky, David; Sulovari, Arvis; Li, Ruiyang; Rose, Nicholas R; Audano, Peter A; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Mantere, Tuomo; Graves-Lindsay, Tina A; Sanders, Ashley D; Goodwin, Sara; Kramer, Melissa; Mokrab, Younes; Zody, Michael C; Hoischen, Alexander; Korbel, Jan O; McCombie, W Richard; Eichler, Evan E.

Am J Hum Genet ; 109(4): 631-646, 2022 04 07.

Artigo em Inglês | MEDLINE | ID: mdl-35290762

RESUMO

Studies of de novo mutation (DNM) have typically excluded some of the most repetitive and complex regions of the genome because these regions cannot be unambiguously mapped with short-read sequencing data. To better understand the genome-wide pattern of DNM, we generated long-read sequence data from an autism parent-child quad with an affected female where no pathogenic variant had been discovered in short-read Illumina sequence data. We deeply sequenced all four individuals by using three sequencing platforms (Illumina, Oxford Nanopore, and Pacific Biosciences) and three complementary technologies (Strand-seq, optical mapping, and 10X Genomics). Using long-read sequencing, we initially discovered and validated 171 DNMs across two children-a 20% increase in the number of de novo single-nucleotide variants (SNVs) and indels when compared to short-read callsets. The number of DNMs further increased by 5% when considering a more complete human reference (T2T-CHM13) because of the recovery of events in regions absent from GRCh38 (e.g., three DNMs in heterochromatic satellites). In total, we validated 195 de novo germline mutations and 23 potential post-zygotic mosaic mutations across both children; the overall true substitution rate based on this integrated callset is at least 1.41 × 10-8 substitutions per nucleotide per generation. We also identified six de novo insertions and deletions in tandem repeats, two of which represent structural variants. We demonstrate that long-read sequencing and assembly, especially when combined with a more complete reference genome, increases the number of DNMs by >25% compared to previous studies, providing a more complete catalog of DNM compared to short-read data alone.

Assuntos

Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Feminino , Humanos , Mutação/genética , Nucleotídeos , Análise de Sequência de DNA , Software

5.

Targeted long-read sequencing identifies missing disease-causing variation.

Miller, Danny E; Sulovari, Arvis; Wang, Tianyun; Loucks, Hailey; Hoekzema, Kendra; Munson, Katherine M; Lewis, Alexandra P; Fuerte, Edith P Almanza; Paschal, Catherine R; Walsh, Tom; Thies, Jenny; Bennett, James T; Glass, Ian; Dipple, Katrina M; Patterson, Karynne; Bonkowski, Emily S; Nelson, Zoe; Squire, Audrey; Sikes, Megan; Beckman, Erika; Bennett, Robin L; Earl, Dawn; Lee, Winston; Allikmets, Rando; Perlman, Seth J; Chow, Penny; Hing, Anne V; Wenger, Tara L; Adam, Margaret P; Sun, Angela; Lam, Christina; Chang, Irene; Zou, Xue; Austin, Stephanie L; Huggins, Erin; Safi, Alexias; Iyengar, Apoorva K; Reddy, Timothy E; Majoros, William H; Allen, Andrew S; Crawford, Gregory E; Kishnani, Priya S; King, Mary-Claire; Cherry, Tim; Chong, Jessica X; Bamshad, Michael J; Nickerson, Deborah A; Mefford, Heather C; Doherty, Dan; Eichler, Evan E.

Am J Hum Genet ; 108(8): 1436-1449, 2021 08 05.

Artigo em Inglês | MEDLINE | ID: mdl-34216551

RESUMO

Despite widespread clinical genetic testing, many individuals with suspected genetic conditions lack a precise diagnosis, limiting their opportunity to take advantage of state-of-the-art treatments. In some cases, testing reveals difficult-to-evaluate structural differences, candidate variants that do not fully explain the phenotype, single pathogenic variants in recessive disorders, or no variants in genes of interest. Thus, there is a need for better tools to identify a precise genetic diagnosis in individuals when conventional testing approaches have been exhausted. We performed targeted long-read sequencing (T-LRS) using adaptive sampling on the Oxford Nanopore platform on 40 individuals, 10 of whom lacked a complete molecular diagnosis. We computationally targeted up to 151 Mbp of sequence per individual and searched for pathogenic substitutions, structural variants, and methylation differences using a single data source. We detected all genomic aberrations-including single-nucleotide variants, copy number changes, repeat expansions, and methylation differences-identified by prior clinical testing. In 8/8 individuals with complex structural rearrangements, T-LRS enabled more precise resolution of the mutation, leading to changes in clinical management in one case. In ten individuals with suspected Mendelian conditions lacking a precise genetic diagnosis, T-LRS identified pathogenic or likely pathogenic variants in six and variants of uncertain significance in two others. T-LRS accurately identifies pathogenic structural variants, resolves complex rearrangements, and identifies Mendelian variants not detected by other technologies. T-LRS represents an efficient and cost-effective strategy to evaluate high-priority genes and regions or complex clinical testing results.

Assuntos

Aberrações Cromossômicas , Análise Citogenética/métodos , Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/genética , Predisposição Genética para Doença , Genoma Humano , Mutação , Variações do Número de Cópias de DNA , Feminino , Testes Genéticos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Cariotipagem , Masculino , Análise de Sequência de DNA

6.

Characterizing nucleotide variation and expansion dynamics in human-specific variable number tandem repeats.

Course, Meredith M; Sulovari, Arvis; Gudsnuk, Kathryn; Eichler, Evan E; Valdmanis, Paul N.

Genome Res ; 31(8): 1313-1324, 2021 08.

Artigo em Inglês | MEDLINE | ID: mdl-34244228

RESUMO

There are more than 55,000 variable number tandem repeats (VNTRs) in the human genome, notable for both their striking polymorphism and mutability. Despite their role in human evolution and genomic variation, they have yet to be studied collectively and in detail, partially owing to their large size, variability, and predominant location in noncoding regions. Here, we examine 467 VNTRs that are human-specific expansions, unique to one location in the genome, and not associated with retrotransposons. We leverage publicly available long-read genomes, including from the Human Genome Structural Variant Consortium, to ascertain the exact nucleotide composition of these VNTRs and compare their composition of alleles. We then confirm repeat unit composition in more than 3000 short-read samples from the 1000 Genomes Project. Our analysis reveals that these VNTRs contain highly structured repeat motif organization, modified by frequent deletion and duplication events. Although overall VNTR compositions tend to remain similar between 1000 Genomes Project superpopulations, we describe a notable exception with substantial differences in repeat composition (in PCBP3), as well as several VNTRs that are significantly different in length between superpopulations (in ART1, PROP1, DYNC2I1, and LOC102723906). We also observe that most of these VNTRs are expanded in archaic human genomes, yet remain stable in length between single generations. Collectively, our findings indicate that repeat motif variability, repeat composition, and repeat length are all informative modalities to consider when characterizing VNTRs and their contribution to genomic variation.

Assuntos

Repetições Minissatélites , Nucleotídeos , Genoma Humano , Variação Estrutural do Genoma , Humanos , Repetições Minissatélites/genética , Polimorfismo Genético

7.

Quantitative assessment reveals the dominance of duplicated sequences in germline-derived extrachromosomal circular DNA.

Mouakkad-Montoya, Lila; Murata, Michael M; Sulovari, Arvis; Suzuki, Ryusuke; Osia, Beth; Malkova, Anna; Katsumata, Makoto; Giuliano, Armando E; Eichler, Evan E; Tanaka, Hisashi.

Proc Natl Acad Sci U S A ; 118(47)2021 11 23.

Artigo em Inglês | MEDLINE | ID: mdl-34789574

RESUMO

Extrachromosomal circular DNA (eccDNA) originates from linear chromosomal DNA in various human tissues under physiological and disease conditions. The genomic origins of eccDNA have largely been investigated using in vitro-amplified DNA. However, in vitro amplification obscures quantitative information by skewing the total population stoichiometry. In addition, the analyses have focused on eccDNA stemming from single-copy genomic regions, leaving eccDNA from multicopy regions unexamined. To address these issues, we isolated eccDNA without in vitro amplification (naïve small circular DNA, nscDNA) and assessed the populations quantitatively by integrated genomic, molecular, and cytogenetic approaches. nscDNA of up to tens of kilobases were successfully enriched by our approach and were predominantly derived from multicopy genomic regions including segmental duplications (SDs). SDs, which account for 5% of the human genome and are hotspots for copy number variations, were significantly overrepresented in sperm nscDNA, with three times more sequencing reads derived from SDs than from the entire single-copy regions. SDs were also overrepresented in mouse sperm nscDNA, which we estimated to comprise 0.2% of nuclear DNA. Considering that eccDNA can be integrated into chromosomes, germline-derived nscDNA may be a mediator of genome diversity.

Assuntos

DNA Circular , Células Germinativas , Animais , Cromossomos , DNA , Variações do Número de Cópias de DNA , Genoma Humano , Células HeLa , Humanos , Masculino , Camundongos , Camundongos Endogâmicos C57BL , Duplicações Segmentares Genômicas , Espermatozoides

8.

Evolution of a Human-Specific Tandem Repeat Associated with ALS.

Course, Meredith M; Gudsnuk, Kathryn; Smukowski, Samuel N; Winston, Kosuke; Desai, Nitin; Ross, Jay P; Sulovari, Arvis; Bourassa, Cynthia V; Spiegelman, Dan; Couthouis, Julien; Yu, Chang-En; Tsuang, Debby W; Jayadev, Suman; Kay, Mark A; Gitler, Aaron D; Dupre, Nicolas; Eichler, Evan E; Dion, Patrick A; Rouleau, Guy A; Valdmanis, Paul N.

Am J Hum Genet ; 107(3): 445-460, 2020 09 03.

Artigo em Inglês | MEDLINE | ID: mdl-32750315

RESUMO

Tandem repeats are proposed to contribute to human-specific traits, and more than 40 tandem repeat expansions are known to cause neurological disease. Here, we characterize a human-specific 69 bp variable number tandem repeat (VNTR) in the last intron of WDR7, which exhibits striking variability in both copy number and nucleotide composition, as revealed by long-read sequencing. In addition, greater repeat copy number is significantly enriched in three independent cohorts of individuals with sporadic amyotrophic lateral sclerosis (ALS). Each unit of the repeat forms a stem-loop structure with the potential to produce microRNAs, and the repeat RNA can aggregate when expressed in cells. We leveraged its remarkable sequence variability to align the repeat in 288 samples and uncover its mechanism of expansion. We found that the repeat expands in the 3'-5' direction, in groups of repeat units divisible by two. The expansion patterns we observed were consistent with duplication events, and a replication error called template switching. We also observed that the VNTR is expanded in both Denisovan and Neanderthal genomes but is fixed at one copy or fewer in non-human primates. Evaluating the repeat in 1000 Genomes Project samples reveals that some repeat segments are solely present or absent in certain geographic populations. The large size of the repeat unit in this VNTR, along with our multiplexed sequencing strategy, provides an unprecedented opportunity to study mechanisms of repeat expansion, and a framework for evaluating the roles of VNTRs in human evolution and disease.

Assuntos

Proteínas Adaptadoras de Transdução de Sinal/genética , Esclerose Lateral Amiotrófica/genética , Evolução Molecular , Sequências de Repetição em Tandem/genética , Idoso , Doença de Alzheimer/genética , Doença de Alzheimer/patologia , Esclerose Lateral Amiotrófica/patologia , Expansão das Repetições de DNA/genética , Feminino , Regulação da Expressão Gênica/genética , Humanos , Masculino , Repetições Minissatélites/genética , Fenótipo , Especificidade da Espécie

9.

Single-cell strand sequencing of a macaque genome reveals multiple nested inversions and breakpoint reuse during primate evolution.

Maggiolini, Flavia Angela Maria; Sanders, Ashley D; Shew, Colin James; Sulovari, Arvis; Mao, Yafei; Puig, Marta; Catacchio, Claudia Rita; Dellino, Maria; Palmisano, Donato; Mercuri, Ludovica; Bitonto, Miriana; Porubský, David; Cáceres, Mario; Eichler, Evan E; Ventura, Mario; Dennis, Megan Y; Korbel, Jan O; Antonacci, Francesca.

Genome Res ; 30(11): 1680-1693, 2020 11.

Artigo em Inglês | MEDLINE | ID: mdl-33093070

RESUMO

Rhesus macaque is an Old World monkey that shared a common ancestor with human â¼25 Myr ago and is an important animal model for human disease studies. A deep understanding of its genetics is therefore required for both biomedical and evolutionary studies. Among structural variants, inversions represent a driving force in speciation and play an important role in disease predisposition. Here we generated a genome-wide map of inversions between human and macaque, combining single-cell strand sequencing with cytogenetics. We identified 375 total inversions between 859 bp and 92 Mbp, increasing by eightfold the number of previously reported inversions. Among these, 19 inversions flanked by segmental duplications overlap with recurrent copy number variants associated with neurocognitive disorders. Evolutionary analyses show that in 17 out of 19 cases, the Hominidae orientation of these disease-associated regions is always derived. This suggests that duplicated sequences likely played a fundamental role in generating inversions in humans and great apes, creating architectures that nowadays predispose these regions to disease-associated genetic instability. Finally, we identified 861 genes mapping at 156 inversions breakpoints, with some showing evidence of differential expression in human and macaque cell lines, thus highlighting candidates that might have contributed to the evolution of species-specific features. This study depicts the most accurate fine-scale map of inversions between human and macaque using a two-pronged integrative approach, such as single-cell strand sequencing and cytogenetics, and represents a valuable resource toward understanding of the biology and evolution of primate species.

Assuntos

Pontos de Quebra do Cromossomo , Inversão Cromossômica , Evolução Molecular , Macaca mulatta/genética , Animais , Doença/genética , Regulação da Expressão Gênica , Genoma , Genômica , Heterozigoto , Humanos , Hibridização in Situ Fluorescente , Recombinação Genética , Análise de Sequência de DNA , Análise de Célula Única

10.

A virome-wide clonal integration analysis platform for discovering cancer viral etiology.

Chen, Xun; Kost, Jason; Sulovari, Arvis; Wong, Nathalie; Liang, Winnie S; Cao, Jian; Li, Dawei.

Genome Res ; 29(5): 819-830, 2019 05.

Artigo em Inglês | MEDLINE | ID: mdl-30872350

RESUMO

Oncoviral infection is responsible for 12%-15% of cancer in humans. Convergent evidence from epidemiology, pathology, and oncology suggests that new viral etiologies for cancers remain to be discovered. Oncoviral profiles can be obtained from cancer genome sequencing data; however, widespread viral sequence contamination and noncausal viruses complicate the process of identifying genuine oncoviruses. Here, we propose a novel strategy to address these challenges by performing virome-wide screening of early-stage clonal viral integrations. To implement this strategy, we developed VIcaller, a novel platform for identifying viral integrations that are derived from any characterized viruses and shared by a large proportion of tumor cells using whole-genome sequencing (WGS) data. The sensitivity and precision were confirmed with simulated and benchmark cancer data sets. By applying this platform to cancer WGS data sets with proven or speculated viral etiology, we newly identified or confirmed clonal integrations of hepatitis B virus (HBV), human papillomavirus (HPV), Epstein-Barr virus (EBV), and BK Virus (BKV), suggesting the involvement of these viruses in early stages of tumorigenesis in affected tumors, such as HBV in TERT and KMT2B (also known as MLL4) gene loci in liver cancer, HPV and BKV in bladder cancer, and EBV in non-Hodgkin's lymphoma. We also showed the capacity of VIcaller to identify integrations from some uncharacterized viruses. This is the first study to systematically investigate the strategy and method of virome-wide screening of clonal integrations to identify oncoviruses. Searching clonal viral integrations with our platform has the capacity to identify virus-caused cancers and discover cancer viral etiologies.

Assuntos

Neoplasias/virologia , Integração Viral/genética , Sequenciamento Completo do Genoma , Vírus BK/genética , Vírus BK/patogenicidade , Carcinogênese/genética , Transformação Celular Neoplásica , DNA Viral , Proteínas de Ligação a DNA/genética , Vírus da Hepatite B/genética , Vírus da Hepatite B/patogenicidade , Herpesvirus Humano 4/genética , Herpesvirus Humano 4/patogenicidade , Histona-Lisina N-Metiltransferase , Humanos , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/virologia , Linfoma não Hodgkin/genética , Linfoma não Hodgkin/virologia , Neoplasias/genética , Papillomaviridae/genética , Papillomaviridae/patogenicidade , Software , Neoplasias da Bexiga Urinária/genética , Neoplasias da Bexiga Urinária/virologia

11.

Human-specific tandem repeat expansion and differential gene expression during primate evolution.

Sulovari, Arvis; Li, Ruiyang; Audano, Peter A; Porubsky, David; Vollger, Mitchell R; Logsdon, Glennis A; Warren, Wesley C; Pollen, Alex A; Chaisson, Mark J P; Eichler, Evan E.

Proc Natl Acad Sci U S A ; 116(46): 23243-23253, 2019 11 12.

Artigo em Inglês | MEDLINE | ID: mdl-31659027

RESUMO

Short tandem repeats (STRs) and variable number tandem repeats (VNTRs) are important sources of natural and disease-causing variation, yet they have been problematic to resolve in reference genomes and genotype with short-read technology. We created a framework to model the evolution and instability of STRs and VNTRs in apes. We phased and assembled 3 ape genomes (chimpanzee, gorilla, and orangutan) using long-read and 10x Genomics linked-read sequence data for 21,442 human tandem repeats discovered in 6 haplotype-resolved assemblies of Yoruban, Chinese, and Puerto Rican origin. We define a set of 1,584 STRs/VNTRs expanded specifically in humans, including large tandem repeats affecting coding and noncoding portions of genes (e.g., MUC3A, CACNA1C). We show that short interspersed nuclear element-VNTR-Alu (SVA) retrotransposition is the main mechanism for distributing GC-rich human-specific tandem repeat expansions throughout the genome but with a bias against genes. In contrast, we observe that VNTRs not originating from retrotransposons have a propensity to cluster near genes, especially in the subtelomere. Using tissue-specific expression from human and chimpanzee brains, we identify genes where transcript isoform usage differs significantly, likely caused by cryptic splicing variation within VNTRs. Using single-cell expression from cerebral organoids, we observe a strong effect for genes associated with transcription profiles analogous to intermediate progenitor cells. Finally, we compare the sequence composition of some of the largest human-specific repeat expansions and identify 52 STRs/VNTRs with at least 40 uninterrupted pure tracts as candidates for genetically unstable regions associated with disease.

Assuntos

Evolução Molecular , Genoma Humano , Primatas/genética , Sequências de Repetição em Tandem , Animais , Doença/genética , Variação Estrutural do Genoma , Humanos , Splicing de RNA

12.

VIpower: Simulation-based tool for estimating power of viral integration detection via high-throughput sequencing.

Sulovari, Arvis; Li, Dawei.

Genomics ; 112(1): 207-211, 2020 01.

Artigo em Inglês | MEDLINE | ID: mdl-30710609

RESUMO

Viral sequence integrations in the human genome have been implicated in various human diseases. Viral integrations remain among the most challenging-to-detect structural changes of the human genome. No studies have systematically analyzed how molecular and bioinformatics factors affect the power (sensitivity) to detect viral integrations using high-throughput sequencing (HTS). We selected a wide-range of molecular and bioinformatics factors covering genome sequence characteristics, HTS features, and viral integration detection. We designed a fast simulation-based framework to model the process of detecting variable viral integration events in the human genome. We then examined the associations of selected factors with viral integration detection power. We identified six factors that significantly affected viral integration detection power (Pâ¯<â¯2â¯×â¯10-16). The strongest factors associated with detection power included proportion of sample cells with clonal viral integrations (Pearson's ρâ¯=â¯0.64), sequencing depth (ρâ¯=â¯0.37), length of viral integration (ρâ¯=â¯0.37), paired-end read insert size (ρâ¯=â¯0.23), user-defined threshold (number of supporting reads) to claim successful identification of integrations (ρâ¯=â¯-0.19), and read length (when sequence volume was fixed) (ρâ¯=â¯-0.09). As the first tool of its kind, VIpower incorporates all these factors, which can be manipulated in concert with each other to optimize the detection power. This tool may be used to estimate viral integration detection power for various combinations of sequencing or analytic parameters. It may also be used to estimate the parameters required to achieve a specific power when designing new sequencing experiments.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Integração Viral , Genoma Humano , Humanos

13.

Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.

Vollger, Mitchell R; Logsdon, Glennis A; Audano, Peter A; Sulovari, Arvis; Porubsky, David; Peluso, Paul; Wenger, Aaron M; Concepcion, Gregory T; Kronenberg, Zev N; Munson, Katherine M; Baker, Carl; Sanders, Ashley D; Spierings, Diana C J; Lansdorp, Peter M; Surti, Urvashi; Hunkapiller, Michael W; Eichler, Evan E.

Ann Hum Genet ; 84(2): 125-140, 2020 03.

Artigo em Inglês | MEDLINE | ID: mdl-31711268

RESUMO

The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes.

Assuntos

Biomarcadores/análise , Variação Genética , Genoma Humano , Haploidia , Mola Hidatiforme/genética , Análise de Sequência de DNA/métodos , Análise de Célula Única/métodos , Feminino , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Anotação de Sequência Molecular , Gravidez

14.

Atlas of human diseases influenced by genetic variants with extreme allele frequency differences.

Sulovari, Arvis; Chen, Yolanda H; Hudziak, James J; Li, Dawei.

Hum Genet ; 136(1): 39-54, 2017 01.

Artigo em Inglês | MEDLINE | ID: mdl-27699474

RESUMO

Genetic variants with extreme allele frequency differences (EAFD) may underlie some human health disparities across populations. To identify EAFD loci, we systematically analyzed and characterized 81 million genomic variants from 2504 unrelated individuals of 26 world populations (phase III of the 1000 Genomes Project). Our analyses revealed a total of 434 genes, 15 pathways, and 18 diseases and traits influenced by EAFD variants from five continental populations. They included known EAFD genes, such as LCT (lactose tolerance), SLC24A5 (skin pigmentation), and EDAR (hair morphology). We found many novel EAFD genes, including TBC1D2B (autophagy mediator), TRIM40 (gastrointestinal inflammatory regulator), KRT71, KRT75, KRT83, and KRTAP10-1 (hair and epithelial keratin synthesis), PIK3R3 (insulin receptor interaction), DARS (neurological disorders), and NACA2 (skin inflammatory response). Our results also showed four complex diseases significantly associated with EAFD loci, including asthma (adjusted enrichment P = 4 × 10-8), type I diabetes (P = 6 × 10-9), alcohol consumption (P = 0.0002), and attention deficit/hyperactivity disorder (P = 0.003). This study provides a comprehensive atlas of genes, pathways, and human diseases significantly influenced by EAFD variants.

Assuntos

Frequência do Gene , Predisposição Genética para Doença , Variação Genética , Povo Asiático/genética , População Negra/genética , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Modelos Teóricos , Fenótipo , Polimorfismo de Nucleotídeo Único , População Branca/genética

15.

Eye color: A potential indicator of alcohol dependence risk in European Americans.

Sulovari, Arvis; Kranzler, Henry R; Farrer, Lindsay A; Gelernter, Joel; Li, Dawei.

Am J Med Genet B Neuropsychiatr Genet ; 168B(5): 347-53, 2015 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-25921801

RESUMO

In archival samples of European-ancestry subjects, light-eyed individuals have been found to consume more alcohol than dark-eyed individuals. No published population-based studies have directly tested the association between alcohol dependence (AD) and eye color. We hypothesized that light-eyed individuals have a higher prevalence of AD than dark-eyed individuals. A mixture model was used to select a homogeneous sample of 1,263 European-Americans and control for population stratification. After quality control, we conducted an association study using logistic regression, adjusting for confounders (age, sex, and genetic ancestry). We found evidence of association between AD and blue eye color (P = 0.0005 and odds ratio = 1.83 (1.31-2.57)), supporting light eye color as a risk factor relative to brown eye color. Network-based analyses revealed a statistically significant (P = 0.02) number of genetic interactions between eye color genes and AD-associated genes. We found evidence of linkage disequilibrium between an AD-associated GABA receptor gene cluster, GABRB3/GABRG3, and eye color genes, OCA2/HERC2, as well as between AD-associated GRM5 and pigmentation-associated TYR. Our population-phenotype, network, and linkage disequilibrium analyses support association between blue eye color and AD. Although we controlled for stratification we cannot exclude underlying occult stratification as a contributor to this observation. Although replication is needed, our findings suggest that eye pigmentation information may be useful in research on AD. Further characterization of this association may unravel new AD etiological factors. © 2015 Wiley Periodicals, Inc.

Assuntos

Alcoolismo/diagnóstico , Alcoolismo/genética , Cor de Olho , Desequilíbrio de Ligação/genética , Polimorfismo de Nucleotídeo Único/genética , Cor de Olho/fisiologia , Feminino , Genótipo , Humanos , Masculino , Risco , População Branca/genética

16.

GACT: a Genome build and Allele definition Conversion Tool for SNP imputation and meta-analysis in genetic association studies.

Sulovari, Arvis; Li, Dawei.

BMC Genomics ; 15: 610, 2014 Jul 19.

Artigo em Inglês | MEDLINE | ID: mdl-25038819

RESUMO

BACKGROUND: Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. RESULTS: In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. CONCLUSION: GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep-sequencing, particularly for data from the dbGaP and other public databases. GACT SOFTWARE: http://www.uvm.edu/genomics/software/gact.

Assuntos

Alelos , Genômica , Polimorfismo de Nucleotídeo Único , Software , Conjuntos de Dados como Assunto , Frequência do Gene , Estudos de Associação Genética , Genoma , Genômica/métodos , Genótipo , Humanos

17.

Structural and genetic diversity in the secreted mucins, MUC5AC and MUC5B.

Plender, Elizabeth G; Prodanov, Timofey; Hsieh, PingHsun; Nizamis, Evangelos; Harvey, William T; Sulovari, Arvis; Munson, Katherine M; Kaufman, Eli J; O'Neal, Wanda K; Valdmanis, Paul N; Marschall, Tobias; Bloom, Jesse D; Eichler, Evan E.

bioRxiv ; 2024 Mar 20.

Artigo em Inglês | MEDLINE | ID: mdl-38562829

RESUMO

The secreted mucins MUC5AC and MUC5B play critical defensive roles in airway pathogen entrapment and mucociliary clearance by encoding large glycoproteins with variable number tandem repeats (VNTRs). These polymorphic and degenerate protein coding VNTRs make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5761-5762aa); however, seven haplotypes have expanded VNTRs (6291-7019aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5249-6325aa) with cysteine-rich domain and VNTR copy number variation. We grouped MUC5AC alleles into three phylogenetic clades: H1 (46%, ~5654aa), H2 (33%, ~5742aa), and H3 (7%, ~6325aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium (LD) and Tajima's D analyses reveal that East Asians carry exceptionally large MUC5AC LD blocks with an excess of rare variation (p<0.05). To validate this result, we used Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observed signatures of positive selection in H1 and H2 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Africans and Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium, consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein coding VNTRs for improved disease associations.

18.

Complex genetic variation in nearly complete human genomes.

Logsdon, Glennis A; Ebert, Peter; Audano, Peter A; Loftus, Mark; Porubsky, David; Ebler, Jana; Yilmaz, Feyza; Hallast, Pille; Prodanov, Timofey; Yoo, DongAhn; Paisie, Carolyn A; Harvey, William T; Zhao, Xuefang; Martino, Gianni V; Henglin, Mir; Munson, Katherine M; Rabbani, Keon; Chin, Chen-Shan; Gu, Bida; Ashraf, Hufsah; Austine-Orimoloye, Olanrewaju; Balachandran, Parithi; Bonder, Marc Jan; Cheng, Haoyu; Chong, Zechen; Crabtree, Jonathan; Gerstein, Mark; Guethlein, Lisbeth A; Hasenfeld, Patrick; Hickey, Glenn; Hoekzema, Kendra; Hunt, Sarah E; Jensen, Matthew; Jiang, Yunzhe; Koren, Sergey; Kwon, Youngjun; Li, Chong; Li, Heng; Li, Jiaqi; Norman, Paul J; Oshima, Keisuke K; Paten, Benedict; Phillippy, Adam M; Pollock, Nicholas R; Rausch, Tobias; Rautiainen, Mikko; Scholz, Stephan; Song, Yuwei; Söylev, Arda; Sulovari, Arvis.

bioRxiv ; 2024 Sep 25.

Artigo em Inglês | MEDLINE | ID: mdl-39372794

RESUMO

Diverse sets of complete human genomes are required to construct a pangenome reference and to understand the extent of complex structural variation. Here, we sequence 65 diverse human genomes and build 130 haplotype-resolved assemblies (130 Mbp median continuity), closing 92% of all previous assembly gaps1,2 and reaching telomere-to-telomere (T2T) status for 39% of the chromosomes. We highlight complete sequence continuity of complex loci, including the major histocompatibility complex (MHC), SMN1/SMN2, NBPF8, and AMY1/AMY2, and fully resolve 1,852 complex structural variants (SVs). In addition, we completely assemble and validate 1,246 human centromeres. We find up to 30-fold variation in α-satellite high-order repeat (HOR) array length and characterize the pattern of mobile element insertions into α-satellite HOR arrays. While most centromeres predict a single site of kinetochore attachment, epigenetic analysis suggests the presence of two hypomethylated regions for 7% of centromeres. Combining our data with the draft pangenome reference1 significantly enhances genotyping accuracy from short-read data, enabling whole-genome inference3 to a median quality value (QV) of 45. Using this approach, 26,115 SVs per sample are detected, substantially increasing the number of SVs now amenable to downstream disease association studies.

19.

Advances in the discovery and analyses of human tandem repeats.

Chaisson, Mark J P; Sulovari, Arvis; Valdmanis, Paul N; Miller, Danny E; Eichler, Evan E.

Emerg Top Life Sci ; 7(3): 361-381, 2023 Dec 14.

Artigo em Inglês | MEDLINE | ID: mdl-37905568

RESUMO

Long-read sequencing platforms provide unparalleled access to the structure and composition of all classes of tandemly repeated DNA from STRs to satellite arrays. This review summarizes our current understanding of their organization within the human genome, their importance with respect to disease, as well as the advances and challenges in understanding their genetic diversity and functional effects. Novel computational methods are being developed to visualize and associate these complex patterns of human variation with disease, expression, and epigenetic differences. We predict accurate characterization of this repeat-rich form of human variation will become increasingly relevant to both basic and clinical human genetics.

Assuntos

DNA , Sequências de Repetição em Tandem , Humanos , Sequências de Repetição em Tandem/genética , Epigênese Genética

20.

Segmental duplications and their variation in a complete human genome.

Vollger, Mitchell R; Guitart, Xavi; Dishuck, Philip C; Mercuri, Ludovica; Harvey, William T; Gershman, Ariel; Diekhans, Mark; Sulovari, Arvis; Munson, Katherine M; Lewis, Alexandra P; Hoekzema, Kendra; Porubsky, David; Li, Ruiyang; Nurk, Sergey; Koren, Sergey; Miga, Karen H; Phillippy, Adam M; Timp, Winston; Ventura, Mario; Eichler, Evan E.

Science ; 376(6588): eabj6965, 2022 04.

Artigo em Inglês | MEDLINE | ID: mdl-35357917

RESUMO

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human (n = 12) and nonhuman primate (n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.

Assuntos

Variações do Número de Cópias de DNA , Duplicação Gênica , Genoma Humano , Duplicações Segmentares Genômicas , Evolução Molecular , Proteínas Ativadoras de GTPase/genética , Humanos , Polimorfismo de Nucleotídeo Único , Proteínas Proto-Oncogênicas/genética

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa