RESUMO
We sequenced and assembled using multiple long-read sequencing technologies the genomes of chimpanzee, bonobo, gorilla, orangutan, gibbon, macaque, owl monkey, and marmoset. We identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. We estimate that 819.47 Mbp or â¼27% of the genome has been affected by SVs across primate evolution. We identify 1,607 structurally divergent regions wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (e.g., CARD, C4, and OLAH gene families) and additional lineage-specific genes are generated (e.g., CKAP2, VPS36, ACBD7, and NEK5 paralogs), becoming targets of rapid chromosomal diversification and positive selection (e.g., RGPD gene family). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species.
Assuntos
Genoma , Primatas , Animais , Humanos , Sequência de Bases , Primatas/classificação , Primatas/genética , Evolução Biológica , Análise de Sequência de DNA , Variação Estrutural do GenomaRESUMO
Single-nucleotide variants (SNVs) in segmental duplications (SDs) have not been systematically assessed because of the limitations of mapping short-read sequencing data1,2. Here we constructed 1:1 unambiguous alignments spanning high-identity SDs across 102 human haplotypes and compared the pattern of SNVs between unique and duplicated regions3,4. We find that human SNVs are elevated 60% in SDs compared to unique regions and estimate that at least 23% of this increase is due to interlocus gene conversion (IGC) with up to 4.3 megabase pairs of SD sequence converted on average per human haplotype. We develop a genome-wide map of IGC donors and acceptors, including 498 acceptor and 454 donor hotspots affecting the exons of about 800 protein-coding genes. These include 171 genes that have 'relocated' on average 1.61 megabase pairs in a subset of human haplotypes. Using a coalescent framework, we show that SD regions are slightly evolutionarily older when compared to unique sequences, probably owing to IGC. SNVs in SDs, however, show a distinct mutational spectrum: a 27.1% increase in transversions that convert cytosine to guanine or the reverse across all triplet contexts and a 7.6% reduction in the frequency of CpG-associated mutations when compared to unique DNA. We reason that these distinct mutational properties help to maintain an overall higher GC content of SD DNA compared to that of unique DNA, probably driven by GC-biased conversion between paralogous sequences5,6.
Assuntos
Conversão Gênica , Mutação , Duplicações Segmentares Genômicas , Humanos , Conversão Gênica/genética , Genoma Humano/genética , Polimorfismo de Nucleotídeo Único/genética , Haplótipos/genética , Éxons/genética , Citosina/química , Guanina/química , Ilhas de CpG/genéticaRESUMO
The prevalence of highly repetitive sequences within the human Y chromosome has prevented its complete assembly to date1 and led to its systematic omission from genomic analyses. Here we present de novo assemblies of 43 Y chromosomes spanning 182,900 years of human evolution and report considerable diversity in size and structure. Half of the male-specific euchromatic region is subject to large inversions with a greater than twofold higher recurrence rate compared with all other chromosomes2. Ampliconic sequences associated with these inversions show differing mutation rates that are sequence context dependent, and some ampliconic genes exhibit evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, Yq12, is composed of alternating repeat arrays that show extensive variation in the number, size and distribution, but retain a 1:1 copy-number ratio. Finally, our data suggest that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kb away from the currently established1 boundary. The availability of fully sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of traits with specific Y-chromosomal variants and garnering insights into the evolution and function of complex regions of the human genome.
Assuntos
Cromossomos Humanos Y , Evolução Molecular , Humanos , Masculino , Cromossomos Humanos Y/genética , Genoma Humano/genética , Genômica , Taxa de Mutação , Fenótipo , Eucromatina/genética , Pseudogenes , Variação Genética/genética , Cromossomos Humanos X/genética , Regiões Pseudoautossômicas/genéticaRESUMO
The 10q11.22 chromosomal region is a duplication-rich interval of the human genome and one of the last to be fully assembled. It carries copy number-variable genes associated with intellectual disability, bipolar disorder, and obesity. In this study, we characterized the structural diversity at this locus by analyzing 64 haploid assemblies produced by the Human Pangenome Reference Consortium. We identified 11 alternative haplotypes that differ in the copy number and/or orientation of large genomic segments, ranging from hundreds of kilobase pairs (kbp) to over one megabase pair (Mbp). We uncovered a 2.4 Mbp size difference between the shortest and longest haplotypes. Breakpoint analysis revealed that genomic instability results from nonallelic homologous recombination between segmental duplication (SD) pairs with varying similarity (94.4%-99.6%). Nonetheless, these pairs generally recombine at positions where their identity is higher (>99.6%). Recurrent inversions occur with different breakpoints within the same inverted SD pair. Inversion polymorphisms shuffle the entire SD arrangement, creating new predispositions to copy-number variations. The SD architecture is associated with a catarrhine-specific subgroup of the AGAP gene family, which likely triggered the accumulation of SDs at this locus over the past 25 million years of human evolution. Our results reveal extensive structural diversity and genomic instability at the 10q11.22 locus, and expand the general understanding of the mutational mechanisms behind SD-mediated rearrangements.
Assuntos
Cromossomos Humanos Par 10 , Variações do Número de Cópias de DNA , Haplótipos , Duplicações Segmentares Genômicas , Humanos , Cromossomos Humanos Par 10/genética , Instabilidade Genômica , Recombinação Homóloga , Genoma Humano , Inversão Cromossômica , Variação Estrutural do GenomaRESUMO
TBC1D3 is a primate-specific gene family that has expanded in the human lineage and has been implicated in neuronal progenitor proliferation and expansion of the frontal cortex. The gene family and its expression have been challenging to investigate because it is embedded in high-identity and highly variable segmental duplications. We sequenced and assembled the gene family using long-read sequencing data from 34 humans and 11 non-human primate species. Our analysis shows that this particular gene family has independently duplicated in at least five primate lineages, and the duplicated loci are enriched at sites of large-scale chromosomal rearrangements on Chromosome 17. We find that all human copy number variation maps to two distinct clusters located at Chromosome 17q12 and that humans are highly structurally variable at this locus, differing by as many as 20 copies and ~1 Mbp in length depending on haplotypes. We also show evidence of positive selection, as well as a significant change in the predicted human TBC1D3 protein sequence. Lastly, we find that, despite multiple duplications, human TBC1D3 expression is limited to a subset of copies and, most notably, from a single paralog group: TBC1D3-CDKL These observations may help explain why a gene potentially important in cortical development can be so variable in the human population.
RESUMO
The complete assembly of each human chromosome is essential for understanding human biology and evolution1,2. Here we use complementary long-read sequencing technologies to complete the linear assembly of human chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08-Mb centromeric α-satellite array, a 644-kb copy number polymorphism in the ß-defensin gene cluster that is important for disease risk, and an 863-kb variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73-kb hypomethylated region of diverse higher-order α-satellites enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. In addition, we confirm the overall organization and methylation pattern of the centromere in a diploid human genome. Using a dual long-read sequencing approach, we complete high-quality draft assemblies of the orthologous centromere from chromosome 8 in chimpanzee, orangutan and macaque to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved in the great ape ancestor with a layered symmetry, in which more ancient higher-order repeats locate peripherally to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated by more than 2.2-fold compared to the unique portions of the genome, and this acceleration extends into the flanking sequence.
Assuntos
Cromossomos Humanos Par 8/química , Cromossomos Humanos Par 8/genética , Evolução Molecular , Animais , Linhagem Celular , Centrômero/química , Centrômero/genética , Centrômero/metabolismo , Cromossomos Humanos Par 8/fisiologia , Metilação de DNA , DNA Satélite/genética , Epigênese Genética , Feminino , Humanos , Macaca mulatta/genética , Masculino , Repetições Minissatélites/genética , Pan troglodytes/genética , Filogenia , Pongo abelii/genética , Telômero/química , Telômero/genética , Telômero/metabolismoRESUMO
The divergence of chimpanzee and bonobo provides one of the few examples of recent hominid speciation1,2. Here we describe a fully annotated, high-quality bonobo genome assembly, which was constructed without guidance from reference genomes by applying a multiplatform genomics approach. We generate a bonobo genome assembly in which more than 98% of genes are completely annotated and 99% of the gaps are closed, including the resolution of about half of the segmental duplications and almost all of the full-length mobile elements. We compare the bonobo genome to those of other great apes1,3-5 and identify more than 5,569 fixed structural variants that specifically distinguish the bonobo and chimpanzee lineages. We focus on genes that have been lost, changed in structure or expanded in the last few million years of bonobo evolution. We produce a high-resolution map of incomplete lineage sorting and estimate that around 5.1% of the human genome is genetically closer to chimpanzee or bonobo and that more than 36.5% of the genome shows incomplete lineage sorting if we consider a deeper phylogeny including gorilla and orangutan. We also show that 26% of the segments of incomplete lineage sorting between human and chimpanzee or human and bonobo are non-randomly distributed and that genes within these clustered segments show significant excess of amino acid replacement compared to the rest of the genome.
Assuntos
Evolução Molecular , Genoma/genética , Genômica , Pan paniscus/genética , Filogenia , Animais , Fator de Iniciação 4A em Eucariotos/genética , Feminino , Genes , Gorilla gorilla/genética , Anotação de Sequência Molecular/normas , Pan troglodytes/genética , Pongo/genética , Duplicações Segmentares Genômicas , Análise de Sequência de DNARESUMO
MOTIVATION: Highly contiguous de novo phased diploid genome assemblies are now feasible for large numbers of species and individuals. Methods are needed to validate assembly accuracy and detect misassemblies with orthologous sequencing data to allow for confident downstream analyses. RESULTS: We developed GAVISUNK, an open-source pipeline that detects misassemblies and produces a set of reliable regions genome-wide by assessing concordance of distances between unique k-mers in Pacific Biosciences high-fidelity assemblies and raw Oxford Nanopore Technologies reads. AVAILABILITY AND IMPLEMENTATION: GAVISUNK is available at https://github.com/pdishuck/GAVISUNK. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Nanoporos , Software , Humanos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , GenomaRESUMO
We have developed a computational method based on polyploid phasing of long sequence reads to resolve collapsed regions of segmental duplications within genome assemblies. Segmental Duplication Assembler (SDA; https://github.com/mvollger/SDA ) constructs graphs in which paralogous sequence variants define the nodes and long-read sequences provide attraction and repulsion edges, enabling the partition and assembly of long reads corresponding to distinct paralogs. We apply it to single-molecule, real-time sequence data from three human genomes and recover 33-79 megabase pairs (Mb) of duplications in which approximately half of the loci are diverged (<99.8%) compared to the reference genome. We show that the corresponding sequence is highly accurate (>99.9%) and that the diverged sequence corresponds to copy-number-variable paralogs that are absent from the human reference genome. Our method can be applied to other complex genomes to resolve the last gene-rich gaps, improve duplicate gene annotation, and better understand copy-number-variant genetic diversity at the base-pair level.
Assuntos
Biologia Computacional , Duplicações Segmentares Genômicas , Análise de Sequência de DNA/métodos , Genoma Humano , Humanos , Anotação de Sequência MolecularRESUMO
Motivation: We are now in the era of being able to routinely generate highly contiguous (near telomere-to-telomere) genome assemblies of human and nonhuman species. Complex structural variation and regions of rapid evolutionary turnover are being discovered for the first time. Thus, efficient and informative visualization tools are needed to evaluate and directly observe structural differences between two or more genomes. Results: We developed SVbyEye, an open-source R package to visualize and annotate sequence-to-sequence alignments along with various functionalities to process alignments in PAF format. The tool facilitates the characterization of complex structural variants in the context of sequence homology helping resolve the mechanisms underlying their formation. Availability and implementation: SVbyEye is available at https://github.com/daewoooo/SVbyEye.
RESUMO
Segmental duplications (SDs) contribute significantly to human disease, evolution, and diversity yet have been difficult to resolve at the sequence level. We present a population genetics survey of SDs by analyzing 170 human genome assemblies where the majority of SDs are fully resolved using long-read sequence assembly. Excluding the acrocentric short arms, we identify 173.2 Mbp of duplicated sequence (47.4 Mbp not present in the telomere-to-telomere reference) distinguishing fixed from structurally polymorphic events. We find that intrachromosomal SDs are among the most variable with rare events mapping near their progenitor sequences. African genomes harbor significantly more intrachromosomal SDs and are more likely to have recently duplicated gene families with higher copy number when compared to non-African samples. A comparison to a resource of 563 million full-length Iso-Seq reads identifies 201 novel, potentially protein-coding genes corresponding to these copy number polymorphic SDs.
RESUMO
TBC1D3 is a primate-specific gene family that has expanded in the human lineage and has been implicated in neuronal progenitor proliferation and expansion of the frontal cortex. The gene family and its expression have been challenging to investigate because it is embedded in high-identity and highly variable segmental duplications. We sequenced and assembled the gene family using long-read sequencing data from 34 humans and 11 nonhuman primate species. Our analysis shows that this particular gene family has independently duplicated in at least five primate lineages, and the duplicated loci are enriched at sites of large-scale chromosomal rearrangements on chromosome 17. We find that most humans vary along two TBC1D3 clusters where human haplotypes are highly variable in copy number, differing by as many as 20 copies, and structure (structural heterozygosity 90%). We also show evidence of positive selection, as well as a significant change in the predicted human TBC1D3 protein sequence. Lastly, we find that, despite multiple duplications, human TBC1D3 expression is limited to a subset of copies and, most notably, from a single paralog group: TBC1D3-CDKL. These observations may help explain why a gene potentially important in cortical development can be so variable in the human population.
RESUMO
To better understand the pattern of primate genome structural variation, we sequenced and assembled using multiple long-read sequencing technologies the genomes of eight nonhuman primate species, including New World monkeys (owl monkey and marmoset), Old World monkey (macaque), Asian apes (orangutan and gibbon), and African ape lineages (gorilla, bonobo, and chimpanzee). Compared to the human genome, we identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. Across 50 million years of primate evolution, we estimate that 819.47 Mbp or ~27% of the genome has been affected by SVs based on analysis of these primate lineages. We identify 1,607 structurally divergent regions (SDRs) wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (CARDs, ABCD7, OLAH) and new lineage-specific genes are generated (e.g., CKAP2, NEK5) and have become targets of rapid chromosomal diversification and positive selection (e.g., RGPDs). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species for the first time.
RESUMO
Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human (n = 12) and nonhuman primate (n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.
Assuntos
Variações do Número de Cópias de DNA , Duplicação Gênica , Genoma Humano , Duplicações Segmentares Genômicas , Evolução Molecular , Proteínas Ativadoras de GTPase/genética , Humanos , Polimorfismo de Nucleotídeo Único , Proteínas Proto-Oncogênicas/genéticaRESUMO
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.
Assuntos
Genoma Humano , Projeto Genoma Humano , Análise de Sequência de DNA/normas , Linhagem Celular , Cromossomos Artificiais Bacterianos/genética , Cromossomos Humanos/genética , Humanos , Valores de ReferênciaRESUMO
TRP channel-associated factor 1/2 (TCAF1/TCAF2) proteins antagonistically regulate the cold-sensor protein TRPM8 in multiple human tissues. Understanding their significance has been complicated given the locus spans a gap-ridden region with complex segmental duplications in GRCh38. Using long-read sequencing, we sequence-resolve the locus, annotate full-length TCAF models in primate genomes, and show substantial human-specific TCAF copy number variation. We identify two human super haplogroups, H4 and H5, and establish that TCAF duplications originated ~1.7 million years ago but diversified only in Homo sapiens by recurrent structural mutations. Conversely, in all archaic-hominin samples the fixation for a specific H4 haplotype without duplication is likely due to positive selection. Here, our results of TCAF copy number expansion, selection signals in hominins, and differential TCAF2 expression between haplogroups and high TCAF2 and TRPM8 expression in liver and prostate in modern-day humans imply TCAF diversification among hominins potentially in response to cold or dietary adaptations.
Assuntos
Duplicação Gênica , Hominidae/genética , Proteínas de Membrana/genética , Seleção Genética , Animais , Variações do Número de Cópias de DNA , Evolução Molecular , Genoma Humano , Haplótipos , Humanos , Homem de Neandertal , FilogeniaRESUMO
The rhesus macaque (Macaca mulatta) is the most widely studied nonhuman primate (NHP) in biomedical research. We present an updated reference genome assembly (Mmul_10, contig N50 = 46 Mbp) that increases the sequence contiguity 120-fold and annotate it using 6.5 million full-length transcripts, thus improving our understanding of gene content, isoform diversity, and repeat organization. With the improved assembly of segmental duplications, we discovered new lineage-specific genes and expanded gene families that are potentially informative in studies of evolution and disease susceptibility. Whole-genome sequencing (WGS) data from 853 rhesus macaques identified 85.7 million single-nucleotide variants (SNVs) and 10.5 million indel variants, including potentially damaging variants in genes associated with human autism and developmental delay, providing a framework for developing noninvasive NHP models of human disease.
Assuntos
Predisposição Genética para Doença , Genoma , Macaca mulatta/genética , Polimorfismo de Nucleotídeo Único , Animais , Variação Genética , Humanos , Anotação de Sequência Molecular , Sequenciamento Completo do GenomaRESUMO
Gene expression data are accumulating exponentially in public repositories. Reanalysis and integration of themed collections from these studies may provide new insights, but requires further human curation. Here we report a crowdsourcing project to annotate and reanalyse a large number of gene expression profiles from Gene Expression Omnibus (GEO). Through a massive open online course on Coursera, over 70 participants from over 25 countries identify and annotate 2,460 single-gene perturbation signatures, 839 disease versus normal signatures, and 906 drug perturbation signatures. All these signatures are unique and are manually validated for quality. Global analysis of these signatures confirms known associations and identifies novel associations between genes, diseases and drugs. The manually curated signatures are used as a training set to develop classifiers for extracting similar signatures from the entire GEO repository. We develop a web portal to serve these signatures for query, download and visualization.