RESUMO
The complete assembly of each human chromosome is essential for understanding human biology and evolution1,2. Here we use complementary long-read sequencing technologies to complete the linear assembly of human chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08-Mb centromeric α-satellite array, a 644-kb copy number polymorphism in the ß-defensin gene cluster that is important for disease risk, and an 863-kb variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73-kb hypomethylated region of diverse higher-order α-satellites enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. In addition, we confirm the overall organization and methylation pattern of the centromere in a diploid human genome. Using a dual long-read sequencing approach, we complete high-quality draft assemblies of the orthologous centromere from chromosome 8 in chimpanzee, orangutan and macaque to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved in the great ape ancestor with a layered symmetry, in which more ancient higher-order repeats locate peripherally to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated by more than 2.2-fold compared to the unique portions of the genome, and this acceleration extends into the flanking sequence.
Assuntos
Cromossomos Humanos Par 8/química , Cromossomos Humanos Par 8/genética , Evolução Molecular , Animais , Linhagem Celular , Centrômero/química , Centrômero/genética , Centrômero/metabolismo , Cromossomos Humanos Par 8/fisiologia , Metilação de DNA , DNA Satélite/genética , Epigênese Genética , Feminino , Humanos , Macaca mulatta/genética , Masculino , Repetições Minissatélites/genética , Pan troglodytes/genética , Filogenia , Pongo abelii/genética , Telômero/química , Telômero/genética , Telômero/metabolismoRESUMO
Recent advances in long-read sequencing opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. They also emphasized the need for centromere annotation (partitioning human centromeres into monomers and higher-order repeats [HORs]). Although there was a half-century-long series of semi-manual studies of centromere architecture, a rigorous centromere annotation algorithm is still lacking. Moreover, an automated centromere annotation is a prerequisite for studies of genetic diseases associated with centromeres and evolutionary studies of centromeres across multiple species. Although the monomer decomposition (transforming a centromere into a monocentromere written in the monomer alphabet) and the HOR decomposition (representing a monocentromere in the alphabet of HORs) are currently viewed as two separate problems, we show that they should be integrated into a single framework in such a way that HOR (monomer) inference affects monomer (HOR) inference. We thus developed the HORmon algorithm that integrates the monomer/HOR inference and automatically generates the human monomers/HORs that are largely consistent with the previous semi-manual inference.
Assuntos
Algoritmos , Centrômero , Centrômero/genética , HumanosRESUMO
MOTIVATION: Recent advances in long-read sequencing technologies led to rapid progress in centromere assembly in the last year and, for the first time, opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. However, since these advances have not been yet accompanied by the development of the centromere-specific bioinformatics algorithms, even the fundamental questions (e.g. centromere annotation by deriving the complete set of human monomers and high-order repeats), let alone more complex questions (e.g. explaining how monomers and high-order repeats evolved) about human centromeres remain open. Moreover, even though there was a four-decade-long series of studies aimed at cataloging all human monomers and high-order repeats, the rigorous algorithmic definitions of these concepts are still lacking. Thus, the development of a centromere annotation tool is a prerequisite for follow-up personalized biomedical studies of centromeres across the human population and evolutionary studies of centromeres across various species. RESULTS: We describe the CentromereArchitect, the first tool for the centromere annotation in a newly sequenced genome, apply it to the recently generated complete assembly of a human genome by the Telomere-to-Telomere consortium, generate the complete set of human monomers and high-order repeats for 'live' centromeres, and reveal a vast set of hybrid monomers that may represent the focal points of centromere evolution. AVAILABILITY AND IMPLEMENTATION: CentromereArchitect is publicly available on https://github.com/ablab/stringdecomposer/tree/ismb2021. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Centrômero , Genoma , Algoritmos , Sequência de Bases , Centrômero/genética , Humanos , TelômeroRESUMO
MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads. RESULTS: We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. AVAILABILITY AND IMPLEMENTATION: StringDecomposer is publicly available on https://github.com/ablab/stringdecomposer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Centrômero , Nanoporos , Algoritmos , Centrômero/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA , Sequências de Repetição em TandemRESUMO
BACKGROUND: Graph-based representation of genome assemblies has been recently used in different contexts - from improved reconstruction of plasmid sequences and refined analysis of metagenomic data to read error correction and reference-free haplotype reconstruction. While many of these applications heavily utilize the alignment of long nucleotide sequences to assembly graphs, first general-purpose software tools for finding such alignments have been released only recently and their deficiencies and limitations are yet to be discovered. Moreover, existing tools can not perform alignment of amino acid sequences, which could prove useful in various contexts - in particular the analysis of metagenomic sequencing data. RESULTS: In this work we present a novel SPAligner (Saint-Petersburg Aligner) tool for aligning long diverged nucleotide and amino acid sequences to assembly graphs. We demonstrate that SPAligner is an efficient solution for mapping third generation sequencing reads onto assembly graphs of various complexity and also show how it can facilitate the identification of known genes in complex metagenomic datasets. CONCLUSIONS: Our work will facilitate accelerating the development of graph-based approaches in solving sequence to genome assembly alignment problem. SPAligner is implemented as a part of SPAdes tools library and is available on Github.
Assuntos
Algoritmos , Variação Genética , Alinhamento de Sequência , Sequência de Bases , Haplótipos/genética , Humanos , Software , Estatística como Assunto , beta-Lactamases/químicaRESUMO
Existing human genome assemblies have almost entirely excluded repetitive sequences within and near centromeres, limiting our understanding of their organization, evolution, and functions, which include facilitating proper chromosome segregation. Now, a complete, telomere-to-telomere human genome assembly (T2T-CHM13) has enabled us to comprehensively characterize pericentromeric and centromeric repeats, which constitute 6.2% of the genome (189.9 megabases). Detailed maps of these regions revealed multimegabase structural rearrangements, including in active centromeric repeat arrays. Analysis of centromere-associated sequences uncovered a strong relationship between the position of the centromere and the evolution of the surrounding DNA through layered repeat expansions. Furthermore, comparisons of chromosome X centromeres across a diverse panel of individuals illuminated high degrees of structural, epigenetic, and sequence variation in these complex and rapidly evolving regions.
Assuntos
Centrômero/genética , Mapeamento Cromossômico , Epigênese Genética , Genoma Humano , Evolução Molecular , Genômica , Humanos , Sequências Repetitivas de Ácido NucleicoRESUMO
Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion-base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies.
Assuntos
Genoma Humano , Projeto Genoma Humano , Análise de Sequência de DNA/normas , Linhagem Celular , Cromossomos Artificiais Bacterianos/genética , Cromossomos Humanos/genética , Humanos , Valores de ReferênciaRESUMO
BACKGROUND: Since the prolonged use of insecticidal proteins has led to toxin resistance, it is important to search for novel insecticidal protein genes (IPGs) that are effective in controlling resistant insect populations. IPGs are usually encoded in the genomes of entomopathogenic bacteria, especially in large plasmids in strains of the ubiquitous soil bacteria, Bacillus thuringiensis (Bt). Since there are often multiple similar IPGs encoded by such plasmids, their assemblies are typically fragmented and many IPGs are scattered through multiple contigs. As a result, existing gene prediction tools (that analyze individual contigs) typically predict partial rather than complete IPGs, making it difficult to conduct downstream IPG engineering efforts in agricultural genomics. METHODS: Although it is difficult to assemble IPGs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding a single IPG. RESULTS: We describe ORFograph, a pipeline for predicting IPGs in assembly graphs, benchmark it on (meta)genomic datasets, and discover nearly a hundred novel IPGs. This work shows that graph-aware gene prediction tools enable the discovery of greater diversity of IPGs from (meta)genomes. CONCLUSIONS: We demonstrated that analysis of the assembly graphs reveals novel candidate IPGs. ORFograph identified both already known genes "hidden" in assembly graphs and potential novel IPGs that evaded existing tools for IPG identification. As ORFograph is fast, one could imagine a pipeline that processes many (meta)genomic assembly graphs to identify even more novel IPGs for phenotypic testing than would previously be inaccessible by traditional gene-finding methods. While here we demonstrated the results of ORFograph only for IPGs, the proposed approach can be generalized to any class of genes. Video abstract.
Assuntos
Inseticidas , Algoritmos , Genômica , Metagenoma , MetagenômicaRESUMO
OBJECTIVES: Despite the frequency of disasters in Africa, almost nothing is known about ethnic affiliations in relation to psychopathology after such incidents. This study examined the mental health outcomes of members of 7 major ethnic groups exposed to the 1998 terrorist bombing of the US Embassy in Nairobi, Kenya. METHODS: Approximately 8 to 10 months after the disaster, 229 civilian employees, 99 locally engaged staff workers of the US State Department and the US Agency for International Development, and 64 workers of the Kenyan Red Cross Society (total N=392) were assessed with the Diagnostic Interview Schedule for the Diagnostic and Statistical Manual of Mental Disorders (Fourth Edition). Additional data were gathered on demographic characteristics, disaster exposures and injuries, and ethnic affiliations. RESULTS: Disaster-related post-traumatic stress disorder (PTSD) was significantly less prevalent among members of the Kikuyu group (28%) and post-disaster major depression was significantly more prevalent among members of the Meru group (64%), compared with all others in the sample. Preexisting psychopathology and disaster injury were independently associated with bombing-related psychopathology. CONCLUSIONS: Further study of disaster-related psychopathology in relation to African ethnic affiliations is needed to better understand these associations and to assist in planning resources and interventions for African disaster survivors. (Disaster Med Public Health Preparedness. 2018; 12: 360-365).