RESUMO
Indigenous Australians harbour rich and unique genomic diversity. However, Aboriginal and Torres Strait Islander ancestries are historically under-represented in genomics research and almost completely missing from reference datasets1-3. Addressing this representation gap is critical, both to advance our understanding of global human genomic diversity and as a prerequisite for ensuring equitable outcomes in genomic medicine. Here we apply population-scale whole-genome long-read sequencing4 to profile genomic structural variation across four remote Indigenous communities. We uncover an abundance of large insertion-deletion variants (20-49 bp; n = 136,797), structural variants (50 b-50 kb; n = 159,912) and regions of variable copy number (>50 kb; n = 156). The majority of variants are composed of tandem repeat or interspersed mobile element sequences (up to 90%) and have not been previously annotated (up to 62%). A large fraction of structural variants appear to be exclusive to Indigenous Australians (12% lower-bound estimate) and most of these are found in only a single community, underscoring the need for broad and deep sampling to achieve a comprehensive catalogue of genomic structural variation across the Australian continent. Finally, we explore short tandem repeats throughout the genome to characterize allelic diversity at 50 known disease loci5, uncover hundreds of novel repeat expansion sites within protein-coding genes, and identify unique patterns of diversity and constraint among short tandem repeat sequences. Our study sheds new light on the dimensions and dynamics of genomic structural variation within and beyond Australia.
Assuntos
Povos Aborígenes Australianos e Ilhéus do Estreito de Torres , Genoma Humano , Variação Estrutural do Genoma , Humanos , Alelos , Austrália/etnologia , Povos Aborígenes Australianos e Ilhéus do Estreito de Torres/genética , Conjuntos de Dados como Assunto , Variações do Número de Cópias de DNA/genética , Loci Gênicos/genética , Genética Médica , Variação Estrutural do Genoma/genética , Genômica , Mutação INDEL/genética , Sequências Repetitivas Dispersas/genética , Repetições de Microssatélites/genética , Genoma Humano/genéticaRESUMO
The notion that mobile units of nucleic acid known as transposable elements can operate as genomic controlling elements was put forward over six decades ago1,2. However, it was not until the advancement of genomic sequencing technologies that the abundance and repertoire of transposable elements were revealed, and they are now known to constitute up to two-thirds of mammalian genomes3,4. The presence of DNA regulatory regions including promoters, enhancers and transcription-factor-binding sites within transposable elements5-8 has led to the hypothesis that transposable elements have been co-opted to regulate mammalian gene expression and cell phenotype8-14. Mammalian transposable elements include recent acquisitions and ancient transposable elements that have been maintained in the genome over evolutionary time. The presence of ancient conserved transposable elements correlates positively with the likelihood of a regulatory function, but functional validation remains an essential step to identify transposable element insertions that have a positive effect on fitness. Here we show that CRISPR-Cas9-mediated deletion of a transposable element-namely the LINE-1 retrotransposon Lx9c11-in mice results in an exaggerated and lethal immune response to virus infection. Lx9c11 is critical for the neogenesis of a non-coding RNA (Lx9c11-RegoS) that regulates genes of the Schlafen family, reduces the hyperinflammatory phenotype and rescues lethality in virus-infected Lx9c11-/- mice. These findings provide evidence that a transposable element can control the immune system to favour host survival during virus infection.
Assuntos
Elementos de DNA Transponíveis , Interações entre Hospedeiro e Microrganismos , Imunidade , Retroelementos , Viroses , Animais , Sistemas CRISPR-Cas/genética , Elementos de DNA Transponíveis/genética , Elementos de DNA Transponíveis/imunologia , Evolução Molecular , Interações entre Hospedeiro e Microrganismos/genética , Interações entre Hospedeiro e Microrganismos/imunologia , Imunidade/genética , Camundongos , RNA não Traduzido/genética , Sequências Reguladoras de Ácido Nucleico/genética , Retroelementos/genética , Retroelementos/imunologia , Viroses/genética , Viroses/imunologiaRESUMO
BACKGROUND: Loss-of-function variants in MME (membrane metalloendopeptidase) are a known cause of recessive Charcot-Marie-Tooth Neuropathy (CMT). A deep intronic variant, MME c.1188+428A>G (NM_000902.5), was identified through whole genome sequencing (WGS) of two Australian families with recessive inheritance of axonal CMT using the seqr platform. MME c.1188+428A>G was detected in a homozygous state in Family 1, and in a compound heterozygous state with a known pathogenic MME variant (c.467del; p.Pro156Leufs*14) in Family 2. AIMS: We aimed to determine the pathogenicity of the MME c.1188+428A>G variant through segregation and splicing analysis. METHODS: The splicing impact of the deep intronic MME variant c.1188+428A>G was assessed using an in vitro exon-trapping assay. RESULTS: The exon-trapping assay demonstrated that the MME c.1188+428A>G variant created a novel splice donor site resulting in the inclusion of an 83 bp pseudoexon between MME exons 12 and 13. The incorporation of the pseudoexon into MME transcript is predicted to lead to a coding frameshift and premature termination codon (PTC) in MME exon 14 (p.Ala397ProfsTer47). This PTC is likely to result in nonsense mediated decay (NMD) of MME transcript leading to a pathogenic loss-of-function. INTERPRETATION: To our knowledge, this is the first report of a pathogenic deep intronic MME variant causing CMT. This is of significance as deep intronic variants are missed using whole exome sequencing screening methods. Individuals with CMT should be reassessed for deep intronic variants, with splicing impacts being considered in relation to the potential pathogenicity of variants.
Assuntos
Doença de Charcot-Marie-Tooth , Metaloendopeptidases , Splicing de RNA , Adulto , Feminino , Humanos , Masculino , Doença de Charcot-Marie-Tooth/genética , Íntrons , Metaloendopeptidases/genética , Mutação , LinhagemRESUMO
The expression of genes encompasses their transcription into mRNA followed by translation into protein. In recent years, next-generation sequencing and mass spectrometry methods have profiled DNA, RNA and protein abundance in cells. However, there are currently no reference standards that are compatible across these genomic, transcriptomic and proteomic methods, and provide an integrated measure of gene expression. Here, we use synthetic biology principles to engineer a multi-omics control, termed pREF, that can act as a universal molecular standard for next-generation sequencing and mass spectrometry methods. The pREF sequence encodes 21 synthetic genes that can be in vitro transcribed into spike-in mRNA controls, and in vitro translated to generate matched protein controls. The synthetic genes provide qualitative controls that can measure sensitivity and quantitative accuracy of DNA, RNA and peptide detection. We demonstrate the use of pREF in metagenome DNA sequencing and RNA sequencing experiments and evaluate the quantification of proteins using mass spectrometry. Unlike previous spike-in controls, pREF can be independently propagated and the synthetic mRNA and protein controls can be sustainably prepared by recipient laboratories using common molecular biology techniques. Together, this provides a universal synthetic standard able to integrate genomic, transcriptomic and proteomic methods.
Assuntos
DNA , Proteômica , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , DNA/genética , Genômica , RNARESUMO
BACKGROUND: Next-generation sequencing (NGS) can identify mutations in the human genome that cause disease and has been widely adopted in clinical diagnosis. However, the human genome contains many polymorphic, low-complexity, and repetitive regions that are difficult to sequence and analyze. Despite their difficulty, these regions include many clinically important sequences that can inform the treatment of human diseases and improve the diagnostic yield of NGS. RESULTS: To evaluate the accuracy by which these difficult regions are analyzed with NGS, we built an in silico decoy chromosome, along with corresponding synthetic DNA reference controls, that encode difficult and clinically important human genome regions, including repeats, microsatellites, HLA genes, and immune receptors. These controls provide a known ground-truth reference against which to measure the performance of diverse sequencing technologies, reagents, and bioinformatic tools. Using this approach, we provide a comprehensive evaluation of short- and long-read sequencing instruments, library preparation methods, and software tools and identify the errors and systematic bias that confound our resolution of these remaining difficult regions. CONCLUSIONS: This study provides an analytical validation of diagnosis using NGS in difficult regions of the human genome and highlights the challenges that remain to resolve these difficult regions.
Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Cromossomos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Repetições de Microssatélites , Análise de Sequência de DNA/métodos , SoftwareRESUMO
Our understanding of the molecular pathology of posttraumatic stress disorder (PTSD) is evolving due to advances in sequencing technologies. With the recent emergence of Oxford Nanopore direct RNA-seq (dRNA-seq), it is now also possible to interrogate diverse RNA modifications, collectively known as the "epitranscriptome.". Here, we present our analyses of the male and female mouse amygdala transcriptome and epitranscriptome, obtained using parallel Illumina RNA-seq and Oxford Nanopore dRNA-seq, associated with the acquisition of PTSD-like fear induced by Pavlovian cued-fear conditioning. We report significant sex-specific differences in the amygdala transcriptional response during fear acquisition and a range of shared and dimorphic epitranscriptomic signatures. Differential RNA modifications are enriched among mRNA transcripts associated with neurotransmitter regulation and mitochondrial function, many of which have been previously implicated in PTSD. Very few differentially modified transcripts are also differentially expressed, suggesting an influential, expression-independent role for epitranscriptional regulation in PTSD-like fear acquisition.
RESUMO
Library adaptors are short oligonucleotides that are attached to RNA and DNA samples in preparation for next-generation sequencing (NGS). Adaptors can also include additional functional elements, such as sample indexes and unique molecular identifiers, to improve library analysis. Here, we describe Control Library Adaptors, termed CAPTORs, that measure the accuracy and reliability of NGS. CAPTORs can be integrated within the library preparation of RNA and DNA samples, and their encoded information is retrieved during sequencing. We show how CAPTORs can measure the accuracy of nanopore sequencing, evaluate the quantitative performance of metagenomic and RNA sequencing, and improve normalisation between samples. CAPTORs can also be customised for clinical diagnoses, correcting systematic sequencing errors and improving the diagnosis of pathogenic BRCA1/2 variants in breast cancer. CAPTORs are a simple and effective method to increase the accuracy and reliability of NGS, enabling comparisons between samples, reagents and laboratories, and supporting the use of nanopore sequencing for clinical diagnosis.
Assuntos
Sequenciamento por Nanoporos , Reprodutibilidade dos Testes , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , RNARESUMO
DNA synthesis in vitro has enabled the rapid production of reference standards. These are used as controls, and allow measurement and improvement of the accuracy and quality of diagnostic tests. Current reference standards typically represent target genetic material, and act only as positive controls to assess test sensitivity. However, negative controls are also required to evaluate test specificity. Using a pair of chimeric A/B RNA standards, this allowed incorporation of positive and negative controls into diagnostic testing for the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2). The chimeric standards constituted target regions for RT-PCR primer/probe sets that are joined in tandem across two separate synthetic molecules. Accordingly, a target region that is present in standard A provides a positive control, whilst being absent in standard B, thereby providing a negative control. This design enables cross-validation of positive and negative controls between the paired standards in the same reaction, with identical conditions. This enables control and test failures to be distinguished, increasing confidence in the accuracy of results. The chimeric A/B standards were assessed using the US Centres for Disease Control real-time RT-PCR protocol, and showed results congruent with other commercial controls in detecting SARS-CoV-2 in patient samples. This chimeric reference standard design approach offers extensive flexibility, allowing representation of diverse genetic features and distantly related sequences, even from different organisms.
Assuntos
Quimera , Sequência de Aminoácidos , COVID-19/diagnóstico , COVID-19/virologia , Humanos , RNA Viral/normas , Padrões de Referência , Reprodutibilidade dos Testes , SARS-CoV-2/química , SARS-CoV-2/genética , SARS-CoV-2/isolamento & purificação , Sensibilidade e EspecificidadeRESUMO
Standard units of measurement are required for the quantitative description of nature; however, few standard units have been established for genomics to date. Here, we have developed a synthetic DNA ladder that defines a quantitative standard unit that can measure DNA sequence abundance within a next-generation sequencing library. The ladder can be spiked into a DNA sample, and act as an internal scale that measures quantitative genetics features. Unlike previous spike-ins, the ladder is encoded within a single molecule, and can be equivalently and independently synthesized by different laboratories. We show how the ladder can measure diverse quantitative features, including human genetic variation and microbial abundance, and also estimate uncertainty due to technical variation and improve normalization between libraries. This ladder provides an independent quantitative unit that can be used with any organism, application or technology, thereby providing a common metric by which genomes can be measured.
Assuntos
DNA/análise , DNA/síntese química , Sequência de Bases , DNA/genética , Dosagem de Genes , Biblioteca Gênica , Genômica , HumanosRESUMO
Next-generation sequencing (NGS) has been widely adopted to identify genetic variants and investigate their association with disease. However, the analysis of sequencing data remains challenging because of the complexity of human genetic variation and confounding errors introduced during library preparation, sequencing and analysis. We have developed a set of synthetic DNA spike-ins-termed 'sequins' (sequencing spike-ins)-that are directly added to DNA samples before library preparation. Sequins can be used to measure technical biases and to act as internal quantitative and qualitative controls throughout the sequencing workflow. This step-by-step protocol explains the use of sequins for both whole-genome and targeted sequencing of the human genome. This includes instructions regarding the dilution and addition of sequins to human DNA samples, followed by the bioinformatic steps required to separate sequin- and sample-derived sequencing reads and to evaluate the diagnostic performance of the assay. These practical guidelines are accompanied by a broader discussion of the conceptual and statistical principles that underpin the design of sequin standards. This protocol is suitable for users with standard laboratory and bioinformatic experience. The laboratory steps require ~1-4 d and the bioinformatic steps (which can be performed with the provided example data files) take an additional day.