RESUMO
Short tandem repeats (STRs) are a class of repetitive elements, composed of tandem arrays of 1-6 base pair sequence motifs, that comprise a substantial fraction of the human genome. STR expansions can cause a wide range of neurological and neuromuscular conditions, known as repeat expansion disorders, whose age of onset, severity, penetrance and/or clinical phenotype are influenced by the length of the repeats and their sequence composition. The presence of non-canonical motifs, depending on the type, frequency and position within the repeat tract, can alter clinical outcomes by modifying somatic and intergenerational repeat stability, gene expression and mutant transcript-mediated and/or protein-mediated toxicities. Here, we review the diverse structural conformations of repeat expansions, technological advances for the characterization of changes in sequence composition, their clinical correlations and the impact on disease mechanisms.
Assuntos
Repetições de Microssatélites , Humanos , Repetições de Microssatélites/genética , Expansão das Repetições de DNA/genética , Genoma HumanoRESUMO
Expansion of a single repetitive DNA sequence, termed a tandem repeat (TR), is known to cause more than 50 diseases1,2. However, repeat expansions are often not explored beyond neurological and neurodegenerative disorders. In some cancers, mutations accumulate in short tracts of TRs, a phenomenon termed microsatellite instability; however, larger repeat expansions have not been systematically analysed in cancer3-8. Here we identified TR expansions in 2,622 cancer genomes spanning 29 cancer types. In seven cancer types, we found 160 recurrent repeat expansions (rREs), most of which (155/160) were subtype specific. We found that rREs were non-uniformly distributed in the genome with enrichment near candidate cis-regulatory elements, suggesting a potential role in gene regulation. One rRE, a GAAA-repeat expansion, located near a regulatory element in the first intron of UGT2B7 was detected in 34% of renal cell carcinoma samples and was validated by long-read DNA sequencing. Moreover, in preliminary experiments, treating cells that harbour this rRE with a GAAA-targeting molecule led to a dose-dependent decrease in cell proliferation. Overall, our results suggest that rREs may be an important but unexplored source of genetic variation in human cancer, and we provide a comprehensive catalogue for further study.
Assuntos
Expansão das Repetições de DNA , Genoma Humano , Neoplasias , Humanos , Sequência de Bases , Expansão das Repetições de DNA/genética , Genoma Humano/genética , Neoplasias/classificação , Neoplasias/genética , Neoplasias/patologia , Análise de Sequência de DNA , Regulação da Expressão Gênica , Elementos Reguladores de Transcrição/genética , Íntrons/genética , Carcinoma de Células Renais/genética , Carcinoma de Células Renais/patologia , Proliferação de Células/efeitos dos fármacos , Reprodutibilidade dos TestesRESUMO
Spinal muscular atrophy, a leading cause of early infant death, is caused by bi-allelic mutations of SMN1. Sequence analysis of SMN1 is challenging due to high sequence similarity with its paralog SMN2. Both genes have variable copy numbers across populations. Furthermore, without pedigree information, it is currently not possible to identify silent carriers (2+0) with two copies of SMN1 on one chromosome and zero copies on the other. We developed Paraphase, an informatics method that identifies full-length SMN1 and SMN2 haplotypes, determines the gene copy numbers, and calls phased variants using long-read PacBio HiFi data. The SMN1 and SMN2 copy-number calls by Paraphase are highly concordant with orthogonal methods (99.2% for SMN1 and 100% for SMN2). We applied Paraphase to 438 samples across 5 ethnic populations to conduct a population-wide haplotype analysis of these highly homologous genes. We identified major SMN1 and SMN2 haplogroups and characterized their co-segregation through pedigree-based analyses. We identified two SMN1 haplotypes that form a common two-copy SMN1 allele in African populations. Testing positive for these two haplotypes in an individual with two copies of SMN1 gives a silent carrier risk of 88.5%, which is significantly higher than the currently used marker (1.7%-3.0%). Extending beyond simple copy-number testing, Paraphase can detect pathogenic variants and enable potential haplotype-based screening of silent carriers through statistical phasing of haplotypes into alleles. Future analysis of larger population data will allow identification of more diverse haplotypes and genetic markers for silent carriers.
Assuntos
Atrofia Muscular Espinal , Lactente , Humanos , Atrofia Muscular Espinal/genética , Atrofia Muscular Espinal/diagnóstico , Mutação , Dosagem de Genes , Linhagem , Análise de Sequência , Proteína 1 de Sobrevivência do Neurônio Motor/genética , Proteína 2 de Sobrevivência do Neurônio Motor/genéticaRESUMO
Adult-onset cerebellar ataxias are a group of neurodegenerative conditions that challenge both genetic discovery and molecular diagnosis. In this study, we identified an intronic (GAA) repeat expansion in fibroblast growth factor 14 (FGF14). Genetic analysis of 95 Australian individuals with adult-onset ataxia identified four (4.2%) with (GAA)>300 and a further nine individuals with (GAA)>250. PCR and long-read sequence analysis revealed these were pure (GAA) repeats. In comparison, no control subjects had (GAA)>300 and only 2/311 control individuals (0.6%) had a pure (GAA)>250. In a German validation cohort, 9/104 (8.7%) of affected individuals had (GAA)>335 and a further six had (GAA)>250, whereas 10/190 (5.3%) control subjects had (GAA)>250 but none were (GAA)>335. The combined data suggest (GAA)>335 are disease causing and fully penetrant (p = 6.0 × 10-8, OR = 72 [95% CI = 4.3-1,227]), while (GAA)>250 is likely pathogenic with reduced penetrance. Affected individuals had an adult-onset, slowly progressive cerebellar ataxia with variable features including vestibular impairment, hyper-reflexia, and autonomic dysfunction. A negative correlation between age at onset and repeat length was observed (R2 = 0.44, p = 0.00045, slope = -0.12) and identification of a shared haplotype in a minority of individuals suggests that the expansion can be inherited or generated de novo during meiotic division. This study demonstrates the power of genome sequencing and advanced bioinformatic tools to identify novel repeat expansions via model-free, genome-wide analysis and identifies SCA50/ATX-FGF14 as a frequent cause of adult-onset ataxia.
Assuntos
Ataxia Cerebelar , Fatores de Crescimento de Fibroblastos , Ataxia de Friedreich , Expansão das Repetições de Trinucleotídeos , Adulto , Humanos , Ataxia/genética , Austrália , Ataxia Cerebelar/genética , Ataxia de Friedreich/genética , Expansão das Repetições de Trinucleotídeos/genéticaRESUMO
Tandem DNA repeats vary in the size and sequence of each unit (motif). When expanded, these tandem DNA repeats have been associated with more than 40 monogenic disorders1. Their involvement in disorders with complex genetics is largely unknown, as is the extent of their heterogeneity. Here we investigated the genome-wide characteristics of tandem repeats that had motifs with a length of 2-20 base pairs in 17,231 genomes of families containing individuals with autism spectrum disorder (ASD)2,3 and population control individuals4. We found extensive polymorphism in the size and sequence of motifs. Many of the tandem repeat loci that we detected correlated with cytogenetic fragile sites. At 2,588 loci, gene-associated expansions of tandem repeats that were rare among population control individuals were significantly more prevalent among individuals with ASD than their siblings without ASD, particularly in exons and near splice junctions, and in genes related to the development of the nervous system and cardiovascular system or muscle. Rare tandem repeat expansions had a prevalence of 23.3% in children with ASD compared with 20.7% in children without ASD, which suggests that tandem repeat expansions make a collective contribution to the risk of ASD of 2.6%. These rare tandem repeat expansions included previously undescribed ASD-linked expansions in DMPK and FXN, which are associated with neuromuscular conditions, and in previously unknown loci such as FGF14 and CACNB1. Rare tandem repeat expansions were associated with lower IQ and adaptive ability. Our results show that tandem DNA repeat expansions contribute strongly to the genetic aetiology and phenotypic complexity of ASD.
Assuntos
Transtorno do Espectro Autista/genética , Expansão das Repetições de DNA/genética , Genoma Humano/genética , Genômica , Sequências de Repetição em Tandem/genética , Feminino , Fatores de Crescimento de Fibroblastos/genética , Predisposição Genética para Doença , Humanos , Inteligência/genética , Proteínas de Ligação ao Ferro/genética , Masculino , Miotonina Proteína Quinase/genética , Motivos de Nucleotídeos , Polimorfismo Genético , FrataxinaRESUMO
The RecQ DNA helicase WRN is a synthetic lethal target for cancer cells with microsatellite instability (MSI), a form of genetic hypermutability that arises from impaired mismatch repair1-4. Depletion of WRN induces widespread DNA double-strand breaks in MSI cells, leading to cell cycle arrest and/or apoptosis. However, the mechanism by which WRN protects MSI-associated cancers from double-strand breaks remains unclear. Here we show that TA-dinucleotide repeats are highly unstable in MSI cells and undergo large-scale expansions, distinct from previously described insertion or deletion mutations of a few nucleotides5. Expanded TA repeats form non-B DNA secondary structures that stall replication forks, activate the ATR checkpoint kinase, and require unwinding by the WRN helicase. In the absence of WRN, the expanded TA-dinucleotide repeats are susceptible to cleavage by the MUS81 nuclease, leading to massive chromosome shattering. These findings identify a distinct biomarker that underlies the synthetic lethal dependence on WRN, and support the development of therapeutic agents that target WRN for MSI-associated cancers.
Assuntos
Quebras de DNA de Cadeia Dupla , Expansão das Repetições de DNA/genética , Repetições de Dinucleotídeos/genética , Neoplasias/genética , Helicase da Síndrome de Werner/metabolismo , Proteínas Mutadas de Ataxia Telangiectasia/metabolismo , Linhagem Celular Tumoral , Cromossomos Humanos/genética , Cromossomos Humanos/metabolismo , Cromotripsia , Clivagem do DNA , Replicação do DNA , Proteínas de Ligação a DNA/metabolismo , Endodesoxirribonucleases/metabolismo , Endonucleases/metabolismo , Instabilidade Genômica , Humanos , Recombinases/metabolismoRESUMO
Huntington disease (HD) is caused by a CAG repeat expansion in the huntingtin (HTT) gene. Although the length of this repeat is inversely correlated with age of onset (AOO), it does not fully explain the variability in AOO. We assessed the sequence downstream of the CAG repeat in HTT [reference: (CAG)n-CAA-CAG], since variants within this region have been previously described, but no study of AOO has been performed. These analyses identified a variant that results in complete loss of interrupting (LOI) adenine nucleotides in this region [(CAG)n-CAG-CAG]. Analysis of multiple HD pedigrees showed that this LOI variant is associated with dramatically earlier AOO (average of 25 years) despite the same polyglutamine length as in individuals with the interrupting penultimate CAA codon. This LOI allele is particularly frequent in persons with reduced penetrance alleles who manifest with HD and increases the likelihood of presenting clinically with HD with a CAG of 36-39 repeats. Further, we show that the LOI variant is associated with increased somatic repeat instability, highlighting this as a significant driver of this effect. These findings indicate that the number of uninterrupted CAG repeats, which is lengthened by the LOI, is the most significant contributor to AOO of HD and is more significant than polyglutamine length, which is not altered in these individuals. In addition, we identified another variant in this region, where the CAA-CAG sequence is duplicated, which was associated with later AOO. Identification of these cis-acting modifiers have potentially important implications for genetic counselling in HD-affected families.
Assuntos
Códon/genética , Doença de Huntington/genética , Doença de Huntington/patologia , Peptídeos/genética , Expansão das Repetições de Trinucleotídeos/genética , Adolescente , Adulto , Idade de Início , Criança , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , LinhagemRESUMO
Genomic technologies such as next-generation sequencing (NGS) are revolutionizing molecular diagnostics and clinical medicine. However, these approaches have proven inefficient at identifying pathogenic repeat expansions. Here, we apply a collection of bioinformatics tools that can be utilized to identify either known or novel expanded repeat sequences in NGS data. We performed genetic studies of a cohort of 35 individuals from 22 families with a clinical diagnosis of cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome (CANVAS). Analysis of whole-genome sequence (WGS) data with five independent algorithms identified a recessively inherited intronic repeat expansion [(AAGGG)exp] in the gene encoding Replication Factor C1 (RFC1). This motif, not reported in the reference sequence, localized to an Alu element and replaced the reference (AAAAG)11 short tandem repeat. Genetic analyses confirmed the pathogenic expansion in 18 of 22 CANVAS-affected families and identified a core ancestral haplotype, estimated to have arisen in Europe more than twenty-five thousand years ago. WGS of the four RFC1-negative CANVAS-affected families identified plausible variants in three, with genomic re-diagnosis of SCA3, spastic ataxia of the Charlevoix-Saguenay type, and SCA45. This study identified the genetic basis of CANVAS and demonstrated that these improved bioinformatics tools increase the diagnostic utility of WGS to determine the genetic basis of a heterogeneous group of clinically overlapping neurogenetic disorders.
Assuntos
Ataxia Cerebelar/etiologia , Biologia Computacional/métodos , Íntrons , Repetições de Microssatélites , Polineuropatias/etiologia , Proteína de Replicação C/genética , Transtornos de Sensação/etiologia , Doenças Vestibulares/etiologia , Algoritmos , Ataxia Cerebelar/patologia , Estudos de Coortes , Família , Feminino , Genômica , Humanos , Masculino , Pessoa de Meia-Idade , Polineuropatias/patologia , Transtornos de Sensação/patologia , Síndrome , Doenças Vestibulares/patologia , Sequenciamento Completo do GenomaRESUMO
We report an inborn error of metabolism caused by an expansion of a GCA-repeat tract in the 5' untranslated region of the gene encoding glutaminase (GLS) that was identified through detailed clinical and biochemical phenotyping, combined with whole-genome sequencing. The expansion was observed in three unrelated patients who presented with an early-onset delay in overall development, progressive ataxia, and elevated levels of glutamine. In addition to ataxia, one patient also showed cerebellar atrophy. The expansion was associated with a relative deficiency of GLS messenger RNA transcribed from the expanded allele, which probably resulted from repeat-mediated chromatin changes upstream of the GLS repeat. Our discovery underscores the importance of careful examination of regions of the genome that are typically excluded from or poorly captured by exome sequencing.
Assuntos
Erros Inatos do Metabolismo dos Aminoácidos/genética , Ataxia/genética , Deficiências do Desenvolvimento/genética , Glutaminase/deficiência , Glutaminase/genética , Glutamina/metabolismo , Repetições de Microssatélites , Mutação , Atrofia/genética , Cerebelo/patologia , Pré-Escolar , Feminino , Genótipo , Glutamina/análise , Humanos , Masculino , Fenótipo , Reação em Cadeia da Polimerase , Sequenciamento Completo do GenomaRESUMO
OBJECTIVE: The role of the survival of motor neuron (SMN) gene in amyotrophic lateral sclerosis (ALS) is unclear, with several conflicting reports. A decisive result on this topic is needed, given that treatment options are available now for SMN deficiency. METHODS: In this largest multicenter case control study to evaluate the effect of SMN1 and SMN2 copy numbers in ALS, we used whole genome sequencing data from Project MinE data freeze 2. SMN copy numbers of 6,375 patients with ALS and 2,412 controls were called from whole genome sequencing data, and the reliability of the calls was tested with multiplex ligation-dependent probe amplification data. RESULTS: The copy number distribution of SMN1 and SMN2 between cases and controls did not show any statistical differences (binomial multivariate logistic regression SMN1 p = 0.54 and SMN2 p = 0.49). In addition, the copy number of SMN did not associate with patient survival (Royston-Parmar; SMN1 p = 0.78 and SMN2 p = 0.23) or age at onset (Royston-Parmar; SMN1 p = 0.75 and SMN2 p = 0.63). INTERPRETATION: In our well-powered study, there was no association of SMN1 or SMN2 copy numbers with the risk of ALS or ALS disease severity. This suggests that changing SMN protein levels in the physiological range may not modify ALS disease course. This is an important finding in the light of emerging therapies targeted at SMN deficiencies. ANN NEUROL 2021;89:686-697.
Assuntos
Esclerose Lateral Amiotrófica/genética , Esclerose Lateral Amiotrófica/patologia , Proteína 1 de Sobrevivência do Neurônio Motor/genética , Estudos de Casos e Controles , Estudos de Coortes , Feminino , Dosagem de Genes , Humanos , Masculino , Reprodutibilidade dos Testes , Fatores de Risco , Índice de Gravidade de Doença , Proteína 2 de Sobrevivência do Neurônio Motor/genética , Sequenciamento Completo do GenomaRESUMO
Responsible for the metabolism of ~21% of clinically used drugs, CYP2D6 is a critical component of personalized medicine initiatives. Genotyping CYP2D6 is challenging due to sequence similarity with its pseudogene paralog CYP2D7 and a high number and variety of common structural variants (SVs). Here we describe a novel bioinformatics method, Cyrius, that accurately genotypes CYP2D6 using whole-genome sequencing (WGS) data. We show that Cyrius has superior performance (96.5% concordance with truth genotypes) compared to existing methods (84-86.8%). After implementing the improvements identified from the comparison against the truth data, Cyrius's accuracy has since been improved to 99.3%. Using Cyrius, we built a haplotype frequency database from 2504 ethnically diverse samples and estimate that SV-containing star alleles are more frequent than previously reported. Cyrius will be an important tool to incorporate pharmacogenomics in WGS-based precision medicine initiatives.
Assuntos
Citocromo P-450 CYP2D6/genética , Técnicas de Genotipagem/métodos , Alelos , Biologia Computacional/métodos , Etnicidade/genética , Genótipo , Haplótipos/genética , Humanos , Polimorfismo Genético/genética , Sequenciamento Completo do Genoma/métodosRESUMO
Improvement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalog of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of 17 individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased "Platinum" variant catalog of 4.7 million single-nucleotide variants (SNVs) plus 0.7 million small (1-50 bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and 11 children of this pedigree. Platinum genotypes are highly concordant with the current catalog of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%) and add a validated truth catalog that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission ("nonplatinum") revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.
Assuntos
Genoma Humano/genética , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bases de Dados Genéticas , Exoma/genética , Genótipo , Humanos , Mutação INDEL/genética , Linhagem , Polimorfismo de Nucleotídeo Único , SoftwareRESUMO
Identifying large expansions of short tandem repeats (STRs), such as those that cause amyotrophic lateral sclerosis (ALS) and fragile X syndrome, is challenging for short-read whole-genome sequencing (WGS) data. A solution to this problem is an important step toward integrating WGS into precision medicine. We developed a software tool called ExpansionHunter that, using PCR-free WGS short-read data, can genotype repeats at the locus of interest, even if the expanded repeat is larger than the read length. We applied our algorithm to WGS data from 3001 ALS patients who have been tested for the presence of the C9orf72 repeat expansion with repeat-primed PCR (RP-PCR). Compared against this truth data, ExpansionHunter correctly classified all (212/212, 95% CI [0.98, 1.00]) of the expanded samples as either expansions (208) or potential expansions (4). Additionally, 99.9% (2786/2789, 95% CI [0.997, 1.00]) of the wild-type samples were correctly classified as wild type by this method with the remaining three samples identified as possible expansions. We further applied our algorithm to a set of 152 samples in which every sample had one of eight different pathogenic repeat expansions, including those associated with fragile X syndrome, Friedreich's ataxia, and Huntington's disease, and correctly flagged all but one of the known repeat expansions. Thus, ExpansionHunter can be used to accurately detect known pathogenic repeat expansions and provides researchers with a tool that can be used to identify new pathogenic repeat expansions.
Assuntos
Esclerose Lateral Amiotrófica/genética , Expansão das Repetições de DNA , Sequenciamento Completo do Genoma/métodos , Algoritmos , Proteína C9orf72/genética , Bases de Dados Genéticas , Humanos , Medicina de Precisão , Sensibilidade e Especificidade , SoftwareRESUMO
SUMMARY: We describe a novel computational method for genotyping repeats using sequence graphs. This method addresses the long-standing need to accurately genotype medically important loci containing repeats adjacent to other variants or imperfect DNA repeats such as polyalanine repeats. Here we introduce a new version of our repeat genotyping software, ExpansionHunter, that uses this method to perform targeted genotyping of a broad class of such loci. AVAILABILITY AND IMPLEMENTATION: ExpansionHunter is implemented in C++ and is available under the Apache License Version 2.0. The source code, documentation, and Linux/macOS binaries are available at https://github.com/Illumina/ExpansionHunter/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Repetições de Microssatélites , Software , GenótipoRESUMO
PURPOSE: Spinal muscular atrophy (SMA), caused by loss of the SMN1 gene, is a leading cause of early childhood death. Due to the near identical sequences of SMN1 and SMN2, analysis of this region is challenging. Population-wide SMA screening to quantify the SMN1 copy number (CN) is recommended by the American College of Medical Genetics and Genomics. METHODS: We developed a method that accurately identifies the CN of SMN1 and SMN2 using genome sequencing (GS) data by analyzing read depth and eight informative reference genome differences between SMN1/2. RESULTS: We characterized SMN1/2 in 12,747 genomes, identified 1568 samples with SMN1 gains or losses and 6615 samples with SMN2 gains or losses, and calculated a pan-ethnic carrier frequency of 2%, consistent with previous studies. Additionally, 99.8% of our SMN1 and 99.7% of SMN2 CN calls agreed with orthogonal methods, with a recall of 100% for SMA and 97.8% for carriers, and a precision of 100% for both SMA and carriers. CONCLUSION: This SMN copy-number caller can be used to identify both carrier and affected status of SMA, enabling SMA testing to be offered as a comprehensive test in neonatal care and an accurate carrier screening tool in GS sequencing projects.
Assuntos
Atrofia Muscular Espinal , Sequência de Bases , Criança , Pré-Escolar , Humanos , Atrofia Muscular Espinal/diagnóstico , Atrofia Muscular Espinal/genética , Proteína 1 de Sobrevivência do Neurônio Motor/genéticaRESUMO
Standard whole-genome genotyping technologies are unable to determine haplotypes. Here we describe a method for rapid and cost-effective long-range haplotyping. Genomic DNA is diluted and distributed into multiple aliquots such that each aliquot receives a fraction of a haploid copy. The DNA template in each aliquot is amplified by multiple displacement amplification, converted into barcoded sequencing libraries using Nextera technology, and sequenced in multiplexed pools. To assess the performance of our method, we combined two male genomic DNA samples at equal ratios, resulting in a sample with diploid X chromosomes with known haplotypes. Pools of the multiplexed sequencing libraries were subjected to targeted pull-down of a 1-Mb contiguous region of the X-chromosome Duchenne muscular dystrophy gene. We were able to phase the Duchenne muscular dystrophy region into two contiguous haplotype blocks with a mean length of 494 kb. The haplotypes showed 99% agreement with the consensus base calls made by sequencing the individual DNAs. We subsequently used the strategy to haplotype two human genomes. Standard genomic sequencing to identify all heterozygous SNPs in the sample was combined with dilution-amplification-based sequencing data to resolve the phase of identified heterozygous SNPs. Using this procedure, we were able to phase >95% of the heterozygous SNPs from the diploid sequence data. The N50 for a Yoruba male DNA was 702 kb whereas the N50 for a European female DNA was 358 kb. Therefore, the strategy described here is suitable for haplotyping of a set of targeted regions as well as of the entire genome.
Assuntos
Técnicas Genéticas , Genoma Humano/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Técnicas de Amplificação de Ácido Nucleico/métodos , Código de Barras de DNA Taxonômico/métodos , Distrofina/genética , Feminino , Biblioteca Gênica , Genótipo , Humanos , Masculino , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Tandem repeats (TRs) are highly polymorphic in the human genome, have thousands of associated molecular traits and are linked to over 60 disease phenotypes. However, they are often excluded from at-scale studies because of challenges with variant calling and representation, as well as a lack of a genome-wide standard. Here, to promote the development of TR methods, we created a catalog of TR regions and explored TR properties across 86 haplotype-resolved long-read human assemblies. We curated variants from the Genome in a Bottle (GIAB) HG002 individual to create a TR dataset to benchmark existing and future TR analysis methods. We also present an improved variant comparison method that handles variants greater than 4 bp in length and varying allelic representation. The 8.1% of the genome covered by the TR catalog holds ~24.9% of variants per individual, including 124,728 small and 17,988 large variants for the GIAB HG002 'truth-set' TR benchmark. We demonstrate the utility of this pipeline across short-read and long-read technologies.
RESUMO
Tandem repeat (TR) variation is associated with gene expression changes and numerous rare monogenic diseases. Although long-read sequencing provides accurate full-length sequences and methylation of TRs, there is still a need for computational methods to profile TRs across the genome. Here we introduce the Tandem Repeat Genotyping Tool (TRGT) and an accompanying TR database. TRGT determines the consensus sequences and methylation levels of specified TRs from PacBio HiFi sequencing data. It also reports reads that support each repeat allele. These reads can be subsequently visualized with a companion TR visualization tool. Assessing 937,122 TRs, TRGT showed a Mendelian concordance of 98.38%, allowing a single repeat unit difference. In six samples with known repeat expansions, TRGT detected all expansions while also identifying methylation signals and mosaicism and providing finer repeat length resolution than existing methods. Additionally, we released a database with allele sequences and methylation levels for 937,122 TRs across 100 genomes.