Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 592
Filter
Add more filters

Publication year range
1.
Cell ; 187(5): 1024-1037, 2024 Feb 29.
Article in English | MEDLINE | ID: mdl-38290514

ABSTRACT

This perspective focuses on advances in genome technology over the last 25 years and their impact on germline variant discovery within the field of human genetics. The field has witnessed tremendous technological advances from microarrays to short-read sequencing and now long-read sequencing. Each technology has provided genome-wide access to different classes of human genetic variation. We are now on the verge of comprehensive variant detection of all forms of variation for the first time with a single assay. We predict that this transition will further transform our understanding of human health and biology and, more importantly, provide novel insights into the dynamic mutational processes shaping our genomes.


Subject(s)
Genomic Structural Variation , Genomics , Humans , Genomics/methods , Germ-Line Mutation , Mutation , Technology
2.
Cell ; 187(6): 1547-1562.e13, 2024 Mar 14.
Article in English | MEDLINE | ID: mdl-38428424

ABSTRACT

We sequenced and assembled using multiple long-read sequencing technologies the genomes of chimpanzee, bonobo, gorilla, orangutan, gibbon, macaque, owl monkey, and marmoset. We identified 1,338,997 lineage-specific fixed structural variants (SVs) disrupting 1,561 protein-coding genes and 136,932 regulatory elements, including the most complete set of human-specific fixed differences. We estimate that 819.47 Mbp or ∼27% of the genome has been affected by SVs across primate evolution. We identify 1,607 structurally divergent regions wherein recurrent structural variation contributes to creating SV hotspots where genes are recurrently lost (e.g., CARD, C4, and OLAH gene families) and additional lineage-specific genes are generated (e.g., CKAP2, VPS36, ACBD7, and NEK5 paralogs), becoming targets of rapid chromosomal diversification and positive selection (e.g., RGPD gene family). High-fidelity long-read sequencing has made these dynamic regions of the genome accessible for sequence-level analyses within and between primate species.


Subject(s)
Genome , Primates , Animals , Humans , Base Sequence , Primates/classification , Primates/genetics , Biological Evolution , Sequence Analysis, DNA , Genomic Structural Variation
3.
Cell ; 184(13): 3542-3558.e16, 2021 06 24.
Article in English | MEDLINE | ID: mdl-34051138

ABSTRACT

Structural variations (SVs) and gene copy number variations (gCNVs) have contributed to crop evolution, domestication, and improvement. Here, we assembled 31 high-quality genomes of genetically diverse rice accessions. Coupling with two existing assemblies, we developed pan-genome-scale genomic resources including a graph-based genome, providing access to rice genomic variations. Specifically, we discovered 171,072 SVs and 25,549 gCNVs and used an Oryza glaberrima assembly to infer the derived states of SVs in the Oryza sativa population. Our analyses of SV formation mechanisms, impacts on gene expression, and distributions among subpopulations illustrate the utility of these resources for understanding how SVs and gCNVs shaped rice environmental adaptation and domestication. Our graph-based genome enabled genome-wide association study (GWAS)-based identification of phenotype-associated genetic variations undetectable when using only SNPs and a single reference assembly. Our work provides rich population-scale resources paired with easy-to-access tools to facilitate rice breeding as well as plant functional genomics and evolutionary biology research.


Subject(s)
Ecotype , Genetic Variation , Genome, Plant , Oryza/genetics , Adaptation, Physiological/genetics , Agriculture , Domestication , Gene Expression Profiling , Gene Expression Regulation, Plant , Genes, Plant , Genomic Structural Variation , Molecular Sequence Annotation , Phenotype
4.
Cell ; 182(1): 189-199.e15, 2020 07 09.
Article in English | MEDLINE | ID: mdl-32531199

ABSTRACT

Structural variants contribute substantially to genetic diversity and are important evolutionarily and medically, but they are still understudied. Here we present a comprehensive analysis of structural variation in the Human Genome Diversity panel, a high-coverage dataset of 911 samples from 54 diverse worldwide populations. We identify, in total, 126,018 variants, 78% of which were not identified in previous global sequencing projects. Some reach high frequency and are private to continental groups or even individual populations, including regionally restricted runaway duplications and putatively introgressed variants from archaic hominins. By de novo assembly of 25 genomes using linked-read sequencing, we discover 1,643 breakpoint-resolved unique insertions, in aggregate accounting for 1.9 Mb of sequence absent from the GRCh38 reference. Our results illustrate the limitation of a single human reference and the need for high-quality genomes from diverse populations to fully discover and understand human genetic variation.


Subject(s)
Genetics, Population , Genomic Structural Variation , Alleles , Databases, Genetic , Gene Dosage , Gene Duplication , Gene Frequency/genetics , Genetic Variation , Genome, Human , Humans
5.
Cell ; 183(1): 197-210.e32, 2020 10 01.
Article in English | MEDLINE | ID: mdl-33007263

ABSTRACT

Cancer genomes often harbor hundreds of somatic DNA rearrangement junctions, many of which cannot be easily classified into simple (e.g., deletion) or complex (e.g., chromothripsis) structural variant classes. Applying a novel genome graph computational paradigm to analyze the topology of junction copy number (JCN) across 2,778 tumor whole-genome sequences, we uncovered three novel complex rearrangement phenomena: pyrgo, rigma, and tyfonas. Pyrgo are "towers" of low-JCN duplications associated with early-replicating regions, superenhancers, and breast or ovarian cancers. Rigma comprise "chasms" of low-JCN deletions enriched in late-replicating fragile sites and gastrointestinal carcinomas. Tyfonas are "typhoons" of high-JCN junctions and fold-back inversions associated with expressed protein-coding fusions, breakend hypermutation, and acral, but not cutaneous, melanomas. Clustering of tumors according to genome graph-derived features identified subgroups associated with DNA repair defects and poor prognosis.


Subject(s)
Genomic Structural Variation/genetics , Genomics/methods , Neoplasms/genetics , Chromosome Inversion/genetics , Chromothripsis , DNA Copy Number Variations/genetics , Gene Rearrangement/genetics , Genome, Human/genetics , Humans , Mutation/genetics , Whole Genome Sequencing/methods
6.
Cell ; 182(1): 145-161.e23, 2020 07 09.
Article in English | MEDLINE | ID: mdl-32553272

ABSTRACT

Structural variants (SVs) underlie important crop improvement and domestication traits. However, resolving the extent, diversity, and quantitative impact of SVs has been challenging. We used long-read nanopore sequencing to capture 238,490 SVs in 100 diverse tomato lines. This panSV genome, along with 14 new reference assemblies, revealed large-scale intermixing of diverse genotypes, as well as thousands of SVs intersecting genes and cis-regulatory regions. Hundreds of SV-gene pairs exhibit subtle and significant expression changes, which could broadly influence quantitative trait variation. By combining quantitative genetics with genome editing, we show how multiple SVs that changed gene dosage and expression levels modified fruit flavor, size, and production. In the last example, higher order epistasis among four SVs affecting three related transcription factors allowed introduction of an important harvesting trait in modern tomato. Our findings highlight the underexplored role of SVs in genotype-to-phenotype relationships and their widespread importance and utility in crop improvement.


Subject(s)
Crops, Agricultural/genetics , Gene Expression Regulation, Plant , Genomic Structural Variation , Solanum lycopersicum/genetics , Alleles , Cytochrome P-450 Enzyme System/genetics , Ecotype , Epistasis, Genetic , Fruit/genetics , Gene Duplication , Genome, Plant , Genotype , Inbreeding , Molecular Sequence Annotation , Phenotype , Plant Breeding , Quantitative Trait Loci/genetics
7.
Cell ; 176(3): 663-675.e19, 2019 01 24.
Article in English | MEDLINE | ID: mdl-30661756

ABSTRACT

In order to provide a comprehensive resource for human structural variants (SVs), we generated long-read sequence data and analyzed SVs for fifteen human genomes. We sequence resolved 99,604 insertions, deletions, and inversions including 2,238 (1.6 Mbp) that are shared among all discovery genomes with an additional 13,053 (6.9 Mbp) present in the majority, indicating minor alleles or errors in the reference. Genotyping in 440 additional genomes confirms the most common SVs in unique euchromatin are now sequence resolved. We report a ninefold SV bias toward the last 5 Mbp of human chromosomes with nearly 55% of all VNTRs (variable number of tandem repeats) mapping to this portion of the genome. We identify SVs affecting coding and noncoding regulatory loci improving annotation and interpretation of functional variation. These data provide the framework to construct a canonical human reference and a resource for developing advanced representations capable of capturing allelic diversity.


Subject(s)
Gene Frequency/genetics , Genome, Human/genetics , Genomic Structural Variation/genetics , Alleles , Euchromatin/genetics , Genomics/methods , Humans , Minisatellite Repeats/genetics , Sequence Analysis, DNA/methods
8.
Cell ; 176(6): 1310-1324.e10, 2019 03 07.
Article in English | MEDLINE | ID: mdl-30827684

ABSTRACT

DNA rearrangements resulting in human genome structural variants (SVs) are caused by diverse mutational mechanisms. We used long- and short-read sequencing technologies to investigate end products of de novo chromosome 17p11.2 rearrangements and query the molecular mechanisms underlying both recurrent and non-recurrent events. Evidence for an increased rate of clustered single-nucleotide variant (SNV) mutation in cis with non-recurrent rearrangements was found. Indel and SNV formation are associated with both copy-number gains and losses of 17p11.2, occur up to ∼1 Mb away from the breakpoint junctions, and favor C > G transversion substitutions; results suggest that single-stranded DNA is formed during the genesis of the SV and provide compelling support for a microhomology-mediated break-induced replication (MMBIR) mechanism for SV formation. Our data show an additional mutational burden of MMBIR consisting of hypermutation confined to the locus and manifesting as SNVs and indels predominantly within genes.


Subject(s)
Chromosomes, Human, Pair 17 , Mutation , Abnormalities, Multiple/genetics , Chromosome Breakpoints , Chromosome Disorders/genetics , Chromosome Duplication/genetics , DNA Copy Number Variations , DNA Repair/genetics , DNA Replication , Gene Rearrangement , Genome, Human , Genomic Structural Variation , Humans , INDEL Mutation , Models, Genetic , Polymorphism, Single Nucleotide , Recombination, Genetic , Sequence Analysis, DNA/methods , Smith-Magenis Syndrome/genetics
9.
Cell ; 174(3): 758-769.e9, 2018 07 26.
Article in English | MEDLINE | ID: mdl-30033370

ABSTRACT

While mutations affecting protein-coding regions have been examined across many cancers, structural variants at the genome-wide level are still poorly defined. Through integrative deep whole-genome and -transcriptome analysis of 101 castration-resistant prostate cancer metastases (109X tumor/38X normal coverage), we identified structural variants altering critical regulators of tumorigenesis and progression not detectable by exome approaches. Notably, we observed amplification of an intergenic enhancer region 624 kb upstream of the androgen receptor (AR) in 81% of patients, correlating with increased AR expression. Tandem duplication hotspots also occur near MYC, in lncRNAs associated with post-translational MYC regulation. Classes of structural variations were linked to distinct DNA repair deficiencies, suggesting their etiology, including associations of CDK12 mutation with tandem duplications, TP53 inactivation with inverted rearrangements and chromothripsis, and BRCA2 inactivation with deletions. Together, these observations provide a comprehensive view of how structural variations affect critical regulators in metastatic prostate cancer.


Subject(s)
Genomic Structural Variation/genetics , Prostatic Neoplasms/genetics , Aged , Aged, 80 and over , BRCA2 Protein/metabolism , Cyclin-Dependent Kinases/metabolism , DNA Copy Number Variations , Exome , Gene Expression Profiling/methods , Genomics/methods , Humans , Male , Middle Aged , Mutation , Neoplasm Metastasis/genetics , Proto-Oncogene Proteins c-myc/genetics , Proto-Oncogene Proteins c-myc/metabolism , Receptors, Androgen/genetics , Receptors, Androgen/metabolism , Tandem Repeat Sequences/genetics , Tumor Suppressor Protein p53/metabolism , Whole Genome Sequencing/methods
10.
Nature ; 624(7992): 602-610, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38093003

ABSTRACT

Indigenous Australians harbour rich and unique genomic diversity. However, Aboriginal and Torres Strait Islander ancestries are historically under-represented in genomics research and almost completely missing from reference datasets1-3. Addressing this representation gap is critical, both to advance our understanding of global human genomic diversity and as a prerequisite for ensuring equitable outcomes in genomic medicine. Here we apply population-scale whole-genome long-read sequencing4 to profile genomic structural variation across four remote Indigenous communities. We uncover an abundance of large insertion-deletion variants (20-49 bp; n = 136,797), structural variants (50 b-50 kb; n = 159,912) and regions of variable copy number (>50 kb; n = 156). The majority of variants are composed of tandem repeat or interspersed mobile element sequences (up to 90%) and have not been previously annotated (up to 62%). A large fraction of structural variants appear to be exclusive to Indigenous Australians (12% lower-bound estimate) and most of these are found in only a single community, underscoring the need for broad and deep sampling to achieve a comprehensive catalogue of genomic structural variation across the Australian continent. Finally, we explore short tandem repeats throughout the genome to characterize allelic diversity at 50 known disease loci5, uncover hundreds of novel repeat expansion sites within protein-coding genes, and identify unique patterns of diversity and constraint among short tandem repeat sequences. Our study sheds new light on the dimensions and dynamics of genomic structural variation within and beyond Australia.


Subject(s)
Australian Aboriginal and Torres Strait Islander Peoples , Genome, Human , Genomic Structural Variation , Humans , Alleles , Australia/ethnology , Australian Aboriginal and Torres Strait Islander Peoples/genetics , Datasets as Topic , DNA Copy Number Variations/genetics , Genetic Loci/genetics , Genetics, Medical , Genomic Structural Variation/genetics , Genomics , INDEL Mutation/genetics , Interspersed Repetitive Sequences/genetics , Microsatellite Repeats/genetics , Genome, Human/genetics
11.
Nature ; 624(7992): 593-601, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38093005

ABSTRACT

The Indigenous peoples of Australia have a rich linguistic and cultural history. How this relates to genetic diversity remains largely unknown because of their limited engagement with genomic studies. Here we analyse the genomes of 159 individuals from four remote Indigenous communities, including people who speak a language (Tiwi) not from the most widespread family (Pama-Nyungan). This large collection of Indigenous Australian genomes was made possible by careful community engagement and consultation. We observe exceptionally strong population structure across Australia, driven by divergence times between communities of 26,000-35,000 years ago and long-term low but stable effective population sizes. This demographic history, including early divergence from Papua New Guinean (47,000 years ago) and Eurasian groups1, has generated the highest proportion of previously undescribed genetic variation seen outside Africa and the most extended homozygosity compared with global samples. A substantial proportion of this variation is not observed in global reference panels or clinical datasets, and variation with predicted functional consequence is more likely to be homozygous than in other populations, with consequent implications for medical genomics2. Our results show that Indigenous Australians are not a single homogeneous genetic group and their genetic relationship with the peoples of New Guinea is not uniform. These patterns imply that the full breadth of Indigenous Australian genetic diversity remains uncharacterized, potentially limiting genomic medicine and equitable healthcare for Indigenous Australians.


Subject(s)
Australian Aboriginal and Torres Strait Islander Peoples , Genome, Human , Genomic Structural Variation , Humans , Australia/ethnology , Australian Aboriginal and Torres Strait Islander Peoples/genetics , Australian Aboriginal and Torres Strait Islander Peoples/history , Datasets as Topic , Genetics, Medical , Genome, Human/genetics , Genomic Structural Variation/genetics , Genomics , History, Ancient , Homozygote , Language , New Guinea/ethnology , Population Density , Population Dynamics
12.
Nature ; 612(7940): 564-572, 2022 12.
Article in English | MEDLINE | ID: mdl-36477537

ABSTRACT

Higher-order chromatin structure is important for the regulation of genes by distal regulatory sequences1,2. Structural variants (SVs) that alter three-dimensional (3D) genome organization can lead to enhancer-promoter rewiring and human disease, particularly in the context of cancer3. However, only a small minority of SVs are associated with altered gene expression4,5, and it remains unclear why certain SVs lead to changes in distal gene expression and others do not. To address these questions, we used a combination of genomic profiling and genome engineering to identify sites of recurrent changes in 3D genome structure in cancer and determine the effects of specific rearrangements on oncogene activation. By analysing Hi-C data from 92 cancer cell lines and patient samples, we identified loci affected by recurrent alterations to 3D genome structure, including oncogenes such as MYC, TERT and CCND1. By using CRISPR-Cas9 genome engineering to generate de novo SVs, we show that oncogene activity can be predicted by using 'activity-by-contact' models that consider partner region chromatin contacts and enhancer activity. However, activity-by-contact models are only predictive of specific subsets of genes in the genome, suggesting that different classes of genes engage in distinct modes of regulation by distal regulatory elements. These results indicate that SVs that alter 3D genome organization are widespread in cancer genomes and begin to illustrate predictive rules for the consequences of SVs on oncogene activation.


Subject(s)
Genomic Structural Variation , Neoplasms , Oncogene Proteins , Oncogenes , Humans , Chromatin/genetics , Gene Rearrangement/genetics , Genomic Structural Variation/genetics , Neoplasms/genetics , Neoplasms/pathology , Oncogenes/genetics , Oncogene Proteins/chemistry , Oncogene Proteins/genetics , Oncogene Proteins/metabolism , Chromosomes, Human/genetics , Cell Line, Tumor , Enhancer Elements, Genetic/genetics , Models, Genetic
13.
Am J Hum Genet ; 111(8): 1524-1543, 2024 Aug 08.
Article in English | MEDLINE | ID: mdl-39053458

ABSTRACT

Gene misexpression is the aberrant transcription of a gene in a context where it is usually inactive. Despite its known pathological consequences in specific rare diseases, we have a limited understanding of its wider prevalence and mechanisms in humans. To address this, we analyzed gene misexpression in 4,568 whole-blood bulk RNA sequencing samples from INTERVAL study blood donors. We found that while individual misexpression events occur rarely, in aggregate they were found in almost all samples and a third of inactive protein-coding genes. Using 2,821 paired whole-genome and RNA sequencing samples, we identified that misexpression events are enriched in cis for rare structural variants. We established putative mechanisms through which a subset of SVs lead to gene misexpression, including transcriptional readthrough, transcript fusions, and gene inversion. Overall, we develop misexpression as a type of transcriptomic outlier analysis and extend our understanding of the variety of mechanisms by which genetic variants can influence gene expression.


Subject(s)
Gene Expression Regulation , Humans , Sequence Analysis, RNA , Genetic Variation , Genomic Structural Variation/genetics , Transcriptome/genetics , Blood Donors
14.
Genome Res ; 34(2): 300-309, 2024 03 20.
Article in English | MEDLINE | ID: mdl-38355307

ABSTRACT

Expression and splicing quantitative trait loci (e/sQTL) are large contributors to phenotypic variability. Achieving sufficient statistical power for e/sQTL mapping requires large cohorts with both genotypes and molecular phenotypes, and so, the genomic variation is often called from short-read alignments, which are unable to comprehensively resolve structural variation. Here we build a pangenome from 16 HiFi haplotype-resolved cattle assemblies to identify small and structural variation and genotype them with PanGenie in 307 short-read samples. We find high (>90%) concordance of PanGenie-genotyped and DeepVariant-called small variation and confidently genotype close to 21 million small and 43,000 structural variants in the larger population. We validate 85% of these structural variants (with MAF > 0.1) directly with a subset of 25 short-read samples that also have medium coverage HiFi reads. We then conduct e/sQTL mapping with this comprehensive variant set in a subset of 117 cattle that have testis transcriptome data, and find 92 structural variants as causal candidates for eQTL and 73 for sQTL. We find that roughly half of the top associated structural variants affecting expression or splicing are transposable elements, such as SV-eQTL for STN1 and MYH7 and SV-sQTL for CEP89 and ASAH2 Extensive linkage disequilibrium between small and structural variation results in only 28 additional eQTL and 17 sQTL discovered when including SVs, although many top associated SVs are compelling candidates.


Subject(s)
Quantitative Trait Loci , RNA Splicing , Male , Cattle/genetics , Animals , Genotype , Phenotype , Linkage Disequilibrium , Genomic Structural Variation
15.
Genome Res ; 34(1): 7-19, 2024 02 07.
Article in English | MEDLINE | ID: mdl-38176712

ABSTRACT

High-quality genome assemblies and sophisticated algorithms have increased sensitivity for a wide range of variant types, and breakpoint accuracy for structural variants (SVs, ≥50 bp) has improved to near base pair precision. Despite these advances, many SV breakpoint locations are subject to systematic bias affecting variant representation. To understand why SV breakpoints are inconsistent across samples, we reanalyzed 64 phased haplotypes constructed from long-read assemblies released by the Human Genome Structural Variation Consortium (HGSVC). We identify 882 SV insertions and 180 SV deletions with variable breakpoints not anchored in tandem repeats (TRs) or segmental duplications (SDs). SVs called from aligned sequencing reads increase breakpoint disagreements by 2×-16×. Sequence accuracy had a minimal impact on breakpoints, but we observe a strong effect of ancestry. We confirm that SNP and indel polymorphisms are enriched at shifted breakpoints and are also absent from variant callsets. Breakpoint homology increases the likelihood of imprecise SV calls and the distance they are shifted, and tandem duplications are the most heavily affected SVs. Because graph genome methods normalize SV calls across samples, we investigated graphs generated by two different methods and find the resulting breakpoints are subject to other technical biases affecting breakpoint accuracy. The breakpoint inconsistencies we characterize affect ∼5% of the SVs called in a human genome and can impact variant interpretation and annotation. These limitations underscore a need for algorithm development to improve SV databases, mitigate the impact of ancestry on breakpoints, and increase the value of callsets for investigating breakpoint features.


Subject(s)
Algorithms , Genome, Human , Humans , Sequence Analysis , Genomic Structural Variation , Bias , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing
16.
Proc Natl Acad Sci U S A ; 121(27): e2322291121, 2024 Jul 02.
Article in English | MEDLINE | ID: mdl-38913905

ABSTRACT

Tibetan sheep were introduced to the Qinghai Tibet plateau roughly 3,000 B.P., making this species a good model for investigating genetic mechanisms of high-altitude adaptation over a relatively short timescale. Here, we characterize genomic structural variants (SVs) that distinguish Tibetan sheep from closely related, low-altitude Hu sheep, and we examine associated changes in tissue-specific gene expression. We document differentiation between the two sheep breeds in frequencies of SVs associated with genes involved in cardiac function and circulation. In Tibetan sheep, we identified high-frequency SVs in a total of 462 genes, including EPAS1, PAPSS2, and PTPRD. Single-cell RNA-Seq data and luciferase reporter assays revealed that the SVs had cis-acting effects on the expression levels of these three genes in specific tissues and cell types. In Tibetan sheep, we identified a high-frequency chromosomal inversion that exhibited modified chromatin architectures relative to the noninverted allele that predominates in Hu sheep. The inversion harbors several genes with altered expression patterns related to heart protection, brown adipocyte proliferation, angiogenesis, and DNA repair. These findings indicate that SVs represent an important source of genetic variation in gene expression and may have contributed to high-altitude adaptation in Tibetan sheep.


Subject(s)
Altitude , Animals , Sheep/genetics , Tibet , Genomic Structural Variation , Basic Helix-Loop-Helix Transcription Factors/genetics , Gene Expression Regulation , Genome , Acclimatization/genetics
17.
Nat Methods ; 20(4): 559-568, 2023 04.
Article in English | MEDLINE | ID: mdl-36959322

ABSTRACT

Structural variants (SVs) are a major driver of genetic diversity and disease in the human genome and their discovery is imperative to advances in precision medicine. Existing SV callers rely on hand-engineered features and heuristics to model SVs, which cannot scale to the vast diversity of SVs nor fully harness the information available in sequencing datasets. Here we propose an extensible deep-learning framework, Cue, to call and genotype SVs that can learn complex SV abstractions directly from the data. At a high level, Cue converts alignments to images that encode SV-informative signals and uses a stacked hourglass convolutional neural network to predict the type, genotype and genomic locus of the SVs captured in each image. We show that Cue outperforms the state of the art in the detection of several classes of SVs on synthetic and real short-read data and that it can be easily extended to other sequencing platforms, while achieving competitive performance.


Subject(s)
Deep Learning , Software , Humans , Genotype , Cues , Genomic Structural Variation , Genome, Human
18.
Nat Methods ; 20(8): 1143-1158, 2023 08.
Article in English | MEDLINE | ID: mdl-37386186

ABSTRACT

As long-read sequencing technologies are becoming increasingly popular, a number of methods have been developed for the discovery and analysis of structural variants (SVs) from long reads. Long reads enable detection of SVs that could not be previously detected from short-read sequencing, but computational methods must adapt to the unique challenges and opportunities presented by long-read sequencing. Here, we summarize over 50 long-read-based methods for SV detection, genotyping and visualization, and discuss how new telomere-to-telomere genome assemblies and pangenome efforts can improve the accuracy and drive the development of SV callers in the future.


Subject(s)
Algorithms , Genome , Humans , Sequence Analysis, DNA/methods , Genomic Structural Variation , High-Throughput Nucleotide Sequencing/methods , Genome, Human
19.
Brief Bioinform ; 25(5)2024 Jul 25.
Article in English | MEDLINE | ID: mdl-39297879

ABSTRACT

Structural variation (SV) refers to insertions, deletions, inversions, and duplications in human genomes. SVs are present in approximately 1.5% of the human genome. Still, this small subset of genetic variation has been implicated in the pathogenesis of psoriasis, Crohn's disease and other autoimmune disorders, autism spectrum and other neurodevelopmental disorders, and schizophrenia. Since identifying structural variants is an important problem in genetics, several specialized computational techniques have been developed to detect structural variants directly from sequencing data. With advances in whole-genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SVs present in a sample. Previous studies have shown that none of the existing SV callers can maintain high accuracy across various SV lengths and genomic coverages. Here, we report an integrated structural variant calling framework, Variant Identification and Structural Variant Analysis (VISTA), that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle gold standard SV set, haplotype-resolved de novo assemblies from the Human Pangenome Reference Consortium, along with an in-house polymerase chain reaction (PCR)-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes. VISTA also has an optimized mode, where the calls can be optimized for precision or recall. VISTA-optimized can attain 100% precision and the highest sensitivity among other variant callers. In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.


Subject(s)
Genome, Human , Genomic Structural Variation , Software , Humans , Whole Genome Sequencing/methods , Algorithms , Genomics/methods , Computational Biology/methods , Genetic Variation
20.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38980375

ABSTRACT

Structural variation (SV) is an important form of genomic variation that influences gene function and expression by altering the structure of the genome. Although long-read data have been proven to better characterize SVs, SVs detected from noisy long-read data still include a considerable portion of false-positive calls. To accurately detect SVs in long-read data, we present SVDF, a method that employs a learning-based noise filtering strategy and an SV signature-adaptive clustering algorithm, for effectively reducing the likelihood of false-positive events. Benchmarking results from multiple orthogonal experiments demonstrate that, across different sequencing platforms and depths, SVDF achieves higher calling accuracy for each sample compared to several existing general SV calling tools. We believe that, with its meticulous and sensitive SV detection capability, SVDF can bring new opportunities and advancements to cutting-edge genomic research.


Subject(s)
Algorithms , Humans , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Genomics/methods , Genomic Structural Variation , Software
SELECTION OF CITATIONS
SEARCH DETAIL