Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32.462
Filtrar
1.
Methods Mol Biol ; 2856: 293-308, 2025.
Artigo em Inglês | MEDLINE | ID: mdl-39283460

RESUMO

In order to analyze the three-dimensional genome architecture, it is important to simulate how the genome is structured through the cell cycle progression. In this chapter, we present the usage of our computation codes for simulating how the human genome is formed as the cell transforms from anaphase to interphase. We do not use the global Hi-C data as an input into the genome simulation but represent all chromosomes as linear polymers annotated by the neighboring region contact index (NCI), which classifies the A/B type of each local chromatin region. The simulated mitotic chromosomes heterogeneously expand upon entry to the G1 phase, which induces phase separation of A and B chromatin regions, establishing chromosome territories, compartments, and lamina and nucleolus associations in the interphase nucleus. When the appropriate one-dimensional chromosomal annotation is possible, using the protocol of this chapter, one can quantitatively simulate the three-dimensional genome structure and dynamics of human cells of interest.


Assuntos
Anáfase , Cromatina , Genoma Humano , Interfase , Humanos , Anáfase/genética , Interfase/genética , Cromatina/genética , Cromatina/metabolismo , Simulação por Computador , Cromossomos Humanos/genética , Mitose/genética
2.
Genome Biol ; 25(1): 253, 2024 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-39358801

RESUMO

In this work, we extend vcfdist to be the first variant call benchmarking tool to jointly evaluate phased single-nucleotide polymorphisms (SNPs), small insertions/deletions (INDELs), and structural variants (SVs) for the whole genome. First, we find that a joint evaluation of small and structural variants uniformly reduces measured errors for SNPs (- 28.9%), INDELs (- 19.3%), and SVs (- 52.4%) across three datasets. vcfdist also corrects a common flaw in phasing evaluations, reducing measured flip errors by over 50%. Lastly, we show that vcfdist is more accurate than previously published works and on par with the newest approaches while providing improved result interpretability.


Assuntos
Benchmarking , Mutação INDEL , Polimorfismo de Nucleotídeo Único , Software , Humanos , Variação Estrutural do Genoma , Genoma Humano
3.
BMC Genomics ; 25(1): 942, 2024 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-39375616

RESUMO

BACKGROUND: The Csangos are an East-Central European ethnographic group living mainly in east of Transylvania in Romania. Traditionally, ethnography distinguishes three Csango subpopulations, the Moldavian, Gyimes and Burzenland Csangos. In our previous study we found that the Moldavian Csangos have East Asian/Siberian Turkic ancestry components that might be unique in the East-Central European region and might help to better understand the history of Hungarian speaking ethnic groups of the area. Since then, we obtained further Csango samples from Moldavia and from a distinct region of Gyimes, which two Csango subgroups are traditionally different since they live in a degree of isolation not only from other people but also from each other. Here we present the first genomic analysis of Gyimes Csangos, which intended to compare the genetic makeup of these two Csango subgroups using both allele-frequency and haplotype-based methods. The main goal of the study was to investigate the genetic isolation of the Csangos on a genome-wide SNP basis and to assess the isolation of Gyimes Csangos, which in contrast to the Moldavians was not yet studied. RESULTS: Our results show that these two Csango groups show slight differences from each other. We confirmed the genetic isolation of Moldavian Csangos and revealed that Gyimes Csangos have a similar, but detectably weaker isolation. In the case of Gyimes Csangos we detected also a stronger East European or presumably Asian derived ancestry. CONCLUSION: The Gyimes Csangos show a degree of genetic isolation comparable to that of the Moldavians. The Asian ancestry that differentiates the Moldavian Csango people from the other East-Central European populations may be present in the Gyimes Csangos in an even higher degree, since Gyimes Csango individuals show a more significant share from that ancestry component.


Assuntos
Haplótipos , Polimorfismo de Nucleotídeo Único , Humanos , Genética Populacional , Frequência do Gene , Etnicidade/genética , Genoma Humano , População Branca/genética
4.
Nucleus ; 15(1): 2400525, 2024 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-39377317

RESUMO

Cytogenetic bands reflect genomic organization in large blocks of DNA with similar properties. Because banding patterns are invariant, this organization may often be assumed unimportant for genome regulation. Results here challenge that view. Findings here suggest cytogenetic bands reflect a visible framework upon which regulated genome architecture is built. Given Alu and L1 densities differ in cytogenetic bands, we examined their distribution after X-chromosome inactivation or formation of senescent-associated heterochromatin foci (SAHFs). Alu-rich regions remain outside both SAHFs and the Barr Body (BB), affirming that the BB is not the whole chromosome but a condensed, L1-rich core. Hi-C analysis of senescent cells demonstrates large (~10 Mb) G-bands remodel as a contiguous unit, gaining distal intrachromosomal interactions as syntenic G-bands coalesce into SAHFs. Striking peaks of Alu within R-bands strongly resist condensation. Thus, large-scale segmental genome architectur relates to dark versus light cytogenetic bands and Alu-peaks, implicating both in chromatin regulation.


Assuntos
Elementos Alu , Elementos Alu/genética , Humanos , Heterocromatina/metabolismo , Heterocromatina/genética , Genoma Humano/genética , Núcleo Celular/genética , Núcleo Celular/metabolismo
5.
Cancer Discov ; 14(10): 1766-1767, 2024 Oct 04.
Artigo em Inglês | MEDLINE | ID: mdl-39363744

RESUMO

Baker and colleagues developed a new algorithm called "Gain Route Identification and Timing In Cancer" (GRITIC) to uncover the path of chromosomal evolution in a tumor, particularly in the context of whole-genome duplication. Their approach found that tumors with genome doubling frequently take an indirect path from one copy number state to another. In addition, the timing of genome doubling within a tumor's evolution impacts its consequences on downstream chromosomal instability. See related article by Baker et al., p. 1810.


Assuntos
Neoplasias , Humanos , Neoplasias/genética , Algoritmos , Evolução Molecular , Genoma Humano
6.
Nat Commun ; 15(1): 8549, 2024 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-39362880

RESUMO

The role of rare non-coding variation in complex human phenotypes is still largely unknown. To elucidate the impact of rare variants in regulatory elements, we performed a whole-genome sequencing association analysis for height using 333,100 individuals from three datasets: UK Biobank (N = 200,003), TOPMed (N = 87,652) and All of Us (N = 45,445). We performed rare ( < 0.1% minor-allele-frequency) single-variant and aggregate testing of non-coding variants in regulatory regions based on proximal-regulatory, intergenic-regulatory and deep-intronic annotation. We observed 29 independent variants associated with height at P < 6 × 10 - 10 after conditioning on previously reported variants, with effect sizes ranging from -7cm to +4.7 cm. We also identified and replicated non-coding aggregate-based associations proximal to HMGA1 containing variants associated with a 5 cm taller height and of highly-conserved variants in MIR497HG on chromosome 17. We have developed an approach for identifying non-coding rare variants in regulatory regions with large effects from whole-genome sequencing data associated with complex traits.


Assuntos
Estatura , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Sequenciamento Completo do Genoma , Humanos , Estatura/genética , Masculino , Feminino , Frequência do Gene , Genoma Humano , Variação Genética , Fenótipo
7.
Sci Rep ; 14(1): 22774, 2024 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-39354004

RESUMO

While significant strides have been made in understanding pharmacogenetics (PGx) and gene-drug interactions, there remains limited characterization of population-level PGx variation. This study aims to comprehensively profile global star alleles (haplotype patterns) and phenotype frequencies in 58 pharmacogenes associated with drug absorption, distribution, metabolism, and excretion. PyPGx, a star-allele calling tool, was employed to identify star alleles within high-coverage whole genome sequencing (WGS) data from the 1000 Genomes Project (N = 2504; 26 global populations). This process involved detecting structural variants (SVs), such as gene deletions, duplications, hybrids, as well as single nucleotide variants and insertion-deletion variants. The majority of our PyPGx calls for star alleles and phenotype frequencies aligned with the Pharmacogenomics Knowledge Base, although notable population-specific frequencies differed at least twofold. Validation efforts confirmed known SVs while uncovering several novel SVs currently undefined as star alleles. Additionally, we identified 210 small nucleotide variants associated with severe functional consequences that are not defined as star alleles. The study serves as a valuable resource, providing updated population-level star allele and phenotype frequencies while incorporating SVs. It also highlights the burgeoning potential of cost-effective WGS for PGx genotyping, offering invaluable insights to improve tailored drug therapies across diverse populations.


Assuntos
Alelos , Farmacogenética , Sequenciamento Completo do Genoma , Humanos , Sequenciamento Completo do Genoma/métodos , Farmacogenética/métodos , Frequência do Gene , Polimorfismo de Nucleotídeo Único , Genoma Humano , Fenótipo , Haplótipos , Variação Estrutural do Genoma , Testes Farmacogenômicos/métodos , Projeto Genoma Humano
8.
Nat Commun ; 15(1): 8454, 2024 Oct 02.
Artigo em Inglês | MEDLINE | ID: mdl-39358353

RESUMO

It is unclear how patterns of regional genetic differentiation in the UK and Ireland might impact the protein-coding fraction of the genome. We exploit UK Biobank (UKB) and Viking Genes whole exome sequencing data to study regional genetic differentiation across the UK and Ireland in protein coding genes, encompassing 44,696 unrelated individuals from 20 regions of origin. We demonstrate substantial exonic differentiation among Shetlanders, Orcadians, individuals with full or partial Ashkenazi Jewish ancestry and in several mainland regions (particularly north and south Wales, southeast Scotland and Ireland). With stringent filtering criteria, we find 67 regionally enriched (≥5-fold) variants likely to have adverse biomedical consequences in homozygous individuals. Here, we show that regional genetic variation across the UK and Ireland should be considered in the design of genetic studies and may inform effective genetic screening and counselling.


Assuntos
Éxons , Variação Genética , Humanos , Irlanda , Reino Unido , Éxons/genética , Sequenciamento do Exoma , Genética Populacional , Judeus/genética , Genoma Humano , Polimorfismo de Nucleotídeo Único
9.
PeerJ ; 12: e18050, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-39351368

RESUMO

Background: Recent advances in long-read sequencing technologies enabled accurate and contiguous de novo assemblies of large genomes and metagenomes. However, even long and accurate high-fidelity (HiFi) reads do not resolve repeats that are longer than the read lengths. This limitation negatively affects the contiguity of diploid genome assemblies since two haplomes share many long identical regions. To generate the telomere-to-telomere assemblies of diploid genomes, biologists now construct their HiFi-based phased assemblies and use additional experimental technologies to transform them into more contiguous diploid assemblies. The barcoded linked-reads, generated using an inexpensive TELL-Seq technology, provide an attractive way to bridge unresolved repeats in phased assemblies of diploid genomes. Results: We developed the SpLitteR tool for diploid genome assembly using linked-reads and assembly graphs and benchmarked it against state-of-the-art linked-read scaffolders ARKS and SLR-superscaffolder using human HG002 genome and sheep gut microbiome datasets. The benchmark showed that SpLitteR scaffolding results in 1.5-fold increase in NGA50 compared to the baseline LJA assembly and other scaffolders while introducing no additional misassemblies on the human dataset. Conclusion: We developed the SpLitteR tool for assembly graph phasing and scaffolding using barcoded linked-reads. We benchmarked SpLitteR on assembly graphs produced by various long-read assemblers and have demonstrated that TELL-Seq reads facilitate phasing and scaffolding in these graphs. This benchmarking demonstrates that SpLitteR improves upon the state-of-the-art linked-read scaffolders in the accuracy and contiguity metrics. SpLitteR is implemented in C++ as a part of the freely available SPAdes package and is available at https://github.com/ablab/spades/releases/tag/splitter-preprint.


Assuntos
Diploide , Animais , Humanos , Genoma Humano/genética , Ovinos/genética , Software , Análise de Sequência de DNA/métodos , Microbioma Gastrointestinal/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genoma/genética
10.
Nat Commun ; 15(1): 8007, 2024 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-39266513

RESUMO

Modern sequencing technology enables the systematic detection of complex structural variation (SV) across genomes. However, extensive DNA rearrangements arising through a series of mutations, a phenomenon we refer to as serial SV (sSV), remain underexplored, posing a challenge for SV discovery. Here, we present NAHRwhals ( https://github.com/WHops/NAHRwhals ), a method to infer repeat-mediated series of SVs in long-read genomic assemblies. Applying NAHRwhals to haplotype-resolved human genomes from 28 individuals reveals 37 sSV loci of various length and complexity. These sSVs explain otherwise cryptic variation in medically relevant regions such as the TPSAB1 gene, 8p23.1, 22q11 and Sotos syndrome regions. Comparisons with great ape assemblies indicate that most human sSVs formed recently, after the human-ape split, and involved non-repeat-mediated processes in addition to non-allelic homologous recombination. NAHRwhals reliably discovers and characterizes sSVs at scale and independent of species, uncovering their genomic abundance and suggesting broader implications for disease.


Assuntos
Genoma Humano , Variação Estrutural do Genoma , Hominidae , Humanos , Animais , Hominidae/genética , Genoma Humano/genética , Genômica/métodos , Haplótipos
11.
BMC Bioinformatics ; 25(1): 301, 2024 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-39272021

RESUMO

Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model's learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.


Assuntos
DNA , Humanos , DNA/química , Genoma Humano , Análise de Sequência de DNA/métodos , Processamento de Linguagem Natural , Biologia Computacional/métodos
12.
Am J Hum Genet ; 111(10): 2129-2138, 2024 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-39270648

RESUMO

Large-scale, multi-ethnic whole-genome sequencing (WGS) studies, such as the National Human Genome Research Institute Genome Sequencing Program's Centers for Common Disease Genomics (CCDG), play an important role in increasing diversity for genetic research. Before performing association analyses, assessing Hardy-Weinberg equilibrium (HWE) is a crucial step in quality control procedures to remove low quality variants and ensure valid downstream analyses. Diverse WGS studies contain ancestrally heterogeneous samples; however, commonly used HWE methods assume that the samples are homogeneous. Therefore, directly applying these to the whole dataset can yield statistically invalid results. To account for this heterogeneity, HWE can be tested on subsets of samples that have genetically homogeneous ancestries and the results aggregated at each variant. To facilitate valid HWE subset testing, we developed a semi-supervised learning approach that predicts homogeneous ancestries based on the genotype. This method provides a convenient tool for estimating HWE in the presence of population structure and missing self-reported race and ethnicities in diverse WGS studies. In addition, assessing HWE within the homogeneous ancestries provides reliable HWE estimates that will directly benefit downstream analyses, including association analyses in WGS studies. We applied our proposed method on the CCDG dataset, predicting homogeneous genetic ancestry groups for 60,545 multi-ethnic WGS samples to assess HWE within each group.


Assuntos
Aprendizado de Máquina Supervisionado , Sequenciamento Completo do Genoma , Humanos , Sequenciamento Completo do Genoma/métodos , Genoma Humano , Genética Populacional/métodos , Etnicidade/genética , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Genótipo
13.
PLoS Genet ; 20(9): e1011198, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39302992

RESUMO

Dominance is a fundamental parameter in genetics, determining the dynamics of natural selection on deleterious and beneficial mutations, the patterns of genetic variation in natural populations, and the severity of inbreeding depression in a population. Despite this importance, dominance parameters remain poorly known, particularly in humans or other non-model organisms. A key reason for this lack of information about dominance is that it is extremely challenging to disentangle the selection coefficient (s) of a mutation from its dominance coefficient (h). Here, we explore dominance and selection parameters in humans by fitting models to the site frequency spectrum (SFS) for nonsynonymous mutations. When assuming a single dominance coefficient for all nonsynonymous mutations, we find that numerous h values can fit the data, so long as h is greater than ~0.15. Moreover, we also observe that theoretically-predicted models with a negative relationship between h and s can also fit the data well, including models with h = 0.05 for strongly deleterious mutations. Finally, we use our estimated dominance and selection parameters to inform simulations revisiting the question of whether the out-of-Africa bottleneck has led to differences in genetic load between African and non-African human populations. These simulations suggest that the relative burden of genetic load in non-African populations depends on the dominance model assumed, with slight increases for more weakly recessive models and slight decreases shown for more strongly recessive models. Moreover, these results also demonstrate that models of partially recessive nonsynonymous mutations can explain the observed severity of inbreeding depression in humans, bridging the gap between molecular population genetics and direct measures of fitness in humans. Our work represents a comprehensive assessment of dominance and deleterious variation in humans, with implications for parameterizing models of deleterious variation in humans and other mammalian species.


Assuntos
Genética Populacional , Genoma Humano , Modelos Genéticos , Mutação , Seleção Genética , Humanos , Seleção Genética/genética , Genes Dominantes , Variação Genética , Carga Genética , Depressão por Endogamia/genética
14.
Nat Commun ; 15(1): 7731, 2024 Sep 04.
Artigo em Inglês | MEDLINE | ID: mdl-39231944

RESUMO

Whole genome sequencing (WGS) provides comprehensive, individualised cancer genomic information. However, routine tumour biopsies are formalin-fixed and paraffin-embedded (FFPE), damaging DNA, historically limiting their use in WGS. Here we analyse FFPE cancer WGS datasets from England's 100,000 Genomes Project, comparing 578 FFPE samples with 11,014 fresh frozen (FF) samples across multiple tumour types. We use an approach that characterises rather than discards artefacts. We identify three artefactual signatures, including one known (SBS57) and two previously uncharacterised (SBS FFPE, ID FFPE), and develop an "FFPEImpact" score that quantifies sample artefacts. Despite inferior sequencing quality, FFPE-derived data identifies clinically-actionable variants, mutational signatures and permits algorithmic stratification. Matched FF/FFPE validation cohorts shows good concordance while acknowledging SBS, ID and copy-number artefacts. While FF-derived WGS data remains the gold standard, FFPE-samples can be used for WGS if required, using analytical advancements developed here, potentially democratising whole cancer genomics to many.


Assuntos
Formaldeído , Neoplasias , Inclusão em Parafina , Fixação de Tecidos , Sequenciamento Completo do Genoma , Humanos , Inclusão em Parafina/métodos , Neoplasias/genética , Neoplasias/patologia , Sequenciamento Completo do Genoma/métodos , Fixação de Tecidos/métodos , Genômica/métodos , Mutação , Genoma Humano , Artefatos
16.
Science ; 385(6714): 1146-1147, 2024 Sep 13.
Artigo em Inglês | MEDLINE | ID: mdl-39265004
17.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39256198

RESUMO

Accurate assessment of fragment abundance within a genome is crucial in clinical genomics applications such as the analysis of copy number variation (CNV). However, this task is often hindered by biased coverage in regions with varying guanine-cytosine (GC) content. These biases are particularly exacerbated in hybridization capture sequencing due to GC effects on probe hybridization and polymerase chain reaction (PCR) amplification efficiency. Such GC content-associated variations can exert a negative impact on the fidelity of CNV calling within hybridization capture panels. In this report, we present panelGC, a novel metric, to quantify and monitor GC biases in hybridization capture sequencing data. We establish the efficacy of panelGC, demonstrating its proficiency in identifying and flagging potential procedural anomalies, even in situations where instrument and experimental monitoring data may not be readily accessible. Validation using real-world datasets demonstrates that panelGC enhances the quality control and reliability of hybridization capture panel sequencing.


Assuntos
Composição de Bases , Variações do Número de Cópias de DNA , Genômica , Humanos , Genômica/métodos , Análise de Sequência de DNA/métodos , Hibridização de Ácido Nucleico/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Genoma Humano , Reprodutibilidade dos Testes
18.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39256200

RESUMO

Copy number variations (CNVs) play pivotal roles in disease susceptibility and have been intensively investigated in human disease studies. Long-read sequencing technologies offer opportunities for comprehensive structural variation (SV) detection, and numerous methodologies have been developed recently. Consequently, there is a pressing need to assess these methods and aid researchers in selecting appropriate techniques for CNV detection using long-read sequencing. Hence, we conducted an evaluation of eight CNV calling methods across 22 datasets from nine publicly available samples and 15 simulated datasets, covering multiple sequencing platforms. The overall performance of CNV callers varied substantially and was influenced by the input dataset type, sequencing depth, and CNV type, among others. Specifically, the PacBio CCS sequencing platform outperformed PacBio CLR and Nanopore platforms regarding CNV detection recall rates. A sequencing depth of 10x demonstrated the capability to identify 85% of the CNVs detected in a 50x dataset. Moreover, deletions were more generally detectable than duplications. Among the eight benchmarked methods, cuteSV, Delly, pbsv, and Sniffles2 demonstrated superior accuracy, while SVIM exhibited high recall rates.


Assuntos
Algoritmos , Variações do Número de Cópias de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Biologia Computacional/métodos , Genoma Humano
19.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39256199

RESUMO

Deoxyribonucleic acid (DNA) methylation plays a key role in gene regulation and is critical for development and human disease. Techniques such as whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS) allow DNA methylation analysis at the genome scale, with Illumina NovaSeq 6000 and MGI Tech DNBSEQ-T7 being popular due to their efficiency and affordability. However, detailed comparative studies of their performance are not available. In this study, we constructed 60 WGBS and RRBS libraries for two platforms using different types of clinical samples and generated approximately 2.8 terabases of sequencing data. We systematically compared quality control metrics, genomic coverage, CpG methylation levels, intra- and interplatform correlations, and performance in detecting differentially methylated positions. Our results revealed that the DNBSEQ platform exhibited better raw read quality, although base quality recalibration indicated potential overestimation of base quality. The DNBSEQ platform also showed lower sequencing depth and less coverage uniformity in GC-rich regions than did the NovaSeq platform and tended to enrich methylated regions. Overall, both platforms demonstrated robust intra- and interplatform reproducibility for RRBS and WGBS, with NovaSeq performing better for WGBS, highlighting the importance of considering these factors when selecting a platform for bisulfite sequencing.


Assuntos
Ilhas de CpG , Metilação de DNA , Análise de Sequência de DNA , Humanos , Análise de Sequência de DNA/métodos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sulfitos/química , Pareamento de Bases , Sequenciamento Completo do Genoma/métodos , Reprodutibilidade dos Testes
20.
Bioinformatics ; 40(Suppl 2): ii11-ii19, 2024 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-39230689

RESUMO

MOTIVATION: Complex structural variants (SVs) are genomic rearrangements that involve multiple segments of DNA. They contribute to human diversity and have been shown to cause Mendelian disease. Nevertheless, our abilities to analyse complex SVs are very limited. As opposed to deletions and other canonical types of SVs, there are no established tools that have explicitly been designed for analysing complex SVs. RESULTS: Here, we describe a new computational approach that we specifically designed for genotyping complex SVs in short-read sequenced genomes. Given a variant description, our approach computes genotype-specific probability distributions for observing aligned read pairs with a wide range of properties. Subsequently, these distributions can be used to efficiently determine the most likely genotype for any set of aligned read pairs observed in a sequenced genome. In addition, we use these distributions to compute a genotyping difficulty for a given variant, which predicts the amount of data needed to achieve a reliable call. Careful evaluation confirms that our approach outperforms other genotypers by making reliable genotype predictions across both simulated and real data. On up to 7829 human genomes, we achieve high concordance with population-genetic assumptions and expected inheritance patterns. On simulated data, we show that precision correlates well with our prediction of genotyping difficulty. This together with low memory and time requirements makes our approach well-suited for application in biomedical studies involving small to very large numbers of short-read sequenced genomes. AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/kehrlab/Complex-SV-Genotyping.


Assuntos
Genoma Humano , Variação Estrutural do Genoma , Análise de Sequência de DNA , Software , Humanos , Análise de Sequência de DNA/métodos , Genótipo , Técnicas de Genotipagem/métodos , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA