RESUMO
Structural variants are associated with cancers and developmental disorders, but challenges with estimating population frequency remain a barrier to prioritizing mutations over inherited variants. In particular, variability in variant calling heuristics and filtering limits the use of current structural variant catalogs. We present STIX, a method that, instead of relying on variant calls, indexes and searches the raw alignments from thousands of samples to enable more comprehensive allele frequency estimation.
Assuntos
Genoma , Variação Estrutural do Genoma , Neoplasias , Algoritmos , Variação Estrutural do Genoma/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Neoplasias/genética , SoftwareRESUMO
Germline and somatic variants within an individual or cohort are interpreted with information from large cohorts. Annotation with this information becomes a computational bottleneck as population sets grow to terabytes of data. Here, we introduce echtvar, which efficiently encodes population variants and annotation fields into a compressed archive that can be used for rapid variant annotation and filtering. Most variants, represented by chromosome, position and alleles are encoded into 32-bits-half the size of previous encoding schemes and at least 4 times smaller than a naive encoding. The annotations, stored separately within the same archive, are also encoded and compressed. We show that echtvar is faster and uses less space than existing tools and that it can effectively reduce the number of candidate variants. We give examples on germ-line and somatic variants to document how echtvar can facilitate exploratory data analysis on genetic variants. Echtvar is available at https://github.com/brentp/echtvar under an MIT license.
Assuntos
Polimorfismo de Nucleotídeo Único , Software , Humanos , Mutação INDEL , Alelos , Cromossomos , Anotação de Sequência MolecularRESUMO
Each human genome includes de novo mutations that arose during gametogenesis. While these germline mutations represent a fundamental source of new genetic diversity, they can also create deleterious alleles that impact fitness. Whereas the rate and patterns of point mutations in the human germline are now well understood, far less is known about the frequency and features that impact de novo structural variants (dnSVs). We report a family-based study of germline mutations among 9,599 human genomes from 33 multigenerational CEPH-Utah families and 2,384 families from the Simons Foundation Autism Research Initiative. We find that de novo structural mutations detected by alignment-based, short-read WGS occur at an overall rate of at least 0.160 events per genome in unaffected individuals, and we observe a significantly higher rate (0.206 per genome) in ASD-affected individuals. In both probands and unaffected samples, nearly 73% of de novo structural mutations arose in paternal gametes, and we predict most de novo structural mutations to be caused by mutational mechanisms that do not require sequence homology. After multiple testing correction, we did not observe a statistically significant correlation between parental age and the rate of de novo structural variation in offspring. These results highlight that a spectrum of mutational mechanisms contribute to germline structural mutations and that these mechanisms most likely have markedly different rates and selective pressures than those leading to point mutations.
Assuntos
Família , Genoma Humano/genética , Células Germinativas , Mutação em Linhagem Germinativa/genética , Taxa de Mutação , Envelhecimento/genética , Transtorno Autístico/genética , Viés , Variações do Número de Cópias de DNA/genética , Análise Mutacional de DNA , Feminino , Humanos , Masculino , Idade Paterna , Mutação Puntual/genéticaRESUMO
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies-as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format.
Assuntos
Ecossistema , Variação Genética , Biologia Computacional , Variação Genética/genética , Nucleotídeos , SoftwareRESUMO
BACKGROUND: Despite numerous molecular and computational advances, roughly half of patients with a rare disease remain undiagnosed after exome or genome sequencing. A particularly challenging barrier to diagnosis is identifying variants that cause deleterious alternative splicing at intronic or exonic loci outside of canonical donor or acceptor splice sites. RESULTS: Several existing tools predict the likelihood that a genetic variant causes alternative splicing. We sought to extend such methods by developing a new metric that aids in discerning whether a genetic variant leads to deleterious alternative splicing. Our metric combines genetic variation in the Genome Aggregate Database with alternative splicing predictions from SpliceAI to compare observed and expected levels of splice-altering genetic variation. We infer genic regions with significantly less splice-altering variation than expected to be constrained. The resulting model of regional splicing constraint captures differential splicing constraint across gene and exon categories, and the most constrained genic regions are enriched for pathogenic splice-altering variants. Building from this model, we developed ConSpliceML. This ensemble machine learning approach combines regional splicing constraint with multiple per-nucleotide alternative splicing scores to guide the prediction of deleterious splicing variants in protein-coding genes. ConSpliceML more accurately distinguishes deleterious and benign splicing variants than state-of-the-art splicing prediction methods, especially in "cryptic" splicing regions beyond canonical donor or acceptor splice sites. CONCLUSION: Integrating a model of genetic constraint with annotations from existing alternative splicing tools allows ConSpliceML to prioritize potentially deleterious splice-altering variants in studies of rare human diseases.
Assuntos
Processamento Alternativo , Doenças Raras , Humanos , Doenças Raras/genética , Splicing de RNA , Íntrons , Éxons , Mutação , Sítios de Splice de RNARESUMO
SUMMARY: Unfazed is a command-line tool to determine the parental gamete of origin for de novo mutations from paired-end Illumina DNA sequencing reads. Unfazed uses variant information for a sequenced trio to identify the parental gamete of origin by linking phase-informative inherited variants to de novo mutations using read-based phasing. It achieves a high success rate by chaining reads into haplotype groups, thus increasing the search space for informative sites. Unfazed provides a simple command-line interface and scales well to large inputs, determining parent-of-origin for nearly 30 000 de novo variants in under 60 h. AVAILABILITY AND IMPLEMENTATION: Unfazed is available at https://github.com/jbelyeu/unfazed. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Haplótipos , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
The textbook view that most germline mutations in mammals arise from replication errors is indirectly supported by the fact that there are both more mutations and more cell divisions in the male than in the female germline. When analyzing large de novo mutation datasets in humans, we find multiple lines of evidence that call that view into question. Notably, despite the drastic increase in the ratio of male to female germ cell divisions after the onset of spermatogenesis, even young fathers contribute three times more mutations than young mothers, and this ratio barely increases with parental age. This surprising finding points to a substantial contribution of damage-induced mutations. Indeed, C-to-G transversions and CpG transitions, which together constitute over one-fourth of all base substitution mutations, show genomic distributions and sex-specific age dependencies indicative of double-strand break repair and methylation-associated damage, respectively. Moreover, we find evidence that maternal age at conception influences the mutation rate both because of the accumulation of damage in oocytes and potentially through an influence on the number of postzygotic mutations in the embryo. These findings reveal underappreciated roles of DNA damage and maternal age in the genesis of human germline mutations.
Assuntos
Quebras de DNA de Cadeia Dupla , Reparo do DNA , Bases de Dados de Ácidos Nucleicos , Mutação em Linhagem Germinativa , Idade Materna , Adolescente , Adulto , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Oócitos , Gravidez , Espermatogênese/genéticaRESUMO
GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.
Assuntos
Neoplasias da Mama/genética , Genoma Humano , Genômica/métodos , Ferramenta de Busca/métodos , Análise de Sequência de DNA/métodos , Software , Bases de Dados Genéticas , Feminino , Humanos , InternetRESUMO
The potential for genetic discovery in human DNA sequencing studies is greatly diminished if DNA samples from a cohort are mislabeled, swapped, or contaminated or if they include unintended individuals. Unfortunately, the potential for such errors is significant since DNA samples are often manipulated by several protocols, labs, or scientists in the process of sequencing. We have developed a software package, peddy, to identify and facilitate the remediation of such errors via interactive visualizations and reports comparing the stated sex, relatedness, and ancestry to what is inferred from the individual genotypes derived from whole-genome (WGS) or whole-exome (WES) sequencing. Peddy predicts a sample's ancestry using a machine learning model trained on individuals of diverse ancestries from the 1000 Genomes Project reference panel. Peddy facilitates both automated and interactive, visual detection of sample swaps, poor sequencing quality, and other indicators of sample problems that, if left undetected, would inhibit discovery.
Assuntos
Genoma Humano , Aprendizado de Máquina , Análise de Sequência de DNA/métodos , Software , Cromossomos Humanos X/genética , Exoma , Feminino , Estudos de Associação Genética , Loci Gênicos , Genótipo , Humanos , Masculino , Linhagem , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Motivation: Extracting biological insight from genomic data inevitably requires custom software. In many cases, this is accomplished with scripting languages, owing to their accessibility and brevity. Unfortunately, the ease of scripting languages typically comes at a substantial performance cost that is especially acute with the scale of modern genomics datasets. Results: We present hts-nim, a high-performance library written in the Nim programming language that provides a simple, scripting-like syntax without sacrificing performance. Availability and implementation: hts-nim is available at https://github.com/brentp/hts-nim and the example tools are at https://github.com/brentp/hts-nim-tools both under the MIT license.
Assuntos
Genômica , Linguagens de Programação , Software , Biologia ComputacionalRESUMO
Summary: Mosdepth is a new command-line tool for rapidly calculating genome-wide sequencing coverage. It measures depth from BAM or CRAM files at either each nucleotide position in a genome or for sets of genomic regions. Genomic regions may be specified as either a BED file to evaluate coverage across capture regions, or as a fixed-size window as required for copy-number calling. Mosdepth uses a simple algorithm that is computationally efficient and enables it to quickly produce coverage summaries. We demonstrate that mosdepth is faster than existing tools and provides flexibility in the types of coverage profiles produced. Availability and implementation: mosdepth is available from https://github.com/brentp/mosdepth under the MIT license. Contact: bpederse@gmail.com. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genoma Humano , Genômica/métodos , Humanos , Sequenciamento do Exoma/métodosRESUMO
MOTIVATION: Variant call format (VCF) files document the genetic variation observed after DNA sequencing, alignment and variant calling of a sample cohort. Given the complexity of the VCF format as well as the diverse variant annotations and genotype metadata, there is a need for fast, flexible methods enabling intuitive analysis of the variant data within VCF and BCF files. RESULTS: We introduce cyvcf2 , a Python library and software package for fast parsing and querying of VCF and BCF files and illustrate its speed, simplicity and utility. CONTACT: bpederse@gmail.com or aaronquinlan@gmail.com. AVAILABILITY AND IMPLEMENTATION: cyvcf2 is available from https://github.com/brentp/cyvcf2 under the MIT license and from common python package managers. Detailed documentation is available at http://brentp.github.io/cyvcf2/.
Assuntos
Variação Genética , Técnicas de Genotipagem/métodos , Análise de Sequência de DNA/métodos , Software , Humanos , MetadadosRESUMO
BACKGROUND: Given the strong environmental influence on both epigenetic marks and allergic asthma in children, the epigenetic alterations in respiratory epithelia might provide insight into allergic asthma. OBJECTIVE: We sought to identify DNA methylation and gene expression changes associated with childhood allergic persistent asthma. METHODS: We compared genomic DNA methylation patterns and gene expression in African American children with persistent atopic asthma (n = 36) versus healthy control subjects (n = 36). Results were validated in an independent population of asthmatic children (n = 30) by using a shared healthy control population (n = 36) and in an independent population of white adult atopic asthmatic patients (n = 12) and control subjects (n = 12). RESULTS: We identified 186 genes with significant methylation changes, differentially methylated regions or differentially methylated probes, after adjustment for age, sex, race/ethnicity, batch effects, inflation, and multiple comparisons. Genes differentially methylated included those with established roles in asthma and atopy and genes related to extracellular matrix, immunity, cell adhesion, epigenetic regulation, and airflow obstruction. The methylation changes were substantial (median, 9.5%; range, 2.6% to 29.5%). Hypomethylated and hypermethylated genes were associated with increased and decreased gene expression, respectively (P < 2.8 × 10-6 for differentially methylated regions and P < 7.8 × 10-10 for differentially methylated probes). Quantitative analysis in 53 differentially expressed genes demonstrated that 32 (60%) have significant methylation-expression relationships within 5 kb of the gene. Ten loci selected based on the relevance to asthma, magnitude of methylation change, and methylation-expression relationships were validated in an independent cohort of children with atopic asthma. Sixty-seven of 186 genes also have significant asthma-associated methylation changes in nasal epithelia of adult white asthmatic patients. CONCLUSIONS: Epigenetic marks in respiratory epithelia are associated with allergic asthma and gene expression changes in inner-city children.
Assuntos
Asma/genética , Metilação de DNA , Mucosa Nasal/metabolismo , Adulto , Negro ou Afro-Americano/genética , Idoso , Idoso de 80 Anos ou mais , Criança , Epigênese Genética , Feminino , Regulação da Expressão Gênica , Humanos , Masculino , Pessoa de Meia-Idade , População Branca/genética , Adulto JovemRESUMO
The gain-of-function mucin 5B (MUC5B) promoter variant, rs35705950, confers the largest risk, genetic or otherwise, for the development of idiopathic pulmonary fibrosis; however, the mechanisms underlying the regulation of MUC5B expression have yet to be elucidated. Here, we identify a critical regulatory domain that contains the MUC5B promoter variant and has a highly conserved forkhead box protein A2 (FOXA2) binding motif. This region is differentially methylated in association with idiopathic pulmonary fibrosis, MUC5B expression, and rs35705950. In addition, we show that this locus binds FOXA2 dynamically, and that binding of FOXA2 is necessary for enhanced expression of MUC5B. In aggregate, our findings identify novel targets to regulate the expression of MUC5B.
Assuntos
Fibrose Pulmonar Idiopática/genética , Mucina-5B/genética , Sequência de Bases , Sítios de Ligação , Imunoprecipitação da Cromatina , Ilhas de CpG/genética , Metilação de DNA/genética , Técnicas de Silenciamento de Genes , Fator 3-beta Nuclear de Hepatócito/metabolismo , Humanos , Pulmão/metabolismo , Pulmão/patologia , Mucina-5B/metabolismo , Polimorfismo de Nucleotídeo Único/genética , Regiões Promotoras Genéticas , Ligação Proteica/genética , RNA Polimerase II/metabolismo , RNA Interferente Pequeno/metabolismoRESUMO
RATIONALE: Sequence variation, methylation differences, and transcriptional changes in desmoplakin (DSP) have been observed in patients with idiopathic pulmonary fibrosis (IPF). OBJECTIVES: To identify novel variants in DSP associated with IPF and to characterize the relationship of these IPF sequence variants with DSP gene expression in human lung. METHODS: A chromosome 6 locus (7,370,061-7,606,946) was sequenced in 230 subjects with IPF and 228 control subjects. Validation genotyping of disease-associated variants was conducted in 936 subjects with IPF and 936 control subjects. DSP gene expression was measured in lung tissue from 334 subjects with IPF and 201 control subjects. MEASUREMENTS AND MAIN RESULTS: We identified 23 sequence variants in the chromosome 6 locus associated with IPF. Genotyping of selected variants in our validation cohort revealed that noncoding intron 1 variant rs2744371 (odds ratio = 0.77, 95% confidence interval [CI] = 0.66-0.91, P = 0.002) is protective for IPF, and a previously described IPF-associated intron 5 variant (rs2076295) is associated with increased risk of IPF (odds ratio = 1.36, 95% CI = 1.19-1.56, P < 0.001) after controlling for sex and age. DSP expression is 2.3-fold increased (95% CI = 1.91-2.71) in IPF lung tissue (P < 0.0001). Only the minor allele at rs2076295 is associated with decreased DSP expression (P = 0.001). Staining of fibrotic and normal human lung tissue localized DSP to airway epithelia. CONCLUSIONS: Sequence variants in DSP are associated with IPF, and rs2076295 genotype is associated with differential expression of DSP in the lung. DSP expression is increased in IPF lung and concentrated in the airway epithelia, suggesting a potential role for DSP in the pathogenesis of IPF.
Assuntos
Desmoplaquinas/genética , Variação Genética/genética , Fibrose Pulmonar Idiopática/genética , Idoso , Feminino , Expressão Gênica/genética , Humanos , Masculino , Pessoa de Meia-Idade , Razão de ChancesRESUMO
Cancer-associated somatic mutations outside protein-coding regions remain largely unexplored. Analyses of the TERT locus have indicated that non-coding regulatory mutations can be more frequent than previously suspected and play important roles in oncogenesis. Using a computational method called SASE-hunter, developed here, we identified a novel signature of accelerated somatic evolution (SASE) marked by a significant excess of somatic mutations localized in a genomic locus, and prioritized those loci that carried the signature in multiple cancer patients. Interestingly, even when an affected locus carried the signature in multiple individuals, the mutations contributing to SASE themselves were rarely recurrent at the base-pair resolution. In a pan-cancer analysis of 906 samples from 12 tumor types, we detected SASE in the promoters of several genes, including known cancer genes such as MYC, BCL2, RBM5 and WWOX. Nucleotide substitution patterns consistent with oxidative DNA damage and local somatic hypermutation appeared to contribute to this signature in selected gene promoters (e.g. MYC). SASEs in selected cancer gene promoters were associated with over-expression, and also correlated with the age of onset of cancer, aggressiveness of the disease and survival. Taken together, our work detects a hitherto under-appreciated and clinically important class of regulatory changes in cancer genomes.
Assuntos
Mutação , Neoplasias/genética , Regiões Promotoras Genéticas , Adulto , Expressão Gênica , Genômica , Humanos , Pessoa de Meia-Idade , Neoplasias/diagnóstico , SoftwareRESUMO
Integrative analyses of genomic, epigenomic and transcriptomic features for human and various model organisms have revealed that many such features are nonrandomly distributed in the genome. Significant enrichment (or depletion) of genomic features is anticipated to be biologically important. Detection of genomic regions having enrichment of certain features and estimation of corresponding statistical significance rely on the expected null distribution generated by a permutation model. We discuss different genome-wide permutation approaches, present examples where the permutation strategy affects the null model and show that the confidence in estimating statistical significance of genome-wide enrichment might depend on the choice of the permutation approach. In those cases, where biologically relevant constraints are unclear, it is preferable to examine whether key conclusions are consistent, irrespective of the choice of the randomization strategy.
Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Sítios de Ligação/genética , Fator de Ligação a CCCTC , Biologia Computacional , DNA/química , DNA/genética , DNA/metabolismo , Epigenômica/estatística & dados numéricos , Quadruplex G , Perfilação da Expressão Gênica/estatística & dados numéricos , Genoma Humano , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Modelos Genéticos , Modelos Estatísticos , Proteínas Repressoras/metabolismo , Fator de Transcrição STAT2/metabolismo , SoftwareRESUMO
BACKGROUND: Fibrotic idiopathic interstitial pneumonias (fIIP) are a group of fatal lung diseases with largely unknown etiology and without definitive treatment other than lung transplant to prolong life. There is strong evidence for the importance of both rare and common genetic risk alleles in familial and sporadic disease. We have previously used genome-wide single nucleotide polymorphism data to identify 10 risk loci for fIIP. Here we extend that work to imputed genome-wide genotypes and conduct new RNA sequencing studies of lung tissue to identify and characterize new fIIP risk loci. RESULTS: We performed genome-wide genotype imputation association analyses in 1616 non-Hispanic white (NHW) cases and 4683 NHW controls followed by validation and replication (878 cases, 2017 controls) genotyping and targeted gene expression in lung tissue. Following meta-analysis of the discovery and replication populations, we identified a novel fIIP locus in the HLA region of chromosome 6 (rs7887 P meta = 3.7 × 10(-09)). Imputation of classic HLA alleles identified two in high linkage disequilibrium that are associated with fIIP (DRB1*15:01 P = 1.3 × 10(-7) and DQB1*06:02 P = 6.1 × 10(-8)). Targeted RNA-sequencing of the HLA locus identified 21 genes differentially expressed between fibrotic and control lung tissue (Q < 0.001), many of which are involved in immune and inflammatory response regulation. In addition, the putative risk alleles, DRB1*15:01 and DQB1*06:02, are associated with expression of the DQB1 gene among fIIP cases (Q < 1 × 10(-16)). CONCLUSIONS: We have identified a genome-wide significant association between the HLA region and fIIP. Two HLA alleles are associated with fIIP and affect expression of HLA genes in lung tissue, indicating that the potential genetic risk due to HLA alleles may involve gene regulation in addition to altered protein structure. These studies reveal the importance of the HLA region for risk of fIIP and a basis for the potential etiologic role of auto-immunity in fIIP.
Assuntos
Estudo de Associação Genômica Ampla/métodos , Cadeias beta de HLA-DQ/genética , Cadeias HLA-DRB1/genética , Fibrose Pulmonar Idiopática/genética , Fibrose Pulmonar/genética , Análise de Sequência de RNA/métodos , Adulto , Idoso , Cromossomos Humanos Par 6/genética , Feminino , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Loci Gênicos , Predisposição Genética para Doença , Humanos , Desequilíbrio de Ligação , Masculino , Pessoa de Meia-IdadeRESUMO
BACKGROUND: Epigenetic marks are heritable, influenced by the environment, direct the maturation of T lymphocytes, and in mice enhance the development of allergic airway disease. Thus it is important to define epigenetic alterations in asthmatic populations. OBJECTIVE: We hypothesize that epigenetic alterations in circulating PBMCs are associated with allergic asthma. METHODS: We compared DNA methylation patterns and gene expression in inner-city children with persistent atopic asthma versus healthy control subjects by using DNA and RNA from PBMCs. Results were validated in an independent population of asthmatic patients. RESULTS: Comparing asthmatic patients (n = 97) with control subjects (n = 97), we identified 81 regions that were differentially methylated. Several immune genes were hypomethylated in asthma, including IL13, RUNX3, and specific genes relevant to T lymphocytes (TIGIT). Among asthmatic patients, 11 differentially methylated regions were associated with higher serum IgE concentrations, and 16 were associated with percent predicted FEV1. Hypomethylated and hypermethylated regions were associated with increased and decreased gene expression, respectively (P < 6 × 10(-12) for asthma and P < .01 for IgE). We further explored the relationship between DNA methylation and gene expression using an integrative analysis and identified additional candidates relevant to asthma (IL4 and ST2). Methylation marks involved in T-cell maturation (RUNX3), TH2 immunity (IL4), and oxidative stress (catalase) were validated in an independent asthmatic cohort of children living in the inner city. CONCLUSIONS: Our results demonstrate that DNA methylation marks in specific gene loci are associated with asthma and suggest that epigenetic changes might play a role in establishing the immune phenotype associated with asthma.