RESUMO
Since its introduction in 2011 the variant call format (VCF) has been widely adopted for processing DNA and RNA variants in practically all population studies-as well as in somatic and germline mutation studies. The VCF format can represent single nucleotide variants, multi-nucleotide variants, insertions and deletions, and simple structural variants called and anchored against a reference genome. Here we present a spectrum of over 125 useful, complimentary free and open source software tools and libraries, we wrote and made available through the multiple vcflib, bio-vcf, cyvcf2, hts-nim and slivar projects. These tools are applied for comparison, filtering, normalisation, smoothing and annotation of VCF, as well as output of statistics, visualisation, and transformations of files variants. These tools run everyday in critical biomedical pipelines and countless shell scripts. Our tools are part of the wider bioinformatics ecosystem and we highlight best practices. We shortly discuss the design of VCF, lessons learnt, and how we can address more complex variation through pangenome graph formats, variation that can not easily be represented by the VCF format.
Assuntos
Ecossistema , Variação Genética , Biologia Computacional , Variação Genética/genética , Nucleotídeos , SoftwareRESUMO
BACKGROUND: Human papillomavirus (HPV) is a common sexually transmitted infection associated with cervical cancer that frequently occurs as a coinfection of types and subtypes. Highly similar sublineages that show over 100-fold differences in cancer risk are not distinguishable in coinfections with current typing methods. RESULTS: We describe an efficient set of computational tools, rkmh, for analyzing complex mixed infections of related viruses based on sequence data. rkmh makes extensive use of MinHash similarity measures, and includes utilities for removing host DNA and classifying reads by type, lineage, and sublineage. We show that rkmh is capable of assigning reads to their HPV type as well as HPV16 lineage and sublineages. CONCLUSIONS: Accurate read classification enables estimates of percent composition when there are multiple infecting lineages or sublineages. While we demonstrate rkmh for HPV with multiple sequencing technologies, it is also applicable to other mixtures of related sequences.
Assuntos
Coinfecção/diagnóstico , Coinfecção/virologia , Biologia Computacional/métodos , Papillomavirus Humano 16/fisiologia , Software , DNA Viral/genética , Papillomavirus Humano 16/classificação , Humanos , Infecções por Papillomavirus/virologia , Filogenia , Análise de Sequência de DNA , Fatores de TempoRESUMO
Errors in multiple sequence alignments (MSAs) can reduce accuracy in positive-selection inference. Therefore, it has been suggested to filter MSAs before conducting further analyses. One widely used filter, Guidance, allows users to remove MSA positions aligned with low confidence. However, Guidance's utility in positive-selection inference has been disputed in the literature. We have conducted an extensive simulation-based study to characterize fully how Guidance impacts positive-selection inference, specifically for protein-coding sequences of realistic divergence levels. We also investigated whether novel scoring algorithms, which phylogenetically corrected confidence scores, and a new gap-penalization score-normalization scheme improved Guidance's performance. We found that no filter, including original Guidance, consistently benefitted positive-selection inferences. Moreover, all improvements detected were exceedingly minimal, and in certain circumstances, Guidance-based filters worsened inferences.
Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Simulação por Computador , Proteínas/genética , Seleção Genética , SoftwareRESUMO
Several recent works have shown that protein structure can predict site-specific evolutionary sequence variation. In particular, sites that are buried and/or have many contacts with other sites in a structure have been shown to evolve more slowly, on average, than surface sites with few contacts. Here, we present a comprehensive study of the extent to which numerous structural properties can predict sequence variation. The quantities we considered include buriedness (as measured by relative solvent accessibility), packing density (as measured by contact number), structural flexibility (as measured by B factors, root-mean-square fluctuations, and variation in dihedral angles), and variability in designed structures. We obtained structural flexibility measures both from molecular dynamics simulations performed on nine non-homologous viral protein structures and from variation in homologous variants of those proteins, where they were available. We obtained measures of variability in designed structures from flexible-backbone design in the Rosetta software. We found that most of the structural properties correlate with site variation in the majority of structures, though the correlations are generally weak (correlation coefficients of 0.1-0.4). Moreover, we found that buriedness and packing density were better predictors of evolutionary variation than structural flexibility. Finally, variability in designed structures was a weaker predictor of evolutionary variability than buriedness or packing density, but it was comparable in its predictive power to the best structural flexibility measures. We conclude that simple measures of buriedness and packing density are better predictors of evolutionary variation than the more complicated predictors obtained from dynamic simulations, ensembles of homologous structures, or computational protein design.
Assuntos
Evolução Molecular , Proteínas Virais/química , Sequência de Aminoácidos , Entropia , Simulação de Dinâmica Molecular , Conformação ProteicaRESUMO
Childhood radioactive iodine exposure from the Chornobyl accident increased papillary thyroid carcinoma (PTC) risk. While cervical lymph node metastases (cLNM) are well-recognized in pediatric PTC, the PTC metastatic process and potential radiation association are poorly understood. Here, we analyze cLNM occurrence among 428 PTC with genomic landscape analyses and known drivers (131I-exposed = 349, unexposed = 79; mean age = 27.9 years). We show that cLNM are more frequent in PTC with fusion (55%) versus mutation (30%) drivers, although the proportion varies by specific driver gene (RET-fusion = 71%, BRAF-mutation = 38%, RAS-mutation = 5%). cLNM frequency is not associated with other characteristics, including radiation dose. cLNM molecular profiling (N = 47) demonstrates 100% driver concordance with matched primary PTCs and highly concordant mutational spectra. Transcriptome analysis reveals 17 differentially expressed genes, particularly in the HOXC cluster and BRINP3; the strongest differentially expressed microRNA also is near HOXC10. Our findings underscore the critical role of driver alterations and provide promising candidates for elucidating the biological underpinnings of PTC cLNM.
Assuntos
Acidente Nuclear de Chernobyl , Radioisótopos do Iodo , Metástase Linfática , Mutação , Câncer Papilífero da Tireoide , Neoplasias da Glândula Tireoide , Humanos , Câncer Papilífero da Tireoide/genética , Câncer Papilífero da Tireoide/patologia , Metástase Linfática/genética , Masculino , Adulto , Feminino , Neoplasias da Glândula Tireoide/genética , Neoplasias da Glândula Tireoide/patologia , Adolescente , Proteínas Proto-Oncogênicas B-raf/genética , Adulto Jovem , Linfonodos/patologia , Proteínas Proto-Oncogênicas c-ret/genética , Criança , Genômica , Pessoa de Meia-Idade , Proteínas de Homeodomínio/genética , Proteínas de Homeodomínio/metabolismo , Perfilação da Expressão Gênica , MicroRNAs/genética , MicroRNAs/metabolismo , Neoplasias Induzidas por Radiação/genética , Neoplasias Induzidas por Radiação/patologia , Pescoço/patologia , Regulação Neoplásica da Expressão GênicaRESUMO
Acral melanoma, which is not ultraviolet (UV)-associated, is the most common type of melanoma in several low- and middle-income countries including Mexico. Latin American samples are significantly underrepresented in global cancer genomics studies, which directly affects patients in these regions as it is known that cancer risk and incidence may be influenced by ancestry and environmental exposures. To address this, here we characterise the genome and transcriptome of 128 acral melanoma tumours from 96 Mexican patients, a population notable because of its genetic admixture. Compared with other studies of melanoma, we found fewer frequent mutations in classical driver genes such as BRAF, NRAS or NF1. While most patients had predominantly Amerindian genetic ancestry, those with higher European ancestry had increased frequency of BRAF mutations and a lower number of structural variants. These BRAF-mutated tumours have a transcriptional profile similar to cutaneous non-volar melanocytes, suggesting that acral melanomas in these patients may arise from a distinct cell of origin compared to other tumours arising in these locations. KIT mutations were found in a subset of these tumours, and transcriptional profiling defined three expression clusters; these characteristics were associated with overall survival. We highlight novel low-frequency drivers, such as SPHKAP, which correlate with a distinct genomic profile and clinical characteristics. Our study enhances knowledge of this understudied disease and underscores the importance of including samples from diverse ancestries in cancer genomics studies.
RESUMO
The 1986 Chernobyl nuclear power plant accident increased papillary thyroid carcinoma (PTC) incidence in surrounding regions, particularly for radioactive iodine (131I)-exposed children. We analyzed genomic, transcriptomic, and epigenomic characteristics of 440 PTCs from Ukraine (from 359 individuals with estimated childhood 131I exposure and 81 unexposed children born after 1986). PTCs displayed radiation dose-dependent enrichment of fusion drivers, nearly all in the mitogen-activated protein kinase pathway, and increases in small deletions and simple/balanced structural variants that were clonal and bore hallmarks of nonhomologous end-joining repair. Radiation-related genomic alterations were more pronounced for individuals who were younger at exposure. Transcriptomic and epigenomic features were strongly associated with driver events but not radiation dose. Our results point to DNA double-strand breaks as early carcinogenic events that subsequently enable PTC growth after environmental radiation exposure.
Assuntos
Acidente Nuclear de Chernobyl , Mutação , Neoplasias Induzidas por Radiação/genética , Câncer Papilífero da Tireoide/etiologia , Câncer Papilífero da Tireoide/genética , Neoplasias da Glândula Tireoide/etiologia , Neoplasias da Glândula Tireoide/genética , Adolescente , Adulto , Criança , Pré-Escolar , Variações do Número de Cópias de DNA , Epigenoma , Feminino , Perfilação da Expressão Gênica , Genes ras , Variação Genética , Humanos , Lactente , Radioisótopos do Iodo , Perda de Heterozigosidade , Masculino , Pessoa de Meia-Idade , Proteínas Proto-Oncogênicas B-raf/genética , RNA-Seq , Doses de Radiação , Glândula Tireoide/fisiologia , Glândula Tireoide/efeitos da radiação , Translocação Genética , Ucrânia , Sequenciamento Completo do Genoma , Adulto JovemRESUMO
Structural variants (SVs) remain challenging to represent and study relative to point mutations despite their demonstrated importance. We show that variation graphs, as implemented in the vg toolkit, provide an effective means for leveraging SV catalogs for short-read SV genotyping experiments. We benchmark vg against state-of-the-art SV genotypers using three sequence-resolved SV catalogs generated by recent long-read sequencing studies. In addition, we use assemblies from 12 yeast strains to show that graphs constructed directly from aligned de novo assemblies improve genotyping compared to graphs built from intermediate SV catalogs in the VCF format.
Assuntos
Variação Estrutural do Genoma , Técnicas de Genotipagem/métodos , Software , Genoma Fúngico , Saccharomyces cerevisiae , Sequenciamento Completo do Genoma/métodosRESUMO
SUMMARY: GFA has emerged as a standard format for the exchange of genome assemblies and sequence graphs. To encourage further adoption in high-performance software we have developed an open-source C++ library for GFA and a set of utilities for summarizing and manipulating the format. AVAILABILITY: The gfakluge source code is freely available under the MIT license at https://github.com/edawson/gfakluge. It has been tested on both Mac OS X and Linux.
RESUMO
Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.
Assuntos
Variação Genética , Simulação por Computador , DNA/genética , HumanosRESUMO
We investigate the causes of site-specific evolutionary-rate variation in influenza haemagglutinin (HA) between human and avian influenza, for subtypes H1, H3, and H5. By calculating the evolutionary-rate ratio, ω = dN/dS as a function of a residue's solvent accessibility in the three-dimensional protein structure, we show that solvent accessibility has a significant but relatively modest effect on site-specific rate variation. By comparing rates within HA subtypes among host species, we derive an upper limit to the amount of variation that can be explained by structural constraints of any kind. Protein structure explains only 20-40% of the variation in ω. Finally, by comparing ω at sites near the sialic-acid-binding region to ω at other sites, we show that ω near the sialic-acid-binding region is significantly elevated in both human and avian influenza, with the exception of avian H5. We conclude that protein structure, HA subtype, and host biology all impose distinct selection pressures on sites in influenza HA.