RESUMO
There is increasing interest in developing diagnostics that discriminate individual mutagenic mechanisms in a range of applications that include identifying population-specific mutagenesis and resolving distinct mutation signatures in cancer samples. Analyses for these applications assume that mutagenic mechanisms have a distinct relationship with neighboring bases that allows them to be distinguished. Direct support for this assumption is limited to a small number of simple cases, e.g., CpG hypermutability. We have evaluated whether the mechanistic origin of a point mutation can be resolved using only sequence context for a more complicated case. We contrasted single nucleotide variants originating from the multitude of mutagenic processes that normally operate in the mouse germline with those induced by the potent mutagen N-ethyl-N-nitrosourea (ENU). The considerable overlap in the mutation spectra of these two samples make this a challenging problem. Employing a new, robust log-linear modeling method, we demonstrate that neighboring bases contain information regarding point mutation direction that differs between the ENU-induced and spontaneous mutation variant classes. A logistic regression classifier exhibited strong performance at discriminating between the different mutation classes. Concordance between the feature set of the best classifier and information content analyses suggest our results can be generalized to other mutation classification problems. We conclude that machine learning can be used to build a practical classification tool to identify the mutation mechanism for individual genetic variants. Software implementing our approach is freely available under an open-source license.
Assuntos
Aprendizado de Máquina , Mutação Puntual , Análise de Sequência de DNA/métodos , Animais , Etilnitrosoureia/toxicidade , Camundongos , Mutagênicos/toxicidade , Motivos de NucleotídeosRESUMO
Mutation processes differ between types of point mutation, genomic locations, cells, and biological species. For some point mutations, specific neighboring bases are known to be mechanistically influential. Beyond these cases, numerous questions remain unresolved, including: what are the sequence motifs that affect point mutations? How large are the motifs? Are they strand symmetric? And, do they vary between samples? We present new log-linear models that allow explicit examination of these questions, along with sequence logo style visualization to enable identifying specific motifs. We demonstrate the performance of these methods by analyzing mutation processes in human germline and malignant melanoma. We recapitulate the known CpG effect, and identify novel motifs, including a highly significant motif associated with A[Formula: see text]G mutations. We show that major effects of neighbors on germline mutation lie within [Formula: see text] of the mutating base. Models are also presented for contrasting the entire mutation spectra (the distribution of the different point mutations). We show the spectra vary significantly between autosomes and X-chromosome, with a difference in T[Formula: see text]C transition dominating. Analyses of malignant melanoma confirmed reported characteristic features of this cancer, including statistically significant strand asymmetry, and markedly different neighboring influences. The methods we present are made freely available as a Python library https://bitbucket.org/pycogent3/mutationmotif.
Assuntos
Motivos de Nucleotídeos , Mutação Puntual , Análise de Sequência de DNA/métodos , Software , Animais , Ilhas de CpG , Interpretação Estatística de Dados , HumanosRESUMO
Whole-exome sequencing (WES) is a new tool that allows the rapid, inexpensive and accurate exploration of Mendelian and complex diseases, such as obesity. To identify sequence variants associated with obesity, we performed WES of family trios of one male teenager and one female child with severe early-onset obesity. Additionally, the teenager patient had hypopituitarism and hyperprolactinaemia. A comprehensive bioinformatics analysis found de novo and compound heterozygote sequence variants with a damaging effect on genes previously associated with obesity in mice (LRP2) and humans (UCP2), among other intriguing mutations affecting ciliary function (DNAAF1). A gene ontology and pathway analysis of genes harbouring mutations resulted in the significant identification of overrepresented pathways related to ATP/ITP (adenosine/inosine triphosphate) metabolism and, in general, to the regulation of lipid metabolism. We discuss the clinical and physiological consequences of these mutations and the importance of these findings for either the clinical assessment or eventual treatment of morbid obesity.
RESUMO
OBJECTIVE: In the sanroque mouse model of lupus, pathologic germinal centers (GCs) arise due to increased numbers of follicular helper T (Tfh) cells, resulting in high-affinity anti-double-stranded DNA antibodies that cause end-organ inflammation, such as glomerulonephritis. The purpose of this study was to examine the hypothesis that this pathway could account for a subset of patients with systemic lupus erythematosus (SLE). METHODS: An expansion of Tfh cells is a causal, and therefore consistent, component of the sanroque mouse phenotype. We validated the enumeration of circulating T cells resembling Tfh cells as a biomarker of this expansion in sanroque mice, and we performed a comprehensive comparison of the surface phenotype of circulating and tonsillar Tfh cells in humans. This circulating biomarker was enumerated in SLE patients (n = 46), Sjögren's syndrome patients (n = 17), and healthy controls (n = 48) and was correlated with disease activity and end-organ involvement. RESULTS: In sanroque mice, circulating Tfh cells increased in proportion to their GC counterparts, making circulating Tfh cells a feasible human biomarker of this novel mechanism of breakdown in GC tolerance. In a subset of SLE patients (14 of 46), but in none of the controls, the levels of circulating Tfh cells (defined as circulating CXCR5+CD4+ cells with high expression of Tfh-associated molecules, such as inducible T cell costimulator or programmed death 1) were increased. This cellular phenotype did not vary with time, disease activity, or treatment, but it did correlate with the diversity and titers of autoantibodies and with the severity of end-organ involvement. CONCLUSION: These findings in SLE patients are consistent with the autoimmune mechanism in sanroque mice and identify Tfh effector molecules as possible therapeutic targets in a recognizable subset of patients with SLE.
Assuntos
Autoimunidade/imunologia , Centro Germinativo/patologia , Lúpus Eritematoso Sistêmico/patologia , Linfócitos T Auxiliares-Indutores/patologia , Animais , Formação de Anticorpos/imunologia , Antígenos CD/metabolismo , Antígenos de Diferenciação de Linfócitos T/metabolismo , Proteínas Reguladoras de Apoptose/metabolismo , Contagem de Células , Modelos Animais de Doenças , Centro Germinativo/imunologia , Centro Germinativo/metabolismo , Humanos , Memória Imunológica/imunologia , Proteína Coestimuladora de Linfócitos T Induzíveis , Lúpus Eritematoso Sistêmico/imunologia , Camundongos , Camundongos Endogâmicos C57BL , Tonsila Palatina/imunologia , Tonsila Palatina/metabolismo , Tonsila Palatina/patologia , Receptor de Morte Celular Programada 1 , Receptores CXCR5/metabolismo , Síndrome de Sjogren/sangue , Síndrome de Sjogren/imunologia , Síndrome de Sjogren/patologia , Linfócitos T Auxiliares-Indutores/imunologia , Linfócitos T Auxiliares-Indutores/metabolismoRESUMO
BACKGROUND: Identifying coevolving positions in protein sequences has myriad applications, ranging from understanding and predicting the structure of single molecules to generating proteome-wide predictions of interactions. Algorithms for detecting coevolving positions can be classified into two categories: tree-aware, which incorporate knowledge of phylogeny, and tree-ignorant, which do not. Tree-ignorant methods are frequently orders of magnitude faster, but are widely held to be insufficiently accurate because of a confounding of shared ancestry with coevolution. We conjectured that by using a null distribution that appropriately controls for the shared-ancestry signal, tree-ignorant methods would exhibit equivalent statistical power to tree-aware methods. Using a novel t-test transformation of coevolution metrics, we systematically compared four tree-aware and five tree-ignorant coevolution algorithms, applying them to myoglobin and myosin. We further considered the influence of sequence recoding using reduced-state amino acid alphabets, a common tactic employed in coevolutionary analyses to improve both statistical and computational performance. RESULTS: Consistent with our conjecture, the transformed tree-ignorant metrics (particularly Mutual Information) often outperformed the tree-aware metrics. Our examination of the effect of recoding suggested that charge-based alphabets were generally superior for identifying the stabilizing interactions in alpha helices. Performance was not always improved by recoding however, indicating that the choice of alphabet is critical. CONCLUSION: The results suggest that t-test transformation of tree-ignorant metrics can be sufficient to control for patterns arising from shared ancestry.
Assuntos
Algoritmos , Biologia Computacional/métodos , Evolução Molecular , Modelos Estatísticos , Filogenia , Modelos Genéticos , Mioglobina/genética , Miosinas/genética , Estrutura Secundária de Proteína , Alinhamento de Sequência , Análise de Sequência de ProteínaRESUMO
We have implemented in Python the COmparative GENomic Toolkit, a fully integrated and thoroughly tested framework for novel probabilistic analyses of biological sequences, devising workflows, and generating publication quality graphics. PyCogent includes connectors to remote databases, built-in generalized probabilistic techniques for working with biological sequences, and controllers for third-party applications. The toolkit takes advantage of parallel architectures and runs on a range of hardware and operating systems, and is available under the general public license from http://sourceforge.net/projects/pycogent.
Assuntos
Genômica/métodos , Análise de Sequência/métodos , Software , Animais , Proteína BRCA1/genética , Bases de Dados Genéticas , Humanos , Filogenia , Conformação Proteica , Proteobactérias/classificação , Proteobactérias/genética , Fator de von Willebrand/química , Fator de von Willebrand/genéticaRESUMO
Over 3% of human proteins contain single amino acid repeats (repeat-containing proteins, RCPs). Many repeats (homopeptides) localize to important proteins involved in transcription, and the expansion of certain repeats, in particular poly-Q and poly-A tracts, can also lead to the development of neurological diseases. Previous studies have suggested that the homopeptide makeup is a result of the presence of G+C-rich tracts in the encoding genes and that expansion occurs via replication slippage. Here, we have performed a large-scale genomic analysis of the variation of the genes encoding RCPs in 13 species and present these data in an online database (http://repeats.med.monash.edu.au/genetic_analysis/). This resource allows rapid comparison and analysis of RCPs, homopeptides, and their underlying genetic tracts across the eukaryotic species considered. We report three major findings. First, there is a bias for a small subset of codons being reiterated within homopeptides, and there is no G+C or A+T bias relative to the organism's transcriptome. Second, single base pair transversions from the homocodon are unusually common and may represent a mechanism of reducing the rate of homopeptide mutations. Third, homopeptides that are conserved across different species lie within regions that are under stronger purifying selection in contrast to nonconserved homopeptides.
Assuntos
Códon/genética , Evolução Molecular , Proteínas/genética , Sequências Repetitivas de Aminoácidos/genética , Regiões 3' não Traduzidas , Regiões 5' não Traduzidas , Humanos , Peptídeos/química , Peptídeos/genética , Polimorfismo de Nucleotídeo Único , Proteínas/químicaRESUMO
BACKGROUND: Phylogenetic footprinting is the identification of functional regions of DNA by their evolutionary conservation. This is achieved by comparing orthologous regions from multiple species and identifying the DNA regions that have diverged less than neutral DNA. Vestige is a phylogenetic footprinting package built on the PyEvolve toolkit that uses probabilistic molecular evolutionary modelling to represent aspects of sequence evolution, including the conventional divergence measure employed by other footprinting approaches. In addition to measuring the divergence, Vestige allows the expansion of the definition of a phylogenetic footprint to include variation in the distribution of any molecular evolutionary processes. This is achieved by displaying the distribution of model parameters that represent partitions of molecular evolutionary substitutions. Examination of the spatial incidence of these effects across regions of the genome can identify DNA segments that differ in the nature of the evolutionary process. RESULTS: Vestige was applied to a reference dataset of the SCL locus from four species and provided clear identification of the known conserved regions in this dataset. To demonstrate the flexibility to use diverse models of molecular evolution and dissect the nature of the evolutionary process Vestige was used to footprint the Ka/Ks ratio in primate BRCA1 with a codon model of evolution. Two regions of putative adaptive evolution were identified illustrating the ability of Vestige to represent the spatial distribution of distinct molecular evolutionary processes. CONCLUSION: Vestige provides a flexible, open platform for phylogenetic footprinting. Underpinned by the PyEvolve toolkit, Vestige provides a framework for visualising the signatures of evolutionary processes across the genome of numerous organisms simultaneously. By exploiting the maximum-likelihood statistical framework, the complex interplay between mutational processes, DNA repair and selection can be evaluated both spatially (along a sequence alignment) and temporally (for each branch of the tree) providing visual indicators to the attributes and functions of DNA sequences.
Assuntos
Biologia Computacional/métodos , Interpretação Estatística de Dados , Algoritmos , Animais , Proteína BRCA1/genética , Sequência de Bases , Códon , Simulação por Computador , DNA/química , Reparo do DNA , Evolução Molecular , Genoma , Humanos , Funções Verossimilhança , Modelos Biológicos , Modelos Estatísticos , Filogenia , Linguagens de Programação , Sequências Reguladoras de Ácido Nucleico , Alinhamento de Sequência , Análise de Sequência de DNA , Análise de Sequência de Proteína , Software , Especificidade da Espécie , Fatores de TempoRESUMO
The modified base 5-methylcytosine ((m)C) plays an important functional role in the biology of mammals as an epigenetic modification and appears to exert a striking impact on the molecular evolution of mammal genomes. The collective epigenetic functions of (m)C revolve around its effect on gene transcription, while the influence of this modified base on the evolution of mammal genomes derives from the greatly elevated spontaneous mutation rate of (m)C to T. In mammals, (m)C occurs at the dinucleotides CpG, CpA, and CpT. As a step toward a comprehensive statistical examination of the role of (m)C in mammal molecular evolution, we have developed novel Markov models of codon substitution that incorporate dinucleotide-level terms relevant to (m)C mutation. We apply these models to two data sets of aligned BRCA1 exon 11 sequences from bats and primates. In all cases, terms specific to mutations that affect the dinucleotides CpG, CpA, and CpT significantly improved model fit. For the CpG-specific terms, both transition and transversion substitution rates were elevated. These rates differed between the data sets. Bats exhibited a lower relative rate of substitutions at CpG-containing codons. Transition substitutions were significantly less than 1 at CpA-containing codons but greater than 1 at CpT-containing codons. The inclusion of interaction terms in the codon models to represent possible confounding with the effect of natural selection were supported for codons that contained CpG and CpT, but not CpA. From the results, we infer that mutation of (m)C is a probable factor that affects BRCA1 codons containing the dinucleotide CpG, a possible factor for CpA-containing codons, and an unlikely factor that affects CpT-containing codons. The confounding of estimated terms with the effect of natural selection indicate this confounding must be addressed for comparisons between different coding and noncoding regions.
Assuntos
Metilação de DNA , Genes BRCA1 , Mamíferos/genética , Animais , Biometria , Quirópteros/genética , Ilhas de CpG , Evolução Molecular , Humanos , Modelos Genéticos , Primatas/genéticaRESUMO
BACKGROUND: Examining the distribution of variation has proven an extremely profitable technique in the effort to identify sequences of biological significance. Most approaches in the field, however, evaluate only the conserved portions of sequences - ignoring the biological significance of sequence differences. A suite of sophisticated likelihood based statistical models from the field of molecular evolution provides the basis for extracting the information from the full distribution of sequence variation. The number of different problems to which phylogeny-based maximum likelihood calculations can be applied is extensive. Available software packages that can perform likelihood calculations suffer from a lack of flexibility and scalability, or employ error-prone approaches to model parameterisation. RESULTS: Here we describe the implementation of PyEvolve, a toolkit for the application of existing, and development of new, statistical methods for molecular evolution. We present the object architecture and design schema of PyEvolve, which includes an adaptable multi-level parallelisation schema. The approach for defining new methods is illustrated by implementing a novel dinucleotide model of substitution that includes a parameter for mutation of methylated CpG's, which required 8 lines of standard Python code to define. Benchmarking was performed using either a dinucleotide or codon substitution model applied to an alignment of BRCA1 sequences from 20 mammals, or a 10 species subset. Up to five-fold parallel performance gains over serial were recorded. Compared to leading alternative software, PyEvolve exhibited significantly better real world performance for parameter rich models with a large data set, reducing the time required for optimisation from approximately 10 days to approximately 6 hours. CONCLUSION: PyEvolve provides flexible functionality that can be used either for statistical modelling of molecular evolution, or the development of new methods in the field. The toolkit can be used interactively or by writing and executing scripts. The toolkit uses efficient processes for specifying the parameterisation of statistical models, and implements numerous optimisations that make highly parameter rich likelihood functions solvable within hours on multi-cpu hardware. PyEvolve can be readily adapted in response to changing computational demands and hardware configurations to maximise performance. PyEvolve is released under the GPL and can be downloaded from http://cbis.anu.edu.au/software.