RESUMEN
BACKGROUND: De novo mutations (DNMs) are variants that occur anew in the offspring of noncarrier parents. They are not inherited from either parent but rather result from endogenous mutational processes involving errors of DNA repair/replication. These spontaneous errors play a significant role in the causation of genetic disorders, and their importance in the context of molecular diagnostic medicine has become steadily more apparent as more DNMs have been reported in the literature. In this study, we examined 46,489 disease-associated DNMs annotated by the Human Gene Mutation Database (HGMD) to ascertain their distribution across gene and disease categories. RESULTS: Most disease-associated DNMs reported to date are found to be associated with developmental and psychiatric disorders, a reflection of the focus of sequencing efforts over the last decade. Of the 13,277 human genes in which DNMs have so far been found, the top-10 genes with the highest proportions of DNM relative to gene size were H3-3 A, DDX3X, CSNK2B, PURA, ZC4H2, STXBP1, SCN1A, SATB2, H3-3B and TUBA1A. The distribution of CADD and REVEL scores for both disease-associated DNMs and those mutations not reported to be de novo revealed a trend towards higher deleteriousness for DNMs, consistent with the likely lower selection pressure impacting them. This contrasts with the non-DNMs, which are presumed to have been subject to continuous negative selection over multiple generations. CONCLUSION: This meta-analysis provides important information on the occurrence and distribution of disease-associated DNMs in association with heritable disease and should make a significant contribution to our understanding of this major type of mutation.
Asunto(s)
Células Germinativas , Padres , Humanos , MutaciónRESUMEN
The evolution of gene expression in mammalian organ development remains largely uncharacterized. Here we report the transcriptomes of seven organs (cerebrum, cerebellum, heart, kidney, liver, ovary and testis) across developmental time points from early organogenesis to adulthood for human, rhesus macaque, mouse, rat, rabbit, opossum and chicken. Comparisons of gene expression patterns identified correspondences of developmental stages across species, and differences in the timing of key events during the development of the gonads. We found that the breadth of gene expression and the extent of purifying selection gradually decrease during development, whereas the amount of positive selection and expression of new genes increase. We identified differences in the temporal trajectories of expression of individual genes across species, with brain tissues showing the smallest percentage of trajectory changes, and the liver and testis showing the largest. Our work provides a resource of developmental transcriptomes of seven organs across seven species, and comparative analyses that characterize the development and evolution of mammalian organs.
Asunto(s)
Regulación del Desarrollo de la Expresión Génica , Organogénesis/genética , Transcriptoma/genética , Animales , Evolución Biológica , Pollos/genética , Femenino , Humanos , Macaca mulatta/genética , Masculino , Ratones , Zarigüeyas/genética , Conejos , RatasRESUMEN
Recent evidence from proteomics and deep massively parallel sequencing studies have revealed that eukaryotic genomes contain substantial numbers of as-yet-uncharacterized open reading frames (ORFs). We define these uncharacterized ORFs as novel ORFs (nORFs). nORFs in humans are mostly under 100 codons and are found in diverse regions of the genome, including in long noncoding RNAs, pseudogenes, 3' UTRs, 5' UTRs, and alternative reading frames of canonical protein coding exons. There is therefore a pressing need to evaluate the potential functional importance of these unannotated transcripts and proteins in biological pathways and human disease on a larger scale, rather than one at a time. In this study, we outline the creation of a valuable nORFs data set with experimental evidence of translation for the community, use measures of heritability and selection that reveal signals for functional importance, and show the potential implications for functional interpretation of genetic variants in nORFs. Our results indicate that some variants that were previously classified as being benign or of uncertain significance may have to be reinterpreted.
RESUMEN
Whilst DNA repeat expansions cause numerous heritable human disorders, their origins and underlying pathological mechanisms are often unclear. We collated a dataset comprising 224 human repeat expansions encompassing 203 different genes, and performed a systematic analysis with respect to key topological features at the DNA, RNA and protein levels. Comparison with controls without known pathogenicity and genomic regions lacking repeats, allowed the construction of the first tool to discriminate repeat regions harboring pathogenic repeat expansions (DPREx). At the DNA level, pathogenic repeat expansions exhibited stronger signals for DNA regulatory factors (e.g. H3K4me3, transcription factor-binding sites) in exons, promoters, 5'UTRs and 5'genes but were not significantly different from controls in introns, 3'UTRs and 3'genes. Additionally, pathogenic repeat expansions were also found to be enriched in non-B DNA structures. At the RNA level, pathogenic repeat expansions were characterized by lower free energy for forming RNA secondary structure and were closer to splice sites in introns, exons, promoters and 5'genes than controls. At the protein level, pathogenic repeat expansions exhibited a preference to form coil rather than other types of secondary structure, and tended to encode surface-located protein domains. Guided by these features, DPREx ( http://biomed.nscc-gz.cn/zhaolab/geneprediction/# ) achieved an Area Under the Curve (AUC) value of 0.88 in a test on an independent dataset. Pathogenic repeat expansions are thus located such that they exert a synergistic influence on the gene expression pathway involving inter-molecular connections at the DNA, RNA and protein levels.
Asunto(s)
Expansión de las Repeticiones de ADN , ADN , Humanos , Intrones/genética , ARN , Expansión de Repetición de TrinucleótidoRESUMEN
Human genome stability requires efficient repair of oxidized bases, which is initiated via damage recognition and excision by NEIL1 and other base excision repair (BER) pathway DNA glycosylases (DGs). However, the biological mechanisms underlying detection of damaged bases among the million-fold excess of undamaged bases remain enigmatic. Indeed, mutation rates vary greatly within individual genomes, and lesion recognition by purified DGs in the chromatin context is inefficient. Employing super-resolution microscopy and co-immunoprecipitation assays, we find that acetylated NEIL1 (AcNEIL1), but not its non-acetylated form, is predominantly localized in the nucleus in association with epigenetic marks of uncondensed chromatin. Furthermore, chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) revealed non-random AcNEIL1 binding near transcription start sites of weakly transcribed genes and along highly transcribed chromatin domains. Bioinformatic analyses revealed a striking correspondence between AcNEIL1 occupancy along the genome and mutation rates, with AcNEIL1-occupied sites exhibiting fewer mutations compared to AcNEIL1-free domains, both in cancer genomes and in population variation. Intriguingly, from the evolutionarily conserved unstructured domain that targets NEIL1 to open chromatin, its damage surveillance of highly oxidation-susceptible sites to preserve essential gene function and to limit instability and cancer likely originated â¼500 million years ago during the buildup of free atmospheric oxygen.
Asunto(s)
Cromatina/fisiología , ADN Glicosilasas/metabolismo , Reparación del ADN , Procesamiento Proteico-Postraduccional , Acetilación , Animales , Línea Celular Tumoral , Núcleo Celular/metabolismo , Cromatina/ultraestructura , ADN Glicosilasas/química , ADN Glicosilasas/fisiología , Reparación del ADN/genética , Conjuntos de Datos como Asunto , Evolución Molecular , Genes de Helminto , Genes Homeobox , Células HEK293 , Proteínas del Helminto/genética , Humanos , Invertebrados/genética , Invertebrados/metabolismo , Lisina/química , Mutación , Proteínas de Neoplasias/metabolismo , Neoplasias/genética , Neoplasias/metabolismo , Neoplasias/mortalidad , Oxidación-Reducción , Proteoma , Alineación de Secuencia , Homología de Secuencia de Aminoácido , Sitio de Iniciación de la Transcripción , Vertebrados/genética , Vertebrados/metabolismoRESUMEN
The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that are thought to underlie, or are closely associated with human inherited disease. At the time of writing (June 2020), the database contains in excess of 289,000 different gene lesions identified in over 11,100 genes manually curated from 72,987 articles published in over 3100 peer-reviewed journals. There are primarily two main groups of users who utilise HGMD on a regular basis; research scientists and clinical diagnosticians. This review aims to highlight how to make the most out of HGMD data in each setting.
Asunto(s)
Bases de Datos Genéticas , Genoma Humano , Mutación de Línea Germinal , Polimorfismo Genético , Bibliometría , Investigación Biomédica/métodos , Predisposición Genética a la Enfermedad , Humanos , Asociación entre el Sector Público-PrivadoRESUMEN
Differentiation between phenotypically neutral and disease-causing genetic variation remains an open and relevant problem. Among different types of variation, non-frameshifting insertions and deletions (indels) represent an understudied group with widespread phenotypic consequences. To address this challenge, we present a machine learning method, MutPred-Indel, that predicts pathogenicity and identifies types of functional residues impacted by non-frameshifting insertion/deletion variation. The model shows good predictive performance as well as the ability to identify impacted structural and functional residues including secondary structure, intrinsic disorder, metal and macromolecular binding, post-translational modifications, allosteric sites, and catalytic residues. We identify structural and functional mechanisms impacted preferentially by germline variation from the Human Gene Mutation Database, recurrent somatic variation from COSMIC in the context of different cancers, as well as de novo variants from families with autism spectrum disorder. Further, the distributions of pathogenicity prediction scores generated by MutPred-Indel are shown to differentiate highly recurrent from non-recurrent somatic variation. Collectively, we present a framework to facilitate the interrogation of both pathogenicity and the functional effects of non-frameshifting insertion/deletion variants. The MutPred-Indel webserver is available at http://mutpred.mutdb.org/.
Asunto(s)
Predisposición Genética a la Enfermedad/genética , Genoma Humano , Mutación INDEL , Trastorno del Espectro Autista/genética , Trastorno del Espectro Autista/fisiopatología , Biología Computacional , Bases de Datos Genéticas , Genoma Humano/genética , Genoma Humano/fisiología , Humanos , Mutación INDEL/genética , Mutación INDEL/fisiología , Aprendizaje Automático , Curva ROCRESUMEN
It has long been known that canonical 5' splice site (5'SS) GT>GC variants may be compatible with normal splicing. However, to date, the actual scale of canonical 5'SSs capable of generating wild-type transcripts in the case of GT>GC substitutions remains unknown. Herein, combining data derived from a meta-analysis of 45 human disease-causing 5'SS GT>GC variants and a cell culture-based full-length gene splicing assay of 103 5'SS GT>GC substitutions, we estimate that ~15-18% of canonical GT 5'SSs retain their capacity to generate between 1% and 84% normal transcripts when GT is substituted by GC. We further demonstrate that the canonical 5'SSs in which substitution of GT by GC-generated normal transcripts exhibit stronger complementarity to the 5' end of U1 snRNA than those sites whose substitutions of GT by GC did not lead to the generation of normal transcripts. We also observed a correlation between the generation of wild-type transcripts and a milder than expected clinical phenotype but found that none of the available splicing prediction tools were capable of reliably distinguishing 5'SS GT>GC variants that generated wild-type transcripts from those that did not. Our findings imply that 5'SS GT>GC variants in human disease genes may not invariably be pathogenic.
Asunto(s)
Empalme Alternativo , Secuencia de Bases , Regulación de la Expresión Génica , Variación Genética , Sitios de Empalme de ARN , Células Cultivadas , Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos , Exones , Perfilación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Intrones , Motivos de Nucleótidos , Posición Específica de Matrices de Puntuación , Análisis de Secuencia de ADNRESUMEN
Summary: We present FATHMM-XF, a method for predicting pathogenic point mutations in the human genome. Drawing on an extensive feature set, FATHMM-XF outperforms competitors on benchmark tests, particularly in non-coding regions where the majority of pathogenic mutations are likely to be found. Availability and implementation: The FATHMM-XF web server is available at http://fathmm.biocompute.org.uk/fathmm-xf/, and as tracks on the Genome Tolerance Browser: http://gtb.biocompute.org.uk. Predictions are provided for human genome version GRCh37/hg19. The data used for this project can be downloaded from: http://fathmm.biocompute.org.uk/fathmm-xf/. Contact: mark.rogers@bristol.ac.uk or c.campbell@bristol.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Genómica/métodos , Mutación Puntual , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Genoma Humano , HumanosRESUMEN
The in silico prediction of the functional consequences of mutations is an important goal of human pathogenetics. However, bioinformatic tools that classify mutations according to their functionality employ different algorithms so that predictions may vary markedly between tools. We therefore integrated nine popular prediction tools (PolyPhen-2, SNPs&GO, MutPred, SIFT, MutationTaster2, Mutation Assessor and FATHMM as well as conservation-based Grantham Score and PhyloP) into a single predictor. The optimal combination of these tools was selected by means of a wide range of statistical modeling techniques, drawing upon 10 029 disease-causing single nucleotide variants (SNVs) from Human Gene Mutation Database and 10 002 putatively 'benign' non-synonymous SNVs from UCSC. Predictive performance was found to be markedly improved by model-based integration, whilst maximum predictive capability was obtained with either random forest, decision tree or logistic regression analysis. A combination of PolyPhen-2, SNPs&GO, MutPred, MutationTaster2 and FATHMM was found to perform as well as all tools combined. Comparison of our approach with other integrative approaches such as Condel, CoVEC, CAROL, CADD, MetaSVM and MetaLR using an independent validation dataset, revealed the superiority of our newly proposed integrative approach. An online implementation of this approach, IMHOTEP ('Integrating Molecular Heuristics and Other Tools for Effect Prediction'), is provided at http://www.uni-kiel.de/medinfo/cgi-bin/predictor/.
Asunto(s)
Variación Genética , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Simulación por Computador , Humanos , Mutación , Polimorfismo de Nucleótido SimpleRESUMEN
Many genetic diseases exhibit considerable epidemiological comorbidity and common symptoms, which provokes debate about the extent of their etiological overlap. The rapid growth in the number of known disease-causing mutations in the Human Gene Mutation Database (HGMD) has allowed us to characterize genetic similarities between diseases by ascertaining the extent to which identical genetic mutations are shared between diseases. Using this approach, we show that 41.6% of disease pairs in all possible pairs (42, 083) exhibit a significant sharing of mutations (P value < 0.05). These mutation-related disease pairs are in agreement with heritability-based disease-disease relations in 48 neurological and psychiatric disease pairs (Spearman's correlation coefficient = 0.50; P value = 3.4 × 10-5 ), and share over-expressed genes significantly more often than unrelated disease pairs (1.5-1.8-fold higher; P value ≤ 1.6 × 10-4 ). The usefulness of mutation-related disease pairs was further demonstrated for predicting novel mutations and identifying individuals susceptible to Crohn disease. Moreover, the mutation-based disease network concurs closely with that based on phenotypes.
Asunto(s)
Mutación/genética , Predisposición Genética a la Enfermedad/genética , Humanos , Fenotipo , ARN Mensajero/genéticaRESUMEN
Our goal is to answer the question: compared with experimental structures, how useful are predicted models for functional annotation? We assessed the functional utility of predicted models by comparing the performances of a suite of methods for functional characterization on the predictions and the experimental structures. We identified 28 sites in 25 protein targets to perform functional assessment. These 28 sites included nine sites with known ligand binding (holo-sites), nine sites that are expected or suggested by experimental authors for small molecule binding (apo-sites), and Ten sites containing important motifs, loops, or key residues with important disease-associated mutations. We evaluated the utility of the predictions by comparing their microenvironments to the experimental structures. Overall structural quality correlates with functional utility. However, the best-ranked predictions (global) may not have the best functional quality (local). Our assessment provides an ability to discriminate between predictions with high structural quality. When assessing ligand-binding sites, most prediction methods have higher performance on apo-sites than holo-sites. Some servers show consistently high performance for certain types of functional sites. Finally, many functional sites are associated with protein-protein interaction. We also analyzed biologically relevant features from the protein assemblies of two targets where the active site spanned the protein-protein interface. For the assembly targets, we find that the features in the models are mainly determined by the choice of template.
Asunto(s)
Productos Biológicos/metabolismo , Biología Computacional/métodos , Modelos Moleculares , Modelos Estadísticos , Conformación Proteica , Proteínas/química , Proteínas/metabolismo , Sitios de Unión , Dominio Catalítico , Humanos , Ligandos , Unión ProteicaRESUMEN
MOTIVATION: Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease. RESULTS: To address this, we investigate protein features underlying loss-of-function genetic variation and develop a machine learning method, MutPred-LOF, for the discrimination of pathogenic and tolerated variants that can also generate hypotheses on specific molecular events disrupted by the variant. We investigate a large set of human variants derived from the Human Gene Mutation Database, ClinVar and the Exome Aggregation Consortium. Our prediction method shows an area under the Receiver Operating Characteristic curve of 0.85 for all loss-of-function variants and 0.75 for proteins in which both pathogenic and neutral variants have been observed. We applied MutPred-LOF to a set of 1142 de novo vari3ants from neurodevelopmental disorders and find enrichment of pathogenic variants in affected individuals. Overall, our results highlight the potential of computational tools to elucidate causal mechanisms underlying loss of protein function in loss-of-function variants. AVAILABILITY AND IMPLEMENTATION: http://mutpred.mutdb.org. CONTACT: predrag@indiana.edu.
Asunto(s)
Mutación con Pérdida de Función , Aprendizaje Automático , Proteínas/genética , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Biología Computacional/métodos , Humanos , Conformación Proteica , Proteínas/metabolismo , Proteínas/fisiologíaRESUMEN
BACKGROUND: Mucopolysaccharidosis-IVA (Morquio A disease) is a lysosomal disorder in which the abnormal accumulation of keratan sulfate and chondroitin-6-sulfate is consequent to mutations in the galactosamine-6-sulfatase (GALNS) gene. Since standard DNA sequencing analysis fails to detect about 16% of GALNS mutant alleles, gross DNA rearrangement screening and uniparental disomy evaluation are required to complete the molecular diagnosis. Despite this, the second pathogenic GALNS allele generally remains unidentified in ~ 5% of Morquio-A disease patients. METHODS: In an attempt to bridge the residual gap between clinical and molecular diagnosis, we performed an mRNA-based evaluation of three Morquio-A disease patients in whom the second mutant GALNS allele had not been identified. We also performed sequence analysis of the entire GALNS gene in two patients. RESULTS: Different aberrant GALNS mRNA transcripts were characterized in each patient. Analysis of these transcripts then allowed the identification, in one patient, of a disease-causing deep intronic GALNS mutation. The aberrant mRNA products identified in the other two individuals resulted in partial exon loss. Despite sequencing the entire GALNS gene region in these patients, the identity of a single underlying pathological lesion could not be unequivocally determined. We postulate that a combination of multiple variants, acting in cis, may synergise in terms of their impact on the splicing machinery. CONCLUSIONS: We have identified GALNS variants located within deep intronic regions that have the potential to impact splicing. These findings have prompted us to incorporate mRNA analysis into our diagnostic flow procedure for the molecular analysis of Morquio A disease.
Asunto(s)
Condroitinsulfatasas/genética , Mucopolisacaridosis IV/genética , Mutación , Empalme del ARN , ARN Mensajero/genética , Adolescente , Secuencia de Bases , Condroitinsulfatasas/metabolismo , Análisis Mutacional de ADN , Árboles de Decisión , Exones , Femenino , Genotipo , Humanos , Intrones , Masculino , Mucopolisacaridosis IV/diagnóstico , Mucopolisacaridosis IV/metabolismo , Mucopolisacaridosis IV/fisiopatología , ARN Mensajero/metabolismoRESUMEN
BACKGROUND AND AIMS: Duodenal polyposis and cancer have become a key issue for patients with familial adenomatous polyposis (FAP) and MUTYH-associated polyposis (MAP). Almost all patients with FAP will develop duodenal adenomas, and 5% will develop cancer. The incidence of duodenal adenomas in MAP appears to be lower than in FAP, but the limited available data suggest a comparable increase in the relative risk and lifetime risk of duodenal cancer. Current surveillance recommendations, however, are the same for FAP and MAP, using the Spigelman score (incorporating polyp number, size, dysplasia, and histology) for risk stratification and determination of surveillance intervals. Previous studies have demonstrated a benefit of enhanced detection rates of adenomas by use of chromoendoscopy both in sporadic colorectal disease and in groups at high risk of colorectal cancer. We aimed to assess the effect of chromoendoscopy on duodenal adenoma detection, to determine the impact on Spigelman stage and to compare this in individuals with known pathogenic mutations in order to determine the difference in duodenal involvement between MAP and FAP. METHODS: A prospective study examined the impact of chromoendoscopy on the assessment of the duodenum in 51 consecutive patients with MAP and FAP in 2 academic centers in the United Kingdom (University Hospital Llandough, Cardiff, and St Mark's Hospital, London) from 2011 to 2014. RESULTS: Enhanced adenoma detection of 3 times the number of adenomas after chromoendoscopy was demonstrated in both MAP (P = .013) and FAP (P = .002), but did not affect adenoma size. In both conditions, there was a significant increase in Spigelman stage after chromoendoscopy compared with endoscopy without dye spray. Spigelman scores and overall adenoma detection was significantly lower in MAP compared with FAP. CONCLUSIONS: Chromoendoscopy improved the diagnostic yield of anomas in MAP and FAP 3-fold, and in both MAP and FAP this resulted in a clinically significant upstaging in Spigelman score. Further studies are required to determine the impact of improved adenoma detection on the management and outcome of duodenal polyposis.
Asunto(s)
Poliposis Adenomatosa del Colon/diagnóstico por imagen , Neoplasias Duodenales/diagnóstico por imagen , Endoscopía Gastrointestinal/métodos , Vigilancia de la Población/métodos , Poliposis Adenomatosa del Colon/genética , Poliposis Adenomatosa del Colon/patología , Adulto , Anciano , Anciano de 80 o más Años , Colorantes , ADN Glicosilasas/genética , Neoplasias Duodenales/genética , Neoplasias Duodenales/patología , Femenino , Humanos , Carmin de Índigo , Masculino , Persona de Mediana Edad , Estadificación de Neoplasias , Estudios Prospectivos , Carga TumoralRESUMEN
Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
Asunto(s)
Evolución Molecular , Especiación Genética , Genoma/genética , Gorilla gorilla/genética , Animales , Femenino , Regulación de la Expresión Génica , Variación Genética/genética , Genómica , Humanos , Macaca mulatta/genética , Datos de Secuencia Molecular , Pan troglodytes/genética , Filogenia , Pongo/genética , Proteínas/genética , Alineación de Secuencia , Especificidad de la Especie , Transcripción GenéticaRESUMEN
BACKGROUND: Small insertions and deletions (indels) have a significant influence in human disease and, in terms of frequency, they are second only to single nucleotide variants as pathogenic mutations. As the majority of mutations associated with complex traits are located outside the exome, it is crucial to investigate the potential pathogenic impact of indels in non-coding regions of the human genome. RESULTS: We present FATHMM-indel, an integrative approach to predict the functional effect, pathogenic or neutral, of indels in non-coding regions of the human genome. Our method exploits various genomic annotations in addition to sequence data. When validated on benchmark data, FATHMM-indel significantly outperforms CADD and GAVIN, state of the art models in assessing the pathogenic impact of non-coding variants. FATHMM-indel is available via a web server at indels.biocompute.org.uk. CONCLUSIONS: FATHMM-indel can accurately predict the functional impact and prioritise small indels throughout the whole non-coding genome.
Asunto(s)
Biología Computacional/métodos , ADN Intergénico/genética , Genoma Humano , Mutación INDEL/genética , Genética de Población , Humanos , Fenotipo , Curva ROC , Reproducibilidad de los Resultados , Programas InformáticosRESUMEN
Synonymous single-nucleotide variants (SNVs), although they do not alter the encoded protein sequences, have been implicated in many genetic diseases. Experimental studies indicate that synonymous SNVs can lead to changes in the secondary and tertiary structures of DNA and RNA, thereby affecting translational efficiency, cotranslational protein folding as well as the binding of DNA-/RNA-binding proteins. However, the importance of these various features in disease phenotypes is not clearly understood. Here, we have built a support vector machine (SVM) model (termed DDIG-SN) as a means to discriminate disease-causing synonymous variants. The model was trained and evaluated on nearly 900 disease-causing variants. The method achieves robust performance with the area under the receiver operating characteristic curve of 0.84 and 0.85 for protein-stratified 10-fold cross-validation and independent testing, respectively. We were able to show that the disease-causing effects in the immediate proximity to exon-intron junctions (1-3 bp) are driven by the loss of splicing motif strength, whereas the gain of splicing motif strength is the primary cause in regions further away from the splice site (4-69 bp). The method is available as a part of the DDIG server at http://sparks-lab.org/ddig.
Asunto(s)
Proteínas de Unión al ADN/genética , ADN/genética , Proteínas/genética , Mutación Silenciosa/genética , ADN/química , Proteínas de Unión al ADN/química , Predisposición Genética a la Enfermedad , Humanos , Conformación de Ácido Nucleico , Polimorfismo de Nucleótido Simple/genética , Pliegue de Proteína , Proteínas/química , ARN/química , ARN/genéticaRESUMEN
Alternative splicing (AS) is a closely regulated process that allows a single gene to encode multiple protein isoforms, thereby contributing to the diversity of the proteome. Dysregulation of the splicing process has been found to be associated with many inherited diseases. However, among the pathogenic AS events, there are numerous "passenger" events whose inclusion or exclusion does not lead to significant changes with respect to protein function. In this study, we evaluate the secondary and tertiary structural features of proteins associated with disease-causing and neutral AS events, and show that several structural features are strongly associated with the pathological impact of exon inclusion. We further develop a machine-learning-based computational model, ExonImpact, for prioritizing and evaluating the functional consequences of hitherto uncharacterized AS events. We evaluated our model using several strategies including cross-validation, and data from the Gene-Tissue Expression (GTEx) and ClinVar databases. ExonImpact is freely available at http://watson.compbio.iupui.edu/ExonImpact.
Asunto(s)
Empalme Alternativo , Biología Computacional/métodos , Exones , Estudios de Asociación Genética/métodos , Programas Informáticos , Algoritmos , Encéfalo/metabolismo , Bases de Datos de Ácidos Nucleicos , Predisposición Genética a la Enfermedad , Humanos , Aprendizaje Automático , Dominios Proteicos , Isoformas de Proteínas/química , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , Relación Estructura-Actividad , Navegador WebRESUMEN
While synonymous single-nucleotide variants (sSNVs) have largely been unstudied, since they do not alter protein sequence, mounting evidence suggests that they may affect RNA conformation, splicing, and the stability of nascent-mRNAs to promote various diseases. Accurately prioritizing deleterious sSNVs from a pool of neutral ones can significantly improve our ability of selecting functional genetic variants identified from various genome-sequencing projects, and, therefore, advance our understanding of disease etiology. In this study, we develop a computational algorithm to prioritize sSNVs based on their impact on mRNA splicing and protein function. In addition to genomic features that potentially affect splicing regulation, our proposed algorithm also includes dozens structural features that characterize the functions of alternatively spliced exons on protein function. Our systematical evaluation on thousands of sSNVs suggests that several structural features, including intrinsic disorder protein scores, solvent accessible surface areas, protein secondary structures, and known and predicted protein family domains, show significant differences between disease-causing and neutral sSNVs. Our result suggests that the protein structure features offer an added dimension of information while distinguishing disease-causing and neutral synonymous variants. The inclusion of structural features increases the predictive accuracy for functional sSNV prioritization.