RESUMEN
Most heritable diseases are polygenic. To comprehend the underlying genetic architecture, it is crucial to discover the clinically relevant epistatic interactions (EIs) between genomic single nucleotide polymorphisms (SNPs) (1-3). Existing statistical computational methods for EI detection are mostly limited to pairs of SNPs due to the combinatorial explosion of higher-order EIs. With NeEDL (network-based epistasis detection via local search), we leverage network medicine to inform the selection of EIs that are an order of magnitude more statistically significant compared to existing tools and consist, on average, of five SNPs. We further show that this computationally demanding task can be substantially accelerated once quantum computing hardware becomes available. We apply NeEDL to eight different diseases and discover genes (affected by EIs of SNPs) that are partly known to affect the disease, additionally, these results are reproducible across independent cohorts. EIs for these eight diseases can be interactively explored in the Epistasis Disease Atlas (https://epistasis-disease-atlas.com). In summary, NeEDL demonstrates the potential of seamlessly integrated quantum computing techniques to accelerate biomedical research. Our network medicine approach detects higher-order EIs with unprecedented statistical and biological evidence, yielding unique insights into polygenic diseases and providing a basis for the development of improved risk scores and combination therapies.
Asunto(s)
Epistasis Genética , Polimorfismo de Nucleótido Simple , Humanos , Teoría Cuántica , Herencia Multifactorial/genética , Enfermedad/genética , Biología Computacional/métodos , Algoritmos , Predisposición Genética a la EnfermedadRESUMEN
Genome-wide association studies (GWAS) explore the genetic causes of complex diseases. However, classical approaches ignore the biological context of the genetic variants and genes under study. To address this shortcoming, one can use biological networks, which model functional relationships, to search for functionally related susceptibility loci. Many such network methods exist, each arising from different mathematical frameworks, pre-processing steps, and assumptions about the network properties of the susceptibility mechanism. Unsurprisingly, this results in disparate solutions. To explore how to exploit these heterogeneous approaches, we selected six network methods and applied them to GENESIS, a nationwide French study on familial breast cancer. First, we verified that network methods recovered more interpretable results than a standard GWAS. We addressed the heterogeneity of their solutions by studying their overlap, computing what we called the consensus. The key gene in this consensus solution was COPS5, a gene related to multiple cancer hallmarks. Another issue we observed was that network methods were unstable, selecting very different genes on different subsamples of GENESIS. Therefore, we proposed a stable consensus solution formed by the 68 genes most consistently selected across multiple subsamples. This solution was also enriched in genes known to be associated with breast cancer susceptibility (BLM, CASP8, CASP10, DNAJC1, FGFR2, MRPS30, and SLC4A7, P-value = 3 × 10-4). The most connected gene was CUL3, a regulator of several genes linked to cancer progression. Lastly, we evaluated the biases of each method and the impact of their parameters on the outcome. In general, network methods preferred highly connected genes, even after random rewirings that stripped the connections of any biological meaning. In conclusion, we present the advantages of network-guided GWAS, characterize their shortcomings, and provide strategies to address them. To compute the consensus networks, implementations of all six methods are available at https://github.com/hclimente/gwas-tools.
Asunto(s)
Neoplasias de la Mama , Predisposición Genética a la Enfermedad/genética , Estudio de Asociación del Genoma Completo/métodos , Algoritmos , Neoplasias de la Mama/epidemiología , Neoplasias de la Mama/genética , Bases de Datos Genéticas , Femenino , Humanos , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
MOTIVATION: Finding non-linear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity and computational overhead. Here we propose block HSIC Lasso, a non-linear feature selector that does not present the previous drawbacks. RESULTS: We compare block HSIC Lasso to other state-of-the-art feature selection techniques in both synthetic and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA sequencing and genome-wide association studies. In all cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than those selected by other techniques. As a proof of concept, we applied block HSIC Lasso to a single-cell RNA sequencing experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons. AVAILABILITY AND IMPLEMENTATION: Block HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available on PyPI. Source code is available on GitHub (https://github.com/riken-aip/pyHSICLasso). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Biomarcadores , Estudio de Asociación del Genoma Completo , Programas Informáticos , Animales , Genoma , Genómica , RatonesRESUMEN
The speed at which new genomes are being sequenced highlights the need for genome-wide methods capable of predicting protein-DNA interactions. Here, we present PADA1, a generic algorithm that accurately models structural complexes and predicts the DNA-binding regions of resolved protein structures. PADA1 relies on a library of protein and double-stranded DNA fragment pairs obtained from a training set of 2103 DNA-protein complexes. It includes a fast statistical force field computed from atom-atom distances, to evaluate and filter the 3D docking models. Using published benchmark validation sets and 212 DNA-protein structures published after 2016 we predicted the DNA-binding regions with an RMSD of <1.8 Å per residue in >95% of the cases. We show that the quality of the docked templates is compatible with FoldX protein design tool suite to identify the crystallized DNA molecule sequence as the most energetically favorable in 80% of the cases. We highlighted the biological potential of PADA1 by reconstituting DNA and protein conformational changes upon protein mutagenesis of a meganuclease and its variants, and by predicting DNA-binding regions and nucleotide sequences in proteins crystallized without DNA. These results opens up new perspectives for the engineering of DNA-protein interfaces.
Asunto(s)
Algoritmos , Proteínas de Unión al ADN/química , Proteínas de Unión al ADN/metabolismo , ADN/química , ADN/metabolismo , Sitios de Unión , Biología Computacional/métodos , Simulación por Computador , Proteínas de Unión al ADN/genética , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Bases de Datos de Proteínas/estadística & datos numéricos , Bases del Conocimiento , Modelos Moleculares , Simulación del Acoplamiento Molecular , Unión Proteica , Conformación Proteica , Ingeniería de Proteínas , Programas InformáticosRESUMEN
A common problem encountered when performing large-scale MS proteome analysis is the loss of information due to the high percentage of unassigned spectra. To determine the causes behind this loss we have analyzed the proteome of one of the smallest living bacteria that can be grown axenically, Mycoplasma pneumoniae (729 ORFs). The proteome of M. pneumoniae cells, grown in defined media, was analyzed by MS. An initial search with both Mascot and a species-specific NCBInr database with common contaminants (NCBImpn), resulted in around 79% of the acquired spectra not having an assignment. The percentage of non-assigned spectra was reduced to 27% after re-analysis of the data with the PEAKS software, thereby increasing the proteome coverage of M. pneumoniae from the initial 60% to over 76%. Nonetheless, 33,413 spectra with assigned amino acid sequences could not be mapped to any NCBInr database protein sequence. Approximately, 1% of these unassigned peptides corresponded to PTMs and 4% to M. pneumoniae protein variants (deamidation and translation inaccuracies). The most abundant peptide sequence variants (Phe-Tyr and Ala-Ser) could be explained by alterations in the editing capacity of the corresponding tRNA synthases. About another 1% of the peptides not associated to any protein had repetitions of the same aromatic/hydrophobic amino acid at the N-terminus, or had Arg/Lys at the C-terminus. Thus, in a model system, we have maximized the number of assigned spectra to 73% (51,453 out of the 70,040 initial acquired spectra). All MS data have been deposited in the ProteomeXchange with identifier PXD002779 (http://proteomecentral.proteomexchange.org/dataset/PXD002779).
Asunto(s)
Proteínas Bacterianas/análisis , Proteínas Bacterianas/genética , Mycoplasma pneumoniae/química , Mycoplasma pneumoniae/genética , Bases de Datos de Proteínas , Genoma Bacteriano , Humanos , Mycoplasma pneumoniae/crecimiento & desarrollo , Neumonía por Mycoplasma/microbiología , Procesamiento Proteico-Postraduccional , Proteoma/análisis , Proteoma/genética , Proteómica , Espectrometría de Masas en Tándem , TranscriptomaRESUMEN
We present a network-based protocol to discover susceptibility genes in case-control genome-wide association studies (GWASs). In short, this protocol looks for biomarkers that are informative of disease status and interconnected in an underlying biological network. This boosts discovery and interpretability. Moreover, the protocol tackles the instability of network methods, producing a stable set of genes most likely to replicate in external cohorts. To apply the procedure to a provided GWAS dataset, install the required software and execute our command-line tool. For complete details on the use and execution of this protocol, please refer to Climente-González et al.1.
Asunto(s)
Estudio de Asociación del Genoma Completo , Programas Informáticos , Estudio de Asociación del Genoma Completo/métodosRESUMEN
Most heritable diseases are polygenic. To comprehend the underlying genetic architecture, it is crucial to discover the clinically relevant epistatic interactions (EIs) between genomic single nucleotide polymorphisms (SNPs)1-3. Existing statistical computational methods for EI detection are mostly limited to pairs of SNPs due to the combinatorial explosion of higher-order EIs. With NeEDL (network-based epistasis detection via local search), we leverage network medicine to inform the selection of EIs that are an order of magnitude more statistically significant compared to existing tools and consist, on average, of five SNPs. We further show that this computationally demanding task can be substantially accelerated once quantum computing hardware becomes available. We apply NeEDL to eight different diseases and discover genes (affected by EIs of SNPs) that are partly known to affect the disease, additionally, these results are reproducible across independent cohorts. EIs for these eight diseases can be interactively explored in the Epistasis Disease Atlas (https://epistasis-disease-atlas.com). In summary, NeEDL is the first application that demonstrates the potential of seamlessly integrated quantum computing techniques to accelerate biomedical research. Our network medicine approach detects higher-order EIs with unprecedented statistical and biological evidence, yielding unique insights into polygenic diseases and providing a basis for the development of improved risk scores and combination therapies.
RESUMEN
BACKGROUND: Detecting epistatic interactions at the gene level is essential to understanding the biological mechanisms of complex diseases. Unfortunately, genome-wide interaction association studies involve many statistical challenges that make such detection hard. We propose a multi-step protocol for epistasis detection along the edges of a gene-gene co-function network. Such an approach reduces the number of tests performed and provides interpretable interactions while keeping type I error controlled. Yet, mapping gene interactions into testable single-nucleotide polymorphism (SNP)-interaction hypotheses, as well as computing gene pair association scores from SNP pair ones, is not trivial. RESULTS: Here we compare 3 SNP-gene mappings (positional overlap, expression quantitative trait loci, and proximity in 3D structure) and use the adaptive truncated product method to compute gene pair scores. This method is non-parametric, does not require a known null distribution, and is fast to compute. We apply multiple variants of this protocol to a genome-wide association study dataset on inflammatory bowel disease. Different configurations produced different results, highlighting that various mechanisms are implicated in inflammatory bowel disease, while at the same time, results overlapped with known disease characteristics. Importantly, the proposed pipeline also differs from a conventional approach where no network is used, showing the potential for additional discoveries when prior biological knowledge is incorporated into epistasis detection.
Asunto(s)
Epistasis Genética , Estudio de Asociación del Genoma Completo , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Polimorfismo de Nucleótido Simple , Sitios de Carácter CuantitativoRESUMEN
For the past decade, cancer genomic studies have focused on mutations leading to splice-site disruption, overlooking those having splice-creating potential. Here, we applied a bioinformatic tool, MiSplice, for the large-scale discovery of splice-site-creating mutations (SCMs) across 8,656 TCGA tumors. We report 1,964 originally mis-annotated mutations having clear evidence of creating alternative splice junctions. TP53 and GATA3 have 26 and 18 SCMs, respectively, and ATRX has 5 from lower-grade gliomas. Mutations in 11 genes, including PARP1, BRCA1, and BAP1, were experimentally validated for splice-site-creating function. Notably, we found that neoantigens induced by SCMs are likely several folds more immunogenic compared to missense mutations, exemplified by the recurrent GATA3 SCM. Further, high expression of PD-1 and PD-L1 was observed in tumors with SCMs, suggesting candidates for immune blockade therapy. Our work highlights the importance of integrating DNA and RNA data for understanding the functional and the clinical implications of mutations in human diseases.
Asunto(s)
Mutación , Neoplasias/genética , Sitios de Empalme de ARN , Proteína BRCA1/genética , Factor de Transcripción GATA3/genética , Células HEK293 , Humanos , Poli(ADP-Ribosa) Polimerasa-1/genética , Receptor de Muerte Celular Programada 1/genética , Proteína p53 Supresora de Tumor/genética , Proteína Nuclear Ligada al Cromosoma X/genéticaRESUMEN
Alternative splicing changes are frequently observed in cancer and are starting to be recognized as important signatures for tumor progression and therapy. However, their functional impact and relevance to tumorigenesis remain mostly unknown. We carried out a systematic analysis to characterize the potential functional consequences of alternative splicing changes in thousands of tumor samples. This analysis revealed that a subset of alternative splicing changes affect protein domain families that are frequently mutated in tumors and potentially disrupt protein-protein interactions in cancer-related pathways. Moreover, there was a negative correlation between the number of these alternative splicing changes in a sample and the number of somatic mutations in drivers. We propose that a subset of the alternative splicing changes observed in tumors may represent independent oncogenic processes that could be relevant to explain the functional transformations in cancer, and some of them could potentially be considered alternative splicing drivers (AS drivers).