RESUMEN
Proteins exhibit cell-type-specific functions and interactions, yet most ways of representing proteins lack any biological or environmental context. To address this gap, recent work by Li et al.1 introduces PINNACLE, a geometric deep learning approach that generates contextualized representations of proteins by combined analysis of protein interactions and multiorgan single-cell transcriptomics.
Asunto(s)
Proteínas , Proteínas/metabolismo , Humanos , Análisis de la Célula Individual/métodos , Aprendizaje Profundo , Transcriptoma/genéticaRESUMEN
The data deluge in biology calls for computational approaches that can integrate multiple datasets of different types to build a holistic view of biological processes or structures of interest. An emerging paradigm in this domain is the unsupervised learning of data embeddings that can be used for downstream clustering and classification tasks. While such approaches for integrating data of similar types are becoming common, there is scarcer work on consolidating different data modalities such as network and image information. Here, we introduce DICE (Data Integration through Contrastive Embedding), a contrastive learning model for multi-modal data integration. We apply this model to study the subcellular organization of proteins by integrating protein-protein interaction data and protein image data measured in HEK293 cells. We demonstrate the advantage of data integration over any single modality and show that our framework outperforms previous integration approaches. Availability: https://github.com/raminass/protein-contrastive Contact: raminass@gmail.com.
Asunto(s)
Biología Computacional , Humanos , Células HEK293 , Biología Computacional/métodos , Mapeo de Interacción de Proteínas/métodos , Proteínas/metabolismo , Proteínas/química , Aprendizaje Automático no SupervisadoRESUMEN
While immune checkpoint inhibitors have revolutionized cancer therapy, many patients exhibit poor outcomes. Here, we show immunotherapy responses in bladder and non-small cell lung cancers are effectively predicted by factoring tumor mutation burden (TMB) into burdens on specific protein assemblies. This approach identifies 13 protein assemblies for which the assembly-level mutation burden (AMB) predicts treatment outcomes, which can be combined to powerfully separate responders from nonresponders in multiple cohorts (e.g., 76% versus 37% bladder cancer 1-year survival). These results are corroborated by (i) engineered disruptions in the predictive assemblies, which modulate immunotherapy response in mice, and (ii) histochemistry showing that predicted responders have elevated inflammation. The 13 assemblies have diverse roles in DNA damage checkpoints, oxidative stress, or Janus kinase/signal transducers and activators of transcription signaling and include unexpected genes (e.g., PIK3CG and FOXP1) for which mutation affects treatment response. This study provides a roadmap for using tumor cell biology to factor mutational effects on immune response.
Asunto(s)
Inmunoterapia , Mutación , Humanos , Inmunoterapia/métodos , Animales , Ratones , Carcinoma de Pulmón de Células no Pequeñas/genética , Carcinoma de Pulmón de Células no Pequeñas/inmunología , Carcinoma de Pulmón de Células no Pequeñas/tratamiento farmacológico , Resultado del Tratamiento , Neoplasias de la Vejiga Urinaria/genética , Neoplasias de la Vejiga Urinaria/inmunología , Neoplasias de la Vejiga Urinaria/tratamiento farmacológico , Neoplasias de la Vejiga Urinaria/terapia , Neoplasias/genética , Neoplasias/inmunología , Neoplasias/terapia , Neoplasias/tratamiento farmacológico , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/inmunología , Neoplasias Pulmonares/tratamiento farmacológico , Neoplasias Pulmonares/patología , Proteínas de Neoplasias/genética , Proteínas de Neoplasias/inmunología , Inhibidores de Puntos de Control Inmunológico/uso terapéutico , Inhibidores de Puntos de Control Inmunológico/farmacologíaRESUMEN
Reactive changes of glial cells during neuroinflammation impact brain disorders and disease progression. Elucidating the mechanisms that control reactive gliosis may help us to understand brain pathophysiology and improve outcomes. Here, we report that adult ablation of autism spectrum disorder (ASD)-associated CHD8 in astrocytes attenuates reactive gliosis via remodeling chromatin accessibility, changing gene expression. Conditional Chd8 deletion in astrocytes, but not microglia, suppresses reactive gliosis by impeding astrocyte proliferation and morphological elaboration. Astrocyte Chd8 ablation alleviates lipopolysaccharide-induced neuroinflammation and septic-associated hypothermia in mice. Astrocytic CHD8 plays an important role in neuroinflammation by altering the chromatin landscape, regulating metabolic and lipid-associated pathways, and astrocyte-microglia crosstalk. Moreover, we show that reactive gliosis can be directly mitigated in vivo using an adeno-associated virus (AAV)-mediated Chd8 gene editing strategy. These findings uncover a role of ASD-associated CHD8 in the adult brain, which may warrant future exploration of targeting chromatin remodelers in reactive gliosis and neuroinflammation in injury and neurological diseases.
Asunto(s)
Astrocitos , Gliosis , Animales , Gliosis/patología , Gliosis/metabolismo , Astrocitos/metabolismo , Astrocitos/patología , Ratones , Cromatina/metabolismo , Trastorno del Espectro Autista/metabolismo , Trastorno del Espectro Autista/genética , Trastorno del Espectro Autista/patología , Enfermedades Neuroinflamatorias/metabolismo , Enfermedades Neuroinflamatorias/patología , Ensamble y Desensamble de Cromatina , Microglía/metabolismo , Microglía/patología , Proteínas de Unión al ADN/metabolismo , Proteínas de Unión al ADN/genética , Ratones Endogámicos C57BL , Lipopolisacáridos/farmacología , Humanos , Ratones Noqueados , Masculino , Proliferación CelularRESUMEN
Defining the subset of cellular factors governing SARS-CoV-2 replication can provide critical insights into viral pathogenesis and identify targets for host-directed antiviral therapies. While a number of genetic screens have previously reported SARS-CoV-2 host dependency factors, these approaches relied on utilizing pooled genome-scale CRISPR libraries, which are biased towards the discovery of host proteins impacting early stages of viral replication. To identify host factors involved throughout the SARS-CoV-2 infectious cycle, we conducted an arrayed genome-scale siRNA screen. Resulting data were integrated with published datasets to reveal pathways supported by orthogonal datasets, including transcriptional regulation, epigenetic modifications, and MAPK signalling. The identified proviral host factors were mapped into the SARS-CoV-2 infectious cycle, including 27 proteins that were determined to impact assembly and release. Additionally, a subset of proteins were tested across other coronaviruses revealing 17 potential pan-coronavirus targets. Further studies illuminated a role for the heparan sulfate proteoglycan perlecan in SARS-CoV-2 viral entry, and found that inhibition of the non-canonical NF-kB pathway through targeting of BIRC2 restricts SARS-CoV-2 replication both in vitro and in vivo. These studies provide critical insight into the landscape of virus-host interactions driving SARS-CoV-2 replication as well as valuable targets for host-directed antivirals.
RESUMEN
BACKGROUND: Genome-wide association studies (GWAS) have identified hundreds of common variants associated with alcohol consumption. In contrast, genetic studies of alcohol consumption that use rare variants are still in their early stages. No prior studies of alcohol consumption have examined whether common and rare variants implicate the same genes and molecular networks, leaving open the possibility that the two approaches might identify distinct biology. METHODS: To address this knowledge gap, we used publicly available alcohol consumption GWAS summary statistics (GSCAN, N = 666,978) and whole exome sequencing data (Genebass, N = 393,099) to identify a set of common and rare variants for alcohol consumption. We used gene-based analysis to implicate genes from common and rare variant analyses, which we then propagated onto a shared molecular network using a network colocalization procedure. RESULTS: Gene-based analysis of each dataset implicated 294 (common variants) and 35 (rare variants) genes, including ethanol metabolizing genes ADH1B and ADH1C, which were identified by both analyses, and ANKRD12, GIGYF1, KIF21B, and STK31, which were identified in only the rare variant analysis, but have been associated with other neuropsychiatric traits. Network colocalization revealed significant network overlap between the genes identified via common and rare variants. The shared network identified gene families that function in alcohol metabolism, including ADH, ALDH, CYP, and UGT. Seventy-one of the genes in the shared network were previously implicated in neuropsychiatric or substance use disorders but not alcohol-related behaviors (e.g. EXOC2, EPM2A, and CACNG4). Differential gene expression analysis showed enrichment in the liver and several brain regions. CONCLUSIONS: Genes implicated by network colocalization identify shared biology relevant to alcohol consumption, which also underlie neuropsychiatric traits and substance use disorders that are comorbid with alcohol use, providing a more holistic understanding of two disparate sources of genetic information.
RESUMEN
Single-gene missense mutations remain challenging to interpret. Here, we deploy scalable functional screening by sequencing (SEUSS), a Perturb-seq method, to generate mutations at protein interfaces of RUNX1 and quantify their effect on activities of downstream cellular programs. We evaluate single-cell RNA profiles of 115 mutations in myelogenous leukemia cells and categorize them into three functionally distinct groups, wild-type (WT)-like, loss-of-function (LoF)-like, and hypomorphic, that we validate in orthogonal assays. LoF-like variants dominate the DNA-binding site and are recurrent in cancer; however, recurrence alone does not predict functional impact. Hypomorphic variants share characteristics with LoF-like but favor protein interactions, promoting gene expression indicative of nerve growth factor (NGF) response and cytokine recruitment of neutrophils. Accessible DNA near differentially expressed genes frequently contains RUNX1-binding motifs. Finally, we reclassify 16 variants of uncertain significance and train a classifier to predict 103 more. Our work demonstrates the potential of targeting protein interactions to better define the landscape of phenotypes reachable by missense mutations.
Asunto(s)
Subunidad alfa 2 del Factor de Unión al Sitio Principal , Humanos , Sitios de Unión , Línea Celular Tumoral , Subunidad alfa 2 del Factor de Unión al Sitio Principal/metabolismo , Subunidad alfa 2 del Factor de Unión al Sitio Principal/genética , Mutación/genética , Mutación Missense , Fenotipo , Análisis de la Célula Individual/métodosRESUMEN
MOTIVATION: Predicting cancer drug response requires a comprehensive assessment of many mutations present across a tumor genome. While current drug response models generally use a binary mutated/unmutated indicator for each gene, not all mutations in a gene are equivalent. RESULTS: Here, we construct and evaluate a series of predictive models based on leading methods for quantitative mutation scoring. Such methods include VEST4 and CADD, which score the impact of a mutation on gene function, and CHASMplus, which scores the likelihood a mutation drives cancer. The resulting predictive models capture cellular responses to dabrafenib, which targets BRAF-V600 mutations, whereas models based on binary mutation status do not. Performance improvements generalize to other drugs, extending genetic indications for PIK3CA, ERBB2, EGFR, PARP1, and ABL1 inhibitors. Introducing quantitative mutation features in drug response models increases performance and mechanistic understanding. AVAILABILITY AND IMPLEMENTATION: Code and example datasets are available at https://github.com/pgwall/qms.
Asunto(s)
Antineoplásicos , Mutación , Neoplasias , Humanos , Neoplasias/genética , Neoplasias/tratamiento farmacológico , Antineoplásicos/farmacología , Antineoplásicos/uso terapéutico , Imidazoles/farmacología , Oximas/farmacología , Biología Computacional/métodosRESUMEN
This article describes the Cell Maps for Artificial Intelligence (CM4AI) project and its goals, methods, standards, current datasets, software tools , status, and future directions. CM4AI is the Functional Genomics Data Generation Project in the U.S. National Institute of Health's (NIH) Bridge2AI program. Its overarching mission is to produce ethical, AI-ready datasets of cell architecture, inferred from multimodal data collected for human cell lines, to enable transformative biomedical AI research.
RESUMEN
In vitro evolution and whole genome analysis has proven to be a powerful method for studying the mechanism of action of small molecules in many haploid microbes but has generally not been applied to human cell lines in part because their diploid state complicates the identification of variants that confer drug resistance. To determine if haploid human cells could be used in MOA studies, we evolved resistance to five different anticancer drugs (doxorubicin, gemcitabine, etoposide, topotecan, and paclitaxel) using a near-haploid cell line (HAP1) and then analyzed the genomes of the drug resistant clones, developing a bioinformatic pipeline that involved filtering for high frequency alleles predicted to change protein sequence, or alleles which appeared in the same gene for multiple independent selections with the same compound. Applying the filter to sequences from 28 drug resistant clones identified a set of 21 genes which was strongly enriched for known resistance genes or known drug targets (TOP1, TOP2A, DCK, WDR33, SLCO3A1). In addition, some lines carried structural variants that encompassed additional known resistance genes (ABCB1, WWOX and RRM1). Gene expression knockdown and knockout experiments of 10 validation targets showed a high degree of specificity and accuracy in our calls and demonstrates that the same drug resistance mechanisms found in diverse clinical samples can be evolved, discovered and studied in an isogenic background.
Asunto(s)
Antineoplásicos , Resistencia a Antineoplásicos , Haploidia , Humanos , Resistencia a Antineoplásicos/genética , Antineoplásicos/farmacología , Genoma Humano , Secuenciación Completa del Genoma/métodos , Línea CelularRESUMEN
Polypharmacology drugs-compounds that inhibit multiple proteins-have many applications but are difficult to design. To address this challenge we have developed POLYGON, an approach to polypharmacology based on generative reinforcement learning. POLYGON embeds chemical space and iteratively samples it to generate new molecular structures; these are rewarded by the predicted ability to inhibit each of two protein targets and by drug-likeness and ease-of-synthesis. In binding data for >100,000 compounds, POLYGON correctly recognizes polypharmacology interactions with 82.5% accuracy. We subsequently generate de-novo compounds targeting ten pairs of proteins with documented co-dependency. Docking analysis indicates that top structures bind their two targets with low free energies and similar 3D orientations to canonical single-protein inhibitors. We synthesize 32 compounds targeting MEK1 and mTOR, with most yielding >50% reduction in each protein activity and in cell viability when dosed at 1-10 µM. These results support the potential of generative modeling for polypharmacology.
Asunto(s)
Simulación del Acoplamiento Molecular , Humanos , Serina-Treonina Quinasas TOR/metabolismo , Polifarmacología , MAP Quinasa Quinasa 1/antagonistas & inhibidores , MAP Quinasa Quinasa 1/metabolismo , MAP Quinasa Quinasa 1/química , Inhibidores de Proteínas Quinasas/farmacología , Inhibidores de Proteínas Quinasas/química , Unión Proteica , Descubrimiento de Drogas/métodos , Diseño de Fármacos , Supervivencia Celular/efectos de los fármacosRESUMEN
Advancements in genomic and proteomic technologies have powered the use of gene and protein networks ("interactomes") for understanding genotype-phenotype translation. However, the proliferation of interactomes complicates the selection of networks for specific applications. Here, we present a comprehensive evaluation of 46 current human interactomes, encompassing protein-protein interactions as well as gene regulatory, signaling, colocalization, and genetic interaction networks. Our analysis shows that large composite networks such as HumanNet, STRING, and FunCoup are most effective for identifying disease genes, while smaller networks such as DIP and SIGNOR demonstrate strong interaction prediction performance. These findings provide a benchmark for interactomes across diverse network biology applications and clarify factors that influence network performance. Furthermore, our evaluation pipeline paves the way for continued assessment of emerging and updated interaction networks in the future.
RESUMEN
While the primary sequences of human proteins have been cataloged for over a decade, determining how these are organized into a dynamic collection of multiprotein assemblies, with structures and functions spanning biological scales, is an ongoing venture. Systematic and data-driven analyses of these higher-order structures are emerging, facilitating the discovery and understanding of cellular phenotypes. At present, knowledge of protein localization and function has been primarily derived from manual annotation and curation in resources such as the Gene Ontology, which are biased toward richly annotated genes in the literature. Here, we envision a future powered by data-driven mapping of protein assemblies. These maps can capture and decode cellular functions through the integration of protein expression, localization, and interaction data across length scales and timescales. In this review, we focus on progress toward constructing integrated cell maps that accelerate the life sciences and translational research.
Asunto(s)
Fenotipo , Proteómica , Humanos , Proteómica/métodosRESUMEN
Cyclin-dependent kinase 4 and 6 inhibitors (CDK4/6is) have revolutionized breast cancer therapy. However, <50% of patients have an objective response, and nearly all patients develop resistance during therapy. To elucidate the underlying mechanisms, we constructed an interpretable deep learning model of the response to palbociclib, a CDK4/6i, based on a reference map of multiprotein assemblies in cancer. The model identifies eight core assemblies that integrate rare and common alterations across 90 genes to stratify palbociclib-sensitive versus palbociclib-resistant cell lines. Predictions translate to patients and patient-derived xenografts, whereas single-gene biomarkers do not. Most predictive assemblies can be shown by CRISPR-Cas9 genetic disruption to regulate the CDK4/6i response. Validated assemblies relate to cell-cycle control, growth factor signaling and a histone regulatory complex that we show promotes S-phase entry through the activation of the histone modifiers KAT6A and TBL1XR1 and the transcription factor RUNX1. This study enables an integrated assessment of how a tumor's genetic profile modulates CDK4/6i resistance.
Asunto(s)
Quinasa 4 Dependiente de la Ciclina , Quinasa 6 Dependiente de la Ciclina , Aprendizaje Profundo , Resistencia a Antineoplásicos , Piperazinas , Inhibidores de Proteínas Quinasas , Piridinas , Humanos , Quinasa 4 Dependiente de la Ciclina/antagonistas & inhibidores , Quinasa 6 Dependiente de la Ciclina/antagonistas & inhibidores , Resistencia a Antineoplásicos/genética , Animales , Inhibidores de Proteínas Quinasas/farmacología , Inhibidores de Proteínas Quinasas/uso terapéutico , Piridinas/farmacología , Piridinas/uso terapéutico , Femenino , Piperazinas/farmacología , Piperazinas/uso terapéutico , Ratones , Línea Celular Tumoral , Neoplasias de la Mama/tratamiento farmacológico , Neoplasias de la Mama/genética , Neoplasias de la Mama/patología , Ensayos Antitumor por Modelo de XenoinjertoRESUMEN
Genome-wide association studies (GWAS) have identified hundreds of common variants associated with alcohol consumption. In contrast, rare variants have only begun to be studied for their role in alcohol consumption. No studies have examined whether common and rare variants implicate the same genes and molecular networks. To address this knowledge gap, we used publicly available alcohol consumption GWAS summary statistics (GSCAN, N=666,978) and whole exome sequencing data (Genebass, N=393,099) to identify a set of common and rare variants for alcohol consumption. Gene-based analysis of each dataset have implicated 294 (common variants) and 35 (rare variants) genes, including ethanol metabolizing genes ADH1B and ADH1C, which were identified by both analyses, and ANKRD12, GIGYF1, KIF21B, and STK31, which were identified only by rare variant analysis, but have been associated with related psychiatric traits. We then used a network colocalization procedure to propagate the common and rare gene sets onto a shared molecular network, revealing significant overlap. The shared network identified gene families that function in alcohol metabolism, including ADH, ALDH, CYP, and UGT. 74 of the genes in the network were previously implicated in comorbid psychiatric or substance use disorders, but had not previously been identified for alcohol-related behaviors, including EXOC2, EPM2A, CACNB3, and CACNG4. Differential gene expression analysis showed enrichment in the liver and several brain regions supporting the role of network genes in alcohol consumption. Thus, genes implicated by common and rare variants identify shared functions relevant to alcohol consumption, which also underlie psychiatric traits and substance use disorders that are comorbid with alcohol use.
RESUMEN
The data-intensive fields of genomics and machine learning (ML) are in an early stage of convergence. Genomics researchers increasingly seek to harness the power of ML methods to extract knowledge from their data; conversely, ML scientists recognize that genomics offers a wealth of large, complex, and well-annotated datasets that can be used as a substrate for developing biologically relevant algorithms and applications. The National Human Genome Research Institute (NHGRI) inquired with researchers working in these two fields to identify common challenges and receive recommendations to better support genomic research efforts using ML approaches. Those included increasing the amount and variety of training datasets by integrating genomic with multiomics, context-specific (e.g., by cell type), and social determinants of health datasets; reducing the inherent biases of training datasets; prioritizing transparency and interpretability of ML methods; and developing privacy-preserving technologies for research participants' data.
Asunto(s)
Bioética , Genómica , Humanos , Algoritmos , Privacidad , Aprendizaje AutomáticoRESUMEN
Rapid proliferation is a hallmark of cancer associated with sensitivity to therapeutics that cause DNA replication stress (RS). Many tumors exhibit drug resistance, however, via molecular pathways that are incompletely understood. Here, we develop an ensemble of predictive models that elucidate how cancer mutations impact the response to common RS-inducing (RSi) agents. The models implement recent advances in deep learning to facilitate multidrug prediction and mechanistic interpretation. Initial studies in tumor cells identify 41 molecular assemblies that integrate alterations in hundreds of genes for accurate drug response prediction. These cover roles in transcription, repair, cell-cycle checkpoints, and growth signaling, of which 30 are shown by loss-of-function genetic screens to regulate drug sensitivity or replication restart. The model translates to cisplatin-treated cervical cancer patients, highlighting an RTK-JAK-STAT assembly governing resistance. This study defines a compendium of mechanisms by which mutations affect therapeutic responses, with implications for precision medicine. SIGNIFICANCE: Zhao and colleagues use recent advances in machine learning to study the effects of tumor mutations on the response to common therapeutics that cause RS. The resulting predictive models integrate numerous genetic alterations distributed across a constellation of molecular assemblies, facilitating a quantitative and interpretable assessment of drug response. This article is featured in Selected Articles from This Issue, p. 384.
Asunto(s)
Neoplasias del Cuello Uterino , Humanos , Femenino , Mutación , Transducción de Señal , Cisplatino/farmacología , Cisplatino/uso terapéutico , Aprendizaje AutomáticoRESUMEN
Translating high-confidence (hc) autism spectrum disorder (ASD) genes into viable treatment targets remains elusive. We constructed a foundational protein-protein interaction (PPI) network in HEK293T cells involving 100 hcASD risk genes, revealing over 1,800 PPIs (87% novel). Interactors, expressed in the human brain and enriched for ASD but not schizophrenia genetic risk, converged on protein complexes involved in neurogenesis, tubulin biology, transcriptional regulation, and chromatin modification. A PPI map of 54 patient-derived missense variants identified differential physical interactions, and we leveraged AlphaFold-Multimer predictions to prioritize direct PPIs and specific variants for interrogation in Xenopus tropicalis and human forebrain organoids. A mutation in the transcription factor FOXP1 led to reconfiguration of DNA binding sites and altered development of deep cortical layer neurons in forebrain organoids. This work offers new insights into molecular mechanisms underlying ASD and describes a powerful platform to develop and test therapeutic strategies for many genetically-defined conditions.
RESUMEN
Gene set analysis is a mainstay of functional genomics, but it relies on curated databases of gene functions that are incomplete. Here we evaluate five Large Language Models (LLMs) for their ability to discover the common biological functions represented by a gene set, substantiated by supporting rationale, citations and a confidence assessment. Benchmarking against canonical gene sets from the Gene Ontology, GPT-4 confidently recovered the curated name or a more general concept (73% of cases), while benchmarking against random gene sets correctly yielded zero confidence. Gemini-Pro and Mixtral-Instruct showed ability in naming but were falsely confident for random sets, whereas Llama2-70b had poor performance overall. In gene sets derived from 'omics data, GPT-4 identified novel functions not reported by classical functional enrichment (32% of cases), which independent review indicated were largely verifiable and not hallucinations. The ability to rapidly synthesize common gene functions positions LLMs as valuable 'omics assistants.
RESUMEN
Cells consist of large components, such as organelles, that recursively factor into smaller systems, such as condensates and protein complexes, forming a dynamic multi-scale structure of the cell. Recent technological innovations have paved the way for systematic interrogation of subcellular structures, yielding unprecedented insights into their roles and interactions. In this workshop, we discuss progress, challenges, and collaboration to marshal various computational approaches toward assembling an integrated structural map of the human cell.