RESUMEN
It is well established that cells sense chemical signals from their local microenvironment and transduce them to the nucleus to regulate gene expression programmes. Although a number of experiments have shown that mechanical cues can also modulate gene expression, the underlying mechanisms are far from clear. Nevertheless, we are now beginning to understand how mechanical cues are transduced to the nucleus and how they influence nuclear mechanics, genome organization and transcription. In particular, recent progress in super-resolution imaging, in genome-wide application of RNA sequencing, chromatin immunoprecipitation and chromosome conformation capture and in theoretical modelling of 3D genome organization enables the exploration of the relationship between cell mechanics, 3D chromatin configurations and transcription, thereby shedding new light on how mechanical forces regulate gene expression.
Asunto(s)
Ensamble y Desensamble de Cromatina/fisiología , Cromatina/fisiología , Genoma Humano/fisiología , Mecanotransducción Celular/fisiología , Modelos Genéticos , Animales , HumanosAsunto(s)
Envejecimiento/fisiología , Betacoronavirus/fisiología , Citoesqueleto/virología , Interacciones Huésped-Patógeno/fisiología , Betacoronavirus/patogenicidad , COVID-19 , Infecciones por Coronavirus/patología , Citoesqueleto/metabolismo , Genoma Viral , Humanos , FN-kappa B/metabolismo , Pandemias , Neumonía Viral/patología , SARS-CoV-2 , Transducción de Señal , Replicación Viral/fisiologíaRESUMEN
While neural networks are used for classification tasks across domains, a long-standing open problem in machine learning is determining whether neural networks trained using standard procedures are consistent for classification, i.e., whether such models minimize the probability of misclassification for arbitrary data distributions. In this work, we identify and construct an explicit set of neural network classifiers that are consistent. Since effective neural networks in practice are typically both wide and deep, we analyze infinitely wide networks that are also infinitely deep. In particular, using the recent connection between infinitely wide neural networks and neural tangent kernels, we provide explicit activation functions that can be used to construct networks that achieve consistency. Interestingly, these activation functions are simple and easy to implement, yet differ from commonly used activations such as ReLU or sigmoid. More generally, we create a taxonomy of infinitely wide and deep networks and show that these models implement one of three well-known classifiers depending on the activation function used: 1) 1-nearest neighbor (model predictions are given by the label of the nearest training example); 2) majority vote (model predictions are given by the label of the class with the greatest representation in the training set); or 3) singular kernel classifiers (a set of classifiers containing those that achieve consistency). Our results highlight the benefit of using deep networks for classification tasks, in contrast to regression tasks, where excessive depth is harmful.
Asunto(s)
Aprendizaje Automático , Redes Neurales de la ComputaciónRESUMEN
Matrix completion problems arise in many applications including recommendation systems, computer vision, and genomics. Increasingly larger neural networks have been successful in many of these applications but at considerable computational costs. Remarkably, taking the width of a neural network to infinity allows for improved computational performance. In this work, we develop an infinite width neural network framework for matrix completion that is simple, fast, and flexible. Simplicity and speed come from the connection between the infinite width limit of neural networks and kernels known as neural tangent kernels (NTK). In particular, we derive the NTK for fully connected and convolutional neural networks for matrix completion. The flexibility stems from a feature prior, which allows encoding relationships between coordinates of the target matrix, akin to semisupervised learning. The effectiveness of our framework is demonstrated through competitive results for virtual drug screening and image inpainting/reconstruction. We also provide an implementation in Python to make our framework accessible on standard hardware to a broad audience.
Asunto(s)
Procesamiento de Imagen Asistido por Computador , Redes Neurales de la Computación , Computadores , Procesamiento de Imagen Asistido por Computador/métodos , Aprendizaje Automático , Aprendizaje Automático SupervisadoRESUMEN
A complete mitochondrial (mt) genome sequence was reconstructed from a 38,000 year-old Neandertal individual with 8341 mtDNA sequences identified among 4.8 Gb of DNA generated from approximately 0.3 g of bone. Analysis of the assembled sequence unequivocally establishes that the Neandertal mtDNA falls outside the variation of extant human mtDNAs, and allows an estimate of the divergence date between the two mtDNA lineages of 660,000 +/- 140,000 years. Of the 13 proteins encoded in the mtDNA, subunit 2 of cytochrome c oxidase of the mitochondrial electron transport chain has experienced the largest number of amino acid substitutions in human ancestors since the separation from Neandertals. There is evidence that purifying selection in the Neandertal mtDNA was reduced compared with other primate lineages, suggesting that the effective population size of Neandertals was small.
Asunto(s)
Evolución Molecular , Fósiles , Hominidae/genética , Análisis de Secuencia de ADN/métodos , Animales , Secuencia de Bases , Huesos/metabolismo , Croacia , Ciclooxigenasa 2/química , ADN Mitocondrial/genética , Genoma Mitocondrial , Humanos , Modelos Moleculares , Datos de Secuencia MolecularRESUMEN
Identifying computational mechanisms for memorization and retrieval of data is a long-standing problem at the intersection of machine learning and neuroscience. Our main finding is that standard overparameterized deep neural networks trained using standard optimization methods implement such a mechanism for real-valued data. We provide empirical evidence that 1) overparameterized autoencoders store training samples as attractors and thus iterating the learned map leads to sample recovery, and that 2) the same mechanism allows for encoding sequences of examples and serves as an even more efficient mechanism for memory than autoencoding. Theoretically, we prove that when trained on a single example, autoencoders store the example as an attractor. Lastly, by treating a sequence encoder as a composition of maps, we prove that sequence encoding provides a more efficient mechanism for memory than autoencoding.
Asunto(s)
Biología Computacional/métodos , Memoria/fisiología , Redes Neurales de la Computación , Aprendizaje Automático , Dinámicas no LinealesRESUMEN
SUMMARY: Designing interventions to control gene regulation necessitates modeling a gene regulatory network by a causal graph. Currently, large-scale gene expression datasets from different conditions, cell types, disease states, and developmental time points are being collected. However, application of classical causal inference algorithms to infer gene regulatory networks based on such data is still challenging, requiring high sample sizes and computational resources. Here, we describe an algorithm that efficiently learns the differences in gene regulatory mechanisms between different conditions. Our difference causal inference (DCI) algorithm infers changes (i.e. edges that appeared, disappeared, or changed weight) between two causal graphs given gene expression data from the two conditions. This algorithm is efficient in its use of samples and computation since it infers the differences between causal graphs directly without estimating each possibly large causal graph separately. We provide a user-friendly Python implementation of DCI and also enable the user to learn the most robust difference causal graph across different tuning parameters via stability selection. Finally, we show how to apply DCI to single-cell RNA-seq data from different conditions and cell states, and we also validate our algorithm by predicting the effects of interventions. AVAILABILITY AND IMPLEMENTATION: Python package freely available at http://uhlerlab.github.io/causaldag/dci. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Redes Reguladoras de Genes , Regulación de la Expresión GénicaRESUMEN
Lineage tracing involves the identification of all ancestors and descendants of a given cell, and is an important tool for studying biological processes such as development and disease progression. However, in many settings, controlled time-course experiments are not feasible, for example when working with tissue samples from patients. Here we present ImageAEOT, a computational pipeline based on autoencoders and optimal transport for predicting the lineages of cells using time-labeled datasets from different stages of a cellular process. Given a single-cell image from one of the stages, ImageAEOT generates an artificial lineage of this cell based on the population characteristics of the other stages. These lineages can be used to connect subpopulations of cells through the different stages and identify image-based features and biomarkers underlying the biological process. To validate our method, we apply ImageAEOT to a benchmark task based on nuclear and chromatin images during the activation of fibroblasts by tumor cells in engineered 3D tissues. We further validate ImageAEOT on chromatin images of various breast cancer cell lines and human tissue samples, thereby linking alterations in chromatin condensation patterns to different stages of tumor progression. Our results demonstrate the promise of computational methods based on autoencoding and optimal transport principles for lineage tracing in settings where existing experimental strategies cannot be used.
Asunto(s)
Linaje de la Célula , Biología Computacional/métodos , Análisis de la Célula Individual/métodos , Neoplasias de la Mama , Diferenciación Celular/fisiología , Línea Celular Tumoral , Núcleo Celular/fisiología , Cromatina/fisiología , Técnicas de Cocultivo , Femenino , Humanos , Procesamiento de Imagen Asistido por Computador , Reproducibilidad de los ResultadosRESUMEN
In this Current Opinion, we highlight the importance of the material properties of tissues and how alterations therein, which influence epithelial-to-mesenchymal transitions, represent an important layer of regulation in a number of diseases and potentially also play a critical role in host-pathogen interactions. In light of the current SARS-CoV-2 pandemic, we here highlight the possible role of lung tissue stiffening with ageing and how this might facilitate increased SARS-CoV-2 replication through matrix-stiffness dependent epithelial-to-mesenchymal transitions of the lung epithelium. This emphasizes the need for integrating material properties of tissues in drug discovery programs.
RESUMEN
The 3D structure of the genome plays a key role in regulatory control of the cell. Experimental methods such as high-throughput chromosome conformation capture (Hi-C) have been developed to probe the 3D structure of the genome. However, it remains a challenge to deduce from these data chromosome regions that are colocalized and coregulated. Here, we present an integrative approach that leverages 1D functional genomic features (e.g., epigenetic marks) with 3D interactions from Hi-C data to identify functional interchromosomal interactions. We construct a weighted network with 250-kb genomic regions as nodes and Hi-C interactions as edges, where the edge weights are given by the correlation between 1D genomic features. Individual interacting clusters are determined using weighted correlation clustering on the network. We show that intermingling regions generally fall into either active or inactive clusters based on the enrichment for RNA polymerase II (RNAPII) and H3K9me3, respectively. We show that active clusters are hotspots for transcription factor binding sites. We also validate our predictions experimentally by 3D fluorescence in situ hybridization (FISH) experiments and show that active RNAPII is enriched in predicted active clusters. Our method provides a general quantitative framework that couples 1D genomic features with 3D interactions from Hi-C to probe the guiding principles that link the spatial organization of the genome with regulatory control.
Asunto(s)
Cromosomas Humanos , Análisis de Secuencia de ADN/métodos , Transcripción Genética/fisiología , Animales , Cromosomas Humanos/genética , Cromosomas Humanos/metabolismo , HumanosRESUMEN
The protection of privacy of individual-level information in genome-wide association study (GWAS) databases has been a major concern of researchers following the publication of "an attack" on GWAS data by Homer et al. (2008). Traditional statistical methods for confidentiality and privacy protection of statistical databases do not scale well to deal with GWAS data, especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach that provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees may come at a serious price in terms of data utility. Building on such notions, Uhler et al. (2013) proposed new methods to release aggregate GWAS data without compromising an individual's privacy. We extend the methods developed in Uhler et al. (2013) for releasing differentially-private χ(2)-statistics by allowing for arbitrary number of cases and controls, and for releasing differentially-private allelic test statistics. We also provide a new interpretation by assuming the controls' data are known, which is a realistic assumption because some GWAS use publicly available data as controls. We assess the performance of the proposed methods through a risk-utility analysis on a real data set consisting of DNA samples collected by the Wellcome Trust Case Control Consortium and compare the methods with the differentially-private release mechanism proposed by Johnson and Shmatikov (2013).
Asunto(s)
Estudio de Asociación del Genoma Completo , Privacidad , Enfermedad de Crohn/genética , HumanosRESUMEN
The subcellular localization of a protein is important for its function and interaction with other molecules, and its mislocalization is linked to numerous diseases. While atlas-scale efforts have been made to profile protein localization across various cell lines, existing datasets only contain limited pairs of proteins and cell lines which do not cover all human proteins. We present a method that uses both protein sequences and cellular landmark images to perform Predictions of Unseen Proteins' Subcellular localization (PUPS), which can generalize to both proteins and cell lines not used for model training. PUPS combines a protein language model and an image inpainting model to utilize both protein sequence and cellular images for protein localization prediction. The protein sequence input enables generalization to unseen proteins and the cellular image input enables cell type specific prediction that captures single-cell variability. PUPS' ability to generalize to unseen proteins and cell lines enables us to assess the variability in protein localization across cell lines as well as across single cells within a cell line and to identify the biological processes associated with the proteins that have variable localization. Experimental validation shows that PUPS can be used to predict protein localization in newly performed experiments outside of the Human Protein Atlas used for training. Collectively, PUPS utilizes both protein sequences and cellular images to predict protein localization in unseen proteins and cell lines with the ability to capture single-cell variability.
RESUMEN
Human life expectancy is constantly increasing and aging has become a major risk factor for many diseases, although the underlying gene regulatory mechanisms are still unclear. Using transcriptomic and chromosomal conformation capture (Hi-C) data from human skin fibroblasts from individuals across different age groups, we identified a tight coupling between the changes in co-regulation and co-localization of genes. We obtained transcription factors, cofactors, and chromatin regulators that could drive the cellular aging process by developing a time-course prize-collecting Steiner tree algorithm. In particular, by combining RNA-Seq data from different age groups and protein-protein interaction data we determined the key transcription regulators and gene regulatory changes at different life stage transitions. We then mapped these transcription regulators to the 3D reorganization of chromatin in young and old skin fibroblasts. Collectively, we identified key transcription regulators whose target genes are spatially rearranged and correlate with changes in their expression, thereby providing potential targets for reverting cellular aging.
Asunto(s)
Cromatina , Factores de Transcripción , Humanos , Cromatina/genética , Factores de Transcripción/metabolismo , Regulación de la Expresión Génica , Senescencia Celular/genética , Perfilación de la Expresión GénicaRESUMEN
Ductal carcinoma in situ (DCIS) is a pre-invasive tumor that can progress to invasive breast cancer, a leading cause of cancer death. We generate a large-scale tissue microarray dataset of chromatin images, from 560 samples from 122 female patients in 3 disease stages and 11 phenotypic categories. Using representation learning on chromatin images alone, without multiplexed staining or high-throughput sequencing, we identify eight morphological cell states and tissue features marking DCIS. All cell states are observed in all disease stages with different proportions, indicating that cell states enriched in invasive cancer exist in small fractions in normal breast tissue. Tissue-level analysis reveals significant changes in the spatial organization of cell states across disease stages, which is predictive of disease stage and phenotypic category. Taken together, we show that chromatin imaging represents a powerful measure of cell state and disease stage of DCIS, providing a simple and effective tumor biomarker.
Asunto(s)
Neoplasias de la Mama , Carcinoma Intraductal no Infiltrante , Cromatina , Humanos , Femenino , Carcinoma Intraductal no Infiltrante/patología , Carcinoma Intraductal no Infiltrante/genética , Carcinoma Intraductal no Infiltrante/metabolismo , Cromatina/metabolismo , Neoplasias de la Mama/patología , Neoplasias de la Mama/genética , Neoplasias de la Mama/metabolismo , Biomarcadores de Tumor/metabolismo , Biomarcadores de Tumor/genética , Aprendizaje Automático no Supervisado , Procesamiento de Imagen Asistido por Computador/métodos , Análisis de Matrices Tisulares , Estadificación de NeoplasiasRESUMEN
Ebola virus (EBOV) is a high-consequence filovirus that gives rise to frequent epidemics with high case fatality rates and few therapeutic options. Here, we applied image-based screening of a genome-wide CRISPR library to systematically identify host cell regulators of Ebola virus infection in 39,085,093 million single cells. Measuring viral RNA and protein levels together with their localization in cells identified over 998 related host factors and provided detailed information about the role of each gene across the virus replication cycle. We trained a deep learning model on single-cell images to associate each host factor with predicted replication steps, and confirmed the predicted relationship for select host factors. Among the findings, we showed that the mitochondrial complex III subunit UQCRB is a post-entry regulator of Ebola virus RNA replication, and demonstrated that UQCRB inhibition with a small molecule reduced overall Ebola virus infection with an IC50 of 5 µM. Using a random forest model, we also identified perturbations that reduced infection by disrupting the equilibrium between viral RNA and protein. One such protein, STRAP, is a spliceosome-associated factor that was found to be closely associated with VP35, a viral protein required for RNA processing. Loss of STRAP expression resulted in a reduction in full-length viral genome production and subsequent production of non-infectious virus particles. Overall, the data produced in this genome-wide high-content single-cell screen and secondary screens in additional cell lines and related filoviruses (MARV and SUDV) revealed new insights about the role of host factors in virus replication and potential new targets for therapeutic intervention.
RESUMEN
Synthetic lethality refers to a genetic interaction where the simultaneous perturbation of gene pairs leads to cell death. Synthetically lethal gene pairs (SL pairs) provide a potential avenue for selectively targeting cancer cells based on genetic vulnerabilities. The rise of large-scale gene perturbation screens such as the Cancer Dependency Map (DepMap) offers the opportunity to identify SL pairs automatically using machine learning. We build on a recently developed class of feature learning kernel machines known as Recursive Feature Machines (RFMs) to develop a pipeline for identifying SL pairs based on CRISPR viability data from DepMap. In particular, we first train RFMs to predict viability scores for a given CRISPR gene knockout from cell line embeddings consisting of gene expression and mutation features. After training, RFMs use a statistical operator known as average gradient outer product to provide weights for each feature indicating the importance of each feature in predicting cellular viability. We subsequently apply correlation-based filters to re-weight RFM feature importances and identify those features that are most indicative of low cellular viability. Our resulting pipeline is computationally efficient, taking under 3 minutes for analyzing all 17, 453 knockouts from DepMap for candidate SL pairs. We show that our pipeline more accurately recovers experimentally verified SL pairs than prior approaches. Moreover, our pipeline finds new candidate SL pairs, thereby opening novel avenues for identifying genetic vulnerabilities in cancer.
RESUMEN
High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at https://github.com/uhlerlab/InfoCORE.
RESUMEN
Transfer learning refers to the process of adapting a model trained on a source task to a target task. While kernel methods are conceptually and computationally simple models that are competitive on a variety of tasks, it has been unclear how to develop scalable kernel-based transfer learning methods across general source and target tasks with possibly differing label dimensions. In this work, we propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task. We demonstrate the effectiveness of our framework in applications to image classification and virtual drug screening. For both applications, we identify simple scaling laws that characterize the performance of transfer-learned kernels as a function of the number of target examples. We explain this phenomenon in a simplified linear setting, where we are able to derive the exact scaling laws.
RESUMEN
Proteins on the cell membrane cluster to respond to extracellular signals; for example, adhesion proteins cluster to enhance extracellular matrix sensing; or T-cell receptors cluster to enhance antigen sensing. Importantly, the maturation of such receptor clusters requires transcriptional control to adapt and reinforce the extracellular signal sensing. However, it has been unclear how such efficient clustering mechanisms are encoded at the level of the genes that code for these receptor proteins. Using the adhesome as an example, we show that genes that code for adhesome receptor proteins are spatially co-localized and co-regulated within the cell nucleus. Towards this, we use Hi-C maps combined with RNA-seq data of adherent cells to map the correspondence between adhesome receptor proteins and their associated genes. Interestingly, we find that the transcription factors that regulate these genes are also co-localized with the adhesome gene loci, thereby potentially facilitating a transcriptional reinforcement of the extracellular matrix sensing machinery. Collectively, our results highlight an important layer of transcriptional control of cellular signal sensing.
RESUMEN
Protein-ligand binding prediction is a fundamental problem in AI-driven drug discovery. Prior work focused on supervised learning methods using a large set of binding affinity data for small molecules, but it is hard to apply the same strategy to other drug classes like antibodies as labelled data is limited. In this paper, we explore unsupervised approaches and reformulate binding energy prediction as a generative modeling task. Specifically, we train an energy-based model on a set of unlabelled protein-ligand complexes using SE(3) denoising score matching and interpret its log-likelihood as binding affinity. Our key contribution is a new equivariant rotation prediction network called Neural Euler's Rotation Equations (NERE) for SE(3) score matching. It predicts a rotation by modeling the force and torque between protein and ligand atoms, where the force is defined as the gradient of an energy function with respect to atom coordinates. We evaluate NERE on protein-ligand and antibody-antigen binding affinity prediction benchmarks. Our model outperforms all unsupervised baselines (physics-based and statistical potentials) and matches supervised learning methods in the antibody case.