Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
Mais filtros

Base de dados
Tipo de documento
País de afiliação
Intervalo de ano de publicação
1.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36549922

RESUMO

MOTIVATION: Single-cell assay for transposase accessible chromatin using sequencing (scATAC-seq) is a valuable resource to learn cis-regulatory elements such as cell-type specific enhancers and transcription factor binding sites. However, cell-type identification of scATAC-seq data is known to be challenging due to the heterogeneity derived from different protocols and the high dropout rate. RESULTS: In this study, we perform a systematic comparison of seven scATAC-seq datasets of mouse brain to benchmark the efficacy of neuronal cell-type annotation from gene sets. We find that redundant marker genes give a dramatic improvement for a sparse scATAC-seq annotation across the data collected from different studies. Interestingly, simple aggregation of such marker genes achieves performance comparable or higher than that of machine-learning classifiers, suggesting its potential for downstream applications. Based on our results, we reannotated all scATAC-seq data for detailed cell types using robust marker genes. Their meta scATAC-seq profiles are publicly available at https://gillisweb.cshl.edu/Meta_scATAC. Furthermore, we trained a deep neural network to predict chromatin accessibility from only DNA sequence and identified key motifs enriched for each neuronal subtype. Those predicted profiles are visualized together in our database as a valuable resource to explore cell-type specific epigenetic regulation in a sequence-dependent and -independent manner.


Assuntos
Cromatina , Epigênese Genética , Animais , Camundongos , Cromatina/genética , Sequências Reguladoras de Ácido Nucleico , Redes Neurais de Computação
2.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38366935

RESUMO

SUMMARY: Deep neural networks (DNNs) have been widely applied to predict the molecular functions of the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package, we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. AVAILABILITY AND IMPLEMENTATION: EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).


Assuntos
Aprendizado Profundo , Genômica/métodos , Genoma , Software , Redes Neurais de Computação
3.
Proc Natl Acad Sci U S A ; 117(21): 11471-11482, 2020 05 26.
Artigo em Inglês | MEDLINE | ID: mdl-32385160

RESUMO

Lineage plasticity is a prominent feature of pancreatic ductal adenocarcinoma (PDA) cells, which can occur via deregulation of lineage-specifying transcription factors. Here, we show that the zinc finger protein ZBED2 is aberrantly expressed in PDA and alters tumor cell identity in this disease. Unexpectedly, our epigenomic experiments reveal that ZBED2 is a sequence-specific transcriptional repressor of IFN-stimulated genes, which occurs through antagonism of IFN regulatory factor 1 (IRF1)-mediated transcriptional activation at cooccupied promoter elements. Consequently, ZBED2 attenuates the transcriptional output and growth arrest phenotypes downstream of IFN signaling in multiple PDA cell line models. We also found that ZBED2 is preferentially expressed in the squamous molecular subtype of human PDA, in association with inferior patient survival outcomes. Consistent with this observation, we show that ZBED2 can repress the pancreatic progenitor transcriptional program, enhance motility, and promote invasion in PDA cells. Collectively, our findings suggest that high ZBED2 expression is acquired during PDA progression to suppress the IFN response pathway and to promote lineage plasticity in this disease.


Assuntos
Carcinoma Ductal Pancreático/patologia , Proteínas de Ligação a DNA/metabolismo , Fator Regulador 1 de Interferon/metabolismo , Neoplasias Pancreáticas/patologia , Fatores de Transcrição/metabolismo , Animais , Carcinoma Ductal Pancreático/genética , Carcinoma Ductal Pancreático/metabolismo , Carcinoma Ductal Pancreático/mortalidade , Linhagem Celular Tumoral , Proliferação de Células/efeitos dos fármacos , Imunoprecipitação da Cromatina , Proteínas de Ligação a DNA/genética , Regulação Neoplásica da Expressão Gênica , Humanos , Fator Regulador 1 de Interferon/genética , Interferon gama/farmacologia , Camundongos , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/metabolismo , Neoplasias Pancreáticas/mortalidade , Regiões Promotoras Genéticas , Análise de Sobrevida , Fatores de Transcrição/genética
4.
PLoS Comput Biol ; 17(5): e1008925, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-33983921

RESUMO

Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.


Assuntos
Aprendizado Profundo , Genômica , Redes Neurais de Computação , Biologia Computacional/métodos , Humanos
5.
PLoS Comput Biol ; 15(12): e1007560, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31856220

RESUMO

Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs-assembling partial features into whole features in deeper layers-tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences.


Assuntos
Genômica/estatística & dados numéricos , Redes Neurais de Computação , Motivos de Aminoácidos , Sítios de Ligação/genética , Biologia Computacional , Simulação por Computador , DNA/genética , Bases de Dados Genéticas/estatística & dados numéricos , Aprendizado Profundo/estatística & dados numéricos , Genoma Humano , Humanos , Fatores de Transcrição/química , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
6.
Mol Phylogenet Evol ; 139: 106562, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31323334

RESUMO

One major challenge to delimiting species with genetic data is successfully differentiating population structure from species-level divergence, an issue exacerbated in taxa inhabiting naturally fragmented habitats. Many fields of science are now using machine learning, and in evolutionary biology supervised machine learning has recently been used to infer species boundaries. These supervised methods require training data with associated labels. Conversely, unsupervised machine learning (UML) uses inherent data structure and does not require user-specified training labels, potentially providing more objectivity in species delimitation. In the context of integrative taxonomy, we demonstrate the utility of three UML approaches (random forests, variational autoencoders, t-distributed stochastic neighbor embedding) for species delimitation in an arachnid taxon with high population genetic structure (Opiliones, Laniatores, Metanonychus). We find that UML approaches successfully cluster samples according to species-level divergences and not high levels of population structure, while model-based validation methods severely over-split putative species. UML offers intuitive data visualization in two-dimensional space, the ability to accommodate various data types, and has potential in many areas of systematic and evolutionary biology. We argue that machine learning methods are ideally suited for species delimitation and may perform well in many natural systems and across taxa with diverse biological characteristics.


Assuntos
Aprendizado de Máquina não Supervisionado , Animais , Aracnídeos/classificação , Aracnídeos/genética , Análise por Conglomerados , Filogenia , Polimorfismo de Nucleotídeo Único , Análise de Componente Principal
7.
Biophys J ; 111(1): 19-24, 2016 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-27410730

RESUMO

Many aspects of chromatin biology are influenced by the nuclear compartment in which a locus resides, from transcriptional regulation to DNA repair. Further, the dynamic and variable localization of a particular locus across cell populations and over time makes analysis of a large number of cells critical. As a consequence, robust and automatable methods to measure the position of individual loci within the nuclear volume in populations of cells are necessary to support quantitative analysis of nuclear position. Here, we describe a three-dimensional membrane reconstruction approach that uses fluorescently tagged nuclear envelope or endoplasmic reticulum membrane marker proteins to precisely map the nuclear volume. This approach is robust to a variety of nuclear shapes, providing greater biological accuracy than alternative methods that enforce nuclear circularity, while also describing nuclear position in all three dimensions. By combining this method with established approaches to reconstruct the position of diffraction-limited chromatin markers-in this case, lac Operator arrays bound by lacI-GFP-the distribution of loci positions within the nuclear volume with respect to the nuclear periphery can be quantitatively obtained. This stand-alone image analysis pipeline should be of broad practical utility for individuals interested in various aspects of chromatin biology, while also providing, to our knowledge, a new conceptual framework for investigators who study organelle shape.


Assuntos
Imageamento Tridimensional , Membrana Nuclear/metabolismo , Animais , Retículo Endoplasmático/metabolismo , Corantes Fluorescentes/metabolismo , Camundongos , Modelos Biológicos , Células NIH 3T3 , Schizosaccharomyces/citologia
8.
PLoS Comput Biol ; 11(10): e1004297, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26512894

RESUMO

Resolving distinct biochemical interaction states when analyzing the trajectories of diffusing proteins in live cells on an individual basis remains challenging because of the limited statistics provided by the relatively short trajectories available experimentally. Here, we introduce a novel, machine-learning based classification methodology, which we call perturbation expectation-maximization (pEM), that simultaneously analyzes a population of protein trajectories to uncover the system of diffusive behaviors which collectively result from distinct biochemical interactions. We validate the performance of pEM in silico and demonstrate that pEM is capable of uncovering the proper number of underlying diffusive states with an accurate characterization of their diffusion properties. We then apply pEM to experimental protein trajectories of Rho GTPases, an integral regulator of cytoskeletal dynamics and cellular homeostasis, in vivo via single particle tracking photo-activated localization microscopy. Remarkably, pEM uncovers 6 distinct diffusive states conserved across various Rho GTPase family members. The variability across family members in the propensities for each diffusive state reveals non-redundant roles in the activation states of RhoA and RhoC. In a resting cell, our results support a model where RhoA is constantly cycling between activation states, with an imbalance of rates favoring an inactive state. RhoC, on the other hand, remains predominantly inactive.


Assuntos
Difusão , Modelos Biológicos , Modelos Químicos , Imagem Molecular/métodos , Frações Subcelulares/química , Proteínas rho de Ligação ao GTP/química , Simulação por Computador , Aprendizado de Máquina , Modelos Estatísticos
9.
bioRxiv ; 2024 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-37461616

RESUMO

The rise of large-scale, sequence-based deep neural networks (DNNs) for predicting gene expression has introduced challenges in their evaluation and interpretation. Current evaluations align DNN predictions with experimental perturbation assays, which provides insights into the generalization capabilities within the studied loci but offers a limited perspective of what drives their predictions. Moreover, existing model explainability tools focus mainly on motif analysis, which becomes complex when interpreting longer sequences. Here we introduce CREME, an in silico perturbation toolkit that interrogates large-scale DNNs to uncover rules of gene regulation that it learns. Using CREME, we investigate Enformer, a prominent DNN in gene expression prediction, revealing cis-regulatory elements (CREs) that directly enhance or silence target genes. We explore the intricate complexity of higher-order CRE interactions, the relationship between CRE distance from transcription start sites on gene expression, as well as the biochemical features of enhancers and silencers learned by Enformer. Moreover, we demonstrate the flexibility of CREME to efficiently uncover a higher-resolution view of functional sequence elements within CREs. This work demonstrates how CREME can be employed to translate the powerful predictions of large-scale DNNs to study open questions in gene regulation.

10.
bioRxiv ; 2024 Mar 03.
Artigo em Inglês | MEDLINE | ID: mdl-38464101

RESUMO

The emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

11.
bioRxiv ; 2024 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-38293144

RESUMO

Deep neural networks (DNNs) have been widely applied to predict the molecular functions of regulatory regions in the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. Availability: EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).

12.
bioRxiv ; 2024 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-38013993

RESUMO

Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. Interpreting genomic DNNs in terms of biological mechanisms, however, remains difficult. Here we introduce SQUID, a genomic DNN interpretability framework based on surrogate modeling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models, i.e., simpler models that are mechanistically interpretable. Importantly, SQUID removes the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements. SQUID thus advances the ability to mechanistically interpret genomic DNNs.

13.
Genome Biol ; 24(1): 109, 2023 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-37161475

RESUMO

Post hoc attribution methods can provide insights into the learned patterns from deep neural networks (DNNs) trained on high-throughput functional genomics data. However, in practice, their resultant attribution maps can be challenging to interpret due to spurious importance scores for seemingly arbitrary nucleotides. Here, we identify a previously overlooked attribution noise source that arises from how DNNs handle one-hot encoded DNA. We demonstrate this noise is pervasive across various genomic DNNs and introduce a statistical correction that effectively reduces it, leading to more reliable attribution maps. Our approach represents a promising step towards gaining meaningful insights from DNNs in regulatory genomics.


Assuntos
Genômica , Aprendizagem , Redes Neurais de Computação , Nucleotídeos
14.
bioRxiv ; 2023 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-36711495

RESUMO

N6-methyladenosine is a highly dynamic, abundant mRNA modification which is an excellent potential mechanism for fine tuning gene expression. Plants adapt to their surrounding light and temperature environment using complex gene regulatory networks. The role of m6A in controlling gene expression in response to variable environmental conditions has so far been unexplored. Here, we map the transcriptome-wide m6A landscape under various light and temperature environments. Identified m6A-modifications show a highly specific spatial distribution along transcripts with enrichment occurring in 5'UTR regions and around transcriptional end sites. We show that the position of m6A modifications on transcripts might influence cellular transcript localization and the presence of m6A-modifications is associated with alternative polyadenylation, a process which results in multiple RNA isoforms with varying 3'UTR lengths. RNA with m6A-modifications exhibit a higher preference for shorter 3'UTRs. These shorter 3'UTR regions might directly influence transcript abundance and localization by including or excluding cis-regulatory elements. We propose that environmental stimuli might change the m6A landscape of plants as one possible way of fine tuning gene regulation through alternative polyadenylation and transcript localization.

15.
Methods Mol Biol ; 2586: 197-215, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36705906

RESUMO

Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.


Assuntos
Redes Neurais de Computação , RNA , RNA/genética , Proteínas de Ligação a RNA/metabolismo , DNA/metabolismo , Matrizes de Pontuação de Posição Específica , Ligação Proteica
16.
Genome Biol ; 24(1): 105, 2023 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-37143118

RESUMO

Deep neural networks (DNNs) hold promise for functional genomics prediction, but their generalization capability may be limited by the amount of available data. To address this, we propose EvoAug, a suite of evolution-inspired augmentations that enhance the training of genomic DNNs by increasing genetic variation. Random transformation of DNA sequences can potentially alter their function in unknown ways, so we employ a fine-tuning procedure using the original non-transformed data to preserve functional integrity. Our results demonstrate that EvoAug substantially improves the generalization and interpretability of established DNNs across prominent regulatory genomics prediction tasks, offering a robust solution for genomic DNNs.


Assuntos
Genômica , Redes Neurais de Computação , Genômica/métodos
17.
Comput Methods Programs Biomed ; 239: 107631, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37271050

RESUMO

BACKGROUND AND OBJECTIVE: Histopathology is the gold standard for diagnosis of many cancers. Recent advances in computer vision, specifically deep learning, have facilitated the analysis of histopathology images for many tasks, including the detection of immune cells and microsatellite instability. However, it remains difficult to identify optimal models and training configurations for different histopathology classification tasks due to the abundance of available architectures and the lack of systematic evaluations. Our objective in this work is to present a software tool that addresses this need and enables robust, systematic evaluation of neural network models for patch classification in histology in a light-weight, easy-to-use package for both algorithm developers and biomedical researchers. METHODS: Here we present ChampKit (Comprehensive Histopathology Assessment of Model Predictions toolKit): an extensible, fully reproducible evaluation toolkit that is a one-stop-shop to train and evaluate deep neural networks for patch classification. ChampKit curates a broad range of public datasets. It enables training and evaluation of models supported by timm directly from the command line, without the need for users to write any code. External models are enabled through a straightforward API and minimal coding. As a result, Champkit facilitates the evaluation of existing and new models and deep learning architectures on pathology datasets, making it more accessible to the broader scientific community. To demonstrate the utility of ChampKit, we establish baseline performance for a subset of possible models that could be employed with ChampKit, focusing on several popular deep learning models, namely ResNet18, ResNet50, and R26-ViT, a hybrid vision transformer. In addition, we compare each model trained either from random weight initialization or with transfer learning from ImageNet pretrained models. For ResNet18, we also consider transfer learning from a self-supervised pretrained model. RESULTS: The main result of this paper is the ChampKit software. Using ChampKit, we were able to systemically evaluate multiple neural networks across six datasets. We observed mixed results when evaluating the benefits of pretraining versus random intialization, with no clear benefit except in the low data regime, where transfer learning was found to be beneficial. Surprisingly, we found that transfer learning from self-supervised weights rarely improved performance, which is counter to other areas of computer vision. CONCLUSIONS: Choosing the right model for a given digital pathology dataset is nontrivial. ChampKit provides a valuable tool to fill this gap by enabling the evaluation of hundreds of existing (or user-defined) deep learning models across a variety of pathology tasks. Source code and data for the tool are freely accessible at https://github.com/SBU-BMI/champkit.


Assuntos
Neoplasias , Redes Neurais de Computação , Humanos , Algoritmos , Software , Técnicas Histológicas
18.
Nat Cell Biol ; 25(2): 298-308, 2023 02.
Artigo em Inglês | MEDLINE | ID: mdl-36658219

RESUMO

The EWS-FLI1 fusion oncoprotein deregulates transcription to initiate the paediatric cancer Ewing sarcoma. Here we used a domain-focused CRISPR screen to implicate the transcriptional repressor ETV6 as a unique dependency in this tumour. Using biochemical assays and epigenomics, we show that ETV6 competes with EWS-FLI1 for binding to select DNA elements enriched for short GGAA repeat sequences. Upon inactivating ETV6, EWS-FLI1 overtakes and hyper-activates these cis-elements to promote mesenchymal differentiation, with SOX11 being a key downstream target. We show that squelching of ETV6 with a dominant-interfering peptide phenocopies these effects and suppresses Ewing sarcoma growth in vivo. These findings reveal targeting of ETV6 as a strategy for neutralizing the EWS-FLI1 oncoprotein by reprogramming of genomic occupancy.


Assuntos
Sarcoma de Ewing , Criança , Humanos , Sarcoma de Ewing/genética , Sarcoma de Ewing/metabolismo , Sarcoma de Ewing/patologia , Linhagem Celular Tumoral , Regulação Neoplásica da Expressão Gênica , Proteína EWS de Ligação a RNA/genética , Proteína EWS de Ligação a RNA/metabolismo , Proteína Proto-Oncogênica c-fli-1/genética , Proteína Proto-Oncogênica c-fli-1/metabolismo , Proteínas de Fusão Oncogênica/genética , Proteínas de Fusão Oncogênica/metabolismo
19.
Nat Mach Intell ; 4(12): 1088-1100, 2022 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37324054

RESUMO

Deep learning has been successful at predicting epigenomic profiles from DNA sequences. Most approaches frame this task as a binary classification relying on peak callers to define functional activity. Recently, quantitative models have emerged to directly predict the experimental coverage values as a regression. As new models continue to emerge with different architectures and training configurations, a major bottleneck is forming due to the lack of ability to fairly assess the novelty of proposed models and their utility for downstream biological discovery. Here we introduce a unified evaluation framework and use it to compare various binary and quantitative models trained to predict chromatin accessibility data. We highlight various modeling choices that affect generalization performance, including a downstream application of predicting variant effects. In addition, we introduce a robustness metric that can be used to enhance model selection and improve variant effect predictions. Our empirical study largely supports that quantitative modeling of epigenomic profiles leads to better generalizability and interpretability.

20.
Proc Mach Learn Res ; 200: 131-149, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37205975

RESUMO

Deep neural networks (DNNs) have advanced our ability to take DNA primary sequence as input and predict a myriad of molecular activities measured via high-throughput functional genomic assays. Post hoc attribution analysis has been employed to provide insights into the importance of features learned by DNNs, often revealing patterns such as sequence motifs. However, attribution maps typically harbor spurious importance scores to an extent that varies from model to model, even for DNNs whose predictions generalize well. Thus, the standard approach for model selection, which relies on performance of a held-out validation set, does not guarantee that a high-performing DNN will provide reliable explanations. Here we introduce two approaches that quantify the consistency of important features across a population of attribution maps; consistency reflects a qualitative property of human interpretable attribution maps. We employ the consistency metrics as part of a multivariate model selection framework to identify models that yield high generalization performance and interpretable attribution analysis. We demonstrate the efficacy of this approach across various DNNs quantitatively with synthetic data and qualitatively with chromatin accessibility data.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA