Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 51
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36549922

RESUMO

MOTIVATION: Single-cell assay for transposase accessible chromatin using sequencing (scATAC-seq) is a valuable resource to learn cis-regulatory elements such as cell-type specific enhancers and transcription factor binding sites. However, cell-type identification of scATAC-seq data is known to be challenging due to the heterogeneity derived from different protocols and the high dropout rate. RESULTS: In this study, we perform a systematic comparison of seven scATAC-seq datasets of mouse brain to benchmark the efficacy of neuronal cell-type annotation from gene sets. We find that redundant marker genes give a dramatic improvement for a sparse scATAC-seq annotation across the data collected from different studies. Interestingly, simple aggregation of such marker genes achieves performance comparable or higher than that of machine-learning classifiers, suggesting its potential for downstream applications. Based on our results, we reannotated all scATAC-seq data for detailed cell types using robust marker genes. Their meta scATAC-seq profiles are publicly available at https://gillisweb.cshl.edu/Meta_scATAC. Furthermore, we trained a deep neural network to predict chromatin accessibility from only DNA sequence and identified key motifs enriched for each neuronal subtype. Those predicted profiles are visualized together in our database as a valuable resource to explore cell-type specific epigenetic regulation in a sequence-dependent and -independent manner.


Assuntos
Cromatina , Epigênese Genética , Animais , Camundongos , Cromatina/genética , Sequências Reguladoras de Ácido Nucleico , Redes Neurais de Computação
2.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38366935

RESUMO

SUMMARY: Deep neural networks (DNNs) have been widely applied to predict the molecular functions of the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package, we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. AVAILABILITY AND IMPLEMENTATION: EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).


Assuntos
Aprendizado Profundo , Genômica/métodos , Genoma , Software , Redes Neurais de Computação
3.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36355460

RESUMO

MOTIVATION: Multiple sequence alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. RESULTS: Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold2 and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of optimizing predictions of protein sequences with methods that are not fully understood. AVAILABILITY AND IMPLEMENTATION: Our code and examples are available at: https://github.com/spetti/SMURF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Proteínas , Humanos , Alinhamento de Sequência , Proteínas/química , Redes Neurais de Computação , Sequência de Aminoácidos
4.
Proc Natl Acad Sci U S A ; 117(21): 11471-11482, 2020 05 26.
Artigo em Inglês | MEDLINE | ID: mdl-32385160

RESUMO

Lineage plasticity is a prominent feature of pancreatic ductal adenocarcinoma (PDA) cells, which can occur via deregulation of lineage-specifying transcription factors. Here, we show that the zinc finger protein ZBED2 is aberrantly expressed in PDA and alters tumor cell identity in this disease. Unexpectedly, our epigenomic experiments reveal that ZBED2 is a sequence-specific transcriptional repressor of IFN-stimulated genes, which occurs through antagonism of IFN regulatory factor 1 (IRF1)-mediated transcriptional activation at cooccupied promoter elements. Consequently, ZBED2 attenuates the transcriptional output and growth arrest phenotypes downstream of IFN signaling in multiple PDA cell line models. We also found that ZBED2 is preferentially expressed in the squamous molecular subtype of human PDA, in association with inferior patient survival outcomes. Consistent with this observation, we show that ZBED2 can repress the pancreatic progenitor transcriptional program, enhance motility, and promote invasion in PDA cells. Collectively, our findings suggest that high ZBED2 expression is acquired during PDA progression to suppress the IFN response pathway and to promote lineage plasticity in this disease.


Assuntos
Carcinoma Ductal Pancreático/patologia , Proteínas de Ligação a DNA/metabolismo , Fator Regulador 1 de Interferon/metabolismo , Neoplasias Pancreáticas/patologia , Fatores de Transcrição/metabolismo , Animais , Carcinoma Ductal Pancreático/genética , Carcinoma Ductal Pancreático/metabolismo , Carcinoma Ductal Pancreático/mortalidade , Linhagem Celular Tumoral , Proliferação de Células/efeitos dos fármacos , Imunoprecipitação da Cromatina , Proteínas de Ligação a DNA/genética , Regulação Neoplásica da Expressão Gênica , Humanos , Fator Regulador 1 de Interferon/genética , Interferon gama/farmacologia , Camundongos , Neoplasias Pancreáticas/genética , Neoplasias Pancreáticas/metabolismo , Neoplasias Pancreáticas/mortalidade , Regiões Promotoras Genéticas , Análise de Sobrevida , Fatores de Transcrição/genética
5.
PLoS Comput Biol ; 17(5): e1008925, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-33983921

RESUMO

Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.


Assuntos
Aprendizado Profundo , Genômica , Redes Neurais de Computação , Biologia Computacional/métodos , Humanos
6.
PLoS Comput Biol ; 15(12): e1007560, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31856220

RESUMO

Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs-assembling partial features into whole features in deeper layers-tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences.


Assuntos
Genômica/estatística & dados numéricos , Redes Neurais de Computação , Motivos de Aminoácidos , Sítios de Ligação/genética , Biologia Computacional , Simulação por Computador , DNA/genética , Bases de Dados Genéticas/estatística & dados numéricos , Aprendizado Profundo/estatística & dados numéricos , Genoma Humano , Humanos , Fatores de Transcrição/química , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
7.
Int J Mol Sci ; 21(16)2020 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-32823614

RESUMO

BACKGROUND: Despite the recent research implicating E2F8 (E2F Transcription Factor 8) in cancer, the role of E2F8 in the progression of ovarian cancer has remained unclear. Hence, we explored the bio-functional effects of E2F8 knockdown on ovarian cancer cell lines in vitro and in vivo. METHODS: The expression of E2F8 was compared between ovarian cancer and noncancer tissues, and its association with the progression-free survival of ovarian cancer patients was analyzed. To demonstrate the function of E2F8 in cell proliferation, migration, and invasion, we employed RNA interference to suppress E2F8 expression in ovarian cancer cell lines. Finally, the effect of E2F8 knockdown was investigated in a xenograft mouse model of ovarian cancer. RESULTS: Ovarian cancer tissue exhibited significantly higher E2F8 expression compared to that of normal ovarian tissue. Clinical data showed that E2F8 was a significant predictor of progression-free survival. Moreover, the prognosis of the ovarian cancer patients with high E2F8 expression was poorer than that of the patients with low E2F8 expression. In vitro experiments using E2F8-knockdown ovarian cancer cell lines demonstrated that E2F8 knockdown inhibited cell proliferation, migration, and tumor invasion. Additionally, E2F8 was a potent inducer and modulator of the expression of epithelial-mesenchymal transition and Notch signaling pathway-related markers. We confirmed the function of E2F8 in vivo, signifying that E2F8 knockdown was significantly correlated with reduced tumor size and weight. CONCLUSIONS: Our findings indicate that E2F8 is highly correlated with ovarian cancer progression. Hence, E2F8 can be utilized as a prognostic marker and therapeutic target against ovarian malignancy.


Assuntos
Transição Epitelial-Mesenquimal , Neoplasias Ovarianas/metabolismo , Neoplasias Ovarianas/patologia , Receptores Notch/metabolismo , Proteínas Repressoras/metabolismo , Transdução de Sinais , Animais , Linhagem Celular Tumoral , Movimento Celular , Proliferação de Células , Feminino , Técnicas de Silenciamento de Genes , Humanos , Camundongos Nus , Análise Multivariada , Invasividade Neoplásica , Prognóstico , Intervalo Livre de Progressão , Carga Tumoral , Ensaios Antitumorais Modelo de Xenoenxerto
8.
Mol Phylogenet Evol ; 139: 106562, 2019 10.
Artigo em Inglês | MEDLINE | ID: mdl-31323334

RESUMO

One major challenge to delimiting species with genetic data is successfully differentiating population structure from species-level divergence, an issue exacerbated in taxa inhabiting naturally fragmented habitats. Many fields of science are now using machine learning, and in evolutionary biology supervised machine learning has recently been used to infer species boundaries. These supervised methods require training data with associated labels. Conversely, unsupervised machine learning (UML) uses inherent data structure and does not require user-specified training labels, potentially providing more objectivity in species delimitation. In the context of integrative taxonomy, we demonstrate the utility of three UML approaches (random forests, variational autoencoders, t-distributed stochastic neighbor embedding) for species delimitation in an arachnid taxon with high population genetic structure (Opiliones, Laniatores, Metanonychus). We find that UML approaches successfully cluster samples according to species-level divergences and not high levels of population structure, while model-based validation methods severely over-split putative species. UML offers intuitive data visualization in two-dimensional space, the ability to accommodate various data types, and has potential in many areas of systematic and evolutionary biology. We argue that machine learning methods are ideally suited for species delimitation and may perform well in many natural systems and across taxa with diverse biological characteristics.


Assuntos
Aprendizado de Máquina não Supervisionado , Animais , Aracnídeos/classificação , Aracnídeos/genética , Análise por Conglomerados , Filogenia , Polimorfismo de Nucleotídeo Único , Análise de Componente Principal
9.
Nucleic Acids Res ; 43(2): 917-31, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25550426

RESUMO

V(D)J recombination is initiated by RAG1 and RAG2, which together with HMGB1 bind to a recombination signal sequence (12RSS or 23RSS) to form the signal complex (SC) and then capture a complementary partner RSS, yielding the paired complex (PC). Little is known regarding the structural changes that accompany the SC to PC transition or the structural features that allow RAG to distinguish its two asymmetric substrates. To address these issues, we analyzed the structure of the 12RSS in the SC and PC using fluorescence resonance energy transfer (FRET) and molecular dynamics modeling. The resulting models indicate that the 12RSS adopts a strongly bent V-shaped structure upon RAG/HMGB1 binding and reveal structural differences, particularly near the heptamer, between the 12RSS in the SC and PC. Comparison of models of the 12RSS and 23RSS in the PC reveals broadly similar shapes but a distinct number and location of DNA bends as well as a smaller central cavity for the 12RSS. These findings provide the most detailed view yet of the 12RSS in RAG-DNA complexes and highlight structural features of the RSS that might underlie activation of RAG-mediated cleavage and substrate asymmetry important for the 12/23 rule of V(D)J recombination.


Assuntos
DNA/química , Proteínas de Homeodomínio/metabolismo , Recombinação V(D)J , DNA/metabolismo , Clivagem do DNA , Proteína HMGB1/metabolismo , Modelos Moleculares , Conformação de Ácido Nucleico
10.
Biophys J ; 111(1): 19-24, 2016 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-27410730

RESUMO

Many aspects of chromatin biology are influenced by the nuclear compartment in which a locus resides, from transcriptional regulation to DNA repair. Further, the dynamic and variable localization of a particular locus across cell populations and over time makes analysis of a large number of cells critical. As a consequence, robust and automatable methods to measure the position of individual loci within the nuclear volume in populations of cells are necessary to support quantitative analysis of nuclear position. Here, we describe a three-dimensional membrane reconstruction approach that uses fluorescently tagged nuclear envelope or endoplasmic reticulum membrane marker proteins to precisely map the nuclear volume. This approach is robust to a variety of nuclear shapes, providing greater biological accuracy than alternative methods that enforce nuclear circularity, while also describing nuclear position in all three dimensions. By combining this method with established approaches to reconstruct the position of diffraction-limited chromatin markers-in this case, lac Operator arrays bound by lacI-GFP-the distribution of loci positions within the nuclear volume with respect to the nuclear periphery can be quantitatively obtained. This stand-alone image analysis pipeline should be of broad practical utility for individuals interested in various aspects of chromatin biology, while also providing, to our knowledge, a new conceptual framework for investigators who study organelle shape.


Assuntos
Imageamento Tridimensional , Membrana Nuclear/metabolismo , Animais , Retículo Endoplasmático/metabolismo , Corantes Fluorescentes/metabolismo , Camundongos , Modelos Biológicos , Células NIH 3T3 , Schizosaccharomyces/citologia
11.
PLoS Comput Biol ; 11(10): e1004297, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26512894

RESUMO

Resolving distinct biochemical interaction states when analyzing the trajectories of diffusing proteins in live cells on an individual basis remains challenging because of the limited statistics provided by the relatively short trajectories available experimentally. Here, we introduce a novel, machine-learning based classification methodology, which we call perturbation expectation-maximization (pEM), that simultaneously analyzes a population of protein trajectories to uncover the system of diffusive behaviors which collectively result from distinct biochemical interactions. We validate the performance of pEM in silico and demonstrate that pEM is capable of uncovering the proper number of underlying diffusive states with an accurate characterization of their diffusion properties. We then apply pEM to experimental protein trajectories of Rho GTPases, an integral regulator of cytoskeletal dynamics and cellular homeostasis, in vivo via single particle tracking photo-activated localization microscopy. Remarkably, pEM uncovers 6 distinct diffusive states conserved across various Rho GTPase family members. The variability across family members in the propensities for each diffusive state reveals non-redundant roles in the activation states of RhoA and RhoC. In a resting cell, our results support a model where RhoA is constantly cycling between activation states, with an imbalance of rates favoring an inactive state. RhoC, on the other hand, remains predominantly inactive.


Assuntos
Difusão , Modelos Biológicos , Modelos Químicos , Imagem Molecular/métodos , Frações Subcelulares/química , Proteínas rho de Ligação ao GTP/química , Simulação por Computador , Aprendizado de Máquina , Modelos Estatísticos
12.
bioRxiv ; 2024 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-37461616

RESUMO

The rise of large-scale, sequence-based deep neural networks (DNNs) for predicting gene expression has introduced challenges in their evaluation and interpretation. Current evaluations align DNN predictions with experimental perturbation assays, which provides insights into the generalization capabilities within the studied loci but offers a limited perspective of what drives their predictions. Moreover, existing model explainability tools focus mainly on motif analysis, which becomes complex when interpreting longer sequences. Here we introduce CREME, an in silico perturbation toolkit that interrogates large-scale DNNs to uncover rules of gene regulation that it learns. Using CREME, we investigate Enformer, a prominent DNN in gene expression prediction, revealing cis-regulatory elements (CREs) that directly enhance or silence target genes. We explore the intricate complexity of higher-order CRE interactions, the relationship between CRE distance from transcription start sites on gene expression, as well as the biochemical features of enhancers and silencers learned by Enformer. Moreover, we demonstrate the flexibility of CREME to efficiently uncover a higher-resolution view of functional sequence elements within CREs. This work demonstrates how CREME can be employed to translate the powerful predictions of large-scale DNNs to study open questions in gene regulation.

13.
bioRxiv ; 2024 Mar 03.
Artigo em Inglês | MEDLINE | ID: mdl-38464101

RESUMO

The emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.

14.
bioRxiv ; 2024 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-38293144

RESUMO

Deep neural networks (DNNs) have been widely applied to predict the molecular functions of regulatory regions in the non-coding genome. DNNs are data hungry and thus require many training examples to fit data well. However, functional genomics experiments typically generate limited amounts of data, constrained by the activity levels of the molecular function under study inside the cell. Recently, EvoAug was introduced to train a genomic DNN with evolution-inspired augmentations. EvoAug-trained DNNs have demonstrated improved generalization and interpretability with attribution analysis. However, EvoAug only supports PyTorch-based models, which limits its applications to a broad class of genomic DNNs based in TensorFlow. Here, we extend EvoAug's functionality to TensorFlow in a new package we call EvoAug-TF. Through a systematic benchmark, we find that EvoAug-TF yields comparable performance with the original EvoAug package. Availability: EvoAug-TF is freely available for users and is distributed under an open-source MIT license. Researchers can access the open-source code on GitHub (https://github.com/p-koo/evoaug-tf). The pre-compiled package is provided via PyPI (https://pypi.org/project/evoaug-tf) with in-depth documentation on ReadTheDocs (https://evoaug-tf.readthedocs.io). The scripts for reproducing the results are available at (https://github.com/p-koo/evoaug-tf_analysis).

15.
bioRxiv ; 2024 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-38013993

RESUMO

Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. Interpreting genomic DNNs in terms of biological mechanisms, however, remains difficult. Here we introduce SQUID, a genomic DNN interpretability framework based on surrogate modeling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models, i.e., simpler models that are mechanistically interpretable. Importantly, SQUID removes the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements. SQUID thus advances the ability to mechanistically interpret genomic DNNs.

16.
Genome Biol ; 24(1): 109, 2023 05 09.
Artigo em Inglês | MEDLINE | ID: mdl-37161475

RESUMO

Post hoc attribution methods can provide insights into the learned patterns from deep neural networks (DNNs) trained on high-throughput functional genomics data. However, in practice, their resultant attribution maps can be challenging to interpret due to spurious importance scores for seemingly arbitrary nucleotides. Here, we identify a previously overlooked attribution noise source that arises from how DNNs handle one-hot encoded DNA. We demonstrate this noise is pervasive across various genomic DNNs and introduce a statistical correction that effectively reduces it, leading to more reliable attribution maps. Our approach represents a promising step towards gaining meaningful insights from DNNs in regulatory genomics.


Assuntos
Genômica , Aprendizagem , Redes Neurais de Computação , Nucleotídeos
17.
bioRxiv ; 2023 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-36711495

RESUMO

N6-methyladenosine is a highly dynamic, abundant mRNA modification which is an excellent potential mechanism for fine tuning gene expression. Plants adapt to their surrounding light and temperature environment using complex gene regulatory networks. The role of m6A in controlling gene expression in response to variable environmental conditions has so far been unexplored. Here, we map the transcriptome-wide m6A landscape under various light and temperature environments. Identified m6A-modifications show a highly specific spatial distribution along transcripts with enrichment occurring in 5'UTR regions and around transcriptional end sites. We show that the position of m6A modifications on transcripts might influence cellular transcript localization and the presence of m6A-modifications is associated with alternative polyadenylation, a process which results in multiple RNA isoforms with varying 3'UTR lengths. RNA with m6A-modifications exhibit a higher preference for shorter 3'UTRs. These shorter 3'UTR regions might directly influence transcript abundance and localization by including or excluding cis-regulatory elements. We propose that environmental stimuli might change the m6A landscape of plants as one possible way of fine tuning gene regulation through alternative polyadenylation and transcript localization.

18.
Methods Mol Biol ; 2586: 197-215, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36705906

RESUMO

Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.


Assuntos
Redes Neurais de Computação , RNA , RNA/genética , Proteínas de Ligação a RNA/metabolismo , DNA/metabolismo , Matrizes de Pontuação de Posição Específica , Ligação Proteica
19.
Genome Biol ; 24(1): 105, 2023 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-37143118

RESUMO

Deep neural networks (DNNs) hold promise for functional genomics prediction, but their generalization capability may be limited by the amount of available data. To address this, we propose EvoAug, a suite of evolution-inspired augmentations that enhance the training of genomic DNNs by increasing genetic variation. Random transformation of DNA sequences can potentially alter their function in unknown ways, so we employ a fine-tuning procedure using the original non-transformed data to preserve functional integrity. Our results demonstrate that EvoAug substantially improves the generalization and interpretability of established DNNs across prominent regulatory genomics prediction tasks, offering a robust solution for genomic DNNs.


Assuntos
Genômica , Redes Neurais de Computação , Genômica/métodos
20.
Comput Methods Programs Biomed ; 239: 107631, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37271050

RESUMO

BACKGROUND AND OBJECTIVE: Histopathology is the gold standard for diagnosis of many cancers. Recent advances in computer vision, specifically deep learning, have facilitated the analysis of histopathology images for many tasks, including the detection of immune cells and microsatellite instability. However, it remains difficult to identify optimal models and training configurations for different histopathology classification tasks due to the abundance of available architectures and the lack of systematic evaluations. Our objective in this work is to present a software tool that addresses this need and enables robust, systematic evaluation of neural network models for patch classification in histology in a light-weight, easy-to-use package for both algorithm developers and biomedical researchers. METHODS: Here we present ChampKit (Comprehensive Histopathology Assessment of Model Predictions toolKit): an extensible, fully reproducible evaluation toolkit that is a one-stop-shop to train and evaluate deep neural networks for patch classification. ChampKit curates a broad range of public datasets. It enables training and evaluation of models supported by timm directly from the command line, without the need for users to write any code. External models are enabled through a straightforward API and minimal coding. As a result, Champkit facilitates the evaluation of existing and new models and deep learning architectures on pathology datasets, making it more accessible to the broader scientific community. To demonstrate the utility of ChampKit, we establish baseline performance for a subset of possible models that could be employed with ChampKit, focusing on several popular deep learning models, namely ResNet18, ResNet50, and R26-ViT, a hybrid vision transformer. In addition, we compare each model trained either from random weight initialization or with transfer learning from ImageNet pretrained models. For ResNet18, we also consider transfer learning from a self-supervised pretrained model. RESULTS: The main result of this paper is the ChampKit software. Using ChampKit, we were able to systemically evaluate multiple neural networks across six datasets. We observed mixed results when evaluating the benefits of pretraining versus random intialization, with no clear benefit except in the low data regime, where transfer learning was found to be beneficial. Surprisingly, we found that transfer learning from self-supervised weights rarely improved performance, which is counter to other areas of computer vision. CONCLUSIONS: Choosing the right model for a given digital pathology dataset is nontrivial. ChampKit provides a valuable tool to fill this gap by enabling the evaluation of hundreds of existing (or user-defined) deep learning models across a variety of pathology tasks. Source code and data for the tool are freely accessible at https://github.com/SBU-BMI/champkit.


Assuntos
Neoplasias , Redes Neurais de Computação , Humanos , Algoritmos , Software , Técnicas Histológicas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA