Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Genome Res ; 34(3): 469-483, 2024 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-38514204

RESUMEN

With the goal of mapping genomic activity, international projects have recently measured epigenetic activity in hundreds of cell and tissue types. Chromatin state annotations produced by segmentation and genome annotation (SAGA) methods have emerged as the predominant way to summarize these epigenomic data sets in order to annotate the genome. These chromatin state annotations are essential for many genomic tasks, including identifying active regulatory elements and interpreting disease-associated genetic variation. However, despite the widespread applications of SAGA methods, no principled approach exists to evaluate the statistical significance of chromatin state assignments. Here, we propose the first method for assigning calibrated confidence scores to chromatin state annotations. Toward this goal, we performed a comprehensive evaluation of the reproducibility of the two most widely used existing SAGA methods, ChromHMM and Segway. We found that their predictions are frequently irreproducible. For example, when applying the same SAGA method on two sets of experimental replicates, 27%-69% of predicted enhancers fail to replicate. This suggests that a substantial fraction of predicted elements in existing chromatin state annotations cannot be relied upon. To remedy this problem, we introduce SAGAconf, a method for assigning a measure of confidence (r-value) to chromatin state annotations. SAGAconf works with any SAGA method and assigns an r-value to each genomic bin of a chromatin state annotation that represents the probability that the label of this bin will be reproduced in a replicated experiment. Thus, SAGAconf allows a researcher to select only the reliable predictions from a chromatin annotation for use in downstream analyses.


Asunto(s)
Cromatina , Anotación de Secuencia Molecular , Cromatina/genética , Cromatina/metabolismo , Humanos , Reproducibilidad de los Resultados , Genómica/métodos
2.
Nat Rev Genet ; 24(2): 125-137, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36192604

RESUMEN

Artificial intelligence (AI) models based on deep learning now represent the state of the art for making functional predictions in genomics research. However, the underlying basis on which predictive models make such predictions is often unknown. For genomics researchers, this missing explanatory information would frequently be of greater value than the predictions themselves, as it can enable new insights into genetic processes. We review progress in the emerging area of explainable AI (xAI), a field with the potential to empower life science researchers to gain mechanistic insights into complex deep learning models. We discuss and categorize approaches for model interpretation, including an intuitive understanding of how each approach works and their underlying assumptions and limitations in the context of typical high-throughput biological datasets.


Asunto(s)
Inteligencia Artificial , Aprendizaje Profundo , Genómica
3.
Nat Commun ; 13(1): 3704, 2022 06 28.
Artículo en Inglés | MEDLINE | ID: mdl-35764630

RESUMEN

Despite the availability of chromatin conformation capture experiments, discerning the relationship between the 1D genome and 3D conformation remains a challenge, which limits our understanding of their affect on gene expression and disease. We propose Hi-C-LSTM, a method that produces low-dimensional latent representations that summarize intra-chromosomal Hi-C contacts via a recurrent long short-term memory neural network model. We find that these representations contain all the information needed to recreate the observed Hi-C matrix with high accuracy, outperforming existing methods. These representations enable the identification of a variety of conformation-defining genomic elements, including nuclear compartments and conformation-related transcription factors. They furthermore enable in-silico perturbation experiments that measure the influence of cis-regulatory elements on conformation.


Asunto(s)
Cromatina , Genómica , Cromatina/genética , Aprendizaje , Conformación Molecular , Redes Neurales de la Computación
4.
IEEE Trans Med Imaging ; 41(11): 3128-3145, 2022 11.
Artículo en Inglés | MEDLINE | ID: mdl-35622798

RESUMEN

Drug repurposing can accelerate the identification of effective compounds for clinical use against SARS-CoV-2, with the advantage of pre-existing clinical safety data and an established supply chain. RNA viruses such as SARS-CoV-2 manipulate cellular pathways and induce reorganization of subcellular structures to support their life cycle. These morphological changes can be quantified using bioimaging techniques. In this work, we developed DEEMD: a computational pipeline using deep neural network models within a multiple instance learning framework, to identify putative treatments effective against SARS-CoV-2 based on morphological analysis of the publicly available RxRx19a dataset. This dataset consists of fluorescence microscopy images of SARS-CoV-2 non-infected cells and infected cells, with and without drug treatment. DEEMD first extracts discriminative morphological features to generate cell morphological profiles from the non-infected and infected cells. These morphological profiles are then used in a statistical model to estimate the applied treatment efficacy on infected cells based on similarities to non-infected cells. DEEMD is capable of localizing infected cells via weak supervision without any expensive pixel-level annotations. DEEMD identifies known SARS-CoV-2 inhibitors, such as Remdesivir and Aloxistatin, supporting the validity of our approach. DEEMD can be explored for use on other emerging viruses and datasets to rapidly identify candidate antiviral treatments in the future. Our implementation is available online at https://www.github.com/Sadegh-Saberian/DEEMD.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , Antivirales/farmacología , Antivirales/química , Antivirales/metabolismo
5.
Bioinformatics ; 38(11): 3029-3036, 2022 05 26.
Artículo en Inglés | MEDLINE | ID: mdl-35451453

RESUMEN

MOTIVATION: Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These methods take as input a set of sequencing-based assays of epigenomic activity, such as ChIP-seq measurements of histone modification and transcription factor binding. They output an annotation of the genome that assigns a chromatin state label to each genomic position. Existing SAGA methods have several limitations caused by the discrete annotation framework: such annotations cannot easily represent varying strengths of genomic elements, and they cannot easily represent combinatorial elements that simultaneously exhibit multiple types of activity. To remedy these limitations, we propose an annotation strategy that instead outputs a vector of chromatin state features at each position rather than a single discrete label. Continuous modeling is common in other fields, such as in topic modeling of text documents. We propose a method, epigenome-ssm-nonneg, that uses a non-negative state space model to efficiently annotate the genome with chromatin state features. We also propose several measures of the quality of a chromatin state feature annotation and we compare the performance of several alternative methods according to these quality measures. RESULTS: We show that chromatin state features from epigenome-ssm-nonneg are more useful for several downstream applications than both continuous and discrete alternatives, including their ability to identify expressed genes and enhancers. Therefore, we expect that these continuous chromatin state features will be valuable reference annotations to be used in visualization and downstream analysis. AVAILABILITY AND IMPLEMENTATION: Source code for epigenome-ssm is available at https://github.com/habibdanesh/epigenome-ssm and Zenodo (DOI: 10.5281/zenodo.6507585). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Cromatina , Epigenoma , Humanos , Epigenómica/métodos , Genómica/métodos , Programas Informáticos
6.
Cytometry A ; 101(2): 177-184, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-34559446

RESUMEN

We introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g., disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and on BioConductor.


Asunto(s)
Programas Informáticos , Biomarcadores , Citometría de Flujo/métodos
7.
Bioinformatics ; 38(4): 1126-1128, 2022 01 27.
Artículo en Inglés | MEDLINE | ID: mdl-34718413

RESUMEN

MOTIVATION: With the advancement of sequencing technologies, genomic data sets are constantly being expanded by high volumes of different data types. One recently introduced data type in genomic science is genomic signals, which are usually short-read coverage measurements over the genome. To understand and evaluate the results of such studies, one needs to understand and analyze the characteristics of the input data. RESULTS: SigTools is an R-based genomic signals visualization package developed with two objectives: (i) to facilitate genomic signals exploration in order to uncover insights for later model training, refinement and development by including distribution and autocorrelation plots; (ii) to enable genomic signals interpretation by including correlation and aggregation plots. In addition, our corresponding web application, SigTools-Shiny, extends the accessibility scope of these modules to people who are more comfortable working with graphical user interfaces instead of command-line tools. AVAILABILITY AND IMPLEMENTATION: SigTools source code, installation guide and manual is freely available on http://github.com/shohre73.


Asunto(s)
Genoma , Genómica , Humanos , Genómica/métodos , Programas Informáticos , Análisis de Secuencia
8.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2313-2323, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-34043510

RESUMEN

The availability of thousands of assays of epigenetic activity necessitates compressed representations of these data sets that summarize the epigenetic landscape of the genome. Until recently, most such representations were cell type-specific, applying to a single tissue or cell state. Recently, neural networks have made it possible to summarize data across tissues to produce a pan-cell type representation. In this work, we propose Epi-LSTM, a deep long short-term memory (LSTM) recurrent neural network autoencoder to capture the long-term dependencies in the epigenomic data. The latent representations from Epi-LSTM capture a variety of genomic phenomena, including gene-expression, promoter-enhancer interactions, replication timing, frequently interacting regions, and evolutionary conservation. These representations outperform existing methods in a majority of cell types while yielding smoother representations along the genomic axis due to their sequential nature.


Asunto(s)
Epigenoma , Redes Neurales de la Computación , Humanos
9.
PLoS Comput Biol ; 17(10): e1009423, 2021 10.
Artículo en Inglés | MEDLINE | ID: mdl-34648491

RESUMEN

Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These algorithms take as input epigenomic datasets, such as chromatin immunoprecipitation-sequencing (ChIP-seq) measurements of histone modifications or transcription factor binding. They partition the genome and assign a label to each segment such that positions with the same label exhibit similar patterns of input data. SAGA algorithms discover categories of activity such as promoters, enhancers, or parts of genes without prior knowledge of known genomic elements. In this sense, they generally act in an unsupervised fashion like clustering algorithms, but with the additional simultaneous function of segmenting the genome. Here, we review the common methodological framework that underlies these methods, review variants of and improvements upon this basic framework, and discuss the outlook for future work. This review is intended for those interested in applying SAGA methods and for computational researchers interested in improving upon them.


Asunto(s)
Algoritmos , Cromatina/genética , Genoma/genética , Genómica/métodos , Anotación de Secuencia Molecular/métodos , Secuenciación de Inmunoprecipitación de Cromatina , Código de Histonas , Humanos , Unión Proteica
10.
Genome Biol ; 20(1): 180, 2019 08 28.
Artículo en Inglés | MEDLINE | ID: mdl-31462275

RESUMEN

Semi-automated genome annotation methods such as Segway take as input a set of genome-wide measurements such as of histone modification or DNA accessibility and output an annotation of genomic activity in the target cell type. Here we present annotations of 164 human cell types using 1615 data sets. To produce these annotations, we automated the label interpretation step to produce a fully automated annotation strategy. Using these annotations, we developed a measure of the importance of each genomic position called the "conservation-associated activity score." We further combined all annotations into a single, cell type-agnostic encyclopedia that catalogs all human regulatory elements.


Asunto(s)
ADN/genética , Bases de Datos Genéticas , Anotación de Secuencia Molecular , Algoritmos , Automatización , Línea Celular , Humanos , Aprendizaje Automático , Fenotipo , Transcripción Genética
11.
Nat Commun ; 9(1): 1402, 2018 04 11.
Artículo en Inglés | MEDLINE | ID: mdl-29643364

RESUMEN

The Encyclopedia of DNA Elements (ENCODE) and the Roadmap Epigenomics Project seek to characterize the epigenome in diverse cell types using assays that identify, for example, genomic regions with modified histones or accessible chromatin. These efforts have produced thousands of datasets but cannot possibly measure each epigenomic factor in all cell types. To address this, we present a method, PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition (PREDICTD), to computationally impute missing experiments. PREDICTD leverages an elegant model called "tensor decomposition" to impute many experiments simultaneously. Compared with the current state-of-the-art method, ChromImpute, PREDICTD produces lower overall mean squared error, and combining the two methods yields further improvement. We show that PREDICTD data captures enhancer activity at noncoding human accelerated regions. PREDICTD provides reference imputed data and open-source software for investigating new cell types, and demonstrates the utility of tensor decomposition and cloud computing, both promising technologies for bioinformatics.


Asunto(s)
Nube Computacional/estadística & datos numéricos , Epigénesis Genética , Genoma Humano , Histonas/genética , Programas Informáticos , Cromatina/química , Cromatina/metabolismo , Conjuntos de Datos como Asunto , Epigenómica/estadística & datos numéricos , Histonas/metabolismo , Humanos
12.
Proteins ; 86(4): 454-466, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29345009

RESUMEN

Selecting a non-redundant representative subset of sequences is a common step in many bioinformatics workflows, such as the creation of non-redundant training sets for sequence and structural models or selection of "operational taxonomic units" from metagenomics data. Previous methods for this task, such as CD-HIT, PISCES, and UCLUST, apply a heuristic threshold-based algorithm that has no theoretical guarantees. We propose a new approach based on submodular optimization. Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success for other representative set selection problems. We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is the best possible in polynomial time (under some assumptions), and it is flexible and intuitive because it applies a suite of generic methods to optimize one of a variety of objective functions.


Asunto(s)
Algoritmos , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Análisis por Conglomerados , Proteómica/métodos
13.
Bioinformatics ; 34(4): 669-671, 2018 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-29028889

RESUMEN

Summary: Segway performs semi-automated genome annotation, discovering joint patterns across multiple genomic signal datasets. We discuss a major new version of Segway and highlight its ability to model data with substantially greater accuracy. Major enhancements in Segway 2.0 include the ability to model data with a mixture of Gaussians, enabling capture of arbitrarily complex signal distributions, and minibatch training, leading to better learned parameters. Availability and implementation: Segway and its source code are freely available for download at http://segway.hoffmanlab.org. We have made available scripts (https://doi.org/10.5281/zenodo.802939) and datasets (https://doi.org/10.5281/zenodo.802906) for this paper's analysis. Contact: michael.hoffman@utoronto.ca. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica/métodos , Anotación de Secuencia Molecular/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Eucariontes/genética
14.
Genome Biol ; 17(1): 229, 2016 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-27846892

RESUMEN

Due to the high cost of sequencing-based genomics assays such as ChIP-seq and DNase-seq, the epigenomic characterization of a cell type is typically carried out using a small panel of assay types. Deciding a priori which assays to perform is, thus, a critical step in many studies. We present the submodular selection of assays (SSA), a method for choosing a diverse panel of genomic assays that leverages methods from submodular optimization. More generally, this application serves as a model for how submodular optimization can be applied to other discrete problems in biology.


Asunto(s)
Genoma , Genómica/métodos , Sitios de Unión , Inmunoprecipitación de Cromatina , Bases de Datos de Ácidos Nucleicos , Epigenómica/métodos , Genómica/normas , Secuenciación de Nucleótidos de Alto Rendimiento , Histonas/metabolismo , Humanos , Unión Proteica , Factores de Transcripción/metabolismo
15.
Artículo en Inglés | MEDLINE | ID: mdl-27168766

RESUMEN

BACKGROUND: Eukaryotic genome duplication starts at discrete sequences (replication origins) that coordinate cell cycle progression, ensure genomic stability and modulate gene expression. Origins share some sequence features, but their activity also responds to changes in transcription and cellular differentiation status. RESULTS: To identify chromatin states and histone modifications that locally mark replication origins, we profiled origin distributions in eight human cell lines representing embryonic and differentiated cell types. Consistent with a role of chromatin structure in determining origin activity, we found that cancer and non-cancer cells of similar lineages exhibited highly similar replication origin distributions. Surprisingly, our study revealed that DNase hypersensitivity, which often correlates with early replication at large-scale chromatin domains, did not emerge as a strong local determinant of origin activity. Instead, we found that two distinct sets of chromatin modifications exhibited strong local associations with two discrete groups of replication origins. The first origin group consisted of about 40,000 regions that actively initiated replication in all cell types and preferentially colocalized with unmethylated CpGs and with the euchromatin markers, H3K4me3 and H3K9Ac. The second group included origins that were consistently active in cells of a single type or lineage and preferentially colocalized with the heterochromatin marker, H3K9me3. Shared origins replicated throughout the S-phase of the cell cycle, whereas cell-type-specific origins preferentially replicated during late S-phase. CONCLUSIONS: These observations are in line with the hypothesis that differentiation-associated changes in chromatin and gene expression affect the activation of specific replication origins.

16.
Nat Rev Genet ; 16(6): 321-32, 2015 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-25948244

RESUMEN

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.


Asunto(s)
Inteligencia Artificial , Modelos Genéticos , Secuencia de Aminoácidos , Animales , Secuencia de Bases , Simulación por Computador , Análisis Discriminante , Genética Médica , Genómica , Humanos , Anotación de Secuencia Molecular
17.
Genome Res ; 25(4): 544-57, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-25677182

RESUMEN

The genomic neighborhood of a gene influences its activity, a behavior that is attributable in part to domain-scale regulation. Previous genomic studies have identified many types of regulatory domains. However, due to the difficulty of integrating genomics data sets, the relationships among these domain types are poorly understood. Semi-automated genome annotation (SAGA) algorithms facilitate human interpretation of heterogeneous collections of genomics data by simultaneously partitioning the human genome and assigning labels to the resulting genomic segments. However, existing SAGA methods cannot integrate inherently pairwise chromatin conformation data. We developed a new computational method, called graph-based regularization (GBR), for expressing a pairwise prior that encourages certain pairs of genomic loci to receive the same label in a genome annotation. We used GBR to exploit chromatin conformation information during genome annotation by encouraging positions that are close in 3D to occupy the same type of domain. Using this approach, we produced a model of chromatin domains in eight human cell types, thereby revealing the relationships among known domain types. Through this model, we identified clusters of tightly regulated genes expressed in only a small number of cell types, which we term "specific expression domains." We found that domain boundaries marked by promoters and CTCF motifs are consistent between cell types even when domain activity changes. Finally, we showed that GBR can be used to transfer information from well-studied cell types to less well-characterized cell types during genome annotation, making it possible to produce high-quality annotations of the hundreds of cell types with limited available data.


Asunto(s)
Cromatina/genética , Biología Computacional/métodos , Genómica/métodos , Conformación Molecular , Anotación de Secuencia Molecular/métodos , Algoritmos , Secuencias de Aminoácidos/genética , Línea Celular Tumoral , Cromatina/metabolismo , Estructuras Cromosómicas , Genoma Humano/genética , Células HeLa , Células Hep G2 , Células Endoteliales de la Vena Umbilical Humana , Humanos , Regiones Promotoras Genéticas/genética
18.
Nature ; 512(7515): 449-52, 2014 Aug 28.
Artículo en Inglés | MEDLINE | ID: mdl-25164756

RESUMEN

Genome function is dynamically regulated in part by chromatin, which consists of the histones, non-histone proteins and RNA molecules that package DNA. Studies in Caenorhabditis elegans and Drosophila melanogaster have contributed substantially to our understanding of molecular mechanisms of genome function in humans, and have revealed conservation of chromatin components and mechanisms. Nevertheless, the three organisms have markedly different genome sizes, chromosome architecture and gene organization. On human and fly chromosomes, for example, pericentric heterochromatin flanks single centromeres, whereas worm chromosomes have dispersed heterochromatin-like regions enriched in the distal chromosomal 'arms', and centromeres distributed along their lengths. To systematically investigate chromatin organization and associated gene regulation across species, we generated and analysed a large collection of genome-wide chromatin data sets from cell lines and developmental stages in worm, fly and human. Here we present over 800 new data sets from our ENCODE and modENCODE consortia, bringing the total to over 1,400. Comparison of combinatorial patterns of histone modifications, nuclear lamina-associated domains, organization of large-scale topological domains, chromatin environment at promoters and enhancers, nucleosome positioning, and DNA replication patterns reveals many conserved features of chromatin organization among the three organisms. We also find notable differences in the composition and locations of repressive chromatin. These data sets and analyses provide a rich resource for comparative and species-specific investigations of chromatin composition, organization and function.


Asunto(s)
Caenorhabditis elegans/citología , Caenorhabditis elegans/genética , Cromatina/genética , Cromatina/metabolismo , Drosophila melanogaster/citología , Drosophila melanogaster/genética , Animales , Línea Celular , Centrómero/genética , Centrómero/metabolismo , Cromatina/química , Ensamble y Desensamble de Cromatina/genética , Replicación del ADN/genética , Elementos de Facilitación Genéticos/genética , Epigénesis Genética , Heterocromatina/química , Heterocromatina/genética , Heterocromatina/metabolismo , Histonas/química , Histonas/metabolismo , Humanos , Anotación de Secuencia Molecular , Lámina Nuclear/metabolismo , Nucleosomas/química , Nucleosomas/genética , Nucleosomas/metabolismo , Regiones Promotoras Genéticas/genética , Especificidad de la Especie
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...