Búsqueda | BVS CLAP/SMR-OPS/OMS

Wide-Open: Accelerating public data release by automating detection of overdue datasets.

Grechkin, Maxim; Poon, Hoifung; Howe, Bill.

PLoS Biol ; 15(6): e2002477, 2017 Jun.

Artículo en Inglés | MEDLINE | ID: mdl-28594819

RESUMEN

Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.

Asunto(s)

Acceso a la Información , Investigación Biomédica/métodos , Bases de Datos Genéticas , Animales , Investigación Biomédica/tendencias , Biotecnología/tendencias , Biología Computacional/tendencias , Minería de Datos , Bases de Datos Bibliográficas , Bases de Datos Genéticas/normas , Bases de Datos Genéticas/tendencias , Regulación de la Expresión Génica , Humanos , Automatización de Bibliotecas , Datos de Secuencia Molecular , National Library of Medicine (U.S.) , Publicaciones Periódicas como Asunto , Reproducibilidad de los Resultados , Factores de Tiempo , Estados Unidos

Identifying Network Perturbation in Cancer.

Grechkin, Maxim; Logsdon, Benjamin A; Gentles, Andrew J; Lee, Su-In.

PLoS Comput Biol ; 12(5): e1004888, 2016 05.

Artículo en Inglés | MEDLINE | ID: mdl-27145341

RESUMEN

We present a computational framework, called DISCERN (DIfferential SparsE Regulatory Network), to identify informative topological changes in gene-regulator dependence networks inferred on the basis of mRNA expression datasets within distinct biological states. DISCERN takes two expression datasets as input: an expression dataset of diseased tissues from patients with a disease of interest and another expression dataset from matching normal tissues. DISCERN estimates the extent to which each gene is perturbed-having distinct regulator connectivity in the inferred gene-regulator dependencies between the disease and normal conditions. This approach has distinct advantages over existing methods. First, DISCERN infers conditional dependencies between candidate regulators and genes, where conditional dependence relationships discriminate the evidence for direct interactions from indirect interactions more precisely than pairwise correlation. Second, DISCERN uses a new likelihood-based scoring function to alleviate concerns about accuracy of the specific edges inferred in a particular network. DISCERN identifies perturbed genes more accurately in synthetic data than existing methods to identify perturbed genes between distinct states. In expression datasets from patients with acute myeloid leukemia (AML), breast cancer and lung cancer, genes with high DISCERN scores in each cancer are enriched for known tumor drivers, genes associated with the biological processes known to be important in the disease, and genes associated with patient prognosis, in the respective cancer. Finally, we show that DISCERN can uncover potential mechanisms underlying network perturbation by explaining observed epigenomic activity patterns in cancer and normal tissue types more accurately than alternative methods, based on the available epigenomic data from the ENCODE project.

Asunto(s)

Redes Reguladoras de Genes , Modelos Genéticos , Neoplasias/genética , Neoplasias de la Mama/genética , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas , Epigénesis Genética , Femenino , Regulación Neoplásica de la Expresión Génica , Humanos , Leucemia Mieloide Aguda/genética , Funciones de Verosimilitud , Neoplasias Pulmonares/genética , Pronóstico

Pathway Graphical Lasso.

Grechkin, Maxim; Fazel, Maryam; Witten, Daniela; Lee, Su-In.

Proc AAAI Conf Artif Intell ; 2015: 2617-2623, 2015 Jan.

Artículo en Inglés | MEDLINE | ID: mdl-26167394

RESUMEN

Graphical models provide a rich framework for summarizing the dependencies among variables. The graphical lasso approach attempts to learn the structure of a Gaussian graphical model (GGM) by maximizing the log likelihood of the data, subject to an l1 penalty on the elements of the inverse co-variance matrix. Most algorithms for solving the graphical lasso problem do not scale to a very large number of variables. Furthermore, the learned network structure is hard to interpret. To overcome these challenges, we propose a novel GGM structure learning method that exploits the fact that for many real-world problems we have prior knowledge that certain edges are unlikely to be present. For example, in gene regulatory networks, a pair of genes that does not participate together in any of the cellular processes, typically referred to as pathways, is less likely to be connected. In computer vision applications in which each variable corresponds to a pixel, each variable is likely to be connected to the nearby variables. In this paper, we propose the pathway graphical lasso, which learns the structure of a GGM subject to pathway-based constraints. In order to solve this problem, we decompose the network into smaller parts, and use a message-passing algorithm in order to communicate among the subnetworks. Our algorithm has orders of magnitude improvement in run time compared to the state-of-the-art optimization methods for the graphical lasso problem that were modified to handle pathway-based constraints.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA