Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 9 de 9
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Algorithms Mol Biol ; 14: 10, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30976291

RESUMO

PROBLEM: We study the problem of identifying differentially mutated subnetworks of a large gene-gene interaction network, that is, subnetworks that display a significant difference in mutation frequency in two sets of cancer samples. We formally define the associated computational problem and show that the problem is NP-hard. ALGORITHM: We propose a novel and efficient algorithm, called DAMOKLE, to identify differentially mutated subnetworks given genome-wide mutation data for two sets of cancer samples. We prove that DAMOKLE identifies subnetworks with statistically significant difference in mutation frequency when the data comes from a reasonable generative model, provided enough samples are available. EXPERIMENTAL RESULTS: We test DAMOKLE on simulated and real data, showing that DAMOKLE does indeed find subnetworks with significant differences in mutation frequency and that it provides novel insights into the molecular mechanisms of the disease not revealed by standard methods.

2.
J Comput Biol ; 23(1): 30-41, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26645471

RESUMO

Advances in DNA sequencing technologies have enabled large cancer sequencing studies, collecting somatic mutation data from a large number of cancer patients. One of the main goals of these studies is the identification of all cancer genes--genes associated with cancer. Its achievement is complicated by the extensive mutational heterogeneity of cancer, due to the fact that important mutations in cancer target combinations of genes (i.e., pathways). Recently, the pattern of mutual exclusivity among mutations in a cancer pathway has been observed, and methods that find significant combinations of cancer genes by detecting mutual exclusivity have been proposed. A key question in the analysis of mutual exclusivity is the computation of the minimum number of samples required to reliably find a meaningful set of mutually exclusive mutations in the data, or conclude that there is no such set. In general, the problem of determining the sample complexity, or the number of samples required to identify significant combinations of features, of genomic problems is largely unexplored. In this work we propose a framework to analyze the sample complexity of problems that arise in the study of genomic datasets. Our framework is based on tools from combinatorial analysis and statistical learning theory that have been used for the analysis of machine learning and probably approximately correct (PAC) learning. We use our framework to analyze the problem of the identification of cancer pathways through mutual exclusivity analysis. We analytically derive matching upper and lower bounds on the sample complexity of the problem, showing that sample sizes much larger than currently available may be required to identify all the cancer genes in a pathway. We also provide two algorithms to find a cancer pathway from a large genomic dataset. On simulated and cancer data, we show that our algorithms can be used to identify cancer pathways from large genomic datasets.


Assuntos
Genômica/métodos , Neoplasias/genética , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Redes Reguladoras de Genes , Humanos , Aprendizado de Máquina , Modelos Genéticos , Mutação , Neoplasias/metabolismo , Transdução de Sinais
3.
PLoS Comput Biol ; 11(5): e1004071, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25950620

RESUMO

A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a genetic variant may have very different sizes; and the evaluation of many possible variants demands highly accurate computation of very small p-values. We demonstrate this problem for cancer genomics data where the standard log-rank test leads to many false positive associations between somatic mutations and survival time. We develop and analyze a novel algorithm, Exact Log-rank Test (ExaLT), that accurately computes the p-value of the log-rank statistic under an exact distribution that is appropriate for any size populations. We demonstrate the advantages of ExaLT on data from published cancer genomics studies, finding significant differences from the reported p-values. We analyze somatic mutations in six cancer types from The Cancer Genome Atlas (TCGA), finding mutations with known association to survival as well as several novel associations. In contrast, standard implementations of the log-rank test report dozens-hundreds of likely false positive associations as more significant than these known associations.


Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Análise de Sobrevida , Algoritmos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Variação Genética , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Masculino , Modelos Estatísticos , Mutação , Neoplasias/genética , Neoplasias/mortalidade
4.
Algorithms Mol Biol ; 7(1): 23, 2012 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-22954134

RESUMO

BACKGROUND: Cancer sequencing projects are now measuring somatic mutations in large numbers of cancer genomes. A key challenge in interpreting these data is to distinguish driver mutations, mutations important for cancer development, from passenger mutations that have accumulated in somatic cells but without functional consequences. A common approach to identify genes harboring driver mutations is a single gene test that identifies individual genes that are recurrently mutated in a significant number of cancer genomes. However, the power of this test is reduced by: (1) the necessity of estimating the background mutation rate (BMR) for each gene; (2) the mutational heterogeneity in most cancers meaning that groups of genes (e.g. pathways), rather than single genes, are the primary target of mutations. RESULTS: We investigate the problem of discovering driver pathways, groups of genes containing driver mutations, directly from cancer mutation data and without prior knowledge of pathways or other interactions between genes. We introduce two generative models of somatic mutations in cancer and study the algorithmic complexity of discovering driver pathways in both models. We show that a single gene test for driver genes is highly sensitive to the estimate of the BMR. In contrast, we show that an algorithmic approach that maximizes a straightforward measure of the mutational properties of a driver pathway successfully discovers these groups of genes without an estimate of the BMR. Moreover, this approach is also successful in the case when the observed frequencies of passenger and driver mutations are indistinguishable, a situation where single gene tests fail. CONCLUSIONS: Accurate estimation of the BMR is a challenging task. Thus, methods that do not require an estimate of the BMR, such as the ones we provide here, can give increased power for the discovery of driver genes.

5.
Pac Symp Biocomput ; : 55-66, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22174262

RESUMO

A major goal of cancer sequencing projects is to identify genetic alterations that determine clinical phenotypes, such as survival time or drug response. Somatic mutations in cancer are typically very diverse, and are found in different sets of genes in different patients. This mutational heterogeneity complicates the discovery of associations between individual mutations and a clinical phenotype. This mutational heterogeneity is explained in part by the fact that driver mutations, the somatic mutations that drive cancer development, target genes in cellular pathways, and only a subset of pathway genes is mutated in a given patient. Thus, pathway-based analysis of associations between mutations and phenotype are warranted. Here, we introduce an algorithm to find groups of genes, or pathways, whose mutational status is associated to a clinical phenotype without prior definition of the pathways. Rather, we find subnetworks of genes in an gene interaction network with the property that the mutational status of the genes in the subnetwork are significantly associated with a clinical phenotype. This new algorithm is built upon HotNet, an algorithm that finds groups of mutated genes using a heat diffusion model and a two-stage statistical test. We focus here on discovery of statistically significant correlations between mutated subnetworks and patient survival data. A similar approach can be used for correlations with other types of clinical data, through use of an appropriate statistical test. We apply our method to simulated data as well as to mutation and survival data from ovarian cancer samples from The Cancer Genome Atlas. In the TCGA data, we discover nine subnetworks containing genes whose mutational status is correlated with survival. Genes in four of these subnetworks overlap known pathways, including the focal adhesion and cell adhesion pathways, while other subnetworks are novel.


Assuntos
Redes Reguladoras de Genes , Mutação , Neoplasias/genética , Algoritmos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Humanos , Modelos Genéticos , Neoplasias Ovarianas/genética , Fenótipo
6.
Genome Res ; 22(2): 375-85, 2012 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21653252

RESUMO

Next-generation DNA sequencing technologies are enabling genome-wide measurements of somatic mutations in large numbers of cancer patients. A major challenge in the interpretation of these data is to distinguish functional "driver mutations" important for cancer development from random "passenger mutations." A common approach for identifying driver mutations is to find genes that are mutated at significant frequency in a large cohort of cancer genomes. This approach is confounded by the observation that driver mutations target multiple cellular signaling and regulatory pathways. Thus, each cancer patient may exhibit a different combination of mutations that are sufficient to perturb these pathways. This mutational heterogeneity presents a problem for predicting driver mutations solely from their frequency of occurrence. We introduce two combinatorial properties, coverage and exclusivity, that distinguish driver pathways, or groups of genes containing driver mutations, from groups of genes with passenger mutations. We derive two algorithms, called Dendrix, to find driver pathways de novo from somatic mutation data. We apply Dendrix to analyze somatic mutation data from 623 genes in 188 lung adenocarcinoma patients, 601 genes in 84 glioblastoma patients, and 238 known mutations in 1000 patients with various cancers. In all data sets, we find groups of genes that are mutated in large subsets of patients and whose mutations are approximately exclusive. Our Dendrix algorithms scale to whole-genome analysis of thousands of patients and thus will prove useful for larger data sets to come from The Cancer Genome Atlas (TCGA) and other large-scale cancer genome sequencing projects.


Assuntos
Biologia Computacional/métodos , Mutação , Neoplasias/genética , Transdução de Sinais , Algoritmos , Simulação por Computador , Humanos , Internet , Modelos Genéticos , Software
7.
J Comput Biol ; 18(3): 507-22, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21385051

RESUMO

Recent genome sequencing studies have shown that the somatic mutations that drive cancer development are distributed across a large number of genes. This mutational heterogeneity complicates efforts to distinguish functional mutations from sporadic, passenger mutations. Since cancer mutations are hypothesized to target a relatively small number of cellular signaling and regulatory pathways, a common practice is to assess whether known pathways are enriched for mutated genes. We introduce an alternative approach that examines mutated genes in the context of a genome-scale gene interaction network. We present a computationally efficient strategy for de novo identification of subnetworks in an interaction network that are mutated in a statistically significant number of patients. This framework includes two major components. First, we use a diffusion process on the interaction network to define a local neighborhood of "influence" for each mutated gene in the network. Second, we derive a two-stage multiple hypothesis test to bound the false discovery rate (FDR) associated with the identified subnetworks. We test these algorithms on a large human protein-protein interaction network using somatic mutation data from glioblastoma and lung adenocarcinoma samples. We successfully recover pathways that are known to be important in these cancers and also identify additional pathways that have been implicated in other cancers but not previously reported as mutated in these samples. We anticipate that our approach will find increasing use as cancer genome studies increase in size and scope.


Assuntos
Algoritmos , Mutação , Neoplasias/genética , Humanos , Modelos Genéticos
8.
J Comput Biol ; 18(4): 535-45, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21417937

RESUMO

We develop, analyze, and experiment with a new tool, called MADMX, which extracts frequent motifs from biological sequences. We introduce the notion of density to single out the "significant" motifs. The density is a simple and flexible measure for bounding the number of don't cares in a motif, defined as the fraction of solid (i.e., different from don't care) characters in the motif. A maximal dense motif has density above a certain threshold, and any further specialization of a don't care symbol in it or any extension of its boundaries decreases its number of occurrences in the input sequence. By extracting only maximal dense motifs, MADMX reduces the output size and improves performance, while enhancing the quality of the discoveries. The efficiency of our approach relies on a newly defined combining operation, dubbed fusion, which allows for the construction of maximal dense motifs in a bottom-up fashion, while avoiding the generation of nonmaximal ones. We provide experimental evidence of the efficiency and the quality of the motifs returned by MADMX.


Assuntos
Algoritmos , Biologia Computacional/métodos , Análise de Sequência/métodos
9.
Artigo em Inglês | MEDLINE | ID: mdl-16447972

RESUMO

Perhaps the most common question that a microarray study can ask is, "Between two given biological conditions, which genes exhibit changed expression levels?" Existing methods for answering this question either generate a comparative measure based upon a static model, or take an indirect approach, first estimating absolute expression levels and then comparing the estimated levels to one another. We present a method for detecting changes in gene expression between two samples based on data from Affymetrix GeneChips. Using a library of over 200,000 known cases of differential expression, we create a learned comparative expression measure (LCEM) based on classification of probe-level data patterns as changed or unchanged. LCEM uses perfect match probe data only; mismatch probe values did not prove to be useful in this context. LCEM is particularly powerful in the case of small microarry studies, in which a regression-based method such as RMA cannot generalize, and in detecting small expression changes. At the levels of selectivity that are typical in microarray analysis, the LCEM shows a lower false discovery rate than either MAS5 or RMA trained from a single chip. When many chips are available to RMA, LCEM performs better on two out of the three data sets, and nearly as well on the third. Performance of the MAS5 log ratio statistic was notably bad on all datasets.


Assuntos
Algoritmos , Inteligência Artificial , Perfilação da Expressão Gênica/métodos , Expressão Gênica/fisiologia , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reconhecimento Automatizado de Padrão/métodos , Interpretação Estatística de Dados , Perfilação da Expressão Gênica/instrumentação , Análise de Sequência com Séries de Oligonucleotídeos/instrumentação , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...