Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 34
Filtrar
1.
Brief Bioinform ; 22(1): 88-95, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32577746

RESUMO

The study of microbial communities crucially relies on the comparison of metagenomic next-generation sequencing data sets, for which several methods have been designed in recent years. Here, we review three key challenges in the comparison of such data sets: species identification and quantification, the efficient computation of distances between metagenomic samples and the identification of metagenomic features associated with a phenotype such as disease status. We present current solutions for such challenges, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.


Assuntos
Metagenômica/métodos , Microbiota/genética , Animais , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Metagenômica/normas
2.
Bioinformatics ; 38(Suppl_2): ii49-ii55, 2022 09 16.
Artigo em Inglês | MEDLINE | ID: mdl-36124798

RESUMO

MOTIVATION: Tumors are the result of a somatic evolutionary process leading to substantial intra-tumor heterogeneity. Single-cell and multi-region sequencing enable the detailed characterization of the clonal architecture of tumors and have highlighted its extensive diversity across tumors. While several computational methods have been developed to characterize the clonal composition and the evolutionary history of tumors, the identification of significantly conserved evolutionary trajectories across tumors is still a major challenge. RESULTS: We present a new algorithm, MAximal tumor treeS TRajectOries (MASTRO), to discover significantly conserved evolutionary trajectories in cancer. MASTRO discovers all conserved trajectories in a collection of phylogenetic trees describing the evolution of a cohort of tumors, allowing the discovery of conserved complex relations between alterations. MASTRO assesses the significance of the trajectories using a conditional statistical test that captures the coherence in the order in which alterations are observed in different tumors. We apply MASTRO to data from nonsmall-cell lung cancer bulk sequencing and to acute myeloid leukemia data from single-cell panel sequencing, and find significant evolutionary trajectories recapitulating and extending the results reported in the original studies. AVAILABILITY AND IMPLEMENTATION: MASTRO is available at https://github.com/VandinLab/MASTRO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Evolução Clonal , Humanos , Filogenia , Software
3.
Bioinformatics ; 38(13): 3343-3350, 2022 06 27.
Artigo em Inglês | MEDLINE | ID: mdl-35583271

RESUMO

MOTIVATION: The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS: In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION: SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA , Algoritmos , Genômica
4.
Genome Res ; 27(9): 1573-1588, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28768687

RESUMO

Prioritizing molecular alterations that act as drivers of cancer remains a crucial bottleneck in therapeutic development. Here we introduce HIT'nDRIVE, a computational method that integrates genomic and transcriptomic data to identify a set of patient-specific, sequence-altered genes, with sufficient collective influence over dysregulated transcripts. HIT'nDRIVE aims to solve the "random walk facility location" (RWFL) problem in a gene (or protein) interaction network, which differs from the standard facility location problem by its use of an alternative distance measure: "multihitting time," the expected length of the shortest random walk from any one of the set of sequence-altered genes to an expression-altered target gene. When applied to 2200 tumors from four major cancer types, HIT'nDRIVE revealed many potentially clinically actionable driver genes. We also demonstrated that it is possible to perform accurate phenotype prediction for tumor samples by only using HIT'nDRIVE-seeded driver gene modules from gene interaction networks. In addition, we identified a number of breast cancer subtype-specific driver modules that are associated with patients' survival outcome. Furthermore, HIT'nDRIVE, when applied to a large panel of pan-cancer cell lines, accurately predicted drug efficacy using the driver genes and their seeded gene modules. Overall, HIT'nDRIVE may help clinicians contextualize massive multiomics data in therapeutic decision making, enabling widespread implementation of precision oncology.


Assuntos
Neoplasias da Mama/genética , Variações do Número de Cópias de DNA/genética , Software , Transcriptoma/genética , Neoplasias da Mama/patologia , Biologia Computacional , Feminino , Genômica , Humanos , Mutação , Mapas de Interação de Proteínas/genética
5.
PLoS Comput Biol ; 15(5): e1006802, 2019 05.
Artigo em Inglês | MEDLINE | ID: mdl-31120875

RESUMO

Recent large cancer studies have measured somatic alterations in an unprecedented number of tumours. These large datasets allow the identification of cancer-related sets of genetic alterations by identifying relevant combinatorial patterns. Among such patterns, mutual exclusivity has been employed by several recent methods that have shown its effectiveness in characterizing gene sets associated to cancer. Mutual exclusivity arises because of the complementarity, at the functional level, of alterations in genes which are part of a group (e.g., a pathway) performing a given function. The availability of quantitative target profiles, from genetic perturbations or from clinical phenotypes, provides additional information that can be leveraged to improve the identification of cancer related gene sets by discovering groups with complementary functional associations with such targets. In this work we study the problem of finding groups of mutually exclusive alterations associated with a quantitative (functional) target. We propose a combinatorial formulation for the problem, and prove that the associated computational problem is computationally hard. We design two algorithms to solve the problem and implement them in our tool UNCOVER. We provide analytic evidence of the effectiveness of UNCOVER in finding high-quality solutions and show experimentally that UNCOVER finds sets of alterations significantly associated with functional targets in a variety of scenarios. In particular, we show that our algorithms find sets which are better than the ones obtained by the state-of-the-art method, even when sets are evaluated using the statistical score employed by the latter. In addition, our algorithms are much faster than the state-of-the-art, allowing the analysis of large datasets of thousands of target profiles from cancer cell lines. We show that on two such datasets, one from project Achilles and one from the Genomics of Drug Sensitivity in Cancer project, UNCOVER identifies several significant gene sets with complementary functional associations with targets. Software available at: https://github.com/VandinLab/UNCOVER.


Assuntos
Biologia Computacional/métodos , Neoplasias/genética , Análise de Sequência de DNA/métodos , Algoritmos , Regulação Neoplásica da Expressão Gênica/genética , Redes Reguladoras de Genes/genética , Teste de Complementação Genética/métodos , Genômica/métodos , Humanos , Mutação , Software
6.
BMC Bioinformatics ; 20(1): 17, 2019 Jan 09.
Artigo em Inglês | MEDLINE | ID: mdl-30626316

RESUMO

BACKGROUND: Translational and post-translational control mechanisms in the cell result in widely observable differences between measured gene transcription and protein abundances. Herein, protein complexes are among the most tightly controlled entities by selective degradation of their individual proteins. They furthermore act as control hubs that regulate highly important processes in the cell and exhibit a high functional diversity due to their ability to change their composition and their structure. Better understanding and prediction of these functional states demands methods for the characterization of complex composition, behavior, and abundance across multiple cell states. Mass spectrometry provides an unbiased approach to directly determine protein abundances across different cell populations and thus to profile a comprehensive abundance map of proteins. RESULTS: We provide a tool to investigate the behavior of protein subunits in known complexes by comparing their abundance profiles across up to 140 cell types available in ProteomicsDB. Thorough assessment of different randomization methods and statistical scoring algorithms allows determining the significance of concurrent profiles within a complex, therefore providing insights into the conservation of their composition across human cell types as well as the identification of intrinsic structures in complex behavior to determine which proteins orchestrate complex function. This analysis can be extended to investigate common profiles within arbitrary protein groups. CoExpresso can be accessed through http://computproteomics.bmb.sdu.dk/Apps/CoExpresso . CONCLUSIONS: With the CoExpresso web service, we offer a potent scoring scheme to assess proteins for their co-regulation and thereby offer insight into their potential for forming functional groups like protein complexes.


Assuntos
Proteínas/metabolismo , Proteômica/métodos , Algoritmos , Humanos
7.
Nature ; 502(7471): 333-339, 2013 Oct 17.
Artigo em Inglês | MEDLINE | ID: mdl-24132290

RESUMO

The Cancer Genome Atlas (TCGA) has used the latest sequencing and analysis methods to identify somatic variants across thousands of tumours. Here we present data and analytical results for point mutations and small insertions/deletions from 3,281 tumours across 12 tumour types as part of the TCGA Pan-Cancer effort. We illustrate the distributions of mutation frequencies, types and contexts across tumour types, and establish their links to tissues of origin, environmental/carcinogen influences, and DNA repair defects. Using the integrated data sets, we identified 127 significantly mutated genes from well-known (for example, mitogen-activated protein kinase, phosphatidylinositol-3-OH kinase, Wnt/ß-catenin and receptor tyrosine kinase signalling pathways, and cell cycle control) and emerging (for example, histone, histone modification, splicing, metabolism and proteolysis) cellular processes in cancer. The average number of mutations in these significantly mutated genes varies across tumour types; most tumours have two to six, indicating that the number of driver mutations required during oncogenesis is relatively small. Mutations in transcriptional factors/regulators show tissue specificity, whereas histone modifiers are often mutated across several cancer types. Clinical association analysis identifies genes having a significant effect on survival, and investigations of mutations with respect to clonal/subclonal architecture delineate their temporal orders during tumorigenesis. Taken together, these results lay the groundwork for developing new diagnostics and individualizing cancer treatment.


Assuntos
Carcinogênese/genética , Mutação/genética , Neoplasias/classificação , Neoplasias/genética , Ciclo Celular/genética , Células Clonais/metabolismo , Células Clonais/patologia , Estudos de Coortes , Reparo do DNA/genética , Humanos , Mutação INDEL/genética , Proteínas Quinases Ativadas por Mitógeno/genética , Modelos Genéticos , Neoplasias/metabolismo , Neoplasias/patologia , Oncogenes/genética , Fosfatidilinositol 3-Quinases/genética , Mutação Puntual/genética , Receptores Proteína Tirosina Quinases/metabolismo , Análise de Sobrevida , Fatores de Tempo
8.
Nucleic Acids Res ; 45(16): e151, 2017 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-28934488

RESUMO

Gene expression profiles have been extensively discussed as an aid to guide the therapy by predicting disease outcome for the patients suffering from complex diseases, such as cancer. However, prediction models built upon single-gene (SG) features show poor stability and performance on independent datasets. Attempts to mitigate these drawbacks have led to the development of network-based approaches that integrate pathway information to produce meta-gene (MG) features. Also, MG approaches have only dealt with the two-class problem of good versus poor outcome prediction. Stratifying patients based on their molecular subtypes can provide a detailed view of the disease and lead to more personalized therapies. We propose and discuss a novel MG approach based on de novo pathways, which for the first time have been used as features in a multi-class setting to predict cancer subtypes. Comprehensive evaluation in a large cohort of breast cancer samples from The Cancer Genome Atlas (TCGA) revealed that MGs are considerably more stable than SG models, while also providing valuable insight into the cancer hallmarks that drive them. In addition, when tested on an independent benchmark non-TCGA dataset, MG features consistently outperformed SG models. We provide an easy-to-use web service at http://pathclass.compbio.sdu.dk where users can upload their own gene expression datasets from breast cancer studies and obtain the subtype predictions from all the classifiers.


Assuntos
Biomarcadores Tumorais/genética , Neoplasias da Mama/genética , Perfilação da Expressão Gênica/métodos , Biomarcadores Tumorais/metabolismo , Neoplasias da Mama/classificação , Neoplasias da Mama/metabolismo , Metilação de DNA , Feminino , Genes Neoplásicos , Humanos
9.
Bioinformatics ; 33(4): 549-551, 2017 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-27794558

RESUMO

Motivation: Epigenome-wide association studies (EWAS) generate big epidemiological datasets. They aim for detecting differentially methylated DNA regions that are likely to influence transcriptional gene activity and, thus, the regulation of metabolic processes. The by far most widely used technology is the Illumina Methylation BeadChip, which measures the methylation levels of 450 (850) thousand cytosines, in the CpG dinucleotide context in a set of patients compared to a control group. Many bioinformatics tools exist for raw data analysis. However, most of them require some knowledge in the programming language R, have no user interface, and do not offer all necessary steps to guide users from raw data all the way down to statistically significant differentially methylated regions (DMRs) and the associated genes. Results: Here, we present DiMmeR (Discovery of Multiple Differentially Methylated Regions), the first free standalone software that interactively guides with a user-friendly graphical user interface (GUI) scientists the whole way through EWAS data analysis. It offers parallelized statistical methods for efficiently identifying DMRs in both Illumina 450K and 850K EPIC chip data. DiMmeR computes empirical P -values through randomization tests, even for big datasets of hundreds of patients and thousands of permutations within a few minutes on a standard desktop PC. It is independent of any third-party libraries, computes regression coefficients, P -values and empirical P -values, and it corrects for multiple testing. Availability and Implementation: DiMmeR is publicly available at http://dimmer.compbio.sdu.dk . Contact: diogoma@bmb.sdu.dk. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Ilhas de CpG , Metilação de DNA , Epigenômica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Software , Humanos
10.
Nature ; 487(7406): 239-43, 2012 Jul 12.
Artigo em Inglês | MEDLINE | ID: mdl-22722839

RESUMO

Characterization of the prostate cancer transcriptome and genome has identified chromosomal rearrangements and copy number gains and losses, including ETS gene family fusions, PTEN loss and androgen receptor (AR) amplification, which drive prostate cancer development and progression to lethal, metastatic castration-resistant prostate cancer (CRPC). However, less is known about the role of mutations. Here we sequenced the exomes of 50 lethal, heavily pre-treated metastatic CRPCs obtained at rapid autopsy (including three different foci from the same patient) and 11 treatment-naive, high-grade localized prostate cancers. We identified low overall mutation rates even in heavily treated CRPCs (2.00 per megabase) and confirmed the monoclonal origin of lethal CRPC. Integrating exome copy number analysis identified disruptions of CHD1 that define a subtype of ETS gene family fusion-negative prostate cancer. Similarly, we demonstrate that ETS2, which is deleted in approximately one-third of CRPCs (commonly through TMPRSS2:ERG fusions), is also deregulated through mutation. Furthermore, we identified recurrent mutations in multiple chromatin- and histone-modifying genes, including MLL2 (mutated in 8.6% of prostate cancers), and demonstrate interaction of the MLL complex with the AR, which is required for AR-mediated signalling. We also identified novel recurrent mutations in the AR collaborating factor FOXA1, which is mutated in 5 of 147 (3.4%) prostate cancers (both untreated localized prostate cancer and CRPC), and showed that mutated FOXA1 represses androgen signalling and increases tumour growth. Proteins that physically interact with the AR, such as the ERG gene fusion product, FOXA1, MLL2, UTX (also known as KDM6A) and ASXL1 were found to be mutated in CRPC. In summary, we describe the mutational landscape of a heavily treated metastatic cancer, identify novel mechanisms of AR signalling deregulated in prostate cancer, and prioritize candidates for future study.


Assuntos
Neoplasias da Próstata/genética , Proliferação de Células , Células Cultivadas , Fator 3-alfa Nuclear de Hepatócito/genética , Humanos , Masculino , Dados de Sequência Molecular , Mutação , Orquiectomia , Neoplasias da Próstata/patologia , Receptores Androgênicos/metabolismo , Alinhamento de Sequência , Transdução de Sinais
11.
Ann Hum Genet ; 81(1): 20-26, 2017 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-28009044

RESUMO

Genome-wide association studies with moderate sample sizes are underpowered, especially when testing SNP alleles with low allele counts, a situation that may lead to high frequency of false-positive results and lack of replication in independent studies. Related individuals, such as twin pairs concordant for a disease, should confer increased power in genetic association analysis because of their genetic relatedness. We conducted a computer simulation study to explore the power advantage of the disease-concordant twin design, which uses singletons from disease-concordant twin pairs as cases and ordinary healthy samples as controls. We examined the power gain of the twin-based design for various scenarios (i.e., cases from monozygotic and dizygotic twin pairs concordant for a disease) and compared the power with the ordinary case-control design with cases collected from the unrelated patient population. Simulation was done by assigning various allele frequencies and allelic relative risks for different mode of genetic inheritance. In general, for achieving a power estimate of 80%, the sample sizes needed for dizygotic and monozygotic twin cases were one half and one fourth of the sample size of an ordinary case-control design, with variations depending on genetic mode. Importantly, the enriched power for dizygotic twins also applies to disease-concordant sibling pairs, which largely extends the application of the concordant twin design. Overall, our simulation revealed a high value of disease-concordant twins in genetic association studies and encourages the use of genetically related individuals for highly efficiently identifying both common and rare genetic variants underlying human complex diseases without increasing laboratory cost.


Assuntos
Doenças em Gêmeos/genética , Estudo de Associação Genômica Ampla , Simulação por Computador , Frequência do Gene , Predisposição Genética para Doença , Humanos , Modelos Genéticos , Risco , Gêmeos Dizigóticos/genética , Gêmeos Monozigóticos/genética
12.
Ann Hum Genet ; 80(2): 81-7, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26831219

RESUMO

Poor nutrition during critical growth phases may alter the structural and physiologic development of vital organs thus "programming" the susceptibility to adult-onset diseases and disease-related health conditions. Epigenome-wide association studies have been performed in birth-weight discordant twin pairs to find evidence for such "programming" effects, but no significant results emerged. We further investigated this issue using a new computational approach: Instead of probing single genomic sites for significant alterations in epigenetic marks, we scan for differentially methylated genomic regions. Whole genome DNA methylation levels were measured in whole blood from 150 pairs of adult identical twins discordant for birth-weight. Intrapair differential DNA methylation was associated with qualitative (large or small) and quantitative (percentage) birth-weight discordance at each genomic site using regression models adjusting for age and sex. Based on the regression results, genomic regions with consistent alteration patterns of DNA methylation were located and tested for significant robustness using computational permutation tests. This yielded an interesting genomic region on chromosome 1, which is significantly differentially methylated for quantitative birth-weight discordance. The region covers two genes (TYW3 and CRYZ) both reportedly associated with metabolism. We conclude that prenatal conditions for birth-weight discordance may result in persistent epigenetic modifications potentially affecting even adult health.


Assuntos
Peso ao Nascer , Metilação de DNA , Epigênese Genética , Adulto , Idoso , Feminino , Genoma Humano , Genômica , Humanos , Modelos Lineares , Masculino , Pessoa de Meia-Idade , Gêmeos Monozigóticos
13.
PLoS Comput Biol ; 11(5): e1004071, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25950620

RESUMO

A key challenge in genomics is to identify genetic variants that distinguish patients with different survival time following diagnosis or treatment. While the log-rank test is widely used for this purpose, nearly all implementations of the log-rank test rely on an asymptotic approximation that is not appropriate in many genomics applications. This is because: the two populations determined by a genetic variant may have very different sizes; and the evaluation of many possible variants demands highly accurate computation of very small p-values. We demonstrate this problem for cancer genomics data where the standard log-rank test leads to many false positive associations between somatic mutations and survival time. We develop and analyze a novel algorithm, Exact Log-rank Test (ExaLT), that accurately computes the p-value of the log-rank statistic under an exact distribution that is appropriate for any size populations. We demonstrate the advantages of ExaLT on data from published cancer genomics studies, finding significant differences from the reported p-values. We analyze somatic mutations in six cancer types from The Cancer Genome Atlas (TCGA), finding mutations with known association to survival as well as several novel associations. In contrast, standard implementations of the log-rank test report dozens-hundreds of likely false positive associations as more significant than these known associations.


Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Análise de Sobrevida , Algoritmos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Variação Genética , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Masculino , Modelos Estatísticos , Mutação , Neoplasias/genética , Neoplasias/mortalidade
14.
Genome Res ; 22(2): 375-85, 2012 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21653252

RESUMO

Next-generation DNA sequencing technologies are enabling genome-wide measurements of somatic mutations in large numbers of cancer patients. A major challenge in the interpretation of these data is to distinguish functional "driver mutations" important for cancer development from random "passenger mutations." A common approach for identifying driver mutations is to find genes that are mutated at significant frequency in a large cohort of cancer genomes. This approach is confounded by the observation that driver mutations target multiple cellular signaling and regulatory pathways. Thus, each cancer patient may exhibit a different combination of mutations that are sufficient to perturb these pathways. This mutational heterogeneity presents a problem for predicting driver mutations solely from their frequency of occurrence. We introduce two combinatorial properties, coverage and exclusivity, that distinguish driver pathways, or groups of genes containing driver mutations, from groups of genes with passenger mutations. We derive two algorithms, called Dendrix, to find driver pathways de novo from somatic mutation data. We apply Dendrix to analyze somatic mutation data from 623 genes in 188 lung adenocarcinoma patients, 601 genes in 84 glioblastoma patients, and 238 known mutations in 1000 patients with various cancers. In all data sets, we find groups of genes that are mutated in large subsets of patients and whose mutations are approximately exclusive. Our Dendrix algorithms scale to whole-genome analysis of thousands of patients and thus will prove useful for larger data sets to come from The Cancer Genome Atlas (TCGA) and other large-scale cancer genome sequencing projects.


Assuntos
Biologia Computacional/métodos , Mutação , Neoplasias/genética , Transdução de Sinais , Algoritmos , Simulação por Computador , Humanos , Internet , Modelos Genéticos , Software
15.
J Comput Biol ; 27(4): 534-549, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-31891535

RESUMO

Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. Although several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire data set, which can be extremely expensive for high-throughput sequencing data sets. Although in some applications it is crucial to estimate all k-mers and their abundances, in other situations it may be sufficient to report only frequent k-mers, which appear with relatively high frequency in a data set. This is the case, for example, in the computation of k-mers' abundance-based distances among data sets of reads, commonly used in metagenomic analyses. In this study, we develop, analyze, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the Vapnik-Chervonenkis dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a data set and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer-based distances between high-throughput sequencing data sets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers' abundances providing significant speedups in the analysis of large sequencing data sets.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Biologia Computacional , Metagenoma/genética , Tamanho da Amostra
16.
iScience ; 23(10): 101619, 2020 Oct 23.
Artigo em Inglês | MEDLINE | ID: mdl-33089107

RESUMO

Phenotypic heterogeneity in cancer is often caused by different patterns of genetic alterations. Understanding such phenotype-genotype relationships is fundamental for the advance of personalized medicine. We develop a computational method, named NETPHIX (NETwork-to-PHenotype association with eXclusivity) to identify subnetworks of genes whose genetic alterations are associated with drug response or other continuous cancer phenotypes. Leveraging interaction information among genes and properties of cancer mutations such as mutual exclusivity, we formulate the problem as an integer linear program and solve it optimally to obtain a subnetwork of associated genes. Applied to a large-scale drug screening dataset, NETPHIX uncovered gene modules significantly associated with drug responses. Utilizing interaction information, NETPHIX modules are functionally coherent and can thus provide important insights into drug action. In addition, we show that modules identified by NETPHIX together with their association patterns can be leveraged to suggest drug combinations.

17.
Algorithms Mol Biol ; 14: 10, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30976291

RESUMO

PROBLEM: We study the problem of identifying differentially mutated subnetworks of a large gene-gene interaction network, that is, subnetworks that display a significant difference in mutation frequency in two sets of cancer samples. We formally define the associated computational problem and show that the problem is NP-hard. ALGORITHM: We propose a novel and efficient algorithm, called DAMOKLE, to identify differentially mutated subnetworks given genome-wide mutation data for two sets of cancer samples. We prove that DAMOKLE identifies subnetworks with statistically significant difference in mutation frequency when the data comes from a reasonable generative model, provided enough samples are available. EXPERIMENTAL RESULTS: We test DAMOKLE on simulated and real data, showing that DAMOKLE does indeed find subnetworks with significant differences in mutation frequency and that it provides novel insights into the molecular mechanisms of the disease not revealed by standard methods.

18.
Front Genet ; 10: 265, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31024613

RESUMO

Next-generation sequencing technologies allow to measure somatic mutations in a large number of patients from the same cancer type: one of the main goals in their analysis is the identification of mutations associated with clinical parameters. The identification of such relationships is hindered by extensive genetic heterogeneity in tumors, with different genes mutated in different patients, due, in part, to the fact that genes and mutations act in the context of pathways: it is therefore crucial to study mutations in the context of interactions among genes. In this work we study the problem of identifying subnetworks of a large gene-gene interaction network with mutations associated with survival time. We formally define the associated computational problem by using a score for subnetworks based on the log-rank statistical test to compare the survival of two given populations. We propose a novel approach, based on a new algorithm, called Network of Mutations Associated with Survival (NoMAS) to find subnetworks of a large interaction network whose mutations are associated with survival time. NoMAS is based on the color-coding technique, that has been previously employed in other applications to find the highest scoring subnetwork with high probability when the subnetwork score is additive. In our case the score is not additive, so our algorithm cannot identify the optimal solution with the same guarantees associated to additive scores. Nonetheless, we prove that, under a reasonable model for mutations in cancer, NoMAS identifies the optimal solution with high probability. We also design a holdout approach to identify subnetworks significantly associated with survival time. We test NoMAS on simulated and cancer data, comparing it to approaches based on single gene tests and to various greedy approaches. We show that our method does indeed find the optimal solution and performs better than the other approaches. Moreover, on three cancer datasets our method identifies subnetworks with significant association to survival when none of the genes has significant association with survival when considered in isolation.

19.
Eur J Hum Genet ; 27(4): 631-636, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30659261

RESUMO

Genetic interaction is a crucial issue in the understanding of functional pathways underlying complex diseases. However, detecting such interaction effects is challenging in terms of both methodology and statistical power. We address this issue by introducing a disease-concordant twin-case-only design, which applies to both monozygotic and dizygotic twins. To investigate the power, we conducted a computer simulation study by setting a series of parameter schemes with different minor allele frequencies and relative risks. Results from the simulation study reveals that the disease-concordant twin-case-only design largely reduces sample size required for sufficient power compared to the ordinary case-only design for detecting gene-gene interaction using unrelated individuals. Sample sizes for dizygotic and monozygotic twins were roughly 1/2 and 1/4 of sample sizes in the ordinary case-only design. Since dizygotic twins are genetically similar as siblings, the enriched power for dizygotic twins also applies to affected siblings, which could help to largely extend the application of the powerful twin-case-only design. In summary, our simulation reveals high value of disease-concordant twins and siblings in efficiently detecting gene-by-gene interactions.


Assuntos
Doenças em Gêmeos/genética , Doenças Genéticas Inatas/genética , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Simulação por Computador/estatística & dados numéricos , Doenças em Gêmeos/patologia , Feminino , Frequência do Gene/genética , Doenças Genéticas Inatas/patologia , Humanos , Masculino , Risco , Tamanho da Amostra , Irmãos , Gêmeos Dizigóticos/genética , Gêmeos Monozigóticos/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA