Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 42
Filtrar
1.
bioRxiv ; 2023 Apr 07.
Artículo en Inglés | MEDLINE | ID: mdl-37383947

RESUMEN

Accurate identification of cell classes across the tissues of living organisms is central in the analysis of growing atlases of single-cell RNA sequencing (scRNA-seq) data across biomedicine. Such analyses are often based on the existence of highly discriminating "marker genes" for specific cell classes which enables a deeper functional understanding of these classes as well as their identification in new, related datasets. Currently, marker genes are defined by methods that serially assess the level of differential expression (DE) of individual genes across landscapes of diverse cells. This serial approach has been extremely useful, but is limited because it ignores possible redundancy or complementarity across genes, that can only be captured by analyzing several genes at the same time. We wish to identify discriminating panels of genes. To efficiently explore the vast space of possible marker panels, leverage the large number of cells often sequenced, and overcome zero-inflation in scRNA-seq data, we propose viewing panel selection as a variation of the "minimal set-covering problem" in combinatorial optimization which can be solved with integer programming. In this formulation, the covering elements are genes, and the objects to be covered are cells of a particular class, where a cell is covered by a gene if that gene is expressed in that cell. Our method, CellCover, identifies a panel of marker genes in scRNA-seq data that covers one class of cells within a population. We apply this method to generate covering marker gene panels which characterize cells of the developing mouse neocortex as postmitotic neurons are generated from neural progenitor cells (NPCs). We show that CellCover captures cell class-specific signals distinct from those defined by DE methods and that CellCover's compact gene panels can be expanded to explore cell type specific function.Transfer learning experiments exploring these covering panels across in vivo mouse, primate, and human scRNA-seq datasets demonstrate that CellCover identifies markers of conserved cell classes in neurogenesis, as well as markers of temporal progression in the molecular identity of these cell types across development of the mammalian neocortex. The gene covering panels we identify across cell types and developmental time can be freely explored in visualizations across all the public data we use in this report at with NeMo Analytics [1] through https://nemoanalytics.org/p?l=CellCover . The code for CellCover is written in R and the Gurobi R interface and is available at [2].

2.
iScience ; 26(3): 106108, 2023 Mar 17.
Artículo en Inglés | MEDLINE | ID: mdl-36852282

RESUMEN

Many gene signatures have been developed by applying machine learning (ML) on omics profiles, however, their clinical utility is often hindered by limited interpretability and unstable performance. Here, we show the importance of embedding prior biological knowledge in the decision rules yielded by ML approaches to build robust classifiers. We tested this by applying different ML algorithms on gene expression data to predict three difficult cancer phenotypes: bladder cancer progression to muscle-invasive disease, response to neoadjuvant chemotherapy in triple-negative breast cancer, and prostate cancer metastatic progression. We developed two sets of classifiers: mechanistic, by restricting the training to features capturing specific biological mechanisms; and agnostic, in which the training did not use any a priori biological information. Mechanistic models had a similar or better testing performance than their agnostic counterparts, with enhanced interpretability. Our findings support the use of biological constraints to develop robust gene signatures with high translational potential.

3.
Mol Psychiatry ; 28(5): 2018-2029, 2023 May.
Artículo en Inglés | MEDLINE | ID: mdl-36732587

RESUMEN

Seven Tesla magnetic resonance spectroscopy (7T MRS) offers a precise measurement of metabolic levels in the human brain via a non-invasive approach. Studying longitudinal changes in brain metabolites could help evaluate the characteristics of disease over time. This approach may also shed light on how the age of study participants and duration of illness may influence these metabolites. This study used 7T MRS to investigate longitudinal patterns of brain metabolites in young adulthood in both healthy controls and patients. A four-year longitudinal cohort with 38 patients with first episode psychosis (onset within 2 years) and 48 healthy controls was used to examine 10 brain metabolites in 5 brain regions associated with the pathophysiology of psychosis in a comprehensive manner. Both patients and controls were found to have significant longitudinal reductions in glutamate in the anterior cingulate cortex (ACC). Only patients were found to have a significant decrease over time in γ-aminobutyric acid, N-acetyl aspartate, myo-inositol, total choline, and total creatine in the ACC. Together we highlight the ACC with dynamic changes in several metabolites in early-stage psychosis, in contrast to the other 4 brain regions that also are known to play roles in psychosis. Meanwhile, glutathione was uniquely found to have a near zero annual percentage change in both patients and controls in all 5 brain regions during a four-year follow-up in young adulthood. Given that a reduction of the glutathione in the ACC has been reported as a feature of treatment-refractory psychosis, this observation further supports the potential of glutathione as a biomarker for this subset of patients with psychosis.


Asunto(s)
Glutamina , Trastornos Psicóticos , Humanos , Adulto Joven , Adulto , Glutamina/metabolismo , Trastornos Psicóticos/metabolismo , Encéfalo/metabolismo , Ácido Glutámico/metabolismo , Giro del Cíngulo/metabolismo , Ácido Aspártico/metabolismo , Glutatión/metabolismo
4.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7430-7443, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-36441893

RESUMEN

There is a growing concern about typically opaque decision-making with high-performance machine learning algorithms. Providing an explanation of the reasoning process in domain-specific terms can be crucial for adoption in risk-sensitive domains such as healthcare. We argue that machine learning algorithms should be interpretable by design and that the language in which these interpretations are expressed should be domain- and task-dependent. Consequently, we base our model's prediction on a family of user-defined and task-specific binary functions of the data, each having a clear interpretation to the end-user. We then minimize the expected number of queries needed for accurate prediction on any given input. As the solution is generally intractable, following prior work, we choose the queries sequentially based on information gain. However, in contrast to previous work, we need not assume the queries are conditionally independent. Instead, we leverage a stochastic generative model (VAE) and an MCMC algorithm (Unadjusted Langevin) to select the most informative query about the input based on previous query-answers. This enables the online determination of a query chain of whatever depth is required to resolve prediction ambiguities. Finally, experiments on vision and NLP tasks demonstrate the efficacy of our approach and its superiority over post-hoc explanations.

5.
PLoS Comput Biol ; 17(6): e1008944, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-34115745

RESUMEN

Cancer cells display massive dysregulation of key regulatory pathways due to now well-catalogued mutations and other DNA-related aberrations. Moreover, enormous heterogeneity has been commonly observed in the identity, frequency and location of these aberrations across individuals with the same cancer type or subtype, and this variation naturally propagates to the transcriptome, resulting in myriad types of dysregulated gene expression programs. Many have argued that a more integrative and quantitative analysis of heterogeneity of DNA and RNA molecular profiles may be necessary for designing more systematic explorations of alternative therapies and improving predictive accuracy. We introduce a representation of multi-omics profiles which is sufficiently rich to account for observed heterogeneity and support the construction of quantitative, integrated, metrics of variation. Starting from the network of interactions existing in Reactome, we build a library of "paired DNA-RNA aberrations" that represent prototypical and recurrent patterns of dysregulation in cancer; each two-gene "Source-Target Pair" (STP) consists of a "source" regulatory gene and a "target" gene whose expression is plausibly "controlled" by the source gene. The STP is then "aberrant" in a joint DNA-RNA profile if the source gene is DNA-aberrant (e.g., mutated, deleted, or duplicated), and the downstream target gene is "RNA-aberrant", meaning its expression level is outside the normal, baseline range. With M STPs, each sample profile has exactly one of the 2M possible configurations. We concentrate on subsets of STPs, and the corresponding reduced configurations, by selecting tissue-dependent minimal coverings, defined as the smallest family of STPs with the property that every sample in the considered population displays at least one aberrant STP within that family. These minimal coverings can be computed with integer programming. Given such a covering, a natural measure of cross-sample diversity is the extent to which the particular aberrant STPs composing a covering vary from sample to sample; this variability is captured by the entropy of the distribution over configurations. We apply this program to data from TCGA for six distinct tumor types (breast, prostate, lung, colon, liver, and kidney cancer). This enables an efficient simplification of the complex landscape observed in cancer populations, resulting in the identification of novel signatures of molecular alterations which are not detected with frequency-based criteria. Estimates of cancer heterogeneity across tumor phenotypes reveals a stable pattern: entropy increases with disease severity. This framework is then well-suited to accommodate the expanding complexity of cancer genomes and epigenomes emerging from large consortia projects.


Asunto(s)
ADN de Neoplasias/genética , Neoplasias/genética , ARN Neoplásico/genética , Biología Computacional/métodos , Redes Reguladoras de Genes , Humanos , Mutación
6.
PLoS One ; 16(4): e0249002, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33819273

RESUMEN

Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.


Asunto(s)
Genómica/métodos , Programas Informáticos , Bases de Datos Genéticas , Humanos , Neoplasias/genética
7.
Metabolites ; 11(1)2020 Dec 30.
Artículo en Inglés | MEDLINE | ID: mdl-33396819

RESUMEN

Cancer cells are adept at reprogramming energy metabolism, and the precise manifestation of this metabolic reprogramming exhibits heterogeneity across individuals (and from cell to cell). In this study, we analyzed the metabolic differences between interpersonal heterogeneous cancer phenotypes. We used divergence analysis on gene expression data of 1156 breast normal and tumor samples from The Cancer Genome Atlas (TCGA) and integrated this information with a genome-scale reconstruction of human metabolism to generate personalized, context-specific metabolic networks. Using this approach, we classified the samples into four distinct groups based on their metabolic profiles. Enrichment analysis of the subsystems indicated that amino acid metabolism, fatty acid oxidation, citric acid cycle, androgen and estrogen metabolism, and reactive oxygen species (ROS) detoxification distinguished these four groups. Additionally, we developed a workflow to identify potential drugs that can selectively target genes associated with the reactions of interest. MG-132 (a proteasome inhibitor) and OSU-03012 (a celecoxib derivative) were the top-ranking drugs identified from our analysis and known to have anti-tumor activity. Our approach has the potential to provide mechanistic insights into cancer-specific metabolic dependencies, ultimately enabling the identification of potential drug targets for each patient independently, contributing to a rational personalized medicine approach.

8.
Proc Natl Acad Sci U S A ; 117(2): 857-864, 2020 01 14.
Artículo en Inglés | MEDLINE | ID: mdl-31882448

RESUMEN

Cancer is driven by the sequential accumulation of genetic and epigenetic changes in oncogenes and tumor suppressor genes. The timing of these events is not well understood. Moreover, it is currently unknown why the same driver gene change appears as an early event in some cancer types and as a later event, or not at all, in others. These questions have become even more topical with the recent progress brought by genome-wide sequencing studies of cancer. Focusing on mutational events, we provide a mathematical model of the full process of tumor evolution that includes different types of fitness advantages for driver genes and carrying-capacity considerations. The model is able to recapitulate a substantial proportion of the observed cancer incidence in several cancer types (colorectal, pancreatic, and leukemia) and inherited conditions (Lynch and familial adenomatous polyposis), by changing only 2 tissue-specific parameters: the number of stem cells in a tissue and its cell division frequency. The model sheds light on the evolutionary dynamics of cancer by suggesting a generalized early onset of tumorigenesis followed by slow mutational waves, in contrast to previous conclusions. Formulas and estimates are provided for the fitness increases induced by driver mutations, often much larger than previously described, and highly tissue dependent. Our results suggest a mechanistic explanation for why the selective fitness advantage introduced by specific driver genes is tissue dependent.


Asunto(s)
Carcinogénesis/genética , Modelos Genéticos , Neoplasias/clasificación , Poliposis Adenomatosa del Colon/genética , Anciano , División Celular , Neoplasias Colorrectales/genética , Neoplasias Colorrectales Hereditarias sin Poliposis , Humanos , Persona de Mediana Edad , Mutación , Neoplasias/genética , Oncogenes/genética
9.
Mol Omics ; 14(6): 424-436, 2018 12 03.
Artículo en Inglés | MEDLINE | ID: mdl-30259924

RESUMEN

Label-free shotgun mass spectrometry enables the detection of significant changes in protein abundance between different conditions. Due to often limited cohort sizes or replication, large ratios of potential protein markers to number of samples, as well as multiple null measurements pose important technical challenges to conventional parametric models. From a statistical perspective, a scenario similar to that of unlabeled proteomics is encountered in genomics when looking for differentially expressed genes. Still, the difficulty of detecting a large fraction of the true positives without a high false discovery rate is arguably greater in proteomics due to even smaller sample sizes and peptide-to-peptide variability in detectability. These constraints argue for nonparametric (or distribution-free) tests on normalized peptide values, thus minimizing the number of free parameters, as well as for measuring significance with permutation testing. We propose such a procedure with a class-based statistic, no parametric assumptions, and no parameters to select other than a nominal false discovery rate. Our method was tested on a new dataset which is available via ProteomeXchange with identifier PXD006447. The dataset was prepared using a standard proteolytic digest of a human protein mixture at 1.5-fold to 3-fold protein concentration changes and diluted into a constant background of yeast proteins. We demonstrate its superiority relative to other approaches in terms of the realized sensitivity and realized false discovery rates determined by ground truth, and recommend it for detecting differentially abundant proteins from MS data.


Asunto(s)
Proteoma , Proteómica/métodos , Estadísticas no Paramétricas , Humanos , Espectrometría de Masas en Tándem/métodos
10.
Proc Natl Acad Sci U S A ; 115(18): 4545-4552, 2018 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-29666255

RESUMEN

Data collected from omics technologies have revealed pervasive heterogeneity and stochasticity of molecular states within and between phenotypes. A prominent example of such heterogeneity occurs between genome-wide mRNA, microRNA, and methylation profiles from one individual tumor to another, even within a cancer subtype. However, current methods in bioinformatics, such as detecting differentially expressed genes or CpG sites, are population-based and therefore do not effectively model intersample diversity. Here we introduce a unified theory to quantify sample-level heterogeneity that is applicable to a single omics profile. Specifically, we simplify an omics profile to a digital representation based on the omics profiles from a set of samples from a reference or baseline population (e.g., normal tissues). The state of any subprofile (e.g., expression vector for a subset of genes) is said to be "divergent" if it lies outside the estimated support of the baseline distribution and is consequently interpreted as "dysregulated" relative to that baseline. We focus on two cases: single features (e.g., individual genes) and distinguished subsets (e.g., regulatory pathways). Notably, since the divergence analysis is at the individual sample level, dysregulation can be analyzed probabilistically; for example, one can estimate the probability that a gene or pathway is divergent in some population. Finally, the reduction in complexity facilitates a more "personalized" and biologically interpretable analysis of variation, as illustrated by experiments involving tissue characterization, disease detection and progression, and disease-pathway associations.


Asunto(s)
Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Medicina de Precisión/métodos , Biología Computacional/estadística & datos numéricos , Interpretación Estadística de Datos , Bases de Datos Genéticas , Perfilación de la Expresión Génica/estadística & datos numéricos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , MicroARNs/genética , Neoplasias/genética , Proteómica/métodos
11.
Bioinformatics ; 34(11): 1859-1867, 2018 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-29342249

RESUMEN

Motivation: Current bioinformatics methods to detect changes in gene isoform usage in distinct phenotypes compare the relative expected isoform usage in phenotypes. These statistics model differences in isoform usage in normal tissues, which have stable regulation of gene splicing. Pathological conditions, such as cancer, can have broken regulation of splicing that increases the heterogeneity of the expression of splice variants. Inferring events with such differential heterogeneity in gene isoform usage requires new statistical approaches. Results: We introduce Splice Expression Variability Analysis (SEVA) to model increased heterogeneity of splice variant usage between conditions (e.g. tumor and normal samples). SEVA uses a rank-based multivariate statistic that compares the variability of junction expression profiles within one condition to the variability within another. Simulated data show that SEVA is unique in modeling heterogeneity of gene isoform usage, and benchmark SEVA's performance against EBSeq, DiffSplice and rMATS that model differential isoform usage instead of heterogeneity. We confirm the accuracy of SEVA in identifying known splice variants in head and neck cancer and perform cross-study validation of novel splice variants. A novel comparison of splice variant heterogeneity between subtypes of head and neck cancer demonstrated unanticipated similarity between the heterogeneity of gene isoform usage in HPV-positive and HPV-negative subtypes and anticipated increased heterogeneity among HPV-negative samples with mutations in genes that regulate the splice variant machinery. These results show that SEVA accurately models differential heterogeneity of gene isoform usage from RNA-seq data. Availability and implementation: SEVA is implemented in the R/Bioconductor package GSReg. Contact: bahman@jhu.edu or favorov@sensi.org or ejfertig@jhmi.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Empalme Alternativo , Neoplasias/genética , Isoformas de Proteínas/genética , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Biología Computacional/métodos , Regulación Neoplásica de la Expresión Génica , Neoplasias de Cabeza y Cuello/genética , Humanos , Modelos Genéticos
12.
Hum Mol Genet ; 26(5): 913-922, 2017 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-28334820

RESUMEN

Huntington's disease is a dominantly inherited neurodegenerative disease caused by the expansion of a CAG repeat in the HTT gene. In addition to the length of the CAG expansion, factors such as genetic background have been shown to contribute to the age at onset of neurological symptoms. A central challenge in understanding the disease progression that leads from the HD mutation to massive cell death in the striatum is the ability to characterize the subtle and early functional consequences of the CAG expansion longitudinally. We used dense time course sampling between 4 and 20 postnatal weeks to characterize early transcriptomic, molecular and cellular phenotypes in the striatum of six distinct knock-in mouse models of the HD mutation. We studied the effects of the HttQ111 allele on the C57BL/6J, CD-1, FVB/NCr1, and 129S2/SvPasCrl genetic backgrounds, and of two additional alleles, HttQ92 and HttQ50, on the C57BL/6J background. We describe the emergence of a transcriptomic signature in HttQ111/+ mice involving hundreds of differentially expressed genes and changes in diverse molecular pathways. We also show that this time course spanned the onset of mutant huntingtin nuclear localization phenotypes and somatic CAG-length instability in the striatum. Genetic background strongly influenced the magnitude and age at onset of these effects. This work provides a foundation for understanding the earliest transcriptional and molecular changes contributing to HD pathogenesis.


Asunto(s)
Cuerpo Estriado/metabolismo , Proteína Huntingtina/genética , Enfermedad de Huntington/genética , Expansión de Repetición de Trinucleótido/genética , Animales , Cuerpo Estriado/patología , Modelos Animales de Enfermedad , Regulación del Desarrollo de la Expresión Génica , Técnicas de Sustitución del Gen , Antecedentes Genéticos , Inestabilidad Genómica/genética , Humanos , Proteína Huntingtina/biosíntesis , Enfermedad de Huntington/patología , Ratones , Mutación/genética , Neuronas/metabolismo , Neuronas/patología , Fenotipo , Transcriptoma/genética
13.
Proc Natl Acad Sci U S A ; 113(34): 9384-7, 2016 08 23.
Artículo en Inglés | MEDLINE | ID: mdl-27555443
14.
Proc Natl Acad Sci U S A ; 112(12): 3618-23, 2015 Mar 24.
Artículo en Inglés | MEDLINE | ID: mdl-25755262

RESUMEN

Today, computer vision systems are tested by their accuracy in detecting and localizing instances of objects. As an alternative, and motivated by the ability of humans to provide far richer descriptions and even tell a story about an image, we construct a "visual Turing test": an operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question; the operator either provides the correct answer or rejects the question as ambiguous; the engine proposes the next question ("just-in-time truthing"). The test is then administered to the computer-vision system, one question at a time. After the system's answer is recorded, the system is provided the correct answer and the next question. Parsing is trivial and deterministic; the system being tested requires no natural language processing. The query engine employs statistical constraints, learned from a training set, to produce questions with essentially unpredictable answers-the answer to a question, given the history of questions and their correct answers, is nearly equally likely to be positive or negative. In this sense, the test is only about vision. The system is designed to produce streams of questions that follow natural story lines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects.


Asunto(s)
Procesamiento de Imagen Asistido por Computador/métodos , Reconocimiento de Normas Patrones Automatizadas , Algoritmos , Inteligencia Artificial , Humanos , Imagenología Tridimensional , Modelos Estadísticos , Reproducibilidad de los Resultados , Programas Informáticos
15.
Bioinformatics ; 31(2): 273-4, 2015 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-25262153

RESUMEN

UNLABELLED: k-Top Scoring Pairs (kTSP) is a classification method for prediction from high-throughput data based on a set of the paired measurements. Each of the two possible orderings of a pair of measurements (e.g. a reversal in the expression of two genes) is associated with one of two classes. The kTSP prediction rule is the aggregation of voting among such individual two-feature decision rules based on order switching. kTSP, like its predecessor, Top Scoring Pair (TSP), is a parameter-free classifier relying only on ranking of a small subset of features, rendering it robust to noise and potentially easy to interpret in biological terms. In contrast to TSP, kTSP has comparable accuracy to standard genomics classification techniques, including Support Vector Machines and Prediction Analysis for Microarrays. Here, we describe 'switchBox', an R package for kTSP-based prediction. AVAILABILITY: The 'switchBox' package is freely available from Bioconductor: http://www.bioconductor.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Biomarcadores de Tumor/genética , Neoplasias de la Mama/clasificación , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Recurrencia Local de Neoplasia/diagnóstico , Neoplasias de la Mama/genética , Femenino , Regulación Neoplásica de la Expresión Génica , Humanos , Recurrencia Local de Neoplasia/genética , Máquina de Vectores de Soporte
16.
Hum Genet ; 134(5): 479-95, 2015 May.
Artículo en Inglés | MEDLINE | ID: mdl-25381197

RESUMEN

Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda­in particular, predicting disease phenotypes, progression and treatment response for individuals­requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning biomarkers, metabolism, cell signaling, network inference and tumorigenesis.


Asunto(s)
Biología Computacional/métodos , Interpretación Estadística de Datos , Redes Reguladoras de Genes/genética , Neoplasias/genética , Fenotipo , Biología de Sistemas/métodos , Investigación Biomédica Traslacional/métodos , Biomarcadores de Tumor , Carcinogénesis/genética , Humanos , Redes y Vías Metabólicas/genética , Redes y Vías Metabólicas/fisiología , Mutación/genética , Neoplasias/patología , Transducción de Señal/genética , Transducción de Señal/fisiología , Investigación Biomédica Traslacional/tendencias
17.
Cancer Inform ; 13(Suppl 5): 61-7, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25392694

RESUMEN

Analysis of gene sets can implicate activity in signaling pathways that is responsible for cancer initiation and progression, but is not discernible from the analysis of individual genes. Multiple methods and software packages have been developed to infer pathway activity from expression measurements for set of genes targeted by that pathway. Broadly, three major methodologies have been proposed: over-representation, enrichment, and differential variability. Both over-representation and enrichment analyses are effective techniques to infer differentially regulated pathways from gene sets with relatively consistent differentially expressed (DE) genes. Specifically, these algorithms aggregate statistics from each gene in the pathway. However, they overlook multivariate patterns related to gene interactions and variations in expression. Therefore, the analysis of differential variability of multigene expression patterns can be essential to pathway inference in cancers. The corresponding methodologies and software packages for such multivariate variability analysis of pathways are reviewed here. We also introduce a new, computationally efficient algorithm, expression variation analysis (EVA), which has been implemented along with a previously proposed algorithm, Differential Rank Conservation (DIRAC), in an open source R package, gene set regulation (GSReg). EVA inferred similar pathways as DIRAC at reduced computational costs. Moreover, EVA also inferred different dysregulated pathways than those identified by enrichment analysis.

18.
PLoS One ; 9(10): e110840, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25330348

RESUMEN

BACKGROUND: The biomarker discovery field is replete with molecular signatures that have not translated into the clinic despite ostensibly promising performance in predicting disease phenotypes. One widely cited reason is lack of classification consistency, largely due to failure to maintain performance from study to study. This failure is widely attributed to variability in data collected for the same phenotype among disparate studies, due to technical factors unrelated to phenotypes (e.g., laboratory settings resulting in "batch-effects") and non-phenotype-associated biological variation in the underlying populations. These sources of variability persist in new data collection technologies. METHODS: Here we quantify the impact of these combined "study-effects" on a disease signature's predictive performance by comparing two types of validation methods: ordinary randomized cross-validation (RCV), which extracts random subsets of samples for testing, and inter-study validation (ISV), which excludes an entire study for testing. Whereas RCV hardwires an assumption of training and testing on identically distributed data, this key property is lost in ISV, yielding systematic decreases in performance estimates relative to RCV. Measuring the RCV-ISV difference as a function of number of studies quantifies influence of study-effects on performance. RESULTS: As a case study, we gathered publicly available gene expression data from 1,470 microarray samples of 6 lung phenotypes from 26 independent experimental studies and 769 RNA-seq samples of 2 lung phenotypes from 4 independent studies. We find that the RCV-ISV performance discrepancy is greater in phenotypes with few studies, and that the ISV performance converges toward RCV performance as data from additional studies are incorporated into classification. CONCLUSIONS: We show that by examining how fast ISV performance approaches RCV as the number of studies is increased, one can estimate when "sufficient" diversity has been achieved for learning a molecular signature likely to translate without significant loss of accuracy to new clinical settings.


Asunto(s)
Biomarcadores de Tumor/biosíntesis , Perfilación de la Expresión Génica/métodos , Regulación Neoplásica de la Expresión Génica , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Adenocarcinoma/genética , Adenocarcinoma/patología , Adenocarcinoma del Pulmón , Carcinoma de Células Escamosas/genética , Carcinoma de Células Escamosas/patología , Humanos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patología , Fenotipo , Enfermedad Pulmonar Obstructiva Crónica/genética , Enfermedad Pulmonar Obstructiva Crónica/patología , Análisis de Secuencia de ARN , Máquina de Vectores de Soporte
19.
BMC Syst Biol ; 7: 118, 2013 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-24182195

RESUMEN

BACKGROUND: Reverse-engineering gene regulatory networks from expression data is difficult, especially without temporal measurements or interventional experiments. In particular, the causal direction of an edge is generally not statistically identifiable, i.e., cannot be inferred as a statistical parameter, even from an unlimited amount of non-time series observational mRNA expression data. Some additional evidence is required and high-throughput methylation data can viewed as a natural multifactorial gene perturbation experiment. RESULTS: We introduce IDEM (Identifying Direction from Expression and Methylation), a method for identifying the causal direction of edges by combining DNA methylation and mRNA transcription data. We describe the circumstances under which edge directions become identifiable and experiments with both real and synthetic data demonstrate that the accuracy of IDEM for inferring both edge placement and edge direction in gene regulatory networks is significantly improved relative to other methods. CONCLUSION: Reverse-engineering directed gene regulatory networks from static observational data becomes feasible by exploiting the context provided by high-throughput DNA methylation data.An implementation of the algorithm described is available at http://code.google.com/p/idem/.


Asunto(s)
Biología Computacional/métodos , Metilación de ADN , Perfilación de la Expresión Génica , Redes Reguladoras de Genes , Teorema de Bayes , Técnicas de Silenciamiento del Gen , Funciones de Verosimilitud , Cadenas de Markov , ARN Mensajero/genética , ARN Mensajero/metabolismo , Reproducibilidad de los Resultados
20.
PLoS Comput Biol ; 9(7): e1003148, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23935471

RESUMEN

We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein--Identification of Structured Signatures and Classifiers (ISSAC)--that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.


Asunto(s)
Neoplasias Encefálicas/genética , Transcriptoma , Biomarcadores de Tumor/metabolismo , Neoplasias Encefálicas/metabolismo , Biología Computacional , Humanos , Reproducibilidad de los Resultados
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...