Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 42
Filtrar
1.
Mol Psychiatry ; 28(5): 2018-2029, 2023 May.
Artigo em Inglês | MEDLINE | ID: mdl-36732587

RESUMO

Seven Tesla magnetic resonance spectroscopy (7T MRS) offers a precise measurement of metabolic levels in the human brain via a non-invasive approach. Studying longitudinal changes in brain metabolites could help evaluate the characteristics of disease over time. This approach may also shed light on how the age of study participants and duration of illness may influence these metabolites. This study used 7T MRS to investigate longitudinal patterns of brain metabolites in young adulthood in both healthy controls and patients. A four-year longitudinal cohort with 38 patients with first episode psychosis (onset within 2 years) and 48 healthy controls was used to examine 10 brain metabolites in 5 brain regions associated with the pathophysiology of psychosis in a comprehensive manner. Both patients and controls were found to have significant longitudinal reductions in glutamate in the anterior cingulate cortex (ACC). Only patients were found to have a significant decrease over time in γ-aminobutyric acid, N-acetyl aspartate, myo-inositol, total choline, and total creatine in the ACC. Together we highlight the ACC with dynamic changes in several metabolites in early-stage psychosis, in contrast to the other 4 brain regions that also are known to play roles in psychosis. Meanwhile, glutathione was uniquely found to have a near zero annual percentage change in both patients and controls in all 5 brain regions during a four-year follow-up in young adulthood. Given that a reduction of the glutathione in the ACC has been reported as a feature of treatment-refractory psychosis, this observation further supports the potential of glutathione as a biomarker for this subset of patients with psychosis.


Assuntos
Glutamina , Transtornos Psicóticos , Humanos , Adulto Jovem , Adulto , Glutamina/metabolismo , Transtornos Psicóticos/metabolismo , Encéfalo/metabolismo , Ácido Glutâmico/metabolismo , Giro do Cíngulo/metabolismo , Ácido Aspártico/metabolismo , Glutationa/metabolismo
2.
Proc Natl Acad Sci U S A ; 117(2): 857-864, 2020 01 14.
Artigo em Inglês | MEDLINE | ID: mdl-31882448

RESUMO

Cancer is driven by the sequential accumulation of genetic and epigenetic changes in oncogenes and tumor suppressor genes. The timing of these events is not well understood. Moreover, it is currently unknown why the same driver gene change appears as an early event in some cancer types and as a later event, or not at all, in others. These questions have become even more topical with the recent progress brought by genome-wide sequencing studies of cancer. Focusing on mutational events, we provide a mathematical model of the full process of tumor evolution that includes different types of fitness advantages for driver genes and carrying-capacity considerations. The model is able to recapitulate a substantial proportion of the observed cancer incidence in several cancer types (colorectal, pancreatic, and leukemia) and inherited conditions (Lynch and familial adenomatous polyposis), by changing only 2 tissue-specific parameters: the number of stem cells in a tissue and its cell division frequency. The model sheds light on the evolutionary dynamics of cancer by suggesting a generalized early onset of tumorigenesis followed by slow mutational waves, in contrast to previous conclusions. Formulas and estimates are provided for the fitness increases induced by driver mutations, often much larger than previously described, and highly tissue dependent. Our results suggest a mechanistic explanation for why the selective fitness advantage introduced by specific driver genes is tissue dependent.


Assuntos
Carcinogênese/genética , Modelos Genéticos , Neoplasias/classificação , Polipose Adenomatosa do Colo/genética , Idoso , Divisão Celular , Neoplasias Colorretais/genética , Neoplasias Colorretais Hereditárias sem Polipose , Humanos , Pessoa de Meia-Idade , Mutação , Neoplasias/genética , Oncogenes/genética
3.
PLoS Comput Biol ; 17(6): e1008944, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-34115745

RESUMO

Cancer cells display massive dysregulation of key regulatory pathways due to now well-catalogued mutations and other DNA-related aberrations. Moreover, enormous heterogeneity has been commonly observed in the identity, frequency and location of these aberrations across individuals with the same cancer type or subtype, and this variation naturally propagates to the transcriptome, resulting in myriad types of dysregulated gene expression programs. Many have argued that a more integrative and quantitative analysis of heterogeneity of DNA and RNA molecular profiles may be necessary for designing more systematic explorations of alternative therapies and improving predictive accuracy. We introduce a representation of multi-omics profiles which is sufficiently rich to account for observed heterogeneity and support the construction of quantitative, integrated, metrics of variation. Starting from the network of interactions existing in Reactome, we build a library of "paired DNA-RNA aberrations" that represent prototypical and recurrent patterns of dysregulation in cancer; each two-gene "Source-Target Pair" (STP) consists of a "source" regulatory gene and a "target" gene whose expression is plausibly "controlled" by the source gene. The STP is then "aberrant" in a joint DNA-RNA profile if the source gene is DNA-aberrant (e.g., mutated, deleted, or duplicated), and the downstream target gene is "RNA-aberrant", meaning its expression level is outside the normal, baseline range. With M STPs, each sample profile has exactly one of the 2M possible configurations. We concentrate on subsets of STPs, and the corresponding reduced configurations, by selecting tissue-dependent minimal coverings, defined as the smallest family of STPs with the property that every sample in the considered population displays at least one aberrant STP within that family. These minimal coverings can be computed with integer programming. Given such a covering, a natural measure of cross-sample diversity is the extent to which the particular aberrant STPs composing a covering vary from sample to sample; this variability is captured by the entropy of the distribution over configurations. We apply this program to data from TCGA for six distinct tumor types (breast, prostate, lung, colon, liver, and kidney cancer). This enables an efficient simplification of the complex landscape observed in cancer populations, resulting in the identification of novel signatures of molecular alterations which are not detected with frequency-based criteria. Estimates of cancer heterogeneity across tumor phenotypes reveals a stable pattern: entropy increases with disease severity. This framework is then well-suited to accommodate the expanding complexity of cancer genomes and epigenomes emerging from large consortia projects.


Assuntos
DNA de Neoplasias/genética , Neoplasias/genética , RNA Neoplásico/genética , Biologia Computacional/métodos , Redes Reguladoras de Genes , Humanos , Mutação
4.
Proc Natl Acad Sci U S A ; 115(18): 4545-4552, 2018 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-29666255

RESUMO

Data collected from omics technologies have revealed pervasive heterogeneity and stochasticity of molecular states within and between phenotypes. A prominent example of such heterogeneity occurs between genome-wide mRNA, microRNA, and methylation profiles from one individual tumor to another, even within a cancer subtype. However, current methods in bioinformatics, such as detecting differentially expressed genes or CpG sites, are population-based and therefore do not effectively model intersample diversity. Here we introduce a unified theory to quantify sample-level heterogeneity that is applicable to a single omics profile. Specifically, we simplify an omics profile to a digital representation based on the omics profiles from a set of samples from a reference or baseline population (e.g., normal tissues). The state of any subprofile (e.g., expression vector for a subset of genes) is said to be "divergent" if it lies outside the estimated support of the baseline distribution and is consequently interpreted as "dysregulated" relative to that baseline. We focus on two cases: single features (e.g., individual genes) and distinguished subsets (e.g., regulatory pathways). Notably, since the divergence analysis is at the individual sample level, dysregulation can be analyzed probabilistically; for example, one can estimate the probability that a gene or pathway is divergent in some population. Finally, the reduction in complexity facilitates a more "personalized" and biologically interpretable analysis of variation, as illustrated by experiments involving tissue characterization, disease detection and progression, and disease-pathway associations.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Medicina de Precisão/métodos , Biologia Computacional/estatística & dados numéricos , Interpretação Estatística de Dados , Bases de Dados Genéticas , Perfilação da Expressão Gênica/estatística & dados numéricos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , MicroRNAs/genética , Neoplasias/genética , Proteômica/métodos
5.
Hum Mol Genet ; 26(5): 913-922, 2017 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-28334820

RESUMO

Huntington's disease is a dominantly inherited neurodegenerative disease caused by the expansion of a CAG repeat in the HTT gene. In addition to the length of the CAG expansion, factors such as genetic background have been shown to contribute to the age at onset of neurological symptoms. A central challenge in understanding the disease progression that leads from the HD mutation to massive cell death in the striatum is the ability to characterize the subtle and early functional consequences of the CAG expansion longitudinally. We used dense time course sampling between 4 and 20 postnatal weeks to characterize early transcriptomic, molecular and cellular phenotypes in the striatum of six distinct knock-in mouse models of the HD mutation. We studied the effects of the HttQ111 allele on the C57BL/6J, CD-1, FVB/NCr1, and 129S2/SvPasCrl genetic backgrounds, and of two additional alleles, HttQ92 and HttQ50, on the C57BL/6J background. We describe the emergence of a transcriptomic signature in HttQ111/+ mice involving hundreds of differentially expressed genes and changes in diverse molecular pathways. We also show that this time course spanned the onset of mutant huntingtin nuclear localization phenotypes and somatic CAG-length instability in the striatum. Genetic background strongly influenced the magnitude and age at onset of these effects. This work provides a foundation for understanding the earliest transcriptional and molecular changes contributing to HD pathogenesis.


Assuntos
Corpo Estriado/metabolismo , Proteína Huntingtina/genética , Doença de Huntington/genética , Expansão das Repetições de Trinucleotídeos/genética , Animais , Corpo Estriado/patologia , Modelos Animais de Doenças , Regulação da Expressão Gênica no Desenvolvimento , Técnicas de Introdução de Genes , Patrimônio Genético , Instabilidade Genômica/genética , Humanos , Proteína Huntingtina/biossíntese , Doença de Huntington/patologia , Camundongos , Mutação/genética , Neurônios/metabolismo , Neurônios/patologia , Fenótipo , Transcriptoma/genética
6.
Bioinformatics ; 34(11): 1859-1867, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29342249

RESUMO

Motivation: Current bioinformatics methods to detect changes in gene isoform usage in distinct phenotypes compare the relative expected isoform usage in phenotypes. These statistics model differences in isoform usage in normal tissues, which have stable regulation of gene splicing. Pathological conditions, such as cancer, can have broken regulation of splicing that increases the heterogeneity of the expression of splice variants. Inferring events with such differential heterogeneity in gene isoform usage requires new statistical approaches. Results: We introduce Splice Expression Variability Analysis (SEVA) to model increased heterogeneity of splice variant usage between conditions (e.g. tumor and normal samples). SEVA uses a rank-based multivariate statistic that compares the variability of junction expression profiles within one condition to the variability within another. Simulated data show that SEVA is unique in modeling heterogeneity of gene isoform usage, and benchmark SEVA's performance against EBSeq, DiffSplice and rMATS that model differential isoform usage instead of heterogeneity. We confirm the accuracy of SEVA in identifying known splice variants in head and neck cancer and perform cross-study validation of novel splice variants. A novel comparison of splice variant heterogeneity between subtypes of head and neck cancer demonstrated unanticipated similarity between the heterogeneity of gene isoform usage in HPV-positive and HPV-negative subtypes and anticipated increased heterogeneity among HPV-negative samples with mutations in genes that regulate the splice variant machinery. These results show that SEVA accurately models differential heterogeneity of gene isoform usage from RNA-seq data. Availability and implementation: SEVA is implemented in the R/Bioconductor package GSReg. Contact: bahman@jhu.edu or favorov@sensi.org or ejfertig@jhmi.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Processamento Alternativo , Neoplasias/genética , Isoformas de Proteínas/genética , Análise de Sequência de RNA/métodos , Software , Biologia Computacional/métodos , Regulação Neoplásica da Expressão Gênica , Neoplasias de Cabeça e Pescoço/genética , Humanos , Modelos Genéticos
7.
Proc Natl Acad Sci U S A ; 112(12): 3618-23, 2015 Mar 24.
Artigo em Inglês | MEDLINE | ID: mdl-25755262

RESUMO

Today, computer vision systems are tested by their accuracy in detecting and localizing instances of objects. As an alternative, and motivated by the ability of humans to provide far richer descriptions and even tell a story about an image, we construct a "visual Turing test": an operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question; the operator either provides the correct answer or rejects the question as ambiguous; the engine proposes the next question ("just-in-time truthing"). The test is then administered to the computer-vision system, one question at a time. After the system's answer is recorded, the system is provided the correct answer and the next question. Parsing is trivial and deterministic; the system being tested requires no natural language processing. The query engine employs statistical constraints, learned from a training set, to produce questions with essentially unpredictable answers-the answer to a question, given the history of questions and their correct answers, is nearly equally likely to be positive or negative. In this sense, the test is only about vision. The system is designed to produce streams of questions that follow natural story lines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects.


Assuntos
Processamento de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão , Algoritmos , Inteligência Artificial , Humanos , Imageamento Tridimensional , Modelos Estatísticos , Reprodutibilidade dos Testes , Software
8.
Bioinformatics ; 31(2): 273-4, 2015 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-25262153

RESUMO

UNLABELLED: k-Top Scoring Pairs (kTSP) is a classification method for prediction from high-throughput data based on a set of the paired measurements. Each of the two possible orderings of a pair of measurements (e.g. a reversal in the expression of two genes) is associated with one of two classes. The kTSP prediction rule is the aggregation of voting among such individual two-feature decision rules based on order switching. kTSP, like its predecessor, Top Scoring Pair (TSP), is a parameter-free classifier relying only on ranking of a small subset of features, rendering it robust to noise and potentially easy to interpret in biological terms. In contrast to TSP, kTSP has comparable accuracy to standard genomics classification techniques, including Support Vector Machines and Prediction Analysis for Microarrays. Here, we describe 'switchBox', an R package for kTSP-based prediction. AVAILABILITY: The 'switchBox' package is freely available from Bioconductor: http://www.bioconductor.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biomarcadores Tumorais/genética , Neoplasias da Mama/classificação , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Recidiva Local de Neoplasia/diagnóstico , Neoplasias da Mama/genética , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Recidiva Local de Neoplasia/genética , Máquina de Vetores de Suporte
9.
Nat Rev Genet ; 11(10): 733-9, 2010 10.
Artigo em Inglês | MEDLINE | ID: mdl-20838408

RESUMO

High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.


Assuntos
Biotecnologia/métodos , Genômica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de DNA/métodos , Biotecnologia/normas , Biotecnologia/estatística & dados numéricos , Biologia Computacional/métodos , Genômica/normas , Genômica/estatística & dados numéricos , Análise de Sequência com Séries de Oligonucleotídeos/normas , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Publicações Periódicas como Assunto/normas , Projetos de Pesquisa/normas , Projetos de Pesquisa/estatística & dados numéricos , Análise de Sequência de DNA/normas , Análise de Sequência de DNA/estatística & dados numéricos
10.
Hum Genet ; 134(5): 479-95, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25381197

RESUMO

Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda­in particular, predicting disease phenotypes, progression and treatment response for individuals­requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning biomarkers, metabolism, cell signaling, network inference and tumorigenesis.


Assuntos
Biologia Computacional/métodos , Interpretação Estatística de Dados , Redes Reguladoras de Genes/genética , Neoplasias/genética , Fenótipo , Biologia de Sistemas/métodos , Pesquisa Translacional Biomédica/métodos , Biomarcadores Tumorais , Carcinogênese/genética , Humanos , Redes e Vias Metabólicas/genética , Redes e Vias Metabólicas/fisiologia , Mutação/genética , Neoplasias/patologia , Transdução de Sinais/genética , Transdução de Sinais/fisiologia , Pesquisa Translacional Biomédica/tendências
11.
PLoS Comput Biol ; 9(7): e1003148, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23935471

RESUMO

We utilized abundant transcriptomic data for the primary classes of brain cancers to study the feasibility of separating all of these diseases simultaneously based on molecular data alone. These signatures were based on a new method reported herein--Identification of Structured Signatures and Classifiers (ISSAC)--that resulted in a brain cancer marker panel of 44 unique genes. Many of these genes have established relevance to the brain cancers examined herein, with others having known roles in cancer biology. Analyses on large-scale data from multiple sources must deal with significant challenges associated with heterogeneity between different published studies, for it was observed that the variation among individual studies often had a larger effect on the transcriptome than did phenotype differences, as is typical. For this reason, we restricted ourselves to studying only cases where we had at least two independent studies performed for each phenotype, and also reprocessed all the raw data from the studies using a unified pre-processing pipeline. We found that learning signatures across multiple datasets greatly enhanced reproducibility and accuracy in predictive performance on truly independent validation sets, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while amplifying signal from the repeated global characteristics of the phenotype. When molecular signatures of brain cancers were constructed from all currently available microarray data, 90% phenotype prediction accuracy, or the accuracy of identifying a particular brain cancer from the background of all phenotypes, was found. Looking forward, we discuss our approach in the context of the eventual development of organ-specific molecular signatures from peripheral fluids such as the blood.


Assuntos
Neoplasias Encefálicas/genética , Transcriptoma , Biomarcadores Tumorais/metabolismo , Neoplasias Encefálicas/metabolismo , Biologia Computacional , Humanos , Reprodutibilidade dos Testes
12.
Proc Natl Acad Sci U S A ; 113(34): 9384-7, 2016 08 23.
Artigo em Inglês | MEDLINE | ID: mdl-27555443
13.
Proc Natl Acad Sci U S A ; 108(43): 17621-5, 2011 Oct 25.
Artigo em Inglês | MEDLINE | ID: mdl-22006295

RESUMO

Automated scene interpretation has benefited from advances in machine learning, and restricted tasks, such as face detection, have been solved with sufficient accuracy for restricted settings. However, the performance of machines in providing rich semantic descriptions of natural scenes from digital images remains highly limited and hugely inferior to that of humans. Here we quantify this "semantic gap" in a particular setting: We compare the efficiency of human and machine learning in assigning an image to one of two categories determined by the spatial arrangement of constituent parts. The images are not real, but the category-defining rules reflect the compositional structure of real images and the type of "reasoning" that appears to be necessary for semantic parsing. Experiments demonstrate that human subjects grasp the separating principles from a handful of examples, whereas the error rates of computer programs fluctuate wildly and remain far behind that of humans even after exposure to thousands of examples. These observations lend support to current trends in computer vision such as integrating machine learning with parts-based modeling.


Assuntos
Algoritmos , Inteligência Artificial , Reconhecimento Automatizado de Padrão/métodos , Reconhecimento Visual de Modelos/fisiologia , Resolução de Problemas , Humanos
14.
BMC Genomics ; 14: 336, 2013 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-23682826

RESUMO

BACKGROUND: A small number of prognostic and predictive tests based on gene expression are currently offered as reference laboratory tests. In contrast to such success stories, a number of flaws and errors have recently been identified in other genomic-based predictors and the success rate for developing clinically useful genomic signatures is low. These errors have led to widespread concerns about the protocols for conducting and reporting of computational research. As a result, a need has emerged for a template for reproducible development of genomic signatures that incorporates full transparency, data sharing and statistical robustness. RESULTS: Here we present the first fully reproducible analysis of the data used to train and test MammaPrint, an FDA-cleared prognostic test for breast cancer based on a 70-gene expression signature. We provide all the software and documentation necessary for researchers to build and evaluate genomic classifiers based on these data. As an example of the utility of this reproducible research resource, we develop a simple prognostic classifier that uses only 16 genes from the MammaPrint signature and is equally accurate in predicting 5-year disease free survival. CONCLUSIONS: Our study provides a prototypic example for reproducible development of computational algorithms for learning prognostic biomarkers in the era of personalized medicine.


Assuntos
Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Biologia Computacional/métodos , Perfilação da Expressão Gênica , Estudos de Coortes , Humanos , Prognóstico , Reprodutibilidade dos Testes , Software
15.
Nucleic Acids Res ; 39(5): 1666-79, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21059680

RESUMO

Histone modifications are fundamental to chromatin structure and transcriptional regulation, and are recognized by a limited number of protein folds. Among these folds are PHD fingers, which are present in most chromatin modification complexes. To date, about 15 PHD finger domains have been structurally characterized, whereas hundreds of different sequences have been identified. Consequently, an important open problem is to predict structural features of a PHD finger knowing only its sequence. Here, we classify PHD fingers into different groups based on the analysis of residue-residue co-evolution in their sequences. We measure the degree to which fixing the amino acid type at one position modifies the frequencies of amino acids at other positions. We then detect those position/amino acid combinations, or 'conditions', which have the strongest impact on other sequence positions. Clustering these strong conditions yields four families, providing informative labels for PHD finger sequences. Existing experimental results, as well as docking calculations performed here, reveal that these families indeed show discrepancies at the functional level. Our method should facilitate the functional characterization of new PHD fingers, as well as other protein families, solely based on sequence information.


Assuntos
Análise de Sequência de Proteína , Fatores de Transcrição/química , Motivos de Aminoácidos , Sequência de Aminoácidos , Aminoácidos , Análise por Conglomerados , Entropia , Evolução Molecular , Histonas/química , Dados de Sequência Molecular , Proteínas Nucleares/química , Estrutura Terciária de Proteína , Homologia de Sequência de Aminoácidos , Fatores de Transcrição/classificação
16.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7430-7443, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-36441893

RESUMO

There is a growing concern about typically opaque decision-making with high-performance machine learning algorithms. Providing an explanation of the reasoning process in domain-specific terms can be crucial for adoption in risk-sensitive domains such as healthcare. We argue that machine learning algorithms should be interpretable by design and that the language in which these interpretations are expressed should be domain- and task-dependent. Consequently, we base our model's prediction on a family of user-defined and task-specific binary functions of the data, each having a clear interpretation to the end-user. We then minimize the expected number of queries needed for accurate prediction on any given input. As the solution is generally intractable, following prior work, we choose the queries sequentially based on information gain. However, in contrast to previous work, we need not assume the queries are conditionally independent. Instead, we leverage a stochastic generative model (VAE) and an MCMC algorithm (Unadjusted Langevin) to select the most informative query about the input based on previous query-answers. This enables the online determination of a query chain of whatever depth is required to resolve prediction ambiguities. Finally, experiments on vision and NLP tasks demonstrate the efficacy of our approach and its superiority over post-hoc explanations.

17.
bioRxiv ; 2023 Apr 07.
Artigo em Inglês | MEDLINE | ID: mdl-37383947

RESUMO

Accurate identification of cell classes across the tissues of living organisms is central in the analysis of growing atlases of single-cell RNA sequencing (scRNA-seq) data across biomedicine. Such analyses are often based on the existence of highly discriminating "marker genes" for specific cell classes which enables a deeper functional understanding of these classes as well as their identification in new, related datasets. Currently, marker genes are defined by methods that serially assess the level of differential expression (DE) of individual genes across landscapes of diverse cells. This serial approach has been extremely useful, but is limited because it ignores possible redundancy or complementarity across genes, that can only be captured by analyzing several genes at the same time. We wish to identify discriminating panels of genes. To efficiently explore the vast space of possible marker panels, leverage the large number of cells often sequenced, and overcome zero-inflation in scRNA-seq data, we propose viewing panel selection as a variation of the "minimal set-covering problem" in combinatorial optimization which can be solved with integer programming. In this formulation, the covering elements are genes, and the objects to be covered are cells of a particular class, where a cell is covered by a gene if that gene is expressed in that cell. Our method, CellCover, identifies a panel of marker genes in scRNA-seq data that covers one class of cells within a population. We apply this method to generate covering marker gene panels which characterize cells of the developing mouse neocortex as postmitotic neurons are generated from neural progenitor cells (NPCs). We show that CellCover captures cell class-specific signals distinct from those defined by DE methods and that CellCover's compact gene panels can be expanded to explore cell type specific function.Transfer learning experiments exploring these covering panels across in vivo mouse, primate, and human scRNA-seq datasets demonstrate that CellCover identifies markers of conserved cell classes in neurogenesis, as well as markers of temporal progression in the molecular identity of these cell types across development of the mammalian neocortex. The gene covering panels we identify across cell types and developmental time can be freely explored in visualizations across all the public data we use in this report at with NeMo Analytics [1] through https://nemoanalytics.org/p?l=CellCover . The code for CellCover is written in R and the Gurobi R interface and is available at [2].

18.
iScience ; 26(3): 106108, 2023 Mar 17.
Artigo em Inglês | MEDLINE | ID: mdl-36852282

RESUMO

Many gene signatures have been developed by applying machine learning (ML) on omics profiles, however, their clinical utility is often hindered by limited interpretability and unstable performance. Here, we show the importance of embedding prior biological knowledge in the decision rules yielded by ML approaches to build robust classifiers. We tested this by applying different ML algorithms on gene expression data to predict three difficult cancer phenotypes: bladder cancer progression to muscle-invasive disease, response to neoadjuvant chemotherapy in triple-negative breast cancer, and prostate cancer metastatic progression. We developed two sets of classifiers: mechanistic, by restricting the training to features capturing specific biological mechanisms; and agnostic, in which the training did not use any a priori biological information. Mechanistic models had a similar or better testing performance than their agnostic counterparts, with enhanced interpretability. Our findings support the use of biological constraints to develop robust gene signatures with high translational potential.

19.
PLoS Comput Biol ; 6(5): e1000792, 2010 May 27.
Artigo em Inglês | MEDLINE | ID: mdl-20523739

RESUMO

A powerful way to separate signal from noise in biology is to convert the molecular data from individual genes or proteins into an analysis of comparative biological network behaviors. One of the limitations of previous network analyses is that they do not take into account the combinatorial nature of gene interactions within the network. We report here a new technique, Differential Rank Conservation (DIRAC), which permits one to assess these combinatorial interactions to quantify various biological pathways or networks in a comparative sense, and to determine how they change in different individuals experiencing the same disease process. This approach is based on the relative expression values of participating genes-i.e., the ordering of expression within network profiles. DIRAC provides quantitative measures of how network rankings differ either among networks for a selected phenotype or among phenotypes for a selected network. We examined disease phenotypes including cancer subtypes and neurological disorders and identified networks that are tightly regulated, as defined by high conservation of transcript ordering. Interestingly, we observed a strong trend to looser network regulation in more malignant phenotypes and later stages of disease. At a sample level, DIRAC can detect a change in ranking between phenotypes for any selected network. Variably expressed networks represent statistically robust differences between disease states and serve as signatures for accurate molecular classification, validating the information about expression patterns captured by DIRAC. Importantly, DIRAC can be applied not only to transcriptomic data, but to any ordinal data type.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Análise por Conglomerados , Bases de Dados Factuais , Humanos , Neoplasias/genética , Fenótipo , Reprodutibilidade dos Testes , Transdução de Sinais
20.
PLoS One ; 16(4): e0249002, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33819273

RESUMO

Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.


Assuntos
Genômica/métodos , Software , Bases de Dados Genéticas , Humanos , Neoplasias/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA