RESUMO
MHC-I molecules expose the intracellular protein content on the cell surface, allowing T cells to detect foreign or mutated peptides. The combination of six MHC-I alleles each individual carries defines the sub-peptidome that can be effectively presented. We applied this concept to human cancer, hypothesizing that oncogenic mutations could arise in gaps in personal MHC-I presentation. To validate this hypothesis, we developed and applied a residue-centric patient presentation score to 9,176 cancer patients across 1,018 recurrent oncogenic mutations. We found that patient MHC-I genotype-based scores could predict which mutations were more likely to emerge in their tumor. Accordingly, poor presentation of a mutation across patients was correlated with higher frequency among tumors. These results support that MHC-I genotype-restricted immunoediting during tumor formation shapes the landscape of oncogenic mutations observed in clinically diagnosed tumors and paves the way for predicting personal cancer susceptibilities from knowledge of MHC-I genotype.
Assuntos
Apresentação de Antígeno , Antígenos de Histocompatibilidade Classe I/genética , Antígenos de Histocompatibilidade Classe I/imunologia , Mutação , Neoplasias/imunologia , Linhagem Celular Tumoral , Simulação por Computador , Feminino , Células HeLa , Humanos , Masculino , Monitorização Imunológica , ProteomaRESUMO
Compensatory proliferation triggered by hepatocyte loss is required for liver regeneration and maintenance but also promotes development of hepatocellular carcinoma (HCC). Despite extensive investigation, the cells responsible for hepatocyte restoration or HCC development remain poorly characterized. We used genetic lineage tracing to identify cells responsible for hepatocyte replenishment following chronic liver injury and queried their roles in three distinct HCC models. We found that a pre-existing population of periportal hepatocytes, located in the portal triads of healthy livers and expressing low amounts of Sox9 and other bile-duct-enriched genes, undergo extensive proliferation and replenish liver mass after chronic hepatocyte-depleting injuries. Despite their high regenerative potential, these so-called hybrid hepatocytes do not give rise to HCC in chronically injured livers and thus represent a unique way to restore tissue function and avoid tumorigenesis. This specialized set of pre-existing differentiated cells may be highly suitable for cell-based therapy of chronic hepatocyte-depleting disorders.
Assuntos
Hepatócitos/transplante , Fígado/citologia , Fígado/fisiologia , Animais , Ductos Biliares/citologia , Proliferação de Células , Transplante de Células/métodos , Hepatócitos/classificação , Hepatócitos/citologia , Fígado/lesões , Neoplasias Hepáticas , Camundongos , Regeneração , Fatores de Transcrição SOX9/genética , TranscriptomaRESUMO
Bio-oils are precursors for biofuels but are highly corrosive necessitating further upgrading. Furthermore, bio-oil samples are highly complex and represent a broad range of chemistries. They are complex mixtures not simply because of the large number of poly-oxygenated compounds but because each composition can comprise many isomers with multiple functional groups. The use of hyphenated ultrahigh-resolution mass spectrometry affords the ability to separate isomeric species of complex mixtures. Here, we present for the first time, the use of this powerful analytical technique combined with chemical reactivity to gain greater insights into the reactivity of the individual isomeric species of bio-oils. A pyrolysis bio-oils and its esterified bio-oil were analyzed using gas chromatography coupled to Fourier transform ion cyclotron resonance mass spectrometry, and in-house software (KairosMS) was used for fast comparison of the hyphenated data sets. The data revealed a total of 10,368 isomers in the pyrolysis bio-oil and an increase to 18,827 isomers after esterification conditions. Furthermore, the comparison of the isomeric distribution before and after esterification provide new light on the reactivities within these complex mixtures; these reactivities would be expected to correspond with carboxylic acid, aldehyde, and ketone functional groups. Using this approach, it was possible to reveal the increased chemical complexity of bio-oils after upgrading and target detection of valuable compounds within the bio-oils. The combination of chemical reactions alongside with in-depth molecular characterization opens a new window for the understanding of the chemistry and reactivity of complex mixtures.
Assuntos
Óleos de Plantas , Polifenóis , Biocombustíveis/análise , Biomassa , Misturas Complexas , Cromatografia Gasosa-Espectrometria de Massas , Temperatura Alta , Óleos de Plantas/química , Polifenóis/químicaRESUMO
The use of hyphenated Fourier transform mass spectrometry (FTMS) methods affords additional information about complex chemical mixtures. Coeluted components can be resolved thanks to the ultrahigh resolving power, which also allows extracted ion chromatograms (EICs) to be used for the observation of isomers. As such data sets can be large and data analyses laborious, improved tools are needed for data analyses and extraction of key information. The typical workflow for this type of data is based upon manually dividing the total ion chromatogram (TIC) into several windows of usually equal retention time, averaging the signal of each window to create a single mass spectrum, extracting a peak list, performing the compositional assignments, visualizing the results, and repeating the process for each window. Through removal of the need to manually divide a data set into many time windows and analyze each one, a time-consuming workflow has been significantly simplified. An environmental sample from the oil sands region of Alberta, Canada, and dissolved organic matter samples from the Suwannee River Fulvic Acid (SRFA) and marine waters (Marine DOM) were used as a test bed for the new method. A complete solution named KairosMS was developed in the R language utilizing the Tidyverse packages and Shiny for the user interface. KairosMS imports raw data from common file types, processes it, and exports a mass list for compositional assignments. KairosMS then incorporates those assignments for analysis and visualization. The present method increases the computational speed while reducing the manual work of the analysis when compared to other current methods. The algorithm subsequently incorporates the assignments into the processed data set, generating a series of interactive plots, EICs for individual components or entire compound classes, and can export raw data or graphics for off-line use. Using the example of petroleum related data, it is then visualized according to heteroatom class, carbon number, double bond equivalents, and retention time. The algorithm also gives the ability to screen for isomeric contributions and to follow homologous series or compound classes, instead of individual components, as a function of time.
RESUMO
Fourier transform ion cyclotron resonance mass spectrometry (FTICR MS) provides the resolution and mass accuracy needed to analyze complex mixtures such as crude oil. When mixtures contain many different components, a competitive effect within the ICR cell takes place that hampers the detection of a potentially large fraction of the components. Recently, a new data collection technique, which consists of acquiring several spectra of small mass ranges and assembling a complete spectrum afterward, enabled the observation of a record number of peaks with greater accuracy compared to broadband methods. There is a need for statistical methods to combine and preprocess segmented acquisition data. A particular challenge of quadrupole isolation is that near the window edges there is a drop in intensity, hampering the stitching of consecutive windows. We developed an algorithm called Rhapso to stitch peak lists corresponding to multiple different m/z regions from crude oil samples. Rhapso corrects potential edge effects to enable the use of smaller windows and reduce the required overlap between windows, corrects mass shifts between windows, and generates a single peak list for the full spectrum. Relative to a stitching performed manually, Rhapso increased the data processing speed and avoided potential human errors, simplifying the subsequent chemical analysis of the sample. Relative to a broadband spectrum, the stitched output showed an over 2-fold increase in assigned peaks and reduced mass error by a factor of 2. Rhapso is expected to enable routine use of this spectral stitching method for ultracomplex samples, giving a more detailed characterization of existing samples and enabling the characterization of samples that were previously too complex to analyze.
RESUMO
Fourier transform ion cyclotron resonance mass spectrometry affords the resolving power to determine an unprecedented number of components in complex mixtures, such as petroleum. The software tools required to also analyze these data struggle to keep pace with advancing instrument capabilities and increasing quantities of data, particularly in terms of combining information efficiently across multiple replicates. Improved confidence in data and the use of replicates is particularly important where strategic decisions will be based upon the analysis. We present a new algorithm named Themis, developed using R, to jointly preprocess replicate measurements of a sample with the aim of improving consistency as a preliminary step to assigning peaks to chemical compositions. The main features of the algorithm are quality control criteria to detect failed runs, ensuring comparable magnitudes across replicates, peak alignment, and the use of an adaptive mixture model-based strategy to help distinguish true peaks from noise. The algorithm outputs a list of peaks reliably observed across replicates and facilitates data handling by preprocessing all replicates in a single step. The processed data produced by our algorithm can subsequently be analyzed by use of relevant specialized software. While Themis has been demonstrated with petroleum as an example of a complex mixture, its basic framework will be useful for complex samples arising from a variety of other applications.
RESUMO
MOTIVATION: Designing an RNA-seq study depends critically on its specific goals, technology and underlying biology, which renders general guidelines inadequate. We propose a Bayesian framework to customize experiments so that goals can be attained and resources are not wasted, with a focus on alternative splicing. RESULTS: We studied how read length, sequencing depth, library preparation and the number of replicates affects cost-effectiveness of single-sample and group comparison studies. Optimal settings varied strongly according to the target organism or tissue (potential 50-500% cost cuts) and, interestingly, short reads outperformed long reads for standard analyses. Our framework learns key characteristics for study design from the data, and predicts if and how to continue experimentation. These predictions matched several follow-up experimental datasets that were used for validation. We provide default pipelines, but the framework can be combined with other data analysis methods and can help assess their relative merits. AVAILABILITY AND IMPLEMENTATION: casper package at www.bioconductor.org/packages/release/bioc/html/casper.html, Supplementary Manual by typing casperDesign() at the R prompt. CONTACT: rosselldavid@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Processamento Alternativo/genética , Guias como Assunto , Projetos de Pesquisa , Análise de Sequência de RNA/métodos , Algoritmos , Animais , Teorema de Bayes , Simulação por Computador , Humanos , Camundongos , SoftwareRESUMO
Development of tools to jointly visualize the genome and the epigenome remains a challenge. chroGPS is a computational approach that addresses this question. chroGPS uses multidimensional scaling techniques to represent similarity between epigenetic factors, or between genetic elements on the basis of their epigenetic state, in 2D/3D reference maps. We emphasize biological interpretability, statistical robustness, integration of genetic and epigenetic data from heterogeneous sources, and computational feasibility. Although chroGPS is a general methodology to create reference maps and study the epigenetic state of any class of genetic element or genomic region, we focus on two specific kinds of maps: chroGPS(factors), which visualizes functional similarities between epigenetic factors, and chroGPS(genes), which describes the epigenetic state of genes and integrates gene expression and other functional data. We use data from the modENCODE project on the genomic distribution of a large collection of epigenetic factors in Drosophila, a model system extensively used to study genome organization and function. Our results show that the maps allow straightforward visualization of relationships between factors and elements, capturing relevant information about their functional properties that helps to interpret epigenetic information in a functional context and derive testable hypotheses.
Assuntos
Cromatina/metabolismo , Epigênese Genética , Epigenômica/métodos , Software , Animais , Linhagem Celular , Gráficos por Computador , Drosophila/genética , Expressão Gênica , Genes de Insetos , Transdução de Sinais/genéticaRESUMO
In high-throughput experiments, the sample size is typically chosen informally. Most formal sample-size calculations depend critically on prior knowledge. We propose a sequential strategy that, by updating knowledge when new data are available, depends less critically on prior assumptions. Experiments are stopped or continued based on the potential benefits in obtaining additional data. The underlying decision-theoretic framework guarantees the design to proceed in a coherent fashion. We propose intuitively appealing, easy-to-implement utility functions. As in most sequential design problems, an exact solution is prohibitive. We propose a simulation-based approximation that uses decision boundaries. We apply the method to RNA-seq, microarray, and reverse-phase protein array studies and show its potential advantages. The approach has been added to the Bioconductor package gaga.
Assuntos
Interpretação Estatística de Dados , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Projetos de Pesquisa , Tamanho da Amostra , Simulação por Computador , Humanos , Análise em Microsséries/métodos , Análise Serial de Proteínas , Análise de Sequência de RNA/métodosRESUMO
H3K4me3 is a histone modification that accumulates at the transcription-start site (TSS) of active genes and is known to be important for transcription activation. The way in which H3K4me3 is regulated at TSS and the actual molecular basis of its contribution to transcription remain largely unanswered. To address these questions, we have analyzed the contribution of dKDM5/LID, the main H3K4me3 demethylase in Drosophila, to the regulation of the pattern of H3K4me3. ChIP-seq results show that, at developmental genes, dKDM5/LID localizes at TSS and regulates H3K4me3. dKDM5/LID target genes are highly transcribed and enriched in active RNApol II and H3K36me3, suggesting a positive contribution to transcription. Expression-profiling show that, though weakly, dKDM5/LID target genes are significantly downregulated upon dKDM5/LID depletion. Furthermore, dKDM5/LID depletion results in decreased RNApol II occupancy, particularly by the promoter-proximal Pol llo(ser5) form. Our results also show that ASH2, an evolutionarily conserved factor that locates at TSS and is required for H3K4me3, binds and positively regulates dKDM5/LID target genes. However, dKDM5/LID and ASH2 do not bind simultaneously and recognize different chromatin states, enriched in H3K4me3 and not, respectively. These results indicate that, at developmental genes, dKDM5/LID and ASH2 coordinately regulate H3K4me3 at TSS and that this dynamic regulation contributes to transcription.
Assuntos
Proteínas de Drosophila/metabolismo , Regulação da Expressão Gênica no Desenvolvimento , Histona-Lisina N-Metiltransferase/metabolismo , Histonas/metabolismo , Sítio de Iniciação de Transcrição , Transcrição Gênica , Animais , Linhagem Celular , Drosophila/enzimologia , Drosophila/genética , Drosophila/metabolismo , Histona Desmetilases , Proteínas Nucleares/metabolismo , Fatores de Transcrição/metabolismoRESUMO
UNLABELLED: We provide a Bioconductor package with quality assessment, processing and visualization tools for high-throughput sequencing data, with emphasis in ChIP-seq and RNA-seq studies. It includes detection of outliers and biases, inefficient immuno-precipitation and overamplification artifacts, de novo identification of read-rich genomic regions and visualization of the location and coverage of genomic region lists. AVAILABILITY: www.bioconductor.org.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Genômica , Humanos , Internet , Análise de Sequência com Séries de Oligonucleotídeos , Controle de Qualidade , Saccharomyces cerevisiae/genéticaRESUMO
Misfolded proteins are caused by genomic mutations, aberrant splicing events, translation errors or environmental factors. The accumulation of misfolded proteins is a phenomenon connected to several human disorders, and is managed by stress responses specific to the cellular compartments being affected. In wild-type cells these mechanisms of stress response can be experimentally induced by expressing recombinant misfolded proteins or by incubating cells with large concentrations of amino acid analogues. Here, we report a novel approach for the induction of stress responses to protein aggregation. Our method is based on engineered transfer RNAs that can be expressed in cells or tissues, where they actively integrate in the translation machinery causing general proteome substitutions. This strategy allows for the introduction of mutations of increasing severity randomly in the proteome, without exposing cells to unnatural compounds. Here, we show that this approach can be used for the differential activation of the stress response in the Endoplasmic Reticulum (ER). As an example of the applications of this method, we have applied it to the identification of human microRNAs activated or repressed during unfolded protein stress.
Assuntos
Proteoma/genética , RNA de Transferência de Serina/química , Resposta a Proteínas não Dobradas/genética , Animais , Processos de Crescimento Celular , Linhagem Celular , Sobrevivência Celular , Embrião de Galinha , Interpretação Estatística de Dados , Humanos , MicroRNAs/classificação , MicroRNAs/metabolismo , Mutagênese Sítio-Dirigida , Mutação , Biossíntese de Proteínas , RNA de Transferência de Serina/metabolismoRESUMO
MHC class I proteins present intracellular peptides on the cell's surface, enabling the immune system to recognize tumor-specific neoantigens of early neoplastic cells and eliminate them before the tumor develops further. However, variability in peptide-MHC-I affinity results in variable presentation of oncogenic peptides, leading to variable likelihood of immune evasion across both individuals and mutations. Since the major determinant of peptide-MHC-I affinity in patients is individual MHC-I genotype, we developed a residue-centric presentation score taking both mutated residues and MHC-I genotype into account and hypothesized that high scores (which correspond to poor presentation) would correlate to high mutation frequencies within tumors. We applied our scoring system to 9176 tumor samples from TCGA across 1018 recurrent mutations and found that, indeed, presentation scores predicted mutation probability. These findings open the door to more personalized treatment plans based on simple genotyping. Here, we outline the computational tools and statistical methods used to arrive at this conclusion.
Assuntos
Biologia Computacional/métodos , Antígenos de Histocompatibilidade Classe II/genética , Mutação , Neoplasias/genética , Bases de Dados Genéticas , Predisposição Genética para Doença , Técnicas de Genotipagem , Humanos , Funções Verossimilhança , Taxa de Mutação , Medicina de Precisão , Evasão Tumoral , Sequenciamento do ExomaRESUMO
We develop an approach for microarray differential expression analysis, i.e. identifying genes whose expression levels differ between two or more groups. Current approaches to inference rely either on full parametric assumptions or on permutation-based techniques for sampling under the null distribution. In some situations, however, a full parametric model cannot be justified, or the sample size per group is too small for permutation methods to be valid. We propose a semi-parametric framework based on partial mixture estimation which only requires a parametric assumption for the null (equally expressed) distribution and can handle small sample sizes where permutation methods break down. We develop two novel improvements of Scott's minimum integrated square error criterion for partial mixture estimation [Scott, 2004a,b]. As a side benefit, we obtain interpretable and closed-form estimates for the proportion of EE genes. Pseudo-Bayesian and frequentist procedures for controlling the false discovery rate are given. Results from simulations and real datasets indicate that our approach can provide substantial advantages for small sample sizes over the SAM method of Tusher et al. [2001], the empirical Bayes procedure of Efron and Tibshirani [2002], the mixture of normals of Pan et al. [2003] and a t-test with p-value adjustment [Dudoit et al., 2003] to control the FDR [Benjamini and Hochberg, 1995].
Assuntos
Algoritmos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos , Teorema de Bayes , Simulação por Computador , Expressão Gênica , Tamanho da AmostraRESUMO
A new strategy has been developed for characterization of the most challenging complex mixtures to date, using a combination of custom-designed experiments and a new data pre-processing algorithm. In contrast to traditional methods, the approach enables operation of Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR MS) with constant ultrahigh resolution at hitherto inaccessible levels (approximately 3 million FWHM, independent of m/z). The approach, referred to as OCULAR, makes it possible to analyze samples that were previously too complex, even for high field FT-ICR MS instrumentation. Previous FT-ICR MS studies have typically spanned a broad mass range with decreasing resolving power (inversely proportional to m/z) or have used a single, very narrow m/z range to produce data of enhanced resolving power; both methods are of limited effectiveness for complex mixtures spanning a broad mass range, however. To illustrate the enhanced performance due to OCULAR, we show how a record number of unique molecular formulae (244 779 elemental compositions) can be assigned in a single, non-distillable petroleum fraction without the aid of chromatography or dissociation (MS/MS) experiments. The method is equally applicable to other areas of research, can be used with both high field and low field FT-ICR MS instruments to enhance their performance, and represents a step-change in the ability to analyze highly complex samples.
RESUMO
Bayesian variable selection often assumes normality, but the effects of model misspecification are not sufficiently understood. There are sound reasons behind this assumption, particularly for large p: ease of interpretation, analytical and computational convenience. More flexible frameworks exist, including semi- or non-parametric models, often at the cost of some tractability. We propose a simple extension that allows for skewness and thicker-than-normal tails but preserves tractability. It leads to easy interpretation and a log-concave likelihood that facilitates optimization and integration. We characterize asymptotically parameter estimation and Bayes factor rates, under certain model misspecification. Under suitable conditions misspecified Bayes factors induce sparsity at the same rates than under the correct model. However, the rates to detect signal change by an exponential factor, often reducing sensitivity. These deficiencies can be ameliorated by inferring the error distribution, a simple strategy that can improve inference substantially. Our work focuses on the likelihood and can be combined with any likelihood penalty or prior, but here we focus on non-local priors to induce extra sparsity and ameliorate finite-sample effects caused by misspecification. We show the importance of considering the likelihood rather than solely the prior, for Bayesian variable selection. The methodology is in R package 'mombf'.
RESUMO
Jointly achieving parsimony and good predictive power in high dimensions is a main challenge in statistics. Non-local priors (NLPs) possess appealing properties for model choice, but their use for estimation has not been studied in detail. We show that for regular models NLP-based Bayesian model averaging (BMA) shrink spurious parameters either at fast polynomial or quasi-exponential rates as the sample size n increases, while non-spurious parameter estimates are not shrunk. We extend some results to linear models with dimension p growing with n. Coupled with our theoretical investigations, we outline the constructive representation of NLPs as mixtures of truncated distributions that enables simple posterior sampling and extending NLPs beyond previous proposals. Our results show notable high-dimensional estimation for linear models with p â« n at low computational cost. NLPs provided lower estimation error than benchmark and hyper-g priors, SCAD and LASSO in simulations, and in gene expression data achieved higher cross-validated R2 with less predictors. Remarkably, these results were obtained without pre-screening variables. Our findings contribute to the debate of whether different priors should be used for estimation and model selection, showing that selection priors may actually be desirable for high-dimensional estimation.
RESUMO
Big Data brings unprecedented power to address scientific, economic and societal issues, but also amplifies the possibility of certain pitfalls. These include using purely data-driven approaches that disregard understanding the phenomenon under study, aiming at a dynamically moving target, ignoring critical data collection issues, summarizing or preprocessing the data inadequately and mistaking noise for signal. We review some success stories and illustrate how statistical principles can help obtain more reliable information from data. We also touch upon current challenges that require active methodological research, such as strategies for efficient computation, integration of heterogeneous data, extending the underlying theory to increasingly complex questions and, perhaps most importantly, training a new generation of scientists to develop and deploy these strategies.
RESUMO
Recent molecular classifications of colorectal cancer (CRC) based on global gene expression profiles have defined subtypes displaying resistance to therapy and poor prognosis. Upon evaluation of these classification systems, we discovered that their predictive power arises from genes expressed by stromal cells rather than epithelial tumor cells. Bioinformatic and immunohistochemical analyses identify stromal markers that associate robustly with disease relapse across the various classifications. Functional studies indicate that cancer-associated fibroblasts (CAFs) increase the frequency of tumor-initiating cells, an effect that is dramatically enhanced by transforming growth factor (TGF)-ß signaling. Likewise, we find that all poor-prognosis CRC subtypes share a gene program induced by TGF-ß in tumor stromal cells. Using patient-derived tumor organoids and xenografts, we show that the use of TGF-ß signaling inhibitors to block the cross-talk between cancer cells and the microenvironment halts disease progression.
Assuntos
Neoplasias Colorretais/diagnóstico , Neoplasias Colorretais/genética , Fibroblastos/metabolismo , Células-Tronco Neoplásicas/metabolismo , Animais , Análise por Conglomerados , Neoplasias Colorretais/classificação , Neoplasias Colorretais/patologia , Fibroblastos/patologia , Regulação Neoplásica da Expressão Gênica , Células HT29 , Humanos , Camundongos , Camundongos Nus , Análise em Microsséries , Invasividade Neoplásica , Metástase Neoplásica , Células-Tronco Neoplásicas/patologia , Prognóstico , Células Estromais/metabolismo , Células Estromais/patologia , TranscriptomaRESUMO
Efforts to compile the phenotypic effects of drugs and environmental chemicals offer the opportunity to adopt a chemo-centric view of human health that does not require detailed mechanistic information. Here we consider thousands of chemicals and analyse the relationship of their structures with adverse and therapeutic responses. Our study includes molecules related to the aetiology of 934 health-threatening conditions and used to treat 835 diseases. We first identify chemical moieties that could be independently associated with each phenotypic effect. Using these fragments, we build accurate predictors for approximately 400 clinical phenotypes, finding many privileged and liable structures. Finally, we connect two diseases if they relate to similar chemical structures. The resulting networks of human conditions are able to predict disease comorbidities, as well as identifying potential drug side effects and opportunities for drug repositioning, and show a remarkable coincidence with clinical observations.