RESUMO
Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.
Assuntos
Algoritmos , Genômica , Genômica/métodos , HumanosRESUMO
MOTIVATION: High-throughput fluorescent microscopy is a popular class of techniques for studying tissues and cells through automated imaging and feature extraction of hundreds to thousands of samples. Like other high-throughput assays, these approaches can suffer from unwanted noise and technical artifacts that obscure the biological signal. In this work, we consider how an experimental design incorporating multiple levels of replication enables the removal of technical artifacts from such image-based platforms. RESULTS: We develop a general approach to remove technical artifacts from high-throughput image data that leverages an experimental design with multiple levels of replication. To illustrate the methods, we consider microenvironment microarrays (MEMAs), a high-throughput platform designed to study cellular responses to microenvironmental perturbations. In application to MEMAs, our approach removes unwanted spatial artifacts and thereby enhances the biological signal. This approach has broad applicability to diverse biological assays. AVAILABILITY AND IMPLEMENTATION: Raw data are on synapse (syn2862345), analysis code is on github: gjhunt/mema_norm, a reproducible Docker image is available on dockerhub: gjhunt/mema_norm. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Artefatos , Ensaios de Triagem em Larga Escala , Análise em Microsséries , Projetos de PesquisaRESUMO
Concerted examination of multiple collections of single-cell RNA sequencing (RNA-seq) data promises further biological insights that cannot be uncovered with individual datasets. Here we present scMerge, an algorithm that integrates multiple single-cell RNA-seq datasets using factor analysis of stably expressed genes and pseudoreplicates across datasets. Using a large collection of public datasets, we benchmark scMerge against published methods and demonstrate that it consistently provides improved cell type separation by removing unwanted factors; scMerge can also enhance biological discovery through robust data integration, which we show through the inference of development trajectory in a liver dataset collection.
Assuntos
Metanálise como Assunto , Análise de Sequência de RNA , Análise de Célula Única , Software , Algoritmos , Animais , Desenvolvimento Embrionário , Análise Fatorial , Expressão Gênica , Humanos , CamundongosRESUMO
The Nanostring nCounter gene expression assay uses molecular barcodes and single molecule imaging to detect and count hundreds of unique transcripts in a single reaction. These counts need to be normalized to adjust for the amount of sample, variations in assay efficiency and other factors. Most users adopt the normalization approach described in the nSolver analysis software, which involves background correction based on the observed values of negative control probes, a within-sample normalization using the observed values of positive control probes and normalization across samples using reference (housekeeping) genes. Here we present a new normalization method, Removing Unwanted Variation-III (RUV-III), which makes vital use of technical replicates and suitable control genes. We also propose an approach using pseudo-replicates when technical replicates are not available. The effectiveness of RUV-III is illustrated on four different datasets. We also offer suggestions on the design and analysis of studies involving this technology.
Assuntos
Perfilação da Expressão Gênica/métodos , Adenocarcinoma de Pulmão/genética , Adenocarcinoma de Pulmão/metabolismo , Células Dendríticas/metabolismo , Humanos , Doenças Inflamatórias Intestinais/genética , Doenças Inflamatórias Intestinais/metabolismo , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/metabolismo , Ativação Linfocitária/genética , Imagem Individual de MoléculaRESUMO
MOTIVATION: Cell type composition of tissues is important in many biological processes. To help understand cell type composition using gene expression data, methods of estimating (deconvolving) cell type proportions have been developed. Such estimates are often used to adjust for confounding effects of cell type in differential expression analysis (DEA). RESULTS: We propose dtangle, a new cell type deconvolution method. dtangle works on a range of DNA microarray and bulk RNA-seq platforms. It estimates cell type proportions using publicly available, often cross-platform, reference data. We evaluate dtangle on 11 benchmark datasets showing that dtangle is competitive with published deconvolution methods, is robust to outliers and selection of tuning parameters, and is fast. As a case study, we investigate the human immune response to Lyme disease. dtangle's estimates reveal a temporal trend consistent with previous findings and are important covariates for DEA across disease status. AVAILABILITY AND IMPLEMENTATION: dtangle is on CRAN (cran.r-project.org/package=dtangle) or github (dtangle.github.io). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Software , Humanos , Análise de Sequência com Séries de OligonucleotídeosRESUMO
BACKGROUND: The vast majority of pediatric distal-third tibial shaft fractures can be treated with closed reduction and casting. If conservative measures fail, then these fractures are usually treated with 2 antegrade flexible intramedullary nails. A postoperative cast is usually applied because of the tenuous fixation of the 2 nails. Recent studies have described the use of 4 nails to increase the stability of the fixation, a technique that may preclude the need for postoperative casting. The purpose of this biomechanical study is to quantify the relative increase in stiffness and load to failure when using 4 versus 2 nails to surgically stabilize these fractures. METHODS: Short, oblique osteotomies were created in the distal third of small fourth-generation tibial sawbones and stabilized with 2 (double) or 4 (quadruple) flexible intramedullary nails. After pilot testing, 5 models per fixation method were tested cyclically in axial compression, torsion, and 4-point bending in valgus and recurvatum. At the end of the study, each model was loaded to failure in valgus. Stiffness values were calculated, and yield points were recorded. The data were compared using Student's t tests. Results are presented as mean±SD. The level of significance was set at P≤0.05. RESULTS: Stiffness in valgus 4-point bending was 624±231 and 336±162 N/mm in the quadruple-nail and double-nail groups, respectively (P=0.04). There were no statistically significant differences in any other mode of testing. CONCLUSIONS: The quadruple-nail construct was almost 2 times as stiff as the double-nail construct in resisting valgus deformation. This provides biomechanical support for a previously published study describing the clinical success of this fixation construct.
Assuntos
Pinos Ortopédicos , Fixação Intramedular de Fraturas/instrumentação , Tíbia/lesões , Fraturas da Tíbia/cirurgia , Fenômenos Biomecânicos , Criança , Diáfises/lesões , Diáfises/cirurgia , Fixação Intramedular de Fraturas/métodos , Humanos , Masculino , Desenho de Prótese , Tíbia/cirurgia , Fraturas da Tíbia/fisiopatologiaRESUMO
When dealing with large scale gene expression studies, observations are commonly contaminated by sources of unwanted variation such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g. when the goal is to cluster the samples or to build a corrected version of the dataset--as opposed to the study of an observed factor of interest--taking unwanted variation into account can become a difficult task. The factors driving unwanted variation may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data. The proposed methods are then evaluated on synthetic data and three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state-of-the-art corrections. All proposed methods are implemented in the bioconductor package RUVnormalize.
Assuntos
Interpretação Estatística de Dados , Expressão Gênica/genética , Variação Genética/genética , Análise em Microsséries/métodos , HumanosRESUMO
Due to their relatively low-cost per sample and broad, gene-centric coverage of CpGs across the human genome, Illumina's 450k arrays are widely used in large scale differential methylation studies. However, by their very nature, large studies are particularly susceptible to the effects of unwanted variation. The effects of unwanted variation have been extensively documented in gene expression array studies and numerous methods have been developed to mitigate these effects. However, there has been much less research focused on the appropriate methodology to use for accounting for unwanted variation in methylation array studies. Here we present a novel 2-stage approach using RUV-inverse in a differential methylation analysis of 450k data and show that it outperforms existing methods.
Assuntos
Metilação de DNA , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Adolescente , Idoso de 80 Anos ou mais , Envelhecimento/genética , Feminino , Variação Genética , Humanos , Lactente , Recém-Nascido , Masculino , Neoplasias/genética , Fumar/genéticaRESUMO
OBJECTIVE: Differences in gastric cancer (GC) clinical outcomes between patients in Asian and non-Asian countries has been historically attributed to variability in clinical management. However, recent international Phase III trials suggest that even with standardised treatments, GC outcomes differ by geography. Here, we investigated gene expression differences between Asian and non-Asian GCs, and if these molecular differences might influence clinical outcome. DESIGN: We compared gene expression profiles of 1016 GCs from six Asian and three non-Asian GC cohorts, using a two-stage meta-analysis design and a novel biostatistical method (RUV-4) to adjust for technical variation between cohorts. We further validated our findings by computerised immunohistochemical analysis on two independent tissue microarray (TMA) cohorts from Asian and non-Asian localities (n=665). RESULTS: Gene signatures differentially expressed between Asians and non-Asian GCs were related to immune function and inflammation. Non-Asian GCs were significantly enriched in signatures related to T-cell biology, including CTLA-4 signalling. Similarly, in the TMA cohorts, non-Asian GCs showed significantly higher expression of T-cell markers (CD3, CD45R0, CD8) and lower expression of the immunosuppressive T-regulatory cell marker FOXP3 compared to Asian GCs (p<0.05). Inflammatory cell markers CD66b and CD68 also exhibited significant cohort differences (p<0.05). Exploratory analyses revealed a significant relationship between tumour immunity factors, geographic locality-specific prognosis, and postchemotherapy outcomes. CONCLUSIONS: Analyses of >1600 GCs suggest that Asian and non-Asian GCs exhibit distinct tumour immunity signatures related to T-cell function. These differences may influence geographical differences in clinical outcome, and the design of future trials particularly in immuno-oncology.
Assuntos
Adenocarcinoma/genética , Adenocarcinoma/imunologia , Neoplasias Gástricas/genética , Neoplasias Gástricas/imunologia , Transcriptoma , Adenocarcinoma/tratamento farmacológico , Adulto , Idoso , Idoso de 80 Anos ou mais , Povo Asiático/genética , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Estudos Retrospectivos , Neoplasias Gástricas/tratamento farmacológico , Resultado do Tratamento , Adulto JovemRESUMO
Metabolomics experiments are inevitably subject to a component of unwanted variation, due to factors such as batch effects, long runs of samples, and confounding biological variation. Although the removal of this unwanted variation is a vital step in the analysis of metabolomics data, it is considered a gray area in which there is a recognized need to develop a better understanding of the procedures and statistical methods required to achieve statistically relevant optimal biological outcomes. In this paper, we discuss the causes of unwanted variation in metabolomics experiments, review commonly used metabolomics approaches for handling this unwanted variation, and present a statistical approach for the removal of unwanted variation to obtain normalized metabolomics data. The advantages and performance of the approach relative to several widely used metabolomics normalization approaches are illustrated through two metabolomics studies, and recommendations are provided for choosing and assessing the most suitable normalization method for a given metabolomics experiment. Software for the approach is made freely available.
Assuntos
Espectrometria de Massas/métodos , Metabolômica/métodos , Software , Humanos , Análise de Componente PrincipalRESUMO
Microarray expression studies suffer from the problem of batch effects and other unwanted variation. Many methods have been proposed to adjust microarray data to mitigate the problems of unwanted variation. Several of these methods rely on factor analysis to infer the unwanted variation from the data. A central problem with this approach is the difficulty in discerning the unwanted variation from the biological variation that is of interest to the researcher. We present a new method, intended for use in differential expression studies, that attempts to overcome this problem by restricting the factor analysis to negative control genes. Negative control genes are genes known a priori not to be differentially expressed with respect to the biological factor of interest. Variation in the expression levels of these genes can therefore be assumed to be unwanted variation. We name this method "Remove Unwanted Variation, 2-step" (RUV-2). We discuss various techniques for assessing the performance of an adjustment method and compare the performance of RUV-2 with that of other commonly used adjustment methods such as Combat and Surrogate Variable Analysis (SVA). We present several example studies, each concerning genes differentially expressed with respect to gender in the brain and find that RUV-2 performs as well or better than other methods. Finally, we discuss the possibility of adapting RUV-2 for use in studies not concerned with differential expression and conclude that there may be promise but substantial challenges remain.
Assuntos
Interpretação Estatística de Dados , Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Feminino , Humanos , MasculinoRESUMO
Accurate identification and effective removal of unwanted variation is essential to derive meaningful biological results from RNA sequencing (RNA-seq) data, especially when the data come from large and complex studies. Using RNA-seq data from The Cancer Genome Atlas (TCGA), we examined several sources of unwanted variation and demonstrate here how these can significantly compromise various downstream analyses, including cancer subtype identification, association between gene expression and survival outcomes and gene co-expression analysis. We propose a strategy, called pseudo-replicates of pseudo-samples (PRPS), for deploying our recently developed normalization method, called removing unwanted variation III (RUV-III), to remove the variation caused by library size, tumor purity and batch effects in TCGA RNA-seq data. We illustrate the value of our approach by comparing it to the standard TCGA normalizations on several TCGA RNA-seq datasets. RUV-III with PRPS can be used to integrate and normalize other large transcriptomic datasets coming from multiple laboratories or platforms.
Assuntos
Neoplasias , RNA , Humanos , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA , Neoplasias/genéticaRESUMO
Transcriptome deconvolution aims to estimate the cellular composition of an RNA sample from its gene expression data, which in turn can be used to correct for composition differences across samples. The human brain is unique in its transcriptomic diversity, and comprises a complex mixture of cell-types, including transcriptionally similar subtypes of neurons. Here, we carry out a comprehensive evaluation of deconvolution methods for human brain transcriptome data, and assess the tissue-specificity of our key observations by comparison with human pancreas and heart. We evaluate eight transcriptome deconvolution approaches and nine cell-type signatures, testing the accuracy of deconvolution using in silico mixtures of single-cell RNA-seq data, RNA mixtures, as well as nearly 2000 human brain samples. Our results identify the main factors that drive deconvolution accuracy for brain data, and highlight the importance of biological factors influencing cell-type signatures, such as brain region and in vitro cell culturing.
Assuntos
RNA , Transcriptoma , Encéfalo , Perfilação da Expressão Gênica/métodos , Humanos , Especificidade de Órgãos , Análise de Sequência de RNA/métodos , Transcriptoma/genéticaRESUMO
Previous work has characterized the properties of neurotransmitter release at excitatory and inhibitory synapses, but we know remarkably little about the properties of monoamine release, because these neuromodulators do not generally produce a fast ionotropic response. Since dopamine and serotonin neurons can also release glutamate in vitro and in vivo, we have used the vesicular monoamine transporter VMAT2 and the vesicular glutamate transporter VGLUT1 to compare the localization and recycling of synaptic vesicles that store, respectively, monoamines and glutamate. First, VMAT2 segregates partially from VGLUT1 in the boutons of midbrain dopamine neurons, indicating the potential for distinct release sites. Second, endocytosis after stimulation is slower for VMAT2 than VGLUT1. During the stimulus, however, the endocytosis of VMAT2 (but not VGLUT1) accelerates dramatically in midbrain dopamine but not hippocampal neurons, indicating a novel, cell-specific mechanism to sustain high rates of release. On the other hand, we find that in both midbrain dopamine and hippocampal neurons, a substantially smaller proportion of VMAT2 than VGLUT1 is available for evoked release, and VMAT2 shows considerably more dispersion along the axon after exocytosis than VGLUT1. Even when expressed in the same neuron, the two vesicular transporters thus target to distinct populations of synaptic vesicles, presumably due to their selection of distinct recycling pathways.
Assuntos
Dopamina/metabolismo , Neurônios/metabolismo , Vesículas Sinápticas/metabolismo , Proteína Vesicular 1 de Transporte de Glutamato/metabolismo , Proteínas Vesiculares de Transporte de Monoamina/metabolismo , Animais , Animais Recém-Nascidos , Western Blotting , Células Cultivadas , Eletrofisiologia , Endocitose/fisiologia , Exocitose/fisiologia , Hipocampo/citologia , Hipocampo/metabolismo , Mesencéfalo/citologia , Mesencéfalo/metabolismo , RatosRESUMO
ABSTRACT: As severe acute respiratory syndrome coronavirus 2 continues to spread, easy-to-use risk models that predict hospital mortality can assist in clinical decision making and triage. We aimed to develop a risk score model for in-hospital mortality in patients hospitalized with 2019 novel coronavirus (COVID-19) that was robust across hospitals and used clinical factors that are readily available and measured standardly across hospitals.In this retrospective observational study, we developed a risk score model using data collected by trained abstractors for patients in 20 diverse hospitals across the state of Michigan (Mi-COVID19) who were discharged between March 5, 2020 and August 14, 2020. Patients who tested positive for severe acute respiratory syndrome coronavirus 2 during hospitalization or were discharged with an ICD-10 code for COVID-19 (U07.1) were included. We employed an iterative forward selection approach to consider the inclusion of 145 potential risk factors available at hospital presentation. Model performance was externally validated with patients from 19 hospitals in the Mi-COVID19 registry not used in model development. We shared the model in an easy-to-use online application that allows the user to predict in-hospital mortality risk for a patient if they have any subset of the variables in the final model.Two thousand one hundred and ninety-three patients in the Mi-COVID19 registry met our inclusion criteria. The derivation and validation sets ultimately included 1690 and 398 patients, respectively, with mortality rates of 19.6% and 18.6%, respectively. The average age of participants in the study after exclusions was 64âyears old, and the participants were 48% female, 49% Black, and 87% non-Hispanic. Our final model includes the patient's age, first recorded respiratory rate, first recorded pulse oximetry, highest creatinine level on day of presentation, and hospital's COVID-19 mortality rate. No other factors showed sufficient incremental model improvement to warrant inclusion. The area under the receiver operating characteristics curve for the derivation and validation sets were .796 (95% confidence interval, .767-.826) and .829 (95% confidence interval, .782-.876) respectively.We conclude that the risk of in-hospital mortality in COVID-19 patients can be reliably estimated using a few factors, which are standardly measured and available to physicians very early in a hospital encounter.
Assuntos
COVID-19/mortalidade , Mortalidade Hospitalar/tendências , Fatores Etários , Idoso , Idoso de 80 Anos ou mais , Índice de Massa Corporal , Comorbidade , Creatinina/sangue , Feminino , Comportamentos Relacionados com a Saúde , Humanos , Modelos Logísticos , Masculino , Michigan/epidemiologia , Pessoa de Meia-Idade , Oximetria , Prognóstico , Curva ROC , Grupos Raciais , Estudos Retrospectivos , Medição de Risco , Fatores de Risco , SARS-CoV-2 , Índice de Gravidade de Doença , Fatores Sexuais , Fatores SocioeconômicosRESUMO
MOTIVATION: In single-cell RNA-sequencing (scRNA-seq) experiments, RNA transcripts are extracted and measured from isolated cells to understand gene expression at the cellular level. Measurements from this technology are affected by many technical artifacts, including batch effects. In analogous bulk gene expression experiments, external references, e.g. synthetic gene spike-ins often from the External RNA Controls Consortium (ERCC), may be incorporated to the experimental protocol for use in adjusting measurements for technical artifacts. In scRNA-seq experiments, the use of external spike-ins is controversial due to dissimilarities with endogenous genes and uncertainty about sufficient precision of their introduction. Instead, endogenous genes with highly stable expression could be used as references within scRNA-seq to help normalize the data. First, however, a specific notion of stable expression at the single-cell level needs to be formulated; genes could be stable in absolute expression, in proportion to cell volume, or in proportion to total gene expression. Different types of stable genes will be useful for different normalizations and will need different methods for discovery. RESULTS: We compile gene sets whose products are associated with cellular structures and record these gene sets for future reuse and analysis. We find that genes whose final products are associated with the cytosolic ribosome have expressions that are highly stable with respect to the total RNA content. Notably, these genes appear to be stable in bulk measurements as well. SUPPLEMENTARY INFORMATION: Supplementary data are available through GitHub (johanngb/sc-stable).
Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Animais , Biologia Computacional/métodos , Bases de Dados Factuais , Humanos , CamundongosRESUMO
Proper data transformation is an essential part of analysis. Choosing appropriate transformations for variables can enhance visualization, improve efficacy of analytical methods, and increase data interpretability. However determining appropriate transformations of variables from high-content imaging data poses new challenges. Imaging data produces hundreds of covariates from each of thousands of images in a corpus. Each of these covariates will have a different distribution and need a potentially different transformation. As such imaging data produces hundreds of covariates, determining an appropriate transformation for each of them is infeasible by hand. In this paper we explore simple, robust, and automatic transformations of high-content image data. A central application of our work is to microenvironment microarray bio-imaging data from the NIH LINCS program. We show that our robust transformations enhance visualization and improve the discovery of substantively relevant latent effects. These transformations enhance analysis of image features individually and also improve data integration approaches when combining together multiple features. We anticipate that the advantages of this work will likely also be realized in the analysis of data from other high-content and highly-multiplexed technologies like Cell Painting or Cyclic Immunofluorescence. Software and further analysis can be found at gjhunt.github.io/rr.
RESUMO
BACKGROUND: When conducting a randomized controlled trial, it is common to specify in advance the statistical analyses that will be used to analyze the data. Typically, these analyses will involve adjusting for small imbalances in baseline covariates. However, this poses a dilemma, as adjusting for too many covariates can hurt precision more than it helps, and it is often unclear which covariates are predictive of outcome prior to conducting the experiment. OBJECTIVES: This article aims to produce a covariate adjustment method that allows for automatic variable selection, so that practitioners need not commit to any specific set of covariates prior to seeing the data. RESULTS: In this article, we propose the "leave-one-out potential outcomes" estimator. We leave out each observation and then impute that observation's treatment and control potential outcomes using a prediction algorithm such as a random forest. In addition to allowing for automatic variable selection, this estimator is unbiased under the Neyman-Rubin model, generally performs at least as well as the unadjusted estimator, and the experimental randomization largely justifies the statistical assumptions made.