RESUMO
MOTIVATION: Matched case-control analysis is widely used in biomedical studies to identify exposure variables associated with health conditions. The matching is used to improve the efficiency. Existing variable selection methods for matched case-control studies are challenged in high-dimensional settings where interactions among variables are also important. We describe a quite different method for high-dimensional matched case-control data, based on the potential outcome model, which is not only flexible regarding the number of matching and exposure variables but also able to detect interaction effects. RESULTS: We present Matched Forest (MF), an algorithm for variable selection in matched case-control data. The method preserves the case and control values in each instance but transforms the matched case-control data with added counterfactuals. A modified variable importance score from a supervised learner is used to detect important variables. The method is conceptually simple and can be applied with widely available software tools. Simulation studies show the effectiveness of MF in identifying important variables. MF is also applied to data from the biomedical domain and its performance is compared with alternative approaches. AVAILABILITY AND IMPLEMENTATION: R code for implementing MF is available at https://github.com/NooshinSh/Matched_Forest. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Software , Estudos de Casos e Controles , Florestas , Aprendizado de Máquina SupervisionadoRESUMO
BACKGROUND: Copy Number Alternations (CNAs) is defined as somatic gain or loss of DNA regions. The profiles of CNAs may provide a fingerprint specific to a tumor type or tumor grade. Low-coverage sequencing for reporting CNAs has recently gained interest since successfully translated into clinical applications. Ovarian serous carcinomas can be classified into two largely mutually exclusive grades, low grade and high grade, based on their histologic features. The grade classification based on the genomics may provide valuable clue on how to best manage these patients in clinic. Based on the study of ovarian serous carcinomas, we explore the methodology of combining CNAs reporting from low-coverage sequencing with machine learning techniques to stratify tumor biospecimens of different grades. RESULTS: We have developed a data-driven methodology for tumor classification using the profiles of CNAs reported by low-coverage sequencing. The proposed method called Bag-of-Segments is used to summarize fixed-length CNA features predictive of tumor grades. These features are further processed by machine learning techniques to obtain classification models. High accuracy is obtained for classifying ovarian serous carcinoma into high and low grades based on leave-one-out cross-validation experiments. The models that are weakly influenced by the sequence coverage and the purity of the sample can also be built, which would be of higher relevance for clinical applications. The patterns captured by Bag-of-Segments features correlate with current clinical knowledge: low grade ovarian tumors being related to aneuploidy events associated to mitotic errors while high grade ovarian tumors are induced by DNA repair gene malfunction. CONCLUSIONS: The proposed data-driven method obtains high accuracy with various parametrizations for the ovarian serous carcinoma study, indicating that it has good generalization potential towards other CNA classification problems. This method could be applied to the more difficult task of classifying ovarian serous carcinomas with ambiguous histology or in those with low grade tumor co-existing with high grade tumor. The closer genomic relationship of these tumor samples to low or high grade may provide important clinical value.
Assuntos
Cistadenocarcinoma Seroso/classificação , Variações do Número de Cópias de DNA , Ciência de Dados/métodos , Genoma Humano , Neoplasias Ovarianas/classificação , Cistadenocarcinoma Seroso/genética , Cistadenocarcinoma Seroso/patologia , Feminino , Humanos , Gradação de Tumores , Neoplasias Ovarianas/genética , Neoplasias Ovarianas/patologia , Sequenciamento Completo do GenomaRESUMO
In this paper, we propose a new end-to-end deep neural network model for time-series classification (TSC) with emphasis on both the accuracy and the interpretation. The proposed model contains a convolutional network component to extract high-level features and a recurrent network component to enhance the modeling of the temporal characteristics of TS data. In addition, a feedforward fully connected network with the sparse group lasso (SGL) regularization is used to generate the final classification. The proposed architecture not only achieves satisfying classification accuracy, but also obtains good interpretability through the SGL regularization. All these networks are connected and jointly trained in an end-to-end framework, and it can be generally applied to TSC tasks across different domains without the efforts of feature engineering. Our experiments in various TS data sets show that the proposed model outperforms the traditional convolutional neural network model for the classification accuracy, and also demonstrate how the SGL contributes to a better model interpretation.
RESUMO
[This corrects the article DOI: 10.1371/journal.pone.0196556.].
RESUMO
BACKGROUND: Next generation sequencing tests (NGS) are usually performed on relatively small core biopsy or fine needle aspiration (FNA) samples. Data is limited on what amount of tumor by volume or minimum number of FNA passes are needed to yield sufficient material for running NGS. We sought to identify the amount of tumor for running the PCDx NGS platform. METHODS: 2,723 consecutive tumor tissues of all cancer types were queried and reviewed for inclusion. Information on tumor volume, success of performing NGS, and results of NGS were compiled. Assessment of sequence analysis, mutation calling and sensitivity, quality control, drug associations, and data aggregation and analysis were performed. RESULTS: 6.4% of samples were rejected from all testing due to insufficient tumor quantity. The number of genes with insufficient sensitivity make definitive mutation calls increased as the percentage of tumor decreased, reaching statistical significance below 5% tumor content. The number of drug associations also decreased with a lower percentage of tumor, but this difference only became significant between 1-3%. The number of drug associations did decrease with smaller tissue size as expected. Neither specimen size or percentage of tumor affected the ability to pass mRNA quality control. A tumor area of 10 mm2 provides a good margin of error for specimens to yield adequate drug association results. CONCLUSIONS: Specimen suitability remains a major obstacle to clinical NGS testing. We determined that PCR-based library creation methods allow the use of smaller specimens, and those with a lower percentage of tumor cells to be run on the PCDx NGS platform.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Neoplasias/diagnóstico , Neoplasias/genética , Biópsia por Agulha Fina/métodos , Análise Mutacional de DNA , DNA Complementar/metabolismo , Feminino , Biblioteca Gênica , Humanos , Masculino , Mutação , Reação em Cadeia da Polimerase , RNA Mensageiro/metabolismo , Reprodutibilidade dos Testes , Estudos Retrospectivos , Sensibilidade e EspecificidadeRESUMO
Phenotypic characterization of individual cells provides crucial insights into intercellular heterogeneity and enables access to information that is unavailable from ensemble averaged, bulk cell analyses. Single-cell studies have attracted significant interest in recent years and spurred the development of a variety of commercially available and research-grade technologies. To quantify cell-to-cell variability of cell populations, we have developed an experimental platform for real-time measurements of oxygen consumption (OC) kinetics at the single-cell level. Unique challenges inherent to these single-cell measurements arise, and no existing data analysis methodology is available to address them. Here we present a data processing and analysis method that addresses challenges encountered with this unique type of data in order to extract biologically relevant information. We applied the method to analyze OC profiles obtained with single cells of two different cell lines derived from metaplastic and dysplastic human Barrett's esophageal epithelium. In terms of method development, three main challenges were considered for this heterogeneous dynamic system: (i) high levels of noise, (ii) the lack of a priori knowledge of single-cell dynamics, and (iii) the role of intercellular variability within and across cell types. Several strategies and solutions to address each of these three challenges are presented. The features such as slopes, intercepts, breakpoint or change-point were extracted for every OC profile and compared across individual cells and cell types. The results demonstrated that the extracted features facilitated exposition of subtle differences between individual cells and their responses to cell-cell interactions. With minor modifications, this method can be used to process and analyze data from other acquisition and experimental modalities at the single-cell level, providing a valuable statistical framework for single-cell analysis.
Assuntos
Oxigênio/metabolismo , Análise de Célula Única/métodos , Esôfago de Barrett/metabolismo , Linhagem Celular , Esôfago/metabolismo , Humanos , Modelos LinearesRESUMO
Kernel principal component analysis (KPCA) is a method widely used for denoising multivariate data. Using geometric arguments, we investigate why a projection operation inherent to all existing KPCA denoising algorithms can sometimes cause very poor denoising. Based on this, we propose a modification to the projection operation that remedies this problem and can be incorporated into any of the existing KPCA algorithms. Using toy examples and real datasets, we show that the proposed algorithm can substantially improve denoising performance and is more robust to misspecification of an important tuning parameter.
RESUMO
Despite significant improvements in recent years, proteomic datasets currently available still suffer from large number of missing values. Integrative analyses based upon incomplete proteomic and transcriptomic datasets could seriously bias the biological interpretation. In this study, we applied a non-linear data-driven stochastic gradient boosted trees (GBT) model to impute missing proteomic values using a temporal transcriptomic and proteomic dataset of Shewanella oneidensis. In this dataset, genes' expression was measured after the cells were exposed to 1 mM potassium chromate for 5, 30, 60, and 90 min, while protein abundance was measured for 45 and 90 min. With the ultimate objective to impute protein values for experimentally undetected samples at 45 and 90 min, we applied a serial set of algorithms to capture relationships between temporal gene and protein expression. This work follows four main steps: (1) a quality control step for gene expression reliability, (2) mRNA imputation, (3) protein prediction, and (4) validation. Initially, an S control chart approach is performed on gene expression replicates to remove unwanted variability. Then, we focused on the missing measurements of gene expression through a nonlinear Smoothing Splines Curve Fitting. This method identifies temporal relationships among transcriptomic data at different time points and enables imputation of mRNA abundance at 45 min. After mRNA imputation was validated by biological constrains (i.e. operons), we used a data-driven GBT model to impute protein abundance for the proteins experimentally undetected in the 45 and 90 min samples, based on relevant predictors such as temporal mRNA gene expression data and cellular functional roles. The imputed protein values were validated using biological constraints such as operon and pathway information through a permutation test to investigate whether dispersion measures are indeed smaller for known biological groups than for any set of random genes. Finally, we demonstrated that such missing value imputation improved characterization of the temporal response of S. oneidensis to chromate.
Assuntos
Perfilação da Expressão Gênica , Proteômica , Shewanella/genética , Shewanella/metabolismo , Algoritmos , Cromatos/farmacologia , Biologia Computacional , Poluentes Ambientais/farmacologia , Regulação Bacteriana da Expressão Gênica/efeitos dos fármacos , Modelos Estatísticos , Compostos de Potássio/farmacologia , Controle de Qualidade , Shewanella/efeitos dos fármacos , Fatores de TempoRESUMO
This paper proposes a new feature selection methodology. The methodology is based on the stepwise variable selection procedure, but, instead of using the traditional discriminant metrics such as Wilks' Lambda, it uses an estimation of the misclassification error as the figure of merit to evaluate the introduction of new features. The expected misclassification error rate (MER) is obtained by using the densities of a constructed function of random variables, which is the stochastic representation of the conditional distribution of the quadratic discriminant function estimate. The application of the proposed methodology results in significant savings of computational time in the estimation of classification error over the traditional simulation and cross-validation methods. One of the main advantages of the proposed method is that it provides a direct estimation of the expected misclassification error at the time of feature selection, which provides an immediate assessment of the benefits of introducing an additional feature into an inspection/classification algorithm.
Assuntos
Algoritmos , Inteligência Artificial , Análise de Falha de Equipamento/métodos , Modelos Teóricos , Reconhecimento Automatizado de Padrão/métodos , Simulação por ComputadorRESUMO
MOTIVATION: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitivity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integrative transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing proteomic data will allow better integration of transcriptomic and proteomic datasets and provide deeper insight into metabolic mechanisms underlying complex biological systems. RESULTS: In this study, we present a non-linear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic gradient boosted trees (GBT) to uncover possible non-linear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant predictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, guanine-cytosine (GC) content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the regions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds estimated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein abundance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top-ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our non-linear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regression used in the past. We evaluated the validity of this non-linear model using biological information of operons, regulons and pathways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.