RESUMEN
MOTIVATION: Supervised machine learning is commonly applied in genomic research to construct a classifier from the training data that is generalizable to predict independent testing data. When test datasets are not available, cross-validation is commonly used to estimate the error rate. Many machine learning methods are available, and it is well known that no universally best method exists in general. It has been a common practice to apply many machine learning methods and report the method that produces the smallest cross-validation error rate. Theoretically, such a procedure produces a selection bias. Consequently, many clinical studies with moderate sample sizes (e.g. n = 30-60) risk reporting a falsely small cross-validation error rate that could not be validated later in independent cohorts. RESULTS: In this article, we illustrated the probabilistic framework of the problem and explored the statistical and asymptotic properties. We proposed a new bias correction method based on learning curve fitting by inverse power law (IPL) and compared it with three existing methods: nested cross-validation, weighted mean correction and Tibshirani-Tibshirani procedure. All methods were compared in simulation datasets, five moderate size real datasets and two large breast cancer datasets. The result showed that IPL outperforms the other methods in bias correction with smaller variance, and it has an additional advantage to extrapolate error estimates for larger sample sizes, a practical feature to recommend whether more samples should be recruited to improve the classifier and accuracy. An R package 'MLbias' and all source files are publicly available. AVAILABILITY AND IMPLEMENTATION: tsenglab.biostat.pitt.edu/software.htm. CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Inteligencia Artificial , Genómica/métodos , Neoplasias de la Mama/genética , Interpretación Estadística de Datos , Femenino , Perfilación de la Expresión Génica , Humanos , Modelos Teóricos , Tamaño de la MuestraRESUMEN
BACKGROUND: In modern biomedical research of complex diseases, a large number of demographic and clinical variables, herein called phenomic data, are often collected and missing values (MVs) are inevitable in the data collection process. Since many downstream statistical and bioinformatics methods require complete data matrix, imputation is a common and practical solution. In high-throughput experiments such as microarray experiments, continuous intensities are measured and many mature missing value imputation methods have been developed and widely applied. Numerous methods for missing data imputation of microarray data have been developed. Large phenomic data, however, contain continuous, nominal, binary and ordinal data types, which void application of most methods. Though several methods have been developed in the past few years, not a single complete guideline is proposed with respect to phenomic missing data imputation. RESULTS: In this paper, we investigated existing imputation methods for phenomic data, proposed a self-training selection (STS) scheme to select the best imputation method and provide a practical guideline for general applications. We introduced a novel concept of "imputability measure" (IM) to identify missing values that are fundamentally inadequate to impute. In addition, we also developed four variations of K-nearest-neighbor (KNN) methods and compared with two existing methods, multivariate imputation by chained equations (MICE) and missForest. The four variations are imputation by variables (KNN-V), by subjects (KNN-S), their weighted hybrid (KNN-H) and an adaptively weighted hybrid (KNN-A). We performed simulations and applied different imputation methods and the STS scheme to three lung disease phenomic datasets to evaluate the methods. An R package "phenomeImpute" is made publicly available. CONCLUSIONS: Simulations and applications to real datasets showed that MICE often did not perform well; KNN-A, KNN-H and random forest were among the top performers although no method universally performed the best. Imputation of missing values with low imputability measures increased imputation errors greatly and could potentially deteriorate downstream analyses. The STS scheme was accurate in selecting the optimal method by evaluating methods in a second layer of missingness simulation. All source files for the simulation and the real data analyses are available on the author's publication website.
Asunto(s)
Métodos Epidemiológicos , Programas Informáticos , Algoritmos , Análisis por Conglomerados , Biología Computacional , Simulación por Computador , Conjuntos de Datos como Asunto , Humanos , Proyectos de InvestigaciónRESUMEN
DNA methylation is one of the most important epigenetic mechanisms in regulating gene expression. Genome hypermethylation has been proposed as a critical mechanism in human malignancies. However, whole-genome quantification of DNA methylation of human malignancies has rarely been investigated, and the significance of the genome distribution of CpG methylation is unclear. We performed whole-genome methylation sequencing to investigate the methylation profiles of 13 prostate samples: 5 prostate cancers, 4 matched benign prostate tissues adjacent to tumor, and 4 age-matched organ-donor prostate tissues. Alterations of methylation patterns occurred in prostate cancer and in benign prostate tissues adjacent to tumor, in comparison with age-matched organ-donor prostates. More than 95% alterations of genome methylation occurred in sequences outside CpG islands. Only a small fraction of the methylated CpG islands had any effect on RNA expression. Both intragene and promoter CpG island methylations negatively affected gene expression. However, suppressions of RNA expression did not correlate with levels of CpG island methylation, suggesting that CpG island methylation alone might not be sufficient to shut down gene expression. Motif analysis revealed a consensus sequence containing Sp1 binding motif significantly enriched in the effective CpG islands.
Asunto(s)
Islas de CpG , Metilación de ADN , Genoma Humano , Neoplasias de la Próstata/metabolismo , Transcripción Genética , Anciano , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Persona de Mediana Edad , Neoplasias de la Próstata/genética , Neoplasias de la Próstata/patologíaRESUMEN
SUMMARY: With the rapid advances and prevalence of high-throughput genomic technologies, integrating information of multiple relevant genomic studies has brought new challenges. Microarray meta-analysis has become a frequently used tool in biomedical research. Little effort, however, has been made to develop a systematic pipeline and user-friendly software. In this article, we present MetaOmics, a suite of three R packages MetaQC, MetaDE and MetaPath, for quality control, differentially expressed gene identification and enriched pathway detection for microarray meta-analysis. MetaQC provides a quantitative and objective tool to assist study inclusion/exclusion criteria for meta-analysis. MetaDE and MetaPath were developed for candidate marker and pathway detection, which provide choices of marker detection, meta-analysis and pathway analysis methods. The system allows flexible input of experimental data, clinical outcome (case-control, multi-class, continuous or survival) and pathway databases. It allows missing values in experimental data and utilizes multi-core parallel computing for fast implementation. It generates informative summary output and visualization plots, operates on different operation systems and can be expanded to include new algorithms or combine different types of genomic data. This software suite provides a comprehensive tool to conveniently implement and compare various genomic meta-analysis pipelines. AVAILABILITY: http://www.biostat.pitt.edu/bioinfo/software.htm CONTACT: ctseng@pitt.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Perfilación de la Expresión Génica/métodos , Genómica/métodos , Análisis por Micromatrices/métodos , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Humanos , Masculino , Metaanálisis como Asunto , Neoplasias de la Próstata/genética , Control de CalidadRESUMEN
Massively parallel sequencing (a.k.a. next-generation sequencing, NGS) technology has emerged as a powerful tool in characterizing genomic profiles. Among many NGS applications, RNA sequencing (RNA-Seq) has gradually become a standard tool for global transcriptomic monitoring. Although the cost of NGS experiments has dropped constantly, the high sequencing cost and bioinformatic complexity are still obstacles for many biomedical projects. Unlike earlier fluorescence-based technologies such as microarray, modelling of NGS data should consider discrete count data. In addition to sample size, sequencing depth also directly relates to the experimental cost. Consequently, given total budget and pre-specified unit experimental cost, the study design issue in RNA-Seq is conceptually a more complex multi-dimensional constrained optimization problem rather than one-dimensional sample size calculation in traditional hypothesis setting. In this paper, we propose a statistical framework, namely "RNASeqDesign", to utilize pilot data for power calculation and study design of RNA-Seq experiments. The approach is based on mixture model fitting of p-value distribution from pilot data and a parametric bootstrap procedure based on approximated Wald test statistics to infer genome-wide power for optimal sample size and sequencing depth. We further illustrate five practical study design tasks for practitioners. We perform simulations and three real applications to evaluate the performance and compare to existing methods.
RESUMEN
The most effective natural prevention against breast cancer is an early first full-term pregnancy. Understanding how the protective effect is elicited will inform the development of new prevention strategies. To better understand the role of epigenetics in long-term protection, we investigated parity-induced DNA methylation in the mammary gland. FVB mice were bred or remained nulliparous and mammary glands harvested immediately after involution (early) or 6.5 months following involution (late), allowing identification of both transient and persistent changes. Targeted DNA methylation (109 Mb of Ensemble regulatory features) analysis was performed using the SureSelectXT Mouse Methyl-seq assay and massively parallel sequencing. Two hundred sixty-nine genes were hypermethylated and 128 hypomethylated persistently at both the early and late time points. Pathway analysis of the persistently differentially methylated genes revealed Igf1r to be central to one of the top identified signaling networks, and Igf1r itself was one of the most significantly hypermethylated genes. Hypermethylation of Igf1r in the parous mammary gland was associated with a reduction of Igf1r mRNA expression. These data suggest that the IGF pathway is regulated at multiple levels during pregnancy and that its modification might be critical in the protective role of pregnancy. This supports the approach of lowering IGF action for prevention of breast cancer, a concept that is currently being tested clinically.