RESUMEN
In excess of 12% of human cancer incidents have a viral cofactor. Epidemiological studies of idiopathic human cancers indicate that additional tumor viruses remain to be discovered. Recent advances in sequencing technology have enabled systematic screenings of human tumor transcriptomes for viral transcripts. However, technical problems such as low abundances of viral transcripts in large volumes of sequencing data, viral sequence divergence, and homology between viral and human factors significantly confound identification of tumor viruses. We have developed a novel computational approach for detecting viral transcripts in human cancers that takes the aforementioned confounding factors into account and is applicable to a wide variety of viruses and tumors. We apply the approach to conducting the first systematic search for viruses in neuroblastoma, the most common cancer in infancy. The diverse clinical progression of this disease as well as related epidemiological and virological findings are highly suggestive of a pathogenic cofactor. However, a viral etiology of neuroblastoma is currently contested. We mapped 14 transcriptomes of neuroblastoma as well as positive and negative controls to the human and all known viral genomes in order to detect both known and unknown viruses. Analysis of controls, comparisons with related methods, and statistical estimates demonstrate the high sensitivity of our approach. Detailed investigation of putative viral transcripts within neuroblastoma samples did not provide evidence for the existence of any known human viruses. Likewise, de-novo assembly and analysis of chimeric transcripts did not result in expression signatures associated with novel human pathogens. While confounding factors such as sample dilution or viral clearance in progressed tumors may mask viral cofactors in the data, in principle, this is rendered less likely by the high sensitivity of our approach and the number of biological replicates analyzed. Therefore, our results suggest that frequent viral cofactors of metastatic neuroblastoma are unlikely.
Asunto(s)
Biología Computacional/métodos , Neoplasias/genética , Neoplasias/virología , Transcriptoma/genética , Virus/aislamiento & purificación , Línea Celular Tumoral , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Neoplasias/metabolismo , Neuroblastoma , Filogenia , ARN/análisis , ARN/clasificación , ARN/genética , ARN Viral/análisis , ARN Viral/genética , Análisis de Secuencia de ARN/métodos , Homología de Secuencia de Ácido Nucleico , Virus/genética , Virus/metabolismoRESUMEN
MOTIVATION: Recurrent DNA breakpoints in cancer genomes indicate the presence of critical functional elements for tumor development. Identifying them can help determine new therapeutic targets. High-dimensional DNA microarray experiments like arrayCGH afford the identification of DNA copy number breakpoints with high precision, offering a solid basis for computational estimation of recurrent breakpoint locations. RESULTS: We introduce a method for identification of recurrent breakpoints (consensus breakpoints) from copy number aberration datasets. The method is based on weighted kernel counting of breakpoints around genomic locations. Counts larger than expected by chance are considered significant. We show that the consensus breakpoints facilitate consensus segmentation of the samples. We apply our method to three arrayCGH datasets and show that by using consensus segmentation we achieve significant dimension reduction, which is useful for the task of prediction of tumor phenotype based on copy number data. We use our approach for classification of neuroblastoma tumors from different age groups and confirm the recent recommendation for the choice of age cut-off for differential treatment of 18 months. We also investigate the (epi)genetic properties at consensus breakpoint locations for seven datasets and show enrichment in overlap with important functional genomic regions. AVAILABILITY: Implementation in R of our approach can be found at http://www.mpi-inf.mpg.de/ â¼laura/FeatureGrouping.html. CONTACT: laura@mpi-inf.mpg.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Puntos de Rotura del Cromosoma , Variaciones en el Número de Copia de ADN , Neoplasias/genética , Genoma Humano , Genómica/métodos , Humanos , Neuroblastoma/genética , Análisis de Secuencia por Matrices de Oligonucleótidos , Programas InformáticosRESUMEN
MOTIVATION: Classification and feature selection of genomics or transcriptomics data is often hampered by the large number of features as compared with the small number of samples available. Moreover, features represented by probes that either have similar molecular functions (gene expression analysis) or genomic locations (DNA copy number analysis) are highly correlated. Classical model selection methods such as penalized logistic regression or random forest become unstable in the presence of high feature correlations. Sophisticated penalties such as group Lasso or fused Lasso can force the models to assign similar weights to correlated features and thus improve model stability and interpretability. In this article, we show that the measures of feature relevance corresponding to the above-mentioned methods are biased such that the weights of the features belonging to groups of correlated features decrease as the sizes of the groups increase, which leads to incorrect model interpretation and misleading feature ranking. RESULTS: With simulation experiments, we demonstrate that Lasso logistic regression, fused support vector machine, group Lasso and random forest models suffer from correlation bias. Using simulations, we show that two related methods for group selection based on feature clustering can be used for correcting the correlation bias. These techniques also improve the stability and the accuracy of the baseline models. We apply all methods investigated to a breast cancer and a bladder cancer arrayCGH dataset and in order to identify copy number aberrations predictive of tumor phenotype. AVAILABILITY: R code can be found at: http://www.mpi-inf.mpg.de/~laura/Clustering.r.
Asunto(s)
Genómica/métodos , Estadística como Asunto , Neoplasias de la Mama/genética , Análisis por Conglomerados , Hibridación Genómica Comparativa , Femenino , Humanos , Modelos Logísticos , Modelos Biológicos , Modelos Moleculares , Neoplasias , Soluciones , Neoplasias de la Vejiga Urinaria/genéticaRESUMEN
MOTIVATION: In life sciences, interpretability of machine learning models is as important as their prediction accuracy. Linear models are probably the most frequently used methods for assessing feature relevance, despite their relative inflexibility. However, in the past years effective estimators of feature relevance have been derived for highly complex or non-parametric models such as support vector machines and RandomForest (RF) models. Recently, it has been observed that RF models are biased in such a way that categorical variables with a large number of categories are preferred. RESULTS: In this work, we introduce a heuristic for normalizing feature importance measures that can correct the feature importance bias. The method is based on repeated permutations of the outcome vector for estimating the distribution of measured importance for each variable in a non-informative setting. The P-value of the observed importance provides a corrected measure of feature importance. We apply our method to simulated data and demonstrate that (i) non-informative predictors do not receive significant P-values, (ii) informative variables can successfully be recovered among non-informative variables and (iii) P-values computed with permutation importance (PIMP) are very helpful for deciding the significance of variables, and therefore improve model interpretability. Furthermore, PIMP was used to correct RF-based importance measures for two real-world case studies. We propose an improved RF model that uses the significant variables with respect to the PIMP measure and show that its prediction accuracy is superior to that of other existing models. AVAILABILITY: R code for the method presented in this article is available at http://www.mpi-inf.mpg.de/ approximately altmann/download/PIMP.R CONTACT: altmann@mpi-inf.mpg.de, laura.tolosi@mpi-inf.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Algoritmos , Modelos Lineales , Reconocimiento de Normas Patrones AutomatizadasRESUMEN
Somatic genetic alterations in cancers have been linked with response to targeted therapeutics by creation of specific dependency on activated oncogenic signaling pathways. However, no tools currently exist to systematically connect such genetic lesions to therapeutic vulnerability. We have therefore developed a genomics approach to identify lesions associated with therapeutically relevant oncogene dependency. Using integrated genomic profiling, we have demonstrated that the genomes of a large panel of human non-small cell lung cancer (NSCLC) cell lines are highly representative of those of primary NSCLC tumors. Using cell-based compound screening coupled with diverse computational approaches to integrate orthogonal genomic and biochemical data sets, we identified molecular and genomic predictors of therapeutic response to clinically relevant compounds. Using this approach, we showed that v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog (KRAS) mutations confer enhanced Hsp90 dependency and validated this finding in mice with KRAS-driven lung adenocarcinoma, as these mice exhibited dramatic tumor regression when treated with an Hsp90 inhibitor. In addition, we found that cells with copy number enhancement of v-abl Abelson murine leukemia viral oncogene homolog 2 (ABL2) and ephrin receptor kinase and v-src sarcoma (Schmidt-Ruppin A-2) viral oncogene homolog (avian) (SRC) kinase family genes were exquisitely sensitive to treatment with the SRC/ABL inhibitor dasatinib, both in vitro and when it xenografted into mice. Thus, genomically annotated cell-line collections may help translate cancer genomics information into clinical practice by defining critical pathway dependencies amenable to therapeutic inhibition.
Asunto(s)
Antineoplásicos/uso terapéutico , Carcinoma de Pulmón de Células no Pequeñas/tratamiento farmacológico , Carcinoma de Pulmón de Células no Pequeñas/genética , Animales , Antineoplásicos/farmacología , Carcinoma de Pulmón de Células no Pequeñas/patología , Línea Celular Tumoral , Evaluación Preclínica de Medicamentos , Receptores ErbB/química , Receptores ErbB/genética , Receptores ErbB/metabolismo , Perfilación de la Expresión Génica , Humanos , Imagen por Resonancia Magnética , Ratones , Modelos Moleculares , Mutación/genética , Fenotipo , Estructura Terciaria de Proteína , Especificidad por SustratoRESUMEN
BACKGROUND: The purpose of this study was to prove the feasibility of a longmer oligonucleotide microarray platform to profile gene copy number alterations in prostate cancer cell lines and to quickly indicate novel candidate genes, which may play a role in carcinogenesis. METHODS/RESULTS AND FINDINGS: Genome-wide screening for regions of genetic gains and losses on nine prostate cancer cell lines (PC3, DU145, LNCaP, CWR22, and derived sublines) was carried out using comparative genomic hybridization on a 35,000 feature oligonucleotide microarray (arrayCGH). Compared to conventional chromosomal CGH, more deletions and small regions of gains, particularly in pericentromeric regions and regions next to the telomeres, were detected. As validation of the high-resolution of arrayCGH we further analyzed a small amplicon of 1.7 MB at 9p13.3, which was found in CWR22 and CWR22-Rv1. Increased copy number was confirmed by fluorescence in situ hybridization using the BAC clone RP11-165H19 from the amplified region comprising the two genes interleukin 11 receptor alpha (IL11-RA) and dynactin 3 (DCTN3). Using quantitative real time PCR (qPCR) we could demonstrate that IL11-RA is the gene with the highest copy number gain in the cell lines compared to DCTN3 suggesting IL11-RA to be the amplification target. Screening of 20 primary prostate carcinomas by qPCR revealed an IL11-RA copy number gain in 75% of the tumors analyzed. Gain of DCTN3 was only found in two cases together with a gain of IL11-RA. CONCLUSIONS/SIGNIFICANCE: ArrayCGH using longmer oligonucleotide microarrays is feasible for high-resolution analysis of chomosomal imbalances. Characterization of a small gained region at 9p13.3 in prostate cancer cell lines and primary prostate cancer samples by fluorescence in situ hybridization and quantitative PCR has revealed interleukin 11 receptor alpha gene as a candidate target of amplification with an amplification frequency of 75% in prostate carcinomas. Frequent amplification of IL11-RA in prostate cancer is a potential mechanism of IL11-RA overexpression in this tumor type.
Asunto(s)
Hibridación Genómica Comparativa/métodos , Variaciones en el Número de Copia de ADN , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Neoplasias de la Próstata/genética , Línea Celular Tumoral , Cromosomas Humanos , Regulación Neoplásica de la Expresión Génica , Predisposición Genética a la Enfermedad , Genoma Humano , Humanos , Hibridación Fluorescente in Situ , MasculinoRESUMEN
OBJECTIVE: We evaluated the effects of a preschool nutrition education and food service intervention "Healthy Start," on two-to-five-year-old children in nine Head Start Centers in upstate NY. The primary objective was to reduce the saturated fat (sat-fat) content of preschool meals to <10% daily energy (E) and to reduce consumption of sat-fat by preschoolers to <10% E. METHODS: Six centers were assigned to the food service intervention and three to control condition. Food service intervention included training workshops for cooks and monthly site visits to review progress towards goals. Child dietary intake at preschool was assessed by direct observation and plate waste measurement. Dietary intake at home was assessed by parental food record and telephone interviews. Dietary data were collected each Fall/Spring over two years, including five days of menus and recipes from each center. Dietary data were analyzed with the Minnesota NDS software. RESULTS: Consumption of saturated fat from school meals decreased significantly from 1.0%E to 10.4%E after one year of intervention and to 8.0%E after the second year, compared with an increase of 10.2% to 13.0% to 11.4%E, respectively, for control schools (p < 0.001). Total caloric intake was adequately maintained for both groups. Analysis of preschool menus and recipes over the two-year period of intervention showed a significant decrease in sat-fat content in intervention preschools (from 12.5 at baseline to 8.0%E compared with a change of 12.1%E to >11.6%E in control preschools (p < 0.001)). Total fat content of menus also decreased significantly in intervention schools (31.0% to >25.0%E) compared with controls (29.9% to >28.4%E). CONCLUSIONS: The Healthy Start food service intervention was effective in reducing the fat and saturated fat content of preschool meals and reducing children's consumption of saturated fat at preschool without compromising energy intake or intake of essential nutrients. These goals are consistent with current U.S Dietary Guidelines for children older than two years of age.