RESUMO
A mass spectrometry-based plasma biomarker discovery workflow was developed to facilitate biomarker discovery. Plasma from either healthy volunteers or patients with pancreatic cancer was 8-plex iTRAQ labeled, fractionated by 2-dimensional reversed phase chromatography and subjected to MALDI ToF/ToF mass spectrometry. Data were processed using a q-value based statistical approach to maximize protein quantification and identification. Technical (between duplicate samples) and biological variance (between and within individuals) were calculated and power analysis was thereby enabled. An a priori power analysis was carried out using samples from healthy volunteers to define sample sizes required for robust biomarker identification. The result was subsequently validated with a post hoc power analysis using a real clinical setting involving pancreatic cancer patients. This demonstrated that six samples per group (e.g., pre- vs post-treatment) may provide sufficient statistical power for most proteins with changes>2 fold. A reference standard allowed direct comparison of protein expression changes between multiple experiments. Analysis of patient plasma prior to treatment identified 29 proteins with significant changes within individual patient. Changes in Peroxiredoxin II levels were confirmed by Western blot. This q-value based statistical approach in combination with reference standard samples can be applied with confidence in the design and execution of clinical studies for predictive, prognostic, and/or pharmacodynamic biomarker discovery. The power analysis provides information required prior to study initiation.
Assuntos
Biomarcadores Tumorais/sangue , Proteínas Sanguíneas/análise , Proteínas de Neoplasias/sangue , Proteômica/métodos , Proteínas Sanguíneas/química , Estudos de Casos e Controles , Fator XIII , Humanos , Proteínas de Neoplasias/química , Neoplasias Pancreáticas/sangue , Peroxirredoxinas , Proteoma/análise , Proteoma/química , Reprodutibilidade dos Testes , Estatística como AssuntoRESUMO
The development of informative composite circulating biomarkers predicting cancer presence or therapy response is clinically attractive but optimal approaches to modeling are as yet unclear. This study investigated multidimensional relationships within an example panel of serum insulin-like growth factor (IGF) peptides using logistic regression (LR), fractional polynomial (FP), regression, artificial neural networks (ANNs) and support vector machines (SVMs) to derive predictive models for colorectal cancer (CRC). Two phase 2 biomarker validation analyses were performed: controls were ambulant adults (n = 722); cases were: (i) CRC patients (n = 100) and (ii) patients with acromegaly (n = 52), the latter as "positive" discriminators. Serum IGF-I, IGF-II, IGF binding protein (IGFBP)-2 and -3 were measured. Discriminatory characteristics were compared within and between models. For the LR, FP and ANN models, and to a lesser extent SVMs, the addition of covariates at several steps improved discrimination characteristics. The optimum biomarker combination discriminating CRC vs. controls was achieved using ANN models [sensitivity, 94%; specificity, 90%; accuracy, 0.975 (95% CIs: 0.948 1.000)]. ANN modeling significantly outperformed LR, FP and SVM in terms of discrimination (p < 0.0001) and calibration. The acromegaly analysis demonstrated expected high performance characteristics in the ANN model [accuracy, 0.993 (95% CIs: 0.977, 1.000)]. Curved decision surfaces generated from the ANNs revealed the potential clinical utility. This example demonstrated improved discriminatory characteristics within the composite biomarker ANN model and a final model that outperformed the three other models. This modeling approach forms the basis to evaluate composite biomarkers as pharmacological and predictive biomarkers in future clinical trials.
Assuntos
Biomarcadores Tumorais/sangue , Neoplasias Colorretais/diagnóstico , Proteína 2 de Ligação a Fator de Crescimento Semelhante à Insulina/sangue , Proteína 3 de Ligação a Fator de Crescimento Semelhante à Insulina/sangue , Fator de Crescimento Insulin-Like II/metabolismo , Fator de Crescimento Insulin-Like I/metabolismo , Modelos Estatísticos , Adulto , Idoso , Estudos de Casos e Controles , Neoplasias Colorretais/sangue , Feminino , Humanos , Masculino , Radioimunoensaio , Estudos RetrospectivosRESUMO
Applications of genomic and proteomic technologies have seen a major increase, resulting in an explosion in the amount of highly dimensional and complex data being generated. Subsequently this has increased the effort by the bioinformatics community to develop novel computational approaches that allow for meaningful information to be extracted. This information must be of biological relevance and thus correlate to disease phenotypes of interest. Artificial neural networks are a form of machine learning from the field of artificial intelligence with proven pattern recognition capabilities and have been utilized in many areas of bioinformatics. This is due to their ability to cope with highly dimensional complex datasets such as those developed by protein mass spectrometry and DNA microarray experiments. As such, neural networks have been applied to problems such as disease classification and identification of biomarkers. This review introduces and describes the concepts related to neural networks, the advantages and caveats to their use, examples of their applications in mass spectrometry and microarray research (with a particular focus on cancer studies), and illustrations from recent literature showing where neural networks have performed well in comparison to other machine learning methods. This should form the necessary background knowledge and information enabling researchers with an interest in these methodologies, but not necessarily from a machine learning background, to apply the concepts to their own datasets, thus maximizing the information gain from these complex biological systems.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Bases de Dados de Proteínas , Espectrometria de Massas , Análise em Microsséries , Neoplasias , Redes Neurais de Computação , Inteligência Artificial , Teorema de Bayes , Genômica , Humanos , Neoplasias/genética , Neoplasias/metabolismo , Proteômica , Reprodutibilidade dos TestesRESUMO
OBJECTIVE: The advent of microarrays has attracted considerable interest from biologists due to the potential for high throughput analysis of hundreds of thousands of gene transcripts. Subsequent analysis of the data may identify specific features which correspond to characteristics of interest within the population, for example, analysis of gene expression profiles in cancer patients to identify molecular signatures corresponding with prognostic outcome. These high throughput technologies have resulted in an unprecedented rate of data generation, often of high complexity, highlighting the need for novel data analysis methodologies that will cope with data of this nature. METHODS: Stepwise methods using artificial neural networks (ANNs) have been developed to identify an optimal subset of predictive gene transcripts from highly dimensional microarray data. Here these methods have been applied to a gene microarray dataset to identify and validate gene signatures corresponding with estrogen receptor and lymph node status in breast cancer. RESULTS: Many gene transcripts were identified whose expression could differentiate patients to very high accuracies based upon firstly whether they were positive or negative for estrogen receptor, and secondly whether metastasis to the axillary lymph node had occurred. A number of these genes had been previously reported to have a role in cancer. Significantly fewer genes were used compared to other previous studies. The models using the optimal gene subsets were internally validated using an extensive random sample cross-validation procedure and externally validated using a follow up dataset from a different cohort of patients on a newer array chip containing the same and additional probe sets. Here, the models retained high accuracies, emphasising the potential power of this approach in analysing complex systems. These findings show how the proposed method allows for the rapid analysis and subsequent detailed interrogation of gene expression signatures to provide a further understanding of the underlying molecular mechanisms that could be important in determining novel prognostic markers associated with cancer.
Assuntos
Neoplasias da Mama/genética , Neoplasias da Mama/patologia , Redes Neurais de Computação , Receptores de Estrogênio/fisiologia , Transcrição Gênica/fisiologia , Feminino , Perfilação da Expressão Gênica , Humanos , Metástase Linfática , Análise de Sequência com Séries de Oligonucleotídeos , Valor Preditivo dos Testes , Reprodutibilidade dos TestesRESUMO
BACKGROUND: Gene expression profiling is being widely applied in cancer research to identify biomarkers for clinical endpoint prediction. Since RNA-seq provides a powerful tool for transcriptome-based applications beyond the limitations of microarrays, we sought to systematically evaluate the performance of RNA-seq-based and microarray-based classifiers in this MAQC-III/SEQC study for clinical endpoint prediction using neuroblastoma as a model. RESULTS: We generate gene expression profiles from 498 primary neuroblastomas using both RNA-seq and 44 k microarrays. Characterization of the neuroblastoma transcriptome by RNA-seq reveals that more than 48,000 genes and 200,000 transcripts are being expressed in this malignancy. We also find that RNA-seq provides much more detailed information on specific transcript expression patterns in clinico-genetic neuroblastoma subgroups than microarrays. To systematically compare the power of RNA-seq and microarray-based models in predicting clinical endpoints, we divide the cohort randomly into training and validation sets and develop 360 predictive models on six clinical endpoints of varying predictability. Evaluation of factors potentially affecting model performances reveals that prediction accuracies are most strongly influenced by the nature of the clinical endpoint, whereas technological platforms (RNA-seq vs. microarrays), RNA-seq data analysis pipelines, and feature levels (gene vs. transcript vs. exon-junction level) do not significantly affect performances of the models. CONCLUSIONS: We demonstrate that RNA-seq outperforms microarrays in determining the transcriptomic characteristics of cancer, while RNA-seq and microarray-based models perform similarly in clinical endpoint prediction. Our findings may be valuable to guide future studies on the development of gene expression-based predictive models and their implementation in clinical practice.
Assuntos
Perfilação da Expressão Gênica , Neuroblastoma/genética , Análise de Sequência com Séries de Oligonucleotídeos , Análise de Sequência de RNA , Adolescente , Adulto , Criança , Pré-Escolar , Determinação de Ponto Final , Feminino , Humanos , Lactente , Recém-Nascido , Masculino , Modelos Genéticos , Neuroblastoma/classificação , Neuroblastoma/diagnóstico , Células Tumorais Cultivadas , Adulto JovemRESUMO
BACKGROUND: Gene expression microarray has been the primary biomarker platform ubiquitously applied in biomedical research, resulting in enormous data, predictive models, and biomarkers accrued. Recently, RNA-seq has looked likely to replace microarrays, but there will be a period where both technologies co-exist. This raises two important questions: Can microarray-based models and biomarkers be directly applied to RNA-seq data? Can future RNA-seq-based predictive models and biomarkers be applied to microarray data to leverage past investment? RESULTS: We systematically evaluated the transferability of predictive models and signature genes between microarray and RNA-seq using two large clinical data sets. The complexity of cross-platform sequence correspondence was considered in the analysis and examined using three human and two rat data sets, and three levels of mapping complexity were revealed. Three algorithms representing different modeling complexity were applied to the three levels of mappings for each of the eight binary endpoints and Cox regression was used to model survival times with expression data. In total, 240,096 predictive models were examined. CONCLUSIONS: Signature genes of predictive models are reciprocally transferable between microarray and RNA-seq data for model development, and microarray-based models can accurately predict RNA-seq-profiled samples; while RNA-seq-based models are less accurate in predicting microarray-profiled samples and are affected both by the choice of modeling algorithm and the gene mapping complexity. The results suggest continued usefulness of legacy microarray data and established microarray biomarkers and predictive models in the forthcoming RNA-seq era.
Assuntos
Perfilação da Expressão Gênica/métodos , Marcadores Genéticos , RNA/análise , Análise de Sequência de RNA , Algoritmos , Animais , Biologia Computacional/métodos , Humanos , Modelos Genéticos , Análise de Sequência com Séries de Oligonucleotídeos , RatosRESUMO
The concordance of RNA-sequencing (RNA-seq) with microarrays for genome-wide analysis of differential gene expression has not been rigorously assessed using a range of chemical treatment conditions. Here we use a comprehensive study design to generate Illumina RNA-seq and Affymetrix microarray data from the same liver samples of rats exposed in triplicate to varying degrees of perturbation by 27 chemicals representing multiple modes of action (MOAs). The cross-platform concordance in terms of differentially expressed genes (DEGs) or enriched pathways is linearly correlated with treatment effect size (R(2)î0.8). Furthermore, the concordance is also affected by transcript abundance and biological complexity of the MOA. RNA-seq outperforms microarray (93% versus 75%) in DEG verification as assessed by quantitative PCR, with the gain mainly due to its improved accuracy for low-abundance transcripts. Nonetheless, classifiers to predict MOAs perform similarly when developed using data from either platform. Therefore, the endpoint studied and its biological complexity, transcript abundance and the genomic application are important factors in transcriptomic research and for clinical and regulatory decision making.
Assuntos
Análise de Sequência com Séries de Oligonucleotídeos , RNA Mensageiro/genética , Análise de Sequência de RNA , Animais , RatosRESUMO
Circulating full-length and caspase-cleaved cytokeratin 18 (CK18) are considered biomarkers of chemotherapy-induced cell death measured using a combination of the M30 and M65 ELISAs. M30 measures caspase-cleaved CK18 produced during apoptosis and M65 measures the levels of both caspase-cleaved and intact CK18, the latter of which is released from cells undergoing necrosis. Previous studies have highlighted their potential as prognostic, predictive, and pharmacological tools in the treatment of cancer. Disseminated testicular germ cell cancer (TC) is a paradigm for a chemosensitive solid malignancy of epithelial origin and has a cure rate of 80% to 90%. We conducted M30/M65 analyses on 34 patients with TC before and during treatment with bleomycin, etoposide, and cisplatin and showed that prechemotherapy serum levels of M65 and M30 antigens are correlated with established TC tumor markers lactate dehydrogenase, alpha-fetoprotein, and beta-human chorionic gonadotropin, probably reflecting tumor load. Cumulative percentage change of M65 and M30 from baseline to end of study was highest in poor prognosis patients (P < .05). Moreover, area under the curve profiles of M65 and M30 during chemotherapy mirrored dynamic profiles for lactate dehydrogenase, alpha-fetoprotein, and beta-human chorionic gonadotropin. Consequently, M65 and M30 levels appear to reflect chemotherapy-induced changes that correlate with changes in markers routinely used in the clinic for management of patients with TC. This is the first clinical study where M65 and M30 antigen levels correlate with established prognostic markers and provides impetus for their exploration in other epithelial cancers where there is a pressing need for informative circulating biomarkers.