RESUMO
The complexity of breast cancer biology makes it challenging to analyze large datasets of clinicopathologic and molecular attributes, toward identifying the key prognostic features and producing systems capable of predicting which patients are likely to relapse. We applied machine-learning techniques to analyze a set of well-characterized primary breast cancers, which specified the abundance and localization of various junctional proteins. We hypothesized that disruption of junctional complexes would lead to the cytoplasmic/nuclear redistribution of the protein components and their potential interactions with growth-regulating molecules, which would promote relapse, and that machine-learning techniques could use the subcellular locations of these proteins, together with standard clinicopathological data, to produce an efficient prognostic classifier. We used immunohistochemistry to assess the expression and subcellular distribution of six junctional proteins, in addition to a panel of eight standard clinical features and concentrations of four "growth-regulating" proteins, to produce a database involving 36 features, over 66 primary invasive ductal breast carcinomas. A machine-learning system was applied to this clinicopathologic dataset to produce a decision-tree classifier that could predict whether a novel breast cancer patient would relapse. We show that this decision-tree classifier, which incorporates a combination of only four features (nuclear alpha- and beta-catenin levels, the total level of PTEN and the number of involved axillary lymph nodes), is able to correctly classify patient outcomes essentially 80% of the time. Further, this classifier is significantly better than classifiers based on any subgroup of these 36 features. This study demonstrates that autonomous machine-learning techniques are able to generate simple and efficient decision-tree prognostic classifiers from a wide variety of clinical, pathologic and biomarker data, and unlike other analytic methods, suggest testable biologic relationships among explicitly identified key variables. The decision-tree classifier resulting from these analytic methods is sufficiently simple and should be widely applicable to a spectrum of clinical cancer settings. Further, the subcellular distribution of junctional proteins, which influences growth regulatory pathways involved in locoregional and metastatic relapse of breast cancer, helped to identify which patients would relapse while their total concentration did not. This emphasizes the need to evaluate the subcellular distribution of junctional proteins in assessing their contribution to tumor progression.
Assuntos
Biomarcadores Tumorais/análise , Neoplasias da Mama/metabolismo , Carcinoma Ductal de Mama/metabolismo , Conexinas/metabolismo , Recidiva Local de Neoplasia/metabolismo , Adulto , Idoso , Idoso de 80 Anos ou mais , Neoplasias da Mama/patologia , Carcinoma Ductal de Mama/patologia , Árvores de Decisões , Feminino , Humanos , Imuno-Histoquímica , Pessoa de Meia-Idade , Recidiva Local de Neoplasia/patologia , PrognósticoRESUMO
Developing predictive modeling frameworks of potential cytotoxicity of engineered nanoparticles is critical for environmental and health risk analysis. The complexity and the heterogeneity of available data on potential risks of nanoparticles, in addition to interdependency of relevant influential attributes, makes it challenging to develop a generalization of nanoparticle toxicity behavior. Lack of systematic approaches to investigate these risks further adds uncertainties and variability to the body of literature and limits generalizability of existing studies. Here, we developed a rigorous approach for assembling published evidence on cytotoxicity of several organic and inorganic nanoparticles and unraveled hidden relationships that were not targeted in the original publications. We used a machine learning approach that employs decision trees together with feature selection algorithms ( e.g., Gain ratio) to analyze a set of published nanoparticle cytotoxicity sample data (2896 samples). The specific studies were selected because they specified nanoparticle-, cell-, and screening method-related attributes. The resultant decision-tree classifiers are sufficiently simple, accurate, and with high prediction power and should be widely applicable to a spectrum of nanoparticle cytotoxicity settings. Among several influential attributes, we show that the cytotoxicity of nanoparticles is primarily predicted from the nanoparticle material chemistry, followed by nanoparticle concentration and size, cell type, and cytotoxicity screening indicator. Overall, our study indicates that following rigorous and transparent methodological experimental approaches, in parallel to continuous addition to this data set developed using our approach, will offer higher predictive power and accuracy and uncover hidden relationships. Results obtained in this study help focus future studies to develop nanoparticles that are safe by design.
Assuntos
Mineração de Dados , Nanopartículas/química , Animais , Sobrevivência Celular/fisiologia , Humanos , Aprendizado de MáquinaRESUMO
BACKGROUND: Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER) status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results. METHODS: To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor. RESULTS: This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status. CONCLUSIONS: Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions.
Assuntos
Inteligência Artificial , Biologia Computacional/métodos , Perfilação da Expressão Gênica , Receptores de Estrogênio/metabolismo , Bioestatística , Neoplasias da Mama/metabolismo , HumanosRESUMO
Top differentially expressed gene lists are often inconsistent between studies and it has been suggested that small sample sizes contribute to lack of reproducibility and poor prediction accuracy in discriminative models. We considered sex differences (69â, 65 â) in 134 human skeletal muscle biopsies using DNA microarray. The full dataset and subsamples (n = 10 (5 â, 5 â) to n = 120 (60 â, 60 â)) thereof were used to assess the effect of sample size on the differential expression of single genes, gene rank order and prediction accuracy. Using our full dataset (n = 134), we identified 717 differentially expressed transcripts (p<0.0001) and we were able predict sex with ~90% accuracy, both within our dataset and on external datasets. Both p-values and rank order of top differentially expressed genes became more variable using smaller subsamples. For example, at n = 10 (5 â, 5 â), no gene was considered differentially expressed at p<0.0001 and prediction accuracy was ~50% (no better than chance). We found that sample size clearly affects microarray analysis results; small sample sizes result in unstable gene lists and poor prediction accuracy. We anticipate this will apply to other phenotypes, in addition to sex.
Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , RNA Mensageiro/análise , Reto do Abdome/química , Transcriptoma , Idoso , Feminino , Variação Genética , Humanos , Masculino , Pessoa de Meia-Idade , Neoplasias/genética , Neoplasias/patologia , Neoplasias/cirurgia , Análise de Sequência com Séries de Oligonucleotídeos/normas , Valor Preditivo dos Testes , RNA Mensageiro/genética , Reto do Abdome/metabolismo , Reprodutibilidade dos Testes , Tamanho da AmostraRESUMO
This study explored various feature extraction methods for use in automated diagnosis of Attention-Deficit Hyperactivity Disorder (ADHD) from functional Magnetic Resonance Image (fMRI) data. Each participant's data consisted of a resting state fMRI scan as well as phenotypic data (age, gender, handedness, IQ, and site of scanning) from the ADHD-200 dataset. We used machine learning techniques to produce support vector machine (SVM) classifiers that attempted to differentiate between (1) all ADHD patients vs. healthy controls and (2) ADHD combined (ADHD-c) type vs. ADHD inattentive (ADHD-i) type vs. controls. In different tests, we used only the phenotypic data, only the imaging data, or else both the phenotypic and imaging data. For feature extraction on fMRI data, we tested the Fast Fourier Transform (FFT), different variants of Principal Component Analysis (PCA), and combinations of FFT and PCA. PCA variants included PCA over time (PCA-t), PCA over space and time (PCA-st), and kernelized PCA (kPCA-st). Baseline chance accuracy was 64.2% produced by guessing healthy control (the majority class) for all participants. Using only phenotypic data produced 72.9% accuracy on two class diagnosis and 66.8% on three class diagnosis. Diagnosis using only imaging data did not perform as well as phenotypic-only approaches. Using both phenotypic and imaging data with combined FFT and kPCA-st feature extraction yielded accuracies of 76.0% on two class diagnosis and 68.6% on three class diagnosis-better than phenotypic-only approaches. Our results demonstrate the potential of using FFT and kPCA-st with resting-state fMRI data as well as phenotypic data for automated diagnosis of ADHD. These results are encouraging given known challenges of learning ADHD diagnostic classifiers using the ADHD-200 dataset (see Brown et al., 2012).
RESUMO
Neuroimaging-based diagnostics could potentially assist clinicians to make more accurate diagnoses resulting in faster, more effective treatment. We participated in the 2011 ADHD-200 Global Competition which involved analyzing a large dataset of 973 participants including Attention deficit hyperactivity disorder (ADHD) patients and healthy controls. Each participant's data included a resting state functional magnetic resonance imaging (fMRI) scan as well as personal characteristic and diagnostic data. The goal was to learn a machine learning classifier that used a participant's resting state fMRI scan to diagnose (classify) that individual into one of three categories: healthy control, ADHD combined (ADHD-C) type, or ADHD inattentive (ADHD-I) type. We used participants' personal characteristic data (site of data collection, age, gender, handedness, performance IQ, verbal IQ, and full scale IQ), without any fMRI data, as input to a logistic classifier to generate diagnostic predictions. Surprisingly, this approach achieved the highest diagnostic accuracy (62.52%) as well as the highest score (124 of 195) of any of the 21 teams participating in the competition. These results demonstrate the importance of accounting for differences in age, gender, and other personal characteristics in imaging diagnostics research. We discuss further implications of these results for fMRI-based diagnosis as well as fMRI-based clinical research. We also document our tests with a variety of imaging-based diagnostic methods, none of which performed as well as the logistic classifier using only personal characteristic data.