RESUMO
Deep learning (DL) models find increasing application in mental state decoding, where researchers seek to understand the mapping between mental states (e.g., experiencing anger or joy) and brain activity by identifying those spatial and temporal features of brain activity that allow to accurately identify (i.e., decode) these states. Once a DL model has been trained to accurately decode a set of mental states, neuroimaging researchers often make use of methods from explainable artificial intelligence research to understand the model's learned mappings between mental states and brain activity. Here, we benchmark prominent explanation methods in a mental state decoding analysis of multiple functional Magnetic Resonance Imaging (fMRI) datasets. Our findings demonstrate a gradient between two key characteristics of an explanation in mental state decoding, namely, its faithfulness and its alignment with other empirical evidence on the mapping between brain activity and decoded mental state: explanation methods with high explanation faithfulness, which capture the model's decision process well, generally provide explanations that align less well with other empirical evidence than the explanations of methods with less faithfulness. Based on our findings, we provide guidance for neuroimaging researchers on how to choose an explanation method to gain insight into the mental state decoding decisions of DL models.
Assuntos
Encéfalo , Aprendizado Profundo , Humanos , Encéfalo/diagnóstico por imagem , Mapeamento Encefálico/métodos , Inteligência Artificial , Benchmarking , Imageamento por Ressonância Magnética/métodosRESUMO
PURPOSE: To develop a method for building MRI reconstruction neural networks robust to changes in signal-to-noise ratio (SNR) and trainable with a limited number of fully sampled scans. METHODS: We propose Noise2Recon, a consistency training method for SNR-robust accelerated MRI reconstruction that can use both fully sampled (labeled) and undersampled (unlabeled) scans. Noise2Recon uses unlabeled data by enforcing consistency between model reconstructions of undersampled scans and their noise-augmented counterparts. Noise2Recon was compared to compressed sensing and both supervised and self-supervised deep learning baselines. Experiments were conducted using retrospectively accelerated data from the mridata three-dimensional fast-spin-echo knee and two-dimensional fastMRI brain datasets. All methods were evaluated in label-limited settings and among out-of-distribution (OOD) shifts, including changes in SNR, acceleration factors, and datasets. An extensive ablation study was conducted to characterize the sensitivity of Noise2Recon to hyperparameter choices. RESULTS: In label-limited settings, Noise2Recon achieved better structural similarity, peak signal-to-noise ratio, and normalized-RMS error than all baselines and matched performance of supervised models, which were trained with 14 × $$ 14\times $$ more fully sampled scans. Noise2Recon outperformed all baselines, including state-of-the-art fine-tuning and augmentation techniques, among low-SNR scans and when generalizing to OOD acceleration factors. Augmentation extent and loss weighting hyperparameters had negligible impact on Noise2Recon compared to supervised methods, which may indicate increased training stability. CONCLUSION: Noise2Recon is a label-efficient reconstruction method that is robust to distribution shifts, such as changes in SNR, acceleration factors, and others, with limited or no fully sampled training data.
Assuntos
Aprendizado Profundo , Processamento de Imagem Assistida por Computador , Humanos , Processamento de Imagem Assistida por Computador/métodos , Razão Sinal-Ruído , Estudos Retrospectivos , Imageamento por Ressonância Magnética/métodos , Aprendizado de Máquina SupervisionadoRESUMO
PURPOSE: To compare machine learning methods for classifying mass lesions on mammography images that use predefined image features computed over lesion segmentations to those that leverage segmentation-free representation learning on a standard, public evaluation dataset. METHODS: We apply several classification algorithms to the public Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM), in which each image contains a mass lesion. Segmentation-free representation learning techniques for classifying lesions as benign or malignant include both a Bag-of-Visual-Words (BoVW) method and a Convolutional Neural Network (CNN). We compare classification performance of these techniques to that obtained using two different segmentation-dependent approaches from the literature that rely on specific combinations of end classifiers (e.g. linear discriminant analysis, neural networks) and predefined features computed over the lesion segmentation (e.g. spiculation measure, morphological characteristics, intensity metrics). RESULTS: We report area under the receiver operating characteristic curve (AZ) values for malignancy classification on CBIS-DDSM for each technique. We find average AZ values of 0.73 for a segmentation-free BoVW method, 0.86 for a segmentation-free CNN method, 0.75 for a segmentation-dependent linear discriminant analysis of Rubber-Band Straightening Transform features, and 0.58 for a hybrid rule-based neural network classification using a small number of hand-designed features. CONCLUSIONS: We find that malignancy classification performance on the CBIS-DDSM dataset using segmentation-free BoVW features is comparable to that of the best segmentation-dependent methods we study, but also observe that a common segmentation-free CNN model substantially and significantly outperforms each of these (p < 0.05). These results reinforce recent findings suggesting that representation learning techniques such as BoVW and CNNs are advantageous for mammogram analysis because they do not require lesion segmentation, the quality and specific characteristics of which can vary substantially across datasets. We further observe that segmentation-dependent methods achieve performance levels on CBIS-DDSM inferior to those achieved on the original evaluation datasets reported in the literature. Each of these findings reinforces the need for standardization of datasets, segmentation techniques, and model implementations in performance assessments of automated classifiers for medical imaging.
Assuntos
Neoplasias da Mama , Mamografia , Mama/diagnóstico por imagem , Neoplasias da Mama/diagnóstico por imagem , Computadores , Detecção Precoce de Câncer , Feminino , HumanosRESUMO
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models 2.8 × faster and increase predictive performance an average 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60 % of the predictive performance of large hand-curated training sets.
RESUMO
Purpose To assess the ability of convolutional neural networks (CNNs) to enable high-performance automated binary classification of chest radiographs. Materials and Methods In a retrospective study, 216 431 frontal chest radiographs obtained between 1998 and 2012 were procured, along with associated text reports and a prospective label from the attending radiologist. This data set was used to train CNNs to classify chest radiographs as normal or abnormal before evaluation on a held-out set of 533 images hand-labeled by expert radiologists. The effects of development set size, training set size, initialization strategy, and network architecture on end performance were assessed by using standard binary classification metrics; detailed error analysis, including visualization of CNN activations, was also performed. Results Average area under the receiver operating characteristic curve (AUC) was 0.96 for a CNN trained with 200 000 images. This AUC value was greater than that observed when the same model was trained with 2000 images (AUC = 0.84, P < .005) but was not significantly different from that observed when the model was trained with 20 000 images (AUC = 0.95, P > .05). Averaging the CNN output score with the binary prospective label yielded the best-performing classifier, with an AUC of 0.98 (P < .005). Analysis of specific radiographs revealed that the model was heavily influenced by clinically relevant spatial regions but did not reliably generalize beyond thoracic disease. Conclusion CNNs trained with a modestly sized collection of prospectively labeled chest radiographs achieved high diagnostic performance in the classification of chest radiographs as normal or abnormal; this function may be useful for automated prioritization of abnormal chest radiographs. © RSNA, 2018 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.
Assuntos
Redes Neurais de Computação , Interpretação de Imagem Radiográfica Assistida por Computador/métodos , Radiografia Torácica/métodos , Feminino , Humanos , Pulmão/diagnóstico por imagem , Masculino , Curva ROC , Radiologistas , Estudos RetrospectivosRESUMO
There are more than 3.7 million published articles on the biological functions or disease implications of proteins, constituting an important resource of proteomics knowledge. However, it is difficult to summarize the millions of proteomics findings in the literature manually and quantify their relevance to the biology and diseases of interest. We developed a fully automated bioinformatics framework to identify and prioritize proteins associated with any biological entity. We used the 22 targeted areas of the Biology/Disease-driven (B/D)-Human Proteome Project (HPP) as examples, prioritized the relevant proteins through their Protein Universal Reference Publication-Originated Search Engine (PURPOSE) scores, validated the relevance of the score by comparing the protein prioritization results with a curated database, computed the scores of proteins across the topics of B/D-HPP, and characterized the top proteins in the common model organisms. We further extended the bioinformatics workflow to identify the relevant proteins in all organ systems and human diseases and deployed a cloud-based tool to prioritize proteins related to any custom search terms in real time. Our tool can facilitate the prioritization of proteins for any organ system or disease of interest and can contribute to the development of targeted proteomic studies for precision medicine.
Assuntos
Biologia Computacional/métodos , Proteômica/métodos , Animais , Projeto Genoma Humano , Humanos , Medicina de Precisão/métodos , Pesquisa , Ferramenta de BuscaRESUMO
Targeted metabolomics and biochemical studies complement the ongoing investigations led by the Human Proteome Organization (HUPO) Biology/Disease-Driven Human Proteome Project (B/D-HPP). However, it is challenging to identify and prioritize metabolite and chemical targets. Literature-mining-based approaches have been proposed for target proteomics studies, but text mining methods for metabolite and chemical prioritization are hindered by a large number of synonyms and nonstandardized names of each entity. In this study, we developed a cloud-based literature mining and summarization platform that maps metabolites and chemicals in the literature to unique identifiers and summarizes the copublication trends of metabolites/chemicals and B/D-HPP topics using Protein Universal Reference Publication-Originated Search Engine (PURPOSE) scores. We successfully prioritized metabolites and chemicals associated with the B/D-HPP targeted fields and validated the results by checking against expert-curated associations and enrichment analyses. Compared with existing algorithms, our system achieved better precision and recall in retrieving chemicals related to B/D-HPP focused areas. Our cloud-based platform enables queries on all biological terms in multiple species, which will contribute to B/D-HPP and targeted metabolomics/chemical studies.
Assuntos
Computação em Nuvem , Metabolômica , Proteoma , Algoritmos , Mineração de Dados/métodos , Humanos , Ferramenta de BuscaRESUMO
MOTIVATION: A complete repository of gene-gene interactions is key for understanding cellular processes, human disease and drug response. These gene-gene interactions include both protein-protein interactions and transcription factor interactions. The majority of known interactions are found in the biomedical literature. Interaction databases, such as BioGRID and ChEA, annotate these gene-gene interactions; however, curation becomes difficult as the literature grows exponentially. DeepDive is a trained system for extracting information from a variety of sources, including text. In this work, we used DeepDive to extract both protein-protein and transcription factor interactions from over 100,000 full-text PLOS articles. METHODS: We built an extractor for gene-gene interactions that identified candidate gene-gene relations within an input sentence. For each candidate relation, DeepDive computed a probability that the relation was a correct interaction. We evaluated this system against the Database of Interacting Proteins and against randomly curated extractions. RESULTS: Our system achieved 76% precision and 49% recall in extracting direct and indirect interactions involving gene symbols co-occurring in a sentence. For randomly curated extractions, the system achieved between 62% and 83% precision based on direct or indirect interactions, as well as sentence-level and document-level precision. Overall, our system extracted 3356 unique gene pairs using 724 features from over 100,000 full-text articles. AVAILABILITY AND IMPLEMENTATION: Application source code is publicly available at https://github.com/edoughty/deepdive_genegene_app CONTACT: russ.altman@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Mineração de Dados , Epistasia Genética , Armazenamento e Recuperação da Informação , Publicações , Software , Curadoria de Dados , Bases de Dados Genéticas , HumanosRESUMO
The dark data extraction or knowledge base construction (KBC) problem is to populate a SQL database with information from unstructured data sources including emails, webpages, and pdf reports. KBC is a long-standing problem in industry and research that encompasses problems of data extraction, cleaning, and integration. We describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems. The key idea in DeepDive is that statistical inference and machine learning are key tools to attack classical data problems in extraction, cleaning, and integration in a unified and more effective manner. DeepDive programs are declarative in that one cannot write probabilistic inference algorithms; instead, one interacts by defining features or rules about the domain. A key reason for this design choice is to enable domain experts to build their own KBC systems. We present the applications, abstractions, and techniques of DeepDive employed to accelerate construction of KBC systems.
RESUMO
A major barrier to deploying healthcare AI is trustworthiness. One form of trustworthiness is a model's robustness across subgroups: while models may exhibit expert-level performance on aggregate metrics, they often rely on non-causal features, leading to errors in hidden subgroups. To take a step closer towards trustworthy seizure onset detection from EEG, we propose to leverage annotations that are produced by healthcare personnel in routine clinical workflows-which we refer to as workflow notes-that include multiple event descriptions beyond seizures. Using workflow notes, we first show that by scaling training data to 68,920 EEG hours, seizure onset detection performance significantly improves by 12.3 AUROC (Area Under the Receiver Operating Characteristic) points compared to relying on smaller training sets with gold-standard labels. Second, we reveal that our binary seizure onset detection model underperforms on clinically relevant subgroups (e.g., up to a margin of 6.5 AUROC points between pediatrics and adults), while having significantly higher FPRs (False Positive Rates) on EEG clips showing non-epileptiform abnormalities (+19 FPR points). To improve model robustness to hidden subgroups, we train a multilabel model that classifies 26 attributes other than seizures (e.g., spikes and movement artifacts) and significantly improve overall performance (+5.9 AUROC points) while greatly improving performance among subgroups (up to +8.3 AUROC points) and decreasing false positives on non-epileptiform abnormalities (by 8 FPR points). Finally, we find that our multilabel model improves clinical utility (false positives per 24 EEG hours) by a factor of 2×.
RESUMO
Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for ML models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 26.3 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for ML models in complex domains.
RESUMO
Purpose: Neural networks have potential to automate medical image segmentation but require expensive labeling efforts. While methods have been proposed to reduce the labeling burden, most have not been thoroughly evaluated on large, clinical datasets or clinical tasks. We propose a method to train segmentation networks with limited labeled data and focus on thorough network evaluation. Approach: We propose a semi-supervised method that leverages data augmentation, consistency regularization, and pseudolabeling and train four cardiac magnetic resonance (MR) segmentation networks. We evaluate the models on multiinstitutional, multiscanner, multidisease cardiac MR datasets using five cardiac functional biomarkers, which are compared to an expert's measurements using Lin's concordance correlation coefficient (CCC), the within-subject coefficient of variation (CV), and the Dice coefficient. Results: The semi-supervised networks achieve strong agreement using Lin's CCC ( > 0.8 ), CV similar to an expert, and strong generalization performance. We compare the error modes of the semi-supervised networks against fully supervised networks. We evaluate semi-supervised model performance as a function of labeled training data and with different types of model supervision, showing that a model trained with 100 labeled image slices can achieve a Dice coefficient within 1.10% of a network trained with 16,000+ labeled image slices. Conclusion: We evaluate semi-supervision for medical image segmentation using heterogeneous datasets and clinical metrics. As methods for training models with little labeled data become more common, knowledge about how they perform on clinical tasks, how they fail, and how they perform with different amounts of labeled data is useful to model developers and users.
RESUMO
INTRODUCTION: Esophageal 24-hour pH/impedance testing is routinely performed to diagnose gastroesophageal reflux disease. Interpretation of these studies is time-intensive for expert physicians and has high inter-reader variability. There are no commercially available machine learning tools to assist with automated identification of reflux events in these studies. METHODS: A machine learning system to identify reflux events in 24-hour pH/impedance studies was developed, which included an initial signal processing step and a machine learning model. Gold-standard reflux events were defined by a group of expert physicians. Performance metrics were computed to compare the machine learning system, current automated detection software (Reflux Reader v6.1), and an expert physician reader. RESULTS: The study cohort included 45 patients (20/5/20 patients in the training/validation/test sets, respectively). The mean age was 51 (standard deviation 14.5) years, 47% of patients were male, and 78% of studies were performed off proton-pump inhibitor. Comparing the machine learning system vs current automated software vs expert physician reader, area under the curve was 0.87 (95% confidence interval [CI] 0.85-0.89) vs 0.40 (95% CI 0.37-0.42) vs 0.83 (95% CI 0.81-0.86), respectively; sensitivity was 68.7% vs 61.1% vs 79.4%, respectively; and specificity was 80.8% vs 18.6% vs 87.3%, respectively. DISCUSSION: We trained and validated a novel machine learning system to successfully identify reflux events in 24-hour pH/impedance studies. Our model performance was superior to that of existing software and comparable to that of a human reader. Machine learning tools could significantly improve automated interpretation of pH/impedance studies.
Assuntos
Monitoramento do pH Esofágico , Refluxo Gastroesofágico , Humanos , Masculino , Pessoa de Meia-Idade , Feminino , Impedância Elétrica , Refluxo Gastroesofágico/diagnóstico , Concentração de Íons de HidrogênioRESUMO
In mental state decoding, researchers aim to identify the set of mental states (e.g., experiencing happiness or fear) that can be reliably identified from the activity patterns of a brain region (or network). Deep learning (DL) models are highly promising for mental state decoding because of their unmatched ability to learn versatile representations of complex data. However, their widespread application in mental state decoding is hindered by their lack of interpretability, difficulties in applying them to small datasets, and in ensuring their reproducibility and robustness. We recommend approaching these challenges by leveraging recent advances in explainable artificial intelligence (XAI) and transfer learning, and also provide recommendations on how to improve the reproducibility and robustness of DL models in mental state decoding.
Assuntos
Inteligência Artificial , Mapeamento Encefálico , Aprendizado Profundo , Encéfalo , Humanos , Aprendizado de Máquina , Neuroimagem , Reprodutibilidade dos TestesRESUMO
OBJECTIVES: To determine recent trends in: (1) human immunodeficiency virus (HIV) diagnoses, (2) the proportion of patients newly diagnosed with HIV with injection drug use (IDU) and (3) patients' patterns of healthcare utilization in the year before diagnosis at an urban, academic medical center. METHODS: We performed a cross sectional study of patients newly diagnosed with HIV at a healthcare system in southern New Jersey between January 1st, 2014 and December 31st, 2019. Patients 18 years or older with HIV diagnosed during the study period were included. Demographics, comorbidities, HIV test results, and healthcare utilization data were collected from the electronic medical record. RESULTS: Of 192 patients newly diagnosed with HIV, 36 (19%) had documented IDU. New HIV diagnoses doubled from 22 to 47 annual cases between 2014 and 2019. The proportion of patients with newly diagnosed HIV and documented IDU increased from 9% in 2014 to 32% in 2019, chi-square test for linear trend P value = 0.001. Eighty-nine percent of patients with IDU had at least one contact with the healthcare system in the year before diagnosis compared to 63% of patients without IDU, P value 0.003. The median (interquartile range IQR) number of healthcare visits was 7 [2 - 16] for patients with IDU versus 1 [0 - 3] for patients without IDU, P < 0.001. CONCLUSIONS: We observed an increase in new HIV diagnoses with an increase in the proportion of newly diagnosed patients with IDU. Patients with newly diagnosed HIV and IDU had high rates of health care utilization in the year before diagnosis presenting an opportunity for intervention.
Assuntos
Infecções por HIV , Abuso de Substâncias por Via Intravenosa , Estudos Transversais , Infecções por HIV/diagnóstico , Infecções por HIV/epidemiologia , Humanos , Aceitação pelo Paciente de Cuidados de Saúde , Abuso de Substâncias por Via Intravenosa/epidemiologiaRESUMO
PURPOSE: To develop a convolutional neural network (CNN) to triage head CT (HCT) studies and investigate the effect of upstream medical image processing on the CNN's performance. MATERIALS AND METHODS: A total of 9776 HCT studies were retrospectively collected from 2001 through 2014, and a CNN was trained to triage them as normal or abnormal. CNN performance was evaluated on a held-out test set, assessing triage performance and sensitivity to 20 disorders to assess differential model performance, with 7856 CT studies in the training set, 936 in the validation set, and 984 in the test set. This CNN was used to understand how the upstream imaging chain affects CNN performance by evaluating performance after altering three variables: image acquisition by reducing the number of x-ray projections, image reconstruction by inputting sinogram data into the CNN, and image preprocessing. To evaluate performance, the DeLong test was used to assess differences in the area under the receiver operating characteristic curve (AUROC), and the McNemar test was used to compare sensitivities. RESULTS: The CNN achieved a mean AUROC of 0.84 (95% CI: 0.83, 0.84) in discriminating normal and abnormal HCT studies. The number of x-ray projections could be reduced by 16 times and the raw sensor data could be input into the CNN with no statistically significant difference in classification performance. Additionally, CT windowing consistently improved CNN performance, increasing the mean triage AUROC by 0.07 points. CONCLUSION: A CNN was developed to triage HCT studies, which may help streamline image evaluation, and the means by which upstream image acquisition, reconstruction, and preprocessing affect downstream CNN performance was investigated, bringing focus to this important part of the imaging chain.Keywords Head CT, Automated Triage, Deep Learning, Sinogram, DatasetSupplemental material is available for this article.© RSNA, 2021.
RESUMO
Importance: Implant registries provide valuable information on the performance of implants in a real-world setting, yet they have traditionally been expensive to establish and maintain. Electronic health records (EHRs) are widely used and may include the information needed to generate clinically meaningful reports similar to a formal implant registry. Objectives: To quantify the extractability and accuracy of registry-relevant data from the EHR and to assess the ability of these data to track trends in implant use and the durability of implants (hereafter referred to as implant survivorship), using data stored since 2000 in the EHR of the largest integrated health care system in the United States. Design, Setting, and Participants: Retrospective cohort study of a large EHR of veterans who had 45â¯351 total hip arthroplasty procedures in Veterans Health Administration hospitals from 2000 to 2017. Data analysis was performed from January 1, 2000, to December 31, 2017. Exposures: Total hip arthroplasty. Main Outcomes and Measures: Number of total hip arthroplasty procedures extracted from the EHR, trends in implant use, and relative survivorship of implants. Results: A total of 45â¯351 total hip arthroplasty procedures were identified from 2000 to 2017 with 192â¯805 implant parts. Data completeness improved over the time. After 2014, 85% of prosthetic heads, 91% of shells, 81% of stems, and 85% of liners used in the Veterans Health Administration health care system were identified by part number. Revision burden and trends in metal vs ceramic prosthetic femoral head use were found to reflect data from the American Joint Replacement Registry. Recalled implants were obvious negative outliers in implant survivorship using Kaplan-Meier curves. Conclusions and Relevance: Although loss to follow-up remains a challenge that requires additional attention to improve the quantitative nature of calculated implant survivorship, we conclude that data collected during routine clinical care and stored in the EHR of a large health system over 18 years were sufficient to provide clinically meaningful data on trends in implant use and to identify poor implants that were subsequently recalled. This automated approach was low cost and had no reporting burden. This low-cost, low-overhead method to assess implant use and performance within a large health care setting may be useful to internal quality assurance programs and, on a larger scale, to postmarket surveillance of implant performance.
Assuntos
Artroplastia de Quadril/estatística & dados numéricos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Adulto , Idoso , Idoso de 80 Anos ou mais , Estudos de Coortes , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Sistema de Registros , Reprodutibilidade dos Testes , Estudos Retrospectivos , Adulto JovemRESUMO
Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model may still consistently miss a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring hidden stratification effects, and characterize these effects both via synthetic experiments on the CIFAR-100 benchmark dataset and on multiple real-world medical imaging datasets. Using these measurement techniques, we find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we discuss the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.
RESUMO
Automated seizure detection from electroencephalography (EEG) would improve the quality of patient care while reducing medical costs, but achieving reliably high performance across patients has proven difficult. Convolutional Neural Networks (CNNs) show promise in addressing this problem, but they are limited by a lack of large labeled training datasets. We propose using imperfect but plentiful archived annotations to train CNNs for automated, real-time EEG seizure detection across patients. While these weak annotations indicate possible seizures with precision scores as low as 0.37, they are commonly produced in large volumes within existing clinical workflows by a mixed group of technicians, fellows, students, and board-certified epileptologists. We find that CNNs trained using such weak annotations achieve Area Under the Receiver Operating Characteristic curve (AUROC) values of 0.93 and 0.94 for pediatric and adult seizure onset detection, respectively. Compared to currently deployed clinical software, our model provides a 31% increase (18 points) in F1-score for pediatric patients and a 17% increase (11 points) for adult patients. These results demonstrate that weak annotations, which are sustainably collected via existing clinical workflows, can be leveraged to produce clinically useful seizure detection models.
RESUMO
OBJECTIVE: Non-small cell lung cancer is a leading cause of cancer death worldwide, and histopathological evaluation plays the primary role in its diagnosis. However, the morphological patterns associated with the molecular subtypes have not been systematically studied. To bridge this gap, we developed a quantitative histopathology analytic framework to identify the types and gene expression subtypes of non-small cell lung cancer objectively. MATERIALS AND METHODS: We processed whole-slide histopathology images of lung adenocarcinoma (n = 427) and lung squamous cell carcinoma patients (n = 457) in the Cancer Genome Atlas. We built convolutional neural networks to classify histopathology images, evaluated their performance by the areas under the receiver-operating characteristic curves (AUCs), and validated the results in an independent cohort (n = 125). RESULTS: To establish neural networks for quantitative image analyses, we first built convolutional neural network models to identify tumor regions from adjacent dense benign tissues (AUCs > 0.935) and recapitulated expert pathologists' diagnosis (AUCs > 0.877), with the results validated in an independent cohort (AUCs = 0.726-0.864). We further demonstrated that quantitative histopathology morphology features identified the major transcriptomic subtypes of both adenocarcinoma and squamous cell carcinoma (P < .01). DISCUSSION: Our study is the first to classify the transcriptomic subtypes of non-small cell lung cancer using fully automated machine learning methods. Our approach does not rely on prior pathology knowledge and can discover novel clinically relevant histopathology patterns objectively. The developed procedure is generalizable to other tumor types or diseases.