Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
1.
Article in English | MEDLINE | ID: mdl-37018684

ABSTRACT

Reduction in 30-day readmission rate is an important quality factor for hospitals as it can reduce the overall cost of care and improve patient post-discharge outcomes. While deep-learning-based studies have shown promising empirical results, several limitations exist in prior models for hospital readmission prediction, such as: (a) only patients with certain conditions are considered, (b) do not leverage data temporality, (c) individual admissions are assumed independent of each other, which ignores patient similarity, (d) limited to single modality or single center data. In this study, we propose a multimodal, spatiotemporal graph neural network (MM-STGNN) for prediction of 30-day all-cause hospital readmission, which fuses in-patient multimodal, longitudinal data and models patient similarity using a graph. Using longitudinal chest radiographs and electronic health records from two independent centers, we show that MM-STGNN achieved an area under the receiver operating characteristic curve (AUROC) of 0.79 on both datasets. Furthermore, MM-STGNN significantly outperformed the current clinical reference standard, LACE+ (AUROC=0.61), on the internal dataset. For subset populations of patients with heart disease, our model significantly outperformed baselines, such as gradient-boosting and Long Short-Term Memory models (e.g., AUROC improved by 3.7 points in patients with heart disease). Qualitative interpretability analysis indicated that while patients' primary diagnoses were not explicitly used to train the model, features crucial for model prediction may reflect patients' diagnoses. Our model could be utilized as an additional clinical decision aid during discharge disposition and triaging high-risk patients for closer post-discharge follow-up for potential preventive measures.

2.
J Pathol Inform ; 13: 100142, 2022.
Article in English | MEDLINE | ID: mdl-36605116

ABSTRACT

Several machine learning algorithms have demonstrated high predictive capability in the identification of cancer within digitized pathology slides. The Augmented Reality Microscope (ARM) has allowed these algorithms to be seamlessly integrated within the pathology workflow by overlaying their inferences onto its microscopic field of view in real time. We present an independent assessment of the LYmph Node Assistant (LYNA) models, state-of-the-art algorithms for the identification of breast cancer metastases in lymph node biopsies, optimized for usage on the ARM. We assessed the models on 40 whole slide images at the commonly used objective magnifications of 10×, 20×, and 40×. We analyzed their performance across clinically relevant subclasses of tissue, including breast cancer, lymphocytes, histiocytes, blood, and fat. Each model obtained overall AUC values of approximately 0.98, accuracy values of approximately 0.94, and sensitivity values above 0.88 at classifying small regions of a field of view as benign or cancerous. Across tissue subclasses, the models performed most accurately on fat and blood, and least accurately on histiocytes, germinal centers, and sinus. The models also struggled with the identification of isolated tumor cells, especially at lower magnifications. After testing, we reviewed the discrepancies between model predictions and ground truth to understand the causes of error. We introduce a distinction between proper and improper ground truth for analysis in cases of uncertain annotations. Taken together, these methods comprise a novel approach for exploratory model analysis over complex anatomic pathology data in which precise ground truth is difficult to establish.

3.
Radiol Clin North Am ; 59(6): 1063-1074, 2021 Nov.
Article in English | MEDLINE | ID: mdl-34689874

ABSTRACT

Although recent scientific studies suggest that artificial intelligence (AI) could provide value in many radiology applications, much of the hard engineering work required to consistently realize this value in practice remains to be done. In this article, we summarize the various ways in which AI can benefit radiology practice, identify key challenges that must be overcome for those benefits to be delivered, and discuss promising avenues by which these challenges can be addressed.


Subject(s)
Artificial Intelligence/standards , Diagnostic Imaging/methods , Image Interpretation, Computer-Assisted/methods , Radiology/methods , Radiology/standards , Diagnostic Imaging/standards , Humans , Image Interpretation, Computer-Assisted/standards , Reproducibility of Results , Software
4.
Radiol Artif Intell ; 3(4): e200229, 2021 Jul.
Article in English | MEDLINE | ID: mdl-34350412

ABSTRACT

PURPOSE: To develop a convolutional neural network (CNN) to triage head CT (HCT) studies and investigate the effect of upstream medical image processing on the CNN's performance. MATERIALS AND METHODS: A total of 9776 HCT studies were retrospectively collected from 2001 through 2014, and a CNN was trained to triage them as normal or abnormal. CNN performance was evaluated on a held-out test set, assessing triage performance and sensitivity to 20 disorders to assess differential model performance, with 7856 CT studies in the training set, 936 in the validation set, and 984 in the test set. This CNN was used to understand how the upstream imaging chain affects CNN performance by evaluating performance after altering three variables: image acquisition by reducing the number of x-ray projections, image reconstruction by inputting sinogram data into the CNN, and image preprocessing. To evaluate performance, the DeLong test was used to assess differences in the area under the receiver operating characteristic curve (AUROC), and the McNemar test was used to compare sensitivities. RESULTS: The CNN achieved a mean AUROC of 0.84 (95% CI: 0.83, 0.84) in discriminating normal and abnormal HCT studies. The number of x-ray projections could be reduced by 16 times and the raw sensor data could be input into the CNN with no statistically significant difference in classification performance. Additionally, CT windowing consistently improved CNN performance, increasing the mean triage AUROC by 0.07 points. CONCLUSION: A CNN was developed to triage HCT studies, which may help streamline image evaluation, and the means by which upstream image acquisition, reconstruction, and preprocessing affect downstream CNN performance was investigated, bringing focus to this important part of the imaging chain.Keywords Head CT, Automated Triage, Deep Learning, Sinogram, DatasetSupplemental material is available for this article.© RSNA, 2021.

5.
Sci Rep ; 11(1): 8366, 2021 04 16.
Article in English | MEDLINE | ID: mdl-33863957

ABSTRACT

The reliability of machine learning models can be compromised when trained on low quality data. Many large-scale medical imaging datasets contain low quality labels extracted from sources such as medical reports. Moreover, images within a dataset may have heterogeneous quality due to artifacts and biases arising from equipment or measurement errors. Therefore, algorithms that can automatically identify low quality data are highly desired. In this study, we used data Shapley, a data valuation metric, to quantify the value of training data to the performance of a pneumonia detection algorithm in a large chest X-ray dataset. We characterized the effectiveness of data Shapley in identifying low quality versus valuable data for pneumonia detection. We found that removing training data with high Shapley values decreased the pneumonia detection performance, whereas removing data with low Shapley values improved the model performance. Furthermore, there were more mislabeled examples in low Shapley value data and more true pneumonia cases in high Shapley value data. Our results suggest that low Shapley value indicates mislabeled or poor quality images, whereas high Shapley value indicates data that are valuable for pneumonia detection. Our method can serve as a framework for using data Shapley to denoise large-scale medical imaging datasets.


Subject(s)
Algorithms , Diagnostic Imaging/methods , Image Processing, Computer-Assisted/methods , Machine Learning , Neural Networks, Computer , Pneumonia/diagnosis , Radiography, Thoracic/methods , Datasets as Topic , Humans
6.
Nat Commun ; 12(1): 1880, 2021 03 25.
Article in English | MEDLINE | ID: mdl-33767174

ABSTRACT

Computational decision support systems could provide clinical value in whole-body FDG-PET/CT workflows. However, limited availability of labeled data combined with the large size of PET/CT imaging exams make it challenging to apply existing supervised machine learning systems. Leveraging recent advancements in natural language processing, we describe a weak supervision framework that extracts imperfect, yet highly granular, regional abnormality labels from free-text radiology reports. Our framework automatically labels each region in a custom ontology of anatomical regions, providing a structured profile of the pathologies in each imaging exam. Using these generated labels, we then train an attention-based, multi-task CNN architecture to detect and estimate the location of abnormalities in whole-body scans. We demonstrate empirically that our multi-task representation is critical for strong performance on rare abnormalities with limited training data. The representation also contributes to more accurate mortality prediction from imaging data, suggesting the potential utility of our framework beyond abnormality detection and location estimation.


Subject(s)
Decision Support Systems, Clinical , Image Interpretation, Computer-Assisted/methods , Machine Learning , Positron Emission Tomography Computed Tomography/methods , Whole Body Imaging/methods , Datasets as Topic , Fluorodeoxyglucose F18 , Humans , Multitasking Behavior , Natural Language Processing
7.
J Biomed Inform ; 113: 103656, 2021 01.
Article in English | MEDLINE | ID: mdl-33309994

ABSTRACT

PURPOSE: To compare machine learning methods for classifying mass lesions on mammography images that use predefined image features computed over lesion segmentations to those that leverage segmentation-free representation learning on a standard, public evaluation dataset. METHODS: We apply several classification algorithms to the public Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM), in which each image contains a mass lesion. Segmentation-free representation learning techniques for classifying lesions as benign or malignant include both a Bag-of-Visual-Words (BoVW) method and a Convolutional Neural Network (CNN). We compare classification performance of these techniques to that obtained using two different segmentation-dependent approaches from the literature that rely on specific combinations of end classifiers (e.g. linear discriminant analysis, neural networks) and predefined features computed over the lesion segmentation (e.g. spiculation measure, morphological characteristics, intensity metrics). RESULTS: We report area under the receiver operating characteristic curve (AZ) values for malignancy classification on CBIS-DDSM for each technique. We find average AZ values of 0.73 for a segmentation-free BoVW method, 0.86 for a segmentation-free CNN method, 0.75 for a segmentation-dependent linear discriminant analysis of Rubber-Band Straightening Transform features, and 0.58 for a hybrid rule-based neural network classification using a small number of hand-designed features. CONCLUSIONS: We find that malignancy classification performance on the CBIS-DDSM dataset using segmentation-free BoVW features is comparable to that of the best segmentation-dependent methods we study, but also observe that a common segmentation-free CNN model substantially and significantly outperforms each of these (p < 0.05). These results reinforce recent findings suggesting that representation learning techniques such as BoVW and CNNs are advantageous for mammogram analysis because they do not require lesion segmentation, the quality and specific characteristics of which can vary substantially across datasets. We further observe that segmentation-dependent methods achieve performance levels on CBIS-DDSM inferior to those achieved on the original evaluation datasets reported in the literature. Each of these findings reinforces the need for standardization of datasets, segmentation techniques, and model implementations in performance assessments of automated classifiers for medical imaging.


Subject(s)
Breast Neoplasms , Mammography , Breast/diagnostic imaging , Breast Neoplasms/diagnostic imaging , Computers , Early Detection of Cancer , Female , Humans
8.
Article in English | MEDLINE | ID: mdl-33196064

ABSTRACT

Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model may still consistently miss a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring hidden stratification effects, and characterize these effects both via synthetic experiments on the CIFAR-100 benchmark dataset and on multiple real-world medical imaging datasets. Using these measurement techniques, we find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we discuss the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.

10.
Patterns (N Y) ; 1(2)2020 May 08.
Article in English | MEDLINE | ID: mdl-32776018

ABSTRACT

A major bottleneck in developing clinically impactful machine learning models is a lack of labeled training data for model supervision. Thus, medical researchers increasingly turn to weaker, noisier sources of supervision, such as leveraging extractions from unstructured text reports to supervise image classification. A key challenge in weak supervision is combining sources of information that may differ in quality and have correlated errors. Recently, a statistical theory of weak supervision called data programming has shown promise in addressing this challenge. Data programming now underpins many deployed machine-learning systems in the technology industry, even for critical applications. We propose a new technique for applying data programming to the problem of cross-modal weak supervision in medicine, wherein weak labels derived from an auxiliary modality (e.g., text) are used to train models over a different target modality (e.g., images). We evaluate our approach on diverse clinical tasks via direct comparison to institution-scale, hand-labeled datasets. We find that our supervision technique increases model performance by up to 6 points area under the receiver operating characteristic curve (ROC-AUC) over baseline methods by improving both coverage and quality of the weak labels. Our approach yields models that on average perform within 1.75 points ROC-AUC of those supervised with physician-years of hand labeling and outperform those supervised with physician-months of hand labeling by 10.25 points ROC-AUC, while using only person-days of developer time and clinician work-a time saving of 96%. Our results suggest that modern weak supervision techniques such as data programming may enable more rapid development and deployment of clinically useful machine-learning models.

11.
NPJ Digit Med ; 3: 59, 2020.
Article in English | MEDLINE | ID: mdl-32352037

ABSTRACT

Automated seizure detection from electroencephalography (EEG) would improve the quality of patient care while reducing medical costs, but achieving reliably high performance across patients has proven difficult. Convolutional Neural Networks (CNNs) show promise in addressing this problem, but they are limited by a lack of large labeled training datasets. We propose using imperfect but plentiful archived annotations to train CNNs for automated, real-time EEG seizure detection across patients. While these weak annotations indicate possible seizures with precision scores as low as 0.37, they are commonly produced in large volumes within existing clinical workflows by a mixed group of technicians, fellows, students, and board-certified epileptologists. We find that CNNs trained using such weak annotations achieve Area Under the Receiver Operating Characteristic curve (AUROC) values of 0.93 and 0.94 for pediatric and adult seizure onset detection, respectively. Compared to currently deployed clinical software, our model provides a 31% increase (18 points) in F1-score for pediatric patients and a 17% increase (11 points) for adult patients. These results demonstrate that weak annotations, which are sustainably collected via existing clinical workflows, can be leveraged to produce clinically useful seizure detection models.

12.
NPJ Digit Med ; 3: 61, 2020.
Article in English | MEDLINE | ID: mdl-32352039

ABSTRACT

Pulmonary embolism (PE) is a life-threatening clinical problem and computed tomography pulmonary angiography (CTPA) is the gold standard for diagnosis. Prompt diagnosis and immediate treatment are critical to avoid high morbidity and mortality rates, yet PE remains among the diagnoses most frequently missed or delayed. In this study, we developed a deep learning model-PENet, to automatically detect PE on volumetric CTPA scans as an end-to-end solution for this purpose. The PENet is a 77-layer 3D convolutional neural network (CNN) pretrained on the Kinetics-600 dataset and fine-tuned on a retrospective CTPA dataset collected from a single academic institution. The PENet model performance was evaluated in detecting PE on data from two different institutions: one as a hold-out dataset from the same institution as the training data and a second collected from an external institution to evaluate model generalizability to an unrelated population dataset. PENet achieved an AUROC of 0.84 [0.82-0.87] on detecting PE on the hold out internal test set and 0.85 [0.81-0.88] on external dataset. PENet also outperformed current state-of-the-art 3D CNN models. The results represent successful application of an end-to-end 3D CNN model for the complex task of PE diagnosis without requiring computationally intensive and time consuming preprocessing and demonstrates sustained performance on data from an external institution. Our model could be applied as a triage tool to automatically identify clinically important PEs allowing for prioritization for diagnostic radiology interpretation and improved care pathways via more efficient diagnosis.

13.
Proc AAAI Conf Artif Intell ; 33: 4763-4771, 2019.
Article in English | MEDLINE | ID: mdl-31565535

ABSTRACT

As machine learning models continue to increase in complexity, collecting large hand-labeled training sets has become one of the biggest roadblocks in practice. Instead, weaker forms of supervision that provide noisier but cheaper labels are often used. However, these weak supervision sources have diverse and unknown accuracies, may output correlated labels, and may label different tasks or apply at different levels of granularity. We propose a framework for integrating and modeling such weak supervision sources by viewing them as labeling different related sub-tasks of a problem, which we refer to as the multi-task weak supervision setting. We show that by solving a matrix completion-style problem, we can recover the accuracies of these multi-task sources given their dependency structure, but without any labeled data, leading to higher-quality supervision for training an end model. Theoretically, we show that the generalization error of models trained with this approach improves with the number of unlabeled data points, and characterize the scaling with respect to the task and dependency structures. On three fine-grained classification problems, we show that our approach leads to average gains of 20.2 points in accuracy over a traditional supervised approach, 6.8 points over a majority vote baseline, and 4.1 points over a previously proposed weak supervision method that models tasks separately.

14.
Nat Commun ; 10(1): 3111, 2019 07 15.
Article in English | MEDLINE | ID: mdl-31308376

ABSTRACT

Biomedical repositories such as the UK Biobank provide increasing access to prospectively collected cardiac imaging, however these data are unlabeled, which creates barriers to their use in supervised machine learning. We develop a weakly supervised deep learning model for classification of aortic valve malformations using up to 4,000 unlabeled cardiac MRI sequences. Instead of requiring highly curated training data, weak supervision relies on noisy heuristics defined by domain experts to programmatically generate large-scale, imperfect training labels. For aortic valve classification, models trained with imperfect labels substantially outperform a supervised model trained on hand-labeled MRIs. In an orthogonal validation experiment using health outcomes data, our model identifies individuals with a 1.8-fold increase in risk of a major adverse cardiac event. This work formalizes a deep learning baseline for aortic valve classification and outlines a general strategy for using weak supervision to train machine learning models using unlabeled medical images at scale.


Subject(s)
Aortic Valve/abnormalities , Heart Valve Diseases/pathology , Machine Learning , Aortic Valve/diagnostic imaging , Aortic Valve/pathology , Heart Diseases/pathology , Heart Valve Diseases/diagnostic imaging , Humans , Magnetic Resonance Imaging , Neural Networks, Computer , Supervised Machine Learning
15.
Radiology ; 290(2): 537-544, 2019 02.
Article in English | MEDLINE | ID: mdl-30422093

ABSTRACT

Purpose To assess the ability of convolutional neural networks (CNNs) to enable high-performance automated binary classification of chest radiographs. Materials and Methods In a retrospective study, 216 431 frontal chest radiographs obtained between 1998 and 2012 were procured, along with associated text reports and a prospective label from the attending radiologist. This data set was used to train CNNs to classify chest radiographs as normal or abnormal before evaluation on a held-out set of 533 images hand-labeled by expert radiologists. The effects of development set size, training set size, initialization strategy, and network architecture on end performance were assessed by using standard binary classification metrics; detailed error analysis, including visualization of CNN activations, was also performed. Results Average area under the receiver operating characteristic curve (AUC) was 0.96 for a CNN trained with 200 000 images. This AUC value was greater than that observed when the same model was trained with 2000 images (AUC = 0.84, P < .005) but was not significantly different from that observed when the model was trained with 20 000 images (AUC = 0.95, P > .05). Averaging the CNN output score with the binary prospective label yielded the best-performing classifier, with an AUC of 0.98 (P < .005). Analysis of specific radiographs revealed that the model was heavily influenced by clinically relevant spatial regions but did not reliably generalize beyond thoracic disease. Conclusion CNNs trained with a modestly sized collection of prospectively labeled chest radiographs achieved high diagnostic performance in the classification of chest radiographs as normal or abnormal; this function may be useful for automated prioritization of abnormal chest radiographs. © RSNA, 2018 Online supplemental material is available for this article. See also the editorial by van Ginneken in this issue.


Subject(s)
Neural Networks, Computer , Radiographic Image Interpretation, Computer-Assisted/methods , Radiography, Thoracic/methods , Female , Humans , Lung/diagnostic imaging , Male , ROC Curve , Radiologists , Retrospective Studies
16.
Article in English | MEDLINE | ID: mdl-30931438

ABSTRACT

Many real-world machine learning problems are challenging to tackle for two reasons: (i) they involve multiple sub-tasks at different levels of granularity; and (ii) they require large volumes of labeled training data. We propose Snorkel MeTaL, an end-to-end system for multi-task learning that leverages weak supervision provided at multiple levels of granularity by domain expert users. In MeTaL, a user specifies a problem consisting of multiple, hierarchically-related sub-tasks-for example, classifying a document at multiple levels of granularity-and then provides labeling functions for each sub-task as weak supervision. MeTaL learns a re-weighted model of these labeling functions, and uses the combined signal to train a hierarchical multi-task network which is automatically compiled from the structure of the sub-tasks. Using MeTaL on a radiology report triage task and a fine-grained news classification task, we achieve average gains of 11.2 accuracy points over a baseline supervised approach and 9.5 accuracy points over the predictions of the user-provided labeling functions.

17.
Adv Neural Inf Process Syst ; 30: 3239-3249, 2017 Dec.
Article in English | MEDLINE | ID: mdl-29375240

ABSTRACT

Data augmentation is a ubiquitous technique for increasing the size of labeled training sets by leveraging task-specific data transformations that preserve class labels. While it is often easy for domain experts to specify individual transformations, constructing and tuning the more sophisticated compositions typically needed to achieve state-of-the-art results is a time-consuming manual task in practice. We propose a method for automating this process by learning a generative sequence model over user-specified transformation functions using a generative adversarial approach. Our method can make use of arbitrary, non-deterministic transformation functions, is robust to misspecified user input, and is trained on unlabeled data. The learned transformation model can then be used to perform data augmentation for any end discriminative model. In our experiments, we show the efficacy of our approach on both image and text datasets, achieving improvements of 4.0 accuracy points on CIFAR-10, 1.4 F1 points on the ACE relation extraction task, and 3.4 accuracy points when using domain-specific transformation operations on a medical imaging dataset as compared to standard heuristic augmentation approaches.

SELECTION OF CITATIONS
SEARCH DETAIL
...