Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 69
Filter
1.
ArXiv ; 2024 Jun 20.
Article in English | MEDLINE | ID: mdl-38947933

ABSTRACT

Feature attribution, the ability to localize regions of the input data that are relevant for classification, is an important capability for ML models in scientific and biomedical domains. Current methods for feature attribution, which rely on "explaining" the predictions of end-to-end classifiers, suffer from imprecise feature localization and are inadequate for use with small sample sizes and high-dimensional datasets due to computational challenges. We introduce prospector heads, an efficient and interpretable alternative to explanation-based attribution methods that can be applied to any encoder and any data modality. Prospector heads generalize across modalities through experiments on sequences (text), images (pathology), and graphs (protein structures), outperforming baseline attribution methods by up to 26.3 points in mean localization AUPRC. We also demonstrate how prospector heads enable improved interpretation and discovery of class-specific patterns in input data. Through their high performance, flexibility, and generalizability, prospectors provide a framework for improving trust and transparency for ML models in complex domains.

2.
NPJ Digit Med ; 7(1): 42, 2024 Feb 21.
Article in English | MEDLINE | ID: mdl-38383884

ABSTRACT

A major barrier to deploying healthcare AI is trustworthiness. One form of trustworthiness is a model's robustness across subgroups: while models may exhibit expert-level performance on aggregate metrics, they often rely on non-causal features, leading to errors in hidden subgroups. To take a step closer towards trustworthy seizure onset detection from EEG, we propose to leverage annotations that are produced by healthcare personnel in routine clinical workflows-which we refer to as workflow notes-that include multiple event descriptions beyond seizures. Using workflow notes, we first show that by scaling training data to 68,920 EEG hours, seizure onset detection performance significantly improves by 12.3 AUROC (Area Under the Receiver Operating Characteristic) points compared to relying on smaller training sets with gold-standard labels. Second, we reveal that our binary seizure onset detection model underperforms on clinically relevant subgroups (e.g., up to a margin of 6.5 AUROC points between pediatrics and adults), while having significantly higher FPRs (False Positive Rates) on EEG clips showing non-epileptiform abnormalities (+19 FPR points). To improve model robustness to hidden subgroups, we train a multilabel model that classifies 26 attributes other than seizures (e.g., spikes and movement artifacts) and significantly improve overall performance (+5.9 AUROC points) while greatly improving performance among subgroups (up to +8.3 AUROC points) and decreasing false positives on non-epileptiform abnormalities (by 8 FPR points). Finally, we find that our multilabel model improves clinical utility (false positives per 24 EEG hours) by a factor of 2×.

3.
Clin Transl Gastroenterol ; 14(10): e00634, 2023 10 01.
Article in English | MEDLINE | ID: mdl-37578060

ABSTRACT

INTRODUCTION: Esophageal 24-hour pH/impedance testing is routinely performed to diagnose gastroesophageal reflux disease. Interpretation of these studies is time-intensive for expert physicians and has high inter-reader variability. There are no commercially available machine learning tools to assist with automated identification of reflux events in these studies. METHODS: A machine learning system to identify reflux events in 24-hour pH/impedance studies was developed, which included an initial signal processing step and a machine learning model. Gold-standard reflux events were defined by a group of expert physicians. Performance metrics were computed to compare the machine learning system, current automated detection software (Reflux Reader v6.1), and an expert physician reader. RESULTS: The study cohort included 45 patients (20/5/20 patients in the training/validation/test sets, respectively). The mean age was 51 (standard deviation 14.5) years, 47% of patients were male, and 78% of studies were performed off proton-pump inhibitor. Comparing the machine learning system vs current automated software vs expert physician reader, area under the curve was 0.87 (95% confidence interval [CI] 0.85-0.89) vs 0.40 (95% CI 0.37-0.42) vs 0.83 (95% CI 0.81-0.86), respectively; sensitivity was 68.7% vs 61.1% vs 79.4%, respectively; and specificity was 80.8% vs 18.6% vs 87.3%, respectively. DISCUSSION: We trained and validated a novel machine learning system to successfully identify reflux events in 24-hour pH/impedance studies. Our model performance was superior to that of existing software and comparable to that of a human reader. Machine learning tools could significantly improve automated interpretation of pH/impedance studies.


Subject(s)
Esophageal pH Monitoring , Gastroesophageal Reflux , Humans , Male , Middle Aged , Female , Electric Impedance , Gastroesophageal Reflux/diagnosis , Hydrogen-Ion Concentration
4.
Magn Reson Med ; 90(5): 2052-2070, 2023 11.
Article in English | MEDLINE | ID: mdl-37427449

ABSTRACT

PURPOSE: To develop a method for building MRI reconstruction neural networks robust to changes in signal-to-noise ratio (SNR) and trainable with a limited number of fully sampled scans. METHODS: We propose Noise2Recon, a consistency training method for SNR-robust accelerated MRI reconstruction that can use both fully sampled (labeled) and undersampled (unlabeled) scans. Noise2Recon uses unlabeled data by enforcing consistency between model reconstructions of undersampled scans and their noise-augmented counterparts. Noise2Recon was compared to compressed sensing and both supervised and self-supervised deep learning baselines. Experiments were conducted using retrospectively accelerated data from the mridata three-dimensional fast-spin-echo knee and two-dimensional fastMRI brain datasets. All methods were evaluated in label-limited settings and among out-of-distribution (OOD) shifts, including changes in SNR, acceleration factors, and datasets. An extensive ablation study was conducted to characterize the sensitivity of Noise2Recon to hyperparameter choices. RESULTS: In label-limited settings, Noise2Recon achieved better structural similarity, peak signal-to-noise ratio, and normalized-RMS error than all baselines and matched performance of supervised models, which were trained with 14 × $$ 14\times $$ more fully sampled scans. Noise2Recon outperformed all baselines, including state-of-the-art fine-tuning and augmentation techniques, among low-SNR scans and when generalizing to OOD acceleration factors. Augmentation extent and loss weighting hyperparameters had negligible impact on Noise2Recon compared to supervised methods, which may indicate increased training stability. CONCLUSION: Noise2Recon is a label-efficient reconstruction method that is robust to distribution shifts, such as changes in SNR, acceleration factors, and others, with limited or no fully sampled training data.


Subject(s)
Deep Learning , Image Processing, Computer-Assisted , Humans , Image Processing, Computer-Assisted/methods , Signal-To-Noise Ratio , Retrospective Studies , Magnetic Resonance Imaging/methods , Supervised Machine Learning
5.
J Med Imaging (Bellingham) ; 10(2): 024007, 2023 Mar.
Article in English | MEDLINE | ID: mdl-37009059

ABSTRACT

Purpose: Neural networks have potential to automate medical image segmentation but require expensive labeling efforts. While methods have been proposed to reduce the labeling burden, most have not been thoroughly evaluated on large, clinical datasets or clinical tasks. We propose a method to train segmentation networks with limited labeled data and focus on thorough network evaluation. Approach: We propose a semi-supervised method that leverages data augmentation, consistency regularization, and pseudolabeling and train four cardiac magnetic resonance (MR) segmentation networks. We evaluate the models on multiinstitutional, multiscanner, multidisease cardiac MR datasets using five cardiac functional biomarkers, which are compared to an expert's measurements using Lin's concordance correlation coefficient (CCC), the within-subject coefficient of variation (CV), and the Dice coefficient. Results: The semi-supervised networks achieve strong agreement using Lin's CCC ( > 0.8 ), CV similar to an expert, and strong generalization performance. We compare the error modes of the semi-supervised networks against fully supervised networks. We evaluate semi-supervised model performance as a function of labeled training data and with different types of model supervision, showing that a model trained with 100 labeled image slices can achieve a Dice coefficient within 1.10% of a network trained with 16,000+ labeled image slices. Conclusion: We evaluate semi-supervision for medical image segmentation using heterogeneous datasets and clinical metrics. As methods for training models with little labeled data become more common, knowledge about how they perform on clinical tasks, how they fail, and how they perform with different amounts of labeled data is useful to model developers and users.

6.
Neuroimage ; 273: 120109, 2023 06.
Article in English | MEDLINE | ID: mdl-37059157

ABSTRACT

Deep learning (DL) models find increasing application in mental state decoding, where researchers seek to understand the mapping between mental states (e.g., experiencing anger or joy) and brain activity by identifying those spatial and temporal features of brain activity that allow to accurately identify (i.e., decode) these states. Once a DL model has been trained to accurately decode a set of mental states, neuroimaging researchers often make use of methods from explainable artificial intelligence research to understand the model's learned mappings between mental states and brain activity. Here, we benchmark prominent explanation methods in a mental state decoding analysis of multiple functional Magnetic Resonance Imaging (fMRI) datasets. Our findings demonstrate a gradient between two key characteristics of an explanation in mental state decoding, namely, its faithfulness and its alignment with other empirical evidence on the mapping between brain activity and decoded mental state: explanation methods with high explanation faithfulness, which capture the model's decision process well, generally provide explanations that align less well with other empirical evidence than the explanations of methods with less faithfulness. Based on our findings, we provide guidance for neuroimaging researchers on how to choose an explanation method to gain insight into the mental state decoding decisions of DL models.


Subject(s)
Brain , Deep Learning , Humans , Brain/diagnostic imaging , Brain Mapping/methods , Artificial Intelligence , Benchmarking , Magnetic Resonance Imaging/methods
7.
Trends Cogn Sci ; 26(11): 972-986, 2022 11.
Article in English | MEDLINE | ID: mdl-36223760

ABSTRACT

In mental state decoding, researchers aim to identify the set of mental states (e.g., experiencing happiness or fear) that can be reliably identified from the activity patterns of a brain region (or network). Deep learning (DL) models are highly promising for mental state decoding because of their unmatched ability to learn versatile representations of complex data. However, their widespread application in mental state decoding is hindered by their lack of interpretability, difficulties in applying them to small datasets, and in ensuring their reproducibility and robustness. We recommend approaching these challenges by leveraging recent advances in explainable artificial intelligence (XAI) and transfer learning, and also provide recommendations on how to improve the reproducibility and robustness of DL models in mental state decoding.


Subject(s)
Artificial Intelligence , Brain Mapping , Deep Learning , Brain , Humans , Machine Learning , Neuroimaging , Reproducibility of Results
8.
J Addict Med ; 16(3): 340-345, 2022.
Article in English | MEDLINE | ID: mdl-34510089

ABSTRACT

OBJECTIVES: To determine recent trends in: (1) human immunodeficiency virus (HIV) diagnoses, (2) the proportion of patients newly diagnosed with HIV with injection drug use (IDU) and (3) patients' patterns of healthcare utilization in the year before diagnosis at an urban, academic medical center. METHODS: We performed a cross sectional study of patients newly diagnosed with HIV at a healthcare system in southern New Jersey between January 1st, 2014 and December 31st, 2019. Patients 18 years or older with HIV diagnosed during the study period were included. Demographics, comorbidities, HIV test results, and healthcare utilization data were collected from the electronic medical record. RESULTS: Of 192 patients newly diagnosed with HIV, 36 (19%) had documented IDU. New HIV diagnoses doubled from 22 to 47 annual cases between 2014 and 2019. The proportion of patients with newly diagnosed HIV and documented IDU increased from 9% in 2014 to 32% in 2019, chi-square test for linear trend P value = 0.001. Eighty-nine percent of patients with IDU had at least one contact with the healthcare system in the year before diagnosis compared to 63% of patients without IDU, P value 0.003. The median (interquartile range IQR) number of healthcare visits was 7 [2 - 16] for patients with IDU versus 1 [0 - 3] for patients without IDU, P < 0.001. CONCLUSIONS: We observed an increase in new HIV diagnoses with an increase in the proportion of newly diagnosed patients with IDU. Patients with newly diagnosed HIV and IDU had high rates of health care utilization in the year before diagnosis presenting an opportunity for intervention.


Subject(s)
HIV Infections , Substance Abuse, Intravenous , Cross-Sectional Studies , HIV Infections/diagnosis , HIV Infections/epidemiology , Humans , Patient Acceptance of Health Care , Substance Abuse, Intravenous/epidemiology
9.
Radiol Artif Intell ; 3(4): e200229, 2021 Jul.
Article in English | MEDLINE | ID: mdl-34350412

ABSTRACT

PURPOSE: To develop a convolutional neural network (CNN) to triage head CT (HCT) studies and investigate the effect of upstream medical image processing on the CNN's performance. MATERIALS AND METHODS: A total of 9776 HCT studies were retrospectively collected from 2001 through 2014, and a CNN was trained to triage them as normal or abnormal. CNN performance was evaluated on a held-out test set, assessing triage performance and sensitivity to 20 disorders to assess differential model performance, with 7856 CT studies in the training set, 936 in the validation set, and 984 in the test set. This CNN was used to understand how the upstream imaging chain affects CNN performance by evaluating performance after altering three variables: image acquisition by reducing the number of x-ray projections, image reconstruction by inputting sinogram data into the CNN, and image preprocessing. To evaluate performance, the DeLong test was used to assess differences in the area under the receiver operating characteristic curve (AUROC), and the McNemar test was used to compare sensitivities. RESULTS: The CNN achieved a mean AUROC of 0.84 (95% CI: 0.83, 0.84) in discriminating normal and abnormal HCT studies. The number of x-ray projections could be reduced by 16 times and the raw sensor data could be input into the CNN with no statistically significant difference in classification performance. Additionally, CT windowing consistently improved CNN performance, increasing the mean triage AUROC by 0.07 points. CONCLUSION: A CNN was developed to triage HCT studies, which may help streamline image evaluation, and the means by which upstream image acquisition, reconstruction, and preprocessing affect downstream CNN performance was investigated, bringing focus to this important part of the imaging chain.Keywords Head CT, Automated Triage, Deep Learning, Sinogram, DatasetSupplemental material is available for this article.© RSNA, 2021.

10.
JAMA Netw Open ; 4(3): e211728, 2021 03 01.
Article in English | MEDLINE | ID: mdl-33720372

ABSTRACT

Importance: Implant registries provide valuable information on the performance of implants in a real-world setting, yet they have traditionally been expensive to establish and maintain. Electronic health records (EHRs) are widely used and may include the information needed to generate clinically meaningful reports similar to a formal implant registry. Objectives: To quantify the extractability and accuracy of registry-relevant data from the EHR and to assess the ability of these data to track trends in implant use and the durability of implants (hereafter referred to as implant survivorship), using data stored since 2000 in the EHR of the largest integrated health care system in the United States. Design, Setting, and Participants: Retrospective cohort study of a large EHR of veterans who had 45 351 total hip arthroplasty procedures in Veterans Health Administration hospitals from 2000 to 2017. Data analysis was performed from January 1, 2000, to December 31, 2017. Exposures: Total hip arthroplasty. Main Outcomes and Measures: Number of total hip arthroplasty procedures extracted from the EHR, trends in implant use, and relative survivorship of implants. Results: A total of 45 351 total hip arthroplasty procedures were identified from 2000 to 2017 with 192 805 implant parts. Data completeness improved over the time. After 2014, 85% of prosthetic heads, 91% of shells, 81% of stems, and 85% of liners used in the Veterans Health Administration health care system were identified by part number. Revision burden and trends in metal vs ceramic prosthetic femoral head use were found to reflect data from the American Joint Replacement Registry. Recalled implants were obvious negative outliers in implant survivorship using Kaplan-Meier curves. Conclusions and Relevance: Although loss to follow-up remains a challenge that requires additional attention to improve the quantitative nature of calculated implant survivorship, we conclude that data collected during routine clinical care and stored in the EHR of a large health system over 18 years were sufficient to provide clinically meaningful data on trends in implant use and to identify poor implants that were subsequently recalled. This automated approach was low cost and had no reporting burden. This low-cost, low-overhead method to assess implant use and performance within a large health care setting may be useful to internal quality assurance programs and, on a larger scale, to postmarket surveillance of implant performance.


Subject(s)
Arthroplasty, Replacement, Hip/statistics & numerical data , Electronic Health Records/statistics & numerical data , Adult , Aged , Aged, 80 and over , Cohort Studies , Female , Humans , Male , Middle Aged , Registries , Reproducibility of Results , Retrospective Studies , Young Adult
11.
J Biomed Inform ; 113: 103656, 2021 01.
Article in English | MEDLINE | ID: mdl-33309994

ABSTRACT

PURPOSE: To compare machine learning methods for classifying mass lesions on mammography images that use predefined image features computed over lesion segmentations to those that leverage segmentation-free representation learning on a standard, public evaluation dataset. METHODS: We apply several classification algorithms to the public Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM), in which each image contains a mass lesion. Segmentation-free representation learning techniques for classifying lesions as benign or malignant include both a Bag-of-Visual-Words (BoVW) method and a Convolutional Neural Network (CNN). We compare classification performance of these techniques to that obtained using two different segmentation-dependent approaches from the literature that rely on specific combinations of end classifiers (e.g. linear discriminant analysis, neural networks) and predefined features computed over the lesion segmentation (e.g. spiculation measure, morphological characteristics, intensity metrics). RESULTS: We report area under the receiver operating characteristic curve (AZ) values for malignancy classification on CBIS-DDSM for each technique. We find average AZ values of 0.73 for a segmentation-free BoVW method, 0.86 for a segmentation-free CNN method, 0.75 for a segmentation-dependent linear discriminant analysis of Rubber-Band Straightening Transform features, and 0.58 for a hybrid rule-based neural network classification using a small number of hand-designed features. CONCLUSIONS: We find that malignancy classification performance on the CBIS-DDSM dataset using segmentation-free BoVW features is comparable to that of the best segmentation-dependent methods we study, but also observe that a common segmentation-free CNN model substantially and significantly outperforms each of these (p < 0.05). These results reinforce recent findings suggesting that representation learning techniques such as BoVW and CNNs are advantageous for mammogram analysis because they do not require lesion segmentation, the quality and specific characteristics of which can vary substantially across datasets. We further observe that segmentation-dependent methods achieve performance levels on CBIS-DDSM inferior to those achieved on the original evaluation datasets reported in the literature. Each of these findings reinforces the need for standardization of datasets, segmentation techniques, and model implementations in performance assessments of automated classifiers for medical imaging.


Subject(s)
Breast Neoplasms , Mammography , Breast/diagnostic imaging , Breast Neoplasms/diagnostic imaging , Computers , Early Detection of Cancer , Female , Humans
12.
Article in English | MEDLINE | ID: mdl-33196064

ABSTRACT

Machine learning models for medical image analysis often suffer from poor performance on important subsets of a population that are not identified during training or testing. For example, overall performance of a cancer detection model may be high, but the model may still consistently miss a rare but aggressive cancer subtype. We refer to this problem as hidden stratification, and observe that it results from incompletely describing the meaningful variation in a dataset. While hidden stratification can substantially reduce the clinical efficacy of machine learning models, its effects remain difficult to measure. In this work, we assess the utility of several possible techniques for measuring hidden stratification effects, and characterize these effects both via synthetic experiments on the CIFAR-100 benchmark dataset and on multiple real-world medical imaging datasets. Using these measurement techniques, we find evidence that hidden stratification can occur in unidentified imaging subsets with low prevalence, low label quality, subtle distinguishing features, or spurious correlates, and that it can result in relative performance differences of over 20% on clinically important subsets. Finally, we discuss the clinical implications of our findings, and suggest that evaluation of hidden stratification should be a critical component of any machine learning deployment in medical imaging.

13.
Circ Genom Precis Med ; 13(6): e003014, 2020 12.
Article in English | MEDLINE | ID: mdl-33125279

ABSTRACT

BACKGROUND: The aortic valve is an important determinant of cardiovascular physiology and anatomic location of common human diseases. METHODS: From a sample of 34 287 white British ancestry participants, we estimated functional aortic valve area by planimetry from prospectively obtained cardiac magnetic resonance imaging sequences of the aortic valve. Aortic valve area measurements were submitted to genome-wide association testing, followed by polygenic risk scoring and phenome-wide screening, to identify genetic comorbidities. RESULTS: A genome-wide association study of aortic valve area in these UK Biobank participants showed 3 significant associations, indexed by rs71190365 (chr13:50764607, DLEU1, P=1.8×10-9), rs35991305 (chr12:94191968, CRADD, P=3.4×10-8), and chr17:45013271:C:T (GOSR2, P=5.6×10-8). Replication on an independent set of 8145 unrelated European ancestry participants showed consistent effect sizes in all 3 loci, although rs35991305 did not meet nominal significance. We constructed a polygenic risk score for aortic valve area, which in a separate cohort of 311 728 individuals without imaging demonstrated that smaller aortic valve area is predictive of increased risk for aortic valve disease (odds ratio, 1.14; P=2.3×10-6). After excluding subjects with a medical diagnosis of aortic valve stenosis (remaining n=308 683 individuals), phenome-wide association of >10 000 traits showed multiple links between the polygenic score for aortic valve disease and key health-related comorbidities involving the cardiovascular system and autoimmune disease. Genetic correlation analysis supports a shared genetic etiology with between aortic valve area and birth weight along with other cardiovascular conditions. CONCLUSIONS: These results illustrate the use of automated phenotyping of cardiac imaging data from the general population to investigate the genetic etiology of aortic valve disease, perform clinical prediction, and uncover new clinical and genetic correlates of cardiac anatomy.


Subject(s)
Aortic Valve/diagnostic imaging , Biological Specimen Banks , Cardiovascular Diseases/diagnostic imaging , Cardiovascular Diseases/genetics , Genome-Wide Association Study , Magnetic Resonance Imaging , Adult , Aged , Aortic Valve/pathology , Aortic Valve Stenosis/diagnostic imaging , Aortic Valve Stenosis/genetics , Comorbidity , Female , Genome, Human , Humans , Male , Middle Aged , Multifactorial Inheritance/genetics , Phenomics , Phenotype , Survival Analysis , United Kingdom
14.
Patterns (N Y) ; 1(2)2020 May 08.
Article in English | MEDLINE | ID: mdl-32776018

ABSTRACT

A major bottleneck in developing clinically impactful machine learning models is a lack of labeled training data for model supervision. Thus, medical researchers increasingly turn to weaker, noisier sources of supervision, such as leveraging extractions from unstructured text reports to supervise image classification. A key challenge in weak supervision is combining sources of information that may differ in quality and have correlated errors. Recently, a statistical theory of weak supervision called data programming has shown promise in addressing this challenge. Data programming now underpins many deployed machine-learning systems in the technology industry, even for critical applications. We propose a new technique for applying data programming to the problem of cross-modal weak supervision in medicine, wherein weak labels derived from an auxiliary modality (e.g., text) are used to train models over a different target modality (e.g., images). We evaluate our approach on diverse clinical tasks via direct comparison to institution-scale, hand-labeled datasets. We find that our supervision technique increases model performance by up to 6 points area under the receiver operating characteristic curve (ROC-AUC) over baseline methods by improving both coverage and quality of the weak labels. Our approach yields models that on average perform within 1.75 points ROC-AUC of those supervised with physician-years of hand labeling and outperform those supervised with physician-months of hand labeling by 10.25 points ROC-AUC, while using only person-days of developer time and clinician work-a time saving of 96%. Our results suggest that modern weak supervision techniques such as data programming may enable more rapid development and deployment of clinically useful machine-learning models.

15.
Front Neurol ; 11: 520, 2020.
Article in English | MEDLINE | ID: mdl-32714261

ABSTRACT

Seizure patterns observed in patients with epilepsy suggest that circadian rhythms and sleep/wake mechanisms play some role in the disease. This review addresses key topics in the relationship between circadian rhythms and seizures in epilepsy. We present basic information on circadian biology, but focus on research studying the influence of both the time of day and the sleep/wake cycle as independent but related factors on the expression of seizures in epilepsy. We review studies investigating how seizures and epilepsy disrupt expression of core clock genes, and how disruption of clock mechanisms impacts seizures and the development of epilepsy. We focus on the overlap between mechanisms of circadian-associated changes in SCN neuronal excitability and mechanisms of epileptogenesis as a means of identifying key pathways and molecules that could represent new targets or strategies for epilepsy therapy. Finally, we review the concept of chronotherapy and provide a perspective regarding its application to patients with epilepsy based on their individual characteristics (i.e., being a "morning person" or a "night owl"). We conclude that better understanding of the relationship between circadian rhythms, neuronal excitability, and seizures will allow both the identification of new therapeutic targets for treating epilepsy as well as more effective treatment regimens using currently available pharmacological and non-pharmacological strategies.

16.
NPJ Digit Med ; 3: 59, 2020.
Article in English | MEDLINE | ID: mdl-32352037

ABSTRACT

Automated seizure detection from electroencephalography (EEG) would improve the quality of patient care while reducing medical costs, but achieving reliably high performance across patients has proven difficult. Convolutional Neural Networks (CNNs) show promise in addressing this problem, but they are limited by a lack of large labeled training datasets. We propose using imperfect but plentiful archived annotations to train CNNs for automated, real-time EEG seizure detection across patients. While these weak annotations indicate possible seizures with precision scores as low as 0.37, they are commonly produced in large volumes within existing clinical workflows by a mixed group of technicians, fellows, students, and board-certified epileptologists. We find that CNNs trained using such weak annotations achieve Area Under the Receiver Operating Characteristic curve (AUROC) values of 0.93 and 0.94 for pediatric and adult seizure onset detection, respectively. Compared to currently deployed clinical software, our model provides a 31% increase (18 points) in F1-score for pediatric patients and a 17% increase (11 points) for adult patients. These results demonstrate that weak annotations, which are sustainably collected via existing clinical workflows, can be leveraged to produce clinically useful seizure detection models.

17.
J Am Med Inform Assoc ; 27(5): 757-769, 2020 05 01.
Article in English | MEDLINE | ID: mdl-32364237

ABSTRACT

OBJECTIVE: Non-small cell lung cancer is a leading cause of cancer death worldwide, and histopathological evaluation plays the primary role in its diagnosis. However, the morphological patterns associated with the molecular subtypes have not been systematically studied. To bridge this gap, we developed a quantitative histopathology analytic framework to identify the types and gene expression subtypes of non-small cell lung cancer objectively. MATERIALS AND METHODS: We processed whole-slide histopathology images of lung adenocarcinoma (n = 427) and lung squamous cell carcinoma patients (n = 457) in the Cancer Genome Atlas. We built convolutional neural networks to classify histopathology images, evaluated their performance by the areas under the receiver-operating characteristic curves (AUCs), and validated the results in an independent cohort (n = 125). RESULTS: To establish neural networks for quantitative image analyses, we first built convolutional neural network models to identify tumor regions from adjacent dense benign tissues (AUCs > 0.935) and recapitulated expert pathologists' diagnosis (AUCs > 0.877), with the results validated in an independent cohort (AUCs = 0.726-0.864). We further demonstrated that quantitative histopathology morphology features identified the major transcriptomic subtypes of both adenocarcinoma and squamous cell carcinoma (P < .01). DISCUSSION: Our study is the first to classify the transcriptomic subtypes of non-small cell lung cancer using fully automated machine learning methods. Our approach does not rely on prior pathology knowledge and can discover novel clinically relevant histopathology patterns objectively. The developed procedure is generalizable to other tumor types or diseases.


Subject(s)
Adenocarcinoma of Lung/pathology , Carcinoma, Non-Small-Cell Lung/pathology , Carcinoma, Squamous Cell/pathology , Lung Neoplasms/pathology , Machine Learning , Neural Networks, Computer , Transcriptome , Adenocarcinoma of Lung/genetics , Carcinoma, Non-Small-Cell Lung/genetics , Carcinoma, Squamous Cell/genetics , Humans , Lung Neoplasms/genetics , ROC Curve
18.
Sci Transl Med ; 12(544)2020 05 20.
Article in English | MEDLINE | ID: mdl-32434849

ABSTRACT

The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient's disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient's given set of phenotypes. Diagnosis of singleton patients (without relatives' exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database-based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children's Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu.


Subject(s)
Exome , Child , Genotype , Humans , Phenotype , Probability , Retrospective Studies
19.
VLDB J ; 29(2): 709-730, 2020.
Article in English | MEDLINE | ID: mdl-32214778

ABSTRACT

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models 2.8 × faster and increase predictive performance an average 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60 % of the predictive performance of large hand-curated training sets.

20.
Adv Neural Inf Process Syst ; 32: 9392-9402, 2019 Dec.
Article in English | MEDLINE | ID: mdl-31871391

ABSTRACT

In real-world machine learning applications, data subsets correspond to especially critical outcomes: vulnerable cyclist detections are safety-critical in an autonomous driving task, and "question" sentences might be important to a dialogue agent's language understanding for product purposes. While machine learning models can achieve high quality performance on coarse-grained metrics like F1-score and overall accuracy, they may underperform on critical subsets-we define these as slices, the key abstraction in our approach. To address slice-level performance, practitioners often train separate "expert" models on slice subsets or use multi-task hard parameter sharing. We propose Slice-based Learning, a new programming model in which the slicing function (SF), a programming interface, specifies critical data subsets for which the model should commit additional capacity. Any model can leverage SFs to learn slice expert representations, which are combined with an attention mechanism to make slice-aware predictions. We show that our approach maintains a parameter-efficient representation while improving over baselines by up to 19.0 F1 on slices and 4.6 F1 overall on datasets spanning language understanding (e.g. SuperGLUE), computer vision, and production-scale industrial systems.

SELECTION OF CITATIONS
SEARCH DETAIL
...