RESUMEN
BACKGROUND AND AIMS: The American Society for Gastrointestinal Endoscopy (ASGE) AI Task Force along with experts in endoscopy, technology space, regulatory authorities, and other medical subspecialties initiated a consensus process that analyzed the current literature, highlighted potential areas, and outlined the necessary research in artificial intelligence (AI) to allow a clearer understanding of AI as it pertains to endoscopy currently. METHODS: A modified Delphi process was used to develop these consensus statements. RESULTS: Statement 1: Current advances in AI allow for the development of AI-based algorithms that can be applied to endoscopy to augment endoscopist performance in detection and characterization of endoscopic lesions. Statement 2: Computer vision-based algorithms provide opportunities to redefine quality metrics in endoscopy using AI, which can be standardized and can reduce subjectivity in reporting quality metrics. Natural language processing-based algorithms can help with the data abstraction needed for reporting current quality metrics in GI endoscopy effortlessly. Statement 3: AI technologies can support smart endoscopy suites, which may help optimize workflows in the endoscopy suite, including automated documentation. Statement 4: Using AI and machine learning helps in predictive modeling, diagnosis, and prognostication. High-quality data with multidimensionality are needed for risk prediction, prognostication of specific clinical conditions, and their outcomes when using machine learning methods. Statement 5: Big data and cloud-based tools can help advance clinical research in gastroenterology. Multimodal data are key to understanding the maximal extent of the disease state and unlocking treatment options. Statement 6: Understanding how to evaluate AI algorithms in the gastroenterology literature and clinical trials is important for gastroenterologists, trainees, and researchers, and hence education efforts by GI societies are needed. Statement 7: Several challenges regarding integrating AI solutions into the clinical practice of endoscopy exist, including understanding the role of human-AI interaction. Transparency, interpretability, and explainability of AI algorithms play a key role in their clinical adoption in GI endoscopy. Developing appropriate AI governance, data procurement, and tools needed for the AI lifecycle are critical for the successful implementation of AI into clinical practice. Statement 8: For payment of AI in endoscopy, a thorough evaluation of the potential value proposition for AI systems may help guide purchasing decisions in endoscopy. Reliable cost-effectiveness studies to guide reimbursement are needed. Statement 9: Relevant clinical outcomes and performance metrics for AI in gastroenterology are currently not well defined. To improve the quality and interpretability of research in the field, steps need to be taken to define these evidence standards. Statement 10: A balanced view of AI technologies and active collaboration between the medical technology industry, computer scientists, gastroenterologists, and researchers are critical for the meaningful advancement of AI in gastroenterology. CONCLUSIONS: The consensus process led by the ASGE AI Task Force and experts from various disciplines has shed light on the potential of AI in endoscopy and gastroenterology. AI-based algorithms have shown promise in augmenting endoscopist performance, redefining quality metrics, optimizing workflows, and aiding in predictive modeling and diagnosis. However, challenges remain in evaluating AI algorithms, ensuring transparency and interpretability, addressing governance and data procurement, determining payment models, defining relevant clinical outcomes, and fostering collaboration between stakeholders. Addressing these challenges while maintaining a balanced perspective is crucial for the meaningful advancement of AI in gastroenterology.
RESUMEN
BACKGROUND: Positive and negative likelihood ratios (PLR and NLR) are important metrics of accuracy for diagnostic devices with a binary output. However, the properties of Bayesian and frequentist interval estimators of PLR/NLR have not been extensively studied and compared. In this study, we explore the potential use of the Bayesian method for interval estimation of PLR/NLR, and, more broadly, for interval estimation of the ratio of two independent proportions. METHODS: We develop a Bayesian-based approach for interval estimation of PLR/NLR for use as a part of a diagnostic device performance evaluation. Our approach is applicable to a broader setting for interval estimation of any ratio of two independent proportions. We compare score and Bayesian interval estimators for the ratio of two proportions in terms of the coverage probability (CP) and expected interval width (EW) via extensive experiments and applications to two case studies. A supplementary experiment was also conducted to assess the performance of the proposed exact Bayesian method under different priors. RESULTS: Our experimental results show that the overall mean CP for Bayesian interval estimation is consistent with that for the score method (0.950 vs. 0.952), and the overall mean EW for Bayesian is shorter than that for score method (15.929 vs. 19.724). Application to two case studies showed that the intervals estimated using the Bayesian and frequentist approaches are very similar. DISCUSSION: Our numerical results indicate that the proposed Bayesian approach has a comparable CP performance with the score method while yielding higher precision (i.e. a shorter EW).
RESUMEN
BACKGROUND: The Basic Local Alignment Search Tool (BLAST) is a suite of commonly used algorithms for identifying matches between biological sequences. The user supplies a database file and query file of sequences for BLAST to find identical sequences between the two. The typical millions of database and query sequences make BLAST computationally challenging but also well suited for parallelization on high-performance computing clusters. The efficacy of parallelization depends on the data partitioning, where the optimal data partitioning relies on an accurate performance model. In previous studies, a BLAST job was sped up by 27 times by partitioning the database and query among thousands of processor nodes. However, the optimality of the partitioning method was not studied. Unlike BLAST performance models proposed in the literature that usually have problem size and hardware configuration as the only variables, the execution time of a BLAST job is a function of database size, query size, and hardware capability. In this work, the nucleotide BLAST application BLASTN was profiled using three methods: shell-level profiling with the Unix "time" command, code-level profiling with the built-in "profiler" module, and system-level profiling with the Unix "gprof" program. The runtimes were measured for six node types, using six different database files and 15 query files, on a heterogeneous HPC cluster with 500+ nodes. The empirical measurement data were fitted with quadratic functions to develop performance models that were used to guide the data parallelization for BLASTN jobs. RESULTS: Profiling results showed that BLASTN contains more than 34,500 different functions, but a single function, RunMTBySplitDB, takes 99.12% of the total runtime. Among its 53 child functions, five core functions were identified to make up 92.12% of the overall BLASTN runtime. Based on the performance models, static load balancing algorithms can be applied to the BLASTN input data to minimize the runtime of the longest job on an HPC cluster. Four test cases being run on homogeneous and heterogeneous clusters were tested. Experiment results showed that the runtime can be reduced by 81% on a homogeneous cluster and by 20% on a heterogeneous cluster by re-distributing the workload. DISCUSSION: Optimal data partitioning can improve BLASTN's overall runtime 5.4-fold in comparison with dividing the database and query into the same number of fragments. The proposed methodology can be used in the other applications in the BLAST+ suite or any other application as long as source code is available.
Asunto(s)
Metodologías Computacionales , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Alineación de SecuenciaRESUMEN
BACKGROUND: Recent breakthroughs in molecular biology and next generation sequencing technologies have led to the expenential growh of the sequence databases. Researchrs use BLAST for processing these sequences. However traditional software parallelization techniques (threads, message passing interface) applied in newer versios of BLAST are not adequate for processing these sequences in timely manner. METHODS: A new method for array job parallelization has been developed which offers O(T) theoretical speed-up in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. (The number of CPUs that will be used to complete the job equals the product of T multiplied by the number of CPUs used by a single task.) The approach is based on segmentation of both input datasets to the BLAST process, combining partial solutions published earlier (Dhanker and Gupta, Int J Comput Sci Inf Technol_5:4818-4820, 2014), (Grant et al., Bioinformatics_18:765-766, 2002), (Mathog, Bioinformatics_19:1865-1866, 2003). It is accordingly referred to as a "dual segmentation" method. In order to implement the new method, the BLAST source code was modified to allow the researcher to pass to the program the number of records (effective number of sequences) in the original database. The team also developed methods to manage and consolidate the large number of partial results that get produced. Dual segmentation allows for massive parallelization, which lifts the scaling ceiling in exciting ways. RESULTS: BLAST jobs that hitherto failed or slogged inefficiently to completion now finish with speeds that characteristically reduce wallclock time from 27 days on 40 CPUs to a single day using 4104 tasks, each task utilizing eight CPUs and taking less than 7 minutes to complete. CONCLUSIONS: The massive increase in the number of tasks when running an analysis job with dual segmentation reduces the size, scope and execution time of each task. Besides significant speed of completion, additional benefits include fine-grained checkpointing and increased flexibility of job submission. "Trickling in" a swarm of individual small tasks tempers competition for CPU time in the shared HPC environment, and jobs submitted during quiet periods can complete in extraordinarily short time frames. The smaller task size also allows the use of older and less powerful hardware. The CDRH workhorse cluster was commissioned in 2010, yet its eight-core CPUs with only 24GB RAM work well in 2017 for these dual segmentation jobs. Finally, these techniques are excitingly friendly to budget conscious scientific research organizations where probabilistic algorithms such as BLAST might discourage attempts at greater certainty because single runs represent a major resource drain. If a job that used to take 24 days can now be completed in less than an hour or on a space available basis (which is the case at CDRH), repeated runs for more exhaustive analyses can be usefully contemplated.
Asunto(s)
Algoritmos , Biología Computacional/métodos , Bases de Datos de Ácidos Nucleicos , Humanos , Motor de Búsqueda , Programas InformáticosRESUMEN
The area under the receiver operating characteristic curve is often used as a summary index of the diagnostic ability in evaluating biomarkers when the clinical outcome (truth) is binary. When the clinical outcome is right-censored survival time, the C index, motivated as an extension of area under the receiver operating characteristic curve, has been proposed by Harrell as a measure of concordance between a predictive biomarker and the right-censored survival outcome. In this work, we investigate methods for statistical comparison of two diagnostic or predictive systems, of which they could either be two biomarkers or two fixed algorithms, in terms of their C indices. We adopt a U-statistics-based C estimator that is asymptotically normal and develop a nonparametric analytical approach to estimate the variance of the C estimator and the covariance of two C estimators. A z-score test is then constructed to compare the two C indices. We validate our one-shot nonparametric method via simulation studies in terms of the type I error rate and power. We also compare our one-shot method with resampling methods including the jackknife and the bootstrap. Simulation results show that the proposed one-shot method provides almost unbiased variance estimations and has satisfactory type I error control and power. Finally, we illustrate the use of the proposed method with an example from the Framingham Heart Study.
Asunto(s)
Bioestadística/métodos , Estadísticas no Paramétricas , Algoritmos , Área Bajo la Curva , Biomarcadores , Enfermedades Cardiovasculares/etiología , Simulación por Computador , Humanos , Modelos Estadísticos , Análisis Multivariante , Estudios Prospectivos , Curva ROC , Análisis de SupervivenciaRESUMEN
Receiver operating characteristic (ROC) analysis is a standard methodology to evaluate the performance of a binary classification system. The area under the ROC curve (AUC) is a performance metric that summarizes how well a classifier separates two classes. Traditional AUC optimization techniques are supervised learning methods that utilize only labeled data (i.e., the true class is known for all data) to train the classifiers. In this work, inspired by semi-supervised and transductive learning, we propose two new AUC optimization algorithms hereby referred to as semi-supervised learning receiver operating characteristic (SSLROC) algorithms, which utilize unlabeled test samples in classifier training to maximize AUC. Unlabeled samples are incorporated into the AUC optimization process, and their ranking relationships to labeled positive and negative training samples are considered as optimization constraints. The introduced test samples will cause the learned decision boundary in a multidimensional feature space to adapt not only to the distribution of labeled training data, but also to the distribution of unlabeled test data. We formulate the semi-supervised AUC optimization problem as a semi-definite programming problem based on the margin maximization theory. The proposed methods SSLROC1 (1-norm) and SSLROC2 (2-norm) were evaluated using 34 (determined by power analysis) randomly selected datasets from the University of California, Irvine machine learning repository. Wilcoxon signed rank tests showed that the proposed methods achieved significant improvement compared with state-of-the-art methods. The proposed methods were also applied to a CT colonography dataset for colonic polyp classification and showed promising results.
RESUMEN
Purpose: Understanding an artificial intelligence (AI) model's ability to generalize to its target population is critical to ensuring the safe and effective usage of AI in medical devices. A traditional generalizability assessment relies on the availability of large, diverse datasets, which are difficult to obtain in many medical imaging applications. We present an approach for enhanced generalizability assessment by examining the decision space beyond the available testing data distribution. Approach: Vicinal distributions of virtual samples are generated by interpolating between triplets of test images. The generated virtual samples leverage the characteristics already in the test set, increasing the sample diversity while remaining close to the AI model's data manifold. We demonstrate the generalizability assessment approach on the non-clinical tasks of classifying patient sex, race, COVID status, and age group from chest x-rays. Results: Decision region composition analysis for generalizability indicated that a disproportionately large portion of the decision space belonged to a single "preferred" class for each task, despite comparable performance on the evaluation dataset. Evaluation using cross-reactivity and population shift strategies indicated a tendency to overpredict samples as belonging to the preferred class (e.g., COVID negative) for patients whose subgroup was not represented in the model development data. Conclusions: An analysis of an AI model's decision space has the potential to provide insight into model generalizability. Our approach uses the analysis of composition of the decision space to obtain an improved assessment of model generalizability in the case of limited test data.
RESUMEN
Purpose: Synthetic datasets hold the potential to offer cost-effective alternatives to clinical data, ensuring privacy protections and potentially addressing biases in clinical data. We present a method leveraging such datasets to train a machine learning algorithm applied as part of a computer-aided detection (CADe) system. Approach: Our proposed approach utilizes clinically acquired computed tomography (CT) scans of a physical anthropomorphic phantom into which manufactured lesions were inserted to train a machine learning algorithm. We treated the training database obtained from the anthropomorphic phantom as a simplified representation of clinical data and increased the variability in this dataset using a set of randomized and parameterized augmentations. Furthermore, to mitigate the inherent differences between phantom and clinical datasets, we investigated adding unlabeled clinical data into the training pipeline. Results: We apply our proposed method to the false positive reduction stage of a lung nodule CADe system in CT scans, in which regions of interest containing potential lesions are classified as nodule or non-nodule regions. Experimental results demonstrate the effectiveness of the proposed method; the system trained on labeled data from physical phantom scans and unlabeled clinical data achieves a sensitivity of 90% at eight false positives per scan. Furthermore, the experimental results demonstrate the benefit of the physical phantom in which the performance in terms of competitive performance metric increased by 6% when a training set consisting of 50 clinical CT scans was enlarged by the scans obtained from the physical phantom. Conclusions: The scalability of synthetic datasets can lead to improved CADe performance, particularly in scenarios in which the size of the labeled clinical data is limited or subject to inherent bias. Our proposed approach demonstrates an effective utilization of synthetic datasets for training machine learning algorithms.
RESUMEN
Purpose: Endometrial cancer (EC) is the most common gynecologic malignancy in the United States, and atypical endometrial hyperplasia (AEH) is considered a high-risk precursor to EC. Hormone therapies and hysterectomy are practical treatment options for AEH and early-stage EC. Some patients prefer hormone therapies for reasons such as fertility preservation or being poor surgical candidates. However, accurate prediction of an individual patient's response to hormonal treatment would allow for personalized and potentially improved recommendations for these conditions. This study aims to explore the feasibility of using deep learning models on whole slide images (WSI) of endometrial tissue samples to predict the patient's response to hormonal treatment. Approach: We curated a clinical WSI dataset of 112 patients from two clinical sites. An expert pathologist annotated these images by outlining AEH/EC regions. We developed an end-to-end machine learning model with mixed supervision. The model is based on image patches extracted from pathologist-annotated AEH/EC regions. Either an unsupervised deep learning architecture (Autoencoder or ResNet50), or non-deep learning (radiomics feature extraction) is used to embed the images into a low-dimensional space, followed by fully connected layers for binary prediction, which was trained with binary responder/non-responder labels established by pathologists. We used stratified sampling to partition the dataset into a development set and a test set for internal validation of the performance of our models. Results: The autoencoder model yielded an AUROC of 0.80 with 95% CI [0.63, 0.95] on the independent test set for the task of predicting a patient with AEH/EC as a responder vs non-responder to hormonal treatment. Conclusions: These findings demonstrate the potential of using mixed supervised machine learning models on WSIs for predicting the response to hormonal treatment in AEH/EC patients.
RESUMEN
In the past decade, artificial intelligence (AI) algorithms have made promising impacts in many areas of healthcare. One application is AI-enabled prioritization software known as computer-aided triage and notification (CADt). This type of software as a medical device is intended to prioritize reviews of radiological images with time-sensitive findings, thus shortening the waiting time for patients with these findings. While many CADt devices have been deployed into clinical workflows and have been shown to improve patient treatment and clinical outcomes, quantitative methods to evaluate the wait-time-savings from their deployment are not yet available. In this paper, we apply queueing theory methods to evaluate the wait-time-savings of a CADt by calculating the average waiting time per patient image without and with a CADt device being deployed. We study two workflow models with one or multiple radiologists (servers) for a range of AI diagnostic performances, radiologist's reading rates, and patient image (customer) arrival rates. To evaluate the time-saving performance of a CADt, we use the difference in the mean waiting time between the diseased patient images in the with-CADt scenario and that in the without-CADt scenario as our performance metric. As part of this effort, we have developed and also share a software tool to simulate the radiology workflow around medical image interpretation, to verify theoretical results, and to provide confidence intervals for the performance metric we defined. We show quantitatively that a CADt triage device is more effective in a busy, short-staffed reading setting, which is consistent with our clinical intuition and simulation results. Although this work is motivated by the need for evaluating CADt devices, the evaluation methodology presented in this paper can be applied to assess the time-saving performance of other types of algorithms that prioritize a subset of customers based on binary outputs.
RESUMEN
Radiomics, the science of extracting quantifiable data from routine medical images, is a powerful tool that has many potential applications in oncology. The Response Evaluation Criteria in Solid Tumors Working Group (RWG) held a workshop in May 2022, which brought together various stakeholders to discuss the potential role of radiomics in oncology drug development and clinical trials, particularly with respect to response assessment. This article summarizes the results of that workshop, reviewing radiomics for the practicing oncologist and highlighting the work that needs to be done to move forward the incorporation of radiomics into clinical trials.
Asunto(s)
Neoplasias , Medicina de Precisión , Humanos , Medicina de Precisión/métodos , Criterios de Evaluación de Respuesta en Tumores Sólidos , Radiómica , Oncología Médica , Neoplasias/diagnóstico por imagen , Neoplasias/tratamiento farmacológicoRESUMEN
The adoption of artificial intelligence (AI) tools in medicine poses challenges to existing clinical workflows. This commentary discusses the necessity of context-specific quality assurance (QA), emphasizing the need for robust QA measures with quality control (QC) procedures that encompass (1) acceptance testing (AT) before clinical use, (2) continuous QC monitoring, and (3) adequate user training. The discussion also covers essential components of AT and QA, illustrated with real-world examples. We also highlight what we see as the shared responsibility of manufacturers or vendors, regulators, healthcare systems, medical physicists, and clinicians to enact appropriate testing and oversight to ensure a safe and equitable transformation of medicine through AI.
RESUMEN
Innovation in medical imaging artificial intelligence (AI)/machine learning (ML) demands extensive data collection, algorithmic advancements, and rigorous performance assessments encompassing aspects such as generalizability, uncertainty, bias, fairness, trustworthiness, and interpretability. Achieving widespread integration of AI/ML algorithms into diverse clinical tasks will demand a steadfast commitment to overcoming issues in model design, development, and performance assessment. The complexities of AI/ML clinical translation present substantial challenges, requiring engagement with relevant stakeholders, assessment of cost-effectiveness for user and patient benefit, timely dissemination of information relevant to robust functioning throughout the AI/ML lifecycle, consideration of regulatory compliance, and feedback loops for real-world performance evidence. This commentary addresses several hurdles for the development and adoption of AI/ML technologies in medical imaging. Comprehensive attention to these underlying and often subtle factors is critical not only for tackling the challenges but also for exploring novel opportunities for the advancement of AI in radiology.
RESUMEN
In 2020, Novartis Pharmaceuticals Corporation and the U.S. Food and Drug Administration (FDA) started a 4-year scientific collaboration to approach complex new data modalities and advanced analytics. The scientific question was to find novel radio-genomics-based prognostic and predictive factors for HR+/HER- metastatic breast cancer under a Research Collaboration Agreement. This collaboration has been providing valuable insights to help successfully implement future scientific projects, particularly using artificial intelligence and machine learning. This tutorial aims to provide tangible guidelines for a multi-omics project that includes multidisciplinary expert teams, spanning across different institutions. We cover key ideas, such as "maintaining effective communication" and "following good data science practices," followed by the four steps of exploratory projects, namely (1) plan, (2) design, (3) develop, and (4) disseminate. We break each step into smaller concepts with strategies for implementation and provide illustrations from our collaboration to further give the readers actionable guidance.
Asunto(s)
Inteligencia Artificial , Multiómica , Humanos , Aprendizaje Automático , GenómicaRESUMEN
BACKGROUND: The surge in biomarker development calls for research on statistical evaluation methodology to rigorously assess emerging biomarkers and classification models. Recently, several authors reported the puzzling observation that, in assessing the added value of new biomarkers to existing ones in a logistic regression model, statistical significance of new predictor variables does not necessarily translate into a statistically significant increase in the area under the ROC curve (AUC). Vickers et al. concluded that this inconsistency is because AUC "has vastly inferior statistical properties," i.e., it is extremely conservative. This statement is based on simulations that misuse the DeLong et al. method. Our purpose is to provide a fair comparison of the likelihood ratio (LR) test and the Wald test versus diagnostic accuracy (AUC) tests. DISCUSSION: We present a test to compare ideal AUCs of nested linear discriminant functions via an F test. We compare it with the LR test and the Wald test for the logistic regression model. The null hypotheses of these three tests are equivalent; however, the F test is an exact test whereas the LR test and the Wald test are asymptotic tests. Our simulation shows that the F test has the nominal type I error even with a small sample size. Our results also indicate that the LR test and the Wald test have inflated type I errors when the sample size is small, while the type I error converges to the nominal value asymptotically with increasing sample size as expected. We further show that the DeLong et al. method tests a different hypothesis and has the nominal type I error when it is used within its designed scope. Finally, we summarize the pros and cons of all four methods we consider in this paper. SUMMARY: We show that there is nothing inherently less powerful or disagreeable about ROC analysis for showing the usefulness of new biomarkers or characterizing the performance of classification models. Each statistical method for assessing biomarkers and classification models has its own strengths and weaknesses. Investigators need to choose methods based on the assessment purpose, the biomarker development phase at which the assessment is being performed, the available patient data, and the validity of assumptions behind the methodologies.
Asunto(s)
Biomarcadores , Modelos Estadísticos , Valor Predictivo de las Pruebas , Área Bajo la Curva , Humanos , Funciones de Verosimilitud , Modelos LogísticosRESUMEN
Data drift refers to differences between the data used in training a machine learning (ML) model and that applied to the model in real-world operation. Medical ML systems can be exposed to various forms of data drift, including differences between the data sampled for training and used in clinical operation, differences between medical practices or context of use between training and clinical use, and time-related changes in patient populations, disease patterns, and data acquisition, to name a few. In this article, we first review the terminology used in ML literature related to data drift, define distinct types of drift, and discuss in detail potential causes within the context of medical applications with an emphasis on medical imaging. We then review the recent literature regarding the effects of data drift on medical ML systems, which overwhelmingly show that data drift can be a major cause for performance deterioration. We then discuss methods for monitoring data drift and mitigating its effects with an emphasis on pre- and post-deployment techniques. Some of the potential methods for drift detection and issues around model retraining when drift is detected are included. Based on our review, we find that data drift is a major concern in medical ML deployment and that more research is needed so that ML models can identify drift early, incorporate effective mitigation strategies and resist performance decay.
Asunto(s)
Aprendizaje Automático , Computación en Informática MédicaRESUMEN
BACKGROUND: Bone health and fracture risk are known to be correlated with stiffness. Both micro-finite element analysis (µFEA) and mechanical testing of additive manufactured phantoms are useful approaches for estimating mechanical properties of trabecular bone-like structures. However, it is unclear if measurements from the two approaches are consistent. The purpose of this work is to evaluate the agreement between stiffness measurements obtained from mechanical testing of additive manufactured trabecular bone phantoms and µFEA modeling. Agreement between the two methods would suggest 3D printing is a viable method for validation of µFEA modeling. METHODS: A set of 20 lumbar vertebrae regions of interests were segmented and the corresponding trabecular bone phantoms were produced using selective laser sintering. The phantoms were mechanically tested in uniaxial compression to derive their stiffness values. The stiffness values were also derived from in silico simulation, where linear elastic µFEA was applied to simulate the same compression and boundary conditions. Bland-Altman analysis was used to evaluate agreement between the mechanical testing and µFEA simulation values. Additionally, we evaluated the fidelity of the 3D printed phantoms as well as the repeatability of the 3D printing and mechanical testing process. RESULTS: We observed good agreement between the mechanically tested stiffness and µFEA stiffness, with R2 of 0.84 and normalized root mean square deviation of 8.1%. We demonstrate that the overall trabecular bone structures are printed in high fidelity (Dice score of 0.97 (95% CI, [0.96,0.98]) and that mechanical testing is repeatable (coefficient of variation less than 5% for stiffness values from testing of duplicated phantoms). However, we noticed some defects in the resin microstructure of the 3D printed phantoms, which may account for the discrepancy between the stiffness values from simulation and mechanical testing. CONCLUSION: Overall, the level of agreement achieved between the mechanical stiffness and µFEA indicates that our µFEA methods may be acceptable for assessing bone mechanics of complex trabecular structures as part of an analysis of overall bone health.
RESUMEN
Endometrial cancer (EC) is the most common gynecologic malignancy in the US and complex atypical hyperplasia (CAH) is considered a high-risk precursor to EC. Treatment options for CAH and early-stage EC include hormone therapies and hysterectomy with the former preferred by certain patients, e.g., for fertility preservation or poor surgical candidates. Accurate prediction of response to hormonal treatment would allow for personalized and potentially improved recommendations for the treatment of these conditions. In this study, we investigate the feasibility of utilizing weakly supervised deep learning models on whole slide images of endometrial tissue samples for the prediction of patient response to hormonal treatment. We curated a clinical whole-slide-image (WSI) dataset of 112 patients from two clinical sites. We developed an end-to-end machine learning model using WSIs of endometrial specimens for the prediction of hormonal treatment response among women with CAH/EC. The model takes patches extracted from pathologist-annotated CAH/EC regions as input and utilizes an unsupervised deep learning architecture (Autoencoder or ResNet50) to embed the images into a low-dimensional space, followed by fully connected layers for binary prediction. Our autoencoder model yielded an AUC of 0.79 with 95% CI [0.61, 0.98] on a hold-out test set in the task of predicting a patient with CAH/EC as a responder vs non-responder to hormonal treatment. Our results, demonstrate the potential for using weakly supervised machine learning models on WSIs for predicting response to hormonal treatment of CAH/EC patients.
RESUMEN
PURPOSE: Machine learning algorithms are best trained with large quantities of accurately annotated samples. While natural scene images can often be labeled relatively cheaply and at large scale, obtaining accurate annotations for medical images is both time consuming and expensive. In this study, we propose a cooperative labeling method that allows us to make use of weakly annotated medical imaging data for the training of a machine learning algorithm. As most clinically produced data are weakly-annotated - produced for use by humans rather than machines and lacking information machine learning depends upon - this approach allows us to incorporate a wider range of clinical data and thereby increase the training set size. METHODS: Our pseudo-labeling method consists of multiple stages. In the first stage, a previously established network is trained using a limited number of samples with high-quality expert-produced annotations. This network is used to generate annotations for a separate larger dataset that contains only weakly annotated scans. In the second stage, by cross-checking the two types of annotations against each other, we obtain higher-fidelity annotations. In the third stage, we extract training data from the weakly annotated scans, and combine it with the fully annotated data, producing a larger training dataset. We use this larger dataset to develop a computer-aided detection (CADe) system for nodule detection in chest CT. RESULTS: We evaluated the proposed approach by presenting the network with different numbers of expert-annotated scans in training and then testing the CADe using an independent expert-annotated dataset. We demonstrate that when availability of expert annotations is severely limited, the inclusion of weakly-labeled data leads to a 5% improvement in the competitive performance metric (CPM), defined as the average of sensitivities at different false-positive rates. CONCLUSIONS: Our proposed approach can effectively merge a weakly-annotated dataset with a small, well-annotated dataset for algorithm training. This approach can help enlarge limited training data by leveraging the large amount of weakly labeled data typically generated in clinical image interpretation.