RESUMO
Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands of image tiles1-3. Prior models have often resorted to subsampling a small portion of tiles for each slide, thus missing the important slide-level context4. Here we present Prov-GigaPath, a whole-slide pathology foundation model pretrained on 1.3 billion 256 × 256 pathology image tiles in 171,189 whole slides from Providence, a large US health network comprising 28 cancer centres. The slides originated from more than 30,000 patients covering 31 major tissue types. To pretrain Prov-GigaPath, we propose GigaPath, a novel vision transformer architecture for pretraining gigapixel pathology slides. To scale GigaPath for slide-level learning with tens of thousands of image tiles, GigaPath adapts the newly developed LongNet5 method to digital pathology. To evaluate Prov-GigaPath, we construct a digital pathology benchmark comprising 9 cancer subtyping tasks and 17 pathomics tasks, using both Providence and TCGA data6. With large-scale pretraining and ultra-large-context modelling, Prov-GigaPath attains state-of-the-art performance on 25 out of 26 tasks, with significant improvement over the second-best method on 18 tasks. We further demonstrate the potential of Prov-GigaPath on vision-language pretraining for pathology7,8 by incorporating the pathology reports. In sum, Prov-GigaPath is an open-weight foundation model that achieves state-of-the-art performance on various digital pathology tasks, demonstrating the importance of real-world data and whole-slide modelling.
Assuntos
Conjuntos de Dados como Assunto , Processamento de Imagem Assistida por Computador , Aprendizado de Máquina , Patologia Clínica , Humanos , Benchmarking , Processamento de Imagem Assistida por Computador/métodos , Neoplasias/classificação , Neoplasias/diagnóstico , Neoplasias/patologia , Patologia Clínica/métodos , Masculino , FemininoRESUMO
Most detailed patient information in real-world data (RWD) is only consistently available in free-text clinical documents. Manual curation is expensive and time consuming. Developing natural language processing (NLP) methods for structuring RWD is thus essential for scaling real-world evidence generation. We propose leveraging patient-level supervision from medical registries, which are often readily available and capture key patient information, for general RWD applications. We conduct an extensive study on 135,107 patients from the cancer registry of a large integrated delivery network (IDN) comprising healthcare systems in five western US states. Our deep-learning methods attain test area under the receiver operating characteristic curve (AUROC) values of 94%-99% for key tumor attributes and comparable performance on held-out data from separate health systems and states. Ablation results demonstrate the superiority of these advanced deep-learning methods. Error analysis shows that our NLP system sometimes even corrects errors in registrar labels.
RESUMO
Importance: The use of machine learning applications related to health is rapidly increasing and may have the potential to profoundly affect the field of health care. Objective: To analyze submissions to a popular machine learning for health venue to assess the current state of research, including areas of methodologic and clinical focus, limitations, and underexplored areas. Design, Setting, and Participants: In this data-driven qualitative analysis, 166 accepted manuscript submissions to the Third Annual Machine Learning for Health workshop at the 32nd Conference on Neural Information Processing Systems on December 8, 2018, were analyzed to understand research focus, progress, and trends. Experts reviewed each submission against a rubric to identify key data points, statistical modeling and analysis of submitting authors was performed, and research topics were quantitatively modeled. Finally, an iterative discussion of topics common in submissions and invited speakers at the workshop was held to identify key trends. Main Outcomes and Measures: Frequency and statistical measures of methods, topics, goals, and author attributes were derived from an expert review of submissions guided by a rubric. Results: Of the 166 accepted submissions, 58 (34.9%) had clinician involvement and 83 submissions (50.0%) that focused on clinical practice included clinical collaborators. A total of 97 data sets (58.4%) used in submissions were publicly available or required a standard registration process. Clinical practice was the most common application area (70 manuscripts [42.2%]), with brain and mental health (25 [15.1%]), oncology (21 [12.7%]), and cardiovascular (19 [11.4%]) being the most common specialties. Conclusions and Relevance: Trends in machine learning for health research indicate the importance of well-annotated, easily accessed data and the benefit from greater clinician involvement in the development of translational applications.