Improving reference standards for validation of AI-based radiography.

Duggan, Gavin E; Reicher, Joshua J; Liu, Yun; Tse, Daniel; Shetty, Shravya

Duggan, Gavin E; Reicher, Joshua J; Liu, Yun; Tse, Daniel; Shetty, Shravya.

Afiliación

Duggan GE; Google Health (G.E.D., Y.L., D.T., S.S.), Stanford Health Care and Palo Alto Veterans Affairs (J.J.R.), California, California, USA.
Reicher JJ; Google Health (G.E.D., Y.L., D.T., S.S.), Stanford Health Care and Palo Alto Veterans Affairs (J.J.R.), California, California, USA.
Liu Y; Google Health (G.E.D., Y.L., D.T., S.S.), Stanford Health Care and Palo Alto Veterans Affairs (J.J.R.), California, California, USA.
Tse D; Google Health (G.E.D., Y.L., D.T., S.S.), Stanford Health Care and Palo Alto Veterans Affairs (J.J.R.), California, California, USA.
Shetty S; Google Health (G.E.D., Y.L., D.T., S.S.), Stanford Health Care and Palo Alto Veterans Affairs (J.J.R.), California, California, USA.

Br J Radiol ; 94(1123): 20210435, 2021 Jul 01.

Article en En | MEDLINE | ID: mdl-34142868

ABSTRACT

ABSTRACT

OBJECTIVE:

Demonstrate the importance of combining multiple readers' opinions, in a context-aware manner, when establishing the reference standard for validation of artificial intelligence (AI) applications for, e.g. chest radiographs. By comparing individual readers, majority vote of a panel, and panel-based discussion, we identify methods which maximize interobserver agreement and label reproducibility.

METHODS:

1100 frontal chest radiographs were evaluated for 6

findings:

airspace opacity, cardiomegaly, pulmonary edema, fracture, nodules, and pneumothorax. Each image was reviewed by six radiologists, first individually and then via asynchronous adjudication (web-based discussion) in two panels of three readers to resolve disagreements within each panel. We quantified the reproducibility of each method by measuring interreader agreement.

RESULTS:

Panel-based majority vote improved agreement relative to individual readers for all findings. Most disagreements were resolved with two rounds of adjudication, which further improved reproducibility for some findings, particularly reducing misses. Improvements varied across finding categories, with adjudication improving agreement for cardiomegaly, fractures, and pneumothorax.

CONCLUSION:

The likelihood of interreader agreement, even within panels of US board-certified radiologists, must be considered before reads can be used as a reference standard for validation of proposed AI tools. Agreement and, by extension, reproducibility can be improved by applying majority vote, maximum sensitivity, or asynchronous adjudication for different findings, which supports the development of higher quality clinical research. ADVANCES IN KNOWLEDGE A panel of three experts is a common technique for establishing reference standards when ground truth is not available for use in AI validation. The manner in which differing opinions are resolved is shown to be important, and has not been previously explored.

Asunto(s)

Inteligencia Artificial/normas; Radiografía Torácica; Humanos; Variaciones Dependientes del Observador; Mejoramiento de la Calidad; Radiólogos; Estándares de Referencia; Reproducibilidad de los Resultados

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Inteligencia Artificial / Radiografía Torácica Tipo de estudio: Guideline / Prognostic_studies Límite: Humans Idioma: En Revista: Br J Radiol Año: 2021 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google