Your browser doesn't support javascript.
loading
Snowball 2.0: Generic Material Data Parser for ChemDataExtractor.
Dong, Qingyang; Cole, Jacqueline M.
Afiliación
  • Dong Q; Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge CB3 0HE, U.K.
  • Cole JM; Cavendish Laboratory, Department of Physics, University of Cambridge, Cambridge CB3 0HE, U.K.
J Chem Inf Model ; 63(22): 7045-7055, 2023 Nov 27.
Article en En | MEDLINE | ID: mdl-37934697
ABSTRACT
The ever-growing amount of chemical data found in the scientific literature has led to the emergence of data-driven materials discovery. The first step in the pipeline, to automatically extract chemical information from plain text, has been driven by the development of software toolkits such as ChemDataExtractor. Such data extraction processes have created a demand for parsers that efficiently enable text mining. Here, we present Snowball 2.0, a sentence parser based on a semisupervised machine-learning algorithm. It can be used to extract any chemical property without additional training. We validate its precision, recall, and F-score by training and testing a model with sentences of semiconductor band gap information curated from journal articles. Snowball 2.0 builds on two previously developed Snowball algorithms. Evaluation of Snowball 2.0 shows a 15-20% increase in recall with marginally reduced precision over the previous version which has been incorporated into ChemDataExtractor 2.0, giving Snowball 2.0 better performance in most configurations. Snowball 2.0 offers more and better parsing options for ChemDataExtractor, and it is more capable in the pipeline of automated data extraction. Snowball 2.0 also features better generalizability, performance, learning efficiencies, and user-friendliness.
Asunto(s)

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Algoritmos / Programas Informáticos Idioma: En Revista: J Chem Inf Model Asunto de la revista: INFORMATICA MEDICA / QUIMICA Año: 2023 Tipo del documento: Article País de afiliación: Reino Unido

Texto completo: 1 Bases de datos: MEDLINE Asunto principal: Algoritmos / Programas Informáticos Idioma: En Revista: J Chem Inf Model Asunto de la revista: INFORMATICA MEDICA / QUIMICA Año: 2023 Tipo del documento: Article País de afiliación: Reino Unido