RESUMO
Citation data have remained hidden behind proprietary, restrictive licensing agreements, which raises barriers to entry for analysts wishing to use the data, increases the expense of performing large-scale analyses, and reduces the robustness and reproducibility of the conclusions. For the past several years, the National Institutes of Health (NIH) Office of Portfolio Analysis (OPA) has been aggregating and enhancing citation data that can be shared publicly. Here, we describe the NIH Open Citation Collection (NIH-OCC), a public access database for biomedical research that is made freely available to the community. This dataset, which has been carefully generated from unrestricted data sources such as MedLine, PubMed Central (PMC), and CrossRef, now underlies the citation statistics delivered in the NIH iCite analytic platform. We have also included data from a machine learning pipeline that identifies, extracts, resolves, and disambiguates references from full-text articles available on the internet. Open citation links are available to the public in a major update of iCite (https://icite.od.nih.gov).
Assuntos
Disseminação de Informação/ética , National Institutes of Health (U.S.)/legislação & jurisprudência , Publicação de Acesso Aberto/legislação & jurisprudência , Política Organizacional , Bibliometria , Pesquisa Biomédica , Humanos , Aprendizado de Máquina , Manuscritos como Assunto , National Institutes of Health (U.S.)/economia , Publicação de Acesso Aberto/economia , Estados UnidosRESUMO
INTRODUCTION: To fulfill its mission, the NIH Office of Disease Prevention systematically monitors NIH investments in applied prevention research. Specifically, the Office focuses on research in humans involving primary and secondary prevention, and prevention-related methods. Currently, the NIH uses the Research, Condition, and Disease Categorization system to report agency funding in prevention research. However, this system defines prevention research broadly to include primary and secondary prevention, studies on prevention methods, and basic and preclinical studies for prevention. A new methodology was needed to quantify NIH funding in applied prevention research. METHODS: A novel machine learning approach was developed and evaluated for its ability to characterize NIH-funded applied prevention research during fiscal years 2012-2015. The sensitivity, specificity, positive predictive value, accuracy, and F1 score of the machine learning method; the Research, Condition, and Disease Categorization system; and a combined approach were estimated. Analyses were completed during June-August 2017. RESULTS: Because the machine learning method was trained to recognize applied prevention research, it more accurately identified applied prevention grants (F1â¯=â¯72.7%) than the Research, Condition, and Disease Categorization system (F1â¯=â¯54.4%) and a combined approach (F1â¯=â¯63.5%) with p<0.001. CONCLUSIONS: This analysis demonstrated the use of machine learning as an efficient method to classify NIH-funded research grants in disease prevention.