Automatic generation of investigator bibliographies for institutional research networking systems.

Johnson, Stephen B; Bales, Michael E; Dine, Daniel; Bakken, Suzanne; Albert, Paul J; Weng, Chunhua

Johnson, Stephen B; Bales, Michael E; Dine, Daniel; Bakken, Suzanne; Albert, Paul J; Weng, Chunhua.

Afiliação

Johnson SB; Department of Public Health, Weill Cornell Medical College, New York, United States. Electronic address: johnsos@med.cornell.edu.
Bales ME; Department of Biomedical Informatics, Columbia University, New York, United States.
Dine D; Department of Biomedical Informatics, Columbia University, New York, United States; The Irving Institute for Clinical and Translational Research, Columbia University, New York, United States.
Bakken S; Department of Biomedical Informatics, Columbia University, New York, United States; The Irving Institute for Clinical and Translational Research, Columbia University, New York, United States.
Albert PJ; Samuel J. Wood Library, Weill Cornell Medical College, New York, United States.
Weng C; Department of Biomedical Informatics, Columbia University, New York, United States; The Irving Institute for Clinical and Translational Research, Columbia University, New York, United States.

J Biomed Inform ; 51: 8-14, 2014 Oct.

Article em En | MEDLINE | ID: mdl-24694772

RESUMO

OBJECTIVE: Publications are a key data source for investigator profiles and research networking systems. We developed ReCiter, an algorithm that automatically extracts bibliographies from PubMed using institutional information about the target investigators. METHODS: ReCiter executes a broad query against PubMed, groups the results into clusters that appear to constitute distinct author identities and selects the cluster that best matches the target investigator. Using information about investigators from one of our institutions, we compared ReCiter results to queries based on author name and institution and to citations extracted manually from the Scopus database. Five judges created a gold standard using citations of a random sample of 200 investigators. RESULTS: About half of the 10,471 potential investigators had no matching citations in PubMed, and about 45% had fewer than 70 citations. Interrater agreement (Fleiss' kappa) for the gold standard was 0.81. Scopus achieved the best recall (sensitivity) of 0.81, while name-based queries had 0.78 and ReCiter had 0.69. ReCiter attained the best precision (positive predictive value) of 0.93 while Scopus had 0.85 and name-based queries had 0.31. DISCUSSION: ReCiter accesses the most current citation data, uses limited computational resources and minimizes manual entry by investigators. Generation of bibliographies using named-based queries will not yield high accuracy. Proprietary databases can perform well but requite manual effort. Automated generation with higher recall is possible but requires additional knowledge about investigators.

Assuntos

Indexação e Redação de Resumos/estatística & dados numéricos; Algoritmos; Autoria; Mineração de Dados/métodos; Processamento de Linguagem Natural; Reconhecimento Automatizado de Padrão/métodos; PubMed/organização & administração; Inteligência Artificial; Bibliografias como Assunto; Pesquisa Biomédica/organização & administração; Rede Social; Vocabulário Controlado

Palavras-chave

Authorship; Automated; Bibliography as topic; MEDLINE; Natural language processing; Pattern recognition

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Autoria / Algoritmos / Processamento de Linguagem Natural / Reconhecimento Automatizado de Padrão / PubMed / Indexação e Redação de Resumos / Mineração de Dados Idioma: En Revista: J Biomed Inform Assunto da revista: INFORMATICA MEDICA Ano de publicação: 2014 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google