RESUMO
Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24,000 extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1 million extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7 million affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1%. For the 4.2 million mappings to the USA, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0%) and GeoMaker got 274 (91.3%) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). The incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says "University of Illinois" which can refer to one of five different campuses. A search interface called MapAffil has been developed at the University of Illinois in which the longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.
RESUMO
The perennial problem of author name ambiguity has attracted increasing attention in the academic community. Drawing on the literature, this article first highlights the pervasiveness of the problem and discusses its adverse consequences. It then analyzes the behavioral causes of the problem in the Chinese context and attributes them to personal, cultural, and institutional factors. Informed by this analysis and recognizing ORCID as a promising solution, we propose an ORCID-based "Prevention plus Cure" campaign against author name ambiguity. The prevention objective relies on researchers' consistent use of ORCID, while the cure objective involves retrospectively integrating ORCIDs into backfile publications. We also outline the responsibilities of various stakeholders to ensure the success of the campaign. Furthermore, we argue that universal adoption of ORCID can help curb authorship-related misconduct, discern predatory journals and publishers, and track researchers' undesirable records of academic publishing. We then analyze the current status of ORCID adoption in China, identify potential challenges, propose tentative solutions to address them, and highlight ORCID as a tool that can be utilized to empower China's combat against research misconduct. In conclusion, we emphasize the importance of conducting empirical research to inform more effective promotion of ORCID adoption in China.
RESUMO
Genome-scale metabolic models (GEMs) are manually curated repositories describing the metabolic capabilities of an organism. GEMs have been successfully used in different research areas, ranging from systems medicine to biotechnology. However, the different naming conventions (namespaces) of databases used to build GEMs limit model reusability and prevent the integration of existing models. This problem is known in the GEM community, but its extent has not been analyzed in depth. In this study, we investigate the name ambiguity and the multiplicity of non-systematic identifiers and we highlight the (in)consistency in their use in 11 biochemical databases of biochemical reactions and the problems that arise when mapping between different namespaces and databases. We found that such inconsistencies can be as high as 83.1%, thus emphasizing the need for strategies to deal with these issues. Currently, manual verification of the mappings appears to be the only solution to remove inconsistencies when combining models. Finally, we discuss several possible approaches to facilitate (future) unambiguous mapping.
RESUMO
BACKGROUND: A wide range of chemical compound databases are currently available for pharmaceutical research. To retrieve compound information, including structures, researchers can query these chemical databases using non-systematic identifiers. These are source-dependent identifiers (e.g., brand names, generic names), which are usually assigned to the compound at the point of registration. The correctness of non-systematic identifiers (i.e., whether an identifier matches the associated structure) can only be assessed manually, which is cumbersome, but it is possible to automatically check their ambiguity (i.e., whether an identifier matches more than one structure). In this study we have quantified the ambiguity of non-systematic identifiers within and between eight widely used chemical databases. We also studied the effect of chemical structure standardization on reducing the ambiguity of non-systematic identifiers. RESULTS: The ambiguity of non-systematic identifiers within databases varied from 0.1 to 15.2 % (median 2.5 %). Standardization reduced the ambiguity only to a small extent for most databases. A wide range of ambiguity existed for non-systematic identifiers that are shared between databases (17.7-60.2 %, median of 40.3 %). Removing stereochemistry information provided the largest reduction in ambiguity across databases (median reduction 13.7 percentage points). CONCLUSIONS: Ambiguity of non-systematic identifiers within chemical databases is generally low, but ambiguity of non-systematic identifiers that are shared between databases, is high. Chemical structure standardization reduces the ambiguity to a limited extent. Our findings can help to improve database integration, curation, and maintenance.