RESUMEN
BACKGROUND: Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition. RESULTS: We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements. CONCLUSION: Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.
Asunto(s)
Bases del Conocimiento , Conocimiento , Procesamiento de Lenguaje Natural , Femenino , Humanos , Recién Nacido , Nacimiento Prematuro , Publicaciones , Vitamina DRESUMEN
BACKGROUND: Automated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models have the potential to outperform multi-class classification approaches. METHODS: We systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning through extensive studies of alternative methods and hyperparameter selections. We not only identify the best-performing systems and parameters across a wide variety of ontologies but also provide insights into the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. RESULTS: Bidirectional encoder representations from transformers for biomedical text mining (BioBERT) for span detection along with the open-source toolkit for neural machine translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies annotated in the CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches. CONCLUSIONS: Machine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT shared task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation .
RESUMEN
MOTIVATION: Science progresses by posing good questions, yet work in biomedical text mining has not focused on them much. We propose a novel idea for biomedical natural language processing: identifying and characterizing the questions stated in the biomedical literature. Formally, the task is to identify and characterize statements of ignorance, statements where scientific knowledge is missing or incomplete. The creation of such technology could have many significant impacts, from the training of PhD students to ranking publications and prioritizing funding based on particular questions of interest. The work presented here is intended as the first step towards these goals. RESULTS: We present a novel ignorance taxonomy driven by the role statements of ignorance play in research, identifying specific goals for future scientific knowledge. Using this taxonomy and reliable annotation guidelines (inter-annotator agreement above 80%), we created a gold standard ignorance corpus of 60 full-text documents from the prenatal nutrition literature with over 10 000 annotations and used it to train classifiers that achieved over 0.80 F1 scores. AVAILABILITY AND IMPLEMENTATION: Corpus and source code freely available for download at https://github.com/UCDenver-ccp/Ignorance-Question-Work. The source code is implemented in Python.
RESUMEN
How has the focus of research papers on a given disease changed over time? Identifying the papers at the cusps of change can help highlight the emergence of a new topic or a change in the direction of research. We present a generally applicable unsupervised approach to this question based on semantic changepoints within a given collection of research papers. We illustrate the approach by a range of examples based on a nascent corpus of literature on COVID-19 as well as subsets of papers from PubMed on the World Health Organization list of neglected tropical diseases. The software is freely available at: https://github.com/pdddinakar/SemanticChangepointDetection.
Asunto(s)
COVID-19 , Semántica , Biología Computacional , Humanos , PubMed , SARS-CoV-2RESUMEN
BACKGROUND: Certain biological processes, such as the development of cancer and immune activation, can be controlled by rare cellular events that are difficult to capture computationally through simulations of individual cells. Information about such rare events can be gleaned from an attractor analysis, for which a variety of methods exist (in particular for Boolean models). However, explicitly simulating a defined mixed population of cells in a way that tracks even the rarest subpopulations remains an open challenge. RESULTS: Here we show that when cellular states are described using a Boolean network model, one can exactly simulate the dynamics of non-interacting, highly heterogeneous populations directly, without having to model the various subpopulations. This strategy captures even the rarest outcomes of the model with no sampling error. Our method can incorporate heterogeneity in both cell state and, by augmenting the model, the underlying rules of the network as well (e.g., introducing loss-of-function genetic alterations). We demonstrate our method by using it to simulate a heterogeneous population of Boolean networks modeling the T-cell receptor, spanning â¼ 1020 distinct cellular states and mutational profiles. CONCLUSIONS: We have developed a method for using Boolean models to perform a population-level simulation, in which the population consists of non-interacting individuals existing in different states. This approach can be used even when there are far too many distinct subpopulations to model individually.
Asunto(s)
Simulación por Computador , Algoritmos , Modelos Biológicos , Método de Montecarlo , Mutación , Probabilidad , Receptores de Antígenos de Linfocitos T/genética , Receptores de Antígenos de Linfocitos T/metabolismoRESUMEN
Most natural language processing applications exhibit a trade-off between precision and recall. In some use cases for natural language processing, there are reasons to prefer to tilt that trade-off toward high precision. Relying on the Zipfian distribution of false positive results, we describe a strategy for increasing precision, using a variety of both pre-processing and post-processing methods. They draw on both knowledge-based and frequentist approaches to modeling language. Based on an existing high-performance biomedical concept recognition pipeline and a previously published manually annotated corpus, we apply this hybrid rationalist/empiricist strategy to concept normalization for eight different ontologies. Which approaches did and did not improve precision varied widely between the ontologies.
Asunto(s)
Procesamiento de Lenguaje Natural , Ontologías Biológicas/estadística & datos numéricos , Biología Computacional/métodos , Minería de Datos/métodos , Registros Electrónicos de Salud/estadística & datos numéricos , Reacciones Falso Positivas , Humanos , Medicina de Precisión/estadística & datos numéricos , PubMed/estadística & datos numéricos , Reproducibilidad de los ResultadosRESUMEN
Human-annotated data is a fundamental part of natural language processing system development and evaluation. The quality of that data is typically assessed by calculating the agreement between the annotators. It is widely assumed that this agreement between annotators is the upper limit on system performance in natural language processing: if humans can't agree with each other about the classification more than some percentage of the time, we don't expect a computer to do any better. We trace the logical positivist roots of the motivation for measuring inter-annotator agreement, demonstrate the prevalence of the widely-held assumption about the relationship between inter-annotator agreement and system performance, and present data that suggest that inter-annotator agreement is not, in fact, an upper bound on language processing system performance.
Asunto(s)
Curaduría de Datos , Procesamiento de Lenguaje Natural , Humanos , Variaciones Dependientes del ObservadorRESUMEN
An unusual H/D kinetic isotope effect (KIE) is described, in which isotopic selectivity arises primarily from nonstatistical dynamics in the product. In DFT-based quasiclassical trajectories of Bergman cyclization of (Z)-3-hexen-1,5-diyne (1) at 470 K, the new CC bond retains its energy, and 28% of nascent p-benzyne recrosses back to the enediyne on a vibrational time scale. The competing process of intramolecular vibrational redistribution (IVR) in p-benzyne is too slow to prevent this. Deuteration increases the rate of IVR, which decreases the fraction of recrossing and increases the yield of statistical (trapable) p-benzyne, 2. Trapable yields for three isotopomers of 2 range from 72% to 86%. The resulting KIEs for Bergman cyclization differ substantially from KIEs predicted by transition state theory, which suggests that IVR in this reaction can be studied by conventional KIEs. Leakage of vibrational zero point energy (ZPE) into the reaction coordinate was probed by trajectories in which initial ZPE in the CH/CD stretching modes was reduced by 25%. This did not change the predicted KIEs.