Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 62
Filtrar
1.
Stud Health Technol Inform ; 264: 1433-1434, 2019 Aug 21.
Artigo em Inglês | MEDLINE | ID: mdl-31438167

RESUMO

"P-hacking" is the repeated analysis of data until a statistically significant result is achieved. We show that p-hacking can also occur during data generation, sometimes unintentionally. We use the type-token ratio to demonstrate that differences in the definitions of "type" and "token" can produce significantly different results. Since these terms are rarely defined in the biomedical literature, the result is an inability to meaningfully interpret the body of literature that makes use of this measure.


Assuntos
Segurança Computacional , Vocabulário
2.
Bioinformatics ; 35(21): 4372-4380, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-30937439

RESUMO

MOTIVATION: Most currently available text mining tools share two characteristics that make them less than optimal for use by biomedical researchers: they require extensive specialist skills in natural language processing and they were built on the assumption that they should optimize global performance metrics on representative datasets. This is a problem because most end-users are not natural language processing specialists and because biomedical researchers often care less about global metrics like F-measure or representative datasets than they do about more granular metrics such as precision and recall on their own specialized datasets. Thus, there are fundamental mismatches between the assumptions of much text mining work and the preferences of potential end-users. RESULTS: This article introduces the concept of Agile text mining, and presents the PubAnnotation ecosystem as an example implementation. The system approaches the problems from two perspectives: it allows the reformulation of text mining by biomedical researchers from the task of assembling a complete system to the task of retrieving warehoused annotations, and it makes it possible to do very targeted customization of the pre-existing system to address specific end-user requirements. Two use cases are presented: assisted curation of the GlycoEpitope database, and assessing coverage in the literature of pre-eclampsia-associated genes. AVAILABILITY AND IMPLEMENTATION: The three tools that make up the ecosystem, PubAnnotation, PubDictionaries and TextAE are publicly available as web services, and also as open source projects. The dictionaries and the annotation datasets associated with the use cases are all publicly available through PubDictionaries and PubAnnotation, respectively.


Assuntos
Biologia Computacional , Ecossistema , Mineração de Dados , Feminino , Humanos , Processamento de Linguagem Natural , Gravidez , PubMed
3.
LREC Int Conf Lang Resour Eval ; 2018: 156-165, 2018 May.
Artigo em Inglês | MEDLINE | ID: mdl-29911205

RESUMO

Despite considerable recent attention to problems with reproducibility of scientific research, there is a striking lack of agreement about the definition of the term. That is a problem, because the lack of a consensus definition makes it difficult to compare studies of reproducibility, and thus to have even a broad overview of the state of the issue in natural language processing. This paper proposes an ontology of reproducibility in that field. Its goal is to enhance both future research and communication about the topic, and retrospective meta-analyses. We show that three dimensions of reproducibility, corresponding to three kinds of claims in natural language processing papers, can account for a variety of types of research reports. These dimensions are reproducibility of a conclusion, of a finding, and of a value. Three biomedical natural language processing papers by the authors of this paper are analyzed with respect to these dimensions.

4.
Stud Health Technol Inform ; 247: 890-894, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29678089

RESUMO

This paper presents a modular ontology of health care in the context in Amyotrophic Lateral Sclerosis. 4 modules cover socio-environmental, medical, and care coordination aspects of the domain. They are organized by a core module. Its goal is to understand interruptions in health care provision in the context of a neurodegenerative disease.


Assuntos
Esclerose Lateral Amiotrófica/terapia , Comunicação , Gerenciamento Clínico , Humanos
5.
Pac Symp Biocomput ; 23: 566-577, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29218915

RESUMO

Most natural language processing applications exhibit a trade-off between precision and recall. In some use cases for natural language processing, there are reasons to prefer to tilt that trade-off toward high precision. Relying on the Zipfian distribution of false positive results, we describe a strategy for increasing precision, using a variety of both pre-processing and post-processing methods. They draw on both knowledge-based and frequentist approaches to modeling language. Based on an existing high-performance biomedical concept recognition pipeline and a previously published manually annotated corpus, we apply this hybrid rationalist/empiricist strategy to concept normalization for eight different ontologies. Which approaches did and did not improve precision varied widely between the ontologies.


Assuntos
Processamento de Linguagem Natural , Ontologias Biológicas/estatística & dados numéricos , Biologia Computacional/métodos , Mineração de Dados/métodos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Reações Falso-Positivas , Humanos , Medicina de Precisão/estatística & dados numéricos , PubMed/estatística & dados numéricos , Reprodutibilidade dos Testes
6.
BMC Bioinformatics ; 18(1): 361, 2017 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-28784111

RESUMO

BACKGROUND: Probabilistic assessments of clinical care are essential for quality care. Yet, machine learning, which supports this care process has been limited to categorical results. To maximize its usefulness, it is important to find novel approaches that calibrate the ML output with a likelihood scale. Current state-of-the-art calibration methods are generally accurate and applicable to many ML models, but improved granularity and accuracy of such methods would increase the information available for clinical decision making. This novel non-parametric Bayesian approach is demonstrated on a variety of data sets, including simulated classifier outputs, biomedical data sets from the University of California, Irvine (UCI) Machine Learning Repository, and a clinical data set built to determine suicide risk from the language of emergency department patients. RESULTS: The method is first demonstrated on support-vector machine (SVM) models, which generally produce well-behaved, well understood scores. The method produces calibrations that are comparable to the state-of-the-art Bayesian Binning in Quantiles (BBQ) method when the SVM models are able to effectively separate cases and controls. However, as the SVM models' ability to discriminate classes decreases, our approach yields more granular and dynamic calibrated probabilities comparing to the BBQ method. Improvements in granularity and range are even more dramatic when the discrimination between the classes is artificially degraded by replacing the SVM model with an ad hoc k-means classifier. CONCLUSIONS: The method allows both clinicians and patients to have a more nuanced view of the output of an ML model, allowing better decision making. The method is demonstrated on simulated data, various biomedical data sets and a clinical data set, to which diverse ML methods are applied. Trivially extending the method to (non-ML) clinical scores is also discussed.


Assuntos
Sistemas de Apoio a Decisões Clínicas , Aprendizado de Máquina , Adolescente , Teorema de Bayes , Calibragem , Sistemas de Apoio a Decisões Clínicas/normas , Humanos , Estatísticas não Paramétricas , Suicídio , Máquina de Vetores de Suporte
7.
BMC Bioinformatics ; 18(1): 372, 2017 Aug 17.
Artigo em Inglês | MEDLINE | ID: mdl-28818042

RESUMO

BACKGROUND: Coreference resolution is the task of finding strings in text that have the same referent as other strings. Failures of coreference resolution are a common cause of false negatives in information extraction from the scientific literature. In order to better understand the nature of the phenomenon of coreference in biomedical publications and to increase performance on the task, we annotated the Colorado Richly Annotated Full Text (CRAFT) corpus with coreference relations. RESULTS: The corpus was manually annotated with coreference relations, including identity and appositives for all coreferring base noun phrases. The OntoNotes annotation guidelines, with minor adaptations, were used. Interannotator agreement ranges from 0.480 (entity-based CEAF) to 0.858 (Class-B3), depending on the metric that is used to assess it. The resulting corpus adds nearly 30,000 annotations to the previous release of the CRAFT corpus. Differences from related projects include a much broader definition of markables, connection to extensive annotation of several domain-relevant semantic classes, and connection to complete syntactic annotation. Tool performance was benchmarked on the data. A publicly available out-of-the-box, general-domain coreference resolution system achieved an F-measure of 0.14 (B3), while a simple domain-adapted rule-based system achieved an F-measure of 0.42. An ensemble of the two reached F of 0.46. Following the IDENTITY chains in the data would add 106,263 additional named entities in the full 97-paper corpus, for an increase of 76% percent in the semantic classes of the eight ontologies that have been annotated in earlier versions of the CRAFT corpus. CONCLUSIONS: The project produced a large data set for further investigation of coreference and coreference resolution in the scientific literature. The work raised issues in the phenomenon of reference in this domain and genre, and the paper proposes that many mentions that would be considered generic in the general domain are not generic in the biomedical domain due to their referents to specific classes in domain-specific ontologies. The comparison of the performance of a publicly available and well-understood coreference resolution system with a domain-adapted system produced results that are consistent with the notion that the requirements for successful coreference resolution in this genre are quite different from those of the general domain, and also suggest that the baseline performance difference is quite large.


Assuntos
Mineração de Dados/métodos , Publicações Periódicas como Assunto , Semântica
8.
EGEMS (Wash DC) ; 5(1): 12, 2017 Jun 14.
Artigo em Inglês | MEDLINE | ID: mdl-29930960

RESUMO

INTRODUCTION AND BACKGROUND: The US Food and Drug Administration (FDA)'s Manufacturer and User Facility Device Experience (MAUDE) database is a publicly available resource providing over 4 million records relating to medical device safety. Using downloadable MAUDE files avoids limitations of the online MAUDE search interface. However, naive file usage can result in errors, while independent discovery of the nuances required to correctly work with the database can be time-consuming. Practical information is provided to shorten this learning curve and obtain accurate results when using the MAUDE database files. MAUDE FILE DESCRIPTIONS: The MAUDE database consists of 135 fields in four primary (Master Event, Device, Patient, Text) and two supplemental (Device Problems and Problem Code Descriptions) file types. When combined, these six files provide a detailed account of an adverse event or product problem report. Website instructions for joining the files are incomplete. Comprehensive details are provided to enable precise file linking. LESSONS LEARNED: MAUDE files have irregularities that must be understood to download and work with the data efficiently. Accurate results depend upon combining the files correctly and understanding the difference between report and event denominators. Appreciating data availability can facilitate successful MAUDE investigations. CONCLUSION: The MAUDE database can provide key insights about medical device safety. Detailed information is provided about the structure, content and interrelationships of the MAUDE database files to enable investigators to use this valuable resource more quickly and accurately.

9.
Stud Health Technol Inform ; 245: 346-350, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29295113

RESUMO

Prior knowledge of the distributional characteristics of linguistic phenomena can be useful for a variety of language processing tasks. This paper describes the distribution of negation in two types of biomedical texts: scientific journal articles and progress notes. Two types of negation are examined: explicit negation at the syntactic level and affixal negation at the sub-word level. The data show that the distribution of negation is significantly different in the two document types, with explicit negation more frequent in the clinical documents than in the scientific publications and affixal negation more frequent in the journal articles at the type level and token levels. All code is available on GitHub https://github.com/KevinBretonnelCohen/NegationDistribution .


Assuntos
Linguística , Processamento de Linguagem Natural , Mineração de Dados , Registros Eletrônicos de Saúde , Humanos , Idioma , Editoração
10.
Stud Health Technol Inform ; 245: 644-648, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29295175

RESUMO

Semantic relations have been studied for decades without yet reaching consensus on the set of these relations. However, biomedical language processing and ontologies rely on these relations, so it is important to be able to evaluate their suitability. In this paper we examine the role of inter-annotator agreement in choosing between competing proposals regarding the set of such relations. The experiments consisted of labeling the semantic relations between two elements of noun-noun compounds (e.g. cell migration). Two judges annotated a dataset of terms from the biomedical domain using two competing sets of relations and analyzed the inter-annotator agreement. With no training and little documentation, agreement on this task was fairly high and disagreements were consistent. The results support the utility of the relation-based approach to semantic representation.


Assuntos
Documentação , Processamento de Linguagem Natural , Semântica , Ocupações em Saúde
11.
J Biomed Semantics ; 7: 52, 2016 Sep 09.
Artigo em Inglês | MEDLINE | ID: mdl-27613112

RESUMO

BACKGROUND: Gene Ontology (GO) terms represent the standard for annotation and representation of molecular functions, biological processes and cellular compartments, but a large gap exists between the way concepts are represented in the ontology and how they are expressed in natural language text. The construction of highly specific GO terms is formulaic, consisting of parts and pieces from more simple terms. RESULTS: We present two different types of manually generated rules to help capture the variation of how GO terms can appear in natural language text. The first set of rules takes into account the compositional nature of GO and recursively decomposes the terms into their smallest constituent parts. The second set of rules generates derivational variations of these smaller terms and compositionally combines all generated variants to form the original term. By applying both types of rules, new synonyms are generated for two-thirds of all GO terms and an increase in F-measure performance for recognition of GO on the CRAFT corpus from 0.498 to 0.636 is observed. Additionally, we evaluated the combination of both types of rules over one million full text documents from Elsevier; manual validation and error analysis show we are able to recognize GO concepts with reasonable accuracy (88 %) based on random sampling of annotations. CONCLUSIONS: In this work we present a set of simple synonym generation rules that utilize the highly compositional and formulaic nature of the Gene Ontology concepts. We illustrate how the generated synonyms aid in improving recognition of GO concepts on two different biomedical corpora. We discuss other applications of our rules for GO ontology quality assurance, explore the issue of overgeneration, and provide examples of how similar methodologies could be applied to other biomedical terminologies. Additionally, we provide all generated synonyms for use by the text-mining community.


Assuntos
Mineração de Dados/métodos , Ontologia Genética , Semântica , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão
12.
LREC Int Conf Lang Resour Eval ; 2016: 2784-2788, 2016 May.
Artigo em Inglês | MEDLINE | ID: mdl-29568820

RESUMO

This paper reports SuperCAT, a corpus analysis toolkit. It is a radical extension of SubCAT, the Sublanguage Corpus Analysis Toolkit, from sublanguage analysis to corpus analysis in general. The idea behind SuperCAT is that representative corpora have no tendency towards closure-that is, they tend towards infinity. In contrast, non-representative corpora have a tendency towards closure-roughly, finiteness. SuperCAT focuses on general techniques for the quantitative description of the characteristics of any corpus (or other language sample), particularly concerning the characteristics of lexical distributions. Additionally, SuperCAT features a complete re-engineering of the previous SubCAT architecture.

13.
LREC Int Conf Lang Resour Eval ; 2016(W23): 6-12, 2016 May.
Artigo em Inglês | MEDLINE | ID: mdl-29568821

RESUMO

There is currently a crisis in science related to highly publicized failures to reproduce large numbers of published studies. The current work proposes, by way of case studies, a methodology for moving the study of reproducibility in computational work to a full stage beyond that of earlier work. Specifically, it presents a case study in attempting to reproduce the reports of two R libraries for doing text mining of the PubMed/MEDLINE repository of scientific publications. The main findings are that a rational paradigm for reproduction of natural language processing papers can be established; the advertised functionality was difficult, but not impossible, to reproduce; and reproducibility studies can produce additional insights into the functioning of the published system. Additionally, the work on reproducibility lead to the production of novel user-centered documentation that has been accessed 260 times since its publication-an average of once a day per library.

14.
LREC Int Conf Lang Resour Eval ; 2016(W40): 8-12, 2016 May.
Artigo em Inglês | MEDLINE | ID: mdl-29568822

RESUMO

Ethical issues reported with paid crowdsourcing include unfairly low wages. It is assumed that such issues are under the control of the task requester. Can one control the amount that a worker earns by controlling the amount that one pays? 412 linguistic data development tasks were submitted to Amazon Mechanical Turk. The pay per HIT was manipulated through a range of values. We examined the relationship between the pay that is offered per HIT and the effective pay rate. There is no such relationship. Paying more per HIT does not cause workers to earn more: the higher the pay per HIT, the more time workers spend on them (R = 0.92). So, the effective hourly rate stays roughly the same. The finding has clear implications for language resource builders who want to behave ethically: other means must be found in order to compensate workers fairly. The findings of this paper should not be taken as an endorsement of unfairly low pay rates for crowdsourcing workers. Rather, the intention is to point out that additional measures, such as pre-calculating and communicating to the workers an average hourly, rather than per-task, rate must be found in order to ensure an ethical rate of pay.

15.
CEUR Workshop Proc ; 1609: 28-42, 2016 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-29308065

RESUMO

This paper reports on Task 2 of the 2016 CLEF eHealth evaluation lab which extended the previous information extraction tasks of ShARe/CLEF eHealth evaluation labs. The task continued with named entity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System® (UMLS®), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted of extracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In total, seven teams participated, including five in the entity recognition and normalization task, and five in the death certificate coding task. Three teams submitted their systems to our newly offered reproducibility track. For entity recognition, the highest performance was achieved on the EMEA corpus, with an overall F-measure of 0.702 for plain entities recognition and 0.529 for normalized entity recognition. For entity normalization, the highest performance was achieved on the MEDLINE corpus, with an overall F-measure of 0.552. For death certificate coding, the highest performance was 0.848 F-measure.

16.
J Biomed Semantics ; 5(1): 5, 2014 Feb 05.
Artigo em Inglês | MEDLINE | ID: mdl-24495517

RESUMO

The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.

17.
BMC Bioinformatics ; 15: 59, 2014 Feb 26.
Artigo em Inglês | MEDLINE | ID: mdl-24571547

RESUMO

BACKGROUND: Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem. RESULTS: Three dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented. CONCLUSIONS: Baselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14-0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.


Assuntos
Ontologias Biológicas , Mineração de Dados/métodos , Bases de Dados Factuais , Reprodutibilidade dos Testes
18.
Pac Symp Biocomput ; : 328-39, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24297559

RESUMO

Identifying genetic variants that affect drug response or play a role in disease is an important task for clinicians and researchers. Before individual variants can be explored efficiently for effect on drug response or disease relationships, specific candidate genes must be identified. While many methods rank candidate genes through the use of sequence features and network topology, only a few exploit the information contained in the biomedical literature. In this work, we train and test a classifier on known pharmacogenes from PharmGKB and present a classifier that predicts pharmacogenes on a genome-wide scale using only Gene Ontology annotations and simple features mined from the biomedical literature. Performance of F=0.86, AUC=0.860 is achieved. The top 10 predicted genes are analyzed. Additionally, a set of enriched pharmacogenic Gene Ontology concepts is produced.


Assuntos
Farmacogenética/estatística & dados numéricos , Inteligência Artificial , Biologia Computacional , Mineração de Dados/estatística & dados numéricos , Bases de Dados Genéticas , Bases de Dados de Produtos Farmacêuticos , Ontologia Genética/estatística & dados numéricos , Variação Genética , Humanos , Bases de Conhecimento , Processamento de Linguagem Natural
19.
LREC Int Conf Lang Resour Eval ; 2014: 1714-1718, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-29568819

RESUMO

Sublanguages are varieties of language that form "subsets" of the general language, typically exhibiting particular types of lexical, semantic, and other restrictions and deviance. SubCAT, the Sublanguage Corpus Analysis Toolkit, assesses the representativeness and closure properties of corpora to analyze the extent to which they are either sublanguages, or representative samples of the general language. The current version of SubCAT contains scripts and applications for assessing lexical closure, morphological closure, sentence type closure, over-represented words, and syntactic deviance. Its operation is illustrated with three case studies concerning scientific journal articles, patents, and clinical records. Materials from two language families are analyzed-English (Germanic), and Bulgarian (Slavic). The software is available at sublanguage.sourceforge.net under a liberal Open Source license.

20.
Nat Lang Process Inf Syst ; 8455: 33-38, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-29780975

RESUMO

For many researchers, the purpose of ontologies is sharing data. This sharing is facilitated when ontologies are available in multiple languages, but inhibited when an ontology is only available in a single language. Ontologies should be accessible to people in multiple languages, since multilingualism is inevitable in any scientific work. Due to resource scarcity, most ontologies of the biomedical domain are available only in English at present. We present techniques to translate Gene Ontology terms from English to German using DBPedia, the Google Translate API for isolated terms, and the Google Translate API for terms in sentential context. Average fluency scores for the three methods were 4.0, 4.4, and 4.5, respectively. Average adequacy scores were 4.0, 4.9, and 4.9.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...