Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
1.
PLoS One ; 19(6): e0305475, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38870159

RESUMEN

Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.


Asunto(s)
Minería de Datos , Fenotipo , Fitomejoramiento , Triticum , Triticum/genética , Fitomejoramiento/métodos , Minería de Datos/métodos
2.
Data Brief ; 54: 110404, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38665156

RESUMEN

There is a growing interest in milk oligosaccharides (MOs) because of their numerous benefits for newborns' and long-term health. A large number of MO structures have been identified in mammalian milk. Mostly described in human milk, the oligosaccharide richness, although less broad, has also been reported for a wide range of mammalian species. The structure of MOs is particularly difficult to report as it results from the combination of 5 monosaccharides linked by various glycosidic bonds forming structurally diverse and complex matrices of linear and branched oligosaccharides. Exploring the literature and extracting relevant information on MO diversity within or across species appears promising to elucidate structure-function role of MOs. Currently, given the complexity of these molecules, the main issues in exploring literature to extract relevant information on MO diversity within or across species relate to the heterogeneity in the way authors refer to these molecules. Herein, we provide a thesaurus (MilkOligoThesaurus) including the names and synonyms of MOs collected from key selected articles on mammalian milk analyses. MilkOligoThesaurus gathers the names of the MOs with a complete description of their monosaccharide composition and structures. When available, each unique MO molecule is linked to its ID from the NCBI PubChem and ChEBI databases. MilkOligoThesaurus is provided in a tabular format. It gathers 245 unique oligosaccharide structures described by 22 features (columns) including the name of the molecule, its abbreviation, the chemical database IDs if available, the monosaccharide composition, chemical information (molecular formula, monoisotopic mass), synonyms, its formula in condensed form, and in abbreviated condensed form, the abbreviated systematic name, the systematic name, the isomer group, and scientific article sources. MilkOligoThesaurus is also provided in the SKOS (Simple Knowledge Organization System) format. This thesaurus is a valuable resource gathering MO naming variations that are not found elsewhere for (i) Text and Data Mining to enable automatic annotation and rapid extraction of milk oligosaccharide data from scientific papers; (ii) biology researchers aiming to search for or decipher the structure of milk oligosaccharides based on any of their names, abbreviations or monosaccharide compositions and linkages.

3.
PLoS One ; 18(1): e0272473, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36662691

RESUMEN

The dramatic increase in the number of microbe descriptions in databases, reports, and papers presents a two-fold challenge for accessing the information: integration of heterogeneous data in a standard ontology-based representation and normalization of the textual descriptions by semantic analysis. Recent text mining methods offer powerful ways to extract textual information and generate ontology-based representation. This paper describes the design of the Omnicrobe application that gathers comprehensive information on habitats, phenotypes, and usages of microbes from scientific sources of high interest to the microbiology community. The Omnicrobe database contains around 1 million descriptions of microbe properties. These descriptions are created by analyzing and combining six information sources of various kinds, i.e. biological resource catalogs, sequence databases and scientific literature. The microbe properties are indexed by the Ontobiotope ontology and their taxa are indexed by an extended version of the taxonomy maintained by the National Center for Biotechnology Information. The Omnicrobe application covers all domains of microbiology. With simple or rich ontology-based queries, it provides easy-to-use support in the resolution of scientific questions related to the habitats, phenotypes, and uses of microbes. We illustrate the potential of Omnicrobe with a use case from the food innovation domain.


Asunto(s)
Minería de Datos , Ecosistema , Minería de Datos/métodos , Bases de Datos Factuales , Publicaciones , Fenotipo
4.
Database (Oxford) ; 20222022 08 25.
Artículo en Inglés | MEDLINE | ID: mdl-36006843

RESUMEN

Collecting relations between chemicals and drugs is crucial in biomedical research. The pre-trained transformer model, e.g. Bidirectional Encoder Representations from Transformers (BERT), is shown to have limitations on biomedical texts; more specifically, the lack of annotated data makes relation extraction (RE) from biomedical texts very challenging. In this paper, we hypothesize that enriching a pre-trained transformer model with syntactic information may help improve its performance on chemical-drug RE tasks. For this purpose, we propose three syntax-enhanced models based on the domain-specific BioBERT model: Chunking-Enhanced-BioBERT and Constituency-Tree-BioBERT in which constituency information is integrated and a Multi-Task-Learning framework Multi-Task-Syntactic (MTS)-BioBERT in which syntactic information is injected implicitly by adding syntax-related tasks as training objectives. Besides, we test an existing model Late-Fusion which is enhanced by syntactic dependency information and build ensemble systems combining syntax-enhanced models and non-syntax-enhanced models. Experiments are conducted on the BioCreative VII DrugProt corpus, a manually annotated corpus for the development and evaluation of RE systems. Our results reveal that syntax-enhanced models in general degrade the performance of BioBERT in the scenario of biomedical RE but improve the performance when the subject-object distance of candidate semantic relation is long. We also explore the impact of quality of dependency parses. [Our code is available at: https://github.com/Maple177/syntax-enhanced-RE/tree/drugprot (for only MTS-BioBERT); https://github.com/Maple177/drugprot-relation-extraction (for the rest of experiments)] Database URL https://github.com/Maple177/drugprot-relation-extraction.


Asunto(s)
Investigación Biomédica , Minería de Datos , Minería de Datos/métodos , Bases de Datos Factuales , Procesamiento de Lenguaje Natural , Semántica
5.
BMC Bioinformatics ; 21(Suppl 23): 579, 2020 Dec 29.
Artículo en Inglés | MEDLINE | ID: mdl-33372606

RESUMEN

BACKGROUND: Entity normalization is an important information extraction task which has gained renewed attention in the last decade, particularly in the biomedical and life science domains. In these domains, and more generally in all specialized domains, this task is still challenging for the latest machine learning-based approaches, which have difficulty handling highly multi-class and few-shot learning problems. To address this issue, we propose C-Norm, a new neural approach which synergistically combines standard and weak supervision, ontological knowledge integration and distributional semantics. RESULTS: Our approach greatly outperforms all methods evaluated on the Bacteria Biotope datasets of BioNLP Open Shared Tasks 2019, without integrating any manually-designed domain-specific rules. CONCLUSIONS: Our results show that relatively shallow neural network methods can perform well in domains that present highly multi-class and few-shot learning problems.


Asunto(s)
Algoritmos , Redes Neurales de la Computación , Bacterias/metabolismo , Intervalos de Confianza , Bases de Datos como Asunto , Ecosistema , Humanos , Conocimiento , Aprendizaje Automático , Fenotipo
6.
Food Microbiol ; 81: 63-75, 2019 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-30910089

RESUMEN

Information on food microbial diversity is scattered across millions of scientific papers. Researchers need tools to assist their bibliographic search in such large collections. Text mining and knowledge engineering methods are useful to automatically and efficiently find relevant information in Life Science. This work describes how the Alvis text mining platform has been applied to a large collection of PubMed abstracts of scientific papers in the food microbiology domain. The information targeted by our work is microorganisms, their habitats and phenotypes. Two knowledge resources, the NCBI taxonomy and the OntoBiotope ontology were used to detect this information in texts. The result of the text mining process was indexed and is presented through the AlvisIR Food on-line semantic search engine. In this paper, we also show through two illustrative examples the great potential of this new tool to assist in studies on ecological diversity and the origin of microbial presence in food.


Asunto(s)
Biodiversidad , Biología Computacional/métodos , Minería de Datos/métodos , Microbiología de Alimentos , Algoritmos , Ontologías Biológicas , Bases de Datos Bibliográficas , Bases de Datos Factuales , Ecosistema , Humanos , Servicios de Información , Almacenamiento y Recuperación de la Información , Internet , Literatura , MEDLINE , National Library of Medicine (U.S.) , Fenotipo , Filogenia , PubMed , Programas Informáticos , Estados Unidos
7.
J Biomed Semantics ; 8(1): 37, 2017 Sep 11.
Artículo en Inglés | MEDLINE | ID: mdl-28893314

RESUMEN

BACKGROUND: Knowledge representation frameworks are essential to the understanding of complex biomedical processes, and to the analysis of biomedical texts that describe them. Combined with natural language processing (NLP), they have the potential to contribute to retrospective studies by unlocking important phenotyping information contained in the narrative content of electronic health records (EHRs). This work aims to develop an extensive information representation scheme for clinical information contained in EHR narratives, and to support secondary use of EHR narrative data to answer clinical questions. METHODS: We review recent work that proposed information representation schemes and applied them to the analysis of clinical narratives. We then propose a unifying scheme that supports the extraction of information to address a large variety of clinical questions. RESULTS: We devised a new information representation scheme for clinical narratives that comprises 13 entities, 11 attributes and 37 relations. The associated annotation guidelines can be used to consistently apply the scheme to clinical narratives and are https://cabernet.limsi.fr/annotation_guide_for_the_merlot_french_clinical_corpus-Sept2016.pdf . CONCLUSION: The information scheme includes many elements of the major schemes described in the clinical natural language processing literature, as well as a uniquely detailed set of relations.


Asunto(s)
Ontologías Biológicas , Minería de Datos/métodos , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Humanos
8.
Stud Health Technol Inform ; 221: 59-63, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27071877

RESUMEN

The number of patients that benefit from remote monitoring of cardiac implantable electronic devices, such as pacemakers and defibrillators, is growing rapidly. Consequently, the huge number of alerts that are generated and transmitted to the physicians represents a challenge to handle. We have developed a system based on a formal ontology that integrates the alert information and the patient data extracted from the electronic health record in order to better classify the importance of alerts. A pilot study was conducted on atrial fibrillation alerts. We show some examples of alert processing. The results suggest that this approach has the potential to significantly reduce the alert burden in telecardiology. The methods may be extended to other types of connected devices.


Asunto(s)
Fibrilación Atrial/diagnóstico , Alarmas Clínicas , Sistemas de Apoyo a Decisiones Clínicas/organización & administración , Electrocardiografía Ambulatoria/métodos , Registros Electrónicos de Salud/organización & administración , Telemedicina/métodos , Fibrilación Atrial/prevención & control , Ontologías Biológicas , Desfibriladores Implantables , Diagnóstico por Computador/métodos , Humanos , Procesamiento de Lenguaje Natural , Marcapaso Artificial , Proyectos Piloto , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Terapia Asistida por Computador/métodos
9.
Europace ; 18(3): 347-52, 2016 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-26487670

RESUMEN

AIMS: Remote monitoring of cardiac implantable electronic devices is a growing standard; yet, remote follow-up and management of alerts represents a time-consuming task for physicians or trained staff. This study evaluates an automatic mechanism based on artificial intelligence tools to filter atrial fibrillation (AF) alerts based on their medical significance. METHODS AND RESULTS: We evaluated this method on alerts for AF episodes that occurred in 60 pacemaker recipients. AKENATON prototype workflow includes two steps: natural language-processing algorithms abstract the patient health record to a digital version, then a knowledge-based algorithm based on an applied formal ontology allows to calculate the CHA2DS2-VASc score and evaluate the anticoagulation status of the patient. Each alert is then automatically classified by importance from low to critical, by mimicking medical reasoning. Final classification was compared with human expert analysis by two physicians. A total of 1783 alerts about AF episode >5 min in 60 patients were processed. A 1749 of 1783 alerts (98%) were adequately classified and there were no underestimation of alert importance in the remaining 34 misclassified alerts. CONCLUSION: This work demonstrates the ability of a pilot system to classify alerts and improves personalized remote monitoring of patients. In particular, our method allows integration of patient medical history with device alert notifications, which is useful both from medical and resource-management perspectives. The system was able to automatically classify the importance of 1783 AF alerts in 60 patients, which resulted in an 84% reduction in notification workload, while preserving patient safety.


Asunto(s)
Fibrilación Atrial/diagnóstico , Electrocardiografía/instrumentación , Sistema de Conducción Cardíaco/fisiopatología , Frecuencia Cardíaca , Marcapaso Artificial , Telemetría/instrumentación , Potenciales de Acción , Algoritmos , Anticoagulantes/uso terapéutico , Inteligencia Artificial , Fibrilación Atrial/fisiopatología , Fibrilación Atrial/terapia , Automatización , Técnicas de Apoyo para la Decisión , Francia , Humanos , Proyectos Piloto , Valor Predictivo de las Pruebas , Reproducibilidad de los Resultados , Estudios Retrospectivos , Medición de Riesgo , Procesamiento de Señales Asistido por Computador , Flujo de Trabajo , Carga de Trabajo
10.
BMC Med Inform Decis Mak ; 15: 28, 2015 Apr 14.
Artículo en Inglés | MEDLINE | ID: mdl-25881112

RESUMEN

BACKGROUND: Manual eligibility screening (ES) for a clinical trial typically requires a labor-intensive review of patient records that utilizes many resources. Leveraging state-of-the-art natural language processing (NLP) and information extraction (IE) technologies, we sought to improve the efficiency of physician decision-making in clinical trial enrollment. In order to markedly reduce the pool of potential candidates for staff screening, we developed an automated ES algorithm to identify patients who meet core eligibility characteristics of an oncology clinical trial. METHODS: We collected narrative eligibility criteria from ClinicalTrials.gov for 55 clinical trials actively enrolling oncology patients in our institution between 12/01/2009 and 10/31/2011. In parallel, our ES algorithm extracted clinical and demographic information from the Electronic Health Record (EHR) data fields to represent profiles of all 215 oncology patients admitted to cancer treatment during the same period. The automated ES algorithm then matched the trial criteria with the patient profiles to identify potential trial-patient matches. Matching performance was validated on a reference set of 169 historical trial-patient enrollment decisions, and workload, precision, recall, negative predictive value (NPV) and specificity were calculated. RESULTS: Without automation, an oncologist would need to review 163 patients per trial on average to replicate the historical patient enrollment for each trial. This workload is reduced by 85% to 24 patients when using automated ES (precision/recall/NPV/specificity: 12.6%/100.0%/100.0%/89.9%). Without automation, an oncologist would need to review 42 trials per patient on average to replicate the patient-trial matches that occur in the retrospective data set. With automated ES this workload is reduced by 90% to four trials (precision/recall/NPV/specificity: 35.7%/100.0%/100.0%/95.5%). CONCLUSION: By leveraging NLP and IE technologies, automated ES could dramatically increase the trial screening efficiency of oncologists and enable participation of small practices, which are often left out from trial enrollment. The algorithm has the potential to significantly reduce the effort to execute clinical research at a point in time when new initiatives of the cancer care community intend to greatly expand both the access to trials and the number of available trials.


Asunto(s)
Ensayos Clínicos como Asunto/métodos , Determinación de la Elegibilidad/métodos , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Neoplasias/terapia , Selección de Paciente , Niño , Humanos
11.
J Biomed Inform ; 50: 173-183, 2014 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-24556292

RESUMEN

OBJECTIVE: The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development. MATERIALS AND METHODS: In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes. RESULTS: Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings. DISCUSSION AND CONCLUSION: We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.


Asunto(s)
Informática Médica , Seguridad Computacional , Registros Electrónicos de Salud , Health Insurance Portability and Accountability Act , Estados Unidos
12.
J Am Med Inform Assoc ; 21(3): 406-13, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24001514

RESUMEN

OBJECTIVE: To present a series of experiments: (1) to evaluate the impact of pre-annotation on the speed of manual annotation of clinical trial announcements; and (2) to test for potential bias, if pre-annotation is utilized. METHODS: To build the gold standard, 1400 clinical trial announcements from the clinicaltrials.gov website were randomly selected and double annotated for diagnoses, signs, symptoms, Unified Medical Language System (UMLS) Concept Unique Identifiers, and SNOMED CT codes. We used two dictionary-based methods to pre-annotate the text. We evaluated the annotation time and potential bias through F-measures and ANOVA tests and implemented Bonferroni correction. RESULTS: Time savings ranged from 13.85% to 21.5% per entity. Inter-annotator agreement (IAA) ranged from 93.4% to 95.5%. There was no statistically significant difference for IAA and annotator performance in pre-annotations. CONCLUSIONS: On every experiment pair, the annotator with the pre-annotated text needed less time to annotate than the annotator with non-labeled text. The time savings were statistically significant. Moreover, the pre-annotation did not reduce the IAA or annotator performance. Dictionary-based pre-annotation is a feasible and practical method to reduce the cost of annotation of clinical named entity recognition in the eligibility sections of clinical trial announcements without introducing bias in the annotation process.


Asunto(s)
Ensayos Clínicos como Asunto , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Análisis de Varianza , Humanos , Estudios de Tiempo y Movimiento
13.
J Am Med Inform Assoc ; 20(e2): e212-20, 2013 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-24130231

RESUMEN

OBJECTIVE: To evaluate a proposed natural language processing (NLP) and machine-learning based automated method to risk stratify abdominal pain patients by analyzing the content of the electronic health record (EHR). METHODS: We analyzed the EHRs of a random sample of 2100 pediatric emergency department (ED) patients with abdominal pain, including all with a final diagnosis of appendicitis. We developed an automated system to extract relevant elements from ED physician notes and lab values and to automatically assign a risk category for acute appendicitis (high, equivocal, or low), based on the Pediatric Appendicitis Score. We evaluated the performance of the system against a manually created gold standard (chart reviews by ED physicians) for recall, specificity, and precision. RESULTS: The system achieved an average F-measure of 0.867 (0.869 recall and 0.863 precision) for risk classification, which was comparable to physician experts. Recall/precision were 0.897/0.952 in the low-risk category, 0.855/0.886 in the high-risk category, and 0.854/0.766 in the equivocal-risk category. The information that the system required as input to achieve high F-measure was available within the first 4 h of the ED visit. CONCLUSIONS: Automated appendicitis risk categorization based on EHR content, including information from clinical notes, shows comparable performance to physician chart reviewers as measured by their inter-annotator agreement and represents a promising new approach for computerized decision support to promote application of evidence-based medicine at the point of care.


Asunto(s)
Dolor Abdominal/etiología , Algoritmos , Apendicitis/diagnóstico , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Inteligencia Artificial , Niño , Servicio de Urgencia en Hospital , Humanos , Medición de Riesgo/métodos
14.
Stud Health Technol Inform ; 192: 677-81, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23920642

RESUMEN

We translated an existing English negation lexicon (NegEx) to Swedish, French, and German and compared the lexicon on corpora from each language. We observed Zipf's law for all languages, i.e., a few phrases occur a large number of times, and a large number of phrases occur fewer times. Negation triggers "no" and "not" were common for all languages; however, other triggers varied considerably. The lexicon is available in OWL and RDF format and can be extended to other languages. We discuss the challenges in translating negation triggers to other languages and issues in representing multilingual lexical knowledge.


Asunto(s)
Inteligencia Artificial , Sistemas de Registros Médicos Computarizados , Procesamiento de Lenguaje Natural , Semántica , Terminología como Asunto , Traducción , Vocabulario Controlado , Francia , Alemania , Suecia , Estados Unidos
15.
J Med Internet Res ; 15(4): e73, 2013 Apr 02.
Artículo en Inglés | MEDLINE | ID: mdl-23548263

RESUMEN

BACKGROUND: A high-quality gold standard is vital for supervised, machine learning-based, clinical natural language processing (NLP) systems. In clinical NLP projects, expert annotators traditionally create the gold standard. However, traditional annotation is expensive and time-consuming. To reduce the cost of annotation, general NLP projects have turned to crowdsourcing based on Web 2.0 technology, which involves submitting smaller subtasks to a coordinated marketplace of workers on the Internet. Many studies have been conducted in the area of crowdsourcing, but only a few have focused on tasks in the general NLP field and only a handful in the biomedical domain, usually based upon very small pilot sample sizes. In addition, the quality of the crowdsourced biomedical NLP corpora were never exceptional when compared to traditionally-developed gold standards. The previously reported results on medical named entity annotation task showed a 0.68 F-measure based agreement between crowdsourced and traditionally-developed corpora. OBJECTIVE: Building upon previous work from the general crowdsourcing research, this study investigated the usability of crowdsourcing in the clinical NLP domain with special emphasis on achieving high agreement between crowdsourced and traditionally-developed corpora. METHODS: To build the gold standard for evaluating the crowdsourcing workers' performance, 1042 clinical trial announcements (CTAs) from the ClinicalTrials.gov website were randomly selected and double annotated for medication names, medication types, and linked attributes. For the experiments, we used CrowdFlower, an Amazon Mechanical Turk-based crowdsourcing platform. We calculated sensitivity, precision, and F-measure to evaluate the quality of the crowd's work and tested the statistical significance (P<.001, chi-square test) to detect differences between the crowdsourced and traditionally-developed annotations. RESULTS: The agreement between the crowd's annotations and the traditionally-generated corpora was high for: (1) annotations (0.87, F-measure for medication names; 0.73, medication types), (2) correction of previous annotations (0.90, medication names; 0.76, medication types), and excellent for (3) linking medications with their attributes (0.96). Simple voting provided the best judgment aggregation approach. There was no statistically significant difference between the crowd and traditionally-generated corpora. Our results showed a 27.9% improvement over previously reported results on medication named entity annotation task. CONCLUSIONS: This study offers three contributions. First, we proved that crowdsourcing is a feasible, inexpensive, fast, and practical approach to collect high-quality annotations for clinical text (when protected health information was excluded). We believe that well-designed user interfaces and rigorous quality control strategy for entity annotation and linking were critical to the success of this work. Second, as a further contribution to the Internet-based crowdsourcing field, we will publicly release the JavaScript and CrowdFlower Markup Language infrastructure code that is necessary to utilize CrowdFlower's quality control and crowdsourcing interfaces for named entity annotations. Finally, to spur future research, we will release the CTA annotations that were generated by traditional and crowdsourced approaches.


Asunto(s)
Colaboración de las Masas/normas , Procesamiento de Lenguaje Natural , Medios de Comunicación Sociales , Telemedicina/normas , Ensayos Clínicos como Asunto/estadística & datos numéricos , Colaboración de las Masas/estadística & datos numéricos , Humanos , Internet , Proyectos Piloto , Control de Calidad , Telemedicina/estadística & datos numéricos
16.
BMC Med Inform Decis Mak ; 13: 53, 2013 Apr 24.
Artículo en Inglés | MEDLINE | ID: mdl-23617267

RESUMEN

BACKGROUND: Cincinnati Children's Hospital Medical Center (CCHMC) has built the initial Natural Language Processing (NLP) component to extract medications with their corresponding medical conditions (Indications, Contraindications, Overdosage, and Adverse Reactions) as triples of medication-related information ([(1) drug name]-[(2) medical condition]-[(3) LOINC section header]) for an intelligent database system, in order to improve patient safety and the quality of health care. The Food and Drug Administration's (FDA) drug labels are used to demonstrate the feasibility of building the triples as an intelligent database system task. METHODS: This paper discusses a hybrid NLP system, called AutoMCExtractor, to collect medical conditions (including disease/disorder and sign/symptom) from drug labels published by the FDA. Altogether, 6,611 medical conditions in a manually-annotated gold standard were used for the system evaluation. The pre-processing step extracted the plain text from XML file and detected eight related LOINC sections (e.g. Adverse Reactions, Warnings and Precautions) for medical condition extraction. Conditional Random Fields (CRF) classifiers, trained on token, linguistic, and semantic features, were then used for medical condition extraction. Lastly, dictionary-based post-processing corrected boundary-detection errors of the CRF step. We evaluated the AutoMCExtractor on manually-annotated FDA drug labels and report the results on both token and span levels. RESULTS: Precision, recall, and F-measure were 0.90, 0.81, and 0.85, respectively, for the span level exact match; for the token-level evaluation, precision, recall, and F-measure were 0.92, 0.73, and 0.82, respectively. CONCLUSIONS: The results demonstrate that (1) medical conditions can be extracted from FDA drug labels with high performance; and (2) it is feasible to develop a framework for an intelligent database system.


Asunto(s)
Sistemas de Registro de Reacción Adversa a Medicamentos , Minería de Datos/métodos , Etiquetado de Medicamentos , United States Food and Drug Administration , Humanos , Sistemas de Medicación , Procesamiento de Lenguaje Natural , Ohio , Estados Unidos
17.
J Am Med Inform Assoc ; 20(5): 915-21, 2013.
Artículo en Inglés | MEDLINE | ID: mdl-23268488

RESUMEN

OBJECTIVE: The goal of this work was to evaluate machine learning methods, binary classification and sequence labeling, for medication-attribute linkage detection in two clinical corpora. DATA AND METHODS: We double annotated 3000 clinical trial announcements (CTA) and 1655 clinical notes (CN) for medication named entities and their attributes. A binary support vector machine (SVM) classification method with parsimonious feature sets, and a conditional random fields (CRF)-based multi-layered sequence labeling (MLSL) model were proposed to identify the linkages between the entities and their corresponding attributes. We evaluated the system's performance against the human-generated gold standard. RESULTS: The experiments showed that the two machine learning approaches performed statistically significantly better than the baseline rule-based approach. The binary SVM classification achieved 0.94 F-measure with individual tokens as features. The SVM model trained on a parsimonious feature set achieved 0.81 F-measure for CN and 0.87 for CTA. The CRF MLSL method achieved 0.80 F-measure on both corpora. DISCUSSION AND CONCLUSIONS: We compared the novel MLSL method with a binary classification and a rule-based method. The MLSL method performed statistically significantly better than the rule-based method. However, the SVM-based binary classification method was statistically significantly better than the MLSL method for both the CTA and CN corpora. Using parsimonious feature sets both the SVM-based binary classification and CRF-based MLSL methods achieved high performance in detecting medication name and attribute linkages in CTA and CN.


Asunto(s)
Inteligencia Artificial , Almacenamiento y Recuperación de la Información/métodos , Registros Médicos , Preparaciones Farmacéuticas , Máquina de Vectores de Soporte , Ensayos Clínicos como Asunto , Humanos
18.
J Am Med Inform Assoc ; 20(1): 84-94, 2013 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-22859645

RESUMEN

OBJECTIVE: (1) To evaluate a state-of-the-art natural language processing (NLP)-based approach to automatically de-identify a large set of diverse clinical notes. (2) To measure the impact of de-identification on the performance of information extraction algorithms on the de-identified documents. MATERIAL AND METHODS: A cross-sectional study that included 3503 stratified, randomly selected clinical notes (over 22 note types) from five million documents produced at one of the largest US pediatric hospitals. Sensitivity, precision, F value of two automated de-identification systems for removing all 18 HIPAA-defined protected health information elements were computed. Performance was assessed against a manually generated 'gold standard'. Statistical significance was tested. The automated de-identification performance was also compared with that of two humans on a 10% subsample of the gold standard. The effect of de-identification on the performance of subsequent medication extraction was measured. RESULTS: The gold standard included 30 815 protected health information elements and more than one million tokens. The most accurate NLP method had 91.92% sensitivity (R) and 95.08% precision (P) overall. The performance of the system was indistinguishable from that of human annotators (annotators' performance was 92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall while the best system obtained 92.91%(R)/95.73%(P) on same text). The impact of automated de-identification was minimal on the utility of the narrative notes for subsequent information extraction as measured by the sensitivity and precision of medication name extraction. DISCUSSION AND CONCLUSION: NLP-based de-identification shows excellent performance that rivals the performance of human annotators. Furthermore, unlike manual de-identification, the automated approach scales up to millions of documents quickly and inexpensively.


Asunto(s)
Confidencialidad , Minería de Datos , Registros Electrónicos de Salud , Difusión de la Información , Procesamiento de Lenguaje Natural , Algoritmos , Estudios Transversales , Hospitales Pediátricos , Humanos , Variaciones Dependientes del Observador , Reproducibilidad de los Resultados , Evaluación de la Tecnología Biomédica , Estados Unidos
19.
AMIA Annu Symp Proc ; 2012: 144-53, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-23304283

RESUMEN

We present the construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks. Clinical notes from the medical record, clinical trial announcements, and FDA drug labels are annotated. We report high inter-annotator agreements (overall F-measures between 0.8467 and 0.9176) for the annotation of Personal Health Information (PHI) elements for a de-identification task and of medications, diseases/disorders, and signs/symptoms for information extraction (IE) task. The annotated corpora of clinical trials and FDA labels will be publicly released and to facilitate translational NLP tasks that require cross-corpora interoperability (e.g. clinical trial eligibility screening) their annotation schemas are aligned with a large scale, NIH-funded clinical text annotation project.


Asunto(s)
Procesamiento de Lenguaje Natural , Ensayos Clínicos como Asunto , Etiquetado de Medicamentos , Registros Médicos , Programas Informáticos , Estados Unidos , United States Food and Drug Administration
20.
AMIA Annu Symp Proc ; 2011: 501-10, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-22195104

RESUMEN

The CHA2DS2-VASc score is a 10-point scale which allows cardiologists to easily identify potential stroke risk for patients with non-valvular fibrillation. In this article, we present a system based on natural language processing (lexicon and linguistic modules), including negation and speculation handling, which extracts medical concepts from French clinical records and uses them as criteria to compute the CHA2DS2-VASc score. We evaluate this system by comparing its computed criteria with those obtained by human reading of the same clinical texts, and by assessing the impact of the observed differences on the resulting CHA2DS2-VASc scores. Given 21 patient records, 168 instances of criteria were computed, with an accuracy of 97.6%, and the accuracy of the 21 CHA2DS2-VASc scores was 85.7%. All differences in scores trigger the same alert, which means that system performance on this test set yields similar results to human reading of the texts.


Asunto(s)
Fibrilación Atrial/complicaciones , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Medición de Riesgo/métodos , Accidente Cerebrovascular , Tromboembolia , Cardiología , Humanos , Lenguaje , Accidente Cerebrovascular/etiología , Tromboembolia/etiología
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...