Pesquisa | BVS - MINISTÉRIO DA SAÚDE

1.

Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs.

Hu, Jiahui; Fu, Jin; Zhao, Wanqing; Lou, Pei; Feng, Ming; Ren, Huiling; Feng, Shanshan; Li, Yansheng; Fang, An.

Health Informatics J ; 30(4): 14604582241291442, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-39379071

RESUMO

Objective: Faced with the challenges of differential diagnosis caused by the complex clinical manifestations and high pathological heterogeneity of pituitary adenomas, this study aims to construct a high-quality annotated corpus to characterize pituitary adenomas in clinical notes containing rich diagnosis and treatment information. Methods: A dataset from a pituitary adenomas neurosurgery treatment center of a tertiary first-class hospital in China was retrospectively collected. A semi-automatic corpus construction framework was designed. A total of 2000 documents containing 9430 sentences and 524,232 words were annotated, and the text corpus of pituitary adenomas (TCPA) was constructed and analyzed. Its potential application in large language models (LLMs) was explored through fine-tuning and prompting experiments. Results: TCPA had 4782 medical entities and 28,998 tokens, achieving good quality with the inter-annotator agreement value of 0.862-0.986. The LLMs experiments showed that TCPA can be used to automatically identify clinical information from free texts, and introducing instances with clinical characteristics can effectively reduce the need for training data, thereby reducing labor costs. Conclusion: This study characterized pituitary adenomas in clinical notes, and the proposed method were able to serve as references for relevant research in medical natural language scenarios with highly specialized language structure and terminology.

Assuntos

Processamento de Linguagem Natural , Neoplasias Hipofisárias , Humanos , Neoplasias Hipofisárias/diagnóstico , China , Estudos Retrospectivos , Adenoma/diagnóstico , Registros Eletrônicos de Saúde/estatística & dados numéricos

2.

Is Boundary Annotation Necessary? Evaluating Boundary-Free Approaches to Improve Clinical Named Entity Annotation Efficiency: Case Study.

Herman Bernardim Andrade, Gabriel; Yada, Shuntaro; Aramaki, Eiji.

JMIR Med Inform ; 12: e59680, 2024 Jul 02.

Artigo em Inglês | MEDLINE | ID: mdl-38954456

RESUMO

BACKGROUND: Named entity recognition (NER) is a fundamental task in natural language processing. However, it is typically preceded by named entity annotation, which poses several challenges, especially in the clinical domain. For instance, determining entity boundaries is one of the most common sources of disagreements between annotators due to questions such as whether modifiers or peripheral words should be annotated. If unresolved, these can induce inconsistency in the produced corpora, yet, on the other hand, strict guidelines or adjudication sessions can further prolong an already slow and convoluted process. OBJECTIVE: The aim of this study is to address these challenges by evaluating 2 novel annotation methodologies, lenient span and point annotation, aiming to mitigate the difficulty of precisely determining entity boundaries. METHODS: We evaluate their effects through an annotation case study on a Japanese medical case report data set. We compare annotation time, annotator agreement, and the quality of the produced labeling and assess the impact on the performance of an NER system trained on the annotated corpus. RESULTS: We saw significant improvements in the labeling process efficiency, with up to a 25% reduction in overall annotation time and even a 10% improvement in annotator agreement compared to the traditional boundary-strict approach. However, even the best-achieved NER model presented some drop in performance compared to the traditional annotation methodology. CONCLUSIONS: Our findings demonstrate a balance between annotation speed and model performance. Although disregarding boundary information affects model performance to some extent, this is counterbalanced by significant reductions in the annotator's workload and notable improvements in the speed of the annotation process. These benefits may prove valuable in various applications, offering an attractive compromise for developers and researchers.

3.

AI-assisted literature exploration of innovative Chinese medicine formulas.

Chung, Meng-Chi; Su, Li-Jen; Chen, Chien-Lin; Wu, Li-Ching.

Front Pharmacol ; 15: 1347882, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38584602

RESUMO

Objective: Our study provides an innovative approach to exploring herbal formulas that contribute to the promotion of sustainability and biodiversity conservation. We employ data mining, integrating keyword extraction, association rules, and LSTM-based generative models to analyze classical Traditional Chinese Medicine (TCM) texts. We systematically decode classical Chinese medical literature, conduct statistical analyses, and link these historical texts with modern pharmacogenomic references to explore potential alternatives. Methods: We present a novel iterative keyword extraction approach for discerning diverse herbs in historical TCM texts from the Pu-Ji Fang copies. Utilizing association rules, we uncover previously unexplored herb pairs. To bridge classical TCM herbal pairs with modern genetic relationships, we conduct gene-herb searches in PubMed and statistically validate this genetic literature as supporting evidence. We have expanded on the present work by developing a generative language model for suggesting innovative TCM formulations based on textual herb combinations. Results: We collected associations with 7,664 PubMed cross-search entries for gene-herb and 934 for Shenqifuzheng Injection as a positive control. We analyzed 16,384 keyword combinations from Pu-Ji Fang's 426 volumes, employing statistical methods to probe gene-herb associations, focusing on examining differences among the target genes and Pu-Ji Fang herbs. Conclusion: Analyzing Pu-Ji Fang reveals a historical focus on flavor over medicinal aspects in TCM. We extend our work on developing a generative model from classical textual keywords to rapidly produces novel herbal compositions or TCM formulations. This integrated approach enhances our comprehension of TCM by merging ancient text analysis, modern genetic research, and generative modeling.

4.

Dataset construction to detect human behavior with the help of emotions, sentiments and mood for Roman Urdu.

Samreen, Asia; Ali, Syed Asif.

Data Brief ; 52: 109906, 2024 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-38304385

RESUMO

Roman Urdu and English are often used together as a hybrid language for communication on social media. Because writers don't worry about spelling when utilizing the English alphabet to write Urdu during texting, it becomes challenging to interpret mixed codes for emotions. There are over 14,000 emotion lexicons in this dataset, each of which lists nine different emotions and their polarities. The NRC emotion lexicons [8] provided in Urdu have been transliterated into Roman Urdu. To verify that the provided translation is accurate, we used three online dictionaries of Urdu. A Python script that transliterates words from Urdu to Roman Urdu has been used to develop Roman Urdu transliteration. Sentiment and mood, depending on the emotion lexicon, are also provided. The textual data has been annotated using the unigram feature and distance estimation among strings and lexicons. Approximately 10,000 sentences from the baseline sample have been automatically annotated.

5.

The unreasonable effectiveness of large language models in zero-shot semantic annotation of legal texts.

Savelka, Jaromir; Ashley, Kevin D.

Front Artif Intell ; 6: 1279794, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38045764

RESUMO

The emergence of ChatGPT has sensitized the general public, including the legal profession, to large language models' (LLMs) potential uses (e.g., document drafting, question answering, and summarization). Although recent studies have shown how well the technology performs in diverse semantic annotation tasks focused on legal texts, an influx of newer, more capable (GPT-4) or cost-effective (GPT-3.5-turbo) models requires another analysis. This paper addresses recent developments in the ability of LLMs to semantically annotate legal texts in zero-shot learning settings. Given the transition to mature generative AI systems, we examine the performance of GPT-4 and GPT-3.5-turbo(-16k), comparing it to the previous generation of GPT models, on three legal text annotation tasks involving diverse documents such as adjudicatory opinions, contractual clauses, or statutory provisions. We also compare the models' performance and cost to better understand the trade-offs. We found that the GPT-4 model clearly outperforms the GPT-3.5 models on two of the three tasks. The cost-effective GPT-3.5-turbo matches the performance of the 20× more expensive text-davinci-003 model. While one can annotate multiple data points within a single prompt, the performance degrades as the size of the batch increases. This work provides valuable information relevant for many practical applications (e.g., in contract review) and research projects (e.g., in empirical legal studies). Legal scholars and practicing lawyers alike can leverage these findings to guide their decisions in integrating LLMs in a wide range of workflows involving semantic annotation of legal texts.

6.

A BERT-Span model for Chinese named entity recognition in rehabilitation medicine.

Zhong, Jinhong; Xuan, Zhanxiang; Wang, Kang; Cheng, Zhou.

PeerJ Comput Sci ; 9: e1535, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37705622

RESUMO

Background: Due to various factors such as the increasing aging of the population and the upgrading of people's health consumption needs, the demand group for rehabilitation medical care is expanding. Currently, China's rehabilitation medical care encounters several challenges, such as inadequate awareness and a scarcity of skilled professionals. Enhancing public awareness about rehabilitation and improving the quality of rehabilitation services are particularly crucial. Named entity recognition is an essential first step in information processing as it enables the automated extraction of rehabilitation medical entities. These entities play a crucial role in subsequent tasks, including information decision systems and the construction of medical knowledge graphs. Methods: In order to accomplish this objective, we construct the BERT-Span model to complete the Chinese rehabilitation medicine named entity recognition task. First, we collect rehabilitation information from multiple sources to build a corpus in the field of rehabilitation medicine, and fine-tune Bidirectional Encoder Representation from Transformers (BERT) with the rehabilitation medicine corpus. For the rehabilitation medicine corpus, we use BERT to extract the feature vectors of rehabilitation medicine entities in the text, and use the span model to complete the annotation of rehabilitation medicine entities. Result: Compared to existing baseline models, our model achieved the highest F1 value for the named entity recognition task in the rehabilitation medicine corpus. The experimental results demonstrate that our method outperforms in recognizing both long medical entities and nested medical entities in rehabilitation medical texts. Conclusion: The BERT-Span model can effectively identify and extract entity knowledge in the field of rehabilitation medicine in China, which supports the construction of the knowledge graph of rehabilitation medicine and the development of the decision-making system of rehabilitation medicine.

7.

NeuroBridge ontology: computable provenance metadata to give the long tail of neuroimaging data a FAIR chance for secondary use.

Sahoo, Satya S; Turner, Matthew D; Wang, Lei; Ambite, Jose Luis; Appaji, Abhishek; Rajasekar, Arcot; Lander, Howard M; Wang, Yue; Turner, Jessica A.

Front Neuroinform ; 17: 1216443, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37554248

RESUMO

Background: Despite the efforts of the neuroscience community, there are many published neuroimaging studies with data that are still not findable or accessible. Users face significant challenges in reusing neuroimaging data due to the lack of provenance metadata, such as experimental protocols, study instruments, and details about the study participants, which is also required for interoperability. To implement the FAIR guidelines for neuroimaging data, we have developed an iterative ontology engineering process and used it to create the NeuroBridge ontology. The NeuroBridge ontology is a computable model of provenance terms to implement FAIR principles and together with an international effort to annotate full text articles with ontology terms, the ontology enables users to locate relevant neuroimaging datasets. Methods: Building on our previous work in metadata modeling, and in concert with an initial annotation of a representative corpus, we modeled diagnosis terms (e.g., schizophrenia, alcohol usage disorder), magnetic resonance imaging (MRI) scan types (T1-weighted, task-based, etc.), clinical symptom assessments (PANSS, AUDIT), and a variety of other assessments. We used the feedback of the annotation team to identify missing metadata terms, which were added to the NeuroBridge ontology, and we restructured the ontology to support both the final annotation of the corpus of neuroimaging articles by a second, independent set of annotators, as well as the functionalities of the NeuroBridge search portal for neuroimaging datasets. Results: The NeuroBridge ontology consists of 660 classes with 49 properties with 3,200 axioms. The ontology includes mappings to existing ontologies, enabling the NeuroBridge ontology to be interoperable with other domain specific terminological systems. Using the ontology, we annotated 186 neuroimaging full-text articles describing the participant types, scanning, clinical and cognitive assessments. Conclusion: The NeuroBridge ontology is the first computable metadata model that represents the types of data available in recent neuroimaging studies in schizophrenia and substance use disorders research; it can be extended to include more granular terms as needed. This metadata ontology is expected to form the computational foundation to help both investigators to make their data FAIR compliant and support users to conduct reproducible neuroimaging research.

8.

Performance and Agreement When Annotating Chest X-ray Text Reports-A Preliminary Step in the Development of a Deep Learning-Based Prioritization and Detection System.

Li, Dana; Pehrson, Lea Marie; Bonnevie, Rasmus; Fraccaro, Marco; Thrane, Jakob; Tøttrup, Lea; Lauridsen, Carsten Ammitzbøl; Butt Balaganeshan, Sedrah; Jankovic, Jelena; Andersen, Tobias Thostrup; Mayar, Alyas; Hansen, Kristoffer Lindskov; Carlsen, Jonathan Frederik; Darkner, Sune; Nielsen, Michael Bachmann.

Diagnostics (Basel) ; 13(6)2023 Mar 11.

Artigo em Inglês | MEDLINE | ID: mdl-36980376

RESUMO

A chest X-ray report is a communicative tool and can be used as data for developing artificial intelligence-based decision support systems. For both, consistent understanding and labeling is important. Our aim was to investigate how readers would comprehend and annotate 200 chest X-ray reports. Reports written between 1 January 2015 and 11 March 2022 were selected based on search words. Annotators included three board-certified radiologists, two trained radiologists (physicians), two radiographers (radiological technicians), a non-radiological physician, and a medical student. Consensus labels by two or more of the experienced radiologists were considered "gold standard". Matthew's correlation coefficient (MCC) was calculated to assess annotation performance, and descriptive statistics were used to assess agreement between individual annotators and labels. The intermediate radiologist had the best correlation to "gold standard" (MCC 0.77). This was followed by the novice radiologist and medical student (MCC 0.71 for both), the novice radiographer (MCC 0.65), non-radiological physician (MCC 0.64), and experienced radiographer (MCC 0.57). Our findings showed that for developing an artificial intelligence-based support system, if trained radiologists are not available, annotations from non-radiological annotators with basic and general knowledge may be more aligned with radiologists compared to annotations from sub-specialized medical staff, if their sub-specialization is outside of diagnostic radiology.

9.

Lost in transmission? Self- and other-annotation of emotional words.

Braun, Nadine; Goudbeek, Martijn; Krahmer, Emiel.

Acta Psychol (Amst) ; 229: 103713, 2022 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-35988301

RESUMO

It is generally assumed that someone's affective state can be correctly detected and interpreted by other people, nowadays even by computer algorithms, in their writing. However, it is unclear whether these perceptions match the actual experience and communicative intention of the author. Therefore, we investigated the relation between affect expression and perception in text in a two-part study. In Part 1, participants (authors) wrote about emotional experiences according to four combinations of two appraisals (High/Low Pleasantness, High/Low Control), rated the valence of each text, and annotated words using 22 emotions. In Part 2, another group of participants (readers) rated and annotated the same texts. We also compare the human evaluations to those provided by computerized text analysis. Results show that valence differed across conditions and that authors rated and annotated their texts differently than readers. Although the automatic analysis detected levels of positivity and negativity across conditions similar to human valence ratings, it relied on fewer and different words to do so. We discuss implications for affective science and automatic sentiment analysis.

Assuntos

Emoções , Idioma , Humanos

10.

Prediction of standard cell types and functional markers from textual descriptions of flow cytometry gating definitions using machine learning.

Rodriguez-Esteban, Raul; Duarte, José; Teixeira, Priscila C; Richard, Fabien; Koltsova, Svetlana; So, W Venus.

Cytometry B Clin Cytom ; 102(3): 220-227, 2022 05.

Artigo em Inglês | MEDLINE | ID: mdl-35253974

RESUMO

BACKGROUND: A key step in clinical flow cytometry data analysis is gating, which involves the identification of cell populations. The process of gating produces a set of reportable results, which are typically described by gating definitions. The non-standardized, non-interpreted nature of gating definitions represents a hurdle for data interpretation and data sharing across and within organizations. Interpreting and standardizing gating definitions for subsequent analysis of gating results requires a curation effort from experts. Machine learning approaches have the potential to help in this process by predicting expert annotations associated with gating definitions. METHODS: We created a gold-standard dataset by manually annotating thousands of gating definitions with cell type and functional marker annotations. We used this dataset to train and test a machine learning pipeline able to predict standard cell types and functional marker genes associated with gating definitions. RESULTS: The machine learning pipeline predicted annotations with high accuracy for both cell types and functional marker genes. Accuracy was lower for gating definitions from assays belonging to laboratories from which limited or no prior data was available in the training. Manual error review ensured that resulting predicted annotations could be reused subsequently as additional gold-standard training data. CONCLUSIONS: Machine learning methods are able to consistently predict annotations associated with gating definitions from flow cytometry assays. However, a hybrid automatic and manual annotation workflow would be recommended to achieve optimal results.

Assuntos

Aprendizado de Máquina , Citometria de Fluxo , Humanos , Fluxo de Trabalho

11.

Extending TextAE for annotation of non-contiguous entities.

Lever, Jake; Altman, Russ; Kim, Jin-Dong.

Genomics Inform ; 18(2): e15, 2020 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-32634869

RESUMO

Named entity recognition tools are used to identify mentions of biomedical entities in free text and are essential components of high-quality information retrieval and extraction systems. Without good entity recognition, methods will mislabel searched text and will miss important information or identify spurious text that will frustrate users. Most tools do not capture non-contiguous entities which are separate spans of text that together refer to an entity, e.g., the entity "type 1 diabetes" in the phrase "type 1 and type 2 diabetes." This type is commonly found in biomedical texts, especially in lists, where multiple biomedical entities are named in shortened form to avoid repeating words. Most text annotation systems, that enable users to view and edit entity annotations, do not support non-contiguous entities. Therefore, experts cannot even visualize non-contiguous entities, let alone annotate them to build valuable datasets for machine learning methods. To combat this problem and as part of the BLAH6 hackathon, we extended the TextAE platform to allow visualization and annotation of non-contiguous entities. This enables users to add new subspans to existing entities by selecting additional text. We integrate this new functionality with TextAE's existing editing functionality to allow easy changes to entity annotation and editing of relation annotations involving non-contiguous entities, with importing and exporting to the PubAnnotation format. Finally, we roughly quantify the problem across the entire accessible biomedical literature to highlight that there are a substantial number of non-contiguous entities that appear in lists that would be missed by most text mining systems.

12.

Elucidating high-dimensional cancer hallmark annotation via enriched ontology.

Yan, Shankai; Wong, Ka-Chun.

J Biomed Inform ; 73: 84-94, 2017 09.

Artigo em Inglês | MEDLINE | ID: mdl-28723579

RESUMO

MOTIVATION: Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. RESULTS: To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. AVAILABILITY: https://github.com/cskyan/chmannot.

Assuntos

Curadoria de Dados , Medical Subject Headings , Neoplasias , PubMed , Ontologias Biológicas , Humanos

13.

Dense Annotation of Free-Text Critical Care Discharge Summaries from an Indian Hospital and Associated Performance of a Clinical NLP Annotator.

Ramanan, S V; Radhakrishna, Kedar; Waghmare, Abijeet; Raj, Tony; Nathan, Senthil P; Sreerama, Sai Madhukar; Sampath, Sriram.

J Med Syst ; 40(8): 187, 2016 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-27342107

RESUMO

Electronic Health Record (EHR) use in India is generally poor, and structured clinical information is mostly lacking. This work is the first attempt aimed at evaluating unstructured text mining for extracting relevant clinical information from Indian clinical records. We annotated a corpus of 250 discharge summaries from an Intensive Care Unit (ICU) in India, with markups for diseases, procedures, and lab parameters, their attributes, as well as key demographic information and administrative variables such as patient outcomes. In this process, we have constructed guidelines for an annotation scheme useful to clinicians in the Indian context. We evaluated the performance of an NLP engine, Cocoa, on a cohort of these Indian clinical records. We have produced an annotated corpus of roughly 90 thousand words, which to our knowledge is the first tagged clinical corpus from India. Cocoa was evaluated on a test corpus of 50 documents. The overlap F-scores across the major categories, namely disease/symptoms, procedures, laboratory parameters and outcomes, are 0.856, 0.834, 0.961 and 0.872 respectively. These results are competitive with results from recent shared tasks based on US records. The annotated corpus and associated results from the Cocoa engine indicate that unstructured text mining is a viable method for cohort analysis in the Indian clinical context, where structured EHR records are largely absent.

Assuntos

Cuidados Críticos/estatística & dados numéricos , Mineração de Dados/métodos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Processamento de Linguagem Natural , Alta do Paciente/estatística & dados numéricos , Fatores Etários , Técnicas e Procedimentos Diagnósticos , Humanos , Índia , Fatores Sexuais , Fatores Socioeconômicos

14.

Metodologias e ferramentas para anotação de narrativas clínicas / Methodologies and tools for clinical narratives annotation

Andrade, Gabriel Herman Bernardim; Oliveira, Lucas Emanuel Silva e; Moro, Claudia Maria Cabral.

J. health inform ; 8(supl.I): 1031-1040, 2016. ilus

Artigo em Português | LILACS | ID: biblio-906773

RESUMO

A utilização de técnicas de Processamento de Linguagem Natural (PLN) em textos clínicos é amplamente dependente de grandes quantidades de dados textuais anotados, denominados corpus ou padrão ouro. Sendo essenciais para a modelagem da linguagem durante a fase de treinamento de diversos algoritmos de PLN. Porém, para a criação de um padrão ouro é necessário um extenso e custoso trabalho manual de anotação, que demanda um grande esforço de especialistas. OBJETIVO: Realizar uma revisão da literatura, visando o estudo de metodologias e ferramentas utilizadas em procedimentos de anotação de textos. MÉTODO: Levantamento em bases científicas referentes à elaboração de corpus morfológicos, sintáticos e morfossintáticos foi realizado, analisando 32 estudos de anotação e mais 12 ferramentas. RESULTADOS: Foram levantados os principais aspectos nos processos de anotação, bem como realizada uma avaliação dentre critérios pré-definidos de cada das ferramentas de suporte encontradas.

The use of natural language processing techniques (NLP) in clinical texts is dependent on large amounts of annotated text data, called corpus or gold standard. Are essential for the modelling language during the training phase of NLP algorithms. However, for the creation of a gold standard is required extensive and costly manual annotation task, that demands a great deal of experts. OBJECTIVES: To review the literature to identify methodologies and tools applied to text annotation. METHODS: Scientifics databases search regarding the development of morphological, syntactic and morphosyntactic corpus was performed by analyzing 32 annotation studies and 12 tools. RESULTS: Main aspects of the annotation process description, as well as an assessment from pre-defined criteria for each one of the annotation tools identified.

Assuntos

Humanos , Processamento de Linguagem Natural , Software , Armazenamento e Recuperação da Informação , Congressos como Assunto

15.

A multi-ontology approach to annotate scientific documents based on a modularization technique.

Gomes, Priscilla Corrêa E Castro; Moura, Ana Maria de Carvalho; Cavalcanti, Maria Cláudia.

J Biomed Inform ; 58: 208-219, 2015 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-26483172

RESUMO

Scientific text annotation has become an important task for biomedical scientists. Nowadays, there is an increasing need for the development of intelligent systems to support new scientific findings. Public databases available on the Web provide useful data, but much more useful information is only accessible in scientific texts. Text annotation may help as it relies on the use of ontologies to maintain annotations based on a uniform vocabulary. However, it is difficult to use an ontology, especially those that cover a large domain. In addition, since scientific texts explore multiple domains, which are covered by distinct ontologies, it becomes even more difficult to deal with such task. Moreover, there are dozens of ontologies in the biomedical area, and they are usually big in terms of the number of concepts. It is in this context that ontology modularization can be useful. This work presents an approach to annotate scientific documents using modules of different ontologies, which are built according to a module extraction technique. The main idea is to analyze a set of single-ontology annotations on a text to find out the user interests. Based on these annotations a set of modules are extracted from a set of distinct ontologies, and are made available for the user, for complementary annotation. The reduced size and focus of the extracted modules tend to facilitate the annotation task. An experiment was conducted to evaluate this approach, with the participation of a bioinformatician specialist of the Laboratory of Peptides and Proteins of the IOC/Fiocruz, who was interested in discovering new drug targets aiming at the combat of tropical diseases.

Assuntos

Vocabulário Controlado , Armazenamento e Recuperação da Informação

16.

Enhancing medical named entity recognition with an extended segment representation technique.

Keretna, Sara; Lim, Chee Peng; Creighton, Doug; Shaban, Khaled Bashir.

Comput Methods Programs Biomed ; 119(2): 88-100, 2015 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-25791277

RESUMO

OBJECTIVE: The objective of this paper is to formulate an extended segment representation (SR) technique to enhance named entity recognition (NER) in medical applications. METHODS: An extension to the IOBES (Inside/Outside/Begin/End/Single) SR technique is formulated. In the proposed extension, a new class is assigned to words that do not belong to a named entity (NE) in one context but appear as an NE in other contexts. Ambiguity in such cases can negatively affect the results of classification-based NER techniques. Assigning a separate class to words that can potentially cause ambiguity in NER allows a classifier to detect NEs more accurately; therefore increasing classification accuracy. RESULTS: The proposed SR technique is evaluated using the i2b2 2010 medical challenge data set with eight different classifiers. Each classifier is trained separately to extract three different medical NEs, namely treatment, problem, and test. From the three experimental results, the extended SR technique is able to improve the average F1-measure results pertaining to seven out of eight classifiers. The kNN classifier shows an average reduction of 0.18% across three experiments, while the C4.5 classifier records an average improvement of 9.33%.

Assuntos

Sistemas Computadorizados de Registros Médicos , Doença/classificação , Humanos

17.

Development and evaluation of RapTAT: a machine learning system for concept mapping of phrases from medical narratives.

Gobbel, Glenn T; Reeves, Ruth; Jayaramaraja, Shrimalini; Giuse, Dario; Speroff, Theodore; Brown, Steven H; Elkin, Peter L; Matheny, Michael E.

J Biomed Inform ; 48: 54-65, 2014 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-24316051

RESUMO

Rapid, automated determination of the mapping of free text phrases to pre-defined concepts could assist in the annotation of clinical notes and increase the speed of natural language processing systems. The aim of this study was to design and evaluate a token-order-specific naïve Bayes-based machine learning system (RapTAT) to predict associations between phrases and concepts. Performance was assessed using a reference standard generated from 2860 VA discharge summaries containing 567,520 phrases that had been mapped to 12,056 distinct Systematized Nomenclature of Medicine - Clinical Terms (SNOMED CT) concepts by the MCVS natural language processing system. It was also assessed on the manually annotated, 2010 i2b2 challenge data. Performance was established with regard to precision, recall, and F-measure for each of the concepts within the VA documents using bootstrapping. Within that corpus, concepts identified by MCVS were broadly distributed throughout SNOMED CT, and the token-order-specific language model achieved better performance based on precision, recall, and F-measure (0.95±0.15, 0.96±0.16, and 0.95±0.16, respectively; mean±SD) than the bag-of-words based, naïve Bayes model (0.64±0.45, 0.61±0.46, and 0.60±0.45, respectively) that has previously been used for concept mapping. Precision, recall, and F-measure on the i2b2 test set were 92.9%, 85.9%, and 89.2% respectively, using the token-order-specific model. RapTAT required just 7.2ms to map all phrases within a single discharge summary, and mapping rate did not decrease as the number of processed documents increased. The high performance attained by the tool in terms of both accuracy and speed was encouraging, and the mapping rate should be sufficient to support near-real-time, interactive annotation of medical narratives. These results demonstrate the feasibility of rapidly and accurately mapping phrases to a wide range of medical concepts based on a token-order-specific naïve Bayes model and machine learning.

Assuntos

Inteligência Artificial , Processamento de Linguagem Natural , Algoritmos , Automação , Teorema de Bayes , Bases de Dados Factuais , Registros Eletrônicos de Saúde , Hospitais de Veteranos , Humanos , Modelos Estatísticos , Reprodutibilidade dos Testes , Software , Systematized Nomenclature of Medicine , Tennessee , Terminologia como Assunto , Unified Medical Language System , Vocabulário Controlado

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA