RESUMO
BACKGROUND: Extracting concepts from biomedical texts is a key to support many advanced applications such as biomedical information retrieval. However, in clinical notes Named Entity Recognition (NER) has to deal with various types of errors such as spelling errors, grammatical errors, truncated sentences, and non-standard abbreviations. Moreover, in numerous countries, NER is challenged by the availability of many resources originally developed and only suitable for English texts. This paper presents the Cimind system, a multilingual system dedicated to named entity recognition in medical texts based on a phonetic similarity measure. METHODS: Cimind performs entity recognition by combining phonetic recognition using the DM phonetic algorithm to deal with spelling errors and string similarity measures. Three main steps are processed to identify terms in a controlled vocabulary: normalization, candidate selection by phonetic similarity and candidate ranking. RESULTS: Cimind was evaluated in the 2016 and 2017 editions of the CLEF eHealth challenge in the CépiDC/CDC tasks. In 2017, it obtained on each corpus the following results: English dataset: 83.9% P, 78.3% R, 81.0% F1; French raw dataset: 85.7% P, 68.9% R, 76.4% F1; French aligned dataset: 83.5% P, 77.5% R, 80.4% F1. It ranked first in French and fourth in English in officials runs.
Assuntos
Processamento de Linguagem Natural , Fonética , Vocabulário Controlado , Algoritmos , HumanosRESUMO
BACKGROUND: Whole exome sequencing (WES) has become the strategy of choice to identify a coding allelic variant for a rare human monogenic disorder. This approach is a revolution in medical genetics history, impacting both fundamental research, and diagnostic methods leading to personalized medicine. A plethora of efficient algorithms has been developed to ensure the variant discovery. They generally lead to ~20,000 variations that have to be narrow down to find the potential pathogenic allelic variant(s) and the affected gene(s). For this purpose, commonly adopted procedures which implicate various filtering strategies have emerged: exclusion of common variations, type of the allelics variants, pathogenicity effect prediction, modes of inheritance and multiple individuals for exome comparison. To deal with the expansion of WES in medical genomics individual laboratories, new convivial and versatile software tools have to implement these filtering steps. Non-programmer biologists have to be autonomous combining themselves different filtering criteria and conduct a personal strategy depending on their assumptions and study design. RESULTS: We describe EVA (Exome Variation Analyzer), a user-friendly web-interfaced software dedicated to the filtering strategies for medical WES. Thanks to different modules, EVA (i) integrates and stores annotated exome variation data as strictly confidential to the project owner, (ii) allows to combine the main filters dealing with common variations, molecular types, inheritance mode and multiple samples, (iii) offers the browsing of annotated data and filtered results in various interactive tables, graphical visualizations and statistical charts, (iv) and finally offers export files and cross-links to external useful databases and softwares for further prioritization of the small subset of sorted candidate variations and genes. We report a demonstrative case study that allowed to identify a new candidate gene related to a rare form of Alzheimer disease. CONCLUSIONS: EVA is developed to be a user-friendly, versatile, and efficient-filtering assisting software for WES. It constitutes a platform for data storage and for drastic screening of clinical relevant genetics variations by non-programmer geneticists. Thereby, it provides a response to new needs at the expanding era of medical genomics investigated by WES for both fundamental research and clinical diagnostics.
Assuntos
Doença de Alzheimer/genética , Exoma , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bases de Dados Genéticas , Humanos , Análise de Sequência de DNA/instrumentaçãoRESUMO
Extracting concepts from medical texts is a key to support many advanced applications in medical information retrieval. Entity recognition in French texts is moreover challenged by the availability of many resources originally developed for English texts. This paper proposes an evaluation of the terminology coverage in a corpus of 50,000 French articles extracted from the bibliographic database LiSSa. This corpus was automatically indexed with 32 health terminologies, published in French or translated. Then, the terminologies providing the best coverage of these documents were determined. The results show that major resources such as the NCI and SNOMED CT thesauri achieve the largest annotation of the corpus while specific French resources prove to be valuable assets.