Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 77
Filter
1.
Front Digit Health ; 5: 1195017, 2023.
Article in English | MEDLINE | ID: mdl-37388252

ABSTRACT

Objectives: The objective of this study is the exploration of Artificial Intelligence and Natural Language Processing techniques to support the automatic assignment of the four Response Evaluation Criteria in Solid Tumors (RECIST) scales based on radiology reports. We also aim at evaluating how languages and institutional specificities of Swiss teaching hospitals are likely to affect the quality of the classification in French and German languages. Methods: In our approach, 7 machine learning methods were evaluated to establish a strong baseline. Then, robust models were built, fine-tuned according to the language (French and German), and compared with the expert annotation. Results: The best strategies yield average F1-scores of 90% and 86% respectively for the 2-classes (Progressive/Non-progressive) and the 4-classes (Progressive Disease, Stable Disease, Partial Response, Complete Response) RECIST classification tasks. Conclusions: These results are competitive with the manual labeling as measured by Matthew's correlation coefficient and Cohen's Kappa (79% and 76%). On this basis, we confirm the capacity of specific models to generalize on new unseen data and we assess the impact of using Pre-trained Language Models (PLMs) on the accuracy of the classifiers.

2.
Database (Oxford) ; 20232023 03 31.
Article in English | MEDLINE | ID: mdl-37002680

ABSTRACT

The curation of genomic variants requires collecting evidence not only in variant knowledge bases but also in the literature. However, some variants result in no match when searched in the scientific literature. Indeed, it has been reported that a significant subset of information related to genomic variants are not reported in the full text, but only in the supplementary materials associated with a publication. In the study, we present an evaluation of the use of supplementary data (SD) to improve the retrieval of relevant scientific publications for variant curation. Our experiments show that searching SD enables to significantly increase the volume of documents retrieved for a variant, thus reducing by ∼63% the number of variants for which no match is found in the scientific literature. SD thus represent a paramount source of information for curating variants of unknown significance and should receive more attention by global research infrastructures, which maintain literature search engines. Database URL https://www.expasy.org/resources/variomes.


Subject(s)
Genomics , Search Engine , Databases, Factual
3.
Bioinformatics ; 39(1)2023 01 01.
Article in English | MEDLINE | ID: mdl-36511598

ABSTRACT

MOTIVATION: Since early 2020, the coronavirus disease 2019 (COVID-19) pandemic has confronted the biomedical community with an unprecedented challenge. The rapid spread of COVID-19 and ease of transmission seen worldwide is due to increased population flow and international trade. Front-line medical care, treatment research and vaccine development also require rapid and informative interpretation of the literature and COVID-19 data produced around the world, with 177 500 papers published between January 2020 and November 2021, i.e. almost 8500 papers per month. To extract knowledge and enable interoperability across resources, we developed the COVID-19 Vocabulary (COVoc), an application ontology related to the research on this pandemic. The main objective of COVoc development was to enable seamless navigation from biomedical literature to core databases and tools of ELIXIR, a European-wide intergovernmental organization for life sciences. RESULTS: This collaborative work provided data integration into SIB Literature services, an application ontology (COVoc) and a triage service named COVTriage and based on annotation processing to search for COVID-related information across pre-defined aspects with daily updates. Thanks to its interoperability potential, COVoc lends itself to wider applications, hopefully through further connections with other novel COVID-19 ontologies as has been established with Coronavirus Infectious Disease Ontology. AVAILABILITY AND IMPLEMENTATION: The data at https://github.com/EBISPOT/covoc and the service at https://candy.hesge.ch/COVTriage.


Subject(s)
COVID-19 , Humans , COVID-19/diagnosis , Triage , Commerce , Internationality
4.
Stud Health Technol Inform ; 294: 839-843, 2022 May 25.
Article in English | MEDLINE | ID: mdl-35612222

ABSTRACT

The importance of genomic data for health is rapidly growing but accessing and gathering information about variants from different sources is hindered by highly heterogeneous representations of variants, as outlined by clinical associations (AMP/ASCO/CAP) in their recommendations. To enable a smooth and effective retrieval of variant-containing documents from different resources, we developed a tool (https://goldorak.hesge.ch/synvar/) that generates for any given SNP - including variant not present in existing databases - its corresponding description at the genome, transcript and protein levels. It provides variant descriptions in the HGVS format as well as in many non-standard formats found in the literature along with database identifiers. We present the SynVar service and evaluate its impact on the recall of a genomic variant curation-support service. Using SynVar to search variants in the literature enables to increase the recall by +133.8% without a strong impact on precision (i.e. 93%).


Subject(s)
Genomics , Databases, Factual
5.
Stud Health Technol Inform ; 294: 849-853, 2022 May 25.
Article in English | MEDLINE | ID: mdl-35612224

ABSTRACT

The present study shows first attempts to automatically classify oncology treatment responses on the basis of the textual conclusion sections of radiology reports according to the RECIST classification. After a robust and extended manual annotation of 543 conclusion sections (5-to-50-word long), and after the training of several machine learning techniques (from traditional machine learning to deep learning), the best results show an accuracy score of 0.90 for a two-class classification (non-progressive vs. progressive disease) and of 0.82 for a four-class classification (complete response, partial response, stable disease, progressive disease) both with Logistic Regression approach. Some innovative solutions are further suggested to improve these scores in the future.


Subject(s)
Radiology , Machine Learning , Natural Language Processing , Radiography , Research Report , Supervised Machine Learning
6.
Stud Health Technol Inform ; 294: 876-877, 2022 May 25.
Article in English | MEDLINE | ID: mdl-35612233

ABSTRACT

We present an analysis of supplementary materials of PubMed Central (PMC) articles and show their importance in indexing and searching biomedical literature, in particular for the emerging genomic medicine field. On a subset of articles from PubMed Central, we use text mining methods to extract MeSH terms from abstracts, full texts, and text-based supplementary materials. We find that the recall of MeSH annotations increases by about 5.9 percentage points (+20% on relative percentage) when considering supplementary materials compared to using only abstracts. We further compare the supplementary material annotations with full-text annotations and we find out that the recall of MeSH terms increases by 1.5 percentage point (+3% on relative percentage). Additionally, we analyze genetic variant mentions in abstracts and full-texts and compare them with mentions found in supplementary text-based files. We find that the majority (about 99%) of variants are found in text-based supplementary files. In conclusion, we suggest that supplementary data should receive more attention from the information retrieval community, in particular in life and health sciences.


Subject(s)
Medical Subject Headings , Text Messaging , Data Mining/methods , PubMed , Records
7.
Bioinformatics ; 38(9): 2595-2601, 2022 04 28.
Article in English | MEDLINE | ID: mdl-35274687

ABSTRACT

MOTIVATION: Identification and interpretation of clinically actionable variants is a critical bottleneck. Searching for evidence in the literature is mandatory according to ASCO/AMP/CAP practice guidelines; however, it is both labor-intensive and error-prone. We developed a system to perform triage of publications relevant to support an evidence-based decision. The system is also able to prioritize variants. Our system searches within pre-annotated collections such as MEDLINE and PubMed Central. RESULTS: We assess the search effectiveness of the system using three different experimental settings: literature triage; variant prioritization and comparison of Variomes with LitVar. Almost two-thirds of the publications returned in the top-5 are relevant for clinical decision-support. Our approach enabled identifying 81.8% of clinically actionable variants in the top-3. Variomes retrieves on average +21.3% more articles than LitVar and returns the same number of results or more results than LitVar for 90% of the queries when tested on a set of 803 queries; thus, establishing a new baseline for searching the literature about variants. AVAILABILITY AND IMPLEMENTATION: Variomes is publicly available at https://candy.hesge.ch/Variomes. Source code is freely available at https://github.com/variomes/sibtm-variomes. SynVar is publicly available at https://goldorak.hesge.ch/synvar. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genomics , Search Engine , Genomics/methods , Genome , PubMed , Software
8.
Front Res Metr Anal ; 6: 689803, 2021.
Article in English | MEDLINE | ID: mdl-34870074

ABSTRACT

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

9.
Anal Chem ; 93(50): 16853-16861, 2021 12 21.
Article in English | MEDLINE | ID: mdl-34890188

ABSTRACT

The direct quantification of multiple ions in aqueous mixtures is achieved by combining an automated machine learning pipeline with transient potentiometric data obtained from a single miniaturized array of polymeric sensors electrodeposited on a conventional printed circuit board (PCB) substrate. A proof-of-concept system was demonstrated by employing 16 polymeric sensors in combination with features extracted from the transient differential voltages produced by these sensors when transitioning from a reference solution to a test solution, thereby obviating the need for a conventional reference electrode. A tree-based regression model enabled concentrations of various metal cations in pure solutions to be determined in less than 2 min. In a model mixture comprising Al3+, Cu2+, Na+, and Fe3+, the mean relative error was found to depend on the type of ion and varied between 1% for Fe3+ and 44% for Na+ in the concentration range 1-10 mg/L. Overall, a mean relative error of 16% was obtained for quantification of these four ions across a total of 124 tests in different solutions spanning concentrations between 2 and 360 mg/L. These results demonstrate how the analytical capability of a multiselective sensor array can leverage data-driven approaches through training by examples for accelerated testing and can be proposed to complement traditional analytical tools to meet industrial demands, including traceability of chemicals.


Subject(s)
Machine Learning , Cations
10.
Stud Health Technol Inform ; 270: 312-316, 2020 Jun 16.
Article in English | MEDLINE | ID: mdl-32570397

ABSTRACT

The encoding of Electronic Medical Records is a complex and time-consuming task. We report on a machine learning model for proposing diagnoses and procedures codes, from a large realistic dataset of 245 000 electronic medical records at the University Hospitals of Geneva. Our study particularly focuses on the impact of training data quantity on the model's performances. We show that the performances of the models do not increase while encoded instances from previous years are exploited for learning data. Furthermore, supervised models are shown to be highly perishable: we show a potential drop in performances of around -10% per year. Consequently, great and constant care must be exercised for designing and updating the content of such knowledge bases exploited by machine learning.


Subject(s)
Electronic Health Records , Machine Learning
11.
Stud Health Technol Inform ; 270: 884-888, 2020 Jun 16.
Article in English | MEDLINE | ID: mdl-32570509

ABSTRACT

The Swiss Variant Interpretation Platform for Oncology is a centralized, joint and curated database for clinical somatic variants piloted by a board of Swiss healthcare institutions and operated by the SIB Swiss Institute of Bioinformatics. To support this effort, SIB Text Mining designed a set of text analytics services. This report focuses on three of those services. First, the automatic annotations of the literature with a set of terminologies have been performed, resulting in a large annotated version of MEDLINE and PMC. Second, a generator of variant synonyms for single nucleotide variants has been developed using publicly available data resources, as well as patterns of non-standard formats, often found in the literature. Third, a literature ranking service enables to retrieve a ranked set of MEDLINE abstracts given a variant and optionally a diagnosis. The annotation of MEDLINE and PMC resulted in a total of respectively 785,181,199 and 1,156,060,212 annotations, which means an average of 26 and 425 annotations per abstract and full-text article. The generator of variant synonyms enables to retrieve up to 42 synonyms for a variant. The literature ranking service reaches a precision (P10) of 63%, which means that almost two-thirds of the top-10 returned abstracts are judged relevant. Further services will be implemented to complete this set of services, such as a service to retrieve relevant clinical trials for a patient and a literature ranking service for full-text articles.


Subject(s)
Computational Biology , Data Mining , Abstracting and Indexing , Humans , MEDLINE , Switzerland
12.
Nucleic Acids Res ; 48(W1): W12-W16, 2020 07 02.
Article in English | MEDLINE | ID: mdl-32379317

ABSTRACT

Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.


Subject(s)
Data Mining/methods , Search Engine , MEDLINE , Precision Medicine
13.
Database (Oxford) ; 20202020 01 01.
Article in English | MEDLINE | ID: mdl-32367111

ABSTRACT

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.


Subject(s)
Deep Learning , Databases, Protein , Knowledge Bases , Molecular Sequence Annotation , Proteins/genetics
14.
F1000Res ; 82019.
Article in English | MEDLINE | ID: mdl-31824649

ABSTRACT

Intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) are now recognised as major determinants in cellular regulation. This white paper presents a roadmap for future e-infrastructure developments in the field of IDP research within the ELIXIR framework. The goal of these developments is to drive the creation of high-quality tools and resources to support the identification, analysis and functional characterisation of IDPs. The roadmap is the result of a workshop titled "An intrinsically disordered protein user community proposal for ELIXIR" held at the University of Padua. The workshop, and further consultation with the members of the wider IDP community, identified the key priority areas for the roadmap including the development of standards for data annotation, storage and dissemination; integration of IDP data into the ELIXIR Core Data Resources; and the creation of benchmarking criteria for IDP-related software. Here, we discuss these areas of priority, how they can be implemented in cooperation with the ELIXIR platforms, and their connections to existing ELIXIR Communities and international consortia. The article provides a preliminary blueprint for an IDP Community in ELIXIR and is an appeal to identify and involve new stakeholders.


Subject(s)
Intrinsically Disordered Proteins/metabolism
15.
ACS Appl Mater Interfaces ; 11(27): 24037-24046, 2019 Jul 10.
Article in English | MEDLINE | ID: mdl-31251575

ABSTRACT

Adsorption heat pumps offer a clean, zero-emission technology for universally applicable cooling or heating utilizing water as a refrigerant and waste or renewable heat as driving energy instead of electricity. Despite their attractive environmentally friendly prospects, the broader application of such classes of heat pumps has not yet been possible, mainly because of the low power density of adsorption heat exchangers and the corresponding large size and high cost of the adsorption heat pumps. We report an inexpensive route for the fabrication of zeolite coatings with high adsorption power density based on the bottom-up assembly of colloids directed by magnetic and capillary forces. Such an assembly process relies on the chaining of oil droplets under an external magnetic field during deposition of the coating, followed by the formation of a percolating network of bridged adsorbent particles upon drying. This results in vertically open channels and thermal bridges that facilitate directed mass and heat transport across the structured zeolite coating during sorption cycles. By reaching up to 3.3-fold higher performance than their unstructured counterparts using readily available zeolite as an adsorbent material, the architectured coatings produced through this facile, upscalable approach hold great potential for next-generation adsorption heat pumps.

16.
Arch Suicide Res ; 23(4): 576-589, 2019.
Article in English | MEDLINE | ID: mdl-29883272

ABSTRACT

Among the different research methods on suicide notes, the theoretical conceptual approach allows a particularly thorough understanding of the suicidal act. The present study focuses on 78 suicide notes collected in Geneva, Switzerland, between 2006 and 2014. The socio-demographic and medical data of the writers' notes were collected. The conceptual content of the notes was analyzed by two independent raters using the Leenaars method. The results showed that the concepts that appeared most frequently in the notes were: Inability to adjust, Rejection-aggression, Unbearable pain, and Ego. Very few differences were found in the conceptual content when stratified for age, gender, socio-economic status, or religion. This study confirms and complements the findings of similar studies on the content of suicide notes.


Subject(s)
Interpersonal Relations , Mental Disorders , Motivation , Narration , Suicidal Ideation , Suicide Prevention , Suicide , Age Factors , Demography , Female , Health Status , Humans , Male , Mental Disorders/drug therapy , Mental Disorders/epidemiology , Middle Aged , Sex Factors , Socioeconomic Factors , Suicide/psychology , Suicide/statistics & numerical data , Switzerland/epidemiology
17.
Database (Oxford) ; 20182018 01 01.
Article in English | MEDLINE | ID: mdl-30576492

ABSTRACT

The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.


Subject(s)
Data Curation/methods , Data Mining/methods , Databases, Factual , Software , Humans
18.
Database (Oxford) ; 20182018 01 01.
Article in English | MEDLINE | ID: mdl-30329035

ABSTRACT

The text-mining services for kinome curation track, part of BioCreative VI, proposed a competition to assess the effectiveness of text mining to perform literature triage. The track has exploited an unpublished curated data set from the neXtProt database. This data set contained comprehensive annotations for 300 human protein kinases. For a given protein and a given curation axis [diseases or gene ontology (GO) biological processes], participants' systems had to identify and rank relevant articles in a collection of 5.2 M MEDLINE citations (task 1) or 530 000 full-text articles (task 2). Explored strategies comprised named-entity recognition and machine-learning frameworks. For that latter approach, participants developed methods to derive a set of negative instances, as the databases typically do not store articles that were judged as irrelevant by curators. The supervised approaches proposed by the participating groups achieved significant improvements compared to the baseline established in a previous study and compared to a basic PubMed search.


Subject(s)
Data Mining , Protein Kinases/metabolism , Databases, Factual , Humans , Periodicals as Topic
19.
PLoS One ; 13(1): e0190028, 2018.
Article in English | MEDLINE | ID: mdl-29293556

ABSTRACT

The openEHR specifications are designed to support implementation of flexible and interoperable Electronic Health Record (EHR) systems. Despite the increasing number of solutions based on the openEHR specifications, it is difficult to find publicly available healthcare datasets in the openEHR format that can be used to test, compare and validate different data persistence mechanisms for openEHR. To foster research on openEHR servers, we present the openEHR Benchmark Dataset, ORBDA, a very large healthcare benchmark dataset encoded using the openEHR formalism. To construct ORBDA, we extracted and cleaned a de-identified dataset from the Brazilian National Healthcare System (SUS) containing hospitalisation and high complexity procedures information and formalised it using a set of openEHR archetypes and templates. Then, we implemented a tool to enrich the raw relational data and convert it into the openEHR model using the openEHR Java reference model library. The ORBDA dataset is available in composition, versioned composition and EHR openEHR representations in XML and JSON formats. In total, the dataset contains more than 150 million composition records. We describe the dataset and provide means to access it. Additionally, we demonstrate the usage of ORBDA for evaluating inserting throughput and query latency performances of some NoSQL database management systems. We believe that ORBDA is a valuable asset for assessing storage models for openEHR-based information systems during the software engineering process. It may also be a suitable component in future standardised benchmarking of available openEHR storage platforms.


Subject(s)
Benchmarking , Database Management Systems , Datasets as Topic , Electronic Health Records , Brazil
SELECTION OF CITATIONS
SEARCH DETAIL
...