Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36511598

RESUMEN

MOTIVATION: Since early 2020, the coronavirus disease 2019 (COVID-19) pandemic has confronted the biomedical community with an unprecedented challenge. The rapid spread of COVID-19 and ease of transmission seen worldwide is due to increased population flow and international trade. Front-line medical care, treatment research and vaccine development also require rapid and informative interpretation of the literature and COVID-19 data produced around the world, with 177 500 papers published between January 2020 and November 2021, i.e. almost 8500 papers per month. To extract knowledge and enable interoperability across resources, we developed the COVID-19 Vocabulary (COVoc), an application ontology related to the research on this pandemic. The main objective of COVoc development was to enable seamless navigation from biomedical literature to core databases and tools of ELIXIR, a European-wide intergovernmental organization for life sciences. RESULTS: This collaborative work provided data integration into SIB Literature services, an application ontology (COVoc) and a triage service named COVTriage and based on annotation processing to search for COVID-related information across pre-defined aspects with daily updates. Thanks to its interoperability potential, COVoc lends itself to wider applications, hopefully through further connections with other novel COVID-19 ontologies as has been established with Coronavirus Infectious Disease Ontology. AVAILABILITY AND IMPLEMENTATION: The data at https://github.com/EBISPOT/covoc and the service at https://candy.hesge.ch/COVTriage.


Asunto(s)
COVID-19 , Humanos , COVID-19/diagnóstico , Triaje , Comercio , Internacionalidad
2.
Bioinformatics ; 38(9): 2595-2601, 2022 04 28.
Artículo en Inglés | MEDLINE | ID: mdl-35274687

RESUMEN

MOTIVATION: Identification and interpretation of clinically actionable variants is a critical bottleneck. Searching for evidence in the literature is mandatory according to ASCO/AMP/CAP practice guidelines; however, it is both labor-intensive and error-prone. We developed a system to perform triage of publications relevant to support an evidence-based decision. The system is also able to prioritize variants. Our system searches within pre-annotated collections such as MEDLINE and PubMed Central. RESULTS: We assess the search effectiveness of the system using three different experimental settings: literature triage; variant prioritization and comparison of Variomes with LitVar. Almost two-thirds of the publications returned in the top-5 are relevant for clinical decision-support. Our approach enabled identifying 81.8% of clinically actionable variants in the top-3. Variomes retrieves on average +21.3% more articles than LitVar and returns the same number of results or more results than LitVar for 90% of the queries when tested on a set of 803 queries; thus, establishing a new baseline for searching the literature about variants. AVAILABILITY AND IMPLEMENTATION: Variomes is publicly available at https://candy.hesge.ch/Variomes. Source code is freely available at https://github.com/variomes/sibtm-variomes. SynVar is publicly available at https://goldorak.hesge.ch/synvar. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genómica , Motor de Búsqueda , Genómica/métodos , Genoma , PubMed , Programas Informáticos
3.
Nucleic Acids Res ; 48(W1): W12-W16, 2020 07 02.
Artículo en Inglés | MEDLINE | ID: mdl-32379317

RESUMEN

Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.


Asunto(s)
Minería de Datos/métodos , Motor de Búsqueda , MEDLINE , Medicina de Precisión
4.
Nucleic Acids Res ; 48(D1): D269-D276, 2020 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-31713636

RESUMEN

The Database of Protein Disorder (DisProt, URL: https://disprot.org) provides manually curated annotations of intrinsically disordered proteins from the literature. Here we report recent developments with DisProt (version 8), including the doubling of protein entries, a new disorder ontology, improvements of the annotation format and a completely new website. The website includes a redesigned graphical interface, a better search engine, a clearer API for programmatic access and a new annotation interface that integrates text mining technologies. The new entry format provides a greater flexibility, simplifies maintenance and allows the capture of more information from the literature. The new disorder ontology has been formalized and made interoperable by adopting the OWL format, as well as its structure and term definitions have been improved. The new annotation interface has made the curation process faster and more effective. We recently showed that new DisProt annotations can be effectively used to train and validate disorder predictors. We believe the growth of DisProt will accelerate, contributing to the improvement of function and disorder predictors and therefore to illuminate the 'dark' proteome.


Asunto(s)
Bases de Datos de Proteínas , Proteínas Intrínsecamente Desordenadas/química , Ontologías Biológicas , Curaduría de Datos , Anotación de Secuencia Molecular
5.
BMC Bioinformatics ; 15 Suppl 1: S15, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24564220

RESUMEN

BACKGROUND: The large increase in the size of patent collections has led to the need of efficient search strategies. But the development of advanced text-mining applications dedicated to patents of the biomedical field remains rare, in particular to address the needs of the pharmaceutical & biotech industry, which intensively uses patent libraries for competitive intelligence and drug development. METHODS: We describe here the development of an advanced retrieval engine to search information in patent collections in the field of medicinal chemistry. We investigate and combine different strategies and evaluate their respective impact on the performance of the search engine applied to various search tasks, which covers the putatively most frequent search behaviours of intellectual property officers in medical chemistry: 1) a prior art search task; 2) a technical survey task; and 3) a variant of the technical survey task, sometimes called known-item search task, where a single patent is targeted. RESULTS: The optimal tuning of our engine resulted in a top-precision of 6.76% for the prior art search task, 23.28% for the technical survey task and 46.02% for the variant of the technical survey task. We observed that co-citation boosting was an appropriate strategy to improve prior art search tasks, while IPC classification of queries was improving retrieval effectiveness for technical survey tasks. Surprisingly, the use of the full body of the patent was always detrimental for search effectiveness. It was also observed that normalizing biomedical entities using curated dictionaries had simply no impact on the search tasks we evaluate. The search engine was finally implemented as a web-application within Novartis Pharma. The application is briefly described in the report. CONCLUSIONS: We have presented the development of a search engine dedicated to patent search, based on state of the art methods applied to patent corpora. We have shown that a proper tuning of the system to adapt to the various search tasks clearly increases the effectiveness of the system. We conclude that different search tasks demand different information retrieval engines' settings in order to yield optimal end-user retrieval.


Asunto(s)
Química Farmacéutica , Patentes como Asunto , Motor de Búsqueda/métodos , Algoritmos , Almacenamiento y Recuperación de la Información , Internet , Bibliotecas de Moléculas Pequeñas
6.
BMC Bioinformatics ; 14: 104, 2013 Mar 22.
Artículo en Inglés | MEDLINE | ID: mdl-23517090

RESUMEN

BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS: The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS: The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.


Asunto(s)
Minería de Datos/métodos , Bases de Datos de Proteínas , Bases del Conocimiento , Procesamiento Proteico-Postraduccional , Humanos , Anotación de Secuencia Molecular , Proteómica
7.
Database (Oxford) ; 20232023 03 31.
Artículo en Inglés | MEDLINE | ID: mdl-37002680

RESUMEN

The curation of genomic variants requires collecting evidence not only in variant knowledge bases but also in the literature. However, some variants result in no match when searched in the scientific literature. Indeed, it has been reported that a significant subset of information related to genomic variants are not reported in the full text, but only in the supplementary materials associated with a publication. In the study, we present an evaluation of the use of supplementary data (SD) to improve the retrieval of relevant scientific publications for variant curation. Our experiments show that searching SD enables to significantly increase the volume of documents retrieved for a variant, thus reducing by ∼63% the number of variants for which no match is found in the scientific literature. SD thus represent a paramount source of information for curating variants of unknown significance and should receive more attention by global research infrastructures, which maintain literature search engines. Database URL https://www.expasy.org/resources/variomes.


Asunto(s)
Genómica , Motor de Búsqueda , Bases de Datos Factuales
8.
Front Digit Health ; 5: 1195017, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37388252

RESUMEN

Objectives: The objective of this study is the exploration of Artificial Intelligence and Natural Language Processing techniques to support the automatic assignment of the four Response Evaluation Criteria in Solid Tumors (RECIST) scales based on radiology reports. We also aim at evaluating how languages and institutional specificities of Swiss teaching hospitals are likely to affect the quality of the classification in French and German languages. Methods: In our approach, 7 machine learning methods were evaluated to establish a strong baseline. Then, robust models were built, fine-tuned according to the language (French and German), and compared with the expert annotation. Results: The best strategies yield average F1-scores of 90% and 86% respectively for the 2-classes (Progressive/Non-progressive) and the 4-classes (Progressive Disease, Stable Disease, Partial Response, Complete Response) RECIST classification tasks. Conclusions: These results are competitive with the manual labeling as measured by Matthew's correlation coefficient and Cohen's Kappa (79% and 76%). On this basis, we confirm the capacity of specific models to generalize on new unseen data and we assess the impact of using Pre-trained Language Models (PLMs) on the accuracy of the classifiers.

9.
J Med Internet Res ; 14(3): e73, 2012 May 29.
Artículo en Inglés | MEDLINE | ID: mdl-22642960

RESUMEN

BACKGROUND: Antimicrobial resistance has reached globally alarming levels and is becoming a major public health threat. Lack of efficacious antimicrobial resistance surveillance systems was identified as one of the causes of increasing resistance, due to the lag time between new resistances and alerts to care providers. Several initiatives to track drug resistance evolution have been developed. However, no effective real-time and source-independent antimicrobial resistance monitoring system is available publicly. OBJECTIVE: To design and implement an architecture that can provide real-time and source-independent antimicrobial resistance monitoring to support transnational resistance surveillance. In particular, we investigated the use of a Semantic Web-based model to foster integration and interoperability of interinstitutional and cross-border microbiology laboratory databases. METHODS: Following the agile software development methodology, we derived the main requirements needed for effective antimicrobial resistance monitoring, from which we proposed a decentralized monitoring architecture based on the Semantic Web stack. The architecture uses an ontology-driven approach to promote the integration of a network of sentinel hospitals or laboratories. Local databases are wrapped into semantic data repositories that automatically expose local computing-formalized laboratory information in the Web. A central source mediator, based on local reasoning, coordinates the access to the semantic end points. On the user side, a user-friendly Web interface provides access and graphical visualization to the integrated views. RESULTS: We designed and implemented the online Antimicrobial Resistance Trend Monitoring System (ARTEMIS) in a pilot network of seven European health care institutions sharing 70+ million triples of information about drug resistance and consumption. Evaluation of the computing performance of the mediator demonstrated that, on average, query response time was a few seconds (mean 4.3, SD 0.1 × 10(2) seconds). Clinical pertinence assessment showed that resistance trends automatically calculated by ARTEMIS had a strong positive correlation with the European Antimicrobial Resistance Surveillance Network (EARS-Net) (ρ = .86, P < .001) and the Sentinel Surveillance of Antibiotic Resistance in Switzerland (SEARCH) (ρ = .84, P < .001) systems. Furthermore, mean resistance rates extracted by ARTEMIS were not significantly different from those of either EARS-Net (∆ = ±0.130; 95% confidence interval -0 to 0.030; P < .001) or SEARCH (∆ = ±0.042; 95% confidence interval -0.004 to 0.028; P = .004). CONCLUSIONS: We introduce a distributed monitoring architecture that can be used to build transnational antimicrobial resistance surveillance networks. Results indicated that the Semantic Web-based approach provided an efficient and reliable solution for development of eHealth architectures that enable online antimicrobial resistance monitoring from heterogeneous data sources. In future, we expect that more health care institutions can join the ARTEMIS network so that it can provide a large European and wider biosurveillance network that can be used to detect emerging bacterial resistance in a multinational context and support public health actions.


Asunto(s)
Cooperación Internacional , Internet , Vigilancia de la Población , Simulación por Computador , Farmacorresistencia Microbiana , Programas Informáticos , Interfaz Usuario-Computador
10.
Stud Health Technol Inform ; 180: 204-9, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22874181

RESUMEN

Patent collections contain an important amount of medical-related knowledge, but existing tools were reported to lack of useful functionalities. We present here the development of TWINC, an advanced search engine dedicated to patent retrieval in the domain of health and life sciences. Our tool embeds two search modes: an ad hoc search to retrieve relevant patents given a short query and a related patent search to retrieve similar patents given a patent. Both search modes rely on tuning experiments performed during several patent retrieval competitions. Moreover, TWINC is enhanced with interactive modules, such as chemical query expansion, which is of prior importance to cope with various ways of naming biomedical entities. While the related patent search showed promising performances, the ad-hoc search resulted in fairly contrasted results. Nonetheless, TWINC performed well during the Chemathlon task of the PatOlympics competition and experts appreciated its usability.


Asunto(s)
Química Farmacéutica/métodos , Minería de Datos/métodos , Sistemas de Administración de Bases de Datos , Bases de Datos Farmacéuticas , Internet , Patentes como Asunto , Motor de Búsqueda/métodos , Interfaz Usuario-Computador
11.
Stud Health Technol Inform ; 180: 210-4, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22874182

RESUMEN

We present a new approach to perform biomedical documents classification and prioritization for the Comparative Toxicogenomics Database (CTD). This approach is motivated by needs such as literature curation, in particular applied to the human health environment domain. The unique integration of chemical, genes/proteins and disease data in the biomedical literature may advance the identification of exposure and disease biomarkers, mechanisms of chemical actions, and the complex aetiologies of chronic diseases. Our approach aims to assist biomedical researchers when searching for relevant articles for CTD. The task is functionally defined as a binary classification task, where selected articles must also be ranked by order of relevance. We design a SVM classifier, which combines three main feature sets: an information retrieval system (EAGLi), a biomedical named-entity recognizer (MeSH term extraction), a gene normalization (GN) service (NormaGene) and an ad-hoc keyword recognizer for diseases and chemicals. The evaluation of the gene identification module was done on BioCreativeIII test data. Disease normalization is achieved with 95% precision and 93% of recall. The evaluation of the classification was done on the corpus provided by BioCreative organizers in 2012. The approach showed promising performance on the test data.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Minería de Datos/métodos , Bases de Datos de Compuestos Químicos , Bases de Datos Genéticas , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/clasificación , Publicaciones Periódicas como Asunto/clasificación , Toxicogenética/métodos , Sistemas de Administración de Bases de Datos , Humanos , Interfaz Usuario-Computador
12.
Stud Health Technol Inform ; 174: 121-5, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22491124

RESUMEN

Health-related information retrieval is complicated by the variety of nomenclatures available to name entities, since different communities of users will use different ways to name a same entity. We present in this report the development and evaluation of a user-friendly interactive Web application aiming at facilitating health-related patent search. Our tool, called TWINC, relies on a search engine tuned during several patent retrieval competitions, enhanced with intelligent interaction modules, such as chemical query, normalization and expansion. While the functionality of related article search showed promising performances, the ad hoc search results in fairly contrasted results. Nonetheless, TWINC performed well during the PatOlympics competition and was appreciated by intellectual property experts. This result should be balanced by the limited evaluation sample. We can also assume that it can be customized to be applied in corporate search environments to process domain and company-specific vocabularies, including non-English literature and patents reports.


Asunto(s)
Almacenamiento y Recuperación de la Información/métodos , Internet , Patentes como Asunto , Motor de Búsqueda/métodos , Interfaz Usuario-Computador , Inteligencia Artificial , Humanos
14.
Stud Health Technol Inform ; 169: 477-81, 2011.
Artículo en Inglés | MEDLINE | ID: mdl-21893795

RESUMEN

We present exploratory investigations of multimodal mining to help designing clinical guidelines for antibiotherapy. Our approach is based on the assumption that combining various sources of data, such as the literature, a clinical datawarehouse, as well as information regarding costs will result in better recommendations. Compared to our baseline recommendation system based on a question-answering engine built on top of PubMed, an improvement of +16% is observed when clinical data (i.e. resistance profiles) are injected into the model. In complement to PubMed, an alternative search strategy is reported, which is significantly improved by the use of the combined multimodal approach. These results suggest that combining literature-based discovery with structured data mining can significantly improve effectiveness of decision-support systems for authors of clinical practice guidelines.


Asunto(s)
Antibacterianos/uso terapéutico , Guías de Práctica Clínica como Asunto , Estadística como Asunto/métodos , Algoritmos , Antibacterianos/economía , Sistemas de Computación , Sistemas de Apoyo a Decisiones Clínicas , Costos de los Medicamentos , Humanos , National Institutes of Health (U.S.) , PubMed , Staphylococcus aureus/metabolismo , Staphylococcus epidermidis/metabolismo , Estados Unidos
15.
Stud Health Technol Inform ; 270: 312-316, 2020 Jun 16.
Artículo en Inglés | MEDLINE | ID: mdl-32570397

RESUMEN

The encoding of Electronic Medical Records is a complex and time-consuming task. We report on a machine learning model for proposing diagnoses and procedures codes, from a large realistic dataset of 245 000 electronic medical records at the University Hospitals of Geneva. Our study particularly focuses on the impact of training data quantity on the model's performances. We show that the performances of the models do not increase while encoded instances from previous years are exploited for learning data. Furthermore, supervised models are shown to be highly perishable: we show a potential drop in performances of around -10% per year. Consequently, great and constant care must be exercised for designing and updating the content of such knowledge bases exploited by machine learning.


Asunto(s)
Registros Electrónicos de Salud , Aprendizaje Automático
16.
Database (Oxford) ; 20202020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-32367111

RESUMEN

In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.


Asunto(s)
Aprendizaje Profundo , Bases de Datos de Proteínas , Bases del Conocimiento , Anotación de Secuencia Molecular , Proteínas/genética
17.
Stud Health Technol Inform ; 270: 884-888, 2020 Jun 16.
Artículo en Inglés | MEDLINE | ID: mdl-32570509

RESUMEN

The Swiss Variant Interpretation Platform for Oncology is a centralized, joint and curated database for clinical somatic variants piloted by a board of Swiss healthcare institutions and operated by the SIB Swiss Institute of Bioinformatics. To support this effort, SIB Text Mining designed a set of text analytics services. This report focuses on three of those services. First, the automatic annotations of the literature with a set of terminologies have been performed, resulting in a large annotated version of MEDLINE and PMC. Second, a generator of variant synonyms for single nucleotide variants has been developed using publicly available data resources, as well as patterns of non-standard formats, often found in the literature. Third, a literature ranking service enables to retrieve a ranked set of MEDLINE abstracts given a variant and optionally a diagnosis. The annotation of MEDLINE and PMC resulted in a total of respectively 785,181,199 and 1,156,060,212 annotations, which means an average of 26 and 425 annotations per abstract and full-text article. The generator of variant synonyms enables to retrieve up to 42 synonyms for a variant. The literature ranking service reaches a precision (P10) of 63%, which means that almost two-thirds of the top-10 returned abstracts are judged relevant. Further services will be implemented to complete this set of services, such as a service to retrieve relevant clinical trials for a patient and a literature ranking service for full-text articles.


Asunto(s)
Biología Computacional , Minería de Datos , Indización y Redacción de Resúmenes , Humanos , MEDLINE , Suiza
18.
Stud Health Technol Inform ; 150: 569-73, 2009.
Artículo en Inglés | MEDLINE | ID: mdl-19745375

RESUMEN

We aim at proposing a rule generation approach to automatically acquire structured rules that can be used in decision support systems for drug prescription. We apply a question-answering engine to answer specific information requests. The rule generation is seen as an equation problem, where the factors are known items of the rule (e.g., an infectious disease, caused by a given bacteria) and solutions are answered by the engine (e.g., some antibiotics). A top precision of 0.64 is reported, which means, for about two third of the knowledge rules of the benchmark, one of the recommended antibiotic was automatically acquired by the rule generation method. These results suggest that a significant fraction of the medical knowledge can be obtained by such an automatic text mining approach.


Asunto(s)
Antibacterianos/uso terapéutico , Sistemas de Apoyo a Decisiones Clínicas , Procesamiento Automatizado de Datos , Enfermedades Transmisibles , Humanos , Guías de Práctica Clínica como Asunto
19.
Stud Health Technol Inform ; 150: 175-9, 2009.
Artículo en Inglés | MEDLINE | ID: mdl-19745292

RESUMEN

The expansion of biomedical knowledge, reductions in computing costs and spread of IT facilities have led to an explosion of the biomedical electronic data. However, these data are rarely integrated and analysed because of lack of tools. The integration process is complex due to technical and semantic heterogeneity as well as lack of reliability in such distributed system. In addition, for the specific case of biomedical data, privacy is a crucial constraint. This paper presents a pilot system that will be used in the European FP7 DebugIT project to integrate biomedical data from several healthcare centres across Europe.


Asunto(s)
Gestión de la Información , Informática Médica/organización & administración , Integración de Sistemas , Europa (Continente)
20.
BMC Bioinformatics ; 9 Suppl 3: S9, 2008 Apr 11.
Artículo en Inglés | MEDLINE | ID: mdl-18426554

RESUMEN

BACKGROUND: This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases. RESULTS: Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%). CONCLUSIONS: Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.


Asunto(s)
Algoritmos , Genes/genética , MEDLINE , Procesamiento de Lenguaje Natural , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/clasificación , Proteínas/genética , Inteligencia Artificial , Sensibilidad y Especificidad , Terminología como Asunto , Vocabulario Controlado
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA