Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
2.
Nat Commun ; 13(1): 6793, 2022 11 10.
Artigo em Inglês | MEDLINE | ID: mdl-36357391

RESUMO

Benchmarks are crucial to measuring and steering progress in artificial intelligence (AI). However, recent studies raised concerns over the state of AI benchmarking, reporting issues such as benchmark overfitting, benchmark saturation and increasing centralization of benchmark dataset creation. To facilitate monitoring of the health of the AI benchmarking ecosystem, we introduce methodologies for creating condensed maps of the global dynamics of benchmark creation and saturation. We curate data for 3765 benchmarks covering the entire domains of computer vision and natural language processing, and show that a large fraction of benchmarks quickly trends towards near-saturation, that many benchmarks fail to find widespread utilization, and that benchmark performance gains for different AI tasks are prone to unforeseen bursts. We analyze attributes associated with benchmark popularity, and conclude that future benchmarks should emphasize versatility, breadth and real-world utility.


Assuntos
Inteligência Artificial , Benchmarking , Benchmarking/métodos , Ecossistema , Fenômenos Físicos
3.
Sci Data ; 9(1): 322, 2022 06 17.
Artigo em Inglês | MEDLINE | ID: mdl-35715466

RESUMO

Research in artificial intelligence (AI) is addressing a growing number of tasks through a rapidly growing number of models and methodologies. This makes it difficult to keep track of where novel AI methods are successfully - or still unsuccessfully - applied, how progress is measured, how different advances might synergize with each other, and how future research should be prioritized. To help address these issues, we created the Intelligence Task Ontology and Knowledge Graph (ITO), a comprehensive, richly structured and manually curated resource on artificial intelligence tasks, benchmark results and performance metrics. The current version of ITO contains 685,560 edges, 1,100 classes representing AI processes and 1,995 properties representing performance metrics. The primary goal of ITO is to enable analyses of the global landscape of AI tasks and capabilities. ITO is based on technologies that allow for easy integration and enrichment with external data, automated inference and continuous, collaborative expert curation of underlying ontological models. We make the ITO dataset and a collection of Jupyter notebooks utilizing ITO openly available.

4.
Cancers (Basel) ; 14(9)2022 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-35565454

RESUMO

The main hallmarks of cancer include sustaining proliferative signaling and resisting cell death. We analyzed the genes of the WNT pathway and seven cross-linked pathways that may explain the differences in aggressiveness among cancer types. We divided six cancer types (liver, lung, stomach, kidney, prostate, and thyroid) into classes of high (H) and low (L) aggressiveness considering the TCGA data, and their correlations between Shannon entropy and 5-year overall survival (OS). Then, we used principal component analysis (PCA), a random forest classifier (RFC), and protein-protein interactions (PPI) to find the genes that correlated with aggressiveness. Using PCA, we found GRB2, CTNNB1, SKP1, CSNK2A1, PRKDC, HDAC1, YWHAZ, YWHAB, and PSMD2. Except for PSMD2, the RFC analysis showed a different list, which was CAD, PSMD14, APH1A, PSMD2, SHC1, TMEFF2, PSMD11, H2AFZ, PSMB5, and NOTCH1. Both methods use different algorithmic approaches and have different purposes, which explains the discrepancy between the two gene lists. The key genes of aggressiveness found by PCA were those that maximized the separation of H and L classes according to its third component, which represented 19% of the total variance. By contrast, RFC classified whether the RNA-seq of a tumor sample was of the H or L type. Interestingly, PPIs showed that the genes of PCA and RFC lists were connected neighbors in the PPI signaling network of WNT and cross-linked pathways.

5.
Bioinformatics ; 38(8): 2371-2373, 2022 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-35139158

RESUMO

SUMMARY: Machine learning algorithms for link prediction can be valuable tools for hypothesis generation. However, many current algorithms are black boxes or lack good user interfaces that could facilitate insight into why predictions are made. We present LinkExplorer, a software suite for predicting, explaining and exploring links in large biomedical knowledge graphs. LinkExplorer integrates our novel, rule-based link prediction engine SAFRAN, which was recently shown to outcompete other explainable algorithms and established black-box algorithms. Here, we demonstrate highly competitive evaluation results of our algorithm on multiple large biomedical knowledge graphs, and release a web interface that allows for interactive and intuitive exploration of predicted links and their explanations. AVAILABILITY AND IMPLEMENTATION: A publicly hosted instance, source code and further documentation can be found at https://github.com/OpenBioLink/Explorer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Reconhecimento Automatizado de Padrão , Software , Aprendizado de Máquina , Documentação
6.
BMC Bioinformatics ; 21(1): 365, 2020 Aug 24.
Artigo em Inglês | MEDLINE | ID: mdl-32838742

RESUMO

BACKGROUND: The amount of published full-text articles has increased dramatically. Text mining tools configure an essential approach to building biological networks, updating databases and providing annotation for new pathways. PESCADOR is an online web server based on LAITOR and NLProt text mining tools, which retrieves protein-protein co-occurrences in a tabular-based format, adding a network schema. Here we present an HPC-oriented version of PESCADOR's native text mining tool, renamed to LAITOR4HPC, aiming to access an unlimited abstract amount in a short time to enrich available networks, build new ones and possibly highlight whether fields of research have been exhaustively studied. RESULTS: By taking advantage of parallel computing HPC infrastructure, the full collection of MEDLINE abstracts available until June 2017 was analyzed in a shorter period (6 days) when compared to the original online implementation (with an estimated 2 years to run the same data). Additionally, three case studies were presented to illustrate LAITOR4HPC usage possibilities. The first case study targeted soybean and was used to retrieve an overview of published co-occurrences in a single organism, retrieving 15,788 proteins in 7894 co-occurrences. In the second case study, a target gene family was searched in many organisms, by analyzing 15 species under biotic stress. Most co-occurrences regarded Arabidopsis thaliana and Zea mays. The third case study concerned the construction and enrichment of an available pathway. Choosing A. thaliana for further analysis, the defensin pathway was enriched, showing additional signaling and regulation molecules, and how they respond to each other in the modulation of this complex plant defense response. CONCLUSIONS: LAITOR4HPC can be used for an efficient text mining based construction of biological networks derived from big data sources, such as MEDLINE abstracts. Time consumption and data input limitations will depend on the available resources at the HPC facility. LAITOR4HPC enables enough flexibility for different approaches and data amounts targeted to an organism, a subject, or a specific pathway. Additionally, it can deliver comprehensive results where interactions are classified into four types, according to their reliability.


Assuntos
Software , Arabidopsis/metabolismo , Proteínas de Arabidopsis/metabolismo , Bases de Dados Factuais , Proteínas de Plantas/metabolismo , Mapas de Interação de Proteínas , Zea mays/metabolismo
7.
Sci Rep ; 9(1): 10573, 2019 07 22.
Artigo em Inglês | MEDLINE | ID: mdl-31332206

RESUMO

Rice is staple food of nearly half the world's population. Rice yields must therefore increase to feed ever larger populations. By colonising rice and other plants, Herbaspirillum spp. stimulate plant growth and productivity. However the molecular factors involved are largely unknown. To further explore this interaction, the transcription profiles of Nipponbare rice roots inoculated with Herbaspirillum seropedicae were determined by RNA-seq. Mapping the 104 million reads against the Oryza sativa cv. Nipponbare genome produced 65 million unique mapped reads that represented 13,840 transcripts each with at least two-times coverage. About 7.4% (1,014) genes were differentially regulated and of these 255 changed expression levels more than two times. Several of the repressed genes encoded proteins related to plant defence (e.g. a putative probenazole inducible protein), plant disease resistance as well as enzymes involved in flavonoid and isoprenoid synthesis. Genes related to the synthesis and efflux of phytosiderophores (PS) and transport of PS-iron complexes were induced by the bacteria. These data suggest that the bacterium represses the rice defence system while concomitantly activating iron uptake. Transcripts of H. seropedicae were also detected amongst which transcripts of genes involved in nitrogen fixation, cell motility and cell wall synthesis were the most expressed.


Assuntos
Genes de Plantas , Herbaspirillum/metabolismo , Ferro/metabolismo , Oryza/microbiologia , Raízes de Plantas/microbiologia , Resistência à Doença/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas/genética , Homeostase , Oryza/genética , Oryza/metabolismo , Raízes de Plantas/metabolismo
8.
BMC Bioinformatics ; 20(1): 164, 2019 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-30935364

RESUMO

BACKGROUND: For large international research consortia, such as those funded by the European Union's Horizon 2020 programme or the Innovative Medicines Initiative, good data coordination practices and tools are essential for the successful collection, organization and analysis of the resulting data. Research consortia are attempting ever more ambitious science to better understand disease, by leveraging technologies such as whole genome sequencing, proteomics, patient-derived biological models and computer-based systems biology simulations. RESULTS: The IMI eTRIKS consortium is charged with the task of developing an integrated knowledge management platform capable of supporting the complexity of the data generated by such research programmes. In this paper, using the example of the OncoTrack consortium, we describe a typical use case in translational medicine. The tranSMART knowledge management platform was implemented to support data from observational clinical cohorts, drug response data from cell culture models and drug response data from mouse xenograft tumour models. The high dimensional (omics) data from the molecular analyses of the corresponding biological materials were linked to these collections, so that users could browse and analyse these to derive candidate biomarkers. CONCLUSIONS: In all these steps, data mapping, linking and preparation are handled automatically by the tranSMART integration platform. Therefore, researchers without specialist data handling skills can focus directly on the scientific questions, without spending undue effort on processing the data and data integration, which are otherwise a burden and the most time-consuming part of translational research data analysis.


Assuntos
Bases de Dados Factuais , Gestão do Conhecimento , Biologia de Sistemas , Pesquisa Translacional Biomédica/métodos , Animais , Células Cultivadas , Simulação por Computador , Modelos Animais de Doenças , Humanos , Modelos Biológicos , Proteômica , Software , Sequenciamento Completo do Genoma , Ensaios Antitumorais Modelo de Xenoenxerto
9.
Bioinformatics ; 35(9): 1562-1565, 2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-30256906

RESUMO

MOTIVATION: Standardization and semantic alignment have been considered one of the major challenges for data integration in clinical research. The inclusion of the CDISC SDTM clinical data standard into the tranSMART i2b2 via a guiding master ontology tree positively impacts and supports the efficacy of data sharing, visualization and exploration across datasets. RESULTS: We present here a schema for the organization of SDTM variables into the tranSMART i2b2 tree along with a script and test dataset to exemplify the mapping strategy. The eTRIKS master tree concept is demonstrated by making use of fictitious data generated for four patients, including 16 SDTM clinical domains. We describe how the usage of correct visit names and data labels can help to integrate multiple readouts per patient and avoid ETL crashes when running a tranSMART loading routine. AVAILABILITY AND IMPLEMENTATION: The eTRIKS Master Tree package and test datasets are publicly available at https://doi.org/10.5281/zenodo.1009098 and a functional demo installation at https://public.etriks.org/transmart/datasetExplorer/ under eTRIKS-Master Tree branch, where the discussed examples can be visualized.


Assuntos
Armazenamento e Recuperação da Informação , Confiabilidade dos Dados , Coleta de Dados , Humanos , Disseminação de Informação
10.
Bioinformatics ; 33(14): 2229-2231, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28334291

RESUMO

SUMMARY: In translational research, efficient knowledge exchange between the different fields of expertise is crucial. An open platform that is capable of storing a multitude of data types such as clinical, pre-clinical or OMICS data combined with strong visual analytical capabilities will significantly accelerate the scientific progress by making data more accessible and hypothesis generation easier. The open data warehouse tranSMART is capable of storing a variety of data types and has a growing user community including both academic institutions and pharmaceutical companies. tranSMART, however, currently lacks interactive and dynamic visual analytics and does not permit any post-processing interaction or exploration. For this reason, we developed SmartR , a plugin for tranSMART, that equips the platform not only with several dynamic visual analytical workflows, but also provides its own framework for the addition of new custom workflows. Modern web technologies such as D3.js or AngularJS were used to build a set of standard visualizations that were heavily improved with dynamic elements. AVAILABILITY AND IMPLEMENTATION: The source code is licensed under the Apache 2.0 License and is freely available on GitHub: https://github.com/transmart/SmartR . CONTACT: reinhard.schneider@uni.lu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Software , Pesquisa Translacional Biomédica/métodos , Neoplasias da Mama/genética , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos
11.
Big Data ; 4(2): 97-108, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-27441714

RESUMO

Translational medicine is a domain turning results of basic life science research into new tools and methods in a clinical environment, for example, as new diagnostics or therapies. Nowadays, the process of translation is supported by large amounts of heterogeneous data ranging from medical data to a whole range of -omics data. It is not only a great opportunity but also a great challenge, as translational medicine big data is difficult to integrate and analyze, and requires the involvement of biomedical experts for the data processing. We show here that visualization and interoperable workflows, combining multiple complex steps, can address at least parts of the challenge. In this article, we present an integrated workflow for exploring, analysis, and interpretation of translational medicine data in the context of human health. Three Web services-tranSMART, a Galaxy Server, and a MINERVA platform-are combined into one big data pipeline. Native visualization capabilities enable the biomedical experts to get a comprehensive overview and control over separate steps of the workflow. The capabilities of tranSMART enable a flexible filtering of multidimensional integrated data sets to create subsets suitable for downstream processing. A Galaxy Server offers visually aided construction of analytical pipelines, with the use of existing or custom components. A MINERVA platform supports the exploration of health and disease-related mechanisms in a contextualized analytical visualization system. We demonstrate the utility of our workflow by illustrating its subsequent steps using an existing data set, for which we propose a filtering scheme, an analytical pipeline, and a corresponding visualization of analytical results. The workflow is available as a sandbox environment, where readers can work with the described setup themselves. Overall, our work shows how visualization and interfacing of big data processing services facilitate exploration, analysis, and interpretation of translational medicine data.


Assuntos
Mineração de Dados , Doença , Pesquisa Translacional Biomédica , Humanos , Integração de Sistemas
12.
Methods ; 74: 16-35, 2015 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-25449898

RESUMO

Genomic information is being underlined in the format of biological pathways. Building these biological pathways is an ongoing demand and benefits from methods for extracting information from biomedical literature with the aid of text-mining tools. Here we hopefully guide you in the attempt of building a customized pathway or chart representation of a system. Our manual is based on a group of software designed to look at biointeractions in a set of abstracts retrieved from PubMed. However, they aim to support the work of someone with biological background, who does not need to be an expert on the subject and will play the role of manual curator while designing the representation of the system, the pathway. We therefore illustrate with two challenging case studies: hair and breast development. They were chosen for focusing on recent acquisitions of human evolution. We produced sub-pathways for each study, representing different phases of development. Differently from most charts present in current databases, we present detailed descriptions, which will additionally guide PESCADOR users along the process. The implementation as a web interface makes PESCADOR a unique tool for guiding the user along the biointeractions, which will constitute a novel pathway.


Assuntos
Mama/crescimento & desenvolvimento , Mineração de Dados/métodos , Bases de Dados Genéticas , Cabelo/crescimento & desenvolvimento , PubMed , Mineração de Dados/tendências , Bases de Dados Genéticas/tendências , Feminino , Humanos , PubMed/tendências
13.
Nucleic Acids Res ; 42(Database issue): D60-7, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24163100

RESUMO

Approximately half of all human transcripts contain at least one upstream translational initiation site that precedes the main coding sequence (CDS) and gives rise to an upstream open reading frame (uORF). We generated uORFdb, publicly available at http://cbdm.mdc-berlin.de/tools/uorfdb, to serve as a comprehensive literature database on eukaryotic uORF biology. Upstream ORFs affect downstream translation by interfering with the unrestrained progression of ribosomes across the transcript leader sequence. Although the first uORF-related translational activity was observed >30 years ago, and an increasing number of studies link defective uORF-mediated translational control to the development of human diseases, the features that determine uORF-mediated regulation of downstream translation are not well understood. The uORFdb was manually curated from all uORF-related literature listed at the PubMed database. It categorizes individual publications by a variety of denominators including taxon, gene and type of study. Furthermore, the database can be filtered for multiple structural and functional uORF-related properties to allow convenient and targeted access to the complex field of eukaryotic uORF biology.


Assuntos
Bases de Dados de Ácidos Nucleicos , Fases de Leitura Aberta , Biossíntese de Proteínas , Processamento Alternativo , Animais , Doença/genética , Variação Genética , Humanos , Internet , Camundongos , Degradação do RNAm Mediada por Códon sem Sentido , Regiões Promotoras Genéticas , Estabilidade de RNA , RNA Mensageiro/metabolismo , Ribossomos/metabolismo
14.
BioData Min ; 5(1): 1, 2012 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-22297131

RESUMO

BACKGROUND: Keeping up-to-date with bioscience literature is becoming increasingly challenging. Several recent methods help meet this challenge by allowing literature search to be launched based on lists of abstracts that the user judges to be 'interesting'. Some methods go further by allowing the user to provide a second input set of 'uninteresting' abstracts; these two input sets are then used to search and rank literature by relevance. In this work we present the service 'Caipirini' (http://caipirini.org) that also allows two input sets, but takes the novel approach of allowing ranking of literature based on one or more sets of genes. RESULTS: To evaluate the usefulness of Caipirini, we used two test cases, one related to the human cell cycle, and a second related to disease defense mechanisms in Arabidopsis thaliana. In both cases, the new method achieved high precision in finding literature related to the biological mechanisms underlying the input data sets. CONCLUSIONS: To our knowledge Caipirini is the first service enabling literature search directly based on biological relevance to gene sets; thus, Caipirini gives the research community a new way to unlock hidden knowledge from gene sets derived via high-throughput experiments.

15.
BMC Bioinformatics ; 12: 435, 2011 Nov 09.
Artigo em Inglês | MEDLINE | ID: mdl-22070195

RESUMO

BACKGROUND: Biological function is greatly dependent on the interactions of proteins with other proteins and genes. Abstracts from the biomedical literature stored in the NCBI's PubMed database can be used for the derivation of interactions between genes and proteins by identifying the co-occurrences of their terms. Often, the amount of interactions obtained through such an approach is large and may mix processes occurring in different contexts. Current tools do not allow studying these data with a focus on concepts of relevance to a user, for example, interactions related to a disease or to a biological mechanism such as protein aggregation. RESULTS: To help the concept-oriented exploration of such data we developed PESCADOR, a web tool that extracts a network of interactions from a set of PubMed abstracts given by a user, and allows filtering the interaction network according to user-defined concepts. We illustrate its use in exploring protein aggregation in neurodegenerative disease and in the expansion of pathways associated to colon cancer. CONCLUSIONS: PESCADOR is a platform independent web resource available at: http://cbdm.mdc-berlin.de/tools/pescador/


Assuntos
Mineração de Dados , PubMed , Software , Neoplasias Colorretais/genética , Neoplasias Colorretais/metabolismo , Humanos , Internet , Doenças Neurodegenerativas/genética , Doenças Neurodegenerativas/metabolismo , Proteínas/genética , Proteínas/metabolismo
16.
Nucleic Acids Res ; 39(Web Server issue): W455-61, 2011 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-21609954

RESUMO

UNLABELLED: Biomedical literature is traditionally used as a way to inform scientists of the relevance of genes in relation to a research topic. However many genes, especially from poorly studied organisms, are not discussed in the literature. Moreover, a manual and comprehensive summarization of the literature attached to the genes of an organism is in general impossible due to the high number of genes and abstracts involved. We introduce the novel Génie algorithm that overcomes these problems by evaluating the literature attached to all genes in a genome and to their orthologs according to a selected topic. Génie showed high precision (up to 100%) and the best performance in comparison to other algorithms in most of the benchmarks, especially when high sensitivity was required. Moreover, the prioritization of zebrafish genes involved in heart development, using human and mouse orthologs, showed high enrichment in differentially expressed genes from microarray experiments. The Génie web server supports hundreds of species, millions of genes and offers novel functionalities. Common run times below a minute, even when analyzing the human genome with hundreds of thousands of literature records, allows the use of Génie in routine lab work. AVAILABILITY: http://cbdm.mdc-berlin.de/tools/genie/.


Assuntos
Genes , Software , Algoritmos , Animais , Perfilação da Expressão Gênica , Genômica , Coração/embriologia , Humanos , Internet , MEDLINE , Camundongos , Modelos Animais , Peixe-Zebra/embriologia , Peixe-Zebra/genética
17.
BMC Genomics ; 12 Suppl 4: S3, 2011 Dec 22.
Artigo em Inglês | MEDLINE | ID: mdl-22369103

RESUMO

BACKGROUND: The integration of sequencing and gene interaction data and subsequent generation of pathways and networks contained in databases such as KEGG Pathway is essential for the comprehension of complex biological processes. We noticed the absence of a chart or pathway describing the well-studied preimplantation development stages; furthermore, not all genes involved in the process have entries in KEGG Orthology, important information for knowledge application with relation to other organisms. RESULTS: In this work we sought to develop the regulatory pathway for the preimplantation development stage using text-mining tools such as Medline Ranker and PESCADOR to reveal biointeractions among the genes involved in this process. The genes present in the resulting pathway were also used as seeds for software developed by our group called SeedServer to create clusters of homologous genes. These homologues allowed the determination of the last common ancestor for each gene and revealed that the preimplantation development pathway consists of a conserved ancient core of genes with the addition of modern elements. CONCLUSIONS: The generation of regulatory pathways through text-mining tools allows the integration of data generated by several studies for a more complete visualization of complex biological processes. Using the genes in this pathway as "seeds" for the generation of clusters of homologues, the pathway can be visualized for other organisms. The clustering of homologous genes together with determination of the ancestry leads to a better understanding of the evolution of such process.


Assuntos
Mineração de Dados , Software , Animais , Análise por Conglomerados , Bases de Dados Factuais , Desenvolvimento Embrionário , Redes Reguladoras de Genes , Humanos , Armazenamento e Recuperação da Informação , Camundongos , Transplante de Células-Tronco
18.
BioData Min ; 3(1): 1, 2010 Feb 22.
Artigo em Inglês | MEDLINE | ID: mdl-20175922

RESUMO

The quantities of data obtained by the new high-throughput technologies, such as microarrays or ChIP-Chip arrays, and the large-scale OMICS-approaches, such as genomics, proteomics and transcriptomics, are becoming vast. Sequencing technologies become cheaper and easier to use and, thus, large-scale evolutionary studies towards the origins of life for all species and their evolution becomes more and more challenging. Databases holding information about how data are related and how they are hierarchically organized expand rapidly. Clustering analysis is becoming more and more difficult to be applied on very large amounts of data since the results of these algorithms cannot be efficiently visualized. Most of the available visualization tools that are able to represent such hierarchies, project data in 2D and are lacking often the necessary user friendliness and interactivity. For example, the current phylogenetic tree visualization tools are not able to display easy to understand large scale trees with more than a few thousand nodes. In this study, we review tools that are currently available for the visualization of biological trees and analysis, mainly developed during the last decade. We describe the uniform and standard computer readable formats to represent tree hierarchies and we comment on the functionality and the limitations of these tools. We also discuss on how these tools can be developed further and should become integrated with various data sources. Here we focus on freely available software that offers to the users various tree-representation methodologies for biological data analysis.

19.
BMC Bioinformatics ; 11: 70, 2010 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-20122157

RESUMO

BACKGROUND: Biological knowledge is represented in scientific literature that often describes the function of genes/proteins (bioentities) in terms of their interactions (biointeractions). Such bioentities are often related to biological concepts of interest that are specific of a determined research field. Therefore, the study of the current literature about a selected topic deposited in public databases, facilitates the generation of novel hypotheses associating a set of bioentities to a common context. RESULTS: We created a text mining system (LAITOR: Literature Assistant for Identification of Terms co-Occurrences and Relationships) that analyses co-occurrences of bioentities, biointeractions, and other biological terms in MEDLINE abstracts. The method accounts for the position of the co-occurring terms within sentences or abstracts. The system detected abstracts mentioning protein-protein interactions in a standard test (BioCreative II IAS test data) with a precision of 0.82-0.89 and a recall of 0.48-0.70. We illustrate the application of LAITOR to the detection of plant response genes in a dataset of 1000 abstracts relevant to the topic. CONCLUSIONS: Text mining tools combining the extraction of interacting bioentities and biological concepts with network displays can be helpful in developing reasonable hypotheses in different scientific backgrounds.


Assuntos
Mineração de Dados/métodos , Armazenamento e Recuperação da Informação/métodos , Software , Biologia Computacional/métodos , MEDLINE , Publicações , Estados Unidos
20.
Nucleic Acids Res ; 38(1): 26-38, 2010 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19858102

RESUMO

Life scientists are often interested to compare two gene sets to gain insight into differences between two distinct, but related, phenotypes or conditions. Several tools have been developed for comparing gene sets, most of which find Gene Ontology (GO) terms that are significantly over-represented in one gene set. However, such tools often return GO terms that are too generic or too few to be informative. Here, we present Martini, an easy-to-use tool for comparing gene sets. Martini is based, not on GO, but on keywords extracted from Medline abstracts; Martini also supports a much wider range of species than comparable tools. To evaluate Martini we created a benchmark based on the human cell cycle, and we tested several comparable tools (CoPub, FatiGO, Marmite and ProfCom). Martini had the best benchmark performance, delivering a more detailed and accurate description of function. Martini also gave best or equal performance with three other datasets (related to Arabidopsis, melanoma and ovarian cancer), suggesting that Martini represents an advance in the automated comparison of gene sets. In agreement with previous studies, our results further suggest that literature-derived keywords are a richer source of gene-function information than GO annotations. Martini is freely available at http://martini.embl.de.


Assuntos
Genes , Software , Terminologia como Assunto , Arabidopsis/genética , Ciclo Celular/genética , Dicionários como Assunto , Genes Neoplásicos , Genes de Plantas , Humanos , MEDLINE , Melanoma/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...