Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 37
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Bioinformatics ; 37(19): 3343-3348, 2021 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-33964129

RESUMEN

MOTIVATION: Gene Ontology Causal Activity Models (GO-CAMs) assemble individual associations of gene products with cellular components, molecular functions and biological processes into causally linked activity flow models. Pathway databases such as the Reactome Knowledgebase create detailed molecular process descriptions of reactions and assemble them, based on sharing of entities between individual reactions into pathway descriptions. RESULTS: To convert the rich content of Reactome into GO-CAMs, we have developed a software tool, Pathways2GO, to convert the entire set of normal human Reactome pathways into GO-CAMs. This conversion yields standard GO annotations from Reactome content and supports enhanced quality control for both Reactome and GO, yielding a nearly seamless conversion between these two resources for the bioinformatics community. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

2.
Bioinformatics ; 36(4): 1226-1233, 2020 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-31504205

RESUMEN

MOTIVATION: Biomedical literature is growing at a rate that outpaces our ability to harness the knowledge contained therein. To mine valuable inferences from the large volume of literature, many researchers use information extraction algorithms to harvest information in biomedical texts. Information extraction is usually accomplished via a combination of manual expert curation and computational methods. Advances in computational methods usually depend on the time-consuming generation of gold standards by a limited number of expert curators. Citizen science is public participation in scientific research. We previously found that citizen scientists are willing and capable of performing named entity recognition of disease mentions in biomedical abstracts, but did not know if this was true with relationship extraction (RE). RESULTS: In this article, we introduce the Relationship Extraction Module of the web-based application Mark2Cure (M2C) and demonstrate that citizen scientists can perform RE. We confirm the importance of accurate named entity recognition on user performance of RE and identify design issues that impacted data quality. We find that the data generated by citizen scientists can be used to identify relationship types not currently available in the M2C Relationship Extraction Module. We compare the citizen science-generated data with algorithm-mined data and identify ways in which the two approaches may complement one another. We also discuss opportunities for future improvement of this system, as well as the potential synergies between citizen science, manual biocuration and natural language processing. AVAILABILITY AND IMPLEMENTATION: Mark2Cure platform: https://mark2cure.org; Mark2Cure source code: https://github.com/sulab/mark2cure; and data and analysis code for this article: https://github.com/gtsueng/M2C_rel_nb. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Ciencia Ciudadana , Procesamiento de Lenguaje Natural , Almacenamiento y Recuperación de la Información , Proyectos de Investigación , Programas Informáticos
3.
Brief Bioinform ; 17(1): 23-32, 2016 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-25888696

RESUMEN

The use of crowdsourcing to solve important but complex problems in biomedical and clinical sciences is growing and encompasses a wide variety of approaches. The crowd is diverse and includes online marketplace workers, health information seekers, science enthusiasts and domain experts. In this article, we review and highlight recent studies that use crowdsourcing to advance biomedicine. We classify these studies into two broad categories: (i) mining big data generated from a crowd (e.g. search logs) and (ii) active crowdsourcing via specific technical platforms, e.g. labor markets, wikis, scientific games and community challenges. Through describing each study in detail, we demonstrate the applicability of different methods in a variety of domains in biomedical research, including genomics, biocuration and clinical research. Furthermore, we discuss and highlight the strengths and limitations of different crowdsourcing platforms. Finally, we identify important emerging trends, opportunities and remaining challenges for future crowdsourcing research in biomedicine.


Asunto(s)
Colaboración de las Masas/tendencias , Biología Computacional/tendencias , Minería de Datos , Humanos , Internet , Motor de Búsqueda , Teléfono Inteligente , Medios de Comunicación Sociales , Juegos de Video
4.
Bioinformatics ; 32(13): 2072-2074, 2016 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-27153723

RESUMEN

UNLABELLED: Branch is a web application that provides users with the ability to interact directly with large biomedical datasets. The interaction is mediated through a collaborative graphical user interface for building and evaluating decision trees. These trees can be used to compose and test sophisticated hypotheses and to develop predictive models. Decision trees are built and evaluated based on a library of imported datasets and can be stored in a collective area for sharing and re-use. AVAILABILITY AND IMPLEMENTATION: Branch is hosted at http://biobranch.org/ and the open source code is available at http://bitbucket.org/sulab/biobranch/ CONTACTS: asu@scripps.edu or bgood@scripps.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Investigación Biomédica , Árboles de Decisión , Programas Informáticos , Conjuntos de Datos como Asunto , Humanos , Internet , Modelos Teóricos
5.
Bioinformatics ; 29(16): 1925-33, 2013 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-23782614

RESUMEN

MOTIVATION: Bioinformatics is faced with a variety of problems that require human involvement. Tasks like genome annotation, image analysis, knowledge-base population and protein structure determination all benefit from human input. In some cases, people are needed in vast quantities, whereas in others, we need just a few with rare abilities. Crowdsourcing encompasses an emerging collection of approaches for harnessing such distributed human intelligence. Recently, the bioinformatics community has begun to apply crowdsourcing in a variety of contexts, yet few resources are available that describe how these human-powered systems work and how to use them effectively in scientific domains. RESULTS: Here, we provide a framework for understanding and applying several different types of crowdsourcing. The framework considers two broad classes: systems for solving large-volume 'microtasks' and systems for solving high-difficulty 'megatasks'. Within these classes, we discuss system types, including volunteer labor, games with a purpose, microtask markets and open innovation contests. We illustrate each system type with successful examples in bioinformatics and conclude with a guide for matching problems to crowdsourcing solutions that highlights the positives and negatives of different approaches.


Asunto(s)
Biología Computacional , Colaboración de las Masas , Juegos Experimentales , Alineación de Secuencia , Voluntarios
6.
Nucleic Acids Res ; 40(Database issue): D1255-61, 2012 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-22075991

RESUMEN

The Gene Wiki is an open-access and openly editable collection of Wikipedia articles about human genes. Initiated in 2008, it has grown to include articles about more than 10,000 genes that, collectively, contain more than 1.4 million words of gene-centric text with extensive citations back to the primary scientific literature. This growing body of useful, gene-centric content is the result of the work of thousands of individuals throughout the scientific community. Here, we describe recent improvements to the automated system that keeps the structured data presented on Gene Wiki articles in sync with the data from trusted primary databases. We also describe the expanding contents, editors and users of the Gene Wiki. Finally, we introduce a new automated system, called WikiTrust, which can effectively compute the quality of Wikipedia articles, including Gene Wiki articles, at the word level. All articles in the Gene Wiki can be freely accessed and edited at Wikipedia, and additional links and information can be found at the project's Wikipedia portal page: http://en.wikipedia.org/wiki/Portal:Gene_Wiki.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genes , Anotación de Secuencia Molecular , Enciclopedias como Asunto , Humanos , Internet , Proteínas/genética , Programas Informáticos
7.
BMC Genomics ; 12: 603, 2011 Dec 13.
Artículo en Inglés | MEDLINE | ID: mdl-22165947

RESUMEN

BACKGROUND: Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology. RESULTS: Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses. CONCLUSIONS: The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses.


Asunto(s)
Genómica , Almacenamiento y Recuperación de la Información , Internet
8.
PLoS One ; 16(3): e0231916, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33755673

RESUMEN

AVAILABILITY: The API and associated software is open source and currently available for access at https://github.com/NCATS-Tangerine/translator-knowledge-beacon.


Asunto(s)
Conocimiento , Programas Informáticos , Bases de Datos Factuales , Internet
9.
Patterns (N Y) ; 2(1): 100155, 2021 Jan 08.
Artículo en Inglés | MEDLINE | ID: mdl-33196056

RESUMEN

Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. While rich biological knowledge exists for SARS-CoV-2 and related viruses (SARS-CoV, MERS-CoV), integrating this knowledge is difficult and time-consuming, since much of it is in siloed databases or in textual format. Furthermore, the data required by the research community vary drastically for different tasks; the optimal data for a machine learning task, for example, is much different from the data used to populate a browsable user interface for clinicians. To address these challenges, we created KG-COVID-19, a flexible framework that ingests and integrates heterogeneous biomedical data to produce knowledge graphs (KGs), and applied it to create a KG for COVID-19 response. This KG framework also can be applied to other problems in which siloed biomedical data must be quickly integrated for different research applications, including future pandemics.

10.
Database (Oxford) ; 20202020 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-32283553

RESUMEN

Hypothesis generation is a critical step in research and a cornerstone in the rare disease field. Research is most efficient when those hypotheses are based on the entirety of knowledge known to date. Systematic review articles are commonly used in biomedicine to summarize existing knowledge and contextualize experimental data. But the information contained within review articles is typically only expressed as free-text, which is difficult to use computationally. Researchers struggle to navigate, collect and remix prior knowledge as it is scattered in several silos without seamless integration and access. This lack of a structured information framework hinders research by both experimental and computational scientists. To better organize knowledge and data, we built a structured review article that is specifically focused on NGLY1 Deficiency, an ultra-rare genetic disease first reported in 2012. We represented this structured review as a knowledge graph and then stored this knowledge graph in a Neo4j database to simplify dissemination, querying and visualization of the network. Relative to free-text, this structured review better promotes the principles of findability, accessibility, interoperability and reusability (FAIR). In collaboration with domain experts in NGLY1 Deficiency, we demonstrate how this resource can improve the efficiency and comprehensiveness of hypothesis generation. We also developed a read-write interface that allows domain experts to contribute FAIR structured knowledge to this community resource. In contrast to traditional free-text review articles, this structured review exists as a living knowledge graph that is curated by humans and accessible to computational analyses. Finally, we have generalized this workflow into modular and repurposable components that can be applied to other domain areas. This NGLY1 Deficiency-focused network is publicly available at http://ngly1graph.org/. AVAILABILITY AND IMPLEMENTATION: Database URL: http://ngly1graph.org/. Network data files are at: https://github.com/SuLab/ngly1-graph and source code at: https://github.com/SuLab/bioknowledge-reviewer. CONTACT: asu@scripps.edu.


Asunto(s)
Investigación Biomédica/métodos , Biología Computacional/métodos , Bases de Datos Factuales , Bases del Conocimiento , Animales , Investigación Biomédica/estadística & datos numéricos , Biología Computacional/estadística & datos numéricos , Trastornos Congénitos de Glicosilación/genética , Trastornos Congénitos de Glicosilación/metabolismo , Curaduría de Datos/métodos , Minería de Datos/métodos , Humanos , Internet , Péptido-N4-(N-acetil-beta-glucosaminil) Asparagina Amidasa/deficiencia , Péptido-N4-(N-acetil-beta-glucosaminil) Asparagina Amidasa/genética , Péptido-N4-(N-acetil-beta-glucosaminil) Asparagina Amidasa/metabolismo , Revisiones Sistemáticas como Asunto
11.
Elife ; 92020 03 17.
Artículo en Inglés | MEDLINE | ID: mdl-32180547

RESUMEN

Wikidata is a community-maintained knowledge base that has been assembled from repositories in the fields of genomics, proteomics, genetic variants, pathways, chemical compounds, and diseases, and that adheres to the FAIR principles of findability, accessibility, interoperability and reusability. Here we describe the breadth and depth of the biomedical knowledge contained within Wikidata, and discuss the open-source tools we have built to add information to Wikidata and to synchronize it with source databases. We also demonstrate several use cases for Wikidata, including the crowdsourced curation of biomedical ontologies, phenotype-based diagnosis of disease, and drug repurposing.


Asunto(s)
Disciplinas de las Ciencias Biológicas , Biología Computacional , Bases de Datos Factuales , Genómica , Proteómica , Humanos , Reconocimiento de Normas Patrones Automatizadas
12.
BMC Bioinformatics ; 10: 313, 2009 Sep 25.
Artículo en Inglés | MEDLINE | ID: mdl-19781082

RESUMEN

BACKGROUND: Academic social tagging systems, such as Connotea and CiteULike, provide researchers with a means to organize personal collections of online references with keywords (tags) and to share these collections with others. One of the side-effects of the operation of these systems is the generation of large, publicly accessible metadata repositories describing the resources in the collections. In light of the well-known expansion of information in the life sciences and the need for metadata to enhance its value, these repositories present a potentially valuable new resource for application developers. Here we characterize the current contents of two scientifically relevant metadata repositories created through social tagging. This investigation helps to establish how such socially constructed metadata might be used as it stands currently and to suggest ways that new social tagging systems might be designed that would yield better aggregate products. RESULTS: We assessed the metadata that users of CiteULike and Connotea associated with citations in PubMed with the following metrics: coverage of the document space, density of metadata (tags) per document, rates of inter-annotator agreement, and rates of agreement with MeSH indexing. CiteULike and Connotea were very similar on all of the measurements. In comparison to PubMed, document coverage and per-document metadata density were much lower for the social tagging systems. Inter-annotator agreement within the social tagging systems and the agreement between the aggregated social tagging metadata and MeSH indexing was low though the latter could be increased through voting. CONCLUSION: The most promising uses of metadata from current academic social tagging repositories will be those that find ways to utilize the novel relationships between users, tags, and documents exposed through these systems. For more traditional kinds of indexing-based applications (such as keyword-based search) to benefit substantially from socially generated metadata in the life sciences, more documents need to be tagged and more tags are needed for each document. These issues may be addressed both by finding ways to attract more users to current systems and by creating new user interfaces that encourage more collectively useful individual tagging behaviour.


Asunto(s)
Biología Computacional/métodos , Almacenamiento y Recuperación de la Información/métodos , Bases de Datos Factuales , Sistemas de Información , Internet , Medical Subject Headings , PubMed
14.
Brain ; 129(Pt 7): 1872-83, 2006 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-16735456

RESUMEN

HIV-1 often invades the CNS during primary infection, eventually resulting in neurological disorders in up to 50% of untreated patients. The CNS is a distinct viral reservoir, differing from peripheral tissues in immunological surveillance, target cell characteristics and antiretroviral penetration. Neurotropic HIV-1 likely develops distinct genotypic characteristics in response to this unique selective environment. We sought to catalogue the genetic features of CNS-derived HIV-1 by analysing 456 clonal RNA sequences of the C2-V3 env subregion generated from CSF and plasma of 18 chronically infected individuals. Neuropsychological performance of all subjects was evaluated and summarized as a global deficit score. A battery of phylogenetic, statistical and machine learning tools was applied to these data to identify genetic features associated with HIV-1 neurotropism and neurovirulence. Eleven of 18 individuals exhibited significant viral compartmentalization between blood and CSF (P < 0.01, Slatkin-Maddison test). A CSF-specific genetic signature was identified, comprising positions 9, 13 and 19 of the V3 loop. The residue at position 5 of the V3 loop was highly correlated with neurocognitive deficit (P < 0.0025, Fisher's exact test). Antibody-mediated HIV-1 neutralizing activity was significantly reduced in CSF with respect to autologous blood plasma (P < 0.042, Student's t-test). Accordingly, CSF-derived sequences exhibited constrained diversity and contained fewer glycosylated and positively selected sites. Our results suggest that there are several genetic features that distinguish CSF- and plasma-derived HIV-1 populations, probably reflecting altered cellular entry requirements and decreased immune pressure in the CNS. Furthermore, neurological impairment may be influenced by mutations within the viral V3 loop sequence.


Asunto(s)
Trastornos del Conocimiento/virología , Genes env , Infecciones por VIH/virología , VIH-1/genética , Secuencia de Aminoácidos , Recuento de Linfocito CD4 , Líquido Cefalorraquídeo/virología , Evolución Molecular , Variación Genética , Glicosilación , Anticuerpos Anti-VIH/inmunología , Infecciones por VIH/psicología , VIH-1/inmunología , VIH-1/patogenicidad , Humanos , Datos de Secuencia Molecular , Pruebas Neuropsicológicas , Filogenia , ARN Viral/sangre , ARN Viral/líquido cefalorraquídeo , Homología de Secuencia de Aminoácido , Virulencia
15.
Database (Oxford) ; 2017(1)2017 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-28365742

RESUMEN

With the advancement of genome-sequencing technologies, new genomes are being sequenced daily. Although these sequences are deposited in publicly available data warehouses, their functional and genomic annotations (beyond genes which are predicted automatically) mostly reside in the text of primary publications. Professional curators are hard at work extracting those annotations from the literature for the most studied organisms and depositing them in structured databases. However, the resources don't exist to fund the comprehensive curation of the thousands of newly sequenced organisms in this manner. Here, we describe WikiGenomes (wikigenomes.org), a web application that facilitates the consumption and curation of genomic data by the entire scientific community. WikiGenomes is based on Wikidata, an openly editable knowledge graph with the goal of aggregating published knowledge into a free and open database. WikiGenomes empowers the individual genomic researcher to contribute their expertise to the curation effort and integrates the knowledge into Wikidata, enabling it to be accessed by anyone without restriction. Database URL: www.wikigenomes.org.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genoma , Internet , Anotación de Secuencia Molecular/métodos , Anotación de Secuencia Molecular/normas
16.
BMC Bioinformatics ; 7: 534, 2006 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-17173692

RESUMEN

BACKGROUND: User-scripts are programs stored in Web browsers that can manipulate the content of websites prior to display in the browser. They provide a novel mechanism by which users can conveniently gain increased control over the content and the display of the information presented to them on the Web. As the Web is the primary medium by which scientists retrieve biological information, any improvements in the mechanisms that govern the utility or accessibility of this information may have profound effects. GreaseMonkey is a Mozilla Firefox extension that facilitates the development and deployment of user-scripts for the Firefox web-browser. We utilize this to enhance the content and the presentation of the iHOP (information Hyperlinked Over Proteins) website. RESULTS: The iHOPerator is a GreaseMonkey user-script that augments the gene-centred pages on iHOP by providing a compact, configurable visualization of the defining information for each gene and by enabling additional data, such as biochemical pathway diagrams, to be collected automatically from third party resources and displayed in the same browsing context. CONCLUSION: This open-source script provides an extension to the iHOP website, demonstrating how user-scripts can personalize and enhance the Web browsing experience in a relevant biological setting. The novel, user-driven controls over the content and the display of Web resources made possible by user-scripts, such as the iHOPerator, herald the beginning of a transition from a resource-centric to a user-centric Web experience. We believe that this transition is a necessary step in the development of Web technology that will eventually result in profound improvements in the way life scientists interact with information.


Asunto(s)
Biología Computacional/métodos , Internet , Interfaz Usuario-Computador , Biología Computacional/tendencias , Bases de Datos Factuales/tendencias , Internet/tendencias , Programas Informáticos/tendencias
17.
Artículo en Inglés | MEDLINE | ID: mdl-27307137

RESUMEN

Drug toxicity is a major concern for both regulatory agencies and the pharmaceutical industry. In this context, text-mining methods for the identification of drug side effects from free text are key for the development of up-to-date knowledge sources on drug adverse reactions. We present a new system for identification of drug side effects from the literature that combines three approaches: machine learning, rule- and knowledge-based approaches. This system has been developed to address the Task 3.B of Biocreative V challenge (BC5) dealing with Chemical-induced Disease (CID) relations. The first two approaches focus on identifying relations at the sentence-level, while the knowledge-based approach is applied both at sentence and abstract levels. The machine learning method is based on the BeFree system using two corpora as training data: the annotated data provided by the CID task organizers and a new CID corpus developed by crowdsourcing. Different combinations of results from the three strategies were selected for each run of the challenge. In the final evaluation setting, the system achieved the highest Recall of the challenge (63%). By performing an error analysis, we identified the main causes of misclassifications and areas for improving of our system, and highlighted the need of consistent gold standard data sets for advancing the state of the art in text mining of drug side effects.Database URL: https://zenodo.org/record/29887?ln»en#.VsL3yDLWR_V.


Asunto(s)
Trastornos Químicamente Inducidos , Colaboración de las Masas , Bases de Datos Factuales/normas , Aprendizaje Automático/normas , Animales , Trastornos Químicamente Inducidos/genética , Trastornos Químicamente Inducidos/metabolismo , Colaboración de las Masas/métodos , Colaboración de las Masas/normas , Minería de Datos/métodos , Minería de Datos/normas , Humanos
18.
Artículo en Inglés | MEDLINE | ID: mdl-27022157

RESUMEN

The last 20 years of advancement in sequencing technologies have led to sequencing thousands of microbial genomes, creating mountains of genetic data. While efficiency in generating the data improves almost daily, applying meaningful relationships between taxonomic and genetic entities on this scale requires a structured and integrative approach. Currently, knowledge is distributed across a fragmented landscape of resources from government-funded institutions such as National Center for Biotechnology Information (NCBI) and UniProt to topic-focused databases like the ODB3 database of prokaryotic operons, to the supplemental table of a primary publication. A major drawback to large scale, expert-curated databases is the expense of maintaining and extending them over time. No entity apart from a major institution with stable long-term funding can consider this, and their scope is limited considering the magnitude of microbial data being generated daily. Wikidata is an openly editable, semantic web compatible framework for knowledge representation. It is a project of the Wikimedia Foundation and offers knowledge integration capabilities ideally suited to the challenge of representing the exploding body of information about microbial genomics. We are developing a microbial specific data model, based on Wikidata's semantic web compatibility, which represents bacterial species, strains and the gene and gene products that define them. Currently, we have loaded 43,694 gene and 37,966 protein items for 21 species of bacteria, including the human pathogenic bacteriaChlamydia trachomatis.Using this pathogen as an example, we explore complex interactions between the pathogen, its host, associated genes, other microbes, disease and drugs using the Wikidata SPARQL endpoint. In our next phase of development, we will add another 99 bacterial genomes and their gene and gene products, totaling ∼900,000 additional entities. This aggregation of knowledge will be a platform for community-driven collaboration, allowing the networking of microbial genetic data through the sharing of knowledge by both the data and domain expert.


Asunto(s)
Curaduría de Datos , Genoma Microbiano , Modelos Teóricos , Femenino , Ontología de Genes , Genes Bacterianos , Humanos , Anotación de Secuencia Molecular , Operón/genética , Motor de Búsqueda
19.
Artículo en Inglés | MEDLINE | ID: mdl-27087308

RESUMEN

Relations between chemicals and diseases are one of the most queried biomedical interactions. Although expert manual curation is the standard method for extracting these relations from the literature, it is expensive and impractical to apply to large numbers of documents, and therefore alternative methods are required. We describe here a crowdsourcing workflow for extracting chemical-induced disease relations from free text as part of the BioCreative V Chemical Disease Relation challenge. Five non-expert workers on the CrowdFlower platform were shown each potential chemical-induced disease relation highlighted in the original source text and asked to make binary judgments about whether the text supported the relation. Worker responses were aggregated through voting, and relations receiving four or more votes were predicted as true. On the official evaluation dataset of 500 PubMed abstracts, the crowd attained a 0.505F-score (0.475 precision, 0.540 recall), with a maximum theoretical recall of 0.751 due to errors with named entity recognition. The total crowdsourcing cost was $1290.67 ($2.58 per abstract) and took a total of 7 h. A qualitative error analysis revealed that 46.66% of sampled errors were due to task limitations and gold standard errors, indicating that performance can still be improved. All code and results are publicly available athttps://github.com/SuLab/crowd_cid_relexDatabase URL:https://github.com/SuLab/crowd_cid_relex.


Asunto(s)
Colaboración de las Masas , Curaduría de Datos/métodos , Minería de Datos/métodos , Bases de Datos Factuales , Enfermedad/etiología , Sustancias Peligrosas/toxicidad , Humanos , Flujo de Trabajo
20.
Citiz Sci ; 1(2)2016.
Artículo en Inglés | MEDLINE | ID: mdl-30416754

RESUMEN

Biomedical literature represents one of the largest and fastest growing collections of unstructured biomedical knowledge. Finding critical information buried in the literature can be challenging. To extract information from free-flowing text, researchers need to: 1. identify the entities in the text (named entity recognition), 2. apply a standardized vocabulary to these entities (normalization), and 3. identify how entities in the text are related to one another (relationship extraction). Researchers have primarily approached these information extraction tasks through manual expert curation and computational methods. We have previously demonstrated that named entity recognition (NER) tasks can be crowdsourced to a group of non-experts via the paid microtask platform, Amazon Mechanical Turk (AMT), and can dramatically reduce the cost and increase the throughput of biocuration efforts. However, given the size of the biomedical literature, even information extraction via paid microtask platforms is not scalable. With our web-based application Mark2Cure (http://mark2cure.org), we demonstrate that NER tasks also can be performed by volunteer citizen scientists with high accuracy. We apply metrics from the Zooniverse Matrices of Citizen Science Success and provide the results here to serve as a basis of comparison for other citizen science projects. Further, we discuss design considerations, issues, and the application of analytics for successfully moving a crowdsourcing workflow from a paid microtask platform to a citizen science platform. To our knowledge, this study is the first application of citizen science to a natural language processing task.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA