Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 68
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
J Biomed Inform ; 75S: S120-S128, 2017 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-28694118

RESUMEN

OBJECTIVE: Our objective was to develop a machine learning-based system to determine the severity of Positive Valance symptoms for a patient, based on information included in their initial psychiatric evaluation. Severity was rated on an ordinal scale of 0-3 as follows: 0 (absent=no symptoms), 1 (mild=modest significance), 2 (moderate=requires treatment), 3 (severe=causes substantial impairment) by experts. MATERIALS AND METHODS: We treated the task of assigning Positive Valence severity as a text classification problem. During development, we experimented with regularized multinomial logistic regression classifiers, gradient boosted trees, and feedforward, fully-connected neural networks. We found both regularization and feature selection via mutual information to be very important in preventing models from overfitting the data. Our best configuration was a neural network with three fully connected hidden layers with rectified linear unit activations. RESULTS: Our best performing system achieved a score of 77.86%. The evaluation metric is an inverse normalization of the Mean Absolute Error presented as a percentage number between 0 and 100, where 100 means the highest performance. Error analysis showed that 90% of the system errors involved neighboring severity categories. CONCLUSION: Machine learning text classification techniques with feature selection can be trained to recognize broad differences in Positive Valence symptom severity with a modest amount of training data (in this case 600 documents, 167 of which were unannotated). An increase in the amount of annotated data can increase accuracy of symptom severity classification by several percentage points. Additional features and/or a larger training corpus may further improve accuracy.


Asunto(s)
Automatización , Redes Neurales de la Computación , Humanos , Aprendizaje Automático
2.
Bioinformatics ; 31(2): 297-8, 2015 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-25024289

RESUMEN

UNLABELLED: The ISMB Special Interest Group on Linking Literature, Information and Knowledge for Biology (BioLINK) organized a one-day workshop at ISMB/ECCB 2013 in Berlin, Germany. The theme of the workshop was 'Roles for text mining in biomedical knowledge discovery and translational medicine'. This summary reviews the outcomes of the workshop. Meeting themes included concept annotation methods and applications, extraction of biological relationships and the use of text-mined data for biological data analysis. AVAILABILITY AND IMPLEMENTATION: All articles are available at http://biolinksig.org/proceedings-online/.


Asunto(s)
Biología Computacional/métodos , Minería de Datos , Publicaciones Periódicas como Asunto , Informe de Investigación , Biología Computacional/normas , Congresos como Asunto , Humanos
3.
J Biomed Inform ; 61: 97-109, 2016 06.
Artículo en Inglés | MEDLINE | ID: mdl-27020263

RESUMEN

OBJECTIVE: Electronic medical records (EMRs) are increasingly repurposed for activities beyond clinical care, such as to support translational research and public policy analysis. To mitigate privacy risks, healthcare organizations (HCOs) aim to remove potentially identifying patient information. A substantial quantity of EMR data is in natural language form and there are concerns that automated tools for detecting identifiers are imperfect and leak information that can be exploited by ill-intentioned data recipients. Thus, HCOs have been encouraged to invest as much effort as possible to find and detect potential identifiers, but such a strategy assumes the recipients are sufficiently incentivized and capable of exploiting leaked identifiers. In practice, such an assumption may not hold true and HCOs may overinvest in de-identification technology. The goal of this study is to design a natural language de-identification framework, rooted in game theory, which enables an HCO to optimize their investments given the expected capabilities of an adversarial recipient. METHODS: We introduce a Stackelberg game to balance risk and utility in natural language de-identification. This game represents a cost-benefit model that enables an HCO with a fixed budget to minimize their investment in the de-identification process. We evaluate this model by assessing the overall payoff to the HCO and the adversary using 2100 clinical notes from Vanderbilt University Medical Center. We simulate several policy alternatives using a range of parameters, including the cost of training a de-identification model and the loss in data utility due to the removal of terms that are not identifiers. In addition, we compare policy options where, when an attacker is fined for misuse, a monetary penalty is paid to the publishing HCO as opposed to a third party (e.g., a federal regulator). RESULTS: Our results show that when an HCO is forced to exhaust a limited budget (set to $2000 in the study), the precision and recall of the de-identification of the HCO are 0.86 and 0.8, respectively. A game-based approach enables a more refined cost-benefit tradeoff, improving both privacy and utility for the HCO. For example, our investigation shows that it is possible for an HCO to release the data without spending all their budget on de-identification and still deter the attacker, with a precision of 0.77 and a recall of 0.61 for the de-identification. There also exist scenarios in which the model indicates an HCO should not release any data because the risk is too great. In addition, we find that the practice of paying fines back to a HCO (an artifact of suing for breach of contract), as opposed to a third party such as a federal regulator, can induce an elevated level of data sharing risk, where the HCO is incentivized to bait the attacker to elicit compensation. CONCLUSIONS: A game theoretic framework can be applied in leading HCO's to optimized decision making in natural language de-identification investments before sharing EMR data.


Asunto(s)
Confidencialidad , Registros Electrónicos de Salud , Procesamiento de Lenguaje Natural , Humanos , Lenguaje , Riesgo
4.
PLoS Biol ; 9(6): e1001088, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21713030

RESUMEN

A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.


Asunto(s)
Bases de Datos Genéticas , Genómica/normas , Cooperación Internacional , Metagenoma
5.
J Mol Recognit ; 26(9): 426-31, 2013 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-23836470

RESUMEN

SugarBindDB lists pathogen and biotoxin lectins and their carbohydrate ligands in a searchable format. Information is collected from articles published in peer-reviewed scientific journals. Help files guide the user through the search process and provide a review of structures and names of sugars that appear in human oligosaccharides. Glycans are written in the condensed form of the carbohydrate nomenclature system developed by the International Union of Pure and Applied Chemistry (IUPAC). Since its online publication by The MITRE Corporation in 2005, the database has served as a resource for research on the glycobiology of infectious disease. SugarBindDB is currently hosted by the Swiss Institute of Bioinformatics on the ExPASy server and will be enhanced and linked to related resources as part of the wider UniCarbKB initiative. Enhancements will include the option to display glycans in a variety of formats, including modified 2D condensed IUPAC and symbolic nomenclature.


Asunto(s)
Bases de Datos de Proteínas , Lectinas/química , Polisacáridos/química , Proteínas Bacterianas , Conformación de Carbohidratos , Secuencia de Carbohidratos , Glucolípidos/química , Glicoproteínas/química , Humanos , Datos de Secuencia Molecular , Unión Proteica , Interfaz Usuario-Computador
6.
Database (Oxford) ; 20222022 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-36050787

RESUMEN

Monitoring drug safety is a central concern throughout the drug life cycle. Information about toxicity and adverse events is generated at every stage of this life cycle, and stakeholders have a strong interest in applying text mining and artificial intelligence (AI) methods to manage the ever-increasing volume of this information. Recognizing the importance of these applications and the role of challenge evaluations to drive progress in text mining, the organizers of BioCreative VII (Critical Assessment of Information Extraction in Biology) convened a panel of experts to explore 'Challenges in Mining Drug Adverse Reactions'. This article is an outgrowth of the panel; each panelist has highlighted specific text mining application(s), based on their research and their experiences in organizing text mining challenge evaluations. While these highlighted applications only sample the complexity of this problem space, they reveal both opportunities and challenges for text mining to aid in the complex process of drug discovery, testing, marketing and post-market surveillance. Stakeholders are eager to embrace natural language processing and AI tools to help in this process, provided that these tools can be demonstrated to add value to stakeholder workflows. This creates an opportunity for the BioCreative community to work in partnership with regulatory agencies, pharma and the text mining community to identify next steps for future challenge evaluations.


Asunto(s)
Inteligencia Artificial , Biología Computacional , Biología Computacional/métodos , Minería de Datos/métodos , Personal de Salud , Humanos , Procesamiento de Lenguaje Natural
7.
Database (Oxford) ; 20222022 10 05.
Artículo en Inglés | MEDLINE | ID: mdl-36197453

RESUMEN

The coronavirus disease 2019 (COVID-19) pandemic has compelled biomedical researchers to communicate data in real time to establish more effective medical treatments and public health policies. Nontraditional sources such as preprint publications, i.e. articles not yet validated by peer review, have become crucial hubs for the dissemination of scientific results. Natural language processing (NLP) systems have been recently developed to extract and organize COVID-19 data in reasoning systems. Given this scenario, the BioCreative COVID-19 text mining tool interactive demonstration track was created to assess the landscape of the available tools and to gauge user interest, thereby providing a two-way communication channel between NLP system developers and potential end users. The goal was to inform system designers about the performance and usability of their products and to suggest new additional features. Considering the exploratory nature of this track, the call for participation solicited teams to apply for the track, based on their system's ability to perform COVID-19-related tasks and interest in receiving user feedback. We also recruited volunteer users to test systems. Seven teams registered systems for the track, and >30 individuals volunteered as test users; these volunteer users covered a broad range of specialties, including bench scientists, bioinformaticians and biocurators. The users, who had the option to participate anonymously, were provided with written and video documentation to familiarize themselves with the NLP tools and completed a survey to record their evaluation. Additional feedback was also provided by NLP system developers. The track was well received as shown by the overall positive feedback from the participating teams and the users. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-4/.


Asunto(s)
COVID-19 , COVID-19/epidemiología , Minería de Datos/métodos , Bases de Datos Factuales , Documentación , Humanos , Procesamiento de Lenguaje Natural
8.
BMC Bioinformatics ; 12 Suppl 8: S1, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151647

RESUMEN

BACKGROUND: The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. To this end BioCreative I was held in 2004, BioCreative II in 2007, and BioCreative II.5 in 2009. Each of these workshops involved humanly annotated test data for several basic tasks in text mining applied to the biomedical literature. Participants in the workshops were invited to compete in the tasks by constructing software systems to perform the tasks automatically and were given scores based on their performance. The results of these workshops have benefited the community in several ways. They have 1) provided evidence for the most effective methods currently available to solve specific problems; 2) revealed the current state of the art for performance on those problems; 3) and provided gold standard data and results on that data by which future advances can be gauged. This special issue contains overview papers for the three tasks of BioCreative III. RESULTS: The BioCreative III Workshop was held in September of 2010 and continued the tradition of a challenge evaluation on several tasks judged basic to effective text mining in biology, including a gene normalization (GN) task and two protein-protein interaction (PPI) tasks. In total the Workshop involved the work of twenty-three teams. Thirteen teams participated in the GN task which required the assignment of EntrezGene IDs to all named genes in full text papers without any species information being provided to a system. Ten teams participated in the PPI article classification task (ACT) requiring a system to classify and rank a PubMed® record as belonging to an article either having or not having "PPI relevant" information. Eight teams participated in the PPI interaction method task (IMT) where systems were given full text documents and were required to extract the experimental methods used to establish PPIs and a text segment supporting each such method. Gold standard data was compiled for each of these tasks and participants competed in developing systems to perform the tasks automatically.BioCreative III also introduced a new interactive task (IAT), run as a demonstration task. The goal was to develop an interactive system to facilitate a user's annotation of the unique database identifiers for all the genes appearing in an article. This task included ranking genes by importance (based preferably on the amount of described experimental information regarding genes). There was also an optional task to assist the user in finding the most relevant articles about a given gene. For BioCreative III, a user advisory group (UAG) was assembled and played an important role 1) in producing some of the gold standard annotations for the GN task, 2) in critiquing IAT systems, and 3) in providing guidance for a future more rigorous evaluation of IAT systems. Six teams participated in the IAT demonstration task and received feedback on their systems from the UAG group. Besides innovations in the GN and PPI tasks making them more realistic and practical and the introduction of the IAT task, discussions were begun on community data standards to promote interoperability and on user requirements and evaluation metrics to address utility and usability of systems. CONCLUSIONS: In this paper we give a brief history of the BioCreative Workshops and how they relate to other text mining competitions in biology. This is followed by a synopsis of the three tasks GN, PPI, and IAT in BioCreative III with figures for best participant performance on the GN and PPI tasks. These results are discussed and compared with results from previous BioCreative Workshops and we conclude that the best performing systems for GN, PPI-ACT and PPI-IMT in realistic settings are not sufficient for fully automatic use. This provides evidence for the importance of interactive systems and we present our vision of how best to construct an interactive system for a GN or PPI like task in the remainder of the paper.


Asunto(s)
Biología Computacional/métodos , Minería de Datos , Genes , Proteínas/metabolismo , Programas Informáticos , Animales , Biología Computacional/normas , Humanos , Publicaciones Periódicas como Asunto , Proteínas/genética
9.
BMC Bioinformatics ; 12 Suppl 8: S4, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151968

RESUMEN

BACKGROUND: The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested. RESULTS: A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation. DISCUSSION: The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.


Asunto(s)
Minería de Datos/métodos , Genes , Animales , Biología Computacional/métodos , Publicaciones Periódicas como Asunto , Plantas/genética , Plantas/metabolismo
10.
Front Artif Intell ; 4: 670009, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34738081

RESUMEN

There is a growing desire to create computer systems that can collaborate with humans on complex, open-ended activities. These activities typically have no set completion criteria and frequently involve multimodal communication, extensive world knowledge, creativity, and building structures or compositions through multiple steps. Because these systems differ from question and answer (Q&A) systems, chatbots, and simple task-oriented assistants, new methods for evaluating such collaborative computer systems are needed. Here, we present a set of criteria for evaluating these systems, called Hallmarks of Human-Machine Collaboration. The Hallmarks build on the success of heuristic evaluation used by the user interface community and past evaluation techniques used in the spoken language and chatbot communities. They consist of observable characteristics indicative of successful collaborative communication, grouped into eight high-level properties: robustness; habitability; mutual contribution of meaningful content; context-awareness; consistent human engagement; provision of rationale; use of elementary concepts to teach and learn new concepts; and successful collaboration. We present examples of how we used these Hallmarks in the DARPA Communicating with Computers (CwC) program to evaluate diverse activities, including story and music generation, interactive building with blocks, and exploration of molecular mechanisms in cancer. We used the Hallmarks as guides for developers and as diagnostics, assessing systems with the Hallmarks to identify strengths and opportunities for improvement using logs from user studies, surveying the human partner, third-party review of creative products, and direct tests. Informal feedback from CwC technology developers indicates that the use of the Hallmarks for program evaluation helped guide development. The Hallmarks also made it possible to identify areas of progress and major gaps in developing systems where the machine is an equal, creative partner.

11.
Drug Saf ; 44(1): 83-94, 2021 01.
Artículo en Inglés | MEDLINE | ID: mdl-33006728

RESUMEN

INTRODUCTION: The US FDA is interested in a tool that would enable pharmacovigilance safety evaluators to automate the identification of adverse drug events (ADEs) mentioned in FDA prescribing information. The MITRE Corporation (MITRE) and the FDA organized a shared task-Adverse Drug Event Evaluation (ADE Eval)-to determine whether the performance of algorithms currently used for natural language processing (NLP) might be good enough for real-world use. OBJECTIVE: ADE Eval was conducted to evaluate a range of NLP techniques for identifying ADEs mentioned in publicly available FDA-approved drug labels (package inserts). It was designed specifically to reflect pharmacovigilance practices within the FDA and model possible pharmacovigilance use cases. METHODS: Pharmacovigilance-specific annotation guidelines and annotated corpora were created. Two metrics modeled the experiences of FDA safety evaluators: one measured the ability of an algorithm to identify correct Medical Dictionary for Regulatory Activities (MedDRA®) terms for the text from the annotated corpora, and the other assessed the quality of evidence extracted from the corpora to support the selected MedDRA® term by measuring the portion of annotated text an algorithm correctly identified. A third metric assessed the cost of correcting system output for subsequent training (averaged, weighted F1-measure for mention finding). RESULTS: In total, 13 teams submitted 23 runs: the top MedDRA® coding F1-measure was 0.79, the top quality score was 0.96, and the top mention-finding F1-measure was 0.89. CONCLUSION: While NLP techniques do not perform at levels that would allow them to be used without intervention, it is now worthwhile exploring making NLP outputs available in human pharmacovigilance workflows.


Asunto(s)
Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Preparaciones Farmacéuticas , Sistemas de Registro de Reacción Adversa a Medicamentos , Algoritmos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/diagnóstico , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/epidemiología , Humanos , Procesamiento de Lenguaje Natural , Farmacovigilancia
12.
J Am Med Inform Assoc ; 27(9): 1374-1382, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32930712

RESUMEN

OBJECTIVE: Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII. MATERIALS AND METHODS: Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers. RESULTS: Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers. DISCUSSION AND CONCLUSIONS: Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods.


Asunto(s)
Confidencialidad , Anonimización de la Información , Registros Electrónicos de Salud , Seguridad Computacional , Humanos , Procesamiento de Lenguaje Natural
13.
BMC Bioinformatics ; 10: 326, 2009 Oct 08.
Artículo en Inglés | MEDLINE | ID: mdl-19814812

RESUMEN

BACKGROUND: The Comparative Toxicogenomics Database (CTD) is a publicly available resource that promotes understanding about the etiology of environmental diseases. It provides manually curated chemical-gene/protein interactions and chemical- and gene-disease relationships from the peer-reviewed, published literature. The goals of the research reported here were to establish a baseline analysis of current CTD curation, develop a text-mining prototype from readily available open source components, and evaluate its potential value in augmenting curation efficiency and increasing data coverage. RESULTS: Prototype text-mining applications were developed and evaluated using a CTD data set consisting of manually curated molecular interactions and relationships from 1,600 documents. Preliminary results indicated that the prototype found 80% of the gene, chemical, and disease terms appearing in curated interactions. These terms were used to re-rank documents for curation, resulting in increases in mean average precision (63% for the baseline vs. 73% for a rule-based re-ranking), and in the correlation coefficient of rank vs. number of curatable interactions per document (baseline 0.14 vs. 0.38 for the rule-based re-ranking). CONCLUSION: This text-mining project is unique in its integration of existing tools into a single workflow with direct application to CTD. We performed a baseline assessment of the inter-curator consistency and coverage in CTD, which allowed us to measure the potential of these integrated tools to improve prioritization of journal articles for manual curation. Our study presents a feasible and cost-effective approach for developing a text mining solution to enhance manual curation throughput and efficiency.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Factuales , Redes Reguladoras de Genes , Almacenamiento y Recuperación de la Información/métodos , Toxicogenética
14.
J Am Med Inform Assoc ; 26(12): 1536-1544, 2019 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-31390016

RESUMEN

OBJECTIVE: Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus. MATERIALS AND METHODS: We modeled a scenario where an institution (the defender) externally shared an 800-note corpus of actual outpatient clinical encounter notes from a large, integrated health care delivery system in Washington State. These notes were deidentified by a machine-learned PII tagger and HIPS resynthesis. A malicious attacker obtained and performed a parrot attack intending to expose leaked PII in this corpus. Specifically, the attacker mimicked the defender's process by manually annotating all PII-like content in half of the released corpus, training a PII tagger on these data, and using the trained model to tag the remaining encounter notes. The attacker hypothesized that untagged identifiers would be leaked PII, discoverable by manual review. We evaluated the attacker's success using measures of leak-detection rate and accuracy. RESULTS: The attacker correctly hypothesized that 211 (68%) of 310 actual PII leaks in the corpus were leaks, and wrongly hypothesized that 191 resynthesized PII instances were also leaks. One-third of actual leaks remained undetected. DISCUSSION AND CONCLUSION: A malicious parrot attack to reveal leaked PII in clinical text deidentified by machine-learned HIPS resynthesis can attenuate but not eliminate the protective effect of HIPS deidentification.


Asunto(s)
Seguridad Computacional , Confidencialidad , Anonimización de la Información , Registros Electrónicos de Salud , Aprendizaje Automático , Información Personal , Instituciones de Atención Ambulatoria , Atención a la Salud , Humanos , Washingtón
15.
OMICS ; 12(2): 129-36, 2008 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-18416669

RESUMEN

There is an urgent need to capture metadata on the rapidly growing number of genomic, metagenomic and related sequences, such as 16S ribosomal genes. This need is a major focus within the Genomic Standards Consortium (GSC), and Habitat is a key metadata descriptor in the proposed "Minimum Information about a Genome Sequence" (MIGS) specification. The goal of the work described here is to provide a light-weight, easy-to-use (small) set of terms ("Habitat-Lite") that captures high-level information about habitat while preserving a mapping to the recently launched Environment Ontology (EnvO). Our motivation for building Habitat-Lite is to meet the needs of multiple users, such as annotators curating these data, database providers hosting the data, and biologists and bioinformaticians alike who need to search and employ such data in comparative analyses. Here, we report a case study based on semiautomated identification of terms from GenBank and GOLD. We estimate that the terms in the initial version of Habitat-Lite would provide useful labels for over 60% of the kinds of information found in the GenBank isolation_source field, and around 85% of the terms in the GOLD habitat field. We present a revised version of Habitat-Lite defined within the EnvO Environmental Ontology through a new category, EnvO-Lite-GSC. We invite the community's feedback on its further development to provide a minimum list of terms to capture high-level habitat information and to provide classification bins needed for future studies.


Asunto(s)
Genómica , Bases de Datos Genéticas , Estándares de Referencia
16.
OMICS ; 12(2): 157-60, 2008 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-18564916

RESUMEN

Increasingly, we are aware as a community of the growing need to manage the avalanche of genomic and metagenomic data, in addition to related data types like ribosomal RNA and barcode sequences, in a way that tightly integrates contextual data with traditional literature in a machine-readable way. It is for this reason that the Genomic Standards Consortium (GSC) formed in 2005. Here we suggest that we move beyond the development of standards and tackle standards compliance and improved data capture at the level of the scientific publication. We are supported in this goal by the fact that the scientific community is in the midst of a publishing revolution. This revolution is marked by a growing shift away from a traditional dichotomy between "journal articles" and "database entries" and an increasing adoption of hybrid models of collecting and disseminating scientific information. With respect to genomes and metagenomes and related data types, we feel the scientific community would be best served by the immediate launch of a central repository of short, highly structured "Genome Notes" that must be standards compliant. This could be done in the context of an existing journal, but we also suggest the more radical solution of launching a new journal. Such a journal could be designed to cater to a wide range of standards-related content types that are not currently centralized in the published literature. It could also support the demand for centralizing aspects of the "gray literature" (documents developed by institutions or communities) such as the call by the GSC for a central repository of Standard Operating Procedures describing the genomic annotation pipelines of the major sequencing centers. We argue that such an "eJournal," published under the Open Access paradigm by the GSC, could be an attractive publishing forum for a broader range of standardization initiatives within, and beyond, the GSC and thereby fill an unoccupied yet increasingly important niche within the current research landscape.


Asunto(s)
Genómica/normas , Adhesión a Directriz , Publicaciones
17.
OMICS ; 12(2): 101-8, 2008 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-18564914

RESUMEN

This meeting report summarizes the proceedings of the "eGenomics: Cataloguing our Complete Genome Collection IV" workshop held June 6-8, 2007, at the National Institute for Environmental eScience (NIEeS), Cambridge, United Kingdom. This fourth workshop of the Genomic Standards Consortium (GSC) was a mix of short presentations, strategy discussions, and technical sessions. Speakers provided progress reports on the development of the "Minimum Information about a Genome Sequence" (MIGS) specification and the closely integrated "Minimum Information about a Metagenome Sequence" (MIMS) specification. The key outcome of the workshop was consensus on the next version of the MIGS/MIMS specification (v1.2). This drove further definition and restructuring of the MIGS/MIMS XML schema (syntax). With respect to semantics, a term vetting group was established to ensure that terms are properly defined and submitted to the appropriate ontology projects. Perhaps the single most important outcome of the workshop was a proposal to move beyond the concept of "minimum" to create a far richer XML schema that would define a "Genomic Contextual Data Markup Language" (GCDML) suitable for wider semantic integration across databases. GCDML will contain not only curated information (e.g., compliant with MIGS/MIMS), but also be extended to include a variety of data processing and calculations. Further information about the Genomic Standards Consortium and its range of activities can be found at http://gensc.org.


Asunto(s)
Bases de Datos Genéticas , Genómica , Educación , Lenguajes de Programación , Estándares de Referencia
18.
OMICS ; 12(2): 109-13, 2008 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-18564915

RESUMEN

This meeting report summarizes the proceedings of the fifth Genomic Standards Consortium (GSC) workshop held December 12-14, 2007, at the European Bioinformatics Institute (EBI), Cambridge, UK. This fifth workshop served as a milestone event in the evolution of the GSC (launched in September 2005); the key outcome of the workshop was the finalization of a stable version of the MIGS specification (v2.0) for publication. This accomplishment enables, and also in some cases necessitates, downstream activities, which are described in the multiauthor, consensus-driven articles in this special issue of OMICS produced as a direct result of the workshop. This report briefly summarizes the workshop and overviews the special issue. In particular, it aims to explain how the various GSC-led projects are working together to help this community achieve its stated mission of further standardizing the descriptions of genomes and metagenomes and implementing improved mechanisms of data exchange and integration to enable more accurate comparative analyses. Further information about the GSC and its range of activities can be found at http://gensc.org.


Asunto(s)
Genómica , Educación , Estándares de Referencia
19.
J Am Med Inform Assoc ; 14(5): 564-73, 2007.
Artículo en Inglés | MEDLINE | ID: mdl-17600096

RESUMEN

OBJECTIVE: This paper describes a successful approach to de-identification that was developed to participate in a recent AMIA-sponsored challenge evaluation. METHOD: Our approach focused on rapid adaptation of existing toolkits for named entity recognition using two existing toolkits, Carafe and LingPipe. RESULTS: The "out of the box" Carafe system achieved a very good score (phrase F-measure of 0.9664) with only four hours of work to adapt it to the de-identification task. With further tuning, we were able to reduce the token-level error term by over 36% through task-specific feature engineering and the introduction of a lexicon, achieving a phrase F-measure of 0.9736. CONCLUSIONS: We were able to achieve good performance on the de-identification task by the rapid retargeting of existing toolkits. For the Carafe system, we developed a method for tuning the balance of recall vs. precision, as well as a confidence score that correlated well with the measured F-score.


Asunto(s)
Confidencialidad , Sistemas de Registros Médicos Computarizados , Procesamiento de Lenguaje Natural , Estudios de Evaluación como Asunto , Humanos
20.
Methods Inf Med ; 55(4): 356-64, 2016 Aug 05.
Artículo en Inglés | MEDLINE | ID: mdl-27405787

RESUMEN

BACKGROUND: Clinical text contains valuable information but must be de-identified before it can be used for secondary purposes. Accurate annotation of personally identifiable information (PII) is essential to the development of automated de-identification systems and to manual redaction of PII. Yet the accuracy of annotations may vary considerably across individual annotators and annotation is costly. As such, the marginal benefit of incorporating additional annotators has not been well characterized. OBJECTIVES: This study models the costs and benefits of incorporating increasing numbers of independent human annotators to identify the instances of PII in a corpus. We used a corpus with gold standard annotations to evaluate the performance of teams of annotators of increasing size. METHODS: Four annotators independently identified PII in a 100-document corpus consisting of randomly selected clinical notes from Family Practice clinics in a large integrated health care system. These annotations were pooled and validated to generate a gold standard corpus for evaluation. RESULTS: Recall rates for all PII types ranged from 0.90 to 0.98 for individual annotators to 0.998 to 1.0 for teams of three, when meas-ured against the gold standard. Median cost per PII instance discovered during corpus annotation ranged from $ 0.71 for an individual annotator to $ 377 for annotations discovered only by a fourth annotator. CONCLUSIONS: Incorporating a second annotator into a PII annotation process reduces unredacted PII and improves the quality of annotations to 0.99 recall, yielding clear benefit at reasonable cost; the cost advantages of annotation teams larger than two diminish rapidly.


Asunto(s)
Análisis Costo-Beneficio/economía , Minería de Datos/economía , Sistemas de Identificación de Pacientes/economía , Registros Electrónicos de Salud , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA