Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 129
Filter
Add more filters

Country/Region as subject
Publication year range
1.
Am J Hum Genet ; 103(1): 58-73, 2018 07 05.
Article in English | MEDLINE | ID: mdl-29961570

ABSTRACT

Integration of detailed phenotype information with genetic data is well established to facilitate accurate diagnosis of hereditary disorders. As a rich source of phenotype information, electronic health records (EHRs) promise to empower diagnostic variant interpretation. However, how to accurately and efficiently extract phenotypes from heterogeneous EHR narratives remains a challenge. Here, we present EHR-Phenolyzer, a high-throughput EHR framework for extracting and analyzing phenotypes. EHR-Phenolyzer extracts and normalizes Human Phenotype Ontology (HPO) concepts from EHR narratives and then prioritizes genes with causal variants on the basis of the HPO-coded phenotype manifestations. We assessed EHR-Phenolyzer on 28 pediatric individuals with confirmed diagnoses of monogenic diseases and found that the genes with causal variants were ranked among the top 100 genes selected by EHR-Phenolyzer for 16/28 individuals (p < 2.2 × 10-16), supporting the value of phenotype-driven gene prioritization in diagnostic sequence interpretation. To assess the generalizability, we replicated this finding on an independent EHR dataset of ten individuals with a positive diagnosis from a different institution. We then assessed the broader utility by examining two additional EHR datasets, including 31 individuals who were suspected of having a Mendelian disease and underwent different types of genetic testing and 20 individuals with positive diagnoses of specific Mendelian etiologies of chronic kidney disease from exome sequencing. Finally, through several retrospective case studies, we demonstrated how combined analyses of genotype data and deep phenotype data from EHRs can expedite genetic diagnoses. In summary, EHR-Phenolyzer leverages EHR narratives to automate phenotype-driven analysis of clinical exomes or genomes, facilitating the broader implementation of genomic medicine.


Subject(s)
Exome/genetics , Adolescent , Child , Child, Preschool , Electronic Health Records , Female , Genetic Testing/methods , Genomics/methods , Genotype , Humans , Infant , Infant, Newborn , Male , Phenotype , Renal Insufficiency, Chronic/genetics , Retrospective Studies
2.
Brief Bioinform ; 19(5): 863-877, 2018 09 28.
Article in English | MEDLINE | ID: mdl-28334070

ABSTRACT

Drug-drug interactions (DDIs) constitute an important concern in drug development and postmarketing pharmacovigilance. They are considered the cause of many adverse drug effects exposing patients to higher risks and increasing public health system costs. Methods to follow-up and discover possible DDIs causing harm to the population are a primary aim of drug safety researchers. Here, we review different methodologies and recent advances using data mining to detect DDIs with impact on patients. We focus on data mining of different pharmacovigilance sources, such as the US Food and Drug Administration Adverse Event Reporting System and electronic health records from medical institutions, as well as on the diverse data mining studies that use narrative text available in the scientific biomedical literature and social media. We pay attention to the strengths but also further explain challenges related to these methods. Data mining has important applications in the analysis of DDIs showing the impact of the interactions as a cause of adverse effects, extracting interactions to create knowledge data sets and gold standards and in the discovery of novel and dangerous DDIs.


Subject(s)
Data Mining/methods , Drug Interactions , Computational Biology/methods , Drug-Related Side Effects and Adverse Reactions , Electronic Health Records/statistics & numerical data , Humans , Pharmacovigilance , Publications/statistics & numerical data , Social Media/statistics & numerical data , United States , United States Food and Drug Administration
3.
J Biomed Inform ; 100: 103318, 2019 12.
Article in English | MEDLINE | ID: mdl-31655273

ABSTRACT

BACKGROUND: Manually curating standardized phenotypic concepts such as Human Phenotype Ontology (HPO) terms from narrative text in electronic health records (EHRs) is time consuming and error prone. Natural language processing (NLP) techniques can facilitate automated phenotype extraction and thus improve the efficiency of curating clinical phenotypes from clinical texts. While individual NLP systems can perform well for a single cohort, an ensemble-based method might shed light on increasing the portability of NLP pipelines across different cohorts. METHODS: We compared four NLP systems, MetaMapLite, MedLEE, ClinPhen and cTAKES, and four ensemble techniques, including intersection, union, majority-voting and machine learning, for extracting generic phenotypic concepts. We addressed two important research questions regarding automated phenotype recognition. First, we evaluated the performance of different approaches in identifying generic phenotypic concepts. Second, we compared the performance of different methods to identify patient-specific phenotypic concepts. To better quantify the effects caused by concept granularity differences on performance, we developed a novel evaluation metric that considered concept hierarchies and frequencies. Each of the approaches was evaluated on a gold standard set of clinical documents annotated by clinical experts. One dataset containing 1,609 concepts derived from 50 clinical notes from two different institutions was used in both evaluations, and an additional dataset of 608 concepts derived from 50 case report abstracts obtained from PubMed was used for evaluation of identifying generic phenotypic concepts only. RESULTS: For generic phenotypic concept recognition, the top three performers in the NYP/CUIMC dataset are union ensemble (F1, 0.634), training-based ensemble (F1, 0.632), and majority vote-based ensemble (F1, 0.622). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.642), cTAKES (F1, 0.615), and MedLEE (F1, 0.559). In the PubMed dataset, the top three are majority vote-based ensemble (F1, 0.719), training-based (F1, 0.696) and MetaMapLite (F1, 0.694). For identifying patient specific phenotypes, the top three performers in the NYP/CUIMC dataset are majority vote-based ensemble (F1, 0.610), MedLEE (F1, 0.609), and training-based ensemble (F1, 0.585). In the Mayo dataset, the top three are majority vote-based ensemble (F1, 0.604), cTAKES (F1, 0.531) and MedLEE (F1, 0.527). CONCLUSIONS: Our study demonstrates that ensembles of natural language processing can improve both generic phenotypic concept recognition and patient specific phenotypic concept identification over individual systems. Among the individual NLP systems, each individual system performed best when they were applied in the dataset that they were primary designed for. However, combining multiple NLP systems to create an ensemble can generally improve the performance. Specifically, the ensemble can increase the results reproducibility across different cohorts and tasks, and thus provide a more portable phenotyping solution compared to individual NLP systems.


Subject(s)
Natural Language Processing , Phenotype , Datasets as Topic , Electronic Health Records , Humans , Reproducibility of Results
4.
BMC Med Inform Decis Mak ; 19(Suppl 3): 70, 2019 04 04.
Article in English | MEDLINE | ID: mdl-30943963

ABSTRACT

BACKGROUND: A shareable repository of clinical notes is critical for advancing natural language processing (NLP) research, and therefore a goal of many NLP researchers is to create a shareable repository of clinical notes, that has breadth (from multiple institutions) as well as depth (as much individual data as possible). METHODS: We aimed to assess the degree to which individuals would be willing to contribute their health data to such a repository. A compact e-survey probed willingness to share demographic and clinical data categories. Participants were faculty, staff, and students in two geographically diverse major medical centers (Utah and New York). Such a sample could be expected to respond like a typical potential participant from the general public who is given complete and fully informed consent about the pros and cons of participating in a research study. RESULTS: Two thousand one hundred forty respondents completed the surveys. 56% of respondents were "somewhat/definitely willing" to share clinical data with identifiers, while 89% of respondents were "somewhat (17%)/definitely willing (72%)" to share without identifiers. Results were consistent across gender, age, and education, but there were some differences by geographical region. Individuals were most reluctant (50-74%) sharing mental health, substance abuse, and domestic violence data. CONCLUSIONS: We conclude that a substantial fraction of potential patient participants, once educated about risks and benefits, would be willing to donate de-identified clinical data to a shared research repository. A slight majority even would be willing to share absent de-identification, suggesting that perceptions about data misuse are not a major concern. Such a repository of clinical notes should be invaluable for clinical NLP research and advancement.


Subject(s)
Academic Medical Centers , Biomedical Research , Health Personnel , Information Dissemination , Natural Language Processing , Adolescent , Adult , Confidentiality , Female , Humans , Informed Consent , Male , Middle Aged , New York , Patient Participation , Surveys and Questionnaires , Young Adult
5.
J Biomed Inform ; 76: 41-49, 2017 Dec.
Article in English | MEDLINE | ID: mdl-29081385

ABSTRACT

OBJECTIVE: Improving mechanisms to detect adverse drug reactions (ADRs) is key to strengthening post-marketing drug safety surveillance. Signal detection is presently unimodal, relying on a single information source. Multimodal signal detection is based on jointly analyzing multiple information sources. Building on, and expanding the work done in prior studies, the aim of the article is to further research on multimodal signal detection, explore its potential benefits, and propose methods for its construction and evaluation. MATERIAL AND METHODS: Four data sources are investigated; FDA's adverse event reporting system, insurance claims, the MEDLINE citation database, and the logs of major Web search engines. Published methods are used to generate and combine signals from each data source. Two distinct reference benchmarks corresponding to well-established and recently labeled ADRs respectively are used to evaluate the performance of multimodal signal detection in terms of area under the ROC curve (AUC) and lead-time-to-detection, with the latter relative to labeling revision dates. RESULTS: Limited to our reference benchmarks, multimodal signal detection provides AUC improvements ranging from 0.04 to 0.09 based on a widely used evaluation benchmark, and a comparative added lead-time of 7-22 months relative to labeling revision dates from a time-indexed benchmark. CONCLUSIONS: The results support the notion that utilizing and jointly analyzing multiple data sources may lead to improved signal detection. Given certain data and benchmark limitations, the early stage of development, and the complexity of ADRs, it is currently not possible to make definitive statements about the ultimate utility of the concept. Continued development of multimodal signal detection requires a deeper understanding the data sources used, additional benchmarks, and further research on methods to generate and synthesize signals.


Subject(s)
Adverse Drug Reaction Reporting Systems , Databases, Factual , Humans , United States , United States Food and Drug Administration
6.
BMC Med Inform Decis Mak ; 17(1): 175, 2017 Dec 19.
Article in English | MEDLINE | ID: mdl-29258594

ABSTRACT

BACKGROUND: It is beneficial for health care institutions to monitor physician prescribing patterns to ensure that high-quality and cost-effective care is being provided to patients. However, detecting treatment patterns within an institution is challenging, given that medications and conditions are often not explicitly linked in the health record. Here we demonstrate the use of statistical methods together with data from the electronic health care record (EHR) to analyze prescribing patterns at an institution. METHODS: As a demonstration of our method, which is based on regression, we collect EHR data from outpatient notes and use a case/control study design to determine the medications that are associated with hypertension. We also use regression to determine which conditions are associated with a preferential use of one or more classes of hypertension agents. Finally, we compare our method to methods based on tabulation. RESULTS: Our results show that regression methods provide more reasonable and useful results than tabulation, and successfully distinguish between medications that treat hypertension and medications that do not. These methods also provide insight into in which circumstances certain drugs are preferred over others. CONCLUSIONS: Our method can be used by health care institutions to monitor physician prescribing patterns and ensure the appropriateness of treatment.


Subject(s)
Drug Prescriptions/standards , Electronic Health Records , Practice Patterns, Physicians' , Quality of Health Care , Case-Control Studies , Humans , Practice Patterns, Physicians'/standards , Quality of Health Care/standards , Regression Analysis
7.
Am J Gastroenterol ; 108(11): 1794-801, 2013 Nov.
Article in English | MEDLINE | ID: mdl-24060760

ABSTRACT

OBJECTIVES: Observational studies suggest that proton pump inhibitors (PPIs) are a risk factor for incident Clostridium difficile infection (CDI). Data also suggest an association between PPIs and recurrent CDI, although large-scale studies focusing solely on hospitalized patients are lacking. We therefore performed a retrospective cohort analysis of inpatients with incident CDI to assess receipt of PPIs as a risk factor for CDI recurrence in this population. METHODS: Using electronic medical records, we identified hospitalized adult patients between 1 December 2009 and 30 June 2012 with incident CDI, defined as a first positive stool test for C. difficile toxin B and who received appropriate treatment. Electronic records were parsed for clinical factors including receipt of PPIs, other acid suppression, non-CDI antibiotics, and comorbidities. The primary exposure was in-hospital PPIs given concurrently with C. difficile treatment. Recurrence was defined as a second positive stool test 15-90 days after the initial positive test. C. difficile recurrence rates in the PPI exposed and unexposed groups were compared with the log-rank test. Multivariable Cox proportional hazards modeling was performed to control for demographics, comorbidities, and other clinical factors. RESULTS: We identified 894 inpatients with incident CDI. The cumulative incidence of CDI recurrence in the cohort was 23%. Receipt of PPIs concurrent with CDI treatment was not associated with C. difficile recurrence (hazard ratio (HR)=0.82; 95% confidence interval (CI)=0.58-1.16). Black race (HR=1.66, 95% CI=1.05-2.63), increased age (HR=1.02, 95% CI=1.01-1.03), and increased comorbidities (HR=1.09, 95% CI=1.04-1.14) were associated with CDI recurrence. In light of a higher 90-day mortality seen among those who received PPIs (log-rank P=0.02), we also analyzed the subset of patients who survived to 90 days of follow-up. Again, there was no association between PPIs and CDI recurrence (HR=0.87; 95% CI=0.60-1.28). Finally, there was no association between recurrent CDI and increased duration or dose of PPIs. CONCLUSIONS: Among hospitalized adults with C. difficile, receipt of PPIs concurrent with C. difficile treatment was not associated with CDI recurrence. Black race, increased age, and increased comorbidities significantly predicted recurrence. Future studies should test interventions to prevent CDI recurrence among high-risk inpatients.


Subject(s)
Clostridioides difficile , Clostridium Infections/epidemiology , Clostridium Infections/etiology , Inpatients , Proton Pump Inhibitors/adverse effects , Adult , Age Factors , Aged , Aged, 80 and over , Female , Humans , Incidence , Male , Middle Aged , Recurrence , Risk Factors
8.
J Biomed Inform ; 46(5): 765-73, 2013 Oct.
Article in English | MEDLINE | ID: mdl-23810857

ABSTRACT

Natural language processing (NLP) is crucial for advancing healthcare because it is needed to transform relevant information locked in text into structured data that can be used by computer processes aimed at improving patient care and advancing medicine. In light of the importance of NLP to health, the National Library of Medicine (NLM) recently sponsored a workshop to review the state of the art in NLP focusing on text in English, both in biomedicine and in the general language domain. Specific goals of the NLM-sponsored workshop were to identify the current state of the art, grand challenges and specific roadblocks, and to identify effective use and best practices. This paper reports on the main outcomes of the workshop, including an overview of the state of the art, strategies for advancing the field, and obstacles that need to be addressed, resulting in recommendations for a research agenda intended to advance the field.


Subject(s)
Education , National Library of Medicine (U.S.) , Natural Language Processing , United States
9.
Pharmacoepidemiol Drug Saf ; 22(2): 183-9, 2013 Feb.
Article in English | MEDLINE | ID: mdl-23233423

ABSTRACT

PURPOSE: Medication overuse is a serious concern in healthcare as it leads to increased expenditures, side effects, and morbidities. Identifying overuse is only possible through excluding appropriate indications that are primarily mentioned in unstructured notes. We developed a framework for automatic identification of medication overuse and applied it to proton pump inhibitors (PPIs). METHODS: We first created an indications knowledge base using data from drug labels, clinical guidelines, expert opinion, and other sources. We also obtained the list of current problems for 200 randomly selected inpatients who received PPIs using a natural language processing system and the discharge summaries of those patients. These problems were checked against the indications knowledge base to identify overuse candidates. Two gastroenterologists manually reviewed the notes and identified cases of overuse. Results from the automated framework were compared with the manual review. RESULTS: Reviewers had high interrater reliability in finding indications (agreement = 92.1%, Cohen's κ = 0.773). In 137 notes included in the final analysis, our system identified indications with a sensitivity of 74% (95%CI = 59-86) and specificity of 95% (95%CI = 87-98). In cases of appropriate use where the automated system also found one or more indications, it always included the correct indication. CONCLUSIONS: We created an automated system that can identify established indications of medication use in electronic health records with high accuracy. It can provide clinical decision support for identifying potential overuse of PPIs and could be useful for reducing overuse and encouraging better documentation of indications.


Subject(s)
Documentation/standards , Electronic Health Records/standards , Proton Pump Inhibitors/therapeutic use , Aged , Aged, 80 and over , Documentation/methods , Drug Prescriptions , Drug Utilization Review/methods , Drug Utilization Review/standards , Female , Humans , Male , Medical Records Systems, Computerized/standards , Middle Aged , Pilot Projects , Proton Pump Inhibitors/adverse effects
10.
J Biomed Inform ; 45(6): 1075-83, 2012 Dec.
Article in English | MEDLINE | ID: mdl-22742938

ABSTRACT

Abbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus [1]. However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations.


Subject(s)
Abbreviations as Topic , Algorithms , Cluster Analysis , Medical Records , Natural Language Processing , Unified Medical Language System
11.
J Biomed Inform ; 44(5): 805-14, 2011 Oct.
Article in English | MEDLINE | ID: mdl-21549857

ABSTRACT

Biomedical natural language processing (BioNLP) is a useful technique that unlocks valuable information stored in textual data for practice and/or research. Syntactic parsing is a critical component of BioNLP applications that rely on correctly determining the sentence and phrase structure of free text. In addition to dealing with the vast amount of domain-specific terms, a robust biomedical parser needs to model the semantic grammar to obtain viable syntactic structures. With either a rule-based or corpus-based approach, the grammar engineering process requires substantial time and knowledge from experts, and does not always yield a semantically transferable grammar. To reduce the human effort and to promote semantic transferability, we propose an automated method for deriving a probabilistic grammar based on a training corpus consisting of concept strings and semantic classes from the Unified Medical Language System (UMLS), a comprehensive terminology resource widely used by the community. The grammar is designed to specify noun phrases only due to the nominal nature of the majority of biomedical terminological concepts. Evaluated on manually parsed clinical notes, the derived grammar achieved a recall of 0.644, precision of 0.737, and average cross-bracketing of 0.61, which demonstrated better performance than a control grammar with the semantic information removed. Error analysis revealed shortcomings that could be addressed to improve performance. The results indicated the feasibility of an approach which automatically incorporates terminology semantics in the building of an operational grammar. Although the current performance of the unsupervised solution does not adequately replace manual engineering, we believe once the performance issues are addressed, it could serve as an aide in a semi-supervised solution.


Subject(s)
Semantics , Terminology as Topic , Information Storage and Retrieval , Unified Medical Language System , Vocabulary, Controlled
12.
BMC Bioinformatics ; 11 Suppl 9: S7, 2010 Oct 28.
Article in English | MEDLINE | ID: mdl-21044365

ABSTRACT

BACKGROUND: Multi-item adverse drug event (ADE) associations are associations relating multiple drugs to possibly multiple adverse events. The current standard in pharmacovigilance is bivariate association analysis, where each single drug-adverse effect combination is studied separately. The importance and difficulty in the detection of multi-item ADE associations was noted in several prominent pharmacovigilance studies. In this paper we examine the application of a well established data mining method known as association rule mining, which we tailored to the above problem, and demonstrate its value. The method was applied to the FDAs spontaneous adverse event reporting system (AERS) with minimal restrictions and expectations on its output, an experiment that has not been previously done on the scale and generality proposed in this work. RESULTS: Based on a set of 162,744 reports of suspected ADEs reported to AERS and published in the year 2008, our method identified 1167 multi-item ADE associations. A taxonomy that characterizes the associations was developed based on a representative sample. A significant number (67% of the total) of potential multi-item ADE associations identified were characterized and clinically validated by a domain expert as previously recognized ADE associations. Several potentially novel ADEs were also identified. A smaller proportion (4%) of associations were characterized and validated as known drug-drug interactions. CONCLUSIONS: Our findings demonstrate that multi-item ADEs are present and can be extracted from the FDA's adverse effect reporting system using our methodology, suggesting that our method is a valid approach for the initial identification of multi-item ADEs. The study also revealed several limitations and challenges that can be attributed to both the method and quality of data.


Subject(s)
Adverse Drug Reaction Reporting Systems , Data Mining/methods , Drug-Related Side Effects and Adverse Reactions , Adverse Drug Reaction Reporting Systems/statistics & numerical data , Algorithms , Databases, Factual , Drug Synergism , United States , United States Food and Drug Administration
13.
J Biomed Inform ; 43(4): 595-601, 2010 Aug.
Article in English | MEDLINE | ID: mdl-20362071

ABSTRACT

Knowledge acquisition of relations between biomedical entities is critical for many automated biomedical applications, including pharmacovigilance and decision support. Automated acquisition of statistical associations from biomedical and clinical documents has shown some promise. However, acquisition of clinically meaningful relations (i.e. specific associations) remains challenging because textual information is noisy and co-occurrence does not typically determine specific relations. In this work, we focus on acquisition of two types of relations from clinical reports: disease-manifestation related symptom (MRS) and drug-adverse drug event (ADE), and explore the use of filtering by sections of the reports to improve performance. Evaluation indicated that applying the filters improved recall (disease-MRS: from 0.85 to 0.90; drug-ADE: from 0.43 to 0.75) and precision (disease-MRS: from 0.82 to 0.92; drug-ADE: from 0.16 to 0.31). This preliminary study demonstrates that selecting information in narrative electronic reports based on the sections improves the detection of disease-MRS and drug-ADE types of relations. Further investigation of complementary methods, such as more sophisticated statistical methods, more complex temporal models and use of information from other knowledge sources, is needed.


Subject(s)
Electronic Health Records/standards , Drug-Related Side Effects and Adverse Reactions/metabolism , Information Storage and Retrieval/methods , Models, Statistical , Natural Language Processing
14.
BMC Bioinformatics ; 10 Suppl 9: S13, 2009 Sep 17.
Article in English | MEDLINE | ID: mdl-19761567

ABSTRACT

BACKGROUND: The availability of up-to-date, executable, evidence-based medical knowledge is essential for many clinical applications, such as pharmacovigilance, but executable knowledge is costly to obtain and update. Automated acquisition of environmental and phenotypic associations in biomedical and clinical documents using text mining has showed some success. The usefulness of the association knowledge is limited, however, due to the fact that the specific relationships between clinical entities remain unknown. In particular, some associations are indirect relations due to interdependencies among the data. RESULTS: In this work, we develop methods using mutual information (MI) and its property, the data processing inequality (DPI), to help characterize associations that were generated based on use of natural language processing to encode clinical information in narrative patient records followed by statistical methods. Evaluation based on a random sample consisting of two drugs and two diseases indicates an overall precision of 81%. CONCLUSION: This preliminary study demonstrates that the proposed method is effective for helping to characterize phenotypic and environmental associations obtained from clinical reports.


Subject(s)
Computational Biology/methods , Information Storage and Retrieval/methods , Information Theory , Medical Records Systems, Computerized , Database Management Systems , Natural Language Processing
15.
BMC Bioinformatics ; 10 Suppl 2: S8, 2009 Feb 05.
Article in English | MEDLINE | ID: mdl-19208196

ABSTRACT

The evolving complexity of genome-scale experiments has increasingly centralized the role of a highly computable, accurate, and comprehensive resource spanning multiple biological scales and viewpoints. To provide a resource to meet this need, we have significantly extended the PhenoGO database with gene-disease specific annotations and included an additional ten species. This a computationally-derived resource is primarily intended to provide phenotypic context (cell type, tissue, organ, and disease) for mining existing associations between gene products and GO terms specified in the Gene Ontology Databases Automated natural language processing (BioMedLEE) and computational ontology (PhenOS) methods were used to derive these relationships from the literature, expanding the database with information from ten additional species to include over 600,000 phenotypic contexts spanning eleven species from five GO annotation databases. A comprehensive evaluation evaluating the mappings (n = 300) found precision (positive predictive value) at 85%, and recall (sensitivity) at 76%. Phenotypes are encoded in general purpose ontologies such as Cell Ontology, the Unified Medical Language System, and in specialized ontologies such as the Mouse Anatomy and the Mammalian Phenotype Ontology. A web portal has also been developed, allowing for advanced filtering and querying of the database as well as download of the entire dataset http://www.phenogo.org.


Subject(s)
Computational Biology/methods , Databases, Genetic , Phenotype , Software , Animals , Humans , Information Storage and Retrieval , Mice , Natural Language Processing , Unified Medical Language System
18.
Bioinformatics ; 24(17): 1971-3, 2008 Sep 01.
Article in English | MEDLINE | ID: mdl-18625612

ABSTRACT

UNLABELLED: Accurate semantic classification is valuable for text mining and knowledge-based tasks that perform inference based on semantic classes. To benefit applications using the semantic classification of the Unified Medical Language System (UMLS) concepts, we automatically reclassified the concepts based on their lexical and contextual features. The new classification is useful for auditing the original UMLS semantic classification and for building biomedical text mining applications. AVAILABILITY: http://www.dbmi.columbia.edu/~juf7002/reclassify_production


Subject(s)
Database Management Systems , Information Storage and Retrieval/methods , Natural Language Processing , Terminology as Topic , Unified Medical Language System , Databases, Factual , Semantics
19.
J Am Med Inform Assoc ; 16(1): 103-8, 2009.
Article in English | MEDLINE | ID: mdl-18952935

ABSTRACT

OBJECTIVE: To develop methods for building corpus-specific sense inventories of abbreviations occurring in clinical documents. DESIGN: A corpus of internal medicine admission notes was collected and instances of each clinical abbreviation in the corpus were clustered to different sense clusters. One instance from each cluster was manually annotated to generate a final list of senses. Two clustering-based methods (Expectation Maximization--EM and Farthest First--FF) and one random sampling method for sense detection were evaluated using a set of 12 clinical abbreviations. MEASUREMENTS: The clustering-based sense detection methods were evaluated using a set of clinical abbreviations that were manually sense annotated. "Sense Completeness" and "Annotation Cost" were used to measure the performance of different methods. Clustering error rates were also reported for different clustering algorithms. RESULTS: A clustering-based semi-automated method was developed to build corpus-specific sense inventories for abbreviations in hospital admission notes. Evaluation demonstrated that this method could largely reduce manual annotation cost and increase the completeness of sense inventories when compared with a manual annotation method using random samples. CONCLUSION: The authors developed an effective clustering-based method for building corpus-specific sense inventories for abbreviations in a clinical corpus. To the best of the authors knowledge, this is the first time clustering technologies have been used to help building sense inventories of abbreviations in clinical text. The results demonstrated that the clustering-based method performed better than the manual annotation method using random samples for the task of building sense inventories of clinical abbreviations.


Subject(s)
Abbreviations as Topic , Medical Records , Cluster Analysis , Humans , Internal Medicine , Patient Admission
20.
J Am Med Inform Assoc ; 16(3): 328-37, 2009.
Article in English | MEDLINE | ID: mdl-19261932

ABSTRACT

OBJECTIVE It is vital to detect the full safety profile of a drug throughout its market life. Current pharmacovigilance systems still have substantial limitations, however. The objective of our work is to demonstrate the feasibility of using natural language processing (NLP), the comprehensive Electronic Health Record (EHR), and association statistics for pharmacovigilance purposes. DESIGN Narrative discharge summaries were collected from the Clinical Information System at New York Presbyterian Hospital (NYPH). MedLEE, an NLP system, was applied to the collection to identify medication events and entities which could be potential adverse drug events (ADEs). Co-occurrence statistics with adjusted volume tests were used to detect associations between the two types of entities, to calculate the strengths of the associations, and to determine their cutoff thresholds. Seven drugs/drug classes (ibuprofen, morphine, warfarin, bupropion, paroxetine, rosiglitazone, ACE inhibitors) with known ADEs were selected to evaluate the system. RESULTS One hundred thirty-two potential ADEs were found to be associated with the 7 drugs. Overall recall and precision were 0.75 and 0.31 for known ADEs respectively. Importantly, qualitative evaluation using historic roll back design suggested that novel ADEs could be detected using our system. CONCLUSIONS This study provides a framework for the development of active, high-throughput and prospective systems which could potentially unveil drug safety profiles throughout their entire market life. Our results demonstrate that the framework is feasible although there are some challenging issues. To the best of our knowledge, this is the first study using comprehensive unstructured data from the EHR for pharmacovigilance.


Subject(s)
Adverse Drug Reaction Reporting Systems , Drug-Related Side Effects and Adverse Reactions , Medical Records Systems, Computerized , Natural Language Processing , Algorithms , Feasibility Studies , Humans , Information Storage and Retrieval , Software , Statistics as Topic
SELECTION OF CITATIONS
SEARCH DETAIL