Pesquisa | BVS IEC

1.

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Smalheiser, Neil R; Cohen, Aaron M; Bonifield, Gary.

J Biomed Inform ; 90: 103096, 2019 02.

Artigo em Inglês | MEDLINE | ID: mdl-30654030

RESUMO

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rhoâ¯=â¯0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed titleâ¯+â¯abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

Assuntos

Mineração de Dados , PubMed , Semântica

2.

Aggregator: a machine learning approach to identifying MEDLINE articles that derive from the same underlying clinical trial.

Shao, Weixiang; Adams, Clive E; Cohen, Aaron M; Davis, John M; McDonagh, Marian S; Thakurta, Sujata; Yu, Philip S; Smalheiser, Neil R.

Methods ; 74: 65-70, 2015 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-25461812

RESUMO

OBJECTIVE: It is important to identify separate publications that report outcomes from the same underlying clinical trial, in order to avoid over-counting these as independent pieces of evidence. METHODS: We created positive and negative training sets (comprised of pairs of articles reporting on the same condition and intervention) that were, or were not, linked to the same clinicaltrials.gov trial registry number. Features were extracted from MEDLINE and PubMed metadata; pairwise similarity scores were modeled using logistic regression. RESULTS: Article pairs from the same trial were identified with high accuracy (F1 score=0.843). We also created a clustering tool, Aggregator, that takes as input a PubMed user query for RCTs on a given topic, and returns article clusters predicted to arise from the same clinical trial. DISCUSSION: Although painstaking examination of full-text may be needed to be conclusive, metadata are surprisingly accurate in predicting when two articles derive from the same underlying clinical trial.

Assuntos

Ensaios Clínicos como Assunto/estatística & dados numéricos , MEDLINE/estatística & dados numéricos , Aprendizado de Máquina , Análise por Conglomerados , Humanos

3.

Automatically pre-screening patients for the rare disease aromatic l-amino acid decarboxylase deficiency using knowledge engineering, natural language processing, and machine learning on a large EHR population.

Cohen, Aaron M; Kaner, Jolie; Miller, Ryan; Kopesky, Jeffrey W; Hersh, William.

J Am Med Inform Assoc ; 31(3): 692-704, 2024 Feb 16.

Artigo em Inglês | MEDLINE | ID: mdl-38134953

RESUMO

OBJECTIVES: Electronic health record (EHR) data may facilitate the identification of rare diseases in patients, such as aromatic l-amino acid decarboxylase deficiency (AADCd), an autosomal recessive disease caused by pathogenic variants in the dopa decarboxylase gene. Deficiency of the AADC enzyme results in combined severe reductions in monoamine neurotransmitters: dopamine, serotonin, epinephrine, and norepinephrine. This leads to widespread neurological complications affecting motor, behavioral, and autonomic function. The goal of this study was to use EHR data to identify previously undiagnosed patients who may have AADCd without available training cases for the disease. MATERIALS AND METHODS: A multiple symptom and related disease annotated dataset was created and used to train individual concept classifiers on annotated sentence data. A multistep algorithm was then used to combine concept predictions into a single patient rank value. RESULTS: Using an 8000-patient dataset that the algorithms had not seen before ranking, the top and bottom 200 ranked patients were manually reviewed for clinical indications of performing an AADCd diagnostic screening test. The top-ranked patients were 22.5% positively assessed for diagnostic screening, with 0% for the bottom-ranked patients. This result is statistically significant at P < .0001. CONCLUSION: This work validates the approach that large-scale rare-disease screening can be accomplished by combining predictions for relevant individual symptoms and related conditions which are much more common and for which training data is easier to create.

Assuntos

Erros Inatos do Metabolismo dos Aminoácidos , Descarboxilases de Aminoácido-L-Aromático/deficiência , Processamento de Linguagem Natural , Doenças Raras , Humanos , Dopamina , Aprendizado de Máquina

4.

Studying the potential impact of automated document classification on scheduling a systematic review update.

Cohen, Aaron M; Ambert, Kyle; McDonagh, Marian.

BMC Med Inform Decis Mak ; 12: 33, 2012 Apr 19.

Artigo em Inglês | MEDLINE | ID: mdl-22515596

RESUMO

BACKGROUND: Systematic Reviews (SRs) are an essential part of evidence-based medicine, providing support for clinical practice and policy on a wide range of medical topics. However, producing SRs is resource-intensive, and progress in the research they review leads to SRs becoming outdated, requiring updates. Although the question of how and when to update SRs has been studied, the best method for determining when to update is still unclear, necessitating further research. METHODS: In this work we study the potential impact of a machine learning-based automated system for providing alerts when new publications become available within an SR topic. Some of these new publications are especially important, as they report findings that are more likely to initiate a review update. To this end, we have designed a classification algorithm to identify articles that are likely to be included in an SR update, along with an annotation scheme designed to identify the most important publications in a topic area. Using an SR database containing over 70,000 articles, we annotated articles from 9 topics that had received an update during the study period. The algorithm was then evaluated in terms of the overall correct and incorrect alert rate for publications meeting the topic inclusion criteria, as well as in terms of its ability to identify important, update-motivating publications in a topic area. RESULTS: Our initial approach, based on our previous work in topic-specific SR publication classification, identifies over 70% of the most important new publications, while maintaining a low overall alert rate. CONCLUSIONS: We performed an initial analysis of the opportunities and challenges in aiding the SR update planning process with an informatics-based machine learning approach. Alerts could be a useful tool in the planning, scheduling, and allocation of resources for SR updates, providing an improvement in timeliness and coverage for the large number of medical topics needing SRs. While the performance of this initial method is not perfect, it could be a useful supplement to current approaches to scheduling an SR update. Approaches specifically targeting the types of important publications identified by this work are likely to improve results.

Assuntos

Algoritmos , Inteligência Artificial , Medicina Baseada em Evidências , Humanos , Inteligência Artificial/normas , Bases de Dados Factuais , Projetos de Pesquisa , Revisões Sistemáticas como Assunto

5.

Evaluation of publication type tagging as a strategy to screen randomized controlled trial articles in preparing systematic reviews.

Schneider, Jodi; Hoang, Linh; Kansara, Yogeshwar; Cohen, Aaron M; Smalheiser, Neil R.

JAMIA Open ; 5(1): ooac015, 2022 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-35571360

RESUMO

Objectives: To produce a systematic review (SR), reviewers typically screen thousands of titles and abstracts of articles manually to find a small number which are read in full text to find relevant articles included in the final SR. Here, we evaluate a proposed automated probabilistic publication type screening strategy applied to the randomized controlled trial (RCT) articles (i.e., those which present clinical outcome results of RCT studies) included in a corpus of previously published Cochrane reviews. Materials and Methods: We selected a random subset of 558 published Cochrane reviews that specified RCT study only inclusion criteria, containing 7113 included articles which could be matched to PubMed identifiers. These were processed by our automated RCT Tagger tool to estimate the probability that each article reports clinical outcomes of a RCT. Results: Removing articles with low predictive scores P < 0.01 eliminated 288 included articles, of which only 22 were actually typical RCT articles, and only 18 were actually typical RCT articles that MEDLINE indexed as such. Based on our sample set, this screening strategy led to fewer than 0.05 relevant RCT articles being missed on average per Cochrane SR. Discussion: This scenario, based on real SRs, demonstrates that automated tagging can identify RCT articles accurately while maintaining very high recall. However, we also found that even SRs whose inclusion criteria are restricted to RCT studies include not only clinical outcome articles per se, but a variety of ancillary article types as well. Conclusions: This encourages further studies learning how best to incorporate automated tagging of additional publication types into SR triage workflows.

6.

Clinical study applying machine learning to detect a rare disease: results and lessons learned.

Hersh, William R; Cohen, Aaron M; Nguyen, Michelle M; Bensching, Katherine L; Deloughery, Thomas G.

JAMIA Open ; 5(2): ooac053, 2022 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-35783073

RESUMO

Machine learning has the potential to improve identification of patients for appropriate diagnostic testing and treatment, including those who have rare diseases for which effective treatments are available, such as acute hepatic porphyria (AHP). We trained a machine learning model on 205â571 complete electronic health records from a single medical center based on 30 known cases to identify 22 patients with classic symptoms of AHP that had neither been diagnosed nor tested for AHP. We offered urine porphobilinogen testing to these patients via their clinicians. Of the 7 who agreed to testing, none were positive for AHP. We explore the reasons for this and provide lessons learned for further work evaluating machine learning to detect AHP and other rare diseases.

7.

Testing a filtering strategy for systematic reviews: evaluating work savings and recall.

Proescholdt, Randi; Hsiao, Tzu-Kun; Schneider, Jodi; Cohen, Aaron M; McDonagh, Marian S; Smalheiser, Neil R.

AMIA Jt Summits Transl Sci Proc ; 2022: 406-413, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35854734

RESUMO

Systematic reviews are extremely time-consuming. The goal of this work is to assess work savings and recall for a publication type filtering strategy that uses the output of two machine learning models, Multi-Tagger and web RCT Tagger, applied retrospectively to 10 systematic reviews on drug effectiveness. Our filtering strategy resulted in mean work savings of 33.6% and recall of 98.3%. Of 363 articles finally included in any of the systematic reviews, 7 were filtered out by our strategy, but 1 "error" was actually an article using a publication type that the SR team had not pre-specified as relevant for inclusion. Our analysis suggests that automated publication type filtering can potentially provide substantial work savings with minimal loss of included articles. Publication type filtering should be personalized for each systematic review and might be combined with other filtering or ranking methods to provide additional work savings for manual triage.

8.

Integrative analysis of drug response and clinical outcome in acute myeloid leukemia.

Bottomly, Daniel; Long, Nicola; Schultz, Anna Reister; Kurtz, Stephen E; Tognon, Cristina E; Johnson, Kara; Abel, Melissa; Agarwal, Anupriya; Avaylon, Sammantha; Benton, Erik; Blucher, Aurora; Borate, Uma; Braun, Theodore P; Brown, Jordana; Bryant, Jade; Burke, Russell; Carlos, Amy; Chang, Bill H; Cho, Hyun Jun; Christy, Stephen; Coblentz, Cody; Cohen, Aaron M; d'Almeida, Amanda; Cook, Rachel; Danilov, Alexey; Dao, Kim-Hien T; Degnin, Michie; Dibb, James; Eide, Christopher A; English, Isabel; Hagler, Stuart; Harrelson, Heath; Henson, Rachel; Ho, Hibery; Joshi, Sunil K; Junio, Brian; Kaempf, Andy; Kosaka, Yoko; Laderas, Ted; Lawhead, Matt; Lee, Hyunjung; Leonard, Jessica T; Lin, Chenwei; Lind, Evan F; Liu, Selina Qiuying; Lo, Pierrette; Loriaux, Marc M; Luty, Samuel; Maxson, Julia E; Macey, Tara.

Cancer Cell ; 40(8): 850-864.e9, 2022 08 08.

Artigo em Inglês | MEDLINE | ID: mdl-35868306

RESUMO

Acute myeloid leukemia (AML) is a cancer of myeloid-lineage cells with limited therapeutic options. We previously combined ex vivo drug sensitivity with genomic, transcriptomic, and clinical annotations for a large cohort of AML patients, which facilitated discovery of functional genomic correlates. Here, we present a dataset that has been harmonized with our initial report to yield a cumulative cohort of 805 patients (942 specimens). We show strong cross-cohort concordance and identify features of drug response. Further, deconvoluting transcriptomic data shows that drug sensitivity is governed broadly by AML cell differentiation state, sometimes conditionally affecting other correlates of response. Finally, modeling of clinical outcome reveals a single gene, PEAR1, to be among the strongest predictors of patient survival, especially for young patients. Collectively, this report expands a large functional genomic resource, offers avenues for mechanistic exploration and drug development, and reveals tools for predicting outcome in AML.

Assuntos

Leucemia Mieloide Aguda , Diferenciação Celular , Estudos de Coortes , Humanos , Leucemia Mieloide Aguda/tratamento farmacológico , Leucemia Mieloide Aguda/genética , Receptores de Superfície Celular/genética , Transcriptoma

9.

An Analysis of Two Sources of Cardiology Patient Data to Measure Medication Agreement.

Goueth, Rose C; Cohen, Aaron M; Weiskopf, Nicole G.

AMIA Jt Summits Transl Sci Proc ; 2021: 267-275, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34457141

RESUMO

Errors and incompleteness in electronic health record (EHR) medication lists can result in medical errors. To reduce errors in these medication lists, clinicians use patient self-reported data to reconcile EHR data. We assessed the agreement between patient self-reported medications and medications recorded in the EHR for six medication classes related to cardiovascular care and used logistic regression models to determine which patient-related factors were associated with the disagreement between these two information sources. From our 297 patients, we found self-reported medications had an overall above-average agreement with the EHR (? = .727). We observed the highest agreement level for statins (? = .831) and the lowest for other antihypertensives (? = .465). Agreement was less likely for Hispanic and male patients. We also performed an in-depth error analysis of different types of disagreement beyond medication names, which revealed that the most frequent type of disagreement was mismatched dosages.

Assuntos

Cardiologia , Registros Eletrônicos de Saúde , Anti-Hipertensivos , Humanos , Masculino

10.

Identifying main finding sentences in clinical case reports.

Luo, Mengqi; Cohen, Aaron M; Addepalli, Sidharth; Smalheiser, Neil R.

Database (Oxford) ; 20202020 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-32525207

RESUMO

Clinical case reports are the 'eyewitness reports' of medicine and provide a valuable, unique, albeit noisy and underutilized type of evidence. Generally, a case report has a single main finding that represents the reason for writing up the report in the first place. However, no one has previously created an automatic way of identifying main finding sentences in case reports. We previously created a manual corpus of main finding sentences extracted from the abstracts and full text of clinical case reports. Here, we have utilized the corpus to create a machine learning-based model that automatically predicts which sentence(s) from abstracts state the main finding. The model has been evaluated on a separate manual corpus of clinical case reports and found to have good performance. This is a step toward setting up a retrieval system in which, given one case report, one can find other case reports that report the same or very similar main findings. The code and necessary files to run the main finding model can be downloaded from https://github.com/qi29/main_ finding_recognition, released under the Apache License, Version 2.0.

Assuntos

Mineração de Dados/métodos , Aprendizado de Máquina , Prontuários Médicos/classificação , Humanos , Processamento de Linguagem Natural , Software

11.

Evaluation of patient-level retrieval from electronic health record data for a cohort discovery task.

Chamberlin, Steven R; Bedrick, Steven D; Cohen, Aaron M; Wang, Yanshan; Wen, Andrew; Liu, Sijia; Liu, Hongfang; Hersh, William R.

JAMIA Open ; 3(3): 395-404, 2020 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-33215074

RESUMO

OBJECTIVE: Growing numbers of academic medical centers offer patient cohort discovery tools to their researchers, yet the performance of systems for this use case is not well understood. The objective of this research was to assess patient-level information retrieval methods using electronic health records for different types of cohort definition retrieval. MATERIALS AND METHODS: We developed a test collection consisting of about 100 000 patient records and 56 test topics that characterized patient cohort requests for various clinical studies. Automated information retrieval tasks using word-based approaches were performed, varying 4 different parameters for a total of 48 permutations, with performance measured using B-Pref. We subsequently created structured Boolean queries for the 56 topics for performance comparisons. In addition, we performed a more detailed analysis of 10 topics. RESULTS: The best-performing word-based automated query parameter settings achieved a mean B-Pref of 0.167 across all 56 topics. The way a topic was structured (topic representation) had the largest impact on performance. Performance not only varied widely across topics, but there was also a large variance in sensitivity to parameter settings across the topics. Structured queries generally performed better than automated queries on measures of recall and precision but were still not able to recall all relevant patients found by the automated queries. CONCLUSION: While word-based automated methods of cohort retrieval offer an attractive solution to the labor-intensive nature of this task currently used at many medical centers, we generally found suboptimal performance in those approaches, with better performance obtained from structured Boolean queries. Future work will focus on using the test collection to develop and evaluate new approaches to query structure, weighting algorithms, and application of semantic methods.

12.

Correction: Detecting rare diseases in electronic health records using machine learning and knowledge engineering: Case study of acute hepatic porphyria.

Cohen, Aaron M; Chamberlin, Steven; Deloughery, Thomas; Nguyen, Michelle; Bedrick, Steven; Meninger, Stephen; Ko, John J; Amin, Jigar J; Wei, Alex H; Hersh, William.

PLoS One ; 15(8): e0238277, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32817711

RESUMO

[This corrects the article DOI: 10.1371/journal.pone.0235574.].

13.

Detecting rare diseases in electronic health records using machine learning and knowledge engineering: Case study of acute hepatic porphyria.

Cohen, Aaron M; Chamberlin, Steven; Deloughery, Thomas; Nguyen, Michelle; Bedrick, Steven; Meninger, Stephen; Ko, John J; Amin, Jigar J; Wei, Alex J; Hersh, William.

PLoS One ; 15(7): e0235574, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32614911

RESUMO

BACKGROUND: With the growing adoption of the electronic health record (EHR) worldwide over the last decade, new opportunities exist for leveraging EHR data for detection of rare diseases. Rare diseases are often not diagnosed or delayed in diagnosis by clinicians who encounter them infrequently. One such rare disease that may be amenable to EHR-based detection is acute hepatic porphyria (AHP). AHP consists of a family of rare, metabolic diseases characterized by potentially life-threatening acute attacks and chronic debilitating symptoms. The goal of this study was to apply machine learning and knowledge engineering to a large extract of EHR data to determine whether they could be effective in identifying patients not previously tested for AHP who should receive a proper diagnostic workup for AHP. METHODS AND FINDINGS: We used an extract of the complete EHR data of 200,000 patients from an academic medical center and enriched it with records from an additional 5,571 patients containing any mention of porphyria in the record. After manually reviewing the records of all 47 unique patients with the ICD-10-CM code E80.21 (Acute intermittent [hepatic] porphyria), we identified 30 patients who were positive cases for our machine learning models, with the rest of the patients used as negative cases. We parsed the record into features, which were scored by frequency of appearance and filtered using univariate feature analysis. We manually choose features not directly tied to provider attributes or suspicion of the patient having AHP. We trained on the full dataset, with the best cross-validation performance coming from support vector machine (SVM) algorithm using a radial basis function (RBF) kernel. The trained model was applied back to the full data set and patients were ranked by margin distance. The top 100 ranked negative cases were manually reviewed for symptom complexes similar to AHP, finding four patients where AHP diagnostic testing was likely indicated and 18 patients where AHP diagnostic testing was possibly indicated. From the top 100 ranked cases of patients with mention of porphyria in their record, we identified four patients for whom AHP diagnostic testing was possibly indicated and had not been previously performed. Based solely on the reported prevalence of AHP, we would have expected only 0.002 cases out of the 200 patients manually reviewed. CONCLUSIONS: The application of machine learning and knowledge engineering to EHR data may facilitate the diagnosis of rare diseases such as AHP. Further work will recommend clinical investigation to identified patients' clinicians, evaluate more patients, assess additional feature selection and machine learning algorithms, and apply this methodology to other rare diseases. This work provides strong evidence that population-level informatics can be applied to rare diseases, greatly improving our ability to identify undiagnosed patients, and in the future improve the care of these patients and our ability study these diseases. The next step is to learn how best to apply these EHR-based machine learning approaches to benefit individual patients with a clinical study that provides diagnostic testing and clinical follow up for those identified as possibly having undiagnosed AHP.

Assuntos

Conhecimento , Aprendizado de Máquina , Sintase do Porfobilinogênio/deficiência , Porfirias Hepáticas/diagnóstico , Bases de Dados Factuais , Registros Eletrônicos de Saúde , Feminino , Humanos , Masculino , Porfirias Hepáticas/patologia

14.

A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection.

Ambert, Kyle H; Cohen, Aaron M.

J Am Med Inform Assoc ; 16(4): 590-5, 2009.

Artigo em Inglês | MEDLINE | ID: mdl-19390099

RESUMO

OBJECTIVE Free-text clinical reports serve as an important part of patient care management and clinical documentation of patient disease and treatment status. Free-text notes are commonplace in medical practice, but remain an under-used source of information for clinical and epidemiological research, as well as personalized medicine. The authors explore the challenges associated with automatically extracting information from clinical reports using their submission to the Integrating Informatics with Biology and the Bedside (i2b2) 2008 Natural Language Processing Obesity Challenge Task. DESIGN A text mining system for classifying patient comorbidity status, based on the information contained in clinical reports. The approach of the authors incorporates a variety of automated techniques, including hot-spot filtering, negated concept identification, zero-vector filtering, weighting by inverse class-frequency, and error-correcting of output codes with linear support vector machines. MEASUREMENTS Performance was evaluated in terms of the macroaveraged F1 measure. RESULTS The automated system performed well against manual expert rule-based systems, finishing fifth in the Challenge's intuitive task, and 13(th) in the textual task. CONCLUSIONS The system demonstrates that effective comorbidity status classification by an automated system is possible.

Assuntos

Classificação/métodos , Sistemas Computadorizados de Registros Médicos , Processamento de Linguagem Natural , Obesidade , Alta do Paciente , Comorbidade , Doença/classificação , Humanos , Armazenamento e Recuperação da Informação/métodos

15.

Towards augmenting structured EHR data: a comparison of manual chart review and patient self-report.

Weiskopf, Nicole G; Cohen, Aaron M; Hannan, Joely; Jarmon, Thad; Dorr, David A.

AMIA Annu Symp Proc ; 2019: 903-912, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-32308887

RESUMO

Structured electronic health record (EHR) data are often used for quality measurement and improvement, clinical research, and other secondary uses. These data, however, are known to suffer from quality problems. There may be value in augmenting structured EHR data to improve data quality, thereby improving the reliability and validity of the conclusions drawn from those data. Focusing on five diagnoses related to cardiovascular care, this paper considers the added value of two alternative data sources: manual chart abstraction and patient self-report. We assess the overall agreement between structured EHR problem list data, abstracted EHR data, and patient self- report; and explore possible causes of disagreement between those sources. Our findings suggest that both chart abstraction and patient self-report contain significantly more diagnoses than the problem list, but that the information they capture is different. Methods for collecting and validating self-reported medical data require further consideration and exploration.

Assuntos

Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação , Autorrelato , Adulto , Idoso , Idoso de 80 Anos ou mais , Confiabilidade dos Dados , Feminino , Humanos , Masculino , Registros Médicos Orientados a Problemas , Pessoa de Meia-Idade , Reprodutibilidade dos Testes , Adulto Jovem

16.

Modelling disease risk for amyloid A (AA) amyloidosis in non-human primates using machine learning.

Leung, Eric T; Raboin, Michael J; McKelvey, Jessica; Graham, Adam; Lewis, Anne; Prongay, Kamm; Cohen, Aaron M; Vinson, Amanda.

Amyloid ; 26(3): 139-147, 2019 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-31210531

RESUMO

Objective: Amyloid A (AA) amyloidosis is found in humans and non-human primates, but quantifying disease risk prior to clinical symptoms is challenging. We applied machine learning to identify the best predictors of amyloidosis in rhesus macaques from available clinical and pathology records. To explore potential biomarkers, we also assessed whether changes in circulating serum amyloid A (SAA) or lipoprotein profiles accompany the disease. Methods: We conducted a retrospective study using 86 cases and 163 controls matched for age and sex. We performed data reduction on 62 clinical, pathological and demographic variables, and applied multivariate modelling and model selection with cross-validation. To test the performance of our final model, we applied it to a replication cohort of 2,775 macaques. Results: The strongest predictors of disease were colitis, gastrointestinal adenocarcinoma, endometriosis, arthritis, trauma, diarrhoea and number of pregnancies. Sensitivity and specificity of the risk model were predicted to be 82%, and were assessed at 79 and 72%, respectively. Total, low density lipoprotein and high density lipoprotein cholesterol levels were significantly lower, and SAA levels and triglyceride-to-HDL ratios were significantly higher in cases versus controls. Conclusion: Machine learning is a powerful approach to identifying macaques at risk of AA amyloidosis, which is accompanied by increased circulating SAA and altered lipoprotein profiles.

Assuntos

Amiloidose/diagnóstico , Aprendizado de Máquina/estatística & dados numéricos , Modelos Estatísticos , Proteína Amiloide A Sérica/metabolismo , Adenocarcinoma/diagnóstico , Adenocarcinoma/fisiopatologia , Amiloidose/sangue , Amiloidose/fisiopatologia , Animais , Artrite/diagnóstico , Artrite/fisiopatologia , Biomarcadores/sangue , Estudos de Casos e Controles , HDL-Colesterol/sangue , LDL-Colesterol/sangue , Colite/diagnóstico , Colite/fisiopatologia , Diarreia/diagnóstico , Diarreia/fisiopatologia , Modelos Animais de Doenças , Endometriose/diagnóstico , Endometriose/fisiopatologia , Feminino , Neoplasias Gastrointestinais/diagnóstico , Neoplasias Gastrointestinais/fisiopatologia , Humanos , Macaca mulatta , Masculino , Estudos Retrospectivos , Fatores de Risco , Triglicerídeos/sangue , Ferimentos e Lesões/diagnóstico , Ferimentos e Lesões/fisiopatologia

17.

Five-way smoking status classification using text hot-spot identification and error-correcting output codes.

Cohen, Aaron M.

J Am Med Inform Assoc ; 15(1): 32-5, 2008.

Artigo em Inglês | MEDLINE | ID: mdl-17947623

RESUMO

We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.

Assuntos

Classificação/métodos , Processamento de Linguagem Natural , Fumar , Algoritmos , Inteligência Artificial , Humanos , Sistemas Computadorizados de Registros Médicos

18.

Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.

Smalheiser, Neil R; Cohen, Aaron M.

Data Inf Manag ; 2(1): 27-36, 2018 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-30766970

RESUMO

Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and employ machine learning algorithms. At present, each research group tackles each problem from scratch, and in isolation of other projects, which causes redundancy and great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects, and can serve as a public repository for their outputs. We will initially focus on a specific goal, namely, classifying articles according to Publication Type, and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning based goals and projects, and can be used as a public platform for disseminating the results of NLP tools to end-users as well.

19.

A probabilistic automated tagger to identify human-related publications.

Cohen, Aaron M; Dunivin, Zackary O; Smalheiser, Neil R.

Database (Oxford) ; 2018: 1-8, 2018 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-30184195

RESUMO

The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence_based_medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.

Assuntos

Automação , Probabilidade , Publicações , Calibragem , Bases de Dados como Assunto , Humanos , Reprodutibilidade dos Testes

20.

Evaluation of Clinical Text Segmentation to Facilitate Cohort Retrieval.

Edinger, Tracy; Demner-Fushman, Dina; Cohen, Aaron M; Bedrick, Steven; Hersh, William.

AMIA Annu Symp Proc ; 2017: 660-669, 2017.

Artigo em Inglês | MEDLINE | ID: mdl-29854131

RESUMO

Objective: Secondary use of electronic health record (EHR) data is enabled by accurate and complete retrieval of the relevant patient cohort, which requires searching both structured and unstructured data. Clinical text poses difficulties to searching, although chart notes incorporate structure that may facilitate accurate retrieval. Methods: We developed rules identifying clinical document sections, which can be indexed in search engines that allow faceted searches, such as Lucene or Essie, an NLM search engine. We developed 22 clinical cohorts and two queries for each cohort, one utilizing section headings and the other searching the whole document. We manually evaluated a subset of retrieved documents to compare query performance. Results: Querying by section had lower recall than whole-document queries (0.83 vs 0.95), higher precision (0.73 vs 0.54), and higher F1 (0.78 vs 0.69). Conclusion: This evaluation suggests that searching specific sections may improve precision under certain conditions and often with loss of recall.

Assuntos

Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Ferramenta de Busca , Indexação e Redação de Resumos , Humanos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA