Your browser doesn't support javascript.
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 4.160
Filtrar
1.
J Korean Med Sci ; 35(12): e78, 2020 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-32233158

RESUMO

BACKGROUND: Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited because they are typically provided in free-text format or PDFs on electronic medical records. We here propose a method to convert HLA genotype information stored in an unstructured format into a reusable structured format by extracting serotype/allele information. METHODS: We queried HLA typing reports from the clinical data warehouse of Seoul National University Hospital (SUPPREME) from 2000 to 2018 as a rule-development data set (64,024 reports) and from the most recent year (6,181 reports) as a test set. We used a rule-based natural language approach using a Python regex function to extract the 1) number of patients in the report, 2) clinical characteristics such as indication of the HLA testing, and 3) precise HLA genotypes. The performance of the rules and codes was evaluated by comparison between the extracted results from the test set and a validation set generated by manual curation. RESULTS: Among 11,287 reports for development set and 1,107 for the test set describing HLA typing for a single patient, iterative rule generation developed 124 extracting rules and 8 cleaning rules for HLA genotypes. Application of these rules extracted HLA genotypes with 0.892-0.999 precision and 0.795-0.998 recall for the five HLA genes. The precision and recall of the extracting rules for the number of patients in a report were 0.997 and 0.994 and those for the clinical variable extraction were 0.997 and 0.992, respectively. All extracted HLA alleles and serotypes were transformed according to formal HLA nomenclature by the cleaning rules. CONCLUSION: The rule-based HLA genotype extraction method shows reliable accuracy. We believe that there are significant number of patients who takes profit when this under-used genetic information will be return to them.


Assuntos
Antígenos HLA/genética , Teste de Histocompatibilidade , Armazenamento e Recuperação da Informação , Processamento de Linguagem Natural , Algoritmos , Data Warehousing , Registros Eletrônicos de Saúde , Genótipo , Humanos , Seul
2.
J Med Syst ; 44(5): 96, 2020 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-32193703

RESUMO

Optic disc (OD) and optic cup (OC) segmentation are important steps for automatic screening and diagnosing of optic nerve head abnormalities such as glaucoma. Many recent works formulated the OD and OC segmentation as a pixel classification task. However, it is hard for these methods to explicitly model the spatial relations between the labels in the output mask. Furthermore, the proportion of the background, OD and OC are unbalanced which also may result in a biased model as well as introduce more noise. To address these problems, we developed an approach that follows a coarse-to-fine segmentation process. We start with a U-Net to obtain a rough segmenting boundary and then crop the area around the boundary to form a boundary contour centered image. Second, inspired by sequence labeling tasks in natural language processing, we regard the OD and OC segmentation as a sequence labeling task and propose a novel fully convolutional network called SU-Net and combine it with the Viterbi algorithm to jointly decode the segmentation boundary. We also introduced a geometric parameter-based data augmentation method to generate more training samples in order to minimize the differences between training and test sets and reduce overfitting. Experimental results show that our method achieved state-of-the-art results on 2 datasets for both OD and OC segmentation and our method outperforms most of the ophthalmologists in terms of achieving agreement out of 6 ophthalmologists on the MESSIDOR dataset for both OD and OC segmentation. In terms of glaucoma screening, we achieved the best cup-to-disc ratio (CDR) error and area under the ROC curve (AUC) for glaucoma classification on the Drishti-GS dataset.


Assuntos
Glaucoma , Processamento de Imagem Assistida por Computador , Redes Neurais de Computação , Disco Óptico/diagnóstico por imagem , Fundo de Olho , Glaucoma/diagnóstico , Humanos , Processamento de Imagem Assistida por Computador/métodos , Processamento de Linguagem Natural
3.
PLoS One ; 15(1): e0226634, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31923222

RESUMO

PURPOSE: The objective of this study was to assess the classification capability of Breast Imaging Reporting and Data System (BI-RADS) ultrasound feature descriptors targeting established commercial transcriptomic gene signatures that guide management of breast cancer. MATERIALS AND METHODS: This retrospective, single-institution analysis of 219 patients involved two cohorts using one of two FDA approved transcriptome-based tests that were performed as part of the clinical care of breast cancer patients at Harbor-UCLA Medical Center between April 2008 and January 2013. BI-RADS descriptive terminology was collected from the corresponding ultrasound reports for each patient in conjunction with transcriptomic test results. Recursive partitioning and regression trees were used to test and validate classification of the two cohorts. RESULTS: The area under the curve (AUC) of the receiver operator curves (ROC) for the regression classifier between the two FDA approved tests and ultrasound features were 0.77 and 0.65, respectively; they employed the 'margins', 'retrotumoral', and 'internal echoes' feature descriptors. Notably, the 'retrotumoral' and mass 'margins' features were used in both classification trees. The identification of sonographic correlates of gene tests provides added value to the ultrasound exam without incurring additional procedures or testing. CONCLUSIONS: The predictive capability using structured language from diagnostic ultrasound reports (BI-RADS) was moderate for the two tests, and provides added value from ultrasound imaging without incurring any additional costs. Incorporation of additional measures, such as ultrasound contrast enhancement, with validation in larger, prospective studies may further substantiate these results and potentially demonstrate even greater predictive utility.


Assuntos
Neoplasias da Mama/diagnóstico por imagem , Neoplasias da Mama/genética , Perfilação da Expressão Gênica , Processamento de Linguagem Natural , Projetos de Pesquisa , Ultrassonografia Mamária/classificação , Área Sob a Curva , Estudos de Coortes , Feminino , Humanos , Pessoa de Meia-Idade , Curva ROC , Estudos Retrospectivos
4.
PLoS One ; 15(1): e0227730, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31945115

RESUMO

BACKGROUND: Chronic obstructive pulmonary disease (COPD) is associated with poor quality of life, hospitalization and mortality. COPD phenotype includes using pulmonary function tests to determine airflow obstruction from the forced expiratory volume in one second (FEV1):forced vital capacity. FEV1 is a commonly used value for severity but is difficult to identify in structured electronic health record (EHR) data. DATA SOURCE AND METHODS: Using the Microsoft SQL Server's full-text search feature and string functions supporting regular-expression-like operations, we developed an automated tool to extract FEV1 values from progress notes to improve ascertainment of FEV1 in EHR in the Veterans Aging Cohort Study (VACS). RESULTS: The automated tool increased quantifiable FEV1 values from 12,425 to 16,274 (24% increase in numeric FEV1). Using chart review as the reference, positive predictive value of the tool was 99% (95% Confidence interval: 98.2-100.0%) for identifying quantifiable FEV1 values and a recall value of 100%, yielding an F-measure of 0.99. The tool correctly identified FEV1 measurements in 95% of cases. CONCLUSION: A SQL-based full text search of clinical notes for quantifiable FEV1 is efficient and improves the number of values available in VA data. Future work will examine how these methods can improve phenotyping of patients with COPD in the VA.


Assuntos
Mineração de Dados/métodos , Registros Eletrônicos de Saúde/estatística & dados numéricos , Volume Expiratório Forçado/fisiologia , Doença Pulmonar Obstrutiva Crônica/diagnóstico , Capacidade Vital/fisiologia , Estudos de Coortes , Sistemas de Informação em Saúde/estatística & dados numéricos , Hospitalização/estatística & dados numéricos , Humanos , Pulmão/fisiopatologia , Processamento de Linguagem Natural , Doença Pulmonar Obstrutiva Crônica/fisiopatologia , Índice de Gravidade de Doença , Software , Estados Unidos , United States Department of Veterans Affairs/estatística & dados numéricos , Veteranos/estatística & dados numéricos
5.
BMC Bioinformatics ; 21(1): 29, 2020 Jan 28.
Artigo em Inglês | MEDLINE | ID: mdl-31992184

RESUMO

BACKGROUND: Event extraction from the biomedical literature is one of the most actively researched areas in biomedical text mining and natural language processing. However, most approaches have focused on events within single sentence boundaries, and have thus paid much less attention to events spanning multiple sentences. The Bacteria-Biotope event (BB-event) subtask presented in BioNLP Shared Task 2016 is one such example; a significant amount of relations between bacteria and biotope span more than one sentence, but existing systems have treated them as false negatives because labeled data is not sufficiently large enough to model a complex reasoning process using supervised learning frameworks. RESULTS: We present an unsupervised method for inferring cross-sentence events by propagating intra-sentence information to adjacent sentences using context trigger expressions that strongly signal the implicit presence of entities of interest. Such expressions can be collected from a large amount of unlabeled plain text based on simple syntactic constraints, helping to overcome the limitation of relying only on a small number of training examples available. The experimental results demonstrate that our unsupervised system extracts cross-sentence events quite well and outperforms all the state-of-the-art supervised systems when combined with existing methods for intra-sentence event extraction. Moreover, our system is also found effective at detecting long-distance intra-sentence events, compared favorably with existing high-dimensional models such as deep neural networks, without any supervised learning techniques. CONCLUSIONS: Our linguistically motivated inference model is shown to be effective at detecting implicit events that have not been covered by previous work, without relying on training data or curated knowledge bases. Moreover, it also helps to boost the performance of existing systems by allowing them to detect additional cross-sentence events. We believe that the proposed model offers an effective way to infer implicit information beyond sentence boundaries, especially when human-annotated data is not sufficient enough to train a robust supervised system.


Assuntos
Mineração de Dados/métodos , Bactérias , Processamento de Linguagem Natural , Publicações
6.
Int J Med Inform ; 133: 104016, 2020 01.
Artigo em Inglês | MEDLINE | ID: mdl-31707264

RESUMO

OBJECTIVES: Nurse concerns documented in nursing notes are important predictors of patient risk of deterioration. Using a standard nursing terminology and inputs from subject-matter experts (SMEs), we aimed to identify and define nurse concern concepts and terms about patient deterioration, which can be used to support subsequent automated tasks, such as natural language processing and risk predication. METHODS: Group consensus meetings with nurse SMEs were held to identify nursing concerns by grading Clinical Care Classification (CCC) system concepts based on clinical knowledge. Next, a fundamental lexicon was built placing selected CCC concepts into a framework of entities and seed terms to extend CCC granularity. RESULTS: A total of 29 CCC concepts were selected as reflecting nurse concerns. From these, 111 entities and 586 seed terms were generated into a fundamental lexicon. Nursing concern concepts differed across settings (intensive care units versus non-intensive care units) and unit types (medicine versus surgery units). CONCLUSIONS: The CCC concepts were useful for representing nursing concern as they encompass a nursing-centric conceptual framework and are practical in lexicon construction. It enabled the codification of nursing concerns for deteriorating patients at a standardized conceptual level. The boundary of selected CCC concepts and lexicons were determined by the SMEs. The fundamental lexicon offers more granular terms that can be identified and processed in an automated fashion.


Assuntos
Terminologia Padronizada em Enfermagem , Humanos , Unidades de Terapia Intensiva , Processamento de Linguagem Natural , Enfermeiras e Enfermeiros
7.
Neural Netw ; 121: 132-139, 2020 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-31541881

RESUMO

Neural networks (NNs) have become the state of the art in many machine learning applications, such as image, sound (LeCun et al., 2015) and natural language processing (Young et al., 2017; Linggard et al., 2012). However, the success of NNs remains dependent on the availability of large labelled datasets, such as in the case of electronic health records (EHRs). With scarce data, NNs are unlikely to be able to extract this hidden information with practical accuracy. In this study, we develop an approach that solves these problems for named entity recognition, obtaining 94.6 F1 score in I2B2 2009 Medical Extraction Challenge (Uzuner et al., 2010), 4.3 above the architecture that won the competition. To achieve this, we bootstrap our NN models through transfer learning by pretraining word embeddings on a secondary task performed on a large pool of unannotated EHRs and using the output embeddings as a foundation of a range of NN architectures. Beyond the official I2B2 challenge, we further achieve 82.4 F1 on extracting relationships between medical terms using attention-based seq2seq models bootstrapped in the same manner.


Assuntos
Registros Eletrônicos de Saúde/classificação , Aprendizado de Máquina/classificação , Processamento de Linguagem Natural , Redes Neurais de Computação , Coleta de Dados/classificação , Coleta de Dados/métodos , Humanos
8.
Int J Med Inform ; 135: 104053, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31884312

RESUMO

OBJECTIVE: Early identification and treatment of patient deterioration is crucial to improving clinical outcomes. To act, hospital rapid response (RR) teams often rely on nurses' clinical judgement typically documented narratively in the electronic health record (EHR). We developed a data-driven, unsupervised method to discover potential risk factors of RR events from nursing notes. METHODS: We applied multiple natural language processing methods, including language modelling, word embeddings, and two phrase mining methods (TextRank and NC-Value), to identify quality phrases that represent clinical entities from unannotated nursing notes. TextRank was used to determine the important word-sequences in each note. NC-Value was then used to globally rank the locally-important sequences across the whole corpus. We evaluated our method both on its accuracy compared to human judgement and on the ability of the mined phrases to predict a clinical outcome, RR event hazard. RESULTS: When applied to 61,740 hospital encounters with 1,067 RR events and 778,955 notes, our method achieved an average precision of 0.590 to 0.764 (when excluding numeric tokens). Time-dependent covariates Cox model using the phrases achieved a concordance index of 0.739. Clustering the phrases revealed clinical concepts significantly associated with RR event hazard. DISCUSSION: Our findings demonstrate that our minimal-annotation, unsurprised method can rapidly mine quality phrases from a large amount of nursing notes, and these identified phrases are useful for downstream tasks, such as clinical outcome predication and risk factor identification.


Assuntos
Mineração de Dados , Registros Eletrônicos de Saúde , Adulto , Idoso , Registros Eletrônicos de Saúde/estatística & dados numéricos , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Processamento de Linguagem Natural , Enfermeiras e Enfermeiros , Fatores de Risco
9.
Int J Med Inform ; 133: 103970, 2020 01.
Artigo em Inglês | MEDLINE | ID: mdl-31704490

RESUMO

BACKGROUND: The drug information most commonly requested by patients is to learn more about potential adverse drug reactions (ADRs) of their drugs. Such information should be customizable to individual information needs. While approaches to automatically aggregate ADRs by text-mining processes and establishment of respective databases are well known, further efforts to map additional ADR information are sparse, yet crucial for customization. In a proof-of-principle (PoP) study, we developed a database format demonstrating that natural language processing can further structure ADR information in a way that facilitates customization. METHODS: We developed the database in a 3-step process: (1) initial ADR extraction, (2) mapping of additional ADR information, and (3) review process. ADRs of 10 frequently prescribed active ingredients were initially extracted from their Summary of Product Characteristics (SmPC) by text-mining processes and mapped to Medical Dictionary for Regulatory Activities (MedDRA) terms. To further structure ADR information, we mapped 7 additional ADR characteristics (i.e. frequency, organ class, seriousness, lay perceptibility, onset, duration, and management strategies) to individual ADRs. In a PoP study, the process steps were assessed and tested. Initial ADR extraction was assessed by measuring precision, recall, and F1-scores (i.e. harmonic mean of precision and recall). Mapping of additional ADR information was assessed considering pre-defined parameters (i.e. correctness, errors, and misses) regarding the mapped ADR characteristics. RESULTS: Overall the SmPCs listed 393 ADRs with an average of 39.3 ±â€¯18.1 ADRs per SmPC. For initial ADR extraction precision was 97.9% and recall was 93.2% leading to an F1-score of 95.5%. Regarding mapping of additional ADR information, the frequency information of 28.6 ±â€¯18.4 ADRs for each SmPC was correctly mapped (72.8%). Overall 77 ADRs (20.6%) of the correctly extracted ADRs did not have a concise frequency stated in the SmPC and were consequently mapped with 'frequency not known'. Mapping of remaining ADR characteristics did not result in noteworthy errors or misses. CONCLUSION: ADR information can be automatically extracted and mapped to corresponding MedDRA terms. Additionally, ADR information can be further structured considering additional ADR characteristics to facilitate customization to individual patient needs.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos , Coleta de Dados , Mineração de Dados , Bases de Dados Factuais , Processamento de Linguagem Natural
10.
BMC Bioinformatics ; 20(Suppl 16): 502, 2019 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-31787096

RESUMO

BACKGROUND: In recent years, deep learning methods have been applied to many natural language processing tasks to achieve state-of-the-art performance. However, in the biomedical domain, they have not out-performed supervised word sense disambiguation (WSD) methods based on support vector machines or random forests, possibly due to inherent similarities of medical word senses. RESULTS: In this paper, we propose two deep-learning-based models for supervised WSD: a model based on bi-directional long short-term memory (BiLSTM) network, and an attention model based on self-attention architecture. Our result shows that the BiLSTM neural network model with a suitable upper layer structure performs even better than the existing state-of-the-art models on the MSH WSD dataset, while our attention model was 3 or 4 times faster than our BiLSTM model with good accuracy. In addition, we trained "universal" models in order to disambiguate all ambiguous words together. That is, we concatenate the embedding of the target ambiguous word to the max-pooled vector in the universal models, acting as a "hint". The result shows that our universal BiLSTM neural network model yielded about 90 percent accuracy. CONCLUSION: Deep contextual models based on sequential information processing methods are able to capture the relative contextual information from pre-trained input word embeddings, in order to provide state-of-the-art results for supervised biomedical WSD tasks.


Assuntos
Algoritmos , Vocabulário , Humanos , Processamento de Linguagem Natural , Máquina de Vetores de Suporte
11.
BMC Med Inform Decis Mak ; 19(Suppl 10): 257, 2019 12 27.
Artigo em Inglês | MEDLINE | ID: mdl-31881965

RESUMO

BACKGROUND: Family history information (FHI) described in unstructured electronic health records (EHRs) is a valuable information source for patient care and scientific researches. Since FHI is usually described in the format of free text, the entire process of FHI extraction consists of various steps including section segmentation, family member and clinical observation extraction, and relation discovery between the extracted members and their observations. The extraction step involves the recognition of FHI concepts along with their properties such as the family side attribute of the family member concept. METHODS: This study focuses on the extraction step and formulates it as a sequence labeling problem. We employed a neural sequence labeling model along with different tag schemes to distinguish family members and their observations. Corresponding to different tag schemes, the identified entities were aggregated and processed by different algorithms to determine the required properties. RESULTS: We studied the effectiveness of encoding required properties in the tag schemes by evaluating their performance on the dataset released by the BioCreative/OHNLP challenge 2018. It was observed that the proposed side scheme along with the developed features and neural network architecture can achieve an overall F1-score of 0.849 on the test set, which ranked second in the FHI entity recognition subtask. CONCLUSIONS: By comparing with the performance of conditional random fields models, the developed neural network-based models performed significantly better. However, our error analysis revealed two challenging issues of the current approach. One is that some properties required cross-sentence inferences. The other is that the current model is not able to distinguish between the narratives describing the family members of the patient and those specifying the relatives of the patient's family members.


Assuntos
Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Anamnese , Processamento de Linguagem Natural , Redes Neurais de Computação , Algoritmos , Humanos
12.
BMC Med Inform Decis Mak ; 19(Suppl 10): 277, 2019 12 27.
Artigo em Inglês | MEDLINE | ID: mdl-31881967

RESUMO

BACKGROUND: Family history (FH) information, including family members, side of family of family members (i.e., maternal or paternal), living status of family members, observations (diseases) of family members, etc., is very important in the decision-making process of disorder diagnosis and treatment. However FH information cannot be used directly by computers as it is always embedded in unstructured text in electronic health records (EHRs). In order to extract FH information form clinical text, there is a need of natural language processing (NLP). In the BioCreative/OHNLP2018 challenge, there is a task regarding FH extraction (i.e., task1), including two subtasks: (1) entity identification, identifying family members and their observations (diseases) mentioned in clinical text; (2) family history extraction, extracting side of family of family members, living status of family members, and observations of family members. For this task, we propose a system based on deep joint learning methods to extract FH information. Our system achieves the highest F1- scores of 0.8901 on subtask1 and 0.6359 on subtask2, respectively.


Assuntos
Aprendizado Profundo , Registros Eletrônicos de Saúde , Armazenamento e Recuperação da Informação/métodos , Anamnese , Processamento de Linguagem Natural , Algoritmos , Tomada de Decisão Clínica , Biologia Computacional , Humanos
15.
Medicine (Baltimore) ; 98(52): e18493, 2019 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-31876736

RESUMO

Bronchopulmonary dysplasia (BPD) is a common disease of premature infants with very low birth weight. The mechanism is inconclusive. The aim of this study is to systematically explore BPD-related genes and characterize their functions.Natural language processing analysis was used to identify BPD-related genes. Gene data were extracted from PubMed database. Gene ontology, pathway, and network analysis were carried out, and the result was integrated with corresponding database.In this study, 216 genes were identified as BPD-related genes with P < .05, and 30 pathways were identified as significant. A network of BPD-related genes was also constructed with 17 hub genes identified. In particular, phosphatidyl inositol-3-enzyme-serine/threonine kinase signaling pathway involved the largest number of genes. Insulin was found to be a promising candidate gene related with BPD, suggesting that it may serve as an effective therapeutic target.Our data may help to better understand the molecular mechanisms underlying BPD. However, the mechanisms of BPD are elusive, and further studies are needed.


Assuntos
Displasia Broncopulmonar/genética , Mineração de Dados , Algoritmos , Displasia Broncopulmonar/etiologia , Displasia Broncopulmonar/metabolismo , Biologia Computacional/métodos , Mineração de Dados/métodos , Ontologia Genética , Genes/genética , Genes/fisiologia , Predisposição Genética para Doença/genética , Humanos , Recém-Nascido , Redes e Vias Metabólicas/genética , Processamento de Linguagem Natural , Transdução de Sinais/genética
16.
BMC Bioinformatics ; 20(1): 723, 2019 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-31847804

RESUMO

BACKGROUND: Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. RESULTS: We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. CONCLUSION: Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence.


Assuntos
Aprendizado de Máquina , Sequência de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Ácidos Nucleicos , Bases de Dados de Proteínas , Processamento de Linguagem Natural , Redes Neurais de Computação , Proteínas/química , Proteômica/métodos , Análise de Sequência
17.
Nat Protoc ; 14(12): 3426-3444, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31748751

RESUMO

Phenotypes are the foundation for clinical and genetic studies of disease risk and outcomes. The growth of biobanks linked to electronic medical record (EMR) data has both facilitated and increased the demand for efficient, accurate, and robust approaches for phenotyping millions of patients. Challenges to phenotyping with EMR data include variation in the accuracy of codes, as well as the high level of manual input required to identify features for the algorithm and to obtain gold standard labels. To address these challenges, we developed PheCAP, a high-throughput semi-supervised phenotyping pipeline. PheCAP begins with data from the EMR, including structured data and information extracted from the narrative notes using natural language processing (NLP). The standardized steps integrate automated procedures, which reduce the level of manual input, and machine learning approaches for algorithm training. PheCAP itself can be executed in 1-2 d if all data are available; however, the timing is largely dependent on the chart review stage, which typically requires at least 2 weeks. The final products of PheCAP include a phenotype algorithm, the probability of the phenotype for all patients, and a phenotype classification (yes or no).


Assuntos
Análise de Dados , Registros Eletrônicos de Saúde/estatística & dados numéricos , Ensaios de Triagem em Larga Escala/métodos , Algoritmos , Interpretação Estatística de Dados , Humanos , Aprendizado de Máquina , Processamento de Linguagem Natural , Fenótipo
18.
J Am Dent Assoc ; 150(11): 933-939.e2, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31668172

RESUMO

BACKGROUND: A significant amount of clinical information captured as free-text narratives could be better used for several applications, such as clinical decision support, ontology development, evidence-based practice, and research. The Human Phenotype Ontology (HPO) is specifically used for semantic comparisons for diagnostic purposes. All these functions require quality coverage of the domain of interest. The authors used natural language processing to capture craniofacial and oral phenotype signatures from electronic health records and then used these signatures for evaluation of existing oral phenotype ontology coverage. METHODS: The authors applied a text-processing pipeline based on the clinical Text Analysis and Knowledge Extraction System to annotate the clinical notes with Unified Medical Language System codes. The authors extracted the disease or disorder phenotype terms, which were then compared with HPO terms and their synonyms. RESULTS: The authors retrieved 2,153 deidentified clinical notes from 558 patients. Finally, 2,416 unique diseases or disorders phenotype terms were extracted, which included 210 craniofacial or oral phenotype terms. Twenty-six of these phenotypes were not found in the HPO. CONCLUSIONS: The authors demonstrated that natural language processing tools could extract relevant phenotype terms from clinical narratives, which could help identify gaps in existing ontologies and enhance craniofacial and dental phenotyping vocabularies. PRACTICAL IMPLICATIONS: The expansion of terms in the dental, oral, and craniofacial domains in the HPO is particularly important as the dental community moves toward electronic health records.


Assuntos
Processamento de Linguagem Natural , Vocabulário , Registros Eletrônicos de Saúde , Humanos , Narração , Fenótipo
19.
Int J Med Inform ; 132: 103971, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31630063

RESUMO

CONTEXT: Adverse events in healthcare are often collated in incident reports which contain unstructured free text. Learning from these events may improve patient safety. Natural language processing (NLP) uses computational techniques to interrogate free text, reducing the human workload associated with its analysis. There is growing interest in applying NLP to patient safety, but the evidence in the field has not been summarised and evaluated to date. OBJECTIVE: To perform a systematic literature review and narrative synthesis to describe and evaluate NLP methods for classification of incident reports and adverse events in healthcare. METHODS: Data sources included Medline, Embase, The Cochrane Library, CINAHL, MIDIRS, ISI Web of Science, SciELO, Google Scholar, PROSPERO, hand searching of key articles, and OpenGrey. Data items were manually abstracted to a standardised extraction form. RESULTS: From 428 articles screened for eligibility, 35 met the inclusion criteria of using NLP to perform a classification task on incident reports, or with the aim of detecting adverse events. The majority of studies used free text from incident reporting systems or electronic health records. Models were typically designed to classify by type of incident, type of medication error, or harm severity. A broad range of NLP techniques are demonstrated to perform these classification tasks with favourable performance outcomes. There are methodological challenges in how these results can be interpreted in a broader context. CONCLUSION: NLP can generate meaningful information from unstructured data in the specific domain of the classification of incident reports and adverse events. Understanding what or why incidents are occurring is important in adverse event analysis. If NLP enables these insights to be drawn from larger datasets it may improve the learning from adverse events in healthcare.


Assuntos
Sistemas de Notificação de Reações Adversas a Medicamentos/normas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/classificação , Registros Eletrônicos de Saúde/tendências , Processamento de Linguagem Natural , Gestão de Riscos/classificação , Gestão de Riscos/normas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/diagnóstico , Registros Eletrônicos de Saúde/normas , Humanos
20.
Int J Med Inform ; 132: 103985, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31627032

RESUMO

OBJECTIVE: Breast cancer is the most common malignant tumor among women. The diagnosis and treatment information of breast cancer patients is abundant in multiple types of clinical fields, including clinicopathological data, genotype and phenotype information, treatment information, and prognosis information. However, current studies are mainly focused on extracting information from one specific type of clinical field. This study defines a comprehensive information model to represent the whole-course clinical information of patients. Furthermore, deep learning approaches are used to extract the concepts and their attributes from clinical breast cancer documents by fine-tuning pretrained Bidirectional Encoder Representations from Transformers (BERT) language models. MATERIALS AND METHODS: The clinical corpus that was used in this study was from one 3A cancer hospital in China, consisting of the encounter notes, operation records, pathology notes, radiology notes, progress notes and discharge summaries of 100 breast cancer patients. Our system consists of two components: a named entity recognition (NER) component and a relation recognition component. For each component, we implemented deep learning-based approaches by fine-tuning BERT, which outperformed other state-of-the-art methods on multiple natural language processing (NLP) tasks. A clinical language model is first pretrained using BERT on a large-scale unlabeled corpus of Chinese clinical text. For NER, the context embeddings that were pretrained using BERT were used as the input features of the Bi-LSTM-CRF (Bidirectional long-short-memory-conditional random fields) model and were fine-tuned using the annotated breast cancer notes. Furthermore, we proposed an approach to fine-tune BERT for relation extraction. It was considered to be a classification problem in which the two entities that were mentioned in the input sentence were replaced with their semantic types. RESULTS: Our best-performing system achieved F1 scores of 93.53% for the NER and 96.73% for the relation extraction. Additional evaluations showed that the deep learning-based approaches that fine-tuned BERT did outperform the traditional Bi-LSTM-CRF and CRF machine learning algorithms in NER and the attention-Bi-LSTM and SVM (support vector machines) algorithms in relation recognition. CONCLUSION: In this study, we developed a deep learning approach that fine-tuned BERT to extract the breast cancer concepts and their attributes. It demonstrated its superior performance compared to traditional machine learning algorithms, thus supporting its uses in broader NER and relation extraction tasks in the medical domain.


Assuntos
Algoritmos , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/terapia , Aprendizado Profundo , Processamento de Linguagem Natural , Neoplasias da Mama/epidemiologia , China/epidemiologia , Feminino , Humanos , Máquina de Vetores de Suporte
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA