Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 39
Filtrar
1.
Comput Inform Nurs ; 42(3): 184-192, 2024 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-37607706

RESUMO

Incidence of hospital-acquired pressure injury, a key indicator of nursing quality, is directly proportional to adverse outcomes, increased hospital stays, and economic burdens on patients, caregivers, and society. Thus, predicting hospital-acquired pressure injury is important. Prediction models use structured data more often than unstructured notes, although the latter often contain useful patient information. We hypothesize that unstructured notes, such as nursing notes, can predict hospital-acquired pressure injury. We evaluate the impact of using various natural language processing packages to identify salient patient information from unstructured text. We use named entity recognition to identify keywords, which comprise the feature space of our classifier for hospital-acquired pressure injury prediction. We compare scispaCy and Stanza, two different named entity recognition models, using unstructured notes in Medical Information Mart for Intensive Care III, a publicly available ICU data set. To assess the impact of vocabulary size reduction, we compare the use of all clinical notes with only nursing notes. Our results suggest that named entity recognition extraction using nursing notes can yield accurate models. Moreover, the extracted keywords play a significant role in the prediction of hospital-acquired pressure injury.


Assuntos
Processamento de Linguagem Natural , Úlcera por Pressão , Humanos , Úlcera por Pressão/diagnóstico , Cuidados Críticos , Hospitais
2.
Comput Biol Med ; 168: 107754, 2024 01.
Artigo em Inglês | MEDLINE | ID: mdl-38016372

RESUMO

Hospital-acquired pressure injury is one of the most harmful events in clinical settings. Patients who do not receive early prevention and treatment can experience a significant financial burden and physical trauma. Several hospital-acquired pressure injury prediction algorithms have been developed to tackle this problem, but these models assume a consensus, gold-standard label (i.e., presence of pressure injury or not) is present for all training data. Existing definitions for identifying hospital-acquired pressure injuries are inconsistent due to the lack of high-quality documentation surrounding pressure injuries. To address this issue, we propose in this paper an ensemble-based algorithm that leverages truth inference methods to resolve label inconsistencies between various case definitions and the level of disagreements in annotations. Application of our method to MIMIC-III, a publicly available intensive care unit dataset, gives empirical results that illustrate the promise of learning a prediction model using truth inference-based labels and observed conflict among annotators.


Assuntos
Úlcera por Pressão , Humanos , Úlcera por Pressão/diagnóstico , Algoritmos , Unidades de Terapia Intensiva , Hospitais
3.
AMIA Jt Summits Transl Sci Proc ; 2023: 582-591, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37350881

RESUMO

Electronic health records (EHR) data contain rich information about patients' health conditions including diagnosis, procedures, medications and etc., which have been widely used to facilitate digital medicine. Despite its importance, it is often non-trivial to learn useful representations for patients' visits that support downstream clinical predictions, as each visit contains massive and diverse medical codes. As a result, the complex interactions among medical codes are often not captured, which leads to substandard predictions. To better model these complex relations, we leverage hypergraphs, which go beyond pairwise relations to jointly learn the representations for visits and medical codes. We also propose to use the self-attention mechanism to automatically identify the most relevant medical codes for each visit based on the downstream clinical predictions with better generalization power. Experiments on two EHR datasets show that our proposed method not only yields superior performance, but also provides reasonable insights towards the target tasks.

4.
Artigo em Inglês | MEDLINE | ID: mdl-37332899

RESUMO

Aims: Various cardiovascular risk prediction models have been developed for patients with type 2 diabetes mellitus. Yet few models have been validated externally. We perform a comprehensive validation of existing risk models on a heterogeneous population of patients with type 2 diabetes using secondary analysis of electronic health record data. Methods: Electronic health records of 47,988 patients with type 2 diabetes between 2013 and 2017 were used to validate 16 cardiovascular risk models, including 5 that had not been compared previously, to estimate the 1-year risk of various cardiovascular outcomes. Discrimination and calibration were assessed by the c-statistic and the Hosmer-Lemeshow goodness-of-fit statistic, respectively. Each model was also evaluated based on the missing measurement rate. Sub-analysis was performed to determine the impact of race on discrimination performance. Results: There was limited discrimination (c-statistics ranged from 0.51 to 0.67) across the cardiovascular risk models. Discrimination generally improved when the model was tailored towards the individual outcome. After recalibration of the models, the Hosmer-Lemeshow statistic yielded p-values above 0.05. However, several of the models with the best discrimination relied on measurements that were often imputed (up to 39% missing). Conclusion: No single prediction model achieved the best performance on a full range of cardiovascular endpoints. Moreover, several of the highest-scoring models relied on variables with high missingness frequencies such as HbA1c and cholesterol that necessitated data imputation and may not be as useful in practice. An open-source version of our developed Python package, cvdm, is available for comparisons using other data sources.

5.
Medicine (Baltimore) ; 102(10): e32859, 2023 Mar 10.
Artigo em Inglês | MEDLINE | ID: mdl-36897716

RESUMO

To determine the hepatitis C virus (HCV) care cascade among persons who were born during 1945 to 1965 and received outpatient care on or after January 2014 at a large academic healthcare system. Deidentified electronic health record data in an existing research database were analyzed for this study. Laboratory test results for HCV antibody and HCV ribonucleic acid (RNA) indicated seropositivity and confirmatory testing. HCV genotyping was used as a proxy for linkage to care. A direct-acting antiviral (DAA) prescription indicated treatment initiation, an undetectable HCV RNA at least 20 weeks after initiation of antiviral treatment indicated a sustained virologic response. Of the 121,807 patients in the 1945 to 1965 birth cohort who received outpatient care between January 1, 2014 and June 30, 2017, 3399 (3%) patients were screened for HCV; 540 (16%) were seropositive. Among the seropositive, 442 (82%) had detectable HCV RNA, 68 (13%) had undetectable HCV RNA, and 30 (6%) lacked HCV RNA testing. Of the 442 viremic patients, 237 (54%) were linked to care, 65 (15%) initiated DAA treatment, and 32 (7%) achieved sustained virologic response. While only 3% were screened for HCV, the seroprevalence was high in the screened sample. Despite the established safety and efficacy of DAAs, only 15% initiated treatment during the study period. To achieve HCV elimination, improved HCV screening and linkage to HCV care and DAA treatment are needed.


Assuntos
Hepatite C Crônica , Hepatite C , Humanos , Hepacivirus/genética , Antivirais/uso terapêutico , Estudos Soroepidemiológicos , Hepatite C Crônica/tratamento farmacológico , Hepatite C/tratamento farmacológico , Atenção à Saúde , Resposta Viral Sustentada , RNA Viral
6.
JMIR Med Inform ; 11: e40672, 2023 Feb 23.
Artigo em Inglês | MEDLINE | ID: mdl-36649481

RESUMO

BACKGROUND: Patients develop pressure injuries (PIs) in the hospital owing to low mobility, exposure to localized pressure, circulatory conditions, and other predisposing factors. Over 2.5 million Americans develop PIs annually. The Center for Medicare and Medicaid considers hospital-acquired PIs (HAPIs) as the most frequent preventable event, and they are the second most common claim in lawsuits. With the growing use of electronic health records (EHRs) in hospitals, an opportunity exists to build machine learning models to identify and predict HAPI rather than relying on occasional manual assessments by human experts. However, accurate computational models rely on high-quality HAPI data labels. Unfortunately, the different data sources within EHRs can provide conflicting information on HAPI occurrence in the same patient. Furthermore, the existing definitions of HAPI disagree with each other, even within the same patient population. The inconsistent criteria make it impossible to benchmark machine learning methods to predict HAPI. OBJECTIVE: The objective of this project was threefold. We aimed to identify discrepancies in HAPI sources within EHRs, to develop a comprehensive definition for HAPI classification using data from all EHR sources, and to illustrate the importance of an improved HAPI definition. METHODS: We assessed the congruence among HAPI occurrences documented in clinical notes, diagnosis codes, procedure codes, and chart events from the Medical Information Mart for Intensive Care III database. We analyzed the criteria used for the 3 existing HAPI definitions and their adherence to the regulatory guidelines. We proposed the Emory HAPI (EHAPI), which is an improved and more comprehensive HAPI definition. We then evaluated the importance of the labels in training a HAPI classification model using tree-based and sequential neural network classifiers. RESULTS: We illustrate the complexity of defining HAPI, with <13% of hospital stays having at least 3 PI indications documented across 4 data sources. Although chart events were the most common indicator, it was the only PI documentation for >49% of the stays. We demonstrate a lack of congruence across existing HAPI definitions and EHAPI, with only 219 stays having a consensus positive label. Our analysis highlights the importance of our improved HAPI definition, with classifiers trained using our labels outperforming others on a small manually labeled set from nurse annotators and a consensus set in which all definitions agreed on the label. CONCLUSIONS: Standardized HAPI definitions are important for accurately assessing HAPI nursing quality metric and determining HAPI incidence for preventive measures. We demonstrate the complexity of defining an occurrence of HAPI, given the conflicting and incomplete EHR data. Our EHAPI definition has favorable properties, making it a suitable candidate for HAPI classification tasks.

7.
Proc ACM Int Conf Inf Knowl Manag ; 2022: 4470-4474, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36382341

RESUMO

With the ever-increasing abundance of biomedical articles, improving the accuracy of keyword search results becomes crucial for ensuring reproducible research. However, keyword extraction for biomedical articles is hard due to the existence of obscure keywords and the lack of a comprehensive benchmark. PubMedAKE is an author-assigned keyword extraction dataset that contains the title, abstract, and keywords of over 843,269 articles from the PubMed open access subset database. This dataset, publicly available on Zenodo, is the largest keyword extraction benchmark with sufficient samples to train neural networks. Experimental results using state-of-the-art baseline methods illustrate the need for developing automatic keyword extraction methods for biomedical literature.

8.
Proc Mach Learn Res ; 193: 259-278, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-37255863

RESUMO

Electronic Health Record modeling is crucial for digital medicine. However, existing models ignore higher-order interactions among medical codes and their causal relations towards downstream clinical predictions. To address such limitations, we propose a novel framework CACHE, to provide effective and insightful clinical predictions based on hypergraph representation learning and counterfactual and factual reasoning techniques. Experiments on two real EHR datasets show the superior performance of CACHE. Case studies with a domain expert illustrate a preferred capability of CACHE in generating clinically meaningful interpretations towards the correct predictions.

9.
Diabetol Metab Syndr ; 13(1): 146, 2021 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-34922618

RESUMO

BACKGROUND: Diabetes and hypertension disparities are pronounced among South Asians. There is regional variation in the prevalence of diabetes and hypertension in the US, but it is unknown whether there is variation among South Asians living in the US. The objective of this study was to compare the burden of diabetes and hypertension between South Asian patients receiving care in the health systems of two US cities. METHODS: Cross-sectional analyses were performed using electronic health records (EHR) for 90,137 South Asians receiving care at New York University Langone in New York City (NYC) and 28,868 South Asians receiving care at Emory University (Atlanta). Diabetes was defined as having 2 + encounters with a diagnosis of diabetes, having a diabetes medication prescribed (excluding Acarbose/Metformin), or having 2 + abnormal A1C levels (≥ 6.5%) and 1 + encounter with a diagnosis of diabetes. Hypertension was defined as having 3 + BP readings of systolic BP ≥ 130 mmHg or diastolic BP ≥ 80 mmHg, 2 + encounters with a diagnosis of hypertension, or having an anti-hypertensive medication prescribed. RESULTS: Among South Asian patients at these two large, private health systems, age-adjusted diabetes burden was 10.7% in NYC compared to 6.7% in Atlanta. Age-adjusted hypertension burden was 20.9% in NYC compared to 24.7% in Atlanta. In Atlanta, 75.6% of those with diabetes had comorbid hypertension compared to 46.2% in NYC. CONCLUSIONS: These findings suggest differences by region and sex in diabetes and hypertension risk. Additionally, these results call for better characterization of race/ethnicity in EHRs to identify ethnic subgroup variation, as well as intervention studies to reduce lifestyle exposures that underlie the elevated risk for type 2 diabetes and hypertension development in South Asians.

10.
Adv Databases Inf Syst ; 1450: 50-60, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-34604867

RESUMO

Sequential pattern mining can be used to extract meaningful sequences from electronic health records. However, conventional sequential pattern mining algorithms that discover all frequent sequential patterns can incur a high computational and be susceptible to noise in the observations. Approximate sequential pattern mining techniques have been introduced to address these shortcomings yet, existing approximate methods fail to reflect the true frequent sequential patterns or only target single-item event sequences. Multi-item event sequences are prominent in healthcare as a patient can have multiple interventions for a single visit. To alleviate these issues, we propose GASP, a graph-based approximate sequential pattern mining, that discovers frequent patterns for multi-item event sequences. Our approach compresses the sequential information into a concise graph structure which has computational benefits. The empirical results on two healthcare datasets suggest that GASP outperforms existing approximate models by improving recoverability and extracts better predictive patterns.

11.
Adv Databases Inf Syst ; 12843: 260-274, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-34608464

RESUMO

Schema matching aims to identify the correspondences among attributes of database schemas. It is frequently considered as the most challenging and decisive stage existing in many contemporary web semantics and database systems. Low-quality algorithmic matchers fail to provide improvement while manually annotation consumes extensive human efforts. Further complications arise from data privacy in certain domains such as healthcare, where only schema-level matching should be used to prevent data leakage. For this problem, we propose SMAT, a new deep learning model based on state-of-the-art natural language processing techniques to obtain semantic mappings between source and target schemas using only the attribute name and description. SMAT avoids directly encoding domain knowledge about the source and target systems, which allows it to be more easily deployed across different sites. We also introduce a new benchmark dataset, OMAP, based on real-world schema-level mappings from the healthcare domain. Our extensive evaluation of various benchmark datasets demonstrates the potential of SMAT to help automate schema-level matching tasks.

12.
Proc Int World Wide Web Conf ; 2021: 171-182, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34467367

RESUMO

Modern healthcare systems knitted by a web of entities (e.g., hospitals, clinics, pharmacy companies) are collecting a huge volume of healthcare data from a large number of individuals with various medical procedures, medications, diagnosis, and lab tests. To extract meaningful medical concepts (i.e., phenotypes) from such higher-arity relational healthcare data, tensor factorization has been proven to be an effective approach and received increasing research attention, due to their intrinsic capability to represent the high-dimensional data. Recently, federated learning offers a privacy-preserving paradigm for collaborative learning among different entities, which seemingly provides an ideal potential to further enhance the tensor factorization-based collaborative phenotyping to handle sensitive personal health data. However, existing attempts to federated tensor factorization come with various limitations, including restrictions to the classic tensor factorization, high communication cost and reduced accuracy. We propose a communication efficient federated generalized tensor factorization, which is flexible enough to choose from a variate of losses to best suit different types of data in practice. We design a three-level communication reduction strategy tailored to the generalized tensor factorization, which is able to reduce the uplink communication cost up to 99.90%. In addition, we theoretically prove that our algorithm does not compromise convergence speed despite the aggressive communication compression. Extensive experiments on two real-world electronics health record datasets demonstrate the efficiency improvements in terms of computation and communication cost.

13.
Appl Clin Inform ; 12(4): 897-909, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34587637

RESUMO

OBJECTIVES: This study aimed to compare the concordance of pressure injury (PI) site, stage, and count documented in electronic health records (EHRs); explore if PI count during each patient hospitalization is consistent based on PI site or stage count in the diagnosis or chart event records; and examine if discrepancies in PI count were associated with patient characteristics. METHODS: Hospitalization records with the International Classification of Diseases ninth edition (ICD-9) codes, chart events from two systems (CareVue, MetaVision), and clinical notes on PI were extracted from the Medical Information Mart for Intensive Care (MIMIC)-III database. PI site and stage counts from individual hospitalization were computed. Hospitalizations with the same or different counts of site and stage according to ICD-9 codes (site and stage), CareVue (site and stage), or MetaVision (stage) charts were defined as consistent or discrepant reporting. Chi-squared, independent t-, and Kruskal-Wallis tests were examined if the count discrepancy was associated with patient characteristics. ICD-9 codes and charts were also compared for people with one site or stage. RESULTS: A total of 31,918 hospitalizations had PI data. Within hospitalizations with ICD-9-coded sites and stages, 55.9% reported different counts. Within hospitalizations with CareVue charts on PI, 99.3% reported the same count. For hospitalizations with stages based on ICD-9 codes or MetaVision chart data, only 42.9% reported the same count. Discrepancies in counts were consistently and significantly associated with variables including PI recording in clinical notes, dead/hospice at discharge, more caregivers, longer hospitalization or intensive care unit stays, and more days to first transfer. Discrepancies between ICD-9 code and chart values on the site and stage were also reported. CONCLUSION: Patient characteristics associated with PI count discrepancies identified patients at risk of having discrepant PI counts or worse outcomes. PI documentation quality could be improved with better communication, care continuity, and integrity. Clinical research using EHRs should adopt systematic data quality analysis to inform limitations.


Assuntos
Hospitalização , Classificação Internacional de Doenças , Úlcera por Pressão , Humanos , Cuidados Críticos , Bases de Dados Factuais , Alta do Paciente
14.
AMIA Jt Summits Transl Sci Proc ; 2021: 384-393, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34457153

RESUMO

From electronic health records (EHRs), the relationship between patients' conditions, treatments, and outcomes can be discovered and used in various healthcare research tasks such as risk prediction. In practice, EHRs can be stored in one or more data warehouses, and mining from distributed data sources becomes challenging. Another challenge arises from privacy laws because patient data cannot be used without some patient privacy guarantees. Thus, in this paper, we propose a privacy-preserving framework using sequential pattern mining in distributed data sources. Our framework extracts patterns from each source and shares patterns with other sources to discover discriminative and representative patterns that can be used for risk prediction while preserving privacy. We demonstrate our framework using a case study of predicting Cardiovascular Disease in patients with type 2 diabetes and show the effectiveness of our framework with several sources and by applying differential privacy mechanisms.


Assuntos
Doenças Cardiovasculares , Diabetes Mellitus Tipo 2 , Doenças Cardiovasculares/diagnóstico , Confidencialidade , Registros Eletrônicos de Saúde , Humanos , Privacidade
15.
Diabetes Technol Ther ; 23(8): 555-564, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-33720761

RESUMO

Aims: To identify profiles of type 2 diabetes from continuous glucose monitoring (CGM) data using ambulatory glucose profile (AGP) indicators and examine the association with prevalent complications. Methods: Two weeks of CGM data, collected between 2015 and 2019, from 5901 adult type 2 diabetes patients were retrieved from a clinical database in Chennai, India. Non-negative matrix factorization was used to identify profiles as per AGP indicators. The association of profiles with existing complications was examined using multinomial and logistic regressions adjusted for glycated hemoglobin (HbA1c; %), sex, age at onset, and duration of diabetes. Results: Three profiles of glycemic variability (GV) were identified based on CGM data-Profile 1 ["TIR Profile"] (n = 2271), Profile 2 ["Hypo"] (n = 1471), and Profile 3 ["Hyper"] (n = 2159). Compared with time in range (TIR) profile, those belonging to Hyper had higher mean fasting plasma glucose (202.9 vs. 167.1, mg/dL), 2-h postprandial plasma glucose (302.1 vs. 255.6, mg/dL), and HbA1c (9.7 vs. 8.6; %). Both "Hypo profile" and "Hyper profile" had higher odds of nonproliferative diabetic retinopathy ("Hypo": 1.44, 1.20-1.73; "Hyper": 1.33, 1.11-1.58), macroalbuminuria ("Hypo": 1.58, 1.25-1.98; "Hyper": 1.37, 1.10-1.71), and diabetic kidney disease (DKD; "Hypo": 1.65, 1.18-2.31; "Hyper": 1.88, 1.37-2.58), compared with "TIR profile." Those in "Hypo profile" (vs. "TIR profile") had higher odds of proliferative diabetic retinopathy (PDR; 2.84, 1.65-2.88). Conclusions: We have identified three profiles of GV from CGM data. While both "Hypo profile" and "Hyper profile" had higher odds of prevalent DKD compared with "TIR profile," "Hypo profile" had higher odds of PDR. Our study emphasizes the clinical importance of recognizing and treating hypoglycemia (which is often unrecognized without CGM) in patients with type 2 Diabetes Mellitus.


Assuntos
Glicemia , Diabetes Mellitus Tipo 2 , Adulto , Automonitorização da Glicemia , Diabetes Mellitus Tipo 2/tratamento farmacológico , Glucose , Hemoglobinas Glicadas/análise , Humanos , Índia
16.
Proc Conf ; 2021: 155-161, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35748887

RESUMO

To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. Unfortunately, this method fails for short documents where the context is unclear. Moreover, keyphrases, which are usually the gist of a document, need to be the central theme. We propose a new extraction model that introduces a centrality constraint to enrich the word representation of a Bidirectional long short-term memory. Performance evaluation on two publicly available datasets demonstrate our model outperforms existing state-of-the art approaches. Our model is publicly available at https://github.com/ZHgero/keyphrases_centrality.git.

17.
Proc ACM Int Conf Inf Knowl Manag ; 2021: 3313-3317, 2021 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36380815

RESUMO

Representation learning on static graph-structured data has shown a significant impact on many real-world applications. However, less attention has been paid to the evolving nature of temporal networks, in which the edges are often changing over time. The embeddings of such temporal networks should encode both graph-structured information and the temporally evolving pattern. Existing approaches in learning temporally evolving network representations fail to capture the temporal interdependence. In this paper, we propose Toffee, a novel approach for temporal network representation learning based on tensor decomposition. Our method exploits the tensor-tensor product operator to encode the cross-time information, so that the periodic changes in the evolving networks can be captured. Experimental results demonstrate that Toffee outperforms existing methods on multiple real-world temporal networks in generating effective embeddings for the link prediction tasks.

18.
Proc IEEE Int Conf Data Min ; 2021: 1216-1221, 2021 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-36382085

RESUMO

Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medical concepts. Federated tensor factorization distributes the tensor computation to multiple workers under the coordination of a central server, which enables jointly learning the phenotypes across multiple hospitals while preserving the privacy of the patient information. However, existing federated tensor factorization algorithms encounter the single-point-failure issue with the involvement of the central server, which is not only easily exposed to external attacks, but also limits the number of clients sharing information with the server under restricted uplink bandwidth. In this paper, we propose CiderTF, a communication-efficient decentralized generalized tensor factorization, which reduces the uplink communication cost by leveraging a four-level communication reduction strategy designed for a generalized tensor factorization, which has the flexibility of modeling different tensor distribution with multiple kinds of loss functions. Experiments on two real-world EHR datasets demonstrate that CiderTF achieves comparable convergence with the communication reduction up to 99.99%.

19.
Artigo em Inglês | MEDLINE | ID: mdl-35775029

RESUMO

To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. However, existing supervised datasets have limited annotated examples to train better deep learning models. In contrast, many domains have large amount of un-annotated data that can be leveraged to improve model performance in keyphrase extraction. We introduce a self-learning based model that incorporates uncertainty estimates to select instances from large-scale unlabeled data to augment the small labeled training set. Performance evaluation on a publicly available biomedical dataset demonstrates that our method improves performance of keyphrase extraction over state of the art models.

20.
ACM CHIL 2021 (2021) ; 2021: 146-153, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-35194593

RESUMO

Generating a novel and optimized molecule with desired chemical properties is an essential part of the drug discovery process. Failure to meet one of the required properties can frequently lead to failure in a clinical test which is costly. In addition, optimizing these multiple properties is a challenging task because the optimization of one property is prone to changing other properties. In this paper, we pose this multi-property optimization problem as a sequence translation process and propose a new optimized molecule generator model based on the Transformer with two constraint networks: property prediction and similarity prediction. We further improve the model by incorporating score predictions from these constraint networks in a modified beam search algorithm. The experiments demonstrate that our proposed model, Controlled Molecule Generator (CMG), outperforms state-of-the-art models by a significant margin for optimizing multiple properties simultaneously.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA