Search | VHL Regional Portal

Potential Schizophrenia Disease-Related Genes Prediction Using Metagraph Representations Based on a Protein-Protein Interaction Keyword Network: Framework Development and Validation.

Yu, Shirui; Wang, Ziyang; Nan, Jiale; Li, Aihua; Yang, Xuemei; Tang, Xiaoli.

JMIR Form Res ; 7: e50998, 2023 Nov 15.

Article in English | MEDLINE | ID: mdl-37966892

ABSTRACT

BACKGROUND: Schizophrenia is a serious mental disease. With increased research funding for this disease, schizophrenia has become one of the key areas of focus in the medical field. Searching for associations between diseases and genes is an effective approach to study complex diseases, which may enhance research on schizophrenia pathology and lead to the identification of new treatment targets. OBJECTIVE: The aim of this study was to identify potential schizophrenia risk genes by employing machine learning methods to extract topological characteristics of proteins and their functional roles in a protein-protein interaction (PPI)-keywords (PPIK) network and understand the complex disease-causing property. Consequently, a PPIK-based metagraph representation approach is proposed. METHODS: To enrich the PPI network, we integrated keywords describing protein properties and constructed a PPIK network. We extracted features that describe the topology of this network through metagraphs. We further transformed these metagraphs into vectors and represented proteins with a series of vectors. We then trained and optimized our model using random forest (RF), extreme gradient boosting, light gradient boosting machine, and logistic regression models. RESULTS: Comprehensive experiments demonstrated the good performance of our proposed method with an area under the receiver operating characteristic curve (AUC) value between 0.72 and 0.76. Our model also outperformed baseline methods for overall disease protein prediction, including the random walk with restart, average commute time, and Katz models. Compared with the PPI network constructed from the baseline models, complementation of keywords in the PPIK network improved the performance (AUC) by 0.08 on average, and the metagraph-based method improved the AUC by 0.30 on average compared with that of the baseline methods. According to the comprehensive performance of the four models, RF was selected as the best model for disease protein prediction, with precision, recall, F1-score, and AUC values of 0.76, 0.73, 0.72, and 0.76, respectively. We transformed these proteins to their encoding gene IDs and identified the top 20 genes as the most probable schizophrenia-risk genes, including the EYA3, CNTN4, HSPA8, LRRK2, and AFP genes. We further validated these outcomes against metagraph features and evidence from the literature, performed a features analysis, and exploited evidence from the literature to interpret the correlation between the predicted genes and diseases. CONCLUSIONS: The metagraph representation based on the PPIK network framework was found to be effective for potential schizophrenia risk genes identification. The results are quite reliable as evidence can be found in the literature to support our prediction. Our approach can provide more biological insights into the pathogenesis of schizophrenia.

Using Semantic Text Similarity calculation for question matching in a rheumatoid arthritis question-answering system.

Li, Meiting; Shen, Xifeng; Sun, Yuanyuan; Zhang, Weining; Nan, Jiale; Zhu, Jia'an; Gao, Dongping.

Quant Imaging Med Surg ; 13(4): 2183-2196, 2023 Apr 01.

Article in English | MEDLINE | ID: mdl-37064382

ABSTRACT

Background: When users inquire about knowledge in a certain field using the internet, the intelligent question-answering system based on frequently asked questions (FAQs) provides numerous concise and accurate answers that have been manually verified. However, there are few specific question-answering systems for chronic diseases, such as rheumatoid arthritis, and the related technology to construct a question-answering system about chronic diseases is not sufficiently mature. Methods: Our research embedded the classification information of the question into the sentence vector based on the bidirectional encoder representations from transformers (BERT) language model. First of all, we calculated the similarity using edit distance to recall the candidate set of similar questions. Then, we took advantage of the BERT pretraining model to map the sentence information to the corresponding embedding representation. Finally, each dimensional feature of the sentence was obtained by passing a sentence vector through the multihead attention layer and the fully connected feedforward layer. The features that were stitched and fused were used for the semantic similarity calculation. Results: Our improved model achieved a Top-1 precision of 0.551, Top-3 precision of 0.767, and Top-5 precision of 0.813 on 176 testing question sentences. In the analysis of the actual application effect of the model, we found that our model performed well in understanding the actual intention of users. Conclusions: Our deep learning model takes into account the background and classifications of questions and combines the efficiency of deep learning technology and the comprehensibility of semantics. The model enables the deep meaning of the user's question to be better understood by the intelligent question answering system, and answers that are more relevant to the original query are provided.

Multi-Label Classification in Patient-Doctor Dialogues With the RoBERTa-WWM-ext + CNN (Robustly Optimized Bidirectional Encoder Representations From Transformers Pretraining Approach With Whole Word Masking Extended Combining a Convolutional Neural Network) Model: Named Entity Study.

Sun, Yuanyuan; Gao, Dongping; Shen, Xifeng; Li, Meiting; Nan, Jiale; Zhang, Weining.

JMIR Med Inform ; 10(4): e35606, 2022 Apr 21.

Article in English | MEDLINE | ID: mdl-35451969

ABSTRACT

BACKGROUND: With the prevalence of online consultation, many patient-doctor dialogues have accumulated, which, in an authentic language environment, are of significant value to the research and development of intelligent question answering and automated triage in recent natural language processing studies. OBJECTIVE: The purpose of this study was to design a front-end task module for the network inquiry of intelligent medical services. Through the study of automatic labeling of real doctor-patient dialogue text on the internet, a method of identifying the negative and positive entities of dialogues with higher accuracy has been explored. METHODS: The data set used for this study was from the Spring Rain Doctor internet online consultation, which was downloaded from the official data set of Alibaba Tianchi Lab. We proposed a composite abutting joint model, which was able to automatically classify the types of clinical finding entities into the following 4 attributes: positive, negative, other, and empty. We adapted a downstream architecture in Chinese Robustly Optimized Bidirectional Encoder Representations from Transformers Pretraining Approach (RoBERTa) with whole word masking (WWM) extended (RoBERTa-WWM-ext) combining a text convolutional neural network (CNN). We used RoBERTa-WWM-ext to express sentence semantics as a text vector and then extracted the local features of the sentence through the CNN, which was our new fusion model. To verify its knowledge learning ability, we chose Enhanced Representation through Knowledge Integration (ERNIE), original Bidirectional Encoder Representations from Transformers (BERT), and Chinese BERT with WWM to perform the same task, and then compared the results. Precision, recall, and macro-F1 were used to evaluate the performance of the methods. RESULTS: We found that the ERNIE model, which was trained with a large Chinese corpus, had a total score (macro-F1) of 65.78290014, while BERT and BERT-WWM had scores of 53.18247117 and 69.2795315, respectively. Our composite abutting joint model (RoBERTa-WWM-ext + CNN) had a macro-F1 value of 70.55936311, showing that our model outperformed the other models in the task. CONCLUSIONS: The accuracy of the original model can be greatly improved by giving priority to WWM and replacing the word-based mask with unit to classify and label medical entities. Better results can be obtained by effectively optimizing the downstream tasks of the model and the integration of multiple models later on. The study findings contribute to the translation of online consultation information into machine-readable information.

ABSTRACT

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL