Búsqueda | Biblioteca Virtual en Salud

1.

Model fusion for predicting unconventional proteins secreted by exosomes using deep learning.

Zhang, Yonglin; Yu, Lezheng; Yang, Ming; Han, Bin; Luo, Jiesi; Jing, Runyu.

Proteomics ; : e2300184, 2024 Apr 21.

Artículo en Inglés | MEDLINE | ID: mdl-38643383

RESUMEN

Unconventional secretory proteins (USPs) are vital for cell-to-cell communication and are necessary for proper physiological processes. Unlike classical proteins that follow the conventional secretory pathway via the Golgi apparatus, these proteins are released using unconventional pathways. The primary modes of secretion for USPs are exosomes and ectosomes, which originate from the endoplasmic reticulum. Accurate and rapid identification of exosome-mediated secretory proteins is crucial for gaining valuable insights into the regulation of non-classical protein secretion and intercellular communication, as well as for the advancement of novel therapeutic approaches. Although computational methods based on amino acid sequence prediction exist for predicting unconventional proteins secreted by exosomes (UPSEs), they suffer from significant limitations in terms of algorithmic accuracy. In this study, we propose a novel approach to predict UPSEs by combining multiple deep learning models that incorporate both protein sequences and evolutionary information. Our approach utilizes a convolutional neural network (CNN) to extract protein sequence information, while various densely connected neural networks (DNNs) are employed to capture evolutionary conservation patterns.By combining six distinct deep learning models, we have created a superior framework that surpasses previous approaches, achieving an ACC score of 77.46% and an MCC score of 0.5406 on an independent test dataset.

2.

Multi-model predictive analysis of RNA solvent accessibility based on modified residual attention mechanism.

Huang, Yuyao; Luo, Jiesi; Jing, Runyu; Li, Menglong.

Brief Bioinform ; 23(6)2022 11 19.

Artículo en Inglés | MEDLINE | ID: mdl-36305428

RESUMEN

Predicting RNA solvent accessibility using only primary sequence data can be regarded as sequence-based prediction work. Currently, the established studies for sequence-based RNA solvent accessibility prediction are limited due to the available number of datasets and black box prediction. To improve these issues, we first expanded the available RNA structures and then developed a sequence-based model using modified attention layers with different receptive fields to conform to the stem-loop structure of RNA chains. We measured the improvement with an extended dataset and further explored the model's interpretability by analysing the model structures, attention values and hyperparameters. Finally, we found that the developed model regarded the pieces of a sequence as templates during the training process. This work will be helpful for researchers who would like to build RNA attribute prediction models using deep learning in the future.

Asunto(s)

ARN , Solventes/química , ARN/genética

3.

Prediction of disease-associated functional variants in noncoding regions through a comprehensive analysis by integrating datasets and features.

Lu, Yu; Wu, Yiming; Liu, Yuan; Li, Yizhou; Jing, Runyu; Li, Menglong.

Hum Mutat ; 42(6): 667-684, 2021 06.

Artículo en Inglés | MEDLINE | ID: mdl-33822436

RESUMEN

One of the greatest challenges in human genetics is deciphering the link between functional variants in noncoding sequences and the pathophysiology of complex diseases. To address this issue, many methods have been developed to sort functional single-nucleotide variants (SNVs) for neutral SNVs in noncoding regions. In this study, we integrated well-established features and commonly used datasets and merged them into large-scale datasets based on a random forest model, which yielded promising performance and outperformed some cutting-edge approaches. Our analyses of feature importance and data coverage also provide certain clues for future research in enhancing the prediction of functional noncoding SNVs.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Enfermedad/genética , ARN no Traducido/genética , Simulación por Computador , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Predisposición Genética a la Enfermedad/genética , Pruebas Genéticas/métodos , Humanos , Polimorfismo de Nucleótido Simple , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Diseño de Software

4.

autoBioSeqpy: A Deep Learning Tool for the Classification of Biological Sequences.

Jing, Runyu; Li, Yizhou; Xue, Li; Liu, Fengjuan; Li, Menglong; Luo, Jiesi.

J Chem Inf Model ; 60(8): 3755-3764, 2020 08 24.

Artículo en Inglés | MEDLINE | ID: mdl-32786512

RESUMEN

Deep learning has proven to be a powerful method with applications in various fields including image, language, and biomedical data. Thanks to the libraries and toolkits such as TensorFlow, PyTorch, and Keras, researchers can use different deep learning architectures and data sets for rapid modeling. However, the available implementations of neural networks using these toolkits are usually designed for a specific research and are difficult to transfer to other work. Here, we present autoBioSeqpy, a tool that uses deep learning for biological sequence classification. The advantage of this tool is its simplicity. Users only need to prepare the input data set and then use a command line interface. Then, autoBioSeqpy automatically executes a series of customizable steps including text reading, parameter initialization, sequence encoding, model loading, training, and evaluation. In addition, the tool provides various ready-to-apply and adapt model templates to improve the usability of these networks. We introduce the application of autoBioSeqpy on three biological sequence problems: the prediction of type III secreted proteins, protein subcellular localization, and CRISPR/Cas9 sgRNA activity. autoBioSeqpy is freely available with examples at https://github.com/jingry/autoBioSeqpy.

Asunto(s)

Aprendizaje Profundo , Redes Neurales de la Computación , Transporte de Proteínas

5.

Domain position prediction based on sequence information by using fuzzy mean operator.

Jing, Runyu; Sun, Jing; Wang, Yuelong; Li, Menglong.

Proteins ; 83(8): 1462-9, 2015 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-26009844

RESUMEN

The prediction of protein domain region is an advantageous process on the study of protein structure and function. In this study, we proposed a new method, which is composed of fuzzy mean operator and region division, to predict the particular positions of domains in a target protein based on its sequence. The whole sequence is aligned and scored by using fuzzy mean operator, and the final determination of domain region position is realized by region division. A published benchmark is used for the comparison with previous researches. In addition, we generate two extra datasets to examine the stability of this method. Finally, the prediction accuracy of independent test dataset achieved by our method was up to 84.13%. We wish that this method could be useful for related researches.

Asunto(s)

Estructura Terciaria de Proteína , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Aprendizaje Automático

6.

Combination use of protein-protein interaction network topological features improves the predictive scores of deleterious non-synonymous single-nucleotide polymorphisms.

Wu, Yiming; Jing, Runyu; Jiang, Lin; Jiang, Yanping; Kuang, Qifan; Ye, Ling; Yang, Lijun; Li, Yizhou; Li, Menglong.

Amino Acids ; 46(8): 2025-35, 2014 Aug.

Artículo en Inglés | MEDLINE | ID: mdl-24849655

RESUMEN

Single-nucleotide polymorphisms (SNPs) are the most frequent form of genetic variations. Non-synonymous SNPs (nsSNPs) occurring in coding region result in single amino acid substitutions that associate with human hereditary diseases. Plenty of approaches were designed for distinguishing deleterious from neutral nsSNPs based on sequence level information. Novel in this work, combinations of protein-protein interaction (PPI) network topological features were introduced in predicting disease-related nsSNPs. Based on a dataset that was compiled from Swiss-Prot, a random forest model was constructed with an average accuracy value of 80.43% and an MCC value of 0.60 in a rigorous tenfold crossvalidation test. For an independent dataset, our model achieved an accuracy of 88.05% and an MCC of 0.67. Compared with previous studies, our approach presented superior prediction ability. Results showed that the incorporated PPI network topological features outperform conventional features. Our further analysis indicated that disease-related proteins are topologically different from other proteins. This study suggested that nsSNPs may share some topological information of proteins and the change of topological attributes could provide clues in illustrating functional shift due to nsSNPs.

Asunto(s)

Sustitución de Aminoácidos/genética , Enfermedades Genéticas Congénitas/genética , Polimorfismo de Nucleótido Simple/genética , Mapas de Interacción de Proteínas , Biología Computacional , Bases de Datos de Proteínas , Humanos , Proteínas/química , Análisis de Secuencia de Proteína

7.

Identifying diagnostic indicators for type 2 diabetes mellitus from physical examination using interpretable machine learning approach.

Lv, Xiang; Luo, Jiesi; Huang, Wei; Guo, Hui; Bai, Xue; Yan, Pijun; Jiang, Zongzhe; Zhang, Yonglin; Jing, Runyu; Chen, Qi; Li, Menglong.

Front Endocrinol (Lausanne) ; 15: 1376220, 2024.

Artículo en Inglés | MEDLINE | ID: mdl-38562414

RESUMEN

Background: Identification of patients at risk for type 2 diabetes mellitus (T2DM) can not only prevent complications and reduce suffering but also ease the health care burden. While routine physical examination can provide useful information for diagnosis, manual exploration of routine physical examination records is not feasible due to the high prevalence of T2DM. Objectives: We aim to build interpretable machine learning models for T2DM diagnosis and uncover important diagnostic indicators from physical examination, including age- and sex-related indicators. Methods: In this study, we present three weighted diversity density (WDD)-based algorithms for T2DM screening that use physical examination indicators, the algorithms are highly transparent and interpretable, two of which are missing value tolerant algorithms. Patients: Regarding the dataset, we collected 43 physical examination indicator data from 11,071 cases of T2DM patients and 126,622 healthy controls at the Affiliated Hospital of Southwest Medical University. After data processing, we used a data matrix containing 16004 EHRs and 43 clinical indicators for modelling. Results: The indicators were ranked according to their model weights, and the top 25% of indicators were found to be directly or indirectly related to T2DM. We further investigated the clinical characteristics of different age and sex groups, and found that the algorithms can detect relevant indicators specific to these groups. The algorithms performed well in T2DM screening, with the highest area under the receiver operating characteristic curve (AUC) reaching 0.9185. Conclusion: This work utilized the interpretable WDD-based algorithms to construct T2DM diagnostic models based on physical examination indicators. By modeling data grouped by age and sex, we identified several predictive markers related to age and sex, uncovering characteristic differences among various groups of T2DM patients.

Asunto(s)

Diabetes Mellitus Tipo 2 , Humanos , Diabetes Mellitus Tipo 2/diagnóstico , Diabetes Mellitus Tipo 2/epidemiología , Aprendizaje Automático , Algoritmos , Curva ROC , Biomarcadores

8.

Evaluation and development of deep neural networks for RNA 5-Methyluridine classifications using autoBioSeqpy.

Yu, Lezheng; Zhang, Yonglin; Xue, Li; Liu, Fengjuan; Jing, Runyu; Luo, Jiesi.

Front Microbiol ; 14: 1175925, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37275146

RESUMEN

Post-transcriptionally RNA modifications, also known as the epitranscriptome, play crucial roles in the regulation of gene expression during development. Recently, deep learning (DL) has been employed for RNA modification site prediction and has shown promising results. However, due to the lack of relevant studies, it is unclear which DL architecture is best suited for some pyrimidine modifications, such as 5-methyluridine (m5U). To fill this knowledge gap, we first performed a comparative evaluation of various commonly used DL models for epigenetic studies with the help of autoBioSeqpy. We identified optimal architectural variations for m5U site classification, optimizing the layer depth and neuron width. Second, we used this knowledge to develop Deepm5U, an improved convolutional-recurrent neural network that accurately predicts m5U sites from RNA sequences. We successfully applied Deepm5U to transcriptomewide m5U profiling data across different sequencing technologies and cell types. Third, we showed that the techniques for interpreting deep neural networks, including LayerUMAP and DeepSHAP, can provide important insights into the internal operation and behavior of models. Overall, we offered practical guidance for the development, benchmark, and analysis of deep learning models when designing new algorithms for RNA modifications.

9.

Fast and Efficient Design of Deep Neural Networks for Predicting N⁷-Methylguanosine Sites Using autoBioSeqpy.

Zhang, Yonglin; Yu, Lezheng; Jing, Runyu; Han, Bin; Luo, Jiesi.

ACS Omega ; 8(22): 19728-19740, 2023 Jun 06.

Artículo en Inglés | MEDLINE | ID: mdl-37305295

RESUMEN

N7-Methylguanosine (m7G) is a crucial post-transcriptional RNA modification that plays a pivotal role in regulating gene expression. Accurately identifying m7G sites is a fundamental step in understanding the biological functions and regulatory mechanisms associated with this modification. While whole-genome sequencing is the gold standard for RNA modification site detection, it is a time-consuming, expensive, and intricate process. Recently, computational approaches, especially deep learning (DL) techniques, have gained popularity in achieving this objective. Convolutional neural networks and recurrent neural networks are examples of DL algorithms that have emerged as versatile tools for modeling biological sequence data. However, developing an efficient network architecture with superior performance remains a challenging task, requiring significant expertise, time, and effort. To address this, we previously introduced a tool called autoBioSeqpy, which streamlines the design and implementation of DL networks for biological sequence classification. In this study, we utilized autoBioSeqpy to develop, train, evaluate, and fine-tune sequence-level DL models for predicting m7G sites. We provided detailed descriptions of these models, along with a step-by-step guide on their execution. The same methodology can be applied to other systems dealing with similar biological questions. The benchmark data and code utilized in this study can be accessed for free at http://github.com/jingry/autoBioSeeqpy/tree/2.0/examples/m7G.

10.

EnsembleDL-ATG: Identifying autophagy proteins by integrating their sequence and evolutionary information using an ensemble deep learning framework.

Yu, Lezheng; Zhang, Yonglin; Xue, Li; Liu, Fengjuan; Jing, Runyu; Luo, Jiesi.

Comput Struct Biotechnol J ; 21: 4836-4848, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37854634

RESUMEN

Autophagy is a primary mechanism for maintaining cellular homeostasis. The synergistic actions of autophagy-related (ATG) proteins strictly regulate the whole autophagic process. Therefore, accurate identification of ATGs is a first and critical step to reveal the molecular mechanism underlying the regulation of autophagy. Current computational methods can predict ATGs from primary protein sequences, but owing to the limitations of algorithms, significant room for improvement still exists. In this research, we propose EnsembleDL-ATG, an ensemble deep learning framework that aggregates multiple deep learning models to predict ATGs from protein sequence and evolutionary information. We first evaluated the performance of individual networks for various feature descriptors to identify the most promising models. Then, we explored all possible combinations of independent models to select the most effective ensemble architecture. The final framework was built and maintained by an organization of four different deep learning models. Experimental results show that our proposed method achieves a prediction accuracy of 94.5 % and MCC of 0.890, which are nearly 4 % and 0.08 higher than ATGPred-FL, respectively. Overall, EnsembleDL-ATG is the first ATG machine learning predictor based on ensemble deep learning. The benchmark data and code utilized in this study can be accessed for free at https://github.com/jingry/autoBioSeqpy/tree/2.0/examples/EnsembleDL-ATG.

11.

layerUMAP: A tool for visualizing and understanding deep learning models in biological sequence classification using UMAP.

Jing, Runyu; Xue, Li; Li, Menglong; Yu, Lezheng; Luo, Jiesi.

iScience ; 25(12): 105530, 2022 Dec 22.

Artículo en Inglés | MEDLINE | ID: mdl-36425757

RESUMEN

Despite the impressive success of deep learning techniques in various types of classification and prediction tasks, interpreting these models and explaining their predictions are still major challenges. In this article, we present an easy-to-use command line tool capable of visualizing and analyzing alternative representations of biological observations learned by deep learning models. This new tool, namely, layerUMAP, integrates autoBioSeqpy software and the UMAP library to address learned high-level representations. An important advantage of the tool is that it provides an interactive option that enables users to visualize the outputs of hidden layers along the depth of the model. We use two different classes of examples to illustrate the potential power of layerUMAP, and the results demonstrate that layerUMAP can provide insightful visual feedback about models and further guide us to develop better models.

12.

Systematic Analysis and Accurate Identification of DNA N4-Methylcytosine Sites by Deep Learning.

Yu, Lezheng; Zhang, Yonglin; Xue, Li; Liu, Fengjuan; Chen, Qi; Luo, Jiesi; Jing, Runyu.

Front Microbiol ; 13: 843425, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-35401453

RESUMEN

DNA N4-methylcytosine (4mC) is a pivotal epigenetic modification that plays an essential role in DNA replication, repair, expression and differentiation. To gain insight into the biological functions of 4mC, it is critical to identify their modification sites in the genomics. Recently, deep learning has become increasingly popular in recent years and frequently employed for the 4mC site identification. However, a systematic analysis of how to build predictive models using deep learning techniques is still lacking. In this work, we first summarized all existing deep learning-based predictors and systematically analyzed their models, features and datasets, etc. Then, using a typical standard dataset with three species (A. thaliana, C. elegans, and D. melanogaster), we assessed the contribution of different model architectures, encoding methods and the attention mechanism in establishing a deep learning-based model for the 4mC site prediction. After a series of optimizations, convolutional-recurrent neural network architecture using the one-hot encoding and attention mechanism achieved the best overall prediction performance. Extensive comparison experiments were conducted based on the same dataset. This work will be helpful for researchers who would like to build the 4mC prediction models using deep learning in the future.

13.

The applications of deep learning algorithms on in silico druggable proteins identification.

Yu, Lezheng; Xue, Li; Liu, Fengjuan; Li, Yizhou; Jing, Runyu; Luo, Jiesi.

J Adv Res ; 41: 219-231, 2022 11.

Artículo en Inglés | MEDLINE | ID: mdl-36328750

RESUMEN

INTRODUCTION: The top priority in drug development is to identify novel and effective drug targets. In vitro assays are frequently used for this purpose; however, traditional experimental approaches are insufficient for large-scale exploration of novel drug targets, as they are expensive, time-consuming and laborious. Therefore, computational methods have emerged in recent decades as an alternative to aid experimental drug discovery studies by developing sophisticated predictive models to estimate unknown drugs/compounds and their targets. The recent success of deep learning (DL) techniques in machine learning and artificial intelligence has further attracted a great deal of attention in the biomedicine field, including computational drug discovery. OBJECTIVES: This study focuses on the practical applications of deep learning algorithms for predicting druggable proteins and proposes a powerful predictor for fast and accurate identification of potential drug targets. METHODS: Using a gold-standard dataset, we explored several typical protein features and different deep learning algorithms and evaluated their performance in a comprehensive way. We provide an overview of the entire experimental process, including protein features and descriptors, neural network architectures, libraries and toolkits for deep learning modelling, performance evaluation metrics, model interpretation and visualization. RESULTS: Experimental results show that the hybrid model (architecture: CNN-RNN (BiLSTM) + DNN; feature: dictionary encoding + DC_TC_CTD) performed better than the other models on the benchmark dataset. This hybrid model was able to achieve 90.0% accuracy and 0.800 MCC on the test dataset and 84.8% and 0.703 on a nonredundant independent test dataset, which is comparable to those of existing methods. CONCLUSION: We developed the first deep learning-based classifier for fast and accurate identification of potential druggable proteins. We hope that this study will be helpful for future researchers who would like to use deep learning techniques to develop relevant predictive models.

Asunto(s)

Aprendizaje Profundo , Inteligencia Artificial , Redes Neurales de la Computación , Algoritmos , Aprendizaje Automático , Proteínas

14.

DeepT3_4: A Hybrid Deep Neural Network Model for the Distinction Between Bacterial Type III and IV Secreted Effectors.

Yu, Lezheng; Liu, Fengjuan; Li, Yizhou; Luo, Jiesi; Jing, Runyu.

Front Microbiol ; 12: 605782, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-33552038

RESUMEN

Gram-negative bacteria can deliver secreted proteins (also known as secreted effectors) directly into host cells through type III secretion system (T3SS), type IV secretion system (T4SS), and type VI secretion system (T6SS) and cause various diseases. These secreted effectors are heavily involved in the interactions between bacteria and host cells, so their identification is crucial for the discovery and development of novel anti-bacterial drugs. It is currently challenging to accurately distinguish type III secreted effectors (T3SEs) and type IV secreted effectors (T4SEs) because neither T3SEs nor T4SEs contain N-terminal signal peptides, and some of these effectors have similar evolutionary conserved profiles and sequence motifs. To address this challenge, we develop a deep learning (DL) approach called DeepT3_4 to correctly classify T3SEs and T4SEs. We generate amino-acid character dictionary and sequence-based features extracted from effector proteins and subsequently implement these features into a hybrid model that integrates recurrent neural networks (RNNs) and deep neural networks (DNNs). After training the model, the hybrid neural network classifies secreted effectors into two different classes with an accuracy, F-value, and recall of over 80.0%. Our approach stands for the first DL approach for the classification of T3SEs and T4SEs, providing a promising supplementary tool for further secretome studies.

15.

DeepT3 2.0: improving type III secreted effector predictions by an integrative deep learning framework.

Jing, Runyu; Wen, Tingke; Liao, Chengxiang; Xue, Li; Liu, Fengjuan; Yu, Lezheng; Luo, Jiesi.

NAR Genom Bioinform ; 3(4): lqab086, 2021 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-34617013

RESUMEN

Type III secretion systems (T3SSs) are bacterial membrane-embedded nanomachines that allow a number of humans, plant and animal pathogens to inject virulence factors directly into the cytoplasm of eukaryotic cells. Export of effectors through T3SSs is critical for motility and virulence of most Gram-negative pathogens. Current computational methods can predict type III secreted effectors (T3SEs) from amino acid sequences, but due to algorithmic constraints, reliable and large-scale prediction of T3SEs in Gram-negative bacteria remains a challenge. Here, we present DeepT3 2.0 (http://advintbioinforlab.com/deept3/), a novel web server that integrates different deep learning models for genome-wide predicting T3SEs from a bacterium of interest. DeepT3 2.0 combines various deep learning architectures including convolutional, recurrent, convolutional-recurrent and multilayer neural networks to learn N-terminal representations of proteins specifically for T3SE prediction. Outcomes from the different models are processed and integrated for discriminating T3SEs and non-T3SEs. Because it leverages diverse models and an integrative deep learning framework, DeepT3 2.0 outperforms existing methods in validation datasets. In addition, the features learned from networks are analyzed and visualized to explain how models make their predictions. We propose DeepT3 2.0 as an integrated and accurate tool for the discovery of T3SEs.

16.

DeepACP: A Novel Computational Approach for Accurate Identification of Anticancer Peptides by Deep Learning Algorithm.

Yu, Lezheng; Jing, Runyu; Liu, Fengjuan; Luo, Jiesi; Li, Yizhou.

Mol Ther Nucleic Acids ; 22: 862-870, 2020 Dec 04.

Artículo en Inglés | MEDLINE | ID: mdl-33230481

RESUMEN

Cancer is one of the most dangerous diseases to human health. The accurate prediction of anticancer peptides (ACPs) would be valuable for the development and design of novel anticancer agents. Current deep neural network models have obtained state-of-the-art prediction accuracy for the ACP classification task. However, based on existing studies, it remains unclear which deep learning architecture achieves the best performance. Thus, in this study, we first present a systematic exploration of three important deep learning architectures: convolutional, recurrent, and convolutional-recurrent networks for distinguishing ACPs from non-ACPs. We find that the recurrent neural network with bidirectional long short-term memory cells is superior to other architectures. By utilizing the proposed model, we implement a sequence-based deep learning tool (DeepACP) to accurately predict the likelihood of a peptide exhibiting anticancer activity. The results indicate that DeepACP outperforms several existing methods and can be used as an effective tool for the prediction of anticancer peptides. Furthermore, we visualize and understand the deep learning model. We hope that our strategy can be extended to identify other types of peptides and may provide more assistance to the development of proteomics and new drugs.

17.

Improving Model Performance on the Stratification of Breast Cancer Patients by Integrating Multiscale Genomic Features.

Hao, Yingyi; He, Li; Zhou, Yifan; Zhao, Yiru; Li, Menglong; Jing, Runyu; Wen, Zhining.

Biomed Res Int ; 2020: 1475368, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-32908867

RESUMEN

In clinical cancer research, it is a hot topic on how to accurately stratify patients based on genomic data. With the development of next-generation sequencing technology, more and more types of genomic features, such as mRNA expression level, can be used to distinguish cancer patients. Previous studies commonly stratified patients by using a single type of genomic features, which can only reflect one aspect of the cancer. In fact, multiscale genomic features will provide more information and may be helpful for clinical prediction. In addition, most of the conventional machine learning algorithms use a handcrafted gene set as features to construct models, which is generally selected by a statistical method with an arbitrary cut-off, e.g., p value < 0.05. The genes in the gene set are not necessarily related to the cancer and will make the model unreliable. Therefore, in our study, we thoroughly investigated the performance of different machine learning methods on stratifying breast cancer patients with a single type of genomic features. Then, we proposed a strategy, which can take into account the degree of correlation between genes and cancer patients, to identify the features from mRNAs and microRNAs, and evaluated the performance of the models with the new combined features of the multiscale genomic features. The results showed that, compared with the models constructed with a single type of features, the models with the multiscale genomic features generated by our proposed method achieved better performance on stratifying the ER status of breast cancer patients. Moreover, we found that the identified multiscale genomic features were closely related to the cancer by gene set enrichment analysis, indicating that our proposed strategy can well reflect the biological relevance of the genes to breast cancer. In conclusion, modelling with multiscale genomic features closely related to the cancer not only can guarantee the prediction performance of the models but also can effectively provide candidate genes for interpreting the mechanisms of cancer.

Asunto(s)

Neoplasias de la Mama/genética , Modelos Genéticos , Algoritmos , Carcinoma de Células Renales/genética , Bases de Datos Genéticas , Femenino , Regulación Neoplásica de la Expresión Génica , Ontología de Genes , Genómica/métodos , Humanos , Neoplasias Renales/genética , Aprendizaje Automático , MicroARNs/genética , ARN Mensajero/genética , Receptores de Estrógenos/genética , Receptores de Estrógenos/metabolismo , Neoplasias de la Tiroides/genética

18.

Narrowing the Gap Between In Vitro and In Vivo Genetic Profiles by Deconvoluting Toxicogenomic Data In Silico.

Liu, Yuan; Jing, Runyu; Wen, Zhining; Li, Menglong.

Front Pharmacol ; 10: 1489, 2019.

Artículo en Inglés | MEDLINE | ID: mdl-31992983

RESUMEN

Toxicogenomics (TGx) is a powerful method to evaluate toxicity and is widely used in both in vivo and in vitro assays. For in vivo TGx, reduction, refinement, and replacement represent the unremitting pursuit of live-animal tests, but in vitro assays, as alternatives, usually demonstrate poor correlation with real in vivo assays. In living subjects, in addition to drug effects, inner-environmental reactions also affect genetic variation, and these two factors are further jointly reflected in gene abundance. Thus, finding a strategy to factorize inner-environmental factor from in vivo assays based on gene expression levels and to further utilize in vitro data to better simulate in vivo data is needed. We proposed a strategy based on post-modified non-negative matrix factorization, which can estimate the gene expression profiles and contents of major factors in samples. The applicability of the strategy was first verified, and the strategy was then utilized to simulate in vivo data by correcting in vitro data. The similarities between real in vivo data and simulated data (single-dose 0.72, repeat-doses 0.75) were higher than those observed when directly comparing real in vivo data with in vitro data (single-dose 0.56, repeat-doses 0.70). Moreover, by keeping environment-related factor, a simulation can always be generated by using in vitro data to provide potential substitutions for in vivo TGx and to reduce the launch of live-animal tests.

19.

Ensemble Methods with Voting Protocols Exhibit Superior Performance for Predicting Cancer Clinical Endpoints and Providing More Complete Coverage of Disease-Related Genes.

Jing, Runyu; Liang, Yu; Ran, Yi; Feng, Shengzhong; Wei, Yanjie; He, Li.

Int J Genomics ; 2018: 8124950, 2018.

Artículo en Inglés | MEDLINE | ID: mdl-29546047

RESUMEN

In genetic data modeling, the use of a limited number of samples for modeling and predicting, especially well below the attribute number, is difficult due to the enormous number of genes detected by a sequencing platform. In addition, many studies commonly use machine learning methods to evaluate genetic datasets to identify potential disease-related genes and drug targets, but to the best of our knowledge, the information associated with the selected gene set was not thoroughly elucidated in previous studies. To identify a relatively stable scheme for modeling limited samples in the gene datasets and reveal the information that they contain, the present study first evaluated the performance of a series of modeling approaches for predicting clinical endpoints of cancer and later integrated the results using various voting protocols. As a result, we proposed a relatively stable scheme that used a set of methods with an ensemble algorithm. Our findings indicated that the ensemble methodologies are more reliable for predicting cancer prognoses than single machine learning algorithms as well as for gene function evaluating. The ensemble methodologies provide a more complete coverage of relevant genes, which can facilitate the exploration of cancer mechanisms and the identification of potential drug targets.

20.

Functional annotation of sixty-five type-2 diabetes risk SNPs and its application in risk prediction.

Wu, Yiming; Jing, Runyu; Dong, Yongcheng; Kuang, Qifan; Li, Yan; Huang, Ziyan; Gan, Wei; Xue, Yue; Li, Yizhou; Li, Menglong.

Sci Rep ; 7: 43709, 2017 03 06.

Artículo en Inglés | MEDLINE | ID: mdl-28262806

RESUMEN

Genome-wide association studies (GWAS) have identified more than sixty single nucleotide polymorphisms (SNPs) associated with increased risk for type 2 diabetes (T2D). However, the identification of causal risk SNPs for T2D pathogenesis was complicated by the factor that each risk SNP is a surrogate for the hundreds of SNPs, most of which reside in non-coding regions. Here we provide a comprehensive annotation of 65 known T2D related SNPs and inspect putative functional SNPs probably causing protein dysfunction, response element disruptions of known transcription factors related to T2D genes and regulatory response element disruption of four histone marks in pancreas and pancreas islet. In new identified risk SNPs, some of them were reported as T2D related SNPs in recent studies. Further, we found that accumulation of modest effects of single sites markedly enhanced the risk prediction based on 1989 T2D samples and 3000 healthy controls. The AROC value increased from 0.58 to 0.62 by only using genotype score when putative risk SNPs were added. Besides, the net reclassification improvement is 10.03% on the addition of new risk SNPs. Taken together, functional annotation could provide a list of prioritized potential risk SNPs for the further estimation on the T2D susceptibility of individuals.

Asunto(s)

Diabetes Mellitus Tipo 2/genética , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Biología Computacional/métodos , Diabetes Mellitus Tipo 2/metabolismo , Epigénesis Genética , Exones , Genómica/métodos , Histonas/metabolismo , Humanos , Desequilibrio de Ligamiento , Anotación de Secuencia Molecular , Oportunidad Relativa , Regiones Promotoras Genéticas , Curva ROC , Secuencias Reguladoras de Ácidos Nucleicos , Medición de Riesgo , Factores de Transcripción/metabolismo

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

Detalles de la búsqueda