Rechercher | Portail Régional BVS

1.

Mining Alzheimer's disease clinical data: reducing effects of natural aging for predicting progression and identifying subtypes.

Han, Tian; Peng, Yunhua; Du, Ying; Li, Yunbo; Wang, Ying; Sun, Wentong; Cui, Lanxin; Peng, Qinke.

Front Neurosci ; 18: 1388391, 2024.

Article de Anglais | MEDLINE | ID: mdl-39206114

RÉSUMÉ

Introduction: Because Alzheimer's disease (AD) has significant heterogeneity in encephalatrophy and clinical manifestations, AD research faces two critical challenges: eliminating the impact of natural aging and extracting valuable clinical data for patients with AD. Methods: This study attempted to address these challenges by developing a novel machine-learning model called tensorized contrastive principal component analysis (T-cPCA). The objectives of this study were to predict AD progression and identify clinical subtypes while minimizing the influence of natural aging. Results: We leveraged a clinical variable space of 872 features, including almost all AD clinical examinations, which is the most comprehensive AD feature description in current research. T-cPCA yielded the highest accuracy in predicting AD progression by effectively minimizing the confounding effects of natural aging. Discussion: The representative features and pathogenic circuits of the four primary AD clinical subtypes were discovered. Confirmed by clinical doctors in Tangdu Hospital, the plaques (18F-AV45) distribution of typical patients in the four clinical subtypes are consistent with representative brain regions found in four AD subtypes, which further offers novel insights into the underlying mechanisms of AD pathogenesis.

2.

KGRACDA: A Model Based on Knowledge Graph from Recursion and Attention Aggregation for CircRNA-disease Association Prediction.

Wang, Ying; Ma, Maoyuan; Xie, Yanxin; Peng, Qinke; Lyu, Hongqiang; Sun, Hequan; Fu, Laiyi.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Aug 21.

Article de Anglais | MEDLINE | ID: mdl-39167510

RÉSUMÉ

CircRNA is closely related to human disease, so it is important to predict circRNA-disease association (CDA). However, the traditional biological detection methods have high difficulty and low accuracy, and computational methods represented by deep learning ignore the ability of the model to explicitly extract local depth information of the CDA. We propose a model based on knowledge graph from recursion and attention aggregation for circRNA-disease association prediction (KGRACDA). This model combines explicit structural features and implicit embedding information of graphs, optimizing graph embedding vectors. First, we built large-scale, multi-source heterogeneous datasets and construct a knowledge graph of multiple RNAs and diseases. After that, we use a recursive method to build multi-hop subgraphs and optimize graph attention mechanism by gating mechanism, mining local depth information. At the same time, the model uses multi-head attention mechanism to balance global and local depth features of graphs, and generate CDA prediction scores. KGRACDA surpasses other methods by capturing local and global depth information related to CDA. We update an interactive web platform HNRBase v2.0, which visualizes circRNA data, and allows users to download data and predict CDA using model.

3.

BertNDA: A Model Based on Graph-Bert and Multi-Scale Information Fusion for ncRNA-Disease Association Prediction.

Ning, Zhiwei; Wu, Jinyang; Ding, Yidong; Wang, Ying; Peng, Qinke; Fu, Laiyi.

IEEE J Biomed Health Inform ; 27(11): 5655-5664, 2023 11.

Article de Anglais | MEDLINE | ID: mdl-37669210

RÉSUMÉ

Non-coding RNAs (ncRNAs) are a class of RNA molecules that lack the ability to encode proteins in human cells, but play crucial roles in various biological process. Understanding the interactions between different ncRNAs and their impact on diseases can significantly contribute to diagnosis, prevention, and treatment of diseases. However, predicting tertiary interactions between ncRNAs and diseases based on structural information in multiple scales remains a challenging task. To address this challenge, we propose a method called BertNDA, aiming to predict potential relationships between miRNAs, lncRNAs, and diseases. The framework identifies the local information through connectionless subgraph, which aggregate neighbor nodes' feature. And global information is extracted by leveraging Laplace transform of graph structures and WL (Weisfeiler-Lehman) absolute role coding. Additionally, an EMLP (Element-wise MLP) structure is designed to fuse pairwise global information. The transformer-encoder is employed as the backbone of our approach, followed by a prediction-layer to output the final correlation score. Extensive experiments demonstrate that BertNDA outperforms state-of-the-art methods in prediction assignment and exhibits significant potential for various biological applications. Moreover, we develop an online prediction platform that incorporates the prediction model, providing users with an intuitive and interactive experience. Overall, our model offers an efficient, accurate, and comprehensive tool for predicting tertiary associations between ncRNAs and diseases.

Sujet(s)

microARN , ARN long non codant , Humains , Alimentations électriques

4.

KGETCDA: an efficient representation learning framework based on knowledge graph encoder from transformer for predicting circRNA-disease associations.

Wu, Jinyang; Ning, Zhiwei; Ding, Yidong; Wang, Ying; Peng, Qinke; Fu, Laiyi.

Brief Bioinform ; 24(5)2023 09 20.

Article de Anglais | MEDLINE | ID: mdl-37587836

RÉSUMÉ

Recent studies have demonstrated the significant role that circRNA plays in the progression of human diseases. Identifying circRNA-disease associations (CDA) in an efficient manner can offer crucial insights into disease diagnosis. While traditional biological experiments can be time-consuming and labor-intensive, computational methods have emerged as a viable alternative in recent years. However, these methods are often limited by data sparsity and their inability to explore high-order information. In this paper, we introduce a novel method named Knowledge Graph Encoder from Transformer for predicting CDA (KGETCDA). Specifically, KGETCDA first integrates more than 10 databases to construct a large heterogeneous non-coding RNA dataset, which contains multiple relationships between circRNA, miRNA, lncRNA and disease. Then, a biological knowledge graph is created based on this dataset and Transformer-based knowledge representation learning and attentive propagation layers are applied to obtain high-quality embeddings with accurately captured high-order interaction information. Finally, multilayer perceptron is utilized to predict the matching scores of CDA based on their embeddings. Our empirical results demonstrate that KGETCDA significantly outperforms other state-of-the-art models. To enhance user experience, we have developed an interactive web-based platform named HNRBase that allows users to visualize, download data and make predictions using KGETCDA with ease. The code and datasets are publicly available at https://github.com/jinyangwu/KGETCDA.

Sujet(s)

ARN circulaire , ARN long non codant , Humains , Reconnaissance automatique des formes , Apprentissage , Bases de données factuelles , Bases de connaissances , Biologie informatique

5.

Graph Neural Networks With High-Order Polynomial Spectral Filters.

Zeng, Zeyuan; Peng, Qinke; Mou, Xu; Wang, Ying; Li, Ruimeng.

IEEE Trans Neural Netw Learn Syst ; PP2023 Apr 11.

Article de Anglais | MEDLINE | ID: mdl-37040244

RÉSUMÉ

General graph neural networks (GNNs) implement convolution operations on graphs based on polynomial spectral filters. Existing filters with high-order polynomial approximations can detect more structural information when reaching high-order neighborhoods but produce indistinguishable representations of nodes, which indicates their inefficiency of processing information in high-order neighborhoods, resulting in performance degradation. In this article, we theoretically identify the feasibility of avoiding this problem and attribute it to overfitting polynomial coefficients. To cope with it, the coefficients are restricted in two steps, dimensionality reduction of the coefficients' domain and sequential assignment of the forgetting factor. We transform the optimization of coefficients to the tuning of a hyperparameter and propose a flexible spectral-domain graph filter, which significantly reduces the memory demand and the adverse impacts on message transmission under large receptive fields. Utilizing our filter, the performance of GNNs is improved significantly in large receptive fields and the receptive fields of GNNs are multiplied as well. Meanwhile, the superiority of applying a high-order approximation is verified across various datasets, notably in strongly hyperbolic datasets. Codes are publicly available at: https://github.com/cengzeyuan/TNNLS-FFKSF.

6.

LncDLSM: Identification of Long Non-coding RNAs with Deep Learning-based Sequence Model.

Wang, Ying; Zhao, Pengfei; Du, Hongkai; Cao, Yingxin; Peng, Qinke; Fu, Laiyi.

IEEE J Biomed Health Inform ; PP2023 Feb 22.

Article de Anglais | MEDLINE | ID: mdl-37027676

RÉSUMÉ

Long non-coding RNAs (LncRNAs) serve a vital role in regulating gene expressions and other biological processes. Differentiation of lncRNAs from protein-coding transcripts helps researchers dig into the mechanism of lncRNA formation and its downstream regulations related to various diseases. Previous works have been proposed to identify lncRNAs, including traditional bio-sequencing and machine learning approaches. Considering the tedious work of biological characteristic-based feature extraction procedures and inevitable artifacts during bio-sequencing processes, those lncRNA detection methods are not always satisfactory. Hence, in this work, we presented lncDLSM, a deep learning-based framework differentiating lncRNA from other protein-coding transcripts without dependencies on prior biological knowledge. lncDLSM is a helpful tool for identifying lncRNAs compared with other biological feature-based machine learning methods and can be applied to other species by transfer learning achieving satisfactory results. Further experiments showed that different species display distinct boundaries among distributions corresponding to the homology and the specificity among species, respectively. An online web server is provided to the community for easy use and efficient identification of lncRNA, available at http://39.106.16.168/lncDLSM.

7.

A Deep Learning Framework for News Readers' Emotion Prediction Based on Features From News Article and Pseudo Comments.

Mou, Xu; Peng, Qinke; Sun, Zhao; Wang, Ying; Li, Xintong; Bashir, Muhammad Fiaz.

IEEE Trans Cybern ; 53(4): 2186-2199, 2023 Apr.

Article de Anglais | MEDLINE | ID: mdl-34587108

RÉSUMÉ

With the rapid development of the Internet, readers tend to share their views and emotions about news events. Predicting these emotions provides a vital role in social media applications (e.g., sentiment retrieval, opinion summary, and election prediction). However, news articles usually consist of objective texts that lack emotion words, making emotion prediction challenging. From prior studies, we know that comments that come directly from readers are full of emotions. Therefore, in this article, we propose a deep learning framework that first merges article and comment information to predict readers' emotions. At the same time, in the prediction process, we design a pseudo comment representation for unpublished news articles by the comments of published news. In addition, a better model is required to encode articles that contain implicit emotions. To solve this problem, we propose a block emotion attention network (BEAN) to encode news articles better. It includes an emotion attention mechanism and a hierarchical structure to capture emotion words and generate structural information during encoding. Experiments performed on three public datasets show that BEAN achieves the state-of-the-art average Pearson (AP) and accuracy (Acc@1). Moreover, results on four self-collected datasets show that both the introduction of emotional comments and BEAN in our framework improve the ability to predict readers' emotions.

Sujet(s)

Apprentissage profond , Médias sociaux , Humains , Émotions

8.

Integrated analysis of multimodal single-cell data with structural similarity.

Cao, Yingxin; Fu, Laiyi; Wu, Jie; Peng, Qinke; Nie, Qing; Zhang, Jing; Xie, Xiaohui.

Nucleic Acids Res ; 50(21): e121, 2022 11 28.

Article de Anglais | MEDLINE | ID: mdl-36130281

RÉSUMÉ

Multimodal single-cell sequencing technologies provide unprecedented information on cellular heterogeneity from multiple layers of genomic readouts. However, joint analysis of two modalities without properly handling the noise often leads to overfitting of one modality by the other and worse clustering results than vanilla single-modality analysis. How to efficiently utilize the extra information from single cell multi-omics to delineate cell states and identify meaningful signal remains as a significant computational challenge. In this work, we propose a deep learning framework, named SAILERX, for efficient, robust, and flexible analysis of multi-modal single-cell data. SAILERX consists of a variational autoencoder with invariant representation learning to correct technical noises from sequencing process, and a multimodal data alignment mechanism to integrate information from different modalities. Instead of performing hard alignment by projecting both modalities to a shared latent space, SAILERX encourages the local structures of two modalities measured by pairwise similarities to be similar. This strategy is more robust against overfitting of noises, which facilitates various downstream analysis such as clustering, imputation, and marker gene detection. Furthermore, the invariant representation learning part enables SAILERX to perform integrative analysis on both multi- and single-modal datasets, making it an applicable and scalable tool for more general scenarios.

Sujet(s)

Génomique , Multi-omique , Analyse de regroupements , Analyse sur cellule unique

9.

TADfit is a multivariate linear regression model for profiling hierarchical chromatin domains on replicate Hi-C data.

Liu, Erhu; Lyu, Hongqiang; Peng, Qinke; Liu, Yuan; Wang, Tian; Han, Jiuqiang.

Commun Biol ; 5(1): 608, 2022 06 20.

Article de Anglais | MEDLINE | ID: mdl-35725901

RÉSUMÉ

Topologically associating domains (TADs) are fundamental building blocks of three dimensional genome, and organized into complex hierarchies. Identifying hierarchical TADs on Hi-C data helps to understand the relationship between genome architectures and gene regulation. Herein we propose TADfit, a multivariate linear regression model for profiling hierarchical chromatin domains, which tries to fit the interaction frequencies in Hi-C contact matrix with and without replicates using all-possible hierarchical TADs, and the significant ones can be determined by the regression coefficients obtained with the help of an online learning solver called Follow-The-Regularized-Leader (FTRL). Beyond the existing methods, TADfit has an ability to handle multiple contact matrix replicates and find partially overlapping TADs on them, which helps to find the comprehensive underlying TADs across replicates from different experiments. The comparative results tell that TADfit has better accuracy and reproducibility, and the hierarchical TADs called by it exhibit a reasonable biological relevance.

Sujet(s)

Chromatine , Chromosomes , Chromatine/génétique , Génome , Modèles linéaires , Reproductibilité des résultats

10.

A successful hybrid deep learning model aiming at promoter identification.

Wang, Ying; Peng, Qinke; Mou, Xu; Wang, Xinyuan; Li, Haozhou; Han, Tian; Sun, Zhao; Wang, Xiao.

BMC Bioinformatics ; 23(Suppl 1): 206, 2022 May 31.

Article de Anglais | MEDLINE | ID: mdl-35641900

RÉSUMÉ

BACKGROUND: The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes. RESULTS: The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets. CONCLUSIONS: The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.

Sujet(s)

Apprentissage profond , Escherichia coli K12 , Escherichia coli/génétique , Escherichia coli K12/génétique , Humains , Régions promotrices (génétique) , Analyse de séquence d'ADN , Site d'initiation de la transcription

11.

UFold: fast and accurate RNA secondary structure prediction with deep learning.

Fu, Laiyi; Cao, Yingxin; Wu, Jie; Peng, Qinke; Nie, Qing; Xie, Xiaohui.

Nucleic Acids Res ; 50(3): e14, 2022 02 22.

Article de Anglais | MEDLINE | ID: mdl-34792173

RÉSUMÉ

For many RNA molecules, the secondary structure is essential for the correct function of the RNA. Predicting RNA secondary structure from nucleotide sequences is a long-standing problem in genomics, but the prediction performance has reached a plateau over time. Traditional RNA secondary structure prediction algorithms are primarily based on thermodynamic models through free energy minimization, which imposes strong prior assumptions and is slow to run. Here, we propose a deep learning-based method, called UFold, for RNA secondary structure prediction, trained directly on annotated data and base-pairing rules. UFold proposes a novel image-like representation of RNA sequences, which can be efficiently processed by Fully Convolutional Networks (FCNs). We benchmark the performance of UFold on both within- and cross-family RNA datasets. It significantly outperforms previous methods on within-family datasets, while achieving a similar performance as the traditional methods when trained and tested on distinct RNA families. UFold is also able to predict pseudoknots accurately. Its prediction is fast with an inference time of about 160 ms per sequence up to 1500 bp in length. An online web server running UFold is available at https://ufold.ics.uci.edu. Code is available at https://github.com/uci-cbcl/UFold.

Sujet(s)

Apprentissage profond , ARN , Algorithmes , Appariement de bases , Humains , Conformation d'acide nucléique , ARN/composition chimique , ARN/génétique

12.

EdClust: A heuristic sequence clustering method with higher sensitivity.

Cao, Ming; Peng, Qinke; Wei, Ze-Gang; Liu, Fei; Hou, Yi-Fan.

J Bioinform Comput Biol ; 20(1): 2150036, 2022 02.

Article de Anglais | MEDLINE | ID: mdl-34939905

RÉSUMÉ

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.

Sujet(s)

Heuristique , Logiciel , Algorithmes , Analyse de regroupements , Alignement de séquences

13.

Bayesian Gene Selection Based on Pathway Information and Network-Constrained Regularization.

Cao, Ming; Fan, Yue; Peng, Qinke.

Comput Math Methods Med ; 2021: 7471516, 2021.

Article de Anglais | MEDLINE | ID: mdl-34394707

RÉSUMÉ

High-throughput data make it possible to study expression levels of thousands of genes simultaneously under a particular condition. However, only few of the genes are discriminatively expressed. How to identify these biomarkers precisely is significant for disease diagnosis, prognosis, and therapy. Many studies utilized pathway information to identify the biomarkers. However, most of these studies only incorporate the group information while the pathway structural information is ignored. In this paper, we proposed a Bayesian gene selection with a network-constrained regularization method, which can incorporate the pathway structural information as priors to perform gene selection. All the priors are conjugated; thus, the parameters can be estimated effectively through Gibbs sampling. We present the application of our method on 6 microarray datasets, comparing with Bayesian Lasso, Bayesian Elastic Net, and Bayesian Fused Lasso. The results show that our method performs better than other Bayesian methods and pathway structural information can improve the result.

Sujet(s)

Théorème de Bayes , Réseaux de régulation génique , Marqueurs génétiques , Marqueurs biologiques tumoraux/génétique , Biologie informatique , Simulation numérique , Bases de données génétiques/statistiques et données numériques , Femelle , Analyse de profil d'expression de gènes , Prédisposition génétique à une maladie , Humains , Mâle , Modèles génétiques , Tumeurs/génétique , Séquençage par oligonucléotides en batterie/statistiques et données numériques

14.

SAILER: scalable and accurate invariant representation learning for single-cell ATAC-seq processing and integration.

Cao, Yingxin; Fu, Laiyi; Wu, Jie; Peng, Qinke; Nie, Qing; Zhang, Jing; Xie, Xiaohui.

Bioinformatics ; 37(Suppl_1): i317-i326, 2021 07 12.

Article de Anglais | MEDLINE | ID: mdl-34252968

RÉSUMÉ

MOTIVATION: Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) provides new opportunities to dissect epigenomic heterogeneity and elucidate transcriptional regulatory mechanisms. However, computational modeling of scATAC-seq data is challenging due to its high dimension, extreme sparsity, complex dependencies and high sensitivity to confounding factors from various sources. RESULTS: Here, we propose a new deep generative model framework, named SAILER, for analyzing scATAC-seq data. SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects. SAILER adopts the conventional encoder-decoder framework to learn the latent representation but imposes additional constraints to ensure the independence of the learned representations from the confounding factors. Experimental results on both simulated and real scATAC-seq datasets demonstrate that SAILER learns better and biologically more meaningful representations of cells than other methods. Its noise-free cell embeddings bring in significant benefits in downstream analyses: clustering and imputation based on SAILER result in 6.9% and 18.5% improvements over existing methods, respectively. Moreover, because no matrix factorization is involved, SAILER can easily scale to process millions of cells. We implemented SAILER into a software package, freely available to all for large-scale scATAC-seq data analysis. AVAILABILITY AND IMPLEMENTATION: The software is publicly available at https://github.com/uci-cbcl/SAILER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Sujet(s)

Séquençage après immunoprécipitation de la chromatine , Analyse sur cellule unique , Épigénomique , Analyse de séquence d'ARN , Logiciel , Transposases

15.

Efficient and effective control of confounding in eQTL mapping studies through joint differential expression and Mendelian randomization analyses.

Fan, Yue; Zhu, Huanhuan; Song, Yanyi; Peng, Qinke; Zhou, Xiang.

Bioinformatics ; 37(3): 296-302, 2021 04 20.

Article de Anglais | MEDLINE | ID: mdl-32790868

RÉSUMÉ

MOTIVATION: Identifying cis-acting genetic variants associated with gene expression levels-an analysis commonly referred to as expression quantitative trait loci (eQTLs) mapping-is an important first step toward understanding the genetic determinant of gene expression variation. Successful eQTL mapping requires effective control of confounding factors. A common method for confounding effects control in eQTL mapping studies is the probabilistic estimation of expression residual (PEER) analysis. PEER analysis extracts PEER factors to serve as surrogates for confounding factors, which is further included in the subsequent eQTL mapping analysis. However, it is computationally challenging to determine the optimal number of PEER factors used for eQTL mapping. In particular, the standard approach to determine the optimal number of PEER factors examines one number at a time and chooses a number that optimizes eQTLs discovery. Unfortunately, this standard approach involves multiple repetitive eQTL mapping procedures that are computationally expensive, restricting its use in large-scale eQTL mapping studies that being collected today. RESULTS: Here, we present a simple and computationally scalable alternative, Effect size Correlation for COnfounding determination (ECCO), to determine the optimal number of PEER factors used for eQTL mapping studies. Instead of performing repetitive eQTL mapping, ECCO jointly applies differential expression analysis and Mendelian randomization analysis, leading to substantial computational savings. In simulations and real data applications, we show that ECCO identifies a similar number of PEER factors required for eQTL mapping analysis as the standard approach but is two orders of magnitude faster. The computational scalability of ECCO allows for optimized eQTL discovery across 48 GTEx tissues for the first time, yielding an overall 5.89% power gain on the number of eQTL harboring genes (eGenes) discovered as compared to the previous GTEx recommendation that does not attempt to determine tissue-specific optimal number of PEER factors. AVAILABILITYAND IMPLEMENTATION: Our method is implemented in the ECCO software, which, along with its GTEx mapping results, is freely available at www.xzlab.org/software.html. All R scripts used in this study are also available at this site. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Sujet(s)

Analyse de randomisation mendélienne , Locus de caractère quantitatif , Expression des gènes , Analyse de profil d'expression de gènes , Étude d'association pangénomique , Polymorphisme de nucléotide simple , Logiciel

16.

Predicting transcription factor binding in single cells through deep learning.

Fu, Laiyi; Zhang, Lihua; Dollinger, Emmanuel; Peng, Qinke; Nie, Qing; Xie, Xiaohui.

Sci Adv ; 6(51)2020 12.

Article de Anglais | MEDLINE | ID: mdl-33355120

RÉSUMÉ

Characterizing genome-wide binding profiles of transcription factors (TFs) is essential for understanding biological processes. Although techniques have been developed to assess binding profiles within a population of cells, determining them at a single-cell level remains elusive. Here, we report scFAN (single-cell factor analysis network), a deep learning model that predicts genome-wide TF binding profiles in individual cells. scFAN is pretrained on genome-wide bulk assay for transposase-accessible chromatin sequencing (ATAC-seq), DNA sequence, and chromatin immunoprecipitation sequencing (ChIP-seq) data and uses single-cell ATAC-seq to predict TF binding in individual cells. We demonstrate the efficacy of scFAN by both studying sequence motifs enriched within predicted binding peaks and using predicted TFs for discovering cell types. We develop a new metric "TF activity score" to characterize each cell and show that activity scores can reliably capture cell identities. scFAN allows us to discover and study cellular identities and heterogeneity based on chromatin accessibility profiles.

17.

Predicting DNA Methylation States with Hybrid Information Based Deep-Learning Model.

Fu, Laiyi; Peng, Qinke; Chai, Ling.

IEEE/ACM Trans Comput Biol Bioinform ; 17(5): 1721-1728, 2020.

Article de Anglais | MEDLINE | ID: mdl-30951477

RÉSUMÉ

DNA methylation plays an important role in the regulation of some biological processes. Up to now, with the development of machine learning models, there are several sequence-based deep learning models designed to predict DNA methylation states, which gain better performance than traditional methods like random forest and SVM. However, convolutional network based deep learning models that use one-hot encoding DNA sequence as input may discover limited information and cause unsatisfactory prediction performance, so more data and model structures of diverse angles should be considered. In this work, we proposed a hybrid sequence-based deep learning model with both MeDIP-seq data and Histone information to predict DNA methylated CpG states (MHCpG). We combined both MeDIP-seq data and histone modification data with sequence information and implemented convolutional network to discover sequence patterns. In addition, we used statistical data gained from previous three input data and adopted a 3-layer feedforward neuron network to extract more high-level features. We compared our method with traditional predicting methods using random forest and other previous methods like CpGenie and DeepCpG, the result showed that MHCpG exceeded the other approaches and gained more satisfactory performance.

Sujet(s)

Méthylation de l'ADN/génétique , Apprentissage profond , Séquençage nucléotidique à haut débit/méthodes , Code histone/génétique , Analyse de séquence d'ADN/méthodes , Lignée cellulaire tumorale , Biologie informatique/méthodes , ADN/génétique , Humains

18.

IMAGE: high-powered detection of genetic effects on DNA methylation using integrated methylation QTL mapping and allele-specific analysis.

Fan, Yue; Vilgalys, Tauras P; Sun, Shiquan; Peng, Qinke; Tung, Jenny; Zhou, Xiang.

Genome Biol ; 20(1): 220, 2019 10 24.

Article de Anglais | MEDLINE | ID: mdl-31651351

RÉSUMÉ

Identifying genetic variants that are associated with methylation variation-an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping-is important for understanding the epigenetic mechanisms underlying genotype-trait associations. Here, we develop a statistical method, IMAGE, for mQTL mapping in sequencing-based methylation studies. IMAGE properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery. We compare IMAGE with existing approaches through extensive simulation. We also apply IMAGE to analyze two bisulfite sequencing studies, in which IMAGE identifies more mQTL than existing approaches.

Sujet(s)

Cartographie chromosomique/méthodes , Méthylation de l'ADN , Génomique/méthodes , Locus de caractère quantitatif , Animaux , Papio/génétique , Loups/génétique

19.

A deep ensemble model to predict miRNA-disease association.

Fu, Laiyi; Peng, Qinke.

Sci Rep ; 7(1): 14482, 2017 11 03.

Article de Anglais | MEDLINE | ID: mdl-29101378

RÉSUMÉ

Cumulative evidence from biological experiments has confirmed that microRNAs (miRNAs) are related to many types of human diseases through different biological processes. It is anticipated that precise miRNA-disease association prediction could not only help infer potential disease-related miRNA but also boost human diagnosis and disease prevention. Considering the limitations of previous computational models, a more effective computational model needs to be implemented to predict miRNA-disease associations. In this work, we first constructed a human miRNA-miRNA similarity network utilizing miRNA-miRNA functional similarity data and heterogeneous miRNA Gaussian interaction profile kernel similarities based on the assumption that similar miRNAs with similar functions tend to be associated with similar diseases, and vice versa. Then, we constructed disease-disease similarity using disease semantic information and heterogeneous disease-related interaction data. We proposed a deep ensemble model called DeepMDA that extracts high-level features from similarity information using stacked autoencoders and then predicts miRNA-disease associations by adopting a 3-layer neural network. In addition to five-fold cross-validation, we also proposed another cross-validation method to evaluate the performance of the model. The results show that the proposed model is superior to previous methods with high robustness.

Sujet(s)

Maladie , microARN/métabolisme , Modèles biologiques , Aire sous la courbe , Humains , Apprentissage machine , 29935 , Courbe ROC

20.

Differential expression analysis for RNAseq using Poisson mixed models.

Sun, Shiquan; Hood, Michelle; Scott, Laura; Peng, Qinke; Mukherjee, Sayan; Tung, Jenny; Zhou, Xiang.

Nucleic Acids Res ; 45(11): e106, 2017 Jun 20.

Article de Anglais | MEDLINE | ID: mdl-28369632

RÉSUMÉ

Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.

Sujet(s)

Analyse de profil d'expression de gènes/méthodes , Analyse de séquence d'ARN , Algorithmes , Théorème de Bayes , Simulation numérique , Humains , Modèles linéaires , Chaines de Markov , Modèles génétiques , Méthode de Monte Carlo , Loi de Poisson , Logiciel

RÉSUMÉ

RÉSUMÉ

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

RÉSUMÉ

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

RÉSUMÉ

Sujet(s)

ENVOYER À:

SÉLECTION CITATIONS

DÉTAIL DE RECHERCHE