Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Brief Bioinform ; 24(5)2023 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-37587836

RESUMO

Recent studies have demonstrated the significant role that circRNA plays in the progression of human diseases. Identifying circRNA-disease associations (CDA) in an efficient manner can offer crucial insights into disease diagnosis. While traditional biological experiments can be time-consuming and labor-intensive, computational methods have emerged as a viable alternative in recent years. However, these methods are often limited by data sparsity and their inability to explore high-order information. In this paper, we introduce a novel method named Knowledge Graph Encoder from Transformer for predicting CDA (KGETCDA). Specifically, KGETCDA first integrates more than 10 databases to construct a large heterogeneous non-coding RNA dataset, which contains multiple relationships between circRNA, miRNA, lncRNA and disease. Then, a biological knowledge graph is created based on this dataset and Transformer-based knowledge representation learning and attentive propagation layers are applied to obtain high-quality embeddings with accurately captured high-order interaction information. Finally, multilayer perceptron is utilized to predict the matching scores of CDA based on their embeddings. Our empirical results demonstrate that KGETCDA significantly outperforms other state-of-the-art models. To enhance user experience, we have developed an interactive web-based platform named HNRBase that allows users to visualize, download data and make predictions using KGETCDA with ease. The code and datasets are publicly available at https://github.com/jinyangwu/KGETCDA.


Assuntos
RNA Circular , RNA Longo não Codificante , Humanos , Reconhecimento Automatizado de Padrão , Aprendizagem , Bases de Dados Factuais , Bases de Conhecimento , Biologia Computacional
2.
Nucleic Acids Res ; 50(3): e14, 2022 02 22.
Artigo em Inglês | MEDLINE | ID: mdl-34792173

RESUMO

For many RNA molecules, the secondary structure is essential for the correct function of the RNA. Predicting RNA secondary structure from nucleotide sequences is a long-standing problem in genomics, but the prediction performance has reached a plateau over time. Traditional RNA secondary structure prediction algorithms are primarily based on thermodynamic models through free energy minimization, which imposes strong prior assumptions and is slow to run. Here, we propose a deep learning-based method, called UFold, for RNA secondary structure prediction, trained directly on annotated data and base-pairing rules. UFold proposes a novel image-like representation of RNA sequences, which can be efficiently processed by Fully Convolutional Networks (FCNs). We benchmark the performance of UFold on both within- and cross-family RNA datasets. It significantly outperforms previous methods on within-family datasets, while achieving a similar performance as the traditional methods when trained and tested on distinct RNA families. UFold is also able to predict pseudoknots accurately. Its prediction is fast with an inference time of about 160 ms per sequence up to 1500 bp in length. An online web server running UFold is available at https://ufold.ics.uci.edu. Code is available at https://github.com/uci-cbcl/UFold.


Assuntos
Aprendizado Profundo , RNA , Algoritmos , Pareamento de Bases , Humanos , Conformação de Ácido Nucleico , RNA/química , RNA/genética
3.
Nucleic Acids Res ; 50(21): e121, 2022 11 28.
Artigo em Inglês | MEDLINE | ID: mdl-36130281

RESUMO

Multimodal single-cell sequencing technologies provide unprecedented information on cellular heterogeneity from multiple layers of genomic readouts. However, joint analysis of two modalities without properly handling the noise often leads to overfitting of one modality by the other and worse clustering results than vanilla single-modality analysis. How to efficiently utilize the extra information from single cell multi-omics to delineate cell states and identify meaningful signal remains as a significant computational challenge. In this work, we propose a deep learning framework, named SAILERX, for efficient, robust, and flexible analysis of multi-modal single-cell data. SAILERX consists of a variational autoencoder with invariant representation learning to correct technical noises from sequencing process, and a multimodal data alignment mechanism to integrate information from different modalities. Instead of performing hard alignment by projecting both modalities to a shared latent space, SAILERX encourages the local structures of two modalities measured by pairwise similarities to be similar. This strategy is more robust against overfitting of noises, which facilitates various downstream analysis such as clustering, imputation, and marker gene detection. Furthermore, the invariant representation learning part enables SAILERX to perform integrative analysis on both multi- and single-modal datasets, making it an applicable and scalable tool for more general scenarios.


Assuntos
Genômica , Multiômica , Análise por Conglomerados , Análise de Célula Única
4.
BMC Bioinformatics ; 23(Suppl 1): 206, 2022 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-35641900

RESUMO

BACKGROUND: The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes. RESULTS: The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets. CONCLUSIONS: The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.


Assuntos
Aprendizado Profundo , Escherichia coli K12 , Escherichia coli/genética , Escherichia coli K12/genética , Humanos , Regiões Promotoras Genéticas , Análise de Sequência de DNA , Sítio de Iniciação de Transcrição
5.
Bioinformatics ; 37(3): 296-302, 2021 04 20.
Artigo em Inglês | MEDLINE | ID: mdl-32790868

RESUMO

MOTIVATION: Identifying cis-acting genetic variants associated with gene expression levels-an analysis commonly referred to as expression quantitative trait loci (eQTLs) mapping-is an important first step toward understanding the genetic determinant of gene expression variation. Successful eQTL mapping requires effective control of confounding factors. A common method for confounding effects control in eQTL mapping studies is the probabilistic estimation of expression residual (PEER) analysis. PEER analysis extracts PEER factors to serve as surrogates for confounding factors, which is further included in the subsequent eQTL mapping analysis. However, it is computationally challenging to determine the optimal number of PEER factors used for eQTL mapping. In particular, the standard approach to determine the optimal number of PEER factors examines one number at a time and chooses a number that optimizes eQTLs discovery. Unfortunately, this standard approach involves multiple repetitive eQTL mapping procedures that are computationally expensive, restricting its use in large-scale eQTL mapping studies that being collected today. RESULTS: Here, we present a simple and computationally scalable alternative, Effect size Correlation for COnfounding determination (ECCO), to determine the optimal number of PEER factors used for eQTL mapping studies. Instead of performing repetitive eQTL mapping, ECCO jointly applies differential expression analysis and Mendelian randomization analysis, leading to substantial computational savings. In simulations and real data applications, we show that ECCO identifies a similar number of PEER factors required for eQTL mapping analysis as the standard approach but is two orders of magnitude faster. The computational scalability of ECCO allows for optimized eQTL discovery across 48 GTEx tissues for the first time, yielding an overall 5.89% power gain on the number of eQTL harboring genes (eGenes) discovered as compared to the previous GTEx recommendation that does not attempt to determine tissue-specific optimal number of PEER factors. AVAILABILITYAND IMPLEMENTATION: Our method is implemented in the ECCO software, which, along with its GTEx mapping results, is freely available at www.xzlab.org/software.html. All R scripts used in this study are also available at this site. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Análise da Randomização Mendeliana , Locos de Características Quantitativas , Expressão Gênica , Perfilação da Expressão Gênica , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Software
6.
Bioinformatics ; 37(Suppl_1): i317-i326, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252968

RESUMO

MOTIVATION: Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) provides new opportunities to dissect epigenomic heterogeneity and elucidate transcriptional regulatory mechanisms. However, computational modeling of scATAC-seq data is challenging due to its high dimension, extreme sparsity, complex dependencies and high sensitivity to confounding factors from various sources. RESULTS: Here, we propose a new deep generative model framework, named SAILER, for analyzing scATAC-seq data. SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects. SAILER adopts the conventional encoder-decoder framework to learn the latent representation but imposes additional constraints to ensure the independence of the learned representations from the confounding factors. Experimental results on both simulated and real scATAC-seq datasets demonstrate that SAILER learns better and biologically more meaningful representations of cells than other methods. Its noise-free cell embeddings bring in significant benefits in downstream analyses: clustering and imputation based on SAILER result in 6.9% and 18.5% improvements over existing methods, respectively. Moreover, because no matrix factorization is involved, SAILER can easily scale to process millions of cells. We implemented SAILER into a software package, freely available to all for large-scale scATAC-seq data analysis. AVAILABILITY AND IMPLEMENTATION: The software is publicly available at https://github.com/uci-cbcl/SAILER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Análise de Célula Única , Epigenômica , Análise de Sequência de RNA , Software , Transposases
7.
Nucleic Acids Res ; 45(11): e106, 2017 Jun 20.
Artigo em Inglês | MEDLINE | ID: mdl-28369632

RESUMO

Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.


Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA , Algoritmos , Teorema de Bayes , Simulação por Computador , Humanos , Modelos Lineares , Cadeias de Markov , Modelos Genéticos , Método de Monte Carlo , Distribuição de Poisson , Software
8.
Biochem Biophys Res Commun ; 471(3): 368-72, 2016 Mar 11.
Artigo em Inglês | MEDLINE | ID: mdl-26869516

RESUMO

Alternative splicing (AS) is an important mechanism of gene regulation that contributes to protein diversity. It is of great significance to recognize different kinds of AS accurately so as to understand the mechanism of gene regulation. Many in silico methods have been applied to detecting AS with vast features, but the result is far from satisfactory. In this paper, we used the features proven to be useful in recognizing AS in previous literature and proposed a hybrid method combining Gene Expression Programming (GEP) and Random Forests (RF) to classify the constitutive exons and cassette exons which is the most common AS phenomenon. GEP will firstly make prediction to the samples of strong signal, and the other samples of weak signal will be distinguished with a more complex classifier based on RF. The experiment result indicates that this method can highly improve the recognition level in this issue.


Assuntos
Processamento Alternativo/genética , DNA/genética , Éxons/genética , Modelos Genéticos , Reconhecimento Automatizado de Padrão/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Simulação por Computador , Modelos Estatísticos , Dados de Sequência Molecular
9.
J Theor Biol ; 358: 85-92, 2014 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-24954019

RESUMO

The haplotype assembly problem has been proven to be complex. Heuristic algorithms are the main methods that are used to solve the problem. These algorithms perform well when the SNP fragments are error-free, but they are less accurate when the error rate increases. The complex relationships caused by fragment errors present a major barrier to assembling accurate haplotypes. Therefore, modeling the complex relationships is the key to solve the problem. In this study, we model the haplotype assembly problem using hypergraph partitioning formulations and propose a novel method termed HGHap (Hypergraph-based Haplotype assembly method). HGHap approaches the haplotype assembly problem in two phases. In the first phase, a hypergraph is constructed in which each vertex corresponds to a fragment and vertices are multiply connected to form hyperedges. In the second phase, a hypergraph partitioning algorithm is employed to obtain two groups of fragments to construct the haplotypes. The hyperedges capture higher-order relationships among fragments that facilitate the subsequent partitioning. Our results demonstrate that the method performs better than other methods in most cases, especially in cases with a high error rate.


Assuntos
Algoritmos , Haplótipos , Projeto Genoma Humano , Humanos
10.
IEEE J Biomed Health Inform ; 27(11): 5655-5664, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37669210

RESUMO

Non-coding RNAs (ncRNAs) are a class of RNA molecules that lack the ability to encode proteins in human cells, but play crucial roles in various biological process. Understanding the interactions between different ncRNAs and their impact on diseases can significantly contribute to diagnosis, prevention, and treatment of diseases. However, predicting tertiary interactions between ncRNAs and diseases based on structural information in multiple scales remains a challenging task. To address this challenge, we propose a method called BertNDA, aiming to predict potential relationships between miRNAs, lncRNAs, and diseases. The framework identifies the local information through connectionless subgraph, which aggregate neighbor nodes' feature. And global information is extracted by leveraging Laplace transform of graph structures and WL (Weisfeiler-Lehman) absolute role coding. Additionally, an EMLP (Element-wise MLP) structure is designed to fuse pairwise global information. The transformer-encoder is employed as the backbone of our approach, followed by a prediction-layer to output the final correlation score. Extensive experiments demonstrate that BertNDA outperforms state-of-the-art methods in prediction assignment and exhibits significant potential for various biological applications. Moreover, we develop an online prediction platform that incorporates the prediction model, providing users with an intuitive and interactive experience. Overall, our model offers an efficient, accurate, and comprehensive tool for predicting tertiary associations between ncRNAs and diseases.


Assuntos
MicroRNAs , RNA Longo não Codificante , Humanos , Fontes de Energia Elétrica
11.
Artigo em Inglês | MEDLINE | ID: mdl-37040244

RESUMO

General graph neural networks (GNNs) implement convolution operations on graphs based on polynomial spectral filters. Existing filters with high-order polynomial approximations can detect more structural information when reaching high-order neighborhoods but produce indistinguishable representations of nodes, which indicates their inefficiency of processing information in high-order neighborhoods, resulting in performance degradation. In this article, we theoretically identify the feasibility of avoiding this problem and attribute it to overfitting polynomial coefficients. To cope with it, the coefficients are restricted in two steps, dimensionality reduction of the coefficients' domain and sequential assignment of the forgetting factor. We transform the optimization of coefficients to the tuning of a hyperparameter and propose a flexible spectral-domain graph filter, which significantly reduces the memory demand and the adverse impacts on message transmission under large receptive fields. Utilizing our filter, the performance of GNNs is improved significantly in large receptive fields and the receptive fields of GNNs are multiplied as well. Meanwhile, the superiority of applying a high-order approximation is verified across various datasets, notably in strongly hyperbolic datasets. Codes are publicly available at: https://github.com/cengzeyuan/TNNLS-FFKSF.

12.
Artigo em Inglês | MEDLINE | ID: mdl-37027676

RESUMO

Long non-coding RNAs (LncRNAs) serve a vital role in regulating gene expressions and other biological processes. Differentiation of lncRNAs from protein-coding transcripts helps researchers dig into the mechanism of lncRNA formation and its downstream regulations related to various diseases. Previous works have been proposed to identify lncRNAs, including traditional bio-sequencing and machine learning approaches. Considering the tedious work of biological characteristic-based feature extraction procedures and inevitable artifacts during bio-sequencing processes, those lncRNA detection methods are not always satisfactory. Hence, in this work, we presented lncDLSM, a deep learning-based framework differentiating lncRNA from other protein-coding transcripts without dependencies on prior biological knowledge. lncDLSM is a helpful tool for identifying lncRNAs compared with other biological feature-based machine learning methods and can be applied to other species by transfer learning achieving satisfactory results. Further experiments showed that different species display distinct boundaries among distributions corresponding to the homology and the specificity among species, respectively. An online web server is provided to the community for easy use and efficient identification of lncRNA, available at http://39.106.16.168/lncDLSM.

13.
IEEE Trans Cybern ; 53(4): 2186-2199, 2023 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-34587108

RESUMO

With the rapid development of the Internet, readers tend to share their views and emotions about news events. Predicting these emotions provides a vital role in social media applications (e.g., sentiment retrieval, opinion summary, and election prediction). However, news articles usually consist of objective texts that lack emotion words, making emotion prediction challenging. From prior studies, we know that comments that come directly from readers are full of emotions. Therefore, in this article, we propose a deep learning framework that first merges article and comment information to predict readers' emotions. At the same time, in the prediction process, we design a pseudo comment representation for unpublished news articles by the comments of published news. In addition, a better model is required to encode articles that contain implicit emotions. To solve this problem, we propose a block emotion attention network (BEAN) to encode news articles better. It includes an emotion attention mechanism and a hierarchical structure to capture emotion words and generate structural information during encoding. Experiments performed on three public datasets show that BEAN achieves the state-of-the-art average Pearson (AP) and accuracy (Acc@1). Moreover, results on four self-collected datasets show that both the introduction of emotional comments and BEAN in our framework improve the ability to predict readers' emotions.


Assuntos
Aprendizado Profundo , Mídias Sociais , Humanos , Emoções
14.
Commun Biol ; 5(1): 608, 2022 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-35725901

RESUMO

Topologically associating domains (TADs) are fundamental building blocks of three dimensional genome, and organized into complex hierarchies. Identifying hierarchical TADs on Hi-C data helps to understand the relationship between genome architectures and gene regulation. Herein we propose TADfit, a multivariate linear regression model for profiling hierarchical chromatin domains, which tries to fit the interaction frequencies in Hi-C contact matrix with and without replicates using all-possible hierarchical TADs, and the significant ones can be determined by the regression coefficients obtained with the help of an online learning solver called Follow-The-Regularized-Leader (FTRL). Beyond the existing methods, TADfit has an ability to handle multiple contact matrix replicates and find partially overlapping TADs on them, which helps to find the comprehensive underlying TADs across replicates from different experiments. The comparative results tell that TADfit has better accuracy and reproducibility, and the hierarchical TADs called by it exhibit a reasonable biological relevance.


Assuntos
Cromatina , Cromossomos , Cromatina/genética , Genoma , Modelos Lineares , Reprodutibilidade dos Testes
15.
J Bioinform Comput Biol ; 20(1): 2150036, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34939905

RESUMO

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


Assuntos
Heurística , Software , Algoritmos , Análise por Conglomerados , Alinhamento de Sequência
16.
Comput Math Methods Med ; 2021: 7471516, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34394707

RESUMO

High-throughput data make it possible to study expression levels of thousands of genes simultaneously under a particular condition. However, only few of the genes are discriminatively expressed. How to identify these biomarkers precisely is significant for disease diagnosis, prognosis, and therapy. Many studies utilized pathway information to identify the biomarkers. However, most of these studies only incorporate the group information while the pathway structural information is ignored. In this paper, we proposed a Bayesian gene selection with a network-constrained regularization method, which can incorporate the pathway structural information as priors to perform gene selection. All the priors are conjugated; thus, the parameters can be estimated effectively through Gibbs sampling. We present the application of our method on 6 microarray datasets, comparing with Bayesian Lasso, Bayesian Elastic Net, and Bayesian Fused Lasso. The results show that our method performs better than other Bayesian methods and pathway structural information can improve the result.


Assuntos
Teorema de Bayes , Redes Reguladoras de Genes , Marcadores Genéticos , Biomarcadores Tumorais/genética , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Perfilação da Expressão Gênica , Predisposição Genética para Doença , Humanos , Masculino , Modelos Genéticos , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos
17.
IEEE/ACM Trans Comput Biol Bioinform ; 17(5): 1721-1728, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-30951477

RESUMO

DNA methylation plays an important role in the regulation of some biological processes. Up to now, with the development of machine learning models, there are several sequence-based deep learning models designed to predict DNA methylation states, which gain better performance than traditional methods like random forest and SVM. However, convolutional network based deep learning models that use one-hot encoding DNA sequence as input may discover limited information and cause unsatisfactory prediction performance, so more data and model structures of diverse angles should be considered. In this work, we proposed a hybrid sequence-based deep learning model with both MeDIP-seq data and Histone information to predict DNA methylated CpG states (MHCpG). We combined both MeDIP-seq data and histone modification data with sequence information and implemented convolutional network to discover sequence patterns. In addition, we used statistical data gained from previous three input data and adopted a 3-layer feedforward neuron network to extract more high-level features. We compared our method with traditional predicting methods using random forest and other previous methods like CpGenie and DeepCpG, the result showed that MHCpG exceeded the other approaches and gained more satisfactory performance.


Assuntos
Metilação de DNA/genética , Aprendizado Profundo , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Código das Histonas/genética , Análise de Sequência de DNA/métodos , Linhagem Celular Tumoral , Biologia Computacional/métodos , DNA/genética , Humanos
18.
Sci Adv ; 6(51)2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33355120

RESUMO

Characterizing genome-wide binding profiles of transcription factors (TFs) is essential for understanding biological processes. Although techniques have been developed to assess binding profiles within a population of cells, determining them at a single-cell level remains elusive. Here, we report scFAN (single-cell factor analysis network), a deep learning model that predicts genome-wide TF binding profiles in individual cells. scFAN is pretrained on genome-wide bulk assay for transposase-accessible chromatin sequencing (ATAC-seq), DNA sequence, and chromatin immunoprecipitation sequencing (ChIP-seq) data and uses single-cell ATAC-seq to predict TF binding in individual cells. We demonstrate the efficacy of scFAN by both studying sequence motifs enriched within predicted binding peaks and using predicted TFs for discovering cell types. We develop a new metric "TF activity score" to characterize each cell and show that activity scores can reliably capture cell identities. scFAN allows us to discover and study cellular identities and heterogeneity based on chromatin accessibility profiles.

19.
Genome Biol ; 20(1): 220, 2019 10 24.
Artigo em Inglês | MEDLINE | ID: mdl-31651351

RESUMO

Identifying genetic variants that are associated with methylation variation-an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping-is important for understanding the epigenetic mechanisms underlying genotype-trait associations. Here, we develop a statistical method, IMAGE, for mQTL mapping in sequencing-based methylation studies. IMAGE properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery. We compare IMAGE with existing approaches through extensive simulation. We also apply IMAGE to analyze two bisulfite sequencing studies, in which IMAGE identifies more mQTL than existing approaches.


Assuntos
Mapeamento Cromossômico/métodos , Metilação de DNA , Genômica/métodos , Locos de Características Quantitativas , Animais , Papio/genética , Lobos/genética
20.
Sci Rep ; 7(1): 14482, 2017 11 03.
Artigo em Inglês | MEDLINE | ID: mdl-29101378

RESUMO

Cumulative evidence from biological experiments has confirmed that microRNAs (miRNAs) are related to many types of human diseases through different biological processes. It is anticipated that precise miRNA-disease association prediction could not only help infer potential disease-related miRNA but also boost human diagnosis and disease prevention. Considering the limitations of previous computational models, a more effective computational model needs to be implemented to predict miRNA-disease associations. In this work, we first constructed a human miRNA-miRNA similarity network utilizing miRNA-miRNA functional similarity data and heterogeneous miRNA Gaussian interaction profile kernel similarities based on the assumption that similar miRNAs with similar functions tend to be associated with similar diseases, and vice versa. Then, we constructed disease-disease similarity using disease semantic information and heterogeneous disease-related interaction data. We proposed a deep ensemble model called DeepMDA that extracts high-level features from similarity information using stacked autoencoders and then predicts miRNA-disease associations by adopting a 3-layer neural network. In addition to five-fold cross-validation, we also proposed another cross-validation method to evaluate the performance of the model. The results show that the proposed model is superior to previous methods with high robustness.


Assuntos
Doença , MicroRNAs/metabolismo , Modelos Biológicos , Área Sob a Curva , Humanos , Aprendizado de Máquina , Redes Neurais de Computação , Curva ROC
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA