Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
IEEE J Biomed Health Inform ; 27(11): 5655-5664, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37669210

RESUMEN

Non-coding RNAs (ncRNAs) are a class of RNA molecules that lack the ability to encode proteins in human cells, but play crucial roles in various biological process. Understanding the interactions between different ncRNAs and their impact on diseases can significantly contribute to diagnosis, prevention, and treatment of diseases. However, predicting tertiary interactions between ncRNAs and diseases based on structural information in multiple scales remains a challenging task. To address this challenge, we propose a method called BertNDA, aiming to predict potential relationships between miRNAs, lncRNAs, and diseases. The framework identifies the local information through connectionless subgraph, which aggregate neighbor nodes' feature. And global information is extracted by leveraging Laplace transform of graph structures and WL (Weisfeiler-Lehman) absolute role coding. Additionally, an EMLP (Element-wise MLP) structure is designed to fuse pairwise global information. The transformer-encoder is employed as the backbone of our approach, followed by a prediction-layer to output the final correlation score. Extensive experiments demonstrate that BertNDA outperforms state-of-the-art methods in prediction assignment and exhibits significant potential for various biological applications. Moreover, we develop an online prediction platform that incorporates the prediction model, providing users with an intuitive and interactive experience. Overall, our model offers an efficient, accurate, and comprehensive tool for predicting tertiary associations between ncRNAs and diseases.


Asunto(s)
MicroARNs , ARN Largo no Codificante , Humanos , Suministros de Energía Eléctrica
2.
Brief Bioinform ; 24(5)2023 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-37587836

RESUMEN

Recent studies have demonstrated the significant role that circRNA plays in the progression of human diseases. Identifying circRNA-disease associations (CDA) in an efficient manner can offer crucial insights into disease diagnosis. While traditional biological experiments can be time-consuming and labor-intensive, computational methods have emerged as a viable alternative in recent years. However, these methods are often limited by data sparsity and their inability to explore high-order information. In this paper, we introduce a novel method named Knowledge Graph Encoder from Transformer for predicting CDA (KGETCDA). Specifically, KGETCDA first integrates more than 10 databases to construct a large heterogeneous non-coding RNA dataset, which contains multiple relationships between circRNA, miRNA, lncRNA and disease. Then, a biological knowledge graph is created based on this dataset and Transformer-based knowledge representation learning and attentive propagation layers are applied to obtain high-quality embeddings with accurately captured high-order interaction information. Finally, multilayer perceptron is utilized to predict the matching scores of CDA based on their embeddings. Our empirical results demonstrate that KGETCDA significantly outperforms other state-of-the-art models. To enhance user experience, we have developed an interactive web-based platform named HNRBase that allows users to visualize, download data and make predictions using KGETCDA with ease. The code and datasets are publicly available at https://github.com/jinyangwu/KGETCDA.


Asunto(s)
ARN Circular , ARN Largo no Codificante , Humanos , Reconocimiento de Normas Patrones Automatizadas , Aprendizaje , Bases de Datos Factuales , Bases del Conocimiento , Biología Computacional
3.
Artículo en Inglés | MEDLINE | ID: mdl-37040244

RESUMEN

General graph neural networks (GNNs) implement convolution operations on graphs based on polynomial spectral filters. Existing filters with high-order polynomial approximations can detect more structural information when reaching high-order neighborhoods but produce indistinguishable representations of nodes, which indicates their inefficiency of processing information in high-order neighborhoods, resulting in performance degradation. In this article, we theoretically identify the feasibility of avoiding this problem and attribute it to overfitting polynomial coefficients. To cope with it, the coefficients are restricted in two steps, dimensionality reduction of the coefficients' domain and sequential assignment of the forgetting factor. We transform the optimization of coefficients to the tuning of a hyperparameter and propose a flexible spectral-domain graph filter, which significantly reduces the memory demand and the adverse impacts on message transmission under large receptive fields. Utilizing our filter, the performance of GNNs is improved significantly in large receptive fields and the receptive fields of GNNs are multiplied as well. Meanwhile, the superiority of applying a high-order approximation is verified across various datasets, notably in strongly hyperbolic datasets. Codes are publicly available at: https://github.com/cengzeyuan/TNNLS-FFKSF.

4.
Artículo en Inglés | MEDLINE | ID: mdl-37027676

RESUMEN

Long non-coding RNAs (LncRNAs) serve a vital role in regulating gene expressions and other biological processes. Differentiation of lncRNAs from protein-coding transcripts helps researchers dig into the mechanism of lncRNA formation and its downstream regulations related to various diseases. Previous works have been proposed to identify lncRNAs, including traditional bio-sequencing and machine learning approaches. Considering the tedious work of biological characteristic-based feature extraction procedures and inevitable artifacts during bio-sequencing processes, those lncRNA detection methods are not always satisfactory. Hence, in this work, we presented lncDLSM, a deep learning-based framework differentiating lncRNA from other protein-coding transcripts without dependencies on prior biological knowledge. lncDLSM is a helpful tool for identifying lncRNAs compared with other biological feature-based machine learning methods and can be applied to other species by transfer learning achieving satisfactory results. Further experiments showed that different species display distinct boundaries among distributions corresponding to the homology and the specificity among species, respectively. An online web server is provided to the community for easy use and efficient identification of lncRNA, available at http://39.106.16.168/lncDLSM.

5.
IEEE Trans Cybern ; 53(4): 2186-2199, 2023 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-34587108

RESUMEN

With the rapid development of the Internet, readers tend to share their views and emotions about news events. Predicting these emotions provides a vital role in social media applications (e.g., sentiment retrieval, opinion summary, and election prediction). However, news articles usually consist of objective texts that lack emotion words, making emotion prediction challenging. From prior studies, we know that comments that come directly from readers are full of emotions. Therefore, in this article, we propose a deep learning framework that first merges article and comment information to predict readers' emotions. At the same time, in the prediction process, we design a pseudo comment representation for unpublished news articles by the comments of published news. In addition, a better model is required to encode articles that contain implicit emotions. To solve this problem, we propose a block emotion attention network (BEAN) to encode news articles better. It includes an emotion attention mechanism and a hierarchical structure to capture emotion words and generate structural information during encoding. Experiments performed on three public datasets show that BEAN achieves the state-of-the-art average Pearson (AP) and accuracy (Acc@1). Moreover, results on four self-collected datasets show that both the introduction of emotional comments and BEAN in our framework improve the ability to predict readers' emotions.


Asunto(s)
Aprendizaje Profundo , Medios de Comunicación Sociales , Humanos , Emociones
6.
Nucleic Acids Res ; 50(21): e121, 2022 11 28.
Artículo en Inglés | MEDLINE | ID: mdl-36130281

RESUMEN

Multimodal single-cell sequencing technologies provide unprecedented information on cellular heterogeneity from multiple layers of genomic readouts. However, joint analysis of two modalities without properly handling the noise often leads to overfitting of one modality by the other and worse clustering results than vanilla single-modality analysis. How to efficiently utilize the extra information from single cell multi-omics to delineate cell states and identify meaningful signal remains as a significant computational challenge. In this work, we propose a deep learning framework, named SAILERX, for efficient, robust, and flexible analysis of multi-modal single-cell data. SAILERX consists of a variational autoencoder with invariant representation learning to correct technical noises from sequencing process, and a multimodal data alignment mechanism to integrate information from different modalities. Instead of performing hard alignment by projecting both modalities to a shared latent space, SAILERX encourages the local structures of two modalities measured by pairwise similarities to be similar. This strategy is more robust against overfitting of noises, which facilitates various downstream analysis such as clustering, imputation, and marker gene detection. Furthermore, the invariant representation learning part enables SAILERX to perform integrative analysis on both multi- and single-modal datasets, making it an applicable and scalable tool for more general scenarios.


Asunto(s)
Genómica , Multiómica , Análisis por Conglomerados , Análisis de la Célula Individual
7.
BMC Bioinformatics ; 23(Suppl 1): 206, 2022 May 31.
Artículo en Inglés | MEDLINE | ID: mdl-35641900

RESUMEN

BACKGROUND: The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes. RESULTS: The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets. CONCLUSIONS: The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.


Asunto(s)
Aprendizaje Profundo , Escherichia coli K12 , Escherichia coli/genética , Escherichia coli K12/genética , Humanos , Regiones Promotoras Genéticas , Análisis de Secuencia de ADN , Sitio de Iniciación de la Transcripción
8.
Commun Biol ; 5(1): 608, 2022 06 20.
Artículo en Inglés | MEDLINE | ID: mdl-35725901

RESUMEN

Topologically associating domains (TADs) are fundamental building blocks of three dimensional genome, and organized into complex hierarchies. Identifying hierarchical TADs on Hi-C data helps to understand the relationship between genome architectures and gene regulation. Herein we propose TADfit, a multivariate linear regression model for profiling hierarchical chromatin domains, which tries to fit the interaction frequencies in Hi-C contact matrix with and without replicates using all-possible hierarchical TADs, and the significant ones can be determined by the regression coefficients obtained with the help of an online learning solver called Follow-The-Regularized-Leader (FTRL). Beyond the existing methods, TADfit has an ability to handle multiple contact matrix replicates and find partially overlapping TADs on them, which helps to find the comprehensive underlying TADs across replicates from different experiments. The comparative results tell that TADfit has better accuracy and reproducibility, and the hierarchical TADs called by it exhibit a reasonable biological relevance.


Asunto(s)
Cromatina , Cromosomas , Cromatina/genética , Genoma , Modelos Lineales , Reproducibilidad de los Resultados
9.
Nucleic Acids Res ; 50(3): e14, 2022 02 22.
Artículo en Inglés | MEDLINE | ID: mdl-34792173

RESUMEN

For many RNA molecules, the secondary structure is essential for the correct function of the RNA. Predicting RNA secondary structure from nucleotide sequences is a long-standing problem in genomics, but the prediction performance has reached a plateau over time. Traditional RNA secondary structure prediction algorithms are primarily based on thermodynamic models through free energy minimization, which imposes strong prior assumptions and is slow to run. Here, we propose a deep learning-based method, called UFold, for RNA secondary structure prediction, trained directly on annotated data and base-pairing rules. UFold proposes a novel image-like representation of RNA sequences, which can be efficiently processed by Fully Convolutional Networks (FCNs). We benchmark the performance of UFold on both within- and cross-family RNA datasets. It significantly outperforms previous methods on within-family datasets, while achieving a similar performance as the traditional methods when trained and tested on distinct RNA families. UFold is also able to predict pseudoknots accurately. Its prediction is fast with an inference time of about 160 ms per sequence up to 1500 bp in length. An online web server running UFold is available at https://ufold.ics.uci.edu. Code is available at https://github.com/uci-cbcl/UFold.


Asunto(s)
Aprendizaje Profundo , ARN , Algoritmos , Emparejamiento Base , Humanos , Conformación de Ácido Nucleico , ARN/química , ARN/genética
10.
J Bioinform Comput Biol ; 20(1): 2150036, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-34939905

RESUMEN

The development of high-throughput technologies has produced increasing amounts of sequence data and an increasing need for efficient clustering algorithms that can process massive volumes of sequencing data for downstream analysis. Heuristic clustering methods are widely applied for sequence clustering because of their low computational complexity. Although numerous heuristic clustering methods have been developed, they suffer from two limitations: overestimation of inferred clusters and low clustering sensitivity. To address these issues, we present a new sequence clustering method (edClust) based on Edlib, a C/C[Formula: see text] library for fast, exact semi-global sequence alignment to group similar sequences. The new method edClust was tested on three large-scale sequence databases, and we compared edClust to several classic heuristic clustering methods, such as UCLUST, CD-HIT, and VSEARCH. Evaluations based on the metrics of cluster number and seed sensitivity (SS) demonstrate that edClust can produce fewer clusters than other methods and that its SS is higher than that of other methods. The source codes of edClust are available from https://github.com/zhang134/EdClust.git under the GNU GPL license.


Asunto(s)
Heurística , Programas Informáticos , Algoritmos , Análisis por Conglomerados , Alineación de Secuencia
11.
Comput Math Methods Med ; 2021: 7471516, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-34394707

RESUMEN

High-throughput data make it possible to study expression levels of thousands of genes simultaneously under a particular condition. However, only few of the genes are discriminatively expressed. How to identify these biomarkers precisely is significant for disease diagnosis, prognosis, and therapy. Many studies utilized pathway information to identify the biomarkers. However, most of these studies only incorporate the group information while the pathway structural information is ignored. In this paper, we proposed a Bayesian gene selection with a network-constrained regularization method, which can incorporate the pathway structural information as priors to perform gene selection. All the priors are conjugated; thus, the parameters can be estimated effectively through Gibbs sampling. We present the application of our method on 6 microarray datasets, comparing with Bayesian Lasso, Bayesian Elastic Net, and Bayesian Fused Lasso. The results show that our method performs better than other Bayesian methods and pathway structural information can improve the result.


Asunto(s)
Teorema de Bayes , Redes Reguladoras de Genes , Marcadores Genéticos , Biomarcadores de Tumor/genética , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Perfilación de la Expresión Génica , Predisposición Genética a la Enfermedad , Humanos , Masculino , Modelos Genéticos , Neoplasias/genética , Análisis de Secuencia por Matrices de Oligonucleótidos/estadística & datos numéricos
12.
Bioinformatics ; 37(Suppl_1): i317-i326, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252968

RESUMEN

MOTIVATION: Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) provides new opportunities to dissect epigenomic heterogeneity and elucidate transcriptional regulatory mechanisms. However, computational modeling of scATAC-seq data is challenging due to its high dimension, extreme sparsity, complex dependencies and high sensitivity to confounding factors from various sources. RESULTS: Here, we propose a new deep generative model framework, named SAILER, for analyzing scATAC-seq data. SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects. SAILER adopts the conventional encoder-decoder framework to learn the latent representation but imposes additional constraints to ensure the independence of the learned representations from the confounding factors. Experimental results on both simulated and real scATAC-seq datasets demonstrate that SAILER learns better and biologically more meaningful representations of cells than other methods. Its noise-free cell embeddings bring in significant benefits in downstream analyses: clustering and imputation based on SAILER result in 6.9% and 18.5% improvements over existing methods, respectively. Moreover, because no matrix factorization is involved, SAILER can easily scale to process millions of cells. We implemented SAILER into a software package, freely available to all for large-scale scATAC-seq data analysis. AVAILABILITY AND IMPLEMENTATION: The software is publicly available at https://github.com/uci-cbcl/SAILER. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Análisis de la Célula Individual , Epigenómica , Análisis de Secuencia de ARN , Programas Informáticos , Transposasas
13.
Bioinformatics ; 37(3): 296-302, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32790868

RESUMEN

MOTIVATION: Identifying cis-acting genetic variants associated with gene expression levels-an analysis commonly referred to as expression quantitative trait loci (eQTLs) mapping-is an important first step toward understanding the genetic determinant of gene expression variation. Successful eQTL mapping requires effective control of confounding factors. A common method for confounding effects control in eQTL mapping studies is the probabilistic estimation of expression residual (PEER) analysis. PEER analysis extracts PEER factors to serve as surrogates for confounding factors, which is further included in the subsequent eQTL mapping analysis. However, it is computationally challenging to determine the optimal number of PEER factors used for eQTL mapping. In particular, the standard approach to determine the optimal number of PEER factors examines one number at a time and chooses a number that optimizes eQTLs discovery. Unfortunately, this standard approach involves multiple repetitive eQTL mapping procedures that are computationally expensive, restricting its use in large-scale eQTL mapping studies that being collected today. RESULTS: Here, we present a simple and computationally scalable alternative, Effect size Correlation for COnfounding determination (ECCO), to determine the optimal number of PEER factors used for eQTL mapping studies. Instead of performing repetitive eQTL mapping, ECCO jointly applies differential expression analysis and Mendelian randomization analysis, leading to substantial computational savings. In simulations and real data applications, we show that ECCO identifies a similar number of PEER factors required for eQTL mapping analysis as the standard approach but is two orders of magnitude faster. The computational scalability of ECCO allows for optimized eQTL discovery across 48 GTEx tissues for the first time, yielding an overall 5.89% power gain on the number of eQTL harboring genes (eGenes) discovered as compared to the previous GTEx recommendation that does not attempt to determine tissue-specific optimal number of PEER factors. AVAILABILITYAND IMPLEMENTATION: Our method is implemented in the ECCO software, which, along with its GTEx mapping results, is freely available at www.xzlab.org/software.html. All R scripts used in this study are also available at this site. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Análisis de la Aleatorización Mendeliana , Sitios de Carácter Cuantitativo , Expresión Génica , Perfilación de la Expresión Génica , Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Programas Informáticos
14.
Sci Adv ; 6(51)2020 12.
Artículo en Inglés | MEDLINE | ID: mdl-33355120

RESUMEN

Characterizing genome-wide binding profiles of transcription factors (TFs) is essential for understanding biological processes. Although techniques have been developed to assess binding profiles within a population of cells, determining them at a single-cell level remains elusive. Here, we report scFAN (single-cell factor analysis network), a deep learning model that predicts genome-wide TF binding profiles in individual cells. scFAN is pretrained on genome-wide bulk assay for transposase-accessible chromatin sequencing (ATAC-seq), DNA sequence, and chromatin immunoprecipitation sequencing (ChIP-seq) data and uses single-cell ATAC-seq to predict TF binding in individual cells. We demonstrate the efficacy of scFAN by both studying sequence motifs enriched within predicted binding peaks and using predicted TFs for discovering cell types. We develop a new metric "TF activity score" to characterize each cell and show that activity scores can reliably capture cell identities. scFAN allows us to discover and study cellular identities and heterogeneity based on chromatin accessibility profiles.

15.
IEEE/ACM Trans Comput Biol Bioinform ; 17(5): 1721-1728, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-30951477

RESUMEN

DNA methylation plays an important role in the regulation of some biological processes. Up to now, with the development of machine learning models, there are several sequence-based deep learning models designed to predict DNA methylation states, which gain better performance than traditional methods like random forest and SVM. However, convolutional network based deep learning models that use one-hot encoding DNA sequence as input may discover limited information and cause unsatisfactory prediction performance, so more data and model structures of diverse angles should be considered. In this work, we proposed a hybrid sequence-based deep learning model with both MeDIP-seq data and Histone information to predict DNA methylated CpG states (MHCpG). We combined both MeDIP-seq data and histone modification data with sequence information and implemented convolutional network to discover sequence patterns. In addition, we used statistical data gained from previous three input data and adopted a 3-layer feedforward neuron network to extract more high-level features. We compared our method with traditional predicting methods using random forest and other previous methods like CpGenie and DeepCpG, the result showed that MHCpG exceeded the other approaches and gained more satisfactory performance.


Asunto(s)
Metilación de ADN/genética , Aprendizaje Profundo , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Código de Histonas/genética , Análisis de Secuencia de ADN/métodos , Línea Celular Tumoral , Biología Computacional/métodos , ADN/genética , Humanos
16.
Genome Biol ; 20(1): 220, 2019 10 24.
Artículo en Inglés | MEDLINE | ID: mdl-31651351

RESUMEN

Identifying genetic variants that are associated with methylation variation-an analysis commonly referred to as methylation quantitative trait locus (mQTL) mapping-is important for understanding the epigenetic mechanisms underlying genotype-trait associations. Here, we develop a statistical method, IMAGE, for mQTL mapping in sequencing-based methylation studies. IMAGE properly accounts for the count nature of bisulfite sequencing data and incorporates allele-specific methylation patterns from heterozygous individuals to enable more powerful mQTL discovery. We compare IMAGE with existing approaches through extensive simulation. We also apply IMAGE to analyze two bisulfite sequencing studies, in which IMAGE identifies more mQTL than existing approaches.


Asunto(s)
Mapeo Cromosómico/métodos , Metilación de ADN , Genómica/métodos , Sitios de Carácter Cuantitativo , Animales , Papio/genética , Lobos/genética
17.
Sci Rep ; 7(1): 14482, 2017 11 03.
Artículo en Inglés | MEDLINE | ID: mdl-29101378

RESUMEN

Cumulative evidence from biological experiments has confirmed that microRNAs (miRNAs) are related to many types of human diseases through different biological processes. It is anticipated that precise miRNA-disease association prediction could not only help infer potential disease-related miRNA but also boost human diagnosis and disease prevention. Considering the limitations of previous computational models, a more effective computational model needs to be implemented to predict miRNA-disease associations. In this work, we first constructed a human miRNA-miRNA similarity network utilizing miRNA-miRNA functional similarity data and heterogeneous miRNA Gaussian interaction profile kernel similarities based on the assumption that similar miRNAs with similar functions tend to be associated with similar diseases, and vice versa. Then, we constructed disease-disease similarity using disease semantic information and heterogeneous disease-related interaction data. We proposed a deep ensemble model called DeepMDA that extracts high-level features from similarity information using stacked autoencoders and then predicts miRNA-disease associations by adopting a 3-layer neural network. In addition to five-fold cross-validation, we also proposed another cross-validation method to evaluate the performance of the model. The results show that the proposed model is superior to previous methods with high robustness.


Asunto(s)
Enfermedad , MicroARNs/metabolismo , Modelos Biológicos , Área Bajo la Curva , Humanos , Aprendizaje Automático , Redes Neurales de la Computación , Curva ROC
18.
Artif Intell Med ; 75: 16-23, 2017 01.
Artículo en Inglés | MEDLINE | ID: mdl-28363453

RESUMEN

BACKGROUND: Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods. MATERIALS AND METHOD: In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database. RESULTS: In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods. CONCLUSION: The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications.


Asunto(s)
Biología Computacional , Escherichia coli , Unión Proteica , Factores de Transcripción , Algoritmos , Sitios de Unión , Humanos
19.
Nucleic Acids Res ; 45(11): e106, 2017 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-28369632

RESUMEN

Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN , Algoritmos , Teorema de Bayes , Simulación por Computador , Humanos , Modelos Lineales , Cadenas de Markov , Modelos Genéticos , Método de Montecarlo , Distribución de Poisson , Programas Informáticos
20.
Comput Math Methods Med ; 2017: 8307530, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-28133490

RESUMEN

Gene regulatory networks (GRNs) play an important role in cellular systems and are important for understanding biological processes. Many algorithms have been developed to infer the GRNs. However, most algorithms only pay attention to the gene expression data but do not consider the topology information in their inference process, while incorporating this information can partially compensate for the lack of reliable expression data. Here we develop a Bayesian group lasso with spike and slab priors to perform gene selection and estimation for nonparametric models. B-spline basis functions are used to capture the nonlinear relationships flexibly and penalties are used to avoid overfitting. Further, we incorporate the topology information into the Bayesian method as a prior. We present the application of our method on DREAM3 and DREAM4 datasets and two real biological datasets. The results show that our method performs better than existing methods and the topology information prior can improve the result.


Asunto(s)
Biología Computacional/métodos , Redes Reguladoras de Genes , Algoritmos , Área Bajo la Curva , Teorema de Bayes , Células HeLa , Humanos , Modelos Genéticos , Modelos Estadísticos , Método de Montecarlo , Probabilidad , Análisis de Regresión
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...