Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 160
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38271483

RESUMEN

The advent of single-cell sequencing technologies has revolutionized cell biology studies. However, integrative analyses of diverse single-cell data face serious challenges, including technological noise, sample heterogeneity, and different modalities and species. To address these problems, we propose scCorrector, a variational autoencoder-based model that can integrate single-cell data from different studies and map them into a common space. Specifically, we designed a Study Specific Adaptive Normalization for each study in decoder to implement these features. scCorrector substantially achieves competitive and robust performance compared with state-of-the-art methods and brings novel insights under various circumstances (e.g. various batches, multi-omics, cross-species, and development stages). In addition, the integration of single-cell data and spatial data makes it possible to transfer information between different studies, which greatly expand the narrow range of genes covered by MERFISH technology. In summary, scCorrector can efficiently integrate multi-study single-cell datasets, thereby providing broad opportunities to tackle challenges emerging from noisy resources.

2.
Brief Bioinform ; 25(6)2024 Sep 23.
Artículo en Inglés | MEDLINE | ID: mdl-39417321

RESUMEN

The gene regulatory network (GRN) plays a vital role in understanding the structure and dynamics of cellular systems, revealing complex regulatory relationships, and exploring disease mechanisms. Recently, deep learning (DL)-based methods have been proposed to infer GRNs from single-cell transcriptomic data and achieved impressive performance. However, these methods do not fully utilize graph topological information and high-order neighbor information from multiple receptive fields. To overcome those limitations, we propose a novel model based on multiview graph attention network, namely, scMGATGRN, to infer GRNs. scMGATGRN mainly consists of GAT, multiview, and view-level attention mechanism. GAT can extract essential features of the gene regulatory network. The multiview model can simultaneously utilize local feature information and high-order neighbor feature information of nodes in the gene regulatory network. The view-level attention mechanism dynamically adjusts the relative importance of node embedding representations and efficiently aggregates node embedding representations from two views. To verify the effectiveness of scMGATGRN, we compared its performance with 10 methods (five shallow learning algorithms and five state-of-the-art DL-based methods) on seven benchmark single-cell RNA sequencing (scRNA-seq) datasets from five cell lines (two in human and three in mouse) with four different kinds of ground-truth networks. The experimental results not only show that scMGATGRN outperforms competing methods but also demonstrate the potential of this model in inferring GRNs. The code and data of scMGATGRN are made freely available on GitHub (https://github.com/nathanyl/scMGATGRN).


Asunto(s)
Redes Reguladoras de Genes , Análisis de la Célula Individual , Transcriptoma , Análisis de la Célula Individual/métodos , Humanos , Biología Computacional/métodos , Algoritmos , Aprendizaje Profundo , Perfilación de la Expresión Génica/métodos , Ratones
3.
PLoS Comput Biol ; 20(8): e1012399, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39173070

RESUMEN

Circular RNAs (circRNAs) play vital roles in transcription and translation. Identification of circRNA-RBP (RNA-binding protein) interaction sites has become a fundamental step in molecular and cell biology. Deep learning (DL)-based methods have been proposed to predict circRNA-RBP interaction sites and achieved impressive identification performance. However, those methods cannot effectively capture long-distance dependencies, and cannot effectively utilize the interaction information of multiple features. To overcome those limitations, we propose a DL-based model iCRBP-LKHA using deep hybrid networks for identifying circRNA-RBP interaction sites. iCRBP-LKHA adopts five encoding schemes. Meanwhile, the neural network architecture, which consists of large kernel convolutional neural network (LKCNN), convolutional block attention module with one-dimensional convolution (CBAM-1D) and bidirectional gating recurrent unit (BiGRU), can explore local information, global context information and multiple features interaction information automatically. To verify the effectiveness of iCRBP-LKHA, we compared its performance with shallow learning algorithms on 37 circRNAs datasets and 37 circRNAs stringent datasets. And we compared its performance with state-of-the-art DL-based methods on 37 circRNAs datasets, 37 circRNAs stringent datasets and 31 linear RNAs datasets. The experimental results not only show that iCRBP-LKHA outperforms other competing methods, but also demonstrate the potential of this model in identifying other RNA-RBP interaction sites.


Asunto(s)
Algoritmos , Biología Computacional , Aprendizaje Profundo , Redes Neurales de la Computación , ARN Circular , Proteínas de Unión al ARN , ARN Circular/genética , ARN Circular/metabolismo , Biología Computacional/métodos , Proteínas de Unión al ARN/metabolismo , Proteínas de Unión al ARN/genética , Humanos , Sitios de Unión/genética
4.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34471921

RESUMEN

Graph is a natural data structure for describing complex systems, which contains a set of objects and relationships. Ubiquitous real-life biomedical problems can be modeled as graph analytics tasks. Machine learning, especially deep learning, succeeds in vast bioinformatics scenarios with data represented in Euclidean domain. However, rich relational information between biological elements is retained in the non-Euclidean biomedical graphs, which is not learning friendly to classic machine learning methods. Graph representation learning aims to embed graph into a low-dimensional space while preserving graph topology and node properties. It bridges biomedical graphs and modern machine learning methods and has recently raised widespread interest in both machine learning and bioinformatics communities. In this work, we summarize the advances of graph representation learning and its representative applications in bioinformatics. To provide a comprehensive and structured analysis and perspective, we first categorize and analyze both graph embedding methods (homogeneous graph embedding, heterogeneous graph embedding, attribute graph embedding) and graph neural networks. Furthermore, we summarize their representative applications from molecular level to genomics, pharmaceutical and healthcare systems level. Moreover, we provide open resource platforms and libraries for implementing these graph representation learning methods and discuss the challenges and opportunities of graph representation learning in bioinformatics. This work provides a comprehensive survey of emerging graph representation learning algorithms and their applications in bioinformatics. It is anticipated that it could bring valuable insights for researchers to contribute their knowledge to graph representation learning and future-oriented bioinformatics studies.


Asunto(s)
Biología Computacional , Redes Neurales de la Computación , Algoritmos , Biología Computacional/métodos , Conocimiento , Aprendizaje Automático
5.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36484687

RESUMEN

MOTIVATION: Cell-type-specific gene expression is maintained in large part by transcription factors (TFs) selectively binding to distinct sets of sites in different cell types. Recent research works have provided evidence that such cell-type-specific binding is determined by TF's intrinsic sequence preferences, cooperative interactions with co-factors, cell-type-specific chromatin landscapes and 3D chromatin interactions. However, computational prediction and characterization of cell-type-specific and shared binding sites is rarely studied. RESULTS: In this article, we propose two computational approaches for predicting and characterizing cell-type-specific and shared binding sites by integrating multiple types of features, in which one is based on XGBoost and another is based on convolutional neural network (CNN). To validate the performance of our proposed approaches, ChIP-seq datasets of 10 binding factors were collected from the GM12878 (lymphoblastoid) and K562 (erythroleukemic) human hematopoietic cell lines, each of which was further categorized into cell-type-specific (GM12878- and K562-specific) and shared binding sites. Then, multiple types of features for these binding sites were integrated to train the XGBoost- and CNN-based models. Experimental results show that our proposed approaches significantly outperform other competing methods on three classification tasks. Moreover, we identified independent feature contributions for cell-type-specific and shared sites through SHAP values and explored the ability of the CNN-based model to predict cell-type-specific and shared binding sites by excluding or including DNase signals. Furthermore, we investigated the generalization ability of our proposed approaches to different binding factors in the same cellular environment. AVAILABILITY AND IMPLEMENTATION: The source code is available at: https://github.com/turningpoint1988/CSSBS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Cromatina , Factores de Transcripción , Humanos , Unión Proteica/genética , Sitios de Unión/genética , Factores de Transcripción/metabolismo , Secuenciación de Inmunoprecipitación de Cromatina , Biología Computacional/métodos
6.
PLoS Comput Biol ; 19(8): e1011344, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37651321

RESUMEN

Accumulating evidence suggests that circRNAs play crucial roles in human diseases. CircRNA-disease association prediction is extremely helpful in understanding pathogenesis, diagnosis, and prevention, as well as identifying relevant biomarkers. During the past few years, a large number of deep learning (DL) based methods have been proposed for predicting circRNA-disease association and achieved impressive prediction performance. However, there are two main drawbacks to these methods. The first is these methods underutilize biometric information in the data. Second, the features extracted by these methods are not outstanding to represent association characteristics between circRNAs and diseases. In this study, we developed a novel deep learning model, named iCircDA-NEAE, to predict circRNA-disease associations. In particular, we use disease semantic similarity, Gaussian interaction profile kernel, circRNA expression profile similarity, and Jaccard similarity simultaneously for the first time, and extract hidden features based on accelerated attribute network embedding (AANE) and dynamic convolutional autoencoder (DCAE). Experimental results on the circR2Disease dataset show that iCircDA-NEAE outperforms other competing methods significantly. Besides, 16 of the top 20 circRNA-disease pairs with the highest prediction scores were validated by relevant literature. Furthermore, we observe that iCircDA-NEAE can effectively predict new potential circRNA-disease associations.


Asunto(s)
Algoritmos , ARN Circular , Humanos , ARN Circular/genética , Semántica
7.
World J Surg Oncol ; 22(1): 49, 2024 Feb 09.
Artículo en Inglés | MEDLINE | ID: mdl-38331878

RESUMEN

BACKGROUND: TMPRSS2-ERG (T2E) fusion is highly related to aggressive clinical features in prostate cancer (PC), which guides individual therapy. However, current fusion prediction tools lacked enough accuracy and biomarkers were unable to be applied to individuals across different platforms due to their quantitative nature. This study aims to identify a transcriptome signature to detect the T2E fusion status of PC at the individual level. METHODS: Based on 272 high-throughput mRNA expression profiles from the Sboner dataset, we developed a rank-based algorithm to identify a qualitative signature to detect T2E fusion in PC. The signature was validated in 1223 samples from three external datasets (Setlur, Clarissa, and TCGA). RESULTS: A signature, composed of five mRNAs coupled to ERG (five ERG-mRNA pairs, 5-ERG-mRPs), was developed to distinguish T2E fusion status in PC. 5-ERG-mRPs reached 84.56% accuracy in Sboner dataset, which was verified in Setlur dataset (n = 455, accuracy = 82.20%) and Clarissa dataset (n = 118, accuracy = 81.36%). Besides, for 495 samples from TCGA, two subtypes classified by 5-ERG-mRPs showed a higher level of significance in various T2E fusion features than subtypes obtained through current fusion prediction tools, such as STAR-Fusion. CONCLUSIONS: Overall, 5-ERG-mRPs can robustly detect T2E fusion in PC at the individual level, which can be used on any gene measurement platform without specific normalization procedures. Hence, 5-ERG-mRPs may serve as an auxiliary tool for PC patient management.


Asunto(s)
Neoplasias de la Próstata , Transcriptoma , Masculino , Humanos , Proteínas de Fusión Oncogénica/genética , Proteínas de Fusión Oncogénica/metabolismo , Proteínas de Fusión Oncogénica/uso terapéutico , Neoplasias de la Próstata/tratamiento farmacológico , ARN Mensajero/genética , Regulador Transcripcional ERG/genética , Regulador Transcripcional ERG/metabolismo , Serina Endopeptidasas/genética , Serina Endopeptidasas/metabolismo , Serina Endopeptidasas/uso terapéutico
8.
Brief Bioinform ; 22(5)2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-33498086

RESUMEN

Transcription factors (TFs) play an important role in regulating gene expression, thus identification of the regions bound by them has become a fundamental step for molecular and cellular biology. In recent years, an increasing number of deep learning (DL) based methods have been proposed for predicting TF binding sites (TFBSs) and achieved impressive prediction performance. However, these methods mainly focus on predicting the sequence specificity of TF-DNA binding, which is equivalent to a sequence-level binary classification task, and fail to identify motifs and TFBSs accurately. In this paper, we developed a fully convolutional network coupled with global average pooling (FCNA), which by contrast is equivalent to a nucleotide-level binary classification task, to roughly locate TFBSs and accurately identify motifs. Experimental results on human ChIP-seq datasets show that FCNA outperforms other competing methods significantly. Besides, we find that the regions located by FCNA can be used by motif discovery tools to further refine the prediction performance. Furthermore, we observe that FCNA can accurately identify TF-DNA binding motifs across different cell lines and infer indirect TF-DNA bindings.


Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Redes Neurales de la Computación , Elementos de Respuesta , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína , Factores de Transcripción , Células A549 , Secuencias de Aminoácidos , Humanos , Células MCF-7 , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
9.
Brief Bioinform ; 22(4)2021 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-33005921

RESUMEN

DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN-RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.


Asunto(s)
ADN/genética , Redes Neurales de la Computación , Motivos de Nucleótidos , ARN/genética , Análisis de Secuencia de ADN , Análisis de Secuencia de ARN
10.
Brief Bioinform ; 22(2): 2085-2095, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-32232320

RESUMEN

Effectively representing Medical Subject Headings (MeSH) headings (terms) such as disease and drug as discriminative vectors could greatly improve the performance of downstream computational prediction models. However, these terms are often abstract and difficult to quantify. In this paper, we converted the MeSH tree structure into a relationship network and applied several graph embedding algorithms on it to represent these terms. Specifically, the relationship network consisting of nodes (MeSH headings) and edges (relationships), which can be constructed by the tree num. Then, five graph embedding algorithms including DeepWalk, LINE, SDNE, LAP and HOPE were implemented on the relationship network to represent MeSH headings as vectors. In order to evaluate the performance of the proposed methods, we carried out the node classification and relationship prediction tasks. The results show that the MeSH headings characterized by graph embedding algorithms can not only be treated as an independent carrier for representation, but also can be utilized as additional information to enhance the representation ability of vectors. Thus, it can serve as an input and continue to play a significant role in any computational models related to disease, drug, microbe, etc. Besides, our method holds great hope to inspire relevant researchers to study the representation of terms in this network perspective.


Asunto(s)
Algoritmos , Medical Subject Headings , Simulación por Computador , Sistemas de Liberación de Medicamentos , Predisposición Genética a la Enfermedad , Humanos , MicroARNs/genética , Semántica
11.
PLoS Comput Biol ; 18(3): e1009941, 2022 03.
Artículo en Inglés | MEDLINE | ID: mdl-35263332

RESUMEN

Transcription factors (TFs) play an important role in regulating gene expression, thus the identification of the sites bound by them has become a fundamental step for molecular and cellular biology. In this paper, we developed a deep learning framework leveraging existing fully convolutional neural networks (FCN) to predict TF-DNA binding signals at the base-resolution level (named as FCNsignal). The proposed FCNsignal can simultaneously achieve the following tasks: (i) modeling the base-resolution signals of binding regions; (ii) discriminating binding or non-binding regions; (iii) locating TF-DNA binding regions; (iv) predicting binding motifs. Besides, FCNsignal can also be used to predict opening regions across the whole genome. The experimental results on 53 TF ChIP-seq datasets and 6 chromatin accessibility ATAC-seq datasets show that our proposed framework outperforms some existing state-of-the-art methods. In addition, we explored to use the trained FCNsignal to locate all potential TF-DNA binding regions on a whole chromosome and predict DNA sequences of arbitrary length, and the results show that our framework can find most of the known binding regions and accept sequences of arbitrary length. Furthermore, we demonstrated the potential ability of our framework in discovering causal disease-associated single-nucleotide polymorphisms (SNPs) through a series of experiments.


Asunto(s)
Aprendizaje Profundo , Sitios de Unión , Secuenciación de Inmunoprecipitación de Cromatina , Unión Proteica , Factores de Transcripción/metabolismo
12.
PLoS Comput Biol ; 18(10): e1010572, 2022 10.
Artículo en Inglés | MEDLINE | ID: mdl-36206320

RESUMEN

In recent years, major advances have been made in various chromosome conformation capture technologies to further satisfy the needs of researchers for high-quality, high-resolution contact interactions. Discriminating the loops from genome-wide contact interactions is crucial for dissecting three-dimensional(3D) genome structure and function. Here, we present a deep learning method to predict genome-wide chromatin loops, called DLoopCaller, by combining accessible chromatin landscapes and raw Hi-C contact maps. Some available orthogonal data ChIA-PET/HiChIP and Capture Hi-C were used to generate positive samples with a wider contact matrix which provides the possibility to find more potential genome-wide chromatin loops. The experimental results demonstrate that DLoopCaller effectively improves the accuracy of predicting genome-wide chromatin loops compared to the state-of-the-art method Peakachu. Moreover, compared to two of most popular loop callers, such as HiCCUPS and Fit-Hi-C, DLoopCaller identifies some unique interactions. We conclude that a combination of chromatin landscapes on the one-dimensional genome contributes to understanding the 3D genome organization, and the identified chromatin loops reveal cell-type specificity and transcription factor motif co-enrichment across different cell lines and species.


Asunto(s)
Cromatina , Aprendizaje Profundo , Cromatina/genética , Genoma/genética , Cromosomas , Factores de Transcripción/genética
13.
Mol Ther ; 30(4): 1775-1786, 2022 04 06.
Artículo en Inglés | MEDLINE | ID: mdl-35121109

RESUMEN

Many biological studies show that the mutation and abnormal expression of microRNAs (miRNAs) could cause a variety of diseases. As an important biomarker for disease diagnosis, miRNA is helpful to understand pathogenesis, and could promote the identification, diagnosis and treatment of diseases. However, the pathogenic mechanism how miRNAs affect these diseases has not been fully understood. Therefore, predicting the potential miRNA-disease associations is of great importance for the development of clinical medicine and drug research. In this study, we proposed a novel deep learning model based on hierarchical graph attention network for predicting miRNA-disease associations (HGANMDA). Firstly, we constructed a miRNA-disease-lncRNA heterogeneous graph based on known miRNA-disease associations, miRNA-lncRNA associations and disease-lncRNA associations. Secondly, the node-layer attention was applied to learn the importance of neighbor nodes based on different meta-paths. Thirdly, the semantic-layer attention was applied to learn the importance of different meta-paths. Finally, a bilinear decoder was employed to reconstruct the connections between miRNAs and diseases. The extensive experimental results indicated that our model achieved good performance and satisfactory results in predicting miRNA-disease associations.


Asunto(s)
MicroARNs , ARN Largo no Codificante , Algoritmos , Biología Computacional/métodos , MicroARNs/genética , ARN Largo no Codificante/genética
14.
Bioinformatics ; 36(13): 4038-4046, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-31793982

RESUMEN

MOTIVATION: Emerging evidence indicates that circular RNA (circRNA) plays a crucial role in human disease. Using circRNA as biomarker gives rise to a new perspective regarding our diagnosing of diseases and understanding of disease pathogenesis. However, detection of circRNA-disease associations by biological experiments alone is often blind, limited to small scale, high cost and time consuming. Therefore, there is an urgent need for reliable computational methods to rapidly infer the potential circRNA-disease associations on a large scale and to provide the most promising candidates for biological experiments. RESULTS: In this article, we propose an efficient computational method based on multi-source information combined with deep convolutional neural network (CNN) to predict circRNA-disease associations. The method first fuses multi-source information including disease semantic similarity, disease Gaussian interaction profile kernel similarity and circRNA Gaussian interaction profile kernel similarity, and then extracts its hidden deep feature through the CNN and finally sends them to the extreme learning machine classifier for prediction. The 5-fold cross-validation results show that the proposed method achieves 87.21% prediction accuracy with 88.50% sensitivity at the area under the curve of 86.67% on the CIRCR2Disease dataset. In comparison with the state-of-the-art SVM classifier and other feature extraction methods on the same dataset, the proposed model achieves the best results. In addition, we also obtained experimental support for prediction results by searching published literature. As a result, 7 of the top 15 circRNA-disease pairs with the highest scores were confirmed by literature. These results demonstrate that the proposed model is a suitable method for predicting circRNA-disease associations and can provide reliable candidates for biological experiments. AVAILABILITY AND IMPLEMENTATION: The source code and datasets explored in this work are available at https://github.com/look0012/circRNA-Disease-association. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Redes Neurales de la Computación , ARN Circular , Algoritmos , Humanos
15.
Cancer Cell Int ; 21(1): 47, 2021 Jan 12.
Artículo en Inglés | MEDLINE | ID: mdl-33514366

RESUMEN

BACKGROUND: The incidence of multiple primary malignant tumors (MPMTs) is rising due to the development of screening technologies, significant treatment advances and increased aging of the population. For patients with a prior cancer history, identifying the tumor origin of the second malignant lesion has important prognostic and therapeutic implications and still represents a difficult problem in clinical practice. METHODS: In this study, we evaluated the performance of a 90-gene expression assay and explored its potential diagnostic utility for MPMTs across a broad spectrum of tumor types. Thirty-five MPMT patients from Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University and Fudan University Shanghai Cancer Center were enrolled; 73 MPMT specimens met all quality control criteria and were analyzed by the 90-gene expression assay. RESULTS: For each clinical specimen, the tumor type predicted by the 90-gene expression assay was compared with its pathological diagnosis, with an overall accuracy of 93.2% (68 of 73, 95% confidence interval 0.84-0.97). For histopathological subgroup analysis, the 90-gene expression assay achieved an overall accuracy of 95.0% (38 of 40; 95% CI 0.82-0.99) for well-moderately differentiated tumors and 92.0% (23 of 25; 95% CI 0.82-0.99) for poorly or undifferentiated tumors, with no statistically significant difference (p-value > 0.5). For squamous cell carcinoma specimens, the overall accuracy of gene expression assay also reached 87.5% (7 of 8; 95% CI 0.47-0.99) for identifying the tumor origins. CONCLUSIONS: The 90-gene expression assay provides flexibility and accuracy in identifying the tumor origin of MPMTs. Future incorporation of the 90-gene expression assay in pathological diagnosis will assist oncologists in applying precise treatments, leading to improved care and outcomes for MPMT patients.

16.
Bioinformatics ; 34(1): 33-40, 2018 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-28968797

RESUMEN

Motivation: Being responsible for initiating transaction of a particular gene in genome, promoter is a short region of DNA. Promoters have various types with different functions. Owing to their importance in biological process, it is highly desired to develop computational tools for timely identifying promoters and their types. Such a challenge has become particularly critical and urgent in facing the avalanche of DNA sequences discovered in the postgenomic age. Although some prediction methods were developed, they can only be used to discriminate a specific type of promoters from non-promoters. None of them has the ability to identify the types of promoters. This is due to the facts that different types of promoters may share quite similar consensus sequence pattern, and that the promoters of same type may have considerably different consensus sequences. Results: To overcome such difficulty, using the multi-window-based PseKNC (pseudo K-tuple nucleotide composition) approach to incorporate the short-, middle-, and long-range sequence information, we have developed a two-layer seamless predictor named as 'iPromoter-2 L'. The first layer serves to identify a query DNA sequence as a promoter or non-promoter, and the second layer to predict which of the following six types the identified promoter belongs to: σ24, σ28, σ32, σ38, σ54 and σ70. Availability and implementation: For the convenience of most experimental scientists, a user-friendly and publicly accessible web-server for the powerful new predictor has been established at http://bioinformatics.hitsz.edu.cn/iPromoter-2L/. It is anticipated that iPromoter-2 L will become a very useful high throughput tool for genome analysis. Contact: bliu@hit.edu.cn or dshuang@tongji.edu.cn or kcchou@gordonlifescience.org. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Escherichia coli/genética , Genómica/métodos , Regiones Promotoras Genéticas , Análisis de Secuencia de ADN/métodos , Programas Informáticos , ADN Bacteriano/metabolismo , ARN Polimerasas Dirigidas por ADN/metabolismo , Escherichia coli/enzimología , Proteínas de Escherichia coli/metabolismo , Genoma Bacteriano
17.
Bioinformatics ; 34(18): 3086-3093, 2018 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-29684124

RESUMEN

Motivation: DNA replication is the key of the genetic information transmission, and it is initiated from the replication origins. Identifying the replication origins is crucial for understanding the mechanism of DNA replication. Although several discriminative computational predictors were proposed to identify DNA replication origins of yeast species, they could only be used to identify very tiny parts (250 or 300 bp) of the replication origins. Besides, none of the existing predictors could successfully capture the 'GC asymmetry bias' of yeast species reported by experimental observations. Hence it would not be surprising why their power is so limited. To grasp the CG asymmetry feature and make the prediction able to cover the entire replication regions of yeast species, we develop a new predictor called 'iRO-3wPseKNC'. Results: Rigorous cross validations on the benchmark datasets from four yeast species (Saccharomyces cerevisiae, Schizosaccharomyces pombe, Kluyveromyces lactis and Pichia pastoris) have indicated that the proposed predictor is really very powerful for predicting the entire DNA duplication origins. Availability and implementation: The web-server for the iRO-3wPseKNC predictor is available at http://bioinformatics.hitsz.edu.cn/iRO-3wPseKNC/, by which users can easily get their desired results without the need to go through the mathematical details. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
ADN/genética , Origen de Réplica , Ascomicetos/genética , Replicación del ADN , Proteínas Fúngicas/genética , Programas Informáticos
18.
Bioinformatics ; 34(22): 3835-3842, 2018 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-29878118

RESUMEN

Motivation: Identification of enhancers and their strength is important because they play a critical role in controlling gene expression. Although some bioinformatics tools were developed, they are limited in discriminating enhancers from non-enhancers only. Recently, a two-layer predictor called 'iEnhancer-2L' was developed that can be used to predict the enhancer's strength as well. However, its prediction quality needs further improvement to enhance the practical application value. Results: A new predictor called 'iEnhancer-EL' was proposed that contains two layer predictors: the first one (for identifying enhancers) is formed by fusing an array of six key individual classifiers, and the second one (for their strength) formed by fusing an array of ten key individual classifiers. All these key classifiers were selected from 171 elementary classifiers formed by SVM (Support Vector Machine) based on kmer, subsequence profile and PseKNC (Pseudo K-tuple Nucleotide Composition), respectively. Rigorous cross-validations have indicated that the proposed predictor is remarkably superior to the existing state-of-the-art one in this area. Availability and implementation: A web server for the iEnhancer-EL has been established at http://bioinformatics.hitsz.edu.cn/iEnhancer-EL/, by which users can easily get their desired results without the need to go through the mathematical details. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuencias Reguladoras de Ácidos Nucleicos , Programas Informáticos , Nucleótidos
19.
Bioinformatics ; 33(14): i243-i251, 2017 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-28881989

RESUMEN

MOTIVATION: The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences. RESULTS: We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities. AVAILABILITY AND IMPLEMENTATION: CDAUC is available at: https://drive.google.com/drive/folders/0BxOW5MtIZbJjNFpCeHlBVWJHeW8 . CONTACT: dshuang@tongji.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Motivos de Nucleótidos , Regiones Promotoras Genéticas , Factores de Transcripción/metabolismo , Área Bajo la Curva , Sitios de Unión , Humanos , Células K562 , Unión Proteica , Curva ROC
20.
Int J Mol Sci ; 19(10)2018 Oct 15.
Artículo en Inglés | MEDLINE | ID: mdl-30326663

RESUMEN

Gene regulatory network (GRN) inference can understand the growth and development of animals and plants, and reveal the mystery of biology. Many computational approaches have been proposed to infer GRN. However, these inference approaches have hardly met the need of modeling, and the reducing redundancy methods based on individual information theory method have bad universality and stability. To overcome the limitations and shortcomings, this thesis proposes a novel algorithm, named HSCVFNT, to infer gene regulatory network with time-delayed regulations by utilizing a hybrid scoring method and complex-valued flexible neural network (CVFNT). The regulations of each target gene can be obtained by iteratively performing HSCVFNT. For each target gene, the HSCVFNT algorithm utilizes a novel scoring method based on time-delayed mutual information (TDMI), time-delayed maximum information coefficient (TDMIC) and time-delayed correlation coefficient (TDCC), to reduce the redundancy of regulatory relationships and obtain the candidate regulatory factor set. Then, the TDCC method is utilized to create time-delayed gene expression time-series matrix. Finally, a complex-valued flexible neural tree model is proposed to infer the time-delayed regulations of each target gene with the time-delayed time-series matrix. Three real time-series expression datasets from (Save Our Soul) SOS DNA repair system in E. coli and Saccharomyces cerevisiae are utilized to evaluate the performance of the HSCVFNT algorithm. As a result, HSCVFNT obtains outstanding F-scores of 0.923, 0.8 and 0.625 for SOS network and (In vivo Reverse-Engineering and Modeling Assessment) IRMA network inference, respectively, which are 5.5%, 14.3% and 72.2% higher than the best performance of other state-of-the-art GRN inference methods and time-delayed methods.


Asunto(s)
Algoritmos , Biología Computacional , Redes Reguladoras de Genes , Teorema de Bayes , Biología Computacional/métodos , Reparación del ADN , Escherichia coli/genética , Redes Neurales de la Computación , Reproducibilidad de los Resultados , Saccharomyces cerevisiae/genética , Sensibilidad y Especificidad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA