Búsqueda | Portal Regional de la BVS

Tumor origin detection with tissue-specific miRNA and DNA methylation markers.

Tang, Wei; Wan, Shixiang; Yang, Zhen; Teschendorff, Andrew E; Zou, Quan.

Bioinformatics ; 34(3): 398-406, 2018 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-29028927

RESUMEN

Motivation: A clear identification of the primary site of tumor is of great importance to the next targeted site-specific treatments and could efficiently improve patient's overall survival. Even though many classifiers based on gene expression had been proposed to predict the tumor primary, only a few studies focus on using DNA methylation (DNAm) profiles to develop classifiers, and none of them compares the performance of classifiers based on different profiles. Results: We introduced novel selection strategies to identify highly tissue-specific CpG sites and then used the random forest approach to construct the classifiers to predict the origin of tumors. We also compared the prediction performance by applying similar strategy on miRNA expression profiles. Our analysis indicated that these classifiers had an accuracy of 96.05% (Maximum-Relevance-Maximum-Distance: 90.02-99.99%) or 95.31% (principal component analysis: 79.82-99.91%) on independent DNAm datasets, and an overall accuracy of 91.30% (range 79.33-98.74%) on independent miRNA test sets for predicting tumor origin. This suggests that our feature selection methods are very effective to identify tissue-specific biomarkers and the classifiers we developed can efficiently predict the origin of tumors. We also developed a user-friendly webserver that helps users to predict the tumor origin by uploading miRNA expression or DNAm profile of their interests. Availability and implementation: The webserver, and relative data, code are accessible at http://server.malab.cn/MMCOP/. Contact: zouquan@nclab.net or a.teschendorff@ucl.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Asunto(s)

Biología Computacional/métodos , Metilación de ADN , Genes Relacionados con las Neoplasias , MicroARNs/genética , Neoplasias/diagnóstico , Islas de CpG , ADN de Neoplasias , Femenino , Perfilación de la Expresión Génica/métodos , Regulación Neoplásica de la Expresión Génica , Humanos , Masculino , Neoplasias/genética , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ARN/métodos

HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.

Wan, Shixiang; Zou, Quan.

Algorithms Mol Biol ; 12: 25, 2017.

Artículo en Inglés | MEDLINE | ID: mdl-29026435

RESUMEN

BACKGROUND: Multiple sequence alignment (MSA) plays a key role in biological sequence analyses, especially in phylogenetic tree construction. Extreme increase in next-generation sequencing results in shortage of efficient ultra-large biological sequence alignment approaches for coping with different sequence types. METHODS: Distributed and parallel computing represents a crucial technique for accelerating ultra-large (e.g. files more than 1 GB) sequence analyses. Based on HAlign and Spark distributed computing system, we implement a highly cost-efficient and time-efficient HAlign-II tool to address ultra-large multiple biological sequence alignment and phylogenetic tree construction. RESULTS: The experiments in the DNA and protein large scale data sets, which are more than 1GB files, showed that HAlign II could save time and space. It outperformed the current software tools. HAlign-II can efficiently carry out MSA and construct phylogenetic trees with ultra-large numbers of biological sequences. HAlign-II shows extremely high memory efficiency and scales well with increases in computing resource. CONCLUSIONS: THAlign-II provides a user-friendly web server based on our distributed computing infrastructure. HAlign-II with open-source codes and datasets was established at http://lab.malab.cn/soft/halign.

HPSLPred: An Ensemble Multi-Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source.

Wan, Shixiang; Duan, Yucong; Zou, Quan.

Proteomics ; 17(17-18)2017 Sep.

Artículo en Inglés | MEDLINE | ID: mdl-28776938

RESUMEN

Predicting the subcellular localization of proteins is an important and challenging problem. Traditional experimental approaches are often expensive and time-consuming. Consequently, a growing number of research efforts employ a series of machine learning approaches to predict the subcellular location of proteins. There are two main challenges among the state-of-the-art prediction methods. First, most of the existing techniques are designed to deal with multi-class rather than multi-label classification, which ignores connections between multiple labels. In reality, multiple locations of particular proteins imply that there are vital and unique biological significances that deserve special focus and cannot be ignored. Second, techniques for handling imbalanced data in multi-label classification problems are necessary, but never employed. For solving these two issues, we have developed an ensemble multi-label classifier called HPSLPred, which can be applied for multi-label classification with an imbalanced protein source. For convenience, a user-friendly webserver has been established at http://server.malab.cn/HPSLPred.

Asunto(s)

Biología Computacional/métodos , Aprendizaje Automático , Proteínas/clasificación , Proteínas/metabolismo , Bases de Datos de Proteínas , Humanos , Espacio Intracelular , Transporte de Proteínas , Fracciones Subcelulares

A novel hierarchical selective ensemble classifier with bioinformatics application.

Wei, Leyi; Wan, Shixiang; Guo, Jiasheng; Wong, Kelvin Kl.

Artif Intell Med ; 83: 82-90, 2017 Nov.

Artículo en Inglés | MEDLINE | ID: mdl-28245947

RESUMEN

Selective ensemble learning is a technique that selects a subset of diverse and accurate basic models in order to generate stronger generalization ability. In this paper, we proposed a novel learning algorithm that is based on parallel optimization and hierarchical selection (PTHS). Our novel feature selection method is based on maximize the sum of relevance and distance (MSRD) for solving the problem of high dimensionality. Specifically, we have a PTHS algorithm that employs parallel optimization and candidate model pruning based on k-means and a hierarchical selection framework. We combine the prediction result of each basic model by majority voting, which employs the divide-and-conquer strategy to save computing time. In addition, the PT algorithm is capable to transform a multi-class problem into a binary classification problem, and thereby allowing our ensemble model to address multi-class problems. Empirical study shows that MSRD is efficient in solving the high dimensionality problem, and PTHS exhibits better performance than the other existing classification algorithms. Most importantly, our classifier achieved high-level performance on several bioinformatics problems (e.g. tRNA identification, and protein-protein interaction prediction, etc.), demonstrating efficiency and robustness.

Asunto(s)

Biología Computacional/métodos , Minería de Datos/métodos , Aprendizaje Automático , Proteínas/clasificación , ARN de Transferencia/clasificación , Área Bajo la Curva , Bases de Datos Genéticas , Mapas de Interacción de Proteínas , Proteínas/metabolismo , ARN de Transferencia/genética , ARN de Transferencia/metabolismo , Curva ROC , Reproducibilidad de los Resultados

Reconstructing evolutionary trees in parallel for massive sequences.

Zou, Quan; Wan, Shixiang; Zeng, Xiangxiang; Ma, Zhanshan Sam.

BMC Syst Biol ; 11(Suppl 6): 100, 2017 12 14.

Artículo en Inglés | MEDLINE | ID: mdl-29297337

RESUMEN

BACKGROUND: Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. RESULTS: HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. CONCLUSIONS: In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .

Asunto(s)

Evolución Molecular , Filogenia , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Clasificación/métodos , Biología Computacional/métodos , Alineación de Secuencia/métodos

Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy.

Zou, Quan; Wan, Shixiang; Ju, Ying; Tang, Jijun; Zeng, Xiangxiang.

BMC Syst Biol ; 10(Suppl 4): 114, 2016 Dec 23.

Artículo en Inglés | MEDLINE | ID: mdl-28155714

RESUMEN

BACKGROUND: It is necessary and essential to discovery protein function from the novel primary sequences. Wet lab experimental procedures are not only time-consuming, but also costly, so predicting protein structure and function reliably based only on amino acid sequence has significant value. TATA-binding protein (TBP) is a kind of DNA binding protein, which plays a key role in the transcription regulation. Our study proposed an automatic approach for identifying TATA-binding proteins efficiently, accurately, and conveniently. This method would guide for the special protein identification with computational intelligence strategies. RESULTS: Firstly, we proposed novel fingerprint features for TBP based on pseudo amino acid composition, physicochemical properties, and secondary structure. Secondly, hierarchical features dimensionality reduction strategies were employed to improve the performance furthermore. Currently, Pretata achieves 92.92% TATA-binding protein prediction accuracy, which is better than all other existing methods. CONCLUSIONS: The experiments demonstrate that our method could greatly improve the prediction accuracy and speed, thus allowing large-scale NGS data prediction to be practical. A web server is developed to facilitate the other researchers, which can be accessed at http://server.malab.cn/preTata/ .

Asunto(s)

Biología Computacional/métodos , Proteína de Unión a TATA-Box/metabolismo , Secuencia de Aminoácidos , Fenómenos Químicos , Unión Proteica , Estructura Secundaria de Proteína , Programas Informáticos , Máquina de Vectores de Soporte , Proteína de Unión a TATA-Box/química

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA