Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 47
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38701413

RESUMO

With the emergence of large amount of single-cell RNA sequencing (scRNA-seq) data, the exploration of computational methods has become critical in revealing biological mechanisms. Clustering is a representative for deciphering cellular heterogeneity embedded in scRNA-seq data. However, due to the diversity of datasets, none of the existing single-cell clustering methods shows overwhelming performance on all datasets. Weighted ensemble methods are proposed to integrate multiple results to improve heterogeneity analysis performance. These methods are usually weighted by considering the reliability of the base clustering results, ignoring the performance difference of the same base clustering on different cells. In this paper, we propose a high-order element-wise weighting strategy based self-representative ensemble learning framework: scEWE. By assigning different base clustering weights to individual cells, we construct and optimize the consensus matrix in a careful and exquisite way. In addition, we extracted the high-order information between cells, which enhanced the ability to represent the similarity relationship between cells. scEWE is experimentally shown to significantly outperform the state-of-the-art methods, which strongly demonstrates the effectiveness of the method and supports the potential applications in complex single-cell data analytical problems.


Assuntos
Análise de Sequência de RNA , Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Análise de Sequência de RNA/métodos , Algoritmos , Biologia Computacional/métodos , Humanos , RNA-Seq/métodos
2.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37864293

RESUMO

Inference of gene regulatory network (GRN) from gene expression profiles has been a central problem in systems biology and bioinformatics in the past decades. The tremendous emergency of single-cell RNA sequencing (scRNA-seq) data brings new opportunities and challenges for GRN inference: the extensive dropouts and complicated noise structure may also degrade the performance of contemporary gene regulatory models. Thus, there is an urgent need to develop more accurate methods for gene regulatory network inference in single-cell data while considering the noise structure at the same time. In this paper, we extend the traditional structural equation modeling (SEM) framework by considering a flexible noise modeling strategy, namely we use the Gaussian mixtures to approximate the complex stochastic nature of a biological system, since the Gaussian mixture framework can be arguably served as a universal approximation for any continuous distributions. The proposed non-Gaussian SEM framework is called NG-SEM, which can be optimized by iteratively performing Expectation-Maximization algorithm and weighted least-squares method. Moreover, the Akaike Information Criteria is adopted to select the number of components of the Gaussian mixture. To probe the accuracy and stability of our proposed method, we design a comprehensive variate of control experiments to systematically investigate the performance of NG-SEM under various conditions, including simulations and real biological data sets. Results on synthetic data demonstrate that this strategy can improve the performance of traditional Gaussian SEM model and results on real biological data sets verify that NG-SEM outperforms other five state-of-the-art methods.


Assuntos
Redes Reguladoras de Genes , Análise da Expressão Gênica de Célula Única , Análise de Classes Latentes , Algoritmos , Biologia Computacional/métodos
3.
Bioinformatics ; 39(7)2023 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-37382572

RESUMO

MOTIVATION: Simultaneous profiling of multi-omics single-cell data represents exciting technological advancements for understanding cellular states and heterogeneity. Cellular indexing of transcriptomes and epitopes by sequencing allowed for parallel quantification of cell-surface protein expression and transcriptome profiling in the same cells; methylome and transcriptome sequencing from single cells allows for analysis of transcriptomic and epigenomic profiling in the same individual cells. However, effective integration method for mining the heterogeneity of cells over the noisy, sparse, and complex multi-modal data is in growing need. RESULTS: In this article, we propose a multi-modal high-order neighborhood Laplacian matrix optimization framework for integrating the multi-omics single-cell data: scHoML. Hierarchical clustering method was presented for analyzing the optimal embedding representation and identifying cell clusters in a robust manner. This novel method by integrating high-order and multi-modal Laplacian matrices would robustly represent the complex data structures and allow for systematic analysis at the multi-omics single-cell level, thus promoting further biological discoveries. AVAILABILITY AND IMPLEMENTATION: Matlab code is available at https://github.com/jianghruc/scHoML.


Assuntos
Algoritmos , Multiômica , Perfilação da Expressão Gênica , Transcriptoma , Análise por Conglomerados , Análise de Célula Única
4.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37079737

RESUMO

MOTIVATION: From a systematic perspective, it is crucial to infer and analyze gene regulatory network (GRN) from high-throughput single-cell RNA sequencing data. However, most existing GRN inference methods mainly focus on the network topology, only few of them consider how to explicitly describe the updated logic rules of regulation in GRNs to obtain their dynamics. Moreover, some inference methods also fail to deal with the over-fitting problem caused by the noise in time series data. RESULTS: In this article, we propose a novel embedded Boolean threshold network method called LogBTF, which effectively infers GRN by integrating regularized logistic regression and Boolean threshold function. First, the continuous gene expression values are converted into Boolean values and the elastic net regression model is adopted to fit the binarized time series data. Then, the estimated regression coefficients are applied to represent the unknown Boolean threshold function of the candidate Boolean threshold network as the dynamical equations. To overcome the multi-collinearity and over-fitting problems, a new and effective approach is designed to optimize the network topology by adding a perturbation design matrix to the input data and thereafter setting sufficiently small elements of the output coefficient vector to zeros. In addition, the cross-validation procedure is implemented into the Boolean threshold network model framework to strengthen the inference capability. Finally, extensive experiments on one simulated Boolean value dataset, dozens of simulation datasets, and three real single-cell RNA sequencing datasets demonstrate that the LogBTF method can infer GRNs from time series data more accurately than some other alternative methods for GRN inference. AVAILABILITY AND IMPLEMENTATION: The source data and code are available at https://github.com/zpliulab/LogBTF.


Assuntos
Algoritmos , Redes Reguladoras de Genes , Fatores de Tempo , Simulação por Computador , Expressão Gênica
5.
PLoS Comput Biol ; 19(3): e1010939, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36930678

RESUMO

During breast cancer metastasis, the developmental process epithelial-mesenchymal (EM) transition is abnormally activated. Transcriptional regulatory networks controlling EM transition are well-studied; however, alternative RNA splicing also plays a critical regulatory role during this process. Alternative splicing was proved to control the EM transition process, and RNA-binding proteins were determined to regulate alternative splicing. A comprehensive understanding of alternative splicing and the RNA-binding proteins that regulate it during EM transition and their dynamic impact on breast cancer remains largely unknown. To accurately study the dynamic regulatory relationships, time-series data of the EM transition process are essential. However, only cross-sectional data of epithelial and mesenchymal specimens are available. Therefore, we developed a pseudotemporal causality-based Bayesian (PCB) approach to infer the dynamic regulatory relationships between alternative splicing events and RNA-binding proteins. Our study sheds light on facilitating the regulatory network-based approach to identify key RNA-binding proteins or target alternative splicing events for the diagnosis or treatment of cancers. The data and code for PCB are available at: http://hkumath.hku.hk/~wkc/PCB(data+code).zip.


Assuntos
Neoplasias da Mama , Humanos , Feminino , Neoplasias da Mama/metabolismo , Teorema de Bayes , Estudos Transversais , Linhagem Celular Tumoral , Processos Neoplásicos , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo , Processamento Alternativo/genética , Transição Epitelial-Mesenquimal/genética
6.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34410342

RESUMO

MOTIVATION: The epithelial-mesenchymal transition (EMT) is a cellular-developmental process activated during tumor metastasis. Transcriptional regulatory networks controlling EMT are well studied; however, alternative RNA splicing also plays a critical regulatory role during this process. Unfortunately, a comprehensive understanding of alternative splicing (AS) and the RNA-binding proteins (RBPs) that regulate it during EMT remains largely unknown. Therefore, a great need exists to develop effective computational methods for predicting associations of RBPs and AS events. Dramatically increasing data sources that have direct and indirect information associated with RBPs and AS events have provided an ideal platform for inferring these associations. RESULTS: In this study, we propose a novel method for RBP-AS target prediction based on weighted data fusion with sparse matrix tri-factorization (WDFSMF in short) that simultaneously decomposes heterogeneous data source matrices into low-rank matrices to reveal hidden associations. WDFSMF can select and integrate data sources by assigning different weights to those sources, and these weights can be assigned automatically. In addition, WDFSMF can identify significant RBP complexes regulating AS events and eliminate noise and outliers from the data. Our proposed method achieves an area under the receiver operating characteristic curve (AUC) of $90.78\%$, which shows that WDFSMF can effectively predict RBP-AS event associations with higher accuracy compared with previous methods. Furthermore, this study identifies significant RBPs as complexes for AS events during EMT and provides solid ground for further investigation into RNA regulation during EMT and metastasis. WDFSMF is a general data fusion framework, and as such it can also be adapted to predict associations between other biological entities.


Assuntos
Processamento Alternativo , Biologia Computacional/métodos , Transição Epitelial-Mesenquimal/genética , Regulação Neoplásica da Expressão Gênica , Proteínas de Ligação a RNA/metabolismo , Algoritmos , Biologia Computacional/normas , Humanos , Curva ROC , Reprodutibilidade dos Testes , Software
7.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33517359

RESUMO

MOTIVATION: The developmental process of epithelial-mesenchymal transition (EMT) is abnormally activated during breast cancer metastasis. Transcriptional regulatory networks that control EMT have been well studied; however, alternative RNA splicing plays a vital regulatory role during this process and the regulating mechanism needs further exploration. Because of the huge cost and complexity of biological experiments, the underlying mechanisms of alternative splicing (AS) and associated RNA-binding proteins (RBPs) that regulate the EMT process remain largely unknown. Thus, there is an urgent need to develop computational methods for predicting potential RBP-AS event associations during EMT. RESULTS: We developed a novel model for RBP-AS target prediction during EMT that is based on inductive matrix completion (RAIMC). Integrated RBP similarities were calculated based on RBP regulating similarity, and RBP Gaussian interaction profile (GIP) kernel similarity, while integrated AS event similarities were computed based on AS event module similarity and AS event GIP kernel similarity. Our primary objective was to complete missing or unknown RBP-AS event associations based on known associations and on integrated RBP and AS event similarities. In this paper, we identify significant RBPs for AS events during EMT and discuss potential regulating mechanisms. Our computational results confirm the effectiveness and superiority of our model over other state-of-the-art methods. Our RAIMC model achieved AUC values of 0.9587 and 0.9765 based on leave-one-out cross-validation (CV) and 5-fold CV, respectively, which are larger than the AUC values from the previous models. RAIMC is a general matrix completion framework that can be adopted to predict associations between other biological entities. We further validated the prediction performance of RAIMC on the genes CD44 and MAP3K7. RAIMC can identify the related regulating RBPs for isoforms of these two genes. AVAILABILITY AND IMPLEMENTATION: The source code for RAIMC is available at https://github.com/yushanqiu/RAIMC. CONTACT: zouquan@nclab.net online.


Assuntos
Processamento Alternativo , Neoplasias da Mama , Transição Epitelial-Mesenquimal/genética , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Proteínas de Neoplasias , Proteínas de Ligação a RNA , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Feminino , Humanos , Proteínas de Neoplasias/genética , Proteínas de Neoplasias/metabolismo , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo
8.
Sensors (Basel) ; 21(18)2021 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-34577516

RESUMO

Pallet management as a backbone of logistics and supply chain activities is essential to supply chain parties, while a number of regulations, standards and operational constraints are considered in daily operations. In recent years, pallet pooling has been unconventionally advocated to manage pallets in a closed-loop system to enhance the sustainability and operational effectiveness, but pitfalls in terms of service reliability, quality compliance and pallet limitation when using a single service provider may occur. Therefore, this study incorporates a decentralisation mechanism into the pallet management to formulate a technological eco-system for pallet pooling, namely Pallet as a Service (PalletaaS), raised by the foundation of consortium blockchain and Internet of things (IoT). Consortium blockchain is regarded as the blockchain 3.0 to facilitate more industrial applications, except cryptocurrency, and the synergy of integrating a consortium blockchain and IoT is thus investigated. The corresponding layered architecture is proposed to structure the system deployment in the industry, in which the location-inventory-routing problem for pallet pooling is formulated. To demonstrate the values of this study, a case analysis to illustrate the human-computer interaction and pallet pooling operations is conducted. Overall, this study standardises the decentralised pallet management in the closed-loop mechanism, resulting in a constructive impact to sustainable development in the logistics industry.


Assuntos
Blockchain , Internet das Coisas , Humanos , Reprodutibilidade dos Testes
9.
J Theor Biol ; 463: 1-11, 2019 02 21.
Artigo em Inglês | MEDLINE | ID: mdl-30543810

RESUMO

It is known that many driver nodes are required to control complex biological networks. Previous studies imply that O(N) driver nodes are required in both linear complex network and Boolean network models with N nodes if an arbitrary state is specified as the target. In order to cope with this intrinsic difficulty, we consider a special case of the control problem in which the targets are restricted to attractors. For this special case, we mathematically prove under the uniform distribution of states in basins that the expected number of driver nodes is only O(log2N+log2M) for controlling Boolean networks, where M is the number of attractors. Since it is expected that M is not very large in many practical networks, the new model requires a much smaller number of driver nodes. This result is based on discovery of novel relationships between control problems on Boolean networks and the coupon collector's problem, a well-known concept in combinatorics. We also provide lower bounds of the number of driver nodes as well as simulation results using artificial and realistic network data, which support our theoretical findings.


Assuntos
Modelos Biológicos , Modelos Teóricos , Algoritmos , Biologia de Sistemas/métodos
10.
BMC Bioinformatics ; 17 Suppl 7: 240, 2016 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-27454116

RESUMO

BACKGROUND: Abnormalities in glycan biosynthesis have been conclusively related to various diseases, whereas the complexity of the glycosylation process has impeded the quantitative analysis of biochemical experimental data for the identification of glycoforms contributing to disease. To overcome this limitation, the automatic construction of glycosylation reaction networks in silico is a critical step. RESULTS: In this paper, a framework K2014 is developed to automatically construct N-glycosylation networks in MATLAB with the involvement of the 27 most-known enzyme reaction rules of 22 enzymes, as an extension of previous model KB2005. A toolbox named Glycosylation Network Analysis Toolbox (GNAT) is applied to define network properties systematically, including linkages, stereochemical specificity and reaction conditions of enzymes. Our network shows a strong ability to predict a wider range of glycans produced by the enzymes encountered in the Golgi Apparatus in human cell expression systems. CONCLUSIONS: Our results demonstrate a better understanding of the underlying glycosylation process and the potential of systems glycobiology tools for analyzing conventional biochemical or mass spectrometry-based experimental data quantitatively in a more realistic and practical way.


Assuntos
Vias Biossintéticas , Simulação por Computador , Glicômica/métodos , Modelos Biológicos , Polissacarídeos/biossíntese , Glicosilação , Humanos , Hidrolases/metabolismo , Espectrometria de Massas , Transferases/metabolismo
11.
IEEE Trans Neural Netw Learn Syst ; 34(2): 921-931, 2023 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-34428155

RESUMO

An autoencoder is a layered neural network whose structure can be viewed as consisting of an encoder, which compresses an input vector to a lower dimensional vector, and a decoder, which transforms the low-dimensional vector back to the original input vector (or one that is very similar). In this article, we explore the compressive power of autoencoders that are Boolean threshold networks by studying the numbers of nodes and layers that are required to ensure that each vector in a given set of distinct input binary vectors is transformed back to its original. We show that for any set of n distinct vectors there exists a seven-layer autoencoder with the optimal compression ratio, (i.e., the size of the middle layer is logarithmic in n ), but that there is a set of n vectors for which there is no three-layer autoencoder with a middle layer of logarithmic size. In addition, we present a kind of tradeoff: if the compression ratio is allowed to be considerably larger than the optimal, then there is a five-layer autoencoder. We also study the numbers of nodes and layers required only for encoding, and the results suggest that the decoding part is the bottleneck of autoencoding. For example, there always is a three-layer Boolean threshold encoder that compresses n vectors into a dimension that is twice the logarithm of n .

12.
IEEE Trans Neural Netw Learn Syst ; 33(9): 4147-4159, 2022 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-33587712

RESUMO

We study the distribution of successor states in Boolean networks (BNs). The state vector y is called a successor of x if y = F(x) holds, where x, y ∈ {0,1}n are state vectors and F is an ordered set of Boolean functions describing the state transitions. This problem is motivated by analyzing how information propagates via hidden layers in Boolean threshold networks (discrete model of neural networks) and is kept or lost during time evolution in BNs. In this article, we measure the distribution via entropy and study how entropy changes via the transition from x to y , assuming that x is given uniformly at random. We focus on BNs consisting of exclusive OR (XOR) functions, canalyzing functions, and threshold functions. As a main result, we show that there exists a BN consisting of d -ary XOR functions, which preserves the entropy if d is odd and , whereas there does not exist such a BN if d is even. We also show that there exists a specific BN consisting of d -ary threshold functions, which preserves the entropy if [Formula: see text]. Furthermore, we theoretically analyze the upper and lower bounds of the entropy for BNs consisting of canalyzing functions and perform computational experiments using BN models of real biological networks.

13.
Comput Biol Chem ; 100: 107747, 2022 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-35932551

RESUMO

Recently, identifying robust biomarkers or signatures from gene expression profiling data has attracted much attention in computational biomedicine. The successful discovery of biomarkers for complex diseases such as spontaneous preterm birth (SPTB) and high-grade serous ovarian cancer (HGSOC) will be beneficial to reduce the risk of preterm birth and ovarian cancer among women for early detection and intervention. In this paper, we propose a stable machine learning-recursive feature elimination (StabML-RFE for short) strategy for screening robust biomarkers from high-throughput gene expression data. We employ eight popular machine learning methods, namely AdaBoost (AB), Decision Tree (DT), Gradient Boosted Decision Trees (GBDT), Naive Bayes (NB), Neural Network (NNET), Random Forest (RF), Support Vector Machine (SVM) and XGBoost (XGB), to train on all feature genes of training data, apply recursive feature elimination (RFE) to remove the least important features sequentially, and obtain eight gene subsets with feature importance ranking. Then we select the top-ranking features in each ranked subset as the optimal feature subset. We establish a stability metric aggregated with classification performance on test data to assess the robustness of the eight different feature selection techniques. Finally, StabML-RFE chooses the high-frequent features in the subsets of the combination with maximum stability value as robust biomarkers. Particularly, we verify the screened biomarkers not only via internal validation, functional enrichment analysis and literature check, but also via external validation on two real-world SPTB and HGSOC datasets respectively. Obviously, the proposed StabML-RFE biomarker discovery pipeline easily serves as a model for identifying diagnostic biomarkers for other complex diseases from omics data. The source code and data can be found at https://github.com/zpliulab/StabML-RFE.


Assuntos
Neoplasias Ovarianas , Nascimento Prematuro , Algoritmos , Teorema de Bayes , Biomarcadores/metabolismo , Feminino , Expressão Gênica , Humanos , Recém-Nascido , Aprendizado de Máquina , Neoplasias Ovarianas/diagnóstico , Neoplasias Ovarianas/genética , Máquina de Vetores de Suporte
14.
IEEE/ACM Trans Comput Biol Bioinform ; 18(6): 2714-2723, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-32386162

RESUMO

Clustering tumor metastasis samples from gene expression data at the whole genome level remains an arduous challenge, in particular, when the number of experimental samples is small and the number of genes is huge. We focus on the prediction of the epithelial-mesenchymal transition (EMT), which is an underlying mechanism of tumor metastasis, here, rather than tumor metastasis itself, to avoid confounding effects of uncertainties derived from various factors. In this paper, we propose a novel model in predicting EMT based on multidimensional scaling (MDS) strategies and integrating entropy and random matrix detection strategies to determine the optimal reduced number of dimension in low dimensional space. We verified our proposed model with the gene expression data for EMT samples of breast cancer and the experimental results demonstrated the superiority over state-of-the-art clustering methods. Furthermore, we developed a novel feature extraction method for selecting the significant genes and predicting the tumor metastasis. The source code is available at "https://github.com/yushanqiu/yushan.qiu-szu.edu.cn".


Assuntos
Biologia Computacional/métodos , Transição Epitelial-Mesenquimal/genética , Análise de Escalonamento Multidimensional , Aprendizado de Máquina não Supervisionado , Neoplasias da Mama/genética , Neoplasias da Mama/patologia , Análise por Conglomerados , Feminino , Humanos , Metástase Neoplásica/genética , Transcriptoma/genética
15.
BMC Bioinformatics ; 11: 501, 2010 Oct 08.
Artigo em Inglês | MEDLINE | ID: mdl-20932284

RESUMO

BACKGROUND: Drugs can influence the whole metabolic system by targeting enzymes which catalyze metabolic reactions. The existence of interactions between drugs and metabolic reactions suggests a potential way to discover drug targets. RESULTS: In this paper, we present a computational method to predict new targets for approved anti-cancer drugs by exploring drug-reaction interactions. We construct a Drug-Reaction Network to provide a global view of drug-reaction interactions and drug-pathway interactions. The recent reconstruction of the human metabolic network and development of flux analysis approaches make it possible to predict each metabolic reaction's cell line-specific flux state based on the cell line-specific gene expressions. We first profile each reaction by its flux states in NCI-60 cancer cell lines, and then propose a kernel k-nearest neighbor model to predict related metabolic reactions and enzyme targets for approved cancer drugs. We also integrate the target structure data with reaction flux profiles to predict drug targets and the area under curves can reach 0.92. CONCLUSIONS: The cross validations using the methods with and without metabolic network indicate that the former method is significantly better than the latter. Further experiments show the synergism of reaction flux profiles and target structure for drug target prediction. It also implies the significant contribution of metabolic network to predict drug targets. Finally, we apply our method to predict new reactions and possible enzyme targets for cancer drugs.


Assuntos
Antineoplásicos/farmacologia , Biologia Computacional/métodos , Inibidores Enzimáticos/farmacologia , Antineoplásicos/uso terapêutico , Linhagem Celular Tumoral , Inibidores Enzimáticos/uso terapêutico , Humanos , Redes e Vias Metabólicas , Neoplasias/tratamento farmacológico , Neoplasias/enzimologia
16.
BMC Bioinformatics ; 11 Suppl 1: S33, 2010 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-20122206

RESUMO

BACKGROUND: Glycobiology pertains to the study of carbohydrate sugar chains, or glycans, in a particular cell or organism. Many computational approaches have been proposed for analyzing these complex glycan structures, which are chains of monosaccharides. The monosaccharides are linked to one another by glycosidic bonds, which can take on a variety of comformations, thus forming branches and resulting in complex tree structures. The q-gram method is one of these recent methods used to understand glycan function based on the classification of their tree structures. This q-gram method assumes that for a certain q, different q-grams share no similarity among themselves. That is, that if two structures have completely different components, then they are completely different. However, from a biological standpoint, this is not the case. In this paper, we propose a weighted q-gram method to measure the similarity among glycans by incorporating the similarity of the geometric structures, monosaccharides and glycosidic bonds among q-grams. In contrast to the traditional q-gram method, our weighted q-gram method admits similarity among q-grams for a certain q. Thus our new kernels for glycan structure were developed and then applied in SVMs to classify glycans. RESULTS: Two glycan datasets were used to compare the weighted q-gram method and the original q-gram method. The results show that the incorporation of q-gram similarity improves the classification performance for all of the important glycan classes tested. CONCLUSION: The results in this paper indicate that similarity among q-grams obtained from geometric structure, monosaccharides and glycosidic linkage contributes to the glycan function classification. This is a big step towards the understanding of glycan function based on their complex structures.


Assuntos
Algoritmos , Configuração de Carboidratos , Polissacarídeos/química , Glicômica
17.
Genome Inform ; 22: 95-120, 2010 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-20238422

RESUMO

Annotating genes is a fundamental issue in the post-genomic era. A typical procedure for this issue is first clustering genes by their features and then assigning functions of unknown genes by using known genes in the same cluster. A lot of genomic information are available for this issue, but two major types of data which can be measured for any gene are microarray expressions and sequences, both of which however have their own flaws. Thus a natural and promising approach for gene annotation is to integrate these two data sources, especially in terms of their costs to be optimized in clustering. We develop an efficient gene annotation method with three steps containing spectral clustering over the integrated cost, based on the idea of network modularity. We rigorously examined the performance of our proposed method from three different viewpoints. All experimental results indicate the performance advantage of our method over possible clustering/classification-based approaches of gene function annotation, using expressions and/or sequences.


Assuntos
Perfilação da Expressão Gênica/métodos , Expressão Gênica/fisiologia , Genes/fisiologia , Reconhecimento Automatizado de Padrão , Transdução de Sinais/fisiologia , Integração de Sistemas , Algoritmos , Humanos
18.
Artigo em Inglês | MEDLINE | ID: mdl-29994681

RESUMO

The identification of drug side-effects is considered to be an important step in drug design, which could not only shorten the time but also reduce the cost of drug development. In this paper, we investigate the relationship between the potential side-effects of drug candidates and their chemical structures. The preliminary Regularized Regression (RR) model for drug side-effects prediction has promising features in the efficiency of model training and the existence of a closed form solution. It performs better than other state-of-the-art methods, in terms of minimum accuracy and average accuracy. In order to dig inside how drug structure will associate with side effect, we further propose weighted GTS (Generalized T-Student Kernel: WGTS) SVM model from a structural risk minimization perspective. The SVM model proposed in this paper provides a better understanding of drug side-effects in the process of drug development. The usefulness of the WGTS model lies in the superior performance in a cross validation setting on 888 approved drugs with 1385 side-effects profiling from SIDER database. This work is expected to shed light on intriguing studies that predict potential un-identifying side-effects and suggest how we can avoid drug side-effects by the removal of some distinguished chemical structures.


Assuntos
Biologia Computacional/métodos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Modelos Estatísticos , Preparações Farmacêuticas/química , Humanos , Estrutura Molecular , Análise de Regressão , Máquina de Vetores de Suporte
19.
Proteomics ; 9(15): 3833-42, 2009 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-19681055

RESUMO

Peak detection is a pivotal first step in biomarker discovery from MS data and can significantly influence the results of downstream data analysis steps. We developed a novel automatic peak detection method for prOTOF MS data, which does not require a priori knowledge of protein masses. Random noise is removed by an undecimated wavelet transform and chemical noise is attenuated by an adaptive short-time discrete Fourier transform. Isotopic peaks corresponding to a single protein are combined by extracting an envelope over them. Depending on the S/N, the desired peaks in each individual spectrum are detected and those with the highest intensity among their peak clusters are recorded. The common peaks among all the spectra are identified by choosing an appropriate cut-off threshold in the complete linkage hierarchical clustering. To remove the 1 Da shifting of the peaks, the peak corresponding to the same protein is determined as the detected peak with the largest number among its neighborhood. We validated this method using a data set of serial peptide and protein calibration standards. Compared with MoverZ program, our new method detects more peaks and significantly enhances S/N of the peak after the chemical noise removal. We then successfully applied this method to a data set from prOTOF MS spectra of albumin and albumin-bound proteins from serum samples of 59 patients with carotid artery disease compared to vascular disease-free patients to detect peaks with S/N> or =2. Our method is easily implemented and is highly effective to define peaks that will be used for disease classification or to highlight potential biomarkers.


Assuntos
Proteínas Sanguíneas/análise , Espectrometria de Massas/métodos , Software , Doenças das Artérias Carótidas/sangue , Doenças das Artérias Carótidas/diagnóstico , Análise por Conglomerados , Análise de Fourier , Humanos , Proteômica/métodos , Reprodutibilidade dos Testes , Albumina Sérica/análise
20.
Artif Intell Med ; 95: 96-103, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30352711

RESUMO

Identifying tumor metastasis signatures from gene expression data at the whole genome level remains an arduous challenge, particularly so when the number of genes is huge and the number of experimental samples is small. We focus on the prediction of the epithelial-mesenchymal transition (EMT), which is an underlying mechanism of tumor metastasis, here, rather than on tumor metastasis itself, to avoid confounding effects of uncertainties derived from various factors. We apply an extended LASSO model, L1/2-regularization model, as a feature selector, to identify significant RNA-binding proteins (RBPs) that contribute to regulating the EMT. We find that the L1/2-regularization model significantly outperforms LASSO in the EMT regulation problem. Furthermore, remarkable improvement in L1/2-regularization model classification performance can be achieved by incorporating extra information, specifically correlation values. We demonstrate that the L1/2-regularization model is applicable for identifying significant RBPs in biological research. Identified RBPs will facilitate study of the underlying mechanisms of the EMT.


Assuntos
Transição Epitelial-Mesenquimal , Proteínas de Ligação a RNA/fisiologia , Algoritmos , Linhagem Celular Tumoral , Humanos , Modelos Biológicos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA