Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
BMC Genomics ; 21(1): 324, 2020 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-32334531

RESUMO

BACKGROUND: Post-database search is a key procedure in peptide identification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with a distribution of unbalanced PSMs. A more efficient learning strategy is required for improving the accuracy of peptide identification on challenging datasets. While complex learning models have larger power of classification, they may cause overfitting problems and introduce computational complexity on large-scale datasets. Kernel methods map data from the sample space to high dimensional spaces where data relationships can be simplified for modeling. RESULTS: In order to tackle the computational challenge of using the kernel-based learning model for practical peptide identification problems, we present an online learning algorithm, OLCS-Ranker, which iteratively feeds only one training sample into the learning model at each round, and, as a result, the memory requirement for computation is significantly reduced. Meanwhile, we propose a cost-sensitive learning model for OLCS-Ranker by using a larger loss of decoy PSMs than that of target PSMs in the loss function. CONCLUSIONS: The new model can reduce its false discovery rate on datasets with a distribution of unbalanced PSMs. Experimental studies show that OLCS-Ranker outperforms other methods in terms of accuracy and stability, especially on datasets with a distribution of unbalanced PSMs. Furthermore, OLCS-Ranker is 15-85 times faster than CRanker.


Assuntos
Algoritmos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Peptídeos/análise , Proteômica/métodos , Espectrometria de Massas em Tandem/métodos , Peptídeos/química , Reprodutibilidade dos Testes , Ferramenta de Busca/métodos , Software
2.
BMC Genomics ; 16 Suppl 11: S1, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26578406

RESUMO

BACKGROUND: Peptide sequence assignment is the central task in protein identification with MS/MS-based strategies. Although a number of post-database search algorithms for filtering target peptide spectrum matches (PSMs) have been developed, the discrepancy among the output PSMs is usually significant, remaining a few disputable PSMs. Current studies show that a number of target PSMs which are close to decoy PSMs can hardly be separated from those decoys by only using the discrimination function. RESULTS: In this paper, we assign each target PSM a weight showing its possibility of being correct. We employ a SVM-based learning model to search the optimal weight for each target PSM and develop a new score system, CRanker, to rank all target PSMs. Due to the large PSM datasets generated in routine database searches, we use the Cholesky factorization technique for storing a kernel matrix to reduce the memory requirement. CONCLUSIONS: Compared with PeptideProphet and Percolator, CRanker has identified more PSMs under similar false discover rates over different datasets. CRanker has shown consistent performance on different test sets, validated the reasonability the proposed model.


Assuntos
Biologia Computacional/métodos , Peptídeos/análise , Máquina de Vetores de Suporte , Algoritmos , Humanos , Peptídeos/química
3.
J Proteome Res ; 12(3): 1108-19, 2013 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-23402659

RESUMO

Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has revolutionized the proteomics analysis of complexes, cells, and tissues. In a typical proteomic analysis, the tandem mass spectra from a LC-MS/MS experiment are assigned to a peptide by a search engine that compares the experimental MS/MS peptide data to theoretical peptide sequences in a protein database. The peptide spectra matches are then used to infer a list of identified proteins in the original sample. However, the search engines often fail to distinguish between correct and incorrect peptides assignments. In this study, we designed and implemented a novel algorithm called De-Noise to reduce the number of incorrect peptide matches and maximize the number of correct peptides at a fixed false discovery rate using a minimal number of scoring outputs from the SEQUEST search engine. The novel algorithm uses a three-step process: data cleaning, data refining through a SVM-based decision function, and a final data refining step based on proteolytic peptide patterns. Using proteomics data generated on different types of mass spectrometers, we optimized the De-Noise algorithm on the basis of the resolution and mass accuracy of the mass spectrometer employed in the LC-MS/MS experiment. Our results demonstrate De-Noise improves peptide identification compared to other methods used to process the peptide sequence matches assigned by SEQUEST. Because De-Noise uses a limited number of scoring attributes, it can be easily implemented with other search engines.


Assuntos
Algoritmos , Proteômica , Cromatografia Líquida , Bases de Dados de Proteínas , Humanos , Espectrometria de Massas em Tandem
4.
Proteome Sci ; 11(Suppl 1): S10, 2013 Nov 07.
Artigo em Inglês | MEDLINE | ID: mdl-24564935

RESUMO

BACKGROUND: The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge. RESULTS: A novel scoring method named FC-Ranker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops. CONCLUSIONS: Our experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.

5.
BMC Bioinformatics ; 13 Suppl 9: S3, 2012 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-22901088

RESUMO

BACKGROUND: Identifying gene regulatory network (GRN) from time course gene expression data has attracted more and more attentions. Due to the computational complexity, most approaches for GRN reconstruction are limited on a small number of genes and low connectivity of the underlying networks. These approaches can only identify a single network for a given set of genes. However, for a large-scale gene network, there might exist multiple potential sub-networks, in which genes are only functionally related to others in the sub-networks. RESULTS: We propose the network and community identification (NCI) method for identifying multiple subnetworks from gene expression data by incorporating community structure information into GRN inference. The proposed algorithm iteratively solves two optimization problems, and can promisingly be applied to large-scale GRNs. Furthermore, we present the efficient Block PCA method for searching communities in GRNs. CONCLUSIONS: The NCI method is effective in identifying multiple subnetworks in a large-scale GRN. With the splitting algorithm, the Block PCA method shows a promosing attempt for exploring communities in a large-scale GRN.


Assuntos
Algoritmos , Biologia Computacional/métodos , Redes Reguladoras de Genes , Expressão Gênica
6.
Artigo em Inglês | MEDLINE | ID: mdl-26394437

RESUMO

SEQUEST is a database-searching engine, which calculates the correlation score between observed spectrum and theoretical spectrum deduced from protein sequences stored in a flat text file, even though it is not a relational and object-oriental repository. Nevertheless, the SEQUEST score functions fail to discriminate between true and false PSMs accurately. Some approaches, such as PeptideProphet and Percolator, have been proposed to address the task of distinguishing true and false PSMs. However, most of these methods employ time-consuming learning algorithms to validate peptide assignments [1] . In this paper, we propose a fast algorithm for validating peptide identification by incorporating heterogeneous information from SEQUEST scores and peptide digested knowledge. To automate the peptide identification process and incorporate additional information, we employ l2 multiple kernel learning (MKL) to implement the current peptide identification task. Results on experimental datasets indicate that compared with state-of-the-art methods, i.e., PeptideProphet and Percolator, our data fusing strategy has comparable performance but reduces the running time significantly.


Assuntos
Algoritmos , Lógica Fuzzy , Espectrometria de Massas/métodos , Peptídeos/análise , Proteômica/métodos , Bases de Dados de Proteínas , Peptídeos/química , Software
7.
Neural Netw ; 24(5): 476-83, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-21441012

RESUMO

As a kernel based method, the performance of least squares support vector machine (LS-SVM) depends on the selection of the kernel as well as the regularization parameter (Duan, Keerthi, & Poo, 2003). Cross-validation is efficient in selecting a single kernel and the regularization parameter; however, it suffers from heavy computational cost and is not flexible to deal with multiple kernels. In this paper, we address the issue of multiple kernel learning for LS-SVM by formulating it as semidefinite programming (SDP). Furthermore, we show that the regularization parameter can be optimized in a unified framework with the kernel, which leads to an automatic process for model selection. Extensive experimental validations are performed and analyzed.


Assuntos
Algoritmos , Inteligência Artificial , Redes Neurais de Computação , Design de Software , Simulação por Computador , Humanos , Conceitos Matemáticos , Modelos Teóricos , Software/normas
8.
Int J Bioinform Res Appl ; 5(1): 50-63, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-19136364

RESUMO

This paper proposes a statistical method for finding Significantly Expressed (SE) genes from time-course expression. SE genes are time-dependent while non-SE genes are time-independent. This method models time-dependent gene expression profiles by autoregressive equations plus Gaussian noises, and time-independent ones by Gaussian noises. The statistical F-testing is used to calculate the probability (p-value) that a profile is time-independent. Both a synthetic dataset and a biological dataset were employed to evaluate the performance of this method, measured by the False Discovery Rate (FDR) and the False Non-discovery Rate (FNR). Results show that the proposed method outperforms traditional methods.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Algoritmos , Reações Falso-Positivas , Expressão Gênica , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA