Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
Comput Math Methods Med ; 2021: 5770981, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34413898

RESUMO

Antioxidant proteins (AOPs) play important roles in the management and prevention of several human diseases due to their ability to neutralize excess free radicals. However, the identification of AOPs by using wet-lab experimental techniques is often time-consuming and expensive. In this study, we proposed an accurate computational model, called AOP-HMM, to predict AOPs by extracting discriminatory evolutionary features from hidden Markov model (HMM) profiles. First, auto cross-covariance (ACC) variables were applied to transform the HMM profiles into fixed-length feature vectors. Then, we performed the analysis of variance (ANOVA) method to reduce the dimensionality of the raw feature space. Finally, a support vector machine (SVM) classifier was adopted to conduct the prediction of AOPs. To comprehensively evaluate the performance of the proposed AOP-HMM model, the 10-fold cross-validation (CV), the jackknife CV, and the independent test were carried out on two widely used benchmark datasets. The experimental results demonstrated that AOP-HMM outperformed most of the existing methods and could be used to quickly annotate AOPs and guide the experimental process.


Assuntos
Antioxidantes/química , Aprendizado de Máquina , Peroxirredoxinas/química , Proteínas/química , Algoritmos , Aminoácidos/análise , Antioxidantes/classificação , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Evolução Molecular , Humanos , Cadeias de Markov , Peroxirredoxinas/classificação , Proteínas/classificação
2.
Sci Rep ; 11(1): 12439, 2021 06 14.
Artigo em Inglês | MEDLINE | ID: mdl-34127723

RESUMO

Coiled-coil regions were among the first protein motifs described structurally and theoretically. The simplicity of the motif promises that coiled-coil regions can be detected with reasonable accuracy and precision in any protein sequence. Here, we re-evaluated the most commonly used coiled-coil prediction tools with respect to the most comprehensive reference data set available, the entire Protein Data Bank, down to each amino acid and its secondary structure. Apart from the 30-fold difference in minimum and maximum number of coiled coils predicted the tools strongly vary in where they predict coiled-coil regions. Accordingly, there is a high number of false predictions and missed, true coiled-coil regions. The evaluation of the binary classification metrics in comparison with naïve coin-flip models and the calculation of the Matthews correlation coefficient, the most reliable performance metric for imbalanced data sets, suggests that the tested tools' performance is close to random. This implicates that the tools' predictions have only limited informative value. Coiled-coil predictions are often used to interpret biochemical data and are part of in-silico functional genome annotation. Our results indicate that these predictions should be treated very cautiously and need to be supported and validated by experimental evidence.


Assuntos
Motivos de Aminoácidos , Modelos Moleculares , Estrutura Secundária de Proteína , Sequência de Aminoácidos , Bases de Dados de Proteínas/estatística & dados numéricos , Software
3.
Comput Math Methods Med ; 2020: 1384749, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32300371

RESUMO

Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.


Assuntos
Algoritmos , Proteínas de Ligação a DNA/química , Aprendizado de Máquina , Sequência de Aminoácidos , Aminoácidos/análise , Biologia Computacional , Proteínas de Ligação a DNA/genética , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Cadeias de Markov , Curva ROC , Máquina de Vetores de Suporte
4.
Brief Bioinform ; 20(1): 274-287, 2019 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-29028906

RESUMO

The identification of plant-pathogen protein-protein interactions (PPIs) is an attractive and challenging research topic for deciphering the complex molecular mechanism of plant immunity and pathogen infection. Considering that the experimental identification of plant-pathogen PPIs is time-consuming and labor-intensive, computational methods are emerging as an important strategy to complement the experimental methods. In this work, we first evaluated the performance of traditional computational methods such as interolog, domain-domain interaction and domain-motif interaction in predicting known plant-pathogen PPIs. Owing to the low sensitivity of the traditional methods, we utilized Random Forest to build an inter-species PPI prediction model based on multiple sequence encodings and novel network attributes in the established plant PPI network. Critical assessment of the features demonstrated that the integration of sequence information and network attributes resulted in significant and robust performance improvement. Additionally, we also discussed the influence of Gene Ontology and gene expression information on the prediction performance. The Web server implementing the integrated prediction method, named InterSPPI, has been made freely available at http://systbio.cau.edu.cn/intersppi/index.php. InterSPPI could achieve a reasonably high accuracy with a precision of 73.8% and a recall of 76.6% in the independent test. To examine the applicability of InterSPPI, we also conducted cross-species and proteome-wide plant-pathogen PPI prediction tests. Taken together, we hope this work can provide a comprehensive understanding of the current status of plant-pathogen PPI predictions, and the proposed InterSPPI can become a useful tool to accelerate the exploration of plant-pathogen interactions.


Assuntos
Proteínas de Plantas/metabolismo , Plantas/metabolismo , Plantas/microbiologia , Mapeamento de Interação de Proteínas/métodos , Algoritmos , Arabidopsis/genética , Arabidopsis/metabolismo , Arabidopsis/microbiologia , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/imunologia , Proteínas de Arabidopsis/metabolismo , Biologia Computacional/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Ontologia Genética , Interações Hospedeiro-Patógeno/genética , Interações Hospedeiro-Patógeno/imunologia , Aprendizado de Máquina , Modelos Biológicos , Doenças das Plantas/genética , Doenças das Plantas/imunologia , Doenças das Plantas/microbiologia , Imunidade Vegetal/genética , Proteínas de Plantas/genética , Proteínas de Plantas/imunologia , Plantas/genética , Mapeamento de Interação de Proteínas/estatística & dados numéricos
5.
PLoS One ; 13(1): e0190938, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29351302

RESUMO

When human samples are sequenced, many assembled contigs are "unknown", as conventional alignments find no similarity to known sequences. Hidden Markov models (HMM) exploit the positions of specific nucleotides in protein-encoding codons in various microbes. The algorithm HMMER3 implements HMM using a reference set of sequences encoding viral proteins, "vFam". We used HMMER3 analysis of "unknown" human sample-derived sequences and identified 510 contigs distantly related to viruses (Anelloviridae (n = 1), Baculoviridae (n = 34), Circoviridae (n = 35), Caulimoviridae (n = 3), Closteroviridae (n = 5), Geminiviridae (n = 21), Herpesviridae (n = 10), Iridoviridae (n = 12), Marseillevirus (n = 26), Mimiviridae (n = 80), Phycodnaviridae (n = 165), Poxviridae (n = 23), Retroviridae (n = 6) and 89 contigs related to described viruses not yet assigned to any taxonomic family). In summary, we find that analysis using the HMMER3 algorithm and the "vFam" database greatly extended the detection of viruses in biospecimens from humans.


Assuntos
Microbiota , Vírus/genética , Vírus/isolamento & purificação , Algoritmos , Biologia Computacional , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Cadeias de Markov , Metagenômica/estatística & dados numéricos , Filogenia , Proteínas Virais/genética , Vírus/classificação
6.
Brief Bioinform ; 19(5): 821-837, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-28334258

RESUMO

Understanding of molecular mechanisms that govern protein-protein interactions and accurate modeling of protein-protein docking rely on accurate identification and prediction of protein-binding partners and protein-binding residues. We review over 40 methods that predict protein-protein interactions from protein sequences including methods that predict interacting protein pairs, protein-binding residues for a pair of interacting sequences and protein-binding residues in a single protein chain. We focus on the latter methods that provide residue-level annotations and that can be broadly applied to all protein sequences. We compare their architectures, inputs and outputs, and we discuss aspects related to their assessment and availability. We also perform first-of-its-kind comprehensive empirical comparison of representative predictors of protein-binding residues using a novel and high-quality benchmark data set. We show that the selected predictors accurately discriminate protein-binding and non-binding residues and that newer methods outperform older designs. However, these methods are unable to accurately separate residues that bind other molecules, such as DNA, RNA and small ligands, from the protein-binding residues. This cross-prediction, defined as the incorrect prediction of nucleic-acid- and small-ligand-binding residues as protein binding, is substantial for all evaluated methods and is not driven by the proximity to the native protein-binding residues. We discuss reasons for this drawback and we offer several recommendations. In particular, we postulate the need for a new generation of more accurate predictors and data sets, inclusion of a comprehensive assessment of the cross-predictions in future studies and higher standards of availability of the published methods.


Assuntos
Ligação Proteica/genética , Sequência de Aminoácidos , Sítios de Ligação/genética , Biologia Computacional/métodos , Bases de Dados de Proteínas/estatística & dados numéricos , Ligantes , Ácidos Nucleicos/metabolismo , Domínios e Motivos de Interação entre Proteínas/genética , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Análise de Sequência de Proteína/estatística & dados numéricos , Software , Homologia Estrutural de Proteína
7.
J Bioinform Comput Biol ; 13(5): 1543001, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26493682

RESUMO

Automated assignment of protein function has received considerable attention in recent years for genome-wide study. With the rapid accumulation of genome sequencing data produced by high-throughput experimental techniques, the process of manually predicting functional properties of proteins has become increasingly cumbersome. Such large genomics data sets can only be annotated computationally. However, automated assignment of functions to unknown protein is challenging due to its inherent difficulty and complexity. Previous studies have revealed that solving problems involving complicated objects with multiple semantic meanings using the multi-instance multi-label (MIML) framework is effective. For the protein function prediction problems, each protein object in nature may associate with distinct structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. Thus, it is convenient and natural to tackle the protein function prediction problem by using the MIML framework. In this paper, we propose a sparse Markov chain-based semi-supervised MIML method, called Sparse-Markov. A sparse transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering three biological domains show that our proposed Sparse-Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.


Assuntos
Cadeias de Markov , Proteínas/química , Proteínas/fisiologia , Aprendizado de Máquina Supervisionado , Algoritmos , Animais , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Proteínas/genética
8.
J Bioinform Comput Biol ; 13(5): 1550024, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26434573

RESUMO

Most of the algorithms used for information extraction and for processing the amino acid chains that make up proteins treat them as symbolic chains. Fewer algorithms exploit signal processing techniques that require a numerical representation of amino acid chains. However, these algorithms are very powerful for extracting regularities that cannot be detected when working with a symbolic chain, which may be important for understanding the biological meaning of a sequence or in classification tasks. In this study, a new mathematical representation of amino acid chains is proposed, which is derived using a similarity measure based on the PAM250 amino acid substitution matrix and that generates 20 signals for each protein sequence. Using this representation 20 consensus spectra for a protein family are determined and the relevance of the frequency peaks is established, obtaining a group of significant frequency peaks that manifest common periodicities of the amino acid sequences that belong to a protein family. We also show that the proposed representation in 20 signals can be integrated into Chou's pseudo amino acid composition (PseAAC) and constitute a useful alternative to amino acid physicochemical properties in Chou's PseAAC.


Assuntos
Algoritmos , Proteínas/química , Sequência de Aminoácidos , Substituição de Aminoácidos , Aminoácidos/química , Biologia Computacional/métodos , Sequência Consenso , Bases de Dados de Proteínas/estatística & dados numéricos , Cadeias de Markov
9.
J Bioinform Comput Biol ; 13(2): 1550005, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25524475

RESUMO

Physicochemical properties of proteins always guide to determine the quality of the protein structure, therefore it has been rigorously used to distinguish native or native-like structure from other predicted structures. In this work, we explore nine machine learning methods with six physicochemical properties to predict the Root Mean Square Deviation (RMSD), Template Modeling (TM-score), and Global Distance Test (GDT_TS-score) of modeled protein structure in the absence of its true native state. Physicochemical properties namely total surface area, euclidean distance (ED), total empirical energy, secondary structure penalty (SS), sequence length (SL), and pair number (PN) are used. There are a total of 95,091 modeled structures of 4896 native targets. A real coded Self-adaptive Differential Evolution algorithm (SaDE) is used to determine the feature importance. The K-fold cross validation is used to measure the robustness of the best predictive method. Through the intensive experiments, it is found that Random Forest method outperforms over other machine learning methods. This work makes the prediction faster and inexpensive. The performance result shows the prediction of RMSD, TM-score, and GDT_TS-score on Root Mean Square Error (RMSE) as 1.20, 0.06, and 0.06 respectively; correlation scores are 0.96, 0.92, and 0.91 respectively; R(2) are 0.92, 0.85, and 0.84 respectively; and accuracy are 78.82% (with ± 0.1 err), 86.56% (with ± 0.1 err), and 87.37% (with ± 0.1 err) respectively on the testing data set. The data set used in the study is available as supplement at http://bit.ly/RF-PCP-DataSets.


Assuntos
Modelos Moleculares , Proteínas/química , Algoritmos , Fenômenos Químicos , Biologia Computacional , Simulação por Computador , Bases de Dados de Proteínas/estatística & dados numéricos , Aprendizado de Máquina , Conformação Proteica , Controle de Qualidade
11.
Mol Cell Proteomics ; 13(2): 666-77, 2014 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-24255132

RESUMO

A major goal in proteomics is the comprehensive and accurate description of a proteome. This task includes not only the identification of proteins in a sample, but also the accurate quantification of their abundance. Although mass spectrometry typically provides information on peptide identity and abundance in a sample, it does not directly measure the concentration of the corresponding proteins. Specifically, most mass-spectrometry-based approaches (e.g. shotgun proteomics or selected reaction monitoring) allow one to quantify peptides using chromatographic peak intensities or spectral counting information. Ultimately, based on these measurements, one wants to infer the concentrations of the corresponding proteins. Inferring properties of the proteins based on experimental peptide evidence is often a complex problem because of the ambiguity of peptide assignments and different chemical properties of the peptides that affect the observed concentrations. We present SCAMPI, a novel generic and statistically sound framework for computing protein abundance scores based on quantified peptides. In contrast to most previous approaches, our model explicitly includes information from shared peptides to improve protein quantitation, especially in eukaryotes with many homologous sequences. The model accounts for uncertainty in the input data, leading to statistical prediction intervals for the protein scores. Furthermore, peptides with extreme abundances can be reassessed and classified as either regular data points or actual outliers. We used the proposed model with several datasets and compared its performance to that of other, previously used approaches for protein quantification in bottom-up mass spectrometry.


Assuntos
Biologia Computacional/métodos , Interpretação Estatística de Dados , Proteínas/análise , Proteômica/estatística & dados numéricos , Linhagem Celular Tumoral , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Marcação por Isótopo/métodos , Leptospira interrogans/metabolismo , Leucemia Mieloide Aguda/metabolismo , Cadeias de Markov , Proteômica/métodos , Projetos de Pesquisa , Software
12.
J Bioinform Comput Biol ; 11(2): 1230002, 2013 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-23600810

RESUMO

Complexes of physically interacting proteins are one of the fundamental functional units responsible for driving key biological mechanisms within the cell. Their identification is therefore necessary to understand not only complex formation but also the higher level organization of the cell. With the advent of "high-throughput" techniques in molecular biology, significant amount of physical interaction data has been cataloged from organisms such as yeast, which has in turn fueled computational approaches to systematically mine complexes from the network of physical interactions among proteins (PPI network). In this survey, we review, classify and evaluate some of the key computational methods developed till date for the identification of protein complexes from PPI networks. We present two insightful taxonomies that reflect how these methods have evolved over the years toward improving automated complex prediction. We also discuss some open challenges facing accurate reconstruction of complexes, the crucial ones being the presence of high proportion of errors and noise in current high-throughput datasets and some key aspects overlooked by current complex detection methods. We hope this review will not only help to condense the history of computational complex detection for easy reference but also provide valuable insights to drive further research in this area.


Assuntos
Complexos Multiproteicos/química , Mapas de Interação de Proteínas , Algoritmos , Animais , Análise por Conglomerados , Biologia Computacional , Bases de Dados de Proteínas/estatística & dados numéricos , Evolução Molecular , Humanos , Cadeias de Markov , Proteínas de Membrana/química , Complexos Multiproteicos/classificação , Complexos Multiproteicos/genética , Mapeamento de Interação de Proteínas/estatística & dados numéricos
13.
J Proteomics ; 75(17): 5516-22, 2012 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-22789671

RESUMO

Isobaric tagging techniques such as iTRAQ and TMT are widely used in quantitative proteomics and especially useful for samples that demand in vitro labeling. Due to diversity in choices of MS acquisition approaches, identification algorithms, and relative abundance deduction strategies, researchers are faced with a plethora of possibilities when it comes to data analysis. However, the lack of generic and flexible software tool often makes it cumbersome for researchers to perform the analysis entirely as desired. In this paper, we present MilQuant, mzXML-based isobaric labeling quantitator, a pipeline of freely available programs that supports native acquisition files produced by all mass spectrometer types and collection approaches currently used in isobaric tagging based MS data collection. Moreover, aside from effective normalization and abundance ratio deduction algorithms, MilQuant exports various intermediate results along each step of the pipeline, making it easy for researchers to customize the analysis. The functionality of MilQuant was demonstrated by four distinct datasets from different laboratories. The compatibility and extendibility of MilQuant makes it a generic and flexible tool that can serve as a full solution to data analysis of isobaric tagging-based quantitation.


Assuntos
Marcação por Isótopo/métodos , Proteômica/métodos , Software , Algoritmos , Bases de Dados de Proteínas/estatística & dados numéricos , Eficiência , Processamento Eletrônico de Dados/métodos , Humanos , Marcação por Isótopo/economia , Espectrometria de Massas/métodos , Modelos Biológicos , Proteômica/economia , Software/economia , Fatores de Tempo , Interface Usuário-Computador
14.
Pac Symp Biocomput ; : 271-82, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22174282

RESUMO

Protein homology search provides basis for functional profiling in metagenomic annotation. Profile HMM-based methods classify reads into annotated protein domain families and can achieve better sensitivity for remote protein homology search than pairwise sequence alignment. However, their sensitivity deteriorates with the decrease of read length. As a result, a large number of short reads cannot be classified into their native domain families. In this work, we introduce MetaDomain, a protein domain classification tool designed for short reads generated by next-generation sequencing technologies. MetaDomain uses relaxed position-specific score thresholds to align more reads to a profile HMM while using the distribution of alignment positions as an additional constraint to control false positive matches. In this work MetaDomain is applied to the transcriptomic data of a bacterial genome and a soil metagenomic data set. The experimental results show that it can achieve better sensitivity than the state-of-the-art profile HMM alignment tool in identifying encoded domains from short sequences. The source codes of MetaDomain are available at http://sourceforge.net/projects/metadomain/.


Assuntos
Metagenômica/estatística & dados numéricos , Microbiota/genética , Estrutura Terciária de Proteína , Software , Algoritmos , Sequência de Aminoácidos , Biologia Computacional , Mineração de Dados/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Cadeias de Markov , Estrutura Terciária de Proteína/genética , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/estatística & dados numéricos , Microbiologia do Solo , Transcriptoma/genética
15.
Proteomics ; 11(6): 1114-24, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21298787

RESUMO

As high-resolution instruments are becoming standard in proteomics laboratories, label-free quantification using precursor measurements is becoming a viable option, and is consequently rapidly gaining popularity. Several software solutions have been presented for label-free analysis, but to our knowledge no conclusive studies regarding the sensitivity and reliability of each step of the analysis procedure has been described. Here, we use real complex samples to assess the reliability of label-free quantification using four different software solutions. A generic approach to quality test quantitative label-free LC-MS is introduced. Measures for evaluation are defined for feature detection, alignment and quantification. All steps of the analysis could be considered adequately performed by the utilized software solutions, although differences and possibilities for improvement could be identified. The described method provides an effective testing procedure, which can help the user to quickly pinpoint where in the workflow changes are needed.


Assuntos
Proteômica/estatística & dados numéricos , Proteômica/normas , Software , Espectrometria de Massas em Tandem/estatística & dados numéricos , Espectrometria de Massas em Tandem/normas , Algoritmos , Cromatografia Líquida/normas , Cromatografia Líquida/estatística & dados numéricos , Biologia Computacional , Interpretação Estatística de Dados , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Proteínas/isolamento & purificação , Controle de Qualidade , Reprodutibilidade dos Testes , Alinhamento de Sequência/normas , Alinhamento de Sequência/estatística & dados numéricos , Fluxo de Trabalho
16.
J Bioinform Comput Biol ; 9(1): 179-206, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-21328712

RESUMO

E-value guided extrapolation of protein domain annotation from libraries such as Pfam with the HMMER suite is indispensable for hypothesizing about the function of experimentally uncharacterized protein sequences. Since the recent release of HMMER3 does not supersede all functions of HMMER2, the latter will remain relevant for ongoing research as well as for the evaluation of annotations that reside in databases and in the literature. In HMMER2, the E-value is computed from the score via a logistic function or via a domain model-specific extreme value distribution (EVD); the lower of the two is returned as E-value for the domain hit in the query sequence. We find that, for thousands of domain models, this treatment results in switching from the EVD to the statistical model with the logistic function when scores grow (for Pfam release 23, 99% in the global mode and 75% in the fragment mode). If the score corresponding to the breakpoint results in an E-value above a user-defined threshold (e.g. 0.1), a critical score region with conflicting E-values from the logistic function (below the threshold) and from EVD (above the threshold) does exist. Thus, this switch will affect E-value guided annotation decisions in an automated mode. To emphasize, switching in the fragment mode is of no practical relevance since it occurs only at E-values far below 0.1. Unfortunately, a critical score region does exist for 185 domain models in the hmmpfam and 1,748 domain models in the hmmsearch global-search mode. For 145 out the respective 185 models, the critical score region is indeed populated by actual sequences. In total, 24.4% of their hits have a logistic function-derived E-value < 0.1 when the EVD provides an E-value > 0.1. We provide examples of false annotations and critically discuss the appropriateness of a logistic function as alternative to the EVD.


Assuntos
Proteínas/química , Biologia Computacional , Simulação por Computador , Bases de Dados de Proteínas/estatística & dados numéricos , Modelos Logísticos , Cadeias de Markov , Modelos Moleculares , Anotação de Sequência Molecular/estatística & dados numéricos , Biblioteca de Peptídeos , Estrutura Terciária de Proteína , Alinhamento de Sequência/estatística & dados numéricos , Software
17.
Comput Biol Chem ; 32(1): 38-46, 2008 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-17988949

RESUMO

The combination of results from different large-scale datasets of multidimensional biological signals (such as gene expression profiling) presents a major challenge. Methodologies are needed that can efficiently combine diverse datasets, but can also test the extent of diversity (heterogeneity) across the combined studies. We developed METa-analysis of RAnked DISCovery datasets (METRADISC), a generalized meta-analysis method for combining information across discovery-oriented datasets and for testing between-study heterogeneity for each biological variable of interest. The method is based on non-parametric Monte Carlo permutation testing. The tested biological variables are ranked in each study according to the level of statistical significance. METRADISC tests for each biological variable of interest its average rank and the between-study heterogeneity of the study-specific ranks. After accounting for ties and differences in tested variables across studies, we randomly permute the ranks of each study and the simulated metrics of average rank and heterogeneity are calculated. The procedure is repeated to generate null distributions for the metrics. The use of METRADISC is demonstrated empirically using gene expression data from seven studies comparing prostate cancer cases and normal controls. We offer a new tool for combining complex datasets derived from massive testing, discovery-oriented research and for examining the diversity of results across the combined studies.


Assuntos
Algoritmos , Bases de Dados de Proteínas/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Metanálise como Assunto , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos , Humanos , Masculino , Método de Monte Carlo , Neoplasias da Próstata/metabolismo
19.
Protein Sci ; 14(7): 1800-10, 2005 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-15937274

RESUMO

There are more than 200 completed genomes and over 1 million nonredundant sequences in public repositories. Although the structural data are more sparse (approximately 13,000 nonredundant structures solved to date), several powerful sequence-based methodologies now allow these structures to be mapped onto related regions in a significant proportion of genome sequences. We review a number of publicly available strategies for providing structural annotations for genome sequences, and we describe the protocol adopted to provide CATH structural annotations for completed genomes. In particular, we assess the performance of several sequence-based protocols employing Hidden Markov model (HMM) technologies for superfamily recognition, including a new approach (SAMOSA [sequence augmented models of structure alignments]) that exploits multiple structural alignments from the CATH domain structure database when building the models. Using a data set of remote homologs detected by structure comparison and manually validated in CATH, a single-seed HMM library was able to recognize 76% of the data set. Including the SAMOSA models in the HMM library showed little gain in homolog recognition, although a slight improvement in alignment quality was observed for very remote homologs. However, using an expanded 1D-HMM library, CATH-ISL increased the coverage to 86%. The single-seed HMM library has been used to annotate the protein sequences of 120 genomes from all three major kingdoms, allowing up to 70% of the genes or partial genes to be assigned to CATH superfamilies. It has also been used to recruit sequences from Swiss-Prot and TrEMBL into CATH domain superfamilies, expanding the CATH database eightfold.


Assuntos
Bases de Dados de Proteínas , Genoma , Cadeias de Markov , Estrutura Terciária de Proteína , Proteínas/química , Proteínas/genética , Análise de Sequência de Proteína , Bases de Dados de Proteínas/estatística & dados numéricos , Proteínas/classificação , Homologia de Sequência
20.
Nucleic Acids Res ; 33(Database issue): D284-8, 2005 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-15608197

RESUMO

PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function (ontology terms and pathways), as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences. The latest version, 5.0, contains 6683 protein families, divided into 31,705 subfamilies, covering approximately 90% of mammalian protein-coding genes. PANTHER 5.0 includes a number of significant improvements over previous versions, most notably (i) representation of pathways (primarily signaling pathways) and association with subfamilies and individual protein sequences; (ii) an improved methodology for defining the PANTHER families and subfamilies, and for building the HMMs; (iii) resources for scoring sequences against PANTHER HMMs both over the web and locally; and (iv) a number of new web resources to facilitate analysis of large gene lists, including data generated from high-throughput expression experiments. Efforts are underway to add PANTHER to the InterPro suite of databases, and to make PANTHER consistent with the PIRSF database. PANTHER is now publicly available without restriction at http://panther.appliedbiosystems.com.


Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Análise de Sequência de Proteína , Animais , Bases de Dados de Proteínas/estatística & dados numéricos , Perfilação da Expressão Gênica , Humanos , Internet , Cadeias de Markov , Camundongos , Proteínas/química , Proteínas/fisiologia , Ratos , Transdução de Sinais , Integração de Sistemas , Interface Usuário-Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA