Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
BMC Genomics ; 21(1): 86, 2020 Jan 28.
Artigo em Inglês | MEDLINE | ID: mdl-31992191

RESUMO

BACKGROUND: Branch points (BPs) map within short motifs upstream of acceptor splice sites (3'ss) and are essential for splicing of pre-mature mRNA. Several BP-dedicated bioinformatics tools, including HSF, SVM-BPfinder, BPP, Branchpointer, LaBranchoR and RNABPS were developed during the last decade. Here, we evaluated their capability to detect the position of BPs, and also to predict the impact on splicing of variants occurring upstream of 3'ss. RESULTS: We used a large set of constitutive and alternative human 3'ss collected from Ensembl (n = 264,787 3'ss) and from in-house RNAseq experiments (n = 51,986 3'ss). We also gathered an unprecedented collection of functional splicing data for 120 variants (62 unpublished) occurring in BP areas of disease-causing genes. Branchpointer showed the best performance to detect the relevant BPs upstream of constitutive and alternative 3'ss (99.48 and 65.84% accuracies, respectively). For variants occurring in a BP area, BPP emerged as having the best performance to predict effects on mRNA splicing, with an accuracy of 89.17%. CONCLUSIONS: Our investigations revealed that Branchpointer was optimal to detect BPs upstream of 3'ss, and that BPP was most relevant to predict splicing alteration due to variants in the BP area.


Assuntos
Íntrons , Precursores de RNA , Sítios de Splice de RNA , Splicing de RNA , Processamento Alternativo , Biologia Computacional/métodos , Humanos , Motivos de Nucleotídeos , Matrizes de Pontuação de Posição Específica , Processamento Pós-Transcricional do RNA , Curva ROC , Reprodutibilidade dos Testes
2.
Bioinformatics ; 36(9): 2690-2696, 2020 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31999322

RESUMO

MOTIVATION: Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. RESULTS: We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. AVAILABILITY AND IMPLEMENTATION: Software implementation is available from https://github.com/jttoivon/moder2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Software , Fatores de Transcrição , Algoritmos , Sítios de Ligação , Motivos de Nucleotídeos , Matrizes de Pontuação de Posição Específica , Ligação Proteica , Fatores de Transcrição/genética
3.
Int J Mol Sci ; 19(4)2018 Apr 13.
Artigo em Inglês | MEDLINE | ID: mdl-29652843

RESUMO

Apoptosis proteins (APs) control normal tissue homeostasis by regulating the balance between cell proliferation and death. The function of APs is strongly related to their subcellular location. To date, computational methods have been reported that reliably identify the subcellular location of APs, however, there is still room for improvement of the prediction accuracy. In this study, we developed a novel method named iAPSL-IF (identification of apoptosis protein subcellular location-integrative features), which is based on integrative features captured from Markov chains, physicochemical property matrices, and position-specific score matrices (PSSMs) of amino acid sequences. The matrices with different lengths were transformed into fixed-length feature vectors using an auto cross-covariance (ACC) method. An optimal subset of the features was chosen using a recursive feature elimination (RFE) algorithm method, and the sequences with these features were trained by a support vector machine (SVM) classifier. Based on three datasets ZD98, CL317, and ZW225, the iAPSL-IF was examined using a jackknife cross-validation test. The resulting data showed that the iAPSL-IF outperformed the known predictors reported in the literature: its overall accuracy on the three datasets was 98.98% (ZD98), 94.95% (CL317), and 97.33% (ZW225), respectively; the Matthews correlation coefficient, sensitivity, and specificity for several classes of subcellular location proteins (e.g., membrane proteins, cytoplasmic proteins, endoplasmic reticulum proteins, nuclear proteins, and secreted proteins) in the datasets were 0.92-1.0, 94.23-100%, and 97.07-100%, respectively. Overall, the results of this study provide a high throughput and sequence-based method for better identification of the subcellular location of APs, and facilitates further understanding of programmed cell death in organisms.


Assuntos
Proteínas Reguladoras de Apoptose/genética , Proteínas Reguladoras de Apoptose/metabolismo , Biologia Computacional/métodos , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Humanos , Cadeias de Markov , Matrizes de Pontuação de Posição Específica , Transporte Proteico , Máquina de Vetores de Suporte
4.
BMC Bioinformatics ; 19(1): 65, 2018 02 27.
Artigo em Inglês | MEDLINE | ID: mdl-29482494

RESUMO

BACKGROUND: Crm1-dependent Nuclear Export Signals (NESs) are clusters of alternating hydrophobic and non-hydrophobic amino acid residues between 10 to 15 amino acids in length. NESs were largely thought to follow simple consensus patterns, based on which they were categorized into 6-10 classes. However, newly discovered NESs often deviate from the established consensus patterns. Thus, identifying NESs within protein sequences remains a bioinformatics challenge. RESULTS: We describe a probabilistic representation of NESs using a new generative model we call NoLogo that can account for a large diversity of NESs. Using this model to predict NESs, we demonstrate improved performance over PSSM and GLAM2 models, but do not achieve the performance of the state-of-the-art NES predictor LocNES. Our findings illustrate that over 30% of NESs are best described by novel NES classes rather than the 6-10 classes proposed by current/existing models. Finally, many NESs have additional hydrophobic residues either upstream or downstream of the canonical four residues, suggesting possible functionality. CONCLUSION: Applying the NoLogo model highlights the observation that NESs are more diverse than previously appreciated. Our work questions the practice of assigning each NES to one of several predefined NES classes. Finally, our analysis suggests a novel and testable biophysical perspective on interaction between Crm1 receptor and Crm1-dependent NESs.


Assuntos
Carioferinas/metabolismo , Modelos Estatísticos , Sinais de Exportação Nuclear , Receptores Citoplasmáticos e Nucleares/metabolismo , Software , Sequência de Aminoácidos , Análise por Conglomerados , Humanos , Interações Hidrofóbicas e Hidrofílicas , Carioferinas/química , Cadeias de Markov , Matrizes de Pontuação de Posição Específica , Probabilidade , Receptores Citoplasmáticos e Nucleares/química , Saccharomyces cerevisiae/metabolismo , Proteína Exportina 1
5.
Brief Bioinform ; 19(1): 148-161, 2018 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-27777222

RESUMO

Bacterial effector proteins secreted by various protein secretion systems play crucial roles in host-pathogen interactions. In this context, computational tools capable of accurately predicting effector proteins of the various types of bacterial secretion systems are highly desirable. Existing computational approaches use different machine learning (ML) techniques and heterogeneous features derived from protein sequences and/or structural information. These predictors differ not only in terms of the used ML methods but also with respect to the used curated data sets, the features selection and their prediction performance. Here, we provide a comprehensive survey and benchmarking of currently available tools for the prediction of effector proteins of bacterial types III, IV and VI secretion systems (T3SS, T4SS and T6SS, respectively). We review core algorithms, feature selection techniques, tool availability and applicability and evaluate the prediction performance based on carefully curated independent test data sets. In an effort to improve predictive performance, we constructed three ensemble models based on ML algorithms by integrating the output of all individual predictors reviewed. Our benchmarks demonstrate that these ensemble models outperform all the reviewed tools for the prediction of effector proteins of T3SS and T4SS. The webserver of the proposed ensemble methods for T3SS and T4SS effector protein prediction is freely available at http://tbooster.erc.monash.edu/index.jsp. We anticipate that this survey will serve as a useful guide for interested users and that the new ensemble predictors will stimulate research into host-pathogen relationships and inspiration for the development of new bioinformatics tools for predicting effector proteins of T3SS, T4SS and T6SS.


Assuntos
Bactérias/metabolismo , Proteínas de Bactérias/metabolismo , Sistemas de Secreção Bacterianos/genética , Genoma Bacteriano , Matrizes de Pontuação de Posição Específica , Algoritmos , Sequência de Aminoácidos , Aminoácidos/metabolismo , Bactérias/crescimento & desenvolvimento , Proteínas de Bactérias/classificação , Proteínas de Bactérias/genética , Regulação Bacteriana da Expressão Gênica , Interações Hospedeiro-Patógeno , Humanos , Software
6.
J Mol Graph Model ; 76: 379-402, 2017 09.
Artigo em Inglês | MEDLINE | ID: mdl-28763690

RESUMO

Protein secondary structure prediction (PSSP) is a fundamental task in protein science and computational biology, and it can be used to understand protein 3-dimensional (3-D) structures, further, to learn their biological functions. In the past decade, a large number of methods have been proposed for PSSP. In order to learn the latest progress of PSSP, this paper provides a survey on the development of this field. It first introduces the background and related knowledge of PSSP, including basic concepts, data sets, input data features and prediction accuracy assessment. Then, it reviews the recent algorithmic developments of PSSP, which mainly focus on the latest decade. Finally, it summarizes the corresponding tendencies and challenges in this field. This survey concludes that although various PSSP methods have been proposed, there still exist several further improvements or potential research directions. We hope that the presented guidelines will help nonspecialists and specialists to learn the critical progress in PSSP in recent years.


Assuntos
Biologia Computacional , Modelos Moleculares , Estrutura Secundária de Proteína , Proteínas/química , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Proteínas , Lógica Fuzzy , Cadeias de Markov , Redes Neurais de Computação , Matrizes de Pontuação de Posição Específica , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Máquina de Vetores de Suporte
7.
Nucleic Acids Res ; 44(13): 6055-69, 2016 07 27.
Artigo em Inglês | MEDLINE | ID: mdl-27288444

RESUMO

Position weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k - 1 act as priors for those of order k This Bayesian Markov model (BaMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BaMMs achieve significantly (P = 1/16) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BaMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26-101%. BaMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BaMMs.


Assuntos
Proteínas de Ligação a DNA/genética , DNA/genética , Motivos de Nucleotídeos/genética , Sequências Reguladoras de Ácido Nucleico/genética , Algoritmos , Teorema de Bayes , Sítios de Ligação , Biologia Computacional , Cadeias de Markov , Matrizes de Pontuação de Posição Específica , Software
8.
Obesity (Silver Spring) ; 23(6): 1151-8, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25959516

RESUMO

OBJECTIVE: To determine whether there are differences in baseline psychological and behavioral characteristics between individuals with severe obesity who chose a surgical or nonsurgical intervention for weight loss. METHODS: The current study utilized data from a larger study funded by a state insurance company and is unique in that the insurance company funded the weight loss interventions. Participants indicated their preferred method of weight loss, and completed several self-report psychological questionnaires, as well as demographic information. RESULTS: Participants (N = 605) were 58.8% Caucasian and mostly (86%) female. Logistic regression results indicated that an increased number of weight loss attempts, and select other measures of eating behavior and quality of life may influence individuals' selection for either surgical or nonsurgical treatments for weight loss. CONCLUSIONS: Practitioners should pay particular attention to these baseline characteristics that influence choice to examine potential characteristics that may influence the success of these weight loss treatments.


Assuntos
Comportamento de Escolha , Cobertura do Seguro , Seguro Saúde , Obesidade Mórbida/terapia , Matrizes de Pontuação de Posição Específica , Adulto , Idoso , Comportamento Alimentar/psicologia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Qualidade de Vida , Inquéritos e Questionários , Adulto Jovem
9.
BMC Genomics ; 15: 946, 2014 Dec 05.
Artigo em Inglês | MEDLINE | ID: mdl-25475368

RESUMO

BACKGROUND: The reliable identification of proteins containing 50 or fewer amino acids is difficult due to the limited information content in short sequences. The 37 amino acid CydX protein in Escherichia coli is a member of the cytochrome bd oxidase complex, an enzyme found throughout Eubacteria. To investigate the extent of CydX conservation and prevalence and evaluate different methods of small protein homologue identification, we surveyed 1095 Eubacteria species for the presence of the small protein. RESULTS: Over 300 homologues were identified, including 80 unannotated genes. The ability of both closely-related and divergent homologues to complement the E. coli ΔcydX mutant supports our identification techniques, and suggests that CydX homologues retain similar function among divergent species. However, sequence analysis of these proteins shows a great degree of variability, with only a few highly-conserved residues. An analysis of the co-variation between CydX homologues and their corresponding cydA and cydB genes shows a close synteny of the small protein with the CydA long Q-loop. Phylogenetic analysis suggests that the cydABX operon has undergone horizontal gene transfer, although the cydX gene likely evolved in a progenitor of the Alpha, Beta, and Gammaproteobacteria. Further investigation of cydAB operons identified two additional conserved hypothetical small proteins: CydY encoded in CydAQlong operons that lack cydX, and CydZ encoded in more than 150 CydAQshort operons. CONCLUSIONS: This study provides a systematic analysis of bioinformatics techniques required for the unique challenges present in small protein identification and phylogenetic analyses. These results elucidate the prevalence of CydX throughout the Proteobacteria, provide insight into the selection pressure and sequence requirements for CydX function, and suggest a potential functional interaction between the small protein and the CydA Q-loop, an enigmatic domain of the cytochrome bd oxidase complex. Finally, these results identify other conserved small proteins encoded in cytochrome bd oxidase operons, suggesting that small protein subunits may be a more common component of these enzymes than previously thought.


Assuntos
Citocromos/genética , Complexo de Proteínas da Cadeia de Transporte de Elétrons/genética , Proteínas de Escherichia coli/genética , Evolução Molecular , Oxirredutases/genética , Alelos , Sequência de Aminoácidos , Biologia Computacional/métodos , Sequência Conservada , Grupo dos Citocromos b , Citocromos/química , Citocromos/metabolismo , Complexo de Proteínas da Cadeia de Transporte de Elétrons/química , Complexo de Proteínas da Cadeia de Transporte de Elétrons/metabolismo , Proteínas de Escherichia coli/química , Proteínas de Escherichia coli/metabolismo , Ordem dos Genes , Transferência Genética Horizontal , Teste de Complementação Genética , Genoma Bacteriano , Genômica , Interações Hidrofóbicas e Hidrofílicas , Cadeias de Markov , Anotação de Sequência Molecular , Dados de Sequência Molecular , Mutação , Óperon , Oxirredutases/química , Oxirredutases/metabolismo , Filogenia , Matrizes de Pontuação de Posição Específica , Domínios e Motivos de Interação entre Proteínas , Proteobactérias/genética , Proteobactérias/metabolismo , Alinhamento de Sequência , Análise de Sequência de DNA
10.
Proteins ; 82(5): 858-66, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24265170

RESUMO

In the design of new enzymes and binding proteins, human intuition is often used to modify computationally designed amino acid sequences prior to experimental characterization. The manual sequence changes involve both reversions of amino acid mutations back to the identity present in the parent scaffold and the introduction of residues making additional interactions with the binding partner or backing up first shell interactions. Automation of this manual sequence refinement process would allow more systematic evaluation and considerably reduce the amount of human designer effort involved. Here we introduce a benchmark for evaluating the ability of automated methods to recapitulate the sequence changes made to computer-generated models by human designers, and use it to assess alternative computational methods. We find the best performance for a greedy one-position-at-a-time optimization protocol that utilizes metrics (such as shape complementarity) and local refinement methods too computationally expensive for global Monte Carlo (MC) sequence optimization. This protocol should be broadly useful for improving the stability and function of designed binding proteins.


Assuntos
Automação , Intuição , Proteínas/química , Algoritmos , Bases de Dados de Proteínas , Humanos , Método de Monte Carlo , Matrizes de Pontuação de Posição Específica , Análise de Sequência de DNA , Termodinâmica
11.
Artigo em Inglês | MEDLINE | ID: mdl-24091407

RESUMO

The recognition of microRNA (miRNA)-binding residues in proteins is helpful to understand how miRNAs silence their target genes. It is difficult to use existing computational method to predict miRNA-binding residues in proteins due to the lack of training examples. To address this issue, unlabeled data may be exploited to help construct a computational model. Semisupervised learning deals with methods for exploiting unlabeled data in addition to labeled data automatically to improve learning performance, where no human intervention is assumed. In addition, miRNA-binding proteins almost always contain a much smaller number of binding than nonbinding residues, and cost-sensitive learning has been deemed as a good solution to the class imbalance problem. In this work, a novel model is proposed for recognizing miRNA-binding residues in proteins from sequences using a cost-sensitive extension of Laplacian support vector machines (CS-LapSVM) with a hybrid feature. The hybrid feature consists of evolutionary information of the amino acid sequence (position-specific scoring matrices), the conservation information about three biochemical properties (HKM) and mutual interaction propensities in protein-miRNA complex structures. The CS-LapSVM receives good performance with an F1 score of 26.23 ± 2.55% and an AUC value of 0.805 ± 0.020 superior to existing approaches for the recognition of RNA-binding residues. A web server called SARS is built and freely available for academic usage.


Assuntos
Sítios de Ligação , MicroRNAs/metabolismo , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Máquina de Vetores de Suporte , Algoritmos , Bases de Dados de Proteínas , MicroRNAs/química , Matrizes de Pontuação de Posição Específica , Ligação Proteica , Proteínas/química
12.
PLoS One ; 7(11): e47836, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23144830

RESUMO

A major challenge in molecular biology is reverse-engineering the cis-regulatory logic that plays a major role in the control of gene expression. This program includes searching through DNA sequences to identify "motifs" that serve as the binding sites for transcription factors or, more generally, are predictive of gene expression across cellular conditions. Several approaches have been proposed for de novo motif discovery-searching sequences without prior knowledge of binding sites or nucleotide patterns. However, unbiased validation is not straightforward. We consider two approaches to unbiased validation of discovered motifs: testing the statistical significance of a motif using a DNA "background" sequence model to represent the null hypothesis and measuring performance in predicting membership in gene clusters. We demonstrate that the background models typically used are "too null," resulting in overly optimistic assessments of significance, and argue that performance in predicting TF binding or expression patterns from DNA motifs should be assessed by held-out data, as in predictive learning. Applying this criterion to common motif discovery methods resulted in universally poor performance, although there is a marked improvement when motifs are statistically significant against real background sequences. Moreover, on synthetic data where "ground truth" is known, discriminative performance of all algorithms is far below the theoretical upper bound, with pronounced "over-fitting" in training. A key conclusion from this work is that the failure of de novo discovery approaches to accurately identify motifs is basically due to statistical intractability resulting from the fixed size of co-regulated gene clusters, and thus such failures do not necessarily provide evidence that unfound motifs are not active biologically. Consequently, the use of prior knowledge to enhance motif discovery is not just advantageous but necessary. An implementation of the LR and ALR algorithms is available at http://code.google.com/p/likelihood-ratio-motifs/.


Assuntos
Modelos Genéticos , Sequências Reguladoras de Ácido Nucleico , Análise de Sequência de DNA/métodos , Algoritmos , Área Sob a Curva , Teorema de Bayes , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Humanos , Modelos Logísticos , Cadeias de Markov , Método de Monte Carlo , Matrizes de Pontuação de Posição Específica , Curva ROC , Saccharomyces cerevisiae/genética , Transcriptoma
13.
BMC Bioinformatics ; 13: 89, 2012 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-22574904

RESUMO

BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.


Assuntos
Inteligência Artificial , Proteínas de Ligação a RNA/química , RNA/química , Algoritmos , Aminoácidos/química , Teorema de Bayes , Humanos , Matrizes de Pontuação de Posição Específica , Conformação Proteica , RNA/metabolismo , Proteínas de Ligação a RNA/metabolismo , Análise de Sequência de Proteína , Máquina de Vetores de Suporte
14.
Nucleic Acids Res ; 39(3): 808-24, 2011 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-20923783

RESUMO

Position-specific scoring matrices (PSSMs) are routinely used to predict transcription factor (TF)-binding sites in genome sequences. However, their reliability to predict novel binding sites can be far from optimum, due to the use of a small number of training sites or the inappropriate choice of parameters when building the matrix or when scanning sequences with it. Measures of matrix quality such as E-value and information content rely on theoretical models, and may fail in the context of full genome sequences. We propose a method, implemented in the program 'matrix-quality', that combines theoretical and empirical score distributions to assess reliability of PSSMs for predicting TF-binding sites. We applied 'matrix-quality' to estimate the predictive capacity of matrices for bacterial, yeast and mouse TFs. The evaluation of matrices from RegulonDB revealed some poorly predictive motifs, and allowed us to quantify the improvements obtained by applying multi-genome motif discovery. Interestingly, the method reveals differences between global and specific regulators. It also highlights the enrichment of binding sites in sequence sets obtained from high-throughput ChIP-chip (bacterial and yeast TFs), and ChIP-seq and experiments (mouse TFs). The method presented here has many applications, including: selecting reliable motifs before scanning sequences; improving motif collections in TFs databases; evaluating motifs discovered using high-throughput data sets.


Assuntos
Matrizes de Pontuação de Posição Específica , Regiões Promotoras Genéticas , Análise de Sequência de DNA , Fatores de Transcrição/metabolismo , Animais , Proteínas de Bactérias/metabolismo , Sítios de Ligação , Imunoprecipitação da Cromatina , Genômica , Camundongos , Análise de Sequência com Séries de Oligonucleotídeos , Curva ROC , Proteínas Repressoras/metabolismo , Serina Endopeptidases/metabolismo , Software
15.
J Comput Biol ; 17(12): 1621-38, 2010 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-21128853

RESUMO

The problem of motif detection can be formulated as the construction of a discriminant function to separate sequences of a specific pattern from background. In computational biology, motif detection is used to predict DNA binding sites of a transcription factor (TF), mostly based on the weight matrix (WM) model or the Gibbs free energy (FE) model. However, despite the wide applications, theoretical analysis of these two models and their predictions is still lacking. We derive asymptotic error rates of prediction procedures based on these models under different data generation assumptions. This allows a theoretical comparison between the WM-based and the FE-based predictions in terms of asymptotic efficiency. Applications of the theoretical results are demonstrated with empirical studies on ChIP-seq data and protein binding microarray data. We find that, irrespective of underlying data generation mechanisms, the FE approach shows higher or comparable predictive power relative to the WM approach when the number of observed binding sites used for constructing a discriminant decision is not too small.


Assuntos
Sequência de Bases , Modelos Biológicos , Matrizes de Pontuação de Posição Específica , Sequências Reguladoras de Ácido Nucleico/genética , Biologia Computacional , Cadeias de Markov , Análise Numérica Assistida por Computador , Termodinâmica
16.
J Comput Biol ; 17(12): 1697-709, 2010 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-21128856

RESUMO

Monte Carlo methods can provide accurate p-value estimates of word counting test statistics and are easy to implement. They are especially attractive when an asymptotic theory is absent or when either the search sequence or the word pattern is too short for the application of asymptotic formulae. Naive direct Monte Carlo is undesirable for the estimation of small probabilities because the associated rare events of interest are seldom generated. We propose instead efficient importance sampling algorithms that use controlled insertion of the desired word patterns on randomly generated sequences. The implementation is illustrated on word patterns of biological interest: palindromes and inverted repeats, patterns arising from position-specific weight matrices (PSWMs), and co-occurrences of pairs of motifs.


Assuntos
Motivos de Aminoácidos , Reconhecimento Automatizado de Padrão , Sequências Reguladoras de Ácido Nucleico , Análise de Sequência/métodos , Sequência de Aminoácidos , Sequência de Bases , Sequências Repetidas Invertidas , Método de Monte Carlo , Matrizes de Pontuação de Posição Específica
17.
Bioinformatics ; 25(24): 3251-8, 2009 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-19828575

RESUMO

MOTIVATION: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive. RESULTS: We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92. AVAILABILITY: The presented algorithms are implemented in the program PoSSuMsearch2, available for download at http://bibiserv.techfak.uni-bielefeld.de/possumsearch2/. CONTACT: beckstette@zbh.uni-hamburg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Cadeias de Markov , Matrizes de Pontuação de Posição Específica , Proteínas/classificação , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Reconhecimento Automatizado de Padrão/métodos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA