Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 55
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Nature ; 619(7970): 526-532, 2023 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-37407824

RESUMO

Extreme precipitation is a considerable contributor to meteorological disasters and there is a great need to mitigate its socioeconomic effects through skilful nowcasting that has high resolution, long lead times and local details1-3. Current methods are subject to blur, dissipation, intensity or location errors, with physics-based numerical methods struggling to capture pivotal chaotic dynamics such as convective initiation4 and data-driven learning methods failing to obey intrinsic physical laws such as advective conservation5. We present NowcastNet, a nonlinear nowcasting model for extreme precipitation that unifies physical-evolution schemes and conditional-learning methods into a neural-network framework with end-to-end forecast error optimization. On the basis of radar observations from the USA and China, our model produces physically plausible precipitation nowcasts with sharp multiscale patterns over regions of 2,048 km × 2,048 km and with lead times of up to 3 h. In a systematic evaluation by 62 professional meteorologists from across China, our model ranks first in 71% of cases against the leading methods. NowcastNet provides skilful forecasts at light-to-heavy rain rates, particularly for extreme-precipitation events accompanied by advective or convective processes that were previously considered intractable.

2.
Nat Methods ; 20(8): 1222-1231, 2023 08.
Artigo em Inglês | MEDLINE | ID: mdl-37386189

RESUMO

Jointly profiling the transcriptome, chromatin accessibility and other molecular properties of single cells offers a powerful way to study cellular diversity. Here we present MultiVI, a probabilistic model to analyze such multiomic data and leverage it to enhance single-modality datasets. MultiVI creates a joint representation that allows an analysis of all modalities included in the multiomic input data, even for cells for which one or more modalities are missing. It is available at scvi-tools.org .


Assuntos
Modelos Estatísticos , Transcriptoma
3.
Proc Natl Acad Sci U S A ; 120(21): e2209124120, 2023 05 23.
Artigo em Inglês | MEDLINE | ID: mdl-37192164

RESUMO

Detecting differentially expressed genes is important for characterizing subpopulations of cells. In scRNA-seq data, however, nuisance variation due to technical factors like sequencing depth and RNA capture efficiency obscures the underlying biological signal. Deep generative models have been extensively applied to scRNA-seq data, with a special focus on embedding cells into a low-dimensional latent space and correcting for batch effects. However, little attention has been paid to the problem of utilizing the uncertainty from the deep generative model for differential expression (DE). Furthermore, the existing approaches do not allow for controlling for effect size or the false discovery rate (FDR). Here, we present lvm-DE, a generic Bayesian approach for performing DE predictions from a fitted deep generative model, while controlling the FDR. We apply the lvm-DE framework to scVI and scSphere, two deep generative models. The resulting approaches outperform state-of-the-art methods at estimating the log fold change in gene expression levels as well as detecting differentially expressed genes between subpopulations of cells.


Assuntos
RNA , Análise de Célula Única , Teorema de Bayes , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Perfilação da Expressão Gênica/métodos
4.
Proc Natl Acad Sci U S A ; 119(43): e2204569119, 2022 10 25.
Artigo em Inglês | MEDLINE | ID: mdl-36256807

RESUMO

Many applications of machine-learning methods involve an iterative protocol in which data are collected, a model is trained, and then outputs of that model are used to choose what data to consider next. For example, a data-driven approach for designing proteins is to train a regression model to predict the fitness of protein sequences and then use it to propose new sequences believed to exhibit greater fitness than observed in the training data. Since validating designed sequences in the wet laboratory is typically costly, it is important to quantify the uncertainty in the model's predictions. This is challenging because of a characteristic type of distribution shift between the training and test data that arises in the design setting-one in which the training and test data are statistically dependent, as the latter is chosen based on the former. Consequently, the model's error on the test data-that is, the designed sequences-has an unknown and possibly complex relationship with its error on the training data. We introduce a method to construct confidence sets for predictions in such settings, which account for the dependence between the training and test data. The confidence sets we construct have finite-sample guarantees that hold for any regression model, even when it is used to choose the test-time input distribution. As a motivating use case, we use real datasets to demonstrate how our method quantifies uncertainty for the predicted fitness of designed proteins and can therefore be used to select design algorithms that achieve acceptable tradeoffs between high predicted fitness and low predictive uncertainty.


Assuntos
Algoritmos , Aprendizado de Máquina , Retroalimentação , Incerteza , Conformação Molecular
5.
Mol Syst Biol ; 17(1): e9620, 2021 01.
Artigo em Inglês | MEDLINE | ID: mdl-33491336

RESUMO

As the number of single-cell transcriptomics datasets grows, the natural next step is to integrate the accumulating data to achieve a common ontology of cell types and states. However, it is not straightforward to compare gene expression levels across datasets and to automatically assign cell type labels in a new dataset based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of scRNA-seq data, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage existing cell state annotations. We demonstrate that scVI and scANVI compare favorably to state-of-the-art methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings. In contrast to existing methods, scVI and scANVI integrate multiple datasets with a single generative model that can be directly used for downstream tasks, such as differential expression. Both methods are easily accessible through scvi-tools.


Assuntos
Biologia Computacional/métodos , Análise de Célula Única/métodos , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Humanos , Anotação de Sequência Molecular , Análise de Sequência de RNA , Aprendizado de Máquina Supervisionado
6.
Proc Natl Acad Sci U S A ; 116(42): 20881-20885, 2019 10 15.
Artigo em Inglês | MEDLINE | ID: mdl-31570618

RESUMO

Optimization algorithms and Monte Carlo sampling algorithms have provided the computational foundations for the rapid growth in applications of statistical machine learning in recent years. There is, however, limited theoretical understanding of the relationships between these 2 kinds of methodology, and limited understanding of relative strengths and weaknesses. Moreover, existing results have been obtained primarily in the setting of convex functions (for optimization) and log-concave functions (for sampling). In this setting, where local properties determine global properties, optimization algorithms are unsurprisingly more efficient computationally than sampling algorithms. We instead examine a class of nonconvex objective functions that arise in mixture modeling and multistable systems. In this nonconvex setting, we find that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

7.
Nat Methods ; 15(12): 1053-1058, 2018 12.
Artigo em Inglês | MEDLINE | ID: mdl-30504886

RESUMO

Single-cell transcriptome measurements can reveal unexplored biological diversity, but they suffer from technical noise and bias that must be modeled to account for the resulting uncertainty in downstream analyses. Here we introduce single-cell variational inference (scVI), a ready-to-use scalable framework for the probabilistic representation and analysis of gene expression in single cells ( https://github.com/YosefLab/scVI ). scVI uses stochastic optimization and deep neural networks to aggregate information across similar cells and genes and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity. We used scVI for a range of fundamental analysis tasks including batch correction, visualization, clustering, and differential expression, and achieved high accuracy for each task.


Assuntos
Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Modelos Biológicos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Transcriptoma , Algoritmos , Animais , Encéfalo/citologia , Encéfalo/metabolismo , Análise por Conglomerados , Variação Genética , Células-Tronco Hematopoéticas/citologia , Células-Tronco Hematopoéticas/metabolismo , Humanos , Leucócitos Mononucleares/citologia , Leucócitos Mononucleares/metabolismo , Camundongos
8.
Proc Natl Acad Sci U S A ; 113(47): E7351-E7358, 2016 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-27834219

RESUMO

Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. Although many generalizations and extensions of Nesterov's original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the Bregman Lagrangian, which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods corresponds to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov's technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms.

9.
Proc Natl Acad Sci U S A ; 110(13): E1181-90, 2013 Mar 26.
Artigo em Inglês | MEDLINE | ID: mdl-23479655

RESUMO

Modern massive datasets create a fundamental problem at the intersection of the computational and statistical sciences: how to provide guarantees on the quality of statistical inference given bounds on computational resources, such as time or space. Our approach to this problem is to define a notion of "algorithmic weakening," in which a hierarchy of algorithms is ordered by both computational efficiency and statistical efficiency, allowing the growing strength of the data at scale to be traded off against the need for sophisticated processing. We illustrate this approach in the setting of denoising problems, using convex relaxation as the core inferential tool. Hierarchies of convex relaxations have been widely used in theoretical computer science to yield tractable approximation algorithms to many computationally intractable tasks. In the current paper, we show how to endow such hierarchies with a statistical characterization and thereby obtain concrete tradeoffs relating algorithmic runtime to amount of data.

10.
Proc Natl Acad Sci U S A ; 110(4): 1160-6, 2013 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-23275296

RESUMO

We address the problem of the joint statistical inference of phylogenetic trees and multiple sequence alignments from unaligned molecular sequences. This problem is generally formulated in terms of string-valued evolutionary processes along the branches of a phylogenetic tree. The classic evolutionary process, the TKF91 model [Thorne JL, Kishino H, Felsenstein J (1991) J Mol Evol 33(2):114-124] is a continuous-time Markov chain model composed of insertion, deletion, and substitution events. Unfortunately, this model gives rise to an intractable computational problem: The computation of the marginal likelihood under the TKF91 model is exponential in the number of taxa. In this work, we present a stochastic process, the Poisson Indel Process (PIP), in which the complexity of this computation is reduced to linear. The Poisson Indel Process is closely related to the TKF91 model, differing only in its treatment of insertions, but it has a global characterization as a Poisson process on the phylogeny. Standard results for Poisson processes allow key computations to be decoupled, which yields the favorable computational profile of inference under the PIP model. We present illustrative experiments in which Bayesian inference under the PIP model is compared with separate inference of phylogenies and alignments.


Assuntos
Evolução Molecular , Mutação INDEL , Modelos Genéticos , Modelos Estatísticos , Teorema de Bayes , Bioestatística , Funções Verossimilhança , Cadeias de Markov , Filogenia , Distribuição de Poisson , Alinhamento de Sequência/estatística & dados numéricos
11.
Bioinformatics ; 30(19): 2787-95, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-24894505

RESUMO

MOTIVATION: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. RESULTS: We propose SMaSH, a benchmarking methodology for evaluating germline variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on these benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single-nucleotide polymorphism, indel and structural variant calling algorithms. AVAILABILITY AND IMPLEMENTATION: We provide free and open access online to the SMaSH tool kit, along with detailed documentation, at smash.cs.berkeley.edu


Assuntos
Biologia Computacional/métodos , Genoma Humano , Genômica/métodos , Mutação INDEL , Algoritmos , Interpretação Estatística de Dados , Bases de Dados Genéticas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Polimorfismo de Nucleotídeo Único , Software
12.
Genome Res ; 21(11): 1969-80, 2011 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-21784873

RESUMO

The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.


Assuntos
Genoma , Anotação de Sequência Molecular , Filogenia , Proteínas/classificação , Proteínas/genética , Algoritmos , Arilsulfotransferase/genética , Biologia Computacional , Bases de Dados de Proteínas , Genoma Fúngico , Modelos Estatísticos , Pirofosfatases/genética , Proteínas de Schizosaccharomyces pombe/genética , Nudix Hidrolases
13.
Proteins ; 81(9): 1593-609, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23671031

RESUMO

The subfamily Iα aminotransferases are typically categorized as having narrow specificity toward carboxylic amino acids (AATases), or broad specificity that includes aromatic amino acid substrates (TATases). Because of their general role in central metabolism and, more specifically, their association with liver-related diseases in humans, this subfamily is biologically interesting. The substrate specificities for only a few members of this subfamily have been reported, and the reliable prediction of substrate specificity from protein sequence has remained elusive. In this study, a diverse set of aminotransferases was chosen for characterization based on a scoring system that measures the sequence divergence of the active site. The enzymes that were experimentally characterized include both narrow-specificity AATases and broad-specificity TATases, as well as AATases with broader-specificity and TATases with narrower-specificity than the previously known family members. Molecular function and phylogenetic analyses underscored the complexity of this family's evolution as the TATase function does not follow a single evolutionary thread, but rather appears independently multiple times during the evolution of the subfamily. The additional functional characterizations described in this article, alongside a detailed sequence and phylogenetic analysis, provide some novel clues to understanding the evolutionary mechanisms at work in this family.


Assuntos
Transaminases/química , Transaminases/metabolismo , Sequência de Aminoácidos , Animais , Proteínas de Bactérias , Proteínas Fúngicas , Cinética , Dados de Sequência Molecular , Filogenia , Alinhamento de Sequência , Especificidade por Substrato , Transaminases/classificação , Transaminases/genética
14.
Syst Biol ; 61(4): 579-93, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22223445

RESUMO

Bayesian inference provides an appealing general framework for phylogenetic analysis, able to incorporate a wide variety of modeling assumptions and to provide a coherent treatment of uncertainty. Existing computational approaches to bayesian inference based on Markov chain Monte Carlo (MCMC) have not, however, kept pace with the scale of the data analysis problems in phylogenetics, and this has hindered the adoption of bayesian methods. In this paper, we present an alternative to MCMC based on Sequential Monte Carlo (SMC). We develop an extension of classical SMC based on partially ordered sets and show how to apply this framework--which we refer to as PosetSMC--to phylogenetic analysis. We provide a theoretical treatment of PosetSMC and also present experimental evaluation of PosetSMC on both synthetic and real data. The empirical results demonstrate that PosetSMC is a very promising alternative to MCMC, providing up to two orders of magnitude faster convergence. We discuss other factors favorable to the adoption of PosetSMC in phylogenetics, including its ability to estimate marginal likelihoods, its ready implementability on parallel and distributed computing platforms, and the possibility of combining with MCMC in hybrid MCMC-SMC schemes. Software for PosetSMC is available at http://www.stat.ubc.ca/ bouchard/PosetSMC.


Assuntos
Modelos Genéticos , Método de Monte Carlo , Filogenia , Algoritmos , Teorema de Bayes , Frequência do Gene , Humanos , Cadeias de Markov , RNA Ribossômico 16S/genética , Software
15.
Science ; 382(6671): 669-674, 2023 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-37943906

RESUMO

Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients without making any assumptions about the machine-learning algorithm that supplies the predictions. Furthermore, more accurate predictions translate to smaller confidence intervals. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning. The benefits of prediction-powered inference were demonstrated with datasets from proteomics, astronomy, genomics, remote sensing, census analysis, and ecology.

16.
Nat Biotechnol ; 40(9): 1360-1369, 2022 09.
Artigo em Inglês | MEDLINE | ID: mdl-35449415

RESUMO

Most spatial transcriptomics technologies are limited by their resolution, with spot sizes larger than that of a single cell. Although joint analysis with single-cell RNA sequencing can alleviate this problem, current methods are limited to assessing discrete cell types, revealing the proportion of cell types inside each spot. To identify continuous variation of the transcriptome within cells of the same type, we developed Deconvolution of Spatial Transcriptomics profiles using Variational Inference (DestVI). Using simulations, we demonstrate that DestVI outperforms existing methods for estimating gene expression for every cell type inside every spot. Applied to a study of infected lymph nodes and of a mouse tumor model, DestVI provides high-resolution, accurate spatial characterization of the cellular organization of these tissues and identifies cell-type-specific changes in gene expression between different tissue regions or between conditions. DestVI is available as part of the open-source software package scvi-tools ( https://scvi-tools.org ).


Assuntos
Neoplasias , Transcriptoma , Animais , Perfilação da Expressão Gênica/métodos , Camundongos , Neoplasias/genética , Análise de Célula Única/métodos , Software , Transcriptoma/genética , Sequenciamento do Exoma
17.
Am J Hum Genet ; 83(6): 675-83, 2008 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-19026399

RESUMO

The central questions asked in whole-genome association studies are how to locate associated regions in the genome and how to estimate the significance of these findings. Researchers usually do this by testing each SNP separately for association and then applying a suitable correction for multiple-hypothesis testing. However, SNPs are correlated by the unobserved genealogy of the population, and a more powerful statistical methodology would attempt to take this genealogy into account. Leveraging the genealogy in association studies is challenging, however, because the inference of the genealogy from the genotypes is a computationally intensive task, in particular when recombination is modeled, as in ancestral recombination graphs. Furthermore, if large numbers of genealogies are imputed from the genotypes, the power of the study might decrease if these imputed genealogies create an additional multiple-hypothesis testing burden. Indeed, we show in this paper that several existing methods that aim to address this problem suffer either from low power or from a very high false-positive rate; their performance is generally not better than the standard approach of separate testing of SNPs. We suggest a new genealogy-based approach, CAMP (coalescent-based association mapping), that takes into account the trade-off between the complexity of the genealogy and the power lost due to the additional multiple hypotheses. Our experiments show that CAMP yields a significant increase in power relative to that of previous methods and that it can more accurately locate the associated region.


Assuntos
Mapeamento Cromossômico/métodos , Predisposição Genética para Doença , Genoma Humano , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Algoritmos , Estudos de Casos e Controles , Distribuição de Qui-Quadrado , Simulação por Computador , Genética Populacional , Genótipo , Haplótipos , Humanos , Modelos Genéticos , Mutação , Linhagem , Filogenia , Recombinação Genética
18.
Bioinformatics ; 26(5): 617-24, 2010 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-20080507

RESUMO

MOTIVATION: The identification of catalytic residues is a key step in understanding the function of enzymes. While a variety of computational methods have been developed for this task, accuracies have remained fairly low. The best existing method exploits information from sequence and structure to achieve a precision (the fraction of predicted catalytic residues that are catalytic) of 18.5% at a corresponding recall (the fraction of catalytic residues identified) of 57% on a standard benchmark. Here we present a new method, Discern, which provides a significant improvement over the state-of-the-art through the use of statistical techniques to derive a model with a small set of features that are jointly predictive of enzyme active sites. RESULTS: In cross-validation experiments on two benchmark datasets from the Catalytic Site Atlas and CATRES resources containing a total of 437 manually curated enzymes spanning 487 SCOP families, Discern increases catalytic site recall between 12% and 20% over methods that combine information from both sequence and structure, and by >or=50% over methods that make use of sequence conservation signal only. Controlled experiments show that Discern's improvement in catalytic residue prediction is derived from the combination of three ingredients: the use of the INTREPID phylogenomic method to extract conservation information; the use of 3D structure data, including features computed for residues that are proximal in the structure; and a statistical regularization procedure to prevent overfitting.


Assuntos
Domínio Catalítico/genética , Evolução Molecular , Conformação Proteica , Proteínas/química , Proteômica/métodos , Sítios de Ligação , Catálise , Bases de Dados de Proteínas , Modelos Moleculares , Dobramento de Proteína , Análise de Sequência de Proteína
19.
PLoS Comput Biol ; 6(4): e1000763, 2010 Apr 29.
Artigo em Inglês | MEDLINE | ID: mdl-20442867

RESUMO

Distributions of the backbone dihedral angles of proteins have been studied for over 40 years. While many statistical analyses have been presented, only a handful of probability densities are publicly available for use in structure validation and structure prediction methods. The available distributions differ in a number of important ways, which determine their usefulness for various purposes. These include: 1) input data size and criteria for structure inclusion (resolution, R-factor, etc.); 2) filtering of suspect conformations and outliers using B-factors or other features; 3) secondary structure of input data (e.g., whether helix and sheet are included; whether beta turns are included); 4) the method used for determining probability densities ranging from simple histograms to modern nonparametric density estimation; and 5) whether they include nearest neighbor effects on the distribution of conformations in different regions of the Ramachandran map. In this work, Ramachandran probability distributions are presented for residues in protein loops from a high-resolution data set with filtering based on calculated electron densities. Distributions for all 20 amino acids (with cis and trans proline treated separately) have been determined, as well as 420 left-neighbor and 420 right-neighbor dependent distributions. The neighbor-independent and neighbor-dependent probability densities have been accurately estimated using Bayesian nonparametric statistical analysis based on the Dirichlet process. In particular, we used hierarchical Dirichlet process priors, which allow sharing of information between densities for a particular residue type and different neighbor residue types. The resulting distributions are tested in a loop modeling benchmark with the program Rosetta, and are shown to improve protein loop conformation prediction significantly. The distributions are available at http://dunbrack.fccc.edu/hdp.


Assuntos
Aminoácidos/química , Teorema de Bayes , Modelos Biológicos , Modelos Estatísticos , Proteínas/química , Algoritmos , Sequência de Aminoácidos , Bases de Dados de Proteínas , Conformação Proteica , Proteínas/genética
20.
Proteins ; 78(6): 1583-93, 2010 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-20131376

RESUMO

De novo protein structure prediction requires location of the lowest energy state of the polypeptide chain among a vast set of possible conformations. Powerful approaches include conformational space annealing, in which search progressively focuses on the most promising regions of conformational space, and genetic algorithms, in which features of the best conformations thus far identified are recombined. We describe a new approach that combines the strengths of these two approaches. Protein conformations are projected onto a discrete feature space which includes backbone torsion angles, secondary structure, and beta pairings. For each of these there is one "native" value: the one found in the native structure. We begin with a large number of conformations generated in independent Monte Carlo structure prediction trajectories from Rosetta. Native values for each feature are predicted from the frequencies of feature value occurrences and the energy distribution in conformations containing them. A second round of structure prediction trajectories are then guided by the predicted native feature distributions. We show that native features can be predicted at much higher than background rates, and that using the predicted feature distributions improves structure prediction in a benchmark of 28 proteins. The advantages of our approach are that features from many different input structures can be combined simultaneously without producing atomic clashes or otherwise physically inviable models, and that the features being recombined have a relatively high chance of being correct.


Assuntos
Algoritmos , Estrutura Secundária de Proteína , Proteínas/química , Bases de Dados de Proteínas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA