Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 21
Filtrar
1.
Trends Pharmacol Sci ; 45(3): 255-267, 2024 03.
Artículo en Inglés | MEDLINE | ID: mdl-38378385

RESUMEN

Generative biology combines artificial intelligence (AI), advanced life sciences technologies, and automation to revolutionize the process of designing novel biomolecules with prescribed properties, giving drug discoverers the ability to escape the limitations of biology during the design of next-generation protein therapeutics. Significant hurdles remain, namely: (i) the inherently complex nature of drug discovery, (ii) the bewildering number of promising computational and experimental techniques that have emerged in the past several years, and (iii) the limited availability of relevant protein sequence-function data for drug-like molecules. There is a need to focus on computational methods that will be most practically effective for protein drug discovery and on building experimental platforms to generate the data most appropriate for these methods. Here, we discuss recent advances in computational and experimental life sciences that are most crucial for impacting the pace and success of protein drug discovery.


Asunto(s)
Inteligencia Artificial , Descubrimiento de Drogas , Humanos , Descubrimiento de Drogas/métodos , Biología
2.
MAbs ; 15(1): 2256745, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37698932

RESUMEN

Biologic drug discovery pipelines are designed to deliver protein therapeutics that have exquisite functional potency and selectivity while also manifesting biophysical characteristics suitable for manufacturing, storage, and convenient administration to patients. The ability to use computational methods to predict biophysical properties from protein sequence, potentially in combination with high throughput assays, could decrease timelines and increase the success rates for therapeutic developability engineering by eliminating lengthy and expensive cycles of recombinant protein production and testing. To support development of high-quality predictive models for antibody developability, we designed a sequence-diverse panel of 83 effector functionless IgG1 antibodies displaying a range of biophysical properties, produced and formulated each protein under standard platform conditions, and collected a comprehensive package of analytical data, including in vitro assays and in vivo mouse pharmacokinetics. We used this robust training data set to build machine learning classifier models that can predict complex protein behavior from these data and features derived from predicted and/or experimental structures. Our models predict with 87% accuracy whether viscosity at 150 mg/mL is above or below a threshold of 15 centipoise (cP) and with 75% accuracy whether the area under the plasma drug concentration-time curve (AUC0-672 h) in normal mouse is above or below a threshold of 3.9 × 106 h x ng/mL.


Asunto(s)
Anticuerpos Monoclonales , Descubrimiento de Drogas , Animales , Ratones , Anticuerpos Monoclonales/química , Simulación por Computador , Proteínas Recombinantes , Viscosidad
3.
Proteins ; 91(11): 1471-1486, 2023 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-37337902

RESUMEN

Protein engineers aim to discover and design novel sequences with targeted, desirable properties. Given the near limitless size of the protein sequence landscape, it is no surprise that these desirable sequences are often a relative rarity. This makes identifying such sequences a costly and time-consuming endeavor. In this work, we show how to use a deep transformer protein language model to identify sequences that have the most promise. Specifically, we use the model's self-attention map to calculate a Promise Score that weights the relative importance of a given sequence according to predicted interactions with a specified binding partner. This Promise Score can then be used to identify strong binders worthy of further study and experimentation. We use the Promise Score within two protein engineering contexts-Nanobody (Nb) discovery and protein optimization. With Nb discovery, we show how the Promise Score provides an effective way to select lead sequences from Nb repertoires. With protein optimization, we show how to use the Promise Score to select site-specific mutagenesis experiments that identify a high percentage of improved sequences. In both cases, we also show how the self-attention map used to calculate the Promise Score can indicate which regions of a protein are involved in intermolecular interactions that drive the targeted property. Finally, we describe how to fine-tune the transformer protein language model to learn a predictive model for the targeted property, and discuss the capabilities and limitations of fine-tuning with and without knowledge transfer within the context of protein engineering.


Asunto(s)
Lenguaje , Ingeniería de Proteínas , Mutagénesis Sitio-Dirigida , Secuencia de Aminoácidos , Proyectos de Investigación
4.
Expert Rev Hematol ; 16(sup1): 107-127, 2023 03.
Artículo en Inglés | MEDLINE | ID: mdl-36920855

RESUMEN

BACKGROUND: The National Hemophilia Foundation (NHF) conducted extensive, inclusive community consultations to guide prioritization of research in coming decades in alignment with its mission to find cures and address and prevent complications enabling people and families with blood disorders to thrive. RESEARCH DESIGN AND METHODS: With the American Thrombosis and Hemostasis Network, NHF recruited multidisciplinary expert working groups (WG) to distill the community-identified priorities into concrete research questions and score their feasibility, impact, and risk. WG6 was charged with identifying the infrastructure, workforce development, and funding and resources to facilitate the prioritized research. Community input on conclusions was gathered at the NHF State of the Science Research Summit. RESULTS: WG6 detailed a minimal research capacity infrastructure threshold, and opportunities to enable its attainment, for bleeding disorders centers to participate in prospective, multicenter national registries. They identified challenges and opportunities to recruit, retain, and train the diverse multidisciplinary care and research workforce required into the future. Innovative collaborative approaches to trial design, resource networking, and funding to surmount obstacles facing research in rare disorders were elucidated. CONCLUSIONS: The innovations in infrastructure, workforce development, and resources and funding proposed herein may contribute to facilitating a National Research Blueprint for Inherited Bleeding Disorders.


Research is critical to advancing the diagnosis and care of people with inherited bleeding disorders (PWIBD). This research requires significant infrastructure, including people and resources. Hemophilia treatment centers (HTC) need many different skilled care professionals including doctors, nurses, and other providers; also statisticians, data managers, and other experts to process patients' clinical information into research. Attracting diverse qualified professionals to the clinical and research work requires long-term planning, recruiting individuals in training programs and retaining them as they become experts. Research infrastructure includes physical servers running database software, networks that link them, and the environment in which these components function. US Centers for Disease Control and Prevention (CDC) and American Thrombosis and Hemostasis Network (ATHN) coordinate and fund data collection at HTCs on the health and well-being of thousands of PWIBD into a registry used in research studies.National Hemophilia Foundation (NHF) and ATHN asked our group of health care professionals, technology experts, and lived experience experts (LEE) to identify the infrastructure, workforce, and resources needed to do the research most important to PWIBD. We identified the types of CDC/ATHN studies all HTCs should be able to perform, and the physical and human infrastructure this requires. We prioritized finding the best clinical trial designs to study inherited bleeding disorders, identifying ways to share personnel and tools between HTCs, and innovating how research is governed and funded. Involving LEEs in designing, managing, and carrying out research will be key in conducting research to improve the lives of PWIBD.


Asunto(s)
Hemofilia A , Trombosis , Humanos , Estados Unidos , Estudios Prospectivos , Hemostasis , Recursos Humanos
5.
PLOS Digit Health ; 1(2): e0000012, 2022 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-36812511

RESUMEN

Sepsis is a potentially life-threatening inflammatory response to infection or severe tissue damage. It has a highly variable clinical course, requiring constant monitoring of the patient's state to guide the management of intravenous fluids and vasopressors, among other interventions. Despite decades of research, there's still debate among experts on optimal treatment. Here, we combine for the first time, distributional deep reinforcement learning with mechanistic physiological models to find personalized sepsis treatment strategies. Our method handles partial observability by leveraging known cardiovascular physiology, introducing a novel physiology-driven recurrent autoencoder, and quantifies the uncertainty of its own results. Moreover, we introduce a framework for uncertainty-aware decision support with humans in the loop. We show that our method learns physiologically explainable, robust policies, that are consistent with clinical knowledge. Further our method consistently identifies high-risk states that lead to death, which could potentially benefit from more frequent vasopressor administration, providing valuable guidance for future research.

6.
Bioinformatics ; 37(Suppl_1): i451-i459, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252975

RESUMEN

MOTIVATION: The recent emergence of cloud laboratories-collections of automated wet-lab instruments that are accessed remotely, presents new opportunities to apply Artificial Intelligence and Machine Learning in scientific research. Among these is the challenge of automating the process of optimizing experimental protocols to maximize data quality. RESULTS: We introduce a new deterministic algorithm, called PaRallel OptimizaTiOn for ClOud Laboratories (PROTOCOL), that improves experimental protocols via asynchronous, parallel Bayesian optimization. The algorithm achieves exponential convergence with respect to simple regret. We demonstrate PROTOCOL in both simulated and real-world cloud labs. In the simulated lab, it outperforms alternative approaches to Bayesian optimization in terms of its ability to find optimal configurations, and the number of experiments required to find the optimum. In the real-world lab, the algorithm makes progress toward the optimal setting. DATA AVAILABILITY AND IMPLEMENTATION: PROTOCOL is available as both a stand-alone Python library, and as part of a R Shiny application at https://github.com/clangmead/PROTOCOL. Data are available at the same repository. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Inteligencia Artificial , Programas Informáticos , Algoritmos , Teorema de Bayes , Laboratorios
7.
Algorithms Mol Biol ; 16(1): 13, 2021 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-34210336

RESUMEN

BACKGROUND: Directed evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property, such as binding affinity to a specified target. Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. solubility, thermostability, etc). We address this issue by formulating DE as a regularized Bayesian optimization problem where the regularization term reflects evolutionary or structure-based constraints. RESULTS: We applied our approach to DE to three representative proteins, GB1, BRCA1, and SARS-CoV-2 Spike, and evaluated both evolutionary and structure-based regularization terms. The results of these experiments demonstrate that: (i) structure-based regularization usually leads to better designs (and never hurts), compared to the unregularized setting; (ii) evolutionary-based regularization tends to be least effective; and (iii) regularization leads to better designs because it effectively focuses the search in certain areas of sequence space, making better use of the experimental budget. Additionally, like previous work in Machine learning assisted DE, we find that our approach significantly reduces the experimental burden of DE, relative to model-free methods. CONCLUSION: Introducing regularization into a Bayesian ML-assisted DE framework alters the exploratory patterns of the underlying optimization routine, and can shift variant selections towards those with a range of targeted and desirable properties. In particular, we find that structure-based regularization often improves variant selection compared to unregularized approaches, and never hurts.

8.
J Comput Biol ; 22(6): 474-86, 2015 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-25973864

RESUMEN

In studying the strength and specificity of interaction between members of two protein families, key questions center on which pairs of possible partners actually interact, how well they interact, and why they interact while others do not. The advent of large-scale experimental studies of interactions between members of a target family and a diverse set of possible interaction partners offers the opportunity to address these questions. We develop here a method, DgSpi (data-driven graphical models of specificity in protein:protein interactions), for learning and using graphical models that explicitly represent the amino acid basis for interaction specificity (why) and extend earlier classification-oriented approaches (which) to predict the ΔG of binding (how well). We demonstrate the effectiveness of our approach in analyzing and predicting interactions between a set of 82 PDZ recognition modules against a panel of 217 possible peptide partners, based on data from MacBeath and colleagues. Our predicted ΔG values are highly predictive of the experimentally measured ones, reaching correlation coefficients of 0.69 in 10-fold cross-validation and 0.63 in leave-one-PDZ-out cross-validation. Furthermore, the model serves as a compact representation of amino acid constraints underlying the interactions, enabling protein-level ΔG predictions to be naturally understood in terms of residue-level constraints. Finally, the model DgSpi readily enables the design of new interacting partners, and we demonstrate that designed ligands are novel and diverse.


Asunto(s)
Unión Proteica/genética , Proteínas/genética , Secuencia de Aminoácidos , Aminoácidos/genética , Ligandos , Modelos Moleculares , Sensibilidad y Especificidad
9.
Res Comput Mol Biol ; 8394: 129-143, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25414914

RESUMEN

In studying the strength and specificity of interaction between members of two protein families, key questions center on which pairs of possible partners actually interact, how well they interact, and why they interact while others do not. The advent of large-scale experimental studies of interactions between members of a target family and a diverse set of possible interaction partners offers the opportunity to address these questions. We develop here a method, DgSpi (Data-driven Graphical models of Specificity in Protein:protein Interactions), for learning and using graphical models that explicitly represent the amino acid basis for interaction specificity (why) and extend earlier classification-oriented approaches (which) to predict the ΔG of binding (how well). We demonstrate the effectiveness of our approach in analyzing and predicting interactions between a set of 82 PDZ recognition modules, against a panel of 217 possible peptide partners, based on data from MacBeath and colleagues. Our predicted ΔG values are highly predictive of the experimentally measured ones, reaching correlation coefficients of 0.69 in 10-fold cross-validation and 0.63 in leave-one-PDZ-out cross-validation. Furthermore, the model serves as a compact representation of amino acid constraints underlying the interactions, enabling protein-level ΔG predictions to be naturally understood in terms of residue-level constraints. Finally, as a generative model, DgSpi readily enables the design of new interacting partners, and we demonstrate that designed ligands are novel and diverse.

10.
Adv Exp Med Biol ; 805: 87-105, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24446358

RESUMEN

Atomistic simulations of the conformational dynamics of proteins can be performed using either Molecular Dynamics or Monte Carlo procedures. The ensembles of three-dimensional structures produced during simulation can be analyzed in a number of ways to elucidate the thermodynamic and kinetic properties of the system. The goal of this chapter is to review both traditional and emerging methods for learning generative models from atomistic simulation data. Here, the term 'generative' refers to a model of the joint probability distribution over the behaviors of the constituent atoms. In the context of molecular modeling, generative models reveal the correlation structure between the atoms, and may be used to predict how the system will respond to structural perturbations. We begin by discussing traditional methods, which produce multivariate Gaussian models. We then discuss GAMELAN (GRAPHICAL MODELS OF ENERGY LANDSCAPES), which produces generative models of complex, non-Gaussian conformational dynamics (e.g., allostery, binding, folding, etc.) from long timescale simulation data.


Asunto(s)
Modelos Estadísticos , Simulación de Dinámica Molecular , Método de Montecarlo , Regulación Alostérica , Anticuerpos Monoclonales/química , Antígenos CD4/química , Proteína gp120 de Envoltorio del VIH/química , Inhibidores de Fusión de VIH/química , VIH-1/química , Proteínas de Homeodominio/química , Humanos , Distribución Normal , Unión Proteica , Conformación Proteica , Pliegue de Proteína
11.
BMC Biophys ; 5: 13, 2012 Jun 29.
Artículo en Inglés | MEDLINE | ID: mdl-22748306

RESUMEN

BACKGROUND: G protein coupled receptors (GPCRs) are seven helical transmembrane proteins that function as signal transducers. They bind ligands in their extracellular and transmembrane regions and activate cognate G proteins at their intracellular surface at the other side of the membrane. The relay of allosteric communication between the ligand binding site and the distant G protein binding site is poorly understood. In this study, GREMLIN 1, a recently developed method that identifies networks of co-evolving residues from multiple sequence alignments, was used to identify those that may be involved in communicating the activation signal across the membrane. The GREMLIN-predicted long-range interactions between amino acids were analyzed with respect to the seven GPCR structures that have been crystallized at the time this study was undertaken. RESULTS: GREMLIN significantly enriches the edges containing residues that are part of the ligand binding pocket, when compared to a control distribution of edges drawn from a random graph. An analysis of these edges reveals a minimal GPCR binding pocket containing four residues (T1183.33, M2075.42, Y2686.51 and A2927.39). Additionally, of the ten residues predicted to have the most long-range interactions (A1173.32, A2726.55, E1133.28, H2115.46, S186EC2, A2927.39, E1223.37, G902.57, G1143.29 and M2075.42), nine are part of the ligand binding pocket. CONCLUSIONS: We demonstrate the use of GREMLIN to reveal a network of statistically correlated and functionally important residues in class A GPCRs. GREMLIN identified that ligand binding pocket residues are extensively correlated with distal residues. An analysis of the GREMLIN edges across multiple structures suggests that there may be a minimal binding pocket common to the seven known GPCRs. Further, the activation of rhodopsin involves these long-range interactions between extracellular and intracellular domain residues mediated by the retinal domain.

12.
BMC Bioinformatics ; 13 Suppl 5: S8, 2012 Apr 12.
Artículo en Inglés | MEDLINE | ID: mdl-22537012

RESUMEN

Stochastic Differential Equations (SDE) are often used to model the stochastic dynamics of biological systems. Unfortunately, rare but biologically interesting behaviors (e.g., oncogenesis) can be difficult to observe in stochastic models. Consequently, the analysis of behaviors of SDE models using numerical simulations can be challenging. We introduce a method for solving the following problem: given a SDE model and a high-level behavioral specification about the dynamics of the model, algorithmically decide whether the model satisfies the specification. While there are a number of techniques for addressing this problem for discrete-state stochastic models, the analysis of SDE and other continuous-state models has received less attention. Our proposed solution uses a combination of Bayesian sequential hypothesis testing, non-identically distributed samples, and Girsanov's theorem for change of measures to examine rare behaviors. We use our algorithm to analyze two SDE models of tumor dynamics. Our use of non-identically distributed samples sampling contributes to the state of the art in statistical verification and model checking of stochastic models by providing an effective means for exposing rare events in SDEs, while retaining the ability to compute bounds on the probability that those events occur.


Asunto(s)
Algoritmos , Transformación Celular Neoplásica , Modelos Biológicos , Procesos Estocásticos , Teorema de Bayes , Humanos , Probabilidad
13.
Proteins ; 79(4): 1061-78, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21268112

RESUMEN

We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.


Asunto(s)
Modelos Químicos , Pliegue de Proteína , Proteínas/química , Alineación de Secuencia/métodos , Secuencia de Aminoácidos , Área Bajo la Curva , Biología Computacional , Gráficos por Computador , Simulación por Computador , Cadenas de Markov , Modelos Moleculares , Modelos Estadísticos , Dominios PDZ , Análisis de Secuencia de Proteína , Relación Estructura-Actividad
14.
Proteins ; 79(2): 444-62, 2011 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-21120864

RESUMEN

Protein-protein interactions are governed by the change in free energy upon binding, ΔG = ΔH - TΔS. These interactions are often marginally stable, so one must examine the balance between the change in enthalpy, ΔH, and the change in entropy, ΔS, when investigating known complexes, characterizing the effects of mutations, or designing optimized variants. To perform a large-scale study into the contribution of conformational entropy to binding free energy, we developed a technique called GOBLIN (Graphical mOdel for BiomoLecular INteractions) that performs physics-based free energy calculations for protein-protein complexes under both side-chain and backbone flexibility. Goblin uses a probabilistic graphical model that exploits conditional independencies in the Boltzmann distribution and employs variational inference techniques that approximate the free energy of binding in only a few minutes. We examined the role of conformational entropy on a benchmark set of more than 700 mutants in eight large, well-studied complexes. Our findings suggest that conformational entropy is important in protein-protein interactions--the root mean square error (RMSE) between calculated and experimentally measured ΔΔGs decreases by 12% when explicit entropic contributions were incorporated. GOBLIN models all atoms of the protein complex and detects changes to the binding entropy along the interface as well as positions distal to the binding interface. Our results also suggest that a variational approach to entropy calculations may be quantitatively more accurate than the knowledge-based approaches used by the well-known programs FOLDX and Rosetta--GOBLIN's RMSEs are 10 and 36% lower than these programs, respectively.


Asunto(s)
Proteínas/química , Algoritmos , Aminoácidos/química , Animales , Simulación por Computador , Entropía , Humanos , Cadenas de Markov , Modelos Moleculares , Simulación de Dinámica Molecular , Mutación , Unión Proteica , Dominios y Motivos de Interacción de Proteínas , Estructura Cuaternaria de Proteína , Proteínas/genética , Programas Informáticos
15.
J Bioinform Comput Biol ; 7(2): 323-38, 2009 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-19340918

RESUMEN

We present an exact algorithm, based on techniques from the field of Model Checking, for finding control policies for Boolean Networks (BN) with control nodes. Given a BN, a set of starting states, I, a set of goal states, F, and a target time, t, our algorithm automatically finds a sequence of control signals that deterministically drives the BN from I to F at, or before time t, or else guarantees that no such policy exists. Despite recent hardness-results for finding control policies for BNs, we show that, in practice, our algorithm runs in seconds to minutes on over 13,400 BNs of varying sizes and topologies, including a BN model of embryogenesis in Drosophila melanogaster with 15,360 Boolean variables. We then extend our method to automatically identify a set of Boolean transfer functions that reproduce the qualitative behavior of gene regulatory networks. Specifically, we automatically learn a BN model of D. melanogaster embryogenesis in 5.3 seconds, from a space containing 6.9 x 10(10) possible models.


Asunto(s)
Proteínas de Drosophila/metabolismo , Drosophila melanogaster/embriología , Drosophila melanogaster/fisiología , Desarrollo Embrionario/fisiología , Modelos Biológicos , Mapeo de Interacción de Proteínas/métodos , Transducción de Señal/fisiología , Animales , Retroalimentación/fisiología
16.
J Comput Biol ; 11(2-3): 277-98, 2004.
Artículo en Inglés | MEDLINE | ID: mdl-15285893

RESUMEN

High-throughput NMR structural biology can play an important role in structural genomics. We report an automated procedure for high-throughput NMR resonance assignment for a protein of known structure, or of a homologous structure. These assignments are a prerequisite for probing protein-protein interactions, protein-ligand binding, and dynamics by NMR. Assignments are also the starting point for structure determination and refinement. A new algorithm, called Nuclear Vector Replacement (NVR) is introduced to compute assignments that optimally correlate experimentally measured NH residual dipolar couplings (RDCs) to a given a priori whole-protein 3D structural model. The algorithm requires only uniform( 15)N-labeling of the protein and processes unassigned H(N)-(15)N HSQC spectra, H(N)-(15)N RDCs, and sparse H(N)-H(N) NOE's (d(NN)s), all of which can be acquired in a fraction of the time needed to record the traditional suite of experiments used to perform resonance assignments. NVR runs in minutes and efficiently assigns the (H(N),(15)N) backbone resonances as well as the d(NN)s of the 3D (15)N-NOESY spectrum, in O(n(3)) time. The algorithm is demonstrated on NMR data from a 76-residue protein, human ubiquitin, matched to four structures, including one mutant (homolog), determined either by x-ray crystallography or by different NMR experiments (without RDCs). NVR achieves an assignment accuracy of 92-100%. We further demonstrate the feasibility of our algorithm for different and larger proteins, using NMR data for hen lysozyme (129 residues, 97-100% accuracy) and streptococcal protein G (56 residues, 100% accuracy), matched to a variety of 3D structural models. Finally, we extend NVR to a second application, 3D structural homology detection, and demonstrate that NVR is able to identify structural homologies between proteins with remote amino acid sequences using a database of structural models.


Asunto(s)
Algoritmos , Biología Computacional , Espectroscopía de Resonancia Magnética/estadística & datos numéricos , Estructura Terciaria de Proteína
17.
J Biomol NMR ; 29(2): 111-38, 2004 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-15014227

RESUMEN

We report an automated procedure for high-throughput NMR resonance assignment for a protein of known structure, or of an homologous structure. Our algorithm performs Nuclear Vector Replacement (NVR) by Expectation/Maximization (EM) to compute assignments. NVR correlates experimentally-measured NH residual dipolar couplings (RDCs) and chemical shifts to a given a priori whole-protein 3D structural model. The algorithm requires only uniform (15)N-labelling of the protein, and processes unassigned H(N)-(15)N HSQC spectra, H(N)-(15)N RDCs, and sparse H(N)-H(N) NOE's (d(NN)s). NVR runs in minutes and efficiently assigns the (H(N),(15)N) backbone resonances as well as the sparse d(NN)s from the 3D (15)N-NOESY spectrum, in O (n(3)) time. The algorithm is demonstrated on NMR data from a 76-residue protein, human ubiquitin, matched to four structures, including one mutant (homolog), determined either by X-ray crystallography or by different NMR experiments (without RDCs). NVR achieves an average assignment accuracy of over 99%. We further demonstrate the feasibility of our algorithm for different and larger proteins, using different combinations of real and simulated NMR data for hen lysozyme (129 residues) and streptococcal protein G (56 residues), matched to a variety of 3D structural models.


Asunto(s)
Algoritmos , Simulación por Computador , Espectroscopía de Resonancia Magnética/métodos , Animales , Isótopos de Carbono/química , Cristalografía por Rayos X , Humanos , Estructura Molecular , Muramidasa/química , Isótopos de Nitrógeno/química , Ubiquitina/química
18.
Artículo en Inglés | MEDLINE | ID: mdl-16448021

RESUMEN

One goal of the structural genomics initiative is the identification of new protein folds. Sequence-based structural homology prediction methods are an important means for prioritizing unknown proteins for structure determination. However, an important challenge remains: two highly dissimilar sequences can have similar folds & how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure, called HD, for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies 3D models in a protein structural database whose geometries best fit the unassigned experimental NMR data. HD does not use, and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or homology modelling. The algorithm runs in O(pn + pn(5/2) log (cn)+p log p) time, where p is the number of proteins in the database, n is the number of residues in the target protein and c is the maximum edge weight in an integer-weighted bipartite graph. Our experiments on real NMR data from 3 different proteins against a database of 4,500 representative folds demonstrate that the method identifies closely related protein folds, including sub-domains of larger proteins, with as little as 10-30% sequence homology between the target protein (or sub-domain) and the computed model. In particular, we report no false-negatives or false-positives despite significant percentages of missing experimental data.


Asunto(s)
Cristalografía por Rayos X/métodos , Modelos Químicos , Modelos Moleculares , Mapeo Peptídico/métodos , Proteínas/química , Proteínas/ultraestructura , Análisis de Secuencia de Proteína/métodos , Algoritmos , Secuencia de Aminoácidos , Inteligencia Artificial , Simulación por Computador , Imagenología Tridimensional/métodos , Datos de Secuencia Molecular , Conformación Proteica , Proteínas/análisis , Homología de Secuencia de Aminoácido
19.
J Comput Biol ; 10(3-4): 521-36, 2003.
Artículo en Inglés | MEDLINE | ID: mdl-12935342

RESUMEN

We introduce a model-based analysis technique for extracting and characterizing rhythmic expression profiles from genome-wide DNA microarray hybridization data. These patterns are clues to discovering rhythmic genes implicated in cell-cycle, circadian, or other biological processes. The algorithm, implemented in a program called RAGE (Rhythmic Analysis of Gene Expression), decouples the problems of estimating a pattern's wavelength and phase. Our algorithm is linear-time in frequency and phase resolution, an improvement over previous quadratic-time approaches. Unlike previous approaches, RAGE uses a true distance metric for measuring expression profile similarity, based on the Hausdorff distance. This results in better clustering of expression profiles for rhythmic analysis. The confidence of each frequency estimate is computed using Z-scores. We demonstrate that RAGE is superior to other techniques on synthetic and actual DNA microarray hybridization data. We also show how to replace the discretized phase search in our method with an exact (combinatorially precise) phase search, resulting in a faster algorithm with no complexity dependence on phase resolution.


Asunto(s)
Biología Computacional/métodos , Interpretación Estadística de Datos , Perfilación de la Expresión Génica/métodos , Algoritmos , Animales , Proteína Quinasa CDC28 de Saccharomyces cerevisiae/genética , Proteínas de Ciclo Celular/genética , Ritmo Circadiano/genética , Drosophila/genética , Proteínas de Unión al GTP/genética , Genoma , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos/estadística & datos numéricos
20.
Artículo en Inglés | MEDLINE | ID: mdl-16452795

RESUMEN

Recognition of a protein's fold provides valuable information about its function. While many sequence-based homology prediction methods exist, an important challenge remains: two highly dissimilar sequences can have similar folds-- how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies the 3D structural models in a protein structural database whose geometries best fit the unassigned experimental NMR data. It does not use sequence information and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or sequence homology. The algorithm runs in O(pnk(3)) time, where p is the number of proteins in the database, n is the number of residues in the target protein, and k is the resolution of a rotation search. The method requires only uniform (15)N-labelling of the protein and processes unassigned H(N)-(15)N residual dipolar couplings, which can be acquired in a couple of hours. Our experiments on NMR data from 5 different proteins demonstrate that the method identifies closely related protein folds, despite low-sequence homology between the target protein and the computed model.


Asunto(s)
Algoritmos , Inteligencia Artificial , Espectroscopía de Resonancia Magnética/métodos , Mapeo Peptídico/métodos , Proteínas/análisis , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Sitios de Unión , Bases de Datos de Proteínas , Reconocimiento de Normas Patrones Automatizadas/métodos , Unión Proteica , Conformación Proteica , Homología de Secuencia de Aminoácido
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA