Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 52
Filtrar
1.
PLoS Comput Biol ; 20(2): e1011812, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38377054

RESUMEN

The design of proteins with specific tasks is a major challenge in molecular biology with important diagnostic and therapeutic applications. High-throughput screening methods have been developed to systematically evaluate protein activity, but only a small fraction of possible protein variants can be tested using these techniques. Computational models that explore the sequence space in-silico to identify the fittest molecules for a given function are needed to overcome this limitation. In this article, we propose AnnealDCA, a machine-learning framework to learn the protein fitness landscape from sequencing data derived from a broad range of experiments that use selection and sequencing to quantify protein activity. We demonstrate the effectiveness of our method by applying it to antibody Rep-Seq data of immunized mice and screening experiments, assessing the quality of the fitness landscape reconstructions. Our method can be applied to several experimental cases where a population of protein variants undergoes various rounds of selection and sequencing, without relying on the computation of variants enrichment ratios, and thus can be used even in cases of disjoint sequence samples.


Asunto(s)
Aptitud Genética , Aprendizaje Automático , Animales , Ratones , Mutación , Aptitud Genética/genética
2.
Bioinformatics ; 39(9)2023 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-37647658

RESUMEN

SUMMARY: DCAlign is a new alignment method able to cope with the conservation and the co-evolution signals that characterize the columns of multiple sequence alignments of homologous sequences. However, the pre-processing steps required to align a candidate sequence are computationally demanding. We show in v1.0 how to dramatically reduce the overall computing time by including an empirical prior over an informative set of variables mirroring the presence of insertions and deletions. AVAILABILITY AND IMPLEMENTATION: DCAlign v1.0 is implemented in Julia and it is fully available at https://github.com/infernet-h2020/DCAlign.


Asunto(s)
Alineación de Secuencia , Biología Computacional
3.
Bioinformatics ; 39(7)2023 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-37399105

RESUMEN

MOTIVATION: Being able to artificially design novel proteins of desired function is pivotal in many biological and biomedical applications. Generative statistical modeling has recently emerged as a new paradigm for designing amino acid sequences, including in particular models and embedding methods borrowed from natural language processing (NLP). However, most approaches target single proteins or protein domains, and do not take into account any functional specificity or interaction with the context. To extend beyond current computational strategies, we develop a method for generating protein domain sequences intended to interact with another protein domain. Using data from natural multidomain proteins, we cast the problem as a translation problem from a given interactor domain to the new domain to be generated, i.e. we generate artificial partner sequences conditional on an input sequence. We also show in an example that the same procedure can be applied to interactions between distinct proteins. RESULTS: Evaluating our model's quality using diverse metrics, in part related to distinct biological questions, we show that our method outperforms state-of-the-art shallow autoregressive strategies. We also explore the possibility of fine-tuning pretrained large language models for the same task and of using Alphafold 2 for assessing the quality of sampled sequences. AVAILABILITY AND IMPLEMENTATION: Data and code on https://github.com/barthelemymp/Domain2DomainProteinTranslation.


Asunto(s)
Lenguaje , Proteínas , Secuencia de Aminoácidos , Proteínas/química , Dominios Proteicos
4.
Phys Rev E ; 107(4-1): 044125, 2023 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-37198812

RESUMEN

The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic tools that allow to detect evolutionary patterns, as well as functional or structural characterizations between homologous sequences in different organisms. Typically, state-of-the-art bioinformatics tools are based on profile models that assume the statistical independence of the different sites of the sequences. Over the last years, it has become increasingly clear that homologous sequences show complex patterns of long-range correlations over the primary sequence as a consequence of the natural evolution process that selects genetic variants under the constraint of preserving the functional or structural determinants of the sequence. Here, we present an alignment algorithm based on message passing techniques that overcomes the limitations of profile models. Our method is based on a perturbative small-coupling expansion of the free energy of the model that assumes a linear chain approximation as the zeroth-order of the expansion. We test the potentiality of the algorithm against standard competing strategies on several biological sequences.


Asunto(s)
Algoritmos , Programas Informáticos , Alineación de Secuencia , Biología Computacional/métodos , Proteínas/química
5.
Bioinformatics ; 38(10): 2734-2741, 2022 05 13.
Artículo en Inglés | MEDLINE | ID: mdl-35561171

RESUMEN

SUMMARY: Topology determination is one of the most important intermediate steps toward building the atomic structure of proteins from their medium-resolution cryo-electron microscopy (cryo-EM) map. The main goal in the topology determination is to identify correct matches (i.e. assignment and direction) between secondary structure elements (SSEs) (α-helices and ß-sheets) detected in a protein sequence and cryo-EM density map. Despite many recent advances in molecular biology technologies, the problem remains a challenging issue. To overcome the problem, this article proposes a linear programming-based topology determination (LPTD) method to solve the secondary structure topology problem in three-dimensional geometrical space. Through modeling of the protein's sequence with the aid of extracting highly reliable features and a distance-based scoring function, the secondary structure matching problem is transformed into a complete weighted bipartite graph matching problem. Subsequently, an algorithm based on linear programming is developed as a decision-making strategy to extract the true topology (native topology) between all possible topologies. The proposed automatic framework is verified using 12 experimental and 15 simulated α-ß proteins. Results demonstrate that LPTD is highly efficient and extremely fast in such a way that for 77% of cases in the dataset, the native topology has been detected in the first rank topology in <2 s. Besides, this method is able to successfully handle large complex proteins with as many as 65 SSEs. Such a large number of SSEs have never been solved with current tools/methods. AVAILABILITY AND IMPLEMENTATION: The LPTD package (source code and data) is publicly available at https://github.com/B-Behkamal/LPTD. Moreover, two test samples as well as the instruction of utilizing the graphical user interface have been provided in the shared readme file. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Programación Lineal , Proteínas , Microscopía por Crioelectrón/métodos , Modelos Moleculares , Conformación Proteica , Estructura Secundaria de Proteína , Proteínas/química
7.
Biophys J ; 121(10): 1919-1930, 2022 05 17.
Artículo en Inglés | MEDLINE | ID: mdl-35422414

RESUMEN

Despite major environmental and genetic differences, microbial metabolic networks are known to generate consistent physiological outcomes across vastly different organisms. This remarkable robustness suggests that, at least in bacteria, metabolic activity may be guided by universal principles. The constrained optimization of evolutionarily motivated objective functions, such as the growth rate, has emerged as the key theoretical assumption for the study of bacterial metabolism. While conceptually and practically useful in many situations, the idea that certain functions are optimized is hard to validate in data. Moreover, it is not always clear how optimality can be reconciled with the high degree of single-cell variability observed in experiments within microbial populations. To shed light on these issues, we develop an inverse modeling framework that connects the fitness of a population of cells (represented by the mean single-cell growth rate) to the underlying metabolic variability through the maximum entropy inference of the distribution of metabolic phenotypes from data. While no clear objective function emerges, we find that, as the medium gets richer, the fitness and inferred variability for Escherichia coli populations follow and slowly approach the theoretically optimal bound defined by minimal reduction of variability at given fitness. These results suggest that bacterial metabolism may be crucially shaped by a population-level trade-off between growth and heterogeneity.


Asunto(s)
Escherichia coli , Redes y Vías Metabólicas , Bacterias/metabolismo , Entropía , Escherichia coli/metabolismo , Fenotipo
8.
Biomolecules ; 11(12)2021 11 26.
Artículo en Inglés | MEDLINE | ID: mdl-34944417

RESUMEN

Cryo-electron microscopy (cryo-EM) is a structural technique that has played a significant role in protein structure determination in recent years. Compared to the traditional methods of X-ray crystallography and NMR spectroscopy, cryo-EM is capable of producing images of much larger protein complexes. However, cryo-EM reconstructions are limited to medium-resolution (~4-10 Å) for some cases. At this resolution range, a cryo-EM density map can hardly be used to directly determine the structure of proteins at atomic level resolutions, or even at their amino acid residue backbones. At such a resolution, only the position and orientation of secondary structure elements (SSEs) such as α-helices and ß-sheets are observable. Consequently, finding the mapping of the secondary structures of the modeled structure (SSEs-A) to the cryo-EM map (SSEs-C) is one of the primary concerns in cryo-EM modeling. To address this issue, this study proposes a novel automatic computational method to identify SSEs correspondence in three-dimensional (3D) space. Initially, through a modeling of the target sequence with the aid of extracting highly reliable features from a generated 3D model and map, the SSEs matching problem is formulated as a 3D vector matching problem. Afterward, the 3D vector matching problem is transformed into a 3D graph matching problem. Finally, a similarity-based voting algorithm combined with the principle of least conflict (PLC) concept is developed to obtain the SSEs correspondence. To evaluate the accuracy of the method, a testing set of 25 experimental and simulated maps with a maximum of 65 SSEs is selected. Comparative studies are also conducted to demonstrate the superiority of the proposed method over some state-of-the-art techniques. The results demonstrate that the method is efficient, robust, and works well in the presence of errors in the predicted secondary structures of the cryo-EM images.


Asunto(s)
Biología Computacional/métodos , Proteínas/química , Microscopía por Crioelectrón , Cristalografía por Rayos X , Modelos Moleculares , Estructura Secundaria de Proteína , Máquina de Vectores de Soporte
9.
BMC Bioinformatics ; 22(1): 528, 2021 Oct 29.
Artículo en Inglés | MEDLINE | ID: mdl-34715775

RESUMEN

BACKGROUND: Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. RESULTS: Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA . As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. CONCLUSIONS: The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.


Asunto(s)
Aprendizaje Automático , Proteínas , Humanos , Proteínas/genética , ARN
10.
Nat Commun ; 12(1): 5800, 2021 10 04.
Artículo en Inglés | MEDLINE | ID: mdl-34608136

RESUMEN

Generative models emerge as promising candidates for novel sequence-data driven approaches to protein design, and for the extraction of structural and functional information about proteins deeply hidden in rapidly growing sequence databases. Here we propose simple autoregressive models as highly accurate but computationally efficient generative sequence models. We show that they perform similarly to existing approaches based on Boltzmann machines or deep generative models, but at a substantially lower computational cost (by a factor between 102 and 103). Furthermore, the simple structure of our models has distinctive mathematical advantages, which translate into an improved applicability in sequence generation and evaluation. Within these models, we can easily estimate both the probability of a given sequence, and, using the model's entropy, the size of the functional sequence space related to a specific protein family. In the example of response regulators, we find a huge number of ca. 1068 possible sequences, which nevertheless constitute only the astronomically small fraction 10-80 of all amino-acid sequences of the same length. These findings illustrate the potential and the difficulty in exploring sequence space via generative sequence models.


Asunto(s)
Modelos Estadísticos , Proteínas/química , Secuencia de Aminoácidos , Biología Computacional , Bases de Datos de Proteínas , Epistasis Genética , Evolución Molecular , Aprendizaje Automático , Mutación , Proteínas/clasificación , Proteínas/genética , Alineación de Secuencia
11.
Int J Mol Sci ; 22(20)2021 Oct 09.
Artículo en Inglés | MEDLINE | ID: mdl-34681569

RESUMEN

We present Annealed Mutational approximated Landscape (AMaLa), a new method to infer fitness landscapes from Directed Evolution experiments sequencing data. Such experiments typically start from a single wild-type sequence, which undergoes Darwinian in vitro evolution via multiple rounds of mutation and selection for a target phenotype. In the last years, Directed Evolution is emerging as a powerful instrument to probe fitness landscapes under controlled experimental conditions and as a relevant testing ground to develop accurate statistical models and inference algorithms (thanks to high-throughput screening and sequencing). Fitness landscape modeling either uses the enrichment of variants abundances as input, thus requiring the observation of the same variants at different rounds or assuming the last sequenced round as being sampled from an equilibrium distribution. AMaLa aims at effectively leveraging the information encoded in the whole time evolution. To do so, while assuming statistical sampling independence between sequenced rounds, the possible trajectories in sequence space are gauged with a time-dependent statistical weight consisting of two contributions: (i) an energy term accounting for the selection process and (ii) a generalized Jukes-Cantor model for the purely mutational step. This simple scheme enables accurately describing the Directed Evolution dynamics and inferring a fitness landscape that correctly reproduces the measures of the phenotype under selection (e.g., antibiotic drug resistance), notably outperforming widely used inference strategies. In addition, we assess the reliability of AMaLa by showing how the inferred statistical model could be used to predict relevant structural properties of the wild-type sequence.


Asunto(s)
Biología Computacional/métodos , Evolución Molecular Dirigida/métodos , Mutación , Algoritmos , Evolución Molecular , Aptitud Genética , Secuenciación de Nucleótidos de Alto Rendimiento , Modelos Genéticos , Análisis de Secuencia de ADN
12.
Phys Rev E ; 103(4-1): 043301, 2021 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-34005851

RESUMEN

Efficient feature selection from high-dimensional datasets is a very important challenge in many data-driven fields of science and engineering. We introduce a statistical mechanics inspired strategy that addresses the problem of sparse feature selection in the context of binary classification by leveraging a computational scheme known as expectation propagation (EP). The algorithm is used in order to train a continuous-weights perceptron learning a classification rule from a set of (possibly partly mislabeled) examples provided by a teacher perceptron with diluted continuous weights. We test the method in the Bayes optimal setting under a variety of conditions and compare it to other state-of-the-art algorithms based on message passing and on expectation maximization approximate inference schemes. Overall, our simulations show that EP is a robust and competitive algorithm in terms of variable selection properties, estimation accuracy, and computational complexity, especially when the student perceptron is trained from correlated patterns that prevent other iterative methods from converging. Furthermore, our numerical tests demonstrate that the algorithm is capable of learning online the unknown values of prior parameters, such as the dilution level of the weights of the teacher perceptron and the fraction of mislabeled examples, quite accurately. This is achieved by means of a simple maximum likelihood strategy that consists in minimizing the free energy associated with the EP algorithm.

13.
Mol Biol Evol ; 38(1): 318-328, 2021 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-32770229

RESUMEN

The recent technological advances underlying the screening of large combinatorial libraries in high-throughput mutational scans deepen our understanding of adaptive protein evolution and boost its applications in protein design. Nevertheless, the large number of possible genotypes requires suitable computational methods for data analysis, the prediction of mutational effects, and the generation of optimized sequences. We describe a computational method that, trained on sequencing samples from multiple rounds of a screening experiment, provides a model of the genotype-fitness relationship. We tested the method on five large-scale mutational scans, yielding accurate predictions of the mutational effects on fitness. The inferred fitness landscape is robust to experimental and sampling noise and exhibits high generalization power in terms of broader sequence space exploration and higher fitness variant predictions. We investigate the role of epistasis and show that the inferred model provides structural information about the 3D contacts in the molecular fold.


Asunto(s)
Evolución Molecular , Aptitud Genética , Epistasis Genética , Mutación , Aprendizaje Automático no Supervisado
14.
J Mol Graph Model ; 103: 107815, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33338845

RESUMEN

Cryo-electron microscopy (cryo-EM) has recently emerged as a prominent biophysical method for macromolecular structure determination. Many research efforts have been devoted to produce cryo-EM images, density maps, at near-atomic resolution. Despite many advances in technology, the resolution of the generated density maps may not be sufficiently adequate and informative to directly construct the atomic structure of proteins. At medium-resolution (∼4-10 Å), secondary structure elements (α-helices and ß-sheets) are discernible, whereas finding the correspondence of secondary structure elements detected in the density map with those on the sequence remains a challenging problem. In this paper, an automatic framework is proposed to solve α-helix correspondence problem in three-dimensional space. Through modeling of the sequence with the aid of a novel strategy, the α-helix correspondence problem is initially transformed into a complete weighted bipartite graph matching problem. An innovative correlation-based scoring function based on a well-known and robust statistical method is proposed for weighting the graph. Moreover, two local optimization algorithms, which are Greedy and Improved Greedy algorithms, have been presented to find α-helix correspondence. A widely used data set including 16 reconstructed and 4 experimental cryo-EM maps were chosen to verify the accuracy and reliability of the proposed automatic method. The experimental results demonstrate that the automatic method is highly efficient (86.25% accuracy), robust (11.3% error rate), fast (∼1.4 s), and works independently from cryo-EM skeleton.


Asunto(s)
Algoritmos , Proteínas , Microscopía por Crioelectrón , Modelos Moleculares , Conformación Proteica en Hélice alfa , Reproducibilidad de los Resultados
15.
J Proteomics ; 216: 103667, 2020 03 30.
Artículo en Inglés | MEDLINE | ID: mdl-31982546

RESUMEN

Clostridium cellulovorans is among the most promising candidates for consolidated bioprocessing (CBP) of cellulosic biomass to liquid biofuels (ethanol, butanol). C. cellulovorans metabolizes all the main plant polysaccharides and mainly produces butyrate. Since most butyrate and butanol biosynthetic reactions from acetyl-CoA are common, introduction of single heterologous alcohol/aldehyde dehydrogenase can divert the branching-point intermediate (butyryl-CoA) towards butanol production in this strain. However, engineering C. cellulovorans metabolic pathways towards industrial utilization requires better understanding of its metabolism. The present study aimed at improving comprehension of cellulose metabolism in C. cellulovorans by comparing growth kinetics, substrate consumption/product accumulation and whole-cell soluble proteome (data available via ProteomeXchange, identifier PXD015487) with those of the same strain grown on a soluble carbohydrate, glucose, as the main carbon source. Growth substrate-dependent modulations of the central metabolism were detected, including regulation of several glycolytic enzymes, fermentation pathways (e.g. hydrogenase, pyruvate formate lyase, phosphate transacetylase) and nitrogen assimilation (e.g. glutamate dehydrogenase). Overexpression of hydrogenase and increased ethanol production by glucose-grown bacteria suggest a more reduced redox state. Higher energy expenditure seems to occur in cellulose-grown C. cellulovorans (likely related to overexpression and secretion of (hemi-)cellulases), which induces up-regulation of ATP synthetic pathways, e.g. acetate production and ATP synthase. SIGNIFICANCE: C. cellulovorans can metabolize all the main plant polysaccharides (cellulose, hemicelluloses and pectins) and, unlike other well established cellulolytic microorganisms, can produce butyrate. C. cellulovorans is therefore among the most attractive candidates for direct fermentation of lignocellulose to high-value chemicals and, especially, n-butanol, i.e. one of the most promising liquid biofuels for the future. Recent studies aimed at engineering n-butanol production in C. cellulovorans represent milestones towards production of biofuels through one-step fermentation of lignocellulose but also indicated that more detailed understanding of the C. cellulovorans central carbon metabolism is essential to refine metabolic engineering strategies towards improved n-butanol production in this strain. The present study helped identifying key genes associated with specific catabolic reactions and indicated modulations of central carbon metabolism (including redox and energy balance) associated with cellulose consumption. This information will be useful to determine key enzymes and possible metabolic bottlenecks to be addressed towards improved metabolic engineering of this strain.


Asunto(s)
Clostridium cellulovorans , 1-Butanol , Butanoles , Celulosa , Clostridium , Clostridium cellulovorans/genética , Clostridium cellulovorans/metabolismo , Fermentación , Ingeniería Metabólica , Proteómica
16.
Phys Rev E ; 102(6-1): 062409, 2020 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-33465950

RESUMEN

Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e., arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position specificities like conservation in sequences but assume an independent evolution of different positions. Over recent years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles, and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.


Asunto(s)
Secuencia Conservada , Evolución Molecular , Modelos Genéticos
17.
Methods Mol Biol ; 2074: 57-65, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-31583630

RESUMEN

Even if we know that two families of homologous proteins interact, we do not necessarily know, which specific proteins interact inside each species. The reason is that most families contain paralogs, i.e., more than one homologous sequence per species. We have developed a tool to predict interacting paralogs between the two protein families, which is based on the idea of inter-protein coevolution: our algorithm matches those members of the two protein families, which belong to the same species and collectively maximize the detectable coevolutionary signal. It is applicable even in cases, where simpler methods based, e.g., on genomic co-localization of genes coding for interacting proteins or orthology-based methods fail. In this method paper, we present an efficient implementation of this idea based on freely available software.


Asunto(s)
Biología Computacional/métodos , Proteínas/química , Proteínas/metabolismo , Unión Proteica , Programas Informáticos
18.
Sci Rep ; 9(1): 18032, 2019 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-31792239

RESUMEN

We introduce a simple model that describes the average occurrence of point variations in a generic protein sequence. This model is based on the idea that mutations are more likely to be fixed at sites in contact with others that have mutated in the recent past. Therefore, we extend the usual assumptions made in protein coevolution by introducing a time dumping on the effect of a substitution on its surrounding and makes correlated substitutions happen in avalanches localized in space and time. The model correctly predicts the average correlation of substitutions as a function of their distance along the sequence. At the same time, it predicts an among-site distribution of the number of substitutions per site highly compatible with a negative binomial, consistently with experimental data. The promising outcomes achieved with this model encourage the application of the same ideas in the field of pairwise and multiple sequence alignment.


Asunto(s)
Secuencia de Aminoácidos/genética , Evolución Molecular , Modelos Genéticos , Sustitución de Aminoácidos , Codón/genética , Humanos , Mutación Puntual , Alineación de Secuencia
19.
Phys Rev E ; 100(3-1): 032134, 2019 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-31639925

RESUMEN

The problem of efficiently reconstructing tomographic images can be mapped into a Bayesian inference problem over the space of pixels densities. Solutions to this problem are given by pixels assignments that are compatible with tomographic measurements and maximize a posterior probability density. This maximization can be performed with standard local optimization tools when the log-posterior is a convex function, but it is generally intractable when introducing realistic nonconcave priors that reflect typical images features such as smoothness or sharpness. We introduce a new method to reconstruct images obtained from Radon projections by using expectation propagation, which allows us to approximate the intractable posterior. We show, by means of extensive simulations, that, compared to state-of-the-art algorithms for this task, expectation propagation paired with very simple but non-log-concave priors is often able to reconstruct images up to a smaller error while using a lower amount of information per pixel. We provide estimates for the critical rate of information per pixel above which recovery is error-free by means of simulations on ensembles of phantom and real images.

20.
PLoS Comput Biol ; 15(4): e1006767, 2019 04.
Artículo en Inglés | MEDLINE | ID: mdl-30958823

RESUMEN

It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.


Asunto(s)
Evolución Molecular , Proteínas/química , Proteínas/genética , Secuencia de Aminoácidos , Biología Computacional , Bases de Datos de Proteínas/estadística & datos numéricos , Modelos Moleculares , Mutación , Filogenia , Conformación Proteica , Pliegue de Proteína , Proteínas/clasificación , Homología de Secuencia de Aminoácido , Homología Estructural de Proteína
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...