RESUMEN
We propose a sampling algorithm relying on a collective variable (CV) of midsize dimension modeled by a normalizing flow and using nonequilibrium dynamics to propose full configurational moves from the proposition of a refreshed value of the CV made by the flow. The algorithm takes the form of a Markov chain with nonlocal updates, allowing jumps through energy barriers across metastable states. The flow is trained throughout the algorithm to reproduce the free energy landscape of the CV. The output of the algorithm is a sample of thermalized configurations and the trained network that can be used to efficiently produce more configurations. We show the functioning of the algorithm first in a test case with a mixture of Gaussians. Then, we successfully tested it on a higher-dimensional system consisting of a polymer in solution with a compact state and an extended stable state separated by a high free energy barrier.
RESUMEN
Complex networks are powerful mathematical tools for modelling and understanding the behaviour of highly interconnected systems. However, existing methods for analyzing these networks focus on local properties (e.g. degree distribution, clustering coefficient) or global properties (e.g. diameter, modularity) and fail to characterize the network structure across multiple scales. In this paper, we introduce a rigorous method for calculating the intrinsic dimension of unweighted networks. The intrinsic dimension is a feature that describes the network structure at all scales, from local to global. We propose using this measure as a summary statistic within an Approximate Bayesian Computation framework to infer the parameters of flexible and multi-purpose mechanistic models that generate complex networks. Furthermore, we present a new mechanistic model that can reproduce the intrinsic dimension of networks with large diameters, a task that has been challenging for existing models.
RESUMEN
Clinical databases typically include, for each patient, many heterogeneous features, for example blood exams, the clinical history before the onset of the disease, the evolution of the symptoms, the results of imaging exams, and many others. We here propose to exploit a recently developed statistical approach, the Information Imbalance, to compare different subsets of patient features and automatically select the set of features that is maximally informative for a given clinical purpose, especially in minority classes. We adapt the Information Imbalance approach to work in a clinical framework, where patient features are often categorical and are generally available only for a fraction of the patients. We apply this algorithm to a data set of â¼ 1300 patients treated for COVID-19 in Udine hospital before October 2021. Using this approach, we find combinations of features which, if used in combination, are maximally informative of the clinical fate and of the severity of the disease. The optimal number of features, which is determined automatically, turns out to be between 10 and 15. These features can be measured at admission. The approach can be used also if the features are available only for a fraction of the patients, does not require imputation and, importantly, is able to automatically select features with small inter-feature correlation. Clinical insights deriving from this study are also discussed.
Asunto(s)
Algoritmos , COVID-19 , SARS-CoV-2 , Índice de Severidad de la Enfermedad , Humanos , COVID-19/diagnóstico , COVID-19/epidemiología , SARS-CoV-2/aislamiento & purificación , Bases de Datos Factuales , Masculino , FemeninoRESUMEN
We introduce an approach which allows detecting causal relationships between variables for which the time evolution is available. Causality is assessed by a variational scheme based on the Information Imbalance of distance ranks, a statistical test capable of inferring the relative information content of different distance measures. We test whether the predictability of a putative driven system Y can be improved by incorporating information from a potential driver system X, without explicitly modeling the underlying dynamics and without the need to compute probability densities of the dynamic variables. This framework makes causality detection possible even between high-dimensional systems where only few of the variables are known or measured. Benchmark tests on coupled chaotic dynamical systems demonstrate that our approach outperforms other model-free causality detection methods, successfully handling both unidirectional and bidirectional couplings. We also show that the method can be used to robustly detect causality in human electroencephalography data.
RESUMEN
BACKGROUND AND OBJECTIVES: ASPECTs is a widely used marker to identify early stroke signs on non-enhanced computed tomography (NECT), yet it presents interindividual variability and it may be hard to use for non-experts. We introduce an algorithm capable of automatically estimating the NECT volumetric extension of early acute ischemic changes in the 3D space. We compared the power of this marker with ASPECTs evaluated by experienced practitioner in predicting the clinical outcome. METHODS: We analyzed and processed neuroimaging data of 153 patients admitted with acute ischemic stroke. All patients underwent a NECT at admission and on follow-up. The developed algorithm identifies the early ischemic hypodense region based on an automatic comparison of the gray level in the images of the two hemispheres, assumed to be an approximate mirror image of each other in healthy patients. RESULTS: In the two standard axial slices used to estimate the ASPECTs, the regions identified by the algorithm overlap significantly with those identified by experienced practitioners. However, in many patients, the regions identified automatically extend significantly to other slices. In these cases, the volume marker provides supplementary and independent information. Indeed, the clinical outcome of patients with volume marker = 0 can be distinguished with higher statistical confidence than the outcome of patients with ASPECTs = 10. CONCLUSION: The volumetric extension and the location of acute ischemic region in the 3D-space, automatically identified by our algorithm, provide data that are mostly in agreement with the ASPECTs value estimated by expert practitioners, and in some cases complementary and independent.
Asunto(s)
Algoritmos , Accidente Cerebrovascular Isquémico , Tomografía Computarizada por Rayos X , Humanos , Masculino , Tomografía Computarizada por Rayos X/normas , Tomografía Computarizada por Rayos X/métodos , Femenino , Anciano , Accidente Cerebrovascular Isquémico/diagnóstico por imagen , Persona de Mediana Edad , Isquemia Encefálica/diagnóstico por imagen , Anciano de 80 o más Años , Encéfalo/diagnóstico por imagen , Encéfalo/patología , Accidente Cerebrovascular/diagnóstico por imagenRESUMEN
According to common physical chemistry wisdom, the solvent cavities hosting a solute are tightly sewn around it, practically coinciding with its van der Waals surface. Solvation entropy is primarily determined by the surface and the volume of the cavity while enthalpy is determined by the solute-solvent interaction. In this work, we challenge this picture, demonstrating by molecular dynamics simulations that the cavities surrounding the 20 amino acids deviate significantly from the molecular surface. Strikingly, the shape of the cavity alone can be used to predict the solvation free energy, entropy, enthalpy, and hydrophobicity. Solute-solvent interactions involving the different chemical moieties of the amino acid, determine indirectly the cavity shape, and the properties of the branches but do not have to be taken explicitly into account in the prediction model.
RESUMEN
Machine-learning (ML) has become a key workhorse in molecular simulations. Building an ML model in this context involves encoding the information on chemical environments using local atomic descriptors. In this work, we focus on the Smooth Overlap of Atomic Positions (SOAP) and their application in studying the properties of liquid water both in the bulk and at the hydrophobic air-water interface. By using a statistical test aimed at assessing the relative information content of different distance measures defined on the same data space, we investigate if these descriptors provide the same information as some of the common order parameters that are used to characterize local water structure such as hydrogen bonding, density, or tetrahedrality to name a few. Our analysis suggests that the ML description and the standard order parameters of the local water structure are not equivalent. In particular, a combination of these order parameters probing local water environments can predict SOAP similarity only approximately, and vice versa, the environments that are similar according to SOAP are not necessarily similar according to the standard order parameters. We also elucidate the role of some of the metaparameters in the SOAP definition in encoding chemical information.
RESUMEN
Real-world datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this Letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high dimensionality of sequences' space.
RESUMEN
Modern datasets are characterized by numerous features related by complex dependency structures. To deal with these data, dimensionality reduction techniques are essential. Many of these techniques rely on the concept of intrinsic dimension (id), a measure of the complexity of the dataset. However, the estimation of this quantity is not trivial: often, the id depends rather dramatically on the scale of the distances among data points. At short distances, the id can be grossly overestimated due to the presence of noise, becoming smaller and approximately scale-independent only at large distances. An immediate approach to examining the scale dependence consists in decimating the dataset, which unavoidably induces non-negligible statistical errors at large scale. This article introduces a novel statistical method, Gride, that allows estimating the id as an explicit function of the scale without performing any decimation. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among data points. Through simulation studies, we show that Gride is asymptotically unbiased, provides comparable estimates to other state-of-the-art methods, and is more robust to short-scale noise than other likelihood-based approaches.
Asunto(s)
Funciones de Verosimilitud , Simulación por ComputadorRESUMEN
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds â¼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
Asunto(s)
Proteínas , Bases de Datos de Proteínas , Proteínas/genética , Análisis por Conglomerados , Secuencia de Aminoácidos , Dominios ProteicosRESUMEN
DADApy is a Python software package for analyzing and characterizing high-dimensional data manifolds. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering, and for comparing different distance metrics. We review the main functionalities of the package and exemplify its usage in a synthetic dataset and in a real-world application. DADApy is freely available under the open-source Apache 2.0 license.
RESUMEN
Single-molecule force spectroscopy (SMFS) uses the cantilever tip of an atomic force microscopy (AFM) to apply a force able to unfold a single protein. The obtained force-distance curve encodes the unfolding pathway, and from its analysis it is possible to characterize the folded domains. SMFS has been mostly used to study the unfolding of purified proteins, in solution or reconstituted in a lipid bilayer. Here, we describe a pipeline for analyzing membrane proteins based on SMFS, which involves the isolation of the plasma membrane of single cells and the harvesting of force-distance curves directly from it. We characterized and identified the embedded membrane proteins combining, within a Bayesian framework, the information of the shape of the obtained curves, with the information from mass spectrometry and proteomic databases. The pipeline was tested with purified/reconstituted proteins and applied to five cell types where we classified the unfolding of their most abundant membrane proteins. We validated our pipeline by overexpressing four constructs, and this allowed us to gather structural insights of the identified proteins, revealing variable elements in the loop regions. Our results set the basis for the investigation of the unfolding of membrane proteins in situ, and for performing proteomics from a membrane fragment.
Asunto(s)
Membrana Dobles de Lípidos , Proteínas de la Membrana , Teorema de Bayes , Proteínas de la Membrana/química , Microscopía de Fuerza Atómica/métodos , Desplegamiento Proteico , ProteómicaRESUMEN
Epitopes that bind simultaneously to all human alleles of Major Histocompatibility Complex class II (MHC II) are considered one of the key factors for the development of improved vaccines and cancer immunotherapies. To engineer MHC II multiple-allele binders, we developed a protocol called PanMHC-PARCE, based on the unsupervised optimization of the epitope sequence by single-point mutations, parallel explicit-solvent molecular dynamics simulations and scoring of the MHC II-epitope complexes. The key idea is accepting mutations that not only improve the affinity but also reduce the affinity gap between the alleles. We applied this methodology to enhance a Plasmodium vivax epitope for multiple-allele binding. In vitro rate-binding assays showed that four engineered peptides were able to bind with improved affinity toward multiple human MHC II alleles. Moreover, we demonstrated that mice immunized with the peptides exhibited interferon-gamma cellular immune response. Overall, the method enables the engineering of peptides with improved binding properties that can be used for the generation of new immunotherapies.
Asunto(s)
Antígenos HLA-D , Simulación de Dinámica Molecular , Alelos , Animales , Epítopos , Antígenos HLA-D/genética , Ratones , PéptidosRESUMEN
Computational peptide design is useful for therapeutics, diagnostics, and vaccine development. To select the most promising peptide candidates, the key is describing accurately the peptide-target interactions at the molecular level. We here review a computational peptide design protocol whose key feature is the use of all-atom explicit solvent molecular dynamics for describing the different peptide-target complexes explored during the optimization. We describe the milestones behind the development of this protocol, which is now implemented in an open-source code called PARCE. We provide a basic tutorial to run the code for an antibody fragment design example. Finally, we describe three additional applications of the method to design peptides for different targets, illustrating the broad scope of the proposed approach.
Asunto(s)
Simulación de Dinámica Molecular , Péptidos , Péptidos/química , SolventesRESUMEN
By using advanced data analysis techniques, we characterize the shape of the voids surrounding model polymers of different sizes in water, observed in molecular dynamics simulations. We find that even when the model polymer is folded, the voids are extremely rough, with branches that can extend to over 1 nm away from the polymer. Water molecules in contact with the void retain close-to-bulk properties in terms of local structure. The branches disappear, and the voids start resembling the quasispherical shape predicted by dewetting theory only when they surround particles with a radius â¼1 nm, well above the size occupied by a folded hydrophobic polymer. Our results provide fresh insights into the microscopic origins of the vapor-like interfaces underlying dewetting and drying transitions.
RESUMEN
Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.
RESUMEN
Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.
RESUMEN
BACKGROUND: The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. RESULTS: We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. CONCLUSIONS: The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.
Asunto(s)
Proteínas , Alineación de Secuencia , Secuencia de Aminoácidos , Análisis por Conglomerados , Bases de Datos de Proteínas , Humanos , Proteínas/genéticaRESUMEN
Computational protein design has emerged as a powerful tool capable of identifying sequences compatible with pre-defined protein structures. The sequence design protocols, implemented in the Rosetta suite, have become widely used in the protein engineering community. To understand the strengths and limitations of the Rosetta design framework, we tested several design protocols on two distinct folds (SH3-1 and Ubiquitin). The sequence optimization, when started from native structures and natural sequences or polyvaline sequences, converges to sequences that are not recognized as belonging to the fold family of the target protein by standard bioinformatic tools, such as BLAST and Hmmer. The sequences generated from both starting conditions (native and polyvaline) are instead very similar to each other and recognized by Hmmer as belonging to the same "family." This demonstrates the capability of Rosetta to converge to similar sequences, even when sampling from distinct starting conditions, but, on the other hand, shows intrinsic inaccuracy of the scoring function that drifts toward sequences that lack identifiable natural sequence signatures. To address this problem, we developed a protocol embedding Rosetta Design simulations in a genetic algorithm, in which the sequence search is biased to converge to sequences that exist in nature. This protocol allows us to obtain sequences that have recognizable natural sequence signatures and, experimentally, the designed proteins are biochemically well behaved and thermodynamically stable.