RESUMEN
DNA and histone modifications combine into characteristic patterns that demarcate functional regions of the genome1,2. While many 'readers' of individual modifications have been described3-5, how chromatin states comprising composite modification signatures, histone variants and internucleosomal linker DNA are interpreted is a major open question. Here we use a multidimensional proteomics strategy to systematically examine the interaction of around 2,000 nuclear proteins with over 80 modified dinucleosomes representing promoter, enhancer and heterochromatin states. By deconvoluting complex nucleosome-binding profiles into networks of co-regulated proteins and distinct nucleosomal features driving protein recruitment or exclusion, we show comprehensively how chromatin states are decoded by chromatin readers. We find highly distinctive binding responses to different features, many factors that recognize multiple features, and that nucleosomal modifications and linker DNA operate largely independently in regulating protein binding to chromatin. Our online resource, the Modification Atlas of Regulation by Chromatin States (MARCS), provides in-depth analysis tools to engage with our results and advance the discovery of fundamental principles of genome regulation by chromatin states.
Asunto(s)
Ensamble y Desensamble de Cromatina , Cromatina , Proteínas Nucleares , Nucleosomas , Proteómica , Humanos , Sitios de Unión , Cromatina/química , Cromatina/genética , Cromatina/metabolismo , ADN/genética , ADN/metabolismo , Elementos de Facilitación Genéticos , Heterocromatina/genética , Heterocromatina/metabolismo , Histonas/metabolismo , Proteínas Nucleares/análisis , Proteínas Nucleares/metabolismo , Nucleosomas/química , Nucleosomas/genética , Nucleosomas/metabolismo , Regiones Promotoras Genéticas , Unión Proteica , Proteómica/métodosRESUMEN
Chromatin, the nucleoprotein complex consisting of DNA and histone proteins, plays a crucial role in regulating gene expression by controlling access to DNA. Chromatin modifications are key players in this regulation, as they help to orchestrate DNA transcription, replication, and repair. These modifications recruit epigenetic 'reader' proteins, which mediate downstream events. Most modifications occur in distinctive combinations within a nucleosome, suggesting that epigenetic information can be encoded in combinatorial chromatin modifications. A detailed understanding of how multiple modifications cooperate in recruiting such proteins has, however, remained largely elusive. Here, we integrate nucleosome affinity purification data with high-throughput quantitative proteomics and hierarchical interaction modeling to estimate combinatorial effects of chromatin modifications on protein recruitment. This is facilitated by the computational workflow asteRIa which combines hierarchical interaction modeling, stability-based model selection, and replicate-consistency checks for a stable estimation of Robust Interactions among chromatin modifications. asteRIa identifies several epigenetic reader candidates responding to specific interactions between chromatin modifications. For the polycomb protein CBX8, we independently validate our results using genome-wide ChIP-Seq and bisulphite sequencing datasets. We provide the first quantitative framework for identifying cooperative effects of chromatin modifications on protein binding.
Asunto(s)
Cromatina , Epigénesis Genética , Programas Informáticos , Humanos , Cromatina/metabolismo , Cromatina/genética , Histonas/metabolismo , Nucleosomas/metabolismo , Nucleosomas/genética , Proteínas del Grupo Polycomb/metabolismo , Proteínas del Grupo Polycomb/genética , Unión Proteica , Procesamiento Proteico-Postraduccional , Proteómica/métodosRESUMEN
The design of protein interaction inhibitors is a promising approach to address aberrant protein interactions that cause disease. One strategy in designing inhibitors is to use peptidomimetic scaffolds that mimic the natural interaction interface. A central challenge in using peptidomimetics as protein interaction inhibitors, however, is determining how best the molecular scaffold aligns to the residues of the interface it is attempting to mimic. Here we present the Scaffold Matcher algorithm that aligns a given molecular scaffold onto hotspot residues from a protein interaction interface. To optimize the degrees of freedom of the molecular scaffold we implement the covariance matrix adaptation evolution strategy (CMA-ES), a state-of-the-art derivative-free optimization algorithm in Rosetta. To evaluate the performance of the CMA-ES, we used 26 peptides from the FlexPepDock Benchmark and compared with three other algorithms in Rosetta, specifically, Rosetta's default minimizer, a Monte Carlo protocol of small backbone perturbations, and a Genetic algorithm. We test the algorithms' performance on their ability to align a molecular scaffold to a series of hotspot residues (i.e., constraints) along native peptides. Of the 4 methods, CMA-ES was able to find the lowest energy conformation for all 26 benchmark peptides. Additionally, as a proof of concept, we apply the Scaffold Match algorithm with CMA-ES to align a peptidomimetic oligooxopiperazine scaffold to the hotspot residues of the substrate of the main protease of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Our implementation of CMA-ES into Rosetta allows for an alternative optimization method to be used on macromolecular modeling problems with rough energy landscapes. Finally, our Scaffold Matcher algorithm allows for the identification of initial conformations of interaction inhibitors that can be further designed and optimized as high-affinity reagents.
Asunto(s)
Peptidomiméticos , Algoritmos , Péptidos/química , Conformación Molecular , BenchmarkingRESUMEN
In recent years, unsupervised analysis of microbiome data, such as microbial network analysis and clustering, has increased in popularity. Many new statistical and computational methods have been proposed for these tasks. This multiplicity of analysis strategies poses a challenge for researchers, who are often unsure which method(s) to use and might be tempted to try different methods on their dataset to look for the "best" ones. However, if only the best results are selectively reported, this may cause over-optimism: the "best" method is overly fitted to the specific dataset, and the results might be non-replicable on validation data. Such effects will ultimately hinder research progress. Yet so far, these topics have been given little attention in the context of unsupervised microbiome analysis. In our illustrative study, we aim to quantify over-optimism effects in this context. We model the approach of a hypothetical microbiome researcher who undertakes four unsupervised research tasks: clustering of bacterial genera, hub detection in microbial networks, differential microbial network analysis, and clustering of samples. While these tasks are unsupervised, the researcher might still have certain expectations as to what constitutes interesting results. We translate these expectations into concrete evaluation criteria that the hypothetical researcher might want to optimize. We then randomly split an exemplary dataset from the American Gut Project into discovery and validation sets multiple times. For each research task, multiple method combinations (e.g., methods for data normalization, network generation, and/or clustering) are tried on the discovery data, and the combination that yields the best result according to the evaluation criterion is chosen. While the hypothetical researcher might only report this result, we also apply the "best" method combination to the validation dataset. The results are then compared between discovery and validation data. In all four research tasks, there are notable over-optimism effects; the results on the validation data set are worse compared to the discovery data, averaged over multiple random splits into discovery/validation data. Our study thus highlights the importance of validation and replication in microbiome analysis to obtain reliable results and demonstrates that the issue of over-optimism goes beyond the context of statistical testing and fishing for significance.
Asunto(s)
Microbiota , Aprendizaje Automático , Consorcios Microbianos , Bacterias , Análisis por ConglomeradosRESUMEN
BACKGROUND: Due to the asymptomatic nature of the early stages, chronic kidney disease (CKD) is usually diagnosed at late stages and lacks targeted therapy, highlighting the need for new biomarkers to better understand its pathophysiology and to be used for early diagnosis and therapeutic targets. Given the close relationship between CKD and cardiovascular disease (CVD), we investigated the associations of 233 CVD- and inflammation-related plasma proteins with kidney function decline and aimed to assess whether the observed associations are causal. METHODS: We included 1140 participants, aged 55-74 years at baseline, from the Cooperative Health Research in the Region of Augsburg (KORA) cohort study, with a median follow-up time of 13.4 years and 2 follow-up visits. We measured 233 plasma proteins using a proximity extension assay at baseline. In the discovery analysis, linear regression models were used to estimate the associations of 233 proteins with the annual rate of change in creatinine-based estimated glomerular filtration rate (eGFRcr). We further investigated the association of eGFRcr-associated proteins with the annual rate of change in cystatin C-based eGFR (eGFRcys) and eGFRcr-based incident CKD. Two-sample Mendelian randomization was used to infer causality. RESULTS: In the fully adjusted model, 66 out of 233 proteins were inversely associated with the annual rate of change in eGFRcr, indicating that higher baseline protein levels were associated with faster eGFRcr decline. Among these 66 proteins, 21 proteins were associated with both the annual rate of change in eGFRcys and incident CKD. Mendelian randomization analyses on these 21 proteins suggest a potential causal association of higher tumor necrosis factor receptor superfamily member 11A (TNFRSF11A) level with eGFR decline. CONCLUSIONS: We reported 21 proteins associated with kidney function decline and incident CKD and provided preliminary evidence suggesting a potential causal association between TNFRSF11A and kidney function decline. Further Mendelian randomization studies are needed to establish a conclusive causal association.
Asunto(s)
Enfermedades Cardiovasculares , Insuficiencia Renal Crónica , Persona de Mediana Edad , Masculino , Humanos , Femenino , Anciano , Estudios de Cohortes , Proteómica , Insuficiencia Renal Crónica/genética , Tasa de Filtración Glomerular , Riñón , CreatininaRESUMEN
Echocardiography, a rapid and cost-effective imaging technique, assesses cardiac function and structure. Despite its popularity in cardiovascular medicine and clinical research, image-derived phenotypic measurements are manually performed, requiring expert knowledge and training. Notwithstanding great progress in deep-learning applications in small animal echocardiography, the focus has so far only been on images of anesthetized rodents. We present here a new algorithm specifically designed for echocardiograms acquired in conscious mice called Echo2Pheno, an automatic statistical learning workflow for analyzing and interpreting high-throughput non-anesthetized transthoracic murine echocardiographic images in the presence of genetic knockouts. Echo2Pheno comprises a neural network module for echocardiographic image analysis and phenotypic measurements, including a statistical hypothesis-testing framework for assessing phenotypic differences between populations. Using 2159 images of 16 different knockout mouse strains of the German Mouse Clinic, Echo2Pheno accurately confirms known cardiovascular genotype-phenotype relationships (e.g., Dystrophin) and discovers novel genes (e.g., CCR4-NOT transcription complex subunit 6-like, Cnot6l, and synaptotagmin-like protein 4, Sytl4), which cause altered cardiovascular phenotypes, as verified by H&E-stained histological images. Echo2Pheno provides an important step toward automatic end-to-end learning for linking echocardiographic readouts to cardiovascular phenotypes of interest in conscious mice.
Asunto(s)
Aprendizaje Profundo , Ratones , Animales , Ecocardiografía/métodos , Corazón , Algoritmos , Fenotipo , RibonucleasasRESUMEN
MOTIVATION: Estimating microbial association networks from high-throughput sequencing data is a common exploratory data analysis approach aiming at understanding the complex interplay of microbial communities in their natural habitat. Statistical network estimation workflows comprise several analysis steps, including methods for zero handling, data normalization and computing microbial associations. Since microbial interactions are likely to change between conditions, e.g. between healthy individuals and patients, identifying network differences between groups is often an integral secondary analysis step. Thus far, however, no unifying computational tool is available that facilitates the whole analysis workflow of constructing, analysing and comparing microbial association networks from high-throughput sequencing data. RESULTS: Here, we introduce NetCoMi (Network Construction and comparison for Microbiome data), an R package that integrates existing methods for each analysis step in a single reproducible computational workflow. The package offers functionality for constructing and analysing single microbial association networks as well as quantifying network differences. This enables insights into whether single taxa, groups of taxa or the overall network structure change between groups. NetCoMi also contains functionality for constructing differential networks, thus allowing to assess whether single pairs of taxa are differentially associated between two groups. Furthermore, NetCoMi facilitates the construction and analysis of dissimilarity networks of microbiome samples, enabling a high-level graphical summary of the heterogeneity of an entire microbiome sample collection. We illustrate NetCoMi's wide applicability using data sets from the GABRIELA study to compare microbial associations in settled dust from children's rooms between samples from two study centers (Ulm and Munich). AVAILABILITY: R scripts used for producing the examples shown in this manuscript are provided as supplementary data. The NetCoMi package, together with a tutorial, is available at https://github.com/stefpeschel/NetCoMi. CONTACT: Tel:+49 89 3187 43258; stefanie.peschel@mail.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Briefings in Bioinformatics online.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Secuenciación de Nucleótidos de Alto Rendimiento , Microbiota/genética , Programas Informáticos , HumanosRESUMEN
Statistical analysis of microbial genomic data within epidemiological cohort studies holds the promise to assess the influence of environmental exposures on both the host and the host-associated microbiome. However, the observational character of prospective cohort data and the intricate characteristics of microbiome data make it challenging to discover causal associations between environment and microbiome. Here, we introduce a causal inference framework based on the Rubin Causal Model that can help scientists to investigate such environment-host microbiome relationships, to capitalize on existing, possibly powerful, test statistics, and test plausible sharp null hypotheses. Using data from the German KORA cohort study, we illustrate our framework by designing two hypothetical randomized experiments with interventions of (i) air pollution reduction and (ii) smoking prevention. We study the effects of these interventions on the human gut microbiome by testing shifts in microbial diversity, changes in individual microbial abundances, and microbial network wiring between groups of matched subjects via randomization-based inference. In the smoking prevention scenario, we identify a small interconnected group of taxa worth further scrutiny, including Christensenellaceae and Ruminococcaceae genera, that have been previously associated with blood metabolite changes. These findings demonstrate that our framework may uncover potentially causal links between environmental exposure and the gut microbiome from observational data. We anticipate the present statistical framework to be a good starting point for further discoveries on the role of the gut microbiome in environmental health.
Asunto(s)
Microbioma Gastrointestinal , Estudios de Cohortes , Exposición a Riesgos Ambientales/efectos adversos , Microbioma Gastrointestinal/genética , Humanos , Estudios Prospectivos , Distribución AleatoriaRESUMEN
Hearing loss is a major health problem and psychological burden in humans. Mouse models offer a possibility to elucidate genes involved in the underlying developmental and pathophysiological mechanisms of hearing impairment. To this end, large-scale mouse phenotyping programs include auditory phenotyping of single-gene knockout mouse lines. Using the auditory brainstem response (ABR) procedure, the German Mouse Clinic and similar facilities worldwide have produced large, uniform data sets of averaged ABR raw data of mutant and wildtype mice. In the course of standard ABR analysis, hearing thresholds are assessed visually by trained staff from series of signal curves of increasing sound pressure level. This is time-consuming and prone to be biased by the reader as well as the graphical display quality and scale.In an attempt to reduce workload and improve quality and reproducibility, we developed and compared two methods for automated hearing threshold identification from averaged ABR raw data: a supervised approach involving two combined neural networks trained on human-generated labels and a self-supervised approach, which exploits the signal power spectrum and combines random forest sound level estimation with a piece-wise curve fitting algorithm for threshold finding.We show that both models work well and are suitable for fast, reliable, and unbiased hearing threshold detection and quality control. In a high-throughput mouse phenotyping environment, both methods perform well as part of an automated end-to-end screening pipeline to detect candidate genes for hearing involvement. Code for both models as well as data used for this work are freely available.
Asunto(s)
Sordera , Potenciales Evocados Auditivos del Tronco Encefálico , Humanos , Animales , Ratones , Potenciales Evocados Auditivos del Tronco Encefálico/fisiología , Reproducibilidad de los Resultados , Umbral Auditivo/fisiología , Audición/fisiología , Estimulación Acústica/métodosRESUMEN
The human microbiome provides essential physiological functions and helps maintain host homeostasis via the formation of intricate ecological host-microbiome relationships. While it is well established that the lifestyle of the host, dietary preferences, demographic background, and health status can influence microbial community composition and dynamics, robust generalizable associations between specific host-associated factors and specific microbial taxa have remained largely elusive. Here, we propose factor regression models that allow the estimation of structured parsimonious associations between host-related features and amplicon-derived microbial taxa. To account for the overdispersed nature of the amplicon sequencing count data, we propose negative binomial reduced rank regression (NB-RRR) and negative binomial co-sparse factor regression (NB-FAR). While NB-RRR encodes the underlying dependency among the microbial abundances as outcomes and the host-associated features as predictors through a rank-constrained coefficient matrix, NB-FAR uses a sparse singular value decomposition of the coefficient matrix. The latter approach avoids the notoriously difficult joint parameter estimation by extracting sparse unit-rank components of the coefficient matrix sequentially, effectively delivering interpretable bi-clusters of taxa and host-associated factors. To solve the nonconvex optimization problems associated with these factor regression models, we present a novel iterative block-wise majorization procedure. Extensive simulation studies and an application to the microbial abundance data from the American Gut Project (AGP) demonstrate the efficacy of the proposed procedure. In the AGP data, we identify several factors that strongly link dietary habits and host life style to specific microbial families.
Asunto(s)
Análisis de Datos , Microbiota , Análisis Factorial , Conducta Alimentaria , Microbioma Gastrointestinal , Humanos , Estilo de Vida , Análisis de Regresión , Estados UnidosRESUMEN
Motivation: The number of microbial and metagenomic studies has increased drastically due to advancements in next-generation sequencing-based measurement techniques. Statistical analysis and the validity of conclusions drawn from (time series) 16S rRNA and other metagenomic sequencing data is hampered by the presence of significant amount of noise and missing data (sampling zeros). Accounting uncertainty in microbiome data is often challenging due to the difficulty of obtaining biological replicates. Additionally, the compositional nature of current amplicon and metagenomic data differs from many other biological data types adding another challenge to the data analysis. Results: To address these challenges in human microbiome research, we introduce a novel probabilistic approach to explicitly model overdispersion and sampling zeros by considering the temporal correlation between nearby time points using Gaussian Processes. The proposed Temporal Gaussian Process Model for Compositional Data Analysis (TGP-CODA) shows superior modeling performance compared to commonly used Dirichlet-multinomial, multinomial and non-parametric regression models on real and synthetic data. We demonstrate that the nonreplicative nature of human gut microbiota studies can be partially overcome by our method with proper experimental design of dense temporal sampling. We also show that different modeling approaches have a strong impact on ecological interpretation of the data, such as stationarity, persistence and environmental noise models. Availability and implementation: A Stan implementation of the proposed method is available under MIT license at https://github.com/tare/GPMicrobiome. Contact: taijo@flatironinstitute.org or rb113@nyu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Bacterias/aislamiento & purificación , Microbioma Gastrointestinal/genética , Genoma Bacteriano , Metagenómica/métodos , Modelos Estadísticos , ARN Ribosómico 16S/análisis , Bacterias/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Análisis de Secuencia de ADN/métodosRESUMEN
Determining the three dimensional arrangement of proteins in a complex is highly beneficial for uncovering mechanistic function and interpreting genetic variation in coding genes comprising protein complexes. There are several methods for determining co-complex interactions between proteins, among them co-fractionation / mass spectrometry (CF-MS), but it remains difficult to identify directly contacting subunits within a multi-protein complex. Correlation analysis of CF-MS profiles shows promise in detecting protein complexes as a whole but is limited in its ability to infer direct physical contacts among proteins in sub-complexes. To identify direct protein-protein contacts within human protein complexes we learn a sparse conditional dependency graph from approximately 3,000 CF-MS experiments on human cell lines. We show substantial performance gains in estimating direct interactions compared to correlation analysis on a benchmark of large protein complexes with solved three-dimensional structures. We demonstrate the method's value in determining the three dimensional arrangement of proteins by making predictions for complexes without known structure (the exocyst and tRNA multi-synthetase complex) and by establishing evidence for the structural position of a recently discovered component of the core human EKC/KEOPS complex, GON7/C14ORF142, providing a more complete 3D model of the complex. Direct contact prediction provides easily calculable additional structural information for large-scale protein complex mapping studies and should be broadly applicable across organisms as more CF-MS datasets become available.
Asunto(s)
Complejos Multiproteicos/química , Complejos Multiproteicos/metabolismo , Subunidades de Proteína/química , Subunidades de Proteína/metabolismo , Proteómica/métodos , Proteínas Bacterianas/química , Proteínas Bacterianas/genética , Proteínas Bacterianas/metabolismo , Línea Celular , Bases de Datos de Proteínas , Humanos , Espectrometría de Masas , Modelos Moleculares , Complejos Multiproteicos/genética , Conformación Proteica , Subunidades de Proteína/genéticaRESUMEN
Existing methods for interpreting protein variation focus on annotating mutation pathogenicity rather than detailed interpretation of variant deleteriousness and frequently use only sequence-based or structure-based information. We present VIPUR, a computational framework that seamlessly integrates sequence analysis and structural modelling (using the Rosetta protein modelling suite) to identify and interpret deleterious protein variants. To train VIPUR, we collected 9477 protein variants with known effects on protein function from multiple organisms and curated structural models for each variant from crystal structures and homology models. VIPUR can be applied to mutations in any organism's proteome with improved generalized accuracy (AUROC .83) and interpretability (AUPR .87) compared to other methods. We demonstrate that VIPUR's predictions of deleteriousness match the biological phenotypes in ClinVar and provide a clear ranking of prediction confidence. We use VIPUR to interpret known mutations associated with inflammation and diabetes, demonstrating the structural diversity of disrupted functional sites and improved interpretation of mutations associated with human diseases. Lastly, we demonstrate VIPUR's ability to highlight candidate variants associated with human diseases by applying VIPUR to de novo variants associated with autism spectrum disorders.
Asunto(s)
Trastorno del Espectro Autista/genética , Enfermedad Celíaca/genética , Enfermedad de Crohn/genética , Diabetes Mellitus/genética , Mutación , Proteínas/genética , Programas Informáticos , Animales , Trastorno del Espectro Autista/metabolismo , Trastorno del Espectro Autista/patología , Benchmarking , Enfermedad Celíaca/metabolismo , Enfermedad Celíaca/patología , Enfermedad de Crohn/metabolismo , Enfermedad de Crohn/patología , Minería de Datos , Bases de Datos de Proteínas , Diabetes Mellitus/metabolismo , Diabetes Mellitus/patología , Humanos , Inflamación , Modelos Moleculares , Anotación de Secuencia Molecular , Proteínas/química , Proteínas/metabolismoRESUMEN
Understanding gene regulatory networks is critical to understanding cellular differentiation and response to external stimuli. Methods for global network inference have been developed and applied to a variety of species. Most approaches consider the problem of network inference independently in each species, despite evidence that gene regulation can be conserved even in distantly related species. Further, network inference is often confined to single data-types (single platforms) and single cell types. We introduce a method for multi-source network inference that allows simultaneous estimation of gene regulatory networks in multiple species or biological processes through the introduction of priors based on known gene relationships such as orthology incorporated using fused regression. This approach improves network inference performance even when orthology mapping and conservation are incomplete. We refine this method by presenting an algorithm that extracts the true conserved subnetwork from a larger set of potentially conserved interactions and demonstrate the utility of our method in cross species network inference. Last, we demonstrate our method's utility in learning from data collected on different experimental platforms.
Asunto(s)
Biología Computacional/métodos , Regulación Bacteriana de la Expresión Génica/genética , Redes Reguladoras de Genes/genética , Modelos Genéticos , Algoritmos , Bacillus/genética , Bacillus/metabolismo , Simulación por Computador , Perfilación de la Expresión Génica , Análisis de RegresiónRESUMEN
4C-Seq has proven to be a powerful technique to identify genome-wide interactions with a single locus of interest (or "bait") that can be important for gene regulation. However, analysis of 4C-Seq data is complicated by the many biases inherent to the technique. An important consideration when dealing with 4C-Seq data is the differences in resolution of signal across the genome that result from differences in 3D distance separation from the bait. This leads to the highest signal in the region immediately surrounding the bait and increasingly lower signals in far-cis and trans. Another important aspect of 4C-Seq experiments is the resolution, which is greatly influenced by the choice of restriction enzyme and the frequency at which it can cut the genome. Thus, it is important that a 4C-Seq analysis method is flexible enough to analyze data generated using different enzymes and to identify interactions across the entire genome. Current methods for 4C-Seq analysis only identify interactions in regions near the bait or in regions located in far-cis and trans, but no method comprehensively analyzes 4C signals of different length scales. In addition, some methods also fail in experiments where chromatin fragments are generated using frequent cutter restriction enzymes. Here, we describe 4C-ker, a Hidden-Markov Model based pipeline that identifies regions throughout the genome that interact with the 4C bait locus. In addition, we incorporate methods for the identification of differential interactions in multiple 4C-seq datasets collected from different genotypes or experimental conditions. Adaptive window sizes are used to correct for differences in signal coverage in near-bait regions, far-cis and trans chromosomes. Using several datasets, we demonstrate that 4C-ker outperforms all existing 4C-Seq pipelines in its ability to reproducibly identify interaction domains at all genomic ranges with different resolution enzymes.
Asunto(s)
ADN Catalítico/química , ADN Catalítico/genética , Genoma/fisiología , Mapeo Restrictivo/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Secuencia de Bases , Sitios de Unión , Datos de Secuencia Molecular , Unión ProteicaRESUMEN
16S ribosomal RNA (rRNA) gene and other environmental sequencing techniques provide snapshots of microbial communities, revealing phylogeny and the abundances of microbial populations across diverse ecosystems. While changes in microbial community structure are demonstrably associated with certain environmental conditions (from metabolic and immunological health in mammals to ecological stability in soils and oceans), identification of underlying mechanisms requires new statistical tools, as these datasets present several technical challenges. First, the abundances of microbial operational taxonomic units (OTUs) from amplicon-based datasets are compositional. Counts are normalized to the total number of counts in the sample. Thus, microbial abundances are not independent, and traditional statistical metrics (e.g., correlation) for the detection of OTU-OTU relationships can lead to spurious results. Secondly, microbial sequencing-based studies typically measure hundreds of OTUs on only tens to hundreds of samples; thus, inference of OTU-OTU association networks is severely under-powered, and additional information (or assumptions) are required for accurate inference. Here, we present SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference), a statistical method for the inference of microbial ecological networks from amplicon sequencing datasets that addresses both of these issues. SPIEC-EASI combines data transformations developed for compositional data analysis with a graphical model inference framework that assumes the underlying ecological association network is sparse. To reconstruct the network, SPIEC-EASI relies on algorithms for sparse neighborhood and inverse covariance selection. To provide a synthetic benchmark in the absence of an experimentally validated gold-standard network, SPIEC-EASI is accompanied by a set of computational tools to generate OTU count data from a set of diverse underlying network topologies. SPIEC-EASI outperforms state-of-the-art methods to recover edges and network properties on synthetic data under a variety of scenarios. SPIEC-EASI also reproducibly predicts previously unknown microbial associations using data from the American Gut project.
Asunto(s)
Biota , Microbiota/fisiología , Modelos Biológicos , Algoritmos , Biología Computacional/métodos , Microbioma Gastrointestinal , Humanos , Metagenómica/métodos , Microbiota/genética , ARN Ribosómico 16S/genéticaRESUMEN
BACKGROUND: Hypertension, a complex condition, is primarily defined based on blood pressure readings without involving its pathophysiological mechanisms. We aimed to identify biomarkers through a proteomic approach, thereby enhancing the future definition of hypertension with insights into its molecular mechanisms. METHODS: The discovery analysis included 1560 participants, aged 55 to 74 years at baseline, from the KORA (Cooperative Health Research in the Region of Augsburg) S4/F4/FF4 cohort study, with 3332 observations over a median of 13.4 years of follow-up. Generalized estimating equations were used to estimate the associations of 233 plasma proteins with hypertension and systolic blood pressure (SBP). For validation, proteins significantly associated with hypertension or SBP in the discovery analysis were validated in the KORA Age1/Age2 cohort study (1024 participants, 1810 observations). A 2-sample Mendelian randomization analysis was conducted to infer causalities of validated proteins with SBP. RESULTS: Discovery analysis identified 49 proteins associated with hypertension and 99 associated with SBP. Validation in the KORA Age1/Age2 study replicated 7 proteins associated with hypertension and 23 associated with SBP. Three proteins, NT-proBNP (N-terminal pro-B-type natriuretic peptide), KIM1 (kidney injury molecule 1), and OPG (osteoprotegerin), consistently showed positive associations with both outcomes. Five proteins demonstrated potential causal associations with SBP in Mendelian randomization analysis, including NT-proBNP and OPG. CONCLUSIONS: We identified and validated 7 hypertension-associated and 23 SBP-associated proteins across 2 cohort studies. KIM1, NT-proBNP, and OPG demonstrated robust associations, and OPG was identified for the first time as associated with blood pressure. For NT-proBNP (protective) and OPG, causal associations with SBP were suggested.
Asunto(s)
Hipertensión , Proteómica , Humanos , Presión Sanguínea/fisiología , Estudios de Cohortes , Biomarcadores , Péptido Natriurético Encefálico , Fragmentos de PéptidosRESUMEN
The essential molecular chaperonin GroEL is an example of a functionally highly versatile cellular machine with a wide variety of in vitro applications ranging from protein folding to drug release. Directed evolution of new functions for GroEL is considered difficult, due to its structure as a complex homomultimeric double ring and the absence of obvious molecular engineering strategies. In order to investigate the potential to establish an orthogonal GroEL system in Escherichia coli, which might serve as a basis for GroEL evolution, we first successfully individualised groEL genes by inserting different functional peptide tags into a robustly permissive site identified by transposon mutagenesis. These peptides allowed fundamental aspects of the intracellular GroEL complex stoichiometry to be studied and revealed that GroEL single-ring complexes, which assembled in the presence of several functionally equivalent but biochemically distinct monomers, each consist almost exclusively of only one type of monomer. At least in the case of GroEL, individualisation of monomers thus leads to individualisation of homomultimeric protein complexes, effectively providing the prerequisites for evolving an orthogonal intracellular GroEL folding machine.
Asunto(s)
Chaperonina 60/química , Chaperonina 60/genética , Chaperonina 60/metabolismo , Escherichia coli/genética , Modelos Moleculares , Pliegue de ProteínaRESUMEN
We present an artificial neural network architecture, termed STENCIL-NET, for equation-free forecasting of spatiotemporal dynamics from data. STENCIL-NET works by learning a discrete propagator that is able to reproduce the spatiotemporal dynamics of the training data. This data-driven propagator can then be used to forecast or extrapolate dynamics without needing to know a governing equation. STENCIL-NET does not learn a governing equation, nor an approximation to the data themselves. It instead learns a discrete propagator that reproduces the data. It therefore generalizes well to different dynamics and different grid resolutions. By analogy with classic numerical methods, we show that the discrete forecasting operators learned by STENCIL-NET are numerically stable and accurate for data represented on regular Cartesian grids. A once-trained STENCIL-NET model can be used for equation-free forecasting on larger spatial domains and for longer times than it was trained for, as an autonomous predictor of chaotic dynamics, as a coarse-graining method, and as a data-adaptive de-noising method, as we illustrate in numerical experiments. In all tests, STENCIL-NET generalizes better and is computationally more efficient, both in training and inference, than neural network architectures based on local (CNN) or global (FNO) nonlinear convolutions.