RESUMEN
Noonan syndrome patients harboring causative variants in LZTR1 are particularly at risk to develop severe and early-onset hypertrophic cardiomyopathy. In this study, we investigate the mechanistic consequences of a homozygous variant LZTR1L580P by using patient-specific and CRISPR-Cas9-corrected induced pluripotent stem cell (iPSC) cardiomyocytes. Molecular, cellular, and functional phenotyping in combination with in silico prediction identify an LZTR1L580P-specific disease mechanism provoking cardiac hypertrophy. The variant is predicted to alter the binding affinity of the dimerization domains facilitating the formation of linear LZTR1 polymers. LZTR1 complex dysfunction results in the accumulation of RAS GTPases, thereby provoking global pathological changes of the proteomic landscape ultimately leading to cellular hypertrophy. Furthermore, our data show that cardiomyocyte-specific MRAS degradation is mediated by LZTR1 via non-proteasomal pathways, whereas RIT1 degradation is mediated by both LZTR1-dependent and LZTR1-independent pathways. Uni- or biallelic genetic correction of the LZTR1L580P missense variant rescues the molecular and cellular disease phenotype, providing proof of concept for CRISPR-based therapies.
RESUMEN
Animal venoms, distinguished by their unique structural features and potent bioactivities, represent a vast and relatively untapped reservoir of therapeutic molecules. However, limitations associated with extracting or expressing large numbers of individual venoms and venom-like molecules have precluded their therapeutic evaluation via high throughput screening. Here, we developed an innovative computational approach to design a highly diverse library of animal venoms and "metavenoms". We employed programmable M13 hyperphage display to preserve critical disulfide-bonded structures for highly parallelized single-round biopanning with quantitation via high-throughput DNA sequencing. Our approach led to the discovery of Kunitz type domain containing proteins that target the human itch receptor Mas-related G protein-coupled receptor X4 (MRGPRX4), which plays a crucial role in itch perception. Deep learning-based structural homology mining identified two endogenous human homologs, tissue factor pathway inhibitor (TFPI) and serine peptidase inhibitor, Kunitz type 2 (SPINT2), which exhibit agonist-dependent potentiation of MRGPRX4. Highly multiplexed screening of animal venoms and metavenoms is therefore a promising approach to uncover new drug candidates.
RESUMEN
Metagenomic taxonomic classifiers analyze either DNA or amino acid (AA) sequences. Metabuli ( https://metabuli.steineggerlab.com ), however, jointly analyzes both DNA and AA to leverage AA conservation for sensitive homology detection and DNA mutations for specific differentiation of closely related taxa. In the Critical Assessment of Metagenome Interpretation 2 plant-associated dataset, Metabuli covered 99% and 98% of classifications of state-of-the-art DNA- and AA-based classifiers, respectively.
Asunto(s)
Aminoácidos , Metagenoma , Metagenómica , Metagenómica/métodos , Aminoácidos/genética , ADN/genética , Programas Informáticos , Plantas/clasificación , Análisis de Secuencia de ADN/métodos , Secuencia de AminoácidosRESUMEN
The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.
Asunto(s)
Biología Computacional , Proteínas , Biología Computacional/métodos , Proteínas/química , Alineación de Secuencia , Conformación Proteica , Programas Informáticos , Algoritmos , Análisis de Secuencia de Proteína/métodosRESUMEN
The DescribePROT database of amino acid-level descriptors of protein structures and functions was substantially expanded since its release in 2020. This expansion includes substantial increase in the size, scope, and quality of the underlying data, the addition of experimental structural information, the inclusion of new data download options, and an upgraded graphical interface. DescribePROT currently covers 19 structural and functional descriptors for proteins in 273 reference proteomes generated by 11 accurate and complementary predictive tools. Users can search our resource in multiple ways, interact with the data using the graphical interface, and download data at various scales including individual proteins, entire proteomes, and whole database. The annotations in DescribePROT are useful for a broad spectrum of studies that include investigations of protein structure and function, development and validation of predictive tools, and to support efforts in understanding molecular underpinnings of diseases and development of therapeutics. DescribePROT can be freely accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.
Asunto(s)
Aminoácidos , Proteoma , Proteoma/química , Bases de Datos FactualesRESUMEN
The AlphaFold Database Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled by the groundbreaking AlphaFold2 artificial intelligence (AI) system, the predictions archived in AlphaFold DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements in data archiving, covering successive releases encompassing model organisms, global health proteomes, Swiss-Prot integration, and a host of curated protein datasets. We detail the data access mechanisms of AlphaFold DB, from direct file access via FTP to advanced queries using Google Cloud Public Datasets and the programmatic access endpoints of the database. We also discuss the improvements and services added since its initial release, including enhancements to the Predicted Aligned Error viewer, customisation options for the 3D viewer, and improvements in the search engine of AlphaFold DB.
The AlphaFold Protein Structure Database (AlphaFold DB) is a massive digital library of predicted protein structures, with over 214 million entries, marking a 500-times expansion in size since its initial release in 2021. The structures are predicted using Google DeepMind's AlphaFold 2 artificial intelligence (AI) system. Our new report highlights the latest updates we have made to this database. We have added more data on specific organisms and proteins related to global health and expanded to cover almost the complete UniProt database, a primary data resource of protein sequences. We also made it easier for our users to access the data by directly downloading files or using advanced cloud-based tools. Finally, we have also improved how users view and search through these protein structures, making the user experience smoother and more informative. In short, AlphaFold DB has been growing rapidly and has become more user-friendly and robust to support the broader scientific community.
Asunto(s)
Inteligencia Artificial , Estructura Secundaria de Proteína , Proteoma , Secuencia de Aminoácidos , Bases de Datos de Proteínas , Motor de Búsqueda , Proteínas/químicaRESUMEN
As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the structure of a query protein against a database by describing tertiary amino acid interactions within proteins as sequences over a structural alphabet. Foldseek decreases computation times by four to five orders of magnitude with 86%, 88% and 133% of the sensitivities of Dali, TM-align and CE, respectively.
Asunto(s)
Algoritmos , Proteínas , Bases de Datos de Proteínas , Proteínas/química , Aminoácidos , Programas InformáticosRESUMEN
The sequence-structure-function relationships that ultimately generate the diversity of extant observed proteins is complex, as proteins bridge the gap between multiple informational and physical scales involved in nearly all cellular processes. One limitation of existing protein annotation databases such as UniProt is that less than 1% of proteins have experimentally verified functions, and computational methods are needed to fill in the missing information. Here, we demonstrate that a multi-aspect framework based on protein language models can learn sequence-structure-function representations of amino acid sequences, and can provide the foundation for sensitive sequence-structure-function aware protein sequence search and annotation. Based on this model, we introduce a multi-aspect information retrieval system for proteins, Protein-Vec, covering sequence, structure, and function aspects, that enables computational protein annotation and function prediction at tree-of-life scales.
RESUMEN
CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at http://ccb.jhu.edu/chess .
Asunto(s)
Genoma Humano , Proteínas , Humanos , Filogenia , Proteínas/genética , Algoritmos , Programas Informáticos , Anotación de Secuencia MolecularRESUMEN
Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy1, and over 214 million predicted structures are available in the AlphaFold database2. However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life.
Asunto(s)
Algoritmos , Análisis por Conglomerados , Proteínas , Homología Estructural de Proteína , Humanos , Bases de Datos de Proteínas , Proteínas/química , Proteínas/clasificación , Proteínas/metabolismo , Alineación de Secuencia , Anotación de Secuencia Molecular , Células Procariotas/química , Filogenia , Especificidad de la Especie , Evolución MolecularRESUMEN
MOTIVATION: Efficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming (DP) exploit Single Instruction Multiple Data (SIMD) operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions). RESULTS: We propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the DP matrix that greedily shift, grow, and shrink. This approach allows regions of the DP matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities. AVAILABILITY AND IMPLEMENTATION: Our algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.
Asunto(s)
Algoritmos , Proteínas , Posición Específica de Matrices de Puntuación , Alineación de Secuencia , Análisis de Secuencia , Programas InformáticosRESUMEN
The recent CASP15 competition highlighted the critical role of multiple sequence alignments (MSAs) in protein structure prediction, as demonstrated by the success of the top AlphaFold2-based prediction methods. To push the boundaries of MSA utilization, we conducted a petabase-scale search of the Sequence Read Archive (SRA), resulting in gigabytes of aligned homologs for CASP15 targets. These were merged with default MSAs produced by ColabFold-search and provided to ColabFold-predict. By using SRA data, we achieved highly accurate predictions (GDT_TS > 70) for 66% of the non-easy targets, whereas using ColabFold-search default MSAs scored highly in only 52%. Next, we tested the effect of deep homology search and ColabFold's advanced features, such as more recycles, on prediction accuracy. While SRA homologs were most significant for improving ColabFold's CASP15 ranking from 11th to 3rd place, other strategies contributed too. We analyze these in the context of existing strategies to improve prediction.
RESUMEN
Wood-decaying fungi are the major decomposers of plant litter. Heavy sequencing efforts on genomes of wood-decaying fungi have recently been made due to the interest in their lignocellulolytic enzymes; however, most parts of their proteomes remain uncharted. We hypothesized that wood-decaying fungi would possess promiscuous enzymes for detoxifying antifungal phytochemicals remaining in the dead plant bodies, which can be useful biocatalysts. We designed a computational mass spectrometry-based untargeted metabolomics pipeline for the phenotyping of biotransformation and applied it to 264 fungal cultures supplemented with antifungal plant phenolics. The analysis identified the occurrence of diverse reactivities by the tested fungal species. Among those, we focused on O-xylosylation of multiple phenolics by one of the species tested, Lentinus brumalis. By integrating the metabolic phenotyping results with publicly available genome sequences and transcriptome analysis, a UDP-glycosyltransferase designated UGT66A1 was identified and validated as an enzyme catalyzing O-xylosylation with broad substrate specificity. We anticipate that our analytical workflow will accelerate the further characterization of fungal enzymes as promising biocatalysts.
Asunto(s)
Glucosiltransferasas , Lentinula , Metabolómica , Metabolómica/métodos , Lentinula/enzimología , Glucosiltransferasas/química , Glucosiltransferasas/aislamiento & purificación , Glucosiltransferasas/metabolismo , Fitoquímicos/metabolismo , Xilosa/metabolismo , Genoma Fúngico , Cromatografía Líquida con Espectrometría de MasasRESUMEN
BACKGROUND: Protein annotation is a major goal in molecular biology, yet experimentally determined knowledge is typically limited to a few model organisms. In non-model species, the sequence-based prediction of gene orthology can be used to infer protein identity; however, this approach loses predictive power at longer evolutionary distances. Here we propose a workflow for protein annotation using structural similarity, exploiting the fact that similar protein structures often reflect homology and are more conserved than protein sequences. RESULTS: We propose a workflow of openly available tools for the functional annotation of proteins via structural similarity (MorF: MorphologFinder) and use it to annotate the complete proteome of a sponge. Sponges are highly relevant for inferring the early history of animals, yet their proteomes remain sparsely annotated. MorF accurately predicts the functions of proteins with known homology in [Formula: see text] cases and annotates an additional [Formula: see text] of the proteome beyond standard sequence-based methods. We uncover new functions for sponge cell types, including extensive FGF, TGF, and Ephrin signaling in sponge epithelia, and redox metabolism and control in myopeptidocytes. Notably, we also annotate genes specific to the enigmatic sponge mesocytes, proposing they function to digest cell walls. CONCLUSIONS: Our work demonstrates that structural similarity is a powerful approach that complements and extends sequence similarity searches to identify homologous proteins over long evolutionary distances. We anticipate this will be a powerful approach that boosts discovery in numerous -omics datasets, especially for non-model organisms.
Asunto(s)
Proteoma , Animales , Anotación de Secuencia Molecular , Secuencia de AminoácidosRESUMEN
SUMMARY: Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By using a combination of internal and Cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of three compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is five times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analysing large collections of protein structures. AVAILABILITY AND IMPLEMENTATION: Foldcomp is a free open-source software (GPLv3) and available for Linux, macOS, and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB), and ESMatlas HQ (114GB) database ready-for-download.
Asunto(s)
Compresión de Datos , Programas Informáticos , Algoritmos , Compresión de Datos/métodos , Proteínas , Biblioteca de GenesRESUMEN
Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.
Asunto(s)
Furilfuramida , Proteínas , Humanos , Bases de Datos de Proteínas , Proteínas/químicaRESUMEN
The availability of accurate and fast artificial intelligence (AI) solutions predicting aspects of proteins are revolutionizing experimental and computational molecular biology. The webserver LambdaPP aspires to supersede PredictProtein, the first internet server making AI protein predictions available in 1992. Given a protein sequence as input, LambdaPP provides easily accessible visualizations of protein 3D structure, along with predictions at the protein level (GeneOntology, subcellular location), and the residue level (binding to metal ions, small molecules, and nucleotides; conservation; intrinsic disorder; secondary structure; alpha-helical and beta-barrel transmembrane segments; signal-peptides; variant effect) in seconds. The structure prediction provided by LambdaPP-leveraging ColabFold and computed in minutes-is based on MMseqs2 multiple sequence alignments. All other feature prediction methods are based on the pLM ProtT5. Queried by a protein sequence, LambdaPP computes protein and residue predictions almost instantly for various phenotypes, including 3D structure and aspects of protein function. LambdaPP is freely available for everyone to use under embed.predictprotein.org, the interactive results for the case study can be found under https://embed.predictprotein.org/o/Q9NZC2. The frontend of LambdaPP can be found on GitHub (github.com/sacdallago/embed.predictprotein.org), and can be freely used and distributed under the academic free use license (AFL-2). For high-throughput applications, all methods can be executed locally via the bio-embeddings (bioembeddings.com) python package, or docker image at ghcr.io/bioembeddings/bio_embeddings, which also includes the backend of LambdaPP.
Asunto(s)
Inteligencia Artificial , Proteínas , Proteínas/química , Secuencia de Aminoácidos , Estructura Secundaria de Proteína , Alineación de Secuencia , Programas InformáticosRESUMEN
Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.
Asunto(s)
Aprendizaje Automático , Proteínas , Proteínas/química , Biología Computacional/métodos , Conformación ProteicaRESUMEN
In phylogenomics the evolutionary relationship of organisms is studied by their genomic information. A common approach to phylogenomics is to extract related genes from each organism, build a multiple sequence alignment and then reconstruct evolution relations through a phylogenetic tree. Often a set of highly conserved genes occurring in single-copy, called core genes, are used for this analysis, as they allow efficient automation within a taxonomic clade. Here we introduce the Universal Fungal Core Genes (UFCG) database and pipeline for genome-wide phylogenetic analysis of fungi. The UFCG database consists of 61 curated fungal marker genes, including a novel set of 41 computationally derived core genes and 20 canonical genes derived from literature, as well as marker gene sequences extracted from publicly available fungal genomes. Furthermore, we provide an easy-to-use, fully automated and open-source pipeline for marker gene extraction, training and phylogenetic tree reconstruction. The UFCG pipeline can identify marker genes from genomic, proteomic and transcriptomic data, while producing phylogenies consistent with those previously reported, and is publicly available together with the UFCG database at https://ufcg.steineggerlab.com.