Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 233
Filtrar
1.
Nat Commun ; 15(1): 5511, 2024 Jun 29.
Artículo en Inglés | MEDLINE | ID: mdl-38951555

RESUMEN

Accurately building 3D atomic structures from cryo-EM density maps is a crucial step in cryo-EM-based protein structure determination. Converting density maps into 3D atomic structures for proteins lacking accurate homologous or predicted structures as templates remains a significant challenge. Here, we introduce Cryo2Struct, a fully automated de novo cryo-EM structure modeling method. Cryo2Struct utilizes a 3D transformer to identify atoms and amino acid types in cryo-EM density maps, followed by an innovative Hidden Markov Model (HMM) to connect predicted atoms and build protein backbone structures. Cryo2Struct produces substantially more accurate and complete protein structural models than the widely used ab initio method Phenix. Additionally, its performance in building atomic structural models is robust against changes in the resolution of density maps and the size of protein structures.


Asunto(s)
Microscopía por Crioelectrón , Cadenas de Markov , Modelos Moleculares , Conformación Proteica , Proteínas , Microscopía por Crioelectrón/métodos , Proteínas/química , Proteínas/ultraestructura , Algoritmos , Programas Informáticos
2.
Commun Chem ; 7(1): 150, 2024 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-38961141

RESUMEN

Generative deep learning methods have recently been proposed for generating 3D molecules using equivariant graph neural networks (GNNs) within a denoising diffusion framework. However, such methods are unable to learn important geometric properties of 3D molecules, as they adopt molecule-agnostic and non-geometric GNNs as their 3D graph denoising networks, which notably hinders their ability to generate valid large 3D molecules. In this work, we address these gaps by introducing the Geometry-Complete Diffusion Model (GCDM) for 3D molecule generation, which outperforms existing 3D molecular diffusion models by significant margins across conditional and unconditional settings for the QM9 dataset and the larger GEOM-Drugs dataset, respectively. Importantly, we demonstrate that GCDM's generative denoising process enables the model to generate a significant proportion of valid and energetically-stable large molecules at the scale of GEOM-Drugs, whereas previous methods fail to do so with the features they learn. Additionally, we show that extensions of GCDM can not only effectively design 3D molecules for specific protein pockets but can be repurposed to consistently optimize the geometry and chemical composition of existing 3D molecules for molecular stability and property specificity, demonstrating new versatility of molecular diffusion models. Code and data are freely available on GitHub .

3.
ArXiv ; 2024 Jun 19.
Artículo en Inglés | MEDLINE | ID: mdl-38947934

RESUMEN

We introduce ProteinWorkshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on AlphaFold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including AlphaFoldDB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB. ProteinWorkshop is available at: github.com/a-r-j/ProteinWorkshop.

4.
bioRxiv ; 2024 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-38915688

RESUMEN

The oviduct is the site of fertilization and preimplantation embryo development in mammals. Evidence suggests that gametes alter oviductal gene expression. To delineate the adaptive interactions between the oviduct and gamete/embryo, we performed a multi-omics characterization of oviductal tissues utilizing bulk RNA-sequencing (RNA-seq), single-cell RNA-sequencing (scRNA-seq), and proteomics collected from distal and proximal at various stages after mating in mice. We observed robust region-specific transcriptional signatures. Specifically, the presence of sperm induces genes involved in pro-inflammatory responses in the proximal region at 0.5 days post-coitus (dpc). Genes involved in inflammatory responses were produced specifically by secretory epithelial cells in the oviduct. At 1.5 and 2.5 dpc, genes involved in pyruvate and glycolysis were enriched in the proximal region, potentially providing metabolic support for developing embryos. Abundant proteins in the oviductal fluid were differentially observed between naturally fertilized and superovulated samples. RNA-seq data were used to identify transcription factors predicted to influence protein abundance in the proteomic data via a novel machine learning model based on transformers of integrating transcriptomics and proteomics data. The transformers identified influential transcription factors and correlated predictive protein expressions in alignment with the in vivo-derived data. In conclusion, our multi-omics characterization and subsequent in vivo confirmation of proteins/RNAs indicate that the oviduct is adaptive and responsive to the presence of sperm and embryos in a spatiotemporal manner.

5.
ArXiv ; 2024 Jun 06.
Artículo en Inglés | MEDLINE | ID: mdl-38827451

RESUMEN

The effects of ligand binding on protein structures and their in vivo functions carry numerous implications for modern biomedical research and biotechnology development efforts such as drug discovery. Although several deep learning (DL) methods and benchmarks designed for protein-ligand docking have recently been introduced, to date no prior works have systematically studied the behavior of docking methods within the practical context of (1) using predicted (apo) protein structures for docking (e.g., for broad applicability); (2) docking multiple ligands concurrently to a given target protein (e.g., for enzyme design); and (3) having no prior knowledge of binding pockets (e.g., for pocket generalization). To enable a deeper understanding of docking methods' real-world utility, we introduce PoseBench, the first comprehensive benchmark for practical protein-ligand docking. PoseBench enables researchers to rigorously and systematically evaluate DL docking methods for apoto-holo protein-ligand docking and protein-ligand structure generation using both single and multi-ligand benchmark datasets, the latter of which we introduce for the first time to the DL community. Empirically, using PoseBench, we find that all recent DL docking methods but one fail to generalize to multi-ligand protein targets and also that template-based docking algorithms perform equally well or better for multi-ligand docking as recent single-ligand DL docking methods, suggesting areas of improvement for future work. Code, data, tutorials, and benchmark results are available at https://github.com/BioinfoMachineLearning/PoseBench.

6.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38856167

RESUMEN

The genome-wide single-cell chromosome conformation capture technique, i.e. single-cell Hi-C (ScHi-C), was recently developed to interrogate the conformation of the genome of individual cells. However, single-cell Hi-C data are much sparser than bulk Hi-C data of a population of cells, and noise in single-cell Hi-C makes it difficult to apply and analyze them in biological research. Here, we developed the first generative diffusion models (HiCDiff) to denoise single-cell Hi-C data in the form of chromosomal contact matrices. HiCDiff uses a deep residual network to remove the noise in the reverse process of diffusion and can be trained in both unsupervised and supervised learning modes. Benchmarked on several single-cell Hi-C test datasets, the diffusion models substantially remove the noise in single-cell Hi-C data. The unsupervised HiCDiff outperforms most supervised non-diffusion deep learning methods and achieves the performance comparable to the state-of-the-art supervised deep learning method in terms of multiple metrics, demonstrating that diffusion models are a useful approach to denoising single-cell Hi-C data. Moreover, its good performance holds on denoising bulk Hi-C data.


Asunto(s)
Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Humanos , Biología Computacional/métodos , Aprendizaje Profundo , Algoritmos
7.
Nat Methods ; 21(7): 1340-1348, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38918604

RESUMEN

The EMDataResource Ligand Model Challenge aimed to assess the reliability and reproducibility of modeling ligands bound to protein and protein-nucleic acid complexes in cryogenic electron microscopy (cryo-EM) maps determined at near-atomic (1.9-2.5 Å) resolution. Three published maps were selected as targets: Escherichia coli beta-galactosidase with inhibitor, SARS-CoV-2 virus RNA-dependent RNA polymerase with covalently bound nucleotide analog and SARS-CoV-2 virus ion channel ORF3a with bound lipid. Sixty-one models were submitted from 17 independent research groups, each with supporting workflow details. The quality of submitted ligand models and surrounding atoms were analyzed by visual inspection and quantification of local map quality, model-to-map fit, geometry, energetics and contact scores. A composite rather than a single score was needed to assess macromolecule+ligand model quality. These observations lead us to recommend best practices for assessing cryo-EM structures of liganded macromolecules reported at near-atomic resolution.


Asunto(s)
Microscopía por Crioelectrón , Modelos Moleculares , Microscopía por Crioelectrón/métodos , Ligandos , SARS-CoV-2 , COVID-19/virología , Escherichia coli , beta-Galactosidasa/química , beta-Galactosidasa/metabolismo , Conformación Proteica , Reproducibilidad de los Resultados
8.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38860738

RESUMEN

Picking protein particles in cryo-electron microscopy (cryo-EM) micrographs is a crucial step in the cryo-EM-based structure determination. However, existing methods trained on a limited amount of cryo-EM data still cannot accurately pick protein particles from noisy cryo-EM images. The general foundational artificial intelligence-based image segmentation model such as Meta's Segment Anything Model (SAM) cannot segment protein particles well because their training data do not include cryo-EM images. Here, we present a novel approach (CryoSegNet) of integrating an attention-gated U-shape network (U-Net) specially designed and trained for cryo-EM particle picking and the SAM. The U-Net is first trained on a large cryo-EM image dataset and then used to generate input from original cryo-EM images for SAM to make particle pickings. CryoSegNet shows both high precision and recall in segmenting protein particles from cryo-EM micrographs, irrespective of protein type, shape and size. On several independent datasets of various protein types, CryoSegNet outperforms two top machine learning particle pickers crYOLO and Topaz as well as SAM itself. The average resolution of density maps reconstructed from the particles picked by CryoSegNet is 3.33 Å, 7% better than 3.58 Å of Topaz and 14% better than 3.87 Å of crYOLO. It is publicly available at https://github.com/jianlin-cheng/CryoSegNet.


Asunto(s)
Microscopía por Crioelectrón , Procesamiento de Imagen Asistido por Computador , Microscopía por Crioelectrón/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Proteínas/química , Inteligencia Artificial , Algoritmos , Bases de Datos de Proteínas
9.
Biomolecules ; 14(5)2024 May 13.
Artículo en Inglés | MEDLINE | ID: mdl-38785981

RESUMEN

The quality prediction of quaternary structure models of a protein complex, in the absence of its true structure, is known as the Estimation of Model Accuracy (EMA). EMA is useful for ranking predicted protein complex structures and using them appropriately in biomedical research, such as protein-protein interaction studies, protein design, and drug discovery. With the advent of more accurate protein complex (multimer) prediction tools, such as AlphaFold2-Multimer and ESMFold, the estimation of the accuracy of protein complex structures has attracted increasing attention. Many deep learning methods have been developed to tackle this problem; however, there is a noticeable absence of a comprehensive overview of these methods to facilitate future development. Addressing this gap, we present a review of deep learning EMA methods for protein complex structures developed in the past several years, analyzing their methodologies, data and feature construction. We also provide a prospective summary of some potential new developments for further improving the accuracy of the EMA methods.


Asunto(s)
Aprendizaje Profundo , Estructura Cuaternaria de Proteína , Proteínas , Proteínas/química , Modelos Moleculares , Humanos
10.
Sci Data ; 11(1): 458, 2024 May 06.
Artículo en Inglés | MEDLINE | ID: mdl-38710720

RESUMEN

The advent of single-particle cryo-electron microscopy (cryo-EM) has brought forth a new era of structural biology, enabling the routine determination of large biological molecules and their complexes at atomic resolution. The high-resolution structures of biological macromolecules and their complexes significantly expedite biomedical research and drug discovery. However, automatically and accurately building atomic models from high-resolution cryo-EM density maps is still time-consuming and challenging when template-based models are unavailable. Artificial intelligence (AI) methods such as deep learning trained on limited amount of labeled cryo-EM density maps generate inaccurate atomic models. To address this issue, we created a dataset called Cryo2StructData consisting of 7,600 preprocessed cryo-EM density maps whose voxels are labelled according to their corresponding known atomic structures for training and testing AI methods to build atomic models from cryo-EM density maps. Cryo2StructData is larger than existing, publicly available datasets for training AI methods to build atomic protein structures from cryo-EM density maps. We trained and tested deep learning models on Cryo2StructData to validate its quality showing that it is ready for being used to train and test AI methods for building atomic models.


Asunto(s)
Inteligencia Artificial , Microscopía por Crioelectrón , Proteínas , Microscopía por Crioelectrón/métodos , Proteínas/química , Proteínas/ultraestructura , Modelos Moleculares , Conformación Proteica
11.
J Hazard Mater ; 470: 134208, 2024 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-38593663

RESUMEN

This study introduces an innovative strategy for the rapid and accurate identification of pesticide residues in agricultural products by combining surface-enhanced Raman spectroscopy (SERS) with a state-of-the-art transformer model, termed SERSFormer. Gold-silver core-shell nanoparticles were synthesized and served as high-performance SERS substrates, which possess well-defined structures, uniform dispersion, and a core-shell composition with an average diameter of 21.44 ± 4.02 nm, as characterized by TEM-EDS. SERSFormer employs sophisticated, task-specific data processing techniques and CNN embedders, powered by an architecture features weight-shared multi-head self-attention transformer encoder layers. The SERSFormer model demonstrated exceptional proficiency in qualitative analysis, successfully classifying six categories, including five pesticides (coumaphos, oxamyl, carbophenothion, thiabendazole, and phosmet) and a control group of spinach data, with 98.4% accuracy. For quantitative analysis, the model accurately predicted pesticide concentrations with a mean absolute error of 0.966, a mean squared error of 1.826, and an R2 score of 0.849. This novel approach, which combines SERS with machine learning and is supported by robust transformer models, showcases the potential for real-time pesticide detection to improve food safety in the agricultural and food industries.


Asunto(s)
Oro , Aprendizaje Automático , Nanopartículas del Metal , Plaguicidas , Plata , Espectrometría Raman , Spinacia oleracea , Espectrometría Raman/métodos , Spinacia oleracea/química , Nanopartículas del Metal/química , Plata/química , Oro/química , Plaguicidas/análisis , Contaminación de Alimentos/análisis , Residuos de Plaguicidas/análisis
12.
Nat Rev Bioeng ; 2(2): 136-154, 2024 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-38576453

RESUMEN

Denoising diffusion models embody a type of generative artificial intelligence that can be applied in computer vision, natural language processing and bioinformatics. In this Review, we introduce the key concepts and theoretical foundations of three diffusion modelling frameworks (denoising diffusion probabilistic models, noise-conditioned scoring networks and score stochastic differential equations). We then explore their applications in bioinformatics and computational biology, including protein design and generation, drug and small-molecule design, protein-ligand interaction modelling, cryo-electron microscopy image data analysis and single-cell data analysis. Finally, we highlight open-source diffusion model tools and consider the future applications of diffusion models in bioinformatics.

13.
Chem Biol Interact ; 394: 110993, 2024 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-38604394

RESUMEN

Aldehyde dehydrogenase 7A1 (ALDH7A1) catalyzes a step of lysine catabolism. Certain missense mutations in the ALDH7A1 gene cause pyridoxine dependent epilepsy (PDE), a rare autosomal neurometabolic disorder with recessive inheritance that affects almost 1:65,000 live births and is classically characterized by recurrent seizures from the neonatal period. We report a biochemical, structural, and computational study of two novel ALDH7A1 missense mutations that were identified in a child with rare recurrent seizures from the third month of life. The mutations affect two residues in the oligomer interfaces of ALDH7A1, Arg134 and Arg441 (Arg162 and Arg469 in the HGVS nomenclature). The corresponding enzyme variants R134S and R441C (p.Arg162Ser and p.Arg469Cys in the HGVS nomenclature) were expressed in Escherichia coli and purified. R134S and R441C have 10,000- and 50-fold lower catalytic efficiency than wild-type ALDH7A1, respectively. Sedimentation velocity analytical ultracentrifugation shows that R134S is defective in tetramerization, remaining locked in a dimeric state even in the presence of the tetramer-inducing coenzyme NAD+. Because the tetramer is the active form of ALDH7A1, the defect in oligomerization explains the very low catalytic activity of R134S. In contrast, R441C exhibits wild-type oligomerization behavior, and the 2.0 Å resolution crystal structure of R441C complexed with NAD+ revealed no obvious structural perturbations when compared to the wild-type enzyme structure. Molecular dynamics simulations suggest that the mutation of Arg441 to Cys may increase intersubunit ion pairs and alter the dynamics of the active site gate. Our biochemical, structural, and computational data on two novel clinical variants of ALDH7A1 add to the complexity of the molecular determinants underlying pyridoxine dependent epilepsy.


Asunto(s)
Aldehído Deshidrogenasa , Mutación Missense , Aldehído Deshidrogenasa/genética , Aldehído Deshidrogenasa/química , Aldehído Deshidrogenasa/metabolismo , Humanos , Simulación de Dinámica Molecular , Cristalografía por Rayos X , Modelos Moleculares , Epilepsia/genética , Lactante , Masculino
14.
World J Oncol ; 15(2): 149-168, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38545477

RESUMEN

Pigs are playing an increasingly vital role as translational biomedical models for studying human pathophysiology. The annotation of the pig genome was a huge step forward in translatability of pigs as a biomedical model for various human diseases. Similarities between humans and pigs in terms of anatomy, physiology, genetics, and immunology have allowed pigs to become a comprehensive preclinical model for human diseases. With a diverse range, from craniofacial and ophthalmology to reproduction, wound healing, musculoskeletal, and cancer, pigs have provided a seminal understanding of human pathophysiology. This review focuses on the current research using pigs as preclinical models for cancer research and highlights the strengths and opportunities for studying various human cancers.

15.
Res Sq ; 2024 Jan 25.
Artículo en Inglés | MEDLINE | ID: mdl-38343795

RESUMEN

The EMDataResource Ligand Model Challenge aimed to assess the reliability and reproducibility of modeling ligands bound to protein and protein/nucleic-acid complexes in cryogenic electron microscopy (cryo-EM) maps determined at near-atomic (1.9-2.5 Å) resolution. Three published maps were selected as targets: E. coli beta-galactosidase with inhibitor, SARS-CoV-2 RNA-dependent RNA polymerase with covalently bound nucleotide analog, and SARS-CoV-2 ion channel ORF3a with bound lipid. Sixty-one models were submitted from 17 independent research groups, each with supporting workflow details. We found that (1) the quality of submitted ligand models and surrounding atoms varied, as judged by visual inspection and quantification of local map quality, model-to-map fit, geometry, energetics, and contact scores, and (2) a composite rather than a single score was needed to assess macromolecule+ligand model quality. These observations lead us to recommend best practices for assessing cryo-EM structures of liganded macromolecules reported at near-atomic resolution.

16.
Bioinformatics ; 40(2)2024 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-38373819

RESUMEN

MOTIVATION: The field of geometric deep learning has recently had a profound impact on several scientific domains such as protein structure prediction and design, leading to methodological advancements within and outside of the realm of traditional machine learning. Within this spirit, in this work, we introduce GCPNet, a new chirality-aware SE(3)-equivariant graph neural network designed for representation learning of 3D biomolecular graphs. We show that GCPNet, unlike previous representation learning methods for 3D biomolecules, is widely applicable to a variety of invariant or equivariant node-level, edge-level, and graph-level tasks on biomolecular structures while being able to (1) learn important chiral properties of 3D molecules and (2) detect external force fields. RESULTS: Across four distinct molecular-geometric tasks, we demonstrate that GCPNet's predictions (1) for protein-ligand binding affinity achieve a statistically significant correlation of 0.608, more than 5%, greater than current state-of-the-art methods; (2) for protein structure ranking achieve statistically significant target-local and dataset-global correlations of 0.616 and 0.871, respectively; (3) for Newtownian many-body systems modeling achieve a task-averaged mean squared error less than 0.01, more than 15% better than current methods; and (4) for molecular chirality recognition achieve a state-of-the-art prediction accuracy of 98.7%, better than any other machine learning method to date. AVAILABILITY AND IMPLEMENTATION: The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/GCPNet.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Programas Informáticos
17.
Protein Sci ; 33(3): e4932, 2024 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-38380738

RESUMEN

Estimating the accuracy of protein structural models is a critical task in protein bioinformatics. The need for robust methods in the estimation of protein model accuracy (EMA) is prevalent in the field of protein structure prediction, where computationally-predicted structures need to be screened rapidly for the reliability of the positions predicted for each of their amino acid residues and their overall quality. Current methods proposed for EMA are either coupled tightly to existing protein structure prediction methods or evaluate protein structures without sufficiently leveraging the rich, geometric information available in such structures to guide accuracy estimation. In this work, we propose a geometric message passing neural network referred to as the geometry-complete perceptron network for protein structure EMA (GCPNet-EMA), where we demonstrate through rigorous computational benchmarks that GCPNet-EMA's accuracy estimations are 47% faster and more than 10% (6%) more correlated with ground-truth measures of per-residue (per-target) structural accuracy compared to baseline state-of-the-art methods for tertiary (multimer) structure EMA including AlphaFold 2. The source code and data for GCPNet-EMA are available on GitHub, and a public web server implementation is freely available.


Asunto(s)
Redes Neurales de la Computación , Proteínas , Reproducibilidad de los Resultados , Proteínas/química , Programas Informáticos , Aminoácidos , Biología Computacional/métodos
18.
Bioinformatics ; 40(3)2024 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-38407301

RESUMEN

MOTIVATION: Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structures of large protein complexes. Picking single protein particles from cryo-EM micrographs (images) is a crucial step in reconstructing protein structures from them. However, the widely used template-based particle picking process requires some manual particle picking and is labor-intensive and time-consuming. Though machine learning and artificial intelligence (AI) can potentially automate particle picking, the current AI methods pick particles with low precision or low recall. The erroneously picked particles can severely reduce the quality of reconstructed protein structures, especially for the micrographs with low signal-to-noise ratio. RESULTS: To address these shortcomings, we devised CryoTransformer based on transformers, residual networks, and image processing techniques to accurately pick protein particles from cryo-EM micrographs. CryoTransformer was trained and tested on the largest labeled cryo-EM protein particle dataset-CryoPPP. It outperforms the current state-of-the-art machine learning methods of particle picking in terms of the resolution of 3D density maps reconstructed from the picked particles as well as F1-score, and is poised to facilitate the automation of the cryo-EM protein particle picking. AVAILABILITY AND IMPLEMENTATION: The source code and data for CryoTransformer are openly available at: https://github.com/jianlin-cheng/CryoTransformer.


Asunto(s)
Inteligencia Artificial , Programas Informáticos , Microscopía por Crioelectrón/métodos , Aprendizaje Automático , Procesamiento de Imagen Asistido por Computador/métodos , Proteínas
19.
bioRxiv ; 2024 Jan 02.
Artículo en Inglés | MEDLINE | ID: mdl-38260535

RESUMEN

Accurately building three-dimensional (3D) atomic structures from 3D cryo-electron microscopy (cryo-EM) density maps is a crucial step in the cryo-EM-based determination of the structures of protein complexes. Despite improvements in the resolution of 3D cryo-EM density maps, the de novo conversion of density maps into 3D atomic structures for protein complexes that do not have accurate homologous or predicted structures to be used as templates remains a significant challenge. Here, we introduce Cryo2Struct, a fully automated ab initio cryo-EM structure modeling method that utilizes a 3D transformer to identify atoms and amino acid types in cryo-EM density maps first, and then employs a novel Hidden Markov Model (HMM) to connect predicted atoms to build backbone structures of proteins. Tested on a standard test dataset of 128 cryo-EM density maps with varying resolutions (2.1 - 5.6 °A) and different numbers of residues (730 - 8,416), Cryo2Struct built substantially more accurate and complete protein structural models than the widely used ab initio method - Phenix in terms of multiple evaluation metrics. Moreover, on a new test dataset of 500 recently released density maps with varying resolutions (1.9 - 4.0 °A) and different numbers of residues (234 - 8,828), it built more accurate models than on the standard dataset. And its performance is rather robust against the change of the resolution of density maps and the size of protein structures.

20.
ArXiv ; 2024 Feb 05.
Artículo en Inglés | MEDLINE | ID: mdl-36798459

RESUMEN

Motivation: Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods have recently been proposed for generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. However, such methods are unable to learn important geometric and physical properties of 3D molecules during molecular graph generation, as they adopt molecule-agnostic and non-geometric GNNs as their 3D graph denoising networks, which negatively impacts their ability to effectively scale to datasets of large 3D molecules. Results: In this work, we address these gaps by introducing the Geometry-Complete Diffusion Model (GCDM) for 3D molecule generation, which outperforms existing 3D molecular diffusion models by significant margins across conditional and unconditional settings for the QM9 dataset as well as for the larger GEOM-Drugs dataset. Importantly, we demonstrate that the geometry-complete denoising process GCDM learns for 3D molecule generation allows the model to generate realistic and stable large molecules at the scale of GEOM-Drugs, whereas previous methods fail to do so with the features they learn. Additionally, we show that extensions of GCDM can not only effectively design 3D molecules for specific protein pockets but also that GCDM's geometric features can effectively be repurposed to directly optimize the geometry and chemical composition of existing 3D molecules for specific molecular properties, demonstrating new, real-world versatility of molecular diffusion models. Availability: Our source code and data are freely available on GitHub.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...