Search | VHL Regional Portal

1.

Harnessing the 3D-Beacons Network: A Comprehensive Guide to Accessing and Displaying Protein Structure Data.

Magaña, Paulyna; Nair, Sreenath; Varadi, Mihaly; Velankar, Sameer.

Curr Protoc ; 4(5): e1047, 2024 May.

Article in English | MEDLINE | ID: mdl-38720559

ABSTRACT

Recent advancements in protein structure determination and especially in protein structure prediction techniques have led to the availability of vast amounts of macromolecular structures. However, the accessibility and integration of these structures into scientific workflows are hindered by the lack of standardization among publicly available data resources. To address this issue, we introduced the 3D-Beacons Network, a unified platform that aims to establish a standardized framework for accessing and displaying protein structure data. In this article, we highlight the importance of standardized approaches for accessing protein structure data and showcase the capabilities of 3D-Beacons. We describe four protocols for finding and accessing macromolecular structures from various specialist data resources via 3D-Beacons. First, we describe three scenarios for programmatically accessing and retrieving data using the 3D-Beacons API. Next, we show how to perform sequence-based searches to find structures from model providers. Then, we demonstrate how to search for structures and fetch them directly into a workflow using JalView. Finally, we outline the process of facilitating access to data from providers interested in contributing their structures to the 3D-Beacons Network. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Programmatic access to the 3D-Beacons API Basic Protocol 2: Sequence-based search using the 3D-Beacons API Basic Protocol 3: Accessing macromolecules from 3D-Beacons with JalView Basic Protocol 4: Enhancing data accessibility through 3D-Beacons.

Subject(s)

Protein Conformation , Proteins , Proteins/chemistry , Databases, Protein , Software

2.

PDB NextGen Archive: centralizing access to integrated annotations and enriched structural information by the Worldwide Protein Data Bank.

Choudhary, Preeti; Feng, Zukang; Berrisford, John; Chao, Henry; Ikegawa, Yasuyo; Peisach, Ezra; Piehl, Dennis W; Smith, James; Tanweer, Ahsan; Varadi, Mihaly; Westbrook, John D; Young, Jasmine Y; Patwardhan, Ardan; Morris, Kyle L; Hoch, Jeffrey C; Kurisu, Genji; Velankar, Sameer; Burley, Stephen K.

Database (Oxford) ; 20242024 May 27.

Article in English | MEDLINE | ID: mdl-38803272

ABSTRACT

The Protein Data Bank (PDB) is the global repository for public-domain experimentally determined 3D biomolecular structural information. The archival nature of the PDB presents certain challenges pertaining to updating or adding associated annotations from trusted external biodata resources. While each Worldwide PDB (wwPDB) partner has made best efforts to provide up-to-date external annotations, accessing and integrating information from disparate wwPDB data centers can be an involved process. To address this issue, the wwPDB has established the PDB Next Generation (or NextGen) Archive, developed to centralize and streamline access to enriched structural annotations from wwPDB partners and trusted external sources. At present, the NextGen Archive provides mappings between experimentally determined 3D structures of proteins and UniProt amino acid sequences, domain annotations from Pfam, SCOP2 and CATH databases and intra-molecular connectivity information. Since launch, the PDB NextGen Archive has seen substantial user engagement with over 3.5 million data file downloads, ensuring researchers have access to accurate, up-to-date and easily accessible structural annotations. Database URL: http://www.wwpdb.org/ftp/pdb-nextgen-archive-site.

Subject(s)

Databases, Protein , Molecular Sequence Annotation , Proteins/chemistry

3.

Identifying protein conformational states in the Protein Data Bank: Toward unlocking the potential of integrative dynamics studies.

Ellaway, Joseph I J; Anyango, Stephen; Nair, Sreenath; Zaki, Hossam A; Nadzirin, Nurul; Powell, Harold R; Gutmanas, Aleksandras; Varadi, Mihaly; Velankar, Sameer.

Struct Dyn ; 11(3): 034701, 2024 May.

Article in English | MEDLINE | ID: mdl-38774441

ABSTRACT

Studying protein dynamics and conformational heterogeneity is crucial for understanding biomolecular systems and treating disease. Despite the deposition of over 215 000 macromolecular structures in the Protein Data Bank and the advent of AI-based structure prediction tools such as AlphaFold2, RoseTTAFold, and ESMFold, static representations are typically produced, which fail to fully capture macromolecular motion. Here, we discuss the importance of integrating experimental structures with computational clustering to explore the conformational landscapes that manifest protein function. We describe the method developed by the Protein Data Bank in Europe - Knowledge Base to identify distinct conformational states, demonstrate the resource's primary use cases, through examples, and discuss the need for further efforts to annotate protein conformations with functional information. Such initiatives will be crucial in unlocking the potential of protein dynamics data, expediting drug discovery research, and deepening our understanding of macromolecular mechanisms.

4.

IHMCIF: An Extension of the PDBx/mmCIF Data Standard for Integrative Structure Determination Methods.

Vallat, Brinda; Webb, Benjamin M; Westbrook, John D; Goddard, Thomas D; Hanke, Christian A; Graziadei, Andrea; Peisach, Ezra; Zalevsky, Arthur; Sagendorf, Jared; Tangmunarunkit, Hongsuda; Voinea, Serban; Sekharan, Monica; Yu, Jian; Bonvin, Alexander A M J J; DiMaio, Frank; Hummer, Gerhard; Meiler, Jens; Tajkhorshid, Emad; Ferrin, Thomas E; Lawson, Catherine L; Leitner, Alexander; Rappsilber, Juri; Seidel, Claus A M; Jeffries, Cy M; Burley, Stephen K; Hoch, Jeffrey C; Kurisu, Genji; Morris, Kyle; Patwardhan, Ardan; Velankar, Sameer; Schwede, Torsten; Trewhella, Jill; Kesselman, Carl; Berman, Helen M; Sali, Andrej.

J Mol Biol ; : 168546, 2024 Mar 18.

Article in English | MEDLINE | ID: mdl-38508301

ABSTRACT

IHMCIF (github.com/ihmwg/IHMCIF) is a data information framework that supports archiving and disseminating macromolecular structures determined by integrative or hybrid modeling (IHM), and making them Findable, Accessible, Interoperable, and Reusable (FAIR). IHMCIF is an extension of the Protein Data Bank Exchange/macromolecular Crystallographic Information Framework (PDBx/mmCIF) that serves as the framework for the Protein Data Bank (PDB) to archive experimentally determined atomic structures of biological macromolecules and their complexes with one another and small molecule ligands (e.g., enzyme cofactors and drugs). IHMCIF serves as the foundational data standard for the PDB-Dev prototype system, developed for archiving and disseminating integrative structures. It utilizes a flexible data representation to describe integrative structures that span multiple spatiotemporal scales and structural states with definitions for restraints from a variety of experimental methods contributing to integrative structural biology. The IHMCIF extension was created with the benefit of considerable community input and recommendations gathered by the Worldwide Protein Data Bank (wwPDB) Task Force for Integrative or Hybrid Methods (wwpdb.org/task/hybrid). Herein, we describe the development of IHMCIF to support evolving methodologies and ongoing advancements in integrative structural biology. Ultimately, IHMCIF will facilitate the unification of PDB-Dev data and tools with the PDB archive so that integrative structures can be archived and disseminated through PDB.

5.

Restraint validation of biomolecular structures determined by NMR in the Protein Data Bank.

Baskaran, Kumaran; Ploskon, Eliza; Tejero, Roberto; Yokochi, Masashi; Harrus, Deborah; Liang, Yuhe; Peisach, Ezra; Persikova, Irina; Ramelot, Theresa A; Sekharan, Monica; Tolchard, James; Westbrook, John D; Bardiaux, Benjamin; Schwieters, Charles D; Patwardhan, Ardan; Velankar, Sameer; Burley, Stephen K; Kurisu, Genji; Hoch, Jeffrey C; Montelione, Gaetano T; Vuister, Geerten W; Young, Jasmine Y.

Structure ; 32(6): 824-837.e1, 2024 Jun 06.

Article in English | MEDLINE | ID: mdl-38490206

ABSTRACT

Biomolecular structure analysis from experimental NMR studies generally relies on restraints derived from a combination of experimental and knowledge-based data. A challenge for the structural biology community has been a lack of standards for representing these restraints, preventing the establishment of uniform methods of model-vs-data structure validation against restraints and limiting interoperability between restraint-based structure modeling programs. The NEF and NMR-STAR formats provide a standardized approach for representing commonly used NMR restraints. Using these restraint formats, a standardized validation system for assessing structural models of biopolymers against restraints has been developed and implemented in the wwPDB OneDep data deposition-validation-biocuration system. The resulting wwPDB restraint violation report provides a model vs. data assessment of biomolecule structures determined using distance and dihedral restraints, with extensions to other restraint types currently being implemented. These tools are useful for assessing NMR models, as well as for assessing biomolecular structure predictions based on distance restraints.

Subject(s)

Databases, Protein , Models, Molecular , Nuclear Magnetic Resonance, Biomolecular , Protein Conformation , Proteins , Nuclear Magnetic Resonance, Biomolecular/methods , Proteins/chemistry , Software

6.

Community recommendations on cryoEM data archiving and validation.

Kleywegt, Gerard J; Adams, Paul D; Butcher, Sarah J; Lawson, Catherine L; Rohou, Alexis; Rosenthal, Peter B; Subramaniam, Sriram; Topf, Maya; Abbott, Sanja; Baldwin, Philip R; Berrisford, John M; Bricogne, Gérard; Choudhary, Preeti; Croll, Tristan I; Danev, Radostin; Ganesan, Sai J; Grant, Timothy; Gutmanas, Aleksandras; Henderson, Richard; Heymann, J Bernard; Huiskonen, Juha T; Istrate, Andrei; Kato, Takayuki; Lander, Gabriel C; Lok, Shee Mei; Ludtke, Steven J; Murshudov, Garib N; Pye, Ryan; Pintilie, Grigore D; Richardson, Jane S; Sachse, Carsten; Salih, Osman; Scheres, Sjors H W; Schroeder, Gunnar F; Sorzano, Carlos Oscar S; Stagg, Scott M; Wang, Zhe; Warshamanage, Rangana; Westbrook, John D; Winn, Martyn D; Young, Jasmine Y; Burley, Stephen K; Hoch, Jeffrey C; Kurisu, Genji; Morris, Kyle; Patwardhan, Ardan; Velankar, Sameer.

IUCrJ ; 11(Pt 2): 140-151, 2024 Mar 01.

Article in English | MEDLINE | ID: mdl-38358351

ABSTRACT

In January 2020, a workshop was held at EMBL-EBI (Hinxton, UK) to discuss data requirements for the deposition and validation of cryoEM structures, with a focus on single-particle analysis. The meeting was attended by 47 experts in data processing, model building and refinement, validation, and archiving of such structures. This report describes the workshop's motivation and history, the topics discussed, and the resulting consensus recommendations. Some challenges for future methods-development efforts in this area are also highlighted, as is the implementation to date of some of the recommendations.

Subject(s)

Data Curation , Cryoelectron Microscopy/methods

7.

Restraint Validation of Biomolecular Structures Determined by NMR in the Protein Data Bank.

Baskaran, Kumaran; Ploskon, Eliza; Tejero, Roberto; Yokochi, Masashi; Harrus, Deborah; Liang, Yuhe; Peisach, Ezra; Persikova, Irina; Ramelot, Theresa A; Sekharan, Monica; Tolchard, James; Westbrook, John D; Bardiaux, Benjamin; Schwieters, Charles D; Patwardhan, Ardan; Velankar, Sameer; Burley, Stephen K; Kurisu, Genji; Hoch, Jeffrey C; Montelione, Gaetano T; Vuister, Geerten W; Young, Jasmine Y.

bioRxiv ; 2024 Jan 22.

Article in English | MEDLINE | ID: mdl-38328042

ABSTRACT

Biomolecular structure analysis from experimental NMR studies generally relies on restraints derived from a combination of experimental and knowledge-based data. A challenge for the structural biology community has been a lack of standards for representing these restraints, preventing the establishment of uniform methods of model-vs-data structure validation against restraints and limiting interoperability between restraint-based structure modeling programs. The NMR exchange (NEF) and NMR-STAR formats provide a standardized approach for representing commonly used NMR restraints. Using these restraint formats, a standardized validation system for assessing structural models of biopolymers against restraints has been developed and implemented in the wwPDB OneDep data deposition-validation-biocuration system. The resulting wwPDB Restraint Violation Report provides a model vs. data assessment of biomolecule structures determined using distance and dihedral restraints, with extensions to other restraint types currently being implemented. These tools are useful for assessing NMR models, as well as for assessing biomolecular structure predictions based on distance restraints.

8.

EMBL's European Bioinformatics Institute (EMBL-EBI) in 2023.

Thakur, Matthew; Buniello, Annalisa; Brooksbank, Catherine; Gurwitz, Kim T; Hall, Matthew; Hartley, Matthew; Hulcoop, David G; Leach, Andrew R; Marques, Diana; Martin, Maria; Mithani, Aziz; McDonagh, Ellen M; Mutasa-Gottgens, Euphemia; Ochoa, David; Perez-Riverol, Yasset; Stephenson, James; Varadi, Mihaly; Velankar, Sameer; Vizcaino, Juan Antonio; Witham, Rick; McEntyre, Johanna.

Nucleic Acids Res ; 52(D1): D10-D17, 2024 Jan 05.

Article in English | MEDLINE | ID: mdl-38015445

ABSTRACT

The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) is one of the world's leading sources of public biomolecular data. Based at the Wellcome Genome Campus in Hinxton, UK, EMBL-EBI is one of six sites of the European Molecular Biology Laboratory (EMBL), Europe's only intergovernmental life sciences organisation. This overview summarises the latest developments in the services provided by EMBL-EBI data resources to scientific communities globally. These developments aim to ensure EMBL-EBI resources meet the current and future needs of these scientific communities, accelerating the impact of open biological data for all.

Subject(s)

Academies and Institutes , Computational Biology , Computational Biology/organization & administration , Computational Biology/trends , Academies and Institutes/organization & administration , Academies and Institutes/trends , Databases, Nucleic Acid , Europe

9.

Community recommendations on cryoEM data archiving and validation: Outcomes of a wwPDB/EMDB workshop on cryoEM data management, deposition and validation.

Kleywegt, Gerard J; Adams, Paul D; Butcher, Sarah J; Lawson, Catherine L; Rohou, Alexis; Rosenthal, Peter B; Subramaniam, Sriram; Topf, Maya; Abbott, Sanja; Baldwin, Philip R; Berrisford, John M; Bricogne, Gérard; Choudhary, Preeti; Croll, Tristan I; Danev, Radostin; Ganesan, Sai J; Grant, Timothy; Gutmanas, Aleksandras; Henderson, Richard; Heymann, J Bernard; Huiskonen, Juha T; Istrate, Andrei; Kato, Takayuki; Lander, Gabriel C; Lok, Shee-Mei; Ludtke, Steven J; Murshudov, Garib N; Pye, Ryan; Pintilie, Grigore D; Richardson, Jane S; Sachse, Carsten; Salih, Osman; Scheres, Sjors H W; Schroeder, Gunnar F; Sorzano, Carlos Oscar S; Stagg, Scott M; Wang, Zhe; Warshamanage, Rangana; Westbrook, John D; Winn, Martyn D; Young, Jasmine Y; Burley, Stephen K; Hoch, Jeffrey C; Kurisu, Genji; Morris, Kyle; Patwardhan, Ardan; Velankar, Sameer.

ArXiv ; 2024 Feb 02.

Article in English | MEDLINE | ID: mdl-38076521

ABSTRACT

In January 2020, a workshop was held at EMBL-EBI (Hinxton, UK) to discuss data requirements for deposition and validation of cryoEM structures, with a focus on single-particle analysis. The meeting was attended by 47 experts in data processing, model building and refinement, validation, and archiving of such structures. This report describes the workshop's motivation and history, the topics discussed, and consensus recommendations resulting from the workshop. Some challenges for future methods-development efforts in this area are also highlighted, as is the implementation to date of some of the recommendations.

10.

AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences.

Varadi, Mihaly; Bertoni, Damian; Magana, Paulyna; Paramval, Urmila; Pidruchna, Ivanna; Radhakrishnan, Malarvizhi; Tsenkov, Maxim; Nair, Sreenath; Mirdita, Milot; Yeo, Jingi; Kovalevskiy, Oleg; Tunyasuvunakool, Kathryn; Laydon, Agata; Zídek, Augustin; Tomlinson, Hamish; Hariharan, Dhavanthi; Abrahamson, Josh; Green, Tim; Jumper, John; Birney, Ewan; Steinegger, Martin; Hassabis, Demis; Velankar, Sameer.

Nucleic Acids Res ; 52(D1): D368-D375, 2024 Jan 05.

Article in English | MEDLINE | ID: mdl-37933859

ABSTRACT

The AlphaFold Database Protein Structure Database (AlphaFold DB, https://alphafold.ebi.ac.uk) has significantly impacted structural biology by amassing over 214 million predicted protein structures, expanding from the initial 300k structures released in 2021. Enabled by the groundbreaking AlphaFold2 artificial intelligence (AI) system, the predictions archived in AlphaFold DB have been integrated into primary data resources such as PDB, UniProt, Ensembl, InterPro and MobiDB. Our manuscript details subsequent enhancements in data archiving, covering successive releases encompassing model organisms, global health proteomes, Swiss-Prot integration, and a host of curated protein datasets. We detail the data access mechanisms of AlphaFold DB, from direct file access via FTP to advanced queries using Google Cloud Public Datasets and the programmatic access endpoints of the database. We also discuss the improvements and services added since its initial release, including enhancements to the Predicted Aligned Error viewer, customisation options for the 3D viewer, and improvements in the search engine of AlphaFold DB.

The AlphaFold Protein Structure Database (AlphaFold DB) is a massive digital library of predicted protein structures, with over 214 million entries, marking a 500-times expansion in size since its initial release in 2021. The structures are predicted using Google DeepMind's AlphaFold 2 artificial intelligence (AI) system. Our new report highlights the latest updates we have made to this database. We have added more data on specific organisms and proteins related to global health and expanded to cover almost the complete UniProt database, a primary data resource of protein sequences. We also made it easier for our users to access the data by directly downloading files or using advanced cloud-based tools. Finally, we have also improved how users view and search through these protein structures, making the user experience smoother and more informative. In short, AlphaFold DB has been growing rapidly and has become more user-friendly and robust to support the broader scientific community.

Subject(s)

Artificial Intelligence , Protein Structure, Secondary , Proteome , Amino Acid Sequence , Databases, Protein , Search Engine , Proteins/chemistry

11.

PDBImages: a command-line tool for automated macromolecular structure visualization.

Midlik, Adam; Nair, Sreenath; Anyango, Stephen; Deshpande, Mandar; Sehnal, David; Varadi, Mihaly; Velankar, Sameer.

Bioinformatics ; 39(12)2023 12 01.

Article in English | MEDLINE | ID: mdl-38085238

ABSTRACT

SUMMARY: PDBImages is an innovative, open-source Node.js package that harnesses the power of the popular macromolecule structure visualization software Mol*. Designed for use by the scientific community, PDBImages provides a means to generate high-quality images for PDB and AlphaFold DB models. Its unique ability to render and save images directly to files in a browserless mode sets it apart, offering users a streamlined, automated process for macromolecular structure visualization. Here, we detail the implementation of PDBImages, enumerating its diverse image types, and elaborating on its user-friendly setup. This powerful tool opens a new gateway for researchers to visualize, analyse, and share their work, fostering a deeper understanding of bioinformatics. AVAILABILITY AND IMPLEMENTATION: PDBImages is available as an npm package from https://www.npmjs.com/package/pdb-images. The source code is available from https://github.com/PDBeurope/pdb-images.

Subject(s)

Computational Biology , Software , Molecular Structure , Computational Biology/methods

12.

PDBe CCDUtils: an RDKit-based toolkit for handling and analysing small molecules in the Protein Data Bank.

Kunnakkattu, Ibrahim Roshan; Choudhary, Preeti; Pravda, Lukas; Nadzirin, Nurul; Smart, Oliver S; Yuan, Qi; Anyango, Stephen; Nair, Sreenath; Varadi, Mihaly; Velankar, Sameer.

J Cheminform ; 15(1): 117, 2023 Dec 02.

Article in English | MEDLINE | ID: mdl-38042830

ABSTRACT

While the Protein Data Bank (PDB) contains a wealth of structural information on ligands bound to macromolecules, their analysis can be challenging due to the large amount and diversity of data. Here, we present PDBe CCDUtils, a versatile toolkit for processing and analysing small molecules from the PDB in PDBx/mmCIF format. PDBe CCDUtils provides streamlined access to all the metadata for small molecules in the PDB and offers a set of convenient methods to compute various properties using RDKit, such as 2D depictions, 3D conformers, physicochemical properties, scaffolds, common fragments, and cross-references to small molecule databases using UniChem. The toolkit also provides methods for identifying all the covalently attached chemical components in a macromolecular structure and calculating similarity among small molecules. By providing a broad range of functionality, PDBe CCDUtils caters to the needs of researchers in cheminformatics, structural biology, bioinformatics and computational chemistry.

13.

Annotating Macromolecular Complexes in the Protein Data Bank: Improving the FAIRness of Structure Data.

Appasamy, Sri Devan; Berrisford, John; Gaborova, Romana; Nair, Sreenath; Anyango, Stephen; Grudinin, Sergei; Deshpande, Mandar; Armstrong, David; Pidruchna, Ivanna; Ellaway, Joseph I J; Leines, Grisell Díaz; Gupta, Deepti; Harrus, Deborah; Varadi, Mihaly; Velankar, Sameer.

Sci Data ; 10(1): 853, 2023 12 01.

Article in English | MEDLINE | ID: mdl-38040737

ABSTRACT

Macromolecular complexes are essential functional units in nearly all cellular processes, and their atomic-level understanding is critical for elucidating and modulating molecular mechanisms. The Protein Data Bank (PDB) serves as the global repository for experimentally determined structures of macromolecules. Structural data in the PDB offer valuable insights into the dynamics, conformation, and functional states of biological assemblies. However, the current annotation practices lack standardised naming conventions for assemblies in the PDB, complicating the identification of instances representing the same assembly. In this study, we introduce a method leveraging resources external to PDB, such as the Complex Portal, UniProt and Gene Ontology, to describe assemblies and contextualise them within their biological settings accurately. Employing the proposed approach, we assigned standard names to over 90% of unique assemblies in the PDB and provided persistent identifiers for each assembly. This standardisation of assembly data enhances the PDB, facilitating a deeper understanding of macromolecular complexes. Furthermore, the data standardisation improves the PDB's FAIR attributes, fostering more effective basic and translational research and scientific education.

Subject(s)

Translational Research, Biomedical , Molecular Conformation , Databases, Protein , Macromolecular Substances , Protein Conformation

14.

Challenges in bridging the gap between protein structure prediction and functional interpretation.

Varadi, Mihaly; Tsenkov, Maxim; Velankar, Sameer.

Proteins ; 2023 Oct 18.

Article in English | MEDLINE | ID: mdl-37850517

ABSTRACT

The rapid evolution of protein structure prediction tools has significantly broadened access to protein structural data. Although predicted structure models have the potential to accelerate and impact fundamental and translational research significantly, it is essential to note that they are not validated and cannot be considered the ground truth. Thus, challenges persist, particularly in capturing protein dynamics, predicting multi-chain structures, interpreting protein function, and assessing model quality. Interdisciplinary collaborations are crucial to overcoming these obstacles. Databases like the AlphaFold Protein Structure Database, the ESM Metagenomic Atlas, and initiatives like the 3D-Beacons Network provide FAIR access to these data, enabling their interpretation and application across a broader scientific community. Whilst substantial advancements have been made in protein structure prediction, further progress is required to address the remaining challenges. Developing training materials, nurturing collaborations, and ensuring open data sharing will be paramount in this pursuit. The continued evolution of these tools and methodologies will deepen our understanding of protein function and accelerate disease pathogenesis and drug development discoveries.

15.

Structural biology: The transformational era.

Wodak, Shoshana J; Velankar, Sameer.

Proteomics ; 23(17): e2200084, 2023 Sep.

Article in English | MEDLINE | ID: mdl-37667815

16.

Clustering predicted structures at the scale of the known protein universe.

Barrio-Hernandez, Inigo; Yeo, Jingi; Jänes, Jürgen; Mirdita, Milot; Gilchrist, Cameron L M; Wein, Tanita; Varadi, Mihaly; Velankar, Sameer; Beltrao, Pedro; Steinegger, Martin.

Nature ; 622(7983): 637-645, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37704730

ABSTRACT

Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy1, and over 214 million predicted structures are available in the AlphaFold database2. However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment-based clustering algorithm-Foldseek cluster-that can cluster hundreds of millions of structures. Using this method, we have clustered all of the structures in the AlphaFold database, identifying 2.30 million non-singleton structural clusters, of which 31% lack annotations representing probable previously undescribed structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AlphaFold database. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem to be species specific, representing lower-quality predictions or examples of de novo gene birth. We also show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote structural similarity. On the basis of these analyses, we identify several examples of human immune-related proteins with putative remote homology in prokaryotic species, illustrating the value of this resource for studying protein function and evolution across the tree of life.

Subject(s)

Algorithms , Cluster Analysis , Proteins , Structural Homology, Protein , Humans , Databases, Protein , Proteins/chemistry , Proteins/classification , Proteins/metabolism , Sequence Alignment , Molecular Sequence Annotation , Prokaryotic Cells/chemistry , Phylogeny , Species Specificity , Evolution, Molecular

17.

Announcing the launch of Protein Data Bank China as an Associate Member of the Worldwide Protein Data Bank Partnership.

Xu, Wenqing; Velankar, Sameer; Patwardhan, Ardan; Hoch, Jeffrey C; Burley, Stephen K; Kurisu, Genji.

Acta Crystallogr D Struct Biol ; 79(Pt 9): 792-795, 2023 Sep 01.

Article in English | MEDLINE | ID: mdl-37561405

ABSTRACT

The Protein Data Bank (PDB) is the single global archive of atomic-level, three-dimensional structures of biological macromolecules experimentally determined by macromolecular crystallography, nuclear magnetic resonance spectroscopy or three-dimensional cryo-electron microscopy. The PDB is growing continuously, with a recent rapid increase in new structure depositions from Asia. In 2022, the Worldwide Protein Data Bank (wwPDB; https://www.wwpdb.org/) partners welcomed Protein Data Bank China (PDBc; https://www.pdbc.org.cn) to the organization as an Associate Member. PDBc is based in the National Facility for Protein Science in Shanghai which is associated with the Shanghai Advanced Research Institute of Chinese Academy of Sciences, the Shanghai Institute for Advanced Immunochemical Studies and the iHuman Institute of ShanghaiTech University. This letter describes the history of the wwPDB, recently established mechanisms for adding new wwPDB data centers and the processes developed to bring PDBc into the partnership.

Subject(s)

Proteins , Humans , Protein Conformation , Cryoelectron Microscopy , China , Proteins/chemistry , Magnetic Resonance Spectroscopy , Databases, Protein

18.

Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data.

Choudhary, Preeti; Anyango, Stephen; Berrisford, John; Tolchard, James; Varadi, Mihaly; Velankar, Sameer.

Sci Data ; 10(1): 204, 2023 04 12.

Article in English | MEDLINE | ID: mdl-37045837

ABSTRACT

More than 61,000 proteins have up-to-date correspondence between their amino acid sequence (UniProtKB) and their 3D structures (PDB), enabled by the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource. SIFTS incorporates residue-level annotations from many other biological resources. SIFTS data is available in various formats like XML, CSV and TSV format or also accessible via the PDBe REST API but always maintained separately from the structure data (PDBx/mmCIF file) in the PDB archive. Here, we extended the wwPDB PDBx/mmCIF data dictionary with additional categories to accommodate SIFTS data and added the UniProtKB, Pfam, SCOP2, and CATH residue-level annotations directly into the PDBx/mmCIF files from the PDB archive. With the integrated UniProtKB annotations, these files now provide consistent numbering of residues in different PDB entries allowing easy comparison of structure models. The extended dictionary yields a more consistent, standardised metadata description without altering the core PDB information. This development enables up-to-date cross-reference information at the residue level resulting in better data interoperability, supporting improved data analysis and visualisation.

19.

ModelCIF: An Extension of PDBx/mmCIF Data Representation for Computed Structure Models.

Vallat, Brinda; Tauriello, Gerardo; Bienert, Stefan; Haas, Juergen; Webb, Benjamin M; Zídek, Augustin; Zheng, Wei; Peisach, Ezra; Piehl, Dennis W; Anischanka, Ivan; Sillitoe, Ian; Tolchard, James; Varadi, Mihaly; Baker, David; Orengo, Christine; Zhang, Yang; Hoch, Jeffrey C; Kurisu, Genji; Patwardhan, Ardan; Velankar, Sameer; Burley, Stephen K; Sali, Andrej; Schwede, Torsten; Berman, Helen M; Westbrook, John D.

J Mol Biol ; 435(14): 168021, 2023 07 15.

Article in English | MEDLINE | ID: mdl-36828268

ABSTRACT

ModelCIF (github.com/ihmwg/ModelCIF) is a data information framework developed for and by computational structural biologists to enable delivery of Findable, Accessible, Interoperable, and Reusable (FAIR) data to users worldwide. ModelCIF describes the specific set of attributes and metadata associated with macromolecular structures modeled by solely computational methods and provides an extensible data representation for deposition, archiving, and public dissemination of predicted three-dimensional (3D) models of macromolecules. It is an extension of the Protein Data Bank Exchange / macromolecular Crystallographic Information Framework (PDBx/mmCIF), which is the global data standard for representing experimentally-determined 3D structures of macromolecules and associated metadata. The PDBx/mmCIF framework and its extensions (e.g., ModelCIF) are managed by the Worldwide Protein Data Bank partnership (wwPDB, wwpdb.org) in collaboration with relevant community stakeholders such as the wwPDB ModelCIF Working Group (wwpdb.org/task/modelcif). This semantically rich and extensible data framework for representing computed structure models (CSMs) accelerates the pace of scientific discovery. Herein, we describe the architecture, contents, and governance of ModelCIF, and tools and processes for maintaining and extending the data standard. Community tools and software libraries that support ModelCIF are also described.

Subject(s)

Databases, Protein , Macromolecular Substances/chemistry , Protein Conformation , Software

20.

The opportunities and challenges posed by the new generation of deep learning-based protein structure predictors.

Varadi, Mihaly; Bordin, Nicola; Orengo, Christine; Velankar, Sameer.

Curr Opin Struct Biol ; 79: 102543, 2023 04.

Article in English | MEDLINE | ID: mdl-36807079

ABSTRACT

The function of proteins can often be inferred from their three-dimensional structures. Experimental structural biologists spent decades studying these structures, but the accelerated pace of protein sequencing continuously increases the gaps between sequences and structures. The early 2020s saw the advent of a new generation of deep learning-based protein structure prediction tools that offer the potential to predict structures based on any number of protein sequences. In this review, we give an overview of the impact of this new generation of structure prediction tools, with examples of the impacted field in the life sciences. We discuss the novel opportunities and new scientific and technical challenges these tools present to the broader scientific community. Finally, we highlight some potential directions for the future of computational protein structure prediction.

Subject(s)

Deep Learning , Computational Biology/methods , Proteins/chemistry , Amino Acid Sequence

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL