RESUMO
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.
Assuntos
Bases de Dados de Proteínas , Humanos , Sequência de Aminoácidos , Inteligência Artificial , Internet , Proteínas/química , SoftwareRESUMO
MOTIVATION: Wikipedia is one of the most important channels for the public communication of science and is frequently accessed as an educational resource in computational biology. Joint efforts between the International Society for Computational Biology (ISCB) and the Computational Biology taskforce of WikiProject Molecular Biology (a group of expert Wikipedia editors) have considerably improved computational biology representation on Wikipedia in recent years. However, there is still an urgent need for further improvement in quality, especially when compared to related scientific fields such as genetics and medicine. Facilitating involvement of members from ISCB Communities of Special Interest (COSIs) would improve a vital open education resource in computational biology, additionally allowing COSIs to provide a quality educational resource highly specific to their subfield. RESULTS: We generate a list of around 1500 English Wikipedia articles relating to computational biology and describe the development of a binary COSI-Article matrix, linking COSIs to relevant articles and thereby defining domain-specific open educational resources. Our analysis of the COSI-Article matrix data provides a quantitative assessment of computational biology representation on Wikipedia against other fields and at a COSI-specific level. Furthermore, we conducted similarity analysis and subsequent clustering of COSI-Article data to provide insight into potential relationships between COSIs. Finally, based on our analysis, we suggest courses of action to improve the quality of computational biology representation on Wikipedia.
Assuntos
Biologia Computacional , Análise por ConglomeradosRESUMO
CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.
Assuntos
Biologia Computacional/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Domínios Proteicos , Proteínas/química , Sequência de Aminoácidos , COVID-19/epidemiologia , COVID-19/prevenção & controle , COVID-19/virologia , Biologia Computacional/métodos , Epidemias , Humanos , Internet , Anotação de Sequência Molecular , Proteínas/genética , Proteínas/metabolismo , SARS-CoV-2/genética , SARS-CoV-2/metabolismo , SARS-CoV-2/fisiologia , Análise de Sequência de Proteína/métodos , Homologia de Sequência de Aminoácidos , Proteínas Virais/química , Proteínas Virais/genética , Proteínas Virais/metabolismoRESUMO
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Assuntos
Bases de Dados de Proteínas , Proteínas/química , Sequência de Aminoácidos , COVID-19/metabolismo , Internet , Anotação de Sequência Molecular , Domínios Proteicos , Mapas de Interação de Proteínas , SARS-CoV-2/metabolismo , Alinhamento de SequênciaRESUMO
Alternative splicing can expand the diversity of proteomes. Homologous mutually exclusive exons (MXEs) originate from the same ancestral exon and result in polypeptides with similar structural properties but altered sequence. Why would some genes switch homologous exons and what are their biological impact? Here, we analyse the extent of sequence, structural and functional variability in MXEs and report the first large scale, structure-based analysis of the biological impact of MXE events from different genomes. MXE-specific residues tend to map to single domains, are highly enriched in surface exposed residues and cluster at or near protein functional sites. Thus, MXE events are likely to maintain the protein fold, but alter specificity and selectivity of protein function. This comprehensive resource of MXE events and their annotations is available at: http://gene3d.biochem.ucl.ac.uk/mxemod/. These findings highlight how small, but significant changes at critical positions on a protein surface are exploited in evolution to alter function.
Assuntos
Processamento Alternativo/genética , Éxons/genética , Genoma/genética , Proteínas , Animais , Evolução Molecular , Genômica , Humanos , Proteínas/genética , Proteínas/fisiologiaRESUMO
This article provides an update of the latest data and developments within the CATH protein structure classification database (http://www.cathdb.info). The resource provides two levels of release: CATH-B, a daily snapshot of the latest structural domain boundaries and superfamily assignments, and CATH+, which adds layers of derived data, such as predicted sequence domains, functional annotations and functional clustering (known as Functional Families or FunFams). The most recent CATH+ release (version 4.2) provides a huge update in the coverage of structural data. This release increases the number of fully- classified domains by over 40% (from 308 999 to 434 857 structural domains), corresponding to an almost two- fold increase in sequence data (from 53 million to over 95 million predicted domains) organised into 6119 superfamilies. The coverage of high-resolution, protein PDB chains that contain at least one assigned CATH domain is now 90.2% (increased from 82.3% in the previous release). A number of highly requested features have also been implemented in our web pages: allowing the user to view an alignment between their query sequence and a representative FunFam structure and providing tools that make it easier to view the full structural context (multi-domain architecture) of domains and chains.
Assuntos
Bases de Dados de Proteínas , Genoma , Sequência de Aminoácidos , Animais , Sequência Conservada , Ontologia Genética , Humanos , Modelos Moleculares , Anotação de Sequência Molecular , Família Multigênica/genética , Conformação Proteica , Domínios Proteicos/genética , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos , Relação Estrutura-AtividadeRESUMO
An important area of modern biology consists of understanding the relationship between genotype and phenotype. However, to understand this relationship it is essential to investigate one of the principal links between them: the proteome. With the development of recent mass-spectrometry approaches, it is now possible to quantify entire proteomes and thus relate them to different phenotypes. Here, we present a comparison of the proteome of two extreme developmental states in the well-established model organism Drosophila melanogaster: adult and embryo. Protein modules such as ribosome, proteasome, tricarboxylic acid cycle, glycolysis, or oxidative phosphorylation were found differentially expressed between the two developmental stages. Analysis of post-translation modifications of the proteins identified in this study indicates that they generally follow the same trend as their corresponding protein. Comparison between changes in the proteome and the transcriptome highlighted patterns of post-transcriptional regulation for the subunits of protein complexes such as the ribosome and the proteasome, whereas protein from modules such as TCA cycle, glycolysis, and oxidative phosphorylation seem to be coregulated at the transcriptional level. Finally, the impact of the endosymbiont Wolbachia pipientis on the proteome of both developmental states was also investigated.
Assuntos
Drosophila melanogaster/genética , Biossíntese de Proteínas/genética , Proteoma/genética , Transcriptoma/genética , Animais , Drosophila melanogaster/crescimento & desenvolvimento , Drosophila melanogaster/metabolismo , Drosophila melanogaster/microbiologia , Embrião não Mamífero/metabolismo , Embrião não Mamífero/microbiologia , Regulação da Expressão Gênica no Desenvolvimento/genética , Proteólise , Proteoma/metabolismo , Proteômica/métodos , Wolbachia/patogenicidadeRESUMO
The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.1, http://www.cathdb.info). The resource comprises over 300 000 domain structures and over 53 million protein domains classified into 2737 homologous superfamilies, doubling the number of predicted protein domains in the previous version. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 100 000 additional protein domains. This article describes developments to the CATH-Gene3D resource over the last two years since the publication in 2015, including: significant increases to our structural and sequence coverage; expansion of the functional families in CATH; building a support vector machine (SVM) to automatically assign domains to superfamilies; improved search facilities to return alignments of query sequences against multiple sequence alignments; the redesign of the web pages and download site.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Modelos Moleculares , Proteínas/química , Proteínas/metabolismo , Software , Relação Estrutura-Atividade , NavegadorRESUMO
InterPro (http://www.ebi.ac.uk/interpro/) is a freely available database used to classify protein sequences into families and to predict the presence of important domains and sites. InterProScan is the underlying software that allows both protein and nucleic acid sequences to be searched against InterPro's predictive models, which are provided by its member databases. Here, we report recent developments with InterPro and its associated software, including the addition of two new databases (SFLD and CDD), and the functionality to include residue-level annotation and prediction of intrinsic disorder. These developments enrich the annotations provided by InterPro, increase the overall number of residues annotated and allow more specific functional inferences.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Domínios e Motivos de Interação entre Proteínas , Software , Humanos , Anotação de Sequência Molecular , FilogeniaRESUMO
Accurate gene or protein function prediction is a key challenge in the post-genome era. Most current methods perform well on molecular function prediction, but struggle to provide useful annotations relating to biological process functions due to the limited power of sequence-based features in that functional domain. In this work, we systematically evaluate the predictive power of temporal transcription expression profiles for protein function prediction in Drosophila melanogaster. Our results show significantly better performance on predicting protein function when transcription expression profile-based features are integrated with sequence-derived features, compared with the sequence-derived features alone. We also observe that the combination of expression-based and sequence-based features leads to further improvement of accuracy on predicting all three domains of gene function. Based on the optimal feature combinations, we then propose a novel multi-classifier-based function prediction method for Drosophila melanogaster proteins, FFPred-fly+. Interpreting our machine learning models also allows us to identify some of the underlying links between biological processes and developmental stages of Drosophila melanogaster.
Assuntos
Biologia Computacional/métodos , Proteínas de Drosophila/genética , Drosophila melanogaster/crescimento & desenvolvimento , Perfilação da Expressão Gênica/métodos , Transcriptoma/genética , Animais , Análise por Conglomerados , Simulação por Computador , Proteínas de Drosophila/análise , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , Modelos Estatísticos , Fenótipo , Transcriptoma/fisiologiaRESUMO
Gene3D http://gene3d.biochem.ucl.ac.uk is a database of domain annotations of Ensembl and UniProtKB protein sequences. Domains are predicted using a library of profile HMMs representing 2737 CATH superfamilies. Gene3D has previously featured in the Database issue of NAR and here we report updates to the website and database. The current Gene3D (v14) release has expanded its domain assignments to â¼ 20,000 cellular genomes and over 43 million unique protein sequences, more than doubling the number of protein sequences since our last publication. Amongst other updates, we have improved our Functional Family annotation method. We have also improved the quality and coverage of our 3D homology modelling pipeline of predicted CATH domains. Additionally, the structural models have been expanded to include an extra model organism (Drosophila melanogaster). We also document a number of additional visualization tools in the Gene3D website.
Assuntos
Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Humanos , Internet , Modelos Moleculares , Anotação de Sequência Molecular , Domínios e Motivos de Interação entre Proteínas , Estrutura Terciária de Proteína/genéticaRESUMO
As a result of the genome sequencing and structural genomics initiatives, we have a wealth of protein sequence and structural data. However, only about 1% of these proteins have experimental functional annotations. As a result, computational approaches that can predict protein functions are essential in bridging this widening annotation gap. This article reviews the current approaches of protein function prediction using structure and sequence based classification of protein domain family resources with a special focus on functional families in the CATH-Gene3D resource.
Assuntos
Anotação de Sequência Molecular/métodos , Estrutura Terciária de Proteína/genética , Proteínas/química , Proteínas/genética , Sequência de Aminoácidos , Animais , Humanos , Dados de Sequência Molecular , Estrutura Secundária de ProteínaRESUMO
The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence-structure-function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.
Assuntos
Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Software , Ontologia Genética , Internet , Proteínas/classificação , Proteínas/genética , Proteínas/fisiologiaRESUMO
The latest version of the CATH-Gene3D protein structure classification database (4.0, http://www.cathdb.info) provides annotations for over 235,000 protein domain structures and includes 25 million domain predictions. This article provides an update on the major developments in the 2 years since the last publication in this journal including: significant improvements to the predictive power of our functional families (FunFams); the release of our 'current' putative domain assignments (CATH-B); a new, strictly non-redundant data set of CATH domains suitable for homology benchmarking experiments (CATH-40) and a number of improvements to the web pages.
Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Genômica , Internet , Estrutura Terciária de Proteína/genética , Proteínas/classificaçãoRESUMO
A well-known case of evolutionary adaptation is that of ribulose-1,5-bisphosphate carboxylase (RubisCO), the enzyme responsible for fixation of CO2 during photosynthesis. Although the majority of plants use the ancestral C3 photosynthetic pathway, many flowering plants have evolved a derived pathway named C4 photosynthesis. The latter concentrates CO2, and C4 RubisCOs consequently have lower specificity for, and faster turnover of, CO2. The C4 forms result from convergent evolution in multiple clades, with substitutions at a small number of sites under positive selection. To understand the physical constraints on these evolutionary changes, we reconstructed in silico ancestral sequences and 3D structures of RubisCO from a large group of related C3 and C4 species. We were able to precisely track their past evolutionary trajectories, identify mutations on each branch of the phylogeny, and evaluate their stability effect. We show that RubisCO evolution has been constrained by stability-activity tradeoffs similar in character to those previously identified in laboratory-based experiments. The C4 properties require a subset of several ancestral destabilizing mutations, which from their location in the structure are inferred to mainly be involved in enhancing conformational flexibility of the open-closed transition in the catalytic cycle. These mutations are near, but not in, the active site or at intersubunit interfaces. The C3 to C4 transition is preceded by a sustained period in which stability of the enzyme is increased, creating the capacity to accept the functionally necessary destabilizing mutations, and is immediately followed by compensatory mutations that restore global stability.
Assuntos
Evolução Biológica , Ribulose-Bifosfato Carboxilase/fisiologia , Adaptação Fisiológica , Dióxido de Carbono/metabolismo , Estabilidade Enzimática , Modelos Moleculares , Mutação , Fotossíntese , Fenômenos Fisiológicos Vegetais , Ribulose-Bifosfato Carboxilase/química , Ribulose-Bifosfato Carboxilase/genéticaRESUMO
MOTIVATION: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. RESULTS: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110,439 FunFams in 2735 superfamilies which can be used to functionally annotate>16 million domain sequences. AVAILABILITY AND IMPLEMENTATION: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam. CONTACT: sayoni.das.12@ucl.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Proteínas/química , Proteínas/classificação , Sequência de Aminoácidos , Humanos , Dados de Sequência Molecular , Proteínas/genética , Proteínas/metabolismo , Análise de Sequência de Proteína , Homologia de Sequência de Aminoácidos , Homologia Estrutural de ProteínaRESUMO
Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year.
Assuntos
Bases de Dados de Proteínas , Anotação de Sequência Molecular , Estrutura Terciária de Proteína , Genoma , Genômica , Internet , Modelos Moleculares , Estrutura Terciária de Proteína/genética , Análise de Sequência de ProteínaRESUMO
BACKGROUND: In complex Metazoans a given gene frequently codes for multiple protein isoforms, through processes such as alternative splicing. Large scale functional annotation of these isoforms is a key challenge for functional genomics. This annotation gap is increasing with the large numbers of multi transcript genes being identified by technologies such as RNASeq. Furthermore attempts to characterise the functions of splicing in an organism are complicated by the difficulty in distinguishing functional isoforms from those produced by splicing errors or transcription noise. Tools to help prioritise candidate isoforms for testing are largely absent. RESULTS: In this study we implement a Time-course Switch (TS) score for ranking isoforms by their likelihood of producing additional functions based on their developmental expression profiles, as reported by modENCODE. The TS score allows us to better investigate functional roles of different isoforms expressed in multi transcript genes. From this analysis, we find that isoforms with high TS scores have sequence feature changes consistent with more deterministic splicing and functional changes and tend to gain domains or whole exons which could carry additional functions. Furthermore these functions appear to be particularly important for essential regulatory roles, establishing functional isoform switching as key for regulatory processes. Based on the TS score we develop a Transcript Annotations Pipeline for Alternative Splicing (TAPAS) that identifies functional neighbourhoods of potentially interesting isoforms. CONCLUSIONS: We have identified a subset of protein isoforms which appear to have high functional significance, particularly in regulation. This has been made possible through the development of novel methods that make use of transcript expression profiles. The methods and analyses we present here represent important first steps in the development of tools to address the near complete lack of isoform specific function annotation. In turn the tools allow us to better characterise the regulatory functions of alternative splicing in more detail.
Assuntos
Processamento Alternativo , Drosophila melanogaster/crescimento & desenvolvimento , Isoformas de Proteínas/metabolismo , Algoritmos , Animais , Biologia Computacional/métodos , Bases de Dados Genéticas , Drosophila melanogaster/genética , Regulação da Expressão Gênica no Desenvolvimento , RNA Mensageiro/metabolismoRESUMO
Blood coagulation occurs through a cascade of enzymes and cofactors that produces a fibrin clot, while otherwise maintaining hemostasis. The 11 human coagulation factors (FG, FII-FXIII) have been identified across all vertebrates, suggesting that they emerged with the first vertebrates around 500 Ma. Human FVIII, FIX, and FXI are associated with thousands of disease-causing mutations. Here, we evaluated the strength of selective pressures on the 14 genes coding for the 11 factors during vertebrate evolution, and compared these with human mutations in FVIII, FIX, and FXI. Positive selection was identified for fibrinogen (FG), FIII, FVIII, FIX, and FX in the mammalian Primates and Laurasiatheria and the Sauropsida (reptiles and birds). This showed that the coagulation system in vertebrates was under strong selective pressures, perhaps to adapt against blood-invading pathogens. The comparison of these results with disease-causing mutations reported in FVIII, FIX, and FXI showed that the number of disease-causing mutations, and the probability of positive selection were inversely related to each other. It was concluded that when a site was under positive selection, it was less likely to be associated with disease-causing mutations. In contrast, sites under negative selection were more likely to be associated with disease-causing mutations and be destabilizing. A residue-by-residue comparison of the FVIII, FIX, and FXI sequence alignments confirmed this. This improved understanding of evolutionary changes in FVIII, FIX, and FXI provided greater insight into disease-causing mutations, and better assessments of the codon sites that may be mutated in applications of gene therapy.