RESUMO
In the last two years the Pfam database (http://pfam.xfam.org) has undergone a substantial reorganisation to reduce the effort involved in making a release, thereby permitting more frequent releases. Arguably the most significant of these changes is that Pfam is now primarily based on the UniProtKB reference proteomes, with the counts of matched sequences and species reported on the website restricted to this smaller set. Building families on reference proteomes sequences brings greater stability, which decreases the amount of manual curation required to maintain them. It also reduces the number of sequences displayed on the website, whilst still providing access to many important model organisms. Matches to the full UniProtKB database are, however, still available and Pfam annotations for individual UniProtKB sequences can still be retrieved. Some Pfam entries (1.6%) which have no matches to reference proteomes remain; we are working with UniProt to see if sequences from them can be incorporated into reference proteomes. Pfam-B, the automatically-generated supplement to Pfam, has been removed. The current release (Pfam 29.0) includes 16 295 entries and 559 clans. The facility to view the relationship between families within a clan has been improved by the introduction of a new tool.
Assuntos
Bases de Dados de Proteínas , Proteínas/classificação , Proteoma/química , Alinhamento de Sequência , Análise de Sequência de Proteína , Anotação de Sequência MolecularRESUMO
Transport systems comprise roughly 10% of all proteins in a cell, playing critical roles in many processes. Improving and expanding their classification is an important goal that can affect studies ranging from comparative genomics to potential drug target searches. It is not surprising that different classification systems for transport proteins have arisen, be it within a specialized database, focused on this functional class of proteins, or as part of a broader classification system for all proteins. Two such databases are the Transporter Classification Database (TCDB) and the Protein family (Pfam) database. As part of a long-term endeavor to improve consistency between the two classification systems, we have compared transporter annotations in the two databases to understand the rationale for differences and to improve both systems. Differences sometimes reflect the fact that one database has a particular transporter family while the other does not. Differing family definitions and hierarchical organizations were reconciled, resulting in recognition of 69 Pfam 'Domains of Unknown Function', which proved to be transport protein families to be renamed using TCDB annotations. Of over 400 potential new Pfam families identified from TCDB, 10% have already been added to Pfam, and TCDB has created 60 new entries based on Pfam data. This work, for the first time, reveals the benefits of comprehensive database comparisons and explains the differences between Pfam and TCDB.
Assuntos
Bases de Dados de Proteínas , Proteínas/químicaRESUMO
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.
Assuntos
Bases de Dados de Proteínas , Alinhamento de Sequência , Análise de Sequência de Proteína , Internet , Proteínas Intrinsicamente Desordenadas/química , Conformação Proteica , Proteínas/química , Proteínas/classificação , Proteínas/genética , Proteoma/química , Análise de Sequência de DNARESUMO
BACKGROUND: The Acel_2062 protein from Acidothermus cellulolyticus is a protein of unknown function. Initial sequence analysis predicted that it was a metallopeptidase from the presence of a motif conserved amongst the Asp-zincins, which are peptidases that contain a single, catalytic zinc ion ligated by the histidines and aspartic acid within the motif (HEXXHXXGXXD). The Acel_2062 protein was chosen by the Joint Center for Structural Genomics for crystal structure determination to explore novel protein sequence space and structure-based function annotation. RESULTS: The crystal structure confirmed that the Acel_2062 protein consisted of a single, zincin-like metallopeptidase-like domain. The Met-turn, a structural feature thought to be important for a Met-zincin because it stabilizes the active site, is absent, and its stabilizing role may have been conferred to the C-terminal Tyr113. In our crystallographic model there are two molecules in the asymmetric unit and from size-exclusion chromatography, the protein dimerizes in solution. A water molecule is present in the putative zinc-binding site in one monomer, which is replaced by one of two observed conformations of His95 in the other. CONCLUSIONS: The Acel_2062 protein is structurally related to the zincins. It contains the minimum structural features of a member of this protein superfamily, and can be described as a "mini- zincin". There is a striking parallel with the structure of a mini-Glu-zincin, which represents the minimum structure of a Glu-zincin (a metallopeptidase in which the third zinc ligand is a glutamic acid). Rather than being an ancestral state, phylogenetic analysis suggests that the mini-zincins are derived from larger proteins.
Assuntos
Proteínas de Bactérias/química , Metaloproteases/química , Zinco/química , Actinomycetales/química , Actinomycetales/enzimologia , Motivos de Aminoácidos , Sequência de Aminoácidos , Proteínas de Bactérias/metabolismo , Dimerização , Metaloproteases/metabolismo , Modelos Moleculares , Dados de Sequência Molecular , Filogenia , Subunidades Proteicas , Alinhamento de Sequência , Zinco/metabolismoRESUMO
BACKGROUND: CA_C2195 from Clostridium acetobutylicum is a protein of unknown function. Sequence analysis predicted that part of the protein contained a metallopeptidase-related domain. There are over 200 homologs of similar size in large sequence databases such as UniProt, with pairwise sequence identities in the range of ~40-60%. CA_C2195 was chosen for crystal structure determination for structure-based function annotation of novel protein sequence space. RESULTS: The structure confirmed that CA_C2195 contained an N-terminal metallopeptidase-like domain. The structure revealed two extra domains: an α+ß domain inserted in the metallopeptidase-like domain and a C-terminal circularly permuted winged-helix-turn-helix domain. CONCLUSIONS: Based on our sequence and structural analyses using the crystal structure of CA_C2195 we provide a view into the possible functions of the protein. From contextual information from gene-neighborhood analysis, we propose that rather than being a peptidase, CA_C2195 and its homologs might play a role in biosynthesis of a modified cell-surface carbohydrate in conjunction with several sugar-modification enzymes. These results provide the groundwork for the experimental verification of the function.
Assuntos
Proteínas de Bactérias/química , Clostridium acetobutylicum/enzimologia , Metaloproteases/química , Proteínas de Bactérias/genética , Proteínas de Bactérias/metabolismo , Clostridium acetobutylicum/genética , Cristalografia por Raios X , Metaloproteases/genética , Metaloproteases/metabolismo , Modelos Moleculares , Estrutura Terciária de ProteínaRESUMO
BACKGROUND: A novel highly conserved protein domain, DUF162 [Pfam: PF02589], can be mapped to two proteins: LutB and LutC. Both proteins are encoded by a highly conserved LutABC operon, which has been implicated in lactate utilization in bacteria. Based on our analysis of its sequence, structure, and recent experimental evidence reported by other groups, we hereby redefine DUF162 as the LUD domain family. RESULTS: JCSG solved the first crystal structure [PDB:2G40] from the LUD domain family: LutC protein, encoded by ORF DR_1909, of Deinococcus radiodurans. LutC shares features with domains in the functionally diverse ISOCOT superfamily. We have observed that the LUD domain has an increased abundance in the human gut microbiome. CONCLUSIONS: We propose a model for the substrate and cofactor binding and regulation in LUD domain. The significance of LUD-containing proteins in the human gut microbiome, and the implication of lactate metabolism in the radiation-resistance of Deinococcus radiodurans are discussed.
Assuntos
Proteínas de Bactérias/metabolismo , Deinococcus/química , Deinococcus/metabolismo , Ácido Láctico/metabolismo , Sequência de Aminoácidos , Proteínas de Bactérias/química , Proteínas de Bactérias/genética , Cristalografia por Raios X , Deinococcus/genética , Humanos , Microbiota/efeitos da radiação , Dados de Sequência Molecular , Estrutura Terciária de ProteínaRESUMO
BACKGROUND: Every genome contains a large number of uncharacterized proteins that may encode entirely novel biological systems. Many of these uncharacterized proteins fall into related sequence families. By applying sequence and structural analysis we hope to provide insight into novel biology. RESULTS: We analyze a previously uncharacterized Pfam protein family called DUF4424 [Pfam:PF14415]. The recently solved three-dimensional structure of the protein lpg2210 from Legionella pneumophila provides the first structural information pertaining to this family. This protein additionally includes the first representative structure of another Pfam family called the YARHG domain [Pfam:PF13308]. The Pfam family DUF4424 adopts a 19-stranded beta-sandwich fold that shows similarity to the N-terminal domain of leukotriene A-4 hydrolase. The YARHG domain forms an all-helical domain at the C-terminus. Structure analysis allows us to recognize distant similarities between the DUF4424 domain and individual domains of M1 aminopeptidases and tricorn proteases, which form massive proteasome-like capsids in both archaea and bacteria. CONCLUSIONS: Based on our analyses we hypothesize that the DUF4424 domain may have a role in forming large, multi-component enzyme complexes. We suggest that the YARGH domain may play a role in binding a moiety in proximity with peptidoglycan, such as a hydrophobic outer membrane lipid or lipopolysaccharide.
Assuntos
Proteínas de Bactérias/química , Bases de Dados de Proteínas , Legionella pneumophila/química , Sequência de Aminoácidos , Proteínas de Bactérias/genética , Legionella pneumophila/genética , Dados de Sequência Molecular , Estrutura Terciária de Proteína , Alinhamento de Sequência , Análise de Sequência de ProteínaRESUMO
Crystal structures of three members (BACOVA_00364 from Bacteroides ovatus, BACUNI_03039 from Bacteroides uniformis and BACEGG_00036 from Bacteroides eggerthii) of the Pfam domain of unknown function (DUF4488) were determined to 1.95, 1.66, and 1.81 Å resolutions, respectively. The protein structures adopt an eight-stranded, calycin-like, ß-barrel fold and bind an endogenous unknown ligand at one end of the ß-barrel. The amino acids interacting with the ligand are not conserved in any other protein of known structure with this particular fold. The size and chemical environment of the bound ligand suggest binding or transport of a small polar molecule(s) as a potential function for these proteins. These are the first structural representatives of a newly defined PF14869 (DUF4488) Pfam family.