Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 122
Filter
1.
Brief Bioinform ; 13(6): 656-68, 2012 Nov.
Article in English | MEDLINE | ID: mdl-22772836

ABSTRACT

The rapid advances of high-throughput sequencing technologies dramatically prompted metagenomic studies of microbial communities that exist at various environments. Fundamental questions in metagenomics include the identities, composition and dynamics of microbial populations and their functions and interactions. However, the massive quantity and the comprehensive complexity of these sequence data pose tremendous challenges in data analysis. These challenges include but are not limited to ever-increasing computational demand, biased sequence sampling, sequence errors, sequence artifacts and novel sequences. Sequence clustering methods can directly answer many of the fundamental questions by grouping similar sequences into families. In addition, clustering analysis also addresses the challenges in metagenomics. Thus, a large redundant data set can be represented with a small non-redundant set, where each cluster can be represented by a single entry or a consensus. Artifacts can be rapidly detected through clustering. Errors can be identified, filtered or corrected by using consensus from sequences within clusters.


Subject(s)
Algorithms , Metagenome , Cluster Analysis , Metagenomics , Sequence Analysis, DNA
2.
PLoS Biol ; 9(6): e1001088, 2011 Jun.
Article in English | MEDLINE | ID: mdl-21713030

ABSTRACT

A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.


Subject(s)
Databases, Genetic , Genomics/standards , International Cooperation , Metagenome
3.
Nucleic Acids Res ; 39(Database issue): D494-6, 2011 Jan.
Article in English | MEDLINE | ID: mdl-20961957

ABSTRACT

The Open Protein Structure Annotation Network (TOPSAN) is a web-based collaboration platform for exploring and annotating structures determined by structural genomics efforts. Characterization of those structures presents a challenge since the majority of the proteins themselves have not yet been characterized. Responding to this challenge, the TOPSAN platform facilitates collaborative annotation and investigation via a user-friendly web-based interface pre-populated with automatically generated information. Semantic web technologies expand and enrich TOPSAN's content through links to larger sets of related databases, and thus, enable data integration from disparate sources and data mining via conventional query languages. TOPSAN can be found at http://www.topsan.org.


Subject(s)
Databases, Protein , Protein Conformation , Genomics , Proteins/chemistry , Proteins/genetics , User-Computer Interface
4.
Nucleic Acids Res ; 39(Database issue): D546-51, 2011 Jan.
Article in English | MEDLINE | ID: mdl-21045053

ABSTRACT

The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome data sets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share and forward data to other metagenomics sites and community data archives such as GenBank. It has multiple interfaces for easy submission of large or complex data sets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of tools and viewers for querying, analyzing, annotating and comparing metagenome and genome data.


Subject(s)
Databases, Genetic , Metagenome , Environment , Metagenomics , Software
5.
PLoS Biol ; 7(9): e1000205, 2009 Sep.
Article in English | MEDLINE | ID: mdl-19787035

ABSTRACT

The genome projects have unearthed an enormous diversity of genes of unknown function that are still awaiting biological and biochemical characterization. These genes, as most others, can be grouped into families based on sequence similarity. The PFAM database currently contains over 2,200 such families, referred to as domains of unknown function (DUF). In a coordinated effort, the four large-scale centers of the NIH Protein Structure Initiative have determined the first three-dimensional structures for more than 250 of these DUF families. Analysis of the first 248 reveals that about two thirds of the DUF families likely represent very divergent branches of already known and well-characterized families, which allows hypotheses to be formulated about their biological function. The remainder can be formally categorized as new folds, although about one third of these show significant substructure similarity to previously characterized folds. These results infer that, despite the enormous increase in the number and the diversity of new genes being uncovered, the fold space of the proteins they encode is gradually becoming saturated. The previously unexplored sectors of the protein universe appear to be primarily shaped by extreme diversification of known protein families, which then enables organisms to evolve new functions and adapt to particular niches and habitats. Notwithstanding, these DUF families still constitute the richest source for discovery of the remaining protein folds and topologies.


Subject(s)
Proteins/chemistry , Animals , Databases, Protein , Humans , Models, Molecular , Multigene Family , Protein Structure, Secondary , Protein Structure, Tertiary , Structural Homology, Protein , Time Factors
6.
Proteins ; 79(8): 2389-402, 2011 Aug.
Article in English | MEDLINE | ID: mdl-21671455

ABSTRACT

The protein universe can be organized in families that group proteins sharing common ancestry. Such families display variable levels of structural and functional divergence, from homogenous families, where all members have the same function and very similar structure, to very divergent families, where large variations in function and structure are observed. For practical purposes of structure and function prediction, it would be beneficial to identify sub-groups of proteins with highly similar structures (iso-structural) and/or functions (iso-functional) within divergent protein families. We compared three algorithms in their ability to cluster large protein families and discuss whether any of these methods could reliably identify such iso-structural or iso-functional groups. We show that clustering using profile-sequence and profile-profile comparison methods closely reproduces clusters based on similarities between 3D structures or clusters of proteins with similar biological functions. In contrast, the still commonly used sequence-based methods with fixed thresholds result in vast overestimates of structural and functional diversity in protein families. As a result, these methods also overestimate the number of protein structures that have to be determined to fully characterize structural space of such families. The fact that one can build reliable models based on apparently distantly related templates is crucial for extracting maximal amount of information from new sequencing projects.


Subject(s)
Proteins/chemistry , Cluster Analysis , Databases, Protein
7.
PLoS Comput Biol ; 6(2): e1000667, 2010 Feb 26.
Article in English | MEDLINE | ID: mdl-20195499

ABSTRACT

Metagenomics is a discipline that enables the genomic study of uncultured microorganisms. Faster, cheaper sequencing technologies and the ability to sequence uncultured microbes sampled directly from their habitats are expanding and transforming our view of the microbial world. Distilling meaningful information from the millions of new genomic sequences presents a serious challenge to bioinformaticians. In cultured microbes, the genomic data come from a single clone, making sequence assembly and annotation tractable. In metagenomics, the data come from heterogeneous microbial communities, sometimes containing more than 10,000 species, with the sequence data being noisy and partial. From sampling, to assembly, to gene calling and function prediction, bioinformatics faces new demands in interpreting voluminous, noisy, and often partial sequence data. Although metagenomics is a relative newcomer to science, the past few years have seen an explosion in computational methods applied to metagenomic-based research. It is therefore not within the scope of this article to provide an exhaustive review. Rather, we provide here a concise yet comprehensive introduction to the current computational requirements presented by metagenomics, and review the recent progress made. We also note whether there is software that implements any of the methods presented here, and briefly review its utility. Nevertheless, it would be useful if readers of this article would avail themselves of the comment section provided by this journal, and relate their own experiences. Finally, the last section of this article provides a few representative studies illustrating different facets of recent scientific discoveries made using metagenomics.


Subject(s)
Metagenomics , Computational Biology , Sequence Analysis, DNA
8.
PLoS Comput Biol ; 6(6): e1000798, 2010 Jun 03.
Article in English | MEDLINE | ID: mdl-20532204

ABSTRACT

The microbes that inhabit particular environments must be able to perform molecular functions that provide them with a competitive advantage to thrive in those environments. As most molecular functions are performed by proteins and are conserved between related proteins, we can expect that organisms successful in a given environmental niche would contain protein families that are specific for functions that are important in that environment. For instance, the human gut is rich in polysaccharides from the diet or secreted by the host, and is dominated by Bacteroides, whose genomes contain highly expanded repertoire of protein families involved in carbohydrate metabolism. To identify other protein families that are specific to this environment, we investigated the distribution of protein families in the currently available human gut genomic and metagenomic data. Using an automated procedure, we identified a group of protein families strongly overrepresented in the human gut. These not only include many families described previously but also, interestingly, a large group of previously unrecognized protein families, which suggests that we still have much to discover about this environment. The identification and analysis of these families could provide us with new information about an environment critical to our health and well being.


Subject(s)
Bacterial Proteins/genetics , Computational Biology/methods , Gastrointestinal Tract/microbiology , Genome, Bacterial , Metagenome , Cluster Analysis , Databases, Protein , Humans
9.
Structure ; 17(2): 303-13, 2009 Feb 13.
Article in English | MEDLINE | ID: mdl-19217401

ABSTRACT

The crystal structures of two homologous endopeptidases from cyanobacteria Anabaena variabilis and Nostoc punctiforme were determined at 1.05 and 1.60 A resolution, respectively, and contain a bacterial SH3-like domain (SH3b) and a ubiquitous cell-wall-associated NlpC/P60 (or CHAP) cysteine peptidase domain. The NlpC/P60 domain is a primitive, papain-like peptidase in the CA clan of cysteine peptidases with a Cys126/His176/His188 catalytic triad and a conserved catalytic core. We deduced from structure and sequence analysis, and then experimentally, that these two proteins act as gamma-D-glutamyl-L-diamino acid endopeptidases (EC 3.4.22.-). The active site is located near the interface between the SH3b and NlpC/P60 domains, where the SH3b domain may help define substrate specificity, instead of functioning as a targeting domain, so that only muropeptides with an N-terminal L-alanine can bind to the active site.


Subject(s)
Endopeptidases/chemistry , Endopeptidases/metabolism , Peptidoglycan/chemistry , Peptidoglycan/metabolism , Amino Acid Sequence , Anabaena variabilis/chemistry , Anabaena variabilis/enzymology , Catalytic Domain , Cysteine Endopeptidases/chemistry , Cysteine Endopeptidases/metabolism , Cysteine Endopeptidases/physiology , Endopeptidases/physiology , Models, Biological , Models, Molecular , Molecular Sequence Data , Nostoc/chemistry , Nostoc/enzymology , Peptide Fragments/chemistry , Peptide Fragments/metabolism , Protein Structure, Tertiary , Sequence Homology, Amino Acid , Substrate Specificity , src Homology Domains
10.
BMC Bioinformatics ; 11: 426, 2010 Aug 17.
Article in English | MEDLINE | ID: mdl-20716366

ABSTRACT

BACKGROUND: Many protein structures determined in high-throughput structural genomics centers, despite their significant novelty and importance, are available only as PDB depositions and are not accompanied by a peer-reviewed manuscript. Because of this they are not accessible by the standard tools of literature searches, remaining underutilized by the broad biological community. RESULTS: To address this issue we have developed TOPSAN, The Open Protein Structure Annotation Network, a web-based platform that combines the openness of the wiki model with the quality control of scientific communication. TOPSAN enables research collaborations and scientific dialogue among globally distributed participants, the results of which are reviewed by experts and eventually validated by peer review. The immediate goal of TOPSAN is to harness the combined experience, knowledge, and data from such collaborations in order to enhance the impact of the astonishing number and diversity of structures being determined by structural genomics centers and high-throughput structural biology. CONCLUSIONS: TOPSAN combines features of automated annotation databases and formal, peer-reviewed scientific research literature, providing an ideal vehicle to bridge a gap between rapidly accumulating data from high-throughput technologies and a much slower pace for its analysis and integration with other, relevant research.


Subject(s)
Databases, Genetic , Genomics/methods , Proteins/chemistry , Computational Biology/methods , Cooperative Behavior , Internet , Microarray Analysis , Proteins/genetics
11.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1137-42, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944202

ABSTRACT

The Joint Center for Structural Genomics high-throughput structural biology pipeline has delivered more than 1000 structures to the community over the past ten years. The JCSG has made a significant contribution to the overall goal of the NIH Protein Structure Initiative (PSI) of expanding structural coverage of the protein universe, as well as making substantial inroads into structural coverage of an entire organism. Targets are processed through an extensive combination of bioinformatics and biophysical analyses to efficiently characterize and optimize each target prior to selection for structure determination. The pipeline uses parallel processing methods at almost every step in the process and can adapt to a wide range of protein targets from bacterial to human. The construction, expansion and optimization of the JCSG gene-to-structure pipeline over the years have resulted in many technological and methodological advances and developments. The vast number of targets and the enormous amounts of associated data processed through the multiple stages of the experimental pipeline required the development of variety of valuable resources that, wherever feasible, have been converted to free-access web-based tools and applications.


Subject(s)
Databases, Genetic , Genomics , Humans , Protein Conformation
12.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1143-7, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944203

ABSTRACT

The NIH Protein Structure Initiative centers, such as the Joint Center for Structural Genomics (JCSG), have developed highly efficient technological platforms that are capable of experimentally determining the three-dimensional structures of hundreds of proteins per year. However, the overwhelming majority of the almost 5000 protein structures determined by these centers have yet to be described in the peer-reviewed literature. In a high-throughput structural genomics environment, the process of structure determination occurs independently of any associated experimental characterization of function, which creates a challenge for the annotation and analysis of structures and the publication of these results. This challenge has been addressed by developing TOPSAN (`The Open Protein Structure Annotation Network'), which enables the generation of knowledge via collaborations among globally distributed contributors supported by automated amalgamation of available information. TOPSAN currently provides annotations for all protein structures determined by the JCSG in addition to preliminary annotations on a large number of structures from the other PSI production centers. TOPSAN-enabled collaborations have resulted in insightful structure-function analysis for many proteins and have led to numerous peer-reviewed publications, as exemplified by the articles included in this issue of Acta Crystallographica Section F.


Subject(s)
Databases, Genetic , Genomics , Humans , Internet , Protein Conformation
13.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1153-9, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944205

ABSTRACT

The first structural representative of the domain of unknown function DUF2006 family, also known as Pfam family PF09410, comprises a lipocalin-like fold with domain duplication. The finding of the calycin signature in the N-terminal domain, combined with remote sequence similarity to two other protein families (PF07143 and PF08622) implicated in isoprenoid metabolism and the oxidative stress response, support an involvement in lipid metabolism. Clusters of conserved residues that interact with ligand mimetics suggest that the binding and regulation sites map to the N-terminal domain and to the interdomain interface, respectively.


Subject(s)
Bacterial Proteins/chemistry , Databases, Genetic , Lipid Metabolism , Nitrosomonas europaea/chemistry , Amino Acid Sequence , Crystallography, X-Ray , Models, Molecular , Molecular Sequence Data , Nitrosomonas europaea/metabolism , Oxidative Stress , Protein Structure, Tertiary , Sequence Alignment , Sequence Homology, Amino Acid
14.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1160-6, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944206

ABSTRACT

SSO2064 is the first structural representative of PF01796 (DUF35), a large prokaryotic family with a wide phylogenetic distribution. The structure reveals a novel two-domain architecture comprising an N-terminal, rubredoxin-like, zinc ribbon and a C-terminal, oligonucleotide/oligosaccharide-binding (OB) fold domain. Additional N-terminal helical segments may be involved in protein-protein interactions. Domain architectures, genomic context analysis and functional evidence from certain bacterial representatives of this family suggest that these proteins form a novel fatty-acid-binding component that is involved in the biosynthesis of lipids and polyketide antibiotics and that they possibly function as acyl-CoA-binding proteins. This structure has led to a re-evaluation of the DUF35 family, which has now been split into two entries in the latest Pfam release (v.24.0).


Subject(s)
Acyl Coenzyme A/chemistry , Archaeal Proteins/chemistry , Protein Folding , Sulfolobus solfataricus/chemistry , Zinc/chemistry , Amino Acid Sequence , Archaeal Proteins/genetics , Archaeal Proteins/metabolism , Crystallography, X-Ray , Genome, Archaeal , Models, Molecular , Molecular Sequence Data , Protein Binding , Protein Structure, Tertiary , Sulfolobus solfataricus/genetics , Sulfolobus solfataricus/metabolism
15.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1167-73, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944207

ABSTRACT

The crystal structure of Dhaf4260 from Desulfitobacterium hafniense DCB-2 was determined by single-wavelength anomalous diffraction (SAD) to a resolution of 2.01 Šusing the semi-automated high-throughput pipeline of the Joint Center for Structural Genomics (JCSG) as part of the NIGMS Protein Structure Initiative (PSI). This protein structure is the first representative of the PF04016 (DUF364) Pfam family and reveals a novel combination of two well known domains (an enolase N-terminal-like fold followed by a Rossmann-like domain). Structural and bioinformatic analyses reveal partial similarities to Rossmann-like methyltransferases, with residues from the enolase-like fold combining to form a unique active site that is likely to be involved in the condensation or hydrolysis of molecules implicated in the synthesis of flavins, pterins or other siderophores. The genome context of Dhaf4260 and homologs additionally supports a role in heavy-metal chelation.


Subject(s)
Bacterial Proteins/chemistry , Desulfitobacterium/chemistry , Metals, Heavy/chemistry , Phosphopyruvate Hydratase/chemistry , Protein Folding , Amino Acid Sequence , Bacterial Proteins/metabolism , Catalytic Domain , Crystallography, X-Ray , Desulfitobacterium/metabolism , Metals, Heavy/metabolism , Models, Molecular , Molecular Sequence Data , Protein Binding , Protein Structure, Tertiary
16.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1198-204, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944211

ABSTRACT

The crystal structure of Jann_2411 from Jannaschia sp. strain CCS1, a member of the Pfam PF07336 family classified as a domain of unknown function (DUF1470), was solved to a resolution of 1.45 Šby multiple-wavelength anomalous dispersion (MAD). This protein is the first structural representative of the DUF1470 Pfam family. Structural analysis revealed a two-domain organization, with the N-terminal domain presenting a new fold called the ABATE domain that may bind an as yet unknown ligand. The C-terminal domain forms a treble-clef zinc finger that is likely to be involved in DNA binding. Analysis of the Jann_2411 protein and the broader ABATE-domain family suggests a role as stress-induced transcriptional regulators.


Subject(s)
Bacterial Proteins/chemistry , Rhodobacteraceae/chemistry , Amino Acid Sequence , Crystallography, X-Ray , Models, Molecular , Molecular Sequence Data , Protein Structure, Quaternary , Protein Structure, Tertiary , Sequence Alignment , Zinc Fingers
17.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1205-10, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944212

ABSTRACT

The structure of LP2179, a member of the PF08866 (DUF1831) family, suggests a novel α+ß fold comprising two ß-sheets packed against a single helix. A remote structural similarity to two other uncharacterized protein families specific to the Bacillus genus (PF08868 and PF08968), as well as to prokaryotic S-adenosylmethionine decarboxylases, is consistent with a role in amino-acid metabolism. Genomic neighborhood analysis of LP2179 supports this functional assignment, which might also then be extended to PF08868 and PF08968.


Subject(s)
Amino Acids/metabolism , Bacterial Proteins/chemistry , Lactobacillus plantarum/chemistry , Protein Folding , Amino Acid Sequence , Bacterial Proteins/metabolism , Crystallography, X-Ray , Lactobacillus plantarum/metabolism , Models, Molecular , Molecular Sequence Data , Protein Structure, Tertiary , Sequence Alignment , Structural Homology, Protein
18.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1211-7, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944213

ABSTRACT

The crystal structure of PA1994 from Pseudomonas aeruginosa, a member of the Pfam PF06475 family classified as a domain of unknown function (DUF1089), reveals a novel fold comprising a 15-stranded ß-sheet wrapped around a single α-helix that assembles into a tight dimeric arrangement. The remote structural similarity to lipoprotein localization factors, in addition to the presence of an acidic pocket that is conserved in DUF1089 homologs, phospholipid-binding and sugar-binding proteins, indicate a role for PA1994 and the DUF1089 family in glycolipid metabolism. Genome-context analysis lends further support to the involvement of this family of proteins in glycolipid metabolism and indicates possible activation of DUF1089 homologs under conditions of bacterial cell-wall stress or host-pathogen interactions.


Subject(s)
Bacterial Proteins/chemistry , Glycolipids/metabolism , Protein Folding , Pseudomonas aeruginosa/chemistry , Amino Acid Sequence , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Crystallography, X-Ray , Genome, Bacterial , Models, Molecular , Molecular Sequence Data , Protein Structure, Quaternary , Protein Structure, Tertiary , Pseudomonas aeruginosa/genetics , Pseudomonas aeruginosa/metabolism
19.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1218-25, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944214

ABSTRACT

The crystal structures of SPO0140 and Sbal_2486 were determined using the semiautomated high-throughput pipeline of the Joint Center for Structural Genomics (JCSG) as part of the NIGMS Protein Structure Initiative (PSI). The structures revealed a conserved core with domain duplication and a superficial similarity of the C-terminal domain to pleckstrin homology-like folds. The conservation of the domain interface indicates a potential binding site that is likely to involve a nucleotide-based ligand, with genome-context and gene-fusion analyses additionally supporting a role for this family in signal transduction, possibly during oxidative stress.


Subject(s)
Bacterial Proteins/chemistry , Protein Folding , Rhodobacteraceae/chemistry , Shewanella/chemistry , Signal Transduction , Amino Acid Sequence , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Crystallography, X-Ray , Genome, Bacterial , Models, Molecular , Molecular Sequence Data , Protein Structure, Secondary , Protein Structure, Tertiary , Rhodobacteraceae/genetics , Rhodobacteraceae/metabolism , Shewanella/genetics , Shewanella/metabolism , Structural Homology, Protein
20.
Acta Crystallogr Sect F Struct Biol Cryst Commun ; 66(Pt 10): 1230-6, 2010 Oct 01.
Article in English | MEDLINE | ID: mdl-20944216

ABSTRACT

YeaZ is involved in a protein network that is essential for bacteria. The crystal structure of YeaZ from Thermotoga maritima was determined to 2.5 Šresolution. Although this protein belongs to a family of ancient actin-like ATPases, it appears that it has lost the ability to bind ATP since it lacks some key structural features that are important for interaction with ATP. A conserved surface was identified, supporting its role in the formation of protein complexes.


Subject(s)
Bacterial Proteins/chemistry , Thermotoga maritima/chemistry , Amino Acid Sequence , Crystallography, X-Ray , Models, Molecular , Molecular Sequence Data , Protein Structure, Quaternary , Protein Structure, Tertiary , Sequence Alignment
SELECTION OF CITATIONS
SEARCH DETAIL