Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 6 de 6
Filter
Add more filters










Database
Language
Publication year range
1.
ISME J ; 7(12): 2330-9, 2013 Dec.
Article in English | MEDLINE | ID: mdl-23949665

ABSTRACT

Microbial community samples can be efficiently surveyed in high throughput by sequencing markers such as the 16S ribosomal RNA gene. Often, a collection of samples is then selected for subsequent metagenomic, metabolomic or other follow-up. Two-stage study design has long been used in ecology but has not yet been studied in-depth for high-throughput microbial community investigations. To avoid ad hoc sample selection, we developed and validated several purposive sample selection methods for two-stage studies (that is, biological criteria) targeting differing types of microbial communities. These methods select follow-up samples from large community surveys, with criteria including samples typical of the initially surveyed population, targeting specific microbial clades or rare species, maximizing diversity, representing extreme or deviant communities, or identifying communities distinct or discriminating among environment or host phenotypes. The accuracies of each sampling technique and their influences on the characteristics of the resulting selected microbial community were evaluated using both simulated and experimental data. Specifically, all criteria were able to identify samples whose properties were accurately retained in 318 paired 16S amplicon and whole-community metagenomic (follow-up) samples from the Human Microbiome Project. Some selection criteria resulted in follow-up samples that were strongly non-representative of the original survey population; diversity maximization particularly undersampled community configurations. Only selection of intentionally representative samples minimized differences in the selected sample set from the original microbial survey. An implementation is provided as the microPITA (Microbiomes: Picking Interesting Taxa for Analysis) software for two-stage study design of microbial communities.


Subject(s)
Ecology/methods , Microbiological Techniques/methods , Research Design/standards , Bias , Ecology/standards , Humans , Metagenomics , Microbiological Techniques/standards , RNA, Ribosomal, 16S/genetics , Reproducibility of Results
2.
BMC Genomics ; 13: 65, 2012 Feb 10.
Article in English | MEDLINE | ID: mdl-22325056

ABSTRACT

BACKGROUND: Taxa counting is a major problem faced by analysis of metagenomic data. The most popular method relies on analysis of 16S rRNA sequences, but some studies employ also protein based analyses. It would be advantageous to have a method that is applicable directly to short sequences, of the kind extracted from samples in modern metagenomic research. This is achieved by the technique proposed here. RESULTS: We employ specific peptides, deduced from aminoacyl tRNA synthetases, as markers for the occurrence of single genes in data. Sequences carrying these markers are aligned and compared with each other to provide a lower limit for taxa counts in metagenomic data. The method is compared with 16S rRNA searches on a set of known genomes. The taxa counting problem is analyzed mathematically and a heuristic algorithm is proposed. When applied to genomic contigs of a recent human gut microbiome study, the taxa counting method provides information on numbers of different species and strains. We then apply our method to short read data and demonstrate how it can be calibrated to cope with errors. Comparison to known databases leads to estimates of the percentage of novelties, and the type of phyla involved. CONCLUSIONS: A major advantage of our method is its simplicity: it relies on searching sequences for the occurrence of just 4000 specific peptides belonging to the S61 subgroup of aaRS enzymes. When compared to other methods, it provides additional insight into the taxonomic contents of metagenomic data. Furthermore, it can be directly applied to short read data, avoiding the need for genomic contig reconstruction, and taking into account short reads that are otherwise discarded as singletons. Hence it is very suitable for a fast analysis of next generation sequencing data.


Subject(s)
Amino Acyl-tRNA Synthetases/chemistry , Amino Acyl-tRNA Synthetases/genetics , Metagenomics/methods , Peptides/genetics , Algorithms , Amino Acid Sequence , Bacteria/classification , Bacteria/genetics , Genome, Bacterial , Humans , Intestines/microbiology , Metagenome/genetics , Molecular Sequence Data , Peptides/chemistry , Phylogeny , RNA, Ribosomal, 16S/genetics , Sequence Analysis, DNA/methods
3.
BMC Bioinformatics ; 11: 390, 2010 Jul 22.
Article in English | MEDLINE | ID: mdl-20649951

ABSTRACT

BACKGROUND: We propose a method for deriving enzymatic signatures from short read metagenomic data of unknown species. The short read data are converted to six pseudo-peptide candidates. We search for occurrences of Specific Peptides (SPs) on the latter. SPs are peptides that are indicative of enzymatic function as defined by the Enzyme Commission (EC) nomenclature. The number of SP hits on an ensemble of short reads is counted and then converted to estimates of numbers of enzymatic genes associated with different EC categories in the studied metagenome. Relative amounts of different EC categories define the enzymatic spectrum, without the need to perform genomic assemblies of short reads. RESULTS: The method is developed and tested on 22 bacteria for which there exist many EC annotations in Uniprot. Enzymatic signatures are derived for 3 metagenomes, and their functional profiles are explored.We extend the SP methodology to taxon-specific SPs (TSPs), allowing us to estimate taxonomic features of metagenomic data from short reads. Using recent Swiss-Prot data we obtain TSPs for different phyla of bacteria, and different classes of proteobacteria. These allow us to analyze the major taxonomic content of 4 different metagenomic data-sets. CONCLUSIONS: The SP methodology can be successfully extended to applications on short read genomic and metagenomic data. This leads to direct derivation of enzymatic signatures from raw short reads. Furthermore, by employing TSPs, one obtains valuable taxonomic information.


Subject(s)
Bacteria/classification , Bacteria/genetics , Metagenome , Metagenomics/methods , Bacteria/enzymology , Bacterial Proteins/analysis , Databases, Protein , Escherichia coli/enzymology , Escherichia coli/genetics , Genome, Bacterial , Peptides/analysis
4.
BMC Bioinformatics ; 10: 446, 2009 Dec 24.
Article in English | MEDLINE | ID: mdl-20034383

ABSTRACT

BACKGROUND: Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is. RESULTS: We extract novel SP sets from Swiss-Prot enzyme data. Using a training set of July 2006, and test sets of July 2008, we find that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non-enzymes), depends on the coverage length of all SP matches (the number of amino-acids matched on the protein sequence). DME is quite different from BLAST. Comparing the two on an enzyme test set of July 2008, we find that DME has lower recall. On the other hand, DME can provide predictions for proteins regarded by BLAST as having low homologies with known enzymes, thus supplying complementary information. We test our method on a set of proteins belonging to 10 bacteria, dated July 2008, establishing the usefulness of the coverage-length cutoff to determine true-negatives. Moreover, sifting through our predictions we find that some of them have been substantiated by Swiss-Prot annotations by July 2009. Finally we extract, for production purposes, a novel SP set trained on all Swiss-Prot enzymes as of July 2009. This new set increases considerably the recall of DME. The new SP set is being applied to three metagenomes: Sargasso Sea with over 1,000,000 proteins, producing predictions of over 220,000 enzymes, and two human gut metagenomes. The outcome of these analyses can be characterized by the enzymatic profile of the metagenomes, describing the relative numbers of enzymes observed for different EC categories. CONCLUSIONS: Employing SPs for predicting enzymatic activity of proteins works well once one utilizes coverage-length criteria. In our analysis, L >or= 7 has led to highly accurate results.


Subject(s)
Data Mining/methods , Peptides/chemistry , Proteins/chemistry , Sequence Analysis, Protein/methods , Amino Acid Sequence , Proteins/classification , Sequence Alignment/methods
5.
ISME J ; 1(6): 492-501, 2007 Oct.
Article in English | MEDLINE | ID: mdl-18043651

ABSTRACT

Cyanobacteria of the genera Synechococcus and Prochlorococcus are important contributors to photosynthetic productivity in the open ocean. The discovery of genes (psbA, psbD) that encode key photosystem II proteins (D1, D2) in the genomes of phages that infect these cyanobacteria suggests new paradigms for the regulation, function and evolution of photosynthesis in the vast pelagic ecosystem. Reports on the prevalence and expression of phage photosynthesis genes, and evolutionary data showing a potential recombination of phage and host genes, suggest a model in which phage photosynthesis genes help support photosynthetic activity in their hosts during the infection process. Here, using metagenomic data in natural ocean samples, we show that about 60% of the psbA genes in surface water along the global ocean sampling transect are of phage origin, and that the phage genes are undergoing an independent selection for distinct D1 proteins. Furthermore, we show that different viral psbA genes are expressed in the environment.


Subject(s)
Bacteriophages/genetics , Photosynthetic Reaction Center Complex Proteins/genetics , Prochlorococcus/virology , Seawater/microbiology , Synechococcus/virology , Amino Acid Sequence , Cluster Analysis , DNA, Viral/chemistry , DNA, Viral/genetics , Genomics , Molecular Sequence Data , Photosystem II Protein Complex/genetics , Sequence Alignment , Sequence Analysis, DNA
6.
PLoS Comput Biol ; 3(8): e167, 2007 Aug.
Article in English | MEDLINE | ID: mdl-17722976

ABSTRACT

Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 +/- 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.


Subject(s)
Enzymes/chemistry , Enzymes/metabolism , Peptides/chemistry , Peptides/metabolism , Sequence Analysis, Protein/methods , Amino Acid Motifs , Amino Acid Sequence , Molecular Sequence Data , Protein Structure, Tertiary , Structure-Activity Relationship
SELECTION OF CITATIONS
SEARCH DETAIL
...