Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
1.
Nat Methods ; 14(11): 1063-1071, 2017 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-28967888

RESUMO

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Assuntos
Metagenômica , Software , Algoritmos , Benchmarking , Análise de Sequência de DNA
2.
Nature ; 478(7369): 395-8, 2011 Oct 05.
Artigo em Inglês | MEDLINE | ID: mdl-21976020

RESUMO

Maize smut caused by the fungus Ustilago maydis is a widespread disease characterized by the development of large plant tumours. U. maydis is a biotrophic pathogen that requires living plant tissue for its development and establishes an intimate interaction zone between fungal hyphae and the plant plasma membrane. U. maydis actively suppresses plant defence responses by secreted protein effectors. Its effector repertoire comprises at least 386 genes mostly encoding proteins of unknown function and expressed exclusively during the biotrophic stage. The U. maydis secretome also contains about 150 proteins with probable roles in fungal nutrition, fungal cell wall modification and host penetration as well as proteins unlikely to act in the fungal-host interface like a chorismate mutase. Chorismate mutases are key enzymes of the shikimate pathway and catalyse the conversion of chorismate to prephenate, the precursor for tyrosine and phenylalanine synthesis. Root-knot nematodes inject a secreted chorismate mutase into plant cells likely to affect development. Here we show that the chorismate mutase Cmu1 secreted by U. maydis is a virulence factor. The enzyme is taken up by plant cells, can spread to neighbouring cells and changes the metabolic status of these cells through metabolic priming. Secreted chorismate mutases are found in many plant-associated microbes and might serve as general tools for host manipulation.


Assuntos
Corismato Mutase/metabolismo , Ustilago/enzimologia , Ustilago/patogenicidade , Fatores de Virulência/metabolismo , Zea mays/metabolismo , Zea mays/microbiologia , Citoplasma/enzimologia , Regulação da Expressão Gênica de Plantas , Teste de Complementação Genética , Interações Hospedeiro-Patógeno , Metaboloma , Modelos Biológicos , Proteínas de Plantas/metabolismo , Plastídeos/enzimologia , Multimerização Proteica , Saccharomyces cerevisiae/genética , Ácido Salicílico/metabolismo , Técnicas do Sistema de Duplo-Híbrido , Fatores de Virulência/genética
3.
Bioinformatics ; 31(9): 1382-8, 2015 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-25540185

RESUMO

MOTIVATION: With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics. RESULTS: The ultrafast protein classification (UProC) toolbox implements a novel algorithm ('Mosaic Matching') for large-scale sequence analysis. UProC is by three orders of magnitude faster than profile-based methods and in a metagenome simulation study achieved up to 80% higher sensitivity on unassembled 100 bp reads. AVAILABILITY AND IMPLEMENTATION: UProC is available as an open-source software at https://github.com/gobics/uproc. Precompiled databases (Pfam) are linked on the UProC homepage: http://uproc.gobics.de/. CONTACT: peter@gobics.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Metagenômica/métodos , Estrutura Terciária de Proteína , Software , Algoritmos , Metagenoma , Fases de Leitura Aberta
4.
Bioinformatics ; 31(17): 2882-4, 2015 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25957349

RESUMO

MOTIVATION: The characterization of phylogenetic and functional diversity is a key element in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S rRNA sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used for broad studies that involve a large number of different samples. However, in comparison to shotgun sequencing approaches, insights into the functional capabilities of the community get lost when restricting the analysis to taxonomic assignment of 16S rRNA data. RESULTS: Tax4Fun is a software package that predicts the functional capabilities of microbial communities based on 16S rRNA datasets. We evaluated Tax4Fun on a range of paired metagenome/16S rRNA datasets to assess its performance. Our results indicate that Tax4Fun provides a good approximation to functional profiles obtained from metagenomic shotgun sequencing approaches. AVAILABILITY AND IMPLEMENTATION: Tax4Fun is an open-source R package and applicable to output as obtained from the SILVAngs web server or the application of QIIME with a SILVA database extension. Tax4Fun is freely available for download at http://tax4fun.gobics.de/. CONTACT: kasshau@gwdg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bactérias/genética , Biologia Computacional/métodos , Genes Bacterianos/genética , Metagenômica/métodos , RNA Ribossômico 16S/genética , Análise de Sequência de RNA/métodos , Software , Bactérias/classificação , Bases de Dados Factuais , Bases de Dados de Ácidos Nucleicos , Marcadores Genéticos , Metagenoma , Filogenia , RNA Bacteriano/genética
5.
BMC Genomics ; 15: 1003, 2014 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-25409897

RESUMO

BACKGROUND: The annotation of biomolecular functions is an essential step in the analysis of newly sequenced organisms. Usually, the functions are inferred from predicted genes on the genome using homology search techniques. A high quality genomic sequence is an important prerequisite which, however, is difficult to achieve for certain organisms, such as hybrids or organisms with a large genome. For functional analysis it is also possible to use a de novo transcriptome assembly but the computational requirements can be demanding. Up to now, it is unclear how much of the functional repertoire of an organism can be reliably predicted from unassembled RNA-seq short reads alone. RESULTS: We have conducted a study to investigate to what degree it is possible to reconstruct the functional profile of an organism from unassembled transcriptome data. We simulated the de novo prediction of biomolecular functions for Arabidopsis thaliana using a comprehensive RNA-seq data set. We evaluated the prediction performance using several homology search methods in combination with different evidence measures. For the decision on the presence or absence of a particular function under noisy conditions we propose a statistical mixture model enabling unsupervised estimation of a detection threshold. Our results indicate that the prediction of the biomolecular functions from the KEGG database is possible with a high sensitivity up to 94 percent. In this setting, the application of the mixture model for automatic threshold calibration allowed the reduction of the falsely predicted functions down to 4 percent. Furthermore, we found that our statistical approach even outperforms the prediction from a de novo transcriptome assembly. CONCLUSION: The analysis of an organism's transcriptome can provide a solid basis for the prediction of biomolecular functions. Using RNA-seq short reads directly, the functional profile of an organism can be reconstructed in a computationally efficient way to provide a draft annotation in cases where the classical genome-based approaches cannot be applied.


Assuntos
Arabidopsis/genética , Bases de Dados Genéticas , Análise de Sequência de RNA/métodos , Calibragem , Modelos Genéticos , Distribuição Normal , Transcriptoma/genética
6.
Bioinformatics ; 29(8): 973-80, 2013 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-23418187

RESUMO

MOTIVATION: Metagenome analysis requires tools that can estimate the taxonomic abundances in anonymous sequence data over the whole range of biological entities. Because there is usually no prior knowledge about the data composition, not only all domains of life but also viruses have to be included in taxonomic profiling. Such a full-range approach, however, is difficult to realize owing to the limited coverage of available reference data. In particular, archaea and viruses are generally not well represented by current genome databases. RESULTS: We introduce a novel approach to taxonomic profiling of metagenomes that is based on mixture model analysis of protein signatures. Our results on simulated and real data reveal the difficulties of the existing methods when measuring achaeal or viral abundances and show the overall good profiling performance of the protein-based mixture model. As an application example, we provide a large-scale analysis of data from the Human Microbiome Project. This demonstrates the utility of our method as a first instance profiling tool for a fast estimate of the community structure. AVAILABILITY: http://gobics.de/TaxyPro. SUPPLEMENTARY INFORMATION: Supplementary Material is available at Bioinformatics online.


Assuntos
Metagenômica/métodos , Estrutura Terciária de Proteína , DNA Arqueal/análise , DNA Viral/análise , Humanos , Metagenoma , Filogenia
7.
Plant Cell ; 23(4): 1556-72, 2011 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-21487095

RESUMO

In the postgenomic era, accurate prediction tools are essential for identification of the proteomes of cell organelles. Prediction methods have been developed for peroxisome-targeted proteins in animals and fungi but are missing specifically for plants. For development of a predictor for plant proteins carrying peroxisome targeting signals type 1 (PTS1), we assembled more than 2500 homologous plant sequences, mainly from EST databases. We applied a discriminative machine learning approach to derive two different prediction methods, both of which showed high prediction accuracy and recognized specific targeting-enhancing patterns in the regions upstream of the PTS1 tripeptides. Upon application of these methods to the Arabidopsis thaliana genome, 392 gene models were predicted to be peroxisome targeted. These predictions were extensively tested in vivo, resulting in a high experimental verification rate of Arabidopsis proteins previously not known to be peroxisomal. The prediction methods were able to correctly infer novel PTS1 tripeptides, which even included novel residues. Twenty-three newly predicted PTS1 tripeptides were experimentally confirmed, and a high variability of the plant PTS1 motif was discovered. These prediction methods will be instrumental in identifying low-abundance and stress-inducible peroxisomal proteins and defining the entire peroxisomal proteome of Arabidopsis and agronomically important crop plants.


Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Inteligência Artificial , Biologia Computacional/métodos , Peroxissomos/metabolismo , Sinais Direcionadores de Proteínas , Sequência de Aminoácidos , Arabidopsis/genética , Proteínas de Arabidopsis/química , Bases de Dados de Proteínas , Genoma de Planta/genética , Modelos Biológicos , Dados de Sequência Molecular , Peptídeos , Transporte Proteico , Reprodutibilidade dos Testes , Frações Subcelulares/metabolismo
8.
Microb Ecol ; 67(4): 919-30, 2014 May.
Artigo em Inglês | MEDLINE | ID: mdl-24553913

RESUMO

Soil microorganisms play an essential role in sustaining biogeochemical processes and cycling of nutrients across different land use types. To gain insights into microbial gene transcription in forest and grassland soil, we isolated mRNA from 32 sampling sites. After sequencing of generated complementary DNA (cDNA), a total of 5,824,229 sequences could be further analyzed. We were able to assign nonribosomal cDNA sequences to all three domains of life. A dominance of bacterial sequences, which were affiliated to 25 different phyla, was found. Bacterial groups capable of aromatic compound degradation such as Phenylobacterium and Burkholderia were detected in significantly higher relative abundance in forest soil than in grassland soil. Accordingly, KEGG pathway categories related to degradation of aromatic ring-containing molecules (e.g., benzoate degradation) were identified in high abundance within forest soil-derived metatranscriptomic datasets. The impact of land use type forest on community composition and activity is evidently to a high degree caused by the presence of wood breakdown products. Correspondingly, bacterial groups known to be involved in lignin degradation and containing ligninolytic genes such as Burkholderia, Bradyrhizobium, and Azospirillum exhibited increased transcriptional activity in forest soil. Higher solar radiation in grassland presumably induced increased transcription of photosynthesis-related genes within this land use type. This is in accordance with high abundance of photosynthetic organisms and plant-infecting viruses in grassland.


Assuntos
Florestas , Microbiota , Microbiologia do Solo , Transcriptoma , Archaea/classificação , Archaea/genética , Archaea/isolamento & purificação , Bactérias/classificação , Bactérias/genética , Bactérias/isolamento & purificação , Eucariotos/classificação , Eucariotos/genética , Eucariotos/isolamento & purificação , Pradaria , Dados de Sequência Molecular , Filogenia , RNA Mensageiro/genética , Análise de Sequência de DNA
9.
Int J Mol Sci ; 15(7): 12364-78, 2014 Jul 14.
Artigo em Inglês | MEDLINE | ID: mdl-25026170

RESUMO

The variety of metagenomes in current databases provides a rapidly growing source of information for comparative studies. However, the quantity and quality of supplementary metadata is still lagging behind. It is therefore important to be able to identify related metagenomes by means of the available sequence data alone. We have studied efficient sequence-based methods for large-scale identification of similar metagenomes within a database retrieval context. In a broad comparison of different profiling methods we found that vector-based distance measures are well-suitable for the detection of metagenomic neighbors. Our evaluation on more than 1700 publicly available metagenomes indicates that for a query metagenome from a particular habitat on average nine out of ten nearest neighbors represent the same habitat category independent of the utilized profiling method or distance measure. While for well-defined labels a neighborhood accuracy of 100% can be achieved, in general the neighbor detection is severely affected by a natural overlap of manually annotated categories. In addition, we present results of a novel visualization method that is able to reflect the similarity of metagenomes in a 2D scatter plot. The visualization method shows a similarly high accuracy in the reduced space as compared with the high-dimensional profile space. Our study suggests that for inspection of metagenome neighborhoods the profiling methods and distance measures can be chosen to provide a convenient interpretation of results in terms of the underlying features. Furthermore, supplementary metadata of metagenome samples in the future needs to comply with readily available ontologies for fine-grained and standardized annotation. To make profile-based k-nearest-neighbor search and the 2D-visualization of the metagenome universe available to the research community, we included the proposed methods in our CoMet-Universe server for comparative metagenome analysis.


Assuntos
Genômica/métodos , Metagenoma , Análise de Sequência de DNA/métodos , Genoma Humano , Humanos , Microbiota/genética
10.
Stat Appl Genet Mol Biol ; 11(1): Article 1, 2012 Jan 06.
Artigo em Inglês | MEDLINE | ID: mdl-22499688

RESUMO

Profile Hidden Markov Models (pHMMs) are widely used to model nucleotide or protein sequence families. In many applications, a sequence family classified into several subfamilies is given and each subfamily is modeled separately by one pHMM. A major drawback of this approach is the difficulty of coping with subfamilies composed of very few sequences.Correct subtyping of human immunodeficiency virus-1 (HIV-1) sequences is one of the most crucial bioinformatic tasks affected by this problem of small subfamilies, i.e., HIV-1 subtypes with a small number of known sequences. To deal with small samples for particular subfamilies of HIV-1, we employ a machine learning approach. More precisely, we make use of an existing HMM architecture and its associated inference engine, while replacing the unsupervised estimation of emission probabilities by a supervised method. For that purpose, we use regularized linear discriminant learning together with a balancing scheme to account for the widely varying sample size. After training the multiclass linear discriminants, the corresponding weights are transformed to valid probabilities using a softmax function.We apply this modified algorithm to classify HIV-1 sequence data (in the form of partial-length HIV-1 sequences and semi-artificial recombinants) and show that the performance of pHMMs can be significantly improved by the proposed technique.


Assuntos
Algoritmos , HIV-1/genética , Cadeias de Markov , Inteligência Artificial , Bases de Dados Factuais , Humanos , Reconhecimento Automatizado de Padrão/métodos
11.
Nucleic Acids Res ; 39(Web Server issue): W518-23, 2011 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-21622656

RESUMO

Analyzing the functional potential of newly sequenced genomes and metagenomes has become a common task in biomedical and biological research. With the advent of high-throughput sequencing technologies comparative metagenomics opens the way to elucidate the genetically determined similarities and differences of complex microbial communities. We developed the web server 'CoMet' (http://comet.gobics.de), which provides an easy-to-use comparative metagenomics platform that is well-suitable for the analysis of large collections of metagenomic short read data. CoMet combines the ORF finding and subsequent assignment of protein sequences to Pfam domain families with a comparative statistical analysis. Besides comprehensive tabular data files, the CoMet server also provides visually interpretable output in terms of hierarchical clustering and multi-dimensional scaling plots and thus allows a quick overview of a given set of metagenomic samples.


Assuntos
Metagenômica/métodos , Software , Análise por Conglomerados , Interpretação Estatística de Dados , Internet , Metagenoma , Estrutura Terciária de Proteína , Análise de Sequência de DNA , Análise de Sequência de Proteína
12.
Bioinformatics ; 27(12): 1618-24, 2011 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-21546400

RESUMO

MOTIVATION: Inferring the taxonomic profile of a microbial community from a large collection of anonymous DNA sequencing reads is a challenging task in metagenomics. Because existing methods for taxonomic profiling of metagenomes are all based on the assignment of fragmentary sequences to phylogenetic categories, the accuracy of results largely depends on fragment length. This dependence complicates comparative analysis of data originating from different sequencing platforms or resulting from different preprocessing pipelines. RESULTS: We here introduce a new method for taxonomic profiling based on mixture modeling of the overall oligonucleotide distribution of a sample. Our results indicate that the mixture-based profiles compare well with taxonomic profiles obtained with other methods. However, in contrast to the existing methods, our approach shows a nearly constant profiling accuracy across all kinds of read lengths and it operates at an unrivaled speed. AVAILABILITY: A platform-independent implementation of the mixture modeling approach is available in terms of a MATLAB/Octave toolbox at http://gobics.de/peter/taxy. In addition, a prototypical implementation within an easy-to-use interactive tool for Windows can be downloaded.


Assuntos
Metagenômica/métodos , Filogenia , Algoritmos , Metagenoma , Modelos Genéticos , Análise de Sequência de DNA
13.
J Biomed Biotechnol ; 2012: 263910, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22550397

RESUMO

Statistical ranking, filtering, adduct detection, isotope correction, and molecular formula calculation are essential tasks in processing mass spectrometry data in metabolomics studies. In order to obtain high-quality data sets, a framework which incorporates all these methods is required. We present the MarVis-Filter software, which provides well-established and specialized methods for processing mass spectrometry data. For the task of ranking and filtering multivariate intensity profiles, MarVis-Filter provides the ANOVA and Kruskal-Wallis tests with adjustment for multiple hypothesis testing. Adduct and isotope correction are based on a novel algorithm which takes the similarity of intensity profiles into account and allows user-defined ionization rules. The molecular formula calculation utilizes the results of the adduct and isotope correction. For a comprehensive analysis, MarVis-Filter provides an interactive interface to combine data sets deriving from positive and negative ionization mode. The software is exemplarily applied in a metabolic case study, where octadecanoids could be identified as markers for wounding in plants.


Assuntos
Algoritmos , Biologia Computacional/métodos , Espectrometria de Massas/métodos , Metabolômica/métodos , Software , Arabidopsis/metabolismo , Isótopos de Carbono , Ciclopentanos/metabolismo , Bases de Dados Factuais , Metaboloma , Modelos Biológicos , Oxilipinas/metabolismo
14.
Nucleic Acids Res ; 38(Web Server issue): W19-22, 2010 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-20497995

RESUMO

We introduce web interfaces for two recent extensions of the multiple-alignment program DIALIGN. DIALIGN-TX combines the greedy heuristic previously used in DIALIGN with a more traditional 'progressive' approach for improved performance on locally and globally related sequence sets. In addition, we offer a version of DIALIGN that uses predicted protein secondary structures together with primary sequence information to construct multiple protein alignments. Both programs are available through 'Göttingen Bioinformatics Compute Server' (GOBICS).


Assuntos
Estrutura Secundária de Proteína , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína , Software , Internet
15.
Mol Microbiol ; 78(4): 964-79, 2010 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-21062371

RESUMO

The COP9 signalosome complex (CSN) is a crucial regulator of ubiquitin ligases. Defects in CSN result in embryonic impairment and death in higher eukaryotes, whereas the filamentous fungus Aspergillus nidulans survives without CSN, but is unable to complete sexual development. We investigated overall impact of CSN activity on A. nidulans cells by combined transcriptome, proteome and metabolome analysis. Absence of csn5/csnE affects transcription of at least 15% of genes during development, including numerous oxidoreductases. csnE deletion leads to changes in the fungal proteome indicating impaired redox regulation and hypersensitivity to oxidative stress. CSN promotes the formation of asexual spores by regulating developmental hormones produced by PpoA and PpoC dioxygenases. We identify more than 100 metabolites, including orsellinic acid derivatives, accumulating preferentially in the csnE mutant. We also show that CSN is required to activate glucanases and other cell wall recycling enzymes during development. These findings suggest a dual role for CSN during development: it is required early for protection against oxidative stress and hormone regulation and is later essential for control of the secondary metabolism and cell wall rearrangement.


Assuntos
Aspergillus nidulans/crescimento & desenvolvimento , Aspergillus nidulans/metabolismo , Parede Celular/metabolismo , Regulação Fúngica da Expressão Gênica , Hormônios/metabolismo , Complexos Multiproteicos/metabolismo , Estresse Oxidativo , Peptídeo Hidrolases/metabolismo , Transdução de Sinais , Aspergillus nidulans/genética , Complexo do Signalossomo COP9 , Proteínas Fúngicas/genética , Deleção de Genes , Perfilação da Expressão Gênica , Metaboloma , Complexos Multiproteicos/genética , Peptídeo Hidrolases/genética , Proteoma
16.
Bioinformatics ; 26(7): 960-1, 2010 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-20172941

RESUMO

SUMMARY: Assessment of phylogenetic diversity is a key element to the analysis of microbial communities. Tools are needed to handle next-generation sequencing data and to cope with the computational complexity of large-scale studies. Here, we present Treephyler, a tool for fast taxonomic profiling of metagenomes. Treephyler was evaluated on real metagenome to assess its performance in comparison to previous approaches for taxonomic profiling. Results indicate that Treephyler is in terms of speed and accuracy prepared for next-generation sequencing techniques and large-scale analysis. AVAILABILITY: Treephyler is implemented in Perl; it is portable to all platforms and applicable to both nucleotide and protein input data. Treephyler is freely available for download at http://www.gobics.de/fabian/treephyler.php.


Assuntos
Genômica/métodos , Metagenoma , Filogenia , Software , Bases de Dados Factuais
17.
Nucleic Acids Res ; 37(Web Server issue): W101-5, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19429689

RESUMO

Metagenomic sequencing projects yield numerous sequencing reads of a diverse range of uncultivated and mostly yet unknown microorganisms. In many cases, these sequencing reads cannot be assembled into longer contigs. Thus, gene prediction tools that were originally developed for whole-genome analysis are not suitable for processing metagenomes. Orphelia is a program for predicting genes in short DNA sequences that is available through a web server application (http://orphelia.gobics.de). Orphelia utilizes prediction models that were created with machine learning techniques on the basis of a wide range of annotated genomes. In contrast to other methods for metagenomic gene prediction, Orphelia has fragment length-specific prediction models for the two most popular sequencing techniques in metagenomics, chain termination sequencing and pyrosequencing. These models ensure highly specific gene predictions.


Assuntos
Microbiologia Ambiental , Genes , Genômica , Software , Internet , Fases de Leitura Aberta , Análise de Sequência de DNA , Interface Usuário-Computador
18.
BMC Bioinformatics ; 11: 481, 2010 Sep 24.
Artigo em Inglês | MEDLINE | ID: mdl-20868492

RESUMO

BACKGROUND: Establishing the relationship between an organism's genome sequence and its phenotype is a fundamental challenge that remains largely unsolved. Accurately predicting microbial phenotypes solely based on genomic features will allow us to infer relevant phenotypic characteristics when the availability of a genome sequence precedes experimental characterization, a scenario that is favored by the advent of novel high-throughput and single cell sequencing techniques. RESULTS: We present a novel approach to predict the phenotype of prokaryotes directly from their protein domain frequencies. Our discriminative machine learning approach provides high prediction accuracy of relevant phenotypes such as motility, oxygen requirement or spore formation. Moreover, the set of discriminative domains provides biological insight into the underlying phenotype-genotype relationship and enables deriving hypotheses on the possible functions of uncharacterized domains. CONCLUSIONS: Fast and accurate prediction of microbial phenotypes based on genomic protein domain content is feasible and has the potential to provide novel biological insights. First results of a systematic check for annotation errors indicate that our approach may also be applied to semi-automatic correction and completion of the existing phenotype annotation.


Assuntos
Proteínas de Bactérias/química , Fenótipo , Algoritmos , Genoma Arqueal , Genoma Bacteriano , Anotação de Sequência Molecular , Estrutura Terciária de Proteína
19.
Environ Microbiome ; 15(1): 11, 2020 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-33902725

RESUMO

BACKGROUND: Sequencing of 16S rRNA genes has become a powerful technique to study microbial communities and their responses towards changing environmental conditions in various ecosystems. Several tools have been developed for the prediction of functional profiles from 16S rRNA gene sequencing data, because numerous questions in ecosystem ecology require knowledge of community functions in addition to taxonomic composition. However, the accuracy of these tools relies on functional information derived from genomes available in public databases, which are often not representative of the microorganisms present in the studied ecosystem. In addition, there is also a lack of tools to predict functional gene redundancy in microbial communities. RESULTS: To address these challenges, we developed Tax4Fun2, an R package for the prediction of functional profiles and functional gene redundancies of prokaryotic communities from 16S rRNA gene sequences. We demonstrate that functional profiles predicted by Tax4Fun2 are highly correlated to functional profiles derived from metagenomes of the same samples. We further show that Tax4Fun2 has higher accuracies than PICRUSt and Tax4Fun. By incorporating user-defined, habitat-specific genomic information, the accuracy and robustness of predicted functional profiles is substantially enhanced. In addition, functional gene redundancies predicted with Tax4Fun2 are highly correlated to functional gene redundancies determined for simulated microbial communities. CONCLUSIONS: Tax4Fun2 provides researchers with a unique tool to predict and investigate functional profiles of prokaryotic communities based on 16S rRNA gene sequencing data. It is easy-to-use, platform-independent and highly memory-efficient, thus enabling researchers without extensive bioinformatics knowledge or access to high-performance clusters to predict functional profiles. Another unique feature of Tax4Fun2 is that it allows researchers to calculate the redundancy of specific functions, which is a potentially important measure of how resilient a community will be to environmental perturbation. Tax4Fun2 is implemented in R and freely available at https://github.com/bwemheu/Tax4Fun2.

20.
BMC Bioinformatics ; 10: 92, 2009 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-19302701

RESUMO

BACKGROUND: A central goal of experimental studies in systems biology is to identify meaningful markers that are hidden within a diffuse background of data originating from large-scale analytical intensity measurements as obtained from metabolomic experiments. Intensity-based clustering is an unsupervised approach to the identification of metabolic markers based on the grouping of similar intensity profiles. A major problem of this basic approach is that in general there is no prior information about an adequate number of biologically relevant clusters. RESULTS: We present the tool MarVis (Marker Visualization) for data mining on intensity-based profiles using one-dimensional self-organizing maps (1D-SOMs). MarVis can import and export customizable CSV (Comma Separated Values) files and provides aggregation and normalization routines for preprocessing of intensity profiles that contain repeated measurements for a number of different experimental conditions. Robust clustering is then achieved by training of an 1D-SOM model, which introduces a similarity-based ordering of the intensity profiles. The ordering allows a convenient visualization of the intensity variations within the data and facilitates an interactive aggregation of clusters into larger blocks. The intensity-based visualization is combined with the presentation of additional data attributes, which can further support the analysis of experimental data. CONCLUSION: MarVis is a user-friendly and interactive tool for exploration of complex pattern variation in a large set of experimental intensity profiles. The application of 1D-SOMs gives a convenient overview on relevant profiles and groups of profiles. The specialized visualization effectively supports researchers in analyzing a large number of putative clusters, even though the true number of biologically meaningful groups is unknown. Although MarVis has been developed for the analysis of metabolomic data, the tool may be applied to gene expression data as well.


Assuntos
Biomarcadores/metabolismo , Software , Análise por Conglomerados , Biologia Computacional/métodos , Reconhecimento Automatizado de Padrão/métodos , Biologia de Sistemas , Interface Usuário-Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA