Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 38
Filtrar
1.
Nat Methods ; 14(11): 1063-1071, 2017 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-28967888

RESUMEN

Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.


Asunto(s)
Metagenómica , Programas Informáticos , Algoritmos , Benchmarking , Análisis de Secuencia de ADN
2.
Nature ; 478(7369): 395-8, 2011 Oct 05.
Artículo en Inglés | MEDLINE | ID: mdl-21976020

RESUMEN

Maize smut caused by the fungus Ustilago maydis is a widespread disease characterized by the development of large plant tumours. U. maydis is a biotrophic pathogen that requires living plant tissue for its development and establishes an intimate interaction zone between fungal hyphae and the plant plasma membrane. U. maydis actively suppresses plant defence responses by secreted protein effectors. Its effector repertoire comprises at least 386 genes mostly encoding proteins of unknown function and expressed exclusively during the biotrophic stage. The U. maydis secretome also contains about 150 proteins with probable roles in fungal nutrition, fungal cell wall modification and host penetration as well as proteins unlikely to act in the fungal-host interface like a chorismate mutase. Chorismate mutases are key enzymes of the shikimate pathway and catalyse the conversion of chorismate to prephenate, the precursor for tyrosine and phenylalanine synthesis. Root-knot nematodes inject a secreted chorismate mutase into plant cells likely to affect development. Here we show that the chorismate mutase Cmu1 secreted by U. maydis is a virulence factor. The enzyme is taken up by plant cells, can spread to neighbouring cells and changes the metabolic status of these cells through metabolic priming. Secreted chorismate mutases are found in many plant-associated microbes and might serve as general tools for host manipulation.


Asunto(s)
Corismato Mutasa/metabolismo , Ustilago/enzimología , Ustilago/patogenicidad , Factores de Virulencia/metabolismo , Zea mays/metabolismo , Zea mays/microbiología , Citoplasma/enzimología , Regulación de la Expresión Génica de las Plantas , Prueba de Complementación Genética , Interacciones Huésped-Patógeno , Metaboloma , Modelos Biológicos , Proteínas de Plantas/metabolismo , Plastidios/enzimología , Multimerización de Proteína , Saccharomyces cerevisiae/genética , Ácido Salicílico/metabolismo , Técnicas del Sistema de Dos Híbridos , Factores de Virulencia/genética
3.
Bioinformatics ; 31(9): 1382-8, 2015 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-25540185

RESUMEN

MOTIVATION: With rapidly increasing volumes of biological sequence data the functional analysis of new sequences in terms of similarities to known protein families challenges classical bioinformatics. RESULTS: The ultrafast protein classification (UProC) toolbox implements a novel algorithm ('Mosaic Matching') for large-scale sequence analysis. UProC is by three orders of magnitude faster than profile-based methods and in a metagenome simulation study achieved up to 80% higher sensitivity on unassembled 100 bp reads. AVAILABILITY AND IMPLEMENTATION: UProC is available as an open-source software at https://github.com/gobics/uproc. Precompiled databases (Pfam) are linked on the UProC homepage: http://uproc.gobics.de/. CONTACT: peter@gobics.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Metagenómica/métodos , Estructura Terciaria de Proteína , Programas Informáticos , Algoritmos , Metagenoma , Sistemas de Lectura Abierta
4.
Bioinformatics ; 31(17): 2882-4, 2015 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-25957349

RESUMEN

MOTIVATION: The characterization of phylogenetic and functional diversity is a key element in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S rRNA sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used for broad studies that involve a large number of different samples. However, in comparison to shotgun sequencing approaches, insights into the functional capabilities of the community get lost when restricting the analysis to taxonomic assignment of 16S rRNA data. RESULTS: Tax4Fun is a software package that predicts the functional capabilities of microbial communities based on 16S rRNA datasets. We evaluated Tax4Fun on a range of paired metagenome/16S rRNA datasets to assess its performance. Our results indicate that Tax4Fun provides a good approximation to functional profiles obtained from metagenomic shotgun sequencing approaches. AVAILABILITY AND IMPLEMENTATION: Tax4Fun is an open-source R package and applicable to output as obtained from the SILVAngs web server or the application of QIIME with a SILVA database extension. Tax4Fun is freely available for download at http://tax4fun.gobics.de/. CONTACT: kasshau@gwdg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Bacterias/genética , Biología Computacional/métodos , Genes Bacterianos/genética , Metagenómica/métodos , ARN Ribosómico 16S/genética , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Bacterias/clasificación , Bases de Datos Factuales , Bases de Datos de Ácidos Nucleicos , Marcadores Genéticos , Metagenoma , Filogenia , ARN Bacteriano/genética
5.
BMC Genomics ; 15: 1003, 2014 Nov 20.
Artículo en Inglés | MEDLINE | ID: mdl-25409897

RESUMEN

BACKGROUND: The annotation of biomolecular functions is an essential step in the analysis of newly sequenced organisms. Usually, the functions are inferred from predicted genes on the genome using homology search techniques. A high quality genomic sequence is an important prerequisite which, however, is difficult to achieve for certain organisms, such as hybrids or organisms with a large genome. For functional analysis it is also possible to use a de novo transcriptome assembly but the computational requirements can be demanding. Up to now, it is unclear how much of the functional repertoire of an organism can be reliably predicted from unassembled RNA-seq short reads alone. RESULTS: We have conducted a study to investigate to what degree it is possible to reconstruct the functional profile of an organism from unassembled transcriptome data. We simulated the de novo prediction of biomolecular functions for Arabidopsis thaliana using a comprehensive RNA-seq data set. We evaluated the prediction performance using several homology search methods in combination with different evidence measures. For the decision on the presence or absence of a particular function under noisy conditions we propose a statistical mixture model enabling unsupervised estimation of a detection threshold. Our results indicate that the prediction of the biomolecular functions from the KEGG database is possible with a high sensitivity up to 94 percent. In this setting, the application of the mixture model for automatic threshold calibration allowed the reduction of the falsely predicted functions down to 4 percent. Furthermore, we found that our statistical approach even outperforms the prediction from a de novo transcriptome assembly. CONCLUSION: The analysis of an organism's transcriptome can provide a solid basis for the prediction of biomolecular functions. Using RNA-seq short reads directly, the functional profile of an organism can be reconstructed in a computationally efficient way to provide a draft annotation in cases where the classical genome-based approaches cannot be applied.


Asunto(s)
Arabidopsis/genética , Bases de Datos Genéticas , Análisis de Secuencia de ARN/métodos , Calibración , Modelos Genéticos , Distribución Normal , Transcriptoma/genética
6.
Bioinformatics ; 29(8): 973-80, 2013 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-23418187

RESUMEN

MOTIVATION: Metagenome analysis requires tools that can estimate the taxonomic abundances in anonymous sequence data over the whole range of biological entities. Because there is usually no prior knowledge about the data composition, not only all domains of life but also viruses have to be included in taxonomic profiling. Such a full-range approach, however, is difficult to realize owing to the limited coverage of available reference data. In particular, archaea and viruses are generally not well represented by current genome databases. RESULTS: We introduce a novel approach to taxonomic profiling of metagenomes that is based on mixture model analysis of protein signatures. Our results on simulated and real data reveal the difficulties of the existing methods when measuring achaeal or viral abundances and show the overall good profiling performance of the protein-based mixture model. As an application example, we provide a large-scale analysis of data from the Human Microbiome Project. This demonstrates the utility of our method as a first instance profiling tool for a fast estimate of the community structure. AVAILABILITY: http://gobics.de/TaxyPro. SUPPLEMENTARY INFORMATION: Supplementary Material is available at Bioinformatics online.


Asunto(s)
Metagenómica/métodos , Estructura Terciaria de Proteína , ADN de Archaea/análisis , ADN Viral/análisis , Humanos , Metagenoma , Filogenia
7.
Plant Cell ; 23(4): 1556-72, 2011 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-21487095

RESUMEN

In the postgenomic era, accurate prediction tools are essential for identification of the proteomes of cell organelles. Prediction methods have been developed for peroxisome-targeted proteins in animals and fungi but are missing specifically for plants. For development of a predictor for plant proteins carrying peroxisome targeting signals type 1 (PTS1), we assembled more than 2500 homologous plant sequences, mainly from EST databases. We applied a discriminative machine learning approach to derive two different prediction methods, both of which showed high prediction accuracy and recognized specific targeting-enhancing patterns in the regions upstream of the PTS1 tripeptides. Upon application of these methods to the Arabidopsis thaliana genome, 392 gene models were predicted to be peroxisome targeted. These predictions were extensively tested in vivo, resulting in a high experimental verification rate of Arabidopsis proteins previously not known to be peroxisomal. The prediction methods were able to correctly infer novel PTS1 tripeptides, which even included novel residues. Twenty-three newly predicted PTS1 tripeptides were experimentally confirmed, and a high variability of the plant PTS1 motif was discovered. These prediction methods will be instrumental in identifying low-abundance and stress-inducible peroxisomal proteins and defining the entire peroxisomal proteome of Arabidopsis and agronomically important crop plants.


Asunto(s)
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Inteligencia Artificial , Biología Computacional/métodos , Peroxisomas/metabolismo , Señales de Clasificación de Proteína , Secuencia de Aminoácidos , Arabidopsis/genética , Proteínas de Arabidopsis/química , Bases de Datos de Proteínas , Genoma de Planta/genética , Modelos Biológicos , Datos de Secuencia Molecular , Péptidos , Transporte de Proteínas , Reproducibilidad de los Resultados , Fracciones Subcelulares/metabolismo
8.
Microb Ecol ; 67(4): 919-30, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24553913

RESUMEN

Soil microorganisms play an essential role in sustaining biogeochemical processes and cycling of nutrients across different land use types. To gain insights into microbial gene transcription in forest and grassland soil, we isolated mRNA from 32 sampling sites. After sequencing of generated complementary DNA (cDNA), a total of 5,824,229 sequences could be further analyzed. We were able to assign nonribosomal cDNA sequences to all three domains of life. A dominance of bacterial sequences, which were affiliated to 25 different phyla, was found. Bacterial groups capable of aromatic compound degradation such as Phenylobacterium and Burkholderia were detected in significantly higher relative abundance in forest soil than in grassland soil. Accordingly, KEGG pathway categories related to degradation of aromatic ring-containing molecules (e.g., benzoate degradation) were identified in high abundance within forest soil-derived metatranscriptomic datasets. The impact of land use type forest on community composition and activity is evidently to a high degree caused by the presence of wood breakdown products. Correspondingly, bacterial groups known to be involved in lignin degradation and containing ligninolytic genes such as Burkholderia, Bradyrhizobium, and Azospirillum exhibited increased transcriptional activity in forest soil. Higher solar radiation in grassland presumably induced increased transcription of photosynthesis-related genes within this land use type. This is in accordance with high abundance of photosynthetic organisms and plant-infecting viruses in grassland.


Asunto(s)
Bosques , Microbiota , Microbiología del Suelo , Transcriptoma , Archaea/clasificación , Archaea/genética , Archaea/aislamiento & purificación , Bacterias/clasificación , Bacterias/genética , Bacterias/aislamiento & purificación , Eucariontes/clasificación , Eucariontes/genética , Eucariontes/aislamiento & purificación , Pradera , Datos de Secuencia Molecular , Filogenia , ARN Mensajero/genética , Análisis de Secuencia de ADN
9.
Int J Mol Sci ; 15(7): 12364-78, 2014 Jul 14.
Artículo en Inglés | MEDLINE | ID: mdl-25026170

RESUMEN

The variety of metagenomes in current databases provides a rapidly growing source of information for comparative studies. However, the quantity and quality of supplementary metadata is still lagging behind. It is therefore important to be able to identify related metagenomes by means of the available sequence data alone. We have studied efficient sequence-based methods for large-scale identification of similar metagenomes within a database retrieval context. In a broad comparison of different profiling methods we found that vector-based distance measures are well-suitable for the detection of metagenomic neighbors. Our evaluation on more than 1700 publicly available metagenomes indicates that for a query metagenome from a particular habitat on average nine out of ten nearest neighbors represent the same habitat category independent of the utilized profiling method or distance measure. While for well-defined labels a neighborhood accuracy of 100% can be achieved, in general the neighbor detection is severely affected by a natural overlap of manually annotated categories. In addition, we present results of a novel visualization method that is able to reflect the similarity of metagenomes in a 2D scatter plot. The visualization method shows a similarly high accuracy in the reduced space as compared with the high-dimensional profile space. Our study suggests that for inspection of metagenome neighborhoods the profiling methods and distance measures can be chosen to provide a convenient interpretation of results in terms of the underlying features. Furthermore, supplementary metadata of metagenome samples in the future needs to comply with readily available ontologies for fine-grained and standardized annotation. To make profile-based k-nearest-neighbor search and the 2D-visualization of the metagenome universe available to the research community, we included the proposed methods in our CoMet-Universe server for comparative metagenome analysis.


Asunto(s)
Genómica/métodos , Metagenoma , Análisis de Secuencia de ADN/métodos , Genoma Humano , Humanos , Microbiota/genética
10.
Stat Appl Genet Mol Biol ; 11(1): Article 1, 2012 Jan 06.
Artículo en Inglés | MEDLINE | ID: mdl-22499688

RESUMEN

Profile Hidden Markov Models (pHMMs) are widely used to model nucleotide or protein sequence families. In many applications, a sequence family classified into several subfamilies is given and each subfamily is modeled separately by one pHMM. A major drawback of this approach is the difficulty of coping with subfamilies composed of very few sequences.Correct subtyping of human immunodeficiency virus-1 (HIV-1) sequences is one of the most crucial bioinformatic tasks affected by this problem of small subfamilies, i.e., HIV-1 subtypes with a small number of known sequences. To deal with small samples for particular subfamilies of HIV-1, we employ a machine learning approach. More precisely, we make use of an existing HMM architecture and its associated inference engine, while replacing the unsupervised estimation of emission probabilities by a supervised method. For that purpose, we use regularized linear discriminant learning together with a balancing scheme to account for the widely varying sample size. After training the multiclass linear discriminants, the corresponding weights are transformed to valid probabilities using a softmax function.We apply this modified algorithm to classify HIV-1 sequence data (in the form of partial-length HIV-1 sequences and semi-artificial recombinants) and show that the performance of pHMMs can be significantly improved by the proposed technique.


Asunto(s)
Algoritmos , VIH-1/genética , Cadenas de Markov , Inteligencia Artificial , Bases de Datos Factuales , Humanos , Reconocimiento de Normas Patrones Automatizadas/métodos
11.
Nucleic Acids Res ; 39(Web Server issue): W518-23, 2011 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-21622656

RESUMEN

Analyzing the functional potential of newly sequenced genomes and metagenomes has become a common task in biomedical and biological research. With the advent of high-throughput sequencing technologies comparative metagenomics opens the way to elucidate the genetically determined similarities and differences of complex microbial communities. We developed the web server 'CoMet' (http://comet.gobics.de), which provides an easy-to-use comparative metagenomics platform that is well-suitable for the analysis of large collections of metagenomic short read data. CoMet combines the ORF finding and subsequent assignment of protein sequences to Pfam domain families with a comparative statistical analysis. Besides comprehensive tabular data files, the CoMet server also provides visually interpretable output in terms of hierarchical clustering and multi-dimensional scaling plots and thus allows a quick overview of a given set of metagenomic samples.


Asunto(s)
Metagenómica/métodos , Programas Informáticos , Análisis por Conglomerados , Interpretación Estadística de Datos , Internet , Metagenoma , Estructura Terciaria de Proteína , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína
12.
Bioinformatics ; 27(12): 1618-24, 2011 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-21546400

RESUMEN

MOTIVATION: Inferring the taxonomic profile of a microbial community from a large collection of anonymous DNA sequencing reads is a challenging task in metagenomics. Because existing methods for taxonomic profiling of metagenomes are all based on the assignment of fragmentary sequences to phylogenetic categories, the accuracy of results largely depends on fragment length. This dependence complicates comparative analysis of data originating from different sequencing platforms or resulting from different preprocessing pipelines. RESULTS: We here introduce a new method for taxonomic profiling based on mixture modeling of the overall oligonucleotide distribution of a sample. Our results indicate that the mixture-based profiles compare well with taxonomic profiles obtained with other methods. However, in contrast to the existing methods, our approach shows a nearly constant profiling accuracy across all kinds of read lengths and it operates at an unrivaled speed. AVAILABILITY: A platform-independent implementation of the mixture modeling approach is available in terms of a MATLAB/Octave toolbox at http://gobics.de/peter/taxy. In addition, a prototypical implementation within an easy-to-use interactive tool for Windows can be downloaded.


Asunto(s)
Metagenómica/métodos , Filogenia , Algoritmos , Metagenoma , Modelos Genéticos , Análisis de Secuencia de ADN
13.
J Biomed Biotechnol ; 2012: 263910, 2012.
Artículo en Inglés | MEDLINE | ID: mdl-22550397

RESUMEN

Statistical ranking, filtering, adduct detection, isotope correction, and molecular formula calculation are essential tasks in processing mass spectrometry data in metabolomics studies. In order to obtain high-quality data sets, a framework which incorporates all these methods is required. We present the MarVis-Filter software, which provides well-established and specialized methods for processing mass spectrometry data. For the task of ranking and filtering multivariate intensity profiles, MarVis-Filter provides the ANOVA and Kruskal-Wallis tests with adjustment for multiple hypothesis testing. Adduct and isotope correction are based on a novel algorithm which takes the similarity of intensity profiles into account and allows user-defined ionization rules. The molecular formula calculation utilizes the results of the adduct and isotope correction. For a comprehensive analysis, MarVis-Filter provides an interactive interface to combine data sets deriving from positive and negative ionization mode. The software is exemplarily applied in a metabolic case study, where octadecanoids could be identified as markers for wounding in plants.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Espectrometría de Masas/métodos , Metabolómica/métodos , Programas Informáticos , Arabidopsis/metabolismo , Isótopos de Carbono , Ciclopentanos/metabolismo , Bases de Datos Factuales , Metaboloma , Modelos Biológicos , Oxilipinas/metabolismo
14.
Nucleic Acids Res ; 38(Web Server issue): W19-22, 2010 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-20497995

RESUMEN

We introduce web interfaces for two recent extensions of the multiple-alignment program DIALIGN. DIALIGN-TX combines the greedy heuristic previously used in DIALIGN with a more traditional 'progressive' approach for improved performance on locally and globally related sequence sets. In addition, we offer a version of DIALIGN that uses predicted protein secondary structures together with primary sequence information to construct multiple protein alignments. Both programs are available through 'Göttingen Bioinformatics Compute Server' (GOBICS).


Asunto(s)
Estructura Secundaria de Proteína , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína , Programas Informáticos , Internet
15.
Mol Microbiol ; 78(4): 964-79, 2010 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-21062371

RESUMEN

The COP9 signalosome complex (CSN) is a crucial regulator of ubiquitin ligases. Defects in CSN result in embryonic impairment and death in higher eukaryotes, whereas the filamentous fungus Aspergillus nidulans survives without CSN, but is unable to complete sexual development. We investigated overall impact of CSN activity on A. nidulans cells by combined transcriptome, proteome and metabolome analysis. Absence of csn5/csnE affects transcription of at least 15% of genes during development, including numerous oxidoreductases. csnE deletion leads to changes in the fungal proteome indicating impaired redox regulation and hypersensitivity to oxidative stress. CSN promotes the formation of asexual spores by regulating developmental hormones produced by PpoA and PpoC dioxygenases. We identify more than 100 metabolites, including orsellinic acid derivatives, accumulating preferentially in the csnE mutant. We also show that CSN is required to activate glucanases and other cell wall recycling enzymes during development. These findings suggest a dual role for CSN during development: it is required early for protection against oxidative stress and hormone regulation and is later essential for control of the secondary metabolism and cell wall rearrangement.


Asunto(s)
Aspergillus nidulans/crecimiento & desarrollo , Aspergillus nidulans/metabolismo , Pared Celular/metabolismo , Regulación Fúngica de la Expresión Génica , Hormonas/metabolismo , Complejos Multiproteicos/metabolismo , Estrés Oxidativo , Péptido Hidrolasas/metabolismo , Transducción de Señal , Aspergillus nidulans/genética , Complejo del Señalosoma COP9 , Proteínas Fúngicas/genética , Eliminación de Gen , Perfilación de la Expresión Génica , Metaboloma , Complejos Multiproteicos/genética , Péptido Hidrolasas/genética , Proteoma
16.
Bioinformatics ; 26(7): 960-1, 2010 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-20172941

RESUMEN

SUMMARY: Assessment of phylogenetic diversity is a key element to the analysis of microbial communities. Tools are needed to handle next-generation sequencing data and to cope with the computational complexity of large-scale studies. Here, we present Treephyler, a tool for fast taxonomic profiling of metagenomes. Treephyler was evaluated on real metagenome to assess its performance in comparison to previous approaches for taxonomic profiling. Results indicate that Treephyler is in terms of speed and accuracy prepared for next-generation sequencing techniques and large-scale analysis. AVAILABILITY: Treephyler is implemented in Perl; it is portable to all platforms and applicable to both nucleotide and protein input data. Treephyler is freely available for download at http://www.gobics.de/fabian/treephyler.php.


Asunto(s)
Genómica/métodos , Metagenoma , Filogenia , Programas Informáticos , Bases de Datos Factuales
17.
Nucleic Acids Res ; 37(Web Server issue): W101-5, 2009 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-19429689

RESUMEN

Metagenomic sequencing projects yield numerous sequencing reads of a diverse range of uncultivated and mostly yet unknown microorganisms. In many cases, these sequencing reads cannot be assembled into longer contigs. Thus, gene prediction tools that were originally developed for whole-genome analysis are not suitable for processing metagenomes. Orphelia is a program for predicting genes in short DNA sequences that is available through a web server application (http://orphelia.gobics.de). Orphelia utilizes prediction models that were created with machine learning techniques on the basis of a wide range of annotated genomes. In contrast to other methods for metagenomic gene prediction, Orphelia has fragment length-specific prediction models for the two most popular sequencing techniques in metagenomics, chain termination sequencing and pyrosequencing. These models ensure highly specific gene predictions.


Asunto(s)
Microbiología Ambiental , Genes , Genómica , Programas Informáticos , Internet , Sistemas de Lectura Abierta , Análisis de Secuencia de ADN , Interfaz Usuario-Computador
18.
BMC Bioinformatics ; 11: 481, 2010 Sep 24.
Artículo en Inglés | MEDLINE | ID: mdl-20868492

RESUMEN

BACKGROUND: Establishing the relationship between an organism's genome sequence and its phenotype is a fundamental challenge that remains largely unsolved. Accurately predicting microbial phenotypes solely based on genomic features will allow us to infer relevant phenotypic characteristics when the availability of a genome sequence precedes experimental characterization, a scenario that is favored by the advent of novel high-throughput and single cell sequencing techniques. RESULTS: We present a novel approach to predict the phenotype of prokaryotes directly from their protein domain frequencies. Our discriminative machine learning approach provides high prediction accuracy of relevant phenotypes such as motility, oxygen requirement or spore formation. Moreover, the set of discriminative domains provides biological insight into the underlying phenotype-genotype relationship and enables deriving hypotheses on the possible functions of uncharacterized domains. CONCLUSIONS: Fast and accurate prediction of microbial phenotypes based on genomic protein domain content is feasible and has the potential to provide novel biological insights. First results of a systematic check for annotation errors indicate that our approach may also be applied to semi-automatic correction and completion of the existing phenotype annotation.


Asunto(s)
Proteínas Bacterianas/química , Fenotipo , Algoritmos , Genoma Arqueal , Genoma Bacteriano , Anotación de Secuencia Molecular , Estructura Terciaria de Proteína
19.
Environ Microbiome ; 15(1): 11, 2020 May 18.
Artículo en Inglés | MEDLINE | ID: mdl-33902725

RESUMEN

BACKGROUND: Sequencing of 16S rRNA genes has become a powerful technique to study microbial communities and their responses towards changing environmental conditions in various ecosystems. Several tools have been developed for the prediction of functional profiles from 16S rRNA gene sequencing data, because numerous questions in ecosystem ecology require knowledge of community functions in addition to taxonomic composition. However, the accuracy of these tools relies on functional information derived from genomes available in public databases, which are often not representative of the microorganisms present in the studied ecosystem. In addition, there is also a lack of tools to predict functional gene redundancy in microbial communities. RESULTS: To address these challenges, we developed Tax4Fun2, an R package for the prediction of functional profiles and functional gene redundancies of prokaryotic communities from 16S rRNA gene sequences. We demonstrate that functional profiles predicted by Tax4Fun2 are highly correlated to functional profiles derived from metagenomes of the same samples. We further show that Tax4Fun2 has higher accuracies than PICRUSt and Tax4Fun. By incorporating user-defined, habitat-specific genomic information, the accuracy and robustness of predicted functional profiles is substantially enhanced. In addition, functional gene redundancies predicted with Tax4Fun2 are highly correlated to functional gene redundancies determined for simulated microbial communities. CONCLUSIONS: Tax4Fun2 provides researchers with a unique tool to predict and investigate functional profiles of prokaryotic communities based on 16S rRNA gene sequencing data. It is easy-to-use, platform-independent and highly memory-efficient, thus enabling researchers without extensive bioinformatics knowledge or access to high-performance clusters to predict functional profiles. Another unique feature of Tax4Fun2 is that it allows researchers to calculate the redundancy of specific functions, which is a potentially important measure of how resilient a community will be to environmental perturbation. Tax4Fun2 is implemented in R and freely available at https://github.com/bwemheu/Tax4Fun2.

20.
BMC Bioinformatics ; 10: 92, 2009 Mar 20.
Artículo en Inglés | MEDLINE | ID: mdl-19302701

RESUMEN

BACKGROUND: A central goal of experimental studies in systems biology is to identify meaningful markers that are hidden within a diffuse background of data originating from large-scale analytical intensity measurements as obtained from metabolomic experiments. Intensity-based clustering is an unsupervised approach to the identification of metabolic markers based on the grouping of similar intensity profiles. A major problem of this basic approach is that in general there is no prior information about an adequate number of biologically relevant clusters. RESULTS: We present the tool MarVis (Marker Visualization) for data mining on intensity-based profiles using one-dimensional self-organizing maps (1D-SOMs). MarVis can import and export customizable CSV (Comma Separated Values) files and provides aggregation and normalization routines for preprocessing of intensity profiles that contain repeated measurements for a number of different experimental conditions. Robust clustering is then achieved by training of an 1D-SOM model, which introduces a similarity-based ordering of the intensity profiles. The ordering allows a convenient visualization of the intensity variations within the data and facilitates an interactive aggregation of clusters into larger blocks. The intensity-based visualization is combined with the presentation of additional data attributes, which can further support the analysis of experimental data. CONCLUSION: MarVis is a user-friendly and interactive tool for exploration of complex pattern variation in a large set of experimental intensity profiles. The application of 1D-SOMs gives a convenient overview on relevant profiles and groups of profiles. The specialized visualization effectively supports researchers in analyzing a large number of putative clusters, even though the true number of biologically meaningful groups is unknown. Although MarVis has been developed for the analysis of metabolomic data, the tool may be applied to gene expression data as well.


Asunto(s)
Biomarcadores/metabolismo , Programas Informáticos , Análisis por Conglomerados , Biología Computacional/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Biología de Sistemas , Interfaz Usuario-Computador
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda