RESUMEN
The advent of rapid whole-genome sequencing has created new opportunities for computational prediction of antimicrobial resistance (AMR) phenotypes from genomic data. Both rule-based and machine learning (ML) approaches have been explored for this task, but systematic benchmarking is still needed. Here, we evaluated four state-of-the-art ML methods (Kover, PhenotypeSeeker, Seq2Geno2Pheno and Aytan-Aktug), an ML baseline and the rule-based ResFinder by training and testing each of them across 78 species-antibiotic datasets, using a rigorous benchmarking workflow that integrates three evaluation approaches, each paired with three distinct sample splitting methods. Our analysis revealed considerable variation in the performance across techniques and datasets. Whereas ML methods generally excelled for closely related strains, ResFinder excelled for handling divergent genomes. Overall, Kover most frequently ranked top among the ML approaches, followed by PhenotypeSeeker and Seq2Geno2Pheno. AMR phenotypes for antibiotic classes such as macrolides and sulfonamides were predicted with the highest accuracies. The quality of predictions varied substantially across species-antibiotic combinations, particularly for beta-lactams; across species, resistance phenotyping of the beta-lactams compound, aztreonam, amoxicillin/clavulanic acid, cefoxitin, ceftazidime and piperacillin/tazobactam, alongside tetracyclines demonstrated more variable performance than the other benchmarked antibiotics. By organism, Campylobacter jejuni and Enterococcus faecium phenotypes were more robustly predicted than those of Escherichia coli, Staphylococcus aureus, Salmonella enterica, Neisseria gonorrhoeae, Klebsiella pneumoniae, Pseudomonas aeruginosa, Acinetobacter baumannii, Streptococcus pneumoniae and Mycobacterium tuberculosis. In addition, our study provides software recommendations for each species-antibiotic combination. It furthermore highlights the need for optimization for robust clinical applications, particularly for strains that diverge substantially from those used for training.
Asunto(s)
Antibacterianos , Fenotipo , Antibacterianos/farmacología , Aprendizaje Automático , Farmacorresistencia Bacteriana/genética , Biología Computacional/métodos , Genoma Bacteriano , Genoma Microbiano , Humanos , Bacterias/genética , Bacterias/efectos de los fármacosRESUMEN
MOTIVATION: Gene annotation is the problem of mapping proteins to their functions represented as Gene Ontology (GO) terms, typically inferred based on the primary sequences. Gene annotation is a multi-label multi-class classification problem, which has generated growing interest for its uses in the characterization of millions of proteins with unknown functions. However, there is no standard GO dataset used for benchmarking the newly developed new machine learning models within the bioinformatics community. Thus, the significance of improvements for these models remains unclear. RESULTS: The Gene Benchmarking database is the first effort to provide an easy-to-use and configurable hub for the learning and evaluation of gene annotation models. It provides easy access to pre-specified datasets and takes the non-trivial steps of preprocessing and filtering all data according to custom presets using a web interface. The GO bench web application can also be used to evaluate and display any trained model on leaderboards for annotation tasks. AVAILABILITY AND IMPLEMENTATION: The GO Benchmarking dataset is freely available at www.gobench.org. Code is hosted at github.com/mofradlab, with repositories for website code, core utilities and examples of usage (Supplementary Section S.7). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Benchmarking , Programas Informáticos , Anotación de Secuencia Molecular , Ontología de Genes , Aprendizaje Automático , Proteínas/metabolismoRESUMEN
Advances in transcriptomic and translatomic techniques enable in-depth studies of RNA activity profiles and RNA-based regulatory mechanisms. Ribosomal RNA (rRNA) sequences are highly abundant among cellular RNA, but if the target sequences do not include polyadenylation, these cannot be easily removed in library preparation, requiring their post-hoc removal with computational techniques to accelerate and improve downstream analyses. Here, we describe RiboDetector, a novel software based on a Bi-directional Long Short-Term Memory (BiLSTM) neural network, which rapidly and accurately identifies rRNA reads from transcriptomic, metagenomic, metatranscriptomic, noncoding RNA, and ribosome profiling sequence data. Compared with state-of-the-art approaches, RiboDetector produced at least six times fewer misclassifications on the benchmark datasets. Importantly, the few false positives of RiboDetector were not enriched in certain Gene Ontology (GO) terms, suggesting a low bias for downstream functional profiling. RiboDetector also demonstrated a remarkable generalizability for detecting novel rRNA sequences that are divergent from the training data with sequence identities of <90%. On a personal computer, RiboDetector processed 40M reads in less than 6 min, which was â¼50 times faster in GPU mode and â¼15 times in CPU mode than other methods. RiboDetector is available under a GPL v3.0 license at https://github.com/hzi-bifo/RiboDetector.
Asunto(s)
Aprendizaje Profundo , ARN Ribosómico , Metagenómica/métodos , ARN , ARN Ribosómico/genética , Programas InformáticosRESUMEN
Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a 'G.G' context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.
Asunto(s)
Citomegalovirus/genética , Variación Genética , Genoma Viral , Programas Informáticos , HumanosRESUMEN
SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.
Asunto(s)
COVID-19/prevención & control , Biología Computacional , SARS-CoV-2/aislamiento & purificación , Investigación Biomédica , COVID-19/epidemiología , COVID-19/virología , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genéticaRESUMEN
Single-cell genome sequencing provides a highly granular view of biological systems but is affected by high error rates, allelic amplification bias, and uneven genome coverage. This creates a need for data-specific computational methods, for purposes such as for cell lineage tree inference. The objective of cell lineage tree reconstruction is to infer the evolutionary process that generated a set of observed cell genomes. Lineage trees may enable a better understanding of tumor formation and growth, as well as of organ development for healthy body cells. We describe a method, Scelestial, for lineage tree reconstruction from single-cell data, which is based on an approximation algorithm for the Steiner tree problem and is a generalization of the neighbor-joining method. We adapt the algorithm to efficiently select a limited subset of potential sequences as internal nodes, in the presence of missing values, and to minimize cost by lineage tree-based missing value imputation. In a comparison against seven state-of-the-art single-cell lineage tree reconstruction algorithms-BitPhylogeny, OncoNEM, SCITE, SiFit, SASC, SCIPhI, and SiCloneFit-on simulated and real single-cell tumor samples, Scelestial performed best at reconstructing trees in terms of accuracy and run time. Scelestial has been implemented in C++. It is also available as an R package named RScelestial.
Asunto(s)
Algoritmos , Neoplasias , Evolución Biológica , Linaje de la Célula/genética , Humanos , Modelos Genéticos , FilogeniaRESUMEN
BACKGROUND: Selection of optimal computational strategies for analyzing metagenomics data is a decisive step in determining the microbial composition of a sample, and this procedure is complex because of the numerous tools currently available. The aim of this research was to summarize the results of crowdsourced sbv IMPROVER Microbiomics Challenge designed to evaluate the performance of off-the-shelf metagenomics software as well as to investigate the robustness of these results by the extended post-challenge analysis. In total 21 off-the-shelf taxonomic metagenome profiling pipelines were benchmarked for their capacity to identify the microbiome composition at various taxon levels across 104 shotgun metagenomics datasets of bacterial genomes (representative of various microbiome samples) from public databases. Performance was determined by comparing predicted taxonomy profiles with the gold standard. RESULTS: Most taxonomic profilers performed homogeneously well at the phylum level but generated intermediate and heterogeneous scores at the genus and species levels, respectively. kmer-based pipelines using Kraken with and without Bracken or using CLARK-S performed best overall, but they exhibited lower precision than the two marker-gene-based methods MetaPhlAn and mOTU. Filtering out the 1% least abundance species-which were not reliably predicted-helped increase the performance of most profilers by increasing precision but at the cost of recall. However, the use of adaptive filtering thresholds determined from the sample's Shannon index increased the performance of most kmer-based profilers while mitigating the tradeoff between precision and recall. CONCLUSIONS: kmer-based metagenomic pipelines using Kraken/Bracken or CLARK-S performed most robustly across a large variety of microbiome datasets. Removing non-reliably predicted low-abundance species by using diversity-dependent adaptive filtering thresholds further enhanced the performance of these tools. This work demonstrates the applicability of computational pipelines for accurately determining taxonomic profiles in clinical and environmental contexts and exemplifies the power of crowdsourcing for unbiased evaluation.
Asunto(s)
Colaboración de las Masas , Metagenoma , Benchmarking , Metagenómica/métodos , Programas InformáticosRESUMEN
MOTIVATION: B-cell epitopes (BCEs) play a pivotal role in the development of peptide vaccines, immuno-diagnostic reagents and antibody production, and thus in infectious disease prevention and diagnostics in general. Experimental methods used to determine BCEs are costly and time-consuming. Therefore, it is essential to develop computational methods for the rapid identification of BCEs. Although several computational methods have been developed for this task, generalizability is still a major concern, where cross-testing of the classifiers trained and tested on different datasets has revealed accuracies of 51-53%. RESULTS: We describe a new method called EpitopeVec, which uses a combination of residue properties, modified antigenicity scales, and protein language model-based representations (protein vectors) as features of peptides for linear BCE predictions. Extensive benchmarking of EpitopeVec and other state-of-the-art methods for linear BCE prediction on several large and small datasets, as well as cross-testing, demonstrated an improvement in the performance of EpitopeVec over other methods in terms of accuracy and area under the curve. As the predictive performance depended on the species origin of the respective antigens (viral, bacterial and eukaryotic), we also trained our method on a large viral dataset to create a dedicated linear viral BCE predictor with improved cross-testing performance. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/hzi-bifo/epitope-prediction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Antígenos , Péptidos , Secuencia de Aminoácidos , Péptidos/química , Antígenos/química , Programas Informáticos , Epítopos de Linfocito B/químicaRESUMEN
OBJECTIVE: Neutralising antibodies are key effectors of infection-induced and vaccine-induced immunity. Quantification of antibodies' breadth and potency is critical for understanding the mechanisms of protection and for prioritisation of vaccines. Here, we used a unique collection of human specimens and HCV strains to develop HCV reference viruses for quantification of neutralising antibodies, and to investigate viral functional diversity. DESIGN: We profiled neutralisation potency of polyclonal immunoglobulins from 104 patients infected with HCV genotype (GT) 1-6 across 13 HCV strains representing five viral GTs. Using metric multidimensional scaling, we plotted HCV neutralisation onto neutralisation maps. We employed K-means clustering to guide virus clustering and selecting representative strains. RESULTS: Viruses differed greatly in neutralisation sensitivity, with J6 (GT2a) being most resistant and SA13 (GT5a) being most sensitive. They mapped to six distinct neutralisation clusters, in part composed of viruses from different GTs. There was no correlation between viral neutralisation and genetic distance, indicating functional neutralisation clustering differs from sequence-based clustering. Calibrating reference viruses representing these clusters against purified antibodies from 496 patients infected by GT1 to GT6 viruses readily identified individuals with extraordinary potent and broadly neutralising antibodies. It revealed comparable antibody cross-neutralisation and diversity between specimens from diverse viral GTs, confirming well-balanced reporting of HCV cross-neutralisation across highly diverse human samples. CONCLUSION: Representative isolates from six neutralisation clusters broadly reconstruct the functional HCV neutralisation space. They enable high resolution profiling of HCV neutralisation and they may reflect viral functional and antigenic properties important to consider in HCV vaccine design.
Asunto(s)
Anticuerpos Neutralizantes/sangre , Hepacivirus/inmunología , Anticuerpos contra la Hepatitis C/sangre , Hepatitis C/inmunología , Secuencia de Aminoácidos , Anticuerpos Neutralizantes/inmunología , Hepacivirus/genética , Hepatitis C/virología , Humanos , Inmunoglobulina G/sangre , Inmunoglobulina G/inmunologíaRESUMEN
Influenza A viruses cause seasonal epidemics and occasional pandemics in the human population. While the worldwide circulation of seasonal influenza is at least partly understood, the exact migration patterns between countries, states or cities are not well studied. Here, we use the Sankoff algorithm for parsimonious phylogeographic reconstruction together with effective distances based on a worldwide air transportation network. By first simulating geographic spread and then phylogenetic trees and genetic sequences, we confirmed that reconstructions with effective distances inferred phylogeographic spread more accurately than reconstructions with geographic distances and Bayesian reconstructions with BEAST that do not use any distance information, and led to comparable results to the Bayesian reconstruction using distance information via a generalized linear model. Our method extends Bayesian methods that estimate rates from the data by using fine-grained locations like airports and inferring intermediate locations not observed among sampled isolates. When applied to sequence data of the pandemic H1N1 influenza A virus in 2009, our approach correctly inferred the origin and proposed airports mainly involved in the spread of the virus. In case of a novel outbreak, this approach allows to rapidly analyze sequence data and infer origin and spread routes to improve disease surveillance and control.
Asunto(s)
Aviación , Subtipo H1N1 del Virus de la Influenza A/aislamiento & purificación , Gripe Humana/epidemiología , Filogeografía , Transportes , Algoritmos , Teorema de Bayes , Simulación por Computador , Brotes de Enfermedades , Humanos , Gripe Humana/virologíaRESUMEN
Roots and leaves of healthy plants host taxonomically structured bacterial assemblies, and members of these communities contribute to plant growth and health. We established Arabidopsis leaf- and root-derived microbiota culture collections representing the majority of bacterial species that are reproducibly detectable by culture-independent community sequencing. We found an extensive taxonomic overlap between the leaf and root microbiota. Genome drafts of 400 isolates revealed a large overlap of genome-encoded functional capabilities between leaf- and root-derived bacteria with few significant differences at the level of individual functional categories. Using defined bacterial communities and a gnotobiotic Arabidopsis plant system we show that the isolates form assemblies resembling natural microbiota on their cognate host organs, but are also capable of ectopic leaf or root colonization. While this raises the possibility of reciprocal relocation between root and leaf microbiota members, genome information and recolonization experiments also provide evidence for microbiota specialization to their respective niche.
Asunto(s)
Arabidopsis/microbiología , Microbiota/fisiología , Hojas de la Planta/microbiología , Raíces de Plantas/microbiología , Bacterias/clasificación , Bacterias/genética , Bacterias/aislamiento & purificación , Genoma Bacteriano/genética , Vida Libre de Gérmenes , Microbiota/genética , Análisis de Secuencia de ADN , Microbiología del SueloRESUMEN
SUMMARY: Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets. AVAILABILITY AND IMPLEMENTATION: DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , ARN Ribosómico 16S/genética , Biomarcadores , Humanos , Nucleótidos , Fenotipo , Análisis de Secuencia de ADN , Programas InformáticosRESUMEN
Advances in genome-based studies on plant-associated microorganisms have transformed our understanding of many plant pathogens and are beginning to greatly widen our knowledge of plant interactions with mutualistic and commensal microorganisms. Pathogenomics has revealed how pathogenic microorganisms adapt to particular hosts, subvert innate immune responses and change host range, as well as how new pathogen species emerge. Similarly, culture-independent community profiling methods, coupled with metagenomic and metatranscriptomic studies, have provided the first insights into the emerging field of research on plant-associated microbial communities. Together, these approaches have the potential to bridge the gap between plant microbial ecology and plant pathology, which have traditionally been two distinct research fields.
Asunto(s)
Plantas/microbiología , Bacterias/clasificación , Fenómenos Fisiológicos Bacterianos , Productos Agrícolas/microbiología , Hongos/clasificación , Hongos/fisiología , Genoma Microbiano , Especificidad del Huésped , Interacciones Huésped-Patógeno , Enfermedades de las Plantas/inmunología , Enfermedades de las Plantas/microbiología , Plantas/clasificaciónRESUMEN
Medulloblastomas arise from undifferentiated precursor cells in the cerebellum and account for about 20% of all solid brain tumors during childhood; standard therapies include radiation and chemotherapy, which oftentimes come with severe impairment of the cognitive development of the young patients. Here, we show that the posttranscriptional regulator Y-box binding protein 1 (YBX1), a DNA- and RNA-binding protein, acts as an oncogene in medulloblastomas by regulating cellular survival and apoptosis. We observed different cellular responses upon YBX1 knockdown in several medulloblastoma cell lines, with significantly altered transcription and subsequent apoptosis rates. Mechanistically, PAR-CLIP for YBX1 and integration with RNA-Seq data uncovered direct posttranscriptional control of the heterochromatin-associated gene CBX5; upon YBX1 knockdown and subsequent CBX5 mRNA instability, heterochromatin-regulated genes involved in inflammatory response, apoptosis and death receptor signaling were de-repressed. Thus, YBX1 acts as an oncogene in medulloblastoma through indirect transcriptional regulation of inflammatory genes regulating apoptosis and represents a promising novel therapeutic target in this tumor entity.
Asunto(s)
Proteínas Cromosómicas no Histona/metabolismo , Regulación Neoplásica de la Expresión Génica , Heterocromatina/genética , Inflamación/patología , Meduloblastoma/patología , ARN Mensajero/metabolismo , Proteína 1 de Unión a la Caja Y/metabolismo , Apoptosis , Biomarcadores de Tumor/genética , Biomarcadores de Tumor/metabolismo , Proliferación Celular , Neoplasias Cerebelosas/genética , Neoplasias Cerebelosas/inmunología , Neoplasias Cerebelosas/metabolismo , Neoplasias Cerebelosas/patología , Homólogo de la Proteína Chromobox 5 , Proteínas Cromosómicas no Histona/genética , Humanos , Inflamación/genética , Inflamación/inmunología , Inflamación/metabolismo , Meduloblastoma/genética , Meduloblastoma/inmunología , Meduloblastoma/metabolismo , ARN Mensajero/genética , Células Tumorales Cultivadas , Proteína 1 de Unión a la Caja Y/genéticaRESUMEN
The RNA-binding protein Musashi 2 (MSI2) has emerged as an important regulator in cancer initiation, progression, and drug resistance. Translocations and deregulation of the MSI2 gene are diagnostic of certain cancers, including chronic myeloid leukemia (CML) with translocation t(7;17), acute myeloid leukemia (AML) with translocation t(10;17), and some cases of B-precursor acute lymphoblastic leukemia (pB-ALL). To better understand the function of MSI2 in leukemia, the mRNA targets that are bound and regulated by MSI2 and their MSI2-binding motifs need to be identified. To this end, using photoactivatable ribonucleoside cross-linking and immunoprecipitation (PAR-CLIP) and the multiple EM for motif elicitation (MEME) analysis tool, here we identified MSI2's mRNA targets and the consensus RNA-recognition element (RRE) motif recognized by MSI2 (UUAG). Of note, MSI2 knockdown altered the expression of several genes with roles in eukaryotic initiation factor 2 (eIF2), hepatocyte growth factor (HGF), and epidermal growth factor (EGF) signaling pathways. We also show that MSI2 regulates classic interleukin-6 (IL-6) signaling by promoting the degradation of the mRNA of IL-6 signal transducer (IL6ST or GP130), which, in turn, affected the phosphorylation statuses of signal transducer and activator of transcription 3 (STAT3) and the mitogen-activated protein kinase ERK. In summary, we have identified multiple MSI2-regulated mRNAs and provided evidence that MSI2 controls IL6ST activity that control oncogenic signaling networks. Our findings may help inform strategies for unraveling the role of MSI2 in leukemia to pave the way for the development of targeted therapies.
Asunto(s)
Receptor gp130 de Citocinas/genética , Interleucina-6/genética , ARN Mensajero/genética , Proteínas de Unión al ARN/genética , Transcriptoma , Secuencia de Bases , Sitios de Unión , Receptor gp130 de Citocinas/metabolismo , Factor de Crecimiento Epidérmico/genética , Factor de Crecimiento Epidérmico/metabolismo , Factor 2 Eucariótico de Iniciación/genética , Factor 2 Eucariótico de Iniciación/metabolismo , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Células HEK293 , Factor de Crecimiento de Hepatocito/genética , Factor de Crecimiento de Hepatocito/metabolismo , Humanos , Inmunoprecipitación , Interleucina-6/metabolismo , Leucemia/genética , Leucemia/metabolismo , Leucemia/patología , Luz , Proteína Quinasa 1 Activada por Mitógenos/genética , Proteína Quinasa 1 Activada por Mitógenos/metabolismo , Proteína Quinasa 3 Activada por Mitógenos/genética , Proteína Quinasa 3 Activada por Mitógenos/metabolismo , Modelos Biológicos , Unión Proteica , ARN Mensajero/metabolismo , Proteínas de Unión al ARN/metabolismo , Factor de Transcripción STAT3/genética , Factor de Transcripción STAT3/metabolismo , Transducción de SeñalRESUMEN
Motivation: Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes. Results: A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine. Availability and implementation: The software and datasets are available at https://llp.berkeley.edu/micropheno. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Microbiota/genética , Fenotipo , ARN Ribosómico 16S/genética , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Genes de ARNr , Humanos , Alineación de Secuencia/métodosRESUMEN
SUMMARY: Metagenomics revolutionized the field of microbial ecology, giving access to Gb-sized datasets of microbial communities under natural conditions. This enables fine-grained analyses of the functions of community members, studies of their association with phenotypes and environments, as well as of their microevolution and adaptation to changing environmental conditions. However, phylogenetic methods for studying adaptation and evolutionary dynamics are not able to cope with big data. EDEN is the first software for the rapid detection of protein families and regions under positive selection, as well as their associated biological processes, from meta- and pangenome data. It provides an interactive result visualization for detailed comparative analyses. AVAILABILITY AND IMPLEMENTATION: EDEN is available as a Docker installation under the GPL 3.0 license, allowing its use on common operating systems, at http://www.github.com/hzi-bifo/eden. CONTACT: alice.mchardy@helmholtz-hzi.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Evolución Biológica , Metagenómica/métodos , Filogenia , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Bacterias/genética , FenotipoRESUMEN
The Thaumarchaeota is an abundant and ubiquitous phylum of archaea that plays a major role in the global nitrogen cycle. Previous analyses of the ammonia monooxygenase gene amoA suggest that pH is an important driver of niche specialization in these organisms. Although the ecological distribution and ecophysiology of extant Thaumarchaeota have been studied extensively, the evolutionary rise of these prokaryotes to ecological dominance in many habitats remains poorly understood. To characterize processes leading to their diversification, we investigated coevolutionary relationships between amoA, a conserved marker gene for Thaumarchaeota, and soil characteristics, by using deep sequencing and comprehensive environmental data in Bayesian comparative phylogenetics. These analyses reveal a large and rapid increase in diversification rates during early thaumarchaeotal evolution; this finding was verified by independent analyses of 16S rRNA. Our findings suggest that the entire Thaumarchaeota diversification regime was strikingly coupled to pH adaptation but less clearly correlated with several other tested environmental factors. Interestingly, the early radiation event coincided with a period of pH adaptation that enabled the terrestrial Thaumarchaeota ancestor to initially move from neutral to more acidic and alkaline conditions. In contrast to classic evolutionary models, whereby niches become rapidly filled after adaptive radiation, global diversification rates have remained stably high in Thaumarchaeota during the past 400-700 million years, suggesting an ongoing high rate of niche formation or switching for these microbes. Our study highlights the enduring importance of environmental adaptation during thaumarchaeotal evolution and, to our knowledge, is the first to link evolutionary diversification to environmental adaptation in a prokaryotic phylum.
Asunto(s)
Archaea/fisiología , Evolución Biológica , Oxidorreductasas/genética , Suelo/química , Amoníaco/química , Archaea/enzimología , Archaea/genética , Teorema de Bayes , Análisis por Conglomerados , Evolución Molecular , Concentración de Iones de Hidrógeno , Conformación Molecular , Nitrógeno/química , Oxidorreductasas/metabolismo , Oxígeno/química , Filogenia , ARN Ribosómico 16S/metabolismo , Proteínas Recombinantes/químicaRESUMEN
For reasons not yet understood, nearly all infants with acute lymphoblastic leukemia (ALL) are diagnosed with the B-cell type, with T-ALL in infancy representing a very rare exception. Clinical and molecular knowledge about infant T-ALL is still nearly completely lacking and it is also still unclear whether it represents a distinct disease compared to childhood T-ALL. To address this, we performed exome sequencing of three infant cases, which enabled the detection of mutations in NOTCH2, NOTCH3, PTEN, and KRAS. When analyzing the transcriptomes and miRNomes of the three infant and an additional six childhood T-ALL samples, we found 760 differentially expressed mRNAs and 58 differentially expressed miRNAs between these two cohorts. Correlation analysis for differentially expressed miRNA-mRNA target pairs revealed 47 miRNA-mRNA pairs, with many of them previously described to be aberrantly expressed in leukemia and cancer. Pathway analysis revealed differentially expressed pathways and upstream regulators related to the immune system or cancerogenesis such as the ERK5 pathway, which was activated in infant T-ALL. In summary, there are distinct molecular features in infant compared to childhood T-ALL on a transcriptomic and epigenetic level, which potentially have an impact on the development and course of the disease. © 2016 Wiley Periodicals, Inc.