RESUMO
Studies of bacterial adaptation and evolution are hampered by the difficulty of measuring traits such as virulence, drug resistance, and transmissibility in large populations. In contrast, it is now feasible to obtain high-quality complete assemblies of many bacterial genomes thanks to scalable high-accuracy long-read sequencing technologies. To exploit this opportunity, we introduce a phenotype- and alignment-free method for discovering coselected and epistatically interacting genomic variation from genome assemblies covering both core and accessory parts of genomes. Our approach uses a compact colored de Bruijn graph to approximate the intragenome distances between pairs of loci for a collection of bacterial genomes to account for the impacts of linkage disequilibrium (LD). We demonstrate the versatility of our approach to efficiently identify associations between loci linked with drug resistance and adaptation to the hospital niche in the major human bacterial pathogens Streptococcus pneumoniae and Enterococcus faecalis.
Assuntos
Enterococcus faecalis , Epistasia Genética , Genoma Bacteriano , Streptococcus pneumoniae , Streptococcus pneumoniae/genética , Enterococcus faecalis/genética , Desequilíbrio de Ligação , Humanos , Genômica/métodosRESUMO
The soil bacterium Burkholderia pseudomallei is the causative agent of melioidosis and a significant cause of human morbidity and mortality in many tropical and subtropical countries. The species notoriously survives harsh environmental conditions but the genetic architecture for these adaptations remains unclear. Here we employed a powerful combination of genome-wide epistasis and co-selection studies (2,011 genomes), condition-wide transcriptome analyses (82 diverse conditions), and a gene knockout assay to uncover signals of "co-selection"-that is a combination of genetic markers that have been repeatedly selected together through B. pseudomallei evolution. These enabled us to identify 13,061 mutation pairs under co-selection in distinct genes and noncoding RNA. Genes under co-selection displayed marked expression correlation when B. pseudomallei was subjected to physical stress conditions, highlighting the conditions as one of the major evolutionary driving forces for this bacterium. We identified a putative adhesin (BPSL1661) as a hub of co-selection signals, experimentally confirmed a BPSL1661 role under nutrient deprivation, and explored the functional basis of co-selection gene network surrounding BPSL1661 in facilitating the bacterial survival under nutrient depletion. Our findings suggest that nutrient-limited conditions have been the common selection pressure acting on this species, and allelic variation of BPSL1661 may have promoted B. pseudomallei survival during harsh environmental conditions by facilitating bacterial adherence to different surfaces, cells, or living hosts.
Assuntos
Evolução Biológica , Burkholderia pseudomallei , Adesinas Bacterianas , Alelos , Burkholderia pseudomallei/genética , Burkholderia pseudomallei/fisiologia , Seleção Genética , Estresse FisiológicoRESUMO
Covariance-based discovery of polymorphisms under co-selective pressure or epistasis has received considerable recent attention in population genomics. Both statistical modeling of the population level covariation of alleles across the chromosome and model-free testing of dependencies between pairs of polymorphisms have been shown to successfully uncover patterns of selection in bacterial populations. Here we introduce a model-free method, SpydrPick, whose computational efficiency enables analysis at the scale of pan-genomes of many bacteria. SpydrPick incorporates an efficient correction for population structure, which adjusts for the phylogenetic signal in the data without requiring an explicit phylogenetic tree. We also introduce a new type of visualization of the results similar to the Manhattan plots used in genome-wide association studies, which enables rapid exploration of the identified signals of co-evolution. Simulations demonstrate the usefulness of our method and give some insight to when this type of analysis is most likely to be successful. Application of the method to large population genomic datasets of two major human pathogens, Streptococcus pneumoniae and Neisseria meningitidis, revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.
Assuntos
Biologia Computacional/métodos , Epistasia Genética , Genoma Bacteriano/genética , Genômica , Resistência Microbiana a Medicamentos/genética , Humanos , Metagenômica/métodos , Neisseria meningitidis/genética , Neisseria meningitidis/patogenicidade , Streptococcus pneumoniae/genética , Virulência/genéticaRESUMO
In assessments of child sexual abuse (CSA) allegations, informative background information is often overlooked or not used properly. We therefore created and tested an instrument that uses accessible background information to calculate the probability of a child being a CSA victim that can be used as a starting point in the following investigation. Studying 903 demographic and socioeconomic variables from over 11,000 Finnish children, we identified 42 features related to CSA. Using Bayesian logic to calculate the probability of abuse, our instrument-the Finnish Investigative Instrument of Child Sexual Abuse (FICSA)-has two separate profiles for boys and girls. A cross-validation procedure suggested excellent diagnostic utility (area under the curve [AUC] = 0.97 for boys and AUC = 0.88 for girls). We conclude that the presented method can be useful in forensic assessments of CSA allegations by adding a reliable statistical approach to considering background information, and to support clinical decision making and guide investigative efforts.
Assuntos
Abuso Sexual na Infância/diagnóstico , Adolescente , Teorema de Bayes , Criança , Técnicas de Apoio para a Decisão , Finlândia , HumanosRESUMO
OBJECTIVES: Integrating pathogen genomic surveillance with bioinformatics can enhance public health responses by identifying risk and guiding interventions. This study focusses on the two predominant Campylobacter species, which are commonly found in the gut of birds and mammals and often infect humans via contaminated food. Rising incidence and antimicrobial resistance (AMR) are a global concern, and there is an urgent need to quantify the main routes to human infection. METHODS: During routine US national surveillance (2009-2019), 8856 Campylobacter genomes from human infections and 16,703 from possible sources were sequenced. Using machine learning and probabilistic models, we target genetic variation associated with host adaptation to attribute the source of human infections and estimate the importance of different disease reservoirs. RESULTS: Poultry was identified as the primary source of human infections, responsible for an estimated 68% of cases, followed by cattle (28%), and only a small contribution from wild birds (3%) and pork sources (1%). There was also evidence of an increase in multidrug resistance, particularly among isolates attributed to chickens. CONCLUSIONS: National surveillance and source attribution can guide policy, and our study suggests that interventions targeting poultry will yield the greatest reductions in campylobacteriosis and spread of AMR in the US. DATA AVAILABILITY: All sequence reads were uploaded and shared on NCBI's Sequence Read Archive (SRA) associated with BioProjects; PRJNA239251 (CDC / PulseNet surveillance), PRJNA287430 (FSIS surveillance), PRJNA292668 & PRJNA292664 (NARMS) and PRJNA258022 (FDA surveillance). Publicly available genomes, including reference genomes and isolates sampled worldwide from wild birds are associated with BioProject accessions: PRJNA176480, PRJNA177352, PRJNA342755, PRJNA345429, PRJNA312235, PRJNA415188, PRJNA524300, PRJNA528879, PRJNA529798, PRJNA575343, PRJNA524315 and PRJNA689604. Contiguous assemblies of all genome sequences compared are available at Mendeley data (assembled C. coli genomes doi: 10.17632/gxswjvxyh3.1; assembled C. jejuni genomes doi: 10.17632/6ngsz3dtbd.1) and individual project and accession numbers can be found in Supplementary tables S1 and S2, which also includes pubMLST identifiers for assembled genomes. Figshare (10.6084/m9.figshare.20279928). Interactive phylogenies are hosted on microreact separately for C. jejuni (https://microreact.org/project/pascoe-us-cjejuni) and C. coli (https://microreact.org/project/pascoe-us-ccoli).
Assuntos
Infecções por Campylobacter , Campylobacter , Aprendizado de Máquina , Infecções por Campylobacter/epidemiologia , Infecções por Campylobacter/microbiologia , Infecções por Campylobacter/veterinária , Animais , Estados Unidos/epidemiologia , Humanos , Campylobacter/genética , Campylobacter/classificação , Campylobacter/isolamento & purificação , Bovinos , Estudos Retrospectivos , Galinhas/microbiologia , Monitoramento Epidemiológico , Suínos , Aves Domésticas/microbiologia , Genoma BacterianoRESUMO
The Genetics of Sexuality and Aggression (GSA) project was launched at the Abo Akademi University in Turku, Finland in 2005 and has so far undertaken two major population-based data collections involving twins and siblings of twins. To date, it consists of about 14,000 individuals (including 1,147 informative monozygotic twin pairs, 1,042 informative same-sex dizygotic twin pairs, 741 informative opposite-sex dizygotic twin pairs). Participants have been recruited through the Central Population Registry of Finland and were 18-49 years of age at the time of the data collections. Saliva samples for DNA genotyping (n = 4,278) and testosterone analyses (n = 1,168) were collected in 2006. The primary focus of the data collections has been on sexuality (both sexual functioning and sexual behavior) and aggressive behavior. This paper provides an overview of the data collections as well as an outline of the phenotypes and biological data assembled within the project. A detailed overview of publications can be found at the project's Web site: http://www.cebg.fi/.
Assuntos
Agressão/psicologia , Sistema de Registros , Sexualidade/psicologia , Gêmeos Dizigóticos/genética , Gêmeos Monozigóticos/genética , Adolescente , Adulto , Estudos de Coortes , Feminino , Finlândia/epidemiologia , Humanos , Masculino , Pessoa de Meia-Idade , Fenótipo , Desenvolvimento Psicossexual , Inquéritos e Questionários , Gêmeos Dizigóticos/psicologia , Gêmeos Monozigóticos/psicologia , Adulto JovemRESUMO
Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim.
Assuntos
Software , Simulação por ComputadorRESUMO
Measurement and manipulation of the microbiome is generally considered to have great potential for understanding the causes of complex diseases in humans, developing new therapies, and finding preventive measures. Many studies have found significant associations between the microbiome and various diseases; however, Koch's classical postulates remind us about the importance of causative reasoning when considering the relationship between microbes and a disease manifestation. Although causal discovery in observational microbiome data faces many challenges, methodological advances in causal structure learning have improved the potential of data-driven prediction of causal effects in large-scale biological systems. In this Personal View, we show the capability of existing methods for inferring causal effects from metagenomic data, and we highlight ways in which the introduction of causal structures that are more flexible than existing structures offers new opportunities for causal reasoning. Our observations suggest that microbiome research can further benefit from tools developed in the past 5 years in causal discovery and learn from their applications elsewhere.
Assuntos
Microbiota , Humanos , Metagenômica/métodos , Causalidade , MetagenomaRESUMO
Chickens are the most common birds on Earth and colibacillosis is among the most common diseases affecting them. This major threat to animal welfare and safe sustainable food production is difficult to combat because the etiological agent, avian pathogenic Escherichia coli (APEC), emerges from ubiquitous commensal gut bacteria, with no single virulence gene present in all disease-causing isolates. Here, we address the underlying evolutionary mechanisms of extraintestinal spread and systemic infection in poultry. Combining population scale comparative genomics and pangenome-wide association studies, we compare E. coli from commensal carriage and systemic infections. We identify phylogroup-specific and species-wide genetic elements that are enriched in APEC, including pathogenicity-associated variation in 143 genes that have diverse functions, including genes involved in metabolism, lipopolysaccharide synthesis, heat shock response, antimicrobial resistance and toxicity. We find that horizontal gene transfer spreads pathogenicity elements, allowing divergent clones to cause infection. Finally, a Random Forest model prediction of disease status (carriage vs. disease) identifies pathogenic strains in the emergent ST-117 poultry-associated lineage with 73% accuracy, demonstrating the potential for early identification of emergent APEC in healthy flocks.
Assuntos
Infecções por Escherichia coli/prevenção & controle , Escherichia coli/genética , Evolução Molecular , Genoma Bacteriano/genética , Doenças das Aves Domésticas/prevenção & controle , Animais , Galinhas , Escherichia coli/classificação , Escherichia coli/patogenicidade , Infecções por Escherichia coli/diagnóstico , Infecções por Escherichia coli/microbiologia , Genes Bacterianos , Variação Genética , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Filogenia , Doenças das Aves Domésticas/diagnóstico , Doenças das Aves Domésticas/microbiologia , Virulência/genéticaRESUMO
Adaptive immune receptor repertoires (AIRR) are key targets for biomedical research as they record past and ongoing adaptive immune responses. The capacity of machine learning (ML) to identify complex discriminative sequence patterns renders it an ideal approach for AIRR-based diagnostic and therapeutic discovery. To date, widespread adoption of AIRR ML has been inhibited by a lack of reproducibility, transparency, and interoperability. immuneML (immuneml.uio.no) addresses these concerns by implementing each step of the AIRR ML process in an extensible, open-source software ecosystem that is based on fully specified and shareable workflows. To facilitate widespread user adoption, immuneML is available as a command-line tool and through an intuitive Galaxy web interface, and extensive documentation of workflows is provided. We demonstrate the broad applicability of immuneML by (i) reproducing a large-scale study on immune state prediction, (ii) developing, integrating, and applying a novel deep learning method for antigen specificity prediction, and (iii) showcasing streamlined interpretability-focused benchmarking of AIRR ML.
RESUMO
Enterococcus faecium is a gut commensal of the gastro-digestive tract, but also known as nosocomial pathogen among hospitalized patients. Population genetics based on whole-genome sequencing has revealed that E. faecium strains from hospitalized patients form a distinct clade, designated clade A1, and that plasmids are major contributors to the emergence of nosocomial E. faecium. Here we further explored the adaptive evolution of E. faecium using a genome-wide co-evolution study (GWES) to identify co-evolving single-nucleotide polymorphisms (SNPs). We identified three genomic regions harbouring large numbers of SNPs in tight linkage that are not proximal to each other based on the completely assembled chromosome of the clade A1 reference hospital isolate AUS0004. Close examination of these regions revealed that they are located at the borders of four different types of large-scale genomic rearrangements, insertion sites of two different genomic islands and an IS30-like transposon. In non-clade A1 isolates, these regions are adjacent to each other and they lack the insertions of the genomic islands and IS30-like transposon. Additionally, among the clade A1 isolates there is one group of pet isolates lacking the genomic rearrangement and insertion of the genomic islands, suggesting a distinct evolutionary trajectory. In silico analysis of the biological functions of the genes encoded in three regions revealed a common link to a stress response. This suggests that these rearrangements may reflect adaptation to the stringent conditions in the hospital environment, such as antibiotics and detergents, to which bacteria are exposed. In conclusion, to our knowledge, this is the first study using GWES to identify genomic rearrangements, suggesting that there is considerable untapped potential to unravel hidden evolutionary signals from population genomic data.
Assuntos
Enterococcus faecium/classificação , Infecções por Bactérias Gram-Positivas/microbiologia , Polimorfismo de Nucleotídeo Único , Sequenciamento Completo do Genoma/métodos , Infecção Hospitalar/microbiologia , Elementos de DNA Transponíveis , Enterococcus faecium/genética , Evolução Molecular , Ilhas Genômicas , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Filogenia , Plasmídeos/genéticaRESUMO
A fundamental goal of contemporary biomedical research is to understand the molecular basis of disease pathogenesis and exploit this information to develop targeted and more-effective therapies. Necrotizing myositis caused by the bacterial pathogen Streptococcus pyogenes is a devastating human infection with a high mortality rate and few successful therapeutic options. We used dual transcriptome sequencing (RNA-seq) to analyze the transcriptomes of S. pyogenes and host skeletal muscle recovered contemporaneously from infected nonhuman primates. The in vivo bacterial transcriptome was strikingly remodeled compared to organisms grown in vitro, with significant upregulation of genes contributing to virulence and altered regulation of metabolic genes. The transcriptome of muscle tissue from infected nonhuman primates (NHPs) differed significantly from that of mock-infected animals, due in part to substantial changes in genes contributing to inflammation and host defense processes. We discovered significant positive correlations between group A streptococcus (GAS) virulence factor transcripts and genes involved in the host immune response and inflammation. We also discovered significant correlations between the magnitude of bacterial virulence gene expression in vivo and pathogen fitness, as assessed by previously conducted genome-wide transposon-directed insertion site sequencing (TraDIS). By integrating the bacterial RNA-seq data with the fitness data generated by TraDIS, we discovered five new pathogen genes, namely, S. pyogenes 0281 (Spy0281 [dahA]), ihk-irr, slr, isp, and ciaH, that contribute to necrotizing myositis and confirmed these findings using isogenic deletion-mutant strains. Taken together, our study results provide rich new information about the molecular events occurring in severe invasive infection of primate skeletal muscle that has extensive translational research implications.IMPORTANCE Necrotizing myositis caused by Streptococcus pyogenes has high morbidity and mortality rates and relatively few successful therapeutic options. In addition, there is no licensed human S. pyogenes vaccine. To gain enhanced understanding of the molecular basis of this infection, we employed a multidimensional analysis strategy that included dual RNA-seq and other data derived from experimental infection of nonhuman primates. The data were used to target five streptococcal genes for pathogenesis research, resulting in the unambiguous demonstration that these genes contribute to pathogen-host molecular interactions in necrotizing infections. We exploited fitness data derived from a recently conducted genome-wide transposon mutagenesis study to discover significant correlation between the magnitude of bacterial virulence gene expression in vivo and pathogen fitness. Collectively, our findings have significant implications for translational research, potentially including vaccine efforts.
Assuntos
Fasciite Necrosante/microbiologia , Miosite/microbiologia , Infecções Estreptocócicas/microbiologia , Streptococcus pyogenes/genética , Streptococcus pyogenes/metabolismo , Transcriptoma , Fatores de Virulência/genética , Animais , Proteínas de Bactérias/metabolismo , Regulação Bacteriana da Expressão Gênica , Interações Hospedeiro-Patógeno/genética , Interações Hospedeiro-Patógeno/fisiologia , Músculo Esquelético/microbiologia , Músculo Esquelético/patologia , Miosite/genética , Miosite/metabolismo , Primatas , RNA Bacteriano/genética , RNA Bacteriano/metabolismo , Streptococcus pyogenes/patogenicidade , Virulência/genética , Fatores de Virulência/metabolismoRESUMO
Streptococcus pyogenes causes 700 million human infections annually worldwide, yet, despite a century of intensive effort, there is no licensed vaccine against this bacterium. Although a number of large-scale genomic studies of bacterial pathogens have been published, the relationships among the genome, transcriptome, and virulence in large bacterial populations remain poorly understood. We sequenced the genomes of 2,101 emm28 S. pyogenes invasive strains, from which we selected 492 phylogenetically diverse strains for transcriptome analysis and 50 strains for virulence assessment. Data integration provided a novel understanding of the virulence mechanisms of this model organism. Genome-wide association study, expression quantitative trait loci analysis, machine learning, and isogenic mutant strains identified and confirmed a one-nucleotide indel in an intergenic region that significantly alters global transcript profiles and ultimately virulence. The integrative strategy that we used is generally applicable to any microbe and may lead to new therapeutics for many human pathogens.
Assuntos
Genoma Bacteriano/genética , Streptococcus pyogenes/genética , Transcriptoma/genética , Virulência/genética , Regulação Bacteriana da Expressão Gênica/genética , Estudo de Associação Genômica Ampla/métodos , Genômica/métodos , Filogenia , Locos de Características Quantitativas/genéticaRESUMO
The potential for genome-wide modelling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has previously been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 104-105 polymorphisms, representing the amount of core genomic variation observed in analyses of many bacterial species. Here, we introduce a novel inference method (SuperDCA) that employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 105 polymorphisms. Using two large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make additional significant biological findings about this major human pathogen. We also show that our method can uncover signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA, thus, holds considerable potential in building understanding about numerous organisms at a systems biological level.
Assuntos
Epistasia Genética , Genoma Bacteriano , Estudos de Associação Genética , Loci Gênicos , Genômica , Humanos , Modelos Genéticos , Filogenia , Polimorfismo de Nucleotídeo Único , Conformação Proteica , Streptococcus pneumoniae/genéticaRESUMO
Some of the most common infectious diseases are caused by bacteria that naturally colonise humans asymptomatically. Combating these opportunistic pathogens requires an understanding of the traits that differentiate infecting strains from harmless relatives. Staphylococcus epidermidis is carried asymptomatically on the skin and mucous membranes of virtually all humans but is a major cause of nosocomial infection associated with invasive procedures. Here we address the underlying evolutionary mechanisms of opportunistic pathogenicity by combining pangenome-wide association studies and laboratory microbiology to compare S. epidermidis from bloodstream and wound infections and asymptomatic carriage. We identify 61 genes containing infection-associated genetic elements (k-mers) that correlate with in vitro variation in known pathogenicity traits (biofilm formation, cell toxicity, interleukin-8 production, methicillin resistance). Horizontal gene transfer spreads these elements, allowing divergent clones to cause infection. Finally, Random Forest model prediction of disease status (carriage vs. infection) identifies pathogenicity elements in 415 S. epidermidis isolates with 80% accuracy, demonstrating the potential for identifying risk genotypes pre-operatively.