RESUMO
Hemagglutinin (HA) and neuraminidase (NA) proteins are the primary antigenic targets of influenza A virus (IAV) infections. IAV infections are generally classified into subtypes of HA and NA proteins, e.g. H3N2. Most of the known subtypes were originally defined by a lack of antibody cross-reactivity. However, genetic sequencing has played an increasingly important role in characterizing the evolving diversity of IAV. Novel subtypes have recently been described solely by their genetic sequences, and IAV infections are routinely subtyped by molecular assays, or the comparison of sequences to references. In this study, I carry out a comparative analysis of all available IAV protein sequences in the Genbank database (over 1.1 million, reduced to 272,292 unique sequences prior to phylogenetic reconstruction) to determine whether the serologically defined subtypes can be reproduced with sequence-based criteria. I show that a robust genetic taxonomy of HA and NA subtypes can be obtained using a simple clustering method, namely, by progressively partitioning the phylogeny on its longest internal branches. However, this taxonomy also requires some amendments to the current nomenclature. For example, two IAV isolates from bats previously characterized as a divergent lineage of H9N2 should be separated into their own subtype. With the exception of these small and highly divergent lineages, the phylogenies relating each of the other six genomic segments do not support partitions into major subtypes.
RESUMO
During the COVID-19 pandemic, the Province of Ontario, Canada, launched a wastewater surveillance program to monitor SARS-CoV-2, inspired by the early work and successful forecasts of COVID-19 waves in the city of Ottawa, Ontario. This manuscript presents a dataset from January 1, 2021, to March 31, 2023, with RT-qPCR results for SARS-CoV-2 genes and PMMoV from 107 sites across all 34 public health units in Ontario, covering 72% of the province's and 26.2% of Canada's population. Sampling occurred 2-7 times weekly, including geographical coordinates, serviced populations, physico-chemical water characteristics, and flowrates. In doing so, this manuscript ensures data availability and metadata preservation to support future research and epidemic preparedness through detailed analyses and modeling. The dataset has been crucial for public health in tracking disease locally, especially with the rise of the Omicron variant and the decline in clinical testing, highlighting wastewater-based surveillance's role in estimating disease incidence in Ontario.
Assuntos
COVID-19 , SARS-CoV-2 , Águas Residuárias , Ontário/epidemiologia , COVID-19/epidemiologia , Águas Residuárias/virologia , Humanos , Pandemias , Carga ViralRESUMO
Wastewater-based surveillance (WBS) is an important epidemiological and public health tool for tracking pathogens across the scale of a building, neighbourhood, city, or region. WBS gained widespread adoption globally during the SARS-CoV-2 pandemic for estimating community infection levels by qPCR. Sequencing pathogen genes or genomes from wastewater adds information about pathogen genetic diversity, which can be used to identify viral lineages (including variants of concern) that are circulating in a local population. Capturing the genetic diversity by WBS sequencing is not trivial, as wastewater samples often contain a diverse mixture of viral lineages with real mutations and sequencing errors, which must be deconvoluted computationally from short sequencing reads. In this study we assess nine different computational tools that have recently been developed to address this challenge. We simulated 100 wastewater sequence samples consisting of SARS-CoV-2 BA.1, BA.2, and Delta lineages, in various mixtures, as well as a Delta-Omicron recombinant and a synthetic 'novel' lineage. Most tools performed well in identifying the true lineages present and estimating their relative abundances and were generally robust to variation in sequencing depth and read length. While many tools identified lineages present down to 1â% frequency, results were more reliable above a 5â% threshold. The presence of an unknown synthetic lineage, which represents an unclassified SARS-CoV-2 lineage, increases the error in relative abundance estimates of other lineages, but the magnitude of this effect was small for most tools. The tools also varied in how they labelled novel synthetic lineages and recombinants. While our simulated dataset represents just one of many possible use cases for these methods, we hope it helps users understand potential sources of error or bias in wastewater sequencing analysis and to appreciate the commonalities and differences across methods.
Assuntos
COVID-19 , Genoma Viral , SARS-CoV-2 , Águas Residuárias , Águas Residuárias/virologia , SARS-CoV-2/genética , SARS-CoV-2/classificação , COVID-19/virologia , COVID-19/epidemiologia , Humanos , Biologia Computacional/métodos , Genômica/métodos , Vigilância Epidemiológica Baseada em Águas Residuárias , FilogeniaRESUMO
Timing of human immunodeficiency virus-1 (HIV-1) reservoir formation is important for informing HIV cure efforts. It is unclear how much of the variability seen in dating reservoir formation is due to sampling and gene-specific differences. We used a Bayesian extension of root to tip regression (bayroot) to reestimate formation date distributions in participants from Swedish and South African cohorts, and assessed the impact of variable timing, frequency, and depth of sampling on these estimates. Significant shifts in formation date distributions were only observed with use of faster-evolving genes, while timing, frequency, and depth of sampling had minor or no significant effect on estimates.
Assuntos
Teorema de Bayes , Infecções por HIV , HIV-1 , HIV-1/genética , Humanos , Infecções por HIV/virologia , África do Sul/epidemiologia , Suécia/epidemiologia , Masculino , Feminino , Estudos de Coortes , Latência Viral/genética , Adulto , Fatores de TempoRESUMO
BACKGROUND: The principal barrier to an HIV cure is the presence of the latent viral reservoir (LVR), which has been understudied in African populations. From 2018 to 2019, Uganda instituted a nationwide rollout of ART consisting of Dolutegravir (DTG) with two NRTI, which replaced the previous regimen of one NNRTI and the same two NRTI. METHODS: Changes in the inducible replication-competent LVR (RC-LVR) of ART-suppressed Ugandans with HIV (n = 88) from 2015 to 2020 were examined using the quantitative viral outgrowth assay. Outgrowth viruses were examined for viral evolution. Changes in the RC-LVR were analyzed using three versions of a Bayesian model that estimated the decay rate over time as a single, linear rate (model A), or allowing for a change at time of DTG initiation (model B&C). FINDINGS: Model A estimated the slope of RC-LVR change as a non-significant positive increase, which was due to a temporary spike in the RC-LVR that occurred 0-12 months post-DTG initiation (p < 0.005). This was confirmed with models B and C; for instance, model B estimated a significant decay pre-DTG initiation with a half-life of 6.9 years, and an â¼1.7-fold increase in the size of the RC-LVR post-DTG initiation. There was no evidence of viral failure or consistent evolution in the cohort. INTERPRETATION: These data suggest that the change from NNRTI- to DTG-based ART is associated with a significant temporary increase in the circulating RC-LVR. FUNDING: Supported by the NIH (grant 1-UM1AI164565); Gilead HIV Cure Grants Program (90072171); Canadian Institutes of Health Research (PJT-155990); and Ontario Genomics-Canadian Statistical Sciences Institute.
Assuntos
População da África Oriental , Infecções por HIV , Inibidores de Integrase de HIV , HIV-1 , Humanos , Antirretrovirais/uso terapêutico , Teorema de Bayes , Linfócitos T CD4-Positivos , Infecções por HIV/tratamento farmacológico , Inibidores de Integrase de HIV/farmacologia , Inibidores de Integrase de HIV/uso terapêutico , Carga Viral , Latência ViralRESUMO
Wastewater surveillance of coronavirus disease 2019 (COVID-19) commonly applies reverse transcription-quantitative polymerase chain reaction (RT-qPCR) to quantify severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA concentrations in wastewater over time. In most applications worldwide, maximal sensitivity and specificity of RT-qPCR has been achieved, in part, by monitoring two or more genomic loci of SARS-CoV-2. In Ontario, Canada, the provincial Wastewater Surveillance Initiative reports the average copies of the CDC N1 and N2 loci normalized to the fecal biomarker pepper mild mottle virus. In November 2021, the emergence of the Omicron variant of concern, harboring a C28311T mutation within the CDC N1 probe region, challenged the accuracy of the consensus between the RT-qPCR measurements of the N1 and N2 loci of SARS-CoV-2. In this study, we developed and applied a novel real-time dual loci quality assurance and control framework based on the relative difference between the loci measurements to the City of Ottawa dataset to identify a loss of sensitivity of the N1 assay in the period from July 10, 2022 to January 31, 2023. Further analysis via sequencing and allele-specific RT-qPCR revealed a high proportion of mutations C28312T and A28330G during the study period, both in the City of Ottawa and across the province. It is hypothesized that nucleotide mutations in the probe region, especially A28330G, led to inefficient annealing, resulting in reduction in sensitivity and accuracy of the N1 assay. This study highlights the importance of implementing quality assurance and control criteria to continually evaluate, in near real-time, the accuracy of the signal produced in wastewater surveillance applications that rely on detection of pathogens whose genomes undergo high rates of mutation.
Assuntos
Vigilância Epidemiológica Baseada em Águas Residuárias , Águas Residuárias , Alelos , Mutação , Ontário/epidemiologia , SARS-CoV-2/genética , RNA Viral/genéticaRESUMO
The timing of the establishment of the HIV latent viral reservoir (LVR) is of particular interest, as there is evidence that proviruses are preferentially archived at the time of antiretroviral therapy (ART) initiation. Quantitative viral outgrowth assays (QVOAs) were performed using Peripheral Blood Mononuclear Cells (PBMC) collected from Ugandans living with HIV who were virally suppressed on ART for >1 year, had known seroconversion windows, and at least two archived ART-naïve plasma samples. QVOA outgrowth populations and pre-ART plasma samples were deep sequenced for the pol and gp41 genes. The bayroot program was used to estimate the date that each outgrowth virus was incorporated into the reservoir. Bayroot was also applied to previously published data from a South African cohort. In the Ugandan cohort (n = 11), 87.9 per cent pre-ART and 56.3 per cent viral outgrowth sequences were unique. Integration dates were estimated to be relatively evenly distributed throughout viremia in 9/11 participants. In contrast, sequences from the South African cohort (n = 9) were more commonly estimated to have entered the LVR close to ART initiation, as previously reported. Timing of LVR establishment is variable between populations and potentially viral subtypes, which could limit the effectiveness of interventions that target the LVR only at ART initiation.
RESUMO
Nef is an accessory protein unique to the primate HIV-1, HIV-2, and SIV lentiviruses. During infection, Nef functions by interacting with multiple host proteins within infected cells to evade the immune response and enhance virion infectivity. Notably, Nef can counter immune regulators such as CD4 and MHC-I, as well as the SERINC5 restriction factor in infected cells. In this study, we generated a posterior sample of time-scaled phylogenies relating SIV and HIV Nef sequences, followed by reconstruction of ancestral sequences at the root and internal nodes of the sampled trees up to the HIV-1 Group M ancestor. Upon expression of the ancestral primate lentivirus Nef protein within CD4+ HeLa cells, flow cytometry analysis revealed that the primate lentivirus Nef ancestor robustly downregulated cell-surface SERINC5, yet only partially downregulated CD4 from the cell surface. Further analysis revealed that the Nef-mediated CD4 downregulation ability evolved gradually, while Nef-mediated SERINC5 downregulation was recovered abruptly in the HIV-1/M ancestor. Overall, this study provides a framework to reconstruct ancestral viral proteins and enable the functional characterization of these proteins to delineate how functions could have changed throughout evolutionary history.
Assuntos
Lentivirus de Primatas , Vírus da Imunodeficiência Símia , Humanos , Animais , Lentivirus de Primatas/genética , Lentivirus de Primatas/metabolismo , Filogenia , Células HeLa , Vírus da Imunodeficiência Símia/metabolismo , Produtos do Gene nef do Vírus da Imunodeficiência Humana/genética , Produtos do Gene nef do Vírus da Imunodeficiência Humana/metabolismo , Primatas/genética , Primatas/metabolismo , Proteínas de Membrana/genéticaRESUMO
The principal barrier to an HIV cure is the presence of a latent viral reservoir (LVR) made up primarily of latently infected resting CD4+ (rCD4) T-cells. Studies in the United States have shown that the LVR decays slowly (half-life=3.8 years), but this rate in African populations has been understudied. This study examined longitudinal changes in the inducible replication competent LVR (RC-LVR) of ART-suppressed Ugandans living with HIV (n=88) from 2015-2020 using the quantitative viral outgrowth assay, which measures infectious units per million (IUPM) rCD4 T-cells. In addition, outgrowth viruses were examined with site-directed next-generation sequencing to assess for possible ongoing viral evolution. During the study period (2018-19), Uganda instituted a nationwide rollout of first-line ART consisting of Dolutegravir (DTG) with two NRTI, which replaced the previous regimen that consisted of one NNRTI and the same two NRTI. Changes in the RC-LVR were analyzed using two versions of a novel Bayesian model that estimated the decay rate over time on ART as a single, linear rate (model A) or allowing for an inflection at time of DTG initiation (model B). Model A estimated the population-level slope of RC-LVR change as a non-significant positive increase. This positive slope was due to a temporary increase in the RC-LVR that occurred 0-12 months post-DTG initiation (p<0.0001). This was confirmed with model B, which estimated a significant decay pre-DTG initiation with a half-life of 7.7 years, but a significant positive slope post-DTG initiation leading to a transient estimated doubling-time of 8.1 years. There was no evidence of viral failure in the cohort, or consistent evolution in the outgrowth sequences associated with DTG initiation. These data suggest that either the initiation of DTG, or cessation of NNRTI use, is associated with a significant temporary increase in the circulating RC-LVR.
RESUMO
Defining clusters of epidemiologically related infections is a common problem in the surveillance of infectious disease. A popular method for generating clusters is pairwise distance clustering, which assigns pairs of sequences to the same cluster if their genetic distance falls below some threshold. The result is often represented as a network or graph of nodes. A connected component is a set of interconnected nodes in a graph that are not connected to any other node. The prevailing approach to pairwise clustering is to map clusters to the connected components of the graph on a one-to-one basis. We propose that this definition of clusters is unnecessarily rigid. For instance, the connected components can collapse into one cluster by the addition of a single sequence that bridges nodes in the respective components. Moreover, the distance thresholds typically used for viruses like HIV-1 tend to exclude a large proportion of new sequences, making it difficult to train models for predicting cluster growth. These issues may be resolved by revisiting how we define clusters from genetic distances. Community detection is a promising class of clustering methods from the field of network science. A community is a set of nodes that are more densely inter-connected relative to the number of their connections to external nodes. Thus, a connected component may be partitioned into two or more communities. Here we describe community detection methods in the context of genetic clustering for epidemiology, demonstrate how a popular method (Markov clustering) enables us to resolve variation in transmission rates within a giant connected component of HIV-1 sequences, and identify current challenges and directions for further work.
RESUMO
Genetic sequencing is subject to many different types of errors, but most analyses treat the resultant sequences as if they are known without error. Next generation sequencing methods rely on significantly larger numbers of reads than previous sequencing methods in exchange for a loss of accuracy in each individual read. Still, the coverage of such machines is imperfect and leaves uncertainty in many of the base calls. In this work, we demonstrate that the uncertainty in sequencing techniques will affect downstream analysis and propose a straightforward method to propagate the uncertainty. Our method (which we have dubbed Sequence Uncertainty Propagation, or SUP) uses a probabilistic matrix representation of individual sequences which incorporates base quality scores as a measure of uncertainty that naturally lead to resampling and replication as a framework for uncertainty propagation. With the matrix representation, resampling possible base calls according to quality scores provides a bootstrap- or prior distribution-like first step towards genetic analysis. Analyses based on these re-sampled sequences will include a more complete evaluation of the error involved in such analyses. We demonstrate our resampling method on SARS-CoV-2 data. The resampling procedures add a linear computational cost to the analyses, but the large impact on the variance in downstream estimates makes it clear that ignoring this uncertainty may lead to overly confident conclusions. We show that SARS-CoV-2 lineage designations via Pangolin are much less certain than the bootstrap support reported by Pangolin would imply and the clock rate estimates for SARS-CoV-2 are much more variable than reported.
RESUMO
The comparative analysis of amino acid sequences is an important tool in molecular biology that often requires multiple sequence alignments. In comparisons between less closely related genomes, however, it becomes more difficult to accurately align protein-coding sequences, or even to identify homologous regions in different genomes. In this article, we describe an alignment-free method for the classification of homologous protein-coding regions from different genomes. This methodology was originally developed for comparing genomes within virus families, but may be adapted for other organisms. We quantify sequence homology from the overlap (intersection distance) of the k-mer (word) frequency distributions for different protein sequences. Next, we extract groups of homologous sequences from the resulting distance matrix using a combination of dimensionality reduction and hierarchical clustering methods. Finally, we demonstrate how to generate visualizations of the composition of clusters with respect to protein annotations, and by coloring protein-coding regions of genomes by cluster assignments. These provide a useful means to quickly assess the reliability of the clustering results based on the distribution of homologous genes among genomes. © 2023 Wiley Periodicals LLC. Basic Protocol 1: Data collection and processing Basic Protocol 2: Calculating k-mer distances Basic Protocol 3: Extracting clusters of homology Support Protocol: Genome plot based on clustering results.
Assuntos
Algoritmos , Reprodutibilidade dos Testes , Alinhamento de Sequência , Sequência de Aminoácidos , Análise por ConglomeradosRESUMO
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may provide a mechanism to increase the information content of compact genomes. The presence of overlapping reading frames (OvRFs) can skew estimates of selection based on the rates of non-synonymous and synonymous substitutions, since a substitution that is synonymous in one reading frame may be non-synonymous in another and vice versa. To understand the impact of OvRFs on molecular evolution, we implemented a versatile simulation model of nucleotide sequence evolution along a phylogeny with any distribution of open reading frames in linear or circular genomes. We use a custom data structure to track the substitution rates at every nucleotide site, which is determined by the stationary nucleotide frequencies, transition bias and the distribution of selection biases (dN/dS) in the respective reading frames. Our simulation model is implemented in the Python scripting language. All source code is released under the GNU General Public License version 3 and are available at https://github.com/PoonLab/HexSE.
RESUMO
The composition of the latent human immunodeficiency virus 1 (HIV-1) reservoir is shaped by when proviruses integrated into host genomes. These integration dates can be estimated by phylogenetic methods like root-to-tip (RTT) regression. However, RTT does not accommodate variation in the number of mutations over time, uncertainty in estimating the molecular clock, or the position of the root in the tree. To address these limitations, we implemented a Bayesian extension of RTT as an R package (bayroot), which enables the user to incorporate prior information about the time of infection and start of antiretroviral therapy. Taking an unrooted maximum likelihood tree as input, we use a Metropolis-Hastings algorithm to sample from the joint posterior distribution of three parameters (the rate of sequence evolution, i.e., molecular clock; the location of the root; and the time associated with the root). Next, we apply rejection sampling to this posterior sample of model parameters to simulate integration dates for HIV proviral sequences. To validate this method, we use the R package treeswithintrees (twt) to simulate time-scaled trees relating samples of actively and latently infected T cells from a single host. We find that bayroot yields significantly more accurate estimates of integration dates than conventional RTT under a range of model settings.
RESUMO
Clusters of genetically similar infections suggest rapid transmission and may indicate priorities for public health action or reveal underlying epidemiological processes. However, clusters often require user-defined thresholds and are sensitive to non-epidemiological factors, such as non-random sampling. Consequently the ideal threshold for public health applications varies substantially across settings. Here, we show a method which selects optimal thresholds for phylogenetic (subset tree) clustering based on population. We evaluated this method on HIV-1 pol datasets (n = 14, 221 sequences) from four sites in USA (Tennessee, Washington), Canada (Northern Alberta) and China (Beijing). Clusters were defined by tips descending from an ancestral node (with a minimum bootstrap support of 95%) through a series of branches, each with a length below a given threshold. Next, we used pplacer to graft new cases to the fixed tree by maximum likelihood. We evaluated the effect of varying branch-length thresholds on cluster growth as a count outcome by fitting two Poisson regression models: a null model that predicts growth from cluster size, and an alternative model that includes mean collection date as an additional covariate. The alternative model was favoured by AIC across most thresholds, with optimal (greatest difference in AIC) thresholds ranging 0.007-0.013 across sites. The range of optimal thresholds was more variable when re-sampling 80% of the data by location (IQR 0.008 - 0.016, n = 100 replicates). Our results use prospective phylogenetic cluster growth and suggest that there is more variation in effective thresholds for public health than those typically used in clustering studies.
Assuntos
Infecções por HIV , HIV-1 , Humanos , HIV-1/genética , Filogenia , Estudos Prospectivos , Saúde Pública , Infecções por HIV/epidemiologia , Análise por ConglomeradosRESUMO
Combining clinical and genetic data can improve the effectiveness of virus tracking with the aim of reducing the number of HIV cases by 2030.
Assuntos
Infecções por HIV , Vírus , Infecções por HIV/epidemiologia , Humanos , Epidemiologia Molecular , FilogeniaRESUMO
Tracking the emergence and spread of SARS-CoV-2 lineages using phylogenetics has proven critical to inform the timing and stringency of COVID-19 public health interventions. We investigated the effectiveness of international travel restrictions at reducing SARS-CoV-2 importations and transmission in Canada in the first two waves of 2020 and early 2021. Maximum likelihood phylogenetic trees were used to infer viruses' geographic origins, enabling identification of 2263 (95% confidence interval: 2159-2366) introductions, including 680 (658-703) Canadian sublineages, which are international introductions resulting in sampled Canadian descendants, and 1582 (1501-1663) singletons, introductions with no sampled descendants. Of the sublineages seeded during the first wave, 49% (46-52%) originated from the USA and were primarily introduced into Quebec (39%) and Ontario (36%), while in the second wave, the USA was still the predominant source (43%), alongside a larger contribution from India (16%) and the UK (7%). Following implementation of restrictions on the entry of foreign nationals on 21 March 2020, importations declined from 58.5 (50.4-66.5) sublineages per week to 10.3-fold (8.3-15.0) lower within 4 weeks. Despite the drastic reduction in viral importations following travel restrictions, newly seeded sublineages in summer and fall 2020 contributed to the persistence of COVID-19 cases in the second wave, highlighting the importance of sustained interventions to reduce transmission. Importations rebounded further in November, bringing newly emergent variants of concern (VOCs). By the end of February 2021, there had been an estimated 30 (19-41) B.1.1.7 sublineages imported into Canada, which increasingly displaced previously circulating sublineages by the end of the second wave.Although viral importations are nearly inevitable when global prevalence is high, with fewer importations there are fewer opportunities for novel variants to spark outbreaks or outcompete previously circulating lineages.
Assuntos
COVID-19 , SARS-CoV-2 , COVID-19/epidemiologia , Genômica/métodos , Humanos , Ontário , Filogenia , SARS-CoV-2/genéticaRESUMO
The prevailing abundance of full-length HIV type 1 (HIV-1) genome sequences provides an opportunity to revisit the standard model of HIV-1 group M (HIV-1/M) diversity that clusters genomes into largely nonrecombinant subtypes, which is not consistent with recent evidence of deep recombinant histories for simian immunodeficiency virus (SIV) and other HIV-1 groups. Here we develop an unsupervised nonparametric clustering approach, which does not rely on predefined nonrecombinant genomes, by adapting a community detection method developed for dynamic social network analysis. We show that this method (dynamic stochastic block model [DSBM]) attains a significantly lower mean error rate in detecting recombinant breakpoints in simulated data (quasibinomial generalized linear model (GLM), P<8×10−8), compared to other reference-free recombination detection programs (genetic algorithm for recombination detection [GARD], recombination detection program 4 [RDP4], and RDP5). When this method was applied to a representative sample of n = 525 actual HIV-1 genomes, we determined k = 29 as the optimal number of DSBM clusters and used change-point detection to estimate that at least 95% of these genomes are recombinant. Further, we identified both known and undocumented recombination hotspots in the HIV-1 genome and evidence of intersubtype recombination in HIV-1 subtype reference genomes. We propose that clusters generated by DSBM can provide an informative framework for HIV-1 classification.
Assuntos
HIV-1 , HIV-1/genética , Recombinação GenéticaRESUMO
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping open reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated open reading frames (ORFs) in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in dsDNA viruses. Antisense overlaps in which one of the ORFs was encoded in the same frame on the opposite strand (-0) tend to be longer. Next, we develop a new graph-based representation of the distribution of overlaps among the ORFs of genomes in a given virus family. In the absence of an unambiguous partition of ORFs by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent ORFs are adjacent in one or more genomes, and (2) that these ORFs overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps.