RESUMEN
The chronology and phylogeny of bacterial evolution are difficult to reconstruct due to a scarce fossil record. The analysis of bacterial genomes remains challenging because of large sequence divergence, the plasticity of bacterial genomes due to frequent gene loss, horizontal gene transfer, and differences in selective pressure from one locus to another. Therefore, taking advantage of the rich and rapidly accumulating genomic data requires accurate modeling of genome evolution. An important technical consideration is that loci with high effective mutation rates may diverge beyond the detection limit of the alignment algorithms used, biasing the genome-wide divergence estimates toward smaller divergences. In this article, we propose a novel method to gain insight into bacterial evolution based on statistical properties of genome comparisons. We find that the length distribution of sequence matches is shaped by the effective mutation rates of different loci, by the horizontal transfers, and by the aligner sensitivity. Based on these inputs, we build a model and show that it accounts for the empirically observed distributions, taking the Enterobacteriaceae family as an example. Our method allows to distinguish segments of vertical and horizontal origins and to estimate the time divergence and exchange rate between any pair of taxa from genome-wide alignments. Based on the estimated time divergences, we construct a time-calibrated phylogenetic tree to demonstrate the accuracy of the method.
Asunto(s)
Genoma Bacteriano , Modelos Genéticos , Filogenia , Genoma Bacteriano/genética , Genómica/métodos , Bacterias/genética , Evolución MolecularRESUMEN
BACKGROUND: Segmental duplications (SDs) are long DNA sequences that are repeated in a genome and have high sequence identity. In contrast to repetitive elements they are often unique and only sometimes have multiple copies in a genome. There are several well-studied mechanisms responsible for segmental duplications: non-allelic homologous recombination, non-homologous end joining and replication slippage. Such duplications play an important role in evolution, however, we do not have a full understanding of the dynamic properties of the duplication process. RESULTS: We study segmental duplications through a graph representation where nodes represent genomic regions and edges represent duplications between them. The resulting network (the SD network) is quite complex and has distinct features which allow us to make inference on the evolution of segmantal duplications. We come up with the network growth model that explains features of the SD network thus giving us insights on dynamics of segmental duplications in the human genome. Based on our analysis of genomes of other species the network growth model seems to be applicable for multiple mammalian genomes. CONCLUSIONS: Our analysis suggests that duplication rates of genomic loci grow linearly with the number of copies of a duplicated region. Several scenarios explaining such a preferential duplication rates were suggested.
Asunto(s)
Genoma Humano , Duplicaciones Segmentarias en el Genoma , Animales , Evolución Molecular , Duplicación de Gen , Genómica , HumanosRESUMEN
It has long been suspected that the rate of mutation varies across the human genome at a large scale based on the divergence between humans and other species. However, it is now possible to directly investigate this question using the large number of de novo mutations (DNMs) that have been discovered in humans through the sequencing of trios. We investigate a number of questions pertaining to the distribution of mutations using more than 130,000 DNMs from three large datasets. We demonstrate that the amount and pattern of variation differs between datasets at the 1MB and 100KB scales probably as a consequence of differences in sequencing technology and processing. In particular, datasets show different patterns of correlation to genomic variables such as replication time. Never-the-less there are many commonalities between datasets, which likely represent true patterns. We show that there is variation in the mutation rate at the 100KB, 1MB and 10MB scale that cannot be explained by variation at smaller scales, however the level of this variation is modest at large scales-at the 1MB scale we infer that ~90% of regions have a mutation rate within 50% of the mean. Different types of mutation show similar levels of variation and appear to vary in concert which suggests the pattern of mutation is relatively constant across the genome. We demonstrate that variation in the mutation rate does not generate large-scale variation in GC-content, and hence that mutation bias does not maintain the isochore structure of the human genome. We find that genomic features explain less than 40% of the explainable variance in the rate of DNM. As expected the rate of divergence between species is correlated to the rate of DNM. However, the correlations are weaker than expected if all the variation in divergence was due to variation in the mutation rate. We provide evidence that this is due the effect of biased gene conversion on the probability that a mutation will become fixed. In contrast to divergence, we find that most of the variation in diversity can be explained by variation in the mutation rate. Finally, we show that the correlation between divergence and DNM density declines as increasingly divergent species are considered.
Asunto(s)
Variación Genética , Animales , Composición de Base , Conjuntos de Datos como Asunto , Conversión Génica , Genoma Humano , Mutación de Línea Germinal , HumanosRESUMEN
Events in primate evolution are often dated by assuming a constant rate of substitution per unit time, but the validity of this assumption remains unclear. Among mammals, it is well known that there exists substantial variation in yearly substitution rates. Such variation is to be expected from differences in life history traits, suggesting it should also be found among primates. Motivated by these considerations, we analyze whole genomes from 10 primate species, including Old World Monkeys (OWMs), New World Monkeys (NWMs), and apes, focusing on putatively neutral autosomal sites and controlling for possible effects of biased gene conversion and methylation at CpG sites. We find that substitution rates are up to 64% higher in lineages leading from the hominoid-NWM ancestor to NWMs than to apes. Within apes, rates are â¼2% higher in chimpanzees and â¼7% higher in the gorilla than in humans. Substitution types subject to biased gene conversion show no more variation among species than those not subject to it. Not all mutation types behave similarly, however; in particular, transitions at CpG sites exhibit a more clocklike behavior than do other types, presumably because of their nonreplicative origin. Thus, not only the total rate, but also the mutational spectrum, varies among primates. This finding suggests that events in primate evolution are most reliably dated using CpG transitions. Taking this approach, we estimate the human and chimpanzee divergence time is 12.1 million years,â and the human and gorilla divergence time is 15.1 million yearsâ.
Asunto(s)
Evolución Molecular , Variación Genética , Genoma/genética , Primates/genética , Sustitución de Aminoácidos/genética , Animales , Evolución Biológica , Metilación de ADN/genética , Conversión Génica/genética , Gorilla gorilla/genética , Humanos , Pan troglodytes/genéticaRESUMEN
Much evidence indicates that GC-biased gene conversion (gBGC) has a major impact on the evolution of mammalian genomes. However, a detailed quantification of the process is still lacking. The strength of gBGC can be measured from the analysis of derived allele frequency spectra (DAF), but this approach is sensitive to a number of confounding factors. In particular, we show by simulations that the inference is pervasively affected by polymorphism polarization errors and by spatial heterogeneity in gBGC strength. We propose a new general method to quantify gBGC from DAF spectra, incorporating polarization errors, taking spatial heterogeneity into account, and jointly estimating mutation bias. Applying it to human polymorphism data from the 1000 Genomes Project, we show that the strength of gBGC does not differ between hypermutable CpG sites and non-CpG sites, suggesting that in humans gBGC is not caused by the base-excision repair machinery. Genome-wide, the intensity of gBGC is in the nearly neutral area. However, given that recombination occurs primarily within recombination hotspots, 1%-2% of the human genome is subject to strong gBGC. On average, gBGC is stronger in African than in non-African populations, reflecting differences in effective population sizes. However, due to more heterogeneous recombination landscapes, the fraction of the genome affected by strong gBGC is larger in non-African than in African populations. Given that the location of recombination hotspots evolves very rapidly, our analysis predicts that, in the long term, a large fraction of the genome is affected by short episodes of strong gBGC.
Asunto(s)
Composición de Base , Conversión Génica , Genoma Humano , Grupos Raciales/genética , Islas de CpG , Frecuencia de los Genes , Humanos , Modelos Genéticos , Polimorfismo GenéticoRESUMEN
We envision the molecular evolution process as an information transfer process and provide a quantitative measure for information preservation in terms of the channel capacity according to the channel coding theorem of Shannon. We calculate Information capacities of DNA on the nucleotide (for non-coding DNA) and the amino acid (for coding DNA) level using various substitution models. We extend our results on coding DNA to a discussion about the optimality of the natural codon-amino acid code. We provide the results of an adaptive search algorithm in the code domain and demonstrate the existence of a large number of genetic codes with higher information capacity. Our results support the hypothesis of an ancient extension from a 2-nucleotide codon to the current 3-nucleotide codon code to encode the various amino acids.
Asunto(s)
Algoritmos , Codón/genética , Código Genético/genética , Modelos Genéticos , Aminoácidos/genética , Secuencia de Bases , Evolución MolecularRESUMEN
BACKGROUND: The sequencing of immunoglobulin (Ig) transcripts from single B cells yields essential information about Ig heavy:light chain pairing, which is lost in conventional bulk sequencing experiments. The previously limited throughput of single-cell approaches has recently been overcome by the introduction of multiple next-generation sequencing (NGS)-based platforms. Furthermore, single-cell techniques allow the assignment of additional data types (e.g. cell surface marker expression), which are crucial for biological interpretation. However, the currently available computational tools are not designed to handle single-cell data and do not provide integral solutions for linking of sequence data to other biological data. RESULTS: Here we introduce sciReptor, a flexible toolkit for the processing and analysis of antigen receptor repertoire sequencing data at single-cell level. The software combines bioinformatics tools for immunoglobulin sequence annotation with a relational database, where raw data and analysis results are stored and linked. sciReptor supports attribution of additional data categories such as cell surface marker expression or immunological metadata. Furthermore, it comprises a quality control module as well as basic repertoire visualization tools. CONCLUSION: sciReptor is a flexible framework for standardized sequence analysis of antigen receptor repertoires on single-cell level. The relational database allows easy data sharing and downstream analyses as well as immediate comparisons between different data sets.
Asunto(s)
Biología Computacional/métodos , Genes de Inmunoglobulinas , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Inmunoglobulinas/genética , Análisis de la Célula Individual/métodos , Programas Informáticos , Humanos , Anotación de Secuencia Molecular , Receptores Inmunológicos/genéticaRESUMEN
Genome evolution is shaped by a multitude of mutational processes, including point mutations, insertions, and deletions of DNA sequences, as well as segmental duplications. These mutational processes can leave distinctive qualitative marks in the statistical features of genomic DNA sequences. One such feature is the match length distribution (MLD) of exactly matching sequence segments within an individual genome or between the genomes of related species. These have been observed to exhibit characteristic power law decays in many species. Here, we show that simple dynamical models consisting solely of duplication and mutation processes can already explain the characteristic features of MLDs observed in genomic sequences. Surprisingly, we find that these features are largely insensitive to details of the underlying mutational processes and do not necessarily rely on the action of natural selection. Our results demonstrate how analyzing statistical features of DNA sequences can help us reveal and quantify the different mutational processes that underlie genome evolution.
Asunto(s)
Genoma/genética , Genómica/métodos , Animales , Evolución Biológica , Evolución Molecular , Duplicación de Gen/genética , Humanos , Duplicaciones Segmentarias en el Genoma/genética , Selección GenéticaRESUMEN
Single-cell PCR and sequencing of full-length Ig heavy (Igh) and Igk and Igl light chain genes is a powerful tool to measure the diversity of antibody repertoires and allows the functional assessment of B-cell responses through direct Ig gene cloning and the generation of recombinant mAbs. However, the current methodology is not high-throughput compatible. Here we developed a two-dimensional bar-coded primer matrix to combine Igh and Igk/Igl chain gene single-cell PCR with next-generation sequencing for the parallel analysis of the antibody repertoire of over 46 000 individual B cells. Our approach provides full-length Igh and corresponding Igk/Igl chain gene-sequence information and permits the accurate correction of sequencing errors by consensus building. The use of indexed cell sorting for the isolation of single B cells enables the integration of flow cytometry and Ig gene sequence information. The strategy is fully compatible with established protocols for direct antibody gene cloning and expression and therefore advances over previously described high-throughput approaches to assess antibody repertoires at the single-cell level.
Asunto(s)
Cadenas Pesadas de Inmunoglobulina/genética , Cadenas Ligeras de Inmunoglobulina/genética , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Animales , Clonación Molecular/métodos , Cartilla de ADN/genética , Femenino , Citometría de Flujo/métodos , Genes de Inmunoglobulinas/genética , Ratones , Ratones Endogámicos C57BL , Reacción en Cadena de la Polimerasa/métodosRESUMEN
BACKGROUND: Segmental duplications (SDs) are not evenly distributed along chromosomes. The reasons for this biased susceptibility to SD insertion are poorly understood. Accumulation of SDs is associated with increased genomic instability, which can lead to structural variants and genomic disorders such as the Williams-Beuren syndrome. Despite these adverse effects, SDs have become fixed in the human genome. Focusing on chromosome 7, which is particularly rich in interstitial SDs, we have investigated the distribution of SDs in the context of evolution and the three dimensional organisation of the chromosome in order to gain insights into the mutual relationship of SDs and chromatin topology. RESULTS: Intrachromosomal SDs preferentially accumulate in those segments of chromosome 7 that are homologous to marmoset chromosome 2. Although this formerly compact segment has been re-distributed to three different sites during primate evolution, we can show by means of public data on long distance chromatin interactions that these three intervals, and consequently the paralogous SDs mapping to them, have retained their spatial proximity in the nucleus. Focusing on SD clusters implicated in the aetiology of the Williams-Beuren syndrome locus we demonstrate by cross-species comparison that these SDs have inserted at the borders of a topological domain and that they flank regions with distinct DNA conformation. CONCLUSIONS: Our study suggests a link of nuclear architecture and the propagation of SDs across chromosome 7, either by promoting regional SD insertion or by contributing to the establishment of higher order chromatin organisation themselves. The latter could compensate for the high risk of structural rearrangements and thus may have contributed to their evolutionary fixation in the human genome.
Asunto(s)
Cromatina/genética , Cromosomas Humanos Par 7 , Duplicaciones Segmentarias en el Genoma , Acetilación , Cromatina/metabolismo , Cromosomas Humanos Par 2 , Epistasis Genética , Evolución Molecular , Sitios Genéticos , Genómica , Histonas/metabolismo , Humanos , Transcripción Genética , Síndrome de Williams/genéticaRESUMEN
Meiotic recombination is known to influence GC-content evolution in large regions of mammalian genomes by favoring the fixation of G and C alleles and increasing the rate of A/T to G/C substitutions. This process is known as GC-biased gene conversion (gBGC). Until recently, genome-wide measures of fine-scale recombination activity were unavailable in mice. Additionally, comparative studies focusing on mouse were limited as the closest organism with its genome fully sequenced was rat. Here, we make use of the recent mapping of double strand breaks (DSBs), the first step of meiotic recombination, in the mouse genome and of the sequencing of mouse closely related subspecies to analyze the fine-scale evolutionary signature of meiotic recombination on GC-content evolution in recombination hotspots, short regions that undergo extreme rates of recombination. We measure substitution rates around DSB hotspots and observe that gBGC is affecting a very short region (≈ 1 kbp) in length around these hotspots. Furthermore, we can infer that the locations of hotspots evolved rapidly during mouse evolution.
Asunto(s)
Composición de Base , Conversión Génica , Meiosis/genética , Recombinación Genética , Alelos , Sustitución de Aminoácidos , Animales , Secuencia de Bases , Roturas del ADN de Doble Cadena , Evolución Molecular , Genoma , Ratones , Modelos Genéticos , Tasa de Mutación , Filogenia , RatasRESUMEN
The genomes of many vertebrates show a characteristic heterogeneous distribution of GC content, the so-called GC isochore structure. The origin of isochores has been explained via the mechanism of GC-biased gene conversion (gBGC). However, although the isochore structure is declining in many mammalian genomes, the heterogeneity in GC content is being reinforced in the avian genome. Despite this discrepancy, which remains unexplained, examinations of individual substitution frequencies in mammals and birds are both consistent with the gBGC model of isochore evolution. On the other hand, a negative correlation between substitution and recombination rate found in the chicken genome is inconsistent with the gBGC model. It should therefore be important to consider along with gBGC other consequences of recombination on the origin and fate of mutations, as well as to account for relationships between recombination rate and other genomic features. We therefore developed an analytical model to describe the substitution patterns found in the chicken genome, and further investigated the relationships between substitution patterns and several genomic features in a rigorous statistical framework. Our analysis indicates that GC content itself, either directly or indirectly via interrelations to other genomic features, has an impact on the substitution pattern. Further, we suggest that this phenomenon is particularly visible in avian genomes due to their unusually low rate of chromosomal evolution. Because of this, interrelations between GC content and other genomic features are being reinforced, and are as such more pronounced in avian genomes as compared with other vertebrate genomes with a less stable karyotype.
Asunto(s)
Cromosomas/genética , Evolución Molecular , Conversión Génica , Cariotipo , Animales , Composición de Base , Pollos , Genoma , Isocoras/genética , Mamíferos/genética , Recombinación Genética , Vertebrados/genéticaRESUMEN
Nonsense Mediated Decay (NMD) degrades transcripts that contain a premature STOP codon resulting from mistranscription or missplicing. However NMD's surveillance of gene expression varies in efficiency both among and within human genes. Previous work has shown that the intron content of human genes is influenced by missplicing events invisible to NMD. Given the high rate of transcriptional errors in eukaryotes, we hypothesized that natural selection has promoted a dual strategy of "prevention and cure" to alleviate the problem of nonsense transcriptional errors. A prediction of this hypothesis is that NMD's inefficiency should leave a signature of "transcriptional robustness" in human gene sequences that reduces the frequency of nonsense transcriptional errors. For human genes we determined the usage of "fragile" codons, prone to mistranscription into STOP codons, relative to the usage of "robust" codons that do not generate nonsense errors. We observe that single-exon genes have evolved to become robust to mistranscription, because they show a significant tendency to avoid fragile codons relative to robust codons when compared to multi-exon genes. A similar depletion is evident in last exons of multi-exon genes. Histone genes are particularly depleted of fragile codons and thus highly robust to transcriptional errors. Finally, the protein products of single-exon genes show a strong tendency to avoid those amino acids that can only be encoded using fragile codons. Each of these observations can be attributed to NMD deficiency. Thus, in the human genome, wherever the "cure" for nonsense (i.e. NMD) is inefficient, there is increased reliance on the strategy of nonsense "prevention" (i.e. transcriptional robustness). This study shows that human genes are exposed to the deleterious influence of transcriptional errors. Moreover, it suggests that gene expression errors are an underestimated phenomenon, in molecular evolution in general and in selection for genomic robustness in particular.
Asunto(s)
Codón sin Sentido/genética , Codón sin Sentido/metabolismo , Histonas/genética , Intrones/genética , Degradación de ARNm Mediada por Codón sin Sentido/genética , Aminoácidos/genética , Aminoácidos/metabolismo , Animales , Codón/genética , Drosophila/genética , Evolución Molecular , Exones/genética , Expresión Génica , Genes , Genoma , Genoma Humano , Histonas/metabolismo , Humanos , Ratones , Estabilidad del ARN/genética , Transcripción GenéticaRESUMEN
During cellular processes such as differentiation or response to external stimuli, cells exhibit dynamic changes in their gene expression profiles. Single-cell RNA sequencing (scRNA-seq) can be used to investigate these dynamic changes. To this end, cells are typically ordered along a pseudotemporal trajectory which recapitulates the progression of cells as they transition from one cell state to another. We infer transcriptional dynamics by modeling the gene expression profiles in pseudotemporally ordered cells using a Bayesian inference approach. This enables ordering genes along transcriptional cascades, estimating differences in the timing of gene expression dynamics, and deducing regulatory gene interactions. Here, we apply this approach to scRNA-seq datasets derived from mouse embryonic forebrain and pancreas samples. This analysis demonstrates the utility of the method to derive the ordering of gene dynamics and regulatory relationships critical for proper cellular differentiation and maturation across a variety of developmental contexts.
RESUMEN
Recently, an enrichment of identical matching sequences has been found in many eukaryotic genomes. Their length distribution exhibits a power law tail raising the question of what evolutionary mechanism or functional constraints would be able to shape this distribution. Here we introduce a simple and evolutionarily neutral model, which involves only point mutations and segmental duplications, and produces the same statistical features as observed for genomic data. Further, we extend a mathematical model for random stick breaking to analytically show that the exponent of the power law tail is -3 and universal as it does not depend on the microscopic details of the model.
Asunto(s)
ADN/genética , Evolución Molecular , Modelos Genéticos , Eucariontes , Duplicación de Gen , Genoma , Genoma Humano , Humanos , Mutación PuntualRESUMEN
The formation of transcription-factor-binding sites is an important evolutionary process. Here, we show that methylation and deamination of CpG dinucleotides generate in vivo p53-binding sites in numerous Alu elements and in non-repetitive DNA in a species-specific manner. In light of this, we propose that the deamination of methylated CpGs constitutes a universal mechanism for de novo generation of various transcription-factor-binding sites in Alus.
Asunto(s)
Islas de CpG/fisiología , Metilación de ADN , Genoma , Proteína p53 Supresora de Tumor/metabolismo , Animales , Secuencia de Bases , Sitios de Unión , Desaminación , Humanos , Datos de Secuencia MolecularRESUMEN
We have determined diversities exceeding 10(12) different sequences in an annealing and melting assay using synthetic randomized oligonucleotides as a standard. For such high diversities, the annealing kinetics differ from those observed for low diversities, favouring the remelting curve after annealing as the best indicator of complexity. Direct comparisons of nucleic acid pools obtained from an aptamer selection demonstrate that even highly complex populations can be evaluated by using DiStRO, without the need of complicated calculations.
Asunto(s)
Biblioteca de Genes , Oligodesoxirribonucleótidos/normas , Calibración , ADN/normas , Cinética , Desnaturalización de Ácido Nucleico , Oligodesoxirribonucleótidos/síntesis química , Estándares de Referencia , Técnica SELEX de Producción de Aptámeros , TemperaturaRESUMEN
Cerebral organoids exhibit broad regional heterogeneity accompanied by limited cortical cellular diversity despite the tremendous upsurge in derivation methods, suggesting inadequate patterning of early neural stem cells (NSCs). Here we show that a short and early Dual SMAD and WNT inhibition course is necessary and sufficient to establish robust and lasting cortical organoid NSC identity, efficiently suppressing non-cortical NSC fates, while other widely used methods are inconsistent in their cortical NSC-specification capacity. Accordingly, this method selectively enriches for outer radial glia NSCs, which cyto-architecturally demarcate well-defined outer sub-ventricular-like regions propagating from superiorly radially organized, apical cortical rosette NSCs. Finally, this method culminates in the emergence of molecularly distinct deep and upper cortical layer neurons, and reliably uncovers cortex-specific microcephaly defects. Thus, a short SMAD and WNT inhibition is critical for establishing a rich cortical cell repertoire that enables mirroring of fundamental molecular and cyto-architectural features of cortical development and meaningful disease modelling.
Asunto(s)
Células-Madre Neurales , Organoides , Diferenciación Celular , Corteza Cerebral , Células Ependimogliales , Humanos , Neurogénesis , NeuronasRESUMEN
Unraveling the evolutionary forces responsible for variations of neutral substitution patterns among taxa or along genomes is a major issue for detecting selection within sequences. Mammalian genomes show large-scale regional variations of GC-content (the isochores), but the substitution processes at the origin of this structure are poorly understood. We analyzed the pattern of neutral substitutions in 1 Gb of primate non-coding regions. We show that the GC-content toward which sequences are evolving is strongly negatively correlated to the distance to telomeres and positively correlated to the rate of crossovers (R2 = 47%). This demonstrates that recombination has a major impact on substitution patterns in human, driving the evolution of GC-content. The evolution of GC-content correlates much more strongly with male than with female crossover rate, which rules out selectionist models for the evolution of isochores. This effect of recombination is most probably a consequence of the neutral process of biased gene conversion (BGC) occurring within recombination hotspots. We show that the predictions of this model fit very well with the observed substitution patterns in the human genome. This model notably explains the positive correlation between substitution rate and recombination rate. Theoretical calculations indicate that variations in population size or density in recombination hotspots can have a very strong impact on the evolution of base composition. Furthermore, recombination hotspots can create strong substitution hotspots. This molecular drive affects both coding and non-coding regions. We therefore conclude that along with mutation, selection and drift, BGC is one of the major factors driving genome evolution. Our results also shed light on variations in the rate of crossover relative to non-crossover events, along chromosomes and according to sex, and also on the conservation of hotspot density between human and chimp.
Asunto(s)
Evolución Molecular , Genoma Humano , Mutación , Animales , Composición de Base , Cromosomas Humanos , Cruzamientos Genéticos , Femenino , Conversión Génica , Humanos , Isocoras/genética , Macaca/genética , Masculino , Modelos Genéticos , Pan troglodytes/genética , Especificidad de la Especie , TelómeroRESUMEN
Horizontal gene transfer (HGT) is an essential force in microbial evolution. Despite detailed studies on a variety of systems, a global picture of HGT in the microbial world is still missing. Here, we exploit that HGT creates long identical DNA sequences in the genomes of distant species, which can be found efficiently using alignment-free methods. Our pairwise analysis of 93,481 bacterial genomes identified 138,273 HGT events. We developed a model to explain their statistical properties as well as estimate the transfer rate between pairs of taxa. This reveals that long-distance HGT is frequent: our results indicate that HGT between species from different phyla has occurred in at least 8% of the species. Finally, our results confirm that the function of sequences strongly impacts their transfer rate, which varies by more than three orders of magnitude between different functional categories. Overall, we provide a comprehensive view of HGT, illuminating a fundamental process driving bacterial evolution.