RESUMEN
BACKGROUND: Genomes are inherently inhomogeneous, with features such as base composition, recombination, gene density, and gene expression varying along chromosomes. Evolutionary, biological, and biomedical analyses aim to quantify this variation, account for it during inference procedures, and ultimately determine the causal processes behind it. Since sequential observations along chromosomes are not independent, it is unsurprising that autocorrelation patterns have been observed e.g., in human base composition. In this article, we develop a class of Hidden Markov Models (HMMs) called oHMMed (ordered HMM with emission densities, the corresponding R package of the same name is available on CRAN): They identify the number of comparably homogeneous regions within autocorrelated observed sequences. These are modelled as discrete hidden states; the observed data points are realisations of continuous probability distributions with state-specific means that enable ordering of these distributions. The observed sequence is labelled according to the hidden states, permitting only neighbouring states that are also neighbours within the ordering of their associated distributions. The parameters that characterise these state-specific distributions are inferred. RESULTS: We apply our oHMMed algorithms to the proportion of G and C bases (modelled as a mixture of normal distributions) and the number of genes (modelled as a mixture of poisson-gamma distributions) in windows along the human, mouse, and fruit fly genomes. This results in a partitioning of the genomes into regions by statistically distinguishable averages of these features, and in a characterisation of their continuous patterns of variation. In regard to the genomic G and C proportion, this latter result distinguishes oHMMed from segmentation algorithms based in isochore or compositional domain theory. We further use oHMMed to conduct a detailed analysis of variation of chromatin accessibility (ATAC-seq) and epigenetic markers H3K27ac and H3K27me3 (modelled as a mixture of poisson-gamma distributions) along the human chromosome 1 and their correlations. CONCLUSIONS: Our algorithms provide a biologically assumption free approach to characterising genomic landscapes shaped by continuous, autocorrelated patterns of variation. Despite this, the resulting genome segmentation enables extraction of compositionally distinct regions for further downstream analyses.
Asunto(s)
Genoma , Genómica , Animales , Humanos , Ratones , Cadenas de Markov , Composición de Base , Probabilidad , AlgoritmosRESUMEN
Population genetic inference of selection on the nucleotide sequence level often proceeds by comparison to a reference sequence evolving only under mutation and population demography. Among the few candidates for such a reference sequence is the 5' part of short introns (5SI) in Drosophila. In addition to mutation and population demography, however, there is evidence for a weak force favouring GC bases, likely due to GC-biased gene conversion (gBGC), and for the effect of linked selection. Here, we use polymorphism and divergence data of Drosophila melanogaster to detect and describe the forces affecting the evolution of the 5SI. We separately analyse mutation classes, compare them between chromosomes, and relate them to recombination rate frequencies. GC-conservative mutations seem to be mainly influenced by mutation and drift, with linked selection mostly causing differences between the central and the peripheral (i.e., telomeric and centromeric) regions of the chromosome arms. Comparing GC-conservative mutation patterns between autosomes and the X chromosome showed differences in mutation rates, rather than linked selection, in the central chromosomal regions after accounting for differences in effective population sizes. On the other hand, GC-changing mutations show asymmetric site frequency spectra, indicating the presence of gBGC, varying among mutation classes and in intensity along chromosomes, but approximately equal in strength in autosomes and the X chromosome.
Asunto(s)
Drosophila melanogaster , Conversión Génica , Animales , Drosophila melanogaster/genética , Intrones , Evolución Molecular , Mutación , Drosophila/genética , Cromosoma X/genética , Selección GenéticaRESUMEN
Among eukaryotes, the major spliceosomal pathway is highly conserved. While long introns may contain additional regulatory sequences, the ones in short introns seem to be nearly exclusively related to splicing. Although these regulatory sequences involved in splicing are well-characterized, little is known about their evolution. At the 3' end of introns, the splice signal nearly universally contains the dimer AG, which consists of purines, and the polypyrimidine tract upstream of this 3' splice signal is characterized by over-representation of pyrimidines. If the over-representation of pyrimidines in the polypyrimidine tract is also due to avoidance of a premature splicing signal, we hypothesize that AG should be the most under-represented dimer. Through the use of DNA-strand asymmetry patterns, we confirm this prediction in fruit flies of the genus Drosophila and by comparing the asymmetry patterns to a presumably neutrally evolving region, we quantify the selection strength acting on each motif. Moreover, our inference and simulation method revealed that the best explanation for the base composition evolution of the polypyrimidine tract is the joint action of purifying selection against a spurious 3' splice signal and the selection for pyrimidines. Patterns of asymmetry in other eukaryotes indicate that avoidance of premature splicing similarly affects the nucleotide composition in their polypyrimidine tracts.
Asunto(s)
Pirimidinas , Empalme del ARN , Secuencia de Bases , Composición de Base , Mutación , Intrones , Pirimidinas/metabolismoRESUMEN
Theoretical and empirical studies have shown that species radiations are facilitated when a trait under divergent natural selection is also involved in sexual selection. It is yet unclear how quick and effective radiations are where assortative mating is unrelated to the ecological environment and primarily results from sexual selection. We address this question using sympatric grasshopper species of the genus Chorthippus, which have evolved strong behavioural isolation while lacking noticeable ecomorphological divergence. Mitochondrial genomes suggest that the radiation is relatively recent, dating to the mid-Pleistocene, which leads to extensive incomplete lineage sorting throughout the mitochondrial and nuclear genomes. Nuclear data shows that hybrids are absent in sympatric localities but that all species have experienced gene flow, confirming that reproductive isolation is strong but remains incomplete. Demographic modelling is most consistent with a long period of geographic isolation, followed by secondary contact and extensive introgression. Such initial periods of geographic isolation might facilitate the association between male signaling and female preference, permitting the coexistence of sympatric species that are genetically, morphologically, and ecologically similar, but otherwise behave mostly as good biological species.
Asunto(s)
Saltamontes , Animales , Femenino , Flujo Génico , Especiación Genética , Saltamontes/genética , Masculino , Aislamiento Reproductivo , Selección Genética , SimpatríaRESUMEN
It is generally assumed that new genes arise through duplication and/or recombination of existing genes. The probability that a new functional gene could arise out of random non-coding DNA is so far considered to be negligible, since it seems unlikely that such a RNA or protein sequence could have an initial function that influences the fitness of an organism. We have here tested this question systematically, by expressing clones with random sequences in E . coli and subjecting them to competitive growth. Contrary to expectations, we find that random sequences with bioactivity are not rare. In our experiments we find that up to 25% of the evaluated clones enhance the growth rate of their cells and up to 52% inhibit growth. Testing of individual clones in competition assays confirms their activity and provides an indication that their activity could be exerted either by the transcribed RNA or the translated peptide. This suggests that transcribed and translated random parts of the genome could indeed have a high potential to become functional. The results also suggest that random sequences may become an effective new source of molecules for studying cellular functions, as well as for pharmacological activity screening.