RESUMO
Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach able to extract automatically keywords in literary texts. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered (i.e., they self-attract each other), while irrelevant words are distributed randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single documents for which no a priori information is available. In addition, we show that our method works also in generic symbolic sequences (continuous texts without spaces), thus suggesting its general applicability.
RESUMO
The scale-free, long-range correlations detected in DNA sequences contrast with characteristic lengths of genomic elements, being particularly incompatible with the isochores (long, homogeneous DNA segments). By computing the local behavior of the scaling exponent alpha of detrended fluctuation analysis (DFA), we discriminate between sequences with and without true scaling, and we find that no single scaling exists in the human genome. Instead, human chromosomes show a common compositional structure with two characteristic scales, the large one corresponding to the isochores and the other to small and medium scale genomic elements.
Assuntos
Mapeamento Cromossômico/métodos , Análise Mutacional de DNA/métodos , Código Genético/genética , Genoma Humano/genética , Modelos Genéticos , Análise de Sequência de DNA/métodos , Sequência de Bases , Simulação por Computador , Mapeamento de Sequências Contíguas , Variação Genética/genética , Humanos , Dados de Sequência Molecular , Locos de Características Quantitativas/genéticaRESUMO
We present a coding measure which is based on the statistical properties of the stop codons, and that is able to estimate accurately the variation of coding content along an anonymous sequence. As the stop codons play the same role in all the genomes (with very few exceptions) the measure turns out to be species-independent. We show results both for prokaryotic and for eukaryotic genomes, indicating, first, the accuracy of the measure, and, second, that better prediction is achieved if the measure is applied on homogeneous, isochore-like sequences than if it is applied following the standard moving window approach. Finally, we discuss on some of the possible applications of the measure.
Assuntos
Códon de Terminação/genética , Fases de Leitura Aberta/genética , Animais , Bacillus subtilis/genética , Composição de Bases , Bases de Dados de Ácidos Nucleicos , Drosophila melanogaster/genética , Genoma Humano , Humanos , Isocoros/genética , Especificidade da Espécie , Estatística como AssuntoRESUMO
Here we present a study of statistical correlations among different positions in DNA sequences and their implications by directly using the autocorrelation function. Such an analysis is possible now because of the availability of large sequences or even complete genomes of many organisms. After describing the way in which the autocorrelation function can be applied to DNA-sequence analysis, we show that long-range correlations, implying scale independence, appear in several bacterial genomes as well as in long human chromosome contigs. The source for such correlations in bacteria, which may extend up to 60 kb in Bacillus subtilis, may be related to massive lateral transfer of compositionally biased genes from other genomes. In the human genome, correlations extend for more than five decades and may be related to the evolution of the 'neogenome', a modern evolutionary acquisition composed by GC-rich isochores displaying long-range correlations and scale invariance.
Assuntos
DNA/genética , Análise de Sequência de DNA/estatística & dados numéricos , DNA Bacteriano/genética , Genoma Bacteriano , Genoma Humano , Humanos , Análise de Sequência de DNA/métodos , Estatística como AssuntoRESUMO
Analytical DNA ultracentrifugation revealed that eukaryotic genomes are mosaics of isochores: long DNA segments (>>300 kb on average) relatively homogeneous in G+C. Important genome features are dependent on this isochore structure, e.g. genes are found predominantly in the GC-richest isochore classes. However, no reliable method is available to rigorously partition the genome sequence into relatively homogeneous regions of different composition, thereby revealing the isochore structure of chromosomes at the sequence level. Homogeneous regions are currently ascertained by plain statistics on moving windows of arbitrary length, or simply by eye on G+C plots. On the contrary, the entropic segmentation method is able to divide a DNA sequence into relatively homogeneous, statistically significant domains. An early version of this algorithm only produced domains having an average length far below the typical isochore size. Here we show that an improved segmentation method, specifically intended to determine the most statistically significant partition of the sequence at each scale, is able to identify the boundaries between long homogeneous genome regions displaying the typical features of isochores. The algorithm precisely locates classes II and III of the human major histocompatibility complex region, two well-characterized isochores at the sequence level, the boundary between them being the first isochore boundary experimentally characterized at the sequence level. The analysis is then extended to a collection of human large contigs. The relatively homogeneous regions we find show many of the features (G+C range, relative proportion of isochore classes, size distribution, and relationship with gene density) of the isochores identified through DNA centrifugation. Isochore chromosome maps, with many potential applications in genomics, are then drawn for all the completely sequenced eukaryotic genomes available.
Assuntos
DNA/genética , Células Eucarióticas/metabolismo , Genoma , Animais , Composição de Bases , Mapeamento Cromossômico , DNA Fúngico/genética , DNA de Plantas/genética , Sequência Rica em GC/genética , Genes/genética , Variação Genética , Genoma Fúngico , Genoma Humano , Genoma de Planta , Humanos , Complexo Principal de Histocompatibilidade/genéticaRESUMO
Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G + C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.
RESUMO
Human DNA shows a complex structure with compositional features at many scales; the isochores--long DNA segments (~105 bp) of relatively homogeneous guanine-cytosine (G + C) content--are the largest well-documented and well-analyzed compositional structures. However, we report here on the existence of a high-level compositional organization of isochores in the human genome. By using a segmentation algorithm incorporating the long-range correlations existing in human DNA, we find that every chromosome is composed of a few huge segments (~ 107 bp) of relatively homogeneous G + C content, which become the largest compositional organization of the genome. Finally, we show evidence of the biological relevance of these superstructures, pointing to a large-scale functional organization of the human genome.
Assuntos
DNA/química , Genoma Humano , Algoritmos , Composição de Bases , Mapeamento Cromossômico , Cromossomos Humanos/ultraestrutura , Ilhas de CpG , Citosina/química , Sequência Rica em GC , Guanina/química , Humanos , Modelos Estatísticos , Conformação de Ácido Nucleico , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNARESUMO
We introduce a segmentation algorithm to probe the temporal organization of heterogeneities in human heartbeat interval time series. We find that the lengths of segments with different local mean heart rates follow a power-law distribution and show that this scale-invariant structure is not a simple consequence of the long-range correlations present in the data. The differences in mean heart rates between consecutive segments display a common functional form, but with different parameters for healthy individuals and for heart-failure patients. These findings suggest that there is relevant physiological information hidden in the heterogeneities of the heartbeat time series.
Assuntos
Frequência Cardíaca/fisiologia , Coração/fisiologia , Algoritmos , Astronautas , Cardiopatias/fisiopatologia , Humanos , Método de Monte CarloRESUMO
The heterogeneity within, and similarities between, yeast chromosomes are studied. For the former, we show by the size distribution of domains, coding density, size distribution of open reading frames, spatial power spectra, and deviation from binomial distribution for C + G% in large moving windows that there is a strong deviation of the yeast sequences from random sequences. For the latter, not only do we graphically illustrate the similarity for the above mentioned statistics, but we also carry out a rigorous analysis of variance (ANOVA) test. The hypothesis that all yeast chromosomes are similar cannot be rejected by this test. We examine the two possible explanations of this interchromosomal uniformity: a common origin, such as genome-wide duplication (polyploidization), and a concerted evolutionary process.
Assuntos
Composição de Bases , Cromossomos Fúngicos/química , Saccharomyces cerevisiae/genética , Análise de Variância , Citosina/análise , Evolução Molecular , Guanina/análise , Fases de Leitura Aberta , Análise de Sequência de DNARESUMO
A new method to determine entropic profiles in DNA sequences is presented. It is based on the chaos-game representation (CGR) of gene structure, a technique which produces a fractal-like picture of DNA sequences. First, the CGR image was divided into squares 4-m in size (m being the desired resolution), and the point density counted. Second, appropriate intervals were adjusted, and then a histogram of densities was prepared. Third, Shannon's formula was applied to the probability-distribution histogram, thus obtaining a new entropic estimate for DNA sequences, the histogram entropy, a measurement that goes with the level of constraints on the DNA sequence. Lastly, the entropic profile for the sequence was drawn, by considering the entropies at each resolution level, thus providing a way to summarize the complexity of large genomic regions or even entire genomes at different resolution levels. The application of the method to DNA sequences reveals that entropic profiles obtained in this way, as opposed to previously published ones, clearly discriminate between random and natural DNA sequences. Entropic profiles also show a different degree of variability within and between genomes. The results of these analyses are discussed in relation both to the genome compartmentalization in vertebrates and to the differential action of compositional and/or functional constraints on DNA sequences.
Assuntos
Simulação por Computador , Teoria dos Jogos , Teoria da Informação , Análise de Sequência de DNA , Termodinâmica , Animais , Sequência de Bases , Humanos , Vertebrados/genéticaRESUMO
MOTIVATION: DNA sequences are formed by patches or domains of different nucleotide composition. In a few simple sequences, domains can simply be identified by eye; however, most DNA sequences show a complex compositional heterogeneity (fractal structure), which cannot be properly detected by current methods. Recently, a computationally efficient segmentation method to analyse such nonstationary sequence structures, based on the Jensen-Shannon entropic divergence, has been described. Specific algorithms implementing this method are now needed. RESULTS: Here we describe a heuristic segmentation algorithm for DNA sequences, which was implemented on a Windows program (SEGMENT). The program divides a DNA sequence into compositionally homogeneous domains by iterating a local optimization procedure at a given statistical significance. Once a sequence is partitioned into domains, a global measure of sequence compositional complexity (SCC), accounting for both the sizes and compositional biases of all the domains in the sequence, is derived. SEGMENT computes SCC as a function of the significance level, which provides a multiscale view of sequence complexity.
Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Apresentação de Dados , Escherichia coli/genética , Computação Matemática , Modelos Genéticos , Estrutura Molecular , Estrutura Terciária de Proteína/genética , Software , Interface Usuário-ComputadorRESUMO
We present a new computational approach to finding borders between coding and noncoding DNA. This approach has two features: (i) DNA sequences are described by a 12-letter alphabet that captures the differential base composition at each codon position, and (ii) the search for the borders is carried out by means of an entropic segmentation method which uses only the general statistical properties of coding DNA. We find that this method is highly accurate in finding borders between coding and noncoding regions and requires no "prior training" on known data sets. Our results appear to be more accurate than those obtained with moving windows in the discrimination of coding from noncoding DNA.