Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 14 de 14
Filtrar
1.
Phys Rev E Stat Nonlin Soft Matter Phys ; 79(3 Pt 2): 035102, 2009 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-19392005

RESUMO

Using a generalization of the level statistics analysis of quantum disordered systems, we present an approach able to extract automatically keywords in literary texts. Our approach takes into account not only the frequencies of the words present in the text but also their spatial distribution along the text, and is based on the fact that relevant words are significantly clustered (i.e., they self-attract each other), while irrelevant words are distributed randomly in the text. Since a reference corpus is not needed, our approach is especially suitable for single documents for which no a priori information is available. In addition, we show that our method works also in generic symbolic sequences (continuous texts without spaces), thus suggesting its general applicability.

2.
Phys Rev E Stat Nonlin Soft Matter Phys ; 75(3 Pt 1): 032903, 2007 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-17500745

RESUMO

The scale-free, long-range correlations detected in DNA sequences contrast with characteristic lengths of genomic elements, being particularly incompatible with the isochores (long, homogeneous DNA segments). By computing the local behavior of the scaling exponent alpha of detrended fluctuation analysis (DFA), we discriminate between sequences with and without true scaling, and we find that no single scaling exists in the human genome. Instead, human chromosomes show a common compositional structure with two characteristic scales, the large one corresponding to the isochores and the other to small and medium scale genomic elements.


Assuntos
Mapeamento Cromossômico/métodos , Análise Mutacional de DNA/métodos , Código Genético/genética , Genoma Humano/genética , Modelos Genéticos , Análise de Sequência de DNA/métodos , Sequência de Bases , Simulação por Computador , Mapeamento de Sequências Contíguas , Variação Genética/genética , Humanos , Dados de Sequência Molecular , Locos de Características Quantitativas/genética
3.
Gene ; 300(1-2): 97-104, 2002 Oct 30.
Artigo em Inglês | MEDLINE | ID: mdl-12468091

RESUMO

We present a coding measure which is based on the statistical properties of the stop codons, and that is able to estimate accurately the variation of coding content along an anonymous sequence. As the stop codons play the same role in all the genomes (with very few exceptions) the measure turns out to be species-independent. We show results both for prokaryotic and for eukaryotic genomes, indicating, first, the accuracy of the measure, and, second, that better prediction is achieved if the measure is applied on homogeneous, isochore-like sequences than if it is applied following the standard moving window approach. Finally, we discuss on some of the possible applications of the measure.


Assuntos
Códon de Terminação/genética , Fases de Leitura Aberta/genética , Animais , Bacillus subtilis/genética , Composição de Bases , Bases de Dados de Ácidos Nucleicos , Drosophila melanogaster/genética , Genoma Humano , Humanos , Isocoros/genética , Especificidade da Espécie , Estatística como Assunto
4.
Gene ; 300(1-2): 105-15, 2002 Oct 30.
Artigo em Inglês | MEDLINE | ID: mdl-12468092

RESUMO

Here we present a study of statistical correlations among different positions in DNA sequences and their implications by directly using the autocorrelation function. Such an analysis is possible now because of the availability of large sequences or even complete genomes of many organisms. After describing the way in which the autocorrelation function can be applied to DNA-sequence analysis, we show that long-range correlations, implying scale independence, appear in several bacterial genomes as well as in long human chromosome contigs. The source for such correlations in bacteria, which may extend up to 60 kb in Bacillus subtilis, may be related to massive lateral transfer of compositionally biased genes from other genomes. In the human genome, correlations extend for more than five decades and may be related to the evolution of the 'neogenome', a modern evolutionary acquisition composed by GC-rich isochores displaying long-range correlations and scale invariance.


Assuntos
DNA/genética , Análise de Sequência de DNA/estatística & dados numéricos , DNA Bacteriano/genética , Genoma Bacteriano , Genoma Humano , Humanos , Análise de Sequência de DNA/métodos , Estatística como Assunto
5.
Gene ; 276(1-2): 47-56, 2001 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-11591471

RESUMO

Analytical DNA ultracentrifugation revealed that eukaryotic genomes are mosaics of isochores: long DNA segments (>>300 kb on average) relatively homogeneous in G+C. Important genome features are dependent on this isochore structure, e.g. genes are found predominantly in the GC-richest isochore classes. However, no reliable method is available to rigorously partition the genome sequence into relatively homogeneous regions of different composition, thereby revealing the isochore structure of chromosomes at the sequence level. Homogeneous regions are currently ascertained by plain statistics on moving windows of arbitrary length, or simply by eye on G+C plots. On the contrary, the entropic segmentation method is able to divide a DNA sequence into relatively homogeneous, statistically significant domains. An early version of this algorithm only produced domains having an average length far below the typical isochore size. Here we show that an improved segmentation method, specifically intended to determine the most statistically significant partition of the sequence at each scale, is able to identify the boundaries between long homogeneous genome regions displaying the typical features of isochores. The algorithm precisely locates classes II and III of the human major histocompatibility complex region, two well-characterized isochores at the sequence level, the boundary between them being the first isochore boundary experimentally characterized at the sequence level. The analysis is then extended to a collection of human large contigs. The relatively homogeneous regions we find show many of the features (G+C range, relative proportion of isochore classes, size distribution, and relationship with gene density) of the isochores identified through DNA centrifugation. Isochore chromosome maps, with many potential applications in genomics, are then drawn for all the completely sequenced eukaryotic genomes available.


Assuntos
DNA/genética , Células Eucarióticas/metabolismo , Genoma , Animais , Composição de Bases , Mapeamento Cromossômico , DNA Fúngico/genética , DNA de Plantas/genética , Sequência Rica em GC/genética , Genes/genética , Variação Genética , Genoma Fúngico , Genoma Humano , Genoma de Planta , Humanos , Complexo Principal de Histocompatibilidade/genética
6.
Eur Phys J B ; 85(6)2012 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-23645997

RESUMO

Segmentation is a standard method of data analysis to identify change-points dividing a nonstationary time series into homogeneous segments. However, for long-range fractal correlated series, most of the segmentation techniques detect spurious change-points which are simply due to the heterogeneities induced by the correlations and not to real nonstationarities. To avoid this oversegmentation, we present a segmentation algorithm which takes as a reference for homogeneity, instead of a random i.i.d. series, a correlated series modeled by a fractional noise with the same degree of correlations as the series to be segmented. We apply our algorithm to artificial series with long-range correlations and show that it systematically detects only the change-points produced by real nonstationarities and not those created by the correlations of the signal. Further, we apply the method to the sequence of the long arm of human chromosome 21, which is known to have long-range fractal correlations. We obtain only three segments that clearly correspond to the three regions of different G + C composition revealed by means of a multi-scale wavelet plot. Similar results have been obtained when segmenting all human chromosome sequences, showing the existence of previously unknown huge compositional superstructures in the human genome.

7.
Phys Rev E Stat Nonlin Soft Matter Phys ; 83(3 Pt 1): 031908, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21517526

RESUMO

Human DNA shows a complex structure with compositional features at many scales; the isochores--long DNA segments (~105 bp) of relatively homogeneous guanine-cytosine (G + C) content--are the largest well-documented and well-analyzed compositional structures. However, we report here on the existence of a high-level compositional organization of isochores in the human genome. By using a segmentation algorithm incorporating the long-range correlations existing in human DNA, we find that every chromosome is composed of a few huge segments (~ 107 bp) of relatively homogeneous G + C content, which become the largest compositional organization of the genome. Finally, we show evidence of the biological relevance of these superstructures, pointing to a large-scale functional organization of the human genome.


Assuntos
DNA/química , Genoma Humano , Algoritmos , Composição de Bases , Mapeamento Cromossômico , Cromossomos Humanos/ultraestrutura , Ilhas de CpG , Citosina/química , Sequência Rica em GC , Guanina/química , Humanos , Modelos Estatísticos , Conformação de Ácido Nucleico , Sequências Repetitivas de Ácido Nucleico , Análise de Sequência de DNA
8.
Phys Rev Lett ; 87(16): 168105, 2001 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-11690251

RESUMO

We introduce a segmentation algorithm to probe the temporal organization of heterogeneities in human heartbeat interval time series. We find that the lengths of segments with different local mean heart rates follow a power-law distribution and show that this scale-invariant structure is not a simple consequence of the long-range correlations present in the data. The differences in mean heart rates between consecutive segments display a common functional form, but with different parameters for healthy individuals and for heart-failure patients. These findings suggest that there is relevant physiological information hidden in the heterogeneities of the heartbeat time series.


Assuntos
Frequência Cardíaca/fisiologia , Coração/fisiologia , Algoritmos , Astronautas , Cardiopatias/fisiopatologia , Humanos , Método de Monte Carlo
9.
Genome Res ; 8(9): 916-28, 1998 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-9750191

RESUMO

The heterogeneity within, and similarities between, yeast chromosomes are studied. For the former, we show by the size distribution of domains, coding density, size distribution of open reading frames, spatial power spectra, and deviation from binomial distribution for C + G% in large moving windows that there is a strong deviation of the yeast sequences from random sequences. For the latter, not only do we graphically illustrate the similarity for the above mentioned statistics, but we also carry out a rigorous analysis of variance (ANOVA) test. The hypothesis that all yeast chromosomes are similar cannot be rejected by this test. We examine the two possible explanations of this interchromosomal uniformity: a common origin, such as genome-wide duplication (polyploidization), and a concerted evolutionary process.


Assuntos
Composição de Bases , Cromossomos Fúngicos/química , Saccharomyces cerevisiae/genética , Análise de Variância , Citosina/análise , Evolução Molecular , Guanina/análise , Fases de Leitura Aberta , Análise de Sequência de DNA
10.
J Theor Biol ; 160(4): 457-70, 1993 Feb 21.
Artigo em Inglês | MEDLINE | ID: mdl-8501918

RESUMO

A new method to determine entropic profiles in DNA sequences is presented. It is based on the chaos-game representation (CGR) of gene structure, a technique which produces a fractal-like picture of DNA sequences. First, the CGR image was divided into squares 4-m in size (m being the desired resolution), and the point density counted. Second, appropriate intervals were adjusted, and then a histogram of densities was prepared. Third, Shannon's formula was applied to the probability-distribution histogram, thus obtaining a new entropic estimate for DNA sequences, the histogram entropy, a measurement that goes with the level of constraints on the DNA sequence. Lastly, the entropic profile for the sequence was drawn, by considering the entropies at each resolution level, thus providing a way to summarize the complexity of large genomic regions or even entire genomes at different resolution levels. The application of the method to DNA sequences reveals that entropic profiles obtained in this way, as opposed to previously published ones, clearly discriminate between random and natural DNA sequences. Entropic profiles also show a different degree of variability within and between genomes. The results of these analyses are discussed in relation both to the genome compartmentalization in vertebrates and to the differential action of compositional and/or functional constraints on DNA sequences.


Assuntos
Simulação por Computador , Teoria dos Jogos , Teoria da Informação , Análise de Sequência de DNA , Termodinâmica , Animais , Sequência de Bases , Humanos , Vertebrados/genética
11.
Bioinformatics ; 15(12): 974-9, 1999 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-10745986

RESUMO

MOTIVATION: DNA sequences are formed by patches or domains of different nucleotide composition. In a few simple sequences, domains can simply be identified by eye; however, most DNA sequences show a complex compositional heterogeneity (fractal structure), which cannot be properly detected by current methods. Recently, a computationally efficient segmentation method to analyse such nonstationary sequence structures, based on the Jensen-Shannon entropic divergence, has been described. Specific algorithms implementing this method are now needed. RESULTS: Here we describe a heuristic segmentation algorithm for DNA sequences, which was implemented on a Windows program (SEGMENT). The program divides a DNA sequence into compositionally homogeneous domains by iterating a local optimization procedure at a given statistical significance. Once a sequence is partitioned into domains, a global measure of sequence compositional complexity (SCC), accounting for both the sizes and compositional biases of all the domains in the sequence, is derived. SEGMENT computes SCC as a function of the significance level, which provides a multiscale view of sequence complexity.


Assuntos
Algoritmos , Análise de Sequência de DNA/métodos , Apresentação de Dados , Escherichia coli/genética , Computação Matemática , Modelos Genéticos , Estrutura Molecular , Estrutura Terciária de Proteína/genética , Software , Interface Usuário-Computador
12.
Phys Rev Lett ; 85(6): 1342-5, 2000 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-10991547

RESUMO

We present a new computational approach to finding borders between coding and noncoding DNA. This approach has two features: (i) DNA sequences are described by a 12-letter alphabet that captures the differential base composition at each codon position, and (ii) the search for the borders is carried out by means of an entropic segmentation method which uses only the general statistical properties of coding DNA. We find that this method is highly accurate in finding borders between coding and noncoding regions and requires no "prior training" on known data sets. Our results appear to be more accurate than those obtained with moving windows in the discrimination of coding from noncoding DNA.


Assuntos
DNA/química , DNA/genética , Entropia , Código Genético , Modelos Teóricos
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa