RESUMO
Detrended Fluctuation Analysis (DFA) has become a standard method to quantify the correlations and scaling properties of real-world complex time series. For a given scale â of observation, DFA provides the function F(â), which quantifies the fluctuations of the time series around the local trend, which is substracted (detrended). If the time series exhibits scaling properties, then F(â)â¼âα asymptotically, and the scaling exponent α is typically estimated as the slope of a linear fitting in the logF(â) vs. log(â) plot. In this way, α measures the strength of the correlations and characterizes the underlying dynamical system. However, in many cases, and especially in a physiological time series, the scaling behavior is different at short and long scales, resulting in logF(â) vs. log(â) plots with two different slopes, α1 at short scales and α2 at large scales of observation. These two exponents are usually associated with the existence of different mechanisms that work at distinct time scales acting on the underlying dynamical system. Here, however, and since the power-law behavior of F(â) is asymptotic, we question the use of α1 to characterize the correlations at short scales. To this end, we show first that, even for artificial time series with perfect scaling, i.e., with a single exponent α valid for all scales, DFA provides an α1 value that systematically overestimates the true exponent α. In addition, second, when artificial time series with two different scaling exponents at short and large scales are considered, the α1 value provided by DFA not only can severely underestimate or overestimate the true short-scale exponent, but also depends on the value of the large scale exponent. This behavior should prevent the use of α1 to describe the scaling properties at short scales: if DFA is used in two time series with the same scaling behavior at short scales but very different scaling properties at large scales, very different values of α1 will be obtained, although the short scale properties are identical. These artifacts may lead to wrong interpretations when analyzing real-world time series: on the one hand, for time series with truly perfect scaling, the spurious value of α1 could lead to wrongly thinking that there exists some specific mechanism acting only at short time scales in the dynamical system. On the other hand, for time series with true different scaling at short and large scales, the incorrect α1 value would not characterize properly the short scale behavior of the dynamical system.
RESUMO
The observable outputs of many complex dynamical systems consist of time series exhibiting autocorrelation functions of great diversity of behaviors, including long-range power-law autocorrelation functions, as a signature of interactions operating at many temporal or spatial scales. Often, numerical algorithms able to generate correlated noises reproducing the properties of real time series are used to study and characterize such systems. Typically, many of those algorithms produce a Gaussian time series. However, the real, experimentally observed time series are often non-Gaussian and may follow distributions with a diversity of behaviors concerning the support, the symmetry, or the tail properties. It is always possible to transform a correlated Gaussian time series into a time series with a different marginal distribution, but the question is how this transformation affects the behavior of the autocorrelation function. Here, we study analytically and numerically how the Pearson's correlation of two Gaussian variables changes when the variables are transformed to follow a different destination distribution. Specifically, we consider bounded and unbounded distributions, symmetric and non-symmetric distributions, and distributions with different tail properties from decays faster than exponential to heavy-tail cases including power laws, and we find how these properties affect the correlation of the final variables. We extend these results to a Gaussian time series, which are transformed to have a different marginal distribution, and show how the autocorrelation function of the final non-Gaussian time series depends on the Gaussian correlations and on the final marginal distribution. As an application of our results, we propose how to generalize standard algorithms producing a Gaussian power-law correlated time series in order to create a synthetic time series with an arbitrary distribution and controlled power-law correlations. Finally, we show a practical example of this algorithm by generating time series mimicking the marginal distribution and the power-law tail of the autocorrelation function of real time series: the absolute returns of stock prices.
RESUMO
The 2017 update of NGSmethDB stores whole genome methylomes generated from short-read data sets obtained by bisulfite sequencing (WGBS) technology. To generate high-quality methylomes, stringent quality controls were integrated with third-part software, adding also a two-step mapping process to exploit the advantages of the new genome assembly models. The samples were all profiled under constant parameter settings, thus enabling comparative downstream analyses. Besides a significant increase in the number of samples, NGSmethDB now includes two additional data-types, which are a valuable resource for the discovery of methylation epigenetic biomarkers: (i) differentially methylated single-cytosines; and (ii) methylation segments (i.e. genome regions of homogeneous methylation). The NGSmethDB back-end is now based on MongoDB, a NoSQL hierarchical database using JSON-formatted documents and dynamic schemas, thus accelerating sample comparative analyses. Besides conventional database dumps, track hubs were implemented, which improved database access, visualization in genome browsers and comparative analyses to third-part annotations. In addition, the database can be also accessed through a RESTful API. Lastly, a Python client and a multiplatform virtual machine allow for program-driven access from user desktop. This way, private methylation data can be compared to NGSmethDB without the need to upload them to public servers. Database website: http://bioinfo2.ugr.es/NGSmethDB.
Assuntos
Metilação de DNA , Bases de Dados de Ácidos Nucleicos , Animais , Citosina/metabolismo , Genoma , HumanosRESUMO
Despite the widespread diffusion of nonlinear methods for heart rate variability (HRV) analysis, the presence and the extent to which nonlinear dynamics contribute to short-term HRV are still controversial. This work aims at testing the hypothesis that different types of nonlinearity can be observed in HRV depending on the method adopted and on the physiopathological state. Two entropy-based measures of time series complexity (normalized complexity index, NCI) and regularity (information storage, IS), and a measure quantifying deviations from linear correlations in a time series (Gaussian linear contrast, GLC), are applied to short HRV recordings obtained in young (Y) and old (O) healthy subjects and in myocardial infarction (MI) patients monitored in the resting supine position and in the upright position reached through head-up tilt. The method of surrogate data is employed to detect the presence and quantify the contribution of nonlinear dynamics to HRV. We find that the three measures differ both in their variations across groups and conditions and in the percentage and strength of nonlinear HRV dynamics. NCI and IS displayed opposite variations, suggesting more complex dynamics in O and MI compared to Y and less complex dynamics during tilt. The strength of nonlinear dynamics is reduced by tilt using all measures in Y, while only GLC detects a significant strengthening of such dynamics in MI. A large percentage of detected nonlinear dynamics is revealed only by the IS measure in the Y group at rest, with a decrease in O and MI and during T, while NCI and GLC detect lower percentages in all groups and conditions. While these results suggest that distinct dynamic structures may lie beneath short-term HRV in different physiological states and pathological conditions, the strong dependence on the measure adopted and on their implementation suggests that physiological interpretations should be provided with caution.
Assuntos
Frequência Cardíaca/fisiologia , Dinâmica não Linear , Adulto , Entropia , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Fatores de TempoRESUMO
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
RESUMO
Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.
Assuntos
DNA/genética , Modelos Genéticos , Algoritmos , Animais , Sequência de Bases , Sítios de Ligação/genética , Análise por Conglomerados , Éxons/genética , Humanos , Íntrons/genética , Linguística , Camundongos , Especificidade da Espécie , Fatores de Transcrição/genéticaRESUMO
The present study proposes to measure and quantify the heart rate variability (HRV) changes during effort as a function of the heart rate and to test the capacity of the produced indices to predict cardiorespiratory fitness measures. Therefore, the beat-to-beat cardiac time interval series of 18 adolescent athletes (15.2 ± 2.0 years) measured during maximal graded effort test were detrended using a dynamical first-order differential equation model. HRV was then calculated as the standard deviation of the detrended RR intervals (SDRR) within successive windows of one minute. The variation of this measure of HRV during exercise is properly fitted by an exponential decrease of the heart rate: the SDRR is divided by 2 every increase of heart rate of 20 beats/min. The HR increase necessary to divide by 2 the HRV is linearly inversely correlated with the maximum oxygen consumption (r = -0.60, p = 0.006), the maximal aerobic power (r = -0.62, p = 0.006), and, to a lesser extent, to the power at the ventilatory thresholds (r = -0.53, p = 0.02 and r = -0.47, p = 0.05 for the first and second threshold). It indicates that the decrease of the HRV when the heart rate increases is faster among athletes with better fitness. This analysis, based only on cardiac measurements, provides a promising tool for the study of cardiac measurements generated by portable devices.
Assuntos
Aptidão Cardiorrespiratória , Adolescente , Exercício Físico/fisiologia , Teste de Esforço , Frequência Cardíaca/fisiologia , Humanos , Consumo de Oxigênio/fisiologiaRESUMO
BACKGROUND: Unmethylated stretches of CpG dinucleotides (CpG islands) are an outstanding property of mammal genomes. Conventionally, these regions are detected by sliding window approaches using %G + C, CpG observed/expected ratio and length thresholds as main parameters. Recently, clustering methods directly detect clusters of CpG dinucleotides as a statistical property of the genome sequence. RESULTS: We compare sliding-window to clustering (i.e. CpGcluster) predictions by applying new ways to detect putative functionality of CpG islands. Analyzing the co-localization with several genomic regions as a function of window size vs. statistical significance (p-value), CpGcluster shows a higher overlap with promoter regions and highly conserved elements, at the same time showing less overlap with Alu retrotransposons. The major difference in the prediction was found for short islands (CpG islets), often exclusively predicted by CpGcluster. Many of these islets seem to be functional, as they are unmethylated, highly conserved and/or located within the promoter region. Finally, we show that window-based islands can spuriously overlap several, differentially regulated promoters as well as different methylation domains, which might indicate a wrong merge of several CpG islands into a single, very long island. The shorter CpGcluster islands seem to be much more specific when concerning the overlap with alternative transcription start sites or the detection of homogenous methylation domains. CONCLUSIONS: The main difference between sliding-window approaches and clustering methods is the length of the predicted islands. Short islands, often differentially methylated, are almost exclusively predicted by CpGcluster. This suggests that CpGcluster may be the algorithm of choice to explore the function of these short, but putatively functional CpG islands.
Assuntos
Algoritmos , Ilhas de CpG , Elementos Alu/genética , Análise por Conglomerados , Sequência Conservada/genética , Metilação de DNA/genética , Evolução Molecular , Humanos , Regiões Promotoras Genéticas/genéticaRESUMO
In a recent work [Phys. Rev. E 97, 030202(R) (2018)10.1103/PhysRevE.97.030202], Sakhr and Nieminen (SN) solved a hypothesis formulated two decades ago, according to which the local box-counting dimension D_{box}(r) of a given energy spectrum, or more generally of a discrete set, should exclusively depend on the nearest-neighbor spacing distribution P(s) of the spectrum (set). SN found analytically this dependence, which led them to obtain closed formulas for the local box-counting dimension of Poisson spectra and of spectra belonging to Gaussian orthogonal, unitary, and symplectic ensembles. Here, first, we present a different derivation of the equation establishing the connection of D_{box}(r) and P(s) using the concept of surrogate spectrum. Although our equation is formally different to the SN result, we prove that both are equivalent. Second, we apply our equation to solve the inverse problem of determining the functional form of P(s) for spectra with real fractal structure and constant box-counting dimension D_{box}, and we find that P(s) should behave as a power-law of the spacing, with an exponent given by -(1+D_{box}). Finally, we present four applications or consequences of this last result: First, we provide a simple algorithm able to generate random fractal spectra with prescribed and constant D_{box}. Second, we calculate D_{box} for the sets given by the zeros of fractional Brownian motions, whose P(s) is known to have a power-law tail. Third, we also study D_{box}(r) for the zeros of fractional Gaussian noises, whose P(s) in known to present fat (but not power-law) tails, and that could be misinterpreted as real fractals. And finally, we present the calculation of D_{box} for the spectra of Fibonacci Hamiltonians, known to have fractal properties, simply by fitting their corresponding P(s) to a power-law without the need of applying a box-counting algorithm.
RESUMO
BACKGROUND: The phylogenetic distribution of large-scale genome structure (i.e. mosaic compositional patchiness) has been explored mainly by analytical ultracentrifugation of bulk DNA. However, with the availability of large, good-quality chromosome sequences, and the recently developed computational methods to directly analyze patchiness on the genome sequence, an evolutionary comparative analysis can be carried out at the sequence level. RESULTS: The local variations in the scaling exponent of the Detrended Fluctuation Analysis are used here to analyze large-scale genome structure and directly uncover the characteristic scales present in genome sequences. Furthermore, through shuffling experiments of selected genome regions, computationally-identified, isochore-like regions were identified as the biological source for the uncovered large-scale genome structure. The phylogenetic distribution of short- and large-scale patchiness was determined in the best-sequenced genome assemblies from eleven eukaryotic genomes: mammals (Homo sapiens, Pan troglodytes, Mus musculus, Rattus norvegicus, and Canis familiaris), birds (Gallus gallus), fishes (Danio rerio), invertebrates (Drosophila melanogaster and Caenorhabditis elegans), plants (Arabidopsis thaliana) and yeasts (Saccharomyces cerevisiae). We found large-scale patchiness of genome structure, associated with in silico determined, isochore-like regions, throughout this wide phylogenetic range. CONCLUSION: Large-scale genome structure is detected by directly analyzing DNA sequences in a wide range of eukaryotic chromosome sequences, from human to yeast. In all these genomes, large-scale patchiness can be associated with the isochore-like regions, as directly detected in silico at the sequence level.
Assuntos
Genoma/genética , Isocoros/genética , Filogenia , Animais , Arabidopsis/genética , Biologia Computacional , Cães , Genoma Fúngico/genética , Genoma Humano/genética , Genoma de Planta/genética , Humanos , Camundongos , Pan troglodytes/genética , Ratos , Saccharomyces cerevisiae/genética , Análise de Sequência de DNA , Especificidade da EspécieRESUMO
OBJECTIVE: In this work we want to analyze differences in nonlinear properties between rest and exercise and also to study the permanent effects of physical exercise on heart rate dynamics. APPROACH: It has been shown that physical exercise alters heart dynamics by increasing heart rate and decreasing variability, modifying spectral power and linear correlations, etc. We hypothesize that physical exercise should also reduce nonlinearity in the heartbeat time series. To quantify nonlinearity in the heartbeat time series, we use an index of nonlinearity recently proposed by Bernaola et al based on correlations of the magnitude time series. MAIN RESULTS: Our results confirm our initial hypothesis of loss of nonlinearity during physical exercise. Moreover, regarding the permanent effects of physical exercise on heart rate dynamics, we also obtain that aerobic physical training tends to increase nonlinearity in heart dynamics during rest. SIGNIFICANCE: It is well-known that heart dynamics are controlled by complex interactions between the sympathetic and parasympathetic branches of the autonomic nervous system. Moreover, these two branches act in a competing way, resulting in a clear parasympathetic withdrawal and sympathetic activation during physical exercise. We associate these interactions during physical exercise with a drastic loss of nonlinear properties in the heartbeat time series, revealing the importance of nonlinearity measures in the study of complex systems.
Assuntos
Exercício Físico/fisiologia , Coração/fisiologia , Dinâmica não Linear , Descanso/fisiologia , Adulto , Frequência Cardíaca , Humanos , MasculinoRESUMO
The correlation properties of the magnitude of a time series are associated with nonlinear and multifractal properties and have been applied in a great variety of fields. Here we have obtained the analytical expression of the autocorrelation of the magnitude series (C_{|x|}) of a linear Gaussian noise as a function of its autocorrelation (C_{x}). For both, models and natural signals, the deviation of C_{|x|} from its expectation in linear Gaussian noises can be used as an index of nonlinearity that can be applied to relatively short records and does not require the presence of scaling in the time series under study. In a model of artificial Gaussian multifractal signal we use this approach to analyze the relation between nonlinearity and multifractallity and show that the former implies the latter but the reverse is not true. We also apply this approach to analyze experimental data: heart-beat records during rest and moderate exercise. For each individual subject, we observe higher nonlinearities during rest. This behavior is also achieved on average for the analyzed set of 10 semiprofessional soccer players. This result agrees with the fact that other measures of complexity are dramatically reduced during exercise and can shed light on its relationship with the withdrawal of parasympathetic tone and/or the activation of sympathetic activity during physical activity.
Assuntos
Fractais , Modelos Teóricos , Dinâmica não Linear , Atletas , Frequência Cardíaca , Humanos , Masculino , Descanso/fisiologia , Corrida/fisiologia , Futebol , Fatores de Tempo , Adulto JovemRESUMO
BACKGROUND: Despite their involvement in the regulation of gene expression and their importance as genomic markers for promoter prediction, no objective standard exists for defining CpG islands (CGIs), since all current approaches rely on a large parameter space formed by the thresholds of length, CpG fraction and G+C content. RESULTS: Given the higher frequency of CpG dinucleotides at CGIs, as compared to bulk DNA, the distance distributions between neighboring CpGs should differ for bulk and island CpGs. A new algorithm (CpGcluster) is presented, based on the physical distance between neighboring CpGs on the chromosome and able to predict directly clusters of CpGs, while not depending on the subjective criteria mentioned above. By assigning a p-value to each of these clusters, the most statistically significant ones can be predicted as CGIs. CpGcluster was benchmarked against five other CGI finders by using a test sequence set assembled from an experimental CGI library. CpGcluster reached the highest overall accuracy values, while showing the lowest rate of false-positive predictions. Since a minimum-length threshold is not required, CpGcluster can find short but fully functional CGIs usually missed by other algorithms. The CGIs predicted by CpGcluster present the lowest degree of overlap with Alu retrotransposons and, simultaneously, the highest overlap with vertebrate Phylogenetic Conserved Elements (PhastCons). CpGcluster's CGIs overlapping with the Transcription Start Site (TSS) show the highest statistical significance, as compared to the islands in other genome locations, thus qualifying CpGcluster as a valuable tool in discriminating functional CGIs from the remaining islands in the bulk genome. CONCLUSION: CpGcluster uses only integer arithmetic, thus being a fast and computationally efficient algorithm able to predict statistically significant clusters of CpG dinucleotides. Another outstanding feature is that all predicted CGIs start and end with a CpG dinucleotide, which should be appropriate for a genomic feature whose functionality is based precisely on CpG dinucleotides. The only search parameter in CpGcluster is the distance between two consecutive CpGs, in contrast to previous algorithms. Therefore, none of the main statistical properties of CpG islands (neither G+C content, CpG fraction nor length threshold) are needed as search parameters, which may lead to the high specificity and low overlap with spurious Alu elements observed for CpGcluster predictions.
Assuntos
Algoritmos , Ilhas de CpG/genética , Animais , Genoma/genética , Humanos , CamundongosRESUMO
The specific heat corresponding to systems with deterministic fractal energy spectra is known to present logarithmic-periodic oscillations as a function of the temperature T in the low T region around a mean value given by a characteristic dimension of the energy spectrum. In general, it is considered that the presence of disorder does not affect strongly these results, and that the fractal structure of the energy spectrum dominates. In this paper, we study the properties of the specific heat derived from random fractal energy spectra as a function of the degree of disorder present in the spectra. To study the influence of the disorder, we analyze the specific heat using three different properties: the specific heat mean value and the periods and amplitudes of the oscillations of the specific heat around its mean value. By studying the distributions and the mean values of these three properties, we obtain that the disorder does not influence very much the mean value of the specific heat. However, concerning the behavior of periods and amplitudes, we obtain a critical value of the disorder present in the energy spectra. Below this critical value, we find a low effect of the disorder and quasideterministic behavior indicating that the fractal structure is the dominant effect, but above the critical value, the disorder dominates and the behavior of the specific heat is practically chaotic.
RESUMO
Isochores are long genome segments homogeneous in G+C. Here, we describe an algorithm (IsoFinder) running on the web (http://bioinfo2.ugr.es/IsoF/isofinder.html) able to predict isochores at the sequence level. We move a sliding pointer from left to right along the DNA sequence. At each position of the pointer, we compute the mean G+C values to the left and to the right of the pointer. We then determine the position of the pointer for which the difference between left and right mean values (as measured by the t-statistic) reaches its maximum. Next, we determine the statistical significance of this potential cutting point, after filtering out short-scale heterogeneities below 3 kb by applying a coarse-graining technique. Finally, the program checks whether this significance exceeds a probability threshold. If so, the sequence is cut at this point into two subsequences; otherwise, the sequence remains undivided. The procedure continues recursively for each of the two resulting subsequences created by each cut. This leads to the decomposition of a chromosome sequence into long homogeneous genome regions (LHGRs) with well-defined mean G+C contents, each significantly different from the G+C contents of the adjacent LHGRs. Most LHGRs can be identified with Bernardi's isochores, given their correlation with biological features such as gene density, SINE and LINE (short, long interspersed repetitive elements) densities, recombination rate or single nucleotide polymorphism variability. The resulting isochore maps are available at our web site (http://bioinfo2.ugr.es/isochores/), and also at the UCSC Genome Browser (http://genome.cse.ucsc.edu/).
Assuntos
Biologia Computacional , Genômica , Isocoros/química , Software , Algoritmos , Gráficos por Computador , Internet , Complexo Principal de Histocompatibilidade , Interface Usuário-ComputadorRESUMO
Symbolic sequences have been extensively investigated in the past few years within the framework of statistical physics. Paradigmatic examples of such sequences are written texts, and deoxyribonucleic acid (DNA) and protein sequences. In these examples, the spatial distribution of a given symbol (a word, a DNA motif, an amino acid) is a key property usually related to the symbol importance in the sequence: The more uneven and far from random the symbol distribution, the higher the relevance of the symbol to the sequence. Thus, many techniques of analysis measure in some way the deviation of the symbol spatial distribution with respect to the random expectation. The problem is then to know the spatial distribution corresponding to randomness, which is typically considered to be either the geometric or the exponential distribution. However, these distributions are only valid for very large symbolic sequences and for many occurrences of the analyzed symbol. Here, we obtain analytically the exact, randomly expected spatial distribution valid for any sequence length and any symbol frequency, and we study its main properties. The knowledge of the distribution allows us to define a measure able to properly quantify the deviation from randomness of the symbol distribution, especially for short sequences and low symbol frequency. We apply the measure to the problem of keyword detection in written texts and to study amino acid clustering in protein sequences. In texts, we show how the results improve with respect to previous methods when short texts are analyzed. In proteins, which are typically short, we show how the measure quantifies unambiguously the amino acid clustering and characterize its spatial distribution.
Assuntos
Aminoácidos/química , Biologia Computacional/métodos , Modelos Teóricos , Probabilidade , Algoritmos , Sequência de Aminoácidos , Análise por Conglomerados , Periodicidade , Proteínas/química , Análise de SequênciaRESUMO
We systematically study the scaling properties of the magnitude and sign of the fluctuations in correlated time series, which is a simple and useful approach to distinguish between systems with different dynamical properties but the same linear correlations. First, we decompose artificial long-range power-law linearly correlated time series into magnitude and sign series derived from the consecutive increments in the original series, and we study their correlation properties. We find analytical expressions for the correlation exponent of the sign series as a function of the exponent of the original series. Such expressions are necessary for modeling surrogate time series with desired scaling properties. Next, we study linear and nonlinear correlation properties of series composed as products of independent magnitude and sign series. These surrogate series can be considered as a zero-order approximation to the analysis of the coupling of magnitude and sign in real data, a problem still open in many fields. We find analytical results for the scaling behavior of the composed series as a function of the correlation exponents of the magnitude and sign series used in the composition, and we determine the ranges of magnitude and sign correlation exponents leading to either single scaling or to crossover behaviors. Finally, we obtain how the linear and nonlinear properties of the composed series depend on the correlation exponents of their magnitude and sign series. Based on this information we propose a method to generate surrogate series with controlled correlation exponent and multifractal spectrum.
Assuntos
Modelos Lineares , Dinâmica não Linear , Algoritmos , Análise de Fourier , Fractais , Fatores de TempoRESUMO
When investigating the dynamical properties of complex multiple-component physical and physiological systems, it is often the case that the measurable system's output does not directly represent the quantity we want to probe in order to understand the underlying mechanisms. Instead, the output signal is often a linear or nonlinear function of the quantity of interest. Here, we investigate how various linear and nonlinear transformations affect the correlation and scaling properties of a signal, using the detrended fluctuation analysis (DFA) which has been shown to accurately quantify power-law correlations in nonstationary signals. Specifically, we study the effect of three types of transforms: (i) linear ( y(i) =a x(i) +b) , (ii) nonlinear polynomial ( y(i) =a x(k)(i) ) , and (iii) nonlinear logarithmic [ y(i) =log ( x(i) +Delta) ] filters. We compare the correlation and scaling properties of signals before and after the transform. We find that linear filters do not change the correlation properties, while the effect of nonlinear polynomial and logarithmic filters strongly depends on (a) the strength of correlations in the original signal, (b) the power k of the polynomial filter, and (c) the offset Delta in the logarithmic filter. We further apply the DFA method to investigate the "apparent" scaling of three analytic functions: (i) exponential [exp (+/-x+a) ] , (ii) logarithmic [log (x+a) ] , and (iii) power law [ (x+a)(lambda) ] , which are often encountered as trends in physical and biological processes. While these three functions have different characteristics, we find that there is a broad range of values for parameter a common for all three functions, where the slope of the DFA curves is identical. We further note that the DFA results obtained for a class of other analytic functions can be reduced to these three typical cases. We systematically test the performance of the DFA method when estimating long-range power-law correlations in the output signals for different parameter values in the three types of filters and the three analytic functions we consider.
Assuntos
Algoritmos , Modelos Biológicos , Modelos Estatísticos , Dinâmica não Linear , Animais , Simulação por Computador , Humanos , Estatística como AssuntoRESUMO
The sequencing of prokaryotic genomes covering a wide taxonomic range has sparked renewed interest in intrachromosomal compositional (GC) heterogeneity, largely in view of lateral transfers. We present here a brief overview of some methods for visualizing and quantifying GC variation in prokaryotes. We used these methods to examine heterogeneity levels in sequenced prokaryotes, for a range of scales or stringencies. Some species are consistently homogeneous, whereas others are markedly heterogeneous in comparison, in particular Aeropyrum pernix, Xylella fastidiosa, Mycoplasma genitalium, Enterococcus faecalis, Bacillus subtilis, Pyrobaculum aerophilum, Vibrio vulnificus chromosome I, Deinococcus radiodurans chromosome II and Halobacterium. As we discuss here, the wide range of heterogeneities calls for reexamination of an accepted belief, namely that the endogenous DNA of bacteria and archaea should typically exhibit low intrachromosomal GC contrasts. Supplementary results for all species analyzed are available at our website: http://bioinfo2.ugr.es/prok.
Assuntos
Composição de Bases/genética , DNA Bacteriano/genética , Genoma Bacteriano , Algoritmos , Pareamento de Bases/genética , Centrifugação com Gradiente de Concentração , Césio , Cloretos , Cromossomos de Archaea/genética , Cromossomos Bacterianos/genética , Códon/genética , DNA Arqueal/química , DNA Arqueal/genética , DNA Bacteriano/química , Genoma Arqueal , Isocoros/genéticaRESUMO
The human genome is a mosaic of isochores, which are long DNA segments (z.Gt;300 kbp) relatively homogeneous in G+C. Human isochores were first identified by density-gradient ultracentrifugation of bulk DNA, and differ in important features, e.g. genes are found predominantly in the GC-richest isochores. Here, we use a reliable segmentation method to partition the longest contigs in the human genome draft sequence into long homogeneous genome regions (LHGRs), thereby revealing the isochore structure of the human genome. The advantages of the isochore maps presented here are: (1) sequence heterogeneities at different scales are shown in the same plot; (2) pair-wise compositional differences between adjacent regions are all statistically significant; (3) isochore boundaries are accurately defined to single base pair resolution; and (4) both gradual and abrupt isochore boundaries are simultaneously revealed. Taking advantage of the wide sample of genome sequence analyzed, we investigate the correspondence between LHGRs and true human isochores revealed through DNA centrifugation. LHGRs show many of the typical isochore features, mainly size distribution, G+C range, and proportions of the isochore classes. The relative density of genes, Alu and long interspersed nuclear element repeats and the different types of single nucleotide polymorphisms on LHGRs also coincide with expectations in true isochores. Potential applications of isochore maps range from the improvement of gene-finding algorithms to the prediction of linkage disequilibrium levels in association studies between marker genes and complex traits. The coordinates for the LHGRs identified in all the contigs longer than 2 Mb in the human genome sequence are available at the online resource on isochore mapping: http://bioinfo2.ugr.es/isochores.