Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
BMC Bioinformatics ; 17: 59, 2016 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-26842742

RESUMO

BACKGROUND: The second Chargaff's parity rule and its extensions are recognized as universal phenomena in DNA sequences. However, parity of the frequencies of reverse complementary oligonucleotides could be a mere consequence of the single nucleotide parity rule, if nucleotide independence is assumed. Exceptional symmetry (symmetry beyond that expected under an independent nucleotide assumption) was proposed previously as a meaningful measure of the extension of the second parity rule to oligonucleotides. The global exceptional symmetry was detected in long and short genomes. RESULTS: To explore the exceptional genomic word symmetry along the genome sequences, we propose a sliding window method to extract the values of exceptional symmetry (for all words or by word groups). We compare the exceptional symmetry effect size distribution in all human chromosomes against control scenarios (positive and negative controls), testing the differences and performing a residual analysis. We explore local exceptional symmetry in equivalent composition word groups, and find that the behaviour of the local exceptional symmetry depends on the word group. CONCLUSIONS: We conclude that the exceptional symmetry is a local phenomenon in genome sequences, with distinct characteristics along the sequence of each chromosome. The local exceptional symmetry along the genomic sequences shows outlying segments, and those segments have high biological annotation density.


Assuntos
Cromossomos Humanos/genética , DNA/genética , Genoma Humano , Modelos Genéticos , Modelos Estatísticos , Genômica , Humanos , Transcriptoma
2.
Biostatistics ; 16(2): 209-21, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25190514

RESUMO

Some previous studies suggest the extension of Chargaff's second rule (the phenomenon of symmetry in a single DNA strand) to long DNA words. However, in random sequences generated under an independent symbol model where complementary nucleotides have equal occurrence probabilities, we expect the phenomenon of symmetry to hold for any word length. In this work, we develop new statistical methods to measure the exceptional symmetry. Exceptional symmetry is a refinement of Chargaff's second parity rule that highlights the words whose frequency of occurrence is similar to that of its reversed complement but dissimilar to the frequencies of occurrence of other words which contain the same number of nucleotides A or T. We analyze words of lengths up to 12 in the complete human genome and in each chromosome separately. We assess exceptional symmetry globally, by word group, and by word. We conclude that the global symmetry present in the human genome is clearly exceptional and significant. The chromosomes present distinct exceptional symmetry profiles. There are several exceptional word groups and exceptional words with a strong exceptional symmetry.


Assuntos
DNA/genética , Genoma Humano/genética , Modelos Genéticos , Modelos Estatísticos , Humanos
3.
J Theor Biol ; 335: 153-9, 2013 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-23831271

RESUMO

Previous studies have suggested that Chargaff's second rule may hold for relatively long words (above 10nucleotides), but this has not been conclusively shown. In particular, the following questions remain open: Is the phenomenon of symmetry statistically significant? If so, what is the word length above which significance is lost? Can deviations in symmetry due to the finite size of the data be identified? This work addresses these questions by studying word symmetries in the human genome, chromosomes and transcriptome. To rule out finite-length effects, the results are compared with those obtained from random control sequences built to satisfy Chargaff's second parity rule. We use several techniques to evaluate the phenomenon of symmetry, including Pearson's correlation coefficient, total variational distance, a novel word symmetry distance, as well as traditional and equivalence statistical tests. We conclude that word symmetries are statistical significant in the human genome for word lengths up to 6nucleotides. For longer words, we present evidence that the phenomenon may not be as prevalent as previously thought.


Assuntos
Cromossomos Humanos/genética , Genoma Humano/fisiologia , Modelos Genéticos , Cromossomos Humanos/metabolismo , Humanos , Transcriptoma/fisiologia
4.
J Integr Bioinform ; 20(2)2023 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-37486620

RESUMO

This work aims to describe the observed enrichment of inverted repeats in the human genome; and to identify and describe, with detailed length profiles, the regions with significant and relevant enriched occurrence of inverted repeats. The enrichment is assessed and tested with a recently proposed measure (z-scores based measure). We simulate a genome using an order 7 Markov model trained with the data from the real genome. The simulated genome is used to establish the critical values which are used as decision thresholds to identify the regions with significant enriched concentrations. Several human genome regions are highly enriched in the occurrence of inverted repeats. This is observed in all the human chromosomes. The distribution of inverted repeat lengths varies along the genome. The majority of the regions with severely exaggerated enrichment contain mainly short length inverted repeats. There are also regions with regular peaks along the inverted repeats lengths distribution (periodic regularities) and other regions with exaggerated enrichment for long lengths (less frequent). However, adjacent regions tend to have similar distributions.

5.
J Theor Biol ; 275(1): 52-8, 2011 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-21295040

RESUMO

DNA may be represented by sequences of four symbols, but it is often useful to convert those symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but most of them seem to be unrelated to any intrinsic characteristic of DNA. The objective of this work was to study a mapping scheme that is directly related to DNA characteristics, and that could be useful in discriminating between different species. Recently, we have proposed a methodology based on the inter-nucleotide distance, which proved to contribute to the discrimination among species. In this paper, we introduce a new distance, the distance to the nearest dissimilar nucleotide, which is the distance of a nucleotide to first occurrence of a different nucleotide. This distance is related to the repetition structure of single nucleotides. Using the information resulting from the concatenation of the distance to the nearest dissimilar and the inter-nucleotide distance, we found that this new distance brings additional discriminative capabilities. This suggests that the distance to the nearest dissimilar nucleotide might contribute with useful information about the evolution of the species.


Assuntos
Genoma/genética , Modelos Genéticos , Nucleotídeos/genética , Animais , Sequência de Bases , Humanos , Dados de Sequência Molecular , Filogenia , Alinhamento de Sequência , Especificidade da Espécie
6.
Bioinformatics ; 25(23): 3064-70, 2009 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-19759198

RESUMO

MOTIVATION: DNA sequences can be represented by sequences of four symbols, but it is often useful to convert the symbols into real or complex numbers for further analysis. Several mapping schemes have been used in the past, but they seem unrelated to any intrinsic characteristic of DNA. The objective of this work was to find a mapping scheme directly related to DNA characteristics and that would be useful in discriminating between different species. Mathematical models to explore DNA correlation structures may contribute to a better knowledge of the DNA and to find a concise DNA description. RESULTS: We developed a methodology to process DNA sequences based on inter-nucleotide distances. Our main contribution is a method to obtain genomic signatures for complete genomes, based on the inter-nucleotide distances, that are able to discriminate between different species. Using these signatures and hierarchical clustering, it is possible to build phylogenetic trees. Phylogenetic trees lead to genome differentiation and allow the inference of phylogenetic relations. The phylogenetic trees generated in this work display related species close to each other, suggesting that the inter-nucleotide distances are able to capture essential information about the genomes. To create the genomic signature, we construct a vector which describes the inter-nucleotide distance distribution of a complete genome and compare it with the reference distance distribution, which is the distribution of a sequence where the nucleotides are placed randomly and independently. It is the residual or relative error between the data and the reference distribution that is used to compare the DNA sequences of different organisms.


Assuntos
DNA/química , Genoma , Genômica/métodos , Nucleotídeos/química , Análise de Sequência de DNA/métodos , Algoritmos , Sequência de Bases , Filogenia
7.
Interdiscip Sci ; 11(3): 367-372, 2019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-30911903

RESUMO

Finding DNA sites with high potential for the formation of hairpin/cruciform structures is an important task. Previous works studied the distances between adjacent reversed complement words (symmetric word pairs) and also for non-adjacent words. It was observed that for some words a few distances were favoured (peaks) and that in some distributions there was strong peak regularity. The present work extends previous studies, by improving the detection and characterization of peak regularities in the symmetric word pairs distance distributions of the human genome. This work also analyzes the location of the sequences that originate the observed strong peak periodicity in the distance distribution. The results obtained in this work may indicate genomic sites with potential for the formation of hairpin/cruciform structures.


Assuntos
DNA/química , Genoma Humano , Algoritmos , Cromossomos Humanos , Bases de Dados Genéticas , Genômica , Humanos , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software
8.
Interdiscip Sci ; 10(1): 1-11, 2018 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-29214497

RESUMO

In this work, we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance distributions are considered, and it is found that the peak dissimilarity works best in this setting. We report the existence of reverse complementary word pairs with very dissimilar distance distributions, as well as word pairs with very similar distance distributions even when both distributions are irregular and contain strong peaks. The association between distribution dissimilarity and frequency discrepancy is also explored, and it is speculated that symmetric pairs combining low and high values of each measure may uncover features of interest. Taken together, our results suggest that some asymmetries in the human genome go far beyond Chargaff's rules. This study uses both the complete human genome and its repeat-masked version.


Assuntos
DNA Complementar/genética , Genômica , Genoma Humano , Humanos , Anotação de Sequência Molecular
9.
Interdiscip Sci ; 9(1): 14-23, 2017 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27866321

RESUMO

Single-strand DNA symmetry is pointed as a universal law observed in the genomes from all living organisms. It is a somewhat broadly defined concept, which has been refined into some more specific measurable effects. Here we discuss the exceptional symmetry effect. Exceptional symmetry is the symmetry effect beyond that expected in independence contexts, and it can be measured for each word, for each equivalent composition group, or globally, combining the effects of all possible words of a given length. Global exceptional symmetry was found in several species, but there are genomic words with no exceptional symmetry effect, whereas others show a very high exceptional symmetry effect. In this work, we discuss a measure to evaluate the exceptional symmetry effect by symmetric word pair, and compare it with others. We present a detailed study of the exceptional symmetry by symmetric pairs and take the CG content into account. We also introduce and discuss the exceptional symmetry profile for the DNA of each organism, and we perform a multiple comparison for 31 genomes: 7 viruses; 5 archaea; 5 bacteria; 14 eukaryotes.


Assuntos
Genômica/métodos , Modelos Genéticos , Estatística como Assunto/métodos , DNA de Cadeia Simples/genética
10.
Sci Rep ; 7(1): 728, 2017 04 07.
Artigo em Inglês | MEDLINE | ID: mdl-28389642

RESUMO

We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented. For this purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical distance distribution and with clusters of overrepresented short distances. We speculate that patterns of overrepresentation of short distances between symmetric word pairs may allow the occurrence of non-standard DNA conformations, such as hairpin/cruciform structures. We focused on the human genome, and analysed both the complete genome as well as a version with known repetitive sequences masked out. We reported several well-defined features in the distributions of distances, which can be classified into three different profiles, showing enrichment in distinct distance ranges. We analysed in greater detail certain pairs of symmetric words of length seven, found by our procedure, characterised by the surprising fact that they occur at single distances more frequently than expected.


Assuntos
DNA , Genoma Humano , Genômica , Análise de Sequência de DNA , Algoritmos , Cromossomos Humanos , DNA/química , DNA/genética , Bases de Dados Genéticas , Genômica/métodos , Humanos , Cadeias de Markov , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Relação Estrutura-Atividade
11.
IEEE Trans Biomed Eng ; 53(11): 2148-55, 2006 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-17073319

RESUMO

It is known that the protein-coding regions of DNA are usually characterized by a three-base periodicity. In this paper, we exploit this property, studying a DNA model based on three deterministic states, where each state implements a finite-context model. The experimental results obtained confirm the appropriateness of the proposed approach, showing compression gains in relation to the single finite-context model counterpart. Additionally, and potentially more interesting than the compression gain on its own, is the observation that the entropy associated to each of the three base positions of a codon differs and that this variation is not the same among the organisms analyzed.


Assuntos
Algoritmos , DNA/genética , Modelos Genéticos , Fases de Leitura Aberta/genética , Proteínas/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Sequência de Bases , Simulação por Computador , Dados de Sequência Molecular
12.
J Integr Bioinform ; 11(3): 250, 2014 Oct 23.
Artigo em Inglês | MEDLINE | ID: mdl-25339084

RESUMO

Some previous studies point to the extension of Chargaff’s second rule (the phenomenon of symmetry) to words of large length. However, in random sequences generated by an independent symbol model where the probability of occurrence of complementary nucleotides is the same, we expect that the phenomenon of symmetry holds for all word lengths. In this work, we measure the symmetry above that expected in independence contexts (exceptional symmetry), for several organisms: viruses; archaea; bacteria; eukaryotes. We also create 27 control scenarios with the same length of each genome under study. The results for each organism were compared to those obtained in control scenarios. We created a new organism genomic signature consisting of a vector of the measures of exceptional symmetry for words of lengths 1 through 12. We show that the proposed signature is able to capture essential relationships between organisms.


Assuntos
Sequência de Bases , DNA de Cadeia Simples/genética , Evolução Molecular , Animais , Archaea/genética , Bactérias/genética , Genoma , Humanos , Filogenia , Vírus/genética
13.
J Integr Bioinform ; 10(3): 230, 2013 Nov 14.
Artigo em Inglês | MEDLINE | ID: mdl-24231144

RESUMO

In this study we explore the potential of inter-STOP symbol distances for finding coding regions in DNA sequences. We use the distance between STOP symbols in the DNA sequence and a chi-square statistic to evaluate the nonhomogeneity of the three possible reading frames and the occurrence of one long distance in one of the frames. The results of this exploratory study suggest that inter-STOP symbol distances have strong ability to discriminate coding regions in prokaryotes and simple eukaryotes.


Assuntos
DNA/química , Análise de Sequência de DNA , Sequência de Bases , Saccharomyces cerevisiae/genética
14.
PLoS One ; 6(6): e21588, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21738720

RESUMO

A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.


Assuntos
Biologia Computacional/métodos , Genoma/genética , Animais , Humanos , Cadeias de Markov
15.
PLoS One ; 6(1): e16065, 2011 Jan 31.
Artigo em Inglês | MEDLINE | ID: mdl-21386877

RESUMO

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.


Assuntos
Células Eucarióticas/metabolismo , Genoma/genética , Células Procarióticas/metabolismo , Animais , Composição de Bases/genética , Sequência de Bases , Padrões de Herança/genética , Dados de Sequência Molecular , Nucleotídeos/genética , Filogenia , Vertebrados/genética
16.
J Integr Bioinform ; 8(3): 172, 2011 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-21926435

RESUMO

We study the inter-dinucleotide distance distributions in the human genome, both in the whole-genome and protein-coding regions. The inter-dinucleotide distance is defined as the distance to the next occurrence of the same dinucleotide. We consider the 16 sequences of inter-dinucleotide distances and two reading frames. Our results show a period-3 oscillation in the protein-coding inter-dinucleotide distance distributions that is absent from the whole-genome distributions. We also compare the distance distribution of each dinucleotide to a reference distribution, that of a random sequence generated with the same dinucleotide abundances, revealing the CG dinucleotide as the one with the highest cumulative relative error for the first 60 distances. Moreover, the distance distribution of each dinucleotide is compared to the distance distribution of all other dinucleotides using the Kullback-Leibler divergence. We find that the distance distribution of a dinucleotide and that of its reversed complement are very similar, hence, the divergence between them is very small. This is an interesting finding that may give evidence of a stronger parity rule than Chargaff's second parity rule.


Assuntos
Variação Genética/fisiologia , Genoma Humano/fisiologia , Fases de Leitura/fisiologia , Análise de Sequência de DNA/métodos , Animais , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA