Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
1.
Nucleus ; 4(5): 390-8, 2013.
Article in English | MEDLINE | ID: mdl-24051548

ABSTRACT

Since Jacob and Monod's characterization of the role of DNA elements in gene control, it has been recognized that the linear organization of genome structure is important for the regulation of gene transcription and hence the manifestation of phenotypes. Similarly, it has long been hypothesized that the spatial organization (in three dimensions evolving through time), as part of the epigenome, makes a significant contribution to the genotype-phenotype transition. Proximity ligation assays commonly known as chromosome conformation capture (3C) and 3C based methodologies (e.g., GCC, HiC and ChIA-Pet) are increasingly being incorporated into empirical studies to investigate the role that three-dimensional genome structure plays in the regulation of phenotype. The apparent simplicity of these methodologies-crosslink chromatin, digest, dilute, ligate, detect interactions-belies the complexity of the data and the considerations that should be taken into account to ensure the generation and accurate interpretation of reliable data. Here we discuss the probabilistic nature of these methodologies and how this contributes to their endogenous limitations.


Subject(s)
Chromosomes/chemistry , Models, Statistical , Molecular Conformation , Animals , Chromosomes/genetics , Chromosomes/metabolism , Genomics , Humans
2.
Genome Biol Evol ; 4(12): 1316-23, 2012.
Article in English | MEDLINE | ID: mdl-23204304

ABSTRACT

A characteristic feature of eukaryote and prokaryote genomes is the co-occurrence of nucleotide substitution and insertion/deletion (indel) mutations. Although similar observations have also been made for chloroplast DNA, genome-wide associations have not been reported. We determined the chloroplast genome sequences for two morphotypes of taro (Colocasia esculenta; family Araceae) and compared these with four publicly available aroid chloroplast genomes. Here, we report the extent of genome-wide association between direct and inverted repeats, indels, and substitutions in these aroid chloroplast genomes. We suggest that alternative but not mutually exclusive hypotheses explain the mutational dynamics of chloroplast genome evolution.


Subject(s)
Araceae/genetics , Genome, Chloroplast , INDEL Mutation , Point Mutation , Repetitive Sequences, Nucleic Acid , Databases, Nucleic Acid , Genome-Wide Association Study , Inverted Repeat Sequences , Mutation Rate , Phylogeny
3.
Mol Biol Evol ; 29(12): 3703-9, 2012 Dec.
Article in English | MEDLINE | ID: mdl-22790671

ABSTRACT

At macroevolutionary time scales, and for a constant mutation rate, there is an expected linear relationship between time and the number of inferred neutral mutations (the "molecular clock"). However, at shorter time scales, a number of recent studies have observed an apparent acceleration in the rate of molecular evolution. We study this apparent acceleration under a Jukes-Cantor model applied to a randomly mating population, and show that, under the model, it arises as a consequence of ignoring short-term effects due to existing diversity within the population. The acceleration can be accounted for by adding the correction term h(0)e(-4µt/3) to the usual Jukes-Cantor formula p(t) = 3/4(1 - e (-(-4µt/3), where h(0) is the expected heterozygosity in the population at time t = 0. The true mutation rate µ may then be recovered, even if h(0) is not known, by estimating µ and h(0) simultaneously using least squares. Rate estimates made without the correction term (i.e., incorrectly assuming the population to be homogeneous) will result in a divergent rate curve of the form µ(div) = µ + C/t, so that the mutation rate appears to approach infinity as the time scale approaches zero. Although our quantitative results apply only to the Jukes-Cantor model, it is reasonable to suppose that the qualitative picture that emerges also applies to more complex models. Our study, therefore, demonstrates the importance of properly accounting for any ancestral diversity, because it may otherwise play a dominant role in rate overestimation.


Subject(s)
Evolution, Molecular , Genetic Variation , Genetics, Population/methods , Models, Genetic , Mutation Rate , Heterozygote , Time Factors
4.
J Comput Biol ; 18(9): 1211-8, 2011 Sep.
Article in English | MEDLINE | ID: mdl-21899426

ABSTRACT

An approximate nested tandem repeat (NTR) in a string T is a complex repetitive structure consisting of many approximate copies of two substrings x and X ("motifs") interspersed with one another. NTRs fall into a class of repetitive structures broadly known as subrepeats. NTRs have been found in real DNA sequences and are expected to be important in evolutionary biology, both in understanding evolution of the ribosomal DNA (where NTRs can occur), and as a potential marker in population genetic and phylogenetic studies. This article describes an alignment algorithm for the verification phase of the software tool NTRFinder developed for database searches for NTRs. When the search algorithm has located a subsequence containing a possible NTR, with motifs X and x, a verification step aligns this subsequence against an exact NTR built from the templates X and x, to determine whether the subsequence contains an approximate NTR and its extent. This article describes an algorithm to solve this alignment problem in O(|T|(|X| + |x|)) space and time. The algorithm is based on Fischetti et al.'s wrap-around dynamic programming.


Subject(s)
Algorithms , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Tandem Repeat Sequences , Data Mining/methods
5.
Syst Biol ; 59(3): 288-97, 2010 May.
Article in English | MEDLINE | ID: mdl-20525636

ABSTRACT

Commonly used phylogenetic models assume a homogeneous process through time in all parts of the tree. However, it is known that these models can be too simplistic as they do not account for nonhomogeneous lineage-specific properties. In particular, it is now widely recognized that as constraints on sequences evolve, the proportion and positions of variable sites can vary between lineages causing heterotachy. The extent to which this model misspecification affects tree reconstruction is still unknown. Here, we evaluate the effect of changes in the proportions and positions of variable sites on model fit and tree estimation. We consider 5 current models of nucleotide sequence evolution in a Bayesian Markov chain Monte Carlo framework as well as maximum parsimony (MP). We show that for a tree with 4 lineages where 2 nonsister taxa undergo a change in the proportion of variable sites tree reconstruction under the best-fitting model, which is chosen using a relative test, often results in the wrong tree. In this case, we found that an absolute test of model fit is a better predictor of tree estimation accuracy. We also found further evidence that MP is not immune to heterotachy. In addition, we show that increased sampling of taxa that have undergone a change in proportion and positions of variable sites is critical for accurate tree reconstruction.


Subject(s)
Classification/methods , Evolution, Molecular , Models, Genetic , Phylogeny , Bayes Theorem , Data Interpretation, Statistical , Markov Chains , Monte Carlo Method , Sample Size
6.
Biol Lett ; 5(3): 397-400, 2009 Jun 23.
Article in English | MEDLINE | ID: mdl-19324622

ABSTRACT

We present a mathematical model of mitochondrial inheritance evolving under neutral evolution to interpret the heteroplasmies observed at some sites. A comparison of the levels of heteroplasmies transmitted from mother to her offspring allows us to estimate the number N(x) of inherited mitochondrial genomes (segregating units). The model demonstrates the necessity of accounting for both the multiplicity of an unknown number N(x), and the threshold , below which heteroplasmy cannot be detected reliably, in order to estimate the mitochondrial mutation rate mu(m) in the maternal line of descent. Our model is applicable to pedigree studies of any eukaryotic species where site heteroplasmies are observed in regions of the mitochondria, provided neutrality can be assumed. The model is illustrated with an analysis of site heteroplasmies in the first hypervariable region of mitochondrial sequence data sampled from Adélie penguin families, providing an estimate N(x) and mu(m). This estimate of mu(m) was found to be consistent with earlier estimates from ancient DNA analysis.


Subject(s)
DNA, Mitochondrial/genetics , Mitochondria/genetics , Models, Genetic , Polymorphism, Genetic , Evolution, Molecular
7.
Mol Biol Evol ; 26(2): 313-26, 2009 Feb.
Article in English | MEDLINE | ID: mdl-18981298

ABSTRACT

We report three developments toward resolving the challenge of the apparent basal polytomy of neoavian birds. First, we describe improved conditional down-weighting techniques to reduce noise relative to signal for deeper divergences and find increased agreement between data sets. Second, we present formulae for calculating the probabilities of finding predefined groupings in the optimal tree. Finally, we report a significant increase in data: nine new mitochondrial (mt) genomes (the dollarbird, New Zealand kingfisher, great potoo, Australian owlet-nightjar, white-tailed trogon, barn owl, a roadrunner [a ground cuckoo], New Zealand long-tailed cuckoo, and the peach-faced lovebird) and together they provide data for each of the six main groups of Neoaves proposed by Cracraft J (2001). We use his six main groups of modern birds as priors for evaluation of results. These include passerines, cuckoos, parrots, and three other groups termed "WoodKing" (woodpeckers/rollers/kingfishers), "SCA" (owls/potoos/owlet-nightjars/hummingbirds/swifts), and "Conglomerati." In general, the support is highly significant with just two exceptions, the owls move from the "SCA" group to the raptors, particularly accipitrids (buzzards/eagles) and the osprey, and the shorebirds may be an independent group from the rest of the "Conglomerati". Molecular dating mt genomes support a major diversification of at least 12 neoavian lineages in the Late Cretaceous. Our results form a basis for further testing with both nuclear-coding sequences and rare genomic changes.


Subject(s)
Birds/classification , Animals , Birds/genetics , DNA, Mitochondrial/genetics , Phylogeny , Sequence Analysis, DNA
8.
PLoS Genet ; 4(10): e1000209, 2008 Oct 03.
Article in English | MEDLINE | ID: mdl-18833304

ABSTRACT

Precise estimations of molecular rates are fundamental to our understanding of the processes of evolution. In principle, mutation and evolutionary rates for neutral regions of the same species are expected to be equal. However, a number of recent studies have shown that mutation rates estimated from pedigree material are much faster than evolutionary rates measured over longer time periods. To resolve this apparent contradiction, we have examined the hypervariable region (HVR I) of the mitochondrial genome using families of Adélie penguins (Pygoscelis adeliae) from the Antarctic. We sequenced 344 bps of the HVR I from penguins comprising 508 families with 915 chicks, together with both their parents. All of the 62 germline heteroplasmies that we detected in mothers were also detected in their offspring, consistent with maternal inheritance. These data give an estimated mutation rate (micro) of 0.55 mutations/site/Myrs (HPD 95% confidence interval of 0.29-0.88 mutations/site/Myrs) after accounting for the persistence of these heteroplasmies and the sensitivity of current detection methods. In comparison, the rate of evolution (k) of the same HVR I region, determined using DNA sequences from 162 known age sub-fossil bones spanning a 37,000-year period, was 0.86 substitutions/site/Myrs (HPD 95% confidence interval of 0.53 and 1.17). Importantly, the latter rate is not statistically different from our estimate of the mutation rate. These results are in contrast to the view that molecular rates are time dependent.


Subject(s)
Evolution, Molecular , Mutation , Spheniscidae/genetics , Animals , Antarctic Regions , DNA, Mitochondrial/genetics , Genetic Drift , Genetics, Population , Haplotypes , Pedigree
9.
Article in English | MEDLINE | ID: mdl-18670048

ABSTRACT

Under a stochastic model of molecular sequence evolution the probability of each possible pattern of a characters is well defined. The Kimura's three-substitution-types (K3ST) model of evolution, allows analytical expression for these probabilities of by means of the Hadamard conjugation as a function of the phylogeny T and the substitution probabilities on each edge of TM . In this paper we produce a direct combinatorial proof of these results, using pathset distances which generalise pairwise distances between sequences. This interpretation provides us with tools that were proved useful in related problems in the mathematical analysis of sequence evolution.


Subject(s)
Algorithms , DNA Mutational Analysis/methods , Evolution, Molecular , Models, Genetic , Sequence Analysis, DNA/methods , Computer Simulation , Phylogeny
10.
BMC Bioinformatics ; 9: 242, 2008 May 20.
Article in English | MEDLINE | ID: mdl-18489794

ABSTRACT

BACKGROUND: Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. RESULTS: We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. CONCLUSION: coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.


Subject(s)
Data Compression/methods , Database Management Systems , Databases, Nucleic Acid , Animals , Evolution, Molecular , Expressed Sequence Tags , Humans , Neural Networks, Computer , Phylogeny , Point Mutation , Sequence Analysis, DNA , Species Specificity
11.
Mol Biol Evol ; 24(11): 2400-11, 2007 Nov.
Article in English | MEDLINE | ID: mdl-17720690

ABSTRACT

There are many examples of groups (such as birds, bees, mammals, multicellular animals, and flowering plants) that have undergone a rapid radiation. In such cases, where there is a combination of short internal and long external branches, correctly estimating and rooting phylogenetic trees is known to be a difficult problem. In this simulation study, we tested the performances of different phylogenetic methods at estimating a tree that models a rapid radiation. We found that maximum likelihood, corrected and uncorrected neighbor-joining, and corrected and uncorrected parsimony, all suffer from biases toward specific tree topologies. In addition, we found that using a single-taxon outgroup to root a tree frequently disrupts an otherwise correct ingroup phylogeny. Moreover, for uncorrected parsimony, we found cases where several individual trees (in which the outgroup was placed incorrectly) were selected more frequently than the correct tree. Even for parameter settings where the correct tree was selected most frequently when using extremely long sequences, for sequences of up to 60,000 nucleotides the incorrectly rooted trees were each selected more frequently than the correct tree. For all the cases tested here, tree estimation using a two taxon outgroup was more accurate than when using a single-taxon outgroup. However, the ingroup was most accurately recovered when no outgroup was used.


Subject(s)
Models, Genetic , Phylogeny , Computer Simulation , Evolution, Molecular , Likelihood Functions
12.
J Med Virol ; 78(10): 1354-64, 2006 Oct.
Article in English | MEDLINE | ID: mdl-16927286

ABSTRACT

Respiratory syncytial virus (RSV) is the most important cause of viral lower respiratory tract infections in infants and children worldwide. In New Zealand, infants with RSV disease are hospitalized at a higher rate than other industrialized countries, without a proportionate increase in known risk factors. The molecular epidemiology of RSV in New Zealand has never been described. Therefore, we analyzed viral attachment glycoprotein (G) gene sequences from 106 RSV subgroup A isolates collected in New Zealand between 1967 and 2003, and 38 subgroup B viruses collected between 1984 and 2004. Subgroup A and B sequences were aligned separately, and compared to sequences of viruses isolated from other countries during a similar period. Genotyping and clustering analyses showed RSV in New Zealand is similar and temporally related to viruses found in other countries. By quantifying temporal clustering, we found subgroup B viruses clustered more strongly than subgroup A viruses. RSV B sequences displayed more variability in stop codon usage and predicted protein length, and had a higher degree of predicted O-glycosylation site changes than RSV A. The mutation rate calculated for the RSV B G gene was significantly higher than for RSV A. Together, these data reveal that RSV subgroups exhibit different patterns of evolution, with subgroup B viruses evolving faster than A.


Subject(s)
Molecular Epidemiology , Respiratory Syncytial Virus Infections/epidemiology , Respiratory Syncytial Viruses/genetics , Biological Evolution , Child, Preschool , Codon, Terminator/genetics , Female , Glycoproteins/genetics , Glycosylation , Humans , Infant , Male , New Zealand/epidemiology , Prospective Studies , Respiratory Syncytial Viruses/classification , Retrospective Studies , Viral Envelope Proteins/genetics
13.
Mol Biol Evol ; 23(3): 626-32, 2006 Mar.
Article in English | MEDLINE | ID: mdl-16319091

ABSTRACT

Maximum likelihood (ML) is a popular method for inferring a phylogenetic tree of the evolutionary relationship of a set of taxa, from observed homologous aligned genetic sequences of the taxa. Generally, the computation of the ML tree is based on numerical methods, which in a few cases, are known to converge to a local maximum on a tree, which is suboptimal. The extent of this problem is unknown, one approach is to attempt to derive algebraic equations for the likelihood equation and find the maximum points analytically. This approach has so far only been successful in the very simplest cases, of three or four taxa under the Neyman model of evolution of two-state characters. In this paper we extend this approach, for the first time, to four-state characters, the Jukes-Cantor model under a molecular clock, on a tree T on three taxa, a rooted triple. We employ spectral methods (Hadamard conjugation) to express the likelihood function parameterized by the path-length spectrum. Taking partial derivatives, we derive a set of polynomial equations whose simultaneous solution contains all critical points of the likelihood function. Using tools of algebraic geometry (the resultant of two polynomials) in the computer algebra packages (Maple), we are able to find all turning points analytically. We then employ this method on real sequence data and obtain realistic results on the primate-rodents divergence time.


Subject(s)
Biological Evolution , Likelihood Functions , Models, Genetic , Phylogeny , Classification , Mathematics
14.
Bioinformatics ; 19 Suppl 2: ii66-72, 2003 Oct.
Article in English | MEDLINE | ID: mdl-14534174

ABSTRACT

We introduce a mechanism for analytically deriving upper bounds on the maximum likelihood for genetic sequence data on sets of phylogenies. A simple 'partition' bound is introduced for general models. Tighter bounds are developed for the simplest model of evolution, the two state symmetric model of nucleotide substitution under the molecular clock. This follows earlier theoretical work which has been restricted to this model by analytic complexity. A weakness of current numerical computation is that reported 'maximum likelihood' results cannot be guaranteed, both for a specified tree (because of the possibility of multiple maxima) or over the full tree space (as the computation is intractable for large sets of trees). The bounds we develop here can be used to conclusively eliminate large proportions of tree space in the search for the maximum likelihood tree. This is vital in the development of a branch and bound search strategy for identifying the maximum likelihood tree. We report the results from a simulation study of approximately 10(6) data sets generated on clock-like trees of five leaves. In each trial a likelihood value of one specific instance of a parameterised tree is compared to the bound determined for each of the 105 possible rooted binary trees. The proportion of trees that are eliminated from the search for the maximum likelihood tree ranged from 92% to almost 98%, indicating a computational speed-up factor of between 12 and 44.


Subject(s)
Algorithms , Evolution, Molecular , Models, Genetic , Phylogeny , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Base Sequence , Computer Simulation , Data Interpretation, Statistical , Likelihood Functions , Models, Statistical , Molecular Sequence Data
15.
J Theor Biol ; 223(3): 377-85, 2003 Aug 07.
Article in English | MEDLINE | ID: mdl-12850457

ABSTRACT

Sober and Steel (J. Theor. Biol. 218, 395-408) give important limits on the use of current models with sequence data for studying ancient aspects of evolution; but they go too far in suggesting that several fundamental aspects of evolutionary theory cannot be tested in a normal scientific manner. To the contrary, we show examples of how some alternatives to the theory of descent can be formulated in such a way that they lead to predictions that can be evaluated (and rejected). The critical factor is a logical formulation of the alternatives, even though not all possible alternatives can be tested simultaneously. Similarly, some of the limits using DNA sequence data can be overcome by other types of sequence derived characters. The uniqueness (or not) of the origin of life, though still difficult, is similarly amenable to the testing of alternative hypotheses.


Subject(s)
Base Sequence , Biological Evolution , Models, Genetic , Origin of Life , Animals , Computer Simulation
16.
Bioinformatics ; 19(7): 865-73, 2003 May 01.
Article in English | MEDLINE | ID: mdl-12724297

ABSTRACT

MOTIVATION: Noncoding RNA genes produce functional RNA molecules rather than coding for proteins. One such family is the H/ACA snoRNAs. Unlike the related C/D snoRNAs these have resisted automated detection to date. RESULTS: We develop an algorithm to screen the yeast genome for novel H/ACA snoRNAs. To achieve this, we introduce some new methods for facilitating the search for noncoding RNAs in genomic sequences which are based on properties of predicted minimum free-energy (MFE) secondary structures. The algorithm has been implemented and can be generalized to enable screening of other eukaryote genomes. We find that use of primary sequence alone is insufficient for identifying novel H/ACA snoRNAs. Only the use of secondary structure filters reduces the number of candidates to a manageable size. From genomic context, we identify three strong H/ACA snoRNA candidates. These together with a further 47 candidates obtained by our analysis are being experimentally screened.


Subject(s)
Algorithms , Energy Transfer , Genome, Fungal , RNA, Small Nucleolar/chemistry , RNA, Small Nucleolar/genetics , Sequence Alignment/methods , Sequence Analysis, RNA/methods , Base Sequence , Conserved Sequence , Databases, Nucleic Acid , Molecular Sequence Data , Nucleic Acid Conformation , RNA, Small Nucleolar/analysis , RNA, Small Nucleolar/classification , Saccharomyces cerevisiae/chemistry , Saccharomyces cerevisiae/genetics
17.
Syst Biol ; 52(1): 110-8, 2003 Feb.
Article in English | MEDLINE | ID: mdl-12554444

ABSTRACT

We developed a recurrence relation that counts the number of tandem duplication trees (either rooted or unrooted) that are consistent with a set of n tandemly repeated sequences generated under the standard unequal recombination (or crossover) model of tandem duplications. The number of rooted duplication trees is exactly twice the number of unrooted trees, which means that on average only two positions for a root on a duplication tree are possible. Using the recurrence, we tabulated these numbers for small values of n. We also developed an asymptotic formula that for large n provides estimates for these numbers. These numbers give a priori probabilities for phylogenies of the repeated sequences to be duplication trees. This work extends earlier studies where exhaustive counts of the numbers for small n were obtained. One application showed the significance of finding that most maximum-parsimony trees constructed from repeat sequences from human immunoglobins and T-cell receptors were tandem duplication trees. Those findings provided strong support to the proposed mechanisms of tandem gene duplication. The recurrence relation also suggests efficient algorithms to recognize duplication trees and to generate random duplication trees for simulation. We present a linear-time recognition algorithm.


Subject(s)
Phylogeny , Algorithms , Gene Duplication , Models, Statistical
18.
Mol Biol Evol ; 19(12): 2060-70, 2002 Dec.
Article in English | MEDLINE | ID: mdl-12446798

ABSTRACT

We have sequenced four new mitochondrial genomes to improve the stability of the tree for placental mammals; they are two insectivores (a gymnure, Echinosorex gymnurus and Formosan shrew Soriculus fumidus); a Formosan lesser horseshoe bat (Rhinolophus monoceros); and the New Zealand fur seal (Arctocephalus forsteri). A revision to the hedgehog sequence (Erinaceus europaeus) is also reported. All five are from the Laurasiatheria grouping of eutherian mammals. On this new data set there is a strong tendency for the hedgehog and its relative, the gymnure, to join with the other Laurasiatherian insectivores (mole and shrews). To quantify the stability of trees from this data we define, based on nuclear sequences, a major four-way split in Laurasiatherians. This ([Xenarthra, Afrotheria], [Laurasiatheria, Supraprimates]) split is also found from mitochondrial genomes using either protein-coding or RNA (rRNA and tRNA) data sets. The high similarity of the mitochondrial and nuclear-derived trees allows a quantitative estimate of the stability of trees from independent data sets, as detected from a triplet Markov analysis. There are significant changes in the mutational processes within placental mammals that are ignored by current tree programs. On the basis of our quantitative results, we expect the evolutionary tree for mammals to be resolved quickly, and this will allow other problems to be solved.


Subject(s)
Biological Evolution , DNA, Mitochondrial/genetics , Mammals/classification , Animals , Likelihood Functions , Mammals/genetics , Species Specificity
SELECTION OF CITATIONS
SEARCH DETAIL
...