Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Bioinformatics ; 39(12)2023 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-37975872

RESUMEN

MOTIVATION: Phylogenetic placement enables phylogenetic analysis of massive collections of newly sequenced DNA, when de novo tree inference is too unreliable or inefficient. Assuming that a high-quality reference tree is available, the idea is to seek the correct placement of the new sequences in that tree. Recently, alignment-free approaches to phylogenetic placement have emerged, both to circumvent the need to align the new sequences and to avoid the calculations that typically follow the alignment step. A promising approach is based on the inference of k-mers that can be potentially related to the reference sequences, also called phylo-k-mers. However, its usage is limited by the time and memory-consuming stage of reference data preprocessing and the large numbers of k-mers to consider. RESULTS: We suggest a filtering method for selecting informative phylo-k-mers based on mutual information, which can significantly improve the efficiency of placement, at the cost of a small loss in placement accuracy. This method is implemented in IPK, a new tool for computing phylo-k-mers that significantly outperforms the software previously available. We also present EPIK, a new software for phylogenetic placement, supporting filtered phylo-k-mer databases. Our experiments on real-world data show that EPIK is the fastest phylogenetic placement tool available, when placing hundreds of thousands and millions of queries while still providing accurate placements. AVAILABILITY AND IMPLEMENTATION: IPK and EPIK are freely available at https://github.com/phylo42/IPK and https://github.com/phylo42/EPIK. Both are implemented in C++ and Python and supported on Linux and MacOS.


Asunto(s)
Algoritmos , Programas Informáticos , Filogenia , Análisis de Secuencia de ADN , Secuencia de Bases
2.
IEEE/ACM Trans Comput Biol Bioinform ; 20(5): 2889-2897, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37204943

RESUMEN

Finding the correct position of new sequences within an established phylogenetic tree is an increasingly relevant problem in evolutionary bioinformatics and metagenomics. Recently, alignment-free approaches for this task have been proposed. One such approach is based on the concept of phylogenetically-informative k-mers or phylo- k-mers for short. In practice, phylo- k-mers are inferred from a set of related reference sequences and are equipped with scores expressing the probability of their appearance in different locations within the input reference phylogeny. Computing phylo- k-mers, however, represents a computational bottleneck to their applicability in real-world problems such as the phylogenetic analysis of metabarcoding reads and the detection of novel recombinant viruses. Here we consider the problem of phylo- k-mer computation: how can we efficiently find all k-mers whose probability lies above a given threshold for a given tree node? We describe and analyze algorithms for this problem, relying on branch-and-bound and divide-and-conquer techniques. We exploit the redundancy of adjacent windows of the alignment to save on computation. Besides computational complexity analyses, we provide an empirical evaluation of the relative performance of their implementations on simulated and real-world data. The divide-and-conquer algorithms are found to surpass the branch-and-bound approach, especially when many phylo- k-mers are found.

3.
PLoS Comput Biol ; 17(9): e1008380, 2021 09.
Artículo en Inglés | MEDLINE | ID: mdl-34478440

RESUMEN

For various species, high quality sequences and complete genomes are nowadays available for many individuals. This makes data analysis challenging, as methods need not only to be accurate, but also time efficient given the tremendous amount of data to process. In this article, we introduce an efficient method to infer the evolutionary history of individuals under the multispecies coalescent model in networks (MSNC). Phylogenetic networks are an extension of phylogenetic trees that can contain reticulate nodes, which allow to model complex biological events such as horizontal gene transfer, hybridization and introgression. We present a novel way to compute the likelihood of biallelic markers sampled along genomes whose evolution involved such events. This likelihood computation is at the heart of a Bayesian network inference method called SnappNet, as it extends the Snapp method inferring evolutionary trees under the multispecies coalescent model, to networks. SnappNet is available as a package of the well-known beast 2 software. Recently, the MCMC_BiMarkers method, implemented in PhyloNet, also extended Snapp to networks. Both methods take biallelic markers as input, rely on the same model of evolution and sample networks in a Bayesian framework, though using different methods for computing priors. However, SnappNet relies on algorithms that are exponentially more time-efficient on non-trivial networks. Using simulations, we compare performances of SnappNet and MCMC_BiMarkers. We show that both methods enjoy similar abilities to recover simple networks, but SnappNet is more accurate than MCMC_BiMarkers on more complex network scenarios. Also, on complex networks, SnappNet is found to be extremely faster than MCMC_BiMarkers in terms of time required for the likelihood computation. We finally illustrate SnappNet performances on a rice data set. SnappNet infers a scenario that is consistent with previous results and provides additional understanding of rice evolution.


Asunto(s)
Cadenas de Markov , Método de Montecarlo , Filogenia , Algoritmos , Teorema de Bayes , Biología Computacional/métodos , Evolución Molecular , Genes de Plantas , Funciones de Verosimilitud , Oryza/clasificación , Oryza/genética
4.
BMC Bioinformatics ; 22(1): 285, 2021 May 28.
Artículo en Inglés | MEDLINE | ID: mdl-34049487

RESUMEN

BACKGROUND: Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are 'novel' compared to the others in the same dataset, and low weights to sequences that are over-represented. RESULTS: We formalise this principle by rigorously defining the evolutionary 'novelty' of a sequence within an alignment. This results in new sequence weights that we call 'phylogenetic novelty scores'. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column-important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes. CONCLUSIONS: Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy.


Asunto(s)
Algoritmos , Biología Computacional , Filogenia , Alineación de Secuencia
5.
Theor Popul Biol ; 137: 22-31, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-33333117

RESUMEN

The multispecies coalescent process models the genealogical relationships of genes sampled from several species, enabling useful predictions about phenomena such as the discordance between a gene tree and the species phylogeny due to incomplete lineage sorting. Conversely, knowledge of large collections of gene trees can inform us about several aspects of the species phylogeny, such as its topology and ancestral population sizes. A fundamental open problem in this context is how to efficiently compute the probability of a gene tree topology, given the species phylogeny. Although a number of algorithms for this task have been proposed, they either produce approximate results, or, when they are exact, they do not scale to large data sets. In this paper, we present some progress towards exact and efficient computation of the probability of a gene tree topology. We provide a new algorithm that, given a species tree and the number of genes sampled for each species, calculates the probability that the gene tree topology will be concordant with the species tree. Moreover, we provide an algorithm that computes the probability of any specific gene tree topology concordant with the species tree. Both algorithms run in polynomial time and have been implemented in Python. Experiments show that they are able to analyze data sets where thousands of genes are sampled in a matter of minutes to hours.


Asunto(s)
Algoritmos , Modelos Genéticos , Especiación Genética , Filogenia , Probabilidad
6.
Bioinformatics ; 36(22-23): 5351-5360, 2021 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-33331849

RESUMEN

MOTIVATION: Novel recombinant viruses may have important medical and evolutionary significance, as they sometimes display new traits not present in the parental strains. This is particularly concerning when the new viruses combine fragments coming from phylogenetically distinct viral types. Here, we consider the task of screening large collections of sequences for such novel recombinants. A number of methods already exist for this task. However, these methods rely on complex models and heavy computations that are not always practical for a quick scan of a large number of sequences. RESULTS: We have developed SHERPAS, a new program to detect novel recombinants and provide a first estimate of their parental composition. Our approach is based on the precomputation of a large database of 'phylogenetically-informed k-mers', an idea recently introduced in the context of phylogenetic placement in metagenomics. Our experiments show that SHERPAS is hundreds to thousands of times faster than existing software, and enables the analysis of thousands of whole genomes, or long-sequencing reads, within minutes or seconds, and with limited loss of accuracy. AVAILABILITY AND IMPLEMENTATION: The source code is freely available for download at https://github.com/phylo42/sherpas. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.
Bioinformatics ; 36(21): 5264-5266, 2021 01 29.
Artículo en Inglés | MEDLINE | ID: mdl-32697844

RESUMEN

MOTIVATION: Phylogenetic placement (PP) is a process of taxonomic identification for which several tools are now available. However, it remains difficult to assess which tool is more adapted to particular genomic data or a particular reference taxonomy. We developed Placement Evaluation WOrkflows (PEWO), the first benchmarking tool dedicated to PP assessment. Its automated workflows can evaluate PP at many levels, from parameter optimization for a particular tool, to the selection of the most appropriate genetic marker when PP-based species identifications are targeted. Our goal is that PEWO will become a community effort and a standard support for future developments and applications of PP. AVAILABILITY AND IMPLEMENTATION: https://github.com/phylo42/PEWO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Benchmarking , Programas Informáticos , Genoma , Filogenia , Flujo de Trabajo
8.
Bioinformatics ; 35(18): 3303-3312, 2019 09 15.
Artículo en Inglés | MEDLINE | ID: mdl-30698645

RESUMEN

MOTIVATION: Taxonomic classification is at the core of environmental DNA analysis. When a phylogenetic tree can be built as a prior hypothesis to such classification, phylogenetic placement (PP) provides the most informative type of classification because each query sequence is assigned to its putative origin in the tree. This is useful whenever precision is sought (e.g. in diagnostics). However, likelihood-based PP algorithms struggle to scale with the ever-increasing throughput of DNA sequencing. RESULTS: We have developed RAPPAS (Rapid Alignment-free Phylogenetic Placement via Ancestral Sequences) which uses an alignment-free approach, removing the hurdle of query sequence alignment as a preliminary step to PP. Our approach relies on the precomputation of a database of k-mers that may be present with non-negligible probability in relatives of the reference sequences. The placement is performed by inspecting the stored phylogenetic origins of the k-mers in the query, and their probabilities. The database can be reused for the analysis of several different metagenomes. Experiments show that the first implementation of RAPPAS is already faster than competing likelihood-based PP algorithms, while keeping similar accuracy for short reads. RAPPAS scales PP for the era of routine metagenomic diagnostics. AVAILABILITY AND IMPLEMENTATION: Program and sources freely available for download at https://github.com/blinard-BIOINFO/RAPPAS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Metagenoma , Programas Informáticos , Algoritmos , Funciones de Verosimilitud , Filogenia , Alineación de Secuencia , Análisis de Secuencia de ADN
9.
J Math Biol ; 78(1-2): 527-547, 2019 01.
Artículo en Inglés | MEDLINE | ID: mdl-30121824

RESUMEN

Phylogenetic networks are often constructed by merging multiple conflicting phylogenetic signals into a directed acyclic graph. It is interesting to explore whether a network constructed in this way induces biologically-relevant phylogenetic signals that were not present in the input. Here we show that, given a multiple alignment A for a set of taxa X and a rooted phylogenetic network N whose leaves are labelled by X, it is NP-hard to locate a most parsimonious phylogenetic tree displayed by N (with respect to A) even when the level of N-the maximum number of reticulation nodes within a biconnected component-is 1 and A contains only 2 distinct states. (If, additionally, gaps are allowed the problem becomes APX-hard.) We also show that under the same conditions, and assuming a simple binary symmetric model of character evolution, finding a most likely tree displayed by the network is NP-hard. These negative results contrast with earlier work on parsimony in which it is shown that if A consists of a single column the problem is fixed parameter tractable in the level. We conclude with a discussion of why, despite the NP-hardness, both the parsimony and likelihood problem can likely be well-solved in practice.


Asunto(s)
Modelos Genéticos , Filogenia , Algoritmos , Animales , Biología Computacional , Evolución Molecular , Especiación Genética , Humanos , Conceptos Matemáticos
10.
PLoS Comput Biol ; 13(8): e1005611, 2017 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-28763439

RESUMEN

Phylogenetic tree reconstruction is usually done by local search heuristics that explore the space of the possible tree topologies via simple rearrangements of their structure. Tree rearrangement heuristics have been used in combination with practically all optimization criteria in use, from maximum likelihood and parsimony to distance-based principles, and in a Bayesian context. Their basic components are rearrangement moves that specify all possible ways of generating alternative phylogenies from a given one, and whose fundamental property is to be able to transform, by repeated application, any phylogeny into any other phylogeny. Despite their long tradition in tree-based phylogenetics, very little research has gone into studying similar rearrangement operations for phylogenetic network-that is, phylogenies explicitly representing scenarios that include reticulate events such as hybridization, horizontal gene transfer, population admixture, and recombination. To fill this gap, we propose "horizontal" moves that ensure that every network of a certain complexity can be reached from any other network of the same complexity, and "vertical" moves that ensure reachability between networks of different complexities. When applied to phylogenetic trees, our horizontal moves-named rNNI and rSPR-reduce to the best-known moves on rooted phylogenetic trees, nearest-neighbor interchange and rooted subtree pruning and regrafting. Besides a number of reachability results-separating the contributions of horizontal and vertical moves-we prove that rNNI moves are local versions of rSPR moves, and provide bounds on the sizes of the rNNI neighborhoods. The paper focuses on the most biologically meaningful versions of phylogenetic networks, where edges are oriented and reticulation events clearly identified. Moreover, our rearrangement moves are robust to the fact that networks with higher complexity usually allow a better fit with the data. Our goal is to provide a solid basis for practical phylogenetic network reconstruction.


Asunto(s)
Biología Computacional/métodos , Reordenamiento Génico/genética , Modelos Genéticos , Filogenia , Animales , Hominidae/genética , Humanos
11.
Bull Math Biol ; 78(9): 1773-1795, 2016 09.
Artículo en Inglés | MEDLINE | ID: mdl-27659024

RESUMEN

Phylogenetic networks are increasingly used in evolutionary biology to represent the history of species that have undergone reticulate events such as horizontal gene transfer, hybrid speciation and recombination. One of the most fundamental questions that arise in this context is whether the evolution of a gene with one copy in all species can be explained by a given network. In mathematical terms, this is often translated in the following way: is a given phylogenetic tree contained in a given phylogenetic network? Recently this tree containment problem has been widely investigated from a computational perspective, but most studies have only focused on the topology of the phylogenies, ignoring a piece of information that, in the case of phylogenetic trees, is routinely inferred by evolutionary analyses: branch lengths. These measure the amount of change (e.g., nucleotide substitutions) that has occurred along each branch of the phylogeny. Here, we study a number of versions of the tree containment problem that explicitly account for branch lengths. We show that, although length information has the potential to locate more precisely a tree within a network, the problem is computationally hard in its most general form. On a positive note, for a number of special cases of biological relevance, we provide algorithms that solve this problem efficiently. This includes the case of networks of limited complexity, for which it is possible to recover, among the trees contained by the network with the same topology as the input tree, the closest one in terms of branch lengths.


Asunto(s)
Modelos Genéticos , Filogenia , Algoritmos , Evolución Molecular , Conceptos Matemáticos
12.
BMC Bioinformatics ; 17: 23, 2016 Jan 07.
Artículo en Inglés | MEDLINE | ID: mdl-26744021

RESUMEN

BACKGROUND: Branch lengths are an important attribute of phylogenetic trees, providing essential information for many studies in evolutionary biology. Yet, part of the current methodology to reconstruct a phylogeny from genomic information - namely supertree methods - focuses on the topology or structure of the phylogenetic tree, rather than the evolutionary divergences associated to it. Moreover, accurate methods to estimate branch lengths - typically based on probabilistic analysis of a concatenated alignment - are limited by large demands in memory and computing time, and may become impractical when the data sets are too large. RESULTS: Here, we present a novel phylogenomic distance-based method, named ERaBLE (Evolutionary Rates and Branch Length Estimation), to estimate the branch lengths of a given reference topology, and the relative evolutionary rates of the genes employed in the analysis. ERaBLE uses as input data a potentially very large collection of distance matrices, where each matrix is obtained from a different genomic region - either directly from its sequence alignment, or indirectly from a gene tree inferred from the alignment. Our experiments show that ERaBLE is very fast and fairly accurate when compared to other possible approaches for the same tasks. Specifically, it efficiently and accurately deals with large data sets, such as the OrthoMaM v8 database, composed of 6,953 exons from up to 40 mammals. CONCLUSIONS: ERaBLE may be used as a complement to supertree methods - or it may provide an efficient alternative to maximum likelihood analysis of concatenated alignments - to estimate branch lengths from phylogenomic data sets.


Asunto(s)
Genómica/métodos , Filogenia , Animales , Simulación por Computador , Bases de Datos Factuales , Mamíferos/genética , Modelos Moleculares , Alineación de Secuencia
13.
PLoS Comput Biol ; 11(4): e1004135, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-25849429

RESUMEN

Phylogenetic networks represent the evolution of organisms that have undergone reticulate events, such as recombination, hybrid speciation or lateral gene transfer. An important way to interpret a phylogenetic network is in terms of the trees it displays, which represent all the possible histories of the characters carried by the organisms in the network. Interestingly, however, different networks may display exactly the same set of trees, an observation that poses a problem for network reconstruction: from the perspective of many inference methods such networks are "indistinguishable". This is true for all methods that evaluate a phylogenetic network solely on the basis of how well the displayed trees fit the available data, including all methods based on input data consisting of clades, triples, quartets, or trees with any number of taxa, and also sequence-based approaches such as popular formalisations of maximum parsimony and maximum likelihood for networks. This identifiability problem is partially solved by accounting for branch lengths, although this merely reduces the frequency of the problem. Here we propose that network inference methods should only attempt to reconstruct what they can uniquely identify. To this end, we introduce a novel definition of what constitutes a uniquely reconstructible network. For any given set of indistinguishable networks, we define a canonical network that, under mild assumptions, is unique and thus representative of the entire set. Given data that underwent reticulate evolution, only the canonical form of the underlying phylogenetic network can be uniquely reconstructed. While on the methodological side this will imply a drastic reduction of the solution space in network inference, for the study of reticulate evolution this is a fundamental limitation that will require an important change of perspective when interpreting phylogenetic networks.


Asunto(s)
Evolución Biológica , Evolución Molecular , Regulación de la Expresión Génica/genética , Redes Reguladoras de Genes/genética , Modelos Genéticos , Filogenia , Simulación por Computador
14.
Proc Natl Acad Sci U S A ; 109(41): 16443-8, 2012 Oct 09.
Artículo en Inglés | MEDLINE | ID: mdl-23012403

RESUMEN

Several popular methods for phylogenetic inference (or hierarchical clustering) are based on a matrix of pairwise distances between taxa (or any kind of objects): The objective is to construct a tree with branch lengths so that the distances between the leaves in that tree are as close as possible to the input distances. If we hold the structure (topology) of the tree fixed, in some relevant cases (e.g., ordinary least squares) the optimal values for the branch lengths can be expressed using simple combinatorial formulae. Here we define a general form for these formulae and show that they all have two desirable properties: First, the common tree reconstruction approaches (least squares, minimum evolution), when used in combination with these formulae, are guaranteed to infer the correct tree when given enough data (consistency); second, the branch lengths of all the simple (nearest neighbor interchange) rearrangements of a tree can be calculated, optimally, in quadratic time in the size of the tree, thus allowing the efficient application of hill climbing heuristics. The study presented here is a continuation of that by Mihaescu and Pachter on branch length estimation [Mihaescu R, Pachter L (2008) Proc Natl Acad Sci USA 105:13206-13211]. The focus here is on the inference of the tree itself and on providing a basis for novel algorithms to reconstruct trees from distances.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Modelos Genéticos , Filogenia , Animales , Análisis por Conglomerados , Simulación por Computador , Evolución Molecular , Humanos , Reproducibilidad de los Resultados
15.
Bull Math Biol ; 72(7): 1820-39, 2010 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-20449671

RESUMEN

Minimum evolution is the guiding principle of an important class of distance-based phylogeny reconstruction methods, including neighbor-joining (NJ), which is the most cited tree inference algorithm to date. The minimum evolution principle involves searching for the tree with minimum length, where the length is estimated using various least-squares criteria. Since evolutionary distances cannot be known precisely but only estimated, it is important to investigate the robustness of phylogenetic reconstruction to imprecise estimates for these distances. The safety radius is a measure of this robustness: it consists of the maximum relative deviation that the input distances can have from the correct distances, without compromising the reconstruction of the correct tree structure. Answering some open questions, we here derive the safety radius of two popular minimum evolution criteria: balanced minimum evolution (BME) and minimum evolution based on ordinary least squares (OLS + ME). Whereas BME has a radius of 1/2, which is the best achievable, OLS + ME has a radius tending to 0 as the number of taxa increases. This difference may explain the gap in reconstruction accuracy observed in practice between OLS + ME and BME (which forms the basis of popular programs such as NJ and FastME).


Asunto(s)
Evolución Molecular , Modelos Genéticos , Filogenia , Análisis de los Mínimos Cuadrados
16.
Artículo en Inglés | MEDLINE | ID: mdl-20150680

RESUMEN

We explore the maximum parsimony (MP) and ancestral maximum likelihood (AML) criteria in phylogenetic tree reconstruction. Both problems are NP-hard, so we seek approximate solutions. We formulate the two problems as Steiner tree problems under appropriate distances. The gist of our approach is the succinct characterization of Steiner trees for a small number of leaves for the two distances. This enables the use of known Steiner tree approximation algorithms. The approach leads to a 16/9 approximation ratio for AML and asymptotically to a 1.55 approximation ratio for MP.


Asunto(s)
Algoritmos , Análisis Mutacional de ADN/métodos , Evolución Molecular , Modelos Genéticos , Análisis de Secuencia de ADN/métodos , Secuencia de Bases , Simulación por Computador , Interpretación Estadística de Datos , Funciones de Verosimilitud , Modelos Estadísticos , Datos de Secuencia Molecular
17.
Artículo en Inglés | MEDLINE | ID: mdl-19179696

RESUMEN

In the last 15 years, Phylogenetic Diversity (PD) has gained interest in the community of conservation biologists as a surrogate measure for assessing biodiversity. We have recently proposed two approaches to select taxa for maximizing PD, namely PD with budget constraints and PD on split systems. In this paper, we will unify these two strategies and present a dynamic programming algorithm to solve the unified framework of selecting taxa with maximal PD under budget constraints on circular split systems. An improved algorithm will also be given if the underlying split system is a tree.


Asunto(s)
Biodiversidad , Biología Computacional/métodos , Conservación de los Recursos Naturales , Filogenia , Programas Informáticos , Algoritmos , Animales , Conservación de los Recursos Naturales/economía , Conservación de los Recursos Naturales/métodos , Extinción Biológica , Integración de Sistemas
18.
J Theor Biol ; 251(2): 286-96, 2008 Mar 21.
Artículo en Inglés | MEDLINE | ID: mdl-18199459

RESUMEN

Phylogenetic diversity is a measure for describing how much of an evolutionary tree is spanned by a subset of species. If one applies this to the unknown subset of current species that will still be present at some future time, then this 'future phylogenetic diversity' provides a measure of the impact of various extinction scenarios in biodiversity conservation. In this paper, we study the distribution of future phylogenetic diversity under a simple model of extinction (a generalized 'field of bullets' model). We show that the distribution of future phylogenetic diversity converges to a normal distribution as the number of species grows, under mild conditions, which are necessary. We also describe an algorithm to compute the distribution efficiently, provided the edge lengths are integral, and briefly outline the significance of our findings for biodiversity conservation.


Asunto(s)
Biodiversidad , Extinción Biológica , Filogenia , Algoritmos , Animales , Conservación de los Recursos Naturales , Modelos Biológicos
19.
Bioinformatics ; 24(1): 11-7, 2008 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-18006548

RESUMEN

MOTIVATION: Alternative splicing has the potential to generate a wide range of protein isoforms. For many computational applications and for experimental research, it is important to be able to concentrate on the isoform that retains the core biological function. For many genes this is far from clear. RESULTS: We have combined five methods into a pipeline that allows us to detect the principal variant for a gene. Most of the methods were based on conservation between species, at the level of both gene and protein. The five methods used were the conservation of exonic structure, the detection of non-neutral evolution, the conservation of functional residues, the existence of a known protein structure and the abundance of vertebrate orthologues. The pipeline was able to determine a principal isoform for 83% of a set of well-annotated genes with multiple variants.


Asunto(s)
Empalme Alternativo/genética , Evolución Molecular , Perfilación de la Expresión Génica/métodos , Isoformas de Proteínas/genética , Alineación de Secuencia/métodos , Análisis de Secuencia/métodos , Algoritmos , Homología de Secuencia de Ácido Nucleico
20.
Genome Res ; 17(6): 760-74, 2007 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-17567995

RESUMEN

A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.


Asunto(s)
Evolución Molecular , Genoma Humano , Mamíferos/genética , Sistemas de Lectura Abierta , Filogenia , Alineación de Secuencia , Animales , Proyecto Genoma Humano , Humanos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...