Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 7 de 7
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Mol Biol Evol ; 29(10): 2921-36, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-22491036

RESUMO

Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper, we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared with LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. Our models, data, and software are available from http://www.atgc-montpellier.fr/models/lg4x.


Assuntos
Substituição de Aminoácidos/genética , Evolução Molecular , Modelos Genéticos , Taxa de Mutação , Proteínas/genética , Algoritmos , Bases de Dados de Proteínas , Funções Verossimilhança , Fatores de Tempo
2.
Genome Res ; 21(6): 952-60, 2011 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-20980557

RESUMO

Reductions in the cost of sequencing have enabled whole-genome sequencing to identify sequence variants segregating in a population. An efficient approach is to sequence many samples at low coverage, then to combine data across samples to detect shared variants. Here, we present methods to discover and genotype single-nucleotide polymorphism (SNP) sites from low-coverage sequencing data, making use of shared haplotype (linkage disequilibrium) information. For each population, we first collect SNP candidates based on independent sequence calls per site. We then use MARGARITA with genotype or phased haplotype data from the same samples to collect 20 ancestral recombination graphs (ARGs). We refine the posterior probability of SNP candidates by considering possible mutations at internal branches of the 40 marginal ancestral trees inferred from the 20 ARGs at the left and right flanking genotype sites. Using a population genetic prior distribution on tree-branch length and Bayesian inference, we determine a posterior probability of the SNP being real and also the most probable phased genotype call for each individual. We present experiments on both simulation data and real data from the 1000 Genomes Project to prove the applicability of the methods. We also explore the relative tradeoff between sequencing depth and the number of sequenced samples.


Assuntos
Algoritmos , Filogenia , Polimorfismo de Nucleotídeo Único/genética , Análise de Sequência de DNA/métodos , Software , Teorema de Bayes , Simulação por Computador , Genótipo , Haplótipos/genética , Humanos , Funções Verossimilhança , Desequilíbrio de Ligação , Modelos Genéticos
3.
Syst Biol ; 59(3): 277-87, 2010 May.
Artigo em Inglês | MEDLINE | ID: mdl-20525635

RESUMO

Amino acid substitution models are essential to most methods to infer phylogenies from protein data. These models represent the ways in which proteins evolve and substitutions accumulate along the course of time. It is widely accepted that the substitution processes vary depending on the structural configuration of the protein residues. However, this information is very rarely used in phylogenetic studies, though the 3-dimensional structure of dozens of thousands of proteins has been elucidated. Here, we reinvestigate the question in order to fill this gap. We use an improved estimation methodology and a very large database comprising 1471 nonredundant globular protein alignments with structural annotations to estimate new amino acid substitution models accounting for the secondary structure and solvent accessibility of the residues. These models incorporate a confidence coefficient that is estimated from the data and reflects the reliability and usefulness of structural annotations in the analyzed sequences. Our results with 300 independent test alignments show an impressive likelihood gain compared with standard models such as JTT or WAG. Moreover, the use of these models induces significant topological changes in the inferred trees, which should be of primary interest to phylogeneticists. Our data, models, and software are available for download from http://atgc.lirmm.fr/phyml-structure/.


Assuntos
Classificação/métodos , Evolução Molecular , Modelos Genéticos , Filogenia , Conformação Proteica , Proteínas/genética , Sequência de Aminoácidos , Bases de Dados Genéticas , Funções Verossimilhança , Alinhamento de Sequência , Software
4.
Philos Trans R Soc Lond B Biol Sci ; 363(1512): 3965-76, 2008 Dec 27.
Artigo em Inglês | MEDLINE | ID: mdl-18852096

RESUMO

Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution.We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TREEBASE.We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TREEBASE test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25, 1307-1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures.


Assuntos
Substituição de Aminoácidos/genética , Evolução Molecular , Modelos Genéticos , Filogenia , Proteínas/genética , Funções Verossimilhança
5.
Mol Biol Evol ; 25(7): 1307-20, 2008 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-18367465

RESUMO

Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.


Assuntos
Sequência de Aminoácidos , Biologia Computacional/métodos , Modelos Genéticos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Bases de Dados de Ácidos Nucleicos , Evolução Molecular , Dados de Sequência Molecular , Filogenia , Software
6.
Stud Health Technol Inform ; 129(Pt 2): 1304-8, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17911925

RESUMO

To analyze the laboratory data by data mining, user-centered universal tools have not been available in medicine. We analyzed 1,565,877 laboratory data of 771 patients with viral hepatitis in order to find the difference of the temporal changes in laboratory test data between Hepatitis B and Hepatitis C by the combination of temporal abstraction and data mining. The data for one patient is temporal for more than 5 years. After pretreatment the data was converted to abstract patterns and then selected into sets of data combination and rules to identify Hepatitis B or C by D2MS and LUPC which were originally produced by ourselves. Not only data pattern, but also temporal relations were considered as a part of the rules. In the course of evaluating the results by domain experts, even though there were not so remarkable hypotheses, visualization tools made it easier for them to understand the relations of the complicated rules.


Assuntos
Apresentação de Dados , Hepatite B/diagnóstico , Hepatite C/diagnóstico , Armazenamento e Recuperação da Informação/métodos , Testes de Função Hepática , Humanos , Fatores de Tempo
7.
Genome Inform ; 15(2): 82-91, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15706494

RESUMO

In this paper, we propose a graph-based method to measure the similarity between chemical compounds described by 2D form. Our main idea is to measure the similarity between two compounds based on edges, nodes, and connectivity of their common subgraphs. We applied the proposed similarity measure in combination with a clustering method to more than eleven thousand compounds in the chemical compound database KEGG/LIGAND and discovered that compound clusters with highly similar structure compounds that share common names, take part in the same pathways, and have the same requirement of enzymes in reactions. Furthermore, we discovered the surprising sameness between pathway modules identified by clusters of similar structure compounds and that identified by genomic contexts, namely, operon structures of enzyme genes.


Assuntos
Algoritmos , Biologia Computacional/métodos , Modelos Químicos , Modelos Teóricos , Ligantes , Conformação Molecular , Estrutura Molecular
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...