Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
Más filtros

Bases de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Mol Biol Evol ; 41(7)2024 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-38842253

RESUMEN

Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.


Asunto(s)
Mutación INDEL , Filogenia , Alineación de Secuencia , Alineación de Secuencia/métodos , Evolución Molecular , Modelos Genéticos , Humanos
2.
Syst Biol ; 72(2): 307-318, 2023 Jun 16.
Artículo en Inglés | MEDLINE | ID: mdl-35866991

RESUMEN

Modern phylogenetic methods allow inference of ancestral molecular sequences given an alignment and phylogeny relating present-day sequences. This provides insight into the evolutionary history of molecules, helping to understand gene function and to study biological processes such as adaptation and convergent evolution across a variety of applications. Here, we propose a dynamic programming algorithm for fast joint likelihood-based reconstruction of ancestral sequences under the Poisson Indel Process (PIP). Unlike previous approaches, our method, named ARPIP, enables the reconstruction with insertions and deletions based on an explicit indel model. Consequently, inferred indel events have an explicit biological interpretation. Likelihood computation is achieved in linear time with respect to the number of sequences. Our method consists of two steps, namely finding the most probable indel points and reconstructing ancestral sequences. First, we find the most likely indel points and prune the phylogeny to reflect the insertion and deletion events per site. Second, we infer the ancestral states on the pruned subtree in a manner similar to FastML. We applied ARPIP (Ancestral Reconstruction under PIP) on simulated data sets and on real data from the Betacoronavirus genus. ARPIP reconstructs both the indel events and substitutions with a high degree of accuracy. Our method fares well when compared to established state-of-the-art methods such as FastML and PAML. Moreover, the method can be extended to explore both optimal and suboptimal reconstructions, include rate heterogeneity through time and more. We believe it will expand the range of novel applications of ancestral sequence reconstruction. [Ancestral sequences; dynamic programming; evolutionary stochastic process; indel; joint ancestral sequence reconstruction; maximum likelihood; Poisson Indel Process; phylogeny; SARS-CoV.].


Asunto(s)
Algoritmos , Mutación INDEL , Filogenia , Funciones de Verosimilitud , Alineación de Secuencia , Mutación INDEL/genética , Evolución Molecular
3.
J Evol Biol ; 36(2): 321-336, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36289560

RESUMEN

Short tandem repeats (STRs) are units of 1-6 bp that repeat in a tandem fashion in DNA. Along with single nucleotide polymorphisms and large structural variations, they are among the major genomic variants underlying genetic, and likely phenotypic, divergence. STRs experience mutation rates that are orders of magnitude higher than other well-studied genotypic variants. Frequent copy number changes result in a wide range of alleles, and provide unique opportunities for modulating complex phenotypes through variation in repeat length. While classical studies have identified key roles of individual STR loci, the advent of improved sequencing technology, high-quality genome assemblies for diverse species, and bioinformatics methods for genome-wide STR analysis now enable more systematic study of STR variation across wide evolutionary ranges. In this review, we explore mutation and selection processes that affect STR copy number evolution, and how these processes give rise to varying STR patterns both within and across species. Finally, we review recent examples of functional and adaptive changes linked to STRs.


Asunto(s)
Genoma , Repeticiones de Microsatélite , Mutación , Genotipo , Fenotipo
4.
Mod Pathol ; 35(2): 240-248, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-34475526

RESUMEN

The backbone of all colorectal cancer classifications including the consensus molecular subtypes (CMS) highlights microsatellite instability (MSI) as a key molecular pathway. Although mucinous histology (generally defined as >50% extracellular mucin-to-tumor area) is a "typical" feature of MSI, it is not limited to this subgroup. Here, we investigate the association of CMS classification and mucin-to-tumor area quantified using a deep learning algorithm, and  the expression of specific mucins in predicting CMS groups and clinical outcome. A weakly supervised segmentation method was developed to quantify extracellular mucin-to-tumor area in H&E images. Performance was compared to two pathologists' scores, then applied to two cohorts: (1) TCGA (n = 871 slides/412 patients) used for mucin-CMS group correlation and (2) Bern (n = 775 slides/517 patients) for histopathological correlations and next-generation Tissue Microarray construction. TCGA and CPTAC (n = 85 patients) were used to further validate mucin detection and CMS classification by gene and protein expression analysis for MUC2, MUC4, MUC5AC and MUC5B. An excellent inter-observer agreement between pathologists' scores and the algorithm was obtained (ICC = 0.92). In TCGA, mucinous tumors were predominantly CMS1 (25.7%), CMS3 (24.6%) and CMS4 (16.2%). Average mucin in CMS2 was 1.8%, indicating negligible amounts. RNA and protein expression of MUC2, MUC4, MUC5AC and MUC5B were low-to-absent in CMS2. MUC5AC protein expression correlated with aggressive tumor features (e.g., distant metastases (p = 0.0334), BRAF mutation (p < 0.0001), mismatch repair-deficiency (p < 0.0001), and unfavorable 5-year overall survival (44% versus 65% for positive/negative staining). MUC2 expression showed the opposite trend, correlating with less lymphatic (p = 0.0096) and venous vessel invasion (p = 0.0023), no impact on survival.The absence of mucin-expressing tumors in CMS2 provides an important phenotype-genotype correlation. Together with MSI, mucinous histology may help predict CMS classification using only histopathology and should be considered in future image classifiers of molecular subtypes.


Asunto(s)
Neoplasias Encefálicas , Neoplasias Colorrectales , Biomarcadores de Tumor/análisis , Biomarcadores de Tumor/genética , Neoplasias Colorrectales/patología , Humanos , Inestabilidad de Microsatélites , Mucina 2/análisis , Mucina 2/genética , Mutación
5.
Distrib Parallel Databases ; 40(2-3): 409-440, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36097541

RESUMEN

The problem of natural language processing over structured data has become a growing research field, both within the relational database and the Semantic Web community, with significant efforts involved in question answering over knowledge graphs (KGQA). However, many of these approaches are either specifically targeted at open-domain question answering using DBpedia, or require large training datasets to translate a natural language question to SPARQL in order to query the knowledge graph. Hence, these approaches often cannot be applied directly to complex scientific datasets where no prior training data is available. In this paper, we focus on the challenges of natural language processing over knowledge graphs of scientific datasets. In particular, we introduce Bio-SODA, a natural language processing engine that does not require training data in the form of question-answer pairs for generating SPARQL queries. Bio-SODA uses a generic graph-based approach for translating user questions to a ranked list of SPARQL candidate queries. Furthermore, Bio-SODA uses a novel ranking algorithm that includes node centrality as a measure of relevance for selecting the best SPARQL candidate query. Our experiments with real-world datasets across several scientific domains, including the official bioinformatics Question Answering over Linked Data (QALD) challenge, as well as the CORDIS dataset of European projects, show that Bio-SODA outperforms publicly available KGQA systems by an F1-score of least 20% and by an even higher factor on more complex bioinformatics datasets. Finally, we introduce Bio-SODA UX, a graphical user interface designed to assist users in the exploration of large knowledge graphs and in dynamically disambiguating natural language questions that target the data available in these graphs.

6.
BMC Bioinformatics ; 22(1): 518, 2021 Oct 24.
Artículo en Inglés | MEDLINE | ID: mdl-34689750

RESUMEN

BACKGROUND: Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies between the indel history and the phylogeny relating the input sequences. RESULTS: We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model-the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at https://github.com/acg-team/ProPIP and is distributed under the terms of the GNU GPL v3 license. CONCLUSIONS: The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.


Asunto(s)
Evolución Molecular , Mutación INDEL , Algoritmos , Filogenia , Alineación de Secuencia , Programas Informáticos
7.
Nucleic Acids Res ; 47(21): 10994-11006, 2019 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-31584084

RESUMEN

The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.


Asunto(s)
ADN/genética , Bases de Datos de Ácidos Nucleicos , Bases de Datos de Proteínas , Error Científico Experimental , Secuencias Repetidas en Tándem/genética , Animales , Gadus morhua/genética , Análisis de Secuencia de ADN
8.
Indian J Microbiol ; 61(1): 24-30, 2021 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-33505089

RESUMEN

Streptomycetes, Gram-positive bacteria with huge and GC-rich genomes provide an ample example of codon usage bias taken to the extreme. Particularly, in all sequenced to date streptomycete genomes leucyl codon TTA is the rarest one. It is present (usually once or twice) in 70-200 out of 7000-8000 coding sequences that make up a typical streptomycete genome. tRNALeu UAA of streptomycetes, encoded by the bldA gene, has been shown to be present in mature form only after the onset of morphological differentiation and activation of secondary metabolism. Consequently, during the early stages of cell growth, the translation of genes carrying the TTA codon can be interrupted due to the absence of tRNALeu UAA. Several reports show that mutations of TTA to synonymous codons in certain genes indeed relieve their expression from bldA dependence. However, the deletion of bldA does not always arrest the expression of TTA-containing genes. The nucleotides T/C downstream of TTA were suggested, in 2002, to favor TTA mistranslation. We tested this hypothesis using sizable datasets derived from individual Streptomyces genome and a subset of TTA+ genes for secondary metabolism known for their active expression. Our results revealed nucleotide biases downstream of NNA codons family, such as the preference for C and the avoidance of A. Yet, none of the observed biases was sufficient to claim a special case for TTA codon. Hence, the issue of codon context and TTA codon mistranslation in Streptomyces deserves further elaboration.

9.
Breast Cancer Res Treat ; 179(3): 731-742, 2020 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-31754952

RESUMEN

PURPOSE: Germline variants in known breast cancer (BC) predisposing genes explain less than half of hereditary BC cases. This study aimed to identify missing genetic determinants of BC. METHODS: Whole exome sequencing (WES) of lymphocyte DNA was performed for 49 Russian patients with clinical signs of genetic BC predisposition, who lacked Slavic founder mutations in BRCA1, BRCA2, CHEK2, and NBS1 genes. RESULTS: Bioinformatic analysis of WES data was allowed to compile a list of 229 candidate mutations. 79 of these mutations were subjected to a three-stage case-control analysis. The initial two stages, which involved up to 797 high-risk BC patients, 1504 consecutive BC cases, and 1081 healthy women, indicated a potentially BC-predisposing role for 6 candidates, i.e., USP39 c.*208G > C, PZP p.Arg680Ter, LEPREL1 p.Pro636Ser, SLIT3 p.Arg154Cys, CREB3 p.Lys157Glu, and ING1 p.Pro319Leu. USP39 c.*208G > C was strongly associated with triple-negative breast tumors (p = 0.0001). In the third replication stage, we genotyped the truncating variant of PZP (rs145240281) and the potential splice variant of USP39 (rs112653307) in three independent cohorts of Russian, Byelorussian, and German ancestry, comprising a total of 3216 cases and 2525 controls. The data obtained for USP39 rs112653307 supported the association identified in the initial stages (the combined OR 1.72, p = 0.035). CONCLUSIONS: This study suggests the role of a rare splicing variant in BC susceptibility. USP39 encodes an ubiquitin-specific peptidase that regulates cancer-relevant tumor suppressors including CHEK2. Further epidemiological and functional studies involving these gene variants are warranted.


Asunto(s)
Neoplasias de la Mama/genética , Secuenciación del Exoma , Predisposición Genética a la Enfermedad , Proteasas Ubiquitina-Específicas/genética , Alelos , Empalme Alternativo , Biomarcadores de Tumor , Neoplasias de la Mama/diagnóstico , Neoplasias de la Mama/etiología , Biología Computacional , Femenino , Estudios de Asociación Genética , Genotipo , Mutación de Línea Germinal , Humanos , Oportunidad Relativa , Reproducibilidad de los Resultados , Federación de Rusia
10.
BMC Bioinformatics ; 19(1): 331, 2018 Sep 21.
Artículo en Inglés | MEDLINE | ID: mdl-30241460

RESUMEN

BACKGROUND: Sequence alignment is crucial in genomics studies. However, optimal multiple sequence alignment (MSA) is NP-hard. Thus, modern MSA methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogeny. Changes between homologous characters are typically modelled by a Markov substitution model. In contrast, the dynamics of indels are not modelled explicitly, because the computation of the marginal likelihood under such models has exponential time complexity in the number of taxa. But the failure to model indel evolution may lead to artificially short alignments due to biased indel placement, inconsistent with phylogenetic relationship. RESULTS: Recently, the classical indel model TKF91 was modified to describe indel evolution on a phylogeny via a Poisson process, termed PIP. PIP allows to compute the joint marginal probability of an MSA and a tree in linear time. We present a new dynamic programming algorithm to align two MSAs -represented by the underlying homology paths- by full maximum likelihood under PIP in polynomial time, and apply it progressively along a guide tree. We have corroborated the correctness of our method by simulation, and compared it with competitive methods on an illustrative real dataset. CONCLUSIONS: Our MSA method is the first polynomial time progressive aligner with a rigorous mathematical formulation of indel evolution. The new method infers phylogenetically meaningful gap patterns alternative to the popular PRANK, while producing alignments of similar length. Moreover, the inferred gap patterns agree with what was predicted qualitatively by previous studies. The algorithm is implemented in a standalone C++ program: https://github.com/acg-team/ProPIP . Supplementary data are available at BMC Bioinformatics online.


Asunto(s)
Algoritmos , Evolución Molecular , Mutación INDEL , Filogenia , Humanos , Probabilidad , Alineación de Secuencia
11.
J Mol Evol ; 86(3-4): 204-215, 2018 04.
Artículo en Inglés | MEDLINE | ID: mdl-29536136

RESUMEN

The AdpA protein from a streptomycin producer Streptomyces griseus is a founding member of the AdpA family of pleiotropic regulators, known to be ubiquitously present in streptomycetes. Functional genomic approaches revealed a huge number of AdpA targets, leading to the claim that the AdpA regulon is the largest one in bacteria. The expression of adpA is limited at the level of translation of the rare leucyl UUA codon. All known properties of AdpA regulators were discovered on a few streptomycete strains. There are open questions about the true abundance and diversity of AdpA across actinobacterial taxa (and beyond) and about the possible evolutionary forces that shape the AdpA orthologous group in Streptomyces. Here we show that, with respect to the TTA codon, streptomycete adpA is more diverse than has been previously thought, as the genes differ in presence/position of this codon. Reciprocal best hits to AdpA can be found in many actinobacterial orders, with a domain organization resembling that of the prototypical AdpA, but other configurations also exist. Diversifying positive selection was detected within the DNA-binding (AraC) domain in adpA of Streptomyces origin, most likely affecting residues enabling AdpA to recognize a degenerate operator. Sequence coding for putative glutamine amidotransferase (GATase-1) domain also shows signs of positive selection. The two-domain organization of AdpA most likely arose from a fusion of genes encoding separate GATase-1 and AraC domains. Indeed, we show that the AraC domain retains a biological function in the absence of the GATase-1 part. We suggest that acquisition of the regulatory role by TTA codon is a relatively recent event in the evolution of AdpA, which coincided with the rise of the Streptomycetales clade and, at present, is under relaxed selective constraints. Further experimental scrutiny of our findings is invited, which should provide new insights into the evolution and prospects for engineering of an AdpA-centered regulatory network.


Asunto(s)
Proteínas Bacterianas/genética , Regulación Bacteriana de la Expresión Génica , Regulón , Metabolismo Secundario/genética , Streptomyces/genética , Secuencia de Aminoácidos , Codón , Proteínas de Unión al ADN/genética , Filogenia , Streptomyces/clasificación
12.
Mol Biol Evol ; 32(8): 2208-16, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-25911229

RESUMEN

Many protein sequences have distinct domains that evolve with different rates, different selective pressures, or may differ in codon bias. Instead of modeling these differences by more and more complex models of molecular evolution, we present a multipartition approach that allows maximum-likelihood phylogeny inference using different codon models at predefined partitions in the data. Partition models can, but do not have to, share free parameters in the estimation process. We test this approach with simulated data as well as in a phylogenetic study of the origin of the leucin-rich repeat regions in the type III effector proteins of the pythopathogenic bacteria Ralstonia solanacearum. Our study does not only show that a simple two-partition model resolves the phylogeny better than a one-partition model but also gives more evidence supporting the hypothesis of lateral gene transfer events between the bacterial pathogens and its eukaryotic hosts.


Asunto(s)
Proteínas Bacterianas/genética , Codón , Modelos Genéticos , Ralstonia solanacearum/genética
13.
Mol Biol Evol ; 32(3): 806-19, 2015 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-25534034

RESUMEN

Antibodies are glycoproteins produced by the immune system as a dynamically adaptive line of defense against invading pathogens. Very elegant and specific mutational mechanisms allow B lymphocytes to produce a large and diversified repertoire of antibodies, which is modified and enhanced throughout all adulthood. One of these mechanisms is somatic hypermutation, which stochastically mutates nucleotides in the antibody genes, forming new sequences with different properties and, eventually, higher affinity and selectivity to the pathogenic target. As somatic hypermutation involves fast mutation of antibody sequences, this process can be described using a Markov substitution model of molecular evolution. Here, using large sets of antibody sequences from mice and humans, we infer an empirical amino acid substitution model AB, which is specific to antibody sequences. Compared with existing general amino acid models, we show that the AB model provides significantly better description for the somatic evolution of mice and human antibody sequences, as demonstrated on large next generation sequencing (NGS) antibody data. General amino acid models are reflective of conservation at the protein level due to functional constraints, with most frequent amino acids exchanges taking place between residues with the same or similar physicochemical properties. In contrast, within the variable part of antibody sequences we observed an elevated frequency of exchanges between amino acids with distinct physicochemical properties. This is indicative of a sui generis mutational mechanism, specific to antibody somatic hypermutation. We illustrate this property of antibody sequences by a comparative analysis of the network modularity implied by the AB model and general amino acid substitution models. We recommend using the new model for computational studies of antibody sequence maturation, including inference of alignments and phylogenetic trees describing antibody somatic hypermutation in large NGS data sets. The AB model is implemented in the open-source software CodonPhyML (http://sourceforge.net/projects/codonphyml) and can be downloaded and supplied by the user to ProGraphMSA (http://sourceforge.net/projects/prographmsa) or other alignment and phylogeny reconstruction programs that allow for user-defined substitution models.


Asunto(s)
Sustitución de Aminoácidos/genética , Anticuerpos/genética , Evolución Molecular , Alineación de Secuencia/métodos , Secuencia de Aminoácidos , Animales , Anticuerpos/química , Bases de Datos de Proteínas , Humanos , Cadenas de Markov , Ratones , Datos de Secuencia Molecular , Mutación
14.
Bioinformatics ; 31(18): 3051-3, 2015 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-25987568

RESUMEN

MOTIVATION: Currently, more than 40 sequence tandem repeat detectors are published, providing heterogeneous, partly complementary, partly conflicting results. RESULTS: We present TRAL, a tandem repeat annotation library that allows running and parsing of various detection outputs, clustering of redundant or overlapping annotations, several statistical frameworks for filtering false positive annotations, and importantly a tandem repeat annotation and refinement module based on circular profile hidden Markov models (cpHMMs). Using TRAL, we evaluated the performance of a multi-step tandem repeat annotation workflow on 547 085 sequences in UniProtKB/Swiss-Prot. The researcher can use these results to predict run-times for specific datasets, and to choose annotation complexity accordingly. AVAILABILITY AND IMPLEMENTATION: TRAL is an open-source Python 3 library and is available, together with documentation and tutorials via http://www.vital-it.ch/software/tral. CONTACT: elke.schaper@isb-sib.ch.


Asunto(s)
Bases de Datos de Proteínas , Bases del Conocimiento , Anotación de Secuencia Molecular , Programas Informáticos , Secuencias Repetidas en Tándem/genética , Secuencia de Aminoácidos , Análisis por Conglomerados , Documentación , Biblioteca de Genes , Humanos , Datos de Secuencia Molecular
15.
BMC Genet ; 17(1): 66, 2016 05 12.
Artículo en Inglés | MEDLINE | ID: mdl-27176219

RESUMEN

BACKGROUND: The sperm gene bindin encodes a gamete recognition protein, which plays an important role in conspecific fertilization and reproductive isolation of sea urchins. Molecular evolution of the gene has been extensively investigated with the attention focused on the protein coding regions. Intron evolution has been investigated to a much lesser extent. We have studied nucleotide variability in the complete bindin locus, including two exons and one intron, in the sea urchin Strongylocentrotus intermedius represented by two morphological forms. We have also analyzed all available bindin sequences for two other sea urchin species, S. pallidus and S. droebachiensis. RESULTS: The results show that the bindin sequences from the two forms of S. intermedius are intermingled with no evidence of genetic divergence; however, the forms exhibit slightly different patterns in bindin variability. The level of the bindin nucleotide diversity is close for S. intermedius and S. droebachiensis, but noticeably higher for S. pallidus. The distribution of variability is non-uniform along the gene; however there are striking similarities among the species, indicating similar evolutionary trends in this gene engaged in reproductive function. The patterns of nucleotide variability and divergence are radically different in the bindin coding and intron regions. Positive selection is detected in the bindin coding region. The neutrality tests as well as the maximum likelihood approaches suggest the action of diversifying selection in the bindin intron. CONCLUSIONS: Significant deviation from neutrality has been detected in the bindin coding region and suggested in the intron, indicating the possible functional importance of the bindin intron variability. To clarify the question concerning possible involvement of diversifying selection in the bindin intron evolution more data combining population genetic and functional approaches are necessary.


Asunto(s)
Polimorfismo Genético , Receptores de Superficie Celular/genética , Análisis de Secuencia de ADN/métodos , Strongylocentrotus/genética , Animales , Evolución Molecular , Funciones de Verosimilitud , Masculino , Filogenia , Selección Genética , Strongylocentrotus/clasificación
16.
BMC Evol Biol ; 15: 76, 2015 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-25928234

RESUMEN

BACKGROUND: Today computational molecular evolution is a vibrant research field that benefits from the availability of large and complex new generation sequencing data - ranging from full genomes and proteomes to microbiomes, metabolomes and epigenomes. The grounds for this progress were established long before the discovery of the DNA structure. Specifically, Darwin's theory of evolution by means of natural selection not only remains relevant today, but also provides a solid basis for computational research with a variety of applications. But a long-term progress in biology was ensured by the mathematical sciences, as exemplified by Sir R. Fisher in early 20th century. Now this is true more than ever: The data size and its complexity require biologists to work in close collaboration with experts in computational sciences, modeling and statistics. RESULTS: Natural selection drives function conservation and adaptation to emerging pathogens or new environments; selection plays key role in immune and resistance systems. Here I focus on computational methods for evaluating selection in molecular sequences, and argue that they have a high potential for applications. Pharma and biotech industries can successfully use this potential, and should take the initiative to enhance their research and development with state of the art bioinformatics approaches. CONCLUSIONS: This review provides a quick guide to the current computational approaches that apply the evolutionary principles of natural selection to real life problems - from drug target validation, vaccine design and protein engineering to applications in agriculture, ecology and conservation.


Asunto(s)
Biología Computacional/métodos , Evolución Molecular , Genómica , Animales , Biotecnología , Humanos , Metaboloma , Proteínas/genética , Selección Genética
17.
Mol Biol Evol ; 31(5): 1132-48, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24497029

RESUMEN

Tandem repeats (TRs) are a major element of protein sequences in all domains of life. They are particularly abundant in mammals, where by conservative estimates one in three proteins contain a TR. High generation-scale duplication and deletion rates were reported for nucleic TR units. However, it is not known whether protein TR units can also be frequently lost or gained providing a source of variation for rapid adaptation of protein function, or alternatively, tend to have conserved TR unit configurations over long evolutionary times. To obtain a systematic picture, we performed a proteome-wide analysis of the mode of evolution for human protein TRs. For this purpose, we propose a novel method for the detection of orthologous TRs based on circular profile hidden Markov models. For all detected TRs, we reconstructed bispecies TR unit phylogenies across 61 eukaryotes ranging from human to yeast. Moreover, we performed additional analyses to correlate functional and structural annotations of human TRs with their mode of evolution. Surprisingly, we find that the vast majority of human TRs are ancient, with TR unit number and order preserved intact since distant speciation events. For example, ≥ 61% of all human TRs have been strongly conserved at least since the root of all mammals, approximately 300 Ma. Further, we find no human protein TR that shows evidence for strong recent duplications and deletions. The results are in contrast to the high generation-scale mutability of nucleic TRs. Presumably, most protein TRs fold into stable and conserved structures that are indispensable for the function of the TR-containing protein. All of our data and results are available for download from http://www.atgc-montpellier.fr/TRE.


Asunto(s)
Eucariontes/química , Eucariontes/genética , Evolución Molecular , Proteínas/química , Proteínas/genética , Secuencias Repetidas en Tándem , Sustitución de Aminoácidos , Animales , Secuencia Conservada , Exones , Genoma Humano , Humanos , Cadenas de Markov , Modelos Genéticos , Filogenia , Proteoma/química , Proteoma/genética , Factores de Tiempo
18.
New Phytol ; 206(1): 397-410, 2015 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-25420631

RESUMEN

Sequence tandem repeats (TRs) are abundant in proteomes across all domains of life. For plants, little is known about their distribution or contribution to protein function. We exhaustively annotated TRs and studied the evolution of TR unit variations for all Ensembl plants. Using phylogenetic patterns of TR units, we detected conserved TRs with unit number and order preserved during evolution, and those TRs that have diverged via recent TR unit gains/losses. We correlated the mode of evolution of TRs to protein function. TR number was strongly correlated with proteome size, with about one-half of all TRs recognized as common protein domains. The majority of TRs have been highly conserved over long evolutionary distances, some since the separation of red algae and green plants c. 1.6 billion yr ago. Conversely, recurrent recent TR unit mutations were rare. Our results suggest that the first TRs by far predate the first plants, and that TR appearance is an ongoing process with similar rates across the plant kingdom. Interestingly, the few detected highly mutable TRs might provide a source of variation for rapid adaptation. In particular, such TRs are enriched in leucine-rich repeats (LRRs) commonly found in R genes, where TR unit gain/loss may facilitate resistance to emerging pathogens.


Asunto(s)
Proteínas de Plantas/genética , Plantas/genética , Proteoma , Secuencias Repetidas en Tándem/genética , Evolución Molecular , Filogenia , Plantas/metabolismo
19.
Nucleic Acids Res ; 41(17): e162, 2013 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-23877246

RESUMEN

Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogenicity and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination, but current methods assume fixed unit boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR unit indels by unit boundaries. TR indels are modeled separately and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events but also all changes in TR regions owing to recombination, strand slippage and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. The new method is illustrated on a family of type III effectors, a pathogenicity determinant in agriculturally important bacteria Ralstonia solanacearum. We show that TR indel rate variation contributes to the diversification of this protein family.


Asunto(s)
Alineación de Secuencia/métodos , Secuencias Repetidas en Tándem , Algoritmos , Proteínas Bacterianas/química , Proteínas Bacterianas/genética , Mutación INDEL , Modelos Genéticos , Filogenia , Ralstonia solanacearum/genética , Análisis de Secuencia de ADN , Análisis de Secuencia de Proteína
20.
Mol Biol Evol ; 30(6): 1270-80, 2013 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-23436912

RESUMEN

Markov models of codon substitution naturally incorporate the structure of the genetic code and the selection intensity at the protein level, providing a more realistic representation of protein-coding sequences compared with nucleotide or amino acid models. Thus, for protein-coding genes, phylogenetic inference is expected to be more accurate under codon models. So far, phylogeny reconstruction under codon models has been elusive due to computational difficulties of dealing with high dimension matrices. Here, we present a fast maximum likelihood (ML) package for phylogenetic inference, CodonPhyML offering hundreds of different codon models, the largest variety to date, for phylogeny inference by ML. CodonPhyML is tested on simulated and real data and is shown to offer excellent speed and convergence properties. In addition, CodonPhyML includes most recent fast methods for estimating phylogenetic branch supports and provides an integral framework for models selection, including amino acid and DNA models.


Asunto(s)
Codón , Biología Computacional/métodos , Modelos Genéticos , Filogenia , Animales , Bacterias , Simulación por Computador , Eucariontes , Evolución Molecular , Cadenas de Markov , Proteínas/genética , Selección Genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA