Búsqueda | Portal de Búsqueda de la BVS

1.

The Tree Reconstruction Game: Phylogenetic Reconstruction Using Reinforcement Learning.

Azouri, Dana; Granit, Oz; Alburquerque, Michael; Mansour, Yishay; Pupko, Tal; Mayrose, Itay.

Mol Biol Evol ; 41(6)2024 Jun 01.

Artículo en Inglés | MEDLINE | ID: mdl-38829798

RESUMEN

The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree search algorithms might result in a tree that is the local optima, not the global one. Here, we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning-based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for data sets containing 15 sequences of length 18,000 bp and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.

Asunto(s)

Algoritmos , Filogenia , Funciones de Verosimilitud , Modelos Genéticos , Biología Computacional/métodos , Programas Informáticos

2.

Effect of tokenization on transformers for biological sequences.

Dotan, Edo; Jaschek, Gal; Pupko, Tal; Belinkov, Yonatan.

Bioinformatics ; 40(4)2024 Mar 29.

Artículo en Inglés | MEDLINE | ID: mdl-38608190

RESUMEN

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.

Asunto(s)

Algoritmos , Biología Computacional , Aprendizaje Profundo , Procesamiento de Lenguaje Natural , Biología Computacional/métodos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos

3.

A machine-learning-based alternative to phylogenetic bootstrap.

Ecker, Noa; Huchon, Dorothée; Mansour, Yishay; Mayrose, Itay; Pupko, Tal.

Bioinformatics ; 40(Supplement_1): i208-i217, 2024 Jun 28.

Artículo en Inglés | MEDLINE | ID: mdl-38940166

RESUMEN

MOTIVATION: Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS: Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION: The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.

Asunto(s)

Algoritmos , Aprendizaje Automático , Filogenia , Programas Informáticos , Alineación de Secuencia/métodos , Biología Computacional/métodos , Funciones de Verosimilitud

4.

Statistical framework to determine indel-length distribution.

Wygoda, Elya; Loewenthal, Gil; Moshe, Asher; Alburquerque, Michael; Mayrose, Itay; Pupko, Tal.

Bioinformatics ; 40(2)2024 02 01.

Artículo en Inglés | MEDLINE | ID: mdl-38269647

RESUMEN

MOTIVATION: Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution. RESULTS: We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them. AVAILABILITY AND IMPLEMENTATION: The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.

Asunto(s)

Algoritmos , Programas Informáticos , Teorema de Bayes , Alineación de Secuencia , Mutación INDEL , Evolución Molecular

5.

Evaluation of the Ability of AlphaFold to Predict the Three-Dimensional Structures of Antibodies and Epitopes.

Polonsky, Ksenia; Pupko, Tal; Freund, Natalia T.

J Immunol ; 211(10): 1578-1588, 2023 11 15.

Artículo en Inglés | MEDLINE | ID: mdl-37782047

RESUMEN

Being able to accurately predict the three-dimensional structure of an Ab can facilitate Ab characterization and epitope prediction, with important diagnostic and clinical implications. In this study, we evaluated the ability of AlphaFold to predict the structures of 222 recently published, high-resolution Fab H and L chain structures of Abs from different species directed against different Ags. We show that although the overall Ab prediction quality is in line with the results of CASP14, regions such as the complementarity-determining regions (CDRs) of the H chain, which are prone to higher variation, are predicted less accurately. Moreover, we discovered that AlphaFold mispredicts the bending angles between the variable and constant domains. To evaluate the ability of AlphaFold to model Ab-Ag interactions based only on sequence, we used AlphaFold-Multimer in combination with ZDOCK to predict the structures of 26 known Ab-Ag complexes. ZDOCK, which was applied on bound components of both the Ab and the Ag, succeeded in assembling 11 complexes, whereas AlphaFold succeeded in predicting only 2 of 26 models, with significant deviations in the docking contacts predicted in the rest of the molecules. Within the 11 complexes that were successfully predicted by ZDOCK, 9 involved short-peptide Ags (18-mer or less), whereas only 2 were complexes of Ab with a full-length protein. Docking of modeled unbound Ab and Ag was unsuccessful. In summary, our study provides important information about the abilities and limitations of using AlphaFold to predict Ab-Ag interactions and suggests areas for possible improvement.

Asunto(s)

Anticuerpos , Regiones Determinantes de Complementariedad , Epítopos , Péptidos/química

6.

GenomeFLTR: filtering reads made easy.

Dotan, Edo; Alburquerque, Michael; Wygoda, Elya; Huchon, Dorothée; Pupko, Tal.

Nucleic Acids Res ; 51(W1): W232-W236, 2023 07 05.

Artículo en Inglés | MEDLINE | ID: mdl-37177997

RESUMEN

In the last decade, advances in sequencing technology have led to an exponential increase in genomic data. These new data have dramatically changed our understanding of the evolution and function of genes and genomes. Despite improvements in sequencing technologies, identifying contaminated reads remains a complex task for many research groups. Here, we introduce GenomeFLTR, a new web server to filter contaminated reads. Reads are compared against existing sequence databases from various representative organisms to detect potential contaminants. The main features implemented in GenomeFLTR are: (i) automated updating of the relevant databases; (ii) fast comparison of each read against the database; (iii) the ability to create user-specified databases; (iv) a user-friendly interactive dashboard to investigate the origin and frequency of the contaminations; (v) the generation of a contamination-free file. Availability: https://genomefltr.tau.ac.il/.

Asunto(s)

Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Genoma/genética , Bases de Datos de Ácidos Nucleicos , Programas Informáticos

7.

Evolution of myxozoan mitochondrial genomes: insights from myxobolids.

Sandberg, Tatiana Orli Milkewitz; Yahalomi, Dayana; Bracha, Noam; Haddas-Sasson, Michal; Pupko, Tal; Atkinson, Stephen D; Bartholomew, Jerri L; Zhang, Jin Yong; Huchon, Dorothée.

BMC Genomics ; 25(1): 388, 2024 Apr 22.

Artículo en Inglés | MEDLINE | ID: mdl-38649808

RESUMEN

BACKGROUND: Myxozoa is a class of cnidarian parasites that encompasses over 2,400 species. Phylogenetic relationships among myxozoans remain highly debated, owing to both a lack of informative morphological characters and a shortage of molecular markers. Mitochondrial (mt) genomes are a common marker in phylogeny and biogeography. However, only five complete myxozoan mt genomes have been sequenced: four belonging to two closely related genera, Enteromyxum and Kudoa, and one from the genus Myxobolus. Interestingly, while cytochrome oxidase genes could be identified in Enteromyxum and Kudoa, no such genes were found in Myxobolus squamalis, and another member of the Myxobolidae (Henneguya salminicola) was found to have lost its entire mt genome. To evaluate the utility of mt genomes to reconstruct myxozoan relationships and to understand if the loss of cytochrome oxidase genes is a characteristic of myxobolids, we sequenced the mt genome of five myxozoans (Myxobolus wulii, M. honghuensis, M. shantungensis, Thelohanellus kitauei and, Sphaeromyxa zaharoni) using Illumina and Oxford Nanopore platforms. RESULTS: Unlike Enteromyxum, which possesses a partitioned mt genome, the five mt genomes were encoded on single circular chromosomes. An mt plasmid was found in M. wulii, as described previously in Kudoa iwatai. In all new myxozoan genomes, five protein-coding genes (cob, cox1, cox2, nad1, and nad5) and two rRNAs (rnl and rns) were recognized, but no tRNA. We found that Myxobolus and Thelohanellus species shared unidentified reading frames, supporting the view that these mt open reading frames are functional. Our phylogenetic reconstructions based on the five conserved mt genes agree with previously published trees based on the 18S rRNA gene. CONCLUSIONS: Our results suggest that the loss of cytochrome oxidase genes is not a characteristic of all myxobolids, the ancestral myxozoan mt genome was likely encoded on a single circular chromosome, and mt plasmids exist in a few lineages. Our findings indicate that myxozoan mt sequences are poor markers for reconstructing myxozoan phylogenetic relationships because of their fast-evolutionary rates and the abundance of repeated elements, which complicates assembly.

Asunto(s)

Evolución Molecular , Genoma Mitocondrial , Myxozoa , Filogenia , Animales , Myxozoa/genética , Myxozoa/clasificación , Complejo IV de Transporte de Electrones/genética

8.

A phage mechanism for selective nicking of dUMP-containing DNA.

Mahata, Tridib; Molshanski-Mor, Shahar; Goren, Moran G; Jana, Biswanath; Kohen-Manor, Miriam; Yosef, Ido; Avram, Oren; Pupko, Tal; Salomon, Dor; Qimron, Udi.

Proc Natl Acad Sci U S A ; 118(23)2021 06 08.

Artículo en Inglés | MEDLINE | ID: mdl-34074772

RESUMEN

Bacteriophages (phages) have evolved efficient means to take over the machinery of the bacterial host. The molecular tools at their disposal may be applied to manipulate bacteria and to divert molecular pathways at will. Here, we describe a bacterial growth inhibitor, gene product T5.015, encoded by the T5 phage. High-throughput sequencing of genomic DNA of bacterial mutants, resistant to this inhibitor, revealed disruptive mutations in the Escherichia coli ung gene, suggesting that growth inhibition mediated by T5.015 depends on the uracil-excision activity of Ung. We validated that growth inhibition is abrogated in the absence of ung and confirmed physical binding of Ung by T5.015. In addition, biochemical assays with T5.015 and Ung indicated that T5.015 mediates endonucleolytic activity at abasic sites generated by the base-excision activity of Ung. Importantly, the growth inhibition resulting from the endonucleolytic activity is manifested by DNA replication and cell division arrest. We speculate that the phage uses this protein to selectively cause cleavage of the host DNA, which possesses more misincorporated uracils than that of the phage. This protein may also enhance phage utilization of the available resources in the infected cell, since halting replication saves nucleotides, and stopping cell division maintains both daughters of a dividing cell.

Asunto(s)

Bacteriófagos/genética , Bacteriófagos/fisiología , ADN/metabolismo , Nucleótidos de Desoxiuracil/metabolismo , Puntos de Control del Ciclo Celular , División Celular , Endonucleasas , Escherichia coli/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Mutación , Uracilo/metabolismo

9.

An Approximate Bayesian Computation Approach for Modeling Genome Rearrangements.

Moshe, Asher; Wygoda, Elya; Ecker, Noa; Loewenthal, Gil; Avram, Oren; Israeli, Omer; Hazkani-Covo, Einat; Pe'er, Itsik; Pupko, Tal.

Mol Biol Evol ; 39(11)2022 11 03.

Artículo en Inglés | MEDLINE | ID: mdl-36282896

RESUMEN

The inference of genome rearrangement events has been extensively studied, as they play a major role in molecular evolution. However, probabilistic evolutionary models that explicitly imitate the evolutionary dynamics of such events, as well as methods to infer model parameters, are yet to be fully utilized. Here, we developed a probabilistic approach to infer genome rearrangement rate parameters using an Approximate Bayesian Computation (ABC) framework. We developed two genome rearrangement models, a basic model, which accounts for genomic changes in gene order, and a more sophisticated one which also accounts for changes in chromosome number. We characterized the ABC inference accuracy using simulations and applied our methodology to both prokaryotic and eukaryotic empirical datasets. Knowledge of genome-rearrangement rates can help elucidate their role in evolution as well as help simulate genomes with evolutionary dynamics that reflect empirical genomes.

Asunto(s)

Evolución Molecular , Genoma , Teorema de Bayes , Simulación por Computador , Genómica

10.

Effectidor: an automated machine-learning-based web server for the prediction of type-III secretion system effectors.

Wagner, Naama; Avram, Oren; Gold-Binshtok, Dafna; Zerah, Ben; Teper, Doron; Pupko, Tal.

Bioinformatics ; 38(8): 2341-2343, 2022 04 12.

Artículo en Inglés | MEDLINE | ID: mdl-35157036

RESUMEN

MOTIVATION: Type-III secretion systems are utilized by many Gram-negative bacteria to inject type-3 effectors (T3Es) to eukaryotic cells. These effectors manipulate host processes for the benefit of the bacteria and thus promote disease. They can also function as host-specificity determinants through their recognition as avirulence proteins that elicit immune response. Identifying the full effector repertoire within a set of bacterial genomes is of great importance to develop appropriate treatments against the associated pathogens. RESULTS: We present Effectidor, a user-friendly web server that harnesses several machine-learning techniques to predict T3Es within bacterial genomes. We compared the performance of Effectidor to other available tools for the same task on three pathogenic bacteria. Effectidor outperformed these tools in terms of classification accuracy (area under the precision-recall curve above 0.98 in all cases). AVAILABILITY AND IMPLEMENTATION: Effectidor is available at: https://effectidor.tau.ac.il, and the source code is available at: https://github.com/naamawagner/Effectidor. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Proteínas Bacterianas , Sistemas de Secreción Tipo III , Sistemas de Secreción Tipo III/metabolismo , Proteínas Bacterianas/metabolismo , Programas Informáticos , Aprendizaje Automático , Bacterias Gramnegativas/metabolismo

11.

A LASSO-based approach to sample sites for phylogenetic tree search.

Ecker, Noa; Azouri, Dana; Bettisworth, Ben; Stamatakis, Alexandros; Mansour, Yishay; Mayrose, Itay; Pupko, Tal.

Bioinformatics ; 38(Suppl 1): i118-i124, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758778

RESUMEN

MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Inteligencia Artificial , Programas Informáticos , Funciones de Verosimilitud , Filogenia

12.

A Probabilistic Model for Indel Evolution: Differentiating Insertions from Deletions.

Loewenthal, Gil; Rapoport, Dana; Avram, Oren; Moshe, Asher; Wygoda, Elya; Itzkovitch, Alon; Israeli, Omer; Azouri, Dana; Cartwright, Reed A; Mayrose, Itay; Pupko, Tal.

Mol Biol Evol ; 38(12): 5769-5781, 2021 12 09.

Artículo en Inglés | MEDLINE | ID: mdl-34469521

RESUMEN

Insertions and deletions (indels) are common molecular evolutionary events. However, probabilistic models for indel evolution are under-developed due to their computational complexity. Here, we introduce several improvements to indel modeling: 1) While previous models for indel evolution assumed that the rates and length distributions of insertions and deletions are equal, here we propose a richer model that explicitly distinguishes between the two; 2) we introduce numerous summary statistics that allow approximate Bayesian computation-based parameter estimation; 3) we develop a method to correct for biases introduced by alignment programs, when inferring indel parameters from empirical data sets; and 4) using a model-selection scheme, we test whether the richer model better fits biological data compared with the simpler model. Our analyses suggest that both our inference scheme and the model-selection procedure achieve high accuracy on simulated data. We further demonstrate that our proposed richer model better fits a large number of empirical data sets and that, for the majority of these data sets, the deletion rate is higher than the insertion rate.

Asunto(s)

Evolución Molecular , Mutación INDEL , Teorema de Bayes , Modelos Estadísticos , Filogenia

13.

PASA: Proteomic analysis of serum antibodies web server.

Avram, Oren; Kigel, Aya; Vaisman-Mentesh, Anna; Kligsberg, Sharon; Rosenstein, Shai; Dror, Yael; Pupko, Tal; Wine, Yariv.

PLoS Comput Biol ; 17(1): e1008607, 2021 01.

Artículo en Inglés | MEDLINE | ID: mdl-33493161

RESUMEN

MOTIVATION: A comprehensive characterization of the humoral response towards a specific antigen requires quantification of the B-cell receptor repertoire by next-generation sequencing (BCR-Seq), as well as the analysis of serum antibodies against this antigen, using proteomics. The proteomic analysis is challenging since it necessitates the mapping of antigen-specific peptides to individual B-cell clones. RESULTS: The PASA web server provides a robust computational platform for the analysis and integration of data obtained from proteomics of serum antibodies. PASA maps peptides derived from antibodies raised against a specific antigen to corresponding antibody sequences. It then analyzes and integrates proteomics and BCR-Seq data, thus providing a comprehensive characterization of the humoral response. The PASA web server is freely available at https://pasa.tau.ac.il and open to all users without a login requirement.

Asunto(s)

Anticuerpos , Internet , Proteómica/métodos , Programas Informáticos , Animales , Anticuerpos/sangre , Anticuerpos/inmunología , Antígenos/inmunología , Linfocitos B/inmunología , Bases de Datos de Proteínas , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Ratones

14.

ModelTeller: Model Selection for Optimal Phylogenetic Reconstruction Using Machine Learning.

Abadi, Shiran; Avram, Oren; Rosset, Saharon; Pupko, Tal; Mayrose, Itay.

Mol Biol Evol ; 37(11): 3338-3352, 2020 11 01.

Artículo en Inglés | MEDLINE | ID: mdl-32585030

RESUMEN

Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.

Asunto(s)

Aprendizaje Automático , Modelos Genéticos , Filogenia

15.

M1CR0B1AL1Z3R-a user-friendly web server for the analysis of large-scale microbial genomics data.

Avram, Oren; Rapoport, Dana; Portugez, Shir; Pupko, Tal.

Nucleic Acids Res ; 47(W1): W88-W92, 2019 07 02.

Artículo en Inglés | MEDLINE | ID: mdl-31114912

RESUMEN

Large-scale mining and analysis of bacterial datasets contribute to the comprehensive characterization of complex microbial dynamics within a microbiome and among different bacterial strains, e.g., during disease outbreaks. The study of large-scale bacterial evolutionary dynamics poses many challenges. These include data-mining steps, such as gene annotation, ortholog detection, sequence alignment and phylogeny reconstruction. These steps require the use of multiple bioinformatics tools and ad-hoc programming scripts, making the entire process cumbersome, tedious and error-prone due to manual handling. This motivated us to develop the M1CR0B1AL1Z3R web server, a 'one-stop shop' for conducting microbial genomics data analyses via a simple graphical user interface. Some of the features implemented in M1CR0B1AL1Z3R are: (i) extracting putative open reading frames and comparative genomics analysis of gene content; (ii) extracting orthologous sets and analyzing their size distribution; (iii) analyzing gene presence-absence patterns; (iv) reconstructing a phylogenetic tree based on the extracted orthologous set; (v) inferring GC-content variation among lineages. M1CR0B1AL1Z3R facilitates the mining and analysis of dozens of bacterial genomes using advanced techniques, with the click of a button. M1CR0B1AL1Z3R is freely available at https://microbializer.tau.ac.il/.

Asunto(s)

Genoma Bacteriano/genética , Genómica , Programas Informáticos , Biología Computacional , Internet , Filogenia , Alineación de Secuencia/métodos , Interfaz Usuario-Computador

16.

Ancestral sequence reconstruction: accounting for structural information by averaging over replacement matrices.

Moshe, Asher; Pupko, Tal.

Bioinformatics ; 35(15): 2562-2568, 2019 08 01.

Artículo en Inglés | MEDLINE | ID: mdl-30590382

RESUMEN

MOTIVATION: Ancestral sequence reconstruction (ASR) is widely used to understand protein evolution, structure and function. Current ASR methodologies do not fully consider differences in evolutionary constraints among positions imposed by the three-dimensional (3D) structure of the protein. Here, we developed an ASR algorithm that allows different protein sites to evolve according to different mixtures of replacement matrices. We show that assigning replacement matrices to protein positions based on their solvent accessibility leads to ASR with higher log-likelihoods compared to naïve models that assume a single replacement matrix for all sites. Improved ASR log-likelihoods are also demonstrated when solvent accessibility is predicted from protein sequences rather than inferred from a known 3D structure. Finally, we show that using such structure-aware mixture models results in substantial differences in the inferred ancestral sequences. AVAILABILITY AND IMPLEMENTATION: http://fastml.tau.ac.il. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Evolución Molecular , Algoritmos , Secuencia de Aminoácidos , Filogenia , Proteínas

17.

Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction.

Ashkenazy, Haim; Sela, Itamar; Levy Karin, Eli; Landan, Giddy; Pupko, Tal.

Syst Biol ; 68(1): 117-130, 2019 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-29771363

RESUMEN

The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. Yet, inferred MSAs were shown to be inaccurate and alignment errors reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase the accuracy of tree inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal. In this work we explore an approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs and concatenate them into a single SuperMSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA computed by alignment algorithms. Using simulations, we demonstrate that this approach results, on average, in more accurate trees compared to 1) using an unfiltered MSA and 2) using a single MSA with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a SuperMSA and inferring a tree from it is beneficial. Based on these assessments, we expect our methodology to be useful for many cases in which diverged sequences are analyzed. The option to generate such a SuperMSA is available at http://guidance.tau.ac.il.

Asunto(s)

Clasificación/métodos , Filogenia , Alineación de Secuencia , Programas Informáticos , Simulación por Computador , Reproducibilidad de los Resultados

18.

A Simulation-Based Approach to Statistical Alignment.

Levy Karin, Eli; Ashkenazy, Haim; Hein, Jotun; Pupko, Tal.

Syst Biol ; 68(2): 252-266, 2019 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-30239957

RESUMEN

Classic alignment algorithms utilize scoring functions which maximize similarity or minimize edit distances. These scoring functions account for both insertion-deletion (indel) and substitution events. In contrast, alignments based on stochastic models aim to explicitly describe the evolutionary dynamics of sequences by inferring relevant probabilistic parameters from input sequences. Despite advances in stochastic modeling during the last two decades, scoring-based methods are still dominant, partially due to slow running times of probabilistic approaches. Alignment inference using stochastic models involves estimating the probability of events, such as the insertion or deletion of a specific number of characters. In this work, we present SimBa-SAl, a simulation-based approach to statistical alignment inference, which relies on an explicit continuous time Markov model for both indels and substitutions. SimBa-SAl has several advantages. First, using simulations, it decouples the estimation of event probabilities from the inference stage, which allows the introduction of accelerations to the alignment inference procedure. Second, it is general and can accommodate various stochastic models of indel formation. Finally, it allows computing the maximum-likelihood alignment, the probability of a given pair of sequences integrated over all possible alignments, and sampling alternative alignments according to their probability. We first show that SimBa-SAl allows accurate estimation of parameters of the long-indel model previously developed by Miklós et al. (2004). We next show that SimBa-SAl is more accurate than previously developed pairwise alignment algorithms, when analyzing simulated as well as empirical data sets. Finally, we study the goodness-of-fit of the long-indel and TKF91 models. We show that although the long-indel model fits the data sets better than TKF91, there is still room for improvement concerning the realistic modeling of evolutionary sequence dynamics.

Asunto(s)

Clasificación/métodos , Modelos Estadísticos , Filogenia , Simulación por Computador , Evolución Molecular , Mutación INDEL/genética

19.

Phage display peptide libraries: deviations from randomness and correctives.

Ryvkin, Arie; Ashkenazy, Haim; Weiss-Ottolenghi, Yael; Piller, Chen; Pupko, Tal; Gershoni, Jonathan M.

Nucleic Acids Res ; 46(9): e52, 2018 05 18.

Artículo en Inglés | MEDLINE | ID: mdl-29420788

RESUMEN

Peptide-expressing phage display libraries are widely used for the interrogation of antibodies. Affinity selected peptides are then analyzed to discover epitope mimetics, or are subjected to computational algorithms for epitope prediction. A critical assumption for these applications is the random representation of amino acids in the initial naïve peptide library. In a previous study, we implemented next generation sequencing to evaluate a naïve library and discovered severe deviations from randomness in UAG codon over-representation as well as in high G phosphoramidite abundance causing amino acid distribution biases. In this study, we demonstrate that the UAG over-representation can be attributed to the burden imposed on the phage upon the assembly of the recombinant Protein 8 subunits. This was corrected by constructing the libraries using supE44-containing bacteria which suppress the UAG driven abortive termination. We also demonstrate that the overabundance of G stems from variant synthesis-efficiency and can be corrected using compensating oligonucleotide-mixtures calibrated by mass spectroscopy. Construction of libraries implementing these correctives results in markedly improved libraries that display random distribution of amino acids, thus ensuring that enriched peptides obtained in biopanning represent a genuine selection event, a fundamental assumption for phage display applications.

Asunto(s)

Biblioteca de Péptidos , Aminoácidos , Técnicas de Visualización de Superficie Celular

20.

TraitRateProp: a web server for the detection of trait-dependent evolutionary rate shifts in sequence sites.

Levy Karin, Eli; Ashkenazy, Haim; Wicke, Susann; Pupko, Tal; Mayrose, Itay.

Nucleic Acids Res ; 45(W1): W260-W264, 2017 07 03.

Artículo en Inglés | MEDLINE | ID: mdl-28453644

RESUMEN

Understanding species adaptation at the molecular level has been a central goal of evolutionary biology and genomics research. This important task becomes increasingly relevant with the constant rise in both genotypic and phenotypic data availabilities. The TraitRateProp web server offers a unique perspective into this task by allowing the detection of associations between sequence evolution rate and whole-organism phenotypes. By analyzing sequences and phenotypes of extant species in the context of their phylogeny, it identifies sequence sites in a gene/protein whose evolutionary rate is associated with shifts in the phenotype. To this end, it considers alternative histories of whole-organism phenotypic changes, which result in the extant phenotypic states. Its joint likelihood framework that combines models of sequence and phenotype evolution allows testing whether an association between these processes exists. In addition to predicting sequence sites most likely to be associated with the phenotypic trait, the server can optionally integrate structural 3D information. This integration allows a visual detection of trait-associated sequence sites that are juxtapose in 3D space, thereby suggesting a common functional role. We used TraitRateProp to study the shifts in sequence evolution rate of the RPS8 protein upon transitions into heterotrophy in Orchidaceae. TraitRateProp is available at http://traitrate.tau.ac.il/prop.

Asunto(s)

Evolución Molecular , Análisis de Secuencia , Programas Informáticos , Algoritmos , Internet , Orchidaceae/genética , Fenotipo , Filogenia , Proteínas Ribosómicas/química , Proteínas Ribosómicas/genética

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA