RESUMO
Since its public release in 2021, AlphaFold2 (AF2) has made investigating biological questions, by using predicted protein structures of single monomers or full complexes, a common practice. ColabFold-AF2 is an open-source Jupyter Notebook inside Google Colaboratory and a command-line tool that makes it easy to use AF2 while exposing its advanced options. ColabFold-AF2 shortens turnaround times of experiments because of its optimized usage of AF2's models. In this protocol, we guide the reader through ColabFold best practices by using three scenarios: (i) monomer prediction, (ii) complex prediction and (iii) conformation sampling. The first two scenarios cover classic static structure prediction and are demonstrated on the human glycosylphosphatidylinositol transamidase protein. The third scenario demonstrates an alternative use case of the AF2 models by predicting two conformations of the human alanine serine transporter 2. Users can run the protocol without computational expertise via Google Colaboratory or in a command-line environment for advanced users. Using Google Colaboratory, it takes <2 h to run each procedure. The data and code for this protocol are available at https://protocol.colabfold.com .
RESUMO
SUMMARY: SpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein level, optimizing its scores for matching very short sequences, and combining evidence from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes. AVAILABILITY AND IMPLEMENTATION: SpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS: https://github.com/soedinglab/spacepharer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
BACKGROUND: Metagenomics is revolutionizing the study of microorganisms and their involvement in biological, biomedical, and geochemical processes, allowing us to investigate by direct sequencing a tremendous diversity of organisms without the need for prior cultivation. Unicellular eukaryotes play essential roles in most microbial communities as chief predators, decomposers, phototrophs, bacterial hosts, symbionts, and parasites to plants and animals. Investigating their roles is therefore of great interest to ecology, biotechnology, human health, and evolution. However, the generally lower sequencing coverage, their more complex gene and genome architectures, and a lack of eukaryote-specific experimental and computational procedures have kept them on the sidelines of metagenomics. RESULTS: MetaEuk is a toolkit for high-throughput, reference-based discovery, and annotation of protein-coding genes in eukaryotic metagenomic contigs. It performs fast searches with 6-frame-translated fragments covering all possible exons and optimally combines matches into multi-exon proteins. We used a benchmark of seven diverse, annotated genomes to show that MetaEuk is highly sensitive even under conditions of low sequence similarity to the reference database. To demonstrate MetaEuk's power to discover novel eukaryotic proteins in large-scale metagenomic data, we assembled contigs from 912 samples of the Tara Oceans project. MetaEuk predicted >12,000,000 protein-coding genes in 8 days on ten 16-core servers. Most of the discovered proteins are highly diverged from known proteins and originate from very sparsely sampled eukaryotic supergroups. CONCLUSION: The open-source (GPLv3) MetaEuk software (https://github.com/soedinglab/metaeuk) enables large-scale eukaryotic metagenomics through reference-based, sensitive taxonomic and functional annotation. Video abstract.
Assuntos
Algoritmos , Eucariotos/genética , Metagenômica/métodos , Microbiota , Anotação de Sequência Molecular/métodos , Biologia Computacional/métodos , Bases de Dados Genéticas , Ensaios de Triagem em Larga Escala , Metagenoma , Metagenômica/instrumentação , Análise de Sequência de DNA/métodosRESUMO
The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. Yet, inferred MSAs were shown to be inaccurate and alignment errors reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase the accuracy of tree inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal. In this work we explore an approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs and concatenate them into a single SuperMSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA computed by alignment algorithms. Using simulations, we demonstrate that this approach results, on average, in more accurate trees compared to 1) using an unfiltered MSA and 2) using a single MSA with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a SuperMSA and inferring a tree from it is beneficial. Based on these assessments, we expect our methodology to be useful for many cases in which diverged sequences are analyzed. The option to generate such a SuperMSA is available at http://guidance.tau.ac.il.
Assuntos
Classificação/métodos , Filogenia , Alinhamento de Sequência , Software , Simulação por Computador , Reprodutibilidade dos TestesRESUMO
Classic alignment algorithms utilize scoring functions which maximize similarity or minimize edit distances. These scoring functions account for both insertion-deletion (indel) and substitution events. In contrast, alignments based on stochastic models aim to explicitly describe the evolutionary dynamics of sequences by inferring relevant probabilistic parameters from input sequences. Despite advances in stochastic modeling during the last two decades, scoring-based methods are still dominant, partially due to slow running times of probabilistic approaches. Alignment inference using stochastic models involves estimating the probability of events, such as the insertion or deletion of a specific number of characters. In this work, we present SimBa-SAl, a simulation-based approach to statistical alignment inference, which relies on an explicit continuous time Markov model for both indels and substitutions. SimBa-SAl has several advantages. First, using simulations, it decouples the estimation of event probabilities from the inference stage, which allows the introduction of accelerations to the alignment inference procedure. Second, it is general and can accommodate various stochastic models of indel formation. Finally, it allows computing the maximum-likelihood alignment, the probability of a given pair of sequences integrated over all possible alignments, and sampling alternative alignments according to their probability. We first show that SimBa-SAl allows accurate estimation of parameters of the long-indel model previously developed by Miklós et al. (2004). We next show that SimBa-SAl is more accurate than previously developed pairwise alignment algorithms, when analyzing simulated as well as empirical data sets. Finally, we study the goodness-of-fit of the long-indel and TKF91 models. We show that although the long-indel model fits the data sets better than TKF91, there is still room for improvement concerning the realistic modeling of evolutionary sequence dynamics.
Assuntos
Classificação/métodos , Modelos Estatísticos , Filogenia , Simulação por Computador , Evolução Molecular , Mutação INDEL/genéticaRESUMO
Perfect short inverted repeats (IRs) are known to be enriched in a variety of bacterial and eukaryotic genomes. Currently, it is unclear whether perfect IRs are conserved over evolutionary time scales. In this study, we aimed to characterize the prevalence and evolutionary conservation of IRs across 20 proteobacterial strains. We first identified IRs in Escherichia coli K-12 substr MG1655 and showed that they are overabundant. We next aimed to test whether this overabundance is reflected in the conservation of IRs over evolutionary time scales. To this end, for each perfect IR identified in E. coli MG1655, we collected orthologous sequences from related proteobacterial genomes. We next quantified the evolutionary conservation of these IRs, that is, the presence of the exact same IR across orthologous regions. We observed high conservation of perfect IRs: out of the 234 examined orthologous regions, 145 were more conserved than expected, which is statistically significant even after correcting for multiple testing. Our results together with previous experimental findings support a model in which imperfect IRs are corrected to perfect IRs in a preferential manner via a template switching mechanism.
Assuntos
Sequência Conservada/genética , Evolução Molecular , Sequências Repetidas Invertidas/genética , Escherichia coli/genética , Genoma Bacteriano/genética , Proteobactérias/genéticaRESUMO
Many analyses for the detection of biological phenomena rely on a multiple sequence alignment as input. The results of such analyses are often further studied through parametric bootstrap procedures, using sequence simulators. One of the problems with conducting such simulation studies is that users currently have no means to decide which insertion and deletion (indel) parameters to choose, so that the resulting sequences mimic biological data. Here, we present SpartaABC, a web server that aims to solve this issue. SpartaABC implements an approximate-Bayesian-computation rejection algorithm to infer indel parameters from sequence data. It does so by extracting summary statistics from the input. It then performs numerous sequence simulations under randomly sampled indel parameters. By computing a distance between the summary statistics extracted from the input and each simulation, SpartaABC retains only parameters behind simulations close to the real data. As output, SpartaABC provides point estimates and approximate posterior distributions of the indel parameters. In addition, SpartaABC allows simulating sequences with the inferred indel parameters. To this end, the sequence simulators, Dawg 2.0 and INDELible were integrated. Using SpartaABC we demonstrate the differences in indel dynamics among three protein-coding genes across mammalian orthologs. SpartaABC is freely available for use at http://spartaabc.tau.ac.il/webserver.
Assuntos
Algoritmos , Mutação INDEL , Análise de Sequência/métodos , Software , Teorema de Bayes , Regulador de Condutância Transmembrana em Fibrose Cística/genética , Internet , Receptor Tipo 1 de Hormônio Paratireóideo/genética , Globulina de Ligação a Tiroxina/genéticaRESUMO
The most common evolutionary events at the molecular level are single-base substitutions, as well as insertions and deletions (indels) of short DNA segments. A large body of research has been devoted to develop probabilistic substitution models and to infer their parameters using likelihood and Bayesian approaches. In contrast, relatively little has been done to model indel dynamics, probably due to the difficulty in writing explicit likelihood functions. Here, we contribute to the effort of modeling indel dynamics by presenting SpartaABC, an approximate Bayesian computation (ABC) approach to infer indel parameters from sequence data (either aligned or unaligned). SpartaABC circumvents the need to use an explicit likelihood function by extracting summary statistics from simulated sequences. First, summary statistics are extracted from the input sequence data. Second, SpartaABC samples indel parameters from a prior distribution and uses them to simulate sequences. Third, it computes summary statistics from the simulated sets of sequences. By computing a distance between the summary statistics extracted from the input and each simulation, SpartaABC can provide an approximation to the posterior distribution of indel parameters as well as point estimates. We study the performance of our methodology and show that it provides accurate estimates of indel parameters in simulations. We next demonstrate the utility of SpartaABC by studying the impact of alignment errors on the inference of positive selection. A C ++ program implementing SpartaABC is freely available in http://spartaabc.tau.ac.il.
Assuntos
Teorema de Bayes , Mutação INDEL , Modelos Estatísticos , Algoritmos , Biologia Computacional/métodos , Simulação por Computador , Evolução Molecular , Humanos , Modelos Genéticos , Taxa de Mutação , SoftwareRESUMO
Understanding species adaptation at the molecular level has been a central goal of evolutionary biology and genomics research. This important task becomes increasingly relevant with the constant rise in both genotypic and phenotypic data availabilities. The TraitRateProp web server offers a unique perspective into this task by allowing the detection of associations between sequence evolution rate and whole-organism phenotypes. By analyzing sequences and phenotypes of extant species in the context of their phylogeny, it identifies sequence sites in a gene/protein whose evolutionary rate is associated with shifts in the phenotype. To this end, it considers alternative histories of whole-organism phenotypic changes, which result in the extant phenotypic states. Its joint likelihood framework that combines models of sequence and phenotype evolution allows testing whether an association between these processes exists. In addition to predicting sequence sites most likely to be associated with the phenotypic trait, the server can optionally integrate structural 3D information. This integration allows a visual detection of trait-associated sequence sites that are juxtapose in 3D space, thereby suggesting a common functional role. We used TraitRateProp to study the shifts in sequence evolution rate of the RPS8 protein upon transitions into heterotrophy in Orchidaceae. TraitRateProp is available at http://traitrate.tau.ac.il/prop.
Assuntos
Evolução Molecular , Análise de Sequência , Software , Algoritmos , Internet , Orchidaceae/genética , Fenótipo , Filogenia , Proteínas Ribossômicas/química , Proteínas Ribossômicas/genéticaRESUMO
Recent years have seen a constant rise in the availability of trait data, including morphological features, ecological preferences, and life history characteristics. These phenotypic data provide means to associate genomic regions with phenotypic attributes, thus allowing the identification of phenotypic traits associated with the rate of genome and sequence evolution. However, inference methodologies that analyze sequence and phenotypic data in a unified statistical framework are still scarce. Here, we present TraitRateProp, a probabilistic method that allows testing whether the rate of sequence evolution is associated with a binary phenotypic character trait. The method further allows the detection of specific sequence sites whose evolutionary rate is most noticeably affected following the character transition, suggesting a shift in functional/structural constraints. TraitRateProp is first evaluated in simulations and then applied to study the evolutionary process of plastid plant genomes upon a transition to a heterotrophic lifestyle. To this end, we analyze 20 plastid genes across 85 orchid species, spanning different lifestyles and representing different genera in this large family of flowering plants. Our results indicate higher evolutionary rates following repeated transitions to a heterotrophic lifestyle in all but four of the loci analyzed. [Evolutionary models; evolutionary rate; genotype-phenotype; orchids; plastome; rate shift.].
Assuntos
Classificação/métodos , Evolução Molecular , Genoma de Planta/genética , Modelos Genéticos , Simulação por Computador , Genomas de Plastídeos/genética , FenótipoRESUMO
In this study, we present a novel methodology to infer indel parameters from multiple sequence alignments (MSAs) based on simulations. Our algorithm searches for the set of evolutionary parameters describing indel dynamics which best fits a given input MSA. In each step of the search, we use parametric bootstraps and the Mahalanobis distance to estimate how well a proposed set of parameters fits input data. Using simulations, we demonstrate that our methodology can accurately infer the indel parameters for a large variety of plausible settings. Moreover, using our methodology, we show that indel parameters substantially vary between three genomic data sets: Mammals, bacteria, and retroviruses. Finally, we demonstrate how our methodology can be used to simulate MSAs based on indel parameters inferred from real data sets.
Assuntos
Mutação INDEL , Alinhamento de Sequência/métodos , Software , Animais , Genoma Bacteriano , Genoma Viral , MamíferosRESUMO
Estimating phylogenetic trees from sequence data is an extremely challenging and important statistical task. Within the maximum-likelihood paradigm, the best tree is a point estimate. To determine how strongly the data support such an evolutionary scenario, a hypothesis testing methodology is required. To this end, the Kishino-Hasegawa (KH) test was developed to determine whether one topology is significantly more supported by the sequence data than another one. This test and its derivatives are widely used in phylogenetics and phylogenomics. Here, we show that the KH test is biased in the presence of alignment error and can lead to erroneous conclusions. Using simulations we demonstrated that due to alignment errors the KH test often rejects one of the competing topologies, even though both topologies are equally supported by the data. Specifically, we show that the KH test favors the guide tree used to align the analyzed sequences. Further, branch length optimization renders the test too conservative. We propose two possible corrections for these biases. First, we evaluated the impact of removing unreliable alignment columns and found out that it decreases the bias at the cost of substantially reducing the test's power. Second, we developed a parametric test that entirely abolishes the biases without data filtering. This test incorporates the alignment construction step into the test's hypothesis, thus removing the above guide tree effect. We extend this methodology for the case of multiple-topology comparisons and demonstrate the applicability of the new methodology on an exemplary data set.
Assuntos
Filogenia , Primatas/genética , Alinhamento de Sequência/estatística & dados numéricos , Sequência de Aminoácidos , Animais , Viés , Humanos , Funções Verossimilhança , Dados de Sequência Molecular , Primatas/classificação , Alinhamento de Sequência/métodos , Alinhamento de Sequência/normas , IncertezaRESUMO
Evolutionary analysis of phyletic patterns (phylogenetic profiles) is widely used in biology, representing presence or absence of characters such as genes, restriction sites, introns, indels and methylation sites. The phyletic pattern observed in extant genomes is the result of ancestral gain and loss events along the phylogenetic tree. Here we present CoPAP (coevolution of presence-absence patterns), a user-friendly web server, which performs accurate inference of coevolving characters as manifested by co-occurring gains and losses. CoPAP uses state-of-the-art probabilistic methodologies to infer coevolution and allows for advanced network analysis and visualization. We developed a platform for comparing different algorithms that detect coevolution, which includes simulated data with pairs of coevolving sites and independent sites. Using these simulated data we demonstrate that CoPAP performance is higher than alternative methods. We exemplify CoPAP utility by analyzing coevolution among thousands of bacterial genes across 681 genomes. Clusters of coevolving genes that were detected using our method largely coincide with known biosynthesis pathways and cellular modules, thus exhibiting the capability of CoPAP to infer biologically meaningful interactions. CoPAP is freely available for use at http://copap.tau.ac.il/.