Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 131
Filtrar
1.
Bioinformatics ; 40(Supplement_1): i208-i217, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940166

RESUMO

MOTIVATION: Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS: Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION: The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.


Assuntos
Algoritmos , Aprendizado de Máquina , Filogenia , Software , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Funções Verossimilhança
2.
Mol Biol Evol ; 41(6)2024 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-38829798

RESUMO

The computational search for the maximum-likelihood phylogenetic tree is an NP-hard problem. As such, current tree search algorithms might result in a tree that is the local optima, not the global one. Here, we introduce a paradigm shift for predicting the maximum-likelihood tree, by approximating long-term gains of likelihood rather than maximizing likelihood gain at each step of the search. Our proposed approach harnesses the power of reinforcement learning to learn an optimal search strategy, aiming at the global optimum of the search space. We show that when analyzing empirical data containing dozens of sequences, the log-likelihood improvement from the starting tree obtained by the reinforcement learning-based agent was 0.969 or higher compared to that achieved by current state-of-the-art techniques. Notably, this performance is attained without the need to perform costly likelihood optimizations apart from the training process, thus potentially allowing for an exponential increase in runtime. We exemplify this for data sets containing 15 sequences of length 18,000 bp and demonstrate that the reinforcement learning-based method is roughly three times faster than the state-of-the-art software. This study illustrates the potential of reinforcement learning in addressing the challenges of phylogenetic tree reconstruction.


Assuntos
Algoritmos , Filogenia , Funções Verossimilhança , Modelos Genéticos , Biologia Computacional/métodos , Software
3.
BMC Genomics ; 25(1): 388, 2024 Apr 22.
Artigo em Inglês | MEDLINE | ID: mdl-38649808

RESUMO

BACKGROUND: Myxozoa is a class of cnidarian parasites that encompasses over 2,400 species. Phylogenetic relationships among myxozoans remain highly debated, owing to both a lack of informative morphological characters and a shortage of molecular markers. Mitochondrial (mt) genomes are a common marker in phylogeny and biogeography. However, only five complete myxozoan mt genomes have been sequenced: four belonging to two closely related genera, Enteromyxum and Kudoa, and one from the genus Myxobolus. Interestingly, while cytochrome oxidase genes could be identified in Enteromyxum and Kudoa, no such genes were found in Myxobolus squamalis, and another member of the Myxobolidae (Henneguya salminicola) was found to have lost its entire mt genome. To evaluate the utility of mt genomes to reconstruct myxozoan relationships and to understand if the loss of cytochrome oxidase genes is a characteristic of myxobolids, we sequenced the mt genome of five myxozoans (Myxobolus wulii, M. honghuensis, M. shantungensis, Thelohanellus kitauei and, Sphaeromyxa zaharoni) using Illumina and Oxford Nanopore platforms. RESULTS: Unlike Enteromyxum, which possesses a partitioned mt genome, the five mt genomes were encoded on single circular chromosomes. An mt plasmid was found in M. wulii, as described previously in Kudoa iwatai. In all new myxozoan genomes, five protein-coding genes (cob, cox1, cox2, nad1, and nad5) and two rRNAs (rnl and rns) were recognized, but no tRNA. We found that Myxobolus and Thelohanellus species shared unidentified reading frames, supporting the view that these mt open reading frames are functional. Our phylogenetic reconstructions based on the five conserved mt genes agree with previously published trees based on the 18S rRNA gene. CONCLUSIONS: Our results suggest that the loss of cytochrome oxidase genes is not a characteristic of all myxobolids, the ancestral myxozoan mt genome was likely encoded on a single circular chromosome, and mt plasmids exist in a few lineages. Our findings indicate that myxozoan mt sequences are poor markers for reconstructing myxozoan phylogenetic relationships because of their fast-evolutionary rates and the abundance of repeated elements, which complicates assembly.


Assuntos
Evolução Molecular , Genoma Mitocondrial , Myxozoa , Filogenia , Animais , Myxozoa/genética , Myxozoa/classificação , Complexo IV da Cadeia de Transporte de Elétrons/genética
4.
Genome Biol Evol ; 16(4)2024 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-38648506

RESUMO

The genus Xanthomonas has been primarily studied for pathogenic interactions with plants. However, besides host and tissue-specific pathogenic strains, this genus also comprises nonpathogenic strains isolated from a broad range of hosts, sometimes in association with pathogenic strains, and other environments, including rainwater. Based on their incapacity or limited capacity to cause symptoms on the host of isolation, nonpathogenic xanthomonads can be further characterized as commensal and weakly pathogenic. This study aimed to understand the diversity and evolution of nonpathogenic xanthomonads compared to their pathogenic counterparts based on their cooccurrence and phylogenetic relationship and to identify genomic traits that form the basis of a life history framework that groups xanthomonads by ecological strategies. We sequenced genomes of 83 strains spanning the genus phylogeny and identified eight novel species, indicating unexplored diversity. While some nonpathogenic species have experienced a recent loss of a type III secretion system, specifically the hrp2 cluster, we observed an apparent lack of association of the hrp2 cluster with lifestyles of diverse species. We performed association analysis on a large data set of 337 Xanthomonas strains to explain how xanthomonads may have established association with the plants across the continuum of lifestyles from commensals to weak pathogens to pathogens. Presence of distinct transcriptional regulators, distinct nutrient utilization and assimilation genes, transcriptional regulators, and chemotaxis genes may explain lifestyle-specific adaptations of xanthomonads.


Assuntos
Genoma Bacteriano , Filogenia , Xanthomonas , Xanthomonas/genética , Xanthomonas/patogenicidade , Xanthomonas/classificação , Variação Genética , Simbiose
5.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38608190

RESUMO

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.


Assuntos
Algoritmos , Biologia Computacional , Aprendizado Profundo , Processamento de Linguagem Natural , Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos
6.
Bioinformatics ; 40(2)2024 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-38269647

RESUMO

MOTIVATION: Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel-length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best-fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution. RESULTS: We aimed to determine which indel-length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best-fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them. AVAILABILITY AND IMPLEMENTATION: The data underlying this article are available in Github, at https://github.com/elyawy/SpartaSim and https://github.com/elyawy/SpartaPipeline.


Assuntos
Algoritmos , Software , Teorema de Bayes , Alinhamento de Sequência , Mutação INDEL , Evolução Molecular
7.
J Immunol ; 211(10): 1578-1588, 2023 11 15.
Artigo em Inglês | MEDLINE | ID: mdl-37782047

RESUMO

Being able to accurately predict the three-dimensional structure of an Ab can facilitate Ab characterization and epitope prediction, with important diagnostic and clinical implications. In this study, we evaluated the ability of AlphaFold to predict the structures of 222 recently published, high-resolution Fab H and L chain structures of Abs from different species directed against different Ags. We show that although the overall Ab prediction quality is in line with the results of CASP14, regions such as the complementarity-determining regions (CDRs) of the H chain, which are prone to higher variation, are predicted less accurately. Moreover, we discovered that AlphaFold mispredicts the bending angles between the variable and constant domains. To evaluate the ability of AlphaFold to model Ab-Ag interactions based only on sequence, we used AlphaFold-Multimer in combination with ZDOCK to predict the structures of 26 known Ab-Ag complexes. ZDOCK, which was applied on bound components of both the Ab and the Ag, succeeded in assembling 11 complexes, whereas AlphaFold succeeded in predicting only 2 of 26 models, with significant deviations in the docking contacts predicted in the rest of the molecules. Within the 11 complexes that were successfully predicted by ZDOCK, 9 involved short-peptide Ags (18-mer or less), whereas only 2 were complexes of Ab with a full-length protein. Docking of modeled unbound Ab and Ag was unsuccessful. In summary, our study provides important information about the abilities and limitations of using AlphaFold to predict Ab-Ag interactions and suggests areas for possible improvement.


Assuntos
Anticorpos , Regiões Determinantes de Complementaridade , Epitopos , Peptídeos/química
8.
Front Plant Sci ; 14: 1198160, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37583594

RESUMO

Acquisition of the pathogenicity plasmid pPATH that encodes a type III secretion system (T3SS) and effectors (T3Es) has likely led to the transition of a non-pathogenic bacterium into the tumorigenic pathogen Pantoea agglomerans. P. agglomerans pv. gypsophilae (Pag) forms galls on gypsophila (Gypsophila paniculata) and triggers immunity on sugar beet (Beta vulgaris), while P. agglomerans pv. betae (Pab) causes galls on both gypsophila and sugar beet. Draft sequences of the Pag and Pab genomes were previously generated using the MiSeq Illumina technology and used to determine partial T3E inventories of Pab and Pag. Here, we fully assembled the Pab and Pag genomes following sequencing with PacBio technology and carried out a comparative sequence analysis of the Pab and Pag pathogenicity plasmids pPATHpag and pPATHpab. Assembly of Pab and Pag genomes revealed a ~4 Mbp chromosome with a 55% GC content, and three and four plasmids in Pab and Pag, respectively. pPATHpag and pPATHpab share 97% identity within a 74% coverage, and a similar GC content (51%); they are ~156 kb and ~131 kb in size and consist of 198 and 155 coding sequences (CDSs), respectively. In both plasmids, we confirmed the presence of highly similar gene clusters encoding a T3SS, as well as auxin and cytokinins biosynthetic enzymes. Three putative novel T3Es were identified in Pab and one in Pag. Among T3SS-associated proteins encoded by Pag and Pab, we identified two novel chaperons of the ShcV and CesT families that are present in both pathovars with high similarity. We also identified insertion sequences (ISs) and transposons (Tns) that may have contributed to the evolution of the two pathovars. These include seven shared IS elements, and three ISs and two transposons unique to Pab. Finally, comparative sequence analysis revealed plasmid regions and CDSs that are present only in pPATHpab or in pPATHpag. The high similarity and common features of the pPATH plasmids support the hypothesis that the two strains recently evolved into host-specific pathogens.

9.
Front Plant Sci ; 14: 1155341, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37332699

RESUMO

Xanthomonas hortorum pv. pelargonii is the causative agent of bacterial blight in geranium ornamental plants, the most threatening bacterial disease of this plant worldwide. Xanthomonas fragariae is the causative agent of angular leaf spot in strawberries, where it poses a significant threat to the strawberry industry. Both pathogens rely on the type III secretion system and the translocation of effector proteins into the plant cells for their pathogenicity. Effectidor is a freely available web server we have previously developed for the prediction of type III effectors in bacterial genomes. Following a complete genome sequencing and assembly of an Israeli isolate of Xanthomonas hortorum pv. pelargonii - strain 305, we used Effectidor to predict effector encoding genes both in this newly sequenced genome, and in X. fragariae strain Fap21, and validated its predictions experimentally. Four and two genes in X. hortorum and X. fragariae, respectively, contained an active translocation signal that allowed the translocation of the reporter AvrBs2 that induced the hypersensitive response in pepper leaves, and are thus considered validated novel effectors. These newly validated effectors are XopBB, XopBC, XopBD, XopBE, XopBF, and XopBG.

10.
J Mol Biol ; 435(14): 168155, 2023 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-37356902

RESUMO

Multiple sequence alignments (MSAs) are the workhorse of molecular evolution and structural biology research. From MSAs, the amino acids that are tolerated at each site during protein evolution can be inferred. However, little is known regarding the repertoire of tolerated amino acids in proteins when only a few or no sequence homologs are available, such as orphan and de novo designed proteins. Here we present EvoRator2, a deep-learning algorithm trained on over 15,000 protein structures that can predict which amino acids are tolerated at any given site, based exclusively on protein structural information mined from atomic coordinate files. We show that EvoRator2 obtained satisfying results for the prediction of position-weighted scoring matrices (PSSM). We further show that EvoRator2 obtained near state-of-the-art performance on proteins with high quality structures in predicting the effect of mutations in deep mutation scanning (DMS) experiments and that for certain DMS targets, EvoRator2 outperformed state-of-the-art methods. We also show that by combining EvoRator2's predictions with those obtained by a state-of-the-art deep-learning method that accounts for the information in the MSA, the prediction of the effect of mutation in DMS experiments was improved in terms of both accuracy and stability. EvoRator2 is designed to predict which amino-acid substitutions are tolerated in such proteins without many homologous sequences, including orphan or de novo designed proteins. We implemented our approach in the EvoRator web server (https://evorator.tau.ac.il).


Assuntos
Substituição de Aminoácidos , Aprendizado Profundo , Algoritmos , Aminoácidos/genética , Biologia Computacional/métodos , Proteínas/química , Proteínas/genética , Conformação Proteica
11.
Nucleic Acids Res ; 51(W1): W232-W236, 2023 07 05.
Artigo em Inglês | MEDLINE | ID: mdl-37177997

RESUMO

In the last decade, advances in sequencing technology have led to an exponential increase in genomic data. These new data have dramatically changed our understanding of the evolution and function of genes and genomes. Despite improvements in sequencing technologies, identifying contaminated reads remains a complex task for many research groups. Here, we introduce GenomeFLTR, a new web server to filter contaminated reads. Reads are compared against existing sequence databases from various representative organisms to detect potential contaminants. The main features implemented in GenomeFLTR are: (i) automated updating of the relevant databases; (ii) fast comparison of each read against the database; (iii) the ability to create user-specified databases; (iv) a user-friendly interactive dashboard to investigate the origin and frequency of the contaminations; (v) the generation of a contamination-free file. Availability: https://genomefltr.tau.ac.il/.


Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Genoma/genética , Bases de Dados de Ácidos Nucleicos , Software
12.
Protein Sci ; 32(3): e4582, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36718848

RESUMO

The ConSurf web-sever for the analysis of proteins, RNA, and DNA provides a quick and accurate estimate of the per-site evolutionary rate among homologues. The analysis reveals functionally important regions, such as catalytic and ligand-binding sites, which often evolve slowly. Since the last report in 2016, ConSurf has been improved in multiple ways. It now has a user-friendly interface that makes it easier to perform the analysis and to visualize the results. Evolutionary rates are calculated based on a set of homologous sequences, collected using hidden Markov model-based search tools, recently embedded in the pipeline. Using these, and following the removal of redundancy, ConSurf assembles a representative set of effective homologues for protein and nucleic acid queries to enable informative analysis of the evolutionary patterns. The analysis is particularly insightful when the evolutionary rates are mapped on the macromolecule structure. In this respect, the availability of AlphaFold model structures of essentially all UniProt proteins makes ConSurf particularly relevant to the research community. The UniProt ID of a query protein with an available AlphaFold model can now be used to start a calculation. Another important improvement is the Python re-implementation of the entire computational pipeline, making it easier to maintain. This Python pipeline is now available for download as a standalone version. We demonstrate some of ConSurf's key capabilities by the analysis of caveolin-1, the main protein of membrane invaginations called caveolae.


Assuntos
Evolução Biológica , Evolução Molecular , Conformação Proteica , Sequência Conservada/genética , Proteínas/química , Software
13.
Open Biol ; 12(12): 220223, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36514983

RESUMO

Insertions and deletions (indels) of short DNA segments are common evolutionary events. Numerous studies showed that deletions occur more often than insertions in both prokaryotes and eukaryotes. It raises the question why neutral sequences are not eradicated from the genome. We suggest that this is due to a phenomenon we term border-induced selection. Accordingly, a neutral sequence is bordered between conserved regions. Deletions occurring near the borders occasionally protrude to the conserved region and are thereby subject to strong purifying selection. Thus, for short neutral sequences, an insertion bias is expected. Here, we develop a set of increasingly complex models of indel dynamics that incorporate border-induced selection. Furthermore, we show that short conserved sequences within the neutrally evolving sequence help explain: (i) the presence of very long sequences; (ii) the high variance of sequence lengths; and (iii) the possible emergence of multimodality in sequence length distributions. Finally, we fitted our models to the human intron length distribution, as introns are thought to be mostly neutral and bordered by conserved exons. We show that when accounting for the occurrence of short conserved sequences within introns, we reproduce the main features, including the presence of long introns and the multimodality of intron distribution.


Assuntos
Evolução Molecular , Mutação INDEL , Humanos , Íntrons , Genoma , Genômica
14.
Front Plant Sci ; 13: 1024405, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36388586

RESUMO

Type III effectors are proteins injected by Gram-negative bacteria into eukaryotic hosts. In many plant and animal pathogens, these effectors manipulate host cellular processes to the benefit of the bacteria. Type III effectors are secreted by a type III secretion system that must "classify" each bacterial protein into one of two categories, either the protein should be translocated or not. It was previously shown that type III effectors have a secretion signal within their N-terminus, however, despite numerous efforts, the exact biochemical identity of this secretion signal is generally unknown. Computational characterization of the secretion signal is important for the identification of novel effectors and for better understanding the molecular translocation mechanism. In this work we developed novel machine-learning algorithms for characterizing the secretion signal in both plant and animal pathogens. Specifically, we represented each protein as a vector in high-dimensional space using Facebook's protein language model. Classification algorithms were next used to separate effectors from non-effector proteins. We subsequently curated a benchmark dataset of hundreds of effectors and thousands of non-effector proteins. We showed that on this curated dataset, our novel approach yielded substantially better classification accuracy compared to previously developed methodologies. We have also tested the hypothesis that plant and animal pathogen effectors are characterized by different secretion signals. Finally, we integrated the novel approach in Effectidor, a web-server for predicting type III effector proteins, leading to a more accurate classification of effectors from non-effectors.

15.
Mol Biol Evol ; 39(11)2022 11 03.
Artigo em Inglês | MEDLINE | ID: mdl-36282896

RESUMO

The inference of genome rearrangement events has been extensively studied, as they play a major role in molecular evolution. However, probabilistic evolutionary models that explicitly imitate the evolutionary dynamics of such events, as well as methods to infer model parameters, are yet to be fully utilized. Here, we developed a probabilistic approach to infer genome rearrangement rate parameters using an Approximate Bayesian Computation (ABC) framework. We developed two genome rearrangement models, a basic model, which accounts for genomic changes in gene order, and a more sophisticated one which also accounts for changes in chromosome number. We characterized the ABC inference accuracy using simulations and applied our methodology to both prokaryotic and eukaryotic empirical datasets. Knowledge of genome-rearrangement rates can help elucidate their role in evolution as well as help simulate genomes with evolutionary dynamics that reflect empirical genomes.


Assuntos
Evolução Molecular , Genoma , Teorema de Bayes , Simulação por Computador , Genômica
16.
J Mol Biol ; 434(11): 167538, 2022 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-35662466

RESUMO

Measuring evolutionary rates at the residue level is indispensable for gaining structural and functional insights into proteins. State-of-the-art tools for estimating rates take as input a large set of homologous proteins, a probabilistic model of evolution and a phylogenetic tree. However, a gap exists when only few or no homologous proteins can be found, e.g., orphan proteins. In addition, such tools do not take the three-dimensional (3D) structure of the protein into account. The association between the 3D structure and site-specific rates can be learned using machine-learning regression tools from a cohort of proteins for which both the structure and a large set of homologs exist. Here we present EvoRator, a user-friendly web server that implements a machine-learning regression algorithm to predict site-specific evolutionary rates from protein structures. We show that EvoRator outperforms predictions obtained using traditional physicochemical features, such as relative solvent accessibility and weighted contact number. We also demonstrate the application of EvoRator in three common scenarios that arise in protein evolution research: (1) orphan proteins for which no (or few) homologs exist; (2) When homologous sequences exist, our algorithm contrasts structure-based estimates of the evolutionary rates and the phylogeny-based estimates. This allows detecting sites that are likely conserved due to functional rather than structural constraints; (3) Algorithms that only rely on homologous sequence often fail to accurately measure the evolutionary rates of positions in gapped sequence alignments, which frequently occurs as a result of a clade-specific insertion. Our algorithm makes use of training data and known 3D structure of such gapped positions to predict their evolutionary rates. EvoRator is freely available for all users at: https://evorator.tau.ac.il/.


Assuntos
Uso da Internet , Aprendizado de Máquina , Conformação Proteica , Proteínas , Software , Algoritmos , Humanos , Filogenia , Proteínas/química , Proteínas/genética , Alinhamento de Sequência
17.
Bioinformatics ; 38(Suppl 1): i118-i124, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758778

RESUMO

MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Inteligência Artificial , Software , Funções Verossimilhança , Filogenia
18.
Methods Mol Biol ; 2427: 25-36, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35619022

RESUMO

Various Gram-negative bacteria use secretion systems to secrete effector proteins that manipulate host biochemical pathways to their benefit. We and others have previously developed machine-learning algorithms to predict novel effectors. Specifically, given a set of known effectors and a set of known non-effectors, the machine-learning algorithm extracts features that distinguish these two protein groups. In the training phase, the machine learning learns how to best combine the features to separate the two groups. The trained machine learning is then applied to open reading frames (ORFs) with unknown functions, resulting in a score for each ORF, which is its likelihood to be an effector. We developed Effectidor, a web server for predicting type III effectors. In this book chapter, we provide a step-by-step introduction to the application of Effectidor, from selecting input data to analyzing the obtained predictions.


Assuntos
Proteínas de Bactérias , Aprendizado de Máquina , Algoritmos , Proteínas de Bactérias/metabolismo , Bactérias Gram-Negativas/metabolismo
19.
Front Microbiol ; 13: 840308, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35495725

RESUMO

The type VI secretion system (T6SS) present in many Gram-negative bacteria is a contact-dependent apparatus that can directly deliver secreted effectors or toxins into diverse neighboring cellular targets including both prokaryotic and eukaryotic organisms. Recent reverse genetics studies with T6 core gene loci have indicated the importance of functional T6SS toward overall competitive fitness in various pathogenic Xanthomonas spp. To understand the contribution of T6SS toward ecology and evolution of Xanthomonas spp., we explored the distribution of the three distinguishable T6SS clusters, i3*, i3***, and i4, in approximately 1,740 Xanthomonas genomes, along with their conservation, genetic organization, and their evolutionary patterns in this genus. Screening genomes for core genes of each T6 cluster indicated that 40% of the sequenced strains possess two T6 clusters, with combinations of i3*** and i3* or i3*** and i4. A few strains of Xanthomonas citri, Xanthomonas phaseoli, and Xanthomonas cissicola were the exception, possessing a unique combination of i3* and i4. The findings also indicated clade-specific distribution of T6SS clusters. Phylogenetic analysis demonstrated that T6SS clusters i3* and i3*** were probably acquired by the ancestor of the genus Xanthomonas, followed by gain or loss of individual clusters upon diversification into subsequent clades. T6 i4 cluster has been acquired in recent independent events by group 2 xanthomonads followed by its spread via horizontal dissemination across distinct clades across groups 1 and 2 xanthomonads. We also noted reshuffling of the entire core T6 loci, as well as T6SS spike complex components, hcp and vgrG, among different species. Our findings indicate that gain or loss events of specific T6SS clusters across Xanthomonas phylogeny have not been random.

20.
NAR Genom Bioinform ; 4(2): lqac025, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35402908

RESUMO

Conservation is a strong predictor for the pathogenicity of single-nucleotide variants (SNVs). However, some positions that present complex conservation patterns across vertebrates stray from this paradigm. Here, we analyzed the association between complex conservation patterns and the pathogenicity of SNVs in the 115 disease-genes that had sufficient variant data. We show that conservation is not a one-rule-fits-all solution since its accuracy highly depends on the analyzed set of species and genes. For example, pairwise comparisons between the human and 99 vertebrate species showed that species differ in their ability to predict the clinical outcomes of variants among different genes using conservation. Furthermore, certain genes were less amenable for conservation-based variant prediction, while others demonstrated species that optimize prediction. These insights led to developing EvoDiagnostics, which uses the conservation against each species as a feature within a random-forest machine-learning classification algorithm. EvoDiagnostics outperformed traditional conservation algorithms, deep-learning based methods and most ensemble tools in every prediction-task, highlighting the strength of optimizing conservation analysis per-species and per-gene. Overall, we suggest a new and a more biologically relevant approach for analyzing conservation, which improves prediction of variant pathogenicity.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA