Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 5.690
Filtrer
1.
Genome Biol ; 25(1): 212, 2024 Aug 09.
Article de Anglais | MEDLINE | ID: mdl-39123269

RÉSUMÉ

BACKGROUND: Spatial transcriptomics (ST) is advancing our understanding of complex tissues and organisms. However, building a robust clustering algorithm to define spatially coherent regions in a single tissue slice and aligning or integrating multiple tissue slices originating from diverse sources for essential downstream analyses remains challenging. Numerous clustering, alignment, and integration methods have been specifically designed for ST data by leveraging its spatial information. The absence of comprehensive benchmark studies complicates the selection of methods and future method development. RESULTS: In this study, we systematically benchmark a variety of state-of-the-art algorithms with a wide range of real and simulated datasets of varying sizes, technologies, species, and complexity. We analyze the strengths and weaknesses of each method using diverse quantitative and qualitative metrics and analyses, including eight metrics for spatial clustering accuracy and contiguity, uniform manifold approximation and projection visualization, layer-wise and spot-to-spot alignment accuracy, and 3D reconstruction, which are designed to assess method performance as well as data quality. The code used for evaluation is available on our GitHub. Additionally, we provide online notebook tutorials and documentation to facilitate the reproduction of all benchmarking results and to support the study of new methods and new datasets. CONCLUSIONS: Our analyses lead to comprehensive recommendations that cover multiple aspects, helping users to select optimal tools for their specific needs and guide future method development.


Sujet(s)
Algorithmes , Référenciation , Analyse de regroupements , Animaux , Analyse de profil d'expression de gènes/méthodes , Transcriptome , Humains , Logiciel , Alignement de séquences/méthodes
2.
PLoS Comput Biol ; 20(8): e1012337, 2024 Aug.
Article de Anglais | MEDLINE | ID: mdl-39102450

RÉSUMÉ

A phylogenetic tree represents hypothesized evolutionary history for a set of taxa. Besides the branching patterns (i.e., tree topology), phylogenies contain information about the evolutionary distances (i.e. branch lengths) between all taxa in the tree, which include extant taxa (external nodes) and their last common ancestors (internal nodes). During phylogenetic tree inference, the branch lengths are typically co-estimated along with other phylogenetic parameters during tree topology space exploration. There are well-known regions of the branch length parameter space where accurate estimation of phylogenetic trees is especially difficult. Several novel studies have recently demonstrated that machine learning approaches have the potential to help solve phylogenetic problems with greater accuracy and computational efficiency. In this study, as a proof of concept, we sought to explore the possibility of machine learning models to predict branch lengths. To that end, we designed several deep learning frameworks to estimate branch lengths on fixed tree topologies from multiple sequence alignments or its representations. Our results show that deep learning methods can exhibit superior performance in some difficult regions of branch length parameter space. For example, in contrast to maximum likelihood inference, which is typically used for estimating branch lengths, deep learning methods are more efficient and accurate. In general, we find that our neural networks achieve similar accuracy to a Bayesian approach and are the best-performing methods when inferring long branches that are associated with distantly related taxa. Together, our findings represent a next step toward accurate, fast, and reliable phylogenetic inference with machine learning approaches.


Sujet(s)
Biologie informatique , Apprentissage profond , , Phylogenèse , Biologie informatique/méthodes , Algorithmes , Apprentissage machine , Modèles génétiques , Alignement de séquences/méthodes , Évolution moléculaire , Fonctions de vraisemblance
3.
Bioinformatics ; 40(8)2024 Aug 02.
Article de Anglais | MEDLINE | ID: mdl-39137136

RÉSUMÉ

MOTIVATION: Nanopore sequencing current signal data can be 'basecalled' into sequence information or analysed directly, with the capacity to identify diverse molecular features, such as DNA/RNA base modifications and secondary structures. However, raw signal data is large and complex, and there is a need for improved visualization strategies to facilitate signal analysis, exploration and tool development. RESULTS: Squigualiser (Squiggle visualiser) is a toolkit for intuitive, interactive visualization of sequence-aligned signal data, which currently supports both DNA and RNA sequencing data from Oxford Nanopore Technologies instruments. Squigualiser is compatible with a wide range of alternative signal-alignment software packages and enables visualization of both signal-to-read and signal-to-reference aligned data at single-base resolution. Squigualiser generates an interactive signal browser view (HTML file), in which the user can navigate across a genome/transcriptome region and customize the display. Multiple independent reads are integrated into a 'signal pileup' format and different datasets can be displayed as parallel tracks. Although other methods exist, Squigualiser provides the community with a software package purpose-built for raw signal data visualization, incorporating a range of new and existing features into a unified platform. AVAILABILITY AND IMPLEMENTATION: Squigualiser is an open-source package under an MIT licence: https://github.com/hiruna72/squigualiser. The software was developed using Python 3.8 and can be installed with pip or bioconda or executed directly using prebuilt binaries provided with each release.


Sujet(s)
Séquençage par nanopores , Logiciel , Séquençage par nanopores/méthodes , Analyse de séquence d'ADN/méthodes , Alignement de séquences/méthodes , Analyse de séquence d'ARN/méthodes
4.
Bioinformatics ; 40(8)2024 Aug 02.
Article de Anglais | MEDLINE | ID: mdl-39152995

RÉSUMÉ

MOTIVATION: Spaln is the earliest practical tool for self-sufficient genome mapping and spliced alignment of protein query sequences onto a mammalian-sized eukaryotic genomic sequence. However, its computational speed has become inadequate for the analysis of rapidly growing genomic and transcript sequence data. RESULTS: The dynamic programming calculation of Spaln has been sped up in two ways: (i) the introduction of the multi-intermediate unidirectional Hirschberg method and (ii) SIMD-based vectorization. The new version, Spaln3, is ∼7 times faster than the latest Spaln version 2, and its gene prediction accuracy is consistently higher than that of Miniprot. AVAILABILITY AND IMPLEMENTATION: https://github.com/ogotoh/spaln.


Sujet(s)
Cartographie chromosomique , Logiciel , Cartographie chromosomique/méthodes , Alignement de séquences/méthodes , Épissage des ARN , Algorithmes , Animaux , Humains , Génome , Protéines/génétique , Protéines/composition chimique , Génomique/méthodes
5.
J Comput Biol ; 31(7): 597-615, 2024 Jul.
Article de Anglais | MEDLINE | ID: mdl-38980804

RÉSUMÉ

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.


Sujet(s)
Algorithmes , Biologie informatique , Logiciel , Biologie informatique/méthodes , Alignement de séquences/méthodes , Humains , Analyse de séquence d'ADN/méthodes
6.
Bioinformatics ; 40(7)2024 Jul 01.
Article de Anglais | MEDLINE | ID: mdl-38960861

RÉSUMÉ

MOTIVATION: The alignment of sequencing reads is a critical step in the characterization of ancient genomes. However, reference bias and spurious mappings pose a significant challenge, particularly as cutting-edge wet lab methods generate datasets that push the boundaries of alignment tools. Reference bias occurs when reference alleles are favoured over alternative alleles during mapping, whereas spurious mappings stem from either contamination or when endogenous reads fail to align to their correct position. Previous work has shown that these phenomena are correlated with read length but a more thorough investigation of reference bias and spurious mappings for ancient DNA has been lacking. Here, we use a range of empirical and simulated palaeogenomic datasets to investigate the impacts of mapping tools, quality thresholds, and reference genome on mismatch rates across read lengths. RESULTS: For these analyses, we introduce AMBER, a new bioinformatics tool for assessing the quality of ancient DNA mapping directly from BAM-files and informing on reference bias, read length cut-offs and reference selection. AMBER rapidly and simultaneously computes the sequence read mapping bias in the form of the mismatch rates per read length, cytosine deamination profiles at both CpG and non-CpG sites, fragment length distributions, and genomic breadth and depth of coverage. Using AMBER, we find that mapping algorithms and quality threshold choices dictate reference bias and rates of spurious alignment at different read lengths in a predictable manner, suggesting that optimized mapping parameters for each read length will be a key step in alleviating reference bias and spurious mappings. AVAILABILITY AND IMPLEMENTATION: AMBER is available for noncommercial use on GitHub (https://github.com/tvandervalk/AMBER.git). Scripts used to generate and analyse simulated datasets are available on Github (https://github.com/sdolenz/refbias_scripts).


Sujet(s)
ADN ancien , Analyse de séquence d'ADN , ADN ancien/analyse , Humains , Analyse de séquence d'ADN/méthodes , Logiciel , Animaux , Alignement de séquences/méthodes , Biologie informatique/méthodes , Algorithmes
7.
Mol Biol Evol ; 41(7)2024 Jul 03.
Article de Anglais | MEDLINE | ID: mdl-39041199

RÉSUMÉ

The current trend in phylogenetic and evolutionary analyses predominantly relies on omic data. However, prior to core analyses, traditional methods typically involve intricate and time-consuming procedures, including assembly from high-throughput reads, decontamination, gene prediction, homology search, orthology assignment, multiple sequence alignment, and matrix trimming. Such processes significantly impede the efficiency of research when dealing with extensive data sets. In this study, we develop PhyloAln, a convenient reference-based tool capable of directly aligning high-throughput reads or complete sequences with existing alignments as a reference for phylogenetic and evolutionary analyses. Through testing with simulated data sets of species spanning the tree of life, PhyloAln demonstrates consistently robust performance compared with other reference-based tools across different data types, sequencing technologies, coverages, and species, with percent completeness and identity at least 50 percentage points higher in the alignments. Additionally, we validate the efficacy of PhyloAln in removing a minimum of 90% foreign and 70% cross-contamination issues, which are prevalent in sequencing data but often overlooked by other tools. Moreover, we showcase the broad applicability of PhyloAln by generating alignments (completeness mostly larger than 80%, identity larger than 90%) and reconstructing robust phylogenies using real data sets of transcriptomes of ladybird beetles, plastid genes of peppers, or ultraconserved elements of turtles. With these advantages, PhyloAln is expected to facilitate phylogenetic and evolutionary analyses in the omic era. The tool is accessible at https://github.com/huangyh45/PhyloAln.


Sujet(s)
Phylogenèse , Alignement de séquences , Logiciel , Alignement de séquences/méthodes , Séquençage nucléotidique à haut débit/méthodes , Animaux , Évolution moléculaire
8.
BMC Bioinformatics ; 25(1): 247, 2024 Jul 29.
Article de Anglais | MEDLINE | ID: mdl-39075359

RÉSUMÉ

BACKGROUND: Sequence alignment lies at the heart of genome sequence annotation. While the BLAST suite of alignment tools has long held an important role in alignment-based sequence database search, greater sensitivity is achieved through the use of profile hidden Markov models (pHMMs). Here, we describe an FPGA hardware accelerator, called HAVAC, that targets a key bottleneck step (SSV) in the analysis pipeline of the popular pHMM alignment tool, HMMER. RESULTS: The HAVAC kernel calculates the SSV matrix at 1739 GCUPS on a ∼  $3000 Xilinx Alveo U50 FPGA accelerator card, ∼  227× faster than the optimized SSV implementation in nhmmer. Accounting for PCI-e data transfer data processing, HAVAC is 65× faster than nhmmer's SSV with one thread and 35× faster than nhmmer with four threads, and uses ∼  31% the energy of a traditional high end Intel CPU. CONCLUSIONS: HAVAC demonstrates the potential offered by FPGA hardware accelerators to produce dramatic speed gains in sequence annotation and related bioinformatics applications. Because these computations are performed on a co-processor, the host CPU remains free to simultaneously compute other aspects of the analysis pipeline.


Sujet(s)
Chaines de Markov , Alignement de séquences , Alignement de séquences/méthodes , Biologie informatique/méthodes , Similitude de séquences , Algorithmes , Logiciel
9.
BMC Bioinformatics ; 25(1): 238, 2024 Jul 13.
Article de Anglais | MEDLINE | ID: mdl-39003441

RÉSUMÉ

MOTIVATION: Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods. RESULTS: In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.


Sujet(s)
Variation génétique , Génome humain , Séquençage du génome entier , Humains , Séquençage du génome entier/méthodes , Variation génétique/génétique , Séquençage nucléotidique à haut débit/méthodes , Polymorphisme de nucléotide simple/génétique , Alignement de séquences/méthodes , Logiciel , Algorithmes , Étude d'association pangénomique/méthodes
10.
Nucleic Acids Res ; 52(15): 8717-8733, 2024 Aug 27.
Article de Anglais | MEDLINE | ID: mdl-39011889

RÉSUMÉ

In biological sequence alignment, prevailing heuristic aligners achieve high-throughput by several approximation techniques, but at the cost of sacrificing the clarity of output criteria and creating complex parameter spaces. To surmount these challenges, we introduce 'SigAlign', a novel alignment algorithm that employs two explicit cutoffs for the results: minimum length and maximum penalty per length, alongside three affine gap penalties. Comparative analyses of SigAlign against leading database search tools (BLASTn, MMseqs2) and read mappers (BWA-MEM, bowtie2, HISAT2, minimap2) highlight its performance in read mapping and database searches. Our research demonstrates that SigAlign not only provides high sensitivity with a non-heuristic approach, but also surpasses the throughput of existing heuristic aligners, particularly for high-accuracy reads or genomes with few repetitive regions. As an open-source library, SigAlign is poised to become a foundational component to provide a transparent and customizable alignment process to new analytical algorithms, tools and pipelines in bioinformatics.


Sujet(s)
Algorithmes , Alignement de séquences , Logiciel , Alignement de séquences/méthodes , Humains , Biologie informatique/méthodes , Analyse de séquence d'ADN/méthodes
12.
Mol Biol Evol ; 41(7)2024 Jul 03.
Article de Anglais | MEDLINE | ID: mdl-38869090

RÉSUMÉ

Sequence alignment is an essential method in bioinformatics and the basis of many analyses, including phylogenetic inference, ancestral sequence reconstruction, and gene annotation. Sequencing artifacts and errors made during genome assembly, such as abiological frameshifts and incorrect early stop codons, can impact downstream analyses leading to erroneous conclusions in comparative and functional genomic studies. More significantly, while indels can occur both within and between codons in natural sequences, most amino-acid- and codon-based aligners assume that indels only occur between codons. This mismatch between biology and alignment algorithms produces suboptimal alignments and errors in downstream analyses. To address these issues, we present COATi, a statistical, codon-aware pairwise aligner that supports complex insertion-deletion models and can handle artifacts present in genomic data. COATi allows users to reduce the amount of discarded data while generating more accurate sequence alignments. COATi can infer indels both within and between codons, leading to improved sequence alignments. We applied COATi to a dataset containing orthologous protein-coding sequences from humans and gorillas and conclude that 41% of indels occurred between codons, agreeing with previous work in other species. We also applied COATi to semiempirical benchmark alignments and find that it outperforms several popular alignment programs on several measures of alignment quality and accuracy.


Sujet(s)
Mutation de type INDEL , Alignement de séquences , Alignement de séquences/méthodes , Humains , Animaux , Logiciel , Algorithmes , Codon , Gorilla gorilla/génétique , Biologie informatique/méthodes , Cadres ouverts de lecture , Phylogenèse
13.
Bioinformatics ; 40(Supplement_1): i208-i217, 2024 Jun 28.
Article de Anglais | MEDLINE | ID: mdl-38940166

RÉSUMÉ

MOTIVATION: Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS: Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION: The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.


Sujet(s)
Algorithmes , Apprentissage machine , Phylogenèse , Logiciel , Alignement de séquences/méthodes , Biologie informatique/méthodes , Fonctions de vraisemblance
14.
Bioinformatics ; 40(Supplement_1): i328-i336, 2024 Jun 28.
Article de Anglais | MEDLINE | ID: mdl-38940160

RÉSUMÉ

SUMMARY: Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. AVAILABILITY AND IMPLEMENTATION: The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.


Sujet(s)
Algorithmes , Chaines de Markov , Alignement de séquences , Logiciel , Alignement de séquences/méthodes , Biologie informatique/méthodes , Analyse de séquence de protéine/méthodes , Phylogenèse , Protéines/composition chimique
15.
Bioinformatics ; 40(Supplement_1): i337-i346, 2024 Jun 28.
Article de Anglais | MEDLINE | ID: mdl-38940164

RÉSUMÉ

MOTIVATION: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.


Sujet(s)
Algorithmes , Alignement de séquences , Alignement de séquences/méthodes , Logiciel , Biologie informatique/méthodes , Analyse de séquence d'ADN/méthodes , Bases de données génétiques
16.
BMC Bioinformatics ; 25(1): 219, 2024 Jun 19.
Article de Anglais | MEDLINE | ID: mdl-38898394

RÉSUMÉ

BACKGROUND: With the surge in genomic data driven by advancements in sequencing technologies, the demand for efficient bioinformatics tools for sequence analysis has become paramount. BLAST-like alignment tool (BLAT), a sequence alignment tool, faces limitations in performance efficiency and integration with modern programming environments, particularly Python. This study introduces PxBLAT, a Python-based framework designed to enhance the capabilities of BLAT, focusing on usability, computational efficiency, and seamless integration within the Python ecosystem. RESULTS: PxBLAT demonstrates significant improvements over BLAT in execution speed and data handling, as evidenced by comprehensive benchmarks conducted across various sample groups ranging from 50 to 600 samples. These experiments highlight a notable speedup, reducing execution time compared to BLAT. The framework also introduces user-friendly features such as improved server management, data conversion utilities, and shell completion, enhancing the overall user experience. Additionally, the provision of extensive documentation and comprehensive testing supports community engagement and facilitates the adoption of PxBLAT. CONCLUSIONS: PxBLAT stands out as a robust alternative to BLAT, offering performance and user interaction enhancements. Its development underscores the potential for modern programming languages to improve bioinformatics tools, aligning with the needs of contemporary genomic research. By providing a more efficient, user-friendly tool, PxBLAT has the potential to impact genomic data analysis workflows, supporting faster and more accurate sequence analysis in a Python environment.


Sujet(s)
Biologie informatique , Alignement de séquences , Logiciel , Biologie informatique/méthodes , Alignement de séquences/méthodes , Langages de programmation , Génomique/méthodes
17.
Mol Biol Evol ; 41(7)2024 Jul 03.
Article de Anglais | MEDLINE | ID: mdl-38842253

RÉSUMÉ

Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.


Sujet(s)
Mutation de type INDEL , Phylogenèse , Alignement de séquences , Alignement de séquences/méthodes , Évolution moléculaire , Modèles génétiques , Humains
18.
Proc Natl Acad Sci U S A ; 121(27): e2311887121, 2024 Jul 02.
Article de Anglais | MEDLINE | ID: mdl-38913900

RÉSUMÉ

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.


Sujet(s)
Protéines , Alignement de séquences , Alignement de séquences/méthodes , Protéines/composition chimique , Protéines/métabolisme , Séquence d'acides aminés , Algorithmes , Analyse de séquence de protéine/méthodes , Biologie informatique/méthodes , Bases de données de protéines
19.
Mol Biol Evol ; 41(7)2024 Jul 03.
Article de Anglais | MEDLINE | ID: mdl-38917277

RÉSUMÉ

Phylogenetic methods are widely used to reconstruct the evolutionary relationships among species and individuals. However, recombination can obscure ancestral relationships as individuals may inherit different regions of their genome from different ancestors. It is, therefore, often necessary to detect recombination events, locate recombination breakpoints, and select recombination-free alignments prior to reconstructing phylogenetic trees. While many earlier studies have examined the power of different methods to detect recombination, very few have examined the ability of these methods to accurately locate recombination breakpoints. In this study, we simulated genome sequences based on ancestral recombination graphs and explored the accuracy of three popular recombination detection methods: MaxChi, 3SEQ, and Genetic Algorithm Recombination Detection. The accuracy of inferred breakpoint locations was evaluated along with the key factors contributing to variation in accuracy across datasets. While many different genomic features contribute to the variation in performance across methods, the number of informative sites consistent with the pattern of inheritance between parent and recombinant child sequences always has the greatest contribution to accuracy. While partitioning sequence alignments based on identified recombination breakpoints can greatly decrease phylogenetic error, the quality of phylogenetic reconstructions depends very little on how breakpoints are chosen to partition the alignment. Our work sheds light on how different features of recombinant genomes affect the performance of recombination detection methods and suggests best practices for reconstructing phylogenies based on recombination-free alignments.


Sujet(s)
Algorithmes , Phylogenèse , Recombinaison génétique , Points de cassure de chromosome , Alignement de séquences/méthodes , Modèles génétiques
20.
Mol Biol Evol ; 41(6)2024 Jun 01.
Article de Anglais | MEDLINE | ID: mdl-38860506

RÉSUMÉ

Phylogenetic inference based on protein sequence alignment is a widely used procedure. Numerous phylogenetic algorithms have been developed, most of which have many parameters and options. Choosing a program, options, and parameters can be a nontrivial task. No benchmark for comparison of phylogenetic programs on real protein sequences was publicly available. We have developed PhyloBench, a benchmark for evaluating the quality of phylogenetic inference, and used it to test a number of popular phylogenetic programs. PhyloBench is based on natural, not simulated, protein sequences of orthologous evolutionary domains. The measure of accuracy of an inferred tree is its distance to the corresponding species tree. A number of tree-to-tree distance measures were tested. The most reliable results were obtained using the Robinson-Foulds distance. Our results confirmed recent findings that distance methods are more accurate than maximum likelihood (ML) and maximum parsimony. We tested the bayesian program MrBayes on natural protein sequences and found that, on our datasets, it performs better than ML, but worse than distance methods. Of the methods we tested, the Balanced Minimum Evolution method implemented in FastME yielded the best results on our material. Alignments and reference species trees are available at https://mouse.belozersky.msu.ru/tools/phylobench/ together with a web-interface that allows for a semi-automatic comparison of a user's method with a number of popular programs.


Sujet(s)
Algorithmes , Phylogenèse , Logiciel , Référenciation , Alignement de séquences/méthodes , Théorème de Bayes , Évolution moléculaire , Biologie informatique/méthodes
SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE