Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 5.674
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
Cell ; 181(6): 1380-1394.e18, 2020 06 11.
Artigo em Inglês | MEDLINE | ID: mdl-32502392

RESUMO

Homologous recombination (HR) helps maintain genome integrity, and HR defects give rise to disease, especially cancer. During HR, damaged DNA must be aligned with an undamaged template through a process referred to as the homology search. Despite decades of study, key aspects of this search remain undefined. Here, we use single-molecule imaging to demonstrate that Rad54, a conserved Snf2-like protein found in all eukaryotes, switches the search from the diffusion-based pathways characteristic of the basal HR machinery to an active process in which DNA sequences are aligned via an ATP-dependent molecular motor-driven mechanism. We further demonstrate that Rad54 disrupts the donor template strands, enabling the search to take place within a migrating DNA bubble-like structure that is bound by replication protein A (RPA). Our results reveal that Rad54, working together with RPA, fundamentally alters how DNA sequences are aligned during HR.


Assuntos
Trifosfato de Adenosina/genética , DNA Helicases/genética , Enzimas Reparadoras do DNA/genética , DNA/genética , Recombinação Homóloga/genética , Proteínas de Saccharomyces cerevisiae/genética , Adenosina Trifosfatases/genética , Dano ao DNA/genética , Reparo do DNA/genética , Proteínas de Ligação a DNA/genética , Hidrólise , Saccharomyces cerevisiae/genética , Alinhamento de Sequência/métodos
2.
Nature ; 602(7895): 142-147, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-35082445

RESUMO

Public databases contain a planetary collection of nucleic acid sequences, but their systematic exploration has been inhibited by a lack of efficient methods for searching this corpus, which (at the time of writing) exceeds 20 petabases and is growing exponentially1. Here we developed a cloud computing infrastructure, Serratus, to enable ultra-high-throughput sequence alignment at the petabase scale. We searched 5.7 million biologically diverse samples (10.2 petabases) for the hallmark gene RNA-dependent RNA polymerase and identified well over 105 novel RNA viruses, thereby expanding the number of known species by roughly an order of magnitude. We characterized novel viruses related to coronaviruses, hepatitis delta virus and huge phages, respectively, and analysed their environmental reservoirs. To catalyse the ongoing revolution of viral discovery, we established a free and comprehensive database of these data and tools. Expanding the known sequence diversity of viruses can reveal the evolutionary origins of emerging pathogens and improve pathogen surveillance for the anticipation and mitigation of future pandemics.


Assuntos
Computação em Nuvem , Bases de Dados Genéticas , Vírus de RNA/genética , Vírus de RNA/isolamento & purificação , Alinhamento de Sequência/métodos , Virologia/métodos , Viroma/genética , Animais , Arquivos , Bacteriófagos/enzimologia , Bacteriófagos/genética , Biodiversidade , Coronavirus/classificação , Coronavirus/enzimologia , Coronavirus/genética , Evolução Molecular , Vírus Delta da Hepatite/enzimologia , Vírus Delta da Hepatite/genética , Humanos , Modelos Moleculares , Vírus de RNA/classificação , Vírus de RNA/enzimologia , RNA Polimerase Dependente de RNA/química , RNA Polimerase Dependente de RNA/genética , Software
3.
PLoS Biol ; 22(5): e3002405, 2024 May.
Artigo em Inglês | MEDLINE | ID: mdl-38713717

RESUMO

We report a new visualization tool for analysis of whole-genome assembly-assembly alignments, the Comparative Genome Viewer (CGV) (https://ncbi.nlm.nih.gov/genome/cgv/). CGV visualizes pairwise same-species and cross-species alignments provided by National Center for Biotechnology Information (NCBI) using assembly alignment algorithms developed by us and others. Researchers can examine large structural differences spanning chromosomes, such as inversions or translocations. Users can also navigate to regions of interest, where they can detect and analyze smaller-scale deletions and rearrangements within specific chromosome or gene regions. RefSeq or user-provided gene annotation is displayed where available. CGV currently provides approximately 800 alignments from over 350 animal, plant, and fungal species. CGV and related NCBI viewers are undergoing active development to further meet needs of the research community in comparative genome visualization.


Assuntos
Genoma , Software , Animais , Genoma/genética , Alinhamento de Sequência/métodos , Genômica/métodos , Algoritmos , Estados Unidos , Humanos , Eucariotos/genética , Bases de Dados Genéticas , National Library of Medicine (U.S.) , Anotação de Sequência Molecular/métodos
4.
Proc Natl Acad Sci U S A ; 121(27): e2311887121, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38913900

RESUMO

Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.


Assuntos
Proteínas , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Proteínas/química , Proteínas/metabolismo , Sequência de Aminoácidos , Algoritmos , Análise de Sequência de Proteína/métodos , Biologia Computacional/métodos , Bases de Dados de Proteínas
5.
PLoS Genet ; 20(4): e1011184, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38683871

RESUMO

By decomposing genome sequences into k-mers, it is possible to estimate genome differences without alignment. Techniques such as k-mer minimisers, for example MinHash, have been developed and are often accurate approximations of distances based on full k-mer sets. These and other alignment-free methods avoid the large temporal and computational expense of alignment. However, these k-mer set comparisons are not entirely accurate within-species and can be completely inaccurate within-lineage. This is due, in part, to their inability to distinguish core polymorphism from accessory differences. Here we present a new approach, KmerAperture, which uses information on the k-mer relative genomic positions to determine the type of polymorphism causing differences in k-mer presence and absence between pairs of genomes. Single SNPs are expected to result in k unique contiguous k-mers per genome. On the other hand, contiguous series > k may be caused by accessory differences of length S-k+1; when the start and end of the sequence are contiguous with homologous sequence. Alternatively, they may be caused by multiple SNPs within k bp from each other and KmerAperture can determine whether that is the case. To demonstrate use cases KmerAperture was benchmarked using datasets including a very low diversity simulated population with accessory content independent from the number of SNPs, a simulated population where SNPs are spatially dense, a moderately diverse real cluster of genomes (Escherichia coli ST1193) with a large accessory genome and a low diversity real genome cluster (Salmonella Typhimurium ST34). We show that KmerAperture can accurately distinguish both core and accessory sequence diversity without alignment, outperforming other k-mer based tools.


Assuntos
Genoma Bacteriano , Polimorfismo de Nucleotídeo Único , Polimorfismo de Nucleotídeo Único/genética , Sintenia , Genômica/métodos , Algoritmos , Escherichia coli/genética , Software , Alinhamento de Sequência/métodos , Filogenia
6.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38701418

RESUMO

Coverage quantification is required in many sequencing datasets within the field of genomics research. However, most existing tools fail to provide comprehensive statistical results and exhibit limited performance gains from multithreading. Here, we present PanDepth, an ultra-fast and efficient tool for calculating coverage and depth from sequencing alignments. PanDepth outperforms other tools in computation time and memory efficiency for both BAM and CRAM-format alignment files from sequencing data, regardless of read length. It employs chromosome parallel computation and optimized data structures, resulting in ultrafast computation speeds and memory efficiency. It accepts sorted or unsorted BAM and CRAM-format alignment files as well as GTF, GFF and BED-formatted interval files or a specific window size. When provided with a reference genome sequence and the option to enable GC content calculation, PanDepth includes GC content statistics, enhancing the accuracy and reliability of copy number variation analysis. Overall, PanDepth is a powerful tool that accelerates scientific discovery in genomics research.


Assuntos
Genômica , Software , Genômica/métodos , Humanos , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Composição de Bases , Variações do Número de Cópias de DNA , Biologia Computacional/métodos , Algoritmos , Alinhamento de Sequência/métodos
7.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38695119

RESUMO

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Software , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizado Profundo , Bases de Dados de Proteínas
8.
Nature ; 587(7833): 246-251, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33177663

RESUMO

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1-3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


Assuntos
Genoma/genética , Genômica/métodos , Alinhamento de Sequência/métodos , Software , Vertebrados/genética , Âmnio , Animais , Simulação por Computador , Genômica/normas , Haplótipos , Humanos , Controle de Qualidade , Alinhamento de Sequência/normas , Software/normas
9.
Mol Biol Evol ; 41(6)2024 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-38860506

RESUMO

Phylogenetic inference based on protein sequence alignment is a widely used procedure. Numerous phylogenetic algorithms have been developed, most of which have many parameters and options. Choosing a program, options, and parameters can be a nontrivial task. No benchmark for comparison of phylogenetic programs on real protein sequences was publicly available. We have developed PhyloBench, a benchmark for evaluating the quality of phylogenetic inference, and used it to test a number of popular phylogenetic programs. PhyloBench is based on natural, not simulated, protein sequences of orthologous evolutionary domains. The measure of accuracy of an inferred tree is its distance to the corresponding species tree. A number of tree-to-tree distance measures were tested. The most reliable results were obtained using the Robinson-Foulds distance. Our results confirmed recent findings that distance methods are more accurate than maximum likelihood (ML) and maximum parsimony. We tested the bayesian program MrBayes on natural protein sequences and found that, on our datasets, it performs better than ML, but worse than distance methods. Of the methods we tested, the Balanced Minimum Evolution method implemented in FastME yielded the best results on our material. Alignments and reference species trees are available at https://mouse.belozersky.msu.ru/tools/phylobench/ together with a web-interface that allows for a semi-automatic comparison of a user's method with a number of popular programs.


Assuntos
Algoritmos , Filogenia , Software , Benchmarking , Alinhamento de Sequência/métodos , Teorema de Bayes , Evolução Molecular , Biologia Computacional/métodos
10.
Bioinformatics ; 40(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38724243

RESUMO

MOTIVATION: Since 2016, the number of microbial species with available reference genomes in NCBI has more than tripled. Multiple genome alignment, the process of identifying nucleotides across multiple genomes which share a common ancestor, is used as the input to numerous downstream comparative analysis methods. Parsnp is one of the few multiple genome alignment methods able to scale to the current era of genomic data; however, there has been no major release since its initial release in 2014. RESULTS: To address this gap, we developed Parsnp v2, which significantly improves on its original release. Parsnp v2 provides users with more control over executions of the program, allowing Parsnp to be better tailored for different use-cases. We introduce a partitioning option to Parsnp, which allows the input to be broken up into multiple parallel alignment processes which are then combined into a final alignment. The partitioning option can reduce memory usage by over 4× and reduce runtime by over 2×, all while maintaining a precise core-genome alignment. The partitioning workflow is also less susceptible to complications caused by assembly artifacts and minor variation, as alignment anchors only need to be conserved within their partition and not across the entire input set. We highlight the performance on datasets involving thousands of bacterial and viral genomes. AVAILABILITY AND IMPLEMENTATION: Parsnp v2 is available at https://github.com/marbl/parsnp.


Assuntos
Genoma Bacteriano , Alinhamento de Sequência , Software , Alinhamento de Sequência/métodos , Genômica/métodos , Algoritmos
11.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38597887

RESUMO

MOTIVATION: Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS: We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, i.e. 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2000-5000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to >0.98 when query sequences increased from 150-350 to >850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g. ∼1000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY AND IMPLEMENTATION: www.dllab.org/software/VirusPredictor.html.


Assuntos
Genoma Viral , Software , Humanos , Vírus/genética , Análise de Sequência de DNA/métodos , Alinhamento de Sequência/métodos , Aprendizado de Máquina
12.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38608190

RESUMO

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.


Assuntos
Algoritmos , Biologia Computacional , Aprendizado Profundo , Processamento de Linguagem Natural , Biologia Computacional/métodos , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos
13.
Bioinformatics ; 40(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38676570

RESUMO

MOTIVATION: Bacterial genomes present more variability than human genomes, which requires important adjustments in computational tools that are developed for human data. In particular, bacteria exhibit a mosaic structure due to homologous recombinations, but this fact is not sufficiently captured by standard read mappers that align against linear reference genomes. The recent introduction of pangenomics provides some insights in that context, as a pangenome graph can represent the variability within a species. However, the concept of sequence-to-graph alignment that captures the presence of recombinations has not been previously investigated. RESULTS: In this paper, we present the extension of the notion of sequence-to-graph alignment to a variation graph that incorporates a recombination, so that the latter are explicitly represented and evaluated in an alignment. Moreover, we present a dynamic programming approach for the special case where there is at most a recombination-we implement this case as RecGraph. From a modelling point of view, a recombination corresponds to identifying a new path of the variation graph, where the new arc is composed of two halves, each extracted from an original path, possibly joined by a new arc. Our experiments show that RecGraph accurately aligns simulated recombinant bacterial sequences that have at most a recombination, providing evidence for the presence of recombination events. AVAILABILITY AND IMPLEMENTATION: Our implementation is open source and available at https://github.com/AlgoLab/RecGraph.


Assuntos
Algoritmos , Genoma Bacteriano , Recombinação Genética , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Humanos , Software , Análise de Sequência de DNA/métodos , Genômica/métodos
14.
Bioinformatics ; 40(6)2024 Jun 03.
Artigo em Inglês | MEDLINE | ID: mdl-38870521

RESUMO

MOTIVATION: Tools for pairwise alignments between 3D structures of proteins are of fundamental importance for structural biology and bioinformatics, enabling visual exploration of evolutionary and functional relationships. However, the absence of a user-friendly, browser-based tool for creating alignments and visualizing them at both 1D sequence and 3D structural levels makes this process unnecessarily cumbersome. RESULTS: We introduce a novel pairwise structure alignment tool (rcsb.org/alignment) that seamlessly integrates into the RCSB Protein Data Bank (RCSB PDB) research-focused RCSB.org web portal. Our tool and its underlying application programming interface (alignment.rcsb.org) empowers users to align several protein chains with a reference structure by providing access to established alignment algorithms (FATCAT, CE, TM-align, or Smith-Waterman 3D). The user-friendly interface simplifies parameter setup and input selection. Within seconds, our tool enables visualization of results in both sequence (1D) and structural (3D) perspectives through the RCSB PDB RCSB.org Sequence Annotations viewer and Mol* 3D viewer, respectively. Users can effortlessly compare structures deposited in the PDB archive alongside more than a million incorporated Computed Structure Models coming from the ModelArchive and AlphaFold DB. Moreover, this tool can be used to align custom structure data by providing a link/URL or uploading atomic coordinate files directly. Importantly, alignment results can be bookmarked and shared with collaborators. By bridging the gap between 1D sequence and 3D structures of proteins, our tool facilitates deeper understanding of complex evolutionary relationships among proteins through comprehensive sequence and structural analyses. AVAILABILITY AND IMPLEMENTATION: The alignment tool is part of the RCSB PDB research-focused RCSB.org web portal and available at rcsb.org/alignment. Programmatic access is available via alignment.rcsb.org. Frontend code has been published at github.com/rcsb/rcsb-pecos-app. Visualization is powered by the open-source Mol* viewer (github.com/molstar/molstar and github.com/molstar/rcsb-molstar) plus the Sequence Annotations in 3D Viewer (github.com/rcsb/rcsb-saguaro-3d).


Assuntos
Algoritmos , Bases de Dados de Proteínas , Proteínas , Alinhamento de Sequência , Software , Proteínas/química , Alinhamento de Sequência/métodos , Conformação Proteica , Interface Usuário-Computador , Biologia Computacional/métodos
15.
Bioinformatics ; 40(Supplement_1): i208-i217, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940166

RESUMO

MOTIVATION: Currently used methods for estimating branch support in phylogenetic analyses often rely on the classic Felsenstein's bootstrap, parametric tests, or their approximations. As these branch support scores are widely used in phylogenetic analyses, having accurate, fast, and interpretable scores is of high importance. RESULTS: Here, we employed a data-driven approach to estimate branch support values with a probabilistic interpretation. To this end, we simulated thousands of realistic phylogenetic trees and the corresponding multiple sequence alignments. Each of the obtained alignments was used to infer the phylogeny using state-of-the-art phylogenetic inference software, which was then compared to the true tree. Using these extensive data, we trained machine-learning algorithms to estimate branch support values for each bipartition within the maximum-likelihood trees obtained by each software. Our results demonstrate that our model provides fast and more accurate probability-based branch support values than commonly used procedures. We demonstrate the applicability of our approach on empirical datasets. AVAILABILITY AND IMPLEMENTATION: The data supporting this work are available in the Figshare repository at https://doi.org/10.6084/m9.figshare.25050554.v1, and the underlying code is accessible via GitHub at https://github.com/noaeker/bootstrap_repo.


Assuntos
Algoritmos , Aprendizado de Máquina , Filogenia , Software , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Funções Verossimilhança
16.
Bioinformatics ; 40(Supplement_1): i337-i346, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940164

RESUMO

MOTIVATION: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS: We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION: The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.


Assuntos
Algoritmos , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Software , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas
17.
Bioinformatics ; 40(5)2024 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-38648741

RESUMO

SUMMARY: SIMSApiper is a Nextflow pipeline that creates reliable, structure-informed MSAs of thousands of protein sequences faster than standard structure-based alignment methods. Structural information can be provided by the user or collected by the pipeline from online resources. Parallelization with sequence identity-based subsets can be activated to significantly speed up the alignment process. Finally, the number of gaps in the final alignment can be reduced by leveraging the position of conserved secondary structure elements. AVAILABILITY AND IMPLEMENTATION: The pipeline is implemented using Nextflow, Python3, and Bash. It is publicly available on github.com/Bio2Byte/simsapiper.


Assuntos
Proteínas , Alinhamento de Sequência , Análise de Sequência de Proteína , Software , Proteínas/química , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Bases de Dados de Proteínas
18.
Bioinformatics ; 40(Supplement_1): i328-i336, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940160

RESUMO

SUMMARY: Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. AVAILABILITY AND IMPLEMENTATION: The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.


Assuntos
Algoritmos , Cadeias de Markov , Alinhamento de Sequência , Software , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Análise de Sequência de Proteína/métodos , Filogenia , Proteínas/química
19.
PLoS Comput Biol ; 20(4): e1011995, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38656999

RESUMO

Genomes contain conserved non-coding sequences that perform important biological functions, such as gene regulation. We present a phylogenetic method, PhyloAcc-C, that associates nucleotide substitution rates with changes in a continuous trait of interest. The method takes as input a multiple sequence alignment of conserved elements, continuous trait data observed in extant species, and a background phylogeny and substitution process. Gibbs sampling is used to assign rate categories (background, conserved, accelerated) to lineages and explore whether the assigned rate categories are associated with increases or decreases in the rate of trait evolution. We test our method using simulations and then illustrate its application using mammalian body size and lifespan data previously analyzed with respect to protein coding genes. Like other studies, we find processes such as tumor suppression, telomere maintenance, and p53 regulation to be related to changes in longevity and body size. In addition, we also find that skeletal genes, and developmental processes, such as sprouting angiogenesis, are relevant.


Assuntos
Evolução Molecular , Modelos Genéticos , Filogenia , Animais , Longevidade/genética , Humanos , Biologia Computacional/métodos , Simulação por Computador , Tamanho Corporal/genética , Nucleotídeos/genética , Alinhamento de Sequência/métodos
20.
Nucleic Acids Res ; 51(9): e53, 2023 05 22.
Artigo em Inglês | MEDLINE | ID: mdl-36987885

RESUMO

The functions of non-coding RNAs usually depend on their 3D structures. Therefore, comparing RNA 3D structures is critical in analyzing their functions. We noticed an interesting phenomenon that two non-coding RNAs may share similar substructures when rotating their sequence order. To the best of our knowledge, no existing RNA 3D structural alignment tools can detect this type of matching. In this article, we defined the RNA 3D structure circular matching problem and developed a software tool named CircularSTAR3D to solve this problem. CircularSTAR3D first uses the conserved stacks (consecutive base pairs with similar 3D structures) in the input RNAs to identify the circular matched internal loops and multiloops. Then it performs a local extension iteratively to obtain the whole circular matched substructures. The computational experiments conducted on a non-redundant RNA structure dataset show that circular matching is ubiquitous. Furthermore, we demonstrated the utility of CircularSTAR3D by detecting the conserved substructures missed by regular alignment tools, including structural motifs and conserved structures between riboswitches and ribozymes from different classes. We anticipate CircularSTAR3D to be a valuable supplement to the existing RNA 3D structural analysis techniques.


Assuntos
Conformação de Ácido Nucleico , RNA , Alinhamento de Sequência , Análise de Sequência de RNA , Software , Algoritmos , Pareamento de Bases , RNA/genética , RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA