Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 5.695
Filtrar
1.
Bioinformatics ; 40(Suppl 2): ii79-ii86, 2024 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-39230690

RESUMEN

MOTIVATION: For the alignment of large numbers of protein sequences, tools are predominant that decide to align two residues using only simple prior knowledge, e.g. amino acid substitution matrices, and using only part of the available data. The accuracy of state-of-the-art programs declines with decreasing sequence identity and when increasingly large numbers of sequences are aligned. Recently, transformer-based deep-learning models started to harness the vast amount of protein sequence data, resulting in powerful pretrained language models with the main purpose of generating high-dimensional numerical representations, embeddings, for individual sites that agglomerate evolutionary, structural, and biophysical information. RESULTS: We extend the traditional profile hidden Markov model so that it takes as inputs unaligned protein sequences and the corresponding embeddings. We fit the model with gradient descent using our existing differentiable hidden Markov layer. All sequences and their embeddings are jointly aligned to a model of the protein family. We report that our upgraded HMM-based aligner, learnMSA2, combined with the ProtT5-XL protein language model aligns on average almost 6% points more columns correctly than the best amino acid-based competitor and scales well with sequence number. The relative advantage of learnMSA2 over other programs tends to be greater when the sequence identity is lower and when the number of sequences is larger. Our results strengthen the evidence on the rich information contained in protein language models' embeddings and their potential downstream impact on the field of bioinformatics. Availability and implementation:  https://github.com/Gaius-Augustus/learnMSA, PyPI and Bioconda, evaluation: https://github.com/felbecker/snakeMSA.


Asunto(s)
Cadenas de Markov , Proteínas , Alineación de Secuencia , Análisis de Secuencia de Proteína , Alineación de Secuencia/métodos , Proteínas/química , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Aprendizaje Profundo , Algoritmos , Biología Computacional/métodos , Secuencia de Aminoácidos
2.
PLoS One ; 19(9): e0306480, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39264950

RESUMEN

With the rapid development of biotechnology, gene sequencing methods are gradually improved. The structure of gene sequences is also more complex. However, the traditional sequence alignment method is difficult to deal with the complex gene sequence alignment work. In order to improve the efficiency of gene sequence analysis, D2 series method of k-mer statistics is selected to build the model of gene sequence alignment analysis. According to the structure of the foreground sequence, the sequence to be aligned can be cut by different lengths and divided into multiple subsequences. Finally, according to the selected subsequences, the maximum dissimilarity in the alignment results is determined as the statistical result. At the same time, the research also designed an application system for the sequence alignment analysis of the model. The experimental results showed that the statistical power of the sequence alignment analysis model was directly proportional to the sequence coverage and cutting length, and inversely proportional to the K value and module length. At the same time, the model was applied to the system designed in this paper. The maximum storage capacity of the system was 71 GB, the maximum disk capacity was 135 GB, and the running time was less than 2.0s. Therefore, the k-mer statistic sequence alignment model and system proposed in this study have considerable application value in gene alignment analysis.


Asunto(s)
Alineación de Secuencia , Alineación de Secuencia/métodos , Algoritmos , Análisis de Secuencia de ADN/métodos , Modelos Genéticos , Modelos Estadísticos , Biología Computacional/métodos
3.
Sci Rep ; 14(1): 20692, 2024 09 05.
Artículo en Inglés | MEDLINE | ID: mdl-39237735

RESUMEN

Embeddings from protein Language Models (pLMs) are replacing evolutionary information from multiple sequence alignments (MSAs) as the most successful input for protein prediction. Is this because embeddings capture evolutionary information? We tested various approaches to explicitly incorporate evolutionary information into embeddings on various protein prediction tasks. While older pLMs (SeqVec, ProtBert) significantly improved through MSAs, the more recent pLM ProtT5 did not benefit. For most tasks, pLM-based outperformed MSA-based methods, and the combination of both even decreased performance for some (intrinsic disorder). We highlight the effectiveness of pLM-based methods and find limited benefits from integrating MSAs.


Asunto(s)
Evolución Molecular , Proteínas , Alineación de Secuencia , Proteínas/metabolismo , Proteínas/genética , Proteínas/química , Alineación de Secuencia/métodos , Biología Computacional/métodos , Algoritmos , Programas Informáticos , Análisis de Secuencia de Proteína/métodos
4.
Genome Biol ; 25(1): 212, 2024 Aug 09.
Artículo en Inglés | MEDLINE | ID: mdl-39123269

RESUMEN

BACKGROUND: Spatial transcriptomics (ST) is advancing our understanding of complex tissues and organisms. However, building a robust clustering algorithm to define spatially coherent regions in a single tissue slice and aligning or integrating multiple tissue slices originating from diverse sources for essential downstream analyses remains challenging. Numerous clustering, alignment, and integration methods have been specifically designed for ST data by leveraging its spatial information. The absence of comprehensive benchmark studies complicates the selection of methods and future method development. RESULTS: In this study, we systematically benchmark a variety of state-of-the-art algorithms with a wide range of real and simulated datasets of varying sizes, technologies, species, and complexity. We analyze the strengths and weaknesses of each method using diverse quantitative and qualitative metrics and analyses, including eight metrics for spatial clustering accuracy and contiguity, uniform manifold approximation and projection visualization, layer-wise and spot-to-spot alignment accuracy, and 3D reconstruction, which are designed to assess method performance as well as data quality. The code used for evaluation is available on our GitHub. Additionally, we provide online notebook tutorials and documentation to facilitate the reproduction of all benchmarking results and to support the study of new methods and new datasets. CONCLUSIONS: Our analyses lead to comprehensive recommendations that cover multiple aspects, helping users to select optimal tools for their specific needs and guide future method development.


Asunto(s)
Algoritmos , Benchmarking , Análisis por Conglomerados , Animales , Perfilación de la Expresión Génica/métodos , Transcriptoma , Humanos , Programas Informáticos , Alineación de Secuencia/métodos
5.
PLoS Comput Biol ; 20(8): e1012337, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39102450

RESUMEN

A phylogenetic tree represents hypothesized evolutionary history for a set of taxa. Besides the branching patterns (i.e., tree topology), phylogenies contain information about the evolutionary distances (i.e. branch lengths) between all taxa in the tree, which include extant taxa (external nodes) and their last common ancestors (internal nodes). During phylogenetic tree inference, the branch lengths are typically co-estimated along with other phylogenetic parameters during tree topology space exploration. There are well-known regions of the branch length parameter space where accurate estimation of phylogenetic trees is especially difficult. Several novel studies have recently demonstrated that machine learning approaches have the potential to help solve phylogenetic problems with greater accuracy and computational efficiency. In this study, as a proof of concept, we sought to explore the possibility of machine learning models to predict branch lengths. To that end, we designed several deep learning frameworks to estimate branch lengths on fixed tree topologies from multiple sequence alignments or its representations. Our results show that deep learning methods can exhibit superior performance in some difficult regions of branch length parameter space. For example, in contrast to maximum likelihood inference, which is typically used for estimating branch lengths, deep learning methods are more efficient and accurate. In general, we find that our neural networks achieve similar accuracy to a Bayesian approach and are the best-performing methods when inferring long branches that are associated with distantly related taxa. Together, our findings represent a next step toward accurate, fast, and reliable phylogenetic inference with machine learning approaches.


Asunto(s)
Biología Computacional , Aprendizaje Profundo , Redes Neurales de la Computación , Filogenia , Biología Computacional/métodos , Algoritmos , Aprendizaje Automático , Modelos Genéticos , Alineación de Secuencia/métodos , Evolución Molecular , Funciones de Verosimilitud
6.
Bioinformatics ; 40(8)2024 08 02.
Artículo en Inglés | MEDLINE | ID: mdl-39152995

RESUMEN

MOTIVATION: Spaln is the earliest practical tool for self-sufficient genome mapping and spliced alignment of protein query sequences onto a mammalian-sized eukaryotic genomic sequence. However, its computational speed has become inadequate for the analysis of rapidly growing genomic and transcript sequence data. RESULTS: The dynamic programming calculation of Spaln has been sped up in two ways: (i) the introduction of the multi-intermediate unidirectional Hirschberg method and (ii) SIMD-based vectorization. The new version, Spaln3, is ∼7 times faster than the latest Spaln version 2, and its gene prediction accuracy is consistently higher than that of Miniprot. AVAILABILITY AND IMPLEMENTATION: https://github.com/ogotoh/spaln.


Asunto(s)
Mapeo Cromosómico , Programas Informáticos , Mapeo Cromosómico/métodos , Alineación de Secuencia/métodos , Empalme del ARN , Algoritmos , Animales , Humanos , Genoma , Proteínas/genética , Proteínas/química , Genómica/métodos
7.
Artículo en Inglés | MEDLINE | ID: mdl-39209796

RESUMEN

Increasing the accuracy of the nucleotide sequence alignment is an essential issue in genomics research. Although classic dynamic programming (DP) algorithms (e.g., Smith-Waterman and Needleman-Wunsch) guarantee to produce the optimal result, their time complexity hinders the application of large-scale sequence alignment. Many optimization efforts that aim to accelerate the alignment process generally come from three perspectives: redesigning data structures [e.g., diagonal or striped Single Instruction Multiple Data (SIMD) implementations], increasing the number of parallelisms in SIMD operations (e.g., difference recurrence relation), or reducing search space (e.g., banded DP). However, no methods combine all these three aspects to build an ultra-fast algorithm. In this study, we developed a Banded Striped Aligner (BSAlign) library that delivers accurate alignment results at an ultra-fast speed by knitting a series of novel methods together to take advantage of all of the aforementioned three perspectives with highlights such as active F-loop in striped vectorization and striped move in banded DP. We applied our new acceleration design on both regular and edit distance pairwise alignment. BSAlign achieved 2-fold speed-up than other SIMD-based implementations for regular pairwise alignment, and 1.5-fold to 4-fold speed-up in edit distance-based implementations for long reads. BSAlign is implemented in C programing language and is available at https://github.com/ruanjue/bsalign.


Asunto(s)
Algoritmos , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ADN/métodos , Biblioteca de Genes , Biología Computacional/métodos , Secuencia de Bases/genética
8.
J Bioinform Comput Biol ; 22(4): 2450019, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39215522

RESUMEN

The graph of sequences represents the genetic variations of pan-genome concisely and space-efficiently than multiple linear reference genome. In order to accelerate aligning reads to the graph, an index of graph-based reference genomes is used to obtain candidate locations. However, the potential combinatorial explosion of nodes on the sequence graph leads to increasing the index space and maximum memory usage of alignment process considerably, especially for large-scale datasets. For this, existing methods typically attempt to prune complex regions, or extend the length of seeds, which sacrifices the recall of alignment algorithm despite reducing space usage slightly. We present the Sparse-index of Graph (SIG) and alignment algorithm SIG-Aligner, capable of indexing and aligning at the lower memory cost. SIG builds the non-overlapping minimizers index inside nodes of sequence graph and SIG-Aligner filters out most of the false positive matches by the method based on the pigeonhole principle. Compared to Giraffe, the results of computational experiments show that SIG achieves a significant reduction in index memory space ranging from 50% to 75% for the human pan-genome graphs, while still preserving superior or comparable accuracy of alignment and the faster alignment time.


Asunto(s)
Algoritmos , Alineación de Secuencia , Análisis de Secuencia de ADN , Humanos , Alineación de Secuencia/métodos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/estadística & datos numéricos , Genoma Humano , Programas Informáticos , Genómica/métodos , Genómica/estadística & datos numéricos , Genoma
9.
Bioinformatics ; 40(8)2024 08 02.
Artículo en Inglés | MEDLINE | ID: mdl-39137136

RESUMEN

MOTIVATION: Nanopore sequencing current signal data can be 'basecalled' into sequence information or analysed directly, with the capacity to identify diverse molecular features, such as DNA/RNA base modifications and secondary structures. However, raw signal data is large and complex, and there is a need for improved visualization strategies to facilitate signal analysis, exploration and tool development. RESULTS: Squigualiser (Squiggle visualiser) is a toolkit for intuitive, interactive visualization of sequence-aligned signal data, which currently supports both DNA and RNA sequencing data from Oxford Nanopore Technologies instruments. Squigualiser is compatible with a wide range of alternative signal-alignment software packages and enables visualization of both signal-to-read and signal-to-reference aligned data at single-base resolution. Squigualiser generates an interactive signal browser view (HTML file), in which the user can navigate across a genome/transcriptome region and customize the display. Multiple independent reads are integrated into a 'signal pileup' format and different datasets can be displayed as parallel tracks. Although other methods exist, Squigualiser provides the community with a software package purpose-built for raw signal data visualization, incorporating a range of new and existing features into a unified platform. AVAILABILITY AND IMPLEMENTATION: Squigualiser is an open-source package under an MIT licence: https://github.com/hiruna72/squigualiser. The software was developed using Python 3.8 and can be installed with pip or bioconda or executed directly using prebuilt binaries provided with each release.


Asunto(s)
Secuenciación de Nanoporos , Programas Informáticos , Secuenciación de Nanoporos/métodos , Análisis de Secuencia de ADN/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodos
10.
J Mol Biol ; 436(17): 168694, 2024 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-38971557

RESUMEN

Predicting the consensus structure of a set of aligned RNA homologs is a convenient method to find conserved structures in an RNA genome, which has many applications including viral diagnostics and therapeutics. However, the most commonly used tool for this task, RNAalifold, is prohibitively slow for long sequences, due to a cubic scaling with the sequence length, taking over a day on 400 SARS-CoV-2 and SARS-related genomes (∼30,000nt). We present LinearAlifold, a much faster alternative that scales linearly with both the sequence length and the number of sequences, based on our work LinearFold that folds a single RNA in linear time. Our work is orders of magnitude faster than RNAalifold (0.7 h on the above 400 genomes, or ∼36× speedup) and achieves higher accuracies when compared to a database of known structures. More interestingly, LinearAlifold's prediction on SARS-CoV-2 correlates well with experimentally determined structures, substantially outperforming RNAalifold. Finally, LinearAlifold supports two energy models (Vienna and BL*) and four modes: minimum free energy (MFE), maximum expected accuracy (MEA), ThreshKnot, and stochastic sampling, each of which takes under an hour for hundreds of SARS-CoV variants. Our resource is at: https://github.com/LinearFold/LinearAlifold (code) and http://linearfold.org/linear-alifold (server).


Asunto(s)
COVID-19 , Conformación de Ácido Nucleico , ARN Viral , SARS-CoV-2 , Alineación de Secuencia , SARS-CoV-2/genética , ARN Viral/genética , ARN Viral/química , COVID-19/virología , Alineación de Secuencia/métodos , Genoma Viral , Programas Informáticos , Biología Computacional/métodos , Humanos , Análisis de Secuencia de ARN/métodos , Algoritmos , ARN/química
11.
Bioinformatics ; 40(7)2024 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-38960861

RESUMEN

MOTIVATION: The alignment of sequencing reads is a critical step in the characterization of ancient genomes. However, reference bias and spurious mappings pose a significant challenge, particularly as cutting-edge wet lab methods generate datasets that push the boundaries of alignment tools. Reference bias occurs when reference alleles are favoured over alternative alleles during mapping, whereas spurious mappings stem from either contamination or when endogenous reads fail to align to their correct position. Previous work has shown that these phenomena are correlated with read length but a more thorough investigation of reference bias and spurious mappings for ancient DNA has been lacking. Here, we use a range of empirical and simulated palaeogenomic datasets to investigate the impacts of mapping tools, quality thresholds, and reference genome on mismatch rates across read lengths. RESULTS: For these analyses, we introduce AMBER, a new bioinformatics tool for assessing the quality of ancient DNA mapping directly from BAM-files and informing on reference bias, read length cut-offs and reference selection. AMBER rapidly and simultaneously computes the sequence read mapping bias in the form of the mismatch rates per read length, cytosine deamination profiles at both CpG and non-CpG sites, fragment length distributions, and genomic breadth and depth of coverage. Using AMBER, we find that mapping algorithms and quality threshold choices dictate reference bias and rates of spurious alignment at different read lengths in a predictable manner, suggesting that optimized mapping parameters for each read length will be a key step in alleviating reference bias and spurious mappings. AVAILABILITY AND IMPLEMENTATION: AMBER is available for noncommercial use on GitHub (https://github.com/tvandervalk/AMBER.git). Scripts used to generate and analyse simulated datasets are available on Github (https://github.com/sdolenz/refbias_scripts).


Asunto(s)
ADN Antiguo , Análisis de Secuencia de ADN , ADN Antiguo/análisis , Humanos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Alineación de Secuencia/métodos , Biología Computacional/métodos , Algoritmos
12.
BMC Bioinformatics ; 25(1): 247, 2024 Jul 29.
Artículo en Inglés | MEDLINE | ID: mdl-39075359

RESUMEN

BACKGROUND: Sequence alignment lies at the heart of genome sequence annotation. While the BLAST suite of alignment tools has long held an important role in alignment-based sequence database search, greater sensitivity is achieved through the use of profile hidden Markov models (pHMMs). Here, we describe an FPGA hardware accelerator, called HAVAC, that targets a key bottleneck step (SSV) in the analysis pipeline of the popular pHMM alignment tool, HMMER. RESULTS: The HAVAC kernel calculates the SSV matrix at 1739 GCUPS on a ∼  $3000 Xilinx Alveo U50 FPGA accelerator card, ∼  227× faster than the optimized SSV implementation in nhmmer. Accounting for PCI-e data transfer data processing, HAVAC is 65× faster than nhmmer's SSV with one thread and 35× faster than nhmmer with four threads, and uses ∼  31% the energy of a traditional high end Intel CPU. CONCLUSIONS: HAVAC demonstrates the potential offered by FPGA hardware accelerators to produce dramatic speed gains in sequence annotation and related bioinformatics applications. Because these computations are performed on a co-processor, the host CPU remains free to simultaneously compute other aspects of the analysis pipeline.


Asunto(s)
Cadenas de Markov , Alineación de Secuencia , Alineación de Secuencia/métodos , Biología Computacional/métodos , Homología de Secuencia , Algoritmos , Programas Informáticos
13.
Nucleic Acids Res ; 52(15): 8717-8733, 2024 Aug 27.
Artículo en Inglés | MEDLINE | ID: mdl-39011889

RESUMEN

In biological sequence alignment, prevailing heuristic aligners achieve high-throughput by several approximation techniques, but at the cost of sacrificing the clarity of output criteria and creating complex parameter spaces. To surmount these challenges, we introduce 'SigAlign', a novel alignment algorithm that employs two explicit cutoffs for the results: minimum length and maximum penalty per length, alongside three affine gap penalties. Comparative analyses of SigAlign against leading database search tools (BLASTn, MMseqs2) and read mappers (BWA-MEM, bowtie2, HISAT2, minimap2) highlight its performance in read mapping and database searches. Our research demonstrates that SigAlign not only provides high sensitivity with a non-heuristic approach, but also surpasses the throughput of existing heuristic aligners, particularly for high-accuracy reads or genomes with few repetitive regions. As an open-source library, SigAlign is poised to become a foundational component to provide a transparent and customizable alignment process to new analytical algorithms, tools and pipelines in bioinformatics.


Asunto(s)
Algoritmos , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Humanos , Biología Computacional/métodos , Análisis de Secuencia de ADN/métodos
14.
J Comput Biol ; 31(7): 597-615, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38980804

RESUMEN

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.


Asunto(s)
Algoritmos , Biología Computacional , Programas Informáticos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Humanos , Análisis de Secuencia de ADN/métodos
15.
BMC Bioinformatics ; 25(1): 238, 2024 Jul 13.
Artículo en Inglés | MEDLINE | ID: mdl-39003441

RESUMEN

MOTIVATION: Alignment of reads to a reference genome sequence is one of the key steps in the analysis of human whole-genome sequencing data obtained through Next-generation sequencing (NGS) technologies. The quality of the subsequent steps of the analysis, such as the results of clinical interpretation of genetic variants or the results of a genome-wide association study, depends on the correct identification of the position of the read as a result of its alignment. The amount of human NGS whole-genome sequencing data is constantly growing. There are a number of human genome sequencing projects worldwide that have resulted in the creation of large-scale databases of genetic variants of sequenced human genomes. Such information about known genetic variants can be used to improve the quality of alignment at the read alignment stage when analysing sequencing data obtained for a new individual, for example, by creating a genomic graph. While existing methods for aligning reads to a linear reference genome have high alignment speed, methods for aligning reads to a genomic graph have greater accuracy in variable regions of the genome. The development of a read alignment method that takes into account known genetic variants in the linear reference sequence index allows combining the advantages of both sets of methods. RESULTS: In this paper, we present the minimap2_index_modifier tool, which enables the construction of a modified index of a reference genome using known single nucleotide variants and insertions/deletions (indels) specific to a given human population. The use of the modified minimap2 index improves variant calling quality without modifying the bioinformatics pipeline and without significant additional computational overhead. Using the PrecisionFDA Truth Challenge V2 benchmark data (for HG002 short-read data aligned to the GRCh38 linear reference (GCA_000001405.15) with parameters k = 27 and w = 14) it was demonstrated that the number of false negative genetic variants decreased by more than 9500, and the number of false positives decreased by more than 7000 when modifying the index with genetic variants from the Human Pangenome Reference Consortium.


Asunto(s)
Variación Genética , Genoma Humano , Secuenciación Completa del Genoma , Humanos , Secuenciación Completa del Genoma/métodos , Variación Genética/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Polimorfismo de Nucleótido Simple/genética , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Estudio de Asociación del Genoma Completo/métodos
16.
Mol Biol Evol ; 41(7)2024 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-39041199

RESUMEN

The current trend in phylogenetic and evolutionary analyses predominantly relies on omic data. However, prior to core analyses, traditional methods typically involve intricate and time-consuming procedures, including assembly from high-throughput reads, decontamination, gene prediction, homology search, orthology assignment, multiple sequence alignment, and matrix trimming. Such processes significantly impede the efficiency of research when dealing with extensive data sets. In this study, we develop PhyloAln, a convenient reference-based tool capable of directly aligning high-throughput reads or complete sequences with existing alignments as a reference for phylogenetic and evolutionary analyses. Through testing with simulated data sets of species spanning the tree of life, PhyloAln demonstrates consistently robust performance compared with other reference-based tools across different data types, sequencing technologies, coverages, and species, with percent completeness and identity at least 50 percentage points higher in the alignments. Additionally, we validate the efficacy of PhyloAln in removing a minimum of 90% foreign and 70% cross-contamination issues, which are prevalent in sequencing data but often overlooked by other tools. Moreover, we showcase the broad applicability of PhyloAln by generating alignments (completeness mostly larger than 80%, identity larger than 90%) and reconstructing robust phylogenies using real data sets of transcriptomes of ladybird beetles, plastid genes of peppers, or ultraconserved elements of turtles. With these advantages, PhyloAln is expected to facilitate phylogenetic and evolutionary analyses in the omic era. The tool is accessible at https://github.com/huangyh45/PhyloAln.


Asunto(s)
Filogenia , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Evolución Molecular
17.
Bioinformatics ; 40(6)2024 06 03.
Artículo en Inglés | MEDLINE | ID: mdl-38870521

RESUMEN

MOTIVATION: Tools for pairwise alignments between 3D structures of proteins are of fundamental importance for structural biology and bioinformatics, enabling visual exploration of evolutionary and functional relationships. However, the absence of a user-friendly, browser-based tool for creating alignments and visualizing them at both 1D sequence and 3D structural levels makes this process unnecessarily cumbersome. RESULTS: We introduce a novel pairwise structure alignment tool (rcsb.org/alignment) that seamlessly integrates into the RCSB Protein Data Bank (RCSB PDB) research-focused RCSB.org web portal. Our tool and its underlying application programming interface (alignment.rcsb.org) empowers users to align several protein chains with a reference structure by providing access to established alignment algorithms (FATCAT, CE, TM-align, or Smith-Waterman 3D). The user-friendly interface simplifies parameter setup and input selection. Within seconds, our tool enables visualization of results in both sequence (1D) and structural (3D) perspectives through the RCSB PDB RCSB.org Sequence Annotations viewer and Mol* 3D viewer, respectively. Users can effortlessly compare structures deposited in the PDB archive alongside more than a million incorporated Computed Structure Models coming from the ModelArchive and AlphaFold DB. Moreover, this tool can be used to align custom structure data by providing a link/URL or uploading atomic coordinate files directly. Importantly, alignment results can be bookmarked and shared with collaborators. By bridging the gap between 1D sequence and 3D structures of proteins, our tool facilitates deeper understanding of complex evolutionary relationships among proteins through comprehensive sequence and structural analyses. AVAILABILITY AND IMPLEMENTATION: The alignment tool is part of the RCSB PDB research-focused RCSB.org web portal and available at rcsb.org/alignment. Programmatic access is available via alignment.rcsb.org. Frontend code has been published at github.com/rcsb/rcsb-pecos-app. Visualization is powered by the open-source Mol* viewer (github.com/molstar/molstar and github.com/molstar/rcsb-molstar) plus the Sequence Annotations in 3D Viewer (github.com/rcsb/rcsb-saguaro-3d).


Asunto(s)
Algoritmos , Bases de Datos de Proteínas , Proteínas , Alineación de Secuencia , Programas Informáticos , Proteínas/química , Alineación de Secuencia/métodos , Conformación Proteica , Interfaz Usuario-Computador , Biología Computacional/métodos
19.
BMC Bioinformatics ; 25(1): 207, 2024 Jun 06.
Artículo en Inglés | MEDLINE | ID: mdl-38844845

RESUMEN

BACKGROUND: Gene families are groups of homologous genes that often have similar biological functions. These families are formed by gene duplication events throughout evolution, resulting in multiple copies of an ancestral gene. Over time, these copies can acquire mutations and structural variations, resulting in members that may vary in size, motif ordering and sequence. Multigene families have been described in a broad range of organisms, from single-celled bacteria to complex multicellular organisms, and have been linked to an array of phenomena, such as host-pathogen interactions, immune evasion and embryonic development. Despite the importance of gene families, few approaches have been developed for estimating and graphically visualizing their diversity patterns and expression profiles in genome-wide studies. RESULTS: Here, we introduce an R package named dgfr, which estimates and enables the visualization of sequence divergence within gene families, as well as the visualization of secondary data such as gene expression. The package takes as input a multi-fasta file containing the coding sequences (CDS) or amino acid sequences from a multigene family, performs a pairwise alignment among all sequences, and estimates their distance, which is subjected to dimension reduction, optimal cluster determination, and gene assignment to each cluster. The result is a dataset that allows for the visualization of sequence divergence and expression within the gene family, an approximation of the number of clusters present in the family. CONCLUSIONS: dgfr provides a way to estimate and study the diversity of gene families, as well as visualize the dispersion and secondary profile of the sequences. The dgfr package is available at https://github.com/lailaviana/dgfr under the GPL-3 license.


Asunto(s)
Variación Genética , Familia de Multigenes , Programas Informáticos , Variación Genética/genética , Alineación de Secuencia/métodos
20.
BMC Bioinformatics ; 25(1): 219, 2024 Jun 19.
Artículo en Inglés | MEDLINE | ID: mdl-38898394

RESUMEN

BACKGROUND: With the surge in genomic data driven by advancements in sequencing technologies, the demand for efficient bioinformatics tools for sequence analysis has become paramount. BLAST-like alignment tool (BLAT), a sequence alignment tool, faces limitations in performance efficiency and integration with modern programming environments, particularly Python. This study introduces PxBLAT, a Python-based framework designed to enhance the capabilities of BLAT, focusing on usability, computational efficiency, and seamless integration within the Python ecosystem. RESULTS: PxBLAT demonstrates significant improvements over BLAT in execution speed and data handling, as evidenced by comprehensive benchmarks conducted across various sample groups ranging from 50 to 600 samples. These experiments highlight a notable speedup, reducing execution time compared to BLAT. The framework also introduces user-friendly features such as improved server management, data conversion utilities, and shell completion, enhancing the overall user experience. Additionally, the provision of extensive documentation and comprehensive testing supports community engagement and facilitates the adoption of PxBLAT. CONCLUSIONS: PxBLAT stands out as a robust alternative to BLAT, offering performance and user interaction enhancements. Its development underscores the potential for modern programming languages to improve bioinformatics tools, aligning with the needs of contemporary genomic research. By providing a more efficient, user-friendly tool, PxBLAT has the potential to impact genomic data analysis workflows, supporting faster and more accurate sequence analysis in a Python environment.


Asunto(s)
Biología Computacional , Alineación de Secuencia , Programas Informáticos , Biología Computacional/métodos , Alineación de Secuencia/métodos , Lenguajes de Programación , Genómica/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA