Búsqueda | Portal de Búsqueda de la BVS España

A survey and evaluations of histogram-based statistics in alignment-free sequence comparison.

Luczak, Brian B; James, Benjamin T; Girgis, Hani Z.

Brief Bioinform ; 20(4): 1222-1237, 2019 07 19.

Artículo en Inglés | MEDLINE | ID: mdl-29220512

RESUMEN

MOTIVATION: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. RESULTS: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. AVAILABILITY: The source code of the benchmarking tool is available as Supplementary Materials.

Asunto(s)

Biología Computacional/métodos , Modelos Estadísticos , Análisis de Secuencia de ADN/estadística & datos numéricos , Algoritmos , Bases de Datos de Ácidos Nucleicos/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Cadenas de Markov , Alineación de Secuencia/estadística & datos numéricos

MeShClust: an intelligent tool for clustering DNA sequences.

James, Benjamin T; Luczak, Brian B; Girgis, Hani Z.

Nucleic Acids Res ; 46(14): e83, 2018 08 21.

Artículo en Inglés | MEDLINE | ID: mdl-29718317

RESUMEN

Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust's ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.

Asunto(s)

Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Análisis por Conglomerados , Genoma Viral , Microbiota/genética

Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

Girgis, Hani Z; James, Benjamin T; Luczak, Brian B.

NAR Genom Bioinform ; 3(1): lqab001, 2021 Mar.

Artículo en Inglés | MEDLINE | ID: mdl-33554117

RESUMEN

Pairwise global alignment is a fundamental step in sequence analysis. Optimal alignment algorithms are quadratic-slow especially on long sequences. In many applications that involve large sequence datasets, all what is needed is calculating the identity scores (percentage of identical nucleotides in an optimal alignment-including gaps-of two sequences); there is no need for visualizing how every two sequences are aligned. For these applications, we propose Identity, which produces global identity scores for a large number of pairs of DNA sequences using alignment-free methods and self-supervised general linear models. For the first time, the new tool can predict pairwise identity scores in linear time and space. On two large-scale sequence databases, Identity provided the best compromise between sensitivity and precision while being faster than BLAST, Mash, MUMmer4 and USEARCH by 2-80 times. Identity was the best performing tool when searching for low-identity matches. While constructing phylogenetic trees from about 6000 transcripts, the tree due to the scores reported by Identity was the closest to the reference tree (in contrast to andi, FSWM and Mash). Identity is capable of producing pairwise identity scores of millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any global-alignment-based tool. Availability: https://github.com/BioinformaticsToolsmith/Identity.

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA