Pesquisa | Portal de Pesquisa da BVS Enfermagem

Towards the accurate alignment of over a million protein sequences: Current state of the art.

Santus, Luisa; Garriga, Edgar; Deorowicz, Sebastian; Gudys, Adam; Notredame, Cedric.

Curr Opin Struct Biol ; 80: 102577, 2023 06.

Artigo em Inglês | MEDLINE | ID: mdl-37012200

RESUMO

Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel higher-level heuristics. This review provides an extensive critical overview of these recent methods. Using established reference datasets we conclude that albeit significant progress has been achieved, a unified framework able to consistently and efficiently produce high-accuracy large-scale multiple alignments is still lacking.

Assuntos

Algoritmos , Genômica , Genômica/métodos , Sequência de Aminoácidos , Alinhamento de Sequência , Software

Multiple Sequence Alignment Computation Using the T-Coffee Regressive Algorithm Implementation.

Garriga, Edgar; Di Tommaso, Paolo; Magis, Cedrik; Erb, Ionas; Mansouri, Leila; Baltzis, Athanasios; Floden, Evan; Notredame, Cedric.

Methods Mol Biol ; 2231: 89-97, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-33289888

RESUMO

Many fields of biology rely on the inference of accurate multiple sequence alignments (MSA) of biological sequences. Unfortunately, the problem of assembling an MSA is NP-complete thus limiting computation to approximate solutions using heuristics solutions. The progressive algorithm is one of the most popular frameworks for the computation of MSAs. It involves pre-clustering the sequences and aligning them starting with the most similar ones. The scalability of this framework is limited, especially with respect to accuracy. We present here an alternative approach named regressive algorithm. In this framework, sequences are first clustered and then aligned starting with the most distantly related ones. This approach has been shown to greatly improve accuracy during scale-up, especially on datasets featuring 10,000 sequences or more. Another benefit is the possibility to integrate third-party clustering methods and third-party MSA aligners. The regressive algorithm has been tested on up to 1.5 million sequences, its implementation is available in the T-Coffee package.

Assuntos

Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Análise por Conglomerados , Biologia Computacional/instrumentação , Alinhamento de Sequência/instrumentação

Large multiple sequence alignments with a root-to-leaf regressive method.

Garriga, Edgar; Di Tommaso, Paolo; Magis, Cedrik; Erb, Ionas; Mansouri, Leila; Baltzis, Athanasios; Laayouni, Hafid; Kondrashov, Fyodor; Floden, Evan; Notredame, Cedric.

Nat Biotechnol ; 37(12): 1466-1470, 2019 12.

Artigo em Inglês | MEDLINE | ID: mdl-31792410

RESUMO

Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.

Assuntos

Algoritmos , Alinhamento de Sequência/métodos , Bases de Dados Genéticas , Eucariotos/genética , Genômica/métodos , Análise de Regressão

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA