UPP2: fast and accurate alignment of datasets with fragmentary sequences.

Park, Minhyuk; Ivanovic, Stefan; Chu, Gillian; Shen, Chengze; Warnow, Tandy

Park, Minhyuk; Ivanovic, Stefan; Chu, Gillian; Shen, Chengze; Warnow, Tandy.

Affiliation

Park M; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA.
Ivanovic S; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA.
Chu G; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA.
Shen C; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA.
Warnow T; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA.

Bioinformatics ; 39(1)2023 01 01.

Article in En | MEDLINE | ID: mdl-36625535

ABSTRACT

MOTIVATION: Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. RESULTS: We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise. AVAILABILITY AND IMPLEMENTATION: https://github.com/gillichu/sepp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms; Software; Sequence Alignment; Phylogeny

Fulltext

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Algorithms / Software Language: En Journal: Bioinformatics Journal subject: INFORMATICA MEDICA Year: 2023 Type: Article Affiliation country: United States

Fulltext

XML

PubMed Links

Search on Google