Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments.

Neuwald, Andrew F; Lanczycki, Christopher J; Hodges, Theresa K; Marchler-Bauer, Aron

Neuwald, Andrew F; Lanczycki, Christopher J; Hodges, Theresa K; Marchler-Bauer, Aron.

Afiliação

Neuwald AF; Institute for Genome Sciences.
Lanczycki CJ; Department of Biochemistry & Molecular Biology, University of Maryland School of Medicine, 670 W. Baltimore Street, Baltimore, MD 21201, USA.
Hodges TK; National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38 A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Marchler-Bauer A; Institute for Genome Sciences.

Database (Oxford) ; 20202020 01 01.

Article em En | MEDLINE | ID: mdl-32500917

ABSTRACT

ABSTRACT

For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https//www.igs.umaryland.edu/labs/neuwald/software/mapgaps/.

Assuntos

Bases de Dados de Proteínas; Proteínas; Alinhamento de Sequência/métodos; Aprendizado de Máquina; Proteínas/química; Proteínas/genética; Análise de Sequência de Proteína; Software

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Assunto principal: Proteínas / Alinhamento de Sequência / Bases de Dados de Proteínas Idioma: En Ano de publicação: 2020 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google