Pesquisa | Portal Regional da BVS

Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

Kolker, Natali; Higdon, Roger; Broomall, William; Stanberry, Larissa; Welch, Dean; Lu, Wei; Haynes, Winston; Barga, Roger; Kolker, Eugene.

OMICS ; 15(7-8): 513-21, 2011.

Artigo em Inglês | MEDLINE | ID: mdl-21809957

RESUMO

To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

Assuntos

Proteínas/classificação , Bases de Dados de Proteínas , Proteínas/química , Proteínas/metabolismo

Bioinformatics and data-intensive scientific discovery in the beginning of the 21st century.

Barga, Roger; Howe, Bill; Beck, David; Bowers, Stuart; Dobyns, William; Haynes, Winston; Higdon, Roger; Howard, Chris; Roth, Christian; Stewart, Elizabeth; Welch, Dean; Kolker, Eugene.

OMICS ; 15(4): 199-201, 2011 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-21476840

RESUMO

This article is a summary of the bioinformatics issues and challenges of data-intensive science as discussed in the NSF-funded Data-Intensive Science (DIS) workshop in Seattle, September 19-20, 2010.

Assuntos

Disciplinas das Ciências Biológicas/métodos , Biologia Computacional/métodos , Biologia Computacional/tendências

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA