GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms.

Liu, Yi-Wei; Hsu, Tz-Wei; Chang, Che-Yu; Liao, Wen-Hung; Chang, Jia-Ming

Liu, Yi-Wei; Hsu, Tz-Wei; Chang, Che-Yu; Liao, Wen-Hung; Chang, Jia-Ming.

Affiliation

Liu YW; Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan.
Hsu TW; Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan.
Chang CY; Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan.
Liao WH; Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan. whliao@gmail.com.
Chang JM; Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan. chang.jiaming@gmail.com.

BMC Bioinformatics ; 21(Suppl 6): 276, 2020 Nov 18.

Article in En | MEDLINE | ID: mdl-33203348

ABSTRACT

BACKGROUND: Biological data has grown explosively with the advance of next-generation sequencing. However, annotating protein function with wet lab experiments is time-consuming. Fortunately, computational function prediction can help wet labs formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is a framework for unifying the representation of protein function in a hierarchical tree composed of GO terms. RESULTS: We propose GODoc, a general protein GO prediction framework based on sequence information which combines feature engineering, feature reduction, and a novel âkâ-nearest-neighbor algorithm to resolve the multiple GO prediction problem. Comprehensive evaluation on CAFA2 shows that GODoc performs better than two baseline models. In the CAFA3 competition (68 teams), GODoc ranks 10th in Cellular Component Ontology. Regarding the species-specific task, the proposed method ranks 10th and 8th in the eukaryotic Cellular Component Ontology and the prokaryotic Molecular Function Ontology, respectively. In the term-centric task, GODoc performs third and is tied for first for the biofilm formation of Pseudomonas aeruginosa and the long-term memory of Drosophila melanogaster, respectively. CONCLUSIONS: We have developed a novel and effective strategy to incorporate a training procedure into the k-nearest neighbor algorithm (instance-based learning) which is capable of solving the Gene Ontology multiple-label prediction problem, which is especially notable given the thousands of Gene Ontology terms.

Subject(s)

Algorithms; Drosophila melanogaster; Gene Ontology; Proteins; Animals; Computational Biology; Drosophila melanogaster/genetics; Politics; Proteins/genetics

Key words

Data science; Gene ontology; Homology extension; Machine learning; Protein function prediction

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google

Full text: 1 Collection: 01-internacional Database: MEDLINE Main subject: Algorithms / Proteins / Drosophila melanogaster / Gene Ontology Type of study: Prognostic_studies / Risk_factors_studies Limits: Animals Language: En Journal: BMC Bioinformatics Journal subject: INFORMATICA MEDICA Year: 2020 Document type: Article Affiliation country: Taiwan Country of publication: United kingdom

Fulltext

Add to My VHL

XML

PubMed Links

Search on Google