Search | VHL Search Portal

Novel comparison of evaluation metrics for gene ontology classifiers reveals drastic performance differences.

Plyusnin, Ilya; Holm, Liisa; Törönen, Petri.

PLoS Comput Biol ; 15(11): e1007419, 2019 11.

Article in English | MEDLINE | ID: mdl-31682632

ABSTRACT

Automated protein annotation using the Gene Ontology (GO) plays an important role in the biosciences. Evaluation has always been considered central to developing novel annotation methods, but little attention has been paid to the evaluation metrics themselves. Evaluation metrics define how well an annotation method performs and allows for them to be ranked against one another. Unfortunately, most of these metrics were adopted from the machine learning literature without establishing whether they were appropriate for GO annotations. We propose a novel approach for comparing GO evaluation metrics called Artificial Dilution Series (ADS). Our approach uses existing annotation data to generate a series of annotation sets with different levels of correctness (referred to as their signal level). We calculate the evaluation metric being tested for each annotation set in the series, allowing us to identify whether it can separate different signal levels. Finally, we contrast these results with several false positive annotation sets, which are designed to expose systematic weaknesses in GO assessment. We compared 37 evaluation metrics for GO annotation using ADS and identified drastic differences between metrics. We show that some metrics struggle to differentiate between different signal levels, while others give erroneously high scores to the false positive data sets. Based on our findings, we provide guidelines on which evaluation metrics perform well with the Gene Ontology and propose improvements to several well-known evaluation metrics. In general, we argue that evaluation metrics should be tested for their performance and we provide software for this purpose (https://bitbucket.org/plyusnin/ads/). ADS is applicable to other areas of science where the evaluation of prediction results is non-trivial.

Subject(s)

Computational Biology/methods , Molecular Sequence Annotation/classification , Molecular Sequence Annotation/methods , Algorithms , Benchmarking/methods , Databases, Genetic , Databases, Protein , Gene Ontology/trends , Reproducibility of Results , Software

Comprehensive functional annotation of 77 prostate cancer risk loci.

Hazelett, Dennis J; Rhie, Suhn Kyong; Gaddis, Malaina; Yan, Chunli; Lakeland, Daniel L; Coetzee, Simon G; Henderson, Brian E; Noushmehr, Houtan; Cozen, Wendy; Kote-Jarai, Zsofia; Eeles, Rosalind A; Easton, Douglas F; Haiman, Christopher A; Lu, Wange; Farnham, Peggy J; Coetzee, Gerhard A.

PLoS Genet ; 10(1): e1004102, 2014 Jan.

Article in English | MEDLINE | ID: mdl-24497837

ABSTRACT

Genome-wide association studies (GWAS) have revolutionized the field of cancer genetics, but the causal links between increased genetic risk and onset/progression of disease processes remain to be identified. Here we report the first step in such an endeavor for prostate cancer. We provide a comprehensive annotation of the 77 known risk loci, based upon highly correlated variants in biologically relevant chromatin annotations--we identified 727 such potentially functional SNPs. We also provide a detailed account of possible protein disruption, microRNA target sequence disruption and regulatory response element disruption of all correlated SNPs at r(2) ≥ 0.88%. 88% of the 727 SNPs fall within putative enhancers, and many alter critical residues in the response elements of transcription factors known to be involved in prostate biology. We define as risk enhancers those regions with enhancer chromatin biofeatures in prostate-derived cell lines with prostate-cancer correlated SNPs. To aid the identification of these enhancers, we performed genomewide ChIP-seq for H3K27-acetylation, a mark of actively engaged enhancers, as well as the transcription factor TCF7L2. We analyzed in depth three variants in risk enhancers, two of which show significantly altered androgen sensitivity in LNCaP cells. This includes rs4907792, that is in linkage disequilibrium (r(2) = 0.91) with an eQTL for NUDT11 (on the X chromosome) in prostate tissue, and rs10486567, the index SNP in intron 3 of the JAZF1 gene on chromosome 7. Rs4907792 is within a critical residue of a strong consensus androgen response element that is interrupted in the protective allele, resulting in a 56% decrease in its androgen sensitivity, whereas rs10486567 affects both NKX3-1 and FOXA-AR motifs where the risk allele results in a 39% increase in basal activity and a 28% fold-increase in androgen stimulated enhancer activity. Identification of such enhancer variants and their potential target genes represents a preliminary step in connecting risk to disease process.

Subject(s)

Enhancer Elements, Genetic , Molecular Sequence Annotation/classification , Prostatic Neoplasms/genetics , Response Elements/genetics , Alleles , Chromatin/genetics , Gene Expression Regulation, Neoplastic , Genetic Predisposition to Disease , Genome-Wide Association Study , Humans , Linkage Disequilibrium , Male , Oligonucleotide Array Sequence Analysis , Polymorphism, Single Nucleotide/genetics , Prostatic Neoplasms/metabolism , Prostatic Neoplasms/pathology , Risk Factors , Transcription Factors/genetics

GOParGenPy: a high throughput method to generate gene ontology data matrices.

Kumar, Ajay Anand; Holm, Liisa; Toronen, Petri.

BMC Bioinformatics ; 14: 242, 2013 Aug 08.

Article in English | MEDLINE | ID: mdl-23927037

ABSTRACT

BACKGROUND: Gene Ontology (GO) is a popular standard in the annotation of gene products and provides information related to genes across all species. The structure of GO is dynamic and is updated on a daily basis. However, the popular existing methods use outdated versions of GO. Moreover, these tools are slow to process large datasets consisting of more than 20,000 genes. RESULTS: We have developed GOParGenPy, a platform independent software tool to generate the binary data matrix showing the GO class membership, including parental classes, of a set of GO annotated genes. GOParGenPy is at least an order of magnitude faster than popular tools for Gene Ontology analysis and it can handle larger datasets than the existing tools. It can use any available version of the GO structure and allows the user to select the source of GO annotation. GO structure selection is critical for analysis, as we show that GO classes have rapid turnover between different GO structure releases. CONCLUSIONS: GOParGenPy is an easy to use software tool which can generate sparse or full binary matrices from GO annotated gene sets. The obtained binary matrix can then be used with any analysis environment and with any analysis methods.

Subject(s)

Gene Ontology , Genes , Molecular Sequence Annotation/methods , Proteins/genetics , Software , Artificial Intelligence , Molecular Sequence Annotation/classification , Proteins/chemistry , Proteins/classification , Search Engine/methods , Software/classification , Vocabulary, Controlled

Image-level and group-level models for Drosophila gene expression pattern annotation.

Sun, Qian; Muckatira, Sherin; Yuan, Lei; Ji, Shuiwang; Newfeld, Stuart; Kumar, Sudhir; Ye, Jieping.

BMC Bioinformatics ; 14: 350, 2013 Dec 03.

Article in English | MEDLINE | ID: mdl-24299119

ABSTRACT

BACKGROUND: Drosophila melanogaster has been established as a model organism for investigating the developmental gene interactions. The spatio-temporal gene expression patterns of Drosophila melanogaster can be visualized by in situ hybridization and documented as digital images. Automated and efficient tools for analyzing these expression images will provide biological insights into the gene functions, interactions, and networks. To facilitate pattern recognition and comparison, many web-based resources have been created to conduct comparative analysis based on the body part keywords and the associated images. With the fast accumulation of images from high-throughput techniques, manual inspection of images will impose a serious impediment on the pace of biological discovery. It is thus imperative to design an automated system for efficient image annotation and comparison. RESULTS: We present a computational framework to perform anatomical keywords annotation for Drosophila gene expression images. The spatial sparse coding approach is used to represent local patches of images in comparison with the well-known bag-of-words (BoW) method. Three pooling functions including max pooling, average pooling and Sqrt (square root of mean squared statistics) pooling are employed to transform the sparse codes to image features. Based on the constructed features, we develop both an image-level scheme and a group-level scheme to tackle the key challenges in annotating Drosophila gene expression pattern images automatically. To deal with the imbalanced data distribution inherent in image annotation tasks, the undersampling method is applied together with majority vote. Results on Drosophila embryonic expression pattern images verify the efficacy of our approach. CONCLUSION: In our experiment, the three pooling functions perform comparably well in feature dimension reduction. The undersampling with majority vote is shown to be effective in tackling the problem of imbalanced data. Moreover, combining sparse coding and image-level scheme leads to consistent performance improvement in keywords annotation.

Subject(s)

Drosophila melanogaster/cytology , Drosophila melanogaster/genetics , Gene Expression Regulation, Developmental , Genome, Insect/genetics , Models, Genetic , Molecular Sequence Annotation/methods , Animals , Cell Differentiation/genetics , Cell Division/genetics , Computational Biology/classification , Computational Biology/methods , Drosophila melanogaster/embryology , Gene Expression Profiling/classification , Gene Expression Profiling/methods , High-Throughput Screening Assays , Molecular Sequence Annotation/classification , Predictive Value of Tests , Support Vector Machine

The articles.ELM resource: simplifying access to protein linear motif literature by annotation, text-mining and classification.

Palopoli, N; Iserte, J A; Chemes, L B; Marino-Buslje, C; Parisi, G; Gibson, T J; Davey, N E.

Database (Oxford) ; 20202020 01 01.

Article in English | MEDLINE | ID: mdl-32507889

ABSTRACT

Modern biology produces data at a staggering rate. Yet, much of these biological data is still isolated in the text, figures, tables and supplementary materials of articles. As a result, biological information created at great expense is significantly underutilised. The protein motif biology field does not have sufficient resources to curate the corpus of motif-related literature and, to date, only a fraction of the available articles have been curated. In this study, we develop a set of tools and a web resource, 'articles.ELM', to rapidly identify the motif literature articles pertinent to a researcher's interest. At the core of the resource is a manually curated set of about 8000 motif-related articles. These articles are automatically annotated with a range of relevant biological data allowing in-depth search functionality. Machine-learning article classification is used to group articles based on their similarity to manually curated motif classes in the Eukaryotic Linear Motif resource. Articles can also be manually classified within the resource. The 'articles.ELM' resource permits the rapid and accurate discovery of relevant motif articles thereby improving the visibility of motif literature and simplifying the recovery of valuable biological insights sequestered within scientific articles. Consequently, this web resource removes a critical bottleneck in scientific productivity for the motif biology field. Database URL: http://slim.icr.ac.uk/articles/.

Subject(s)

Amino Acid Motifs , Data Mining/methods , Databases, Protein , Molecular Sequence Annotation , Molecular Sequence Annotation/classification , Molecular Sequence Annotation/methods , Publications/classification

Gene Name Disambiguation Using Multi-Scope Species Detection.

Hsiao, Jui-Chen; Wei, Chih-Hsuan; Kao, Hung-Yu.

IEEE/ACM Trans Comput Biol Bioinform ; 11(1): 55-62, 2014.

Article in English | MEDLINE | ID: mdl-26355507

ABSTRACT

Species detection is an important topic in the text mining field. According to the importance of the research topics (e.g., species assignment to genes and document focus species detection), some studies are dedicated to an individual topic. However, no researcher to date has discussed species detection as a general problem. Therefore, we developed a multi-scope species detection model to identify the focus species for different scopes (i.e., gene mention, sentence, paragraph, and global scope of the entire article). Species assignment is one of the bottlenecks of gene name disambiguation. In our evaluation, recognizing the focus species of a gene mention in four different scopes improved the gene name disambiguation. We used the species cue words extracted from articles to estimate the relevance between an article and a species. The relevance score was calculated by our proposed entities frequency-augmented invert species frequency (EF-AISF) formula, which represents the importance of an entity to a species. We also defined a relation guide factor (RGF) to normalize the relevance score. Our method not only achieved better performance than previous methods but also can handle the articles that do not specifically mention a species. In the DECA corpus, we outperformed previous studies and obtained an accuracy of 88.22 percent.

Subject(s)

Computational Biology/methods , Data Mining/methods , Genes/genetics , Molecular Sequence Annotation/classification , Animals , Humans , Support Vector Machine

Automatic assignment of prokaryotic genes to functional categories using literature profiling.

Torrieri, Raul; Oliveira, Francislon S; Oliveira, Guilherme; Coimbra, Roney S.

PLoS One ; 7(10): e47436, 2012.

Article in English | MEDLINE | ID: mdl-23077617

ABSTRACT

In the last years, there was an exponential increase in the number of publicly available genomes. Once finished, most genome projects lack financial support to review annotations. A few of these gene annotations are based on a combination of bioinformatics evidence, however, in most cases, annotations are based solely on sequence similarity to a previously known gene, which was most probably annotated in the same way. As a result, a large number of predicted genes remain unassigned to any functional category despite the fact that there is enough evidence in the literature to predict their function. We developed a classifier trained with term-frequency vectors automatically disclosed from text corpora of an ensemble of genes representative of each functional category of the J. Craig Venter Institute Comprehensive Microbial Resource (JCVI-CMR) ontology. The classifier achieved up to 84% precision with 68% recall (for confidence≥0.4), F-measure 0.76 (recall and precision equally weighted) in an independent set of 2,220 genes, from 13 bacterial species, previously classified by JCVI-CMR into unambiguous categories of its ontology. Finally, the classifier assigned (confidence≥0.7) to functional categories a total of 5,235 out of the â¼24 thousand genes previously in categories "Unknown function" or "Unclassified" for which there is literature in MEDLINE. Two biologists reviewed the literature of 100 of these genes, randomly picket, and assigned them to the same functional categories predicted by the automatic classifier. Our results confirmed the hypothesis that it is possible to confidently assign genes of a real world repository to functional categories, based exclusively on the automatic profiling of its associated literature. The LitProf--Gene Classifier web server is accessible at: www.cebio.org/litprofGC.

Subject(s)

Computational Biology , Databases, Genetic , MEDLINE , Molecular Sequence Annotation , Humans , Internet , Molecular Sequence Annotation/classification , Molecular Sequence Annotation/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL