Search | VHL Regional Portal

Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields.

Barutcuoglu, Zafer; Airoldi, Edoardo M; Dumeaux, Vanessa; Schapire, Robert E; Troyanskaya, Olga G.

Bioinformatics ; 25(10): 1307-13, 2009 May 15.

Article in English | MEDLINE | ID: mdl-19052061

ABSTRACT

MOTIVATION: The heterogeneity of cancer cannot always be recognized by tumor morphology, but may be reflected by the underlying genetic aberrations. Array comparative genome hybridization (array-CGH) methods provide high-throughput data on genetic copy numbers, but determining the clinically relevant copy number changes remains a challenge. Conventional classification methods for linking recurrent alterations to clinical outcome ignore sequential correlations in selecting relevant features. Conversely, existing sequence classification methods can only model overall copy number instability, without regard to any particular position in the genome. RESULTS: Here, we present the heterogeneous hidden conditional random field, a new integrated array-CGH analysis method for jointly classifying tumors, inferring copy numbers and identifying clinically relevant positions in recurrent alteration regions. By capturing the sequentiality as well as the locality of changes, our integrated model provides better noise reduction, and achieves more relevant gene retrieval and more accurate classification than existing methods. We provide an efficient L1-regularized discriminative training algorithm, which notably selects a small set of candidate genes most likely to be clinically relevant and driving the recurrent amplicons of importance. Our method thus provides unbiased starting points in deciding which genomic regions and which genes in particular to pursue for further examination. Our experiments on synthetic data and real genomic cancer prediction data show that our method is superior, both in prediction accuracy and relevant feature discovery, to existing methods. We also demonstrate that it can be used to generate novel biological hypotheses for breast cancer.

Subject(s)

Algorithms , Aneuploidy , Comparative Genomic Hybridization/methods , Computational Biology/methods , Neoplasms/classification , Oligonucleotide Array Sequence Analysis/methods , Gene Dosage

A critical assessment of Mus musculus gene function prediction using integrated genomic evidence.

Peña-Castillo, Lourdes; Tasan, Murat; Myers, Chad L; Lee, Hyunju; Joshi, Trupti; Zhang, Chao; Guan, Yuanfang; Leone, Michele; Pagnani, Andrea; Kim, Wan Kyu; Krumpelman, Chase; Tian, Weidong; Obozinski, Guillaume; Qi, Yanjun; Mostafavi, Sara; Lin, Guan Ning; Berriz, Gabriel F; Gibbons, Francis D; Lanckriet, Gert; Qiu, Jian; Grant, Charles; Barutcuoglu, Zafer; Hill, David P; Warde-Farley, David; Grouios, Chris; Ray, Debajyoti; Blake, Judith A; Deng, Minghua; Jordan, Michael I; Noble, William S; Morris, Quaid; Klein-Seetharaman, Judith; Bar-Joseph, Ziv; Chen, Ting; Sun, Fengzhu; Troyanskaya, Olga G; Marcotte, Edward M; Xu, Dong; Hughes, Timothy R; Roth, Frederick P.

Genome Biol ; 9 Suppl 1: S2, 2008.

Article in English | MEDLINE | ID: mdl-18613946

ABSTRACT

BACKGROUND: Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated. RESULTS: In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%. CONCLUSION: We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

Subject(s)

Algorithms , Mice/genetics , Proteins/genetics , Proteins/metabolism , Animals , Mice/metabolism

Predicting gene function in a hierarchical context with an ensemble of classifiers.

Guan, Yuanfang; Myers, Chad L; Hess, David C; Barutcuoglu, Zafer; Caudy, Amy A; Troyanskaya, Olga G.

Genome Biol ; 9 Suppl 1: S3, 2008.

Article in English | MEDLINE | ID: mdl-18613947

ABSTRACT

BACKGROUND: The wide availability of genome-scale data for several organisms has stimulated interest in computational approaches to gene function prediction. Diverse machine learning methods have been applied to unicellular organisms with some success, but few have been extensively tested on higher level, multicellular organisms. A recent mouse function prediction project (MouseFunc) brought together nine bioinformatics teams applying a diverse array of methodologies to mount the first large-scale effort to predict gene function in the laboratory mouse. RESULTS: In this paper, we describe our contribution to this project, an ensemble framework based on the support vector machine that integrates diverse datasets in the context of the Gene Ontology hierarchy. We carry out a detailed analysis of the performance of our ensemble and provide insights into which methods work best under a variety of prediction scenarios. In addition, we applied our method to Saccharomyces cerevisiae and have experimentally confirmed functions for a novel mitochondrial protein. CONCLUSION: Our method consistently performs among the top methods in the MouseFunc evaluation. Furthermore, it exhibits good classification performance across a variety of cellular processes and functions in both a multicellular organism and a unicellular organism, indicating its ability to discover novel biology in diverse settings.

Subject(s)

Algorithms , Mice/genetics , Proteins/genetics , Proteins/metabolism , Animals , Bayes Theorem , Mice/metabolism , Mitochondrial Proteins/genetics , Mitochondrial Proteins/metabolism , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics , Saccharomyces cerevisiae Proteins/metabolism

Hierarchical multi-label prediction of gene function.

Barutcuoglu, Zafer; Schapire, Robert E; Troyanskaya, Olga G.

Bioinformatics ; 22(7): 830-6, 2006 Apr 01.

Article in English | MEDLINE | ID: mdl-16410319

ABSTRACT

MOTIVATION: Assigning functions for unknown genes based on diverse large-scale data is a key task in functional genomics. Previous work on gene function prediction has addressed this problem using independent classifiers for each function. However, such an approach ignores the structure of functional class taxonomies, such as the Gene Ontology (GO). Over a hierarchy of functional classes, a group of independent classifiers where each one predicts gene membership to a particular class can produce a hierarchically inconsistent set of predictions, where for a given gene a specific class may be predicted positive while its inclusive parent class is predicted negative. Taking the hierarchical structure into account resolves such inconsistencies and provides an opportunity for leveraging all classifiers in the hierarchy to achieve higher specificity of predictions. RESULTS: We developed a Bayesian framework for combining multiple classifiers based on the functional taxonomy constraints. Using a hierarchy of support vector machine (SVM) classifiers trained on multiple data types, we combined predictions in our Bayesian framework to obtain the most probable consistent set of predictions. Experiments show that over a 105-node subhierarchy of the GO, our Bayesian framework improves predictions for 93 nodes. As an additional benefit, our method also provides implicit calibration of SVM margin outputs to probabilities. Using this method, we make function predictions for multiple proteins, and experimentally confirm predictions for proteins involved in mitosis. SUPPLEMENTARY INFORMATION: Results for the 105 selected GO classes and predictions for 1059 unknown genes are available at: http://function.princeton.edu/genesite/ CONTACT: ogt@cs.princeton.edu.

Subject(s)

Algorithms , Bayes Theorem , Computational Biology/methods , Databases, Protein , Mitosis , Oligonucleotide Array Sequence Analysis , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL