Search | VHL Regional Portal

Applying interpretable machine learning in computational biology-pitfalls, recommendations and opportunities for new developments.

Chen, Valerie; Yang, Muyu; Cui, Wenbo; Kim, Joon Sik; Talwalkar, Ameet; Ma, Jian.

Nat Methods ; 21(8): 1454-1461, 2024 Aug.

Article in English | MEDLINE | ID: mdl-39122941

ABSTRACT

Recent advances in machine learning have enabled the development of next-generation predictive models for complex computational biology problems, thereby spurring the use of interpretable machine learning (IML) to unveil biological insights. However, guidelines for using IML in computational biology are generally underdeveloped. We provide an overview of IML methods and evaluation techniques and discuss common pitfalls encountered when applying IML methods to computational biology problems. We also highlight open questions, especially in the era of large language models, and call for collaboration between IML and computational biology researchers.

Subject(s)

Computational Biology , Machine Learning , Computational Biology/methods , Humans , Algorithms

Perspectives on incorporating expert feedback into model updates.

Chen, Valerie; Bhatt, Umang; Heidari, Hoda; Weller, Adrian; Talwalkar, Ameet.

Patterns (N Y) ; 4(7): 100780, 2023 Jul 14.

Article in English | MEDLINE | ID: mdl-37521050

ABSTRACT

Machine learning (ML) practitioners are increasingly tasked with developing models that are aligned with non-technical experts' values and goals. However, there has been insufficient consideration of how practitioners should translate domain expertise into ML updates. In this review, we consider how to capture interactions between practitioners and experts systematically. We devise a taxonomy to match expert feedback types with practitioner updates. A practitioner may receive feedback from an expert at the observation or domain level and then convert this feedback into updates to the dataset, loss function, or parameter space. We review existing work from ML and human-computer interaction to describe this feedback-update taxonomy and highlight the insufficient consideration given to incorporating feedback from non-technical experts. We end with a set of open questions that naturally arise from our proposed taxonomy and subsequent survey.

Inferring population structure in biobank-scale genomic data.

Chiu, Alec M; Molloy, Erin K; Tan, Zilong; Talwalkar, Ameet; Sankararaman, Sriram.

Am J Hum Genet ; 109(4): 727-737, 2022 04 07.

Article in English | MEDLINE | ID: mdl-35298920

ABSTRACT

Inferring the structure of human populations from genetic variation data is a key task in population and medical genomic studies. Although a number of methods for population structure inference have been proposed, current methods are impractical to run on biobank-scale genomic datasets containing millions of individuals and genetic variants. We introduce SCOPE, a method for population structure inference that is orders of magnitude faster than existing methods while achieving comparable accuracy. SCOPE infers population structure in about a day on a dataset containing one million individuals and variants as well as on the UK Biobank dataset containing 488,363 individuals and 569,346 variants. Furthermore, SCOPE can leverage allele frequencies from previous studies to improve the interpretability of population structure estimates.

Subject(s)

Biological Specimen Banks , Genetics, Population , Gene Frequency/genetics , Genomics , Humans

Explaining Groups of Points in Low-Dimensional Representations.

Plumb, Gregory; Terhorst, Jonathan; Sankararaman, Sriram; Talwalkar, Ameet.

Proc Mach Learn Res ; 119: 7762-7771, 2020 Jul.

Article in English | MEDLINE | ID: mdl-34532709

ABSTRACT

A common workflow in data exploration is to learn a low-dimensional representation of the data, identify groups of points in that representation, and examine the differences between the groups to determine what they represent. We treat this workflow as an interpretable machine learning problem by leveraging the model that learned the low-dimensional representation to help identify the key differences between the groups. To solve this problem, we introduce a new type of explanation, a Global Counterfactual Explanation (GCE), and our algorithm, Transitive Global Translations (TGT), for computing GCEs. TGT identifies the differences between each pair of groups using compressed sensing but constrains those pairwise differences to be consistent among all of the groups. Empirically, we demonstrate that TGT is able to identify explanations that accurately explain the model while being relatively sparse, and that these explanations match real patterns in the data.

SMaSH: a benchmarking toolkit for human genome variant calling.

Talwalkar, Ameet; Liptrap, Jesse; Newcomb, Julie; Hartl, Christopher; Terhorst, Jonathan; Curtis, Kristal; Bresler, Ma'ayan; Song, Yun S; Jordan, Michael I; Patterson, David.

Bioinformatics ; 30(19): 2787-95, 2014 Oct.

Article in English | MEDLINE | ID: mdl-24894505

ABSTRACT

MOTIVATION: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. RESULTS: We propose SMaSH, a benchmarking methodology for evaluating germline variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on these benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single-nucleotide polymorphism, indel and structural variant calling algorithms. AVAILABILITY AND IMPLEMENTATION: We provide free and open access online to the SMaSH tool kit, along with detailed documentation, at smash.cs.berkeley.edu

Subject(s)

Computational Biology/methods , Genome, Human , Genomics/methods , INDEL Mutation , Algorithms , Data Interpretation, Statistical , Databases, Genetic , High-Throughput Nucleotide Sequencing , Humans , Polymorphism, Single Nucleotide , Software

A large-scale evaluation of computational protein function prediction.

Radivojac, Predrag; Clark, Wyatt T; Oron, Tal Ronnen; Schnoes, Alexandra M; Wittkop, Tobias; Sokolov, Artem; Graim, Kiley; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa; Pandey, Gaurav; Yunes, Jeffrey M; Talwalkar, Ameet S; Repo, Susanna; Souza, Michael L; Piovesan, Damiano; Casadio, Rita; Wang, Zheng; Cheng, Jianlin; Fang, Hai; Gough, Julian; Koskinen, Patrik; Törönen, Petri; Nokso-Koivisto, Jussi; Holm, Liisa; Cozzetto, Domenico; Buchan, Daniel W A; Bryson, Kevin; Jones, David T; Limaye, Bhakti; Inamdar, Harshal; Datta, Avik; Manjari, Sunitha K; Joshi, Rajendra; Chitale, Meghana; Kihara, Daisuke; Lisewski, Andreas M; Erdin, Serkan; Venner, Eric; Lichtarge, Olivier; Rentzsch, Robert; Yang, Haixuan; Romero, Alfonso E; Bhat, Prajwal; Paccanaro, Alberto; Hamp, Tobias; Kaßner, Rebecca; Seemayer, Stefan; Vicedo, Esmeralda; Schaefer, Christian.

Nat Methods ; 10(3): 221-7, 2013 Mar.

Article in English | MEDLINE | ID: mdl-23353650

ABSTRACT

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

Subject(s)

Computational Biology/methods , Molecular Biology/methods , Molecular Sequence Annotation , Proteins/physiology , Algorithms , Animals , Databases, Protein , Exoribonucleases/classification , Exoribonucleases/genetics , Exoribonucleases/physiology , Forecasting , Humans , Proteins/chemistry , Proteins/classification , Proteins/genetics , Species Specificity

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL