Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
Add more filters










Database
Language
Publication year range
1.
bioRxiv ; 2024 Jun 17.
Article in English | MEDLINE | ID: mdl-38798479

ABSTRACT

Continued advances in variant effect prediction are necessary to demonstrate the ability of machine learning methods to accurately determine the clinical impact of variants of unknown significance (VUS). Towards this goal, the ARSA Critical Assessment of Genome Interpretation (CAGI) challenge was designed to characterize progress by utilizing 219 experimentally assayed missense VUS in the Arylsulfatase A (ARSA) gene to assess the performance of community-submitted predictions of variant functional effects. The challenge involved 15 teams, and evaluated additional predictions from established and recently released models. Notably, a model developed by participants of a genetics and coding bootcamp, trained with standard machine-learning tools in Python, demonstrated superior performance among submissions. Furthermore, the study observed that state-of-the-art deep learning methods provided small but statistically significant improvement in predictive performance compared to less elaborate techniques. These findings underscore the utility of variant effect prediction, and the potential for models trained with modest resources to accurately classify VUS in genetic and clinical research.

2.
bioRxiv ; 2024 Mar 04.
Article in English | MEDLINE | ID: mdl-38496411

ABSTRACT

Therapeutic antibodies have become one of the most influential therapeutics in modern medicine to fight against infectious pathogens, cancer, and many other diseases. However, experimental screening for highly efficacious targeting antibodies is labor-intensive and of high cost, which is exacerbated by evolving antigen targets under selective pressure such as fast-mutating viral variants. As a proof-of-concept, we developed a machine learning-assisted antibody generation pipeline that greatly accelerates the screening and re-design of immunoglobulins G (IgGs) against a broad spectrum of SARS-CoV-2 coronavirus variant strains. These viruses infect human host cells via the viral spike protein binding to the host cell receptor angiotensin-converting enzyme 2 (ACE2). Using over 1300 IgG sequences derived from convalescent patient B cells that bind with spike's receptor binding domain (RBD), we first established protein structural docking models in assessing the RBD-IgG-ACE2 interaction interfaces and predicting the virus-neutralizing activity of each IgG with a confidence score. Additionally, employing Gaussian process regression (also known as Kriging) in a latent space of an antibody language model, we predicted the landscape of IgGs' activity profiles against individual coronaviral variants of concern. With functional analyses and experimental validations, we efficiently prioritized IgG candidates for neutralizing a broad spectrum of viral variants (wildtype, Delta, and Omicron) to prevent the infection of host cells in vitro and hACE2 transgenic mice in vivo. Furthermore, the computational analyses enabled rational redesigns of selective IgG clones with single amino acid substitutions at the RBD-binding interface to improve the IgG blockade efficacy for one of the severe, therapy-resistant strains - Delta (B.1.617). Our work expedites applications of artificial intelligence in antibody screening and re-design even in low-data regimes combining protein language models and Kriging for antibody sequence analysis, activity prediction, and efficacy improvement, in synergy with physics-driven protein docking models for antibody-antigen interface structure analyses and functional optimization.

3.
Res Sq ; 2023 Aug 03.
Article in English | MEDLINE | ID: mdl-37577664

ABSTRACT

Predicting protein variant effects through machine learning is often challenged by the scarcity of experimentally measured effect labels. Recently, protein language models (pLMs) emerge as zero-shot predictors without the need of effect labels, by modeling the evolutionary distribution of functional protein sequences. However, biological contexts important to variant effects are implicitly modeled and effectively marginalized. By assessing the sequence awareness and the structure awareness of pLMs, we find that their improvements often correlate with better variant effect prediction but their tradeoff can present a barrier as observed in over-finetuning to specific family sequences. We introduce a framework of structure-informed pLMs (SI-pLMs) to inject protein structural contexts purposely and controllably, by extending masked sequence denoising in conventional pLMs to cross-modality denoising. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, despite relatively compact sizes, are robustly top performers against competing methods including other pLMs, regardless of the target protein family's evolutionary information content or the tendency to overfitting / over-finetuning. Learned distributions in structural contexts could enhance sequence distributions in predicting variant effects. Ablation studies reveal major contributing factors and analyses of sequence embeddings provide further insights. The data and scripts are available at https://github.com/Stephen2526/Structure-informed_PLM.git.

4.
Genome Biol ; 24(1): 79, 2023 04 18.
Article in English | MEDLINE | ID: mdl-37072822

ABSTRACT

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.


Subject(s)
Algorithms , Epigenomics , Genomics/methods
5.
Hum Mutat ; 40(9): 1546-1556, 2019 09.
Article in English | MEDLINE | ID: mdl-31294896

ABSTRACT

Testing for variation in BRCA1 and BRCA2 (commonly referred to as BRCA1/2), has emerged as a standard clinical practice and is helping countless women better understand and manage their heritable risk of breast and ovarian cancer. Yet the increased rate of BRCA1/2 testing has led to an increasing number of Variants of Uncertain Significance (VUS), and the rate of VUS discovery currently outpaces the rate of clinical variant interpretation. Computational prediction is a key component of the variant interpretation pipeline. In the CAGI5 ENIGMA Challenge, six prediction teams submitted predictions on 326 newly-interpreted variants from the ENIGMA Consortium. By evaluating these predictions against the new interpretations, we have gained a number of insights on the state of the art of variant prediction and specific steps to further advance this state of the art.


Subject(s)
BRCA1 Protein/genetics , BRCA2 Protein/genetics , Breast Neoplasms/diagnosis , Computational Biology/methods , Ovarian Neoplasms/diagnosis , Breast Neoplasms/genetics , Early Detection of Cancer , Female , Genetic Predisposition to Disease , Genetic Testing , Genetic Variation , Humans , Models, Genetic , Ovarian Neoplasms/genetics
6.
Hum Mutat ; 40(9): 1612-1622, 2019 09.
Article in English | MEDLINE | ID: mdl-31241222

ABSTRACT

The availability of disease-specific genomic data is critical for developing new computational methods that predict the pathogenicity of human variants and advance the field of precision medicine. However, the lack of gold standards to properly train and benchmark such methods is one of the greatest challenges in the field. In response to this challenge, the scientific community is invited to participate in the Critical Assessment for Genome Interpretation (CAGI), where unpublished disease variants are available for classification by in silico methods. As part of the CAGI-5 challenge, we evaluated the performance of 18 submissions and three additional methods in predicting the pathogenicity of single nucleotide variants (SNVs) in checkpoint kinase 2 (CHEK2) for cases of breast cancer in Hispanic females. As part of the assessment, the efficacy of the analysis method and the setup of the challenge were also considered. The results indicated that though the challenge could benefit from additional participant data, the combined generalized linear model analysis and odds of pathogenicity analysis provided a framework to evaluate the methods submitted for SNV pathogenicity identification and for comparison to other available methods. The outcome of this challenge and the approaches used can help guide further advancements in identifying SNV-disease relationships.


Subject(s)
Breast Neoplasms/genetics , Checkpoint Kinase 2/genetics , Computational Biology/methods , Hispanic or Latino/genetics , Polymorphism, Single Nucleotide , Adult , Aged , Breast Neoplasms/ethnology , Case-Control Studies , Computer Simulation , Female , Genetic Predisposition to Disease , Humans , Linear Models , Middle Aged , United States/ethnology , Exome Sequencing
7.
Hum Mutat ; 40(9): 1579-1592, 2019 09.
Article in English | MEDLINE | ID: mdl-31144781

ABSTRACT

Quickly growing genetic variation data of unknown clinical significance demand computational methods that can reliably predict clinical phenotypes and deeply unravel molecular mechanisms. On the platform enabled by the Critical Assessment of Genome Interpretation (CAGI), we develop a novel "weakly supervised" regression (WSR) model that not only predicts precise clinical significance (probability of pathogenicity) from inexact training annotations (class of pathogenicity) but also infers underlying molecular mechanisms in a variant-specific manner. Compared to multiclass logistic regression, a representative multiclass classifier, our kernelized WSR improves the performance for the ENIGMA Challenge set from 0.72 to 0.97 in binary area under the receiver operating characteristic curve (AUC) and from 0.64 to 0.80 in ordinal multiclass AUC. WSR model interpretation and protein structural interpretation reach consensus in corroborating the most probable molecular mechanisms by which some pathogenic BRCA1 variants confer clinical significance, namely metal-binding disruption for p.C44F and p.C47Y, protein-binding disruption for p.M18T, and structure destabilization for p.S1715N.


Subject(s)
BRCA1 Protein/genetics , Computational Biology/methods , Mutation, Missense , Area Under Curve , Genetic Predisposition to Disease , Humans , Logistic Models , Machine Learning , Models, Genetic , Phenotype
8.
Proteins ; 85(3): 544-556, 2017 03.
Article in English | MEDLINE | ID: mdl-27862345

ABSTRACT

Predicting protein conformational changes from unbound structures or even homology models to bound structures remains a critical challenge for protein docking. Here we present a study directly addressing the challenge by reducing the dimensionality and narrowing the range of the corresponding conformational space. The study builds on cNMA-our new framework of partner- and contact-specific normal mode analysis that exploits encounter complexes and considers both intrinsic and induced flexibility. First, we established over a CAPRI (Critical Assessment of PRedicted Interactions) target set that the direction of conformational changes from unbound structures and homology models can be reproduced to a great extent by a small set of cNMA modes. In particular, homology-to-bound interface root-mean-square deviation (iRMSD) can be reduced by 40% on average with the slowest 30 modes. Second, we developed novel and interpretable features from cNMA and used various machine learning approaches to predict the extent of conformational changes. The models learned from a set of unbound-to-bound conformational changes could predict the actual extent of iRMSD with errors around 0.6 Å for unbound proteins in a held-out benchmark subset, around 0.8 Å for unbound proteins in the CAPRI set, and around 1 Å even for homology models in the CAPRI set. Our results shed new insights into origins of conformational differences between homology models and bound structures and provide new support for the low-dimensionality of conformational adjustment during protein associations. The results also provide new tools for ensemble generation and conformational sampling in unbound and homology docking. Proteins 2017; 85:544-556. © 2016 Wiley Periodicals, Inc.


Subject(s)
Computational Biology/methods , Machine Learning , Models, Statistical , Molecular Docking Simulation/methods , Proteins/chemistry , Software , Benchmarking , Binding Sites , Dimensional Measurement Accuracy , Protein Binding , Protein Conformation , Protein Multimerization , Research Design , Structural Homology, Protein , Thermodynamics
SELECTION OF CITATIONS
SEARCH DETAIL
...