Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
1.
bioRxiv ; 2023 Dec 07.
Article in English | MEDLINE | ID: mdl-38106161

ABSTRACT

Autosomal dominant polycystic kidney disease (ADPKD) is the leading monogenic cause of kidney failure and affects millions of people worldwide. Despite the prevalence of this monogenic disorder, our limited mechanistic understanding of ADPKD has hindered therapeutic development. Here, we successfully developed bioassays that functionally classify missense variants in polycystin-1 (PC1). Strikingly, ADPKD pathogenic missense variants cluster into two major categories: 1) those that disrupt polycystin cell surface localization or 2) those that attenuate polycystin ion channel activity. We found that polycystin channels with defective surface localization could be rescued with a small molecule. We propose that small-molecule-based strategies to improve polycystin cell surface localization and channel function will be effective therapies for ADPKD patients.

2.
Cell ; 186(21): 4567-4582.e20, 2023 10 12.
Article in English | MEDLINE | ID: mdl-37794590

ABSTRACT

CRISPR-Cas9 genome editing has enabled advanced T cell therapies, but occasional loss of the targeted chromosome remains a safety concern. To investigate whether Cas9-induced chromosome loss is a universal phenomenon and evaluate its clinical significance, we conducted a systematic analysis in primary human T cells. Arrayed and pooled CRISPR screens revealed that chromosome loss was generalizable across the genome and resulted in partial and entire loss of the targeted chromosome, including in preclinical chimeric antigen receptor T cells. T cells with chromosome loss persisted for weeks in culture, implying the potential to interfere with clinical use. A modified cell manufacturing process, employed in our first-in-human clinical trial of Cas9-engineered T cells (NCT03399448), reduced chromosome loss while largely preserving genome editing efficacy. Expression of p53 correlated with protection from chromosome loss observed in this protocol, suggesting both a mechanism and strategy for T cell engineering that mitigates this genotoxicity in the clinic.


Subject(s)
CRISPR-Cas Systems , Chromosome Aberrations , Gene Editing , T-Lymphocytes , Humans , Chromosomes , CRISPR-Cas Systems/genetics , DNA Damage , Gene Editing/methods , Clinical Trials as Topic
3.
medRxiv ; 2023 Sep 18.
Article in English | MEDLINE | ID: mdl-37745486

ABSTRACT

Over three percent of people carry a dominant pathogenic mutation, yet only a fraction of carriers develop disease (incomplete penetrance), and phenotypes from mutations in the same gene range from mild to severe (variable expressivity). Here, we investigate underlying mechanisms for this heterogeneity: variable variant effect sizes, carrier polygenic backgrounds, and modulation of carrier effect by genetic background (epistasis). We leveraged exomes and clinical phenotypes from the UK Biobank and the Mt. Sinai Bio Me Biobank to identify carriers of pathogenic variants affecting cardiometabolic traits. We employed recently developed methods to study these cohorts, observing strong statistical support and clinical translational potential for all three mechanisms of variable penetrance and expressivity. For example, scores from our recent model of variant pathogenicity were tightly correlated with phenotype amongst clinical variant carriers, they predicted effects of variants of unknown significance, and they distinguished gain- from loss-of-function variants. We also found that polygenic scores predicted phenotypes amongst pathogenic carriers and that epistatic effects can exceed main carrier effects by an order of magnitude.

4.
Nat Genet ; 55(9): 1512-1522, 2023 09.
Article in English | MEDLINE | ID: mdl-37563329

ABSTRACT

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.


Subject(s)
Computational Biology , Software , Humans , Computational Biology/methods , Mutation, Missense/genetics , Proteins/genetics , Genome, Human/genetics
5.
bioRxiv ; 2023 Mar 22.
Article in English | MEDLINE | ID: mdl-36993359

ABSTRACT

CRISPR-Cas9 genome editing has enabled advanced T cell therapies, but occasional loss of the targeted chromosome remains a safety concern. To investigate whether Cas9-induced chromosome loss is a universal phenomenon and evaluate its clinical significance, we conducted a systematic analysis in primary human T cells. Arrayed and pooled CRISPR screens revealed that chromosome loss was generalizable across the genome and resulted in partial and entire loss of the chromosome, including in pre-clinical chimeric antigen receptor T cells. T cells with chromosome loss persisted for weeks in culture, implying the potential to interfere with clinical use. A modified cell manufacturing process, employed in our first-in-human clinical trial of Cas9-engineered T cells, 1 dramatically reduced chromosome loss while largely preserving genome editing efficacy. Expression of p53 correlated with protection from chromosome loss observed in this protocol, suggesting both a mechanism and strategy for T cell engineering that mitigates this genotoxicity in the clinic.

6.
Genome Biol ; 23(1): 131, 2022 06 20.
Article in English | MEDLINE | ID: mdl-35725481

ABSTRACT

Genetic studies of human traits have revolutionized our understanding of the variation between individuals, and yet, the genetics of most traits is still poorly understood. In this review, we highlight the major open problems that need to be solved, and by discussing these challenges provide a primer to the field. We cover general issues such as population structure, epistasis and gene-environment interactions, data-related issues such as ancestry diversity and rare genetic variants, and specific challenges related to heritability estimates, genetic association studies, and polygenic risk scores. We emphasize the interconnectedness of these problems and suggest promising avenues to address them.


Subject(s)
Genome-Wide Association Study , Multifactorial Inheritance , Gene-Environment Interaction , Humans , Phenotype , Polymorphism, Single Nucleotide
7.
Bioinformatics ; 38(8): 2102-2110, 2022 04 12.
Article in English | MEDLINE | ID: mdl-35020807

ABSTRACT

SUMMARY: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. AVAILABILITY AND IMPLEMENTATION: Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Deep Learning , Amino Acid Sequence , Proteins/chemistry , Language , Natural Language Processing
8.
NAR Genom Bioinform ; 3(3): lqab079, 2021 Sep.
Article in English | MEDLINE | ID: mdl-34541526

ABSTRACT

Human genetic variation in coding regions is fundamental to the study of protein structure and function. Most methods for interpreting missense variants consider substitution measures derived from homologous proteins across different species. In this study, we introduce human-specific amino acid (AA) substitution matrices that are based on genetic variations in the modern human population. We analyzed the frequencies of >4.8M single nucleotide variants (SNVs) at codon and AA resolution and compiled human-centric substitution matrices that are fundamentally different from classic cross-species matrices (e.g. BLOSUM, PAM). Our matrices are asymmetric, with some AA replacements showing significant directional preference. Moreover, these AA matrices are only partly predicted by nucleotide substitution rates. We further test the utility of our matrices in exposing functional signals of experimentally-validated protein annotations. A significant reduction in AA transition frequencies was observed across nine post-translational modification (PTM) types and four ion-binding sites. Our results propose a purifying selection signal in the human proteome across a diverse set of functional protein annotations and provide an empirical baseline for interpreting human genetic variation in coding regions.

9.
Sci Rep ; 11(1): 14901, 2021 07 21.
Article in English | MEDLINE | ID: mdl-34290314

ABSTRACT

The characterization of germline genetic variation affecting cancer risk, known as cancer predisposition, is fundamental to preventive and personalized medicine. Studies of genetic cancer predisposition typically identify significant genomic regions based on family-based cohorts or genome-wide association studies (GWAS). However, the results of such studies rarely provide biological insight or functional interpretation. In this study, we conducted a comprehensive analysis of cancer predisposition in the UK Biobank cohort using a new gene-based method for detecting protein-coding genes that are functionally interpretable. Specifically, we conducted proteome-wide association studies (PWAS) to identify genetic associations mediated by alterations to protein function. With PWAS, we identified 110 significant gene-cancer associations in 70 unique genomic regions across nine cancer types and pan-cancer. In 48 of the 110 PWAS associations (44%), estimated gene damage is associated with reduced rather than elevated cancer risk, suggesting a protective effect. Together with standard GWAS, we implicated 145 unique genomic loci with cancer risk. While most of these genomic regions are supported by external evidence, our results also highlight many novel loci. Based on the capacity of PWAS to detect non-additive genetic effects, we found that 46% of the PWAS-significant cancer regions exhibited exclusive recessive inheritance. These results highlight the importance of recessive genetic effects, without relying on familial studies. Finally, we show that many of the detected genes exert substantial cancer risk in the studied cohort determined by a quantitative functional description, suggesting their relevance for diagnosis and genetic consulting.


Subject(s)
Genes, Recessive/genetics , Genetic Predisposition to Disease/genetics , Genome-Wide Association Study/methods , Neoplasm Proteins/genetics , Neoplasm Proteins/physiology , Neoplasms/genetics , Proteome/genetics , Cohort Studies , Female , Genetic Counseling , Genetic Loci/genetics , Germ-Line Mutation , Humans , Male , Neoplasms/diagnosis , Risk , United Kingdom
10.
J Pers Med ; 11(6)2021 Jun 21.
Article in English | MEDLINE | ID: mdl-34205563

ABSTRACT

One of the major challenges in the post-genomic era is elucidating the genetic basis of human diseases. In recent years, studies have shown that polygenic risk scores (PRS), based on aggregated information from millions of variants across the human genome, can estimate individual risk for common diseases. In practice, the current medical practice still predominantly relies on physiological and clinical indicators to assess personal disease risk. For example, caregivers mark individuals with high body mass index (BMI) as having an increased risk to develop type 2 diabetes (T2D). An important question is whether combining PRS with clinical metrics can increase the power of disease prediction in particular from early life. In this work we examined this question, focusing on T2D. We present here a sex-specific integrated approach that combines PRS with additional measurements and age to define a new risk score. We show that such approach combining adult BMI and PRS achieves considerably better prediction than each of the measures on unrelated Caucasians in the UK Biobank (UKB, n = 290,584). Likewise, integrating PRS with self-reports on birth weight (n = 172,239) and comparative body size at age ten (n = 287,203) also substantially enhance prediction as compared to each of its components. While the integration of PRS with BMI achieved better results as compared to the other measurements, the latter are early-life measurements that can be integrated already at childhood, to allow preemptive intervention for those at high risk to develop T2D. Our integrated approach can be easily generalized to other diseases, with the relevant early-life measurements.

11.
Comput Struct Biotechnol J ; 19: 1750-1758, 2021.
Article in English | MEDLINE | ID: mdl-33897979

ABSTRACT

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

12.
Cancer Res ; 81(4): 1178-1185, 2021 02 15.
Article in English | MEDLINE | ID: mdl-33277365

ABSTRACT

Contemporary catalogues of cancer driver genes rely primarily on high mutation rates as evidence for gene selection in tumors. Here, we present The Functional Alteration Bias Recovery In Coding-regions Cancer Portal, a comprehensive catalogue of gene selection in cancer based purely on the biochemical functional effects of mutations at the protein level. Gene selection in the portal is quantified by combining genomics data with rich proteomic annotations. Genes are ranked according to the strength of evidence for selection in tumor, based on rigorous and robust statistics. The portal covers the entire human coding genome (∼18,000 protein-coding genes) across 33 cancer types and pan-cancer. It includes a selected set of cross-references to the most relevant resources providing genomics, proteomics, and cancer-related information. We showcase the portal with known and overlooked cancer genes, demonstrating the utility of the portal via its simple visual interface, which allows users to pivot between gene-centric and cancer type views. The portal is available at fabric-cancer.huji.ac.il. SIGNIFICANCE: A new cancer portal quantifies and presents gene selection in tumor over the entire human coding genome across 33 cancer types and pan-cancer.


Subject(s)
Databases, Genetic , Genomics/methods , Neoplasms/genetics , Oncogenes/genetics , Open Reading Frames/genetics , Algorithms , Female , Genome, Human/genetics , Humans , Internet , Male , Mutation Rate , Neoplasms/classification , Precancerous Conditions/classification , Precancerous Conditions/genetics , Selection, Genetic , Software , User-Computer Interface
13.
Genome Biol ; 21(1): 173, 2020 07 14.
Article in English | MEDLINE | ID: mdl-32665031

ABSTRACT

We introduce Proteome-Wide Association Study (PWAS), a new method for detecting gene-phenotype associations mediated by protein function alterations. PWAS aggregates the signal of all variants jointly affecting a protein-coding gene and assesses their overall impact on the protein's function using machine learning and probabilistic models. Subsequently, it tests whether the gene exhibits functional variability between individuals that correlates with the phenotype of interest. PWAS can capture complex modes of heritability, including recessive inheritance. A comparison with GWAS and other existing methods proves its capacity to recover causal protein-coding genes and highlight new associations. PWAS is available as a command-line tool.


Subject(s)
Phenotype , Proteome , Proteomics/methods , Software , Colorectal Neoplasms/genetics , Genome-Wide Association Study , Humans
14.
BMC Cancer ; 19(1): 783, 2019 Aug 07.
Article in English | MEDLINE | ID: mdl-31391007

ABSTRACT

BACKGROUND: In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from > 10,000 patients. METHODS: Our hypothesis in this study is that whole exome sequences from blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity. RESULTS: We report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants. CONCLUSION: TCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.


Subject(s)
Exome , Genome, Human , Genomics , Germ-Line Mutation , Neoplasms/genetics , Genetic Association Studies , Genetic Predisposition to Disease , Humans , Kaplan-Meier Estimate , Neoplasms/diagnosis , Neoplasms/mortality , Neoplasms/therapy , Precision Medicine/methods
15.
Nucleic Acids Res ; 47(13): 6642-6655, 2019 07 26.
Article in English | MEDLINE | ID: mdl-31334812

ABSTRACT

Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the selection of genes in cancer by assessing the effects of de-novo somatic mutations on protein-coding genes. Using a machine-learning model, we quantified the functional effects of ∼3M somatic mutations extracted from over 10 000 human cancerous samples, and compared them against the effects of all possible single-nucleotide mutations in the coding human genome. We detected 593 protein-coding genes showing statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with known cancer genes, but also include many overlooked genes. FABRIC is designed to avoid false discoveries by comparing each gene to its own background model using rigorous statistics, making minimal assumptions about the distribution of random somatic mutations. The framework is an open-source project with a simple command-line interface.


Subject(s)
Computational Biology/methods , Genes, Neoplasm , Mutation , Neoplasm Proteins/genetics , Neoplasms/genetics , Datasets as Topic , Humans , Models, Genetic , Mutation, Missense , Neoplasm Proteins/chemistry , Neoplasm Proteins/physiology
16.
Viruses ; 11(5)2019 04 30.
Article in English | MEDLINE | ID: mdl-31052218

ABSTRACT

Viruses are the most prevalent infectious agents, populating almost every ecosystem on earth. Most viruses carry only a handful of genes supporting their replication and the production of capsids. It came as a great surprise in 2003 when the first giant virus was discovered and found to have a >1 Mbp genome encoding almost a thousand proteins. Following this first discovery, dozens of giant virus strains across several viral families have been reported. Here, we provide an updated quantitative and qualitative view on giant viruses and elaborate on their shared and variable features. We review the complexity of giant viral proteomes, which include functions traditionally associated only with cellular organisms. These unprecedented functions include components of the translation machinery, DNA maintenance, and metabolic enzymes. We discuss the possible underlying evolutionary processes and mechanisms that might have shaped the diversity of giant viruses and their genomes, highlighting their remarkable capacity to hijack genes and genomic sequences from their hosts and environments. This leads us to examine prominent theories regarding the origin of giant viruses. Finally, we present the emerging ecological view of giant viruses, found across widespread habitats and ecological systems, with respect to the environment and human health.


Subject(s)
Giant Viruses/classification , Giant Viruses/physiology , Biological Evolution , Ecosystem , Genome, Viral , Genomics/methods , Host-Pathogen Interactions , Humans
17.
Article in English | MEDLINE | ID: mdl-27694209

ABSTRACT

Determining residue-level protein properties, such as sites of post-translational modifications (PTMs), is vital to understanding protein function. Experimental methods are costly and time-consuming, while traditional rule-based computational methods fail to annotate sites lacking substantial similarity. Machine Learning (ML) methods are becoming fundamental in annotating unknown proteins and their heterogeneous properties. We present ASAP (Amino-acid Sequence Annotation Prediction), a universal ML framework for predicting residue-level properties. ASAP extracts numerous features from raw sequences, and supports easy integration of external features such as secondary structure, solvent accessibility, intrinsically disorder or PSSM profiles. Features are then used to train ML classifiers. ASAP can create new classifiers within minutes for a variety of tasks, including PTM prediction (e.g. cleavage sites by convertase, phosphoserine modification). We present a detailed case study for ASAP: CleavePred, an ASAP-based model to predict protein precursor cleavage sites, with state-of-the-art results. Protein cleavage is a PTM shared by a wide variety of proteins sharing minimal sequence similarity. Current rule-based methods suffer from high false positive rates, making them suboptimal. The high performance of CleavePred makes it suitable for analyzing new proteomes at a genomic scale. The tool is attractive to protein design, mass spectrometry search engines and the discovery of new bioactive peptides from precursors. ASAP functions as a baseline approach for residue-level protein sequence prediction. CleavePred is freely accessible as a web-based application. Both ASAP and CleavePred are open-source with a flexible Python API.Database URL: ASAP's and CleavePred source code, webtool and tutorials are available at: https://github.com/ddofer/asap; http://protonet.cs.huji.ac.il/cleavepred.


Subject(s)
Machine Learning , Molecular Sequence Annotation/methods , Peptides/genetics , Peptides/metabolism , Sequence Analysis, Protein/methods , Internet
18.
Biol Direct ; 11: 26, 2016 05 21.
Article in English | MEDLINE | ID: mdl-27209091

ABSTRACT

BACKGROUND: Viruses are the simplest replicating units, characterized by a limited number of coding genes and an exceptionally high rate of overlapping genes. We sought a unified evolutionary explanation that accounts for their genome sizes, gene overlapping and capsid properties. RESULTS: We performed an unbiased statistical analysis of ~100 families within ~400 genera that comprise the currently known viral world. We found that the volume utilization of capsids is often low, and greatly varies among viral families. Furthermore, although viruses span three orders of magnitude in genome length, they almost never have over 1500 overlapping nucleotides, or over four significantly overlapping genes per virus. CONCLUSIONS: Our findings undermine the generality of the compression theory, which emphasizes optimal packing and length dependency to explain overlapping genes and capsid size in viral genomes. Instead, we propose that gene novelty and evolution exploration offer better explanations to size constraints and gene overlapping in all viruses. REVIEWERS: This article was reviewed by Arne Elofsson and David Kreil.


Subject(s)
Capsid/physiology , Evolution, Molecular , Genes, Overlapping , Genome Size , Genome, Viral , Viruses/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...