Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
1.
Cell ; 186(21): 4567-4582.e20, 2023 10 12.
Artigo em Inglês | MEDLINE | ID: mdl-37794590

RESUMO

CRISPR-Cas9 genome editing has enabled advanced T cell therapies, but occasional loss of the targeted chromosome remains a safety concern. To investigate whether Cas9-induced chromosome loss is a universal phenomenon and evaluate its clinical significance, we conducted a systematic analysis in primary human T cells. Arrayed and pooled CRISPR screens revealed that chromosome loss was generalizable across the genome and resulted in partial and entire loss of the targeted chromosome, including in preclinical chimeric antigen receptor T cells. T cells with chromosome loss persisted for weeks in culture, implying the potential to interfere with clinical use. A modified cell manufacturing process, employed in our first-in-human clinical trial of Cas9-engineered T cells (NCT03399448), reduced chromosome loss while largely preserving genome editing efficacy. Expression of p53 correlated with protection from chromosome loss observed in this protocol, suggesting both a mechanism and strategy for T cell engineering that mitigates this genotoxicity in the clinic.


Assuntos
Sistemas CRISPR-Cas , Aberrações Cromossômicas , Edição de Genes , Linfócitos T , Humanos , Cromossomos , Sistemas CRISPR-Cas/genética , Dano ao DNA , Edição de Genes/métodos , Ensaios Clínicos como Assunto
2.
Bioinformatics ; 38(8): 2102-2110, 2022 04 12.
Artigo em Inglês | MEDLINE | ID: mdl-35020807

RESUMO

SUMMARY: Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs. ProteinBERT obtains near state-of-the-art performance, and sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data. AVAILABILITY AND IMPLEMENTATION: Code and pretrained model weights are available at https://github.com/nadavbra/protein_bert. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Sequência de Aminoácidos , Proteínas/química , Idioma , Processamento de Linguagem Natural
3.
Nucleic Acids Res ; 47(13): 6642-6655, 2019 07 26.
Artigo em Inglês | MEDLINE | ID: mdl-31334812

RESUMO

Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the selection of genes in cancer by assessing the effects of de-novo somatic mutations on protein-coding genes. Using a machine-learning model, we quantified the functional effects of ∼3M somatic mutations extracted from over 10 000 human cancerous samples, and compared them against the effects of all possible single-nucleotide mutations in the coding human genome. We detected 593 protein-coding genes showing statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with known cancer genes, but also include many overlooked genes. FABRIC is designed to avoid false discoveries by comparing each gene to its own background model using rigorous statistics, making minimal assumptions about the distribution of random somatic mutations. The framework is an open-source project with a simple command-line interface.


Assuntos
Biologia Computacional/métodos , Genes Neoplásicos , Mutação , Proteínas de Neoplasias/genética , Neoplasias/genética , Conjuntos de Dados como Assunto , Humanos , Modelos Genéticos , Mutação de Sentido Incorreto , Proteínas de Neoplasias/química , Proteínas de Neoplasias/fisiologia
4.
BMC Cancer ; 19(1): 783, 2019 Aug 07.
Artigo em Inglês | MEDLINE | ID: mdl-31391007

RESUMO

BACKGROUND: In recent years, research on cancer predisposition germline variants has emerged as a prominent field. The identity of somatic mutations is based on a reliable mapping of the patient germline variants. In addition, the statistics of germline variants frequencies in healthy individuals and cancer patients is the basis for seeking candidates for cancer predisposition genes. The Cancer Genome Atlas (TCGA) is one of the main sources of such data, providing a diverse collection of molecular data including deep sequencing for more than 30 types of cancer from > 10,000 patients. METHODS: Our hypothesis in this study is that whole exome sequences from blood samples of cancer patients are not expected to show systematic differences among cancer types. To test this hypothesis, we analyzed common and rare germline variants across six cancer types, covering 2241 samples from TCGA. In our analysis we accounted for inherent variables in the data including the different variant calling protocols, sequencing platforms, and ethnicity. RESULTS: We report on substantial batch effects in germline variants associated with cancer types. We attribute the effect to the specific sequencing centers that produced the data. Specifically, we measured 30% variability in the number of reported germline variants per sample across sequencing centers. The batch effect is further expressed in nucleotide composition and variant frequencies. Importantly, the batch effect causes substantial differences in germline variant distribution patterns across numerous genes, including prominent cancer predisposition genes such as BRCA1, RET, MAX, and KRAS. For most of known cancer predisposition genes, we found a distinct batch-dependent difference in germline variants. CONCLUSION: TCGA germline data is exposed to strong batch effects with substantial variabilities among TCGA sequencing centers. We claim that those batch effects are consequential for numerous TCGA pan-cancer studies. In particular, these effects may compromise the reliability and the potency to detect new cancer predisposition genes. Furthermore, interpretation of pan-cancer analyses should be revisited in view of the source of the genomic data after accounting for the reported batch effects.


Assuntos
Exoma , Genoma Humano , Genômica , Mutação em Linhagem Germinativa , Neoplasias/genética , Estudos de Associação Genética , Predisposição Genética para Doença , Humanos , Estimativa de Kaplan-Meier , Neoplasias/diagnóstico , Neoplasias/mortalidade , Neoplasias/terapia , Medicina de Precisão/métodos
5.
Nat Genet ; 55(9): 1512-1522, 2023 09.
Artigo em Inglês | MEDLINE | ID: mdl-37563329

RESUMO

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.


Assuntos
Biologia Computacional , Software , Humanos , Biologia Computacional/métodos , Mutação de Sentido Incorreto/genética , Proteínas/genética , Genoma Humano/genética
6.
bioRxiv ; 2023 Dec 07.
Artigo em Inglês | MEDLINE | ID: mdl-38106161

RESUMO

Autosomal dominant polycystic kidney disease (ADPKD) is the leading monogenic cause of kidney failure and affects millions of people worldwide. Despite the prevalence of this monogenic disorder, our limited mechanistic understanding of ADPKD has hindered therapeutic development. Here, we successfully developed bioassays that functionally classify missense variants in polycystin-1 (PC1). Strikingly, ADPKD pathogenic missense variants cluster into two major categories: 1) those that disrupt polycystin cell surface localization or 2) those that attenuate polycystin ion channel activity. We found that polycystin channels with defective surface localization could be rescued with a small molecule. We propose that small-molecule-based strategies to improve polycystin cell surface localization and channel function will be effective therapies for ADPKD patients.

7.
medRxiv ; 2023 Sep 18.
Artigo em Inglês | MEDLINE | ID: mdl-37745486

RESUMO

Over three percent of people carry a dominant pathogenic mutation, yet only a fraction of carriers develop disease (incomplete penetrance), and phenotypes from mutations in the same gene range from mild to severe (variable expressivity). Here, we investigate underlying mechanisms for this heterogeneity: variable variant effect sizes, carrier polygenic backgrounds, and modulation of carrier effect by genetic background (epistasis). We leveraged exomes and clinical phenotypes from the UK Biobank and the Mt. Sinai Bio Me Biobank to identify carriers of pathogenic variants affecting cardiometabolic traits. We employed recently developed methods to study these cohorts, observing strong statistical support and clinical translational potential for all three mechanisms of variable penetrance and expressivity. For example, scores from our recent model of variant pathogenicity were tightly correlated with phenotype amongst clinical variant carriers, they predicted effects of variants of unknown significance, and they distinguished gain- from loss-of-function variants. We also found that polygenic scores predicted phenotypes amongst pathogenic carriers and that epistatic effects can exceed main carrier effects by an order of magnitude.

8.
bioRxiv ; 2023 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-36993359

RESUMO

CRISPR-Cas9 genome editing has enabled advanced T cell therapies, but occasional loss of the targeted chromosome remains a safety concern. To investigate whether Cas9-induced chromosome loss is a universal phenomenon and evaluate its clinical significance, we conducted a systematic analysis in primary human T cells. Arrayed and pooled CRISPR screens revealed that chromosome loss was generalizable across the genome and resulted in partial and entire loss of the chromosome, including in pre-clinical chimeric antigen receptor T cells. T cells with chromosome loss persisted for weeks in culture, implying the potential to interfere with clinical use. A modified cell manufacturing process, employed in our first-in-human clinical trial of Cas9-engineered T cells, 1 dramatically reduced chromosome loss while largely preserving genome editing efficacy. Expression of p53 correlated with protection from chromosome loss observed in this protocol, suggesting both a mechanism and strategy for T cell engineering that mitigates this genotoxicity in the clinic.

9.
Genome Biol ; 23(1): 131, 2022 06 20.
Artigo em Inglês | MEDLINE | ID: mdl-35725481

RESUMO

Genetic studies of human traits have revolutionized our understanding of the variation between individuals, and yet, the genetics of most traits is still poorly understood. In this review, we highlight the major open problems that need to be solved, and by discussing these challenges provide a primer to the field. We cover general issues such as population structure, epistasis and gene-environment interactions, data-related issues such as ancestry diversity and rare genetic variants, and specific challenges related to heritability estimates, genetic association studies, and polygenic risk scores. We emphasize the interconnectedness of these problems and suggest promising avenues to address them.


Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial , Interação Gene-Ambiente , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único
10.
NAR Genom Bioinform ; 3(3): lqab079, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34541526

RESUMO

Human genetic variation in coding regions is fundamental to the study of protein structure and function. Most methods for interpreting missense variants consider substitution measures derived from homologous proteins across different species. In this study, we introduce human-specific amino acid (AA) substitution matrices that are based on genetic variations in the modern human population. We analyzed the frequencies of >4.8M single nucleotide variants (SNVs) at codon and AA resolution and compiled human-centric substitution matrices that are fundamentally different from classic cross-species matrices (e.g. BLOSUM, PAM). Our matrices are asymmetric, with some AA replacements showing significant directional preference. Moreover, these AA matrices are only partly predicted by nucleotide substitution rates. We further test the utility of our matrices in exposing functional signals of experimentally-validated protein annotations. A significant reduction in AA transition frequencies was observed across nine post-translational modification (PTM) types and four ion-binding sites. Our results propose a purifying selection signal in the human proteome across a diverse set of functional protein annotations and provide an empirical baseline for interpreting human genetic variation in coding regions.

11.
Cancer Res ; 81(4): 1178-1185, 2021 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-33277365

RESUMO

Contemporary catalogues of cancer driver genes rely primarily on high mutation rates as evidence for gene selection in tumors. Here, we present The Functional Alteration Bias Recovery In Coding-regions Cancer Portal, a comprehensive catalogue of gene selection in cancer based purely on the biochemical functional effects of mutations at the protein level. Gene selection in the portal is quantified by combining genomics data with rich proteomic annotations. Genes are ranked according to the strength of evidence for selection in tumor, based on rigorous and robust statistics. The portal covers the entire human coding genome (∼18,000 protein-coding genes) across 33 cancer types and pan-cancer. It includes a selected set of cross-references to the most relevant resources providing genomics, proteomics, and cancer-related information. We showcase the portal with known and overlooked cancer genes, demonstrating the utility of the portal via its simple visual interface, which allows users to pivot between gene-centric and cancer type views. The portal is available at fabric-cancer.huji.ac.il. SIGNIFICANCE: A new cancer portal quantifies and presents gene selection in tumor over the entire human coding genome across 33 cancer types and pan-cancer.


Assuntos
Bases de Dados Genéticas , Genômica/métodos , Neoplasias/genética , Oncogenes/genética , Fases de Leitura Aberta/genética , Algoritmos , Feminino , Genoma Humano/genética , Humanos , Internet , Masculino , Taxa de Mutação , Neoplasias/classificação , Lesões Pré-Cancerosas/classificação , Lesões Pré-Cancerosas/genética , Seleção Genética , Software , Interface Usuário-Computador
12.
Sci Rep ; 11(1): 14901, 2021 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-34290314

RESUMO

The characterization of germline genetic variation affecting cancer risk, known as cancer predisposition, is fundamental to preventive and personalized medicine. Studies of genetic cancer predisposition typically identify significant genomic regions based on family-based cohorts or genome-wide association studies (GWAS). However, the results of such studies rarely provide biological insight or functional interpretation. In this study, we conducted a comprehensive analysis of cancer predisposition in the UK Biobank cohort using a new gene-based method for detecting protein-coding genes that are functionally interpretable. Specifically, we conducted proteome-wide association studies (PWAS) to identify genetic associations mediated by alterations to protein function. With PWAS, we identified 110 significant gene-cancer associations in 70 unique genomic regions across nine cancer types and pan-cancer. In 48 of the 110 PWAS associations (44%), estimated gene damage is associated with reduced rather than elevated cancer risk, suggesting a protective effect. Together with standard GWAS, we implicated 145 unique genomic loci with cancer risk. While most of these genomic regions are supported by external evidence, our results also highlight many novel loci. Based on the capacity of PWAS to detect non-additive genetic effects, we found that 46% of the PWAS-significant cancer regions exhibited exclusive recessive inheritance. These results highlight the importance of recessive genetic effects, without relying on familial studies. Finally, we show that many of the detected genes exert substantial cancer risk in the studied cohort determined by a quantitative functional description, suggesting their relevance for diagnosis and genetic consulting.


Assuntos
Genes Recessivos/genética , Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla/métodos , Proteínas de Neoplasias/genética , Proteínas de Neoplasias/fisiologia , Neoplasias/genética , Proteoma/genética , Estudos de Coortes , Feminino , Aconselhamento Genético , Loci Gênicos/genética , Mutação em Linhagem Germinativa , Humanos , Masculino , Neoplasias/diagnóstico , Risco , Reino Unido
13.
Comput Struct Biotechnol J ; 19: 1750-1758, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33897979

RESUMO

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore the conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research.

14.
J Pers Med ; 11(6)2021 Jun 21.
Artigo em Inglês | MEDLINE | ID: mdl-34205563

RESUMO

One of the major challenges in the post-genomic era is elucidating the genetic basis of human diseases. In recent years, studies have shown that polygenic risk scores (PRS), based on aggregated information from millions of variants across the human genome, can estimate individual risk for common diseases. In practice, the current medical practice still predominantly relies on physiological and clinical indicators to assess personal disease risk. For example, caregivers mark individuals with high body mass index (BMI) as having an increased risk to develop type 2 diabetes (T2D). An important question is whether combining PRS with clinical metrics can increase the power of disease prediction in particular from early life. In this work we examined this question, focusing on T2D. We present here a sex-specific integrated approach that combines PRS with additional measurements and age to define a new risk score. We show that such approach combining adult BMI and PRS achieves considerably better prediction than each of the measures on unrelated Caucasians in the UK Biobank (UKB, n = 290,584). Likewise, integrating PRS with self-reports on birth weight (n = 172,239) and comparative body size at age ten (n = 287,203) also substantially enhance prediction as compared to each of its components. While the integration of PRS with BMI achieved better results as compared to the other measurements, the latter are early-life measurements that can be integrated already at childhood, to allow preemptive intervention for those at high risk to develop T2D. Our integrated approach can be easily generalized to other diseases, with the relevant early-life measurements.

15.
Genome Biol ; 21(1): 173, 2020 07 14.
Artigo em Inglês | MEDLINE | ID: mdl-32665031

RESUMO

We introduce Proteome-Wide Association Study (PWAS), a new method for detecting gene-phenotype associations mediated by protein function alterations. PWAS aggregates the signal of all variants jointly affecting a protein-coding gene and assesses their overall impact on the protein's function using machine learning and probabilistic models. Subsequently, it tests whether the gene exhibits functional variability between individuals that correlates with the phenotype of interest. PWAS can capture complex modes of heritability, including recessive inheritance. A comparison with GWAS and other existing methods proves its capacity to recover causal protein-coding genes and highlight new associations. PWAS is available as a command-line tool.


Assuntos
Fenótipo , Proteoma , Proteômica/métodos , Software , Neoplasias Colorretais/genética , Estudo de Associação Genômica Ampla , Humanos
16.
Viruses ; 11(5)2019 04 30.
Artigo em Inglês | MEDLINE | ID: mdl-31052218

RESUMO

Viruses are the most prevalent infectious agents, populating almost every ecosystem on earth. Most viruses carry only a handful of genes supporting their replication and the production of capsids. It came as a great surprise in 2003 when the first giant virus was discovered and found to have a >1 Mbp genome encoding almost a thousand proteins. Following this first discovery, dozens of giant virus strains across several viral families have been reported. Here, we provide an updated quantitative and qualitative view on giant viruses and elaborate on their shared and variable features. We review the complexity of giant viral proteomes, which include functions traditionally associated only with cellular organisms. These unprecedented functions include components of the translation machinery, DNA maintenance, and metabolic enzymes. We discuss the possible underlying evolutionary processes and mechanisms that might have shaped the diversity of giant viruses and their genomes, highlighting their remarkable capacity to hijack genes and genomic sequences from their hosts and environments. This leads us to examine prominent theories regarding the origin of giant viruses. Finally, we present the emerging ecological view of giant viruses, found across widespread habitats and ecological systems, with respect to the environment and human health.


Assuntos
Vírus Gigantes/classificação , Vírus Gigantes/fisiologia , Evolução Biológica , Ecossistema , Genoma Viral , Genômica/métodos , Interações Hospedeiro-Patógeno , Humanos
17.
Biol Direct ; 11: 26, 2016 05 21.
Artigo em Inglês | MEDLINE | ID: mdl-27209091

RESUMO

BACKGROUND: Viruses are the simplest replicating units, characterized by a limited number of coding genes and an exceptionally high rate of overlapping genes. We sought a unified evolutionary explanation that accounts for their genome sizes, gene overlapping and capsid properties. RESULTS: We performed an unbiased statistical analysis of ~100 families within ~400 genera that comprise the currently known viral world. We found that the volume utilization of capsids is often low, and greatly varies among viral families. Furthermore, although viruses span three orders of magnitude in genome length, they almost never have over 1500 overlapping nucleotides, or over four significantly overlapping genes per virus. CONCLUSIONS: Our findings undermine the generality of the compression theory, which emphasizes optimal packing and length dependency to explain overlapping genes and capsid size in viral genomes. Instead, we propose that gene novelty and evolution exploration offer better explanations to size constraints and gene overlapping in all viruses. REVIEWERS: This article was reviewed by Arne Elofsson and David Kreil.


Assuntos
Capsídeo/fisiologia , Evolução Molecular , Homologia de Genes , Tamanho do Genoma , Genoma Viral , Vírus/genética
18.
Artigo em Inglês | MEDLINE | ID: mdl-27694209

RESUMO

Determining residue-level protein properties, such as sites of post-translational modifications (PTMs), is vital to understanding protein function. Experimental methods are costly and time-consuming, while traditional rule-based computational methods fail to annotate sites lacking substantial similarity. Machine Learning (ML) methods are becoming fundamental in annotating unknown proteins and their heterogeneous properties. We present ASAP (Amino-acid Sequence Annotation Prediction), a universal ML framework for predicting residue-level properties. ASAP extracts numerous features from raw sequences, and supports easy integration of external features such as secondary structure, solvent accessibility, intrinsically disorder or PSSM profiles. Features are then used to train ML classifiers. ASAP can create new classifiers within minutes for a variety of tasks, including PTM prediction (e.g. cleavage sites by convertase, phosphoserine modification). We present a detailed case study for ASAP: CleavePred, an ASAP-based model to predict protein precursor cleavage sites, with state-of-the-art results. Protein cleavage is a PTM shared by a wide variety of proteins sharing minimal sequence similarity. Current rule-based methods suffer from high false positive rates, making them suboptimal. The high performance of CleavePred makes it suitable for analyzing new proteomes at a genomic scale. The tool is attractive to protein design, mass spectrometry search engines and the discovery of new bioactive peptides from precursors. ASAP functions as a baseline approach for residue-level protein sequence prediction. CleavePred is freely accessible as a web-based application. Both ASAP and CleavePred are open-source with a flexible Python API.Database URL: ASAP's and CleavePred source code, webtool and tutorials are available at: https://github.com/ddofer/asap; http://protonet.cs.huji.ac.il/cleavepred.


Assuntos
Aprendizado de Máquina , Anotação de Sequência Molecular/métodos , Peptídeos/genética , Peptídeos/metabolismo , Análise de Sequência de Proteína/métodos , Internet
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA