Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 3.553
Filter
Add more filters

Publication year range
1.
Cell ; 2024 Sep 19.
Article in English | MEDLINE | ID: mdl-39326416

ABSTRACT

Interpretation of disease-causing genetic variants remains a challenge in human genetics. Current costs and complexity of deep mutational scanning methods are obstacles for achieving genome-wide resolution of variants in disease-related genes. Our framework, saturation mutagenesis-reinforced functional assays (SMuRF), offers simple and cost-effective saturation mutagenesis paired with streamlined functional assays to enhance the interpretation of unresolved variants. Applying SMuRF to neuromuscular disease genes FKRP and LARGE1, we generated functional scores for all possible coding single-nucleotide variants, which aid in resolving clinically reported variants of uncertain significance. SMuRF also demonstrates utility in predicting disease severity, resolving critical structural regions, and providing training datasets for the development of computational predictors. Overall, our approach enables variant-to-function insights for disease genes in a cost-effective manner that can be broadly implemented by standard research laboratories.

2.
Cell ; 172(3): 478-490.e15, 2018 01 25.
Article in English | MEDLINE | ID: mdl-29373829

ABSTRACT

Understanding the sequence determinants that give rise to diversity among individuals and species is the central challenge of genetics. However, despite ever greater numbers of sequenced genomes, most genome-wide association studies cannot distinguish causal variants from linked passenger mutations spanning many genes. We report that this inherent challenge can be overcome in model organisms. By pushing the advantages of inbred crossing to its practical limit in Saccharomyces cerevisiae, we improved the statistical resolution of linkage analysis to single nucleotides. This "super-resolution" approach allowed us to map 370 causal variants across 26 quantitative traits. Missense, synonymous, and cis-regulatory mutations collectively gave rise to phenotypic diversity, providing mechanistic insight into the basis of evolutionary divergence. Our data also systematically unmasked complex genetic architectures, revealing that multiple closely linked driver mutations frequently act on the same quantitative trait. Single-nucleotide mapping thus complements traditional deletion and overexpression screening paradigms and opens new frontiers in quantitative genetics.


Subject(s)
Genetic Linkage , Mutation , Phenotype , Polymorphism, Genetic , Chromosome Mapping/methods , Genome-Wide Association Study/methods , Quantitative Trait, Heritable , Saccharomyces cerevisiae/genetics
3.
Trends Genet ; 40(7): 587-600, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38658256

ABSTRACT

Population-scale sequencing efforts have catalogued substantial genetic variation in humans such that variant discovery dramatically outpaces interpretation. We discuss how single-cell sequencing is poised to reveal genetic mechanisms at a rate that may soon approach that of variant discovery. The functional genomics toolkit is sufficiently modular to systematically profile almost any type of variation within increasingly diverse contexts and with molecularly comprehensive and unbiased readouts. As a result, we can construct deep phenotypic atlases of variant effects that span the entire regulatory cascade. The same conceptual approach to interpreting genetic variation should be applied to engineering therapeutic cell states. In this way, variant mechanism discovery and cell state engineering will become reciprocating and iterative processes towards genomic medicine.


Subject(s)
Genetic Variation , Single-Cell Analysis , Humans , Single-Cell Analysis/methods , Genomics/methods , Genome, Human/genetics , Phenotype
4.
Am J Hum Genet ; 111(9): 2031-2043, 2024 Sep 05.
Article in English | MEDLINE | ID: mdl-39173626

ABSTRACT

In silico variant effect predictions are available for nearly all missense variants but played a minimal role in clinical variant classification because they were deemed to provide only supporting evidence. Recently, the ClinGen Sequence Variant Interpretation (SVI) Working Group updated recommendations for variant effect prediction use. By analyzing control pathogenic and benign variants across all genes, they were able to compute evidence strength for predictor score intervals with some intervals generating moderate, strong, or even very strong evidence. However, this genome-wide approach could obscure heterogeneous predictor performance in different genes. We quantified the gene-by-gene performance of two top predictors, REVEL and BayesDel, by analyzing control variants in each predictor score interval in 3,668 disease-relevant genes. Approximately 10% of intervals had sufficient control variants for analysis, and ∼70% of these intervals exceeded the maximum number of incorrect predictions implied by the SVI recommendations. These trending discordant intervals arose owing to the divergence of the gene-specific distribution of predictions from the genome-wide distribution, suggesting that gene-specific calibration is needed in many cases. Approximately 22% of ClinVar missense variants of uncertain significance in genes we analyzed (REVEL = 100,629, BayesDel = 71,928) had predictions in trending discordant intervals. Thus, genome-wide calibrations could result in many variants receiving inappropriate evidence strength. To facilitate a review of the SVI's calibrations, we developed a web application enabling visualization of gene-specific predictions and trending concordant and discordant intervals.


Subject(s)
Genome-Wide Association Study , Humans , Genome-Wide Association Study/methods , Genome, Human , Mutation, Missense , Genetic Variation , Calibration , Software , Databases, Genetic
5.
Am J Hum Genet ; 111(2): 350-363, 2024 02 01.
Article in English | MEDLINE | ID: mdl-38237594

ABSTRACT

Our ability to determine the clinical impact of variants in 3' untranslated regions (UTRs) of genes remains poor. We provide a thorough analysis of 3' UTR variants from several datasets. Variants in putative regulatory elements, including RNA-binding protein motifs, eCLIP peaks, and microRNA sites, are up to 16 times more likely than variants not in these elements to have gene expression and phenotype associations. Variants in regulatory motifs result in allele-specific protein binding in cell lines and allele-specific gene expression differences in population studies. In addition, variants in shared regions of alternatively polyadenylated isoforms and those proximal to polyA sites are more likely to affect gene expression and phenotype. Finally, pathogenic 3' UTR variants in ClinVar are up to 20 times more likely than benign variants to fall in a regulatory site. We incorporated these findings into RegVar, a software tool that interprets regulatory elements and annotations for any 3' UTR variant and predicts whether the variant is likely to affect gene expression or phenotype. This tool will help prioritize variants for experimental studies and identify pathogenic variants in individuals.


Subject(s)
MicroRNAs , Humans , 3' Untranslated Regions/genetics , MicroRNAs/genetics , Regulatory Sequences, Nucleic Acid/genetics , Cell Line , Protein Binding
6.
Am J Hum Genet ; 2024 Aug 28.
Article in English | MEDLINE | ID: mdl-39226898

ABSTRACT

Variants that alter gene splicing are estimated to comprise up to a third of all disease-causing variants, yet they are hard to predict from DNA sequencing data alone. To overcome this, many groups are incorporating RNA-based analyses, which are resource intensive, particularly for diagnostic laboratories. There are thousands of functionally validated variants that induce mis-splicing; however, this information is not consolidated, and they are under-represented in ClinVar, which presents a barrier to variant interpretation and can result in duplication of validation efforts. To address this issue, we developed SpliceVarDB, an online database consolidating over 50,000 variants assayed for their effects on splicing in over 8,000 human genes. We evaluated over 500 published data sources and established a spliceogenicity scale to standardize, harmonize, and consolidate variant validation data generated by a range of experimental protocols. According to the strength of their supporting evidence, variants were classified as "splice-altering" (∼25%), "not splice-altering" (∼25%), and "low-frequency splice-altering" (∼50%), which correspond to weak or indeterminate evidence of spliceogenicity. Importantly, 55% of the splice-altering variants in SpliceVarDB are outside the canonical splice sites (5.6% are deep intronic). These variants can support the variant curation diagnostic pathway and can be used to provide the high-quality data necessary to develop more accurate in silico splicing predictors. The variants are accessible through an online platform, SpliceVarDB, with additional features for visualization, variant information, in silico predictions, and validation metrics. SpliceVarDB is a very large collection of splice-altering variants and is available at https://splicevardb.org.

7.
Am J Hum Genet ; 2024 Sep 17.
Article in English | MEDLINE | ID: mdl-39317201

ABSTRACT

The ClinGen Hereditary Breast, Ovarian, and Pancreatic Cancer (HBOP) Variant Curation Expert Panel (VCEP) is composed of internationally recognized experts in clinical genetics, molecular biology, and variant interpretation. This VCEP made specifications for the American College of Medical Genetics and Association for Molecular Pathology (ACMG/AMP) guidelines for the ataxia telangiectasia mutated (ATM) gene according to the ClinGen protocol. These gene-specific rules for ATM were modified from the ACMG/AMP guidelines and were tested against 33 ATM variants of various types and classifications in a pilot curation phase. The pilot revealed a majority agreement between the HBOP VCEP classifications and the ClinVar-deposited classifications. Six pilot variants had conflicting interpretations in ClinVar, and re-evaluation with the VCEP's ATM-specific rules resulted in four that were classified as benign, one as likely pathogenic, and one as a variant of uncertain significance (VUS) by the VCEP, improving the certainty of interpretations in the public domain. Overall, 28 of the 33 pilot variants were not VUS, leading to an 85% classification rate. The ClinGen-approved, modified rules demonstrated value for improved interpretation of variants in ATM.

8.
Hum Mol Genet ; 33(5): 426-434, 2024 Feb 18.
Article in English | MEDLINE | ID: mdl-37956408

ABSTRACT

BACKGROUND: Pathogenic germline variants in BRCA1-Associated Protein 1 (BAP1) cause BAP1 tumor predisposition syndrome (BAP1-TPDS). Carriers run especially a risk of uveal (UM) and cutaneous melanoma, malignant mesothelioma, and clear cell renal carcinoma. Approximately half of increasingly reported BAP1 variants lack accurate classification. Correct interpretation of pathogenicity can improve prognosis of the patients through tumor screening with better understanding of BAP1-TPDS. METHODS: We edited five rare BAP1 variants with differing functional characteristics identified from patients with UM in HAP1 cells using CRISPR-Cas9 and assayed their effect on cell adhesion/spreading (at 4 h) and proliferation (at 48 h), measured as cell index (CI), using xCELLigence real-time analysis system. RESULTS: In BAP1 knockout HAP1 cultures, cell number was half of wild type (WT) cultures at 48 h (p = 0.00021), reaching confluence later, and CI was 78% reduced (p < 0.0001). BAP1-TPDS-associated null variants c.67+1G>T and c.1780_1781insT, and a likely pathogenic missense variant c.281A>G reduced adhesion (all p ≤ 0.015) and proliferation by 74%-83% (all p ≤ 0.032). Another likely pathogenic missense variant c.680G>A reduced both by at least 50% (all p ≤ 0.032), whereas cells edited with likely benign one c.1526C>T grew similarly to WT. CONCLUSIONS: BAP1 is essential for optimal fitness of HAP1 cells. Pathogenic and likely pathogenic BAP1 variants reduced cell fitness, reflected in adhesion/spreading and proliferation properties. Further, moderate effects were quantifiable. Variant modelling in HAP1 with CRISPR-Cas9 enabled functional analysis of coding and non-coding region variants in an endogenous expression system.


Subject(s)
Kidney Neoplasms , Melanoma , Skin Neoplasms , Uveal Neoplasms , Humans , Melanoma/pathology , Virulence , Genetic Predisposition to Disease , Germ-Line Mutation/genetics , Ubiquitin Thiolesterase/genetics , Ubiquitin Thiolesterase/metabolism , Tumor Suppressor Proteins/genetics
9.
Trends Genet ; 39(6): 442-450, 2023 06.
Article in English | MEDLINE | ID: mdl-36858880

ABSTRACT

Genomic studies of human disorders are often performed by distinct research communities (i.e., focused on rare diseases, common diseases, or cancer). Despite underlying differences in the mechanistic origin of different disease categories, these studies share the goal of identifying causal genomic events that are critical for the clinical manifestation of the disease phenotype. Moreover, these studies face common challenges, including understanding the complex genetic architecture of the disease, deciphering the impact of variants on multiple scales, and interpreting noncoding mutations. Here, we highlight these challenges in depth and argue that properly addressing them will require a more unified vocabulary and approach across disease communities. Toward this goal, we present a unified perspective on relating variant impact to various genomic disorders.


Subject(s)
Genome , Genomics , Humans , Mutation , Phenotype
10.
Annu Rev Genomics Hum Genet ; 24: 151-176, 2023 08 25.
Article in English | MEDLINE | ID: mdl-37285546

ABSTRACT

DECIPHER (Database of Genomic Variation and Phenotype in Humans Using Ensembl Resources) shares candidate diagnostic variants and phenotypic data from patients with genetic disorders to facilitate research and improve the diagnosis, management, and therapy of rare diseases. The platform sits at the boundary between genomic research and the clinical community. DECIPHER aims to ensure that the most up-to-date data are made rapidly available within its interpretation interfaces to improve clinical care. Newly integrated cardiac case-control data that provide evidence of gene-disease associations and inform variant interpretation exemplify this mission. New research resources are presented in a format optimized for use by a broad range of professionals supporting the delivery of genomic medicine. The interfaces within DECIPHER integrate and contextualize variant and phenotypic data, helping to determine a robust clinico-molecular diagnosis for rare-disease patients, which combines both variant classification and clinical fit. DECIPHER supports discovery research, connecting individuals within the rare-disease community to pursue hypothesis-driven research.


Subject(s)
Genomics , Genomics/methods , Humans , Rare Diseases/genetics , Alleles , Practice Guidelines as Topic , DNA Copy Number Variations , Databases, Genetic
11.
Am J Hum Genet ; 110(9): 1496-1508, 2023 09 07.
Article in English | MEDLINE | ID: mdl-37633279

ABSTRACT

Predicted loss of function (pLoF) variants are often highly deleterious and play an important role in disease biology, but many pLoF variants may not result in loss of function (LoF). Here we present a framework that advances interpretation of pLoF variants in research and clinical settings by considering three categories of LoF evasion: (1) predicted rescue by secondary sequence properties, (2) uncertain biological relevance, and (3) potential technical artifacts. We also provide recommendations on adjustments to ACMG/AMP guidelines' PVS1 criterion. Applying this framework to all high-confidence pLoF variants in 22 genes associated with autosomal-recessive disease from the Genome Aggregation Database (gnomAD v.2.1.1) revealed predicted LoF evasion or potential artifacts in 27.3% (304/1,113) of variants. The major reasons were location in the last exon, in a homopolymer repeat, in a low proportion expressed across transcripts (pext) scored region, or the presence of cryptic in-frame splice rescues. Variants predicted to evade LoF or to be potential artifacts were enriched for ClinVar benign variants. PVS1 was downgraded in 99.4% (162/163) of pLoF variants predicted as likely not LoF/not LoF, with 17.2% (28/163) downgraded as a result of our framework, adding to previous guidelines. Variant pathogenicity was affected (mostly from likely pathogenic to VUS) in 20 (71.4%) of these 28 variants. This framework guides assessment of pLoF variants beyond standard annotation pipelines and substantially reduces false positive rates, which is key to ensure accurate LoF variant prediction in both a research and clinical setting.


Subject(s)
Inheritance Patterns , Humans , Exons , Uncertainty
12.
Am J Hum Genet ; 110(10): 1769-1786, 2023 10 05.
Article in English | MEDLINE | ID: mdl-37729906

ABSTRACT

Defects in hydroxymethylbilane synthase (HMBS) can cause acute intermittent porphyria (AIP), an acute neurological disease. Although sequencing-based diagnosis can be definitive, ∼⅓ of clinical HMBS variants are missense variants, and most clinically reported HMBS missense variants are designated as "variants of uncertain significance" (VUSs). Using saturation mutagenesis, en masse selection, and sequencing, we applied a multiplexed validated assay to both the erythroid-specific and ubiquitous isoforms of HMBS, obtaining confident functional impact scores for >84% of all possible amino acid substitutions. The resulting variant effect maps generally agreed with biochemical expectations and provide further evidence that HMBS can function as a monomer. Additionally, the maps implicated specific residues as having roles in active site dynamics, which was further supported by molecular dynamics simulations. Most importantly, these maps can help discriminate pathogenic from benign HMBS variants, proactively providing evidence even for yet-to-be-observed clinical missense variants.


Subject(s)
Hydroxymethylbilane Synthase , Porphyria, Acute Intermittent , Humans , Hydroxymethylbilane Synthase/chemistry , Hydroxymethylbilane Synthase/genetics , Hydroxymethylbilane Synthase/metabolism , Mutation, Missense/genetics , Porphyria, Acute Intermittent/diagnosis , Porphyria, Acute Intermittent/genetics , Amino Acid Substitution , Molecular Dynamics Simulation
13.
Am J Hum Genet ; 110(1): 92-104, 2023 01 05.
Article in English | MEDLINE | ID: mdl-36563679

ABSTRACT

Variant interpretation remains a major challenge in medical genetics. We developed Meta-Domain HotSpot (MDHS) to identify mutational hotspots across homologous protein domains. We applied MDHS to a dataset of 45,221 de novo mutations (DNMs) from 31,058 individuals with neurodevelopmental disorders (NDDs) and identified three significantly enriched missense DNM hotspots in the ion transport protein domain family (PF00520). The 37 unique missense DNMs that drive enrichment affect 25 genes, 19 of which were previously associated with NDDs. 3D protein structure modeling supports the hypothesis of function-altering effects of these mutations. Hotspot genes have a unique expression pattern in tissue, and we used this pattern alongside in silico predictors and population constraint information to identify candidate NDD-associated genes. We also propose a lenient version of our method, which identifies 32 hotspot positions across 16 different protein domains. These positions are enriched for likely pathogenic variation in clinical databases and DNMs in other genetic disorders.


Subject(s)
Neurodevelopmental Disorders , Humans , Protein Domains/genetics , Mutation/genetics , Neurodevelopmental Disorders/genetics
14.
Am J Hum Genet ; 110(6): 940-949, 2023 06 01.
Article in English | MEDLINE | ID: mdl-37236177

ABSTRACT

While pathogenic variants can significantly increase disease risk, it is still challenging to estimate the clinical impact of rare missense variants more generally. Even in genes such as BRCA2 or PALB2, large cohort studies find no significant association between breast cancer and rare missense variants collectively. Here, we introduce REGatta, a method to estimate clinical risk from variants in smaller segments of individual genes. We first define these regions by using the density of pathogenic diagnostic reports and then calculate the relative risk in each region by using over 200,000 exome sequences in the UK Biobank. We apply this method in 13 genes with established roles across several monogenic disorders. In genes with no significant difference at the gene level, this approach significantly separates disease risk for individuals with rare missense variants at higher or lower risk (BRCA2 regional model OR = 1.46 [1.12, 1.79], p = 0.0036 vs. BRCA2 gene model OR = 0.96 [0.85, 1.07] p = 0.4171). We find high concordance between these regional risk estimates and high-throughput functional assays of variant impact. We compare our method with existing methods and the use of protein domains (Pfam) as regions and find REGatta better identifies individuals at elevated or reduced risk. These regions provide useful priors and are potentially useful for improving risk assessment for genes associated with monogenic diseases.


Subject(s)
Breast Neoplasms , Genetic Predisposition to Disease , Humans , Female , BRCA2 Protein/genetics , Mutation, Missense , Sequence Analysis, DNA , Breast Neoplasms/genetics , Breast Neoplasms/pathology , Cohort Studies
15.
Am J Hum Genet ; 110(5): 863-879, 2023 05 04.
Article in English | MEDLINE | ID: mdl-37146589

ABSTRACT

Deleterious mutations in the X-linked gene encoding ornithine transcarbamylase (OTC) cause the most common urea cycle disorder, OTC deficiency. This rare but highly actionable disease can present with severe neonatal onset in males or with later onset in either sex. Individuals with neonatal onset appear normal at birth but rapidly develop hyperammonemia, which can progress to cerebral edema, coma, and death, outcomes ameliorated by rapid diagnosis and treatment. Here, we develop a high-throughput functional assay for human OTC and individually measure the impact of 1,570 variants, 84% of all SNV-accessible missense mutations. Comparison to existing clinical significance calls, demonstrated that our assay distinguishes known benign from pathogenic variants and variants with neonatal onset from late-onset disease presentation. This functional stratification allowed us to identify score ranges corresponding to clinically relevant levels of impairment of OTC activity. Examining the results of our assay in the context of protein structure further allowed us to identify a 13 amino acid domain, the SMG loop, whose function appears to be required in human cells but not in yeast. Finally, inclusion of our data as PS3 evidence under the current ACMG guidelines, in a pilot reclassification of 34 variants with complete loss of activity, would change the classification of 22 from variants of unknown significance to clinically actionable likely pathogenic variants. These results illustrate how large-scale functional assays are especially powerful when applied to rare genetic diseases.


Subject(s)
Hyperammonemia , Ornithine Carbamoyltransferase Deficiency Disease , Ornithine Carbamoyltransferase , Humans , Amino Acid Substitution , Hyperammonemia/etiology , Hyperammonemia/genetics , Mutation, Missense/genetics , Ornithine Carbamoyltransferase/genetics , Ornithine Carbamoyltransferase Deficiency Disease/genetics , Ornithine Carbamoyltransferase Deficiency Disease/diagnosis , Ornithine Carbamoyltransferase Deficiency Disease/therapy
16.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38388680

ABSTRACT

CRISPR Cas-9 is a groundbreaking genome-editing tool that harnesses bacterial defense systems to alter DNA sequences accurately. This innovative technology holds vast promise in multiple domains like biotechnology, agriculture and medicine. However, such power does not come without its own peril, and one such issue is the potential for unintended modifications (Off-Target), which highlights the need for accurate prediction and mitigation strategies. Though previous studies have demonstrated improvement in Off-Target prediction capability with the application of deep learning, they often struggle with the precision-recall trade-off, limiting their effectiveness and do not provide proper interpretation of the complex decision-making process of their models. To address these limitations, we have thoroughly explored deep learning networks, particularly the recurrent neural network based models, leveraging their established success in handling sequence data. Furthermore, we have employed genetic algorithm for hyperparameter tuning to optimize these models' performance. The results from our experiments demonstrate significant performance improvement compared with the current state-of-the-art in Off-Target prediction, highlighting the efficacy of our approach. Furthermore, leveraging the power of the integrated gradient method, we make an effort to interpret our models resulting in a detailed analysis and understanding of the underlying factors that contribute to Off-Target predictions, in particular the presence of two sub-regions in the seed region of single guide RNA which extends the established biological hypothesis of Off-Target effects. To the best of our knowledge, our model can be considered as the first model combining high efficacy, interpretability and a desirable balance between precision and recall.


Subject(s)
CRISPR-Cas Systems , Deep Learning , Gene Editing/methods , RNA, Guide, CRISPR-Cas Systems , Neural Networks, Computer
17.
Brief Bioinform ; 25(5)2024 Jul 25.
Article in English | MEDLINE | ID: mdl-39234953

ABSTRACT

The internal ribosome entry site (IRES) is a cis-regulatory element that can initiate translation in a cap-independent manner. It is often related to cellular processes and many diseases. Thus, identifying the IRES is important for understanding its mechanism and finding potential therapeutic strategies for relevant diseases since identifying IRES elements by experimental method is time-consuming and laborious. Many bioinformatics tools have been developed to predict IRES, but all these tools are based on structure similarity or machine learning algorithms. Here, we introduced a deep learning model named DeepIRES for precisely identifying IRES elements in messenger RNA (mRNA) sequences. DeepIRES is a hybrid model incorporating dilated 1D convolutional neural network blocks, bidirectional gated recurrent units, and self-attention module. Tenfold cross-validation results suggest that DeepIRES can capture deeper relationships between sequence features and prediction results than other baseline models. Further comparison on independent test sets illustrates that DeepIRES has superior and robust prediction capability than other existing methods. Moreover, DeepIRES achieves high accuracy in predicting experimental validated IRESs that are collected in recent studies. With the application of a deep learning interpretable analysis, we discover some potential consensus motifs that are related to IRES activities. In summary, DeepIRES is a reliable tool for IRES prediction and gives insights into the mechanism of IRES elements.


Subject(s)
Deep Learning , Internal Ribosome Entry Sites , RNA, Messenger , RNA, Messenger/genetics , RNA, Messenger/metabolism , Computational Biology/methods , RNA, Viral/genetics , RNA, Viral/metabolism , Humans , Neural Networks, Computer , Algorithms
18.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38279650

ABSTRACT

As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.


Subject(s)
Computational Biology , Neural Networks, Computer , Computational Biology/methods , Amino Acid Sequence , Protein Transport
19.
Bioessays ; 46(9): e2400026, 2024 Sep.
Article in English | MEDLINE | ID: mdl-38991978

ABSTRACT

Receptor tyrosine kinases exhibit ligand-induced activity and uptake into cells via endocytosis. In the case of epidermal growth factor (EGF) receptor (EGFR), the resulting endosomes are trafficked to the perinuclear region, where dephosphorylation of receptors occurs, which are subsequently directed to degradation. Traveling endosomes bearing phosphorylated EGFRs are subjected to the activity of cytoplasmic phosphatases as well as interactions with the endoplasmic reticulum (ER). The peri-nuclear region harbors ER-embedded phosphatases, a component of the EGFR-bearing endosome-ER contact site. The ER is also emerging as a central player in spatiotemporal control of endosomal motility, positioning, tubulation, and fission. Past studies strongly suggest that the physical interaction between the ER and endosomes forms a reaction "unit" for EGFR dephosphorylation. Independently, endosomes have been implicated to enable quantization of EGFR signals by modulation of the phosphorylation levels. Here, we review the distinct mechanisms by which endosomes form the logistical means for signal quantization and speculate on the role of the ER.


Subject(s)
Endoplasmic Reticulum , Endosomes , ErbB Receptors , Signal Transduction , Animals , Humans , Endocytosis , Endoplasmic Reticulum/metabolism , Endosomes/metabolism , ErbB Receptors/metabolism , Phosphorylation
20.
Proc Natl Acad Sci U S A ; 120(15): e2216698120, 2023 04 11.
Article in English | MEDLINE | ID: mdl-37023129

ABSTRACT

Discovering DNA regulatory sequence motifs and their relative positions is vital to understanding the mechanisms of gene expression regulation. Although deep convolutional neural networks (CNNs) have achieved great success in predicting cis-regulatory elements, the discovery of motifs and their combinatorial patterns from these CNN models has remained difficult. We show that the main difficulty is due to the problem of multifaceted neurons which respond to multiple types of sequence patterns. Since existing interpretation methods were mainly designed to visualize the class of sequences that can activate the neuron, the resulting visualization will correspond to a mixture of patterns. Such a mixture is usually difficult to interpret without resolving the mixed patterns. We propose the NeuronMotif algorithm to interpret such neurons. Given any convolutional neuron (CN) in the network, NeuronMotif first generates a large sample of sequences capable of activating the CN, which typically consists of a mixture of patterns. Then, the sequences are "demixed" in a layer-wise manner by backward clustering of the feature maps of the involved convolutional layers. NeuronMotif can output the sequence motifs, and the syntax rules governing their combinations are depicted by position weight matrices organized in tree structures. Compared to existing methods, the motifs found by NeuronMotif have more matches to known motifs in the JASPAR database. The higher-order patterns uncovered for deep CNs are supported by the literature and ATAC-seq footprinting. Overall, NeuronMotif enables the deciphering of cis-regulatory codes from deep CNs and enhances the utility of CNN in genome interpretation.


Subject(s)
Algorithms , Neural Networks, Computer , Nucleotide Motifs/genetics , Regulatory Sequences, Nucleic Acid/genetics , Databases, Factual
SELECTION OF CITATIONS
SEARCH DETAIL