Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
1.
bioRxiv ; 2024 Jun 23.
Artículo en Inglés | MEDLINE | ID: mdl-38948874

RESUMEN

Gene therapies have the potential to treat disease by delivering therapeutic genetic cargo to disease-associated cells. One limitation to their widespread use is the lack of short regulatory sequences, or promoters, that differentially induce the expression of delivered genetic cargo in target cells, minimizing side effects in other cell types. Such cell-type-specific promoters are difficult to discover using existing methods, requiring either manual curation or access to large datasets of promoter-driven expression from both targeted and untargeted cells. Model-based optimization (MBO) has emerged as an effective method to design biological sequences in an automated manner, and has recently been used in promoter design methods. However, these methods have only been tested using large training datasets that are expensive to collect, and focus on designing promoters for markedly different cell types, overlooking the complexities associated with designing promoters for closely related cell types that share similar regulatory features. Therefore, we introduce a comprehensive framework for utilizing MBO to design promoters in a data-efficient manner, with an emphasis on discovering promoters for similar cell types. We use conservative objective models (COMs) for MBO and highlight practical considerations such as best practices for improving sequence diversity, getting estimates of model uncertainty, and choosing the optimal set of sequences for experimental validation. Using three relatively similar blood cancer cell lines (Jurkat, K562, and THP1), we show that our approach discovers many novel cell-type-specific promoters after experimentally validating the designed sequences. For K562 cells, in particular, we discover a promoter that has 75.85% higher cell-type-specificity than the best promoter from the initial dataset used to train our models.

2.
bioRxiv ; 2024 Jun 08.
Artículo en Inglés | MEDLINE | ID: mdl-38895200

RESUMEN

Regular, systematic, and independent assessment of computational tools used to predict the pathogenicity of missense variants is necessary to evaluate their clinical and research utility and suggest directions for future improvement. Here, as part of the sixth edition of the Critical Assessment of Genome Interpretation (CAGI) challenge, we assess missense variant effect predictors (or variant impact predictors) on an evaluation dataset of rare missense variants from disease-relevant databases. Our assessment evaluates predictors submitted to the CAGI6 Annotate-All-Missense challenge, predictors commonly used by the clinical genetics community, and recently developed deep learning methods for variant effect prediction. To explore a variety of settings that are relevant for different clinical and research applications, we assess performance within different subsets of the evaluation data and within high-specificity and high-sensitivity regimes. We find strong performance of many predictors across multiple settings. Meta-predictors tend to outperform their constituent individual predictors; however, several individual predictors have performance similar to that of commonly used meta-predictors. The relative performance of predictors differs in high-specificity and high-sensitivity regimes, suggesting that different methods may be best suited to different use cases. We also characterize two potential sources of bias. Predictors that incorporate allele frequency as a predictive feature tend to have reduced performance when distinguishing pathogenic variants from very rare benign variants, and predictors supervised on pathogenicity labels from curated variant databases often learn label imbalances within genes. Overall, we find notable advances over the oldest and most cited missense variant effect predictors and continued improvements among the most recently developed tools, and the CAGI Annotate-All-Missense challenge (also termed the Missense Marathon) will continue to assess state-of-the-art methods as the field progresses. Together, our results help illuminate the current clinical and research utility of missense variant effect predictors and identify potential areas for future development.

3.
bioRxiv ; 2024 Mar 07.
Artículo en Inglés | MEDLINE | ID: mdl-37904945

RESUMEN

Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.

4.
Nat Genet ; 55(12): 2056-2059, 2023 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38036790

RESUMEN

Genomic deep learning models can predict genome-wide epigenetic features and gene expression levels directly from DNA sequence. While current models perform well at predicting gene expression levels across genes in different cell types from the reference genome, their ability to explain expression variation between individuals due to cis-regulatory genetic variants remains largely unexplored. Here, we evaluate four state-of-the-art models on paired personal genome and transcriptome data and find limited performance when explaining variation in expression across individuals. In addition, models often fail to predict the correct direction of effect of cis-regulatory genetic variation on expression.


Asunto(s)
Aprendizaje Profundo , Transcriptoma , Humanos , Transcriptoma/genética , Variación Genética/genética , Genoma , Genómica
5.
bioRxiv ; 2023 Feb 27.
Artículo en Inglés | MEDLINE | ID: mdl-36909524

RESUMEN

Advances in gene delivery technologies are enabling rapid progress in molecular medicine, but require precise expression of genetic cargo in desired cell types, which is predominantly achieved via a regulatory DNA sequence called a promoter; however, only a handful of cell type-specific promoters are known. Efficiently designing compact promoter sequences with a high density of regulatory information by leveraging machine learning models would therefore be broadly impactful for fundamental research and direct therapeutic applications. However, models of expression from such compact promoter sequences are lacking, despite the recent success of deep learning in modelling expression from endogenous regulatory sequences. Despite the lack of large datasets measuring promoter-driven expression in many cell types, data from a few well-studied cell types or from endogenous gene expression may provide relevant information for transfer learning, which has not yet been explored in this setting. Here, we evaluate a variety of pretraining tasks and transfer strategies for modelling cell type-specific expression from compact promoters and demonstrate the effectiveness of pretraining on existing promoter-driven expression datasets from other cell types. Our approach is broadly applicable for modelling promoter-driven expression in any data-limited cell type of interest, and will enable the use of model-based optimization techniques for promoter design for gene delivery applications. Our code and data are available at https://github.com/anikethjr/promoter_models.

6.
bioRxiv ; 2023 Dec 23.
Artículo en Inglés | MEDLINE | ID: mdl-38187742

RESUMEN

Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.

7.
Nat Commun ; 13(1): 5803, 2022 10 03.
Artículo en Inglés | MEDLINE | ID: mdl-36192477

RESUMEN

Age is the primary risk factor for many common human diseases. Here, we quantify the relative contributions of genetics and aging to gene expression patterns across 27 tissues from 948 humans. We show that the predictive power of expression quantitative trait loci is impacted by age in many tissues. Jointly modelling the contributions of age and genetics to transcript level variation we find expression heritability (h2) is consistent among tissues while the contribution of aging varies by >20-fold with [Formula: see text] in 5 tissues. We find that while the force of purifying selection is stronger on genes expressed early versus late in life (Medawar's hypothesis), several highly proliferative tissues exhibit the opposite pattern. These non-Medawarian tissues exhibit high rates of cancer and age-of-expression-associated somatic mutations. In contrast, genes under genetic control are under relaxed constraint. Together, we demonstrate the distinct roles of aging and genetics on expression phenotypes.


Asunto(s)
Envejecimiento , Sitios de Carácter Cuantitativo , Envejecimiento/genética , Expresión Génica , Regulación de la Expresión Génica , Humanos , Fenotipo , Sitios de Carácter Cuantitativo/genética
8.
Bioinformatics ; 36(16): 4440-4448, 2020 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-32330225

RESUMEN

SUMMARY: Interpreting genetic variants of unknown significance (VUS) is essential in clinical applications of genome sequencing for diagnosis and personalized care. Non-coding variants remain particularly difficult to interpret, despite making up a large majority of trait associations identified in genome-wide association studies (GWAS) analyses. Predicting the regulatory effects of non-coding variants on candidate genes is a key step in evaluating their clinical significance. Here, we develop a machine-learning algorithm, Inference of Connected expression quantitative trait loci (eQTLs) (IRT), to predict the regulatory targets of non-coding variants identified in studies of eQTLs. We assemble datasets using eQTL results from the Genotype-Tissue Expression (GTEx) project and learn to separate positive and negative pairs based on annotations characterizing the variant, gene and the intermediate sequence. IRT achieves an area under the receiver operating characteristic curve (ROC-AUC) of 0.799 using random cross-validation, and 0.700 for a more stringent position-based cross-validation. Further evaluation on rare variants and experimentally validated regulatory variants shows a significant enrichment in IRT identifying the true target genes versus negative controls. In gene-ranking experiments, IRT achieves a top-1 accuracy of 50% and top-3 accuracy of 90%. Salient features, including GC-content, histone modifications and Hi-C interactions are further analyzed and visualized to illustrate their influences on predictions. IRT can be applied to any VUS of interest and each candidate nearby gene to output a score reflecting the likelihood of regulatory effect on the expression level. These scores can be used to prioritize variants and genes to assist in patient diagnosis and GWAS follow-up studies. AVAILABILITY AND IMPLEMENTATION: Codes and data used in this work are available at https://github.com/miaecle/eQTL_Trees. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Estudio de Asociación del Genoma Completo , Sitios de Carácter Cuantitativo , Mapeo Cromosómico , Código de Histonas , Humanos , Fenotipo , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo/genética
9.
J Invest Dermatol ; 138(12): 2589-2594, 2018 12.
Artículo en Inglés | MEDLINE | ID: mdl-30472995

RESUMEN

Cutaneous squamous cell cancers (cSCCs) present an under-recognized health issue among non-Hispanic whites, one that is likely to increase as populations age. cSCC risks vary considerably among non-Hispanic whites, and this heterogeneity indicates the need for risk-stratified screening strategies that are guided by patients' personal characteristics and clinical histories. Here we describe cSCCscore, a prediction tool that uses patients' covariates and clinical histories to assign them personal probabilities of developing cSCCs within 3 years after risk assessment. cSCCscore uses a statistical model for the occurrence and timing of a patient's cSCCs, whose parameters we estimated using cohort data from 66,995 patients in the Kaiser Permanente Northern California healthcare system. We found that patients' covariates and histories explained approximately 75% of their interpersonal cSCC risk variation. Using cross-validated performance measures, we also found cSCCscore's predictions to be moderately well calibrated to the patients' observed cSCC incidence. Moreover, cSCCscore discriminated well between patients who subsequently did and did not develop a new primary cSCC within 3 years after risk assignment, with area under the receiver operating characteristic curve of approximately 85%. Thus, cSCCscore can facilitate more informed management of non-Hispanic white patients at cSCC risk. cSCCscore's predictions are available at https://researchapps.github.io/cSCCscore/.


Asunto(s)
Carcinoma de Células Escamosas/diagnóstico , Detección Precoz del Cáncer/métodos , Modelos Estadísticos , Neoplasias Cutáneas/diagnóstico , Población Blanca , Anciano , California/epidemiología , Carcinoma de Células Escamosas/epidemiología , Estudios de Cohortes , Atención a la Salud , Femenino , Humanos , Incidencia , Masculino , Persona de Mediana Edad , Recurrencia Local de Neoplasia , Pronóstico , Proyectos de Investigación , Factores de Riesgo , Neoplasias Cutáneas/epidemiología
10.
Nat Commun ; 9(1): 4264, 2018 10 15.
Artículo en Inglés | MEDLINE | ID: mdl-30323283

RESUMEN

Cutaneous squamous cell carcinoma (cSCC) is a common skin cancer with genetic susceptibility loci identified in recent genome-wide association studies (GWAS). Transcriptome-wide association studies (TWAS) using imputed gene expression levels can identify additional gene-level associations. Here we impute gene expression levels in 6891 cSCC cases and 54,566 controls in the Kaiser Permanente Genetic Epidemiology Research in Adult Health and Aging (GERA) cohort and 25,558 self-reported cSCC cases and 673,788 controls from 23andMe. In a discovery-validation study, we identify 19 loci containing 33 genes whose imputed expression levels are associated with cSCC at false discovery rate < 10% in the GERA cohort and validate 15 of these candidate genes at Bonferroni significance in the 23andMe dataset, including eight genes in five novel susceptibility loci and seven genes in four previously associated loci. These results suggest genetic mechanisms contributing to cSCC risk and illustrate advantages and disadvantages of TWAS as a supplement to traditional GWAS analyses.


Asunto(s)
Carcinoma de Células Escamosas/genética , Regulación Neoplásica de la Expresión Génica , Sitios Genéticos , Predisposición Genética a la Enfermedad , Neoplasias Cutáneas/genética , Bases de Datos Genéticas , Humanos , Polimorfismo de Nucleótido Simple/genética , Reproducibilidad de los Resultados
11.
Cancer Immunol Immunother ; 67(7): 1123-1133, 2018 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-29754218

RESUMEN

BACKGROUND: The immune system has been implicated in the pathophysiology of cutaneous squamous cell carcinoma (cSCC) as evidenced by the substantially increased risk of cSCC in immunosuppressed individuals. Associations between cSCC risk and single nucleotide polymorphisms (SNPs) in the HLA region have been identified by genome-wide association studies (GWAS). The translation of the associated HLA SNPs to structural amino acids changes in HLA molecules has not been previously elucidated. METHODS: Using data from a GWAS that included 7238 cSCC cases and 56,961 controls of non-Hispanic white ancestry, we imputed classical alleles and corresponding amino acid changes in HLA genes. Logistic regression models were used to examine associations between cSCC risk and genotyped or imputed SNPs, classical HLA alleles, and amino acid changes. RESULTS: Among the genotyped SNPs, cSCC risk was associated with rs28535317 (OR = 1.20, p = 9.88 × 10- 11) corresponding to an amino-acid change from phenylalanine to leucine at codon 26 of HLA-DRB1 (OR = 1.17, p = 2.48 × 10- 10). An additional independent association was observed for a threonine to isoleucine change at codon 107 of HLA-DQA1 (OR = 1.14, p = 2.34 × 10- 9). Among the classical HLA alleles, cSCC was associated with DRB1*01 (OR = 1.18, p = 5.86 × 10- 10). Conditional analyses revealed additional independent cSCC associations with DQA1*05:01 and DQA1*05:05. Extended haplotype analysis was used to complement the imputed haplotypes, which identified three extended haplotypes in the HLA-DR and HLA-DQ regions. CONCLUSIONS: Associations with specific HLA-DR and -DQ alleles are likely to explain previously observed GWAS signals in the HLA region associated with cSCC risk.


Asunto(s)
Carcinoma de Células Escamosas/genética , Carcinoma de Células Escamosas/patología , Genes MHC Clase II , Polimorfismo de Nucleótido Simple , Neoplasias Cutáneas/genética , Neoplasias Cutáneas/patología , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Factores de Riesgo
12.
Bioinformatics ; 33(24): 3895-3901, 2017 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-28961785

RESUMEN

MOTIVATION: Interpreting genetic variation in noncoding regions of the genome is an important challenge for personal genome analysis. One mechanism by which noncoding single nucleotide variants (SNVs) influence downstream phenotypes is through the regulation of gene expression. Methods to predict whether or not individual SNVs are likely to regulate gene expression would aid interpretation of variants of unknown significance identified in whole-genome sequencing studies. RESULTS: We developed FIRE (Functional Inference of Regulators of Expression), a tool to score both noncoding and coding SNVs based on their potential to regulate the expression levels of nearby genes. FIRE consists of 23 random forests trained to recognize SNVs in cis-expression quantitative trait loci (cis-eQTLs) using a set of 92 genomic annotations as predictive features. FIRE scores discriminate cis-eQTL SNVs from non-eQTL SNVs in the training set with a cross-validated area under the receiver operating characteristic curve (AUC) of 0.807, and discriminate cis-eQTL SNVs shared across six populations of different ancestry from non-eQTL SNVs with an AUC of 0.939. FIRE scores are also predictive of cis-eQTL SNVs across a variety of tissue types. AVAILABILITY AND IMPLEMENTATION: FIRE scores for genome-wide SNVs in hg19/GRCh37 are available for download at https://sites.google.com/site/fireregulatoryvariation/. CONTACT: nilah@stanford.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Regulación de la Expresión Génica , Variación Genética , Programas Informáticos , Genómica , Humanos , Sitios de Carácter Cuantitativo
13.
Hum Immunol ; 78(4): 327-335, 2017 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-28185865

RESUMEN

Cutaneous squamous cell carcinoma (cSCC) is the second most common cancer among Caucasians in the United States, with rising incidence over the past decade. Treatment for non-melanoma skin cancer, including cSCC, in the United States was estimated to cost $4.8 billion in 2014. Thus, an understanding of cSCC pathogenesis could have important public health implications. Immune function impacts cSCC risk, given that cSCC incidence rates are substantially higher in patients with compromised immune systems. We report a systematic review of published associations between cSCC risk and the human leukocyte antigen (HLA) system. This review includes studies that analyze germline class I and class II HLA allelic variation as well as HLA cell-surface protein expression levels associated with cSCC risk. We propose biological mechanisms for these HLA-cSCC associations based on known mechanisms of HLA involvement in other diseases. The review suggests that immunity regulates the development of cSCC and that HLA-cSCC associations differ between immunocompetent and immunosuppressed patients. This difference may reflect the presence of viral co-factors that affect tumorigenesis in immunosuppressed patients. Finally, we highlight limitations in the literature on HLA-cSCC associations, and suggest directions for future research aimed at understanding, preventing and treating cSCC.


Asunto(s)
Carcinoma de Células Escamosas/epidemiología , Infecciones por Virus ADN/epidemiología , Antígenos HLA/genética , Papillomaviridae/fisiología , Neoplasias Cutáneas/epidemiología , Carcinoma de Células Escamosas/genética , Carcinoma de Células Escamosas/inmunología , Frecuencia de los Genes , Estudio de Asociación del Genoma Completo , Humanos , Inmunidad , Terapia de Inmunosupresión , Polimorfismo Genético , Factores de Riesgo , Neoplasias Cutáneas/genética , Neoplasias Cutáneas/inmunología , Estados Unidos , Población Blanca
14.
Am J Hum Genet ; 99(4): 877-885, 2016 Oct 06.
Artículo en Inglés | MEDLINE | ID: mdl-27666373

RESUMEN

The vast majority of coding variants are rare, and assessment of the contribution of rare variants to complex traits is hampered by low statistical power and limited functional data. Improved methods for predicting the pathogenicity of rare coding variants are needed to facilitate the discovery of disease variants from exome sequencing studies. We developed REVEL (rare exome variant ensemble learner), an ensemble method for predicting the pathogenicity of missense variants on the basis of individual tools: MutPred, FATHMM, VEST, PolyPhen, SIFT, PROVEAN, MutationAssessor, MutationTaster, LRT, GERP, SiPhy, phyloP, and phastCons. REVEL was trained with recently discovered pathogenic and rare neutral missense variants, excluding those previously used to train its constituent tools. When applied to two independent test sets, REVEL had the best overall performance (p < 10-12) as compared to any individual tool and seven ensemble methods: MetaSVM, MetaLR, KGGSeq, Condel, CADD, DANN, and Eigen. Importantly, REVEL also had the best performance for distinguishing pathogenic from rare neutral variants with allele frequencies <0.5%. The area under the receiver operating characteristic curve (AUC) for REVEL was 0.046-0.182 higher in an independent test set of 935 recent SwissVar disease variants and 123,935 putatively neutral exome sequencing variants and 0.027-0.143 higher in an independent test set of 1,953 pathogenic and 2,406 benign variants recently reported in ClinVar than the AUCs for other ensemble methods. We provide pre-computed REVEL scores for all possible human missense variants to facilitate the identification of pathogenic variants in the sea of rare variants discovered as sequencing studies expand in scale.


Asunto(s)
Enfermedad/genética , Mutación Missense/genética , Programas Informáticos , Área Bajo la Curva , Análisis Mutacional de ADN , Exoma/genética , Frecuencia de los Genes , Humanos , Curva ROC
15.
J Invest Dermatol ; 136(5): 930-937, 2016 05.
Artículo en Inglés | MEDLINE | ID: mdl-26829030

RESUMEN

We report a genome-wide association study of cutaneous squamous cell carcinoma conducted among non-Hispanic white members of the Kaiser Permanente Northern California health care system. The study includes a genome-wide screen of 61,457 members (6,891 cases and 54,566 controls) genotyped on the Affymetrix Axiom European array and a replication phase involving an independent set of 6,410 additional members (810 cases and 5,600 controls). Combined analysis of screening and replication phases identified 10 loci containing single-nucleotide polymorphisms (SNPs) with P-values < 5 × 10(-8). Six loci contain genes in the pigmentation pathway; SNPs at these loci appear to modulate squamous cell carcinoma risk independently of the pigmentation phenotypes. Another locus contains HLA class II genes studied in relation to elevated squamous cell carcinoma risk following immunosuppression. SNPs at the remaining three loci include an intronic SNP in FOXP1 at locus 3p13, an intergenic SNP at 3q28 near TP63, and an intergenic SNP at 9p22 near BNC2. These findings provide insights into the genetic factors accounting for inherited squamous cell carcinoma susceptibility.


Asunto(s)
Carcinoma de Células Escamosas/genética , Sitios Genéticos , Predisposición Genética a la Enfermedad/epidemiología , Estudio de Asociación del Genoma Completo , Neoplasias Cutáneas/genética , Adulto , California , Carcinoma de Células Escamosas/epidemiología , Carcinoma de Células Escamosas/patología , Estudios de Casos y Controles , Estudios de Cohortes , Femenino , Genotipo , Humanos , Masculino , Persona de Mediana Edad , Polimorfismo de Nucleótido Simple , Neoplasias Cutáneas/epidemiología , Neoplasias Cutáneas/patología , Población Blanca/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...