Search | VHL Regional Portal

Tracing cancer evolution and heterogeneity using Hi-C.

Erdmann-Pham, Dan Daniel; Batra, Sanjit Singh; Turkalo, Timothy K; Durbin, James; Blanchette, Marco; Yeh, Iwei; Shain, Hunter; Bastian, Boris C; Song, Yun S; Rokhsar, Daniel S; Hockemeyer, Dirk.

Nat Commun ; 14(1): 7111, 2023 11 06.

Article in English | MEDLINE | ID: mdl-37932252

ABSTRACT

Chromosomal rearrangements can initiate and drive cancer progression, yet it has been challenging to evaluate their impact, especially in genetically heterogeneous solid cancers. To address this problem we developed HiDENSEC, a new computational framework for analyzing chromatin conformation capture in heterogeneous samples that can infer somatic copy number alterations, characterize large-scale chromosomal rearrangements, and estimate cancer cell fractions. After validating HiDENSEC with in silico and in vitro controls, we used it to characterize chromosome-scale evolution during melanoma progression in formalin-fixed tumor samples from three patients. The resulting comprehensive annotation of the genomic events includes copy number neutral translocations that disrupt tumor suppressor genes such as NF1, whole chromosome arm exchanges that result in loss of CDKN2A, and whole-arm copy-number neutral loss of homozygosity involving PTEN. These findings show that large-scale chromosomal rearrangements occur throughout cancer evolution and that characterizing these events yields insights into drivers of melanoma progression.

Subject(s)

Chromosome Aberrations , Melanoma , Humans , DNA Copy Number Variations , Chromosomes , Translocation, Genetic , Melanoma/genetics

Predicting the effect of CRISPR-Cas9-based epigenome editing.

Batra, Sanjit Singh; Cabrera, Alan; Spence, Jeffrey P; Hilton, Isaac B; Song, Yun S.

bioRxiv ; 2023 Oct 03.

Article in English | MEDLINE | ID: mdl-37873127

ABSTRACT

Epigenetic regulation orchestrates mammalian transcription, but functional links between them remain elusive. To tackle this problem, we here use epigenomic and transcriptomic data from 13 ENCODE cell types to train machine learning models to predict gene expression from histone post-translational modifications (PTMs), achieving transcriptome-wide correlations of ~ 0.70 - 0.79 for most samples. In addition to recapitulating known associations between histone PTMs and expression patterns, our models predict that acetylation of histone subunit H3 lysine residue 27 (H3K27ac) near the transcription start site (TSS) significantly increases expression levels. To validate this prediction experimentally and investigate how engineered vs. natural deposition of H3K27ac might differentially affect expression, we apply the synthetic dCas9-p300 histone acetyltransferase system to 8 genes in the HEK293T cell line. Further, to facilitate model building, we perform MNase-seq to map genome-wide nucleosome occupancy levels in HEK293T. We observe that our models perform well in accurately ranking relative fold changes among genes in response to the dCas9-p300 system; however, their ability to rank fold changes within individual genes is noticeably diminished compared to predicting expression across cell types from their native epigenetic signatures. Our findings highlight the need for more comprehensive genome-scale epigenome editing datasets, better understanding of the actual modifications made by epigenome editing tools, and improved causal models that transfer better from endogenous cellular measurements to perturbation experiments. Together these improvements would facilitate the ability to understand and predictably control the dynamic human epigenome with consequences for human health.

DNA language models are powerful predictors of genome-wide variant effects.

Benegas, Gonzalo; Batra, Sanjit Singh; Song, Yun S.

Proc Natl Acad Sci U S A ; 120(44): e2311219120, 2023 Oct 31.

Article in English | MEDLINE | ID: mdl-37883436

ABSTRACT

The expanding catalog of genome-wide association studies (GWAS) provides biological insights across a variety of species, but identifying the causal variants behind these associations remains a significant challenge. Experimental validation is both labor-intensive and costly, highlighting the need for accurate, scalable computational methods to predict the effects of genetic variants across the entire genome. Inspired by recent progress in natural language processing, unsupervised pretraining on large protein sequence databases has proven successful in extracting complex information related to proteins. These models showcase their ability to learn variant effects in coding regions using an unsupervised approach. Expanding on this idea, we here introduce the Genomic Pre-trained Network (GPN), a model designed to learn genome-wide variant effects through unsupervised pretraining on genomic DNA sequences. Our model also successfully learns gene structure and DNA motifs without any supervision. To demonstrate its utility, we train GPN on unaligned reference genomes of Arabidopsis thaliana and seven related species within the Brassicales order and evaluate its ability to predict the functional impact of genetic variants in A. thaliana by utilizing allele frequencies from the 1001 Genomes Project and a comprehensive database of GWAS. Notably, GPN outperforms predictors based on popular conservation scores such as phyloP and phastCons. Our predictions for A. thaliana can be visualized as sequence logos in the UCSC Genome Browser (https://genome.ucsc.edu/s/gbenegas/gpn-arabidopsis). We provide code (https://github.com/songlab-cal/gpn) to train GPN for any given species using its DNA sequence alone, enabling unsupervised prediction of variant effects across the entire genome.

Subject(s)

Arabidopsis , Arabidopsis/genetics , Genome-Wide Association Study , Genomics , Genome , DNA

The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of epigenomic profiles.

Schreiber, Jacob; Boix, Carles; Wook Lee, Jin; Li, Hongyang; Guan, Yuanfang; Chang, Chun-Chieh; Chang, Jen-Chien; Hawkins-Hooker, Alex; Schölkopf, Bernhard; Schweikert, Gabriele; Carulla, Mateo Rojas; Canakoglu, Arif; Guzzo, Francesco; Nanni, Luca; Masseroli, Marco; Carman, Mark James; Pinoli, Pietro; Hong, Chenyang; Yip, Kevin Y; Spence, Jeffrey P; Batra, Sanjit Singh; Song, Yun S; Mahony, Shaun; Zhang, Zheng; Tan, Wuwei; Shen, Yang; Sun, Yuanfei; Shi, Minyi; Adrian, Jessika; Sandstrom, Richard; Farrell, Nina; Halow, Jessica; Lee, Kristen; Jiang, Lixia; Yang, Xinqiong; Epstein, Charles; Strattan, J Seth; Bernstein, Bradley; Snyder, Michael; Kellis, Manolis; Stafford, William; Kundaje, Anshul.

Genome Biol ; 24(1): 79, 2023 04 18.

Article in English | MEDLINE | ID: mdl-37072822

ABSTRACT

A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.

Subject(s)

Algorithms , Epigenomics , Genomics/methods

Accurate assembly of the olive baboon (Papio anubis) genome using long-read and Hi-C data.

Batra, Sanjit Singh; Levy-Sakin, Michal; Robinson, Jacqueline; Guillory, Joseph; Durinck, Steffen; Vilgalys, Tauras P; Kwok, Pui-Yan; Cox, Laura A; Seshagiri, Somasekar; Song, Yun S; Wall, Jeffrey D.

Gigascience ; 9(12)2020 12 07.

Article in English | MEDLINE | ID: mdl-33283855

ABSTRACT

BACKGROUND: Baboons are a widely used nonhuman primate model for biomedical, evolutionary, and basic genetics research. Despite this importance, the genomic resources for baboons are limited. In particular, the current baboon reference genome Panu_3.0 is a highly fragmented, reference-guided (i.e., not fully de novo) assembly, and its poor quality inhibits our ability to conduct downstream genomic analyses. FINDINGS: Here we present a de novo genome assembly of the olive baboon (Papio anubis) that uses data from several recently developed single-molecule technologies. Our assembly, Panubis1.0, has an N50 contig size of â¼1.46 Mb (as opposed to 139 kb for Panu_3.0) and has single scaffolds that span each of the 20 autosomes and the X chromosome. CONCLUSIONS: We highlight multiple lines of evidence (including Bionano Genomics data, pedigree linkage information, and linkage disequilibrium data) suggesting that there are several large assembly errors in Panu_3.0, which have been corrected in Panubis1.0.

Subject(s)

Genome , Papio anubis , Animals , Biological Evolution , Chromosomes , Genomics , Papio anubis/genetics

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL