Search | VHL Search Portal

1.

Visible Machine Learning for Biomedicine.

Yu, Michael K; Ma, Jianzhu; Fisher, Jasmin; Kreisberg, Jason F; Raphael, Benjamin J; Ideker, Trey.

Cell ; 173(7): 1562-1565, 2018 06 14.

Article in English | MEDLINE | ID: mdl-29906441

ABSTRACT

A major ambition of artificial intelligence lies in translating patient data to successful therapies. Machine learning models face particular challenges in biomedicine, however, including handling of extreme data heterogeneity and lack of mechanistic insight into predictions. Here, we argue for "visible" approaches that guide model structure with experimental biology.

Subject(s)

Computational Biology/methods , Machine Learning , Algorithms , Biomedical Research

2.

Pathogenic Germline Variants in 10,389 Adult Cancers.

Huang, Kuan-Lin; Mashl, R Jay; Wu, Yige; Ritter, Deborah I; Wang, Jiayin; Oh, Clara; Paczkowska, Marta; Reynolds, Sheila; Wyczalkowski, Matthew A; Oak, Ninad; Scott, Adam D; Krassowski, Michal; Cherniack, Andrew D; Houlahan, Kathleen E; Jayasinghe, Reyka; Wang, Liang-Bo; Zhou, Daniel Cui; Liu, Di; Cao, Song; Kim, Young Won; Koire, Amanda; McMichael, Joshua F; Hucthagowder, Vishwanathan; Kim, Tae-Beom; Hahn, Abigail; Wang, Chen; McLellan, Michael D; Al-Mulla, Fahd; Johnson, Kimberly J; Lichtarge, Olivier; Boutros, Paul C; Raphael, Benjamin; Lazar, Alexander J; Zhang, Wei; Wendl, Michael C; Govindan, Ramaswamy; Jain, Sanjay; Wheeler, David; Kulkarni, Shashikant; Dipersio, John F; Reimand, Jüri; Meric-Bernstam, Funda; Chen, Ken; Shmulevich, Ilya; Plon, Sharon E; Chen, Feng; Ding, Li.

Cell ; 173(2): 355-370.e14, 2018 04 05.

Article in English | MEDLINE | ID: mdl-29625052

ABSTRACT

We conducted the largest investigation of predisposition variants in cancer to date, discovering 853 pathogenic or likely pathogenic variants in 8% of 10,389 cases from 33 cancer types. Twenty-one genes showed single or cross-cancer associations, including novel associations of SDHA in melanoma and PALB2 in stomach adenocarcinoma. The 659 predisposition variants and 18 additional large deletions in tumor suppressors, including ATM, BRCA1, and NF1, showed low gene expression and frequent (43%) loss of heterozygosity or biallelic two-hit events. We also discovered 33 such variants in oncogenes, including missenses in MET, RET, and PTPN11 associated with high gene expression. We nominated 47 additional predisposition variants from prioritized VUSs supported by multiple evidences involving case-control frequency, loss of heterozygosity, expression effect, and co-localization with mutations and modified residues. Our integrative approach links rare predisposition variants to functional consequences, informing future guidelines of variant classification and germline genetic testing in cancer.

Subject(s)

Germ Cells/metabolism , Neoplasms/pathology , DNA Copy Number Variations , Databases, Genetic , Gene Deletion , Gene Frequency , Genetic Predisposition to Disease , Genotype , Germ Cells/cytology , Germ-Line Mutation , Humans , Loss of Heterozygosity/genetics , Mutation, Missense , Neoplasms/genetics , Polymorphism, Single Nucleotide , Proto-Oncogene Proteins c-met/genetics , Proto-Oncogene Proteins c-ret/genetics , Tumor Suppressor Proteins/genetics

3.

Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin.

Hoadley, Katherine A; Yau, Christina; Wolf, Denise M; Cherniack, Andrew D; Tamborero, David; Ng, Sam; Leiserson, Max D M; Niu, Beifang; McLellan, Michael D; Uzunangelov, Vladislav; Zhang, Jiashan; Kandoth, Cyriac; Akbani, Rehan; Shen, Hui; Omberg, Larsson; Chu, Andy; Margolin, Adam A; Van't Veer, Laura J; Lopez-Bigas, Nuria; Laird, Peter W; Raphael, Benjamin J; Ding, Li; Robertson, A Gordon; Byers, Lauren A; Mills, Gordon B; Weinstein, John N; Van Waes, Carter; Chen, Zhong; Collisson, Eric A; Benz, Christopher C; Perou, Charles M; Stuart, Joshua M.

Cell ; 158(4): 929-944, 2014 Aug 14.

Article in English | MEDLINE | ID: mdl-25109877

ABSTRACT

Recent genomic analyses of pathologically defined tumor types identify "within-a-tissue" disease subtypes. However, the extent to which genomic signatures are shared across tissues is still unclear. We performed an integrative analysis using five genome-wide platforms and one proteomic platform on 3,527 specimens from 12 cancer types, revealing a unified classification into 11 major subtypes. Five subtypes were nearly identical to their tissue-of-origin counterparts, but several distinct cancer types were found to converge into common subtypes. Lung squamous, head and neck, and a subset of bladder cancers coalesced into one subtype typified by TP53 alterations, TP63 amplifications, and high expression of immune and proliferation pathway genes. Of note, bladder cancers split into three pan-cancer subtypes. The multiplatform classification, while correlated with tissue-of-origin, provides independent information for predicting clinical outcomes. All data sets are available for data-mining from a unified resource to support further biological discoveries and insights into novel therapeutic strategies.

Subject(s)

Neoplasms/classification , Neoplasms/genetics , Cluster Analysis , Humans , Neoplasms/pathology , Transcriptome

4.

Spatial epigenome-transcriptome co-profiling of mammalian tissues.

Zhang, Di; Deng, Yanxiang; Kukanja, Petra; Agirre, Eneritz; Bartosovic, Marek; Dong, Mingze; Ma, Cong; Ma, Sai; Su, Graham; Bao, Shuozhen; Liu, Yang; Xiao, Yang; Rosoklija, Gorazd B; Dwork, Andrew J; Mann, J John; Leong, Kam W; Boldrini, Maura; Wang, Liya; Haeussler, Maximilian; Raphael, Benjamin J; Kluger, Yuval; Castelo-Branco, Gonçalo; Fan, Rong.

Nature ; 616(7955): 113-122, 2023 04.

Article in English | MEDLINE | ID: mdl-36922587

ABSTRACT

Emerging spatial technologies, including spatial transcriptomics and spatial epigenomics, are becoming powerful tools for profiling of cellular states in the tissue context1-5. However, current methods capture only one layer of omics information at a time, precluding the possibility of examining the mechanistic relationship across the central dogma of molecular biology. Here, we present two technologies for spatially resolved, genome-wide, joint profiling of the epigenome and transcriptome by cosequencing chromatin accessibility and gene expression, or histone modifications (H3K27me3, H3K27ac or H3K4me3) and gene expression on the same tissue section at near-single-cell resolution. These were applied to embryonic and juvenile mouse brain, as well as adult human brain, to map how epigenetic mechanisms control transcriptional phenotype and cell dynamics in tissue. Although highly concordant tissue features were identified by either spatial epigenome or spatial transcriptome we also observed distinct patterns, suggesting their differential roles in defining cell states. Linking epigenome to transcriptome pixel by pixel allows the uncovering of new insights in spatial epigenetic priming, differentiation and gene regulation within the tissue architecture. These technologies are of great interest in life science and biomedical research.

Subject(s)

Chromatin , Epigenome , Mammals , Transcriptome , Animals , Humans , Mice , Chromatin/genetics , Chromatin/metabolism , Epigenesis, Genetic , Epigenomics , Gene Expression Profiling , Gene Expression Regulation , Mammals/genetics , Histones/chemistry , Histones/metabolism , Single-Cell Analysis , Organ Specificity , Brain/embryology , Brain/metabolism , Aging/genetics

5.

Epigenetic regulation during cancer transitions across 11 tumour types.

Terekhanova, Nadezhda V; Karpova, Alla; Liang, Wen-Wei; Strzalkowski, Alexander; Chen, Siqi; Li, Yize; Southard-Smith, Austin N; Iglesia, Michael D; Wendl, Michael C; Jayasinghe, Reyka G; Liu, Jingxian; Song, Yizhe; Cao, Song; Houston, Andrew; Liu, Xiuting; Wyczalkowski, Matthew A; Lu, Rita Jui-Hsien; Caravan, Wagma; Shinkle, Andrew; Naser Al Deen, Nataly; Herndon, John M; Mudd, Jacqueline; Ma, Cong; Sarkar, Hirak; Sato, Kazuhito; Ibrahim, Omar M; Mo, Chia-Kuei; Chasnoff, Sara E; Porta-Pardo, Eduard; Held, Jason M; Pachynski, Russell; Schwarz, Julie K; Gillanders, William E; Kim, Albert H; Vij, Ravi; DiPersio, John F; Puram, Sidharth V; Chheda, Milan G; Fuh, Katherine C; DeNardo, David G; Fields, Ryan C; Chen, Feng; Raphael, Benjamin J; Ding, Li.

Nature ; 623(7986): 432-441, 2023 Nov.

Article in English | MEDLINE | ID: mdl-37914932

ABSTRACT

Chromatin accessibility is essential in regulating gene expression and cellular identity, and alterations in accessibility have been implicated in driving cancer initiation, progression and metastasis1-4. Although the genetic contributions to oncogenic transitions have been investigated, epigenetic drivers remain less understood. Here we constructed a pan-cancer epigenetic and transcriptomic atlas using single-nucleus chromatin accessibility data (using single-nucleus assay for transposase-accessible chromatin) from 225 samples and matched single-cell or single-nucleus RNA-sequencing expression data from 206 samples. With over 1 million cells from each platform analysed through the enrichment of accessible chromatin regions, transcription factor motifs and regulons, we identified epigenetic drivers associated with cancer transitions. Some epigenetic drivers appeared in multiple cancers (for example, regulatory regions of ABCC1 and VEGFA; GATA6 and FOX-family motifs), whereas others were cancer specific (for example, regulatory regions of FGF19, ASAP2 and EN1, and the PBX3 motif). Among epigenetically altered pathways, TP53, hypoxia and TNF signalling were linked to cancer initiation, whereas oestrogen response, epithelial-mesenchymal transition and apical junction were tied to metastatic transition. Furthermore, we revealed a marked correlation between enhancer accessibility and gene expression and uncovered cooperation between epigenetic and genetic drivers. This atlas provides a foundation for further investigation of epigenetic dynamics in cancer transitions.

Subject(s)

Epigenesis, Genetic , Gene Expression Regulation, Neoplastic , Neoplasms , Humans , Cell Hypoxia , Cell Nucleus , Chromatin/genetics , Chromatin/metabolism , Enhancer Elements, Genetic/genetics , Epigenesis, Genetic/genetics , Epithelial-Mesenchymal Transition , Estrogens/metabolism , Gene Expression Profiling , GTPase-Activating Proteins/metabolism , Neoplasm Metastasis , Neoplasms/classification , Neoplasms/genetics , Neoplasms/pathology , Regulatory Sequences, Nucleic Acid/genetics , Single-Cell Analysis , Transcription Factors/metabolism

6.

Partial alignment of multislice spatially resolved transcriptomics data.

Liu, Xinhao; Zeira, Ron; Raphael, Benjamin J.

Genome Res ; 33(7): 1124-1132, 2023 07.

Article in English | MEDLINE | ID: mdl-37553263

ABSTRACT

Spatially resolved transcriptomics (SRT) technologies measure messenger RNA (mRNA) expression at thousands of locations in a tissue slice. However, nearly all SRT technologies measure expression in two-dimensional (2D) slices extracted from a 3D tissue, thus losing information that is shared across multiple slices from the same tissue. Integrating SRT data across multiple slices can help recover this information and improve downstream expression analyses, but multislice alignment and integration remains a challenging task. Existing methods for integrating SRT data either do not use spatial information or assume that the morphology of the tissue is largely preserved across slices, an assumption that is often violated because of biological or technical reasons. We introduce PASTE2, a method for partial alignment and 3D reconstruction of multislice SRT data sets, allowing only partial overlap between aligned slices and/or slice-specific cell types. PASTE2 formulates a novel partial fused Gromov-Wasserstein optimal transport problem, which we solve using a conditional gradient algorithm. PASTE2 includes a model selection procedure to estimate the fraction of overlap between slices, and optionally uses information from histological images that accompany some SRT experiments. We show on both simulated and real data that PASTE2 obtains more accurate alignments than existing methods. We further use PASTE2 to reconstruct a 3D map of gene expression in a Drosophila embryo from a 16 slice Stereo-seq data set. PASTE2 produces accurate alignments of multislice data sets from multiple SRT technologies, enabling detailed studies of spatial gene expression across a wide range of biological applications.

Subject(s)

Algorithms , Transcriptome

7.

Alignment and integration of spatial transcriptomics data.

Zeira, Ron; Land, Max; Strzalkowski, Alexander; Raphael, Benjamin J.

Nat Methods ; 19(5): 567-575, 2022 05.

Article in English | MEDLINE | ID: mdl-35577957

ABSTRACT

Spatial transcriptomics (ST) measures mRNA expression across thousands of spots from a tissue slice while recording the two-dimensional (2D) coordinates of each spot. We introduce probabilistic alignment of ST experiments (PASTE), a method to align and integrate ST data from multiple adjacent tissue slices. PASTE computes pairwise alignments of slices using an optimal transport formulation that models both transcriptional similarity and physical distances between spots. PASTE further combines pairwise alignments to construct a stacked 3D alignment of a tissue. Alternatively, PASTE can integrate multiple ST slices into a single consensus slice. We show that PASTE accurately aligns spots across adjacent slices in both simulated and real ST data, demonstrating the advantages of using both transcriptional similarity and spatial information. We further show that the PASTE integrated slice improves the identification of cell types and differentially expressed genes compared with existing approaches that either analyze single ST slices or ignore spatial information.

Subject(s)

Algorithms , Transcriptome

8.

Joint inference of cell lineage and mitochondrial evolution from single-cell sequencing data.

Sashittal, Palash; Chen, Viola; Pasarkar, Amey; Raphael, Benjamin J.

Bioinformatics ; 40(Supplement_1): i218-i227, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940122

ABSTRACT

MOTIVATION: Eukaryotic cells contain organelles called mitochondria that have their own genome. Most cells contain thousands of mitochondria which replicate, even in nondividing cells, by means of a relatively error-prone process resulting in somatic mutations in their genome. Because of the higher mutation rate compared to the nuclear genome, mitochondrial mutations have been used to track cellular lineage, particularly using single-cell sequencing that measures mitochondrial mutations in individual cells. However, existing methods to infer the cell lineage tree from mitochondrial mutations do not model "heteroplasmy," which is the presence of multiple mitochondrial clones with distinct sets of mutations in an individual cell. Single-cell sequencing data thus provide a mixture of the mitochondrial clones in individual cells, with the ancestral relationships between these clones described by a mitochondrial clone tree. While deconvolution of somatic mutations from a mixture of evolutionarily related genomes has been extensively studied in the context of bulk sequencing of cancer tumor samples, the problem of mitochondrial deconvolution has the additional constraint that the mitochondrial clone tree must be concordant with the cell lineage tree. RESULTS: We formalize the problem of inferring a concordant pair of a mitochondrial clone tree and a cell lineage tree from single-cell sequencing data as the Nested Perfect Phylogeny Mixture (NPPM) problem. We derive a combinatorial characterization of the solutions to the NPPM problem, and formulate an algorithm, MERLIN, to solve this problem exactly using a mixed integer linear program. We show on simulated data that MERLIN outperforms existing methods that do not model mitochondrial heteroplasmy nor the concordance between the mitochondrial clone tree and the cell lineage tree. We use MERLIN to analyze single-cell whole-genome sequencing data of 5220 cells of a gastric cancer cell line and show that MERLIN infers a more biologically plausible cell lineage tree and mitochondrial clone tree compared to existing methods. AVAILABILITY AND IMPLEMENTATION: https://github.com/raphael-group/MERLIN.

Subject(s)

Cell Lineage , Mitochondria , Single-Cell Analysis , Single-Cell Analysis/methods , Humans , Cell Lineage/genetics , Mitochondria/genetics , Mutation , Genome, Mitochondrial , Algorithms , Evolution, Molecular

9.

A count-based model for delineating cell-cell interactions in spatial transcriptomics data.

Sarkar, Hirak; Chitra, Uthsav; Gold, Julian; Raphael, Benjamin J.

Bioinformatics ; 40(Supplement_1): i481-i489, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940134

ABSTRACT

MOTIVATION: Cell-cell interactions (CCIs) consist of cells exchanging signals with themselves and neighboring cells by expressing ligand and receptor molecules and play a key role in cellular development, tissue homeostasis, and other critical biological functions. Since direct measurement of CCIs is challenging, multiple methods have been developed to infer CCIs by quantifying correlations between the gene expression of the ligands and receptors that mediate CCIs, originally from bulk RNA-sequencing data and more recently from single-cell or spatially resolved transcriptomics (SRT) data. SRT has a particular advantage over single-cell approaches, since ligand-receptor correlations can be computed between cells or spots that are physically close in the tissue. However, the transcript counts of individual ligands and receptors in SRT data are generally low, complicating the inference of CCIs from expression correlations. RESULTS: We introduce Copulacci, a count-based model for inferring CCIs from SRT data. Copulacci uses a Gaussian copula to model dependencies between the expression of ligands and receptors from nearby spatial locations even when the transcript counts are low. On simulated data, Copulacci outperforms existing CCI inference methods based on the standard Spearman and Pearson correlation coefficients. Using several real SRT datasets, we show that Copulacci discovers biologically meaningful ligand-receptor interactions that are lowly expressed and undiscoverable by existing CCI inference methods. AVAILABILITY AND IMPLEMENTATION: Copulacci is implemented in Python and available at https://github.com/raphael-group/copulacci.

Subject(s)

Cell Communication , Transcriptome , Transcriptome/genetics , Humans , Gene Expression Profiling/methods , Single-Cell Analysis/methods , Algorithms , Computational Biology/methods , Ligands

10.

Maximum likelihood phylogeographic inference of cell motility and cell division from spatial lineage tracing data.

Mai, Uyen; Hu, Gary; Raphael, Benjamin J.

Bioinformatics ; 40(Supplement_1): i228-i236, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940146

ABSTRACT

MOTIVATION: Recently developed spatial lineage tracing technologies induce somatic mutations at specific genomic loci in a population of growing cells and then measure these mutations in the sampled cells along with the physical locations of the cells. These technologies enable high-throughput studies of developmental processes over space and time. However, these applications rely on accurate reconstruction of a spatial cell lineage tree describing both past cell divisions and cell locations. Spatial lineage trees are related to phylogeographic models that have been well-studied in the phylogenetics literature. We demonstrate that standard phylogeographic models based on Brownian motion are inadequate to describe the spatial symmetric displacement (SD) of cells during cell division. RESULTS: We introduce a new model-the SD model for cell motility that includes symmetric displacements of daughter cells from the parental cell followed by independent diffusion of daughter cells. We show that this model more accurately describes the locations of cells in a real spatial lineage tracing of mouse embryonic stem cells. Combining the spatial SD model with an evolutionary model of DNA mutations, we obtain a phylogeographic model for spatial lineage tracing. Using this model, we devise a maximum likelihood framework-MOLLUSC (Maximum Likelihood Estimation Of Lineage and Location Using Single-Cell Spatial Lineage tracing Data)-to co-estimate time-resolved branch lengths, spatial diffusion rate, and mutation rate. On both simulated and real data, we show that MOLLUSC accurately estimates all parameters. In contrast, the Brownian motion model overestimates spatial diffusion rate in all test cases. In addition, the inclusion of spatial information improves accuracy of branch length estimation compared to sequence data alone. On real data, we show that spatial information has more signal than sequence data for branch length estimation, suggesting augmenting lineage tracing technologies with spatial information is useful to overcome the limitations of genome-editing in developmental systems. AVAILABILITY AND IMPLEMENTATION: The python implementation of MOLLUSC is available at https://github.com/raphael-group/MOLLUSC.

Subject(s)

Cell Division , Cell Lineage , Cell Movement , Animals , Mice , Likelihood Functions , Phylogeography , Mutation , Phylogeny

11.

A zero-agnostic model for copy number evolution in cancer.

Schmidt, Henri; Sashittal, Palash; Raphael, Benjamin J.

PLoS Comput Biol ; 19(11): e1011590, 2023 Nov.

Article in English | MEDLINE | ID: mdl-37943952

ABSTRACT

MOTIVATION: New low-coverage single-cell DNA sequencing technologies enable the measurement of copy number profiles from thousands of individual cells within tumors. From this data, one can infer the evolutionary history of the tumor by modeling transformations of the genome via copy number aberrations. Copy number aberrations alter multiple adjacent genomic loci, violating the standard phylogenetic assumption that loci evolve independently. Thus, specialized models to infer copy number phylogenies have been introduced. A widely used model is the copy number transformation (CNT) model in which a genome is represented by an integer vector and a copy number aberration is an event that either increases or decreases the number of copies of a contiguous segment of the genome. The CNT distance between a pair of copy number profiles is the minimum number of events required to transform one profile to another. While this distance can be computed efficiently, no efficient algorithm has been developed to find the most parsimonious phylogeny under the CNT model. RESULTS: We introduce the zero-agnostic copy number transformation (ZCNT) model, a simplification of the CNT model that allows the amplification or deletion of regions with zero copies. We derive a closed form expression for the ZCNT distance between two copy number profiles and show that, unlike the CNT distance, the ZCNT distance forms a metric. We leverage the closed-form expression for the ZCNT distance and an alternative characterization of copy number profiles to derive polynomial time algorithms for two natural relaxations of the small parsimony problem on copy number profiles. While the alteration of zero copy number regions allowed under the ZCNT model is not biologically realistic, we show on both simulated and real datasets that the ZCNT distance is a close approximation to the CNT distance. Extending our polynomial time algorithm for the ZCNT small parsimony problem, we develop an algorithm, Lazac, for solving the large parsimony problem on copy number profiles. We demonstrate that Lazac outperforms existing methods for inferring copy number phylogenies on both simulated and real data.

Subject(s)

DNA Copy Number Variations , Neoplasms , Humans , Phylogeny , DNA Copy Number Variations/genetics , Neoplasms/genetics , Genomics/methods , Genome , Algorithms

12.

Reconstruction of clone- and haplotype-specific cancer genome karyotypes from bulk tumor samples.

Aganezov, Sergey; Raphael, Benjamin J.

Genome Res ; 30(9): 1274-1290, 2020 09.

Article in English | MEDLINE | ID: mdl-32887685

ABSTRACT

Many cancer genomes are extensively rearranged with aberrant chromosomal karyotypes. Deriving these karyotypes from high-throughput DNA sequencing of bulk tumor samples is complicated because most tumors are a heterogeneous mixture of normal cells and subpopulations of cancer cells, or clones, that harbor distinct somatic mutations. We introduce a new algorithm, Reconstructing Cancer Karyotypes (RCK), to reconstruct haplotype-specific karyotypes of one or more rearranged cancer genomes from DNA sequencing data from a bulk tumor sample. RCK leverages evolutionary constraints on the somatic mutational process in cancer to reduce ambiguity in the deconvolution of admixed sequencing data into multiple haplotype-specific cancer karyotypes. RCK models mixtures containing an arbitrary number of derived genomes and allows the incorporation of information both from short-read and long-read DNA sequencing technologies. We compare RCK to existing approaches on 17 primary and metastatic prostate cancer samples. We find that RCK infers cancer karyotypes that better explain the DNA sequencing data and conform to a reasonable evolutionary model. RCK's reconstructions of clone- and haplotype-specific karyotypes will aid further studies of the role of intra-tumor heterogeneity in cancer development and response to treatment. RCK is freely available as open source software.

Subject(s)

Algorithms , Haplotypes , Karyotyping/methods , Neoplasms/genetics , Chromosome Aberrations , Clone Cells , Computer Simulation , Diploidy , Gene Dosage , Gene Rearrangement , Genome, Human , Humans , Male , Prostatic Neoplasms/genetics , Telomere

13.

netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis.

Elyanow, Rebecca; Dumitrascu, Bianca; Engelhardt, Barbara E; Raphael, Benjamin J.

Genome Res ; 30(2): 195-204, 2020 02.

Article in English | MEDLINE | ID: mdl-31992614

ABSTRACT

Single-cell RNA-sequencing (scRNA-seq) enables high-throughput measurement of RNA expression in single cells. However, because of technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells in a lower-dimensional space, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized non-negative matrix factorization. The network regularization takes advantage of prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be nearby each other in the low-dimensional representation. The resulting matrix factorization imputes gene abundance for both zero and nonzero counts and can be used to cluster cells into meaningful subpopulations. We show that netNMF-sc outperforms existing methods at clustering cells and estimating gene-gene covariance using both simulated and real scRNA-seq data, with increasing advantages at higher dropout rates (e.g., >60%). We also show that the results from netNMF-sc are robust to variation in the input network, with more representative networks leading to greater performance gains.

Subject(s)

Epistasis, Genetic/genetics , RNA-Seq , Single-Cell Analysis/methods , Software , Cluster Analysis , Gene Expression Profiling , Humans , Exome Sequencing

14.

Network propagation: a universal amplifier of genetic associations.

Cowen, Lenore; Ideker, Trey; Raphael, Benjamin J; Sharan, Roded.

Nat Rev Genet ; 18(9): 551-562, 2017 09.

Article in English | MEDLINE | ID: mdl-28607512

ABSTRACT

Biological networks are powerful resources for the discovery of genes and genetic modules that drive disease. Fundamental to network analysis is the concept that genes underlying the same phenotype tend to interact; this principle can be used to combine and to amplify signals from individual genes. Recently, numerous bioinformatic techniques have been proposed for genetic analysis using networks, based on random walks, information diffusion and electrical resistance. These approaches have been applied successfully to identify disease genes, genetic modules and drug targets. In fact, all these approaches are variations of a unifying mathematical machinery - network propagation - suggesting that it is a powerful data transformation method of broad utility in genetic research.

Subject(s)

Computational Biology , Disease/genetics , Gene Regulatory Networks , Genetic Association Studies , Software , Algorithms , Humans , Protein Interaction Maps , Proteins/metabolism

15.

Therapy-induced mutations drive the genomic landscape of relapsed acute lymphoblastic leukemia.

Li, Benshang; Brady, Samuel W; Ma, Xiaotu; Shen, Shuhong; Zhang, Yingchi; Li, Yongjin; Szlachta, Karol; Dong, Li; Liu, Yu; Yang, Fan; Wang, Ningling; Flasch, Diane A; Myers, Matthew A; Mulder, Heather L; Ding, Lixia; Liu, Yanling; Tian, Liqing; Hagiwara, Kohei; Xu, Ke; Zhou, Xin; Sioson, Edgar; Wang, Tianyi; Yang, Liu; Zhao, Jie; Zhang, Hui; Shao, Ying; Sun, Hongye; Sun, Lele; Cai, Jiaoyang; Sun, Hui-Ying; Lin, Ting-Nien; Du, Lijuan; Li, Hui; Rusch, Michael; Edmonson, Michael N; Easton, John; Zhu, Xiaofan; Zhang, Jingliao; Cheng, Cheng; Raphael, Benjamin J; Tang, Jingyan; Downing, James R; Alexandrov, Ludmil B; Zhou, Bin-Bing S; Pui, Ching-Hon; Yang, Jun J; Zhang, Jinghui.

Blood ; 135(1): 41-55, 2020 01 02.

Article in English | MEDLINE | ID: mdl-31697823

ABSTRACT

To study the mechanisms of relapse in acute lymphoblastic leukemia (ALL), we performed whole-genome sequencing of 103 diagnosis-relapse-germline trios and ultra-deep sequencing of 208 serial samples in 16 patients. Relapse-specific somatic alterations were enriched in 12 genes (NR3C1, NR3C2, TP53, NT5C2, FPGS, CREBBP, MSH2, MSH6, PMS2, WHSC1, PRPS1, and PRPS2) involved in drug response. Their prevalence was 17% in very early relapse (<9 months from diagnosis), 65% in early relapse (9-36 months), and 32% in late relapse (>36 months) groups. Convergent evolution, in which multiple subclones harbor mutations in the same drug resistance gene, was observed in 6 relapses and confirmed by single-cell sequencing in 1 case. Mathematical modeling and mutational signature analysis indicated that early relapse resistance acquisition was frequently a 2-step process in which a persistent clone survived initial therapy and later acquired bona fide resistance mutations during therapy. In contrast, very early relapses arose from preexisting resistant clone(s). Two novel relapse-specific mutational signatures, one of which was caused by thiopurine treatment based on in vitro drug exposure experiments, were identified in early and late relapses but were absent from 2540 pan-cancer diagnosis samples and 129 non-ALL relapses. The novel signatures were detected in 27% of relapsed ALLs and were responsible for 46% of acquired resistance mutations in NT5C2, PRPS1, NR3C1, and TP53. These results suggest that chemotherapy-induced drug resistance mutations facilitate a subset of pediatric ALL relapses.

Subject(s)

Biomarkers, Tumor/genetics , Methotrexate/therapeutic use , Mutagenesis/drug effects , Mutation , Precursor Cell Lymphoblastic Leukemia-Lymphoma/genetics , Precursor Cell Lymphoblastic Leukemia-Lymphoma/pathology , 5'-Nucleotidase/genetics , Antimetabolites, Antineoplastic/therapeutic use , Child , DNA Mutational Analysis , Female , Follow-Up Studies , Genomics , High-Throughput Nucleotide Sequencing , Humans , Male , Precursor Cell Lymphoblastic Leukemia-Lymphoma/drug therapy , Prognosis , Receptors, Glucocorticoid/genetics , Survival Rate , Tumor Suppressor Protein p53/genetics

16.

Copy number evolution with weighted aberrations in cancer.

Zeira, Ron; Raphael, Benjamin J.

Bioinformatics ; 36(Suppl_1): i344-i352, 2020 07 01.

Article in English | MEDLINE | ID: mdl-32657354

ABSTRACT

MOTIVATION: Copy number aberrations (CNAs), which delete or amplify large contiguous segments of the genome, are a common type of somatic mutation in cancer. Copy number profiles, representing the number of copies of each region of a genome, are readily obtained from whole-genome sequencing or microarrays. However, modeling copy number evolution is a substantial challenge, because different CNAs may overlap with one another on the genome. A recent popular model for copy number evolution is the copy number distance (CND), defined as the length of a shortest sequence of deletions and amplifications of contiguous segments that transforms one profile into the other. In the CND, all events contribute equally; however, it is well known that rates of CNAs vary by length, genomic position and type (amplification versus deletion). RESULTS: We introduce a weighted CND that allows events to have varying weights, or probabilities, based on their length, position and type. We derive an efficient algorithm to compute the weighted CND as well as the associated transformation. This algorithm is based on the observation that the constraint matrix of the underlying optimization problem is totally unimodular. We show that the weighted CND improves phylogenetic reconstruction on simulated data where CNAs occur with varying probabilities, aids in the derivation of phylogenies from ultra-low-coverage single-cell DNA sequencing data and helps estimate CNA rates in a large pan-cancer dataset. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/raphael-group/WCND. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

DNA Copy Number Variations , Neoplasms , Humans , Neoplasms/genetics , Phylogeny , Sequence Analysis, DNA , Whole Genome Sequencing

17.

Identifying tumor clones in sparse single-cell mutation data.

Myers, Matthew A; Zaccaria, Simone; Raphael, Benjamin J.

Bioinformatics ; 36(Suppl_1): i186-i193, 2020 07 01.

Article in English | MEDLINE | ID: mdl-32657385

ABSTRACT

MOTIVATION: Recent single-cell DNA sequencing technologies enable whole-genome sequencing of hundreds to thousands of individual cells. However, these technologies have ultra-low sequencing coverage (<0.5× per cell) which has limited their use to the analysis of large copy-number aberrations (CNAs) in individual cells. While CNAs are useful markers in cancer studies, single-nucleotide mutations are equally important, both in cancer studies and in other applications. However, ultra-low coverage sequencing yields single-nucleotide mutation data that are too sparse for current single-cell analysis methods. RESULTS: We introduce SBMClone, a method to infer clusters of cells, or clones, that share groups of somatic single-nucleotide mutations. SBMClone uses a stochastic block model to overcome sparsity in ultra-low coverage single-cell sequencing data, and we show that SBMClone accurately infers the true clonal composition on simulated datasets with coverage at low as 0.2×. We applied SBMClone to single-cell whole-genome sequencing data from two breast cancer patients obtained using two different sequencing technologies. On the first patient, sequenced using the 10X Genomics CNV solution with sequencing coverage ≈0.03×, SBMClone recovers the major clonal composition when incorporating a small amount of additional information. On the second patient, where pre- and post-treatment tumor samples were sequenced using DOP-PCR with sequencing coverage ≈0.5×, SBMClone shows that tumor cells are present in the post-treatment sample, contrary to published analysis of this dataset. AVAILABILITY AND IMPLEMENTATION: SBMClone is available on the GitHub repository https://github.com/raphael-group/SBMClone. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , Software , Algorithms , Clone Cells , High-Throughput Nucleotide Sequencing , Humans , Mutation , Whole Genome Sequencing

18.

STARCH: copy number and clone inference from spatial transcriptomics data.

Elyanow, Rebecca; Zeira, Ron; Land, Max; Raphael, Benjamin J.

Phys Biol ; 18(3): 035001, 2021 03 09.

Article in English | MEDLINE | ID: mdl-33022659

ABSTRACT

Tumors are highly heterogeneous, consisting of cell populations with both transcriptional and genetic diversity. These diverse cell populations are spatially organized within a tumor, creating a distinct tumor microenvironment. A new technology called spatial transcriptomics can measure spatial patterns of gene expression within a tissue by sequencing RNA transcripts from a grid of spots, each containing a small number of cells. In tumor cells, these gene expression patterns represent the combined contribution of regulatory mechanisms, which alter the rate at which a gene is transcribed, and genetic diversity, particularly copy number aberrations (CNAs) which alter the number of copies of a gene in the genome. CNAs are common in tumors and often promote cancer growth through upregulation of oncogenes or downregulation of tumor-suppressor genes. We introduce a new method STARCH (spatial transcriptomics algorithm reconstructing copy-number heterogeneity) to infer CNAs from spatial transcriptomics data. STARCH overcomes challenges in inferring CNAs from RNA-sequencing data by leveraging the observation that cells located nearby in a tumor are likely to share similar CNAs. We find that STARCH outperforms existing methods for inferring CNAs from RNA-sequencing data without incorporating spatial information.

Subject(s)

Clone Cells , DNA Copy Number Variations , Gene Expression Profiling/instrumentation , Tumor Microenvironment/genetics , Algorithms

19.

Single-cell sequencing data reveal widespread recurrence and loss of mutational hits in the life histories of tumors.

Kuipers, Jack; Jahn, Katharina; Raphael, Benjamin J; Beerenwinkel, Niko.

Genome Res ; 27(11): 1885-1894, 2017 11.

Article in English | MEDLINE | ID: mdl-29030470

ABSTRACT

Intra-tumor heterogeneity poses substantial challenges for cancer treatment. A tumor's composition can be deduced by reconstructing its mutational history. Central to current approaches is the infinite sites assumption that every genomic position can only mutate once over the lifetime of a tumor. The validity of this assumption has never been quantitatively assessed. We developed a rigorous statistical framework to test the infinite sites assumption with single-cell sequencing data. Our framework accounts for the high noise and contamination present in such data. We found strong evidence for the same genomic position being mutationally affected multiple times in individual tumors for 11 of 12 single-cell sequencing data sets from a variety of human cancers. Seven cases involved the loss of earlier mutations, five of which occurred at sites unaffected by large-scale genomic deletions. Four cases exhibited a parallel mutation, potentially indicating convergent evolution at the base pair level. Our results refute the general validity of the infinite sites assumption and indicate that more complex models are needed to adequately quantify intra-tumor heterogeneity for more effective cancer treatment.

Subject(s)

Exome Sequencing/methods , Mutation , Neoplasms/genetics , Single-Cell Analysis/methods , Evolution, Molecular , Genetic Heterogeneity , Humans , Models, Statistical

20.

GenomeVIP: a cloud platform for genomic variant discovery and interpretation.

Mashl, R Jay; Scott, Adam D; Huang, Kuan-Lin; Wyczalkowski, Matthew A; Yoon, Christopher J; Niu, Beifang; DeNardo, Erin; Yellapantula, Venkata D; Handsaker, Robert E; Chen, Ken; Koboldt, Daniel C; Ye, Kai; Fenyö, David; Raphael, Benjamin J; Wendl, Michael C; Ding, Li.

Genome Res ; 27(8): 1450-1459, 2017 08.

Article in English | MEDLINE | ID: mdl-28522612

ABSTRACT

Identifying genomic variants is a fundamental first step toward the understanding of the role of inherited and acquired variation in disease. The accelerating growth in the corpus of sequencing data that underpins such analysis is making the data-download bottleneck more evident, placing substantial burdens on the research community to keep pace. As a result, the search for alternative approaches to the traditional "download and analyze" paradigm on local computing resources has led to a rapidly growing demand for cloud-computing solutions for genomics analysis. Here, we introduce the Genome Variant Investigation Platform (GenomeVIP), an open-source framework for performing genomics variant discovery and annotation using cloud- or local high-performance computing infrastructure. GenomeVIP orchestrates the analysis of whole-genome and exome sequence data using a set of robust and popular task-specific tools, including VarScan, GATK, Pindel, BreakDancer, Strelka, and Genome STRiP, through a web interface. GenomeVIP has been used for genomic analysis in large-data projects such as the TCGA PanCanAtlas and in other projects, such as the ICGC Pilots, CPTAC, ICGC-TCGA DREAM Challenges, and the 1000 Genomes SV Project. Here, we demonstrate GenomeVIP's ability to provide high-confidence annotated somatic, germline, and de novo variants of potential biological significance using publicly available data sets.

Subject(s)

Cloud Computing , Genetic Variation , Genome, Human , Genomics/methods , Neoplasms/genetics , Software , Databases, Genetic , High-Throughput Nucleotide Sequencing/methods , Humans

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL