RESUMEN
Investigating tumor heterogeneity using single-cell sequencing technologies is imperative to understand how tumors evolve since each cell subpopulation harbors a unique set of genomic features that yields a unique phenotype, which is bound to have clinical relevance. Clustering of cells based on copy number data obtained from single-cell DNA sequencing provides an opportunity to identify different tumor cell subpopulations. Accordingly, computational methods have emerged for single-cell copy number profiling and clustering; however, these two tasks have been handled sequentially by applying various ad-hoc pre- and post-processing steps; hence, a procedure vulnerable to introducing clustering artifacts. We avoid the clustering artifact issues in our method, CopyMix, a Variational Inference for a novel mixture model, by jointly inferring cell clusters and their underlying copy number profile. Our probabilistic graphical model is an improved version of the mixture of hidden Markov models, which is designed uniquely to infer single-cell copy number profiling and clustering. For the evaluation, we used likelihood-ratio test, CH index, Silhouette, V-measure, total variation scores. CopyMix performs well on both biological and simulated data. Our favorable results indicate a considerable potential to obtain clinical impact by using CopyMix in studies of cancer tumor heterogeneity.
RESUMEN
Spatial and genomic heterogeneity of tumors are crucial factors influencing cancer progression, treatment, and survival. However, a technology for direct mapping the clones in the tumor tissue based on somatic point mutations is lacking. Here, we propose Tumoroscope, the first probabilistic model that accurately infers cancer clones and their localization in close to single-cell resolution by integrating pathological images, whole exome sequencing, and spatial transcriptomics data. In contrast to previous methods, Tumoroscope explicitly addresses the problem of deconvoluting the proportions of clones in spatial transcriptomics spots. Applied to a reference prostate cancer dataset and a newly generated breast cancer dataset, Tumoroscope reveals spatial patterns of clone colocalization and mutual exclusion in sub-areas of the tumor tissue. We further infer clone-specific gene expression levels and the most highly expressed genes for each clone. In summary, Tumoroscope enables an integrated study of the spatial, genomic, and phenotypic organization of tumors.
Asunto(s)
Neoplasias de la Mama , Genómica , Neoplasias de la Próstata , Humanos , Neoplasias de la Próstata/genética , Neoplasias de la Próstata/patología , Neoplasias de la Mama/genética , Neoplasias de la Mama/patología , Genómica/métodos , Femenino , Masculino , Heterogeneidad Genética , Secuenciación del Exoma , Neoplasias/genética , Neoplasias/patología , Perfilación de la Expresión Génica/métodos , Análisis de la Célula Individual/métodos , Regulación Neoplásica de la Expresión Génica , TranscriptomaRESUMEN
Cell lineage tree reconstruction methods are developed for various tasks, such as investigating the development, differentiation, and cancer progression. Single-cell sequencing technologies enable more thorough analysis with higher resolution. We present Scuphr, a distance-based cell lineage tree reconstruction method using bulk and single-cell DNA sequencing data from healthy tissues. Common challenges of single-cell DNA sequencing, such as allelic dropouts and amplification errors, are included in Scuphr. Scuphr computes the distance between cell pairs and reconstructs the lineage tree using the neighbor-joining algorithm. With its embarrassingly parallel design, Scuphr can do faster analysis than the state-of-the-art methods while obtaining better accuracy. The method's robustness is investigated using various synthetic datasets and a biological dataset of 18 cells.
Asunto(s)
Algoritmos , Linaje de la Célula , Biología Computacional , Análisis de la Célula Individual , Linaje de la Célula/genética , Biología Computacional/métodos , Modelos Estadísticos , Análisis de Secuencia de ADN/métodos , Análisis de la Célula Individual/métodos , Programas InformáticosRESUMEN
MOTIVATION: Copy number variations (CNVs) are common genetic alterations in tumour cells. The delineation of CNVs holds promise for enhancing our comprehension of cancer progression. Moreover, accurate inference of CNVs from single-cell sequencing data is essential for unravelling intratumoral heterogeneity. However, existing inference methods face limitations in resolution and sensitivity. RESULTS: To address these challenges, we present CopyVAE, a deep learning framework based on a variational autoencoder architecture. Through experiments, we demonstrated that CopyVAE can accurately and reliably detect CNVs from data obtained using single-cell RNA sequencing. CopyVAE surpasses existing methods in terms of sensitivity and specificity. We also discussed CopyVAE's potential to advance our understanding of genetic alterations and their impact on disease advancement. AVAILABILITY AND IMPLEMENTATION: CopyVAE is implemented and freely available under MIT license at https://github.com/kurtsemih/copyVAE.
Asunto(s)
Variaciones en el Número de Copia de ADN , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Humanos , Aprendizaje Profundo , Programas Informáticos , Transcriptoma/genética , Análisis de Secuencia de ARN/métodos , Neoplasias/genéticaRESUMEN
Cell types can be classified according to shared patterns of transcription. Non-genetic variability among individual cells of the same type has been ascribed to stochastic transcriptional bursting and transient cell states. Using high-coverage single-cell RNA profiling, we asked whether long-term, heritable differences in gene expression can impart diversity within cells of the same type. Studying clonal human lymphocytes and mouse brain cells, we uncovered a vast diversity of heritable gene expression patterns among different clones of cells of the same type in vivo. We combined chromatin accessibility and RNA profiling on different lymphocyte clones to reveal thousands of regulatory regions exhibiting interclonal variation, which could be directly linked to interclonal variation in gene expression. Our findings identify a source of cellular diversity, which may have important implications for how cellular populations are shaped by selective processes in development, aging, and disease. A record of this paper's transparent peer review process is included in the supplemental information.
Asunto(s)
Cromatina , ARN , Humanos , Ratones , Animales , Envejecimiento , Expresión GénicaRESUMEN
The spatial distribution of lymphocyte clones within tissues is critical to their development, selection, and expansion. We have developed spatial transcriptomics of variable, diversity, and joining (VDJ) sequences (Spatial VDJ), a method that maps B cell and T cell receptor sequences in human tissue sections. Spatial VDJ captures lymphocyte clones that match canonical B and T cell distributions and amplifies clonal sequences confirmed by orthogonal methods. We found spatial congruency between paired receptor chains, developed a computational framework to predict receptor pairs, and linked the expansion of distinct B cell clones to different tumor-associated gene expression programs. Spatial VDJ delineates B cell clonal diversity and lineage trajectories within their anatomical niche. Thus, Spatial VDJ captures lymphocyte spatial clonal architecture across tissues, providing a platform to harness clonal sequences for therapy.
Asunto(s)
Linfocitos B , Receptores de Células Precursoras de Linfocitos B , Receptores de Antígenos de Linfocitos T , Linfocitos T , Humanos , Linfocitos B/metabolismo , Células Clonales/metabolismo , Perfilación de la Expresión Génica/métodos , Receptores de Células Precursoras de Linfocitos B/genética , Receptores de Antígenos de Linfocitos T/genética , Linfocitos T/metabolismoRESUMEN
Spatial transcriptomics maps gene expression across tissues, posing the challenge of determining the spatial arrangement of different cell types. However, spatial transcriptomics spots contain multiple cells. Therefore, the observed signal comes from mixtures of cells of different types. Here, we propose an innovative probabilistic model, Celloscope, that utilizes established prior knowledge on marker genes for cell type deconvolution from spatial transcriptomics data. Celloscope outperforms other methods on simulated data, successfully indicates known brain structures and spatially distinguishes between inhibitory and excitatory neuron types based in mouse brain tissue, and dissects large heterogeneity of immune infiltrate composition in prostate gland tissue.
Asunto(s)
Perfilación de la Expresión Génica , Transcriptoma , Masculino , Animales , Ratones , Neuronas , Encéfalo , Modelos EstadísticosRESUMEN
Functional characterization of the cancer clones can shed light on the evolutionary mechanisms driving cancer's proliferation and relapse mechanisms. Single-cell RNA sequencing data provide grounds for understanding the functional state of cancer as a whole; however, much research remains to identify and reconstruct clonal relationships toward characterizing the changes in functions of individual clones. We present PhylEx that integrates bulk genomics data with co-occurrences of mutations from single-cell RNA sequencing data to reconstruct high-fidelity clonal trees. We evaluate PhylEx on synthetic and well-characterized high-grade serous ovarian cancer cell line datasets. PhylEx outperforms the state-of-the-art methods both when comparing capacity for clonal tree reconstruction and for identifying clones. We analyze high-grade serous ovarian cancer and breast cancer data to show that PhylEx exploits clonal expression profiles beyond what is possible with expression-based clustering methods and clear the way for accurate inference of clonal trees and robust phylo-phenotypic analysis of cancer.
Asunto(s)
Neoplasias Ováricas , Árboles , Femenino , Humanos , Árboles/genética , Transcriptoma , Evolución Clonal , Recurrencia Local de Neoplasia , Neoplasias Ováricas/genética , Células Clonales , Análisis de la Célula Individual/métodosRESUMEN
Breast cancer (BC) is a complex disease comprising multiple distinct subtypes with different genetic features and pathological characteristics. Although a large number of antineoplastic compounds have been approved for clinical use, patient-to-patient variability in drug response is frequently observed, highlighting the need for efficient treatment prediction for individualized therapy. Several patient-derived models have been established lately for the prediction of drug response. However, each of these models has its limitations that impede their clinical application. Here, we report that the whole-tumor cell culture (WTC) ex vivo model could be stably established from all breast tumors with a high success rate (98 out of 116), and it could reassemble the parental tumors with the endogenous microenvironment. We observed strong clinical associations and predictive values from the investigation of a broad range of BC therapies with WTCs derived from a patient cohort. The accuracy was further supported by the correlation between WTC-based test results and patients' clinical responses in a separate validation study, where the neoadjuvant treatment regimens of 15 BC patients were mimicked. Collectively, the WTC model allows us to accomplish personalized drug testing within 10 d, even for small-sized tumors, highlighting its potential for individualized BC therapy. Furthermore, coupled with genomic and transcriptomic analyses, WTC-based testing can also help to stratify specific patient groups for assignment into appropriate clinical trials, as well as validate potential biomarkers during drug development.
Asunto(s)
Antineoplásicos , Neoplasias de la Mama , Humanos , Femenino , Neoplasias de la Mama/tratamiento farmacológico , Neoplasias de la Mama/genética , Neoplasias de la Mama/patología , Antineoplásicos/farmacología , Antineoplásicos/uso terapéutico , Perfilación de la Expresión Génica , Biomarcadores , Técnicas de Cultivo de Célula , Microambiente TumoralRESUMEN
Identifying the interrelations among cancer driver genes and the patterns in which the driver genes get mutated is critical for understanding cancer. In this paper, we study cross-sectional data from cohorts of tumors to identify the cancer-type (or subtype) specific process in which the cancer driver genes accumulate critical mutations. We model this mutation accumulation process using a tree, where each node includes a driver gene or a set of driver genes. A mutation in each node enables its children to have a chance of mutating. This model simultaneously explains the mutual exclusivity patterns observed in mutations in specific cancer genes (by its nodes) and the temporal order of events (by its edges). We introduce a computationally efficient dynamic programming procedure for calculating the likelihood of our noisy datasets and use it to build our Markov Chain Monte Carlo (MCMC) inference algorithm, ToMExO. Together with a set of engineered MCMC moves, our fast likelihood calculations enable us to work with datasets with hundreds of genes and thousands of tumors, which cannot be dealt with using available cancer progression analysis methods. We demonstrate our method's performance on several synthetic datasets covering various scenarios for cancer progression dynamics. Then, a comparison against two state-of-the-art methods on a moderate-size biological dataset shows the merits of our algorithm in identifying significant and valid patterns. Finally, we present our analyses of several large biological datasets, including colorectal cancer, glioblastoma, and pancreatic cancer. In all the analyses, we validate the results using a set of method-independent metrics testing the causality and significance of the relations identified by ToMExO or competing methods.
Asunto(s)
Glioblastoma , Neoplasias , Niño , Humanos , Estudios Transversales , Neoplasias/genética , Neoplasias/patología , Procesos Neoplásicos , Algoritmos , Método de Montecarlo , Mutación , Glioblastoma/genéticaRESUMEN
MOTIVATION: DNA methylation plays a key role in a variety of biological processes. Recently, Nanopore long-read sequencing has enabled direct detection of these modifications. As a consequence, a range of computational methods have been developed to exploit Nanopore data for methylation detection. However, current approaches rely on a human-defined threshold to detect the methylation status of a genomic position and are not optimized to detect sites methylated at low frequency. Furthermore, most methods use either the Nanopore signals or the basecalling errors as the model input and do not take advantage of their combination. RESULTS: Here, we present DeepMP, a convolutional neural network-based model that takes information from Nanopore signals and basecalling errors to detect whether a given motif in a read is methylated or not. Besides, DeepMP introduces a threshold-free position modification calling model sensitive to sites methylated at low frequency across cells. We comprehensively benchmarked DeepMP against state-of-the-art methods on Escherichia coli, human and pUC19 datasets. DeepMP outperforms current approaches at read-based and position-based methylation detection across sites methylated at different frequencies in the three datasets. AVAILABILITY AND IMPLEMENTATION: DeepMP is implemented and freely available under MIT license at https://github.com/pepebonet/DeepMP. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Aprendizaje Profundo , Secuenciación de Nanoporos , Nanoporos , Humanos , Programas Informáticos , Análisis de Secuencia de ADN , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Escherichia coli/genética , ADN/genéticaRESUMEN
Identification of mutations of the genes that give cancer a selective advantage is an important step towards research and clinical objectives. As such, there has been a growing interest in developing methods for identification of driver genes and their temporal order within a single patient (intra-tumor) as well as across a cohort of patients (inter-tumor). In this paper, we develop a probabilistic model for tumor progression, in which the driver genes are clustered into several ordered driver pathways. We develop an efficient inference algorithm that exhibits favorable scalability to the number of genes and samples compared to a previously introduced ILP-based method. Adopting a probabilistic approach also allows principled approaches to model selection and uncertainty quantification. Using a large set of experiments on synthetic datasets, we demonstrate our superior performance compared to the ILP-based method. We also analyze two biological datasets of colorectal and glioblastoma cancers. We emphasize that while the ILP-based method puts many seemingly passenger genes in the driver pathways, our algorithm keeps focused on truly driver genes and outputs more accurate models for cancer progression.
Asunto(s)
Genes Relacionados con las Neoplasias/genética , Modelos Estadísticos , Neoplasias/genética , Neoplasias/patología , Algoritmos , Biología Computacional , Bases de Datos Genéticas , Progresión de la Enfermedad , Humanos , Mutación/genéticaRESUMEN
Intra-tumor heterogeneity is one of the biggest challenges in cancer treatment today. Here we investigate tissue-wide gene expression heterogeneity throughout a multifocal prostate cancer using the spatial transcriptomics (ST) technology. Utilizing a novel approach for deconvolution, we analyze the transcriptomes of nearly 6750 tissue regions and extract distinct expression profiles for the different tissue components, such as stroma, normal and PIN glands, immune cells and cancer. We distinguish healthy and diseased areas and thereby provide insight into gene expression changes during the progression of prostate cancer. Compared to pathologist annotations, we delineate the extent of cancer foci more accurately, interestingly without link to histological changes. We identify gene expression gradients in stroma adjacent to tumor regions that allow for re-stratification of the tumor microenvironment. The establishment of these profiles is the first step towards an unbiased view of prostate cancer and can serve as a dictionary for future studies.
Asunto(s)
Adenocarcinoma/genética , Regulación Neoplásica de la Expresión Génica , Neoplasias de la Próstata/genética , Transcriptoma/genética , Adenocarcinoma/patología , Adenocarcinoma/cirugía , Biología Computacional , Progresión de la Enfermedad , Perfilación de la Expresión Génica , Humanos , Masculino , Próstata/citología , Próstata/patología , Próstata/cirugía , Prostatectomía , Neoplasias de la Próstata/patología , Neoplasias de la Próstata/cirugía , ARN Mensajero/genética , Células del Estroma/patología , Microambiente Tumoral/genéticaRESUMEN
Metastatic breast cancers are still incurable. Characterizing the evolutionary landscape of these cancers, including the role of metastatic axillary lymph nodes (ALNs) in seeding distant organ metastasis, can provide a rational basis for effective treatments. Here, we have described the genomic analyses of the primary tumors and metastatic lesions from 99 samples obtained from 20 patients with breast cancer. Our evolutionary analyses revealed diverse spreading and seeding patterns that govern tumor progression. Although linear evolution to successive metastatic sites was common, parallel evolution from the primary tumor to multiple distant sites was also evident. Metastatic spreading was frequently coupled with polyclonal seeding, in which multiple metastatic subclones originated from the primary tumor and/or other distant metastases. Synchronous ALN metastasis, a well-established prognosticator of breast cancer, was not involved in seeding the distant metastasis, suggesting a hematogenous route for cancer dissemination. Clonal evolution coincided frequently with emerging driver alterations and evolving mutational processes, notably an increase in apolipoprotein B mRNA-editing enzyme, catalytic polypeptide-like-associated (APOBEC-associated) mutagenesis. Our data provide genomic evidence for a role of ALN metastasis in seeding distant organ metastasis and elucidate the evolving mutational landscape during cancer progression.
Asunto(s)
Neoplasias de la Mama/genética , Evolución Molecular , Mutación , Neoplasias de la Mama/mortalidad , Neoplasias de la Mama/patología , Femenino , Humanos , Ganglios Linfáticos/metabolismo , Ganglios Linfáticos/patología , Metástasis Linfática , Metástasis de la NeoplasiaRESUMEN
Cancer arises when pathways that control cell functions such as proliferation and migration are dysregulated to such an extent that cells start to divide uncontrollably and eventually spread throughout the body, ultimately endangering the survival of an affected individual. It is well established that somatic mutations are important in cancer initiation and progression as well as in creation of tumor diversity. Now also modifications of the transcriptome are emerging as a significant force during the transition from normal cell to malignant tumor. Editing of adenosine (A) to inosine (I) in double-stranded RNA, catalyzed by adenosine deaminases acting on RNA (ADARs), is one dynamic modification that in a combinatorial manner can give rise to a very diverse transcriptome. Since the cell interprets inosine as guanosine (G), editing can result in non-synonymous codon changes in transcripts as well as yield alternative splicing, but also affect targeting and disrupt maturation of microRNA. ADAR editing is essential for survival in mammals but its dysregulation can lead to cancer. ADAR1 is for instance overexpressed in, e.g., lung cancer, liver cancer, esophageal cancer and chronic myoelogenous leukemia, which with few exceptions promotes cancer progression. In contrast, ADAR2 is lowly expressed in e.g. glioblastoma, where the lower levels of ADAR2 editing leads to malignant phenotypes. Altogether, RNA editing by the ADAR enzymes is a powerful regulatory mechanism during tumorigenesis. Depending on the cell type, cancer progression seems to mainly be induced by ADAR1 upregulation or ADAR2 downregulation, although in a few cases ADAR1 is instead downregulated. In this review, we discuss how aberrant editing of specific substrates contributes to malignancy.
Asunto(s)
Adenosina Desaminasa/metabolismo , Neoplasias/genética , Edición de ARN , ARN Bicatenario/genética , Proteínas de Unión al ARN/metabolismo , Animales , Progresión de la Enfermedad , Regulación Neoplásica de la Expresión Génica , Humanos , Neoplasias/metabolismo , Neoplasias/patología , Isoformas de ARN/genética , Isoformas de ARN/metabolismo , ARN Bicatenario/metabolismoRESUMEN
A complex disease has, by definition, multiple genetic causes. In theory, these causes could be identified individually, but their identification will likely benefit from informed use of anticipated interactions between causes. In addition, characterizing and understanding interactions must be considered key to revealing the etiology of any complex disease. Large-scale collaborative efforts are now paving the way for comprehensive studies of interaction. As a consequence, there is a need for methods with a computational efficiency sufficient for modern data sets as well as for improvements of statistical accuracy and power. Another issue is that, currently, the relation between different methods for interaction inference is in many cases not transparent, complicating the comparison and interpretation of results between different interaction studies. In this paper we present computationally efficient tests of interaction for the complete family of generalized linear models (GLMs). The tests can be applied for inference of single or multiple interaction parameters, but we show, by simulation, that jointly testing the full set of interaction parameters yields superior power and control of false positive rate. Based on these tests we also describe how to combine results from multiple independent studies of interaction in a meta-analysis. We investigate the impact of several assumptions commonly made when modeling interactions. We also show that, across the important class of models with a full set of interaction parameters, jointly testing the interaction parameters yields identical results. Further, we apply our method to genetic data for cardiovascular disease. This allowed us to identify a putative interaction involved in Lp(a) plasma levels between two 'tag' variants in the LPA locus (p = 2.42 â 10-09) as well as replicate the interaction (p = 6.97 â 10-07). Finally, our meta-analysis method is used in a small (N = 16,181) study of interactions in myocardial infarction.
Asunto(s)
Mapeo Cromosómico/métodos , Epistasis Genética/genética , Estudios de Asociación Genética/métodos , Estudio de Asociación del Genoma Completo/métodos , Modelos Lineales , Modelos Genéticos , Algoritmos , Animales , Humanos , Modelos TeóricosRESUMEN
BACKGROUND: Lateral gene transfer (LGT) is an evolutionary process that has an important role in biology. It challenges the traditional binary tree-like evolution of species and is attracting increasing attention of the molecular biologists due to its involvement in antibiotic resistance. A number of attempts have been made to model LGT in the presence of gene duplication and loss, but reliably placing LGT events in the species tree has remained a challenge. RESULTS: In this paper, we propose probabilistic methods that samples reconciliations of the gene tree with a dated species tree and computes maximum a posteriori probabilities. The MCMC-based method uses the probabilistic model DLTRS, that integrates LGT, gene duplication, gene loss, and sequence evolution under a relaxed molecular clock for substitution rates. We can estimate posterior distributions on gene trees and, in contrast to previous work, the actual placement of potential LGT, which can be used to, e.g., identify "highways" of LGT. CONCLUSIONS: Based on a simulation study, we conclude that the method is able to infer the true LGT events on gene tree and reconcile it to the correct edges on the species tree in most cases. Applied to two biological datasets, containing gene families from Cyanobacteria and Molicutes, we find potential LGTs highways that corroborate other studies as well as previously undetected examples.
Asunto(s)
Transferencia de Gen Horizontal/genética , Modelos Genéticos , Evolución Biológica , Entomoplasmataceae/clasificación , Entomoplasmataceae/genética , FilogeniaRESUMEN
Over the last decade, methods have been developed for the reconstruction of gene trees that take into account the species tree. Many of these methods have been based on the probabilistic duplication-loss model, which describes how a gene-tree evolves over a species-tree with respect to duplication and losses, as well as extension of this model, e.g., the DLRS (Duplication, Loss, Rate and Sequence evolution) model that also includes sequence evolution under relaxed molecular clock. A disjoint, almost as recent, and very important line of research has been focused on non protein-coding, but yet, functional DNA. For instance, DNA sequences being pseudogenes in the sense that they are not translated, may still be transcribed and the thereby produced RNA may be functional.
Asunto(s)
ADN/genética , Evolución Molecular , Filogenia , Seudogenes/genética , Duplicación de GenRESUMEN
Despite the success of genome-wide association studies in medical genetics, the underlying genetics of many complex diseases remains enigmatic. One plausible reason for this could be the failure to account for the presence of genetic interactions in current analyses. Exhaustive investigations of interactions are typically infeasible because the vast number of possible interactions impose hard statistical and computational challenges. There is, therefore, a need for computationally efficient methods that build on models appropriately capturing interaction. We introduce a new methodology where we augment the interaction hypothesis with a set of simpler hypotheses that are tested, in order of their complexity, against a saturated alternative hypothesis representing interaction. This sequential testing provides an efficient way to reduce the number of non-interacting variant pairs before the final interaction test. We devise two different methods, one that relies on a priori estimated numbers of marginally associated variants to correct for multiple tests, and a second that does this adaptively. We show that our methodology in general has an improved statistical power in comparison to seven other methods, and, using the idea of closed testing, that it controls the family-wise error rate. We apply our methodology to genetic data from the PROCARDIS coronary artery disease case/control cohort and discover three distinct interactions. While analyses on simulated data suggest that the statistical power may suffice for an exhaustive search of all variant pairs in ideal cases, we explore strategies for a priori selecting subsets of variant pairs to test. Our new methodology facilitates identification of new disease-relevant interactions from existing and future genome-wide association data, which may involve genes with previously unknown association to the disease. Moreover, it enables construction of interaction networks that provide a systems biology view of complex diseases, serving as a basis for more comprehensive understanding of disease pathophysiology and its clinical consequences.