RESUMEN
Gene expression data holds the potential to shed light on multiple biological processes at once. However, data analysis methods for single cell sequencing mostly focus on finding cell clusters or the principal progression line of the data. Data analysis for spatial transcriptomics mostly addresses clustering and finding spatially variable genes. Existing data analysis methods are effective in finding the main data features, but they might miss less pronounced, albeit significant, processes, possibly involving a subset of the samples. In this work we present SPIRAL: Significant Process InfeRence ALgorithm. SPIRAL is based on Gaussian statistics to detect all statistically significant biological processes in single cell, bulk and spatial transcriptomics data. The algorithm outputs a list of structures, each defined by a set of genes working simultaneously in a specific population of cells. SPIRAL is unique in its flexibility: the structures are constructed by selecting subsets of genes and cells based on statistically significant and consistent differential expression. Every gene and every cell may be part of one structure, more or none. SPIRAL also provides several visual representations of structures and pathway enrichment information. We validated the statistical soundness of SPIRAL on synthetic datasets and applied it to single cell, spatial and bulk RNA-sequencing datasets. SPIRAL is available at https://spiral.technion.ac.il/ .
Asunto(s)
Algoritmos , Biología Computacional , Perfilación de la Expresión Génica , Análisis de la Célula Individual , Transcriptoma , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica/métodos , Biología Computacional/métodos , Humanos , Análisis por ConglomeradosRESUMEN
Solid tumors are characterized by complex interactions between the tumor, the immune system and the microenvironment. These interactions and intra-tumor variations have both diagnostic and prognostic significance and implications. However, quantifying the underlying processes in patient samples requires expensive and complicated molecular experiments. In contrast, H&E staining is typically performed as part of the routine standard process, and is very cheap. Here we present HIPI (H&E Image Interpretation and Protein Expression Inference) for predicting cell marker expression from tumor H&E images. We process paired H&E and CyCIF images taken from serial sections of colorectal cancers to train our model. We show that our model accurately predicts the spatial distribution of several important cell markers, on both held-out tumor regions as well as new tumor samples taken from different patients. Moreover, using only the tissue image morphology, HIPI is able to colocalize the interactions between different cell types, further demonstrating its potential clinical significance.
Asunto(s)
Biomarcadores de Tumor , Neoplasias Colorrectales , Biología Computacional , Humanos , Neoplasias Colorrectales/metabolismo , Neoplasias Colorrectales/patología , Biología Computacional/métodos , Biomarcadores de Tumor/metabolismo , Microambiente Tumoral , Procesamiento de Imagen Asistido por Computador/métodos , AlgoritmosRESUMEN
Off-target effects present a significant impediment to the safe and efficient use of CRISPR-Cas genome editing. Since off-target activity is influenced by the genomic sequence, the presence of sequence variants leads to varying on- and off-target profiles among different alleles or individuals. However, a reliable tool that quantifies genome editing activity in an allelic context is not available. Here, we introduce CRISPECTOR2.0, an extended version of our previously published software tool CRISPECTOR, with an allele-specific editing activity quantification option. CRISPECTOR2.0 enables reference-free, allele-aware, precise quantification of on- and off-target activity, by using de novo sample-specific single nucleotide variant (SNV) detection and statistical-based allele-calling algorithms. We demonstrate CRISPECTOR2.0 efficacy in analyzing samples containing multiple alleles and quantifying allele-specific editing activity, using data from diverse cell types, including primary human cells, plants, and an original extensive human cell line database. We identified instances where an SNV induced changes in the protospacer adjacent motif sequence, resulting in allele-specific editing. Intriguingly, differential allelic editing was also observed in regions carrying distal SNVs, hinting at the involvement of additional epigenetic factors. Our findings highlight the importance of allele-specific editing measurement as a milestone in the adaptation of efficient, accurate, and safe personalized genome editing.
Asunto(s)
Alelos , Sistemas CRISPR-Cas , Edición Génica , Programas Informáticos , Edición Génica/métodos , Humanos , Polimorfismo de Nucleótido Simple , AlgoritmosRESUMEN
BACKGROUND: The incidence rates of cutaneous squamous cell carcinoma (cSCC) and basal cell carcinoma (BCC) skin cancers are rising, while the current diagnostic process is time-consuming. We describe the development of a novel approach to high-throughput sampling of tissue lipids using electroporation-based biopsy, termed e-biopsy. We report on the ability of the e-biopsy technique to harvest large amounts of lipids from human skin samples. MATERIALS AND METHODS: Here, 168 lipids were reliably identified from 12 patients providing a total of 13 samples. The extracted lipids were profiled with ultra-performance liquid chromatography and tandem mass spectrometry (UPLC-MS-MS) providing cSCC, BCC, and healthy skin lipidomic profiles. RESULTS: Comparative analysis identified 27 differentially expressed lipids (p < 0.05). The general profile trend is low diglycerides in both cSCC and BCC, high phospholipids in BCC, and high lyso-phospholipids in cSCC compared to healthy skin tissue samples. CONCLUSION: The results contribute to the growing body of knowledge that can potentially lead to novel insights into these skin cancers and demonstrate the potential of the e-biopsy technique for the analysis of lipidomic profiles of human skin tissues.
Asunto(s)
Carcinoma Basocelular , Carcinoma de Células Escamosas , Electroporación , Lipidómica , Neoplasias Cutáneas , Piel , Humanos , Carcinoma Basocelular/patología , Carcinoma Basocelular/metabolismo , Carcinoma Basocelular/diagnóstico , Neoplasias Cutáneas/patología , Neoplasias Cutáneas/metabolismo , Carcinoma de Células Escamosas/patología , Carcinoma de Células Escamosas/metabolismo , Carcinoma de Células Escamosas/química , Lipidómica/métodos , Biopsia , Piel/patología , Piel/metabolismo , Piel/química , Femenino , Masculino , Electroporación/métodos , Persona de Mediana Edad , Anciano , Lípidos/análisis , Espectrometría de Masas en Tándem/métodosRESUMEN
Data storage in DNA has recently emerged as a promising archival solution, offering space-efficient and long-lasting digital storage solutions. Recent studies suggest leveraging the inherent redundancy of synthesis and sequencing technologies by using composite DNA alphabets. A major challenge of this approach involves the noisy inference process, obstructing large composite alphabets. This paper introduces a novel approach for DNA-based data storage, offering, in some implementations, a 6.5-fold increase in logical density over standard DNA-based storage systems, with near-zero reconstruction error. Combinatorial DNA encoding uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter consists of a subset of shortmers. We formally define various combinatorial encoding schemes and investigate their theoretical properties. These include information density and reconstruction probabilities, as well as required synthesis and sequencing multiplicities. We then propose an end-to-end design for a combinatorial DNA-based data storage system, including encoding schemes, two-dimensional (2D) error correction codes, and reconstruction algorithms, under different error regimes. We performed simulations and show, for example, that the use of 2D Reed-Solomon error correction has significantly improved reconstruction rates. We validated our approach by constructing two combinatorial sequences using Gibson assembly, imitating a 4-cycle combinatorial synthesis process. We confirmed the successful reconstruction, and established the robustness of our approach for different error types. Subsampling experiments supported the important role of sampling rate and its effect on the overall performance. Our work demonstrates the potential of combinatorial shortmer encoding for DNA-based data storage and describes some theoretical research questions and technical challenges. Combining combinatorial principles with error-correcting strategies, and investing in the development of DNA synthesis technologies that efficiently support combinatorial synthesis, can pave the way to efficient, error-resilient DNA-based storage solutions.
Asunto(s)
Replicación del ADN , ADN , Análisis de Secuencia de ADN/métodos , ADN/genética , Algoritmos , Almacenamiento y Recuperación de la InformaciónRESUMEN
Traditional gene set enrichment analysis falters when applied to large genomic domains, where neighboring genes often share functions. This spatial dependency creates misleading enrichments, mistaking mere physical proximity for genuine biological connections. Here we present Spatial Adjusted Gene Ontology (SAGO), a novel cyclic permutation-based approach, to tackle this challenge. SAGO separates enrichments due to spatial proximity from genuine biological links by incorporating the genes' spatial arrangement into the analysis. We applied SAGO to various datasets in which the identified genomic intervals are large, including replication timing domains, large H3K9me3 and H3K27me3 domains, HiC compartments and lamina-associated domains (LADs). Intriguingly, applying SAGO to prostate cancer samples with large copy number alteration (CNA) domains eliminated most of the enriched GO terms, thus helping to accurately identify biologically relevant gene sets linked to oncogenic processes, free from spatial bias.
RESUMEN
Basal cell carcinoma (BCC) is the most common type of skin cancer. Due to multiple, potential underlying molecular tumor aberrations, clinical treatment protocols are not well-defined. This study presents multisite molecular heterogeneity profiles of human BCC based on RNA and proteome profiling. Three areas from lesions excised from 9 patients were analyzed. The focus was gene expression profiles based on proteome and RNA measurements of intra-tumor heterogeneity from the same patient and inter-tumor heterogeneity in nodular, infiltrative, and superficial BCC tumor subtypes from different patients. We observed significant overlap in intra- and inter-tumor variability of proteome and RNA expression profiles, showing significant multisite heterogeneity of protein expression in the BCC tumors. Inter-subtype analysis has also identified unique proteins for each BCC subtype. This profiling leads to a deeper understanding of BCC molecular heterogeneity and potentially contributes to developing new sampling tools for personalized diagnostics therapeutic approaches to BCC.
Asunto(s)
Carcinoma Basocelular , Neoplasias Cutáneas , Humanos , Transcriptoma , Proteoma/genética , Carcinoma Basocelular/patología , Neoplasias Cutáneas/patología , ARNRESUMEN
Digital analysis of pathology whole-slide images has been recently gaining interest in the context of cancer diagnosis and treatment. In particular, deep learning methods have demonstrated significant potential in supporting pathology analysis, recently detecting molecular traits never before recognized in pathology H&E whole-slide images (WSIs). Alongside these advancements in the digital analysis of WSIs, it is becoming increasingly evident that both spatial and overall tumor heterogeneity may be significant determinants of cancer prognosis and treatment outcome. In this chapter, we describe methods that aim to support these two elements. We describe both an end-to-end deep learning pipeline for producing limited spatial transcriptomics from WSIs with associated bulk gene expression data, as well as an algorithm for quantifying spatial tumor heterogeneity based on the results of this pipeline.
Asunto(s)
Neoplasias , Humanos , Neoplasias/diagnóstico , Neoplasias/genética , Fenotipo , Algoritmos , Microscopía/métodosRESUMEN
Excision tissue biopsy, while central to cancer treatment and precision medicine, presents risks to the patient and does not provide a sufficiently broad and faithful representation of the heterogeneity of solid tumors. Here we introduce e-biopsy-a novel concept for molecular profiling of solid tumors using molecular sampling with electroporation. As e-biopsy provides access to the molecular composition of a solid tumor by permeabilization of the cell membrane, it facilitates tumor diagnostics without tissue resection. Furthermore, thanks to its non tissue destructive characteristics, e-biopsy enables probing the solid tumor multiple times in several distinct locations in the same procedure, thereby enabling the spatial profiling of tumor molecular heterogeneity.We demonstrate e-biopsy in vivo, using the 4T1 breast cancer model in mice to assess its performance, as well as the inferred spatial differential protein expression. In particular, we show that proteomic profiles obtained via e-biopsy in vivo distinguish the tumors from healthy breast tissue and reflect spatial tumor differential protein expression. E-biopsy provides a completely new molecular sampling modality for solid tumors molecular cartography, providing information that potentially enables more rapid and sensitive detection at lesser risk, as well as more precise personalized medicine.
Asunto(s)
Neoplasias , Proteómica , Animales , Electroporación , Ratones , Proteínas de Neoplasias , Neoplasias/patología , Medicina de PrecisiónRESUMEN
Analysing human physiological data allows access to the health state and the state of mind of the subject individual. Whenever a person is sick, having a panic attack, happy or scared, physiological signals will be different. In terms of physiological signals, we focus, in this manuscript, on monitoring breathing patterns. The scope can be extended to also address heart rate and other variables. We describe an analysis of breathing rate patterns during activities including resting, walking, running and watching a movie. We model normal breathing behaviours by statistically analysing signals, processed to represent quantities of interest. We consider moving maximum/minimum, the amplitude and the Fourier transform of the respiration signal, working with different window sizes. We then learn a statistical model for the basal behaviour, per individual, and detect outliers. When outliers are detected, a system that incorporates our approach would send a visible signal through a smart garment or through other means. We describe alert generation performance in two datasets-one literature dataset and one collected as a field study for this work. In particular, when learning personal rest distributions for the breathing signals of 14 subjects, we see alerts generated more often when the same individual is running than when they are tested in rest conditions.
Asunto(s)
Respiración , Frecuencia Respiratoria , Humanos , Modelos Estadísticos , DescansoRESUMEN
A major concern in tissue biopsies with a needle is missing the most lethal clone of a tumor, leading to a false negative result. This concern is well justified, since needle-based biopsies gather tissue information limited to needle size. In this work, we show that molecular harvesting with electroporation, e-biopsy, could increase the sampled tissue volume in comparison to tissue sampling by a needle alone. Suggested by numerical models of electric fields distribution, the increased sampled volume is achieved by electroporation-driven permeabilization of cellular membranes in the tissue around the sampling needle. We show that proteomic profiles, sampled by e-biopsy from the brain tissue, ex vivo, at 0.5mm distance outside the visible margins of mice brain melanoma metastasis, have protein patterns similar to melanoma tumor center and different from the healthy brain tissue. In addition, we show that e-biopsy probed proteome signature differentiates between melanoma tumor center and healthy brain in mice. This study suggests that e-biopsy could provide a novel tool for a minimally invasive sampling of molecules in tissue in larger volumes than achieved with traditional needle biopsies.
Asunto(s)
Melanoma , Proteoma , Animales , Encéfalo/patología , Electroporación , Márgenes de Escisión , Melanoma/patología , Ratones , ProteómicaRESUMEN
RNA splicing is a key process in eukaryotic gene expression, in which an intron is spliced out of a pre-mRNA molecule to eventually produce a mature mRNA. Most intron-containing genes are constitutively spliced, hence efficient splicing of an intron is crucial for efficient regulation of gene expression. Here we use a large synthetic oligo library of ~20,000 variants to explore how different intronic sequence features affect splicing efficiency and mRNA expression levels in S. cerevisiae. Introns are defined by three functional sites, the 5' donor site, the branch site, and the 3' acceptor site. Using a combinatorial design of synthetic introns, we demonstrate how non-consensus splice site sequences in each of these sites affect splicing efficiency. We then show that S. cerevisiae splicing machinery tends to select alternative 3' splice sites downstream of the original site, and we suggest that this tendency created a selective pressure, leading to the avoidance of cryptic splice site motifs near introns' 3' ends. We further use natural intronic sequences from other yeast species, whose splicing machineries have diverged to various extents, to show how intron architectures in the various species have been adapted to the organism's splicing machinery. We suggest that the observed tendency for cryptic splicing is a result of a loss of a specific splicing factor, U2AF1. Lastly, we show that synthetic sequences containing two introns give rise to alternative RNA isoforms in S. cerevisiae, demonstrating that merely a synthetic fusion of two introns might be suffice to facilitate alternative splicing in yeast. Our study reveals novel mechanisms by which introns are shaped in evolution to allow cells to regulate their transcriptome. In addition, it provides a valuable resource to study the regulation of constitutive and alternative splicing in a model organism.
Asunto(s)
Empalme del ARN , Saccharomyces cerevisiae/genética , Biología Computacional/métodos , Evolución Molecular , Genes Fúngicos , Secuenciación de Nucleótidos de Alto Rendimiento , Intrones , ARN Mensajero/genéticaRESUMEN
MOTIVATION: Tumour heterogeneity is being increasingly recognized as an important characteristic of cancer and as a determinant of prognosis and treatment outcome. Emerging spatial transcriptomics data hold the potential to further our understanding of tumour heterogeneity and its implications. However, existing statistical tools are not sufficiently powerful to capture heterogeneity in the complex setting of spatial molecular biology. RESULTS: We provide a statistical solution, the HeTerogeneity Average index (HTA), specifically designed to handle the multivariate nature of spatial transcriptomics. We prove that HTA has an approximately normal distribution, therefore lending itself to efficient statistical assessment and inference. We first demonstrate that HTA accurately reflects the level of heterogeneity in simulated data. We then use HTA to analyze heterogeneity in two cancer spatial transcriptomics datasets: spatial RNA sequencing by 10x Genomics and spatial transcriptomics inferred from H&E. Finally, we demonstrate that HTA also applies to 3D spatial data using brain MRI. In spatial RNA sequencing, we use a known combination of molecular traits to assert that HTA aligns with the expected outcome for this combination. We also show that HTA captures immune-cell infiltration at multiple resolutions. In digital pathology, we show how HTA can be used in survival analysis and demonstrate that high levels of heterogeneity may be linked to poor survival. In brain MRI, we show that HTA differentiates between normal ageing, Alzheimer's disease and two tumours. HTA also extends beyond molecular biology and medical imaging, and can be applied to many domains, including GIS. AVAILABILITY AND IMPLEMENTATION: Python package and source code are available at: https://github.com/alonalj/hta. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Neoplasias , Transcriptoma , Humanos , Evaluación de la Tecnología Biomédica , Genómica , NeuroimagenRESUMEN
MOTIVATION: Log-rank test is a widely used test that serves to assess the statistical significance of observed differences in survival, when comparing two or more groups. The log-rank test is based on several assumptions that support the validity of the calculations. It is naturally assumed, implicitly, that no errors occur in the labeling of the samples. That is, the mapping between samples and groups is perfectly correct. In this work, we investigate how test results may be affected when considering some errors in the original labeling. RESULTS: We introduce and define the uncertainty that arises from labeling errors in log-rank test. In order to deal with this uncertainty, we develop a novel algorithm for efficiently calculating a stability interval around the original log-rank P-value and prove its correctness. We demonstrate our algorithm on several datasets. AVAILABILITY AND IMPLEMENTATION: We provide a Python implementation, called LoRSI, for calculating the stability interval using our algorithm https://github.com/YakhiniGroup/LoRSI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , IncertidumbreRESUMEN
Controlling off-target editing activity is one of the central challenges in making CRISPR technology accurate and applicable in medical practice. Current algorithms for analyzing off-target activity do not provide statistical quantification, are not sufficiently sensitive in separating signal from noise in experiments with low editing rates, and do not address the detection of translocations. Here we present CRISPECTOR, a software tool that supports the detection and quantification of on- and off-target genome-editing activity from NGS data using paired treatment/control CRISPR experiments. In particular, CRISPECTOR facilitates the statistical analysis of NGS data from multiplex-PCR comparative experiments to detect and quantify adverse translocation events. We validate the observed results and show independent evidence of the occurrence of translocations in human cell lines, after genome editing. Our methodology is based on a statistical model comparison approach leading to better false-negative rates in sites with weak yet significant off-target activity.
Asunto(s)
Sistemas CRISPR-Cas , Biología Computacional/métodos , Edición Génica/métodos , Algoritmos , Proteínas de Unión al ADN/genética , Células HEK293 , Proteínas de Homeodominio/genética , Humanos , Proteínas Nucleares/genética , Programas Informáticos , Factores de Transcripción/genéticaRESUMEN
We apply an oligo-library and machine learning-approach to characterize the sequence and structural determinants of binding of the phage coat proteins (CPs) of bacteriophages MS2 (MCP), PP7 (PCP), and Qß (QCP) to RNA. Using the oligo library, we generate thousands of candidate binding sites for each CP, and screen for binding using a high-throughput dose-response Sort-seq assay (iSort-seq). We then apply a neural network to expand this space of binding sites, which allowed us to identify the critical structural and sequence features for binding of each CP. To verify our model and experimental findings, we design several non-repetitive binding site cassettes and validate their functionality in mammalian cells. We find that the binding of each CP to RNA is characterized by a unique space of sequence and structural determinants, thus providing a more complete description of CP-RNA interaction as compared with previous low-throughput findings. Finally, based on the binding spaces we demonstrate a computational tool for the successful design and rapid synthesis of functional non-repetitive binding-site cassettes.
Asunto(s)
Allolevivirus/genética , Proteínas de la Cápside/metabolismo , Escherichia coli/virología , Levivirus/genética , ARN/metabolismo , Sitios de Ligazón Microbiológica/genética , Sitios de Unión/genética , Línea Celular Tumoral , Escherichia coli/genética , Biblioteca de Genes , Humanos , Aprendizaje Automático , Plásmidos/genéticaRESUMEN
Different miRNA profiling protocols and technologies introduce differences in the resulting quantitative expression profiles. These include differences in the presence (and measurability) of certain miRNAs. We present and examine a method based on quantile normalization, Adjusted Quantile Normalization (AQuN), to combine miRNA expression data from multiple studies in breast cancer into a single joint dataset for integrative analysis. By pooling multiple datasets, we obtain increased statistical power, surfacing patterns that do not emerge as statistically significant when separately analyzing these datasets. To merge several datasets, as we do here, one needs to overcome both technical and batch differences between these datasets. We compare several approaches for merging and jointly analyzing miRNA datasets. We investigate the statistical confidence for known results and highlight potential new findings that resulted from the joint analysis using AQuN. In particular, we detect several miRNAs to be differentially expressed in estrogen receptor (ER) positive versus ER negative samples. In addition, we identify new potential biomarkers and therapeutic targets for both clinical groups. As a specific example, using the AQuN-derived dataset we detect hsa-miR-193b-5p to have a statistically significant over-expression in the ER positive group, a phenomenon that was not previously reported. Furthermore, as demonstrated by functional assays in breast cancer cell lines, overexpression of hsa-miR-193b-5p in breast cancer cell lines resulted in decreased cell viability in addition to inducing apoptosis. Together, these observations suggest a novel functional role for this miRNA in breast cancer. Packages implementing AQuN are provided for Python and Matlab: https://github.com/YakhiniGroup/PyAQN.
Asunto(s)
Neoplasias de la Mama/genética , Neoplasias de la Mama/metabolismo , Perfilación de la Expresión Génica , Regulación Neoplásica de la Expresión Génica , MicroARNs/metabolismo , Algoritmos , Biomarcadores/metabolismo , Biomarcadores de Tumor/genética , Línea Celular Tumoral , Simulación por Computador , Receptor alfa de Estrógeno/metabolismo , Femenino , Humanos , Células MCF-7 , Análisis de Secuencia por Matrices de Oligonucleótidos , Lenguajes de Programación , ARN Mensajero/genéticaRESUMEN
MOTIVATION AND BACKGROUND: The patient's immune system plays an important role in cancer pathogenesis, prognosis and susceptibility to treatment. Recent work introduced an immune related breast cancer. This subtyping is based on the expression profiles of the tumor samples. Specifically, one study showed that analyzing 658 genes can lead to a signature for subtyping tumors. Furthermore, this classification is independent of other known molecular and clinical breast cancer subtyping. Finally, that study shows that the suggested subtyping has significant prognostic implications. RESULTS: In this work we develop an efficient signature associated with survival in breast cancer. We begin by developing a more efficient signature for the above-mentioned breast cancer immune-based subtyping. This signature represents better performance with a set of 579 genes that obtains an improved Area Under Curve (AUC). We then determine a set of 193 genes and an associated classification rule that yield subtypes with a much stronger statistically significant (log rank p-value < 2 × 10-4 in a test cohort) difference in survival. To obtain these improved results we develop a feature selection process that matches the high-dimensionality character of the data and the dual performance objectives, driven by survival and anchored by the literature subtyping.