RESUMO
Off-target effects present a significant impediment to the safe and efficient use of CRISPR-Cas genome editing. Since off-target activity is influenced by the genomic sequence, the presence of sequence variants leads to varying on- and off-target profiles among different alleles or individuals. However, a reliable tool that quantifies genome editing activity in an allelic context is not available. Here, we introduce CRISPECTOR2.0, an extended version of our previously published software tool CRISPECTOR, with an allele-specific editing activity quantification option. CRISPECTOR2.0 enables reference-free, allele-aware, precise quantification of on- and off-target activity, by using de novo sample-specific single nucleotide variant (SNV) detection and statistical-based allele-calling algorithms. We demonstrate CRISPECTOR2.0 efficacy in analyzing samples containing multiple alleles and quantifying allele-specific editing activity, using data from diverse cell types, including primary human cells, plants, and an original extensive human cell line database. We identified instances where an SNV induced changes in the protospacer adjacent motif sequence, resulting in allele-specific editing. Intriguingly, differential allelic editing was also observed in regions carrying distal SNVs, hinting at the involvement of additional epigenetic factors. Our findings highlight the importance of allele-specific editing measurement as a milestone in the adaptation of efficient, accurate, and safe personalized genome editing.
Assuntos
Alelos , Sistemas CRISPR-Cas , Edição de Genes , Software , Edição de Genes/métodos , Humanos , Polimorfismo de Nucleotídeo Único , AlgoritmosRESUMO
Solid tumors are characterized by complex interactions between the tumor, the immune system and the microenvironment. These interactions and intra-tumor variations have both diagnostic and prognostic significance and implications. However, quantifying the underlying processes in patient samples requires expensive and complicated molecular experiments. In contrast, H&E staining is typically performed as part of the routine standard process, and is very cheap. Here we present HIPI (H&E Image Interpretation and Protein Expression Inference) for predicting cell marker expression from tumor H&E images. We process paired H&E and CyCIF images taken from serial sections of colorectal cancers to train our model. We show that our model accurately predicts the spatial distribution of several important cell markers, on both held-out tumor regions as well as new tumor samples taken from different patients. Moreover, using only the tissue image morphology, HIPI is able to colocalize the interactions between different cell types, further demonstrating its potential clinical significance.
Assuntos
Biomarcadores Tumorais , Neoplasias Colorretais , Biologia Computacional , Humanos , Neoplasias Colorretais/metabolismo , Neoplasias Colorretais/patologia , Biologia Computacional/métodos , Biomarcadores Tumorais/metabolismo , Microambiente Tumoral , Processamento de Imagem Assistida por Computador/métodos , AlgoritmosRESUMO
Precise gene expression patterns are established by transcription factor (TFs) binding to regulatory sequences. While these events occur in the context of chromatin, our understanding of how TF-nucleosome interplay affects gene expression is highly limited. Here, we present an assay for high-resolution measurements of both DNA occupancy and gene expression on large-scale libraries of systematically designed regulatory sequences. Our assay reveals occupancy patterns at the single-cell level. It provides an accurate quantification of the fraction of the population bound by a nucleosome and captures distinct, even adjacent, TF binding events. By applying this assay to over 1,500 promoter variants in yeast, we reveal pronounced differences in the dependency of TF activity on chromatin and classify TFs by their differential capacity to alter chromatin and promote expression. We further demonstrate how different regulatory sequences give rise to nucleosome-mediated TF collaborations that quantitatively account for the resulting expression.
Assuntos
Cromatina/metabolismo , DNA Fúngico/metabolismo , Nucleossomos/metabolismo , Regiões Promotoras Genéticas , Proteínas de Saccharomyces cerevisiae/metabolismo , Saccharomyces cerevisiae/metabolismo , Fatores de Transcrição/metabolismo , Sítios de Ligação , Cromatina/genética , Biologia Computacional , DNA Fúngico/genética , Bases de Dados Genéticas , Regulação Fúngica da Expressão Gênica , Biblioteca Gênica , Ensaios de Triagem em Larga Escala , Nucleossomos/genética , Ligação Proteica , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genética , Fatores de Transcrição/genéticaRESUMO
BACKGROUND: The incidence rates of cutaneous squamous cell carcinoma (cSCC) and basal cell carcinoma (BCC) skin cancers are rising, while the current diagnostic process is time-consuming. We describe the development of a novel approach to high-throughput sampling of tissue lipids using electroporation-based biopsy, termed e-biopsy. We report on the ability of the e-biopsy technique to harvest large amounts of lipids from human skin samples. MATERIALS AND METHODS: Here, 168 lipids were reliably identified from 12 patients providing a total of 13 samples. The extracted lipids were profiled with ultra-performance liquid chromatography and tandem mass spectrometry (UPLC-MS-MS) providing cSCC, BCC, and healthy skin lipidomic profiles. RESULTS: Comparative analysis identified 27 differentially expressed lipids (p < 0.05). The general profile trend is low diglycerides in both cSCC and BCC, high phospholipids in BCC, and high lyso-phospholipids in cSCC compared to healthy skin tissue samples. CONCLUSION: The results contribute to the growing body of knowledge that can potentially lead to novel insights into these skin cancers and demonstrate the potential of the e-biopsy technique for the analysis of lipidomic profiles of human skin tissues.
Assuntos
Carcinoma Basocelular , Carcinoma de Células Escamosas , Eletroporação , Lipidômica , Neoplasias Cutâneas , Pele , Humanos , Carcinoma Basocelular/patologia , Carcinoma Basocelular/metabolismo , Carcinoma Basocelular/diagnóstico , Neoplasias Cutâneas/patologia , Neoplasias Cutâneas/metabolismo , Carcinoma de Células Escamosas/patologia , Carcinoma de Células Escamosas/metabolismo , Carcinoma de Células Escamosas/química , Lipidômica/métodos , Biópsia , Pele/patologia , Pele/metabolismo , Pele/química , Feminino , Masculino , Eletroporação/métodos , Pessoa de Meia-Idade , Idoso , Lipídeos/análise , Espectrometria de Massas em Tandem/métodosRESUMO
RNA splicing is a key process in eukaryotic gene expression, in which an intron is spliced out of a pre-mRNA molecule to eventually produce a mature mRNA. Most intron-containing genes are constitutively spliced, hence efficient splicing of an intron is crucial for efficient regulation of gene expression. Here we use a large synthetic oligo library of ~20,000 variants to explore how different intronic sequence features affect splicing efficiency and mRNA expression levels in S. cerevisiae. Introns are defined by three functional sites, the 5' donor site, the branch site, and the 3' acceptor site. Using a combinatorial design of synthetic introns, we demonstrate how non-consensus splice site sequences in each of these sites affect splicing efficiency. We then show that S. cerevisiae splicing machinery tends to select alternative 3' splice sites downstream of the original site, and we suggest that this tendency created a selective pressure, leading to the avoidance of cryptic splice site motifs near introns' 3' ends. We further use natural intronic sequences from other yeast species, whose splicing machineries have diverged to various extents, to show how intron architectures in the various species have been adapted to the organism's splicing machinery. We suggest that the observed tendency for cryptic splicing is a result of a loss of a specific splicing factor, U2AF1. Lastly, we show that synthetic sequences containing two introns give rise to alternative RNA isoforms in S. cerevisiae, demonstrating that merely a synthetic fusion of two introns might be suffice to facilitate alternative splicing in yeast. Our study reveals novel mechanisms by which introns are shaped in evolution to allow cells to regulate their transcriptome. In addition, it provides a valuable resource to study the regulation of constitutive and alternative splicing in a model organism.
Assuntos
Splicing de RNA , Saccharomyces cerevisiae/genética , Biologia Computacional/métodos , Evolução Molecular , Genes Fúngicos , Sequenciamento de Nucleotídeos em Larga Escala , Íntrons , RNA Mensageiro/genéticaRESUMO
MOTIVATION: Log-rank test is a widely used test that serves to assess the statistical significance of observed differences in survival, when comparing two or more groups. The log-rank test is based on several assumptions that support the validity of the calculations. It is naturally assumed, implicitly, that no errors occur in the labeling of the samples. That is, the mapping between samples and groups is perfectly correct. In this work, we investigate how test results may be affected when considering some errors in the original labeling. RESULTS: We introduce and define the uncertainty that arises from labeling errors in log-rank test. In order to deal with this uncertainty, we develop a novel algorithm for efficiently calculating a stability interval around the original log-rank P-value and prove its correctness. We demonstrate our algorithm on several datasets. AVAILABILITY AND IMPLEMENTATION: We provide a Python implementation, called LoRSI, for calculating the stability interval using our algorithm https://github.com/YakhiniGroup/LoRSI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , IncertezaRESUMO
MOTIVATION: Tumour heterogeneity is being increasingly recognized as an important characteristic of cancer and as a determinant of prognosis and treatment outcome. Emerging spatial transcriptomics data hold the potential to further our understanding of tumour heterogeneity and its implications. However, existing statistical tools are not sufficiently powerful to capture heterogeneity in the complex setting of spatial molecular biology. RESULTS: We provide a statistical solution, the HeTerogeneity Average index (HTA), specifically designed to handle the multivariate nature of spatial transcriptomics. We prove that HTA has an approximately normal distribution, therefore lending itself to efficient statistical assessment and inference. We first demonstrate that HTA accurately reflects the level of heterogeneity in simulated data. We then use HTA to analyze heterogeneity in two cancer spatial transcriptomics datasets: spatial RNA sequencing by 10x Genomics and spatial transcriptomics inferred from H&E. Finally, we demonstrate that HTA also applies to 3D spatial data using brain MRI. In spatial RNA sequencing, we use a known combination of molecular traits to assert that HTA aligns with the expected outcome for this combination. We also show that HTA captures immune-cell infiltration at multiple resolutions. In digital pathology, we show how HTA can be used in survival analysis and demonstrate that high levels of heterogeneity may be linked to poor survival. In brain MRI, we show that HTA differentiates between normal ageing, Alzheimer's disease and two tumours. HTA also extends beyond molecular biology and medical imaging, and can be applied to many domains, including GIS. AVAILABILITY AND IMPLEMENTATION: Python package and source code are available at: https://github.com/alonalj/hta. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Neoplasias , Transcriptoma , Humanos , Avaliação da Tecnologia Biomédica , Genômica , NeuroimagemRESUMO
MOTIVATION: Recent years have seen a growing number and an expanding scope of studies using synthetic oligo libraries for a range of applications in synthetic biology. As experiments are growing by numbers and complexity, analysis tools can facilitate quality control and support better assessment and inference. RESULTS: We present a novel analysis tool, called SOLQC, which enables fast and comprehensive analysis of synthetic oligo libraries, based on NGS analysis performed by the user. SOLQC provides statistical information such as the distribution of variant representation, different error rates and their dependence on sequence or library properties. SOLQC produces graphical reports from the analysis, in a flexible format. We demonstrate SOLQC by analyzing literature libraries. We also discuss the potential benefits and relevance of the different components of the analysis. AVAILABILITY AND IMPLEMENTATION: SOLQC is a free software for non-commercial use, available at https://app.gitbook.com/@yoav-orlev/s/solqc/. For commercial use please contact the authors. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Bibliotecas , Software , Biblioteca Gênica , Controle de Qualidade , Biologia SintéticaRESUMO
Different miRNA profiling protocols and technologies introduce differences in the resulting quantitative expression profiles. These include differences in the presence (and measurability) of certain miRNAs. We present and examine a method based on quantile normalization, Adjusted Quantile Normalization (AQuN), to combine miRNA expression data from multiple studies in breast cancer into a single joint dataset for integrative analysis. By pooling multiple datasets, we obtain increased statistical power, surfacing patterns that do not emerge as statistically significant when separately analyzing these datasets. To merge several datasets, as we do here, one needs to overcome both technical and batch differences between these datasets. We compare several approaches for merging and jointly analyzing miRNA datasets. We investigate the statistical confidence for known results and highlight potential new findings that resulted from the joint analysis using AQuN. In particular, we detect several miRNAs to be differentially expressed in estrogen receptor (ER) positive versus ER negative samples. In addition, we identify new potential biomarkers and therapeutic targets for both clinical groups. As a specific example, using the AQuN-derived dataset we detect hsa-miR-193b-5p to have a statistically significant over-expression in the ER positive group, a phenomenon that was not previously reported. Furthermore, as demonstrated by functional assays in breast cancer cell lines, overexpression of hsa-miR-193b-5p in breast cancer cell lines resulted in decreased cell viability in addition to inducing apoptosis. Together, these observations suggest a novel functional role for this miRNA in breast cancer. Packages implementing AQuN are provided for Python and Matlab: https://github.com/YakhiniGroup/PyAQN.
Assuntos
Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , MicroRNAs/metabolismo , Algoritmos , Biomarcadores/metabolismo , Biomarcadores Tumorais/genética , Linhagem Celular Tumoral , Simulação por Computador , Receptor alfa de Estrogênio/metabolismo , Feminino , Humanos , Células MCF-7 , Análise de Sequência com Séries de Oligonucleotídeos , Linguagens de Programação , RNA Mensageiro/genéticaRESUMO
Analysing human physiological data allows access to the health state and the state of mind of the subject individual. Whenever a person is sick, having a panic attack, happy or scared, physiological signals will be different. In terms of physiological signals, we focus, in this manuscript, on monitoring breathing patterns. The scope can be extended to also address heart rate and other variables. We describe an analysis of breathing rate patterns during activities including resting, walking, running and watching a movie. We model normal breathing behaviours by statistically analysing signals, processed to represent quantities of interest. We consider moving maximum/minimum, the amplitude and the Fourier transform of the respiration signal, working with different window sizes. We then learn a statistical model for the basal behaviour, per individual, and detect outliers. When outliers are detected, a system that incorporates our approach would send a visible signal through a smart garment or through other means. We describe alert generation performance in two datasets-one literature dataset and one collected as a field study for this work. In particular, when learning personal rest distributions for the breathing signals of 14 subjects, we see alerts generated more often when the same individual is running than when they are tested in rest conditions.
Assuntos
Respiração , Taxa Respiratória , Humanos , Modelos Estatísticos , DescansoRESUMO
The interactions of cancer stem cells (CSCs) within the tumor microenvironment (TME), contribute to the overall phenomenon of intratumoral heterogeneity, which also involve CSC interactions with noncancer stromal cells. Comprehensive understanding of the tumorigenesis process requires elucidating the coordinated gene expression between cancer and tumor stromal cells for each tumor. We show that human gastric cancer cells (GSC1) subvert gene expression and cytokine production by mesenchymal stem cells (GSC-MSC), thus promoting tumor progression. Using mixed composition of human tumor xenografts, organotypic culture, and in vitro assays, we demonstrate GSC1-mediated specific reprogramming of "naïve" MSC into specialized tumor associated MSC equipped with a tumor-promoting phenotype. Although paracrine effect of GSC-MSC or primed-MSC is sufficient to enable 2D growth of GSC1, cell-cell interaction with GSC-MSC is necessary for 3D growth and in vivo tumor formation. At both the transcriptional and at the protein level, RNA-Seq and proteome analyses, respectively, revealed increased R-spondin expression in primed-MSC, and paracrine and juxtacrine mediated elevation of Lgr5 expression in GSC1, suggesting GSC-MSC-mediated support of cancer stemness in GSC1. CSC properties are sustained in vivo through the interplay between GSC1 and GSC-MSC, activating the R-spondin/Lgr5 axis and WNT/ß-catenin signaling pathway. ß-Catenin+ cell clusters show ß-catenin nuclear localization, indicating the activation of the WNT/ß-catenin signaling pathway in these cells. The ß-catenin+ cluster of cells overlap the Lgr5+ cells, however, not all Lgr5+ cells express ß-catenin. A predominant means to sustain the CSC contribution to tumor progression appears to be subversion of MSC in the TME by cancer cells. Stem Cells 2018 Stem Cells 2019;37:176-189.
Assuntos
Reprogramação Celular/genética , Células-Tronco Mesenquimais/metabolismo , Neoplasias Gástricas/genética , Humanos , Neoplasias Gástricas/metabolismo , Microambiente TumoralRESUMO
Gene expression regulation is highly dependent on binding of RNA-binding proteins (RBPs) to their RNA targets. Growing evidence supports the notion that both RNA primary sequence and its local secondary structure play a role in specific Protein-RNA recognition and binding. Despite the great advance in high-throughput experimental methods for identifying sequence targets of RBPs, predicting the specific sequence and structure binding preferences of RBPs remains a major challenge. We present a novel webserver, SMARTIV, designed for discovering and visualizing combined RNA sequence and structure motifs from high-throughput RNA-binding data, generated from in-vivo experiments. The uniqueness of SMARTIV is that it predicts motifs from enriched k-mers that combine information from ranked RNA sequences and their predicted secondary structure, obtained using various folding methods. Consequently, SMARTIV generates Position Weight Matrices (PWMs) in a combined sequence and structure alphabet with assigned P-values. SMARTIV concisely represents the sequence and structure motif content as a single graphical logo, which is informative and easy for visual perception. SMARTIV was examined extensively on a variety of high-throughput binding experiments for RBPs from different families, generated from different technologies, showing consistent and accurate results. Finally, SMARTIV is a user-friendly webserver, highly efficient in run-time and freely accessible via http://smartiv.technion.ac.il/.
Assuntos
Proteínas de Ligação a RNA/metabolismo , RNA/química , Software , Sítios de Ligação , Internet , Conformação de Ácido Nucleico , Motivos de Nucleotídeos , Matrizes de Pontuação de Posição Específica , Análise de Sequência de RNARESUMO
BACKGROUND: Synthetic biology and related techniques enable genome scale high-throughput investigation of the effect on organism fitness of different gene knock-downs/outs and of other modifications of genomic sequence. RESULTS: We develop statistical and computational pipelines and frameworks for analyzing high throughput fitness data over a genome scale set of sequence variants. Analyzing data from a high-throughput knock-down/knock-out bacterial study, we investigate differences and determinants of the effect on fitness in different conditions. Comparing fitness vectors of genes, across tens of conditions, we observe that fitness consequences strongly depend on genomic location and more weakly depend on gene sequence similarity and on functional relationships. In analyzing promoter sequences, we identified motifs associated with conditions studied in bacterial media such as Casaminos, D-glucose, Sucrose, and other sugars and amino-acid sources. We also use fitness data to infer genes associated with orphan metabolic reactions in the iJO1366 E. coli metabolic model. To do this, we developed a new computational method that integrates gene fitness and gene expression profiles within a given reaction network neighborhood to associate this reaction with a set of genes that potentially encode the catalyzing proteins. We then apply this approach to predict candidate genes for 107 orphan reactions in iJO1366. Furthermore - we validate our methodology with known reactions using a leave-one-out approach. Specifically, using top-20 candidates selected based on combined fitness and expression datasets, we correctly reconstruct 39.7% of the reactions, as compared to 33% based on fitness and to 26% based on expression separately, and to 4.02% as a random baseline. Our model improvement results include a novel association of a gene to an orphan cytosine nucleosidation reaction. CONCLUSION: Our pipeline for metabolic modeling shows a clear benefit of using fitness data for predicting genes of orphan reactions. Along with the analysis pipelines we developed, it can be used to analyze similar high-throughput data.
Assuntos
Teste de Esforço/métodos , Estudo de Associação Genômica Ampla/métodos , Genômica/métodos , Humanos , Modelos BiológicosRESUMO
Binding of transcription factors (TFs) to regulatory sequences is a pivotal step in the control of gene expression. Despite many advances in the characterization of sequence motifs recognized by TFs, our ability to quantitatively predict TF binding to different regulatory sequences is still limited. Here, we present a novel experimental assay termed BunDLE-seq that provides quantitative measurements of TF binding to thousands of fully designed sequences of 200 bp in length within a single experiment. Applying this binding assay to two yeast TFs, we demonstrate that sequences outside the core TF binding site profoundly affect TF binding. We show that TF-specific models based on the sequence or DNA shape of the regions flanking the core binding site are highly predictive of the measured differential TF binding. We further characterize the dependence of TF binding, accounting for measurements of single and co-occurring binding events, on the number and location of binding sites and on the TF concentration. Finally, by coupling our in vitro TF binding measurements, and another application of our method probing nucleosome formation, to in vivo expression measurements carried out with the same template sequences serving as promoters, we offer insights into mechanisms that may determine the different expression outcomes observed. Our assay thus paves the way to a more comprehensive understanding of TF binding to regulatory sequences and allows the characterization of TF binding determinants within and outside of core binding sites.
Assuntos
Sítios de Ligação , Fatores de Transcrição/metabolismo , Biologia Computacional/métodos , Nucleossomos/metabolismo , Poli A , Poli T , Ligação Proteica , Sequências Reguladoras de Ácido Nucleico , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , TermodinâmicaRESUMO
RNA binding proteins (RBPs) play an important role in regulating many processes in the cell. RBPs often recognize their RNA targets in a specific manner. In addition to the RNA primary sequence, the structure of the RNA has been shown to play a central role in RNA recognition by RBPs. In recent years, many experimental approaches, both in vitro and in vivo, were developed and employed to identify and characterize RBP targets and extract their binding specificities. In vivo binding techniques, such as CrossLinking and ImmunoPrecipitation (CLIP)-based methods, enable the characterization of protein binding sites on RNA targets. However, these methods do not provide information regarding the structural preferences of the protein. While methods to obtain the structure of RNA are available, inferring both the sequence and the structure preferences of RBPs remains a challenge. Here we present SMARTIV, a novel computational tool for discovering combined sequence and structure binding motifs from in vivo RNA binding data relying on the sequences of the target sites, the ranking of their binding scores and their predicted secondary structure. The combined motifs are provided in a unified representation that is informative and easy for visual perception. We tested the method on CLIP-seq data from different platforms for a variety of RBPs. Overall, we show that our results are highly consistent with known binding motifs of RBPs, offering additional information on their structural preferences.
Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Proteínas de Ligação a RNA/genética , RNA/química , Análise de Sequência de RNA/estatística & dados numéricos , Software , Sequência de Bases , Sítios de Ligação , Linhagem Celular , Conjuntos de Dados como Assunto , Humanos , Imunoprecipitação , Conformação de Ácido Nucleico , Ligação Proteica , RNA/genética , RNA/metabolismo , Proteínas de Ligação a RNA/metabolismo , Análise de Sequência de RNA/métodos , TranscriptomaRESUMO
The 3'end genomic region encodes a wide range of regulatory process including mRNA stability, 3' end processing and translation. Here, we systematically investigate the sequence determinants of 3' end mediated expression control by measuring the effect of 13,000 designed 3' end sequence variants on constitutive expression levels in yeast. By including a high resolution scanning mutagenesis of more than 200 native 3' end sequences in this designed set, we found that most mutations had only a mild effect on expression, and that the vast majority (~90%) of strongly effecting mutations localized to a single positive TA-rich element, similar to a previously described 3' end processing efficiency element, and resulted in up to ten-fold decrease in expression. Measurements of 3' UTR lengths revealed that these mutations result in mRNAs with aberrantly long 3'UTRs, confirming the role for this element in 3' end processing. Interestingly, we found that other sequence elements that were previously described in the literature to be part of the polyadenylation signal had a minor effect on expression. We further characterize the sequence specificities of the TA-rich element using additional synthetic 3' end sequences and show that its activity is sensitive to single base pair mutations and strongly depends on the A/T content of the surrounding sequences. Finally, using a computational model, we show that the strength of this element in native 3' end sequences can explain some of their measured expression variability (R = 0.41). Together, our results emphasize the importance of efficient 3' end processing for endogenous protein levels and contribute to an improved understanding of the sequence elements involved in this process.
Assuntos
Regiões 3' não Traduzidas , Regulação Fúngica da Expressão Gênica , Leveduras/genética , Genoma Fúngico , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Leveduras/metabolismoRESUMO
Genetically identical cells exhibit large variability (noise) in gene expression, with important consequences for cellular function. Although the amount of noise decreases with and is thus partly determined by the mean expression level, the extent to which different promoter sequences can deviate away from this trend is not fully known. Here, we present a high-throughput method for measuring promoter-driven noise for thousands of designed synthetic promoters in parallel. We use it to investigate how promoters encode different noise levels and find that the noise levels of promoters with similar mean expression levels can vary more than one order of magnitude, with nucleosome-disfavoring sequences resulting in lower noise and more transcription factor binding sites resulting in higher noise. We propose a kinetic model of gene expression that takes into account the nonspecific DNA binding and one-dimensional sliding along the DNA, which occurs when transcription factors search for their target sites. We show that this assumption can improve the prediction of the mean-independent component of expression noise for our designed promoter sequences, suggesting that a transcription factor target search may affect gene expression noise. Consistent with our findings in designed promoters, we find that binding-site multiplicity in native promoters is associated with higher expression noise. Overall, our results demonstrate that small changes in promoter DNA sequence can tune noise levels in a manner that is predictable and partly decoupled from effects on the mean expression levels. These insights may assist in designing promoters with desired noise levels.
Assuntos
Biologia Computacional/métodos , DNA/metabolismo , Expressão Gênica , Regiões Promotoras Genéticas , Saccharomyces cerevisiae/genética , Sítios de Ligação , Genes Fúngicos , Modelos Lineares , Dados de Sequência Molecular , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo , Fatores de Transcrição/metabolismoRESUMO
MOTIVATION: Complex interactions among alleles often drive differences in inherited properties including disease predisposition. Isolating the effects of these interactions requires phasing information that is difficult to measure or infer. Furthermore, prevalent sequencing technologies used in the essential first step of determining a haplotype limit the range of that step to the span of reads, namely hundreds of bases. With the advent of pseudo-long read technologies, observable partial haplotypes can span several orders of magnitude more. Yet, measuring whole-genome-single-individual haplotypes remains a challenge. A different view of whole genome measurement addresses the 3D structure of the genome-with great development of Hi-C techniques in recent years. A shortcoming of current Hi-C, however, is the difficulty in inferring information that is specific to each of a pair of homologous chromosomes. RESULTS: In this work, we develop a robust algorithmic framework that takes two measurement derived datasets: raw Hi-C and partial short-range haplotypes, and constructs the full-genome haplotype as well as phased diploid Hi-C maps. By analyzing both data sets together we thus bridge important gaps in both technologies-from short to long haplotypes and from un-phased to phased Hi-C. We demonstrate that our method can recover ground truth haplotypes with high accuracy, using measured biological data as well as simulated data. We analyze the impact of noise, Hi-C sequencing depth and measured haplotype lengths on performance. Finally, we use the inferred 3D structure of a human genome to point at transcription factor targets nuclear co-localization. AVAILABILITY AND IMPLEMENTATION: The implementation available at https://github.com/YakhiniGroup/SpectraPh CONTACT: zohar.yakhini@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.