Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 81
Filter
1.
bioRxiv ; 2024 Apr 19.
Article in English | MEDLINE | ID: mdl-38659854

ABSTRACT

The human genome contains millions of retrotransposons, several of which could become active due to somatic mutations having phenotypic consequences, including disease. However, it is not thoroughly understood how nucleotide changes in retrotransposons affect their jumping activity. Here, we developed a novel massively parallel jumping assay (MPJA) that can test the jumping potential of thousands of transposons en masse. We generated nucleotide variant library of selected four Alu retrotransposons containing 165,087 different haplotypes and tested them for their jumping ability using MPJA. We found 66,821 unique jumping haplotypes, allowing us to pinpoint domains and variants vital for transposition. Mapping these variants to the Alu-RNA secondary structure revealed stem-loop features that contribute to jumping potential. Combined, our work provides a novel high-throughput assay that assesses the ability of retrotransposons to jump and identifies nucleotide changes that have the potential to reactivate them in the human genome.

3.
Nucleic Acids Res ; 52(D1): D1143-D1154, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-38183205

ABSTRACT

Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.


Subject(s)
Genetic Variation , Genome, Human , Machine Learning , Software , Nucleotides , Humans
4.
Am J Hum Genet ; 111(2): 338-349, 2024 Feb 01.
Article in English | MEDLINE | ID: mdl-38228144

ABSTRACT

Clinical exome and genome sequencing have revolutionized the understanding of human disease genetics. Yet many genes remain functionally uncharacterized, complicating the establishment of causal disease links for genetic variants. While several scoring methods have been devised to prioritize these candidate genes, these methods fall short of capturing the expression heterogeneity across cell subpopulations within tissues. Here, we introduce single-cell tissue-specific gene prioritization using machine learning (STIGMA), an approach that leverages single-cell RNA-seq (scRNA-seq) data to prioritize candidate genes associated with rare congenital diseases. STIGMA prioritizes genes by learning the temporal dynamics of gene expression across cell types during healthy organogenesis. To assess the efficacy of our framework, we applied STIGMA to mouse limb and human fetal heart scRNA-seq datasets. In a cohort of individuals with congenital limb malformation, STIGMA prioritized 469 variants in 345 genes, with UBA2 as a notable example. For congenital heart defects, we detected 34 genes harboring nonsynonymous de novo variants (nsDNVs) in two or more individuals from a set of 7,958 individuals, including the ortholog of Prdm1, which is associated with hypoplastic left ventricle and hypoplastic aortic arch. Overall, our findings demonstrate that STIGMA effectively prioritizes tissue-specific candidate genes by utilizing single-cell transcriptome data. The ability to capture the heterogeneity of gene expression across cell populations makes STIGMA a powerful tool for the discovery of disease-associated genes and facilitates the identification of causal variants underlying human genetic disorders.


Subject(s)
Heart Defects, Congenital , Transcriptome , Humans , Animals , Mice , Exome/genetics , Heart Defects, Congenital/genetics , Exome Sequencing , Machine Learning , Single-Cell Analysis/methods , Ubiquitin-Activating Enzymes/genetics
5.
NAR Genom Bioinform ; 5(4): lqad102, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38025047

ABSTRACT

Analyses of cell-free DNA (cfDNA) are increasingly being employed for various diagnostic and research applications. Many technologies aim to increase resolution, e.g. for detecting early-stage cancer or minimal residual disease. However, these efforts may be confounded by inherent base composition biases of cfDNA, specifically the over - and underrepresentation of guanine (G) and cytosine (C) sequences. Currently, there is no universally applicable tool to correct these effects on sequencing read-level data. Here, we present GCparagon, a two-stage algorithm for computing and correcting GC biases in cfDNA samples. In the initial step, length and GC base count parameters are determined. Here, our algorithm minimizes the inclusion of known problematic genomic regions, such as low-mappability regions, in its calculations. In the second step, GCparagon computes weights counterbalancing the distortion of cfDNA attributes (correction matrix). These fragment weights are added to a binary alignment map (BAM) file as alignment tags for individual reads. The GC correction matrix or the tagged BAM file can be used for downstream analyses. Parallel computing allows for a GC bias estimation below 1 min. We demonstrate that GCparagon vastly improves the analysis of regulatory regions, which frequently show specific GC composition patterns and will contribute to standardized cfDNA applications.

6.
Bioinformatics ; 39(5)2023 05 04.
Article in English | MEDLINE | ID: mdl-37084271

ABSTRACT

MOTIVATION: Missense variants are a frequent class of variation within the coding genome, and some of them cause Mendelian diseases. Despite advances in computational prediction, classifying missense variants into pathogenic or benign remains a major challenge in the context of personalized medicine. Recently, the structure of the human proteome was derived with unprecedented accuracy using the artificial intelligence system AlphaFold2. This raises the question of whether AlphaFold2 wild-type structures can improve the accuracy of computational pathogenicity prediction for missense variants. RESULTS: To address this, we first engineered a set of features for each amino acid from these structures. We then trained a random forest to distinguish between relatively common (proxy-benign) and singleton (proxy-pathogenic) missense variants from gnomAD v3.1. This yielded a novel AlphaFold2-based pathogenicity prediction score, termed AlphScore. Important feature classes used by AlphScore are solvent accessibility, amino acid network related features, features describing the physicochemical environment, and AlphaFold2's quality parameter (predicted local distance difference test). AlphScore alone showed lower performance than existing in silico scores used for missense prediction, such as CADD or REVEL. However, when AlphScore was added to those scores, the performance increased, as measured by the approximation of deep mutational scan data, as well as the prediction of expert-curated missense variants from the ClinVar database. Overall, our data indicate that the integration of AlphaFold2-predicted structures can improve pathogenicity prediction of missense variants. AVAILABILITY AND IMPLEMENTATION: AlphScore, combinations of AlphScore with existing scores, as well as variants used for training and testing are publicly available.


Subject(s)
Artificial Intelligence , Computational Biology , Humans , Virulence , Mutation, Missense , Mutation
7.
bioRxiv ; 2023 Mar 06.
Article in English | MEDLINE | ID: mdl-36945371

ABSTRACT

The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a nearly comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type-specific 'on switches' providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines, and showcases how large-scale functional measurements can be used to dissect regulatory grammar.

8.
Nature ; 614(7948): 564-571, 2023 02.
Article in English | MEDLINE | ID: mdl-36755093

ABSTRACT

Thousands of genetic variants in protein-coding genes have been linked to disease. However, the functional impact of most variants is unknown as they occur within intrinsically disordered protein regions that have poorly defined functions1-3. Intrinsically disordered regions can mediate phase separation and the formation of biomolecular condensates, such as the nucleolus4,5. This suggests that mutations in disordered proteins may alter condensate properties and function6-8. Here we show that a subset of disease-associated variants in disordered regions alter phase separation, cause mispartitioning into the nucleolus and disrupt nucleolar function. We discover de novo frameshift variants in HMGB1 that cause brachyphalangy, polydactyly and tibial aplasia syndrome, a rare complex malformation syndrome. The frameshifts replace the intrinsically disordered acidic tail of HMGB1 with an arginine-rich basic tail. The mutant tail alters HMGB1 phase separation, enhances its partitioning into the nucleolus and causes nucleolar dysfunction. We built a catalogue of more than 200,000 variants in disordered carboxy-terminal tails and identified more than 600 frameshifts that create arginine-rich basic tails in transcription factors and other proteins. For 12 out of the 13 disease-associated variants tested, the mutation enhanced partitioning into the nucleolus, and several variants altered rRNA biogenesis. These data identify the cause of a rare complex syndrome and suggest that a large number of genetic variants may dysregulate nucleoli and other biomolecular condensates in humans.


Subject(s)
Cell Nucleolus , HMGB1 Protein , Humans , Arginine/genetics , Arginine/metabolism , Cell Nucleolus/genetics , Cell Nucleolus/metabolism , Cell Nucleolus/pathology , HMGB1 Protein/chemistry , HMGB1 Protein/genetics , HMGB1 Protein/metabolism , Intrinsically Disordered Proteins/chemistry , Intrinsically Disordered Proteins/genetics , Intrinsically Disordered Proteins/metabolism , Syndrome , Frameshift Mutation , Phase Transition
9.
Neuron ; 111(6): 857-873.e8, 2023 03 15.
Article in English | MEDLINE | ID: mdl-36640767

ABSTRACT

Using machine learning (ML), we interrogated the function of all human-chimpanzee variants in 2,645 human accelerated regions (HARs), finding 43% of HARs have variants with large opposing effects on chromatin state and 14% on neurodevelopmental enhancer activity. This pattern, consistent with compensatory evolution, was confirmed using massively parallel reporter assays in chimpanzee and human neural progenitor cells. The species-specific enhancer activity of HARs was accurately predicted from the presence and absence of transcription factor footprints in each species. Despite these striking cis effects, activity of a given HAR sequence was nearly identical in human and chimpanzee cells. This suggests that HARs did not evolve to compensate for changes in the trans environment but instead altered their ability to bind factors present in both species. Thus, ML prioritized variants with functional effects on human neurodevelopment and revealed an unexpected reason why HARs may have evolved so rapidly.


Subject(s)
Brain , Enhancer Elements, Genetic , Pan troglodytes , Animals , Humans , Chromatin , Machine Learning , Pan troglodytes/metabolism , Transcription Factors/genetics , Brain/growth & development
10.
BMC Bioinformatics ; 23(Suppl 2): 154, 2022 Dec 12.
Article in English | MEDLINE | ID: mdl-36510125

ABSTRACT

BACKGROUND: Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleteriousness. Considering the central role of CRRs in the regulation of physiological and pathological conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly extracted from DNA sequences. RESULTS: We present the experiments we performed to compare two Deep Neural Networks, a Feed-Forward Neural Network model working on epigenomic features, and a Convolutional Neural Network model working only on genomic sequence, targeted to the identification of enhancer- and promoter-activity in specific cell lines. While performing experiments to understand how the experimental setup influences the prediction performance of the methods, we particularly focused on (1) automatic model selection performed by Bayesian optimization and (2) exploring different data rebalancing setups for reducing negative unbalancing effects. CONCLUSIONS: Results show that (1) automatic model selection by Bayesian optimization improves the quality of the learner; (2) data rebalancing considerably impacts the prediction performance of the models; test set rebalancing may provide over-optimistic results, and should therefore be cautiously applied; (3) despite working on sequence data, convolutional models obtain performance close to those of feed forward models working on epigenomic information, which suggests that also sequence data carries informative content for CRR-activity prediction. We therefore suggest combining both models/data types in future works.


Subject(s)
Deep Learning , Humans , Bayes Theorem , Regulatory Sequences, Nucleic Acid , Neural Networks, Computer , Machine Learning
11.
J Thromb Haemost ; 20(9): 2022-2034, 2022 09.
Article in English | MEDLINE | ID: mdl-35770352

ABSTRACT

BACKGROUND: Hemophilia A (HA) and hemophilia B (HB) are rare inherited bleeding disorders. Although causative genetic variants are clinically relevant, in 2012 only 20% of US patients had been genotyped. OBJECTIVES: My Life, Our Future (MLOF) was a multisector cross-sectional US initiative to improve our understanding of hemophilia through widespread genotyping. METHODS: Subjects and potential genetic carriers were enrolled at US hemophilia treatment centers (HTCs). Bloodworks performed genotyping and returned results to providers. Clinical data were abstracted from the American Thrombosis and Hemostasis Network dataset. Community education was provided by the National Hemophilia Foundation. RESULTS: From 2013 to 2017, 107 HTCs enrolled 11 341 subjects (68.8% male, 31.2% female) for testing for HA (n = 8976), HB (n = 2358), HA/HB (n = 3), and hemophilia not otherwise specified (n = 4). Variants were detected in most male patients (98.2%% HA, 98.1% HB). 1914 unique variants were found (1482 F8, 431 F9); 744 were novel (610 F8, 134 F9). Inhibitor data were available for 6986 subjects (5583 HA; 1403 HB). In severe HA, genotypes with the highest inhibitor rates were large deletions (77/80), complex intron 22 inversions (9/17), and no variant found (7/14). In severe HB, the highest rates were large deletions (24/42). Inhibitors were reported in 27.3% of Black versus 16.2% of White patients. CONCLUSIONS: The findings of MLOF are reported, the largest hemophilia genotyping project performed to date. The results support the need for comprehensive genetic approaches in hemophilia. This effort has contributed significantly towards better understanding variation in the F8 and F9 genes in hemophilia and risks of inhibitor formation.


Subject(s)
Hemophilia A , Hemophilia B , Cross-Sectional Studies , Factor VIII/genetics , Female , Genotype , Hemophilia A/diagnosis , Hemophilia A/genetics , Hemophilia B/diagnosis , Hemophilia B/epidemiology , Hemophilia B/genetics , Humans , Male , United States/epidemiology
12.
Article in English | MEDLINE | ID: mdl-35483875

ABSTRACT

The increase in sequencing capacity, reduction in costs, and national and international coordinated efforts have led to the widespread introduction of next-generation sequencing (NGS) technologies in patient care. More generally, human genetics and genomic medicine are gaining importance for more and more patients. Some communities are already discussing the prospect of sequencing each individual's genome at time of birth. Together with digital health records, this shall enable individualized treatments and preventive measures, so-called precision medicine. A central step in this process is the identification of disease causal mutations or variant combinations that make us more susceptible for diseases. Although various technological advances have improved the identification of genetic alterations, the interpretation and ranking of the identified variants remains a major challenge. Based on our knowledge of molecular processes or previously identified disease variants, we can identify potentially functional genetic variants and, using different lines of evidence, we are sometimes able to demonstrate their pathogenicity directly. However, the vast majority of variants are classified as variants of uncertain clinical significance (VUSs) with not enough experimental evidence to determine their pathogenicity. In these cases, computational methods may be used to improve the prioritization and an increasing toolbox of experimental methods is emerging that can be used to assay the molecular effects of VUSs. Here, we discuss how computational and experimental methods can be used to create catalogs of variant effects for a variety of molecular and cellular phenotypes. We discuss the prospects of integrating large-scale functional data with machine learning and clinical knowledge for the development of accurate pathogenicity predictions for clinical applications.


Subject(s)
High-Throughput Nucleotide Sequencing , Precision Medicine , Genome , Humans , Mutation
13.
Genome Res ; 32(4): 766-777, 2022 04.
Article in English | MEDLINE | ID: mdl-35197310

ABSTRACT

Although technological advances improved the identification of structural variants (SVs) in the human genome, their interpretation remains challenging. Several methods utilize individual mechanistic principles like the deletion of coding sequence or 3D genome architecture disruptions. However, a comprehensive tool using the broad spectrum of available annotations is missing. Here, we describe CADD-SV, a method to retrieve and integrate a wide set of annotations to predict the effects of SVs. Previously, supervised learning approaches were limited due to a small number and biased set of annotated pathogenic or benign SVs. We overcome this problem by using a surrogate training objective, the Combined Annotation Dependent Depletion (CADD) of functional variants. We use human- and chimpanzee-derived SVs as proxy-neutral and contrast them with matched simulated variants as proxy-deleterious, an approach that has proven powerful for short sequence variants. Our tool computes summary statistics over diverse variant annotations and uses random forest models to prioritize deleterious structural variants. The resulting CADD-SV scores correlate with known pathogenic and rare population variants. We further show that we can prioritize somatic cancer variants as well as noncoding variants known to affect gene expression. We provide a website and offline-scoring tool for easy application of CADD-SV.


Subject(s)
Genome, Human , Humans
14.
Gigascience ; 122022 12 28.
Article in English | MEDLINE | ID: mdl-37083939

ABSTRACT

BACKGROUND: Genome sequencing efforts for individuals with rare Mendelian disease have increased the research focus on the noncoding genome and the clinical need for methods that prioritize potentially disease causal noncoding variants. Some tools for assessment of variant pathogenicity as well as annotations are not available for the current human genome build (GRCh38), for which the adoption in databases, software, and pipelines was slow. RESULTS: Here, we present an updated version of the Regulatory Mendelian Mutation (ReMM) score, retrained on features and variants derived from the GRCh38 genome build. Like its GRCh37 version, it achieves good performance on its highly imbalanced data. To improve accessibility and provide users with a toolbox to score their variant files and look up scores in the genome, we developed a website and API for easy score lookup. CONCLUSIONS: Scores of the GRCh38 genome build are highly correlated to the prior release with a performance increase due to the better coverage of features. For prioritization of noncoding mutations in imbalanced datasets, the ReMM score performed much better than other variation scores. Prescored whole-genome files of GRCh37 and GRCh38 genome builds are cited in the article and the website; UCSC genome browser tracks, and an API are available at https://remm.bihealth.org.


Subject(s)
Genome, Human , Software , Humans , Mutation , Databases, Genetic
15.
Med Genet ; 34(4): 275-286, 2022 Dec 31.
Article in English | MEDLINE | ID: mdl-37034418

ABSTRACT

Identification of genetic variation in individual genomes is now a routine procedure in human genetic research and diagnostics. For many variants, however, insufficient evidence is available to establish a pathogenic effect, particularly for variants in non-coding regions. Furthermore, the sheer number of candidate variants renders testing in individual assays virtually impossible. While scalable approaches are being developed, the selection of methods and resources, and the application of a given framework to a particular disease or trait remain major challenges. This limits the translation of results from both genome-wide association studies and genome sequencing. Here, we discuss computational and experimental approaches available for functional annotation of non-coding variation.

16.
Genome Med ; 13(1): 31, 2021 02 22.
Article in English | MEDLINE | ID: mdl-33618777

ABSTRACT

BACKGROUND: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. METHODS: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. RESULTS: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. CONCLUSIONS: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.


Subject(s)
Deep Learning , Genetic Variation , Genome-Wide Association Study , RNA Splicing/genetics , Base Sequence , Exons/genetics , Humans , Introns/genetics
18.
PLoS One ; 15(12): e0237412, 2020.
Article in English | MEDLINE | ID: mdl-33259518

ABSTRACT

Regulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.


Subject(s)
Regulatory Sequences, Nucleic Acid/genetics , A549 Cells , Cell Line, Tumor , Chromatin/genetics , DNA/genetics , Genome/genetics , Genomics/methods , HeLa Cells , Hep G2 Cells , Humans , K562 Cells , MCF-7 Cells , Neural Networks, Computer , Promoter Regions, Genetic/genetics , Sequence Analysis, DNA , Support Vector Machine
19.
Nat Methods ; 17(11): 1083-1091, 2020 11.
Article in English | MEDLINE | ID: mdl-33046894

ABSTRACT

Massively parallel reporter assays (MPRAs) functionally screen thousands of sequences for regulatory activity in parallel. To date, there are limited studies that systematically compare differences in MPRA design. Here, we screen a library of 2,440 candidate liver enhancers and controls for regulatory activity in HepG2 cells using nine different MPRA designs. We identify subtle but significant differences that correlate with epigenetic and sequence-level features, as well as differences in dynamic range and reproducibility. We also validate that enhancer activity is largely independent of orientation, at least for our library and designs. Finally, we assemble and test the same enhancers as 192-mers, 354-mers and 678-mers and observe sizable differences. This work provides a framework for the experimental design of high-throughput reporter assays, suggesting that the extended sequence context of tested elements and to a lesser degree the precise assay, influence MPRA results.


Subject(s)
Gene Library , Genes, Reporter , High-Throughput Nucleotide Sequencing/methods , Regulatory Sequences, Nucleic Acid , Sequence Analysis, DNA/methods , Enhancer Elements, Genetic , Hep G2 Cells , Humans , Reproducibility of Results , Transcription Factors/genetics
20.
Nat Protoc ; 15(8): 2387-2412, 2020 08.
Article in English | MEDLINE | ID: mdl-32641802

ABSTRACT

Massively parallel reporter assays (MPRAs) can simultaneously measure the function of thousands of candidate regulatory sequences (CRSs) in a quantitative manner. In this method, CRSs are cloned upstream of a minimal promoter and reporter gene, alongside a unique barcode, and introduced into cells. If the CRS is a functional regulatory element, it will lead to the transcription of the barcode sequence, which is measured via RNA sequencing and normalized for cellular integration via DNA sequencing of the barcode. This technology has been used to test thousands of sequences and their variants for regulatory activity, to decipher the regulatory code and its evolution, and to develop genetic switches. Lentivirus-based MPRA (lentiMPRA) produces 'in-genome' readouts and enables the use of this technique in hard-to-transfect cells. Here, we provide a detailed protocol for lentiMPRA, along with a user-friendly Nextflow-based computational pipeline-MPRAflow-for quantifying CRS activity from different MPRA designs. The lentiMPRA protocol takes ~2 months, which includes sequencing turnaround time and data processing with MPRAflow.


Subject(s)
Lentivirus/genetics , Regulatory Sequences, Nucleic Acid/genetics , Sequence Analysis, DNA/methods , Workflow , Base Sequence
SELECTION OF CITATIONS
SEARCH DETAIL
...