RESUMO
Introduction: Various sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods: Here, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion: Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
Assuntos
Drosophila melanogaster , Histonas , Animais , Histonas/genética , Análise de Sequência de DNA , Cromatina/genética , DesoxirribonucleasesRESUMO
BACKGROUND: Anopheles cell lines are used in a variety of ways to better understand the major vectors of malaria in sub-Saharan Africa. Despite this, commonly used cell lines are not well characterized, and no tools are available for cell line identification and authentication. METHODS: Utilizing whole genome sequencing, genomes of 4a-3A and 4a-3B 'hemocyte-like' cell lines were characterized for insertions and deletions (indels) and SNP variation. Genomic locations of distinguishing sequence variation and species origin of the cell lines were also examined. Unique indels were targeted to develop a PCR-based cell line authentication assay. Mitotic chromosomes were examined to survey the cytogenetic landscape for chromosome structure and copy number in the cell lines. RESULTS: The 4a-3A and 4a-3B cell lines are female in origin and primarily of Anopheles coluzzii ancestry. Cytogenetic analysis indicates that the two cell lines are essentially diploid, with some relatively minor chromosome structural rearrangements. Whole-genome sequence was generated, and analysis indicated that SNPs and indels which differentiate the cell lines are clustered on the 2R chromosome in the regions of the 2Rb, 2Rc and 2Ru chromosomal inversions. A PCR-based authentication assay was developed to fingerprint three indels unique to each cell line. The assay distinguishes between 4a-3A and 4a-3B cells and also uniquely identifies two additional An. coluzzii cell lines tested, Ag55 and Sua4.0. The assay has the specificity to distinguish four cell lines and also has the sensitivity to detect cellular contamination within a sample of cultured cells. CONCLUSIONS: Genomic characterization of the 4a-3A and 4a-3B Anopheles cell lines was used to develop a simple diagnostic assay that can distinguish these cell lines within and across research laboratories. A cytogenetic survey indicated that the 4a-3A and Sua4.0 cell lines carry essentially normal diploid chromosomes, which makes them amenable to CRISPR/Cas9 genome editing. The presented simple authentication assay, coupled with screening for mycoplasma, will allow validation of the integrity of experimental resources and will promote greater experimental reproducibility of results.