RESUMO
Protein phase transitions (PPTs) from the soluble state to a dense liquid phase (forming droplets via liquid-liquid phase separation) or to solid aggregates (such as amyloids) play key roles in pathological processes associated with age-related diseases such as Alzheimer's disease. Several computational frameworks are capable of separately predicting the formation of droplets or amyloid aggregates based on protein sequences, yet none have tackled the prediction of both within a unified framework. Recently, large language models (LLMs) have exhibited great success in protein structure prediction; however, they have not yet been used for PPTs. Here, we fine-tune a LLM for predicting PPTs and demonstrate its usage in evaluating how sequence variants affect PPTs, an operation useful for protein design. In addition, we show its superior performance compared to suitable classical benchmarks. Due to the "black-box" nature of the LLM, we also employ a classical random forest model along with biophysical features to facilitate interpretation. Finally, focusing on Alzheimer's disease-related proteins, we demonstrate that greater aggregation is associated with reduced gene expression in Alzheimer's disease, suggesting a natural defense mechanism.
Assuntos
Doença de Alzheimer , Transição de Fase , Doença de Alzheimer/metabolismo , Humanos , Amiloide/metabolismo , Amiloide/química , Proteínas/química , Proteínas/metabolismoRESUMO
Numerous statistical methods have emerged for inferring DNA motifs for transcription factors (TFs) from genomic regions. However, the process of selecting informative regions for motif inference remains understudied. Current approaches select regions with strong ChIP-seq signal for a given TF, assuming that such strong signal primarily results from specific interactions between the TF and its motif. Additionally, these selection approaches do not account for non-target motifs, i.e. motifs of other TFs; they presume the occurrence of these non-target motifs infrequent compared to that of the target motif, and thus assume these have minimal interference with the identification of the target. Leveraging extensive ChIP-seq datasets, we introduced the concept of TF signal 'crowdedness', referred to as C-score, for each genomic region. The C-score helps in highlighting TF signals arising from non-specific interactions. Moreover, by considering the C-score (and adjusting for the length of genomic regions), we can effectively mitigate interference of non-target motifs. Using these tools, we find that in many instances, strong ChIP-seq signal stems mainly from non-specific interactions, and the occurrence of non-target motifs significantly impacts the accurate inference of the target motif. Prioritizing genomic regions with reduced crowdedness and short length markedly improves motif inference. This 'less-is-more' effect suggests that ChIP-seq region selection warrants more attention.
Assuntos
Genômica , Motivos de Nucleotídeos , Fatores de Transcrição , Sítios de Ligação , Imunoprecipitação da Cromatina , Ligação Proteica , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
GENCODE produces high quality gene and transcript annotation for the human and mouse genomes. All GENCODE annotation is supported by experimental data and serves as a reference for genome biology and clinical genomics. The GENCODE consortium generates targeted experimental data, develops bioinformatic tools and carries out analyses that, along with externally produced data and methods, support the identification and annotation of transcript structures and the determination of their function. Here, we present an update on the annotation of human and mouse genes, including developments in the tools, data, analyses and major collaborations which underpin this progress. For example, we report the creation of a set of non-canonical ORFs identified in GENCODE transcripts, the LRGASP collaboration to assess the use of long transcriptomic data to build transcript models, the progress in collaborations with RefSeq and UniProt to increase convergence in the annotation of human and mouse protein-coding genes, the propagation of GENCODE across the human pan-genome and the development of new tools to support annotation of regulatory features by GENCODE. Our annotation is accessible via Ensembl, the UCSC Genome Browser and https://www.gencodegenes.org.
Assuntos
Biologia Computacional , Genoma Humano , Humanos , Animais , Camundongos , Anotação de Sequência Molecular , Biologia Computacional/métodos , Genoma Humano/genética , Transcriptoma/genética , Perfilação da Expressão Gênica , Bases de Dados GenéticasRESUMO
BACKGROUND: Predicting cis-regulatory modules (CRMs) in a genome and their functional states in various cell/tissue types of the organism are two related challenging computational tasks. Most current methods attempt to simultaneously achieve both using data of multiple epigenetic marks in a cell/tissue type. Though conceptually attractive, they suffer high false discovery rates and limited applications. To fill the gaps, we proposed a two-step strategy to first predict a map of CRMs in the genome, and then predict functional states of all the CRMs in various cell/tissue types of the organism. We have recently developed an algorithm for the first step that was able to more accurately and completely predict CRMs in a genome than existing methods by integrating numerous transcription factor ChIP-seq datasets in the organism. Here, we presented machine-learning methods for the second step. RESULTS: We showed that functional states in a cell/tissue type of all the CRMs in the genome could be accurately predicted using data of only 1~4 epigenetic marks by a variety of machine-learning classifiers. Our predictions are substantially more accurate than the best achieved so far. Interestingly, a model trained on a cell/tissue type in humans can accurately predict functional states of CRMs in different cell/tissue types of humans as well as of mice, and vice versa. Therefore, epigenetic code that defines functional states of CRMs in various cell/tissue types is universal at least in humans and mice. Moreover, we found that from tens to hundreds of thousands of CRMs were active in a human and mouse cell/tissue type, and up to 99.98% of them were reutilized in different cell/tissue types, while as small as 0.02% of them were unique to a cell/tissue type that might define the cell/tissue type. CONCLUSIONS: Our two-step approach can accurately predict functional states in any cell/tissue type of all the CRMs in the genome using data of only 1~4 epigenetic marks. Our approach is also more cost-effective than existing methods that typically use data of more epigenetic marks. Our results suggest common epigenetic rules for defining functional states of CRMs in various cell/tissue types in humans and mice.
Assuntos
Genoma , Fatores de Transcrição , Algoritmos , Animais , Sítios de Ligação , Epigênese Genética , Regulação da Expressão Gênica , Humanos , Camundongos , Fatores de Transcrição/metabolismoRESUMO
BACKGROUND: Mouse is probably the most important model organism to study mammal biology and human diseases. A better understanding of the mouse genome will help understand the human genome, biology and diseases. However, despite the recent progress, the characterization of the regulatory sequences in the mouse genome is still far from complete, limiting its use to understand the regulatory sequences in the human genome. RESULTS: Here, by integrating binding peaks in ~ 9,000 transcription factor (TF) ChIP-seq datasets that cover 79.9% of the mouse mappable genome using an efficient pipeline, we were able to partition these binding peak-covered genome regions into a cis-regulatory module (CRM) candidate (CRMC) set and a non-CRMC set. The CRMCs contain 912,197 putative CRMs and 38,554,729 TF binding sites (TFBSs) islands, covering 55.5% and 24.4% of the mappable genome, respectively. The CRMCs tend to be under strong evolutionary constraints, indicating that they are likely cis-regulatory; while the non-CRMCs are largely selectively neutral, indicating that they are unlikely cis-regulatory. Based on evolutionary profiles of the genome positions, we further estimated that 63.8% and 27.4% of the mouse genome might code for CRMs and TFBSs, respectively. CONCLUSIONS: Validation using experimental data suggests that at least most of the CRMCs are authentic. Thus, this unprecedentedly comprehensive map of CRMs and TFBSs can be a good resource to guide experimental studies of regulatory genomes in mice and humans.
Assuntos
Genoma Humano , Elementos Reguladores de Transcrição , Humanos , Camundongos , Animais , Elementos Reguladores de Transcrição/genética , Sítios de Ligação/genética , Ligação Proteica , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Mamíferos/genéticaRESUMO
MOTIVATION: The availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. RESULTS: We herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. SUPPLEMENTARY INFORMATION: Supplementary materials are available at Bioinformatics online.
Assuntos
Sequenciamento de Cromatina por Imunoprecipitação , Software , Algoritmos , Sítios de Ligação , Imunoprecipitação da CromatinaRESUMO
Accumulating evidence indicates that transcription factor (TF) binding sites, or cis-regulatory elements (CREs), and their clusters termed cis-regulatory modules (CRMs) play a more important role than do gene-coding sequences in specifying complex traits in humans, including the susceptibility to common complex diseases. To fully characterize their roles in deriving the complex traits/diseases, it is necessary to annotate all CREs and CRMs encoded in the human genome. However, the current annotations of CREs and CRMs in the human genome are still very limited and mostly coarse-grained, as they often lack the detailed information of CREs in CRMs. Here, we integrated 620 TF ChIP-seq datasets produced by the ENCODE project for 168 TFs in 79 different cell/tissue types and predicted an unprecedentedly completely map of CREs in CRMs in the human genome at single nucleotide resolution. The map includes 305 912 CRMs containing a total of 1 178 913 CREs belonging to 736 unique TF binding motifs. The predicted CREs and CRMs tend to be subject to either purifying selection or positive selection, thus are likely to be functional. Based on the results, we also examined the status of available ChIP-seq datasets for predicting the entire regulatory genome of humans.
Assuntos
Sequência de Bases/genética , Genoma Humano/genética , Elementos Reguladores de Transcrição/genética , Sequências Reguladoras de Ácido Nucleico/genética , Algoritmos , Sítios de Ligação , Linhagem Celular Tumoral , Predisposição Genética para Doença/genética , Células HeLa , HumanosRESUMO
BACKGROUND: Although DNA sequence plays a crucial role in establishing the unique epigenome of a cell type, little is known about the sequence determinants that lead to the unique epigenomes of different cell types produced during cell differentiation. To fill this gap, we employed two types of deep convolutional neural networks (CNNs) constructed for each of differentially related cell types and for each of histone marks measured in the cells, to learn the sequence determinants of various histone modification patterns in each cell type. RESULTS: We applied our models to four differentially related human CD4+ T cell types and six histone marks measured in each cell type. The cell models can accurately predict the histone marks in each cell type, while the mark models can also accurately predict the cell types based on a single mark. Sequence motifs learned by both the cell or mark models are highly similar to known binding motifs of transcription factors known to play important roles in CD4+ T cell differentiation. Both the unique histone mark patterns in each cell type and the different patterns of the same histone mark in different cell types are determined by a set of motifs with unique combinations. Interestingly, the level of sharing motifs learned in the different cell models reflects the lineage relationships of the cells, while the level of sharing motifs learned in the different histone mark models reflects their functional relationships. These models can also enable the prediction of the importance of learned motifs and their interactions in determining specific histone mark patterns in the cell types. CONCLUSION: Sequence determinants of various histone modification patterns in different cell types can be revealed by comparative analysis of motifs learned in the CNN models for multiple cell types and histone marks. The learned motifs are interpretable and may provide insights into the underlying molecular mechanisms of establishing the unique epigenomes in different cell types. Thus, our results support the hypothesis that DNA sequences ultimately determine the unique epigenomes of different cell types through their interactions with transcriptional factors, epigenome remodeling system and extracellular cues during cell differentiation.
Assuntos
Diferenciação Celular/genética , Aprendizado Profundo , Epigenômica , Linfócitos T CD4-Positivos/citologia , Linfócitos T CD4-Positivos/metabolismo , Linhagem da Célula , Sequência Conservada , Código das Histonas , Humanos , Motivos de Nucleotídeos/genéticaRESUMO
The VISTA enhancer database is a valuable resource for evaluating predicted enhancers in humans and mice. In addition to thousands of validated positive regions (VPRs) in the human and mouse genomes, the database also contains similar numbers of validated negative regions (VNRs). It is previously shown that the VPRs are on average half as long as predicted overlapping enhancers that are highly conserved and hypothesize that the VPRs may be truncated forms of long bona fide enhancers. Here, it is shown that like the VPRs, the VNRs also are under strong evolutionary constraints and overlap predicted enhancers in the genomes. The VNRs are also on average half as long as predicted overlapping enhancers that are highly conserved. Moreover, the VNRs and the VPRs display similar cell/tissue-specific modification patterns of key epigenetic marks of active enhancers. Furthermore, the VNRs and the VPRs show similar impact score spectra of in silico mutagenesis. These highly similar properties between the VPRs and the VNRs suggest that like the VPRs, the VNRs may also be truncated forms of long bona fide enhancers.
RESUMO
Precision of transcription is critical because transcriptional dysregulation is disease causing. Traditional methods of transcriptional profiling are inadequate to elucidate the full spectrum of the transcriptome, particularly for longer and less abundant mRNAs. SHANK3 is one of the most common autism causative genes. Twenty-four Shank3-mutant animal lines have been developed for autism modeling. However, their preclinical validity has been questioned due to incomplete Shank3 transcript structure. We apply an integrative approach combining cDNA-capture and long-read sequencing to profile the SHANK3 transcriptome in humans and mice. We unexpectedly discover an extremely complex SHANK3 transcriptome. Specific SHANK3 transcripts are altered in Shank3-mutant mice and postmortem brain tissues from individuals with autism spectrum disorder. The enhanced SHANK3 transcriptome significantly improves the detection rate for potential deleterious variants from genomics studies of neuropsychiatric disorders. Our findings suggest that both deterministic and stochastic transcription of the genome is associated with SHANK family genes.
Assuntos
Transtorno Autístico , Proteínas do Tecido Nervoso , Animais , Proteínas do Tecido Nervoso/genética , Proteínas do Tecido Nervoso/metabolismo , Humanos , Camundongos , Transtorno Autístico/genética , Transcrição Gênica , Proteínas dos Microfilamentos/genética , Proteínas dos Microfilamentos/metabolismo , Transcriptoma/genética , Transtorno do Espectro Autista/genética , Processos Estocásticos , MasculinoRESUMO
Precision of transcription is critical because transcriptional dysregulation is disease causing. Traditional methods of transcriptional profiling are inadequate to elucidate the full spectrum of the transcriptome, particularly for longer and less abundant mRNAs. SHANK3 is one of the most common autism causative genes. Twenty-four Shank3 mutant animal lines have been developed for autism modeling. However, their preclinical validity has been questioned due to incomplete Shank3 transcript structure. We applied an integrative approach combining cDNA-capture and long-read sequencing to profile the SHANK3 transcriptome in human and mice. We unexpectedly discovered an extremely complex SHANK3 transcriptome. Specific SHANK3 transcripts were altered in Shank3 mutant mice and postmortem brains tissues from individuals with ASD. The enhanced SHANK3 transcriptome significantly improved the detection rate for potential deleterious variants from genomics studies of neuropsychiatric disorders. Our findings suggest the stochastic transcription of genome associated with SHANK family genes.
RESUMO
Single-cell genomics is a powerful tool for studying heterogeneous tissues such as the brain. Yet, little is understood about how genetic variants influence cell-level gene expression. Addressing this, we uniformly processed single-nuclei, multi-omics datasets into a resource comprising >2.8M nuclei from the prefrontal cortex across 388 individuals. For 28 cell types, we assessed population-level variation in expression and chromatin across gene families and drug targets. We identified >550K cell-type-specific regulatory elements and >1.4M single-cell expression-quantitative-trait loci, which we used to build cell-type regulatory and cell-to-cell communication networks. These networks manifest cellular changes in aging and neuropsychiatric disorders. We further constructed an integrative model accurately imputing single-cell expression and simulating perturbations; the model prioritized ~250 disease-risk genes and drug targets with associated cell types.
RESUMO
Single-cell genomics is a powerful tool for studying heterogeneous tissues such as the brain. Yet little is understood about how genetic variants influence cell-level gene expression. Addressing this, we uniformly processed single-nuclei, multiomics datasets into a resource comprising >2.8 million nuclei from the prefrontal cortex across 388 individuals. For 28 cell types, we assessed population-level variation in expression and chromatin across gene families and drug targets. We identified >550,000 cell type-specific regulatory elements and >1.4 million single-cell expression quantitative trait loci, which we used to build cell-type regulatory and cell-to-cell communication networks. These networks manifest cellular changes in aging and neuropsychiatric disorders. We further constructed an integrative model accurately imputing single-cell expression and simulating perturbations; the model prioritized ~250 disease-risk genes and drug targets with associated cell types.
Assuntos
Encéfalo , Redes Reguladoras de Genes , Transtornos Mentais , Análise de Célula Única , Humanos , Envelhecimento/genética , Encéfalo/metabolismo , Comunicação Celular/genética , Cromatina/metabolismo , Cromatina/genética , Genômica , Transtornos Mentais/genética , Córtex Pré-Frontal/metabolismo , Córtex Pré-Frontal/fisiologia , Locos de Características QuantitativasRESUMO
Self-transcribing active regulatory region sequencing (STARR-seq) and its variants have been widely used to characterize enhancers. However, it has been reported that up to 87% of STARR-seq peaks are located in repressive chromatin and are not functional in the tested cells. While some of the STARR-seq peaks in repressive chromatin might be active in other cell/tissue types, some others might be false positives. Meanwhile, many active enhancers may not be identified by the current STARR-seq methods. Although methods have been proposed to mitigate systematic errors caused by the use of plasmid vectors, the artifacts due to the intrinsic limitations of current STARR-seq methods are still prevalent and the underlying causes are not fully understood. Based on predicted cis-regulatory modules (CRMs) and non-CRMs in the human genome as well as predicted active CRMs and non-active CRMs in a few human cell lines/tissues with STARR-seq data available, we reveal prevalent false positives and false negatives in STARR-seq peaks generated by major variants of STARR-seq methods and possible underlying causes. Our results will help design strategies to improve STARR-seq methods and interpret the results.
RESUMO
Neuronal activity can evoke the hemodynamic change that gives rise to the observed functional magnetic resonance imaging (fMRI) signal. These increases are also regulated by the resting blood volume fraction (V (0)) associated with regional vasculature. The activation locus detected by means of the change in the blood-oxygen-level-dependent (BOLD) signal intensity thereby may deviate from the actual active site due to varied vascular density in the cortex. Furthermore, conventional detection techniques evaluate the statistical significance of the hemodynamic observations. In this sense, the significance level relies not only upon the intensity of the BOLD signal change, but also upon the spatially inhomogeneous fMRI noise distribution that complicates the expression of the results. In this paper, we propose a quantitative strategy for the calibration of activation states to address these challenging problems. The quantitative assessment is based on the estimated neuronal efficacy parameter [Formula: see text] of the hemodynamic model in a voxel-by-voxel way. It is partly immune to the inhomogeneous fMRI noise by virtue of the strength of the optimization strategy. Moreover, it is easy to incorporate regional vascular information into the activation detection procedure. By combining MR angiography images, this approach can remove large vessel contamination in fMRI signals, and provide more accurate functional localization than classical statistical techniques for clinical applications. It is also helpful to investigate the nonlinear nature of the coupling between synaptic activity and the evoked BOLD response. The proposed method might be considered as a potentially useful complement to existing statistical approaches.
Assuntos
Mapeamento Encefálico , Encéfalo/irrigação sanguínea , Encéfalo/fisiologia , Imageamento por Ressonância Magnética , Simulação por Computador , Humanos , Processamento de Imagem Assistida por Computador , Modelos Neurológicos , Oxigênio/sangue , Estimulação Luminosa , Fatores de TempoRESUMO
More accurate and more complete predictions of cis-regulatory modules (CRMs) and constituent transcription factor (TF) binding sites (TFBSs) in genomes can facilitate characterizing functions of regulatory sequences. Here, we developed a database predicted cis-regulatory modules (PCRMS) (https://cci-bioinfo.uncc.edu) that stores highly accurate and unprecedentedly complete maps of predicted CRMs and TFBSs in the human and mouse genomes. The web interface allows the user to browse CRMs and TFBSs in an organism, find the closest CRMs to a gene, search CRMs around a gene and find all TFBSs of a TF. PCRMS can be a useful resource for the research community to characterize regulatory genomes. Database URL: https://cci-bioinfo.uncc.edu/.
Assuntos
Elementos Reguladores de Transcrição , Fatores de Transcrição , Animais , Sítios de Ligação , Genoma/genética , Camundongos , Ligação Proteica , Elementos Reguladores de Transcrição/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
cis-regulatory modules(CRMs) formed by clusters of transcription factor (TF) binding sites (TFBSs) are as important as coding sequences in specifying phenotypes of humans. It is essential to categorize all CRMs and constituent TFBSs in the genome. In contrast to most existing methods that predict CRMs in specific cell types using epigenetic marks, we predict a largely cell type agonistic but more comprehensive map of CRMs and constituent TFBSs in the gnome by integrating all available TF ChIP-seq datasets. Our method is able to partition 77.47% of genome regions covered by available 6092 datasets into a CRM candidate (CRMC) set (56.84%) and a non-CRMC set (43.16%). Intriguingly, the predicted CRMCs are under strong evolutionary constraints, while the non-CRMCs are largely selectively neutral, strongly suggesting that the CRMCs are likely cis-regulatory, while the non-CRMCs are not. Our predicted CRMs are under stronger evolutionary constraints than three state-of-the-art predictions (GeneHancer, EnhancerAtlas and ENCODE phase 3) and substantially outperform them for recalling VISTA enhancers and non-coding ClinVar variants. We estimated that the human genome might encode about 1.47M CRMs and 68M TFBSs, comprising about 55% and 22% of the genome, respectively; for both of which, we predicted 80%. Therefore, the cis-regulatory genome appears to be more prevalent than originally thought.
RESUMO
Changes in BOLD signals are sensitive to the regional blood content associated with the vasculature, which is known as V0 in hemodynamic models. In previous studies involving dynamic causal modeling (DCM) which embodies the hemodynamic model to invert the functional magnetic resonance imaging signals into neuronal activity, V0 was arbitrarily set to a physiolog-ically plausible value to overcome the ill-posedness of the inverse problem. It is interesting to investigate how the V0 value influences DCM. In this study we addressed this issue by using both synthetic and real experiments. The results show that the ability of DCM analysis to reveal information about brain causality depends critically on the assumed V0 value used in the analysis procedure. The choice of V0 value not only directly affects the strength of system connections, but more importantly also affects the inferences about the network architecture. Our analyses speak to a possible refinement of how the hemody-namic process is parameterized (i.e., by making V0 a free parameter); however, the conditional dependencies induced by a more complex model may create more problems than they solve. Obtaining more realistic V0 information in DCM can improve the identifiability of the system and would provide more reliable inferences about the properties of brain connectivity.