RESUMO
Ancestrally admixed populations are underrepresented in genetic studies of complex diseases, which are still dominated by European-descent populations. This is relevant not only from a representation standpoint but also because of admixed populations' unique features, including being enriched for rare variants, for which effect sizes are disproportionately larger than common polymorphisms. Furthermore, results from these populations may be generalizable to other populations. The South African Cape Coloured (SACC) population is genetically admixed and has one of the highest prevalences of fetal alcohol spectrum disorders (FASD) worldwide. We profiled its admixture and examined associations between ancestry profiles and FASD outcomes using two longitudinal birth cohorts (N=308 mothers, 280 children) designed to examine effects of prenatal alcohol exposure on development. Participants were genotyped via MEGAex array to capture common and rare variants. Rare variants were overrepresented in our SACC cohorts, with numerous polymorphisms being monomorphic in other reference populations (e.g., â¼30,000 and â¼ 221,000 variants in gnomAD European and Asian populations, respectively). The cohorts showed global African (51 %; Bantu and San); European (26 %; Northern/Western); South Asian (18 %); and East Asian (5 %; largely Southern regions) ancestries. The cohorts exhibited high rates of homozygosity (6 %), with regions of homozygosity harboring more deleterious variants when lying within African local-ancestry genomic segments. Both maternal and child ancestry profiles were associated with higher FASD risk, and maternal and child ancestry-by-prenatal alcohol exposure interaction effects were seen on child cognition. Our findings indicate that the SACC population may be a valuable asset to identify novel disease-associated genetic loci for FASD and other diseases.
Assuntos
Transtornos do Espectro Alcoólico Fetal , Humanos , Transtornos do Espectro Alcoólico Fetal/genética , Transtornos do Espectro Alcoólico Fetal/epidemiologia , Feminino , África do Sul/epidemiologia , Masculino , Gravidez , População Negra/genética , Adulto , Criança , Polimorfismo de Nucleotídeo Único , Predisposição Genética para Doença , População Branca/genéticaRESUMO
Genetic variants in ABCA7, an Alzheimer's disease (AD)-associated gene, elevate AD risk, yet its functional relevance to the etiology is unclear. We generated a CRISPR-Cas9-mediated abca7 knockout zebrafish to explore ABCA7's role in AD. Single-cell transcriptomics in heterozygous abca7+/- knockout combined with Aß42 toxicity revealed that ABCA7 is crucial for neuropeptide Y (NPY), brain-derived neurotrophic factor (BDNF), and nerve growth factor receptor (NGFR) expressions, which are crucial for synaptic integrity, astroglial proliferation, and microglial prevalence. Impaired NPY induction decreased BDNF and synaptic density, which are rescuable with ectopic NPY. In induced pluripotent stem cell-derived human neurons exposed to Aß42, ABCA7-/- suppresses NPY. Clinical data showed reduced NPY in AD correlated with elevated Braak stages, genetic variants in NPY associated with AD, and epigenetic changes in NPY, NGFR, and BDNF promoters linked to ABCA7 variants. Therefore, ABCA7-dependent NPY signaling via BDNF-NGFR maintains synaptic integrity, implicating its impairment in increased AD risk through reduced brain resilience.
Assuntos
Doença de Alzheimer , Fator Neurotrófico Derivado do Encéfalo , Neuropeptídeo Y , Transdução de Sinais , Peixe-Zebra , Fator Neurotrófico Derivado do Encéfalo/metabolismo , Fator Neurotrófico Derivado do Encéfalo/genética , Animais , Doença de Alzheimer/metabolismo , Doença de Alzheimer/genética , Doença de Alzheimer/patologia , Neuropeptídeo Y/metabolismo , Neuropeptídeo Y/genética , Humanos , Sinapses/metabolismo , Sinapses/patologia , Transportadores de Cassetes de Ligação de ATP/genética , Transportadores de Cassetes de Ligação de ATP/metabolismo , Receptores de Fator de Crescimento Neural/genética , Receptores de Fator de Crescimento Neural/metabolismo , Proteínas de Peixe-Zebra/genética , Proteínas de Peixe-Zebra/metabolismo , Neurônios/metabolismo , Neurônios/patologia , Peptídeos beta-Amiloides/metabolismo , Peptídeos beta-Amiloides/genéticaRESUMO
The crab-eating macaques ( Macaca fascicularis ) and rhesus macaques ( M. mulatta ) are widely studied nonhuman primates in biomedical and evolutionary research. Despite their significance, the current understanding of the complex genomic structure in macaques and the differences between species requires substantial improvement. Here, we present a complete genome assembly of a crab-eating macaque and 20 haplotype-resolved macaque assemblies to investigate the complex regions and major genomic differences between species. Segmental duplication in macaques is â¼42% lower, while centromeres are â¼3.7 times longer than those in humans. The characterization of â¼2 Mbp fixed genetic variants and â¼240 Mbp complex loci highlights potential associations with metabolic differences between the two macaque species (e.g., CYP2C76 and EHBP1L1 ). Additionally, hundreds of alternative splicing differences show post-transcriptional regulation divergence between these two species (e.g., PNPO ). We also characterize 91 large-scale genomic differences between macaques and humans at a single-base-pair resolution and highlight their impact on gene regulation in primate evolution (e.g., FOLH1 and PIEZO2 ). Finally, population genetics recapitulates macaque speciation and selective sweeps, highlighting potential genetic basis of reproduction and tail phenotype differences (e.g., STAB1 , SEMA3F , and HOXD13 ). In summary, the integrated analysis of genetic variation and population genetics in macaques greatly enhances our comprehension of lineage-specific phenotypes, adaptation, and primate evolution, thereby improving their biomedical applications in human diseases.
RESUMO
Ancestrally admixed populations are underrepresented in genetic studies of complex diseases, which are still dominated by European-descent populations. This is relevant not only from a representation standpoint but also because of admixed populations' unique features, including being enriched for rare variants, for which effect sizes are disproportionately larger than common polymorphisms. Furthermore, results from these populations may be generalizable to other populations. The South African Cape Coloured (SACC) population is genetically admixed, with one of the highest prevalences of fetal alcohol spectrum disorders (FASD) worldwide. We profiled its admixture and examined associations between ancestry profiles and FASD outcomes using two longitudinal birth cohorts ( N =308 mothers, 280 children) designed to examine effects of prenatal alcohol exposure on development. Participants were genotyped via MEGA-ex array to capture common and rare variants. Rare variants were overrepresented in our SACC cohorts, with numerous polymorphisms being monomorphic in other reference populations (e.g., â¼30,000 and â¼221,000 variants in gnomAD European and Asian populations, respectively). The cohorts showed global African (51%; Bantu and San); European (26%; Northern/Western); South Asian (18%); and East Asian (5%; largely Southern regions) ancestries. The cohorts exhibited high rates of homozygosity (6%), with regions of homozygosity harboring more deleterious variants when lying within African local-ancestry genomic segments. Both maternal and child ancestry profiles were associated with FASD risk and altered severity of prenatal alcohol exposure-related cognitive deficits in the child. Our findings indicate that the SACC population may be a valuable asset to identify novel disease-associated genetic loci for FASD and other diseases.
RESUMO
Understanding gene expression variations between species is pivotal for deciphering the evolutionary diversity in phenotypes. Rhesus macaques ( Macaca mulatta, MMU) and crab-eating macaques ( M. fascicularis, MFA) serve as crucial nonhuman primate biomedical models with different phenotypes. To date, however, large-scale comparative transcriptome research between these two species has not yet been fully explored. Here, we conducted systematic comparisons utilizing newly sequenced RNA-seq data from 84 samples (41 MFA samples and 43 MMU samples) encompassing 14 common tissues. Our findings revealed a small fraction of genes (3.7%) with differential expression between the two species, as well as 36.5% of genes with tissue-specific expression in both macaques. Comparison of gene expression between macaques and humans indicated that 22.6% of orthologous genes displayed differential expression in at least two tissues. Moreover, 19.41% of genes that overlapped with macaque-specific structural variants showed differential expression between humans and macaques. Of these, the FAM220A gene exhibited elevated expression in humans compared to macaques due to lineage-specific duplication. In summary, this study presents a large-scale transcriptomic comparison between MMU and MFA and between macaques and humans. The discovery of gene expression variations not only enhances the biomedical utility of macaque models but also contributes to the wider field of primate genomics.
Assuntos
Genômica , Transcriptoma , Humanos , Animais , Macaca mulatta/genética , Macaca fascicularis/genética , Perfilação da Expressão Gênica/veterináriaRESUMO
MOTIVATION: Transcription factor binding sites (TFBS) are regulatory elements that have significant impact on transcription regulation and cell fate determination. Canonical motifs, biological experiments, and computational methods have made it possible to discover TFBS. However, most existing in silico TFBS prediction models are solely DNA-based, and are trained and utilized within the same biosample, which fail to infer TFBS in experimentally unexplored biosamples. RESULTS: Here, we propose TFBS prediction by modified TransFormer (TFTF), a multimodal deep language architecture which integrates multiomics information in epigenetic studies. In comparison to existing computational techniques, TFTF has state-of-the-art accuracy, and is also the first approach to accurately perform genome-wide detection for cell-type and species-specific TFBS in experimentally unexplored biosamples. Compared to peak calling methods, TFTF consistently discovers true TFBS in threshold tuning-free way, with higher recalled rates. The underlying mechanism of TFTF reveals greater attention to the targeted TF's motif region in TFBS, and general attention to the entire peak region in non-TFBS. TFTF can benefit from the integration of broader and more diverse data for improvement and can be applied to multiple epigenetic scenarios. AVAILABILITY AND IMPLEMENTATION: We provide a web server (https://tftf.ibreed.cn/) for users to utilize TFTF model. Users can train TFTF model and discover TFBS with their own data.
Assuntos
Genoma , Multiômica , Sítios de Ligação , Ligação Proteica , Fatores de Transcrição/metabolismo , Biologia Computacional/métodosRESUMO
Alzheimer's disease (AD) remains a complex challenge characterized by cognitive decline and memory loss. Genetic variations have emerged as crucial players in the etiology of AD, enabling hope for a better understanding of the disease mechanisms; yet the specific mechanism of action for those genetic variants remain uncertain. Animal models with reminiscent disease pathology could uncover previously uncharacterized roles of these genes. Using CRISPR/Cas9 gene editing, we generated a knockout model for abca7, orthologous to human ABCA7 - an established AD-risk gene. The abca7 +/- zebrafish showed reduced astroglial proliferation, synaptic density, and microglial abundance in response to amyloid beta 42 (Aß42). Single-cell transcriptomics revealed abca7 -dependent neuronal and glial cellular crosstalk through neuropeptide Y (NPY) signaling. The abca7 knockout reduced the expression of npy, bdnf and ngfra , which are required for synaptic integrity and astroglial proliferation. With clinical data in humans, we showed reduced NPY in AD correlates with elevated Braak stage, predicted regulatory interaction between NPY and BDNF , identified genetic variants in NPY associated with AD, found segregation of variants in ABCA7, BDNF and NGFR in AD families, and discovered epigenetic changes in the promoter regions of NPY, NGFR and BDNF in humans with specific single nucleotide polymorphisms in ABCA7 . These results suggest that ABCA7-dependent NPY signaling is required for synaptic integrity, the impairment of which generates a risk factor for AD through compromised brain resilience.
RESUMO
OBJECTIVE: The performances of popular genome-wide association study (GWAS) models have not been examined yet in a consistent manner under the scenario of genetic admixture, which introduces several challenging aspects: heterogeneity of minor allele frequency (MAF), wide spectrum of case-control ratio, varying effect sizes, etc. METHODS: We generated a cohort of synthetic individuals (N = 19 234) that simulates (i) a large sample size; (ii) two-way admixture (Native American and European ancestry) and (iii) a binary phenotype. We then benchmarked three popular GWAS tools [generalized linear mixed model associated test (GMMAT), scalable and accurate implementation of generalized mixed model (SAIGE) and Tractor] by computing inflation factors and power calculations under different MAFs, case-control ratios, sample sizes and varying ancestry proportions. We also employed a cohort of Peruvians (N = 249) to further examine the performances of the testing models on (i) real genetic and phenotype data and (ii) small sample sizes. RESULTS: In the synthetic cohort, SAIGE performed better than GMMAT and Tractor in terms of type-I error rate, especially under severe unbalanced case-control ratio. On the contrary, power analysis identified Tractor as the best method to pinpoint ancestry-specific causal variants but showed decreased power when the effect size displayed limited heterogeneity between ancestries. In the Peruvian cohort, only Tractor identified two suggestive loci (P-value $\le 1\ast{10}^{-5}$) associated with Native American ancestry. DISCUSSION: The current study illustrates best practice and limitations for available GWAS tools under the scenario of genetic admixture. Incorporating local ancestry in GWAS analyses boosts power, although careful consideration of complex scenarios (small sample sizes, imbalance case-control ratio, MAF heterogeneity) is needed.
Assuntos
Benchmarking , Estudo de Associação Genômica Ampla , Humanos , Estudo de Associação Genômica Ampla/métodos , Frequência do Gene , Fenótipo , Tamanho da Amostra , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Defect regulation and the construction of a heterojunction structure are effective strategies to improve the catalytic activity of catalysts. In this work, the rapid conversion of CuO to Cu2(OH)3NO3 was achieved by fixing nitrogen in air as NO3- using dielectric barrier discharge (DBD) plasma. This innovative approach resulted in the successful synthesis of a CuO/Cu2(OH)3NO3 nanosheet heterostructure. Notably, the samples prepared using plasma exhibit thinner thickness and larger specific surface area. Importantly, oxygen vacancies are introduced, simultaneously forming heterojunction interfaces within the CuO/Cu2(OH)3NO3 structure. CuO/Cu2(OH)3NO3 using plasma effectively degraded 96% of methyl orange within 8 min in the dark. The degradation rate is 81 and 23 times that of CuO and Cu2(OH)3NO3 using hydrothermal methods, respectively. The high catalytic activity is attributed to the large specific surface area, the abundance of active sites, and the synergy between oxygen vacancies and the strong heterojunction interfacial interactions, which accelerate the transfer of electrons and the production of reactive oxygen species (ËO2- and ËOH). The mechanism of plasma preparation was proposed on account of microstructure characterization and online mass spectroscopy, which indicated that gas etching, gas expansion, and the repulsive force of electrons play key roles in plasma exfoliation.
RESUMO
Objective: The performances of popular Genome-wide association study (GWAS) models haven't been examined yet in a consistent manner under the scenario of genetic admixture, which introduces several challenging aspects such as heterogeneity of minor allele frequency (MAF), a wide spectrum of case-control ratio, and varying effect sizes etc. Methods: We generated a cohort of synthetic individuals (N=19,234) that simulates 1) a large sample size; 2) two-way admixture [Native American-European ancestry] and 3) a binary phenotype. We then examined the inflation factors produced by three popular GWAS tools: GMMAT, SAIGE, and Tractor. We also computed power calculations under different MAFs, case-control ratios, and varying ancestry percentages. Then, we employed a cohort of Peruvians (N=249) to further examine the performances of the testing models on 1) real genetic data and 2) small sample sizes. Finally, we validated these findings using an independent Peruvian cohort (N=109) included in 1000 Genome project (1000G). Results: In the synthetic cohort, SAIGE performed better than GMMAT and Tractor in terms of type-I error rate, especially under severe unbalanced case-control ratio. On the contrary, power analysis identified Tractor as the best method to pinpoint ancestry-specific causal variants, but showed decreased power when no adequate heterogeneity of the true effect sizes was simulated between ancestries. The real Peruvian data showed that Tractor is severely affected by small sample sizes, and produced severely inflated statistics, which we replicated in the 1000G Peruvian cohort. Discussion: The current study illustrates the limitations of available GWAS tools under different scenarios of genetic admixture. We urge caution when interpreting results under complex population scenarios.
RESUMO
Fine-mapping is commonly used to identify putative causal variants at genome-wide significant loci. Here we propose a Bayesian model for fine-mapping that has several advantages over existing methods, including flexible specification of the prior distribution of effect sizes, joint modeling of summary statistics and functional annotations and accounting for discrepancies between summary statistics and external linkage disequilibrium in meta-analyses. Using simulations, we compare performance with commonly used fine-mapping methods and show that the proposed model has higher power and lower false discovery rate (FDR) when including functional annotations, and higher power, lower FDR and higher coverage for credible sets in meta-analyses. We further illustrate our approach by applying it to a meta-analysis of Alzheimer's disease genome-wide association studies where we prioritize putatively causal variants and genes.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Estudo de Associação Genômica Ampla/métodos , Teorema de Bayes , Desequilíbrio de LigaçãoRESUMO
AIMS: Insomnia is a common sleep disorder that widely occurs in older population, especially older women. This study aims to investigate the associations between accelerometer-measured physical activity (PA) and sedentary behavior (SB) patterns with insomnia in older Chinese women. METHODS: Cross-sectional data derived from the baseline survey of the Physical Activity and Health in Older Women Study were analyzed for 1112 older women aged 60 to 70. Insomnia was evaluated using Athens Insomnia Scale. PA and SB patterns were measured through an accelerometer. Multivariate logistic regression was used to investigate associations of PA and SB patterns with insomnia. RESULTS: All SB variables were positively associated with insomnia, with multivariate-adjusted ORs of 1.24, 1.19 and 1.19 for 60-min increase of total SB, 10min-bouted SB and 30min-bouted SB, respectively. Total LPA and bouted LPA were negatively associated with insomnia, with multivariate-adjusted ORs of 0.90 and 0.89 for 30-min increase of total LPA and bouted LPA, respectively. CONCLUSION: Avoiding SB and encouraging LPA engagement may hold promise in preventing insomnia and promoting sleep in older population. Future studies with experimental study design and follow-up periods are warranted to illustrate the causal associations.
Assuntos
Distúrbios do Início e da Manutenção do Sono , Humanos , Feminino , Idoso , Autorrelato , Distúrbios do Início e da Manutenção do Sono/epidemiologia , Comportamento Sedentário , Estudos Transversais , Exercício Físico , AcelerometriaRESUMO
Many studies have shown that urban workers may have a higher acceptance rate of coronavirus disease (COVID-19) vaccine uptake compared to their rural counterparts. As Omicron spreads globally, the COVID-19 booster vaccination has been acknowledged as the primary strategy against this variant. In this study, we identify factors related to the willingness of workers in megacities to take the vaccine booster shots and their main reasons accounting for their booster willingness. This research survey was conducted in megacity H in eastern China, and a total of 1227 employees from different industries were interviewed. The study at hand examines the relationship between various characteristics (including both economic and non-economic factors) of urban employees and their intention/desire to accept the COVID-19 booster shoots. The survey results show that some characteristics, namely work organization, vaccine knowledge, and social network, affect their intention to take COVID-19 vaccine booster shots. Urban employees with a strong work organization, a high degree of vaccine knowledge, and a dense social capital are more likely to receive booster injections than other employees. Therefore, work organization, vaccine knowledge, and social networks provide fundamental entry points for designing enhanced injection strategies to increase the acceptance of COVID-19 vaccines among employees in megacities.
Assuntos
Vacinas contra COVID-19 , COVID-19 , COVID-19/epidemiologia , COVID-19/prevenção & controle , China , Humanos , Imunização Secundária , SARS-CoV-2 , VacinaçãoRESUMO
SCL/TAL1 Interrupting locus (STIL) is a ciliary-related gene involved in regulating the cell cycle and duplication of centrioles in dividing cells. STIL has been found disordered in multiple cancers and driven carcinogenesis. However, the molecular mechanisms and biological functions of STIL in cancers remain ambiguous. Here, we systematically analyzed the genetic alterations, molecular mechanisms, and clinical relevance of STIL across >10,000 samples representing 33 cancer types in The Cancer Genome Atlas (TCGA) dataset. We found that STIL expression is up-regulated in most cancer types compared with their adjacent normal tissues. The expression dysregulation of STIL was affected by copy number variation, mutation, and DNA methylation. High STIL expression was associated with worse outcomes and promoted the progression of cancers. Gene Ontology (GO) enrichment analysis and Gene Set Variation Analysis (GSVA) further revealed that STIL is involved in cell cycle progression, Mitotic spindle, G2M checkpoint, and E2F targets pathways across cancer types. STIL expression was negatively correlated with multiple genes taking part in ciliogenesis and was positively correlated with several genes which participated with centrosomal duplication or cilia degradation. Moreover, STIL silencing could promote primary cilia formation and inhibit cell cycle protein expression in prostate and kidney cancer cell lines. The phenotype and protein expression alteration due to STIL silencing could be reversed by IFT88 silencing in cancer cells. These results revealed that STIL could regulate the cell cycle through primary cilia in tumor cells. In summary, our results revealed the importance of STIL in cancers. Targeting STIL might be a novel therapeutic approach for cancers.
RESUMO
This study analyzed the comprehensive impact of renewable energy investment on carbon emissions in China. To achieve this, a nonparametric additive regression model was built. Using the STIRPAT model, we considered six influencing factors: economic growth, industrialization level, urbanization level, population aging, trade openness, and renewable energy investment. This enabled the exploration of the existence, direction, and intensity of the impact of renewable energy investment on carbon emissions. The results of the linear component of the model showed that renewable energy investment can slightly reduce carbon emissions. The results of the nonlinear component of the model showed that the impacts of renewable energy investment on carbon emissions were inconsistent at different stages of the investment. In the early stage, the renewable energy investment can increase carbon emissions. In the middle stage, the renewable energy investment begins to play a role in reducing emissions. In the later stage, renewable energy investment may be associated with increased carbon emissions again. The relationship between carbon emissions and the other five influencing factors can be represented by an inverted U-shaped curve, a U-shaped curve, or a slow rising curve. The results above provide useful references to adjust renewable energy investment and reduce carbon emissions.
RESUMO
The rampant spread of COVID-19, an infectious disease caused by SARS-CoV-2, all over the world has led to over millions of deaths, and devastated the social, financial and political entities around the world. Without an existing effective medical therapy, vaccines are urgently needed to avoid the spread of this disease. In this study, we propose an in silico deep learning approach for prediction and design of a multi-epitope vaccine (DeepVacPred). By combining the in silico immunoinformatics and deep neural network strategies, the DeepVacPred computational framework directly predicts 26 potential vaccine subunits from the available SARS-CoV-2 spike protein sequence. We further use in silico methods to investigate the linear B-cell epitopes, Cytotoxic T Lymphocytes (CTL) epitopes, Helper T Lymphocytes (HTL) epitopes in the 26 subunit candidates and identify the best 11 of them to construct a multi-epitope vaccine for SARS-CoV-2 virus. The human population coverage, antigenicity, allergenicity, toxicity, physicochemical properties and secondary structure of the designed vaccine are evaluated via state-of-the-art bioinformatic approaches, showing good quality of the designed vaccine. The 3D structure of the designed vaccine is predicted, refined and validated by in silico tools. Finally, we optimize and insert the codon sequence into a plasmid to ensure the cloning and expression efficiency. In conclusion, this proposed artificial intelligence (AI) based vaccine discovery framework accelerates the vaccine design process and constructs a 694aa multi-epitope vaccine containing 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes, which is promising to fight the SARS-CoV-2 viral infection and can be further evaluated in clinical studies. Moreover, we trace the RNA mutations of the SARS-CoV-2 and ensure that the designed vaccine can tackle the recent RNA mutations of the virus.
Assuntos
Vacinas contra COVID-19 , Aprendizado Profundo , SARS-CoV-2/imunologia , Glicoproteína da Espícula de Coronavírus/imunologia , Alérgenos , COVID-19/prevenção & controle , Vacinas contra COVID-19/efeitos adversos , Vacinas contra COVID-19/química , Vacinas contra COVID-19/imunologia , Vacinas contra COVID-19/toxicidade , Uso do Códon , Biologia Computacional , Desenho de Fármacos , Epitopos de Linfócito B/imunologia , Epitopos de Linfócito T/imunologia , Humanos , Imunogenicidade da Vacina , Modelos Moleculares , Simulação de Acoplamento Molecular , Simulação de Dinâmica Molecular , Mutação , Conformação Proteica , RNA Viral , SARS-CoV-2/química , SARS-CoV-2/genética , Solubilidade , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Linfócitos T Citotóxicos/imunologia , Linfócitos T Auxiliares-Indutores/imunologia , Vacinas de Subunidades Antigênicas/química , Vacinas de Subunidades Antigênicas/imunologiaRESUMO
MOTIVATION: Predicting regulatory effects of genetic variants is a challenging but important problem in functional genomics. Given the relatively low sensitivity of functional assays, and the pervasiveness of class imbalance in functional genomic data, popular statistical prediction models can sharply underestimate the probability of a regulatory effect. We describe here the presence-only model (PO-EN), a type of semi-supervised model, to predict regulatory effects of genetic variants at sequence-level resolution in a context of interest by integrating a large number of epigenetic features and massively parallel reporter assays (MPRAs). RESULTS: Using experimental data from a variety of MPRAs we show that the presence-only model produces better calibrated predicted probabilities and has increased accuracy relative to state-of-the-art prediction models. Furthermore, we show that the predictions based on pre-trained PO-EN models are useful for prioritizing functional variants among candidate eQTLs and significant SNPs at GWAS loci. In particular, for the costimulatory locus, associated with multiple autoimmune diseases, we show evidence of a regulatory variant residing in an enhancer 24.4 kb downstream of CTLA4, with evidence from capture Hi-C of interaction with CTLA4. Furthermore, the risk allele of the regulatory variant is on the same risk increasing haplotype as a functional coding variant in exon 1 of CTLA4, suggesting that the regulatory variant acts jointly with the coding variant leading to increased risk to disease. AVAILABILITY: The presence-only model is implemented in the R package 'PO.EN', freely available on CRAN. A vignette describing a detailed demonstration of using the proposed PO-EN model can be found on github at https://github.com/Iuliana-Ionita-Laza/PO.EN/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.