Pesquisa | Portal de Pesquisa da BVS

Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay.

Shigaki, Dustin; Adato, Orit; Adhikari, Aashish N; Dong, Shengcheng; Hawkins-Hooker, Alex; Inoue, Fumitaka; Juven-Gershon, Tamar; Kenlay, Henry; Martin, Beth; Patra, Ayoti; Penzar, Dmitry D; Schubach, Max; Xiong, Chenling; Yan, Zhongxia; Boyle, Alan P; Kreimer, Anat; Kulakovskiy, Ivan V; Reid, John; Unger, Ron; Yosef, Nir; Shendure, Jay; Ahituv, Nadav; Kircher, Martin; Beer, Michael A.

Hum Mutat ; 40(9): 1280-1291, 2019 09.

Artigo em Inglês | MEDLINE | ID: mdl-31106481

RESUMO

The integrative analysis of high-throughput reporter assays, machine learning, and profiles of epigenomic chromatin state in a broad array of cells and tissues has the potential to significantly improve our understanding of noncoding regulatory element function and its contribution to human disease. Here, we report results from the CAGI 5 regulation saturation challenge where participants were asked to predict the impact of nucleotide substitution at every base pair within five disease-associated human enhancers and nine disease-associated promoters. A library of mutations covering all bases was generated by saturation mutagenesis and altered activity was assessed in a massively parallel reporter assay (MPRA) in relevant cell lines. Reporter expression was measured relative to plasmid DNA to determine the impact of variants. The challenge was to predict the functional effects of variants on reporter expression. Comparative analysis of the full range of submitted prediction results identifies the most successful models of transcription factor binding sites, machine learning algorithms, and ways to choose among or incorporate diverse datatypes and cell-types for training computational models. These results have the potential to improve the design of future studies on more diverse sets of regulatory elements and aid the interpretation of disease-associated genetic variation.

Assuntos

DNA/química , Epigenômica/métodos , Mutação Puntual , Sítios de Ligação , Linhagem Celular , Cromatina/genética , DNA/metabolismo , Elementos Facilitadores Genéticos , Predisposição Genética para Doença , Humanos , Aprendizado de Máquina , Regiões Promotoras Genéticas , Fatores de Transcrição/metabolismo

Investigation of the Role of PUFA Metabolism in Breast Cancer Using a Rank-Based Random Forest Algorithm.

Guryleva, Mariia V; Penzar, Dmitry D; Chistyakov, Dmitry V; Mironov, Andrey A; Favorov, Alexander V; Sergeeva, Marina G.

Cancers (Basel) ; 14(19)2022 Sep 25.

Artigo em Inglês | MEDLINE | ID: mdl-36230586

RESUMO

Polyunsaturated fatty acid (PUFA) metabolism is currently a focus in cancer research due to PUFAs functioning as structural components of the membrane matrix, as fuel sources for energy production, and as sources of secondary messengers, so called oxylipins, important players of inflammatory processes. Although breast cancer (BC) is the leading cause of cancer death among women worldwide, no systematic study of PUFA metabolism as a system of interrelated processes in this disease has been carried out. Here, we implemented a Boruta-based feature selection algorithm to determine the list of most important PUFA metabolism genes altered in breast cancer tissues compared with in normal tissues. A rank-based Random Forest (RF) model was built on the selected gene list (33 genes) and applied to predict the cancer phenotype to ascertain the PUFA genes involved in cancerogenesis. It showed high-performance of dichotomic classification (balanced accuracy of 0.94, ROC AUC 0.99) We also retrieved a list of the important PUFA genes (46 genes) that differed between molecular subtypes at the level of breast cancer molecular subtypes. The balanced accuracy of the classification model built on the specified genes was 0.82, while the ROC AUC for the sensitivity analysis was 0.85. Specific patterns of PUFA metabolic changes were obtained for each molecular subtype of breast cancer. These results show evidence that (1) PUFA metabolism genes are critical for the pathogenesis of breast cancer; (2) BC subtypes differ in PUFA metabolism genes expression; and (3) the lists of genes selected in the models are enriched with genes involved in the metabolism of signaling lipids.

Landscape of allele-specific transcription factor binding in the human genome.

Abramov, Sergey; Boytsov, Alexandr; Bykova, Daria; Penzar, Dmitry D; Yevshin, Ivan; Kolmykov, Semyon K; Fridman, Marina V; Favorov, Alexander V; Vorontsov, Ilya E; Baulin, Eugene; Kolpakov, Fedor; Makeev, Vsevolod J; Kulakovskiy, Ivan V.

Nat Commun ; 12(1): 2751, 2021 05 12.

Artigo em Inglês | MEDLINE | ID: mdl-33980847

RESUMO

Sequence variants in gene regulatory regions alter gene expression and contribute to phenotypes of individual cells and the whole organism, including disease susceptibility and progression. Single-nucleotide variants in enhancers or promoters may affect gene transcription by altering transcription factor binding sites. Differential transcription factor binding in heterozygous genomic loci provides a natural source of information on such regulatory variants. We present a novel approach to call the allele-specific transcription factor binding events at single-nucleotide variants in ChIP-Seq data, taking into account the joint contribution of aneuploidy and local copy number variation, that is estimated directly from variant calls. We have conducted a meta-analysis of more than 7 thousand ChIP-Seq experiments and assembled the database of allele-specific binding events listing more than half a million entries at nearly 270 thousand single-nucleotide polymorphisms for several hundred human transcription factors and cell types. These polymorphisms are enriched for associations with phenotypes of medical relevance and often overlap eQTLs, making candidates for causality by linking variants with molecular mechanisms. Specifically, there is a special class of switching sites, where different transcription factors preferably bind alternative alleles, thus revealing allele-specific rewiring of molecular circuitry.

Assuntos

Alelos , Genoma Humano , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/metabolismo , Cromatina/metabolismo , Bases de Dados Genéticas , Dosagem de Genes , Regulação da Expressão Gênica/genética , Estudo de Associação Genômica Ampla , Humanos , Motivos de Nucleotídeos , Fenótipo , Polimorfismo de Nucleotídeo Único , Ligação Proteica , Locos de Características Quantitativas

What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants.

Penzar, Dmitry D; Zinkevich, Arsenii O; Vorontsov, Ilya E; Sitnik, Vasily V; Favorov, Alexander V; Makeev, Vsevolod J; Kulakovskiy, Ivan V.

Front Genet ; 10: 1078, 2019.

Artigo em Inglês | MEDLINE | ID: mdl-31737053

RESUMO

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent "Regulation Saturation" Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the "information leakage" caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA