Pesquisa | BVS Aleitamento Materno

1.

Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences.

Kumar, Sushant; Warrell, Jonathan; Li, Shantao; McGillivray, Patrick D; Meyerson, William; Salichos, Leonidas; Harmanci, Arif; Martinez-Fundichely, Alexander; Chan, Calvin W Y; Nielsen, Morten Muhlig; Lochovsky, Lucas; Zhang, Yan; Li, Xiaotong; Lou, Shaoke; Pedersen, Jakob Skou; Herrmann, Carl; Getz, Gad; Khurana, Ekta; Gerstein, Mark B.

Cell ; 180(5): 915-927.e16, 2020 03 05.

Artigo em Inglês | MEDLINE | ID: mdl-32084333

RESUMO

The dichotomous model of "drivers" and "passengers" in cancer posits that only a few mutations in a tumor strongly affect its progression, with the remaining ones being inconsequential. Here, we leveraged the comprehensive variant dataset from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) project to demonstrate that-in addition to the dichotomy of high- and low-impact variants-there is a third group of medium-impact putative passengers. Moreover, we also found that molecular impact correlates with subclonal architecture (i.e., early versus late mutations), and different signatures encode for mutations with divergent impact. Furthermore, we adapted an additive-effects model from complex-trait studies to show that the aggregated effect of putative passengers, including undetected weak drivers, provides significant additional power (â¼12% additive variance) for predicting cancerous phenotypes, beyond PCAWG-identified driver mutations. Finally, this framework allowed us to estimate the frequency of potential weak-driver mutations in PCAWG samples lacking any well-characterized driver alterations.

Assuntos

Genoma Humano/genética , Genômica/métodos , Mutação/genética , Neoplasias/genética , Análise Mutacional de DNA/métodos , Progressão da Doença , Humanos , Neoplasias/patologia , Sequenciamento Completo do Genoma

2.

The Extracellular RNA Communication Consortium: Establishing Foundational Knowledge and Technologies for Extracellular RNA Research.

Das, Saumya; Ansel, K Mark; Bitzer, Markus; Breakefield, Xandra O; Charest, Alain; Galas, David J; Gerstein, Mark B; Gupta, Mihir; Milosavljevic, Aleksandar; McManus, Michael T; Patel, Tushar; Raffai, Robert L; Rozowsky, Joel; Roth, Matthew E; Saugstad, Julie A; Van Keuren-Jensen, Kendall; Weaver, Alissa M; Laurent, Louise C.

Cell ; 177(2): 231-242, 2019 04 04.

Artigo em Inglês | MEDLINE | ID: mdl-30951667

RESUMO

The Extracellular RNA Communication Consortium (ERCC) was launched to accelerate progress in the new field of extracellular RNA (exRNA) biology and to establish whether exRNAs and their carriers, including extracellular vesicles (EVs), can mediate intercellular communication and be utilized for clinical applications. Phase 1 of the ERCC focused on exRNA/EV biogenesis and function, discovery of exRNA biomarkers, development of exRNA/EV-based therapeutics, and construction of a robust set of reference exRNA profiles for a variety of biofluids. Here, we present progress by ERCC investigators in these areas, and we discuss collaborative projects directed at development of robust methods for EV/exRNA isolation and analysis and tools for sharing and computational analysis of exRNA profiling data.

Assuntos

Ácidos Nucleicos Livres/genética , Ácidos Nucleicos Livres/metabolismo , Vesículas Extracelulares/genética , Biomarcadores , Humanos , Bases de Conhecimento , MicroRNAs/genética , RNA/genética

3.

Functional genomics data: privacy risk assessment and technological mitigation.

Gürsoy, Gamze; Li, Tianxiao; Liu, Susanna; Ni, Eric; Brannon, Charlotte M; Gerstein, Mark B.

Nat Rev Genet ; 23(4): 245-258, 2022 04.

Artigo em Inglês | MEDLINE | ID: mdl-34759381

RESUMO

The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.

Assuntos

Privacidade Genética , Privacidade , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Medição de Risco

4.

Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases.

Emani, Prashant S; Geradi, Maya N; Gürsoy, Gamze; Grasty, Monica R; Miranker, Andrew; Gerstein, Mark B.

Genome Res ; 2023 Dec 14.

Artigo em Inglês | MEDLINE | ID: mdl-38097386

RESUMO

Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with â¼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, â¼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using â¼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.

5.

Perspectives on ENCODE.

Snyder, Michael P; Gingeras, Thomas R; Moore, Jill E; Weng, Zhiping; Gerstein, Mark B; Ren, Bing; Hardison, Ross C; Stamatoyannopoulos, John A; Graveley, Brenton R; Feingold, Elise A; Pazin, Michael J; Pagan, Michael; Gilchrist, Daniel A; Hitz, Benjamin C; Cherry, J Michael; Bernstein, Bradley E; Mendenhall, Eric M; Zerbino, Daniel R; Frankish, Adam; Flicek, Paul; Myers, Richard M.

Nature ; 583(7818): 693-698, 2020 07.

Artigo em Inglês | MEDLINE | ID: mdl-32728248

RESUMO

The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.

Assuntos

Bases de Dados Genéticas , Genoma/genética , Genômica , Anotação de Sequência Molecular , Animais , Sítios de Ligação , Cromatina/genética , Cromatina/metabolismo , Metilação de DNA , Bases de Dados Genéticas/normas , Bases de Dados Genéticas/tendências , Regulação da Expressão Gênica/genética , Genoma Humano/genética , Genômica/normas , Genômica/tendências , Histonas/metabolismo , Humanos , Camundongos , Anotação de Sequência Molecular/normas , Controle de Qualidade , Sequências Reguladoras de Ácido Nucleico/genética , Fatores de Transcrição/metabolismo

6.

MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations.

Tang, Xiangru; Tran, Andrew; Tan, Jeffrey; Gerstein, Mark B.

Bioinformatics ; 40(Supplement_1): i357-i368, 2024 Jun 28.

Artigo em Inglês | MEDLINE | ID: mdl-38940177

RESUMO

MOTIVATION: The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. RESULTS: We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. AVAILABILITY AND IMPLEMENTATION: Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.

Assuntos

Processamento de Linguagem Natural , Aprendizado Profundo , Biologia Computacional/métodos

7.

BioCoder: a benchmark for bioinformatics code generation with large language models.

Tang, Xiangru; Qian, Bill; Gao, Rick; Chen, Jiakang; Chen, Xinyun; Gerstein, Mark B.

Bioinformatics ; 40(Supplement_1): i266-i276, 2024 Jun 28.

Artigo em Inglês | MEDLINE | ID: mdl-38940140

RESUMO

SUMMARY: Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). AVAILABILITY AND IMPLEMENTATION: All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.

Assuntos

Algoritmos , Benchmarking , Biologia Computacional , Linguagens de Programação , Software , Biologia Computacional/métodos , Benchmarking/métodos

8.

Building integrative functional maps of gene regulation.

Xu, Jinrui; Pratt, Henry E; Moore, Jill E; Gerstein, Mark B; Weng, Zhiping.

Hum Mol Genet ; 31(R1): R114-R122, 2022 10 20.

Artigo em Inglês | MEDLINE | ID: mdl-36083269

RESUMO

Every cell in the human body inherits a copy of the same genetic information. The three billion base pairs of DNA in the human genome, and the roughly 50 000 coding and non-coding genes they contain, must thus encode all the complexity of human development and cell and tissue type diversity. Differences in gene regulation, or the modulation of gene expression, enable individual cells to interpret the genome differently to carry out their specific functions. Here we discuss recent and ongoing efforts to build gene regulatory maps, which aim to characterize the regulatory roles of all sequences in a genome. Many researchers and consortia have identified such regulatory elements using functional assays and evolutionary analyses; we discuss the results, strengths and shortcomings of their approaches. We also discuss new techniques the field can leverage and emerging challenges it will face while striving to build gene regulatory maps of ever-increasing resolution and comprehensiveness.

Assuntos

Regulação da Expressão Gênica , Sequências Reguladoras de Ácido Nucleico , Humanos , Regulação da Expressão Gênica/genética , Genoma Humano/genética , Mapeamento Cromossômico , DNA/genética

9.

Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies.

Zhao, Xuefang; Collins, Ryan L; Lee, Wan-Ping; Weber, Alexandra M; Jun, Yukyung; Zhu, Qihui; Weisburd, Ben; Huang, Yongqing; Audano, Peter A; Wang, Harold; Walker, Mark; Lowther, Chelsea; Fu, Jack; Gerstein, Mark B; Devine, Scott E; Marschall, Tobias; Korbel, Jan O; Eichler, Evan E; Chaisson, Mark J P; Lee, Charles; Mills, Ryan E; Brand, Harrison; Talkowski, Michael E.

Am J Hum Genet ; 108(5): 919-928, 2021 05 06.

Artigo em Inglês | MEDLINE | ID: mdl-33789087

RESUMO

Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.

Assuntos

Genoma Humano/genética , Variação Estrutural do Genoma , Genômica/métodos , Objetivos , Sequenciamento Completo do Genoma/métodos , Sequenciamento Completo do Genoma/normas , Variações do Número de Cópias de DNA , Éxons/genética , Humanos , Projetos de Pesquisa , Duplicações Segmentares Genômicas , Alinhamento de Sequência

10.

Insights from incorporating quantum computing into drug design workflows.

Lau, Bayo; Emani, Prashant S; Chapman, Jackson; Yao, Lijing; Lam, Tarsus; Merrill, Paul; Warrell, Jonathan; Gerstein, Mark B; Lam, Hugo Y K.

Bioinformatics ; 39(1)2023 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-36477833

RESUMO

MOTIVATION: While many quantum computing (QC) methods promise theoretical advantages over classical counterparts, quantum hardware remains limited. Exploiting near-term QC in computer-aided drug design (CADD) thus requires judicious partitioning between classical and quantum calculations. RESULTS: We present HypaCADD, a hybrid classical-quantum workflow for finding ligands binding to proteins, while accounting for genetic mutations. We explicitly identify modules of our drug-design workflow currently amenable to replacement by QC: non-intuitively, we identify the mutation-impact predictor as the best candidate. HypaCADD thus combines classical docking and molecular dynamics with quantum machine learning (QML) to infer the impact of mutations. We present a case study with the coronavirus (SARS-CoV-2) protease and associated mutants. We map a classical machine-learning module onto QC, using a neural network constructed from qubit-rotation gates. We have implemented this in simulation and on two commercial quantum computers. We find that the QML models can perform on par with, if not better than, classical baselines. In summary, HypaCADD offers a successful strategy for leveraging QC for CADD. AVAILABILITY AND IMPLEMENTATION: Jupyter Notebooks with Python code are freely available for academic use on GitHub: https://www.github.com/hypahub/hypacadd_notebook. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

COVID-19 , Software , Humanos , Fluxo de Trabalho , Metodologias Computacionais , Teoria Quântica , SARS-CoV-2 , Desenho de Fármacos , Simulação de Dinâmica Molecular

11.

Author Correction: Functional genomics data: privacy risk assessment and technological mitigation.

Gürsoy, Gamze; Li, Tianxiao; Liu, Susanna; Ni, Eric; Brannon, Charlotte M; Gerstein, Mark B.

Nat Rev Genet ; 23(4): 259, 2022 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-34811555

12.

Author Correction: Perspectives on ENCODE.

Snyder, Michael P; Gingeras, Thomas R; Moore, Jill E; Weng, Zhiping; Gerstein, Mark B; Ren, Bing; Hardison, Ross C; Stamatoyannopoulos, John A; Graveley, Brenton R; Feingold, Elise A; Pazin, Michael J; Pagan, Michael; Gilchrist, Daniel A; Hitz, Benjamin C; Cherry, J Michael; Bernstein, Bradley E; Mendenhall, Eric M; Zerbino, Daniel R; Frankish, Adam; Flicek, Paul; Myers, Richard M.

Nature ; 605(7909): E4, 2022 May.

Artigo em Inglês | MEDLINE | ID: mdl-35474002

13.

Tracking Distinct RNA Populations Using Efficient and Reversible Covalent Chemistry.

Duffy, Erin E; Rutenberg-Schoenberg, Michael; Stark, Catherine D; Kitchen, Robert R; Gerstein, Mark B; Simon, Matthew D.

Mol Cell ; 59(5): 858-66, 2015 Sep 03.

Artigo em Inglês | MEDLINE | ID: mdl-26340425

RESUMO

We describe a chemical method to label and purify 4-thiouridine (s(4)U)-containing RNA. We demonstrate that methanethiosulfonate (MTS) reagents form disulfide bonds with s(4)U more efficiently than the commonly used HPDP-biotin, leading to higher yields and less biased enrichment. This increase in efficiency allowed us to use s(4)U labeling to study global microRNA (miRNA) turnover in proliferating cultured human cells without perturbing global miRNA levels or the miRNA processing machinery. This improved chemistry will enhance methods that depend on tracking different populations of RNA, such as 4-thiouridine tagging to study tissue-specific transcription and dynamic transcriptome analysis (DTA) to study RNA turnover.

Assuntos

MicroRNAs/química , Biotina/análogos & derivados , Proliferação de Células , Dissulfetos , Perfilação da Expressão Gênica/métodos , Células HEK293 , Humanos , Indicadores e Reagentes , Mesilatos , MicroRNAs/genética , MicroRNAs/metabolismo , Fenômenos de Química Orgânica , Processamento Pós-Transcricional do RNA , Tiouridina/química

14.

Nodal modulator (NOMO) is required to sustain endoplasmic reticulum morphology.

Amaya, Catherine; Cameron, Christopher J F; Devarkar, Swapnil C; Seager, Sebastian J H; Gerstein, Mark B; Xiong, Yong; Schlieker, Christian.

J Biol Chem ; 297(2): 100937, 2021 08.

Artigo em Inglês | MEDLINE | ID: mdl-34224731

RESUMO

The endoplasmic reticulum (ER) is a membrane-bound organelle responsible for protein folding, lipid synthesis, and calcium homeostasis. Maintenance of ER structural integrity is crucial for proper function, but much remains to be learned about the molecular players involved. To identify proteins that support the structure of the ER, we performed a proteomic screen and identified nodal modulator (NOMO), a widely conserved type I transmembrane protein of unknown function, with three nearly identical orthologs specified in the human genome. We found that overexpression of NOMO1 imposes a sheet morphology on the ER, whereas depletion of NOMO1 and its orthologs causes a collapse of ER morphology concomitant with the formation of membrane-delineated holes in the ER network positive for the lysosomal marker lysosomal-associated protein 1. In addition, the levels of key players of autophagy including microtubule-associated protein light chain 3 and autophagy cargo receptor p62/sequestosome 1 strongly increase upon NOMO depletion. In vitro reconstitution of NOMO1 revealed a "beads on a string" structure likely representing consecutive immunoglobulin-like domains. Extending NOMO1 by insertion of additional immunoglobulin folds results in a correlative increase in the ER intermembrane distance. Based on these observations and a genetic epistasis analysis including the known ER-shaping proteins Atlastin2 and Climp63, we propose a role for NOMO1 in the functional network of ER-shaping proteins.

Assuntos

Retículo Endoplasmático , Proteômica , Proteína Sequestossoma-1 , Autofagia , Estresse do Retículo Endoplasmático , Homeostase , Humanos , Lisossomos/metabolismo

15.

Leveraging protein dynamics to identify cancer mutational hotspots using 3D structures.

Kumar, Sushant; Clarke, Declan; Gerstein, Mark B.

Proc Natl Acad Sci U S A ; 116(38): 18962-18970, 2019 09 17.

Artigo em Inglês | MEDLINE | ID: mdl-31462496

RESUMO

Large-scale exome sequencing of tumors has enabled the identification of cancer drivers using recurrence-based approaches. Some of these methods also employ 3D protein structures to identify mutational hotspots in cancer-associated genes. In determining such mutational clusters in structures, existing approaches overlook protein dynamics, despite its essential role in protein function. We present a framework to identify cancer driver genes using a dynamics-based search of mutational hotspot communities. Mutations are mapped to protein structures, which are partitioned into distinct residue communities. These communities are identified in a framework where residue-residue contact edges are weighted by correlated motions (as inferred by dynamics-based models). We then search for signals of positive selection among these residue communities to identify putative driver genes, while applying our method to the TCGA (The Cancer Genome Atlas) PanCancer Atlas missense mutation catalog. Overall, we predict 1 or more mutational hotspots within the resolved structures of proteins encoded by 434 genes. These genes were enriched among biological processes associated with tumor progression. Additionally, a comparison between our approach and existing cancer hotspot detection methods using structural data suggests that including protein dynamics significantly increases the sensitivity of driver detection.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Proteínas de Neoplasias/química , Proteínas de Neoplasias/genética , Neoplasias/genética , Bases de Dados Genéticas , Exoma/genética , Humanos , Mutação , Conformação Proteica , Reprodutibilidade dos Testes , Fluxo de Trabalho

16.

Predicting changes in protein thermodynamic stability upon point mutation with deep 3D convolutional neural networks.

Li, Bian; Yang, Yucheng T; Capra, John A; Gerstein, Mark B.

PLoS Comput Biol ; 16(11): e1008291, 2020 11.

Artigo em Inglês | MEDLINE | ID: mdl-33253214

RESUMO

Predicting mutation-induced changes in protein thermodynamic stability (ΔΔG) is of great interest in protein engineering, variant interpretation, and protein biophysics. We introduce ThermoNet, a deep, 3D-convolutional neural network (3D-CNN) designed for structure-based prediction of ΔΔGs upon point mutation. To leverage the image-processing power inherent in CNNs, we treat protein structures as if they were multi-channel 3D images. In particular, the inputs to ThermoNet are uniformly constructed as multi-channel voxel grids based on biophysical properties derived from raw atom coordinates. We train and evaluate ThermoNet with a curated data set that accounts for protein homology and is balanced with direct and reverse mutations; this provides a framework for addressing biases that have likely influenced many previous ΔΔG prediction methods. ThermoNet demonstrates performance comparable to the best available methods on the widely used Ssym test set. In addition, ThermoNet accurately predicts the effects of both stabilizing and destabilizing mutations, while most other methods exhibit a strong bias towards predicting destabilization. We further show that homology between Ssym and widely used training sets like S2648 and VariBench has likely led to overestimated performance in previous studies. Finally, we demonstrate the practical utility of ThermoNet in predicting the ΔΔGs for two clinically relevant proteins, p53 and myoglobin, and for pathogenic and benign missense variants from ClinVar. Overall, our results suggest that 3D-CNNs can model the complex, non-linear interactions perturbed by mutations, directly from biophysical properties of atoms.

Assuntos

Imageamento Tridimensional/métodos , Redes Neurais de Computação , Mutação Puntual , Proteínas/química , Termodinâmica , Biologia Computacional , Estabilidade Proteica

17.

Dynamic RNA-protein interactions underlie the zebrafish maternal-to-zygotic transition.

Despic, Vladimir; Dejung, Mario; Gu, Mengting; Krishnan, Jayanth; Zhang, Jing; Herzel, Lydia; Straube, Korinna; Gerstein, Mark B; Butter, Falk; Neugebauer, Karla M.

Genome Res ; 27(7): 1184-1194, 2017 07.

Artigo em Inglês | MEDLINE | ID: mdl-28381614

RESUMO

During the maternal-to-zygotic transition (MZT), transcriptionally silent embryos rely on post-transcriptional regulation of maternal mRNAs until zygotic genome activation (ZGA). RNA-binding proteins (RBPs) are important regulators of post-transcriptional RNA processing events, yet their identities and functions during developmental transitions in vertebrates remain largely unexplored. Using mRNA interactome capture, we identified 227 RBPs in zebrafish embryos before and during ZGA, hereby named the zebrafish MZT mRNA-bound proteome. This protein constellation consists of many conserved RBPs, some of which are potential stage-specific mRNA interactors that likely reflect the dynamics of RNA-protein interactions during MZT. The enrichment of numerous splicing factors like hnRNP proteins before ZGA was surprising, because maternal mRNAs were found to be fully spliced. To address potentially unique roles of these RBPs in embryogenesis, we focused on Hnrnpa1. iCLIP and subsequent mRNA reporter assays revealed a function for Hnrnpa1 in the regulation of poly(A) tail length and translation of maternal mRNAs through sequence-specific association with 3' UTRs before ZGA. Comparison of iCLIP data from two developmental stages revealed that Hnrnpa1 dissociates from maternal mRNAs at ZGA and instead regulates the nuclear processing of pri-mir-430 transcripts, which we validated experimentally. The shift from cytoplasmic to nuclear RNA targets was accompanied by a dramatic translocation of Hnrnpa1 and other pre-mRNA splicing factors to the nucleus in a transcription-dependent manner. Thus, our study identifies global changes in RNA-protein interactions during vertebrate MZT and shows that Hnrnpa1 RNA-binding activities are spatially and temporally coordinated to regulate RNA metabolism during early development.

Assuntos

Regiões 3' não Traduzidas , MicroRNAs/metabolismo , Peixe-Zebra/metabolismo , Zigoto/metabolismo , Animais , Ribonucleoproteína Nuclear Heterogênea A1/genética , Ribonucleoproteína Nuclear Heterogênea A1/metabolismo , MicroRNAs/genética , Peixe-Zebra/genética , Proteínas de Peixe-Zebra/genética , Proteínas de Peixe-Zebra/metabolismo

18.

TeXP: Deconvolving the effects of pervasive and autonomous transcription of transposable elements.

Navarro, Fabio Cp; Hoops, Jacob; Bellfy, Lauren; Cerveira, Eliza; Zhu, Qihui; Zhang, Chengsheng; Lee, Charles; Gerstein, Mark B.

PLoS Comput Biol ; 15(8): e1007293, 2019 08.

Artigo em Inglês | MEDLINE | ID: mdl-31425522

RESUMO

The Long interspersed nuclear element 1 (LINE-1) is a primary source of genetic variation in humans and other mammals. Despite its importance, LINE-1 activity remains difficult to study because of its highly repetitive nature. Here, we developed and validated a method called TeXP to gauge LINE-1 activity accurately. TeXP builds mappability signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription from autonomous LINE-1 activity. In particular, it apportions the multiple reads aligned to the many LINE-1 instances in the genome into these two categories. Using our method, we evaluated well-established cell lines, cell-line compartments and healthy tissues and found that the vast majority (91.7%) of transcriptome reads overlapping LINE-1 derive from pervasive transcription. We validated TeXP by independently estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high concordance. Next, we applied our method to comprehensively measure LINE-1 activity across healthy somatic cells, while backing out the effect of pervasive transcription. Unexpectedly, we found that LINE-1 activity is present in many normal somatic cells. This finding contrasts with earlier studies showing that LINE-1 has limited activity in healthy somatic tissues, except for neuroprogenitor cells. Interestingly, we found that the amount of LINE-1 activity was associated with the with the amount of cell turnover, with tissues with low cell turnover rates (e.g. the adult central nervous system) showing lower LINE-1 activity. Altogether, our results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome.

Assuntos

Elementos de DNA Transponíveis/genética , Elementos Nucleotídeos Longos e Dispersos/genética , Modelos Genéticos , Transcrição Gênica , Animais , Linhagem Celular , Biologia Computacional , Técnicas Genéticas/estatística & dados numéricos , Genoma Humano , Humanos , Análise de Sequência de RNA/estatística & dados numéricos

19.

Comparative analysis of the transcriptome across distant species.

Gerstein, Mark B; Rozowsky, Joel; Yan, Koon-Kiu; Wang, Daifeng; Cheng, Chao; Brown, James B; Davis, Carrie A; Hillier, LaDeana; Sisu, Cristina; Li, Jingyi Jessica; Pei, Baikang; Harmanci, Arif O; Duff, Michael O; Djebali, Sarah; Alexander, Roger P; Alver, Burak H; Auerbach, Raymond; Bell, Kimberly; Bickel, Peter J; Boeck, Max E; Boley, Nathan P; Booth, Benjamin W; Cherbas, Lucy; Cherbas, Peter; Di, Chao; Dobin, Alex; Drenkow, Jorg; Ewing, Brent; Fang, Gang; Fastuca, Megan; Feingold, Elise A; Frankish, Adam; Gao, Guanjun; Good, Peter J; Guigó, Roderic; Hammonds, Ann; Harrow, Jen; Hoskins, Roger A; Howald, Cédric; Hu, Long; Huang, Haiyan; Hubbard, Tim J P; Huynh, Chau; Jha, Sonali; Kasper, Dionna; Kato, Masaomi; Kaufman, Thomas C; Kitchen, Robert R; Ladewig, Erik; Lagarde, Julien.

Nature ; 512(7515): 445-8, 2014 Aug 28.

Artigo em Inglês | MEDLINE | ID: mdl-25164755

RESUMO

The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.

Assuntos

Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Perfilação da Expressão Gênica , Transcriptoma/genética , Animais , Caenorhabditis elegans/embriologia , Caenorhabditis elegans/crescimento & desenvolvimento , Cromatina/genética , Análise por Conglomerados , Drosophila melanogaster/crescimento & desenvolvimento , Regulação da Expressão Gênica no Desenvolvimento/genética , Histonas/metabolismo , Humanos , Larva/genética , Larva/crescimento & desenvolvimento , Modelos Genéticos , Anotação de Sequência Molecular , Regiões Promotoras Genéticas/genética , Pupa/genética , Pupa/crescimento & desenvolvimento , RNA não Traduzido/genética , Análise de Sequência de RNA

20.

Whole-genome analysis of papillary kidney cancer finds significant noncoding alterations.

Li, Shantao; Shuch, Brian M; Gerstein, Mark B.

PLoS Genet ; 13(3): e1006685, 2017 03.

Artigo em Inglês | MEDLINE | ID: mdl-28358873

RESUMO

To date, studies on papillary renal-cell carcinoma (pRCC) have largely focused on coding alterations in traditional drivers, particularly the tyrosine-kinase, Met. However, for a significant fraction of tumors, researchers have been unable to determine a clear molecular etiology. To address this, we perform the first whole-genome analysis of pRCC. Elaborating on previous results on MET, we find a germline SNP (rs11762213) in this gene predicting prognosis. Surprisingly, we detect no enrichment for small structural variants disrupting MET. Next, we scrutinize noncoding mutations, discovering potentially impactful ones associated with MET. Many of these are in an intron connected to a known, oncogenic alternative-splicing event; moreover, we find methylation dysregulation nearby, leading to a cryptic promoter activation. We also notice an elevation of mutations in the long noncoding RNA NEAT1, and these mutations are associated with increased expression and unfavorable outcome. Finally, to address the origin of pRCC heterogeneity, we carry out whole-genome analyses of mutational processes. First, we investigate genome-wide mutational patterns, finding they are governed mostly by methylation-associated C-to-T transitions. We also observe significantly more mutations in open chromatin and early-replicating regions in tumors with chromatin-modifier alterations. Finally, we reconstruct cancer-evolutionary trees, which have markedly different topologies and suggested evolutionary trajectories for the different subtypes of pRCC.

Assuntos

Carcinoma de Células Renais/genética , Neoplasias Renais/genética , Proteínas Proto-Oncogênicas c-met/genética , RNA Longo não Codificante/genética , Carcinoma de Células Renais/patologia , Cromatina/genética , Metilação de DNA/genética , Feminino , Regulação Neoplásica da Expressão Gênica , Genoma Humano , Humanos , Íntrons/genética , Neoplasias Renais/patologia , Masculino , Pessoa de Meia-Idade , Mutação , Polimorfismo de Nucleotídeo Único

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA