Búsqueda | BVS Bolivia

1.

MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations.

Tang, Xiangru; Tran, Andrew; Tan, Jeffrey; Gerstein, Mark B.

Bioinformatics ; 40(Supplement_1): i357-i368, 2024 Jun 28.

Artículo en Inglés | MEDLINE | ID: mdl-38940177

RESUMEN

MOTIVATION: The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. RESULTS: We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. AVAILABILITY AND IMPLEMENTATION: Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.

Asunto(s)

Procesamiento de Lenguaje Natural , Aprendizaje Profundo , Biología Computacional/métodos

2.

BioCoder: a benchmark for bioinformatics code generation with large language models.

Tang, Xiangru; Qian, Bill; Gao, Rick; Chen, Jiakang; Chen, Xinyun; Gerstein, Mark B.

Bioinformatics ; 40(Supplement_1): i266-i276, 2024 Jun 28.

Artículo en Inglés | MEDLINE | ID: mdl-38940140

RESUMEN

SUMMARY: Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). AVAILABILITY AND IMPLEMENTATION: All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.

Asunto(s)

Algoritmos , Benchmarking , Biología Computacional , Lenguajes de Programación , Programas Informáticos , Biología Computacional/métodos , Benchmarking/métodos

3.

Latent evolutionary signatures: a general framework for analysing music and cultural evolution.

Warrell, Jonathan; Salichos, Leonidas; Gancz, Michael; Gerstein, Mark B.

J R Soc Interface ; 21(212): 20230647, 2024 03.

Artículo en Inglés | MEDLINE | ID: mdl-38503341

RESUMEN

Cultural processes of change bear many resemblances to biological evolution. The underlying units of non-biological evolution have, however, remained elusive, especially in the domain of music. Here, we introduce a general framework to jointly identify underlying units and their associated evolutionary processes. We model musical styles and principles of organization in dimensions such as harmony and form as following an evolutionary process. Furthermore, we propose that such processes can be identified by extracting latent evolutionary signatures from musical corpora, analogously to identifying mutational signatures in genomics. These signatures provide a latent embedding for each song or musical piece. We develop a deep generative architecture for our model, which can be viewed as a type of variational autoencoder with an evolutionary prior constraining the latent space; specifically, the embeddings for each song are tied together via an energy-based prior, which encourages songs close in evolutionary space to share similar representations. As illustration, we analyse songs from the McGill Billboard dataset. We find frequent chord transitions and formal repetition schemes and identify latent evolutionary signatures related to these features. Finally, we show that the latent evolutionary representations learned by our model outperform non-evolutionary representations in such tasks as period and genre prediction.

Asunto(s)

Evolución Cultural , Música , Genómica

4.

Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases.

Emani, Prashant S; Geradi, Maya N; Gürsoy, Gamze; Grasty, Monica R; Miranker, Andrew; Gerstein, Mark B.

Genome Res ; 2023 Dec 14.

Artículo en Inglés | MEDLINE | ID: mdl-38097386

RESUMEN

Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with â¼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, â¼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using â¼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.

5.

The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity.

Reese, Fairlie; Williams, Brian; Balderrama-Gutierrez, Gabriela; Wyman, Dana; Çelik, Muhammed Hasan; Rebboah, Elisabeth; Rezaie, Narges; Trout, Diane; Razavi-Mohseni, Milad; Jiang, Yunzhe; Borsari, Beatrice; Morabito, Samuel; Liang, Heidi Yahan; McGill, Cassandra J; Rahmanian, Sorena; Sakr, Jasmine; Jiang, Shan; Zeng, Weihua; Carvalho, Klebea; Weimer, Annika K; Dionne, Louise A; McShane, Ariel; Bedi, Karan; Elhajjajy, Shaimae I; Upchurch, Sean; Jou, Jennifer; Youngworth, Ingrid; Gabdank, Idan; Sud, Paul; Jolanki, Otto; Strattan, J Seth; Kagda, Meenakshi S; Snyder, Michael P; Hitz, Ben C; Moore, Jill E; Weng, Zhiping; Bennett, David; Reinholdt, Laura; Ljungman, Mats; Beer, Michael A; Gerstein, Mark B; Pachter, Lior; Guigó, Roderic; Wold, Barbara J; Mortazavi, Ali.

bioRxiv ; 2023 May 16.

Artículo en Inglés | MEDLINE | ID: mdl-37292896

RESUMEN

The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.

6.

exRNA-eCLIP intersection analysis reveals a map of extracellular RNA binding proteins and associated RNAs across major human biofluids and carriers.

LaPlante, Emily L; Stürchler, Alessandra; Fullem, Robert; Chen, David; Starner, Anne C; Esquivel, Emmanuel; Alsop, Eric; Jackson, Andrew R; Ghiran, Ionita; Pereira, Getulio; Rozowsky, Joel; Chang, Justin; Gerstein, Mark B; Alexander, Roger P; Roth, Matthew E; Franklin, Jeffrey L; Coffey, Robert J; Raffai, Robert L; Mansuy, Isabelle M; Stavrakis, Stavros; deMello, Andrew J; Laurent, Louise C; Wang, Yi-Ting; Tsai, Chia-Feng; Liu, Tao; Jones, Jennifer; Van Keuren-Jensen, Kendall; Van Nostrand, Eric; Mateescu, Bogdan; Milosavljevic, Aleksandar.

Cell Genom ; 3(5): 100303, 2023 May 10.

Artículo en Inglés | MEDLINE | ID: mdl-37228754

RESUMEN

Although the role of RNA binding proteins (RBPs) in extracellular RNA (exRNA) biology is well established, their exRNA cargo and distribution across biofluids are largely unknown. To address this gap, we extend the exRNA Atlas resource by mapping exRNAs carried by extracellular RBPs (exRBPs). This map was developed through an integrative analysis of ENCODE enhanced crosslinking and immunoprecipitation (eCLIP) data (150 RBPs) and human exRNA profiles (6,930 samples). Computational analysis and experimental validation identified exRBPs in plasma, serum, saliva, urine, cerebrospinal fluid, and cell-culture-conditioned medium. exRBPs carry exRNA transcripts from small non-coding RNA biotypes, including microRNA (miRNA), piRNA, tRNA, small nuclear RNA (snRNA), small nucleolar RNA (snoRNA), Y RNA, and lncRNA, as well as protein-coding mRNA fragments. Computational deconvolution of exRBP RNA cargo reveals associations of exRBPs with extracellular vesicles, lipoproteins, and ribonucleoproteins across human biofluids. Overall, we mapped the distribution of exRBPs across human biofluids, presenting a resource for the community.

7.

The association between evening social media use and delayed sleep may be causal: Suggestive evidence from 120 million Reddit timestamps.

Meyerson, William U; Fineberg, Sarah K; Andrade, Fernanda C; Corlett, Philip; Gerstein, Mark B; Hoyle, Rick H.

Sleep Med ; 107: 212-218, 2023 07.

Artículo en Inglés | MEDLINE | ID: mdl-37235891

RESUMEN

Public health officials and clinicians routinely advise social media users to avoid nighttime social media use due to the perception that this delays the onset of sleep and predisposes to the health risks of insufficient sleep. With some exceptions, the evidence behind this advice mostly derives from surveys identifying an association between self-reported social media usage and self-reported sleep patterns. In principle, these associations could alternatively be explained by users turning to social media to pass the time when they are otherwise having difficulty sleeping, or by individual differences that draw some people to frequent social media use, or by offline activities that overlap with both social media use and delayed sleep. To attempt to distinguish among these explanations, we leveraged estimated bedtimes from 44,000 Reddit users reported in a recent study and their 120 million posts to test whether the relationship between sleep and social media has properties suggestive of a causal relationship. We find that users are especially likely to be active on Reddit after their bedtime (and therefore awake) on nights that they posted to Reddit shortly before bedtime, especially if they posted multiple times or in high-engagement forums that night. Overall, this study lends additional support to the notion that there likely is some causal effect of evening social media use on delayed sleep onset.

Asunto(s)

Trastornos del Sueño del Ritmo Circadiano , Medios de Comunicación Sociales , Adulto , Femenino , Humanos , Masculino , Adulto Joven , Ritmo Circadiano , Prevalencia , Autoinforme , Trastornos del Sueño del Ritmo Circadiano/epidemiología , Factores de Tiempo

8.

Estimation of Bedtimes of Reddit Users: Integrated Analysis of Time Stamps and Surveys.

Meyerson, William U; Fineberg, Sarah K; Song, Ye Kyung; Faber, Adam; Ash, Garrett; Andrade, Fernanda C; Corlett, Philip; Gerstein, Mark B; Hoyle, Rick H.

JMIR Form Res ; 7: e38112, 2023 Jan 17.

Artículo en Inglés | MEDLINE | ID: mdl-36649054

RESUMEN

BACKGROUND: Individuals with later bedtimes have an increased risk of difficulties with mood and substances. To investigate the causes and consequences of late bedtimes and other sleep patterns, researchers are exploring social media as a data source. Pioneering studies inferred sleep patterns directly from social media data. While innovative, these efforts are variously unscalable, context dependent, confined to specific sleep parameters, or rest on untested assumptions, and none of the reviewed studies apply to the popular Reddit platform or release software to the research community. OBJECTIVE: This study builds on this prior work. We estimate the bedtimes of Reddit users from the times tamps of their posts, test inference validity against survey data, and release our model as an R package (The R Foundation). METHODS: We included 159 sufficiently active Reddit users with known time zones and known, nonanomalous bedtimes, together with the time stamps of their 2.1 million posts. The model's form was chosen by visualizing the aggregate distribution of the timing of users' posts relative to their reported bedtimes. The chosen model represents a user's frequency of Reddit posting by time of day, with a flat portion before bedtime and a quadratic depletion that begins near the user's bedtime, with parameters fitted to the data. This model estimates the bedtimes of individual Reddit users from the time stamps of their posts. Model performance is assessed through k-fold cross-validation. We then apply the model to estimate the bedtimes of 51,372 sufficiently active, nonbot Reddit users with known time zones from the time stamps of their 140 million posts. RESULTS: The Pearson correlation between expected and observed Reddit posting frequencies in our model was 0.997 on aggregate data. On average, posting starts declining 45 minutes before bedtime, reaches a nadir 4.75 hours after bedtime that is 87% lower than the daytime rate, and returns to baseline 10.25 hours after bedtime. The Pearson correlation between inferred and reported bedtimes for individual users was 0.61 (P<.001). In 90 of 159 cases (56.6%), our estimate was within 1 hour of the reported bedtime; 128 cases (80.5%) were within 2 hours. There was equivalent accuracy in hold-out sets versus training sets of k-fold cross-validation, arguing against overfitting. The model was more accurate than a random forest approach. CONCLUSIONS: We uncovered a simple, reproducible relationship between Reddit users' reported bedtimes and the time of day when high daytime posting rates transition to low nighttime posting rates. We captured this relationship in a model that estimates users' bedtimes from the time stamps of their posts. Limitations include applicability only to users who post frequently, the requirement for time zone data, and limits on generalizability. Nonetheless, it is a step forward for inferring the sleep parameters of social media users passively at scale. Our model and precomputed estimated bedtimes of 50,000 Reddit users are freely available.

9.

Insights from incorporating quantum computing into drug design workflows.

Lau, Bayo; Emani, Prashant S; Chapman, Jackson; Yao, Lijing; Lam, Tarsus; Merrill, Paul; Warrell, Jonathan; Gerstein, Mark B; Lam, Hugo Y K.

Bioinformatics ; 39(1)2023 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-36477833

RESUMEN

MOTIVATION: While many quantum computing (QC) methods promise theoretical advantages over classical counterparts, quantum hardware remains limited. Exploiting near-term QC in computer-aided drug design (CADD) thus requires judicious partitioning between classical and quantum calculations. RESULTS: We present HypaCADD, a hybrid classical-quantum workflow for finding ligands binding to proteins, while accounting for genetic mutations. We explicitly identify modules of our drug-design workflow currently amenable to replacement by QC: non-intuitively, we identify the mutation-impact predictor as the best candidate. HypaCADD thus combines classical docking and molecular dynamics with quantum machine learning (QML) to infer the impact of mutations. We present a case study with the coronavirus (SARS-CoV-2) protease and associated mutants. We map a classical machine-learning module onto QC, using a neural network constructed from qubit-rotation gates. We have implemented this in simulation and on two commercial quantum computers. We find that the QML models can perform on par with, if not better than, classical baselines. In summary, HypaCADD offers a successful strategy for leveraging QC for CADD. AVAILABILITY AND IMPLEMENTATION: Jupyter Notebooks with Python code are freely available for academic use on GitHub: https://www.github.com/hypahub/hypacadd_notebook. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

COVID-19 , Programas Informáticos , Humanos , Flujo de Trabajo , Metodologías Computacionales , Teoría Cuántica , SARS-CoV-2 , Diseño de Fármacos , Simulación de Dinámica Molecular

10.

Dynamic quality control machinery that operates across compartmental borders mediates the degradation of mammalian nuclear membrane proteins.

Tsai, Pei-Ling; Cameron, Christopher J F; Forni, Maria Fernanda; Wasko, Renee R; Naughton, Brigitte S; Horsley, Valerie; Gerstein, Mark B; Schlieker, Christian.

Cell Rep ; 41(8): 111675, 2022 11 22.

Artículo en Inglés | MEDLINE | ID: mdl-36417855

RESUMEN

Many human diseases are caused by mutations in nuclear envelope (NE) proteins. How protein homeostasis and disease etiology are interconnected at the NE is poorly understood. Specifically, the identity of local ubiquitin ligases that facilitate ubiquitin-proteasome-dependent NE protein turnover is presently unknown. Here, we employ a short-lived, Lamin B receptor disease variant as a model substrate in a genetic screen to uncover key elements of NE protein turnover. We identify the ubiquitin-conjugating enzymes (E2s) Ube2G2 and Ube2D3, the membrane-resident ubiquitin ligases (E3s) RNF5 and HRD1, and the poorly understood protein TMEM33. RNF5, but not HRD1, requires TMEM33 both for efficient biosynthesis and function. Once synthesized, RNF5 responds dynamically to increased substrate levels at the NE by departing from the endoplasmic reticulum, where HRD1 remains confined. Thus, mammalian protein quality control machinery partitions between distinct cellular compartments to address locally changing substrate loads, establishing a robust cellular quality control system.

Asunto(s)

Proteínas de la Membrana , Ubiquitina-Proteína Ligasas , Animales , Humanos , Ubiquitina-Proteína Ligasas/metabolismo , Proteínas de la Membrana/metabolismo , Retículo Endoplásmico/metabolismo , Enzimas Ubiquitina-Conjugadoras/metabolismo , Ubiquitina/metabolismo , Mamíferos/metabolismo

11.

Building integrative functional maps of gene regulation.

Xu, Jinrui; Pratt, Henry E; Moore, Jill E; Gerstein, Mark B; Weng, Zhiping.

Hum Mol Genet ; 31(R1): R114-R122, 2022 10 20.

Artículo en Inglés | MEDLINE | ID: mdl-36083269

RESUMEN

Every cell in the human body inherits a copy of the same genetic information. The three billion base pairs of DNA in the human genome, and the roughly 50 000 coding and non-coding genes they contain, must thus encode all the complexity of human development and cell and tissue type diversity. Differences in gene regulation, or the modulation of gene expression, enable individual cells to interpret the genome differently to carry out their specific functions. Here we discuss recent and ongoing efforts to build gene regulatory maps, which aim to characterize the regulatory roles of all sequences in a genome. Many researchers and consortia have identified such regulatory elements using functional assays and evolutionary analyses; we discuss the results, strengths and shortcomings of their approaches. We also discuss new techniques the field can leverage and emerging challenges it will face while striving to build gene regulatory maps of ever-increasing resolution and comprehensiveness.

Asunto(s)

Regulación de la Expresión Génica , Secuencias Reguladoras de Ácidos Nucleicos , Humanos , Regulación de la Expresión Génica/genética , Genoma Humano/genética , Mapeo Cromosómico , ADN/genética

12.

Phase 2 of extracellular RNA communication consortium charts next-generation approaches for extracellular RNA research.

Mateescu, Bogdan; Jones, Jennifer C; Alexander, Roger P; Alsop, Eric; An, Ji Yeong; Asghari, Mohammad; Boomgarden, Alex; Bouchareychas, Laura; Cayota, Alfonso; Chang, Hsueh-Chia; Charest, Al; Chiu, Daniel T; Coffey, Robert J; Das, Saumya; De Hoff, Peter; deMello, Andrew; D'Souza-Schorey, Crislyn; Elashoff, David; Eliato, Kiarash R; Franklin, Jeffrey L; Galas, David J; Gerstein, Mark B; Ghiran, Ionita H; Go, David B; Gould, Stephen; Grogan, Tristan R; Higginbotham, James N; Hladik, Florian; Huang, Tony Jun; Huo, Xiaoye; Hutchins, Elizabeth; Jeppesen, Dennis K; Jovanovic-Talisman, Tijana; Kim, Betty Y S; Kim, Sung; Kim, Kyoung-Mee; Kim, Yong; Kitchen, Robert R; Knouse, Vaughan; LaPlante, Emily L; Lebrilla, Carlito B; Lee, L James; Lennon, Kathleen M; Li, Guoping; Li, Feng; Li, Tieyi; Liu, Tao; Liu, Zirui; Maddox, Adam L; McCarthy, Kyle.

iScience ; 25(8): 104653, 2022 Aug 19.

Artículo en Inglés | MEDLINE | ID: mdl-35958027

RESUMEN

The extracellular RNA communication consortium (ERCC) is an NIH-funded program aiming to promote the development of new technologies, resources, and knowledge about exRNAs and their carriers. After Phase 1 (2013-2018), Phase 2 of the program (ERCC2, 2019-2023) aims to fill critical gaps in knowledge and technology to enable rigorous and reproducible methods for separation and characterization of both bulk populations of exRNA carriers and single EVs. ERCC2 investigators are also developing new bioinformatic pipelines to promote data integration through the exRNA atlas database. ERCC2 has established several Working Groups (Resource Sharing, Reagent Development, Data Analysis and Coordination, Technology Development, nomenclature, and Scientific Outreach) to promote collaboration between ERCC2 members and the broader scientific community. We expect that ERCC2's current and future achievements will significantly improve our understanding of exRNA biology and the development of accurate and efficient exRNA-based diagnostic, prognostic, and theranostic biomarker assays.

13.

Author Correction: Perspectives on ENCODE.

Snyder, Michael P; Gingeras, Thomas R; Moore, Jill E; Weng, Zhiping; Gerstein, Mark B; Ren, Bing; Hardison, Ross C; Stamatoyannopoulos, John A; Graveley, Brenton R; Feingold, Elise A; Pazin, Michael J; Pagan, Michael; Gilchrist, Daniel A; Hitz, Benjamin C; Cherry, J Michael; Bernstein, Bradley E; Mendenhall, Eric M; Zerbino, Daniel R; Frankish, Adam; Flicek, Paul; Myers, Richard M.

Nature ; 605(7909): E4, 2022 May.

Artículo en Inglés | MEDLINE | ID: mdl-35474002

14.

Functional genomics data: privacy risk assessment and technological mitigation.

Gürsoy, Gamze; Li, Tianxiao; Liu, Susanna; Ni, Eric; Brannon, Charlotte M; Gerstein, Mark B.

Nat Rev Genet ; 23(4): 245-258, 2022 04.

Artículo en Inglés | MEDLINE | ID: mdl-34759381

RESUMEN

The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.

Asunto(s)

Privacidad Genética , Privacidad , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Medición de Riesgo

15.

Author Correction: Functional genomics data: privacy risk assessment and technological mitigation.

Gürsoy, Gamze; Li, Tianxiao; Liu, Susanna; Ni, Eric; Brannon, Charlotte M; Gerstein, Mark B.

Nat Rev Genet ; 23(4): 259, 2022 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-34811555

16.

Cross-platform transcriptomic profiling of the response to recombinant human erythropoietin.

Wang, Guan; Kitaoka, Traci; Crawford, Ali; Mao, Qian; Hesketh, Andrew; Guppy, Fergus M; Ash, Garrett I; Liu, Jason; Gerstein, Mark B; Pitsiladis, Yannis P.

Sci Rep ; 11(1): 21705, 2021 11 04.

Artículo en Inglés | MEDLINE | ID: mdl-34737331

RESUMEN

RNA-seq has matured and become an important tool for studying RNA biology. Here we compared two RNA-seq (MGI DNBSEQ and Illumina NextSeq 500) and two microarray platforms (GeneChip Human Transcriptome Array 2.0 and Illumina Expression BeadChip) in healthy individuals administered recombinant human erythropoietin for transcriptome-wide quantification of differential gene expression. The results show that total RNA DNB-seq generated a multitude of target genes compared to other platforms. Pathway enrichment analyses revealed genes correlate to not only erythropoiesis and oxygen transport but also a wide range of other functions, such as tissue protection and immune regulation. This study provides a knowledge base of genes relevant to EPO biology through cross-platform comparisons and validation.

Asunto(s)

Perfilación de la Expresión Génica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ARN/métodos , Eritropoyesis/genética , Eritropoyetina/genética , Expresión Génica/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , ARN/genética , RNA-Seq/métodos , Transcriptoma/genética

17.

Network propagation-based prioritization of long tail genes in 17 cancer types.

Mohsen, Hussein; Gunasekharan, Vignesh; Qing, Tao; Seay, Montrell; Surovtseva, Yulia; Negahban, Sahand; Szallasi, Zoltan; Pusztai, Lajos; Gerstein, Mark B.

Genome Biol ; 22(1): 287, 2021 10 07.

Artículo en Inglés | MEDLINE | ID: mdl-34620211

RESUMEN

BACKGROUND: The diversity of genomic alterations in cancer poses challenges to fully understanding the etiologies of the disease. Recent interest in infrequent mutations, in genes that reside in the "long tail" of the mutational distribution, uncovered new genes with significant implications in cancer development. The study of cancer-relevant genes often requires integrative approaches pooling together multiple types of biological data. Network propagation methods demonstrate high efficacy in achieving this integration. Yet, the majority of these methods focus their assessment on detecting known cancer genes or identifying altered subnetworks. In this paper, we introduce a network propagation approach that entirely focuses on prioritizing long tail genes with potential functional impact on cancer development. RESULTS: We identify sets of often overlooked, rarely to moderately mutated genes whose biological interactions significantly propel their mutation-frequency-based rank upwards during propagation in 17 cancer types. We call these sets "upward mobility genes" and hypothesize that their significant rank improvement indicates functional importance. We report new cancer-pathway associations based on upward mobility genes that are not previously identified using driver genes alone, validate their role in cancer cell survival in vitro using extensive genome-wide RNAi and CRISPR data repositories, and further conduct in vitro functional screenings resulting in the validation of 18 previously unreported genes. CONCLUSION: Our analysis extends the spectrum of cancer-relevant genes and identifies novel potential therapeutic targets.

Asunto(s)

Genes Relacionados con las Neoplasias , Neoplasias/genética , Supervivencia Celular , Genes Relacionados con las Neoplasias/efectos de los fármacos , Humanos , Mutación , Neoplasias/metabolismo , Mapeo de Interacción de Proteínas

18.

Establishing a Global Standard for Wearable Devices in Sport and Exercise Medicine: Perspectives from Academic and Industry Stakeholders.

Ash, Garrett I; Stults-Kolehmainen, Matthew; Busa, Michael A; Gaffey, Allison E; Angeloudis, Konstantinos; Muniz-Pardos, Borja; Gregory, Robert; Huggins, Robert A; Redeker, Nancy S; Weinzimer, Stuart A; Grieco, Lauren A; Lyden, Kate; Megally, Esmeralda; Vogiatzis, Ioannis; Scher, LaurieAnn; Zhu, Xinxin; Baker, Julien S; Brandt, Cynthia; Businelle, Michael S; Fucito, Lisa M; Griggs, Stephanie; Jarrin, Robert; Mortazavi, Bobak J; Prioleau, Temiloluwa; Roberts, Walter; Spanakis, Elias K; Nally, Laura M; Debruyne, Andre; Bachl, Norbert; Pigozzi, Fabio; Halabchi, Farzin; Ramagole, Dimakatso A; Janse van Rensburg, Dina C; Wolfarth, Bernd; Fossati, Chiara; Rozenstoka, Sandra; Tanisawa, Kumpei; Börjesson, Mats; Casajus, José Antonio; Gonzalez-Aguero, Alex; Zelenkova, Irina; Swart, Jeroen; Gursoy, Gamze; Meyerson, William; Liu, Jason; Greenbaum, Dov; Pitsiladis, Yannis P; Gerstein, Mark B.

Sports Med ; 51(11): 2237-2250, 2021 11.

Artículo en Inglés | MEDLINE | ID: mdl-34468950

RESUMEN

Millions of consumer sport and fitness wearables (CSFWs) are used worldwide, and millions of datapoints are generated by each device. Moreover, these numbers are rapidly growing, and they contain a heterogeneity of devices, data types, and contexts for data collection. Companies and consumers would benefit from guiding standards on device quality and data formats. To address this growing need, we convened a virtual panel of industry and academic stakeholders, and this manuscript summarizes the outcomes of the discussion. Our objectives were to identify (1) key facilitators of and barriers to participation by CSFW manufacturers in guiding standards and (2) stakeholder priorities. The venues were the Yale Center for Biomedical Data Science Digital Health Monthly Seminar Series (62 participants) and the New England Chapter of the American College of Sports Medicine Annual Meeting (59 participants). In the discussion, stakeholders outlined both facilitators of (e.g., commercial return on investment in device quality, lucrative research partnerships, and transparent and multilevel evaluation of device quality) and barriers (e.g., competitive advantage conflict, lack of flexibility in previously developed devices) to participation in guiding standards. There was general agreement to adopt Keadle et al.'s standard pathway for testing devices (i.e., benchtop, laboratory, field-based, implementation) without consensus on the prioritization of these steps. Overall, there was enthusiasm not to add prescriptive or regulatory steps, but instead create a networking hub that connects companies to consumers and researchers for flexible guidance navigating the heterogeneity, multi-tiered development, dynamicity, and nebulousness of the CSFW field.

Asunto(s)

Medicina Deportiva , Deportes , Dispositivos Electrónicos Vestibles , Consenso , Ejercicio Físico , Humanos

19.

Nodal modulator (NOMO) is required to sustain endoplasmic reticulum morphology.

Amaya, Catherine; Cameron, Christopher J F; Devarkar, Swapnil C; Seager, Sebastian J H; Gerstein, Mark B; Xiong, Yong; Schlieker, Christian.

J Biol Chem ; 297(2): 100937, 2021 08.

Artículo en Inglés | MEDLINE | ID: mdl-34224731

RESUMEN

The endoplasmic reticulum (ER) is a membrane-bound organelle responsible for protein folding, lipid synthesis, and calcium homeostasis. Maintenance of ER structural integrity is crucial for proper function, but much remains to be learned about the molecular players involved. To identify proteins that support the structure of the ER, we performed a proteomic screen and identified nodal modulator (NOMO), a widely conserved type I transmembrane protein of unknown function, with three nearly identical orthologs specified in the human genome. We found that overexpression of NOMO1 imposes a sheet morphology on the ER, whereas depletion of NOMO1 and its orthologs causes a collapse of ER morphology concomitant with the formation of membrane-delineated holes in the ER network positive for the lysosomal marker lysosomal-associated protein 1. In addition, the levels of key players of autophagy including microtubule-associated protein light chain 3 and autophagy cargo receptor p62/sequestosome 1 strongly increase upon NOMO depletion. In vitro reconstitution of NOMO1 revealed a "beads on a string" structure likely representing consecutive immunoglobulin-like domains. Extending NOMO1 by insertion of additional immunoglobulin folds results in a correlative increase in the ER intermembrane distance. Based on these observations and a genetic epistasis analysis including the known ER-shaping proteins Atlastin2 and Climp63, we propose a role for NOMO1 in the functional network of ER-shaping proteins.

Asunto(s)

Retículo Endoplásmico , Proteómica , Proteína Sequestosoma-1 , Autofagia , Estrés del Retículo Endoplásmico , Homeostasis , Humanos , Lisosomas/metabolismo

20.

STK11/LKB1 Loss of Function Is Associated with Global DNA Hypomethylation and S-Adenosyl-Methionine Depletion in Human Lung Adenocarcinoma.

Koenig, Michael J; Agana, Bernice A; Kaufman, Jacob M; Sharpnack, Michael F; Wang, Walter Z; Weigel, Christoph; Navarro, Fabio C P; Amann, Joseph M; Cacciato, Nicole; Arasada, Rajeswara Rao; Gerstein, Mark B; Wysocki, Vicki H; Oakes, Christopher; Carbone, David P.

Cancer Res ; 81(16): 4194-4204, 2021 08 15.

Artículo en Inglés | MEDLINE | ID: mdl-34045189

RESUMEN

STK11 (liver kinase B1, LKB1) is the fourth most frequently mutated gene in lung adenocarcinoma, with loss of function observed in up to 30% of all cases. Our previous work identified a 16-gene signature for LKB1 loss of function through mutational and nonmutational mechanisms. In this study, we applied this genetic signature to The Cancer Genome Atlas (TCGA) lung adenocarcinoma samples and discovered a novel association between LKB1 loss and widespread DNA demethylation. LKB1-deficient tumors showed depletion of S-adenosyl-methionine (SAM-e), which is the primary substrate for DNMT1 activity. Lower methylation following LKB1 loss involved repetitive elements (RE) and altered RE transcription, as well as decreased sensitivity to azacytidine. Demethylated CpGs were enriched for FOXA family consensus binding sites, and nuclear expression, localization, and turnover of FOXA was dependent upon LKB1. Overall, these findings demonstrate that a large number of lung adenocarcinomas exhibit global hypomethylation driven by LKB1 loss, which has implications for both epigenetic therapy and immunotherapy in these cancers. SIGNIFICANCE: Lung adenocarcinomas with LKB1 loss demonstrate global genomic hypomethylation associated with depletion of SAM-e, reduced expression of DNMT1, and increased transcription of repetitive elements.

Asunto(s)

Quinasas de la Proteína-Quinasa Activada por el AMP/fisiología , Adenocarcinoma/genética , Metilación de ADN , Neoplasias Pulmonares/genética , S-Adenosilmetionina/metabolismo , Quinasas de la Proteína-Quinasa Activada por el AMP/genética , Adenocarcinoma/metabolismo , Línea Celular , Supervivencia Celular , Análisis por Conglomerados , Biología Computacional , Islas de CpG , Bases de Datos Genéticas , Epigénesis Genética , Genes ras , Humanos , Neoplasias Pulmonares/metabolismo , Metionina , Mutación , Análisis de Secuencia por Matrices de Oligonucleótidos , Proteínas Proto-Oncogénicas p21(ras)/genética , Secuencias Repetitivas de Ácidos Nucleicos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA