RESUMEN
The dichotomous model of "drivers" and "passengers" in cancer posits that only a few mutations in a tumor strongly affect its progression, with the remaining ones being inconsequential. Here, we leveraged the comprehensive variant dataset from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) project to demonstrate that-in addition to the dichotomy of high- and low-impact variants-there is a third group of medium-impact putative passengers. Moreover, we also found that molecular impact correlates with subclonal architecture (i.e., early versus late mutations), and different signatures encode for mutations with divergent impact. Furthermore, we adapted an additive-effects model from complex-trait studies to show that the aggregated effect of putative passengers, including undetected weak drivers, provides significant additional power (â¼12% additive variance) for predicting cancerous phenotypes, beyond PCAWG-identified driver mutations. Finally, this framework allowed us to estimate the frequency of potential weak-driver mutations in PCAWG samples lacking any well-characterized driver alterations.
Asunto(s)
Genoma Humano/genética , Genómica/métodos , Mutación/genética , Neoplasias/genética , Análisis Mutacional de ADN/métodos , Progresión de la Enfermedad , Humanos , Neoplasias/patología , Secuenciación Completa del GenomaRESUMEN
The Extracellular RNA Communication Consortium (ERCC) was launched to accelerate progress in the new field of extracellular RNA (exRNA) biology and to establish whether exRNAs and their carriers, including extracellular vesicles (EVs), can mediate intercellular communication and be utilized for clinical applications. Phase 1 of the ERCC focused on exRNA/EV biogenesis and function, discovery of exRNA biomarkers, development of exRNA/EV-based therapeutics, and construction of a robust set of reference exRNA profiles for a variety of biofluids. Here, we present progress by ERCC investigators in these areas, and we discuss collaborative projects directed at development of robust methods for EV/exRNA isolation and analysis and tools for sharing and computational analysis of exRNA profiling data.
Asunto(s)
Ácidos Nucleicos Libres de Células/genética , Ácidos Nucleicos Libres de Células/metabolismo , Vesículas Extracelulares/genética , Biomarcadores , Humanos , Bases del Conocimiento , MicroARNs/genética , ARN/genéticaRESUMEN
The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.
Asunto(s)
Privacidad Genética , Privacidad , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Medición de RiesgoRESUMEN
Protein phase transitions (PPTs) from the soluble state to a dense liquid phase (forming droplets via liquid-liquid phase separation) or to solid aggregates (such as amyloids) play key roles in pathological processes associated with age-related diseases such as Alzheimer's disease. Several computational frameworks are capable of separately predicting the formation of droplets or amyloid aggregates based on protein sequences, yet none have tackled the prediction of both within a unified framework. Recently, large language models (LLMs) have exhibited great success in protein structure prediction; however, they have not yet been used for PPTs. Here, we fine-tune a LLM for predicting PPTs and demonstrate its usage in evaluating how sequence variants affect PPTs, an operation useful for protein design. In addition, we show its superior performance compared to suitable classical benchmarks. Due to the "black-box" nature of the LLM, we also employ a classical random forest model along with biophysical features to facilitate interpretation. Finally, focusing on Alzheimer's disease-related proteins, we demonstrate that greater aggregation is associated with reduced gene expression in Alzheimer's disease, suggesting a natural defense mechanism.
Asunto(s)
Enfermedad de Alzheimer , Transición de Fase , Enfermedad de Alzheimer/metabolismo , Humanos , Amiloide/metabolismo , Amiloide/química , Proteínas/química , Proteínas/metabolismoRESUMEN
Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with â¼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, â¼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using â¼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.
Asunto(s)
Genotipo , Haplotipos , Polimorfismo de Nucleótido Simple , Humanos , Bases de Datos Genéticas , Cadenas de Markov , Programas Informáticos , Privacidad Genética , Algoritmos , Alineación de Secuencia , Genética de Población/métodosRESUMEN
The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
Asunto(s)
Bases de Datos Genéticas , Genoma/genética , Genómica , Anotación de Secuencia Molecular , Animales , Sitios de Unión , Cromatina/genética , Cromatina/metabolismo , Metilación de ADN , Bases de Datos Genéticas/normas , Bases de Datos Genéticas/tendencias , Regulación de la Expresión Génica/genética , Genoma Humano/genética , Genómica/normas , Genómica/tendencias , Histonas/metabolismo , Humanos , Ratones , Anotación de Secuencia Molecular/normas , Control de Calidad , Secuencias Reguladoras de Ácidos Nucleicos/genética , Factores de Transcripción/metabolismoRESUMEN
MOTIVATION: The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. RESULTS: We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. AVAILABILITY AND IMPLEMENTATION: Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.
Asunto(s)
Procesamiento de Lenguaje Natural , Aprendizaje Profundo , Biología Computacional/métodosRESUMEN
SUMMARY: Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%). AVAILABILITY AND IMPLEMENTATION: All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.
Asunto(s)
Algoritmos , Benchmarking , Biología Computacional , Lenguajes de Programación , Programas Informáticos , Biología Computacional/métodos , Benchmarking/métodosRESUMEN
Every cell in the human body inherits a copy of the same genetic information. The three billion base pairs of DNA in the human genome, and the roughly 50 000 coding and non-coding genes they contain, must thus encode all the complexity of human development and cell and tissue type diversity. Differences in gene regulation, or the modulation of gene expression, enable individual cells to interpret the genome differently to carry out their specific functions. Here we discuss recent and ongoing efforts to build gene regulatory maps, which aim to characterize the regulatory roles of all sequences in a genome. Many researchers and consortia have identified such regulatory elements using functional assays and evolutionary analyses; we discuss the results, strengths and shortcomings of their approaches. We also discuss new techniques the field can leverage and emerging challenges it will face while striving to build gene regulatory maps of ever-increasing resolution and comprehensiveness.
Asunto(s)
Regulación de la Expresión Génica , Secuencias Reguladoras de Ácidos Nucleicos , Humanos , Regulación de la Expresión Génica/genética , Genoma Humano/genética , Mapeo Cromosómico , ADN/genéticaRESUMEN
Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.
Asunto(s)
Genoma Humano/genética , Variación Estructural del Genoma , Genómica/métodos , Objetivos , Secuenciación Completa del Genoma/métodos , Secuenciación Completa del Genoma/normas , Variaciones en el Número de Copia de ADN , Exones/genética , Humanos , Proyectos de Investigación , Duplicaciones Segmentarias en el Genoma , Alineación de SecuenciaRESUMEN
MOTIVATION: While many quantum computing (QC) methods promise theoretical advantages over classical counterparts, quantum hardware remains limited. Exploiting near-term QC in computer-aided drug design (CADD) thus requires judicious partitioning between classical and quantum calculations. RESULTS: We present HypaCADD, a hybrid classical-quantum workflow for finding ligands binding to proteins, while accounting for genetic mutations. We explicitly identify modules of our drug-design workflow currently amenable to replacement by QC: non-intuitively, we identify the mutation-impact predictor as the best candidate. HypaCADD thus combines classical docking and molecular dynamics with quantum machine learning (QML) to infer the impact of mutations. We present a case study with the coronavirus (SARS-CoV-2) protease and associated mutants. We map a classical machine-learning module onto QC, using a neural network constructed from qubit-rotation gates. We have implemented this in simulation and on two commercial quantum computers. We find that the QML models can perform on par with, if not better than, classical baselines. In summary, HypaCADD offers a successful strategy for leveraging QC for CADD. AVAILABILITY AND IMPLEMENTATION: Jupyter Notebooks with Python code are freely available for academic use on GitHub: https://www.github.com/hypahub/hypacadd_notebook. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
COVID-19 , Programas Informáticos , Humanos , Flujo de Trabajo , Metodologías Computacionales , Teoría Cuántica , SARS-CoV-2 , Diseño de Fármacos , Simulación de Dinámica MolecularRESUMEN
We describe a chemical method to label and purify 4-thiouridine (s(4)U)-containing RNA. We demonstrate that methanethiosulfonate (MTS) reagents form disulfide bonds with s(4)U more efficiently than the commonly used HPDP-biotin, leading to higher yields and less biased enrichment. This increase in efficiency allowed us to use s(4)U labeling to study global microRNA (miRNA) turnover in proliferating cultured human cells without perturbing global miRNA levels or the miRNA processing machinery. This improved chemistry will enhance methods that depend on tracking different populations of RNA, such as 4-thiouridine tagging to study tissue-specific transcription and dynamic transcriptome analysis (DTA) to study RNA turnover.
Asunto(s)
MicroARNs/química , Biotina/análogos & derivados , Proliferación Celular , Disulfuros , Perfilación de la Expresión Génica/métodos , Células HEK293 , Humanos , Indicadores y Reactivos , Mesilatos , MicroARNs/genética , MicroARNs/metabolismo , Fenómenos Químicos Orgánicos , Procesamiento Postranscripcional del ARN , Tiouridina/químicaRESUMEN
The endoplasmic reticulum (ER) is a membrane-bound organelle responsible for protein folding, lipid synthesis, and calcium homeostasis. Maintenance of ER structural integrity is crucial for proper function, but much remains to be learned about the molecular players involved. To identify proteins that support the structure of the ER, we performed a proteomic screen and identified nodal modulator (NOMO), a widely conserved type I transmembrane protein of unknown function, with three nearly identical orthologs specified in the human genome. We found that overexpression of NOMO1 imposes a sheet morphology on the ER, whereas depletion of NOMO1 and its orthologs causes a collapse of ER morphology concomitant with the formation of membrane-delineated holes in the ER network positive for the lysosomal marker lysosomal-associated protein 1. In addition, the levels of key players of autophagy including microtubule-associated protein light chain 3 and autophagy cargo receptor p62/sequestosome 1 strongly increase upon NOMO depletion. In vitro reconstitution of NOMO1 revealed a "beads on a string" structure likely representing consecutive immunoglobulin-like domains. Extending NOMO1 by insertion of additional immunoglobulin folds results in a correlative increase in the ER intermembrane distance. Based on these observations and a genetic epistasis analysis including the known ER-shaping proteins Atlastin2 and Climp63, we propose a role for NOMO1 in the functional network of ER-shaping proteins.
Asunto(s)
Retículo Endoplásmico , Proteómica , Proteína Sequestosoma-1 , Autofagia , Estrés del Retículo Endoplásmico , Homeostasis , Humanos , Lisosomas/metabolismoRESUMEN
Large-scale exome sequencing of tumors has enabled the identification of cancer drivers using recurrence-based approaches. Some of these methods also employ 3D protein structures to identify mutational hotspots in cancer-associated genes. In determining such mutational clusters in structures, existing approaches overlook protein dynamics, despite its essential role in protein function. We present a framework to identify cancer driver genes using a dynamics-based search of mutational hotspot communities. Mutations are mapped to protein structures, which are partitioned into distinct residue communities. These communities are identified in a framework where residue-residue contact edges are weighted by correlated motions (as inferred by dynamics-based models). We then search for signals of positive selection among these residue communities to identify putative driver genes, while applying our method to the TCGA (The Cancer Genome Atlas) PanCancer Atlas missense mutation catalog. Overall, we predict 1 or more mutational hotspots within the resolved structures of proteins encoded by 434 genes. These genes were enriched among biological processes associated with tumor progression. Additionally, a comparison between our approach and existing cancer hotspot detection methods using structural data suggests that including protein dynamics significantly increases the sensitivity of driver detection.
Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Proteínas de Neoplasias/química , Proteínas de Neoplasias/genética , Neoplasias/genética , Bases de Datos Genéticas , Exoma/genética , Humanos , Mutación , Conformación Proteica , Reproducibilidad de los Resultados , Flujo de TrabajoRESUMEN
Predicting mutation-induced changes in protein thermodynamic stability (ΔΔG) is of great interest in protein engineering, variant interpretation, and protein biophysics. We introduce ThermoNet, a deep, 3D-convolutional neural network (3D-CNN) designed for structure-based prediction of ΔΔGs upon point mutation. To leverage the image-processing power inherent in CNNs, we treat protein structures as if they were multi-channel 3D images. In particular, the inputs to ThermoNet are uniformly constructed as multi-channel voxel grids based on biophysical properties derived from raw atom coordinates. We train and evaluate ThermoNet with a curated data set that accounts for protein homology and is balanced with direct and reverse mutations; this provides a framework for addressing biases that have likely influenced many previous ΔΔG prediction methods. ThermoNet demonstrates performance comparable to the best available methods on the widely used Ssym test set. In addition, ThermoNet accurately predicts the effects of both stabilizing and destabilizing mutations, while most other methods exhibit a strong bias towards predicting destabilization. We further show that homology between Ssym and widely used training sets like S2648 and VariBench has likely led to overestimated performance in previous studies. Finally, we demonstrate the practical utility of ThermoNet in predicting the ΔΔGs for two clinically relevant proteins, p53 and myoglobin, and for pathogenic and benign missense variants from ClinVar. Overall, our results suggest that 3D-CNNs can model the complex, non-linear interactions perturbed by mutations, directly from biophysical properties of atoms.
Asunto(s)
Imagenología Tridimensional/métodos , Redes Neurales de la Computación , Mutación Puntual , Proteínas/química , Termodinámica , Biología Computacional , Estabilidad ProteicaRESUMEN
During the maternal-to-zygotic transition (MZT), transcriptionally silent embryos rely on post-transcriptional regulation of maternal mRNAs until zygotic genome activation (ZGA). RNA-binding proteins (RBPs) are important regulators of post-transcriptional RNA processing events, yet their identities and functions during developmental transitions in vertebrates remain largely unexplored. Using mRNA interactome capture, we identified 227 RBPs in zebrafish embryos before and during ZGA, hereby named the zebrafish MZT mRNA-bound proteome. This protein constellation consists of many conserved RBPs, some of which are potential stage-specific mRNA interactors that likely reflect the dynamics of RNA-protein interactions during MZT. The enrichment of numerous splicing factors like hnRNP proteins before ZGA was surprising, because maternal mRNAs were found to be fully spliced. To address potentially unique roles of these RBPs in embryogenesis, we focused on Hnrnpa1. iCLIP and subsequent mRNA reporter assays revealed a function for Hnrnpa1 in the regulation of poly(A) tail length and translation of maternal mRNAs through sequence-specific association with 3' UTRs before ZGA. Comparison of iCLIP data from two developmental stages revealed that Hnrnpa1 dissociates from maternal mRNAs at ZGA and instead regulates the nuclear processing of pri-mir-430 transcripts, which we validated experimentally. The shift from cytoplasmic to nuclear RNA targets was accompanied by a dramatic translocation of Hnrnpa1 and other pre-mRNA splicing factors to the nucleus in a transcription-dependent manner. Thus, our study identifies global changes in RNA-protein interactions during vertebrate MZT and shows that Hnrnpa1 RNA-binding activities are spatially and temporally coordinated to regulate RNA metabolism during early development.
Asunto(s)
Regiones no Traducidas 3' , MicroARNs/metabolismo , Pez Cebra/metabolismo , Cigoto/metabolismo , Animales , Ribonucleoproteína Nuclear Heterogénea A1/genética , Ribonucleoproteína Nuclear Heterogénea A1/metabolismo , MicroARNs/genética , Pez Cebra/genética , Proteínas de Pez Cebra/genética , Proteínas de Pez Cebra/metabolismoRESUMEN
The Long interspersed nuclear element 1 (LINE-1) is a primary source of genetic variation in humans and other mammals. Despite its importance, LINE-1 activity remains difficult to study because of its highly repetitive nature. Here, we developed and validated a method called TeXP to gauge LINE-1 activity accurately. TeXP builds mappability signatures from LINE-1 subfamilies to deconvolve the effect of pervasive transcription from autonomous LINE-1 activity. In particular, it apportions the multiple reads aligned to the many LINE-1 instances in the genome into these two categories. Using our method, we evaluated well-established cell lines, cell-line compartments and healthy tissues and found that the vast majority (91.7%) of transcriptome reads overlapping LINE-1 derive from pervasive transcription. We validated TeXP by independently estimating the levels of LINE-1 autonomous transcription using ddPCR, finding high concordance. Next, we applied our method to comprehensively measure LINE-1 activity across healthy somatic cells, while backing out the effect of pervasive transcription. Unexpectedly, we found that LINE-1 activity is present in many normal somatic cells. This finding contrasts with earlier studies showing that LINE-1 has limited activity in healthy somatic tissues, except for neuroprogenitor cells. Interestingly, we found that the amount of LINE-1 activity was associated with the with the amount of cell turnover, with tissues with low cell turnover rates (e.g. the adult central nervous system) showing lower LINE-1 activity. Altogether, our results show how accounting for pervasive transcription is critical to accurately quantify the activity of highly repetitive regions of the human genome.
Asunto(s)
Elementos Transponibles de ADN/genética , Elementos de Nucleótido Esparcido Largo/genética , Modelos Genéticos , Transcripción Genética , Animales , Línea Celular , Biología Computacional , Técnicas Genéticas/estadística & datos numéricos , Genoma Humano , Humanos , Análisis de Secuencia de ARN/estadística & datos numéricosRESUMEN
The transcriptome is the readout of the genome. Identifying common features in it across distant species can reveal fundamental principles. To this end, the ENCODE and modENCODE consortia have generated large amounts of matched RNA-sequencing data for human, worm and fly. Uniform processing and comprehensive annotation of these data allow comparison across metazoan phyla, extending beyond earlier within-phylum transcriptome comparisons and revealing ancient, conserved features. Specifically, we discover co-expression modules shared across animals, many of which are enriched in developmental genes. Moreover, we use expression patterns to align the stages in worm and fly development and find a novel pairing between worm embryo and fly pupae, in addition to the embryo-to-embryo and larvae-to-larvae pairings. Furthermore, we find that the extent of non-canonical, non-coding transcription is similar in each organism, per base pair. Finally, we find in all three organisms that the gene-expression levels, both coding and non-coding, can be quantitatively predicted from chromatin features at the promoter using a 'universal model' based on a single set of organism-independent parameters.