Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 563
Filtrar
1.
Cell ; 2024 Sep 25.
Artículo en Inglés | MEDLINE | ID: mdl-39353436

RESUMEN

The capability to spatially explore RNA biology in formalin-fixed paraffin-embedded (FFPE) tissues holds transformative potential for histopathology research. Here, we present pathology-compatible deterministic barcoding in tissue (Patho-DBiT) by combining in situ polyadenylation and computational innovation for spatial whole transcriptome sequencing, tailored to probe the diverse RNA species in clinically archived FFPE samples. It permits spatial co-profiling of gene expression and RNA processing, unveiling region-specific splicing isoforms, and high-sensitivity transcriptomic mapping of clinical tumor FFPE tissues stored for 5 years. Furthermore, genome-wide single-nucleotide RNA variants can be captured to distinguish malignant subclones from non-malignant cells in human lymphomas. Patho-DBiT also maps microRNA regulatory networks and RNA splicing dynamics, decoding their roles in spatial tumorigenesis. Single-cell level Patho-DBiT dissects the spatiotemporal cellular dynamics driving tumor clonal architecture and progression. Patho-DBiT stands poised as a valuable platform to unravel rich RNA biology in FFPE tissues to aid in clinical pathology evaluation.

2.
Cell ; 183(4): 905-917.e16, 2020 11 12.
Artículo en Inglés | MEDLINE | ID: mdl-33186529

RESUMEN

The generation of functional genomics datasets is surging, because they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intent behind functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to broadly share raw reads for better statistical power and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, enabling principled privacy-utility trade-offs. Our protocol works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA sequencing. It involves quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.


Asunto(s)
Seguridad Computacional , Genómica , Privacidad , Genoma Humano , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Fenotipo , Filogenia , Reproducibilidad de los Resultados , Análisis de Secuencia de ARN , Análisis de la Célula Individual
3.
Cell ; 180(5): 915-927.e16, 2020 03 05.
Artículo en Inglés | MEDLINE | ID: mdl-32084333

RESUMEN

The dichotomous model of "drivers" and "passengers" in cancer posits that only a few mutations in a tumor strongly affect its progression, with the remaining ones being inconsequential. Here, we leveraged the comprehensive variant dataset from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) project to demonstrate that-in addition to the dichotomy of high- and low-impact variants-there is a third group of medium-impact putative passengers. Moreover, we also found that molecular impact correlates with subclonal architecture (i.e., early versus late mutations), and different signatures encode for mutations with divergent impact. Furthermore, we adapted an additive-effects model from complex-trait studies to show that the aggregated effect of putative passengers, including undetected weak drivers, provides significant additional power (∼12% additive variance) for predicting cancerous phenotypes, beyond PCAWG-identified driver mutations. Finally, this framework allowed us to estimate the frequency of potential weak-driver mutations in PCAWG samples lacking any well-characterized driver alterations.


Asunto(s)
Genoma Humano/genética , Genómica/métodos , Mutación/genética , Neoplasias/genética , Análisis Mutacional de ADN/métodos , Progresión de la Enfermedad , Humanos , Neoplasias/patología , Secuenciación Completa del Genoma
4.
Cell ; 177(2): 231-242, 2019 04 04.
Artículo en Inglés | MEDLINE | ID: mdl-30951667

RESUMEN

The Extracellular RNA Communication Consortium (ERCC) was launched to accelerate progress in the new field of extracellular RNA (exRNA) biology and to establish whether exRNAs and their carriers, including extracellular vesicles (EVs), can mediate intercellular communication and be utilized for clinical applications. Phase 1 of the ERCC focused on exRNA/EV biogenesis and function, discovery of exRNA biomarkers, development of exRNA/EV-based therapeutics, and construction of a robust set of reference exRNA profiles for a variety of biofluids. Here, we present progress by ERCC investigators in these areas, and we discuss collaborative projects directed at development of robust methods for EV/exRNA isolation and analysis and tools for sharing and computational analysis of exRNA profiling data.


Asunto(s)
Ácidos Nucleicos Libres de Células/genética , Ácidos Nucleicos Libres de Células/metabolismo , Vesículas Extracelulares/genética , Biomarcadores , Humanos , Bases del Conocimiento , MicroARNs/genética , ARN/genética
5.
Mol Cell ; 83(12): 1983-2002.e11, 2023 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-37295433

RESUMEN

The evolutionarily conserved minor spliceosome (MiS) is required for protein expression of ∼714 minor intron-containing genes (MIGs) crucial for cell-cycle regulation, DNA repair, and MAP-kinase signaling. We explored the role of MIGs and MiS in cancer, taking prostate cancer (PCa) as an exemplar. Both androgen receptor signaling and elevated levels of U6atac, a MiS small nuclear RNA, regulate MiS activity, which is highest in advanced metastatic PCa. siU6atac-mediated MiS inhibition in PCa in vitro model systems resulted in aberrant minor intron splicing leading to cell-cycle G1 arrest. Small interfering RNA knocking down U6atac was ∼50% more efficient in lowering tumor burden in models of advanced therapy-resistant PCa compared with standard antiandrogen therapy. In lethal PCa, siU6atac disrupted the splicing of a crucial lineage dependency factor, the RE1-silencing factor (REST). Taken together, we have nominated MiS as a vulnerability for lethal PCa and potentially other cancers.


Asunto(s)
Neoplasias de la Próstata Resistentes a la Castración , Neoplasias de la Próstata , Masculino , Humanos , Intrones/genética , Neoplasias de la Próstata/metabolismo , Empalme del ARN/genética , Empalmosomas/metabolismo , Transducción de Señal , Receptores Androgénicos/genética , Receptores Androgénicos/metabolismo , Línea Celular Tumoral , Neoplasias de la Próstata Resistentes a la Castración/genética
6.
Cell ; 162(2): 375-390, 2015 Jul 16.
Artículo en Inglés | MEDLINE | ID: mdl-26186191

RESUMEN

Autism spectrum disorder (ASD) is a disorder of brain development. Most cases lack a clear etiology or genetic basis, and the difficulty of re-enacting human brain development has precluded understanding of ASD pathophysiology. Here we use three-dimensional neural cultures (organoids) derived from induced pluripotent stem cells (iPSCs) to investigate neurodevelopmental alterations in individuals with severe idiopathic ASD. While no known underlying genomic mutation could be identified, transcriptome and gene network analyses revealed upregulation of genes involved in cell proliferation, neuronal differentiation, and synaptic assembly. ASD-derived organoids exhibit an accelerated cell cycle and overproduction of GABAergic inhibitory neurons. Using RNA interference, we show that overexpression of the transcription factor FOXG1 is responsible for the overproduction of GABAergic neurons. Altered expression of gene network modules and FOXG1 are positively correlated with symptom severity. Our data suggest that a shift toward GABAergic neuron fate caused by FOXG1 is a developmental precursor of ASD.


Asunto(s)
Trastornos Generalizados del Desarrollo Infantil/genética , Trastornos Generalizados del Desarrollo Infantil/patología , Factores de Transcripción Forkhead/metabolismo , Proteínas del Tejido Nervioso/metabolismo , Neurogénesis , Telencéfalo/embriología , Femenino , Perfilación de la Expresión Génica , Humanos , Células Madre Pluripotentes Inducidas , Masculino , Megalencefalia/genética , Megalencefalia/patología , Modelos Biológicos , Neuronas/citología , Neuronas/metabolismo , Organoides/patología , Telencéfalo/patología
7.
Nature ; 613(7942): 96-102, 2023 01.
Artículo en Inglés | MEDLINE | ID: mdl-36517591

RESUMEN

Expansion of a single repetitive DNA sequence, termed a tandem repeat (TR), is known to cause more than 50 diseases1,2. However, repeat expansions are often not explored beyond neurological and neurodegenerative disorders. In some cancers, mutations accumulate in short tracts of TRs, a phenomenon termed microsatellite instability; however, larger repeat expansions have not been systematically analysed in cancer3-8. Here we identified TR expansions in 2,622 cancer genomes spanning 29 cancer types. In seven cancer types, we found 160 recurrent repeat expansions (rREs), most of which (155/160) were subtype specific. We found that rREs were non-uniformly distributed in the genome with enrichment near candidate cis-regulatory elements, suggesting a potential role in gene regulation. One rRE, a GAAA-repeat expansion, located near a regulatory element in the first intron of UGT2B7 was detected in 34% of renal cell carcinoma samples and was validated by long-read DNA sequencing. Moreover, in preliminary experiments, treating cells that harbour this rRE with a GAAA-targeting molecule led to a dose-dependent decrease in cell proliferation. Overall, our results suggest that rREs may be an important but unexplored source of genetic variation in human cancer, and we provide a comprehensive catalogue for further study.


Asunto(s)
Expansión de las Repeticiones de ADN , Genoma Humano , Neoplasias , Humanos , Secuencia de Bases , Expansión de las Repeticiones de ADN/genética , Genoma Humano/genética , Neoplasias/clasificación , Neoplasias/genética , Neoplasias/patología , Análisis de Secuencia de ADN , Regulación de la Expresión Génica , Elementos Reguladores de la Transcripción/genética , Intrones/genética , Carcinoma de Células Renales/genética , Carcinoma de Células Renales/patología , Proliferación Celular/efectos de los fármacos , Reproducibilidad de los Resultados
8.
Nat Rev Genet ; 23(4): 245-258, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-34759381

RESUMEN

The generation of functional genomics data by next-generation sequencing has increased greatly in the past decade. Broad sharing of these data is essential for research advancement but poses notable privacy challenges, some of which are analogous to those that occur when sharing genetic variant data. However, there are also unique privacy challenges that arise from cryptic information leakage during the processing and summarization of functional genomics data from raw reads to derived quantities, such as gene expression values. Here, we review these challenges and present potential solutions for mitigating privacy risks while allowing broad data dissemination and analysis.


Asunto(s)
Privacidad Genética , Privacidad , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Medición de Riesgo
9.
Nature ; 611(7936): 532-539, 2022 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-36323788

RESUMEN

Neuropsychiatric disorders classically lack defining brain pathologies, but recent work has demonstrated dysregulation at the molecular level, characterized by transcriptomic and epigenetic alterations1-3. In autism spectrum disorder (ASD), this molecular pathology involves the upregulation of microglial, astrocyte and neural-immune genes, the downregulation of synaptic genes, and attenuation of gene-expression gradients in cortex1,2,4-6. However, whether these changes are limited to cortical association regions or are more widespread remains unknown. To address this issue, we performed RNA-sequencing analysis of 725 brain samples spanning 11 cortical areas from 112 post-mortem samples from individuals with ASD and neurotypical controls. We find widespread transcriptomic changes across the cortex in ASD, exhibiting an anterior-to-posterior gradient, with the greatest differences in primary visual cortex, coincident with an attenuation of the typical transcriptomic differences between cortical regions. Single-nucleus RNA-sequencing and methylation profiling demonstrate that this robust molecular signature reflects changes in cell-type-specific gene expression, particularly affecting excitatory neurons and glia. Both rare and common ASD-associated genetic variation converge within a downregulated co-expression module involving synaptic signalling, and common variation alone is enriched within a module of upregulated protein chaperone genes. These results highlight widespread molecular changes across the cerebral cortex in ASD, extending beyond association cortex to broadly involve primary sensory regions.


Asunto(s)
Trastorno del Espectro Autista , Corteza Cerebral , Variación Genética , Transcriptoma , Humanos , Trastorno del Espectro Autista/genética , Trastorno del Espectro Autista/metabolismo , Trastorno del Espectro Autista/patología , Corteza Cerebral/metabolismo , Corteza Cerebral/patología , Neuronas/metabolismo , ARN/análisis , ARN/genética , Transcriptoma/genética , Autopsia , Análisis de Secuencia de ARN , Corteza Visual Primaria/metabolismo , Neuroglía/metabolismo
10.
Cell ; 148(6): 1293-307, 2012 Mar 16.
Artículo en Inglés | MEDLINE | ID: mdl-22424236

RESUMEN

Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.


Asunto(s)
Genoma Humano , Genómica , Medicina de Precisión , Diabetes Mellitus Tipo 2/genética , Femenino , Perfilación de la Expresión Génica , Humanos , Masculino , Metabolómica , Persona de Mediana Edad , Mutación , Proteómica , Virus Sincitiales Respiratorios/aislamiento & purificación , Rhinovirus/aislamiento & purificación
11.
Cell ; 148(1-2): 84-98, 2012 Jan 20.
Artículo en Inglés | MEDLINE | ID: mdl-22265404

RESUMEN

Higher-order chromosomal organization for transcription regulation is poorly understood in eukaryotes. Using genome-wide Chromatin Interaction Analysis with Paired-End-Tag sequencing (ChIA-PET), we mapped long-range chromatin interactions associated with RNA polymerase II in human cells and uncovered widespread promoter-centered intragenic, extragenic, and intergenic interactions. These interactions further aggregated into higher-order clusters, wherein proximal and distal genes were engaged through promoter-promoter interactions. Most genes with promoter-promoter interactions were active and transcribed cooperatively, and some interacting promoters could influence each other implying combinatorial complexity of transcriptional controls. Comparative analyses of different cell lines showed that cell-specific chromatin interactions could provide structural frameworks for cell-specific transcription, and suggested significant enrichment of enhancer-promoter interactions for cell-specific functions. Furthermore, genetically-identified disease-associated noncoding elements were found to be spatially engaged with corresponding genes through long-range interactions. Overall, our study provides insights into transcription regulation by three-dimensional chromatin interactions for both housekeeping and cell-specific genes in human cells.


Asunto(s)
Cromatina/metabolismo , Regulación de la Expresión Génica , Regiones Promotoras Genéticas , ARN Polimerasa II/metabolismo , Transcripción Genética , Línea Celular Tumoral , Inmunoprecipitación de Cromatina , Elementos de Facilitación Genéticos , Estudio de Asociación del Genoma Completo , Humanos
12.
Proc Natl Acad Sci U S A ; 121(33): e2320510121, 2024 Aug 13.
Artículo en Inglés | MEDLINE | ID: mdl-39110734

RESUMEN

Protein phase transitions (PPTs) from the soluble state to a dense liquid phase (forming droplets via liquid-liquid phase separation) or to solid aggregates (such as amyloids) play key roles in pathological processes associated with age-related diseases such as Alzheimer's disease. Several computational frameworks are capable of separately predicting the formation of droplets or amyloid aggregates based on protein sequences, yet none have tackled the prediction of both within a unified framework. Recently, large language models (LLMs) have exhibited great success in protein structure prediction; however, they have not yet been used for PPTs. Here, we fine-tune a LLM for predicting PPTs and demonstrate its usage in evaluating how sequence variants affect PPTs, an operation useful for protein design. In addition, we show its superior performance compared to suitable classical benchmarks. Due to the "black-box" nature of the LLM, we also employ a classical random forest model along with biophysical features to facilitate interpretation. Finally, focusing on Alzheimer's disease-related proteins, we demonstrate that greater aggregation is associated with reduced gene expression in Alzheimer's disease, suggesting a natural defense mechanism.


Asunto(s)
Enfermedad de Alzheimer , Transición de Fase , Enfermedad de Alzheimer/metabolismo , Humanos , Amiloide/metabolismo , Amiloide/química , Proteínas/química , Proteínas/metabolismo
13.
Trends Genet ; 39(6): 442-450, 2023 06.
Artículo en Inglés | MEDLINE | ID: mdl-36858880

RESUMEN

Genomic studies of human disorders are often performed by distinct research communities (i.e., focused on rare diseases, common diseases, or cancer). Despite underlying differences in the mechanistic origin of different disease categories, these studies share the goal of identifying causal genomic events that are critical for the clinical manifestation of the disease phenotype. Moreover, these studies face common challenges, including understanding the complex genetic architecture of the disease, deciphering the impact of variants on multiple scales, and interpreting noncoding mutations. Here, we highlight these challenges in depth and argue that properly addressing them will require a more unified vocabulary and approach across disease communities. Toward this goal, we present a unified perspective on relating variant impact to various genomic disorders.


Asunto(s)
Genoma , Genómica , Humanos , Mutación , Fenotipo
14.
Genome Res ; 33(12): 2156-2173, 2023 Dec 27.
Artículo en Inglés | MEDLINE | ID: mdl-38097386

RESUMEN

Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with ∼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ∼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using ∼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.


Asunto(s)
Genotipo , Haplotipos , Polimorfismo de Nucleótido Simple , Humanos , Bases de Datos Genéticas , Cadenas de Markov , Programas Informáticos , Privacidad Genética , Algoritmos , Alineación de Secuencia , Genética de Población/métodos
15.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-39007594

RESUMEN

Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.


Asunto(s)
Inteligencia Artificial , Diseño de Fármacos , Proteínas , Humanos , Biología Computacional/métodos , Proteínas/química
16.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38493342

RESUMEN

Dynamic compartmentalization of eukaryotic DNA into active and repressed states enables diverse transcriptional programs to arise from a single genetic blueprint, whereas its dysregulation can be strongly linked to a broad spectrum of diseases. While single-cell Hi-C experiments allow for chromosome conformation profiling across many cells, they are still expensive and not widely available for most labs. Here, we propose an alternate approach, scENCORE, to computationally reconstruct chromatin compartments from the more affordable and widely accessible single-cell epigenetic data. First, scENCORE constructs a long-range epigenetic correlation graph to mimic chromatin interaction frequencies, where nodes and edges represent genome bins and their correlations. Then, it learns the node embeddings to cluster genome regions into A/B compartments and aligns different graphs to quantify chromatin conformation changes across conditions. Benchmarking using cell-type-matched Hi-C experiments demonstrates that scENCORE can robustly reconstruct A/B compartments in a cell-type-specific manner. Furthermore, our chromatin confirmation switching studies highlight substantial compartment-switching events that may introduce substantial regulatory and transcriptional changes in psychiatric disease. In summary, scENCORE allows accurate and cost-effective A/B compartment reconstruction to delineate higher-order chromatin structure heterogeneity in complex tissues.


Asunto(s)
Cromatina , Cromosomas , Cromatina/genética , ADN , Conformación Molecular , Epigénesis Genética
17.
Nature ; 583(7818): 693-698, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32728248

RESUMEN

The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.


Asunto(s)
Bases de Datos Genéticas , Genoma/genética , Genómica , Anotación de Secuencia Molecular , Animales , Sitios de Unión , Cromatina/genética , Cromatina/metabolismo , Metilación de ADN , Bases de Datos Genéticas/normas , Bases de Datos Genéticas/tendencias , Regulación de la Expresión Génica/genética , Genoma Humano/genética , Genómica/normas , Genómica/tendencias , Histonas/metabolismo , Humanos , Ratones , Anotación de Secuencia Molecular/normas , Control de Calidad , Secuencias Reguladoras de Ácidos Nucleicos/genética , Factores de Transcripción/metabolismo
18.
Nucleic Acids Res ; 52(4): e20, 2024 Feb 28.
Artículo en Inglés | MEDLINE | ID: mdl-38214231

RESUMEN

Numerous statistical methods have emerged for inferring DNA motifs for transcription factors (TFs) from genomic regions. However, the process of selecting informative regions for motif inference remains understudied. Current approaches select regions with strong ChIP-seq signal for a given TF, assuming that such strong signal primarily results from specific interactions between the TF and its motif. Additionally, these selection approaches do not account for non-target motifs, i.e. motifs of other TFs; they presume the occurrence of these non-target motifs infrequent compared to that of the target motif, and thus assume these have minimal interference with the identification of the target. Leveraging extensive ChIP-seq datasets, we introduced the concept of TF signal 'crowdedness', referred to as C-score, for each genomic region. The C-score helps in highlighting TF signals arising from non-specific interactions. Moreover, by considering the C-score (and adjusting for the length of genomic regions), we can effectively mitigate interference of non-target motifs. Using these tools, we find that in many instances, strong ChIP-seq signal stems mainly from non-specific interactions, and the occurrence of non-target motifs significantly impacts the accurate inference of the target motif. Prioritizing genomic regions with reduced crowdedness and short length markedly improves motif inference. This 'less-is-more' effect suggests that ChIP-seq region selection warrants more attention.


Asunto(s)
Genómica , Motivos de Nucleótidos , Factores de Transcripción , Sitios de Unión , Inmunoprecipitación de Cromatina , Unión Proteica , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
19.
Bioinformatics ; 40(8)2024 08 02.
Artículo en Inglés | MEDLINE | ID: mdl-39051682

RESUMEN

MOTIVATION: Many types of networks, such as co-expression or ChIP-seq-based gene-regulatory networks, provide useful information for biomedical studies. However, they are often too full of connections and difficult to interpret, forming "indecipherable hairballs." RESULTS: To address this issue, we propose that a Bayesian network can summarize the core relationships between gene expression activities. This network, which we call the LatentDAG, is substantially simpler than conventional co-expression network and ChIP-seq networks (by two orders of magnitude). It provides clearer clusters, without extraneous cross-cluster connections, and clear separators between modules. Moreover, one can find a number of clear examples showing how it bridges the connection between steps in the transcriptional regulatory network and other networks (e.g. RNA-binding protein). In conjunction with a graph neural network, the LatentDAG works better than other biological networks in a variety of tasks, including prediction of gene conservation and clustering genes. AVAILABILITY AND IMPLEMENTATION: Code is available at https://github.com/gersteinlab/LatentDAG.


Asunto(s)
Teorema de Bayes , Redes Reguladoras de Genes , Humanos , Algoritmos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Análisis por Conglomerados
20.
Bioinformatics ; 40(Suppl 1): i357-i368, 2024 06 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940177

RESUMEN

MOTIVATION: The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. RESULTS: We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. AVAILABILITY AND IMPLEMENTATION: Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.


Asunto(s)
Procesamiento de Lenguaje Natural , Aprendizaje Profundo , Biología Computacional/métodos
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda