Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 17 de 17
Filtrar
1.
RNA ; 29(5): 517-530, 2023 05.
Artículo en Inglés | MEDLINE | ID: mdl-36737104

RESUMEN

In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.


Asunto(s)
Algoritmos , Análisis de Expresión Génica de una Sola Célula , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica/métodos , Análisis por Conglomerados
2.
Bioinformatics ; 40(6)2024 06 03.
Artículo en Inglés | MEDLINE | ID: mdl-38837395

RESUMEN

MOTIVATION: Tissue context and molecular profiling are commonly used measures in understanding normal development and disease pathology. In recent years, the development of spatial molecular profiling technologies (e.g. spatial resolved transcriptomics) has enabled the exploration of quantitative links between tissue morphology and gene expression. However, these technologies remain expensive and time-consuming, with subsequent analyses necessitating high-throughput pathological annotations. On the other hand, existing computational tools are limited to predicting only a few dozen to several hundred genes, and the majority of the methods are designed for bulk RNA-seq. RESULTS: In this context, we propose HE2Gene, the first multi-task learning-based method capable of predicting tens of thousands of spot-level gene expressions along with pathological annotations from H&E-stained images. Experimental results demonstrate that HE2Gene is comparable to state-of-the-art methods and generalizes well on an external dataset without the need for re-training. Moreover, HE2Gene preserves the annotated spatial domains and has the potential to identify biomarkers. This capability facilitates cancer diagnosis and broadens its applicability to investigate gene-disease associations. AVAILABILITY AND IMPLEMENTATION: The source code and data information has been deposited at https://github.com/Microbiods/HE2Gene.


Asunto(s)
Transcriptoma , Humanos , Perfilación de la Expresión Génica/métodos , Biología Computacional/métodos , Aprendizaje Automático , ARN/metabolismo
3.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34791012

RESUMEN

MOTIVATION: The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules. RESULTS: We evaluated the classification part on 'DDIs 2013' dataset and 'DTIs' dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/EGFI.


Asunto(s)
Lenguaje , Procesamiento de Lenguaje Natural , Interacciones Farmacológicas , Semántica , Programas Informáticos
4.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34524404

RESUMEN

The cooperativity of transcription factors (TFs) is a widespread phenomenon in the gene regulation system. However, the interaction patterns between TF binding motifs remain elusive. The recent high-throughput assays, CAP-SELEX, have identified over 600 composite DNA sites (i.e. heterodimeric motifs) bound by cooperative TF pairs. However, there are over 25 000 inferentially effective heterodimeric TFs in the human cells. It is not practically feasible to validate all heterodimeric motifs due to cost and labor. We introduce DeepMotifSyn, a deep learning-based tool for synthesizing heterodimeric motifs from monomeric motif pairs. Specifically, DeepMotifSyn is composed of heterodimeric motif generator and evaluator. The generator is a U-Net-based neural network that can synthesize heterodimeric motifs from aligned motif pairs. The evaluator is a machine learning-based model that can score the generated heterodimeric motif candidates based on the motif sequence features. Systematic evaluations on CAP-SELEX data illustrate that DeepMotifSyn significantly outperforms the current state-of-the-art predictors. In addition, DeepMotifSyn can synthesize multiple heterodimeric motifs with different orientation and spacing settings. Such a feature can address the shortcomings of previous models. We believe DeepMotifSyn is a more practical and reliable model than current predictors on heterodimeric motif synthesis. Contact:kc.w@cityu.edu.hk.


Asunto(s)
Aprendizaje Profundo , Sitios de Unión/genética , Humanos , Motivos de Nucleótidos , Unión Proteica , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
5.
Brief Bioinform ; 23(6)2022 11 19.
Artículo en Inglés | MEDLINE | ID: mdl-36274236

RESUMEN

MOTIVATION: The identification of drug-target interactions (DTIs) plays a vital role for in silico drug discovery, in which the drug is the chemical molecule, and the target is the protein residues in the binding pocket. Manual DTI annotation approaches remain reliable; however, it is notoriously laborious and time-consuming to test each drug-target pair exhaustively. Recently, the rapid growth of labelled DTI data has catalysed interests in high-throughput DTI prediction. Unfortunately, those methods highly rely on the manual features denoted by human, leading to errors. RESULTS: Here, we developed an end-to-end deep learning framework called CoaDTI to significantly improve the efficiency and interpretability of drug target annotation. CoaDTI incorporates the Co-attention mechanism to model the interaction information from the drug modality and protein modality. In particular, CoaDTI incorporates transformer to learn the protein representations from raw amino acid sequences, and GraphSage to extract the molecule graph features from SMILES. Furthermore, we proposed to employ the transfer learning strategy to encode protein features by pre-trained transformer to address the issue of scarce labelled data. The experimental results demonstrate that CoaDTI achieves competitive performance on three public datasets compared with state-of-the-art models. In addition, the transfer learning strategy further boosts the performance to an unprecedented level. The extended study reveals that CoaDTI can identify novel DTIs such as reactions between candidate drugs and severe acute respiratory syndrome coronavirus 2-associated proteins. The visualization of co-attention scores can illustrate the interpretability of our model for mechanistic insights. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/CoaDTI.


Asunto(s)
COVID-19 , Humanos , Simulación por Computador , Proteínas/química , Secuencia de Aminoácidos , Descubrimiento de Drogas/métodos
6.
Nucleic Acids Res ; 48(10): e56, 2020 06 04.
Artículo en Inglés | MEDLINE | ID: mdl-32232416

RESUMEN

Recent advances in high-throughput single-cell RNA-seq have enabled us to measure thousands of gene expression levels at single-cell resolution. However, the transcriptomic profiles are high-dimensional and sparse in nature. To address it, a deep learning framework based on auto-encoder, termed DeepAE, is proposed to elucidate high-dimensional transcriptomic profiling data in an encode-decode manner. Comparative experiments were conducted on nine transcriptomic profiling datasets to compare DeepAE with four benchmark methods. The results demonstrate that the proposed DeepAE outperforms the benchmark methods with robust performance on uncovering the key dimensions of single-cell RNA-seq data. In addition, we also investigate the performance of DeepAE in other contexts and platforms such as mass cytometry and metabolic profiling in a comprehensive manner. Gene ontology enrichment and pathology analysis are conducted to reveal the mechanisms behind the robust performance of DeepAE by uncovering its key dimensions.


Asunto(s)
Aprendizaje Profundo , RNA-Seq/métodos , Análisis de la Célula Individual/métodos , Animales , Compresión de Datos , Humanos , Metabolómica/métodos , Ratones
7.
Nucleic Acids Res ; 47(4): 1628-1636, 2019 02 28.
Artículo en Inglés | MEDLINE | ID: mdl-30590725

RESUMEN

Bound by transcription factors, DNA motifs (i.e. transcription factor binding sites) are prevalent and important for gene regulation in different tissues at different developmental stages of eukaryotes. Although considerable efforts have been made on elucidating monomeric DNA motif patterns, our knowledge on heterodimeric DNA motifs are still far from complete. Therefore, we propose to develop a computational approach to synthesize a heterodimeric DNA motif from two monomeric DNA motifs. The approach is sequentially divided into two components (Phases A and B). In Phase A, we propose to develop the inference models on how two DNA monomeric motifs can be oriented and overlapped with each other at nucleotide level. In Phase B, given the two monomeric DNA motifs oriented, we further propose to develop DNA-binding family-specific input-output hidden Markov models (IOHMMs) to synthesize a heterodimeric DNA motif. To validate the approach, we execute and cross-validate it with the experimentally verified 618 heterodimeric DNA motifs across 49 DNA-binding family combinations. We observe that our approach can even "rescue" the existing heterodimeric DNA motif pattern (i.e. HOXB2_EOMES) previously published on Nature. Lastly, we apply the proposed approach to infer previously uncharacterized heterodimeric motifs. Their motif instances are supported by DNase accessibility, gene ontology, protein-protein interactions, in vivo ChIP-seq peaks, and even structural data from PDB. A public web-server is built for open accessibility and scientific impact. Its address is listed as follows: http://motif.cs.cityu.edu.hk/custom/MotifKirin.


Asunto(s)
Biología Computacional , Genómica/métodos , Motivos de Nucleótidos/genética , Factores de Transcripción/genética , Algoritmos , Sitios de Unión/genética , Replicación del ADN/genética , Regulación del Desarrollo de la Expresión Génica/genética , Humanos , Cadenas de Markov , Elementos Reguladores de la Transcripción/genética , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Factores de Transcripción/química
8.
Bioinformatics ; 34(17): i656-i663, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30423072

RESUMEN

Motivation: The prediction of off-target mutations in CRISPR-Cas9 is a hot topic due to its relevance to gene editing research. Existing prediction methods have been developed; however, most of them just calculated scores based on mismatches to the guide sequence in CRISPR-Cas9. Therefore, the existing prediction methods are unable to scale and improve their performance with the rapid expansion of experimental data in CRISPR-Cas9. Moreover, the existing methods still cannot satisfy enough precision in off-target predictions for gene editing at the clinical level. Results: To address it, we design and implement two algorithms using deep neural networks to predict off-target mutations in CRISPR-Cas9 gene editing (i.e. deep convolutional neural network and deep feedforward neural network). The models were trained and tested on the recently released off-target dataset, CRISPOR dataset, for performance benchmark. Another off-target dataset identified by GUIDE-seq was adopted for additional evaluation. We demonstrate that convolutional neural network achieves the best performance on CRISPOR dataset, yielding an average classification area under the ROC curve (AUC) of 97.2% under stratified 5-fold cross-validation. Interestingly, the deep feedforward neural network can also be competitive at the average AUC of 97.0% under the same setting. We compare the two deep neural network models with the state-of-the-art off-target prediction methods (i.e. CFD, MIT, CROP-IT, and CCTop) and three traditional machine learning models (i.e. random forest, gradient boosting trees, and logistic regression) on both datasets in terms of AUC values, demonstrating the competitive edges of the proposed algorithms. Additional analyses are conducted to investigate the underlying reasons from different perspectives. Availability and implementation: The example code are available at https://github.com/MichaelLinn/off_target_prediction. The related datasets are available at https://github.com/MichaelLinn/off_target_prediction/tree/master/data.


Asunto(s)
Aprendizaje Profundo , Edición Génica , Algoritmos , Área Bajo la Curva , Sistemas CRISPR-Cas , Modelos Logísticos
9.
bioRxiv ; 2024 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-39131276

RESUMEN

Transcriptional regulation, critical for cellular differentiation and adaptation to environmental changes, involves coordinated interactions among DNA sequences, regulatory proteins, and chromatin architecture. Despite extensive data from consortia like ENCODE, understanding the dynamics of cis-regulatory elements (CREs) in gene expression remains challenging. Deep learning is a powerful tool for learning gene expression and epigenomic signals from DNA sequences, exhibiting superior performance compared to conventional machine learning approaches. However, even the most advanced deep learning-based methods may fall short in capturing the regulatory effects of distal elements such as enhancers, limiting their predictive accuracy. In addition, these methods may require significant resources to train or to adapt to newly generated data. To address these challenges, we present EPInformer, a scalable deep-learning framework for predicting gene expression by integrating promoter-enhancer interactions with their sequences, epigenomic signals, and chromatin contacts. Our model outperforms existing gene expression prediction models in rigorous cross-chromosome validation, accurately recapitulates enhancer-gene interactions validated by CRISPR perturbation experiments, and identifies crucial transcription factor motifs within regulatory sequences. EPInformer is available as open-source software at https://github.com/pinellolab/EPInformer.

10.
Comput Biol Med ; 168: 107753, 2024 01.
Artículo en Inglés | MEDLINE | ID: mdl-38039889

RESUMEN

BACKGROUND: Trans-acting factors are of special importance in transcription regulation, which is a group of proteins that can directly or indirectly recognize or bind to the 8-12 bp core sequence of cis-acting elements and regulate the transcription efficiency of target genes. The progressive development in high-throughput chromatin capture technology (e.g., Hi-C) enables the identification of chromatin-interacting sequence groups where trans-acting DNA motif groups can be discovered. The problem difficulty lies in the combinatorial nature of DNA sequence pattern matching and its underlying sequence pattern search space. METHOD: Here, we propose to develop MotifHub for trans-acting DNA motif group discovery on grouped sequences. Specifically, the main approach is to develop probabilistic modeling for accommodating the stochastic nature of DNA motif patterns. RESULTS: Based on the modeling, we develop global sampling techniques based on EM and Gibbs sampling to address the global optimization challenge for model fitting with latent variables. The results reflect that our proposed approaches demonstrate promising performance with linear time complexities. CONCLUSION: MotifHub is a novel algorithm considering the identification of both DNA co-binding motif groups and trans-acting TFs. Our study paves the way for identifying hub TFs of stem cell development (OCT4 and SOX2) and determining potential therapeutic targets of prostate cancer (FOXA1 and MYC). To ensure scientific reproducibility and long-term impact, its matrix-algebra-optimized source code is released at http://bioinfo.cs.cityu.edu.hk/MotifHub.


Asunto(s)
Algoritmos , Programas Informáticos , Motivos de Nucleótidos/genética , Reproducibilidad de los Resultados , Cromatina/genética
11.
bioRxiv ; 2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39211178

RESUMEN

Genome editing with RNA-guided DNA binding factors carries risk of off-target editing at homologous sequences. Genetic variants may introduce sequence changes that increase homology to a genome editing target, thereby increasing risk of off-target editing. Conventional methods to verify candidate off-targets rely on access to cells with genomic DNA carrying these sequences. However, for candidate off-targets associated with genetic variants, appropriate cells for experimental verification may not be available. Here we develop a method, Assessment By Stand-in Off-target LentiViral Ensemble with sequencing (ABSOLVE-seq), to integrate a set of candidate off-target sequences along with unique molecular identifiers (UMIs) in genomes of primary cells followed by clinically relevant gene editor delivery. Gene editing of dozens of candidate off-target sequences may be evaluated in a single experiment with high sensitivity, precision, and power. We provide an open-source pipeline to analyze sequencing data. This approach enables experimental assessment of the influence of human genetic diversity on specificity evaluation during gene editing therapy development.

12.
Adv Sci (Weinh) ; 10(33): e2303502, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37816141

RESUMEN

Single-cell Hi-C (scHi-C) has made it possible to analyze chromatin organization at the single-cell level. However, scHi-C experiments generate inherently sparse data, which poses a challenge for loop calling methods. The existing approach performs significance tests across the imputed dense contact maps, leading to substantial computational overhead and loss of information at the single-cell level. To overcome this limitation, a lightweight framework called scGSLoop is proposed, which sets a new paradigm for scHi-C loop calling by adapting the training and inferencing strategies of graph-based deep learning to leverage the sequence features and 1D positional information of genomic loci. With this framework, sparsity is no longer a challenge, but rather an advantage that the model leverages to achieve unprecedented computational efficiency. Compared to existing methods, scGSLoop makes more accurate predictions and is able to identify more loops that have the potential to play regulatory roles in genome functioning. Moreover, scGSLoop preserves single-cell information by identifying a distinct group of loops for each individual cell, which not only enables an understanding of the variability of chromatin looping states between cells, but also allows scGSLoop to be extended for the investigation of multi-connected hubs and their underlying mechanisms.


Asunto(s)
Cromatina , Genómica , Cromatina/genética , Genoma
13.
Nat Biotechnol ; 41(3): 409-416, 2023 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-36203014

RESUMEN

Methods for in vitro DNA cleavage and molecular cloning remain unable to precisely cleave DNA directly adjacent to bases of interest. Restriction enzymes (REs) must bind specific motifs, whereas wild-type CRISPR-Cas9 or CRISPR-Cas12 nucleases require protospacer adjacent motifs (PAMs). Here we explore the utility of our previously reported near-PAMless SpCas9 variant, named SpRY, to serve as a universal DNA cleavage tool for various cloning applications. By performing SpRY DNA digests (SpRYgests) using more than 130 guide RNAs (gRNAs) sampling a wide diversity of PAMs, we discovered that SpRY is PAMless in vitro and can cleave DNA at practically any sequence, including sites refractory to cleavage with wild-type SpCas9. We illustrate the versatility and effectiveness of SpRYgests to improve the precision of several cloning workflows, including those not possible with REs or canonical CRISPR nucleases. We also optimize a rapid and simple one-pot gRNA synthesis protocol to streamline SpRYgest implementation. Together, SpRYgests can improve various DNA engineering applications that benefit from precise DNA breaks.


Asunto(s)
Sistemas CRISPR-Cas , División del ADN , Sistemas CRISPR-Cas/genética , ADN/genética , Edición Génica/métodos , ARN Guía de Sistemas CRISPR-Cas
14.
Nat Genet ; 55(1): 34-43, 2023 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-36522432

RESUMEN

CRISPR gene editing holds great promise to modify DNA sequences in somatic cells to treat disease. However, standard computational and biochemical methods to predict off-target potential focus on reference genomes. We developed an efficient tool called CRISPRme that considers single-nucleotide polymorphism (SNP) and indel genetic variants to nominate and prioritize off-target sites. We tested the software with a BCL11A enhancer targeting guide RNA (gRNA) showing promise in clinical trials for sickle cell disease and ß-thalassemia and found that the top candidate off-target is produced by an allele common in African-ancestry populations (MAF 4.5%) that introduces a protospacer adjacent motif (PAM) sequence. We validated that SpCas9 generates strictly allele-specific indels and pericentric inversions in CD34+ hematopoietic stem and progenitor cells (HSPCs), although high-fidelity Cas9 mitigates this off-target. This report illustrates how genetic variants should be considered as modifiers of gene editing outcomes. We expect that variant-aware off-target assessment will become integral to therapeutic genome editing evaluation and provide a powerful approach for comprehensive off-target nomination.


Asunto(s)
Sistemas CRISPR-Cas , Edición Génica , Humanos , Edición Génica/métodos , Sistemas CRISPR-Cas/genética , Células Madre Hematopoyéticas , Mutación INDEL , ARN Guía de Sistemas CRISPR-Cas
15.
iScience ; 25(12): 105535, 2022 Dec 22.
Artículo en Inglés | MEDLINE | ID: mdl-36444296

RESUMEN

Graph and image are two common representations of Hi-C cis-contact maps. Existing computational tools have only adopted Hi-C data modeled as unitary data structures but neglected the potential advantages of synergizing the information of different views. Here we propose GILoop, a dual-branch neural network that learns from both representations to identify genome-wide CTCF-mediated loops. With GILoop, we explore the combined strength of integrating the two view representations of Hi-C data and corroborate the complementary relationship between the views. In particular, the model outperforms the state-of-the-art loop calling framework and is also more robust against low-quality Hi-C libraries. We also uncover distinct preferences for matrix density by graph-based and image-based models, revealing interesting insights into Hi-C data elucidation. Finally, along with multiple transfer-learning case studies, we demonstrate that GILoop can accurately model the organizational and functional patterns of CTCF-mediated looping across different cell lines.

16.
Methods Mol Biol ; 2212: 277-289, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-33733362

RESUMEN

We report a step-by-step protocol to use pysster, a TensorFlow-based package for building deep neural networks on a broad range of epistatic sequences such as DNA, RNA, or annotated secondary structure sequences. Pysster provides users comprehensive supports for developing, training, and evaluating the self-defined deep neural networks on sequence data. Moreover, pysster allows users to easily visualize the resulting perditions, which is helpful to uncover the "black box" of deep neural networks. Here, we describe a step-by-step application of pysster to classify the RNA A-to-I editing regions and interpret the model predictions. To further demonstrate the generalizability of pysster, we utilized it to build and evaluated a new deep neural network on an artificial epistatic sequence dataset.


Asunto(s)
Aprendizaje Profundo , Epistasis Genética , Modelos Genéticos , ARN/genética , Programas Informáticos , Secuencia de Bases , Conjuntos de Datos como Asunto , Humanos , Edición de ARN , Curva ROC , Análisis de Secuencia/estadística & datos numéricos
17.
iScience ; 15: 332-341, 2019 May 31.
Artículo en Inglés | MEDLINE | ID: mdl-31103852

RESUMEN

The early detection of cancers has the potential to save many lives. A recent attempt has been demonstrated successful. However, we note several critical limitations. Given the central importance and broad impact of early cancer detection, we aspire to address those limitations. We explore different supervised learning approaches for multiple cancer type detection and observe significant improvements; for instance, one of our approaches (i.e., CancerA1DE) can double the existing sensitivity from 38% to 77% for the earliest cancer detection (i.e., Stage I) at the 99% specificity level. For Stage II, it can even reach up to about 90% across multiple cancer types. In addition, CancerA1DE can also double the existing sensitivity from 30% to 70% for detecting breast cancers at the 99% specificity level. Data and model analysis are conducted to reveal the underlying reasons. A website is built at http://cancer.cs.cityu.edu.hk/.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA