Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 15 de 15
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 40(6)2024 Jun 03.
Artigo em Inglês | MEDLINE | ID: mdl-38837395

RESUMO

MOTIVATION: Tissue context and molecular profiling are commonly used measures in understanding normal development and disease pathology. In recent years, the development of spatial molecular profiling technologies (e.g. spatial resolved transcriptomics) has enabled the exploration of quantitative links between tissue morphology and gene expression. However, these technologies remain expensive and time-consuming, with subsequent analyses necessitating high-throughput pathological annotations. On the other hand, existing computational tools are limited to predicting only a few dozen to several hundred genes, and the majority of the methods are designed for bulk RNA-seq. RESULTS: In this context, we propose HE2Gene, the first multi-task learning-based method capable of predicting tens of thousands of spot-level gene expressions along with pathological annotations from H&E-stained images. Experimental results demonstrate that HE2Gene is comparable to state-of-the-art methods and generalizes well on an external dataset without the need for re-training. Moreover, HE2Gene preserves the annotated spatial domains and has the potential to identify biomarkers. This capability facilitates cancer diagnosis and broadens its applicability to investigate gene-disease associations. AVAILABILITY AND IMPLEMENTATION: The source code and data information has been deposited at https://github.com/Microbiods/HE2Gene.


Assuntos
Transcriptoma , Humanos , Perfilação da Expressão Gênica/métodos , Biologia Computacional/métodos , Aprendizado de Máquina , RNA/metabolismo
2.
Comput Biol Med ; 168: 107753, 2024 01.
Artigo em Inglês | MEDLINE | ID: mdl-38039889

RESUMO

BACKGROUND: Trans-acting factors are of special importance in transcription regulation, which is a group of proteins that can directly or indirectly recognize or bind to the 8-12 bp core sequence of cis-acting elements and regulate the transcription efficiency of target genes. The progressive development in high-throughput chromatin capture technology (e.g., Hi-C) enables the identification of chromatin-interacting sequence groups where trans-acting DNA motif groups can be discovered. The problem difficulty lies in the combinatorial nature of DNA sequence pattern matching and its underlying sequence pattern search space. METHOD: Here, we propose to develop MotifHub for trans-acting DNA motif group discovery on grouped sequences. Specifically, the main approach is to develop probabilistic modeling for accommodating the stochastic nature of DNA motif patterns. RESULTS: Based on the modeling, we develop global sampling techniques based on EM and Gibbs sampling to address the global optimization challenge for model fitting with latent variables. The results reflect that our proposed approaches demonstrate promising performance with linear time complexities. CONCLUSION: MotifHub is a novel algorithm considering the identification of both DNA co-binding motif groups and trans-acting TFs. Our study paves the way for identifying hub TFs of stem cell development (OCT4 and SOX2) and determining potential therapeutic targets of prostate cancer (FOXA1 and MYC). To ensure scientific reproducibility and long-term impact, its matrix-algebra-optimized source code is released at http://bioinfo.cs.cityu.edu.hk/MotifHub.


Assuntos
Algoritmos , Software , Motivos de Nucleotídeos/genética , Reprodutibilidade dos Testes , Cromatina/genética
3.
Adv Sci (Weinh) ; 10(33): e2303502, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37816141

RESUMO

Single-cell Hi-C (scHi-C) has made it possible to analyze chromatin organization at the single-cell level. However, scHi-C experiments generate inherently sparse data, which poses a challenge for loop calling methods. The existing approach performs significance tests across the imputed dense contact maps, leading to substantial computational overhead and loss of information at the single-cell level. To overcome this limitation, a lightweight framework called scGSLoop is proposed, which sets a new paradigm for scHi-C loop calling by adapting the training and inferencing strategies of graph-based deep learning to leverage the sequence features and 1D positional information of genomic loci. With this framework, sparsity is no longer a challenge, but rather an advantage that the model leverages to achieve unprecedented computational efficiency. Compared to existing methods, scGSLoop makes more accurate predictions and is able to identify more loops that have the potential to play regulatory roles in genome functioning. Moreover, scGSLoop preserves single-cell information by identifying a distinct group of loops for each individual cell, which not only enables an understanding of the variability of chromatin looping states between cells, but also allows scGSLoop to be extended for the investigation of multi-connected hubs and their underlying mechanisms.


Assuntos
Cromatina , Genômica , Cromatina/genética , Genoma
4.
RNA ; 29(5): 517-530, 2023 05.
Artigo em Inglês | MEDLINE | ID: mdl-36737104

RESUMO

In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.


Assuntos
Algoritmos , Análise da Expressão Gênica de Célula Única , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados
5.
Nat Genet ; 55(1): 34-43, 2023 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-36522432

RESUMO

CRISPR gene editing holds great promise to modify DNA sequences in somatic cells to treat disease. However, standard computational and biochemical methods to predict off-target potential focus on reference genomes. We developed an efficient tool called CRISPRme that considers single-nucleotide polymorphism (SNP) and indel genetic variants to nominate and prioritize off-target sites. We tested the software with a BCL11A enhancer targeting guide RNA (gRNA) showing promise in clinical trials for sickle cell disease and ß-thalassemia and found that the top candidate off-target is produced by an allele common in African-ancestry populations (MAF 4.5%) that introduces a protospacer adjacent motif (PAM) sequence. We validated that SpCas9 generates strictly allele-specific indels and pericentric inversions in CD34+ hematopoietic stem and progenitor cells (HSPCs), although high-fidelity Cas9 mitigates this off-target. This report illustrates how genetic variants should be considered as modifiers of gene editing outcomes. We expect that variant-aware off-target assessment will become integral to therapeutic genome editing evaluation and provide a powerful approach for comprehensive off-target nomination.


Assuntos
Sistemas CRISPR-Cas , Edição de Genes , Humanos , Edição de Genes/métodos , Sistemas CRISPR-Cas/genética , Células-Tronco Hematopoéticas , Mutação INDEL , RNA Guia de Sistemas CRISPR-Cas
6.
Nat Biotechnol ; 41(3): 409-416, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-36203014

RESUMO

Methods for in vitro DNA cleavage and molecular cloning remain unable to precisely cleave DNA directly adjacent to bases of interest. Restriction enzymes (REs) must bind specific motifs, whereas wild-type CRISPR-Cas9 or CRISPR-Cas12 nucleases require protospacer adjacent motifs (PAMs). Here we explore the utility of our previously reported near-PAMless SpCas9 variant, named SpRY, to serve as a universal DNA cleavage tool for various cloning applications. By performing SpRY DNA digests (SpRYgests) using more than 130 guide RNAs (gRNAs) sampling a wide diversity of PAMs, we discovered that SpRY is PAMless in vitro and can cleave DNA at practically any sequence, including sites refractory to cleavage with wild-type SpCas9. We illustrate the versatility and effectiveness of SpRYgests to improve the precision of several cloning workflows, including those not possible with REs or canonical CRISPR nucleases. We also optimize a rapid and simple one-pot gRNA synthesis protocol to streamline SpRYgest implementation. Together, SpRYgests can improve various DNA engineering applications that benefit from precise DNA breaks.


Assuntos
Sistemas CRISPR-Cas , Clivagem do DNA , Sistemas CRISPR-Cas/genética , DNA/genética , Edição de Genes/métodos , RNA Guia de Sistemas CRISPR-Cas
7.
iScience ; 25(12): 105535, 2022 Dec 22.
Artigo em Inglês | MEDLINE | ID: mdl-36444296

RESUMO

Graph and image are two common representations of Hi-C cis-contact maps. Existing computational tools have only adopted Hi-C data modeled as unitary data structures but neglected the potential advantages of synergizing the information of different views. Here we propose GILoop, a dual-branch neural network that learns from both representations to identify genome-wide CTCF-mediated loops. With GILoop, we explore the combined strength of integrating the two view representations of Hi-C data and corroborate the complementary relationship between the views. In particular, the model outperforms the state-of-the-art loop calling framework and is also more robust against low-quality Hi-C libraries. We also uncover distinct preferences for matrix density by graph-based and image-based models, revealing interesting insights into Hi-C data elucidation. Finally, along with multiple transfer-learning case studies, we demonstrate that GILoop can accurately model the organizational and functional patterns of CTCF-mediated looping across different cell lines.

8.
Brief Bioinform ; 23(6)2022 11 19.
Artigo em Inglês | MEDLINE | ID: mdl-36274236

RESUMO

MOTIVATION: The identification of drug-target interactions (DTIs) plays a vital role for in silico drug discovery, in which the drug is the chemical molecule, and the target is the protein residues in the binding pocket. Manual DTI annotation approaches remain reliable; however, it is notoriously laborious and time-consuming to test each drug-target pair exhaustively. Recently, the rapid growth of labelled DTI data has catalysed interests in high-throughput DTI prediction. Unfortunately, those methods highly rely on the manual features denoted by human, leading to errors. RESULTS: Here, we developed an end-to-end deep learning framework called CoaDTI to significantly improve the efficiency and interpretability of drug target annotation. CoaDTI incorporates the Co-attention mechanism to model the interaction information from the drug modality and protein modality. In particular, CoaDTI incorporates transformer to learn the protein representations from raw amino acid sequences, and GraphSage to extract the molecule graph features from SMILES. Furthermore, we proposed to employ the transfer learning strategy to encode protein features by pre-trained transformer to address the issue of scarce labelled data. The experimental results demonstrate that CoaDTI achieves competitive performance on three public datasets compared with state-of-the-art models. In addition, the transfer learning strategy further boosts the performance to an unprecedented level. The extended study reveals that CoaDTI can identify novel DTIs such as reactions between candidate drugs and severe acute respiratory syndrome coronavirus 2-associated proteins. The visualization of co-attention scores can illustrate the interpretability of our model for mechanistic insights. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/CoaDTI.


Assuntos
COVID-19 , Humanos , Simulação por Computador , Proteínas/química , Sequência de Aminoácidos , Descoberta de Drogas/métodos
9.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34524404

RESUMO

The cooperativity of transcription factors (TFs) is a widespread phenomenon in the gene regulation system. However, the interaction patterns between TF binding motifs remain elusive. The recent high-throughput assays, CAP-SELEX, have identified over 600 composite DNA sites (i.e. heterodimeric motifs) bound by cooperative TF pairs. However, there are over 25 000 inferentially effective heterodimeric TFs in the human cells. It is not practically feasible to validate all heterodimeric motifs due to cost and labor. We introduce DeepMotifSyn, a deep learning-based tool for synthesizing heterodimeric motifs from monomeric motif pairs. Specifically, DeepMotifSyn is composed of heterodimeric motif generator and evaluator. The generator is a U-Net-based neural network that can synthesize heterodimeric motifs from aligned motif pairs. The evaluator is a machine learning-based model that can score the generated heterodimeric motif candidates based on the motif sequence features. Systematic evaluations on CAP-SELEX data illustrate that DeepMotifSyn significantly outperforms the current state-of-the-art predictors. In addition, DeepMotifSyn can synthesize multiple heterodimeric motifs with different orientation and spacing settings. Such a feature can address the shortcomings of previous models. We believe DeepMotifSyn is a more practical and reliable model than current predictors on heterodimeric motif synthesis. Contact:kc.w@cityu.edu.hk.


Assuntos
Aprendizado Profundo , Sítios de Ligação/genética , Humanos , Motivos de Nucleotídeos , Ligação Proteica , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
10.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34791012

RESUMO

MOTIVATION: The rapid growth in literature accumulates diverse and yet comprehensive biomedical knowledge hidden to be mined such as drug interactions. However, it is difficult to extract the heterogeneous knowledge to retrieve or even discover the latest and novel knowledge in an efficient manner. To address such a problem, we propose EGFI for extracting and consolidating drug interactions from large-scale medical literature text data. Specifically, EGFI consists of two parts: classification and generation. In the classification part, EGFI encompasses the language model BioBERT which has been comprehensively pretrained on biomedical corpus. In particular, we propose the multihead self-attention mechanism and packed BiGRU to fuse multiple semantic information for rigorous context modeling. In the generation part, EGFI utilizes another pretrained language model BioGPT-2 where the generation sentences are selected based on filtering rules. RESULTS: We evaluated the classification part on 'DDIs 2013' dataset and 'DTIs' dataset, achieving the F1 scores of 0.842 and 0.720 respectively. Moreover, we applied the classification part to distinguish high-quality generated sentences and verified with the existing growth truth to confirm the filtered sentences. The generated sentences that are not recorded in DrugBank and DDIs 2013 dataset demonstrated the potential of EGFI to identify novel drug relationships. AVAILABILITY: Source code are publicly available at https://github.com/Layne-Huang/EGFI.


Assuntos
Idioma , Processamento de Linguagem Natural , Interações Medicamentosas , Semântica , Software
11.
Methods Mol Biol ; 2212: 277-289, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-33733362

RESUMO

We report a step-by-step protocol to use pysster, a TensorFlow-based package for building deep neural networks on a broad range of epistatic sequences such as DNA, RNA, or annotated secondary structure sequences. Pysster provides users comprehensive supports for developing, training, and evaluating the self-defined deep neural networks on sequence data. Moreover, pysster allows users to easily visualize the resulting perditions, which is helpful to uncover the "black box" of deep neural networks. Here, we describe a step-by-step application of pysster to classify the RNA A-to-I editing regions and interpret the model predictions. To further demonstrate the generalizability of pysster, we utilized it to build and evaluated a new deep neural network on an artificial epistatic sequence dataset.


Assuntos
Aprendizado Profundo , Epistasia Genética , Modelos Genéticos , RNA/genética , Software , Sequência de Bases , Conjuntos de Dados como Assunto , Humanos , Edição de RNA , Curva ROC , Análise de Sequência/estatística & dados numéricos
12.
Nucleic Acids Res ; 48(10): e56, 2020 06 04.
Artigo em Inglês | MEDLINE | ID: mdl-32232416

RESUMO

Recent advances in high-throughput single-cell RNA-seq have enabled us to measure thousands of gene expression levels at single-cell resolution. However, the transcriptomic profiles are high-dimensional and sparse in nature. To address it, a deep learning framework based on auto-encoder, termed DeepAE, is proposed to elucidate high-dimensional transcriptomic profiling data in an encode-decode manner. Comparative experiments were conducted on nine transcriptomic profiling datasets to compare DeepAE with four benchmark methods. The results demonstrate that the proposed DeepAE outperforms the benchmark methods with robust performance on uncovering the key dimensions of single-cell RNA-seq data. In addition, we also investigate the performance of DeepAE in other contexts and platforms such as mass cytometry and metabolic profiling in a comprehensive manner. Gene ontology enrichment and pathology analysis are conducted to reveal the mechanisms behind the robust performance of DeepAE by uncovering its key dimensions.


Assuntos
Aprendizado Profundo , RNA-Seq/métodos , Análise de Célula Única/métodos , Animais , Compressão de Dados , Humanos , Metabolômica/métodos , Camundongos
13.
iScience ; 15: 332-341, 2019 May 31.
Artigo em Inglês | MEDLINE | ID: mdl-31103852

RESUMO

The early detection of cancers has the potential to save many lives. A recent attempt has been demonstrated successful. However, we note several critical limitations. Given the central importance and broad impact of early cancer detection, we aspire to address those limitations. We explore different supervised learning approaches for multiple cancer type detection and observe significant improvements; for instance, one of our approaches (i.e., CancerA1DE) can double the existing sensitivity from 38% to 77% for the earliest cancer detection (i.e., Stage I) at the 99% specificity level. For Stage II, it can even reach up to about 90% across multiple cancer types. In addition, CancerA1DE can also double the existing sensitivity from 30% to 70% for detecting breast cancers at the 99% specificity level. Data and model analysis are conducted to reveal the underlying reasons. A website is built at http://cancer.cs.cityu.edu.hk/.

14.
Nucleic Acids Res ; 47(4): 1628-1636, 2019 02 28.
Artigo em Inglês | MEDLINE | ID: mdl-30590725

RESUMO

Bound by transcription factors, DNA motifs (i.e. transcription factor binding sites) are prevalent and important for gene regulation in different tissues at different developmental stages of eukaryotes. Although considerable efforts have been made on elucidating monomeric DNA motif patterns, our knowledge on heterodimeric DNA motifs are still far from complete. Therefore, we propose to develop a computational approach to synthesize a heterodimeric DNA motif from two monomeric DNA motifs. The approach is sequentially divided into two components (Phases A and B). In Phase A, we propose to develop the inference models on how two DNA monomeric motifs can be oriented and overlapped with each other at nucleotide level. In Phase B, given the two monomeric DNA motifs oriented, we further propose to develop DNA-binding family-specific input-output hidden Markov models (IOHMMs) to synthesize a heterodimeric DNA motif. To validate the approach, we execute and cross-validate it with the experimentally verified 618 heterodimeric DNA motifs across 49 DNA-binding family combinations. We observe that our approach can even "rescue" the existing heterodimeric DNA motif pattern (i.e. HOXB2_EOMES) previously published on Nature. Lastly, we apply the proposed approach to infer previously uncharacterized heterodimeric motifs. Their motif instances are supported by DNase accessibility, gene ontology, protein-protein interactions, in vivo ChIP-seq peaks, and even structural data from PDB. A public web-server is built for open accessibility and scientific impact. Its address is listed as follows: http://motif.cs.cityu.edu.hk/custom/MotifKirin.


Assuntos
Biologia Computacional , Genômica/métodos , Motivos de Nucleotídeos/genética , Fatores de Transcrição/genética , Algoritmos , Sítios de Ligação/genética , Replicação do DNA/genética , Regulação da Expressão Gênica no Desenvolvimento/genética , Humanos , Cadeias de Markov , Elementos Reguladores de Transcrição/genética , Análise de Sequência de DNA/métodos , Software , Fatores de Transcrição/química
15.
Bioinformatics ; 34(17): i656-i663, 2018 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-30423072

RESUMO

Motivation: The prediction of off-target mutations in CRISPR-Cas9 is a hot topic due to its relevance to gene editing research. Existing prediction methods have been developed; however, most of them just calculated scores based on mismatches to the guide sequence in CRISPR-Cas9. Therefore, the existing prediction methods are unable to scale and improve their performance with the rapid expansion of experimental data in CRISPR-Cas9. Moreover, the existing methods still cannot satisfy enough precision in off-target predictions for gene editing at the clinical level. Results: To address it, we design and implement two algorithms using deep neural networks to predict off-target mutations in CRISPR-Cas9 gene editing (i.e. deep convolutional neural network and deep feedforward neural network). The models were trained and tested on the recently released off-target dataset, CRISPOR dataset, for performance benchmark. Another off-target dataset identified by GUIDE-seq was adopted for additional evaluation. We demonstrate that convolutional neural network achieves the best performance on CRISPOR dataset, yielding an average classification area under the ROC curve (AUC) of 97.2% under stratified 5-fold cross-validation. Interestingly, the deep feedforward neural network can also be competitive at the average AUC of 97.0% under the same setting. We compare the two deep neural network models with the state-of-the-art off-target prediction methods (i.e. CFD, MIT, CROP-IT, and CCTop) and three traditional machine learning models (i.e. random forest, gradient boosting trees, and logistic regression) on both datasets in terms of AUC values, demonstrating the competitive edges of the proposed algorithms. Additional analyses are conducted to investigate the underlying reasons from different perspectives. Availability and implementation: The example code are available at https://github.com/MichaelLinn/off_target_prediction. The related datasets are available at https://github.com/MichaelLinn/off_target_prediction/tree/master/data.


Assuntos
Aprendizado Profundo , Edição de Genes , Algoritmos , Área Sob a Curva , Sistemas CRISPR-Cas , Modelos Logísticos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA