Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 73
Filtrar
Más filtros

Banco de datos
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36527428

RESUMEN

Understanding the mechanisms of candidate drugs play an important role in drug discovery. The activating/inhibiting mechanisms between drugs and targets are major types of mechanisms of drugs. Owing to the complexity of drug-target (DT) mechanisms and data scarcity, modelling this problem based on deep learning methods to accurately predict DT activating/inhibiting mechanisms remains a considerable challenge. Here, by considering network pharmacology, we propose a multi-view deep learning model, DrugAI, which combines four modules, i.e. a graph neural network for drugs, a convolutional neural network for targets, a network embedding module for drugs and targets and a deep neural network for predicting activating/inhibiting mechanisms between drugs and targets. Computational experiments show that DrugAI performs better than state-of-the-art methods and has good robustness and generalization. To demonstrate the reliability of the predictive results of DrugAI, bioassay experiments are conducted to validate two drugs (notopterol and alpha-asarone) predicted to activate TRPV1. Moreover, external validation bears out 61 pairs of mechanism relationships between natural products and their targets predicted by DrugAI based on independent literatures and PubChem bioassays. DrugAI, for the first time, provides a powerful multi-view deep learning framework for robust prediction of DT activating/inhibiting mechanisms.


Asunto(s)
Aprendizaje Profundo , Algoritmos , Reproducibilidad de los Resultados , Redes Neurales de la Computación , Descubrimiento de Drogas
2.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37287135

RESUMEN

Hi-C is a widely applied chromosome conformation capture (3C)-based technique, which has produced a large number of genomic contact maps with high sequencing depths for a wide range of cell types, enabling comprehensive analyses of the relationships between biological functionalities (e.g. gene regulation and expression) and the three-dimensional genome structure. Comparative analyses play significant roles in Hi-C data studies, which are designed to make comparisons between Hi-C contact maps, thus evaluating the consistency of replicate Hi-C experiments (i.e. reproducibility measurement) and detecting statistically differential interacting regions with biological significance (i.e. differential chromatin interaction detection). However, due to the complex and hierarchical nature of Hi-C contact maps, it remains challenging to conduct systematic and reliable comparative analyses of Hi-C data. Here, we proposed sslHiC, a contrastive self-supervised representation learning framework, for precisely modeling the multi-level features of chromosome conformation and automatically producing informative feature embeddings for genomic loci and their interactions to facilitate comparative analyses of Hi-C contact maps. Comprehensive computational experiments on both simulated and real datasets demonstrated that our method consistently outperformed the state-of-the-art baseline methods in providing reliable measurements of reproducibility and detecting differential interactions with biological meanings.


Asunto(s)
Cromatina , Cromosomas , Reproducibilidad de los Resultados , Cromatina/genética , Cromosomas/genética , Genómica/métodos , Aprendizaje Automático Supervisado
3.
PLoS Comput Biol ; 20(4): e1011945, 2024 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-38578805

RESUMEN

Early identification of safe and efficacious disease targets is crucial to alleviating the tremendous cost of drug discovery projects. However, existing experimental methods for identifying new targets are generally labor-intensive and failure-prone. On the other hand, computational approaches, especially machine learning-based frameworks, have shown remarkable application potential in drug discovery. In this work, we propose Progeni, a novel machine learning-based framework for target identification. In addition to fully exploiting the known heterogeneous biological networks from various sources, Progeni integrates literature evidence about the relations between biological entities to construct a probabilistic knowledge graph. Graph neural networks are then employed in Progeni to learn the feature embeddings of biological entities to facilitate the identification of biologically relevant target candidates. A comprehensive evaluation of Progeni demonstrated its superior predictive power over the baseline methods on the target identification task. In addition, our extensive tests showed that Progeni exhibited high robustness to the negative effect of exposure bias, a common phenomenon in recommendation systems, and effectively identified new targets that can be strongly supported by the literature. Moreover, our wet lab experiments successfully validated the biological significance of the top target candidates predicted by Progeni for melanoma and colorectal cancer. All these results suggested that Progeni can identify biologically effective targets and thus provide a powerful and useful tool for advancing the drug discovery process.


Asunto(s)
Biología Computacional , Descubrimiento de Drogas , Aprendizaje Automático , Redes Neurales de la Computación , Humanos , Biología Computacional/métodos , Descubrimiento de Drogas/métodos , Algoritmos , Melanoma , Probabilidad , Neoplasias Colorrectales
4.
Brief Bioinform ; 23(5)2022 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-36070863

RESUMEN

Computational recovery of gene regulatory network (GRN) has recently undergone a great shift from bulk-cell towards designing algorithms targeting single-cell data. In this work, we investigate whether the widely available bulk-cell data could be leveraged to assist the GRN predictions for single cells. We infer cell-type-specific GRNs from both the single-cell RNA sequencing data and the generic GRN derived from the bulk cells by constructing a weakly supervised learning framework based on the axial transformer. We verify our assumption that the bulk-cell transcriptomic data are a valuable resource, which could improve the prediction of single-cell GRN by conducting extensive experiments. Our GRN-transformer achieves the state-of-the-art prediction accuracy in comparison to existing supervised and unsupervised approaches. In addition, we show that our method can identify important transcription factors and potential regulations for Alzheimer's disease risk genes by using the predicted GRN. Availability: The implementation of GRN-transformer is available at https://github.com/HantaoShu/GRN-Transformer.


Asunto(s)
Biología Computacional , Redes Reguladoras de Genes , Algoritmos , Biología Computacional/métodos , Factores de Transcripción/genética , Transcriptoma
5.
J Chem Inf Model ; 64(7): 2236-2249, 2024 Apr 08.
Artículo en Inglés | MEDLINE | ID: mdl-37584270

RESUMEN

Optimizing the activities and properties of lead compounds is an essential step in the drug discovery process. Despite recent advances in machine learning-aided drug discovery, most of the existing methods focus on making predictions for the desired objectives directly while ignoring the explanations for predictions. Although several techniques can provide interpretations for machine learning-based methods such as feature attribution, there are still gaps between these interpretations and the principles commonly adopted by medicinal chemists when designing and optimizing molecules. Here, we propose an interpretation framework, named MolSHAP, for quantitative structure-activity relationship analysis by estimating the contributions of R-groups. Instead of attributing the activities to individual input features, MolSHAP regards the R-group fragments as the basic units of interpretation, which is in accordance with the fragment-based modifications in molecule optimization. MolSHAP is a model-agnostic method that can interpret activity regression models with arbitrary input formats and model architectures. Based on the evaluations of numerous representative activity regression models on a specially designed R-group ranking task, MolSHAP achieved significantly better interpretation power compared with other methods. In addition, we developed a compound optimization algorithm based on MolSHAP and illustrated the reliability of the optimized compounds using an independent case study. These results demonstrated that MolSHAP can provide a useful tool for accurately interpreting the quantitative structure-activity relationships and rationally optimizing the compound activities in drug discovery.


Asunto(s)
Descubrimiento de Drogas , Relación Estructura-Actividad Cuantitativa , Reproducibilidad de los Resultados , Descubrimiento de Drogas/métodos , Algoritmos , Aprendizaje Automático
6.
Proc Natl Acad Sci U S A ; 118(6)2021 02 09.
Artículo en Inglés | MEDLINE | ID: mdl-33526657

RESUMEN

RNA polymerase II (Pol II) generally pauses at certain positions along gene bodies, thereby interrupting the transcription elongation process, which is often coupled with various important biological functions, such as precursor mRNA splicing and gene expression regulation. Characterizing the transcriptional elongation dynamics can thus help us understand many essential biological processes in eukaryotic cells. However, experimentally measuring Pol II elongation rates is generally time and resource consuming. We developed PEPMAN (polymerase II elongation pausing modeling through attention-based deep neural network), a deep learning-based model that accurately predicts Pol II pausing sites based on the native elongating transcript sequencing (NET-seq) data. Through fully taking advantage of the attention mechanism, PEPMAN is able to decipher important sequence features underlying Pol II pausing. More importantly, we demonstrated that the analyses of the PEPMAN-predicted results around various types of alternative splicing sites can provide useful clues into understanding the cotranscriptional splicing events. In addition, associating the PEPMAN prediction results with different epigenetic features can help reveal important factors related to the transcription elongation process. All these results demonstrated that PEPMAN can provide a useful and effective tool for modeling transcription elongation and understanding the related biological factors from available high-throughput sequencing data.


Asunto(s)
Genoma Humano , Aprendizaje Automático , Modelos Biológicos , Elongación de la Transcripción Genética , Secuencia de Bases , Sitios de Unión , Metilación de ADN/genética , Epigénesis Genética , Células HEK293 , Células HeLa , Histonas/metabolismo , Humanos , Motivos de Nucleótidos/genética , Procesamiento Proteico-Postraduccional , ARN Polimerasa II/metabolismo , Sitios de Empalme de ARN/genética , Empalme del ARN/genética
7.
Brief Bioinform ; 22(5)2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-33479731

RESUMEN

Translation elongation is a crucial phase during protein biosynthesis. In this study, we develop a novel deep reinforcement learning-based framework, named Riboexp, to model the determinants of the uneven distribution of ribosomes on mRNA transcripts during translation elongation. In particular, our model employs a policy network to perform a context-dependent feature selection in the setting of ribosome density prediction. Our extensive tests demonstrated that Riboexp can significantly outperform the state-of-the-art methods in predicting ribosome density by up to 5.9% in terms of per-gene Pearson correlation coefficient on the datasets from three species. In addition, Riboexp can indicate more informative sequence features for the prediction task than other commonly used attribution methods in deep learning. In-depth analyses also revealed the meaningful biological insights generated by the Riboexp framework. Moreover, the application of Riboexp in codon optimization resulted in an increase of protein production by around 31% over the previous state-of-the-art method that models ribosome density. These results have established Riboexp as a powerful and useful computational tool in the studies of translation dynamics and protein synthesis. Availability: The data and code of this study are available on GitHub: https://github.com/Liuxg16/Riboexp. Contact:zengjy321@tsinghua.edu.cn; songsen@tsinghua.edu.cn.


Asunto(s)
Codón/metabolismo , Biología Computacional , Modelos Biológicos , Biosíntesis de Proteínas , Ribosomas/metabolismo
8.
Nature ; 547(7662): 232-235, 2017 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-28703188

RESUMEN

In mammals, chromatin organization undergoes drastic reprogramming after fertilization. However, the three-dimensional structure of chromatin and its reprogramming in preimplantation development remain poorly understood. Here, by developing a low-input Hi-C (genome-wide chromosome conformation capture) approach, we examined the reprogramming of chromatin organization during early development in mice. We found that oocytes in metaphase II show homogeneous chromatin folding that lacks detectable topologically associating domains (TADs) and chromatin compartments. Strikingly, chromatin shows greatly diminished higher-order structure after fertilization. Unexpectedly, the subsequent establishment of chromatin organization is a prolonged process that extends through preimplantation development, as characterized by slow consolidation of TADs and segregation of chromatin compartments. The two sets of parental chromosomes are spatially separated from each other and display distinct compartmentalization in zygotes. Such allele separation and allelic compartmentalization can be found as late as the 8-cell stage. Finally, we show that chromatin compaction in preimplantation embryos can partially proceed in the absence of zygotic transcription and is a multi-level hierarchical process. Taken together, our data suggest that chromatin may exist in a markedly relaxed state after fertilization, followed by progressive maturation of higher-order chromatin architecture during early development.


Asunto(s)
Alelos , Ensamble y Desensamble de Cromatina/genética , Cromatina/química , Cromatina/genética , Cromosomas de los Mamíferos/química , Cromosomas de los Mamíferos/genética , Desarrollo Embrionario/genética , Animales , Blastocisto/metabolismo , Cromatina/metabolismo , Cromosomas de los Mamíferos/metabolismo , Femenino , Fertilización , Regulación del Desarrollo de la Expresión Génica , Masculino , Ratones , Transcripción Genética , Cigoto/metabolismo
9.
Nucleic Acids Res ; 49(7): 3719-3734, 2021 04 19.
Artículo en Inglés | MEDLINE | ID: mdl-33744973

RESUMEN

N6-methyladenosine (m6A) is the most pervasive modification in eukaryotic mRNAs. Numerous biological processes are regulated by this critical post-transcriptional mark, such as gene expression, RNA stability, RNA structure and translation. Recently, various experimental techniques and computational methods have been developed to characterize the transcriptome-wide landscapes of m6A modification for understanding its underlying mechanisms and functions in mRNA regulation. However, the experimental techniques are generally costly and time-consuming, while the existing computational models are usually designed only for m6A site prediction in a single-species and have significant limitations in accuracy, interpretability and generalizability. Here, we propose a highly interpretable computational framework, called MASS, based on a multi-task curriculum learning strategy to capture m6A features across multiple species simultaneously. Extensive computational experiments demonstrate the superior performances of MASS when compared to the state-of-the-art prediction methods. Furthermore, the contextual sequence features of m6A captured by MASS can be explained by the known critical binding motifs of the related RNA-binding proteins, which also help elucidate the similarity and difference among m6A features across species. In addition, based on the predicted m6A profiles, we further delineate the relationships between m6A and various properties of gene regulation, including gene expression, RNA stability, translation, RNA structure and histone modification. In summary, MASS may serve as a useful tool for characterizing m6A modification and studying its regulatory code. The source code of MASS can be downloaded from https://github.com/mlcb-thu/MASS.


Asunto(s)
Adenosina/análogos & derivados , Aprendizaje Automático , ARN/química , Adenosina/química , Animales , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Regulación de la Expresión Génica , Humanos , Proteínas de Unión al ARN , Análisis de Secuencia de ARN , Programas Informáticos , Transcriptoma
10.
Bioinformatics ; 37(Suppl_1): i254-i261, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252932

RESUMEN

MOTIVATION: The prediction of the binding between peptides and major histocompatibility complex (MHC) molecules plays an important role in neoantigen identification. Although a large number of computational methods have been developed to address this problem, they produce high false-positive rates in practical applications, since in most cases, a single residue mutation may largely alter the binding affinity of a peptide binding to MHC which cannot be identified by conventional deep learning methods. RESULTS: We developed a differential boundary tree-based model, named DBTpred, to address this problem. We demonstrated that DBTpred can accurately predict MHC class I binding affinity compared to the state-of-art deep learning methods. We also presented a parallel training algorithm to accelerate the training and inference process which enables DBTpred to be applied to large datasets. By investigating the statistical properties of differential boundary trees and the prediction paths to test samples, we revealed that DBTpred can provide an intuitive interpretation and possible hints in detecting important residue mutations that can largely influence binding affinity. AVAILABILITY AND IMPLEMENTATION: The DBTpred package is implemented in Python and freely available at: https://github.com/fpy94/DBT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Antígenos de Histocompatibilidad Clase I , Péptidos , Algoritmos , Antígenos de Histocompatibilidad Clase I/genética , Antígenos de Histocompatibilidad Clase I/metabolismo , Humanos , Complejo Mayor de Histocompatibilidad , Péptidos/metabolismo , Unión Proteica
11.
PLoS Comput Biol ; 17(3): e1008842, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33770074

RESUMEN

Translation elongation is regulated by a series of complicated mechanisms in both prokaryotes and eukaryotes. Although recent advance in ribosome profiling techniques has enabled one to capture the genome-wide ribosome footprints along transcripts at codon resolution, the regulatory codes of elongation dynamics are still not fully understood. Most of the existing computational approaches for modeling translation elongation from ribosome profiling data mainly focus on local contextual patterns, while ignoring the continuity of the elongation process and relations between ribosome densities of remote codons. Modeling the translation elongation process in full-length coding sequence (CDS) level has not been studied to the best of our knowledge. In this paper, we developed a deep learning based approach with a multi-input and multi-output framework, named RiboMIMO, for modeling the ribosome density distributions of full-length mRNA CDS regions. Through considering the underlying correlations in translation efficiency among neighboring and remote codons and extracting hidden features from the input full-length coding sequence, RiboMIMO can greatly outperform the state-of-the-art baseline approaches and accurately predict the ribosome density distributions along the whole mRNA CDS regions. In addition, RiboMIMO explores the contributions of individual input codons to the predictions of output ribosome densities, which thus can help reveal important biological factors influencing the translation elongation process. The analyses, based on our interpretable metric named codon impact score, not only identified several patterns consistent with the previously-published literatures, but also for the first time (to the best of our knowledge) revealed that the codons located at a long distance from the ribosomal A site may also have an association on the translation elongation rate. This finding of long-range impact on translation elongation velocity may shed new light on the regulatory mechanisms of protein synthesis. Overall, these results indicated that RiboMIMO can provide a useful tool for studying the regulation of translation elongation in the range of full-length CDS.


Asunto(s)
Biología Computacional/métodos , Aprendizaje Profundo , Modelos Genéticos , Extensión de la Cadena Peptídica de Translación/genética , Ribosomas , Codón/genética , Codón/metabolismo , Escherichia coli/genética , ARN Mensajero/química , ARN Mensajero/genética , ARN Mensajero/metabolismo , Ribosomas/genética , Ribosomas/metabolismo , Saccharomyces cerevisiae/genética
12.
Bioinformatics ; 36(9): 2872-2880, 2020 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-31950974

RESUMEN

MOTIVATION: Quantitative structure-activity relationship (QSAR) and drug-target interaction (DTI) prediction are both commonly used in drug discovery. Collaboration among pharmaceutical institutions can lead to better performance in both QSAR and DTI prediction. However, the drug-related data privacy and intellectual property issues have become a noticeable hindrance for inter-institutional collaboration in drug discovery. RESULTS: We have developed two novel algorithms under secure multiparty computation (MPC), including QSARMPC and DTIMPC, which enable pharmaceutical institutions to achieve high-quality collaboration to advance drug discovery without divulging private drug-related information. QSARMPC, a neural network model under MPC, displays good scalability and performance and is feasible for privacy-preserving collaboration on large-scale QSAR prediction. DTIMPC integrates drug-related heterogeneous network data and accurately predicts novel DTIs, while keeping the drug information confidential. Under several experimental settings that reflect the situations in real drug discovery scenarios, we have demonstrated that DTIMPC possesses significant performance improvement over the baseline methods, generates novel DTI predictions with supporting evidence from the literature and shows the feasible scalability to handle growing DTI data. All these results indicate that QSARMPC and DTIMPC can provide practically useful tools for advancing privacy-preserving drug discovery. AVAILABILITY AND IMPLEMENTATION: The source codes of QSARMPC and DTIMPC are available on the GitHub: https://github.com/rongma6/QSARMPC_DTIMPC.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Descubrimiento de Drogas , Privacidad , Algoritmos , Desarrollo de Medicamentos
13.
Pharmacol Res ; 173: 105752, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34481072

RESUMEN

Traditional Chinese medicine (TCM) formula is widely used for thousands of years in clinical practice. With the development of artificial intelligence, deep learning models may help doctors prescribe reasonable formulas. Meanwhile, current studies of formula recommendation only focus on the observable clinical symptoms and lack of molecular information. Here, inspired by the theory of TCM network pharmacology, we propose an intelligent formula recommendation system based on deep learning (FordNet), fusing the information of phenotype and molecule. We collected more than 20,000 electronic health records from TCM Master Li Jiren's experience from 2013 to March 2020. In the FordNet system, the feature of diagnosis description is extracted by convolution neural network and the feature of TCM formula is extracted by network embedding, which fusing the molecular information. A hierarchical sampling strategy for data augmentation is designed to effectively learn training samples. Based on the expanded samples, a deep neural network based quantitative optimization model is developed for TCM formula recommendation. FordNet performs significantly better than baseline methods (hit ratio of top 10 improved by 46.9% compared with the best baseline random forest method). Moreover, the molecular information helps FordNet improve 17.3% hit ratio compared with the model using only macro information. Clinical evaluation shows that FordNet can well learn the effective experience of TCM Master and obtain excellent recommendation results. Our study, for the first time, proposes an intelligent recommendation system for TCM formula integrating phenotype and molecule information, which has potential to improve clinical diagnosis and treatment, and promote the shift of TCM research pattern from "experience based, macro" to "data based, macro-micro combined" as well as the development of TCM network pharmacology.


Asunto(s)
Medicina Tradicional China , Redes Neurales de la Computación , Humanos , Farmacología en Red , Fenotipo
14.
Bioinformatics ; 35(14): i284-i294, 2019 07 15.
Artículo en Inglés | MEDLINE | ID: mdl-31510699

RESUMEN

MOTIVATION: Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. RESULTS: In this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision-recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences. AVAILABILITY AND IMPLEMENTATION: https://github.com/haochenucr/DIFFUSE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Profundo , Redes Neurales de la Computación , Algoritmos , Empalme Alternativo
15.
Bioinformatics ; 35(1): 104-111, 2019 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-30561548

RESUMEN

Motivation: Accurately predicting drug-target interactions (DTIs) in silico can guide the drug discovery process and thus facilitate drug development. Computational approaches for DTI prediction that adopt the systems biology perspective generally exploit the rationale that the properties of drugs and targets can be characterized by their functional roles in biological networks. Results: Inspired by recent advance of information passing and aggregation techniques that generalize the convolution neural networks to mine large-scale graph data and greatly improve the performance of many network-related prediction tasks, we develop a new nonlinear end-to-end learning model, called NeoDTI, that integrates diverse information from heterogeneous network data and automatically learns topology-preserving representations of drugs and targets to facilitate DTI prediction. The substantial prediction performance improvement over other state-of-the-art DTI prediction methods as well as several novel predicted DTIs with evidence supports from previous studies have demonstrated the superior predictive power of NeoDTI. In addition, NeoDTI is robust against a wide range of choices of hyperparameters and is ready to integrate more drug and target related information (e.g. compound-protein binding affinity data). All these results suggest that NeoDTI can offer a powerful and robust tool for drug development and drug repositioning. Availability and implementation: The source code and data used in NeoDTI are available at: https://github.com/FangpingWan/NeoDTI. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Simulación por Computador , Desarrollo de Medicamentos/métodos , Programas Informáticos , Descubrimiento de Drogas , Reposicionamiento de Medicamentos , Unión Proteica
16.
Bioinformatics ; 35(2): 219-226, 2019 01 15.
Artículo en Inglés | MEDLINE | ID: mdl-30010790

RESUMEN

Motivation: Vastly greater quantities of microbial genome data are being generated where environmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce 'low-density' locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning. Results: On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bioinformatic challenges. Availability and implementation: Full source code and datasets are available at http://opal.csail.mit.edu and https://github.com/yunwilliamyu/opal. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Genoma Microbiano , Metagenómica , Programas Informáticos , Biología Computacional , Análisis de Secuencia de ADN
17.
Bioinformatics ; 35(23): 4946-4954, 2019 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-31120490

RESUMEN

MOTIVATION: Prediction of peptide binding to the major histocompatibility complex (MHC) plays a vital role in the development of therapeutic vaccines for the treatment of cancer. Algorithms with improved correlations between predicted and actual binding affinities are needed to increase precision and reduce the number of false positive predictions. RESULTS: We present ACME (Attention-based Convolutional neural networks for MHC Epitope binding prediction), a new pan-specific algorithm to accurately predict the binding affinities between peptides and MHC class I molecules, even for those new alleles that are not seen in the training data. Extensive tests have demonstrated that ACME can significantly outperform other state-of-the-art prediction methods with an increase of the Pearson correlation coefficient between predicted and measured binding affinities by up to 23 percentage points. In addition, its ability to identify strong-binding peptides has been experimentally validated. Moreover, by integrating the convolutional neural network with attention mechanism, ACME is able to extract interpretable patterns that can provide useful and detailed insights into the binding preferences between peptides and their MHC partners. All these results have demonstrated that ACME can provide a powerful and practically useful tool for the studies of peptide-MHC class I interactions. AVAILABILITY AND IMPLEMENTATION: ACME is available as an open source software at https://github.com/HYsxe/ACME. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Redes Neurales de la Computación , Algoritmos , Atención , Sitios de Unión , Biología Computacional , Antígenos de Histocompatibilidad Clase I , Péptidos , Unión Proteica
18.
Bioinformatics ; 35(10): 1660-1667, 2019 05 15.
Artículo en Inglés | MEDLINE | ID: mdl-30295703

RESUMEN

MOTIVATION: Human immunodeficiency virus type 1 (HIV-1) genome integration is closely related to clinical latency and viral rebound. In addition to human DNA sequences that directly interact with the integration machinery, the selection of HIV integration sites has also been shown to depend on the heterogeneous genomic context around a large region, which greatly hinders the prediction and mechanistic studies of HIV integration. RESULTS: We have developed an attention-based deep learning framework, named DeepHINT, to simultaneously provide accurate prediction of HIV integration sites and mechanistic explanations of the detected sites. Extensive tests on a high-density HIV integration site dataset showed that DeepHINT can outperform conventional modeling strategies by automatically learning the genomic context of HIV integration from primary DNA sequence alone or together with epigenetic information. Systematic analyses on diverse known factors of HIV integration further validated the biological relevance of the prediction results. More importantly, in-depth analyses of the attention values output by DeepHINT revealed intriguing mechanistic implications in the selection of HIV integration sites, including potential roles of several DNA-binding proteins. These results established DeepHINT as an effective and explainable deep learning framework for the prediction and mechanistic study of HIV integration. AVAILABILITY AND IMPLEMENTATION: DeepHINT is available as an open-source software and can be downloaded from https://github.com/nonnerdling/DeepHINT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
VIH-1 , Atención , Aprendizaje Profundo , Genómica , Humanos , Programas Informáticos , Internalización del Virus
19.
Nucleic Acids Res ; 46(2): e11, 2018 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-29136203

RESUMEN

Alternative splicing plays an important role in many cellular processes of eukaryotic organisms. The exon-inclusion ratio, also known as percent spliced in, is often regarded as one of the most effective measures of alternative splicing events. The existing methods for estimating exon-inclusion ratios at the genome scale all require the existence of a reference transcriptome. In this paper, we propose an alignment-free method, FreePSI, to perform genome-wide estimation of exon-inclusion ratios from RNA-Seq data without relying on the guidance of a reference transcriptome. It uses a novel probabilistic generative model based on k-mer profiles to quantify the exon-inclusion ratios at the genome scale and an efficient expectation-maximization algorithm based on a divide-and-conquer strategy and ultrafast conjugate gradient projection descent method to solve the model. We compare FreePSI with the existing methods on simulated and real RNA-seq data in terms of both accuracy and efficiency and show that it is able to achieve very good performance even though a reference transcriptome is not provided. Our results suggest that FreePSI may have important applications in performing alternative splicing analysis for organisms that do not have quality reference transcriptomes. FreePSI is implemented in C++ and freely available to the public on GitHub.


Asunto(s)
Algoritmos , Empalme Alternativo , Biología Computacional/métodos , Secuenciación del Exoma/métodos , Exones/genética , Perfilación de la Expresión Génica/métodos , Modelos Genéticos , Reproducibilidad de los Resultados
20.
Nucleic Acids Res ; 46(8): e50, 2018 05 04.
Artículo en Inglés | MEDLINE | ID: mdl-29408992

RESUMEN

Decoding the spatial organizations of chromosomes has crucial implications for studying eukaryotic gene regulation. Recently, chromosomal conformation capture based technologies, such as Hi-C, have been widely used to uncover the interaction frequencies of genomic loci in a high-throughput and genome-wide manner and provide new insights into the folding of three-dimensional (3D) genome structure. In this paper, we develop a novel manifold learning based framework, called GEM (Genomic organization reconstructor based on conformational Energy and Manifold learning), to reconstruct the three-dimensional organizations of chromosomes by integrating Hi-C data with biophysical feasibility. Unlike previous methods, which explicitly assume specific relationships between Hi-C interaction frequencies and spatial distances, our model directly embeds the neighboring affinities from Hi-C space into 3D Euclidean space. Extensive validations demonstrated that GEM not only greatly outperformed other state-of-art modeling methods but also provided a physically and physiologically valid 3D representations of the organizations of chromosomes. Furthermore, we for the first time apply the modeled chromatin structures to recover long-range genomic interactions missing from original Hi-C data.


Asunto(s)
Cromosomas Humanos/química , Cromosomas Humanos/genética , Modelos Moleculares , Algoritmos , Cromatina/química , Cromatina/genética , Cromatina/ultraestructura , Mapeo Cromosómico/métodos , Cromosomas Humanos/ultraestructura , Cromosomas Humanos Par 14/química , Cromosomas Humanos Par 14/genética , Cromosomas Humanos Par 14/ultraestructura , Biología Computacional/métodos , Simulación por Computador , Genoma Humano , Genómica/métodos , Humanos , Imagenología Tridimensional , Hibridación Fluorescente in Situ , Aprendizaje Automático , Conformación Molecular
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA