Búsqueda | Portal Regional de la BVS

Investigation of the BERT model on nucleotide sequences with non-standard pre-training and evaluation of different k-mer embeddings.

Zhang, Yao-Zhong; Bai, Zeheng; Imoto, Seiya.

Bioinformatics ; 39(10)2023 10 03.

Artículo en Inglés | MEDLINE | ID: mdl-37815839

RESUMEN

MOTIVATION: In recent years, pre-training with the transformer architecture has gained significant attention. While this approach has led to notable performance improvements across a variety of downstream tasks, the underlying mechanisms by which pre-training models influence these tasks, particularly in the context of biological data, are not yet fully elucidated. RESULTS: In this study, focusing on the pre-training on nucleotide sequences, we decompose a pre-training model of Bidirectional Encoder Representations from Transformers (BERT) into its embedding and encoding modules to analyze what a pre-trained model learns from nucleotide sequences. Through a comparative study of non-standard pre-training at both the data and model levels, we find that a typical BERT model learns to capture overlapping-consistent k-mer embeddings for its token representation within its embedding module. Interestingly, using the k-mer embeddings pre-trained on random data can yield similar performance in downstream tasks, when compared with those using the k-mer embeddings pre-trained on real biological sequences. We further compare the learned k-mer embeddings with other established k-mer representations in downstream tasks of sequence-based functional prediction. Our experimental results demonstrate that the dense representation of k-mers learned from pre-training can be used as a viable alternative to one-hot encoding for representing nucleotide sequences. Furthermore, integrating the pre-trained k-mer embeddings with simpler models can achieve competitive performance in two typical downstream tasks. AVAILABILITY AND IMPLEMENTATION: The source code and associated data can be accessed at https://github.com/yaozhong/bert_investigation.

Asunto(s)

Programas Informáticos , Secuencia de Bases

Zero-shot-capable identification of phage-host relationships with whole-genome sequence representation by contrastive learning.

Zhang, Yao-Zhong; Liu, Yunjie; Bai, Zeheng; Fujimoto, Kosuke; Uematsu, Satoshi; Imoto, Seiya.

Brief Bioinform ; 24(5)2023 09 20.

Artículo en Inglés | MEDLINE | ID: mdl-37466138

RESUMEN

Accurately identifying phage-host relationships from their genome sequences is still challenging, especially for those phages and hosts with less homologous sequences. In this work, focusing on identifying the phage-host relationships at the species and genus level, we propose a contrastive learning based approach to learn whole-genome sequence embeddings that can take account of phage-host interactions (PHIs). Contrastive learning is used to make phages infecting the same hosts close to each other in the new representation space. Specifically, we rephrase whole-genome sequences with frequency chaos game representation (FCGR) and learn latent embeddings that 'encapsulate' phages and host relationships through contrastive learning. The contrastive learning method works well on the imbalanced dataset. Based on the learned embeddings, a proposed pipeline named CL4PHI can predict known hosts and unseen hosts in training. We compare our method with two recently proposed state-of-the-art learning-based methods on their benchmark datasets. The experiment results demonstrate that the proposed method using contrastive learning improves the prediction accuracy on known hosts and demonstrates a zero-shot prediction capability on unseen hosts. In terms of potential applications, the rapid pace of genome sequencing across different species has resulted in a vast amount of whole-genome sequencing data that require efficient computational methods for identifying phage-host interactions. The proposed approach is expected to address this need by efficiently processing whole-genome sequences of phages and prokaryotic hosts and capturing features related to phage-host relationships for genome sequence representation. This approach can be used to accelerate the discovery of phage-host interactions and aid in the development of phage-based therapies for infectious diseases.

Asunto(s)

Bacteriófagos , Bacteriófagos/genética , Genoma Viral , Secuenciación Completa del Genoma , Mapeo Cromosómico

Identification of bacteriophage genome sequences with representation learning.

Bai, Zeheng; Zhang, Yao-Zhong; Miyano, Satoru; Yamaguchi, Rui; Fujimoto, Kosuke; Uematsu, Satoshi; Imoto, Seiya.

Bioinformatics ; 38(18): 4264-4270, 2022 09 15.

Artículo en Inglés | MEDLINE | ID: mdl-35920769

RESUMEN

MOTIVATION: Bacteriophages/phages are the viruses that infect and replicate within bacteria and archaea, and rich in human body. To investigate the relationship between phages and microbial communities, the identification of phages from metagenome sequences is the first step. Currently, there are two main methods for identifying phages: database-based (alignment-based) methods and alignment-free methods. Database-based methods typically use a large number of sequences as references; alignment-free methods usually learn the features of the sequences with machine learning and deep learning models. RESULTS: We propose INHERIT which uses a deep representation learning model to integrate both database-based and alignment-free methods, combining the strengths of both. Pre-training is used as an alternative way of acquiring knowledge representations from existing databases, while the BERT-style deep learning framework retains the advantage of alignment-free methods. We compare INHERIT with four existing methods on a third-party benchmark dataset. Our experiments show that INHERIT achieves a better performance with the F1-score of 0.9932. In addition, we find that pre-training two species separately helps the non-alignment deep learning model make more accurate predictions. AVAILABILITY AND IMPLEMENTATION: The codes of INHERIT are now available in: https://github.com/Celestial-Bai/INHERIT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Bacteriófagos , Humanos , Bacteriófagos/genética , Programas Informáticos , Metagenoma , Aprendizaje Automático , Bacterias

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA