Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36659812

RESUMEN

Bacteriophages (or phages), which infect bacteria, have two distinct lifestyles: virulent and temperate. Predicting the lifestyle of phages helps decipher their interactions with their bacterial hosts, aiding phages' applications in fields such as phage therapy. Because experimental methods for annotating the lifestyle of phages cannot keep pace with the fast accumulation of sequenced phages, computational method for predicting phages' lifestyles has become an attractive alternative. Despite some promising results, computational lifestyle prediction remains difficult because of the limited known annotations and the sheer amount of sequenced phage contigs assembled from metagenomic data. In particular, most of the existing tools cannot precisely predict phages' lifestyles for short contigs. In this work, we develop PhaTYP (Phage TYPe prediction tool) to improve the accuracy of lifestyle prediction on short contigs. We design two different training tasks, self-supervised and fine-tuning tasks, to overcome lifestyle prediction difficulties. We rigorously tested and compared PhaTYP with four state-of-the-art methods: DeePhage, PHACTS, PhagePred and BACPHLIP. The experimental results show that PhaTYP outperforms all these methods and achieves more stable performance on short contigs. In addition, we demonstrated the utility of PhaTYP for analyzing the phage lifestyle on human neonates' gut data. This application shows that PhaTYP is a useful means for studying phages in metagenomic data and helps extend our understanding of microbial communities.


Asunto(s)
Bacteriófagos , Microbiota , Recién Nacido , Humanos , Bacteriófagos/genética , Metagenómica/métodos , Bacterias , Metagenoma
2.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37965809

RESUMEN

MOTIVATION: Bacteriophages (phages for short), which prey on and replicate within bacterial cells, have a significant role in modulating microbial communities and hold potential applications in treating antibiotic resistance. The advancement of high-throughput sequencing technology contributes to the discovery of phages tremendously. However, the taxonomic classification of assembled phage contigs still faces several challenges, including high genetic diversity, lack of a stable taxonomy system and limited knowledge of phage annotations. Despite extensive efforts, existing tools have not yet achieved an optimal balance between prediction rate and accuracy. RESULTS: In this work, we develop a learning-based model named PhaGenus, which conducts genus-level taxonomic classification for phage contigs. PhaGenus utilizes a powerful Transformer model to learn the association between protein clusters and support the classification of up to 508 genera. We tested PhaGenus on four datasets in different scenarios. The experimental results show that PhaGenus outperforms state-of-the-art methods in predicting low-similarity datasets, achieving an improvement of at least 13.7%. Additionally, PhaGenus is highly effective at identifying previously uncharacterized genera that are not represented in reference databases, with an improvement of 8.52%. The analysis of the infants' gut and GOV2.0 dataset demonstrates that PhaGenus can be used to classify more contigs with higher accuracy.


Asunto(s)
Bacteriófagos , Microbiota , Humanos , Bacteriófagos/genética , Secuenciación de Nucleótidos de Alto Rendimiento
3.
Nucleic Acids Res ; 51(15): e83, 2023 08 25.
Artículo en Inglés | MEDLINE | ID: mdl-37427782

RESUMEN

Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools.


Asunto(s)
Genoma Bacteriano , Programas Informáticos , Plásmidos/genética , Metagenoma , Metagenómica/métodos , Análisis de Secuencia de ADN/métodos
4.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35136930

RESUMEN

With advances in library construction protocols and next-generation sequencing technologies, viral metagenomic sequencing has become the major source for novel virus discovery. Conducting taxonomic classification for metagenomic data is an important means to characterize the viral composition in the underlying samples. However, RNA viruses are abundant and highly diverse, jeopardizing the sensitivity of comparison-based classification methods. To improve the sensitivity of read-level taxonomic classification, we developed an RNA-dependent RNA polymerase (RdRp) gene-based read classification tool RdRpBin. It combines alignment-based strategy with machine learning models in order to fully exploit the sequence properties of RdRp. We tested our method and compared its performance with the state-of-the-art tools on the simulated and real sequencing data. RdRpBin competes favorably with all. In particular, when the query RNA viruses share low sequence similarity with the known viruses ($\sim 0.4$), our tool can still maintain a higher F-score than the state-of-the-art tools. The experimental results on real data also showed that RdRpBin can classify more RNA viral reads with a relatively low false-positive rate. Thus, RdRpBin can be utilized to classify novel and diverged RNA viruses.


Asunto(s)
Virus ARN , Virus , Metagenoma , Metagenómica/métodos , Virus ARN/genética , ARN Polimerasa Dependiente del ARN/genética , Virus/genética
5.
Brief Bioinform ; 23(4)2022 07 18.
Artículo en Inglés | MEDLINE | ID: mdl-35769000

RESUMEN

MOTIVATION: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. RESULTS: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins' positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.


Asunto(s)
Bacteriófagos , Microbiota , Bacterias/genética , Bacteriófagos/genética , Metagenoma , Metagenómica/métodos
6.
Bioinformatics ; 39(3)2023 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-36794927

RESUMEN

SUMMARY: Without relying on cultivation, metagenomic sequencing greatly accelerated the novel RNA virus detection. However, it is not trivial to accurately identify RNA viral contigs from a mixture of species. The low content of RNA viruses in metagenomic data requires a highly specific detector, while new RNA viruses can exhibit high genetic diversity, posing a challenge for alignment-based tools. In this work, we developed VirBot, a simple yet effective RNA virus identification tool based on the protein families and the corresponding adaptive score cutoffs. We benchmarked it with seven popular tools for virus identification on both simulated and real sequencing data. VirBot shows its high specificity in metagenomic datasets and superior sensitivity in detecting novel RNA viruses. AVAILABILITY AND IMPLEMENTATION: https://github.com/GreyGuoweiChen/RNA_virus_detector. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Virus ARN , Programas Informáticos , Virus ARN/genética , Metagenoma , Metagenómica , Análisis de Secuencia de ADN
7.
Bioinformatics ; 39(39 Suppl 1): i30-i39, 2023 06 30.
Artículo en Inglés | MEDLINE | ID: mdl-37387136

RESUMEN

MOTIVATION: As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins, such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein (PVP) classification. RESULTS: In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence "images". Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins. AVAILABILITY AND IMPLEMENTATION: The web server of PhaVIP is available via: https://phage.ee.cityu.edu.hk/phavip. The source code of PhaVIP is available via: https://github.com/KennthShang/PhaVIP.


Asunto(s)
Bacteriófagos , Microbiota , Virión , Secuencia de Aminoácidos , Benchmarking
8.
Bioinformatics ; 39(5)2023 05 04.
Artículo en Inglés | MEDLINE | ID: mdl-37086432

RESUMEN

MOTIVATION: As prevalent extrachromosomal replicons in many bacteria, plasmids play an essential role in their hosts' evolution and adaptation. The host range of a plasmid refers to the taxonomic range of bacteria in which it can replicate and thrive. Understanding host ranges of plasmids sheds light on studying the roles of plasmids in bacterial evolution and adaptation. Metagenomic sequencing has become a major means to obtain new plasmids and derive their hosts. However, host prediction for assembled plasmid contigs still needs to tackle several challenges: different sequence compositions and copy numbers between plasmids and the hosts, high diversity in plasmids, and limited plasmid annotations. Existing tools have not yet achieved an ideal tradeoff between sensitivity and precision on metagenomic assembled contigs. RESULTS: In this work, we construct a hierarchical classification tool named HOTSPOT, whose backbone is a phylogenetic tree of the bacterial hosts from phylum to species. By incorporating the state-of-the-art language model, Transformer, in each node's taxon classifier, the top-down tree search achieves an accurate host taxonomy prediction for the input plasmid contigs. We rigorously tested HOTSPOT on multiple datasets, including RefSeq complete plasmids, artificial contigs, simulated metagenomic data, mock metagenomic data, the Hi-C dataset, and the CAMI2 marine dataset. All experiments show that HOTSPOT outperforms other popular methods. AVAILABILITY AND IMPLEMENTATION: The source code of HOTSPOT is available via: https://github.com/Orin-beep/HOTSPOT.


Asunto(s)
Metagenoma , Programas Informáticos , Filogenia , Plásmidos/genética , Metagenómica/métodos , Bacterias/genética
9.
BMC Bioinformatics ; 20(Suppl 23): 646, 2019 Dec 27.
Artículo en Inglés | MEDLINE | ID: mdl-31881831

RESUMEN

BACKGROUND: There are many different types of microRNAs (miRNAs) and elucidating their functions is still under intensive research. A fundamental step in functional annotation of a new miRNA is to classify it into characterized miRNA families, such as those in Rfam and miRBase. With the accumulation of annotated miRNAs, it becomes possible to use deep learning-based models to classify different types of miRNAs. In this work, we investigate several key issues associated with successful application of deep learning models for miRNA classification. First, as secondary structure conservation is a prominent feature for noncoding RNAs including miRNAs, we examine whether secondary structure-based encoding improves classification accuracy. Second, as there are many more non-miRNA sequences than miRNAs, instead of assigning a negative class for all non-miRNA sequences, we test whether using softmax output can distinguish in-distribution and out-of-distribution samples. Finally, we investigate whether deep learning models can correctly classify sequences from small miRNA families. RESULTS: We present our trained convolutional neural network (CNN) models for classifying miRNAs using different types of feature learning and encoding methods. In the first method, we explicitly encode the predicted secondary structure in a matrix. In the second method, we use only the primary sequence information and one-hot encoding matrix. In addition, in order to reject sequences that should not be classified into targeted miRNA families, we use a threshold derived from softmax layer to exclude out-of-distribution sequences, which is an important feature to make this model useful for real transcriptomic data. The comparison with the state-of-the-art ncRNA classification tools such as Infernal shows that our method can achieve comparable sensitivity and accuracy while being significantly faster. CONCLUSION: Automatic feature learning in CNN can lead to better classification accuracy and sensitivity for miRNA classification and annotation. The trained models and also associated codes are freely available at https://github.com/HubertTang/DeepMir.


Asunto(s)
MicroARNs/genética , Redes Neurales de la Computación , Algoritmos , Emparejamiento Base/genética , Secuencia de Bases , Motivos de Nucleótidos/genética , Probabilidad , ARN de Transferencia/genética
10.
Bioinform Adv ; 3(1): vbad101, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37641717

RESUMEN

Motivation: There is accumulating evidence showing the important roles of bacteriophages (phages) in regulating the structure and functions of the microbiome. However, lacking an easy-to-use and integrated phage analysis software hampers microbiome-related research from incorporating phages in the analysis. Results: In this work, we developed a web server, PhaBOX, which can comprehensively identify and analyze phage contigs in metagenomic data. It supports integrated phage analysis, including phage contig identification from the metagenomic assembly, lifestyle prediction, taxonomic classification, and host prediction. Instead of treating the algorithms as a black box, PhaBOX also supports visualization of the essential features for making predictions. The web server is designed with a user-friendly graphical interface that enables both informatics-trained and nonspecialist users to analyze phages in microbiome data with ease. Availability and implementation: The web server of PhaBOX is available via: https://phage.ee.cityu.edu.hk. The source code of PhaBOX is available at: https://github.com/KennthShang/PhaBOX.

11.
Viruses ; 15(1)2022 12 24.
Artículo en Inglés | MEDLINE | ID: mdl-36680094

RESUMEN

Viruses are the most abundant form of life on earth and play important roles in a broad range of ecosystems. Currently, two methods, whole genome shotgun metagenome (WGSM) and viral-like particle enriched metagenome (VLPM) sequencing, are widely applied to compare viruses in various environments. However, there is no critical assessment of their performance in recovering viruses and biological interpretation in comparative viral metagenomic studies. To fill this gap, we applied the two methods to investigate the stool virome in hepatocellular carcinoma (HCC) patients and healthy controls. Both WGSM and VLPM methods can capture the major diversity patterns of alpha and beta diversities and identify the altered viral profiles in the HCC stool samples compared with healthy controls. Viral signatures identified by both methods showed reductions of Faecalibacterium virus Taranis in HCC patients' stool. Ultra-deep sequencing recovered more viruses in both methods, however, generally, 3 or 5 Gb were sufficient to capture the non-fragmented long viral contigs. More lytic viruses were detected than lysogenetic viruses in both methods, and the VLPM can detect the RNA viruses. Using both methods would identify shared and specific viral signatures and would capture different parts of the total virome.


Asunto(s)
Carcinoma Hepatocelular , Neoplasias Hepáticas , Virus , Humanos , Metagenoma , Carcinoma Hepatocelular/genética , Viroma , Ecosistema , Neoplasias Hepáticas/genética , Virus/genética , Metagenómica/métodos , Genoma Viral
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA