Search | VHL Search Portal

1.

PhaTYP: predicting the lifestyle for bacteriophages using BERT.

Shang, Jiayu; Tang, Xubo; Sun, Yanni.

Brief Bioinform ; 24(1)2023 01 19.

Article in English | MEDLINE | ID: mdl-36659812

ABSTRACT

Bacteriophages (or phages), which infect bacteria, have two distinct lifestyles: virulent and temperate. Predicting the lifestyle of phages helps decipher their interactions with their bacterial hosts, aiding phages' applications in fields such as phage therapy. Because experimental methods for annotating the lifestyle of phages cannot keep pace with the fast accumulation of sequenced phages, computational method for predicting phages' lifestyles has become an attractive alternative. Despite some promising results, computational lifestyle prediction remains difficult because of the limited known annotations and the sheer amount of sequenced phage contigs assembled from metagenomic data. In particular, most of the existing tools cannot precisely predict phages' lifestyles for short contigs. In this work, we develop PhaTYP (Phage TYPe prediction tool) to improve the accuracy of lifestyle prediction on short contigs. We design two different training tasks, self-supervised and fine-tuning tasks, to overcome lifestyle prediction difficulties. We rigorously tested and compared PhaTYP with four state-of-the-art methods: DeePhage, PHACTS, PhagePred and BACPHLIP. The experimental results show that PhaTYP outperforms all these methods and achieves more stable performance on short contigs. In addition, we demonstrated the utility of PhaTYP for analyzing the phage lifestyle on human neonates' gut data. This application shows that PhaTYP is a useful means for studying phages in metagenomic data and helps extend our understanding of microbial communities.

Subject(s)

Bacteriophages , Microbiota , Infant, Newborn , Humans , Bacteriophages/genetics , Metagenomics/methods , Bacteria , Metagenome

2.

HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses.

Yu, Runzhou; Abdullah, Syed Muhammad Umer; Sun, Yanni.

Brief Bioinform ; 24(5)2023 09 20.

Article in English | MEDLINE | ID: mdl-37478372

ABSTRACT

Access to accurate viral genomes is important to downstream data analysis. Third-generation sequencing (TGS) has recently become a popular platform for virus sequencing because of its long read length. However, its per-base error rate, which is higher than next-generation sequencing, can lead to genomes with errors. Polishing tools are thus needed to correct errors either before or after sequence assembly. Despite promising results of available polishing tools, there is still room to improve the error correction performance to perform more accurate genome assembly. The errors, particularly those in coding regions, can hamper analysis such as linage identification and variant monitoring. In this work, we developed a novel pipeline, HMMPolish, for correcting (polishing) errors in protein-coding regions of known RNA viruses. This tool can be applied to either raw TGS reads or the assembled sequences of the target virus. By utilizing profile Hidden Markov Models of protein families/domains in known viruses, HMMPolish can correct errors that are ignored by available polishers. We extensively validated HMMPolish on 34 datasets that covered four clinically important viruses, including HIV-1, influenza-A, norovirus, and severe acute respiratory syndrome coronavirus 2. These datasets contain reads with different properties, such as sequencing depth and platforms (PacBio or Nanopore). The benchmark results against popular/representative polishers show that HMMPolish competes favorably on error correction in coding regions of known RNA viruses.

Subject(s)

COVID-19 , RNA Viruses , Viruses , Humans , Sequence Analysis, DNA/methods , Genome , High-Throughput Nucleotide Sequencing/methods

3.

PhaGenus: genus-level classification of bacteriophages using a Transformer model.

Guan, Jiaojiao; Peng, Cheng; Shang, Jiayu; Tang, Xubo; Sun, Yanni.

Brief Bioinform ; 24(6)2023 09 22.

Article in English | MEDLINE | ID: mdl-37965809

ABSTRACT

MOTIVATION: Bacteriophages (phages for short), which prey on and replicate within bacterial cells, have a significant role in modulating microbial communities and hold potential applications in treating antibiotic resistance. The advancement of high-throughput sequencing technology contributes to the discovery of phages tremendously. However, the taxonomic classification of assembled phage contigs still faces several challenges, including high genetic diversity, lack of a stable taxonomy system and limited knowledge of phage annotations. Despite extensive efforts, existing tools have not yet achieved an optimal balance between prediction rate and accuracy. RESULTS: In this work, we develop a learning-based model named PhaGenus, which conducts genus-level taxonomic classification for phage contigs. PhaGenus utilizes a powerful Transformer model to learn the association between protein clusters and support the classification of up to 508 genera. We tested PhaGenus on four datasets in different scenarios. The experimental results show that PhaGenus outperforms state-of-the-art methods in predicting low-similarity datasets, achieving an improvement of at least 13.7%. Additionally, PhaGenus is highly effective at identifying previously uncharacterized genera that are not represented in reference databases, with an improvement of 8.52%. The analysis of the infants' gut and GOV2.0 dataset demonstrates that PhaGenus can be used to classify more contigs with higher accuracy.

Subject(s)

Bacteriophages , Microbiota , Humans , Bacteriophages/genetics , High-Throughput Nucleotide Sequencing

4.

Virus classification for viral genomic fragments using PhaGCN2.

Jiang, Jing-Zhe; Yuan, Wen-Guang; Shang, Jiayu; Shi, Ying-Hui; Yang, Li-Ling; Liu, Min; Zhu, Peng; Jin, Tao; Sun, Yanni; Yuan, Li-Hong.

Brief Bioinform ; 24(1)2023 01 19.

Article in English | MEDLINE | ID: mdl-36464489

ABSTRACT

Viruses are the most ubiquitous and diverse entities in the biome. Due to the rapid growth of newly identified viruses, there is an urgent need for accurate and comprehensive virus classification, particularly for novel viruses. Here, we present PhaGCN2, which can rapidly classify the taxonomy of viral sequences at the family level and supports the visualization of the associations of all families. We evaluate the performance of PhaGCN2 and compare it with the state-of-the-art virus classification tools, such as vConTACT2, CAT and VPF-Class, using the widely accepted metrics. The results show that PhaGCN2 largely improves the precision and recall of virus classification, increases the number of classifiable virus sequences in the Global Ocean Virome dataset (v2.0) by four times and classifies more than 90% of the Gut Phage Database. PhaGCN2 makes it possible to conduct high-throughput and automatic expansion of the database of the International Committee on Taxonomy of Viruses. The source code is freely available at https://github.com/KennthShang/PhaGCN2.0.

Subject(s)

Viruses , Viruses/genetics , Genome, Viral , Databases, Factual , Software , Genomics

5.

Towards more accurate microbial source tracking via non-negative matrix factorization (NMF).

Huang, Ziyi; Cai, Dehan; Sun, Yanni.

Bioinformatics ; 40(Supplement_1): i68-i78, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940128

ABSTRACT

MOTIVATION: The microbiome of a sampled habitat often consists of microbial communities from various sources, including potential contaminants. Microbial source tracking (MST) can be used to discern the contribution of each source to the observed microbiome data, thus enabling the identification and tracking of microbial communities within a sample. Therefore, MST has various applications, from monitoring microbial contamination in clinical labs to tracing the source of pollution in environmental samples. Despite promising results in MST development, there is still room for improvement, particularly for applications where precise quantification of each source's contribution is critical. RESULTS: In this study, we introduce a novel tool called SourceID-NMF towards more precise microbial source tracking. SourceID-NMF utilizes a non-negative matrix factorization (NMF) algorithm to trace the microbial sources contributing to a target sample. By leveraging the taxa abundance in both available sources and the target sample, SourceID-NMF estimates the proportion of available sources present in the target sample. To evaluate the performance of SourceID-NMF, we conducted a series of benchmarking experiments using simulated and real data. The simulated experiments mimic realistic yet challenging scenarios for identifying highly similar sources, irrelevant sources, unknown sources, low abundance sources, and noise sources. The results demonstrate the superior accuracy of SourceID-NMF over existing methods. Particularly, SourceID-NMF accurately estimated the proportion of irrelevant and unknown sources while other tools either over- or under-estimated them. In addition, the noise sources experiment also demonstrated the robustness of SourceID-NMF for MST. AVAILABILITY AND IMPLEMENTATION: SourceID-NMF is available online at https://github.com/ZiyiHuang0708/SourceID-NMF.

Subject(s)

Algorithms , Microbiota , Humans

6.

PLASMe: a tool to identify PLASMid contigs from short-read assemblies using transformer.

Tang, Xubo; Shang, Jiayu; Ji, Yongxin; Sun, Yanni.

Nucleic Acids Res ; 51(15): e83, 2023 08 25.

Article in English | MEDLINE | ID: mdl-37427782

ABSTRACT

Plasmids are mobile genetic elements that carry important accessory genes. Cataloging plasmids is a fundamental step to elucidate their roles in promoting horizontal gene transfer between bacteria. Next generation sequencing (NGS) is the main source for discovering new plasmids today. However, NGS assembly programs tend to return contigs, making plasmid detection difficult. This problem is particularly grave for metagenomic assemblies, which contain short contigs of heterogeneous origins. Available tools for plasmid contig detection still suffer from some limitations. In particular, alignment-based tools tend to miss diverged plasmids while learning-based tools often have lower precision. In this work, we develop a plasmid detection tool PLASMe that capitalizes on the strength of alignment and learning-based methods. Closely related plasmids can be easily identified using the alignment component in PLASMe while diverged plasmids can be predicted using order-specific Transformer models. By encoding plasmid sequences as a language defined on the protein cluster-based token set, Transformer can learn the importance of proteins and their correlation through positionally token embedding and the attention mechanism. We compared PLASMe and other tools on detecting complete plasmids, plasmid contigs, and contigs assembled from CAMI2 simulated data. PLASMe achieved the highest F1-score. After validating PLASMe on data with known labels, we also tested it on real metagenomic and plasmidome data. The examination of some commonly used marker genes shows that PLASMe exhibits more reliable performance than other tools.

Subject(s)

Genome, Bacterial , Software , Plasmids/genetics , Metagenome , Metagenomics/methods , Sequence Analysis, DNA/methods

7.

CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model.

Shang, Jiayu; Sun, Yanni.

Brief Bioinform ; 23(5)2022 09 20.

Article in English | MEDLINE | ID: mdl-35595715

ABSTRACT

Prokaryotic viruses, which infect bacteria and archaea, are key players in microbial communities. Predicting the hosts of prokaryotic viruses helps decipher the dynamic relationship between microbes. Experimental methods for host prediction cannot keep pace with the fast accumulation of sequenced phages. Thus, there is a need for computational host prediction. Despite some promising results, computational host prediction remains a challenge because of the limited known interactions and the sheer amount of sequenced phages by high-throughput sequencing technologies. The state-of-the-art methods can only achieve 43% accuracy at the species level. In this work, we formulate host prediction as link prediction in a knowledge graph that integrates multiple protein and DNA-based sequence features. Our implementation named CHERRY can be applied to predict hosts for newly discovered viruses and to identify viruses infecting targeted bacteria. We demonstrated the utility of CHERRY for both applications and compared its performance with 11 popular host prediction methods. To our best knowledge, CHERRY has the highest accuracy in identifying virus-prokaryote interactions. It outperforms all the existing methods at the species level with an accuracy increase of 37%. In addition, CHERRY's performance on short contigs is more stable than other tools.

Subject(s)

Bacteriophages , Viruses , Bacteria , Bacteriophages/genetics , DNA , Prokaryotic Cells , Viruses/genetics

8.

RdRp-based sensitive taxonomic classification of RNA viruses for metagenomic data.

Tang, Xubo; Shang, Jiayu; Sun, Yanni.

Brief Bioinform ; 23(2)2022 03 10.

Article in English | MEDLINE | ID: mdl-35136930

ABSTRACT

With advances in library construction protocols and next-generation sequencing technologies, viral metagenomic sequencing has become the major source for novel virus discovery. Conducting taxonomic classification for metagenomic data is an important means to characterize the viral composition in the underlying samples. However, RNA viruses are abundant and highly diverse, jeopardizing the sensitivity of comparison-based classification methods. To improve the sensitivity of read-level taxonomic classification, we developed an RNA-dependent RNA polymerase (RdRp) gene-based read classification tool RdRpBin. It combines alignment-based strategy with machine learning models in order to fully exploit the sequence properties of RdRp. We tested our method and compared its performance with the state-of-the-art tools on the simulated and real sequencing data. RdRpBin competes favorably with all. In particular, when the query RNA viruses share low sequence similarity with the known viruses ($\sim 0.4$), our tool can still maintain a higher F-score than the state-of-the-art tools. The experimental results on real data also showed that RdRpBin can classify more RNA viral reads with a relatively low false-positive rate. Thus, RdRpBin can be utilized to classify novel and diverged RNA viruses.

Subject(s)

RNA Viruses , Viruses , Metagenome , Metagenomics/methods , RNA Viruses/genetics , RNA-Dependent RNA Polymerase/genetics , Viruses/genetics

9.

Accurate identification of bacteriophages from metagenomic data using Transformer.

Shang, Jiayu; Tang, Xubo; Guo, Ruocheng; Sun, Yanni.

Brief Bioinform ; 23(4)2022 07 18.

Article in English | MEDLINE | ID: mdl-35769000

ABSTRACT

MOTIVATION: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. RESULTS: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins' positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.

Subject(s)

Bacteriophages , Microbiota , Bacteria/genetics , Bacteriophages/genetics , Metagenome , Metagenomics/methods

10.

AccuVIR: an ACCUrate VIRal genome assembly tool for third-generation sequencing data.

Yu, Runzhou; Cai, Dehan; Sun, Yanni.

Bioinformatics ; 39(1)2023 01 01.

Article in English | MEDLINE | ID: mdl-36610711

ABSTRACT

MOTIVATION: RNA viruses tend to mutate constantly. While many of the variants are neutral, some can lead to higher transmissibility or virulence. Accurate assembly of complete viral genomes enables the identification of underlying variants, which are essential for studying virus evolution and elucidating the relationship between genotypes and virus properties. Recently, third-generation sequencing platforms such as Nanopore sequencers have been used for real-time virus sequencing for Ebola, Zika, coronavirus disease 2019, etc. However, their high per-base error rate prevents the accurate reconstruction of the viral genome. RESULTS: In this work, we introduce a new tool, AccuVIR, for viral genome assembly and polishing using error-prone long reads. It can better distinguish sequencing errors from true variants based on the key observation that sequencing errors can disrupt the gene structures of viruses, which usually have a high density of coding regions. Our experimental results on both simulated and real third-generation sequencing data demonstrated its superior performance on generating more accurate viral genomes than generic assembly or polish tools. AVAILABILITY AND IMPLEMENTATION: The source code and the documentation of AccuVIR are available at https://github.com/rainyrubyzhou/AccuVIR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

COVID-19 , Zika Virus Infection , Zika Virus , Humans , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Software , Genome, Viral

11.

GDmicro: classifying host disease status with GCN and deep adaptation network based on the human gut microbiome data.

Liao, Herui; Shang, Jiayu; Sun, Yanni.

Bioinformatics ; 39(12)2023 12 01.

Article in English | MEDLINE | ID: mdl-38085234

ABSTRACT

MOTIVATION: With advances in metagenomic sequencing technologies, there are accumulating studies revealing the associations between the human gut microbiome and some human diseases. These associations shed light on using gut microbiome data to distinguish case and control samples of a specific disease, which is also called host disease status classification. Importantly, using learning-based models to distinguish the disease and control samples is expected to identify important biomarkers more accurately than abundance-based statistical analysis. However, available tools have not fully addressed two challenges associated with this task: limited labeled microbiome data and decreased accuracy in cross-studies. The confounding factors, such as the diet, technical biases in sample collection/sequencing across different studies/cohorts often jeopardize the generalization of the learning model. RESULTS: To address these challenges, we develop a new tool GDmicro, which combines semi-supervised learning and domain adaptation to achieve a more generalized model using limited labeled samples. We evaluated GDmicro on human gut microbiome data from 11 cohorts covering 5 different diseases. The results show that GDmicro has better performance and robustness than state-of-the-art tools. In particular, it improves the AUC from 0.783 to 0.949 in identifying inflammatory bowel disease. Furthermore, GDmicro can identify potential biomarkers with greater accuracy than abundance-based statistical analysis methods. It also reveals the contribution of these biomarkers to the host's disease status. AVAILABILITY AND IMPLEMENTATION: https://github.com/liaoherui/GDmicro.

Subject(s)

Gastrointestinal Microbiome , Inflammatory Bowel Diseases , Microbiota , Humans , Metagenome , Biomarkers

12.

VirBot: an RNA viral contig detector for metagenomic data.

Chen, Guowei; Tang, Xubo; Shi, Mang; Sun, Yanni.

Bioinformatics ; 39(3)2023 03 01.

Article in English | MEDLINE | ID: mdl-36794927

ABSTRACT

SUMMARY: Without relying on cultivation, metagenomic sequencing greatly accelerated the novel RNA virus detection. However, it is not trivial to accurately identify RNA viral contigs from a mixture of species. The low content of RNA viruses in metagenomic data requires a highly specific detector, while new RNA viruses can exhibit high genetic diversity, posing a challenge for alignment-based tools. In this work, we developed VirBot, a simple yet effective RNA virus identification tool based on the protein families and the corresponding adaptive score cutoffs. We benchmarked it with seven popular tools for virus identification on both simulated and real sequencing data. VirBot shows its high specificity in metagenomic datasets and superior sensitivity in detecting novel RNA viruses. AVAILABILITY AND IMPLEMENTATION: https://github.com/GreyGuoweiChen/RNA_virus_detector. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

RNA Viruses , Software , RNA Viruses/genetics , Metagenome , Metagenomics , Sequence Analysis, DNA

13.

PhaVIP: Phage VIrion Protein classification based on chaos game representation and Vision Transformer.

Shang, Jiayu; Peng, Cheng; Tang, Xubo; Sun, Yanni.

Bioinformatics ; 39(39 Suppl 1): i30-i39, 2023 06 30.

Article in English | MEDLINE | ID: mdl-37387136

ABSTRACT

MOTIVATION: As viruses that mainly infect bacteria, phages are key players across a wide range of ecosystems. Analyzing phage proteins is indispensable for understanding phages' functions and roles in microbiomes. High-throughput sequencing enables us to obtain phages in different microbiomes with low cost. However, compared to the fast accumulation of newly identified phages, phage protein classification remains difficult. In particular, a fundamental need is to annotate virion proteins, the structural proteins, such as major tail, baseplate, etc. Although there are experimental methods for virion protein identification, they are too expensive or time-consuming, leaving a large number of proteins unclassified. Thus, there is a great demand to develop a computational method for fast and accurate phage virion protein (PVP) classification. RESULTS: In this work, we adapted the state-of-the-art image classification model, Vision Transformer, to conduct virion protein classification. By encoding protein sequences into unique images using chaos game representation, we can leverage Vision Transformer to learn both local and global features from sequence "images". Our method, PhaVIP, has two main functions: classifying PVP and non-PVP sequences and annotating the types of PVP, such as capsid and tail. We tested PhaVIP on several datasets with increasing difficulty and benchmarked it against alternative tools. The experimental results show that PhaVIP has superior performance. After validating the performance of PhaVIP, we investigated two applications that can use the output of PhaVIP: phage taxonomy classification and phage host prediction. The results showed the benefit of using classified proteins over all proteins. AVAILABILITY AND IMPLEMENTATION: The web server of PhaVIP is available via: https://phage.ee.cityu.edu.hk/phavip. The source code of PhaVIP is available via: https://github.com/KennthShang/PhaVIP.

Subject(s)

Bacteriophages , Microbiota , Virion , Amino Acid Sequence , Benchmarking

14.

HOTSPOT: hierarchical host prediction for assembled plasmid contigs with transformer.

Ji, Yongxin; Shang, Jiayu; Tang, Xubo; Sun, Yanni.

Bioinformatics ; 39(5)2023 05 04.

Article in English | MEDLINE | ID: mdl-37086432

ABSTRACT

MOTIVATION: As prevalent extrachromosomal replicons in many bacteria, plasmids play an essential role in their hosts' evolution and adaptation. The host range of a plasmid refers to the taxonomic range of bacteria in which it can replicate and thrive. Understanding host ranges of plasmids sheds light on studying the roles of plasmids in bacterial evolution and adaptation. Metagenomic sequencing has become a major means to obtain new plasmids and derive their hosts. However, host prediction for assembled plasmid contigs still needs to tackle several challenges: different sequence compositions and copy numbers between plasmids and the hosts, high diversity in plasmids, and limited plasmid annotations. Existing tools have not yet achieved an ideal tradeoff between sensitivity and precision on metagenomic assembled contigs. RESULTS: In this work, we construct a hierarchical classification tool named HOTSPOT, whose backbone is a phylogenetic tree of the bacterial hosts from phylum to species. By incorporating the state-of-the-art language model, Transformer, in each node's taxon classifier, the top-down tree search achieves an accurate host taxonomy prediction for the input plasmid contigs. We rigorously tested HOTSPOT on multiple datasets, including RefSeq complete plasmids, artificial contigs, simulated metagenomic data, mock metagenomic data, the Hi-C dataset, and the CAMI2 marine dataset. All experiments show that HOTSPOT outperforms other popular methods. AVAILABILITY AND IMPLEMENTATION: The source code of HOTSPOT is available via: https://github.com/Orin-beep/HOTSPOT.

Subject(s)

Metagenome , Software , Phylogeny , Plasmids/genetics , Metagenomics/methods , Bacteria/genetics

15.

Artificial Intelligence in Meta-optics.

Chen, Mu Ku; Liu, Xiaoyuan; Sun, Yanni; Tsai, Din Ping.

Chem Rev ; 122(19): 15356-15413, 2022 10 12.

Article in English | MEDLINE | ID: mdl-35750326

ABSTRACT

Recent years have witnessed promising artificial intelligence (AI) applications in many disciplines, including optics, engineering, medicine, economics, and education. In particular, the synergy of AI and meta-optics has greatly benefited both fields. Meta-optics are advanced flat optics with novel functions and light-manipulation abilities. The optical properties can be engineered with a unique design to meet various optical demands. This review offers comprehensive coverage of meta-optics and artificial intelligence in synergy. After providing an overview of AI and meta-optics, we categorize and discuss the recent developments integrated by these two topics, namely AI for meta-optics and meta-optics for AI. The former describes how to apply AI to the research of meta-optics for design, simulation, optical information analysis, and application. The latter reports the development of the optical Al system and computation via meta-optics. This review will also provide an in-depth discussion of the challenges of this interdisciplinary field and indicate future directions. We expect that this review will inspire researchers in these fields and benefit the next generation of intelligent optical device design.

Subject(s)

Artificial Intelligence , Optics and Photonics

16.

Distinct composition and amplification dynamics of transposable elements in sacred lotus (Nelumbo nucifera Gaertn.).

Cerbin, Stefan; Ou, Shujun; Li, Yang; Sun, Yanni; Jiang, Ning.

Plant J ; 112(1): 172-192, 2022 10.

Article in English | MEDLINE | ID: mdl-35959634

ABSTRACT

Sacred lotus (Nelumbo nucifera Gaertn.) is a basal eudicot plant with a unique lifestyle, physiological features, and evolutionary characteristics. Here we report the unique profile of transposable elements (TEs) in the genome, using a manually curated repeat library. TEs account for 59% of the genome, and hAT (Ac/Ds) elements alone represent 8%, more than in any other known plant genome. About 18% of the lotus genome is comprised of Copia LTR retrotransposons, and over 25% of them are associated with non-canonical termini (non-TGCA). Such high abundance of non-canonical LTR retrotransposons has not been reported for any other organism. TEs are very abundant in genic regions, with retrotransposons enriched in introns and DNA transposons primarily in flanking regions of genes. The recent insertion of TEs in introns has led to significant intron size expansion, with a total of 200 Mb in the 28 455 genes. This is accompanied by declining TE activity in intergenic regions, suggesting distinct control efficacy of TE amplification in different genomic compartments. Despite the prevalence of TEs in genic regions, some genes are associated with fewer TEs, such as those involved in fruit ripening and stress responses. Other genes are enriched with TEs, and genes in epigenetic pathways are the most associated with TEs in introns, indicating a dynamic interaction between TEs and the host surveillance machinery. The dramatic differential abundance of TEs with genes involved in different biological processes as well as the variation of target preference of different TEs suggests the composition and activity of TEs influence the path of evolution.

Subject(s)

Nelumbo , Retroelements , DNA Transposable Elements/genetics , DNA, Intergenic , Evolution, Molecular , Genome, Plant/genetics , Nelumbo/genetics , Retroelements/genetics

17.

PRMT1 inhibition promotes ferroptosis sensitivity via ACSL1 upregulation in acute myeloid leukemia.

Zhou, Lixin; Jia, Xiaoqing; Shang, Yingying; Sun, Yanni; Liu, Zhilong; Liu, Jifeng; Jiang, Wen; Deng, Siyuan; Yao, Qi; Chen, Jieping; Li, Hui.

Mol Carcinog ; 62(8): 1119-1135, 2023 Aug.

Article in English | MEDLINE | ID: mdl-37144835

ABSTRACT

Acute myeloid leukemia (AML) is a hematological malignancy with an alarming mortality rate. The development of novel therapeutic targets or drugs for AML is urgently needed. Ferroptosis is a form of regulated cell death driven by iron-dependent lipid peroxidation. Recently, ferroptosis has emerged as a novel method for targeting cancer, including AML. Epigenetic dysregulation is a hallmark of AML, and a growing body of evidence suggests that ferroptosis is subject to epigenetic regulation. Here, we identified protein arginine methyltransferase 1 (PRMT1) as a ferroptosis regulator in AML. The type I PRMT inhibitor GSK3368715 promoted ferroptosis sensitivity in vitro and in vivo. Moreover, PRMT1-knockout cells exhibited significantly increased sensitivity to ferroptosis, suggesting that PRMT1 is the primary target of GSK3368715 in AML. Mechanistically, both GSK3368715 and PRMT1 knockout upregulated acyl-CoA synthetase long-chain family member 1 (ACSL1), which acts as a ferroptosis promoter by increasing lipid peroxidation. Knockout ACSL1 reduced the ferroptosis sensitivity of AML cells following GSK3368715 treatment. Additionally, the GSK3368715 treatment reduced the abundance of H4R3me2a, the main histone methylation modification mediated by PRMT1, in both genome-wide and ACSL1 promoter regions. Overall, our results demonstrated a previously unknown role of the PRMT1/ACSL1 axis in ferroptosis and suggested the potential value and applications of the combination of PRMT1 inhibitor and ferroptosis inducers in AML treatment.

Subject(s)

Ferroptosis , Leukemia, Myeloid, Acute , Humans , Ferroptosis/genetics , Up-Regulation , Epigenesis, Genetic , Protein-Arginine N-Methyltransferases/genetics , Protein-Arginine N-Methyltransferases/metabolism , Enzyme Inhibitors , Leukemia, Myeloid, Acute/drug therapy , Leukemia, Myeloid, Acute/genetics , Leukemia, Myeloid, Acute/pathology , Repressor Proteins/metabolism , Coenzyme A Ligases/genetics , Coenzyme A Ligases/metabolism

18.

Reconstructing viral haplotypes using long reads.

Cai, Dehan; Sun, Yanni.

Bioinformatics ; 38(8): 2127-2134, 2022 04 12.

Article in English | MEDLINE | ID: mdl-35157018

ABSTRACT

MOTIVATION: Most RNA viruses lack strict proofreading during replication. Coupled with a high replication rate, some RNA viruses can form a virus population containing a group of genetically related but different haplotypes. Characterizing the haplotype composition in a virus population is thus important to understand viruses' evolution. Many attempts have been made to reconstruct viral haplotypes using next-generation sequencing (NGS) reads. However, the short length of NGS reads cannot cover distant single-nucleotide variants, making it difficult to reconstruct complete or near-complete haplotypes. Given the fast developments of third-generation sequencing technologies, a new opportunity has arisen for reconstructing full-length haplotypes with long reads. RESULTS: In this work, we developed a new tool, RVHaplo to reconstruct haplotypes for known viruses from long reads. We tested it rigorously on both simulated and real viral sequencing data and compared it against other popular haplotype reconstruction tools. The results demonstrated that RVHaplo outperforms the state-of-the-art tools for viral haplotype reconstruction from long reads. Especially, RVHaplo can reconstruct the rare (1% abundance) haplotypes that other tools usually missed. AVAILABILITY AND IMPLEMENTATION: The source code and the documentation of RVHaplo are available at https://github.com/dhcai21/RVHaplo. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

RNA Viruses , Software , Haplotypes , RNA Viruses/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA , Algorithms

19.

HaploDMF: viral haplotype reconstruction from long reads via deep matrix factorization.

Cai, Dehan; Shang, Jiayu; Sun, Yanni.

Bioinformatics ; 38(24): 5360-5367, 2022 12 13.

Article in English | MEDLINE | ID: mdl-36308467

ABSTRACT

MOTIVATION: Lacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e. haplotypes) in one virus population helps study the viruses' evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing makes complete haplotype reconstruction difficult. RESULTS: In this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF is able to achieve highly robust performance on data with different coverage, haplotype number and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others. AVAILABILITY AND IMPLEMENTATION: The source code and the documentation of HaploDMF are available at https://github.com/dhcai21/HaploDMF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , RNA Viruses , Haplotypes , High-Throughput Nucleotide Sequencing/methods , Software , RNA Viruses/genetics , Sequence Analysis, DNA/methods

20.

AnnoSINE: a short interspersed nuclear elements annotation tool for plant genomes.

Li, Yang; Jiang, Ning; Sun, Yanni.

Plant Physiol ; 188(2): 955-970, 2022 02 04.

Article in English | MEDLINE | ID: mdl-34792587

ABSTRACT

Short interspersed nuclear elements (SINEs) are a widespread type of small transposable element (TE). With increasing evidence for their impact on gene function and genome evolution in plants, accurate genome-scale SINE annotation becomes a fundamental step for studying the regulatory roles of SINEs and their relationship with other components in the genomes. Despite the overall promising progress made in TE annotation, SINE annotation remains a major challenge. Unlike some other TEs, SINEs are short and heterogeneous, and they usually lack well-conserved sequence or structural features. Thus, current SINE annotation tools have either low sensitivity or high false discovery rates. Given the demand and challenges, we aimed to provide a more accurate and efficient SINE annotation tool for plant genomes. The pipeline starts with maximizing the pool of SINE candidates via profile hidden Markov model-based homology search and de novo SINE search using structural features. Then, it excludes the false positives by integrating all known features of SINEs and the features of other types of TEs that can often be misannotated as SINEs. As a result, the pipeline substantially improves the tradeoff between sensitivity and accuracy, with both values close to or over 90%. We tested our tool in Arabidopsis thaliana and rice (Oryza sativa), and the results show that our tool competes favorably against existing SINE annotation tools. The simplicity and effectiveness of this tool would potentially be useful for generating more accurate SINE annotations for other plant species. The pipeline is freely available at https://github.com/yangli557/AnnoSINE.

Subject(s)

Arabidopsis/genetics , Data Curation/standards , Genome, Plant , Guidelines as Topic , Oryza/genetics , Short Interspersed Nucleotide Elements , Reproducibility of Results

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL