Búsqueda | Portal Regional de la BVS

1.

CAPE: a deep learning framework with Chaos-Attention net for Promoter Evolution.

Ren, Ruohan; Yu, Hongyu; Teng, Jiahao; Mao, Sihui; Bian, Zixuan; Tao, Yangtianze; Yau, Stephen S-T.

Brief Bioinform ; 25(5)2024 Jul 25.

Artículo en Inglés | MEDLINE | ID: mdl-39120645

RESUMEN

Predicting the strength of promoters and guiding their directed evolution is a crucial task in synthetic biology. This approach significantly reduces the experimental costs in conventional promoter engineering. Previous studies employing machine learning or deep learning methods have shown some success in this task, but their outcomes were not satisfactory enough, primarily due to the neglect of evolutionary information. In this paper, we introduce the Chaos-Attention net for Promoter Evolution (CAPE) to address the limitations of existing methods. We comprehensively extract evolutionary information within promoters using merged chaos game representation and process the overall information with modified DenseNet and Transformer structures. Our model achieves state-of-the-art results on two kinds of distinct tasks related to prokaryotic promoter strength prediction. The incorporation of evolutionary information enhances the model's accuracy, with transfer learning further extending its adaptability. Furthermore, experimental results confirm CAPE's efficacy in simulating in silico directed evolution of promoters, marking a significant advancement in predictive modeling for prokaryotic promoter strength. Our paper also presents a user-friendly website for the practical implementation of in silico directed evolution on promoters. The source code implemented in this study and the instructions on accessing the website can be found in our GitHub repository https://github.com/BobYHY/CAPE.

Asunto(s)

Aprendizaje Profundo , Regiones Promotoras Genéticas , Algoritmos , Evolución Molecular , Simulación por Computador , Dinámicas no Lineales , Biología Computacional/métodos

2.

New Virus Variant Detection Based on the Optimal Natural Metric.

Yu, Hongyu; Yau, Stephen S-T.

Genes (Basel) ; 15(7)2024 Jul 07.

Artículo en Inglés | MEDLINE | ID: mdl-39062670

RESUMEN

The highly variable SARS-CoV-2 virus responsible for the COVID-19 pandemic frequently undergoes mutations, leading to the emergence of new variants that present novel threats to public health. The determination of these variants often relies on manual definition based on local sequence characteristics, resulting in delays in their detection relative to their actual emergence. In this study, we propose an algorithm for the automatic identification of novel variants. By leveraging the optimal natural metric for viruses based on an alignment-free perspective to measure distances between sequences, we devise a hypothesis testing framework to determine whether a given viral sequence belongs to a novel variant. Our method demonstrates high accuracy, achieving nearly 100% precision in identifying new variants of SARS-CoV-2 and HIV-1 as well as in detecting novel genera in Orthocoronavirinae. This approach holds promise for timely surveillance and management of emerging viral threats in the field of public health.

Asunto(s)

Algoritmos , COVID-19 , VIH-1 , SARS-CoV-2 , SARS-CoV-2/genética , Humanos , COVID-19/virología , COVID-19/epidemiología , VIH-1/genética , Mutación

3.

Current Pediatric Endoscopy Training Situation in the Asia-Pacific Region: A Collaborative Survey by the Asian Pan-Pacific Society for Pediatric Gastroenterology, Hepatology and Nutrition Endoscopy Scientific Subcommittee.

Ukarapol, Nuthapong; Tanatip, Narumon; Sharma, Ajay; Vitug-Sales, Maribel; Lopez, Robert Nicholas; Malik, Rohan; Ng, Ruey Terng; Umetsu, Shuichiro; Getsuwan, Songpon; Lui, Tak Yau Stephen; Yang, Yao-Jong; Lee, Yeoun Joo; Arai, Katsuhiro; Kim, Kyung Mo.

Pediatr Gastroenterol Hepatol Nutr ; 27(4): 258-265, 2024 Jul.

Artículo en Inglés | MEDLINE | ID: mdl-39035405

RESUMEN

Purpose: To date, there is no region-specific guideline for pediatric endoscopy training. This study aimed to illustrate the current status of pediatric endoscopy training in Asia-Pacific region and identify opportunities for improvement. Methods: A cross-sectional survey, using a standardized electronic questionnaire, was conducted among medical schools in the Asia-Pacific region in January 2024. Results: A total of 57 medical centers in 12 countries offering formal Pediatric Gastroenterology training programs participated in this regional survey. More than 75% of the centers had an average case load of <10 cases per week for both diagnostic and therapeutic endoscopies. Only 36% of the study programs employed competency-based outcomes for program development, whereas nearly half (48%) used volume-based curricula. Foreign body retrieval, polypectomy, percutaneous endoscopic gastrostomy, and esophageal variceal hemostasis, that is, sclerotherapy or band ligation (endoscopic variceal sclerotherapy and endoscopic variceal ligation), comprised the top four priorities that the trainees should acquire in the autonomous stage (unconscious) of competence. Regarding the learning environment, only 31.5% provided formal hands-on workshops/simulation training. The direct observation of procedural skills was the most commonly used assessment method. The application of a quality assurance (QA) system in both educational and patient care (Pediatric Endoscopy Quality Improvement Network) aspects was present in only 28% and 17% of the centers, respectively. Conclusion: Compared with Western academic societies, the limited availability of cases remains a major concern. To close this gap, simulation and adult endoscopy training are essential. The implementation of reliable and valid assessment tools and QA systems can lead to significant development in future programs.

4.

The optimal metric for viral genome space.

Yu, Hongyu; Yau, Stephen S-T.

Comput Struct Biotechnol J ; 23: 2083-2096, 2024 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-38803517

RESUMEN

Understanding the structural similarity between genomes is pivotal in classification and phylogenetic analysis. As the number of known genomes rockets, alignment-free methods have gained considerable attention. Among these methods, the natural vector method stands out as it represents sequences as vectors using statistical moments, enabling effective clustering based on families in biological taxonomy. However, determining an optimal metric that combines different elements in natural vectors remains challenging due to the absence of a rigorous theoretical framework for weighting different k-mers and orders. In this study, we address this challenge by transforming the determination of optimal weights into an optimization problem and resolving it through gradient-based techniques. Our experimental results underscore the substantial improvement in classification accuracy achieved by employing these optimal weights, reaching an impressive 92.73% on the testing set, surpassing other alignment-free methods. On one hand, our method offers an outstanding metric for virus classification, and on the other hand, it provides valuable insights into feature integration within alignment-free methods.

5.

Automated recognition of chromosome fusion using an alignment-free natural vector method.

Yu, Hongyu; Yau, Stephen S-T.

Front Genet ; 15: 1364951, 2024.

Artículo en Inglés | MEDLINE | ID: mdl-38572414

RESUMEN

Chromosomal fusion is a significant form of structural variation, but research into algorithms for its identification has been limited. Most existing methods rely on synteny analysis, which necessitates manual annotations and always involves inefficient sequence alignments. In this paper, we present a novel alignment-free algorithm for chromosomal fusion recognition. Our method transforms the problem into a series of assignment problems using natural vectors and efficiently solves them with the Kuhn-Munkres algorithm. When applied to the human/gorilla and swamp buffalo/river buffalo datasets, our algorithm successfully and efficiently identifies chromosomal fusion events. Notably, our approach offers several advantages, including higher processing speeds by eliminating time-consuming alignments and removing the need for manual annotations. By an alignment-free perspective, our algorithm initially considers entire chromosomes instead of fragments to identify chromosomal structural variations, offering substantial potential to advance research in this field.

6.

Geometric analysis of SARS-CoV-2 variants.

Guan, Mengcen; Sun, Nan; Yau, Stephen S-T.

Gene ; 909: 148291, 2024 May 30.

Artículo en Inglés | MEDLINE | ID: mdl-38417688

RESUMEN

SARS-CoV-2 as a severe respiratory disease has been prevalent around the world since its first discovery in 2019.As a single-stranded RNA virus, its high mutation rate makes its variants manifold and enables some of them to have high pathogenicity, such as Omicron variant, the most prevalent virus now. Research on the relationship of these SARS-CoV-2 variants, especially exploring their difference is a hot issue. In this study, we constructed a geometric space to represent all SARS-CoV-2 sequences of different variants. An alignment-free method: natural vector method was utilized to establish genome space. The genome space of SARS-CoV-2 was constructed based on the 24-dimensional natural vector and the appropriate metric was determined through performing phylogenetic analysises. Phylogenetic trees of different lineages constructed under the selected natural vector and metric coincided with the lineage naming standards, which means lineages with same alphabetical prefix cluster in phylogenetic trees. Furthermore, the relationships between the various GISAID clades as depicted by the natural graph primarily matched the description provided in the GISAID clade naming.The validity of our geometric space was demonstrated by these phylogenetic analysis results. So in this research, we constructed a geometry space for the genomes of the novel coronavirus SARS-CoV-2, which allows us to compare the different variants. Our geometric space is valuable for resolving the issues insides the virus.

Asunto(s)

COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Filogenia , Tasa de Mutación

7.

Editorial: Omics-based novel computational methods revealing microbe-disease associations.

Yau, Stephen S-T; Wu, Qi.

Front Cell Infect Microbiol ; 13: 1305902, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37965254

Asunto(s)

Microbiota , Biología Computacional

8.

Quantitative proteomics profiling reveals the inhibition of trastuzumab antitumor efficacy by phosphorylated RPS6 in gastric carcinoma.

Hu, Chun-Ting; Pei, Shao-Jun; Wang, Jing-Long; Zu, Li-Dong; Shen, Wei-Wei; Yuan, Lin; Gao, Feng; Jiang, Li-Ren; Yau, Stephen S-T; Fu, Guo-Hui.

Cancer Chemother Pharmacol ; 92(5): 341-355, 2023 11.

Artículo en Inglés | MEDLINE | ID: mdl-37507485

RESUMEN

BACKGROUND: The anti-HER2 antibody trastuzumab is a standard treatment for gastric carcinoma with HER2 overexpression, but not all patients benefit from treatment with HER2-targeted therapies due to intrinsic and acquired resistance. Thus, more precise predictors for selecting patients to receive trastuzumab therapy are urgently needed. METHODS: We applied mass spectrometry-based proteomic analysis to 38 HER2-positive gastric tumor biopsies from 19 patients pretreated with trastuzumab (responders n = 10; nonresponders, n = 9) to identify factors that may influence innate sensitivity or resistance to trastuzumab therapy and validated the results in tumor cells and patient samples. RESULTS: Statistical analyses revealed significantly lower phosphorylated ribosomal S6 (p-RPS6) levels in responders than nonresponders, and this downregulation was associated with a durable response and better overall survival after anti-HER2 therapy. High p-RPS6 levels could trigger AKT/mTOR/RPS6 signaling and inhibit trastuzumab antitumor efficacy in nonresponders. We demonstrated that RPS6 phosphorylation inhibitors in combination with trastuzumab effectively suppressed HER2-positive GC cell survival through the inhibition of the AKT/mTOR/RPS6 axis. CONCLUSIONS: Our findings provide for the first time a detailed proteomics profile of current protein alterations in patients before anti-HER2 therapy and present a novel and optimal predictor for the response to trastuzumab treatment. HER2-positive GC patients with low expression of p-RPS6 are more likely to benefit from trastuzumab therapy than those with high expression. However, those with high expression of p-RPS6 may benefit from trastuzumab in combination with RPS6 phosphorylation inhibitors.

Asunto(s)

Carcinoma , Neoplasias Gástricas , Humanos , Trastuzumab/farmacología , Trastuzumab/uso terapéutico , Neoplasias Gástricas/patología , Proteínas Proto-Oncogénicas c-akt , Proteómica/métodos , Línea Celular Tumoral , Serina-Treonina Quinasas TOR/metabolismo , Receptor ErbB-2/metabolismo , Resistencia a Antineoplásicos

9.

Neural Projection Filter: Learning Unknown Dynamics Driven by Noisy Observations.

Tao, Yangtianze; Kang, Jiayi; Yau, Stephen Shing-Toung.

IEEE Trans Neural Netw Learn Syst ; PP2023 Jan 10.

Artículo en Inglés | MEDLINE | ID: mdl-37018644

RESUMEN

In this article, we propose the novel neural stochastic differential equations (SDEs) driven by noisy sequential observations called neural projection filter (NPF) under the continuous state-space models (SSMs) framework. The contributions of this work are both theoretical and algorithmic. On the one hand, we investigate the approximation capacity of the NPF, i.e., the universal approximation theorem for NPF. More explicitly, under some natural assumptions, we prove that the solution of the SDE driven by the semimartingale can be well approximated by the solution of the NPF. In particular, the explicit estimation bound is given. On the other hand, as an important application of this result, we develop a novel data-driven filter based on NPF. Also, under certain condition, we prove the algorithm convergence; i.e., the dynamics of NPF converges to the target dynamics. At last, we systematically compare the NPF with the existing filters. We verify the convergence theorem in linear case and experimentally demonstrate that the NPF outperforms existing filters in nonlinear case with robustness and efficiency. Furthermore, NPF could handle high-dimensional systems in real-time manner, even for the 100 -D cubic sensor, while the state-of-the-art (SOTA) filter fails to do it.

10.

Generating Minimal Models of H1N1 NS1 Gene Sequences Using Alignment-Based and Alignment-Free Algorithms.

Fang, Meng; Xu, Jiawei; Sun, Nan; Yau, Stephen S-T.

Genes (Basel) ; 14(1)2023 01 10.

Artículo en Inglés | MEDLINE | ID: mdl-36672928

RESUMEN

For virus classification and tracing, one idea is to generate minimal models from the gene sequences of each virus group for comparative analysis within and between classes, as well as classification and tracing of new sequences. The starting point of defining a minimal model for a group of gene sequences is to find their longest common sequence (LCS), but this is a non-deterministic polynomial-time hard (NP-hard) problem. Therefore, we applied some heuristic approaches of finding LCS, as well as some of the newer methods of treating gene sequences, including multiple sequence alignment (MSA) and k-mer natural vector (NV) encoding. To evaluate our algorithms, a five-fold cross validation classification scheme on a dataset of H1N1 virus non-structural protein 1 (NS1) gene was analyzed. The results indicate that the MSA-based algorithm has the best performance measured by classification accuracy, while the NV-based algorithm exhibits advantages in the time complexity of generating minimal models.

Asunto(s)

Subtipo H1N1 del Virus de la Influenza A , Algoritmos , Alineación de Secuencia

11.

Patterns of care and survival of Chinese glioblastoma patients in the temozolomide era: a Hong Kong population-level analysis over a 14-year period.

Woo, Peter Y M; Yau, Stephen; Lam, Tai-Chung; Pu, Jenny K S; Li, Lai-Fung; Lui, Louisa C Y; Chan, Danny T M; Loong, Herbert H F; Lee, Michael W Y; Yeung, Rebecca; Kwok, Carol C H; Au, Siu-Kie; Tan, Tze-Ching; Kan, Amanda N C; Chan, Tony K T; Mak, Calvin H K; Mak, Henry K F; Ho, Jason M K; Cheung, Ka-Man; Tse, Teresa P K; Lau, Sarah S N; Chow, Joyce S W; El-Helali, Aya; Ng, Ho-Keung; Poon, Wai-Sang.

Neurooncol Pract ; 10(1): 50-61, 2023 Feb.

Artículo en Inglés | MEDLINE | ID: mdl-36659973

RESUMEN

Background: The aim of this study is to address the paucity of epidemiological data regarding the characteristics, treatment patterns and survival outcomes of Chinese glioblastoma patients. Methods: This was a population-level study of Hong Kong adult (>18 years) Chinese patients with newly diagnosed histologically confirmed glioblastoma between 2006 and 2019. The age standardized incidence rate (ASIR), patient-, tumor- treatment-related characteristics, overall survival (OS) as well as its predictors were determined. Results: One thousand and ten patients with a median follow-up of 10.0 months were reviewed. The ASIR of glioblastoma was 1.0 per 100 000 population with no significant change during the study period. The mean age was 57 + 14 years. The median OS was 10.6 months (IQR: 5.2-18.4). Independent predictors for survival were: Karnofsky performance score >80 (adjusted OR: 0.8; 95% CI: 0.6-0.9), IDH-1 mutant (aOR: 0.7; 95% CI: 0.5-0.9) or MGMT methylated (aOR: 0.7; 95% CI: 0.5-0.8) glioblastomas, gross total resection (aOR: 0.8; 95% CI: 0.5-0.8) and temozolomide chemoradiotherapy (aOR 0.4; 95% CI: 0.3-0.6). Despite the significant increased administration of temozolomide chemoradiotherapy from 39% (127/326) of patients in 2006-2010 to 63% (227/356) in 2015-2019 (P-value < .001), median OS did not improve (2006-2010: 10.3 months vs 2015-2019: 11.8 months) (OR: 1.1; 95% CI: 0.9-1.3). Conclusions: The incidence of glioblastoma in the Chinese general population is low. We charted the development of neuro-oncological care of glioblastoma patients in Hong Kong during the temozolomide era. Although there was an increased adoption of temozolomide chemoradiotherapy, a corresponding improvement in survival was not observed.

12.

Recurrent Neural Networks Are Universal Approximators With Stochastic Inputs.

Chen, Xiuqiong; Tao, Yangtianze; Xu, Wenjie; Yau, Stephen Shing-Toung.

IEEE Trans Neural Netw Learn Syst ; 34(10): 7992-8006, 2023 Oct.

Artículo en Inglés | MEDLINE | ID: mdl-35171782

RESUMEN

In this article, we investigate the approximation ability of recurrent neural networks (RNNs) with stochastic inputs in state space model form. More explicitly, we prove that open dynamical systems with stochastic inputs can be well-approximated by a special class of RNNs under some natural assumptions, and the asymptotic approximation error has also been delicately analyzed as time goes to infinity. In addition, as an important application of this result, we construct an RNN-based filter and prove that it can well-approximate finite dimensional filters which include Kalman filter (KF) and Benes filter as special cases. The efficiency of RNN-based filter has also been verified by two numerical experiments compared with optimal KF.

13.

In-depth investigation of the point mutation pattern of HIV-1.

Sun, Nan; Yau, Stephen S-T.

Front Cell Infect Microbiol ; 12: 1033481, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-36457853

RESUMEN

Mutations may produce highly transmissible and damaging HIV variants, which increase the genetic diversity, and pose a challenge to develop vaccines. Therefore, it is of great significance to understand how mutations drive the virulence of HIV. Based on the 11897 reliable genomes of HIV-1 retrieved from HIV sequence Database, we analyze the 12 types of point mutation (A>C, A>G, A>T, C>A, C>G, C>T, G>A, G>C, G>T, T>A, T>C, T>G) from multiple statistical perspectives for the first time. The global/geographical location/subtype/k-mer analysis results report that A>G, G>A, C>T and T>C account for nearly 64% among all SNPs, which suggest that APOBEC-editing and ADAR-editing may play an important role in HIV-1 infectivity. Time analysis shows that most genomes with abnormal mutation numbers comes from African countries. Finally, we use natural vector method to check the k-mer distribution changing patterns in the genome, and find that there is an important substitution pattern between nucleotides A and G, and 2-mer CG may have a significant impact on viral infectivity. This paper provides an insight into the single mutation of HIV-1 by using the latest data in the HIV sequence Database.

Asunto(s)

VIH-1 , VIH-1/genética , Mutación Puntual , Mutación , Mutación Missense , Bases de Datos de Ácidos Nucleicos

14.

Biomolecular Topology: Modelling and Analysis.

Liu, Jian; Xia, Ke-Lin; Wu, Jie; Yau, Stephen Shing-Toung; Wei, Guo-Wei.

Acta Math Sin Engl Ser ; 38(10): 1901-1938, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-36407804

RESUMEN

With the great advancement of experimental tools, a tremendous amount of biomolecular data has been generated and accumulated in various databases. The high dimensionality, structural complexity, the nonlinearity, and entanglements of biomolecular data, ranging from DNA knots, RNA secondary structures, protein folding configurations, chromosomes, DNA origami, molecular assembly, to others at the macromolecular level, pose a severe challenge in their analysis and characterization. In the past few decades, mathematical concepts, models, algorithms, and tools from algebraic topology, combinatorial topology, computational topology, and topological data analysis, have demonstrated great power and begun to play an essential role in tackling the biomolecular data challenge. In this work, we introduce biomolecular topology, which concerns the topological problems and models originated from the biomolecular systems. More specifically, the biomolecular topology encompasses topological structures, properties and relations that are emerged from biomolecular structures, dynamics, interactions, and functions. We discuss the various types of biomolecular topology from structures (of proteins, DNAs, and RNAs), protein folding, and protein assembly. A brief discussion of databanks (and databases), theoretical models, and computational algorithms, is presented. Further, we systematically review related topological models, including graphs, simplicial complexes, persistent homology, persistent Laplacians, de Rham-Hodge theory, Yau-Hausdorff distance, and the topology-based machine learning models.

15.

Classification of Protein Sequences by a Novel Alignment-Free Method on Bacterial and Virus Families.

Guan, Mengcen; Zhao, Leqi; Yau, Stephen S-T.

Genes (Basel) ; 13(10)2022 Sep 27.

Artículo en Inglés | MEDLINE | ID: mdl-36292629

RESUMEN

The classification of protein sequences provides valuable insights into bioinformatics. Most existing methods are based on sequence alignment algorithms, which become time-consuming as the size of the database increases. Therefore, there is a need to develop an improved method for effectively classifying protein sequences. In this paper, we propose a novel accumulated natural vector method to cluster protein sequences at a lower time cost without reducing accuracy. Our method projects each protein sequence as a point in a 250-dimensional space according to its amino acid distribution. Thus, the biological distance between any two proteins can be easily measured by the Euclidean distance between the corresponding points in the 250-dimensional space. The convex hull analysis and classification perform robustly on virus and bacteria datasets, effectively verifying our method.

Asunto(s)

Aminoácidos , Bacterias , Filogenia , Secuencia de Aminoácidos , Alineación de Secuencia , Aminoácidos/química , Bacterias/genética

16.

An efficient numerical representation of genome sequence: natural vector with covariance component.

Sun, Nan; Zhao, Xin; Yau, Stephen S-T.

PeerJ ; 10: e13544, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-35729905

RESUMEN

Background: The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets. Methods: We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences. Results: First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms.

Asunto(s)

Mimiviridae , Virus , Humanos , Filogenia , Genoma , Evolución Biológica , Nucleótidos/genética , Genómica , Bacterias/genética , Archaea/genética , Mimiviridae/genética

17.

kmer2vec: A Novel Method for Comparing DNA Sequences by word2vec Embedding.

Ren, Ruohan; Yin, Changchuan; S-T Yau, Stephen.

J Comput Biol ; 29(9): 1001-1021, 2022 09.

Artículo en Inglés | MEDLINE | ID: mdl-35593919

RESUMEN

The comparison of DNA sequences is of great significance in genomics analysis. Although the traditional multiple sequence alignment (MSA) method is popularly used for evolutionary analysis, optimally aligning k sequences becomes computationally intractable when k increases due to the intrinsic computational complexity of MSA. Despite numerous k-mer alignment-free methods being proposed, the existing k-mer alignment-free methods may not truly capture the contextual structures of the sequences. In this study, we present a novel k-mer contextual alignment-free method (called kmer2vec), in which the sequence k-mers are semantically embedded to word2vec vectors, an essential technique in natural language processing. Consequently, the method converts each DNA/RNA sequence into a point in the word2vec high-dimensional space and compares DNA sequences in the space. Because the word2vec vectors are trained from the contextual relationship of k-mers in the genomes, the method may extract valuable structural information from the sequences and reflect the relationship among them properly. The proposed method is optimized on the parameters from word2vec training and verified in the phylogenetic analysis of large whole genomes, including coronavirus and bacterial genomes. The results demonstrate the effectiveness of the method on phylogenetic tree construction and species clustering. The method running speed is much faster than that of the MSA method, especially the phylogenetic relationships constructed by the kmer2vec method are more accurate than the conventional k-mer alignment-free method. Therefore, this approach can provide new perspectives for phylogeny and evolution and make it possible to analyze large genomes. In addition, we discuss special parameterization in the k-mer word2vec embedding construction. An effective tool for rapid SARS-CoV-2 typing can also be derived when combining kmer2vec with clustering methods.

Asunto(s)

Algoritmos , COVID-19 , Secuencia de Bases , Humanos , Filogenia , SARS-CoV-2/genética , Análisis de Secuencia de ADN/métodos

18.

Identification of HIV Rapid Mutations Using Differences in Nucleotide Distribution over Time.

Sun, Nan; Yang, Jie; Yau, Stephen S-T.

Genes (Basel) ; 13(2)2022 01 19.

Artículo en Inglés | MEDLINE | ID: mdl-35205215

RESUMEN

Mutation is the driving force of species evolution, which may change the genetic information of organisms and obtain selective competitive advantages to adapt to environmental changes. It may change the structure or function of translated proteins, and cause abnormal cell operation, a variety of diseases and even cancer. Therefore, it is particularly important to identify gene regions with high mutations. Mutations will cause changes in nucleotide distribution, which can be characterized by natural vectors globally. Based on natural vectors, we propose a mathematical formula for measuring the difference in nucleotide distribution over time to investigate the mutations of human immunodeficiency virus. The studied dataset is from public databases and includes gene sequences from twenty HIV-infected patients. The results show that the mutation rate of the nine major genes or gene segment regions in the genome exhibits discrepancy during the infected period, and the Env gene has the fastest mutation rate. We deduce that the peak of virus mutation has a close temporal relationship with viral divergence and diversity. The mutation study of HIV is of great significance to clinical diagnosis and drug design.

Asunto(s)

Infecciones por VIH , VIH-1 , Infecciones por VIH/virología , VIH-1/genética , Humanos , Mutación , Nucleótidos

19.

Utilizing the codon adaptation index to evaluate the susceptibility to HIV-1 and SARS-CoV-2 related coronaviruses in possible target cells in humans.

Zhou, Haoyu; Ren, Ruohan; Yau, Stephen Shing-Toung.

Front Cell Infect Microbiol ; 12: 1085397, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-36760235

RESUMEN

Comprehensive identification of possible target cells for viruses is crucial for understanding the pathological mechanism of virosis. The susceptibility of cells to viruses depends on many factors. Besides the existence of receptors at the cell surface, effective expression of viral genes is also pivotal for viral infection. The regulation of viral gene expression is a multilevel process including transcription, translational initiation and translational elongation. At the translational elongation level, the translational efficiency of viral mRNAs mainly depends on the match between their codon composition and cellular translational machinery (usually referred to as codon adaptation). Thus, codon adaptation for viral ORFs in different cell types may be related to their susceptibility to viruses. In this study, we selected the codon adaptation index (CAI) which is a common codon adaptation-based indicator for assessing the translational efficiency at the translational elongation level to evaluate the susceptibility to two-pandemic viruses (HIV-1 and SARS-CoV-2) of different human cell types. Compared with previous studies that evaluated the infectivity of viruses based on codon adaptation, the main advantage of our study is that our analysis is refined to the cell-type level. At first, we verified the positive correlation between CAI and translational efficiency and strengthened the rationality of our research method. Then we calculated CAI for ORFs of two viruses in various human cell types. We found that compared to high-expression endogenous genes, the CAIs of viral ORFs are relatively low. This phenomenon implied that two kinds of viruses have not been well adapted to translational regulatory machinery in human cells. Also, we indicated that presumptive susceptibility to viruses according to CAI is usually consistent with the results of experimental research. However, there are still some exceptions. Finally, we found that two viruses have different effects on cellular translational mechanisms. HIV-1 decouples CAI and translational efficiency of endogenous genes in host cells and SARS-CoV-2 exhibits increased CAI for its ORFs in infected cells. Our results implied that at least in cases of HIV-1 and SARS-CoV-2, CAI can be regarded as an auxiliary index to assess cells' susceptibility to viruses but cannot be used as the only evidence to identify viral target cells.

Asunto(s)

COVID-19 , VIH-1 , Humanos , SARS-CoV-2/genética , VIH-1/genética , COVID-19/genética , Codón/genética , Adaptación Fisiológica/genética

20.

New Genome Sequence Detection via Natural Vector Convex Hull Method.

Zhao, Ruzhang; Pei, Shaojun; Yau, Stephen Shing-Toung.

IEEE/ACM Trans Comput Biol Bioinform ; 19(3): 1782-1793, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-33237867

RESUMEN

It remains challenging how to find existing but undiscovered genome sequence mutations or predict potential genome sequence mutations based on real sequence data. Motivated by this, we develop approaches to detect new, undiscovered genome sequences. Because discovering new genome sequences through biological experiments is resource-intensive, we want to achieve the new genome sequence detection task mathematically. However, little literature tells us how to detect new, undiscovered genome sequence mutations mathematically. We form a new framework based on natural vector convex hull method that conducts alignment-free sequence analysis. Our newly developed two approaches, Random-permutation Algorithm with Penalty (RAP) and Random-permutation Algorithm with Penalty and COstrained Search (RAPCOS), use the geometry properties captured by natural vectors. In our experiment, we discover a mathematically new human immunodeficiency virus (HIV) genome sequence using some real HIV genome sequences. Significantly, the proposed methods are applicable to solve the new genome sequence detection challenge and have many good properties, such as robustness, rapid convergence, and fast computation.

Asunto(s)

Algoritmos , Genoma , Genoma/genética , Humanos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA