Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 65
Filter
Add more filters










Publication year range
1.
Comput Struct Biotechnol J ; 23: 2083-2096, 2024 Dec.
Article in English | MEDLINE | ID: mdl-38803517

ABSTRACT

Understanding the structural similarity between genomes is pivotal in classification and phylogenetic analysis. As the number of known genomes rockets, alignment-free methods have gained considerable attention. Among these methods, the natural vector method stands out as it represents sequences as vectors using statistical moments, enabling effective clustering based on families in biological taxonomy. However, determining an optimal metric that combines different elements in natural vectors remains challenging due to the absence of a rigorous theoretical framework for weighting different k-mers and orders. In this study, we address this challenge by transforming the determination of optimal weights into an optimization problem and resolving it through gradient-based techniques. Our experimental results underscore the substantial improvement in classification accuracy achieved by employing these optimal weights, reaching an impressive 92.73% on the testing set, surpassing other alignment-free methods. On one hand, our method offers an outstanding metric for virus classification, and on the other hand, it provides valuable insights into feature integration within alignment-free methods.

2.
Front Genet ; 15: 1364951, 2024.
Article in English | MEDLINE | ID: mdl-38572414

ABSTRACT

Chromosomal fusion is a significant form of structural variation, but research into algorithms for its identification has been limited. Most existing methods rely on synteny analysis, which necessitates manual annotations and always involves inefficient sequence alignments. In this paper, we present a novel alignment-free algorithm for chromosomal fusion recognition. Our method transforms the problem into a series of assignment problems using natural vectors and efficiently solves them with the Kuhn-Munkres algorithm. When applied to the human/gorilla and swamp buffalo/river buffalo datasets, our algorithm successfully and efficiently identifies chromosomal fusion events. Notably, our approach offers several advantages, including higher processing speeds by eliminating time-consuming alignments and removing the need for manual annotations. By an alignment-free perspective, our algorithm initially considers entire chromosomes instead of fragments to identify chromosomal structural variations, offering substantial potential to advance research in this field.

3.
Gene ; 909: 148291, 2024 May 30.
Article in English | MEDLINE | ID: mdl-38417688

ABSTRACT

SARS-CoV-2 as a severe respiratory disease has been prevalent around the world since its first discovery in 2019.As a single-stranded RNA virus, its high mutation rate makes its variants manifold and enables some of them to have high pathogenicity, such as Omicron variant, the most prevalent virus now. Research on the relationship of these SARS-CoV-2 variants, especially exploring their difference is a hot issue. In this study, we constructed a geometric space to represent all SARS-CoV-2 sequences of different variants. An alignment-free method: natural vector method was utilized to establish genome space. The genome space of SARS-CoV-2 was constructed based on the 24-dimensional natural vector and the appropriate metric was determined through performing phylogenetic analysises. Phylogenetic trees of different lineages constructed under the selected natural vector and metric coincided with the lineage naming standards, which means lineages with same alphabetical prefix cluster in phylogenetic trees. Furthermore, the relationships between the various GISAID clades as depicted by the natural graph primarily matched the description provided in the GISAID clade naming.The validity of our geometric space was demonstrated by these phylogenetic analysis results. So in this research, we constructed a geometry space for the genomes of the novel coronavirus SARS-CoV-2, which allows us to compare the different variants. Our geometric space is valuable for resolving the issues insides the virus.


Subject(s)
COVID-19 , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , Phylogeny , Mutation Rate
5.
Cancer Chemother Pharmacol ; 92(5): 341-355, 2023 11.
Article in English | MEDLINE | ID: mdl-37507485

ABSTRACT

BACKGROUND: The anti-HER2 antibody trastuzumab is a standard treatment for gastric carcinoma with HER2 overexpression, but not all patients benefit from treatment with HER2-targeted therapies due to intrinsic and acquired resistance. Thus, more precise predictors for selecting patients to receive trastuzumab therapy are urgently needed. METHODS: We applied mass spectrometry-based proteomic analysis to 38 HER2-positive gastric tumor biopsies from 19 patients pretreated with trastuzumab (responders n = 10; nonresponders, n = 9) to identify factors that may influence innate sensitivity or resistance to trastuzumab therapy and validated the results in tumor cells and patient samples. RESULTS: Statistical analyses revealed significantly lower phosphorylated ribosomal S6 (p-RPS6) levels in responders than nonresponders, and this downregulation was associated with a durable response and better overall survival after anti-HER2 therapy. High p-RPS6 levels could trigger AKT/mTOR/RPS6 signaling and inhibit trastuzumab antitumor efficacy in nonresponders. We demonstrated that RPS6 phosphorylation inhibitors in combination with trastuzumab effectively suppressed HER2-positive GC cell survival through the inhibition of the AKT/mTOR/RPS6 axis. CONCLUSIONS: Our findings provide for the first time a detailed proteomics profile of current protein alterations in patients before anti-HER2 therapy and present a novel and optimal predictor for the response to trastuzumab treatment. HER2-positive GC patients with low expression of p-RPS6 are more likely to benefit from trastuzumab therapy than those with high expression. However, those with high expression of p-RPS6 may benefit from trastuzumab in combination with RPS6 phosphorylation inhibitors.


Subject(s)
Carcinoma , Stomach Neoplasms , Humans , Trastuzumab/pharmacology , Trastuzumab/therapeutic use , Stomach Neoplasms/pathology , Proto-Oncogene Proteins c-akt , Proteomics/methods , Cell Line, Tumor , TOR Serine-Threonine Kinases/metabolism , Receptor, ErbB-2/metabolism , Drug Resistance, Neoplasm
6.
Genes (Basel) ; 14(1)2023 01 10.
Article in English | MEDLINE | ID: mdl-36672928

ABSTRACT

For virus classification and tracing, one idea is to generate minimal models from the gene sequences of each virus group for comparative analysis within and between classes, as well as classification and tracing of new sequences. The starting point of defining a minimal model for a group of gene sequences is to find their longest common sequence (LCS), but this is a non-deterministic polynomial-time hard (NP-hard) problem. Therefore, we applied some heuristic approaches of finding LCS, as well as some of the newer methods of treating gene sequences, including multiple sequence alignment (MSA) and k-mer natural vector (NV) encoding. To evaluate our algorithms, a five-fold cross validation classification scheme on a dataset of H1N1 virus non-structural protein 1 (NS1) gene was analyzed. The results indicate that the MSA-based algorithm has the best performance measured by classification accuracy, while the NV-based algorithm exhibits advantages in the time complexity of generating minimal models.


Subject(s)
Influenza A Virus, H1N1 Subtype , Algorithms , Sequence Alignment
7.
Front Cell Infect Microbiol ; 12: 1033481, 2022.
Article in English | MEDLINE | ID: mdl-36457853

ABSTRACT

Mutations may produce highly transmissible and damaging HIV variants, which increase the genetic diversity, and pose a challenge to develop vaccines. Therefore, it is of great significance to understand how mutations drive the virulence of HIV. Based on the 11897 reliable genomes of HIV-1 retrieved from HIV sequence Database, we analyze the 12 types of point mutation (A>C, A>G, A>T, C>A, C>G, C>T, G>A, G>C, G>T, T>A, T>C, T>G) from multiple statistical perspectives for the first time. The global/geographical location/subtype/k-mer analysis results report that A>G, G>A, C>T and T>C account for nearly 64% among all SNPs, which suggest that APOBEC-editing and ADAR-editing may play an important role in HIV-1 infectivity. Time analysis shows that most genomes with abnormal mutation numbers comes from African countries. Finally, we use natural vector method to check the k-mer distribution changing patterns in the genome, and find that there is an important substitution pattern between nucleotides A and G, and 2-mer CG may have a significant impact on viral infectivity. This paper provides an insight into the single mutation of HIV-1 by using the latest data in the HIV sequence Database.


Subject(s)
HIV-1 , HIV-1/genetics , Point Mutation , Mutation , Mutation, Missense , Databases, Nucleic Acid
8.
Genes (Basel) ; 13(10)2022 Sep 27.
Article in English | MEDLINE | ID: mdl-36292629

ABSTRACT

The classification of protein sequences provides valuable insights into bioinformatics. Most existing methods are based on sequence alignment algorithms, which become time-consuming as the size of the database increases. Therefore, there is a need to develop an improved method for effectively classifying protein sequences. In this paper, we propose a novel accumulated natural vector method to cluster protein sequences at a lower time cost without reducing accuracy. Our method projects each protein sequence as a point in a 250-dimensional space according to its amino acid distribution. Thus, the biological distance between any two proteins can be easily measured by the Euclidean distance between the corresponding points in the 250-dimensional space. The convex hull analysis and classification perform robustly on virus and bacteria datasets, effectively verifying our method.


Subject(s)
Amino Acids , Bacteria , Phylogeny , Amino Acid Sequence , Sequence Alignment , Amino Acids/chemistry , Bacteria/genetics
9.
PeerJ ; 10: e13544, 2022.
Article in English | MEDLINE | ID: mdl-35729905

ABSTRACT

Background: The characterization and comparison of microbial sequences, including archaea, bacteria, viruses and fungi, are very important to understand their evolutionary origin and the population relationship. Most methods are limited by the sequence length and lack of generality. The purpose of this study is to propose a general characterization method, and to study the classification and phylogeny of the existing datasets. Methods: We present a new alignment-free method to represent and compare biological sequences. By adding the covariance between each two nucleotides, the new 18-dimensional natural vector successfully describes 24,250 genomic sequences and 95,542 DNA barcode sequences. The new numerical representation is used to study the classification and phylogenetic relationship of microbial sequences. Results: First, the classification results validate that the six-dimensional covariance vector is necessary to characterize sequences. Then, the 18-dimensional natural vector is further used to conduct the similarity relationship between giant virus and archaea, bacteria, other viruses. The nearest distance calculation results reflect that the giant viruses are closer to bacteria in distribution of four nucleotides. The phylogenetic relationships of the three representative families, Mimiviridae, Pandoraviridae and Marsellieviridae from giant viruses are analyzed. The trees show that ten sequences of Mimiviridae are clustered with Pandoraviridae, and Mimiviridae is closer to the root of the tree than Marsellieviridae. The new developed alignment-free method can be computed very fast, which provides an effective numerical representation for the sequence of microorganisms.


Subject(s)
Mimiviridae , Viruses , Humans , Phylogeny , Genome , Biological Evolution , Nucleotides/genetics , Genomics , Bacteria/genetics , Archaea/genetics , Mimiviridae/genetics
10.
Genes (Basel) ; 13(2)2022 01 19.
Article in English | MEDLINE | ID: mdl-35205215

ABSTRACT

Mutation is the driving force of species evolution, which may change the genetic information of organisms and obtain selective competitive advantages to adapt to environmental changes. It may change the structure or function of translated proteins, and cause abnormal cell operation, a variety of diseases and even cancer. Therefore, it is particularly important to identify gene regions with high mutations. Mutations will cause changes in nucleotide distribution, which can be characterized by natural vectors globally. Based on natural vectors, we propose a mathematical formula for measuring the difference in nucleotide distribution over time to investigate the mutations of human immunodeficiency virus. The studied dataset is from public databases and includes gene sequences from twenty HIV-infected patients. The results show that the mutation rate of the nine major genes or gene segment regions in the genome exhibits discrepancy during the infected period, and the Env gene has the fastest mutation rate. We deduce that the peak of virus mutation has a close temporal relationship with viral divergence and diversity. The mutation study of HIV is of great significance to clinical diagnosis and drug design.


Subject(s)
HIV Infections , HIV-1 , HIV Infections/virology , HIV-1/genetics , Humans , Mutation , Nucleotides
11.
J Theor Biol ; 530: 110885, 2021 12 07.
Article in English | MEDLINE | ID: mdl-34478743

ABSTRACT

The world faces a great unforeseen challenge through the COVID-19 pandemic caused by coronavirus SARS-CoV-2. The virus genome structure and evolution are positioned front and center for further understanding insights on vaccine development, monitoring of transmission trajectories, and prevention of zoonotic infections of new coronaviruses. Of particular interest are genomic elements Inverse Repeats (IRs), which maintain genome stability, regulate gene expressions, and are the targets of mutations. However, little research attention is given to the IR content analysis in the SARS-CoV-2 genome. In this study, we propose a geometric analysis method and using the method to investigate the distributions of IRs in SARS-CoV-2 and its related coronavirus genomes. The method represents each genomic IR sequence pair as a single point and constructs the geometric shape of the genome using the IRs. Thus, the IR shape can be considered as the signature of the genome. The genomes of different coronaviruses are then compared using the constructed IR shapes. The results demonstrate that SARS-CoV-2 genome, specifically, has an abundance of IRs, and the IRs in coronavirus genomes show an increase during evolution events.


Subject(s)
COVID-19 , SARS-CoV-2 , Genome, Viral/genetics , Genomics , Humans , Pandemics , Phylogeny
12.
Comput Struct Biotechnol J ; 19: 4226-4234, 2021.
Article in English | MEDLINE | ID: mdl-34429843

ABSTRACT

Understanding the relationships between genomic sequences is essential to the classification and characterization of living beings. The classes and characteristics of an organism can be identified in the corresponding genome space. In the genome space, the natural metric is important to describe the distribution of genomes. Therefore, the similarity of two biological sequences can be measured. Here, we report that all of the viral genomes are in 32-dimensional Euclidean space, in which the natural metric is the weighted summation of Euclidean distance of k-mer natural vectors. The classification of viral genomes in the constructed genome space further proves the convex hull principle of taxonomy, which states that convex hulls of different families are mutually disjoint. This study provides a novel geometric perspective to describe the genome sequences.

13.
Acta Math Sci ; 41(3): 1017-1022, 2021.
Article in English | MEDLINE | ID: mdl-33897081

ABSTRACT

The severe acute respiratory syndrome COVID-19 was discovered on December 31, 2019 in China. Subsequently, many COVID-19 cases were reported in many other countries. However, some positive COVID-19 samples had been reported earlier than those officially accepted by health authorities in other countries, such as France and Italy. Thus, it is of great importance to determine the place where SARS-CoV-2 was first transmitted to human. To this end, we analyze genomes of SARS-CoV-2 using k-mer natural vector method and compare the similarities of global SARS-CoV-2 genomes by a new natural metric. Because it is commonly accepted that SARS-CoV-2 is originated from bat coronavirus RaTG13, we only need to determine which SARS-CoV-2 genome sequence has the closest distance to bat coronavirus RaTG13 under our natural metric. From our analysis, SARS-CoV-2 most likely has already existed in other countries such as France, India, Netherland, England and United States before the outbreak at Wuhan, China.

14.
Front Genet ; 12: 828805, 2021.
Article in English | MEDLINE | ID: mdl-35186019

ABSTRACT

A comprehensive description of human genomes is essential for understanding human evolution and relationships between modern populations. However, most published literature focuses on local alignment comparison of several genes rather than the complete evolutionary record of individual genomes. Combining with data from the 1,000 Genomes Project, we successfully reconstructed 2,504 individual genomes and propose Divided Natural Vector method to analyze the distribution of nucleotides in the genomes. Comparisons based on autosomes, sex chromosomes and mitochondrial genomes reveal the genetic relationships between populations, and different inheritance pattern leads to different phylogenetic results. Results based on mitochondrial genomes confirm the "out-of-Africa" hypothesis and assert that humans, at least females, most likely originated in eastern Africa. The reconstructed genomes are stored on our server and can be further used for any genome-scale analysis of humans (http://yaulab.math.tsinghua.edu.cn/2022_1000genomesprojectdata/). This project provides the complete genomes of thousands of individuals and lays the groundwork for genome-level analyses of the genetic relationships between populations and the origin of humans.

15.
Sci Rep ; 10(1): 21773, 2020 12 10.
Article in English | MEDLINE | ID: mdl-33303802

ABSTRACT

Protein structure can provide insights that help biologists to predict and understand protein functions and interactions. However, the number of known protein structures has not kept pace with the number of protein sequences determined by high-throughput sequencing. Current techniques used to determine the structure of proteins are complex and require a lot of time to analyze the experimental results, especially for large protein molecules. The limitations of these methods have motivated us to create a new approach for protein structure prediction. Here we describe a new approach to predict of protein structures and structure classes from amino acid sequences. Our prediction model performs well in comparison with previous methods when applied to the structural classification of two CATH datasets with more than 5000 protein domains. The average accuracy is 92.5% for structure classification, which is higher than that of previous research. We also used our model to predict four known protein structures with a single amino acid sequence, while many other existing methods could only obtain one possible structure for a given sequence. The results show that our method provides a new effective and reliable tool for protein structure prediction research.


Subject(s)
Amino Acids/chemistry , Protein Conformation , Protein Folding , Proteins/chemistry , Amino Acid Sequence , Protein Domains
16.
Comput Struct Biotechnol J ; 18: 1904-1913, 2020.
Article in English | MEDLINE | ID: mdl-32774785

ABSTRACT

Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.

17.
PeerJ ; 8: e9625, 2020.
Article in English | MEDLINE | ID: mdl-32832270

ABSTRACT

BACKGROUND: Begomoviruses are widely distributed and causing devastating diseases in many crops. According to the number of genomic components, a begomovirus is known as either monopartite or bipartite begomovirus. Both the monopartite and bipartite begomoviruses have the DNA-A component which encodes all essential proteins for virus functions, while the bipartite begomoviruses still contain the DNA-B component. The satellite molecules, known as betasatellites, alphasatellites or deltasatellites, sometimes exist in the begomoviruses. So, the genomic components of begomoviruses are complex and varied. Different genomic components have different gene structures and functions. Classifying the components of begomoviruses is important for studying the virus origin and pathogenic mechanism. METHODS: We propose a model combining Subsequence Natural Vector (SNV) method with Support Vector Machine (SVM) algorithm, to classify the genomic components of begomoviruses and predict the genes of begomoviruses. First, the genome sequence is represented as a vector numerically by the SNV method. Then SVM is applied on the datasets to build the classification model. At last, recursive feature elimination (RFE) is used to select essential features of the subsequence natural vectors based on the importance of features. RESULTS: In the investigation, DNA-A, DNA-B, and different satellite DNAs are selected to build the model. To evaluate our model, the homology-based method BLAST and two machine learning algorithms Random Forest and Naive Bayes method are used to compare with our model. According to the results, our classification model can classify DNA-A, DNA-B, and different satellites with high accuracy. Especially, we can distinguish whether a DNA-A component is from a monopartite or a bipartite begomovirus. Then, based on the results of classification, we can also predict the genes of different genomic components. According to the selected features, we find that the content of four nucleotides in the second and tenth segments (approximately 150-350 bp and 1,450-1,650 bp) are the most different between DNA-A components of monopartite and bipartite begomoviruses, which may be related to the pre-coat protein (AV2) and the transcriptional activator protein (AC2) genes. Our results advance the understanding of the unique structures of the genomic components of begomoviruses.

18.
Int J Mol Sci ; 21(11)2020 May 29.
Article in English | MEDLINE | ID: mdl-32485813

ABSTRACT

Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.


Subject(s)
Phylogeny , Sequence Analysis, DNA/methods , Sequence Homology, Nucleic Acid , Algorithms , Genome, Bacterial , Genome, Viral , Sequence Alignment
19.
Genes (Basel) ; 11(6)2020 06 09.
Article in English | MEDLINE | ID: mdl-32526937

ABSTRACT

The severe respiratory disease COVID-19 was initially reported in Wuhan, China, in December 2019, and spread into many provinces from Wuhan. The corresponding pathogen was soon identified as a novel coronavirus named SARS-CoV-2 (formerly, 2019-nCoV). As of 2 May, 2020, over 3 million COVID-19 cases had been confirmed, and 235,290 deaths had been reported globally, and the numbers are still increasing. It is important to understand the phylogenetic relationship between SARS-CoV-2 and known coronaviruses, and to identify its hosts for preventing the next round of emergency outbreak. In this study, we employ an effective alignment-free approach, the Natural Vector method, to analyze the phylogeny and classify the coronaviruses based on genomic and protein data. Our results show that SARS-CoV-2 is closely related to, but distinct from the SARS-CoV branch. By analyzing the genetic distances from the SARS-CoV-2 strain to the coronaviruses residing in animal hosts, we establish that the most possible transmission path originates from bats to pangolins to humans.


Subject(s)
Betacoronavirus/genetics , Coronavirus Infections/transmission , Coronavirus/genetics , Models, Biological , Pneumonia, Viral/transmission , Animals , Betacoronavirus/classification , COVID-19 , Chiroptera/virology , Coronavirus/classification , Coronavirus 3C Proteases , Coronavirus Infections/virology , Cysteine Endopeptidases/chemistry , Cysteine Endopeptidases/genetics , Disease Outbreaks , Disease Reservoirs , Humans , Mammals/classification , Mammals/virology , Pandemics , Phylogeny , Pneumonia, Viral/virology , SARS-CoV-2 , Spike Glycoprotein, Coronavirus/chemistry , Spike Glycoprotein, Coronavirus/genetics , Viral Nonstructural Proteins/chemistry , Viral Nonstructural Proteins/genetics
20.
J Comput Biol ; 27(12): 1688-1698, 2020 12.
Article in English | MEDLINE | ID: mdl-32392428

ABSTRACT

Bacterial evolution is an important study field, biological sequences are often used to construct phylogenetic relationships. Multiple sequence alignment is very time-consuming and cannot deal with large scales of bacterial genome sequences in a reasonable time. Hence, a new mathematical method, joining density vector method, is proposed to cluster bacteria, which characterizes the features of coding sequence (CDS) in a DNA sequence. Coding sequences carry genetic information that can synthesize proteins. The correspondence between a genomic sequence and its joining density vector (JDV) is one-to-one. JDV reflects the statistical characteristics of genomic sequence and large amounts of data can be analyzed using this new approach. We apply the novel method to do phylogenetic analysis on four bacterial data sets at hierarchies of genus and species. The phylogenetic trees prove that our new method accurately describes the evolutionary relationships of bacterial coding sequences, and is faster than ClustalW and the existing alignment-free methods.


Subject(s)
Bacteria/genetics , Genome, Bacterial , Phylogeny , Enterobacteriaceae/genetics , Pseudomonas/genetics , Streptococcus/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...