Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 68
Filtrar
1.
Brief Bioinform ; 25(5)2024 Jul 25.
Artigo em Inglês | MEDLINE | ID: mdl-39120645

RESUMO

Predicting the strength of promoters and guiding their directed evolution is a crucial task in synthetic biology. This approach significantly reduces the experimental costs in conventional promoter engineering. Previous studies employing machine learning or deep learning methods have shown some success in this task, but their outcomes were not satisfactory enough, primarily due to the neglect of evolutionary information. In this paper, we introduce the Chaos-Attention net for Promoter Evolution (CAPE) to address the limitations of existing methods. We comprehensively extract evolutionary information within promoters using merged chaos game representation and process the overall information with modified DenseNet and Transformer structures. Our model achieves state-of-the-art results on two kinds of distinct tasks related to prokaryotic promoter strength prediction. The incorporation of evolutionary information enhances the model's accuracy, with transfer learning further extending its adaptability. Furthermore, experimental results confirm CAPE's efficacy in simulating in silico directed evolution of promoters, marking a significant advancement in predictive modeling for prokaryotic promoter strength. Our paper also presents a user-friendly website for the practical implementation of in silico directed evolution on promoters. The source code implemented in this study and the instructions on accessing the website can be found in our GitHub repository https://github.com/BobYHY/CAPE.


Assuntos
Aprendizado Profundo , Regiões Promotoras Genéticas , Algoritmos , Evolução Molecular , Simulação por Computador , Dinâmica não Linear , Biologia Computacional/métodos
2.
J Theor Biol ; 530: 110885, 2021 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-34478743

RESUMO

The world faces a great unforeseen challenge through the COVID-19 pandemic caused by coronavirus SARS-CoV-2. The virus genome structure and evolution are positioned front and center for further understanding insights on vaccine development, monitoring of transmission trajectories, and prevention of zoonotic infections of new coronaviruses. Of particular interest are genomic elements Inverse Repeats (IRs), which maintain genome stability, regulate gene expressions, and are the targets of mutations. However, little research attention is given to the IR content analysis in the SARS-CoV-2 genome. In this study, we propose a geometric analysis method and using the method to investigate the distributions of IRs in SARS-CoV-2 and its related coronavirus genomes. The method represents each genomic IR sequence pair as a single point and constructs the geometric shape of the genome using the IRs. Thus, the IR shape can be considered as the signature of the genome. The genomes of different coronaviruses are then compared using the constructed IR shapes. The results demonstrate that SARS-CoV-2 genome, specifically, has an abundance of IRs, and the IRs in coronavirus genomes show an increase during evolution events.


Assuntos
COVID-19 , SARS-CoV-2 , Genoma Viral/genética , Genômica , Humanos , Pandemias , Filogenia
3.
Genomics ; 112(2): 1847-1852, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31704313

RESUMO

A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.


Assuntos
Redes Neurais de Computação , Sítios de Splice de RNA , Análise de Sequência de DNA/métodos , Humanos , Dinâmica não Linear , Software
4.
Genomics ; 111(6): 1777-1784, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-30529533

RESUMO

This study quantitatively validates the principle that the biological properties associated with a given genotype are determined by the distribution of amino acids. In order to visualize this central law of molecular biology, each protein was represented by a point in 250-dimensional space based on its amino acid distribution. Proteins from the same family are found to cluster together, leading to the principle that the convex hull surrounding protein points from the same family do not intersect with the convex hulls of other protein families. This principle was verified computationally for all available and reliable protein kinases and human proteins. In addition, we generated 2,328,761 figures to show that the convex hulls of different families were disjoint from each other. The classification performs well with high and robust accuracy (95.75% and 97.5%) together with reasonable phylogenetic trees validate our methods further.


Assuntos
Algoritmos , Família Multigênica , Filogenia , Proteínas Quinases/classificação , Proteínas Quinases/genética , Humanos
5.
Genomics ; 111(6): 1298-1305, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-30195069

RESUMO

Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.


Assuntos
Filogenia , Análise de Sequência de Proteína/métodos , Vírus da Influenza A/classificação , Rhinovirus/classificação , Proteínas Virais/química , Globinas beta/química
6.
Int J Mol Sci ; 21(11)2020 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-32485813

RESUMO

Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Algoritmos , Genoma Bacteriano , Genoma Viral , Alinhamento de Sequência
7.
Mol Phylogenet Evol ; 141: 106633, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31563612

RESUMO

Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.


Assuntos
Algoritmos , Genoma , Genômica , Sequência de Bases , DNA/genética , Genoma Mitocondrial , Cadeias de Markov , Filogenia
8.
BMC Evol Biol ; 18(1): 200, 2018 12 27.
Artigo em Inglês | MEDLINE | ID: mdl-30587116

RESUMO

BACKGROUND: In recent years, DNA barcoding has become an important tool for biologists to identify species and understand their natural biodiversity. The complexity of barcode data makes it difficult to analyze quickly and effectively. Manual classification of this data cannot keep up to the rate of increase of available data. RESULTS: In this study, we propose a new method for DNA barcode classification based on the distribution of nucleotides within the sequence. By adding the covariance of nucleotides to the original natural vector, this augmented 18-dimensional natural vector makes good use of the available information in the DNA sequence. The accurate classification results we obtained demonstrate that this new 18-dimensional natural vector method, together with the random forest classifier algorthm, can serve as a computationally efficient identification tool for DNA barcodes. We performed phylogenetic analysis on the genus Megacollybia to validate our method. We also studied how effective our method was in determining the genetic distance within and between species in our barcoding dataset. CONCLUSIONS: The classification performs well on the fungi barcode dataset with high and robust accuracy. The reasonable phylogenetic trees we obtained further validate our methods. This method is alignment-free and does not depend on any model assumption, and it will become a powerful tool for classification and evolutionary analysis.


Assuntos
Código de Barras de DNA Taxonômico/métodos , Fungos/classificação , Fungos/genética , Biodiversidade , Filogenia , Análise de Sequência de DNA
9.
J Theor Biol ; 456: 34-40, 2018 11 07.
Artigo em Inglês | MEDLINE | ID: mdl-30059661

RESUMO

Comparing DNA and protein sequence groups plays an important role in biological evolutionary relationship research. Despite many methods available for sequence comparison, only a few can be used for group comparison. In this study, we propose a novel approach using convex hulls. We use statistical information contained within the sequences to represent each sequence as a point in high dimensional space. We find that the points belonging to one biological group are located in a different region of space than points belonging to other biological groups. To be more precise, the convex hull of the points from one group are disjoint from the convex hulls of points from other groups. This finding allows us to do phylogenetic analysis for groups in an efficient way. Five different theorems are presented for checking whether two convex hulls intersect or are disjoint. Test results for datasets related to HRV, HPV, Ebolavirus, PKC and protein phosphatase domains demonstrate that our method performs well and provides a new tool for studying group phylogeny. More significantly, the convex analysis presents a new way to search for sequences belonging to a biological group by examining points within the group's convex hull.


Assuntos
Evolução Biológica , Rhinovirus/genética , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Ebolavirus/genética , Genoma Viral/genética , Humanos , Análise Numérica Assistida por Computador , Papillomaviridae/genética , Filogenia , Proteína Quinase C/genética
10.
J Theor Biol ; 427: 41-52, 2017 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-28587743

RESUMO

Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.


Assuntos
Proteínas/química , Análise por Conglomerados , Filogenia
11.
Genomics ; 108(3-4): 134-142, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27538895

RESUMO

Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.


Assuntos
Análise de Sequência de DNA/métodos , Software , Algoritmos , Alinhamento de Sequência/métodos
12.
Mol Phylogenet Evol ; 98: 271-9, 2016 May.
Artigo em Inglês | MEDLINE | ID: mdl-26926946

RESUMO

The free-living SAR11 clade is a globally abundant group of oceanic Alphaproteobacteria, with small genome sizes and rich genomic A+T content. However, the taxonomy of SAR11 has become controversial recently. Some researchers argue that the position of SAR11 is a sister group to Rickettsiales. Other researchers advocate that SAR11 is located within free-living lineages of Alphaproteobacteria. Here, we use the natural vector representation method to identify the evolutionary origin of the SAR11 clade. This alignment-free method does not depend on any model assumptions. With this approach, the correspondence between proteome sequences and their natural vectors is one-to-one. After fixing a set of proteins, each bacterium is represented by a set of vectors. The Hausdorff distance is then used to compute the dissimilarity distance between two bacteria. The phylogenetic tree can be reconstructed based on these distances. Using our method, we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phylogeny of Alphaproteobacteria. From this we can see that the phylogenetic position of the SAR11 group is within a group of other free-living lineages of Alphaproteobacteria.


Assuntos
Alphaproteobacteria/classificação , Organismos Aquáticos/classificação , Filogenia , Alphaproteobacteria/genética , Alphaproteobacteria/metabolismo , Organismos Aquáticos/genética , Organismos Aquáticos/metabolismo , Proteínas de Bactérias/metabolismo , Proteoma/metabolismo
13.
Mol Phylogenet Evol ; 99: 53-62, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-26988414

RESUMO

Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.


Assuntos
Proteínas Virais/química , Vírus/classificação , Sequência de Aminoácidos , Bases de Dados de Proteínas , Genoma Viral , Filogenia , Proteoma/química , Proteoma/genética , Vírus/genética
14.
J Theor Biol ; 382: 99-110, 2015 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-26151589

RESUMO

DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees and demonstrates that the improved DFT dissimilarity measure is an efficient and effective similarity measure of DNA sequences. Due to its high efficiency and accuracy, the proposed DFT similarity measure is successfully applied on phylogenetic analysis for individual genes and large whole bacterial genomes.


Assuntos
Análise de Fourier , Genoma , Modelos Genéticos , Filogenia , Algoritmos , Animais , Sequência de Bases , Análise por Conglomerados , Simulação por Computador , Genoma Mitocondrial , Humanos , Mamíferos/genética , Mutação/genética , NADH Desidrogenase/genética , Nucleotídeos/genética
15.
J Theor Biol ; 372: 135-45, 2015 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-25747773

RESUMO

A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods.


Assuntos
DNA/genética , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Bactérias/genética , Análise por Conglomerados , Biologia Computacional , Coronavirus/genética , DNA Mitocondrial/genética , Evolução Molecular , Análise de Fourier , Genoma , Genoma Bacteriano , Humanos , Processamento de Imagem Assistida por Computador , Vírus da Influenza A/genética , Modelos Genéticos , Filogenia , Rhinovirus/genética , Alinhamento de Sequência/métodos
16.
Mol Phylogenet Evol ; 81: 29-36, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25172357

RESUMO

We have recently developed a computational approach in a vector space for genome-based virus classification. This approach, called the "Natural Vector (NV) representation", which is an alignment-free method, allows us to classify single-segmented viruses with high speed and accuracy. For multiple-segmented viruses, typically phylogenetic trees of each segment are reconstructed for discovering viral phylogeny. Consensus tree methods may be used to combine the phylogenetic trees based on different segments. However, consensus tree methods were not developed for instances where the viruses have different numbers of segments or where their segments do not match well. We propose a novel approach for comparing multiple-segmented viruses globally, even in cases where viruses contain different numbers of segments. Using our method, each virus is represented by a set of vectors in R(12). The Hausdorff distance is then used to compare different sets of vectors. Phylogenetic trees can be reconstructed based on this distance. The proposed method is used for predicting classification labels of viruses with n-segments (n ⩾ 1). The correctness rates of our predictions based on cross-validation are as high as 96.5%, 95.4%, 99.7%, and 95.6% for Baltimore class, family, subfamily, and genus, respectively, which are comparable to the rates for single-segmented viruses only. Our method is not affected by the number or order of segments. We also demonstrate that the natural graphical representation based on the Hausdorff distance is more reasonable than the consensus tree for a recent public health threat, the influenza A (H7N9) viruses.


Assuntos
Genoma Viral , Subtipo H7N9 do Vírus da Influenza A/classificação , Filogenia , Análise de Sequência de DNA/métodos , Genômica/métodos , Subtipo H7N9 do Vírus da Influenza A/genética , Vírus/classificação , Vírus/genética
17.
J Theor Biol ; 348: 12-20, 2014 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-24486229

RESUMO

In this paper, we develop a novel method to study the viral genome phylogeny. We apply Lempel-Ziv complexity to define the distance between two nucleic acid sequences. Then, based on this distance we use the Hausdorff distance (HD) and a modified Hausdorff distance (MHD) to make the phylogenetic analysis for multi-segmented viral genomes. The results show the MHD can provide more accurate phylogenetic relationship. Our method can have global comparison of all multi-segmented genomes simultaneously, that is, we treat the multi-segmented viral genome as an entirety to make the comparative analysis. Our method is not affected by the number or order of segments, and each segment can make contribution for the phylogeny of whole genomes. We have analyzed several groups of real multi-segmented genomes from different viral families. The results show that our method will provide a new powerful tool for studying the classification of viral genomes and their phylogenetic relationships.


Assuntos
Genoma Viral , Análise de Sequência de DNA/métodos , Animais , Sequência de Bases , DNA Viral/genética , Bases de Dados de Ácidos Nucleicos , HIV-1/classificação , HIV-1/genética , Filogenia , Vírus da Imunodeficiência Símia/classificação , Vírus da Imunodeficiência Símia/genética
18.
J Theor Biol ; 363: 145-50, 2014 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-25158165

RESUMO

Based on the k-mer model for genetic sequence, a k-mer sparse matrix representation is proposed to denote the types and sites of k-mers appearing in a genetic sequence, and there exists a one-to-one relationship between a genetic sequence and its associated k-mer sparse matrix. With the singular value decomposition of the k-mer sparse matrix, the k-mer singular value vector is constructed and utilized to numerically quantify the characteristics of a genetic sequence. We investigate and evaluate the optimum value k(⁎) chosen for our k-mer sparse matrix model for genetic sequence. To show the usefulness of our k-mer sparse matrix model method, it is applied to the comparison of genetic sequences, and the results obtained fully demonstrate that our proposed method is very powerful in analyzing and determining the relationships of genetic sequences.


Assuntos
Sequência de Bases/genética , Biologia Computacional/métodos , Modelos Genéticos , Análise de Sequência/métodos
19.
J Theor Biol ; 359: 18-28, 2014 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-24911780

RESUMO

Multiple sequence alignment (MSA) is a prominent method for classification of DNA sequences, yet it is hampered with inherent limitations in computational complexity. Alignment-free methods have been developed over past decade for more efficient comparison and classification of DNA sequences than MSA. However, most alignment-free methods may lose structural and functional information of DNA sequences because they are based on feature extractions. Therefore, they may not fully reflect the actual differences among DNA sequences. Alignment-free methods with information conservation are needed for more accurate comparison and classification of DNA sequences. We propose a new alignment-free similarity measure of DNA sequences using the Discrete Fourier Transform (DFT). In this method, we map DNA sequences into four binary indicator sequences and apply DFT to the indicator sequences to transform them into frequency domain. The Euclidean distance of full DFT power spectra of the DNA sequences is used as similarity distance metric. To compare the DFT power spectra of DNA sequences with different lengths, we propose an even scaling method to extend shorter DFT power spectra to equal the longest length of the sequences compared. After the DFT power spectra are evenly scaled, the DNA sequences are compared in the same DFT frequency space dimensionality. We assess the accuracy of the similarity metric in hierarchical clustering using simulated DNA and virus sequences. The results demonstrate that the DFT based method is an effective and accurate measure of DNA sequence similarity.


Assuntos
Análise por Conglomerados , DNA/análise , Análise de Fourier , Alinhamento de Sequência/métodos , Algoritmos , Sequência de Bases , Biologia Computacional , Humanos , Dados de Sequência Molecular , Proteína de Sequência 1 de Leucemia de Células Mieloides/análise , Proteína de Sequência 1 de Leucemia de Células Mieloides/genética , Filogenia , Análise de Sequência de DNA , Homologia de Sequência do Ácido Nucleico
20.
Genes (Basel) ; 15(7)2024 Jul 07.
Artigo em Inglês | MEDLINE | ID: mdl-39062670

RESUMO

The highly variable SARS-CoV-2 virus responsible for the COVID-19 pandemic frequently undergoes mutations, leading to the emergence of new variants that present novel threats to public health. The determination of these variants often relies on manual definition based on local sequence characteristics, resulting in delays in their detection relative to their actual emergence. In this study, we propose an algorithm for the automatic identification of novel variants. By leveraging the optimal natural metric for viruses based on an alignment-free perspective to measure distances between sequences, we devise a hypothesis testing framework to determine whether a given viral sequence belongs to a novel variant. Our method demonstrates high accuracy, achieving nearly 100% precision in identifying new variants of SARS-CoV-2 and HIV-1 as well as in detecting novel genera in Orthocoronavirinae. This approach holds promise for timely surveillance and management of emerging viral threats in the field of public health.


Assuntos
Algoritmos , COVID-19 , HIV-1 , SARS-CoV-2 , SARS-CoV-2/genética , Humanos , COVID-19/virologia , COVID-19/epidemiologia , HIV-1/genética , Mutação
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA