Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Int J Mol Sci ; 21(11)2020 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-32485813

RESUMO

Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Algoritmos , Genoma Bacteriano , Genoma Viral , Alinhamento de Sequência
2.
Mol Phylogenet Evol ; 141: 106633, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31563612

RESUMO

Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.


Assuntos
Algoritmos , Genoma , Genômica , Sequência de Bases , DNA/genética , Genoma Mitocondrial , Cadeias de Markov , Filogenia
3.
J Theor Biol ; 427: 41-52, 2017 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-28587743

RESUMO

Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.


Assuntos
Proteínas/química , Análise por Conglomerados , Filogenia
4.
Mol Phylogenet Evol ; 99: 53-62, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-26988414

RESUMO

Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.


Assuntos
Proteínas Virais/química , Vírus/classificação , Sequência de Aminoácidos , Bases de Dados de Proteínas , Genoma Viral , Filogenia , Proteoma/química , Proteoma/genética , Vírus/genética
5.
Mol Phylogenet Evol ; 81: 29-36, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25172357

RESUMO

We have recently developed a computational approach in a vector space for genome-based virus classification. This approach, called the "Natural Vector (NV) representation", which is an alignment-free method, allows us to classify single-segmented viruses with high speed and accuracy. For multiple-segmented viruses, typically phylogenetic trees of each segment are reconstructed for discovering viral phylogeny. Consensus tree methods may be used to combine the phylogenetic trees based on different segments. However, consensus tree methods were not developed for instances where the viruses have different numbers of segments or where their segments do not match well. We propose a novel approach for comparing multiple-segmented viruses globally, even in cases where viruses contain different numbers of segments. Using our method, each virus is represented by a set of vectors in R(12). The Hausdorff distance is then used to compare different sets of vectors. Phylogenetic trees can be reconstructed based on this distance. The proposed method is used for predicting classification labels of viruses with n-segments (n ⩾ 1). The correctness rates of our predictions based on cross-validation are as high as 96.5%, 95.4%, 99.7%, and 95.6% for Baltimore class, family, subfamily, and genus, respectively, which are comparable to the rates for single-segmented viruses only. Our method is not affected by the number or order of segments. We also demonstrate that the natural graphical representation based on the Hausdorff distance is more reasonable than the consensus tree for a recent public health threat, the influenza A (H7N9) viruses.


Assuntos
Genoma Viral , Subtipo H7N9 do Vírus da Influenza A/classificação , Filogenia , Análise de Sequência de DNA/métodos , Genômica/métodos , Subtipo H7N9 do Vírus da Influenza A/genética , Vírus/classificação , Vírus/genética
6.
J Theor Biol ; 348: 12-20, 2014 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-24486229

RESUMO

In this paper, we develop a novel method to study the viral genome phylogeny. We apply Lempel-Ziv complexity to define the distance between two nucleic acid sequences. Then, based on this distance we use the Hausdorff distance (HD) and a modified Hausdorff distance (MHD) to make the phylogenetic analysis for multi-segmented viral genomes. The results show the MHD can provide more accurate phylogenetic relationship. Our method can have global comparison of all multi-segmented genomes simultaneously, that is, we treat the multi-segmented viral genome as an entirety to make the comparative analysis. Our method is not affected by the number or order of segments, and each segment can make contribution for the phylogeny of whole genomes. We have analyzed several groups of real multi-segmented genomes from different viral families. The results show that our method will provide a new powerful tool for studying the classification of viral genomes and their phylogenetic relationships.


Assuntos
Genoma Viral , Análise de Sequência de DNA/métodos , Animais , Sequência de Bases , DNA Viral/genética , Bases de Dados de Ácidos Nucleicos , HIV-1/classificação , HIV-1/genética , Filogenia , Vírus da Imunodeficiência Símia/classificação , Vírus da Imunodeficiência Símia/genética
7.
Comput Struct Biotechnol J ; 19: 4226-4234, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34429843

RESUMO

Understanding the relationships between genomic sequences is essential to the classification and characterization of living beings. The classes and characteristics of an organism can be identified in the corresponding genome space. In the genome space, the natural metric is important to describe the distribution of genomes. Therefore, the similarity of two biological sequences can be measured. Here, we report that all of the viral genomes are in 32-dimensional Euclidean space, in which the natural metric is the weighted summation of Euclidean distance of k-mer natural vectors. The classification of viral genomes in the constructed genome space further proves the convex hull principle of taxonomy, which states that convex hulls of different families are mutually disjoint. This study provides a novel geometric perspective to describe the genome sequences.

8.
Comput Struct Biotechnol J ; 18: 1904-1913, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32774785

RESUMO

Chaos Game Representation (CGR) was first proposed to be an image representation method of DNA and have been extended to the case of other biological macromolecules. Compared with the CGR images of DNA, where DNA sequences are converted into a series of points in the unit square, the existing CGR images of protein are not so elegant in geometry and the implications of the distribution of points in the CGR image are not so obvious. In this study, by naturally distributing the twenty amino acids on the vertices of a regular dodecahedron, we introduce a novel three-dimensional image representation of protein sequences with CGR method. We also associate each CGR image with a vector in high dimensional Euclidean space, called the extended natural vector (ENV), in order to analyze the information contained in the CGR images. Based on the results of protein classification and phylogenetic analysis, our method could serve as a precise method to discover biological relationships between proteins.

9.
Genes (Basel) ; 11(6)2020 06 09.
Artigo em Inglês | MEDLINE | ID: mdl-32526937

RESUMO

The severe respiratory disease COVID-19 was initially reported in Wuhan, China, in December 2019, and spread into many provinces from Wuhan. The corresponding pathogen was soon identified as a novel coronavirus named SARS-CoV-2 (formerly, 2019-nCoV). As of 2 May, 2020, over 3 million COVID-19 cases had been confirmed, and 235,290 deaths had been reported globally, and the numbers are still increasing. It is important to understand the phylogenetic relationship between SARS-CoV-2 and known coronaviruses, and to identify its hosts for preventing the next round of emergency outbreak. In this study, we employ an effective alignment-free approach, the Natural Vector method, to analyze the phylogeny and classify the coronaviruses based on genomic and protein data. Our results show that SARS-CoV-2 is closely related to, but distinct from the SARS-CoV branch. By analyzing the genetic distances from the SARS-CoV-2 strain to the coronaviruses residing in animal hosts, we establish that the most possible transmission path originates from bats to pangolins to humans.


Assuntos
Betacoronavirus/genética , Infecções por Coronavirus/transmissão , Coronavirus/genética , Modelos Biológicos , Pneumonia Viral/transmissão , Animais , Betacoronavirus/classificação , COVID-19 , Quirópteros/virologia , Coronavirus/classificação , Proteases 3C de Coronavírus , Infecções por Coronavirus/virologia , Cisteína Endopeptidases/química , Cisteína Endopeptidases/genética , Surtos de Doenças , Reservatórios de Doenças , Humanos , Mamíferos/classificação , Mamíferos/virologia , Pandemias , Filogenia , Pneumonia Viral/virologia , SARS-CoV-2 , Glicoproteína da Espícula de Coronavírus/química , Glicoproteína da Espícula de Coronavírus/genética , Proteínas não Estruturais Virais/química , Proteínas não Estruturais Virais/genética
10.
PeerJ ; 8: e9625, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32832270

RESUMO

BACKGROUND: Begomoviruses are widely distributed and causing devastating diseases in many crops. According to the number of genomic components, a begomovirus is known as either monopartite or bipartite begomovirus. Both the monopartite and bipartite begomoviruses have the DNA-A component which encodes all essential proteins for virus functions, while the bipartite begomoviruses still contain the DNA-B component. The satellite molecules, known as betasatellites, alphasatellites or deltasatellites, sometimes exist in the begomoviruses. So, the genomic components of begomoviruses are complex and varied. Different genomic components have different gene structures and functions. Classifying the components of begomoviruses is important for studying the virus origin and pathogenic mechanism. METHODS: We propose a model combining Subsequence Natural Vector (SNV) method with Support Vector Machine (SVM) algorithm, to classify the genomic components of begomoviruses and predict the genes of begomoviruses. First, the genome sequence is represented as a vector numerically by the SNV method. Then SVM is applied on the datasets to build the classification model. At last, recursive feature elimination (RFE) is used to select essential features of the subsequence natural vectors based on the importance of features. RESULTS: In the investigation, DNA-A, DNA-B, and different satellite DNAs are selected to build the model. To evaluate our model, the homology-based method BLAST and two machine learning algorithms Random Forest and Naive Bayes method are used to compare with our model. According to the results, our classification model can classify DNA-A, DNA-B, and different satellites with high accuracy. Especially, we can distinguish whether a DNA-A component is from a monopartite or a bipartite begomovirus. Then, based on the results of classification, we can also predict the genes of different genomic components. According to the selected features, we find that the content of four nucleotides in the second and tenth segments (approximately 150-350 bp and 1,450-1,650 bp) are the most different between DNA-A components of monopartite and bipartite begomoviruses, which may be related to the pre-coat protein (AV2) and the transcriptional activator protein (AC2) genes. Our results advance the understanding of the unique structures of the genomic components of begomoviruses.

11.
Infect Genet Evol ; 77: 104080, 2020 01.
Artigo em Inglês | MEDLINE | ID: mdl-31683009

RESUMO

HIV-1 is the most common and pathogenic strain of human immunodeficiency virus consisting of many subtypes. To study the difference among HIV-1 subtypes in infection, diagnosis and drug design, it is important to identify HIV-1 subtypes from clinical HIV-1 samples. In this work, we propose an effective numeric representation called Subsequence Natural Vector (SNV) to encode HIV-1 sequences. Using the representation, we introduce an improved linear discriminant analysis method to classify HIV-1 viruses correctly. SNV is based on distribution of nucleotides in HIV-1 viral sequences. It not only computes the number of nucleotides, but also describes the position and variance of nucleotides in viruses. To validate our alignment-free method, 6902 complete genomes and 11,668 pol gene sequences of HIV-1 subtypes were collected from the up-to-date Los Alamos HIV database. SNV outperforms the three popular methods, Kameris, Comet and REGA, with almost 100% Sensitivity and Specificity, also with much less time. Our subtyping algorithm especially works better for circulating recombinant forms (CRFs) consisting of a few sequences. Our approach is also powerful to separate unique recombinant forms (URFs) from other subtypes with 100% Sensitivity and Specificity. Moreover, phylogenetic trees based on SNV representation are constructed using full-length HIV-1 genomes and pol genes respectively, where viruses from the same subtype are clustered together correctly.


Assuntos
Biologia Computacional/métodos , Infecções por HIV/virologia , HIV-1/classificação , Análise de Sequência de RNA/métodos , Algoritmos , Bases de Dados Genéticas , Análise Discriminante , Evolução Molecular , Variação Genética , HIV-1/genética , HIV-1/isolamento & purificação , Humanos , Filogenia , RNA Viral/genética
12.
Comput Struct Biotechnol J ; 17: 982-994, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31384399

RESUMO

Genome comparison is a vital research area of bioinformatics. For large-scale genome comparisons, the Multiple Sequence Alignment (MSA) methods have been impractical to use due to its algorithmic complexity. In this study, we propose a novel alignment-free method based on the one-to-one correspondence between a DNA sequence and its complete central moment vector of the cumulative Fourier power and phase spectra. In addition, the covariance between the four nucleotides in the power and phase spectra is included. We use the cumulative Fourier power and phase spectra to define a 28-dimensional vector for each DNA sequence. Euclidean distances between the vectors can measure the dissimilarity between DNA sequences. We perform testing with datasets of different sizes and types including simulated DNA sequences, exon-intron and complete genomes. The results show that our method is more accurate and efficient for performing hierarchical clustering than other alignment-free methods and MSA methods.

13.
Front Genet ; 10: 234, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31024610

RESUMO

Classification of DNA sequences is an important issue in the bioinformatics study, yet most existing methods for phylogenetic analysis including Multiple Sequence Alignment (MSA) are time-consuming and computationally expensive. The alignment-free methods are popular nowadays, whereas the manual intervention in those methods usually decreases the accuracy. Also, the interactions among nucleotides are neglected in most methods. Here we propose a new Accumulated Natural Vector (ANV) method which represents each DNA sequence by a point in ℝ18. By calculating the Accumulated Indicator Functions of nucleotides, we can further find an Accumulated Natural Vector for each sequence. This new Accumulated Natural Vector not only can capture the distribution of each nucleotide, but also provide the covariance among nucleotides. Thus global comparison of DNA sequences or genomes can be done easily in ℝ18. The tests of ANV of datasets of different sizes and types have proved the accuracy and time-efficiency of the new proposed ANV method.

14.
Biomed Pharmacother ; 108: 906-913, 2018 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-30372902

RESUMO

Acute lung injury (ALI) and acute respiratory distress syndrome (ARDS) are the serious diseases that are characterized by a severe inflammatory response of lung injuries and damage to the microvascular permeability, frequently resulting in death. YiQiFuMai (YQFM) lyophilized injection powder is a redeveloped preparation based on the well-known traditional Chinese medicine formula Sheng-Mai-San which is widely used in clinical practice in China, mainly for the treatment of microcirculatory disturbance-related diseases. However, there is little information about its role in ALI/ARDS. The aim of this study was to determine the protective effect of YQFM on particulate matter (PM)-induced ALI. The mice were intratracheally instilled with 50 mg/kg body weight of Standard Reference Material1648a (SRM1648a) in the PM-induced group. The mice in the YQFM group were given YQFM (three doses: 0.33, 0.67, and 1.34 g/kg) by tail vein injection 30 min after the intratracheal instillation of PM. The results showed that YQFM markedly reduced lung pathological injury and the lung wet/dry weight ratios induced by PM. Furthermore, we also found that YQFM significantly inhibited the PM-induced myeloperoxidase (MPO) activity in lung tissues, decreased the PM-induced inflammatory cytokines including interleukin-1ß (IL-1ß) and tumor necrosis factor-α (TNF-α), reduced nitric oxide (NO) and total protein in bronchoalveolar lavage fluids (BALF), and effectively attenuated PM-induced increases lymphocytes in BALF. In addition, YQFM increased mammalian target of rapamycin (mTOR) phosphorylation and dramatically suppressed the PM-stimulated expression of toll-like receptor 4 (TLR4), MyD88, autophagy-related protein LC3Ⅱand Beclin 1 as well as autophagy. In conclusion, these findings indicate that YQFM had a critical anti-inflammatory effect due to its ability to regulate both TLR4-MyD88 and mTOR-autophagy pathways, and might be a possible therapeutic agent for PM-induced ALI.


Assuntos
Lesão Pulmonar Aguda/tratamento farmacológico , Autofagia/efeitos dos fármacos , Medicamentos de Ervas Chinesas/farmacologia , Material Particulado/farmacologia , Transdução de Sinais/efeitos dos fármacos , Serina-Treonina Quinases TOR/metabolismo , Receptor 4 Toll-Like/metabolismo , Lesão Pulmonar Aguda/induzido quimicamente , Lesão Pulmonar Aguda/metabolismo , Animais , Líquido da Lavagem Broncoalveolar/química , China , Citocinas/metabolismo , Injeções/métodos , Pulmão/efeitos dos fármacos , Pulmão/metabolismo , Medicina Tradicional Chinesa/métodos , Camundongos , Peroxidase/metabolismo
15.
DNA Cell Biol ; 36(2): 109-116, 2017 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-27977308

RESUMO

Zika virus (ZIKV) is a mosquito-borne flavivirus. It was first isolated from Uganda in 1947 and has become an emergent event since 2007. However, because of the inconsistency of alignment methods, the evolution of ZIKV remains poorly understood. In this study, we first use the complete protein and an alignment-free method to build a phylogenetic tree of 87 Zika strains in which Asian, East African, and West African lineages are characterized. We also use the NS5 protein to construct the genetic relationship among 44 Zika strains. For the first time, these strains are divided into two clades: African 1 and African 2. This result suggests that ZIKV originates from Africa, then spread to Asia, Pacific islands, and throughout the Americas. We also perform the phylogeny analysis for 53 viruses in genus Flavivirus to which ZIKV belongs using complete proteins. Our conclusion is consistent with the classification by the hosts and transmission vectors.


Assuntos
Vetores de Doenças , Flavivirus/classificação , Zika virus/classificação , Animais , Flavivirus/metabolismo , Flavivirus/fisiologia , Filogenia , Proteínas Virais/metabolismo , Zika virus/metabolismo , Zika virus/fisiologia
16.
Evol Bioinform Online ; 13: 1176934317746667, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29308007

RESUMO

We construct a virus database called VirusDB (http://yaulab.math.tsinghua.edu.cn/VirusDB/) and an online inquiry system to serve people who are interested in viral classification and prediction. The database stores all viral genomes, their corresponding natural vectors, and the classification information of the single/multiple-segmented viral reference sequences downloaded from National Center for Biotechnology Information. The online inquiry system serves the purpose of computing natural vectors and their distances based on submitted genomes, providing an online interface for accessing and using the database for viral classification and prediction, and back-end processes for automatic and manual updating of database content to synchronize with GenBank. Submitted genomes data in FASTA format will be carried out and the prediction results with 5 closest neighbors and their classifications will be returned by email. Considering the one-to-one correspondence between sequence and natural vector, time efficiency, and high accuracy, natural vector is a significant advance compared with alignment methods, which makes VirusDB a useful database in further research.

17.
DNA Cell Biol ; 34(6): 418-28, 2015 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-25803489

RESUMO

According to the WHO, ebolaviruses have resulted in 8818 human deaths in West Africa as of January 2015. To better understand the evolutionary relationship of the ebolaviruses and infer virulence from the relationship, we applied the alignment-free natural vector method to classify the newest ebolaviruses. The dataset includes three new Guinea viruses as well as 99 viruses from Sierra Leone. For the viruses of the family of Filoviridae, both genus label classification and species label classification achieve an accuracy rate of 100%. We represented the relationships among Filoviridae viruses by Unweighted Pair Group Method with Arithmetic Mean (UPGMA) phylogenetic trees and found that the filoviruses can be separated well by three genera. We performed the phylogenetic analysis on the relationship among different species of Ebolavirus by their coding-complete genomes and seven viral protein genes (glycoprotein [GP], nucleoprotein [NP], VP24, VP30, VP35, VP40, and RNA polymerase [L]). The topology of the phylogenetic tree by the viral protein VP24 shows consistency with the variations of virulence of ebolaviruses. The result suggests that VP24 be a pharmaceutical target for treating or preventing ebolaviruses.


Assuntos
Ebolavirus/classificação , Ebolavirus/genética , Evolução Molecular , Genoma Viral , Marburgvirus/classificação , Marburgvirus/genética , Modelos Genéticos , Tipagem Molecular , Filogenia , Análise de Sequência de DNA
18.
Sci Rep ; 5: 7972, 2015 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-25609314

RESUMO

What kinds of amino acid sequences could possibly be protein sequences? From all existing databases that we can find, known proteins are only a small fraction of all possible combinations of amino acids. Beginning with Sanger's first detailed determination of a protein sequence in 1952, previous studies have focused on describing the structure of existing protein sequences in order to construct the protein universe. No one, however, has developed a criteria for determining whether an arbitrary amino acid sequence can be a protein. Here we show that when the collection of arbitrary amino acid sequences is viewed in an appropriate geometric context, the protein sequences cluster together. This leads to a new computational test, described here, that has proved to be remarkably accurate at determining whether an arbitrary amino acid sequence can be a protein. Even more, if the results of this test indicate that the sequence can be a protein, and it is indeed a protein sequence, then its identity as a protein sequence is uniquely defined. We anticipate our computational test will be useful for those who are attempting to complete the job of discovering all proteins, or constructing the protein universe.


Assuntos
Proteínas/química , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Aminoácidos , Bases de Dados de Proteínas
19.
PLoS One ; 9(7): e101363, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25036549

RESUMO

Intron-containing and intronless genes have different biological properties and statistical characteristics. Here we propose a new computational method to distinguish between intron-containing and intronless gene sequences. Seven feature parameters α, ß, γ, λ, θ, φ and σ based on detrended fluctuation analysis (DFA) are fully used, and thus we can compute a 7-dimensional feature vector for any given gene sequence to be discriminated. Furthermore, support vector machine (SVM) classifier with Gaussian radial basis kernel function is performed on this feature space to classify the genes into intron-containing and intronless. We investigate the performance of the proposed method in comparison with other state-of-the-art algorithms on biological datasets. The experimental results show that our new method significantly improves the accuracy over those existing techniques.


Assuntos
Biologia Computacional/métodos , Íntrons/genética , Máquina de Vetores de Suporte , Archaea/genética , Bactérias/genética , Eucariotos/genética
20.
PLoS One ; 8(5): e64328, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23717598

RESUMO

The International Committee on Taxonomy of Viruses authorizes and organizes the taxonomic classification of viruses. Thus far, the detailed classifications for all viruses are neither complete nor free from dispute. For example, the current missing label rates in GenBank are 12.1% for family label and 30.0% for genus label. Using the proposed Natural Vector representation, all 2,044 single-segment referenced viral genomes in GenBank can be embedded in [Formula: see text]. Unlike other approaches, this allows us to determine phylogenetic relations for all viruses at any level (e.g., Baltimore class, family, subfamily, genus, and species) in real time. Additionally, the proposed graphical representation for virus phylogeny provides a visualization of the distribution of viruses in [Formula: see text]. Unlike the commonly used tree visualization methods which suffer from uniqueness and existence problems, our representation always exists and is unique. This approach is successfully used to predict and correct viral classification information, as well as to identify viral origins; e.g. a recent public health threat, the West Nile virus, is closer to the Japanese encephalitis antigenic complex based on our visualization. Based on cross-validation results, the accuracy rates of our predictions are as high as 98.2% for Baltimore class labels, 96.6% for family labels, 99.7% for subfamily labels and 97.2% for genus labels.


Assuntos
Vírus/classificação , Genes Virais , Filogenia , Vírus/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA