Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 92
Filtrar
Mais filtros

Base de dados
Tipo de documento
Intervalo de ano de publicação
1.
Acta Math Sin Engl Ser ; 38(10): 1901-1938, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36407804

RESUMO

With the great advancement of experimental tools, a tremendous amount of biomolecular data has been generated and accumulated in various databases. The high dimensionality, structural complexity, the nonlinearity, and entanglements of biomolecular data, ranging from DNA knots, RNA secondary structures, protein folding configurations, chromosomes, DNA origami, molecular assembly, to others at the macromolecular level, pose a severe challenge in their analysis and characterization. In the past few decades, mathematical concepts, models, algorithms, and tools from algebraic topology, combinatorial topology, computational topology, and topological data analysis, have demonstrated great power and begun to play an essential role in tackling the biomolecular data challenge. In this work, we introduce biomolecular topology, which concerns the topological problems and models originated from the biomolecular systems. More specifically, the biomolecular topology encompasses topological structures, properties and relations that are emerged from biomolecular structures, dynamics, interactions, and functions. We discuss the various types of biomolecular topology from structures (of proteins, DNAs, and RNAs), protein folding, and protein assembly. A brief discussion of databanks (and databases), theoretical models, and computational algorithms, is presented. Further, we systematically review related topological models, including graphs, simplicial complexes, persistent homology, persistent Laplacians, de Rham-Hodge theory, Yau-Hausdorff distance, and the topology-based machine learning models.

2.
J Theor Biol ; 530: 110885, 2021 12 07.
Artigo em Inglês | MEDLINE | ID: mdl-34478743

RESUMO

The world faces a great unforeseen challenge through the COVID-19 pandemic caused by coronavirus SARS-CoV-2. The virus genome structure and evolution are positioned front and center for further understanding insights on vaccine development, monitoring of transmission trajectories, and prevention of zoonotic infections of new coronaviruses. Of particular interest are genomic elements Inverse Repeats (IRs), which maintain genome stability, regulate gene expressions, and are the targets of mutations. However, little research attention is given to the IR content analysis in the SARS-CoV-2 genome. In this study, we propose a geometric analysis method and using the method to investigate the distributions of IRs in SARS-CoV-2 and its related coronavirus genomes. The method represents each genomic IR sequence pair as a single point and constructs the geometric shape of the genome using the IRs. Thus, the IR shape can be considered as the signature of the genome. The genomes of different coronaviruses are then compared using the constructed IR shapes. The results demonstrate that SARS-CoV-2 genome, specifically, has an abundance of IRs, and the IRs in coronavirus genomes show an increase during evolution events.


Assuntos
COVID-19 , SARS-CoV-2 , Genoma Viral/genética , Genômica , Humanos , Pandemias , Filogenia
3.
Genomics ; 112(2): 1847-1852, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31704313

RESUMO

A novel method is proposed to detect the acceptor and donor splice sites using chaos game representation and artificial neural network. In order to achieve high accuracy, inputs to the neural network, or feature vector, shall reflect the true nature of the DNA segments. Therefore it is important to have one-to-one numerical representation, i.e. a feature vector should be able to represent the original data. Chaos game representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane in a one-to-one manner. Using CGR, a DNA sequence can be mapped to a numerical sequence that reflects the true nature of the original sequence. In this research, we propose to use CGR as feature input to a neural network to detect splice sites on the NN269 dataset. Computational experiments indicate that this approach gives good accuracy while being simpler than other methods in the literature, with only one neural network component. The code and data for our method can be accessed from this link: https://github.com/thoang3/portfolio/tree/SpliceSites_ANN_CGR.


Assuntos
Redes Neurais de Computação , Sítios de Splice de RNA , Análise de Sequência de DNA/métodos , Humanos , Dinâmica não Linear , Software
4.
Br J Cancer ; 123(1): 114-125, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32372027

RESUMO

BACKGROUND: Nasopharyngeal carcinoma (NPC) is an important cancer in Hong Kong. We aim to utilise liquid biopsies for serial monitoring of disseminated NPC in patients to compare with PET-CT imaging in detection of minimal residual disease. METHOD: Prospective serial monitoring of liquid biopsies was performed for 21 metastatic patients. Circulating tumour cell (CTC) enrichment and characterisation was performed using a sized-based microfluidics CTC chip, enumerating by immunofluorescence staining, and using target-capture sequencing to determine blood mutation load. PET-CT scans were used to monitor NPC patients throughout their treatment according to EORTC guidelines. RESULTS: The longitudinal molecular analysis of CTCs by enumeration or NGS mutational profiling findings provide supplementary information to the plasma EBV assay for disease progression for good responders. Strikingly, post-treatment CTC findings detected positive findings in 75% (6/8) of metastatic NPC patients showing complete response by imaging, thereby demonstrating more sensitive CTC detection of minimal residual disease. Positive baseline, post-treatment CTC, and longitudinal change of CTCs significantly associated with poorer progression-free survival by the Kaplan-Meier analysis. CONCLUSIONS: We show the potential usefulness of application of serial analysis in metastatic NPC of liquid biopsy CTCs, as a novel more sensitive biomarker for minimal residual disease, when compared with imaging.


Assuntos
Biomarcadores Tumorais/sangue , Carcinoma Nasofaríngeo/sangue , Neoplasia Residual/sangue , Células Neoplásicas Circulantes/metabolismo , Adolescente , Adulto , Idoso , Feminino , Humanos , Estimativa de Kaplan-Meier , Masculino , Pessoa de Meia-Idade , Carcinoma Nasofaríngeo/diagnóstico por imagem , Carcinoma Nasofaríngeo/genética , Carcinoma Nasofaríngeo/patologia , Metástase Neoplásica , Neoplasia Residual/genética , Neoplasia Residual/patologia , Células Neoplásicas Circulantes/patologia , Tomografia por Emissão de Pósitrons combinada à Tomografia Computadorizada , Intervalo Livre de Progressão , Adulto Jovem
5.
Genomics ; 111(6): 1298-1305, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-30195069

RESUMO

Based on the k-mer model for protein sequence, a novel k-mer natural vector method is proposed to characterize the features of k-mers in a protein sequence, in which the numbers and distributions of k-mers are considered. It is proved that the relationship between a protein sequence and its k-mer natural vector is one-to-one. Phylogenetic analysis of protein sequences therefore can be easily performed without requiring evolutionary models or human intervention. In addition, there exists no a criterion to choose a suitable k, and k has a great influence on obtaining results as well as computational complexity. In this paper, a compound k-mer natural vector is utilized to quantify each protein sequence. The results gotten from phylogenetic analysis on three protein datasets demonstrate that our new method can precisely describe the evolutionary relationships of proteins, and greatly heighten the computing efficiency.


Assuntos
Filogenia , Análise de Sequência de Proteína/métodos , Vírus da Influenza A/classificação , Rhinovirus/classificação , Proteínas Virais/química , Globinas beta/química
6.
Genomics ; 111(6): 1777-1784, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-30529533

RESUMO

This study quantitatively validates the principle that the biological properties associated with a given genotype are determined by the distribution of amino acids. In order to visualize this central law of molecular biology, each protein was represented by a point in 250-dimensional space based on its amino acid distribution. Proteins from the same family are found to cluster together, leading to the principle that the convex hull surrounding protein points from the same family do not intersect with the convex hulls of other protein families. This principle was verified computationally for all available and reliable protein kinases and human proteins. In addition, we generated 2,328,761 figures to show that the convex hulls of different families were disjoint from each other. The classification performs well with high and robust accuracy (95.75% and 97.5%) together with reasonable phylogenetic trees validate our methods further.


Assuntos
Algoritmos , Família Multigênica , Filogenia , Proteínas Quinases/classificação , Proteínas Quinases/genética , Humanos
7.
Int J Mol Sci ; 21(11)2020 May 29.
Artigo em Inglês | MEDLINE | ID: mdl-32485813

RESUMO

Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.


Assuntos
Filogenia , Análise de Sequência de DNA/métodos , Homologia de Sequência do Ácido Nucleico , Algoritmos , Genoma Bacteriano , Genoma Viral , Alinhamento de Sequência
8.
Mol Phylogenet Evol ; 141: 106633, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31563612

RESUMO

Using numerical methods for genome comparison has always been of importance in bioinformatics. The Chaos Game Representation (CGR) is an effective genome sequence mapping technology, which converts genome sequences to CGR images. To each CGR image, we associate a vector called an Extended Natural Vector (ENV). The ENV is based on the distribution of intensity values. This mapping produces a one-to-one correspondence between CGR images and their ENVs. We define the distance between two DNA sequences as the distance between their associated ENVs. We cluster and classify several datasets including Influenza A viruses, Bacillus genomes, and Conoidea mitochondrial genomes to build their phylogenetic trees. Results show that our ENV combining CGR method (CGR-ENV) compares favorably in classification accuracy and efficiency against the multiple sequence alignment (MSA) method and other alignment-free methods. The research provides significant insights into the study of phylogeny, evolution, and efficient DNA comparison algorithms for large genomes.


Assuntos
Algoritmos , Genoma , Genômica , Sequência de Bases , DNA/genética , Genoma Mitocondrial , Cadeias de Markov , Filogenia
9.
Proc Natl Acad Sci U S A ; 113(40): 11283-11288, 2016 10 04.
Artigo em Inglês | MEDLINE | ID: mdl-27647909

RESUMO

Nasopharyngeal carcinoma (NPC) is an epithelial malignancy with a unique geographical distribution. The genomic abnormalities leading to NPC pathogenesis remain unclear. In total, 135 NPC tumors were examined to characterize the mutational landscape using whole-exome sequencing and targeted resequencing. An APOBEC cytidine deaminase mutagenesis signature was revealed in the somatic mutations. Noticeably, multiple loss-of-function mutations were identified in several NF-κB signaling negative regulators NFKBIA, CYLD, and TNFAIP3 Functional studies confirmed that inhibition of NFKBIA had a significant impact on NF-κB activity and NPC cell growth. The identified loss-of-function mutations in NFKBIA leading to protein truncation contributed to the altered NF-κB activity, which is critical for NPC tumorigenesis. In addition, somatic mutations were found in several cancer-relevant pathways, including cell cycle-phase transition, cell death, EBV infection, and viral carcinogenesis. These data provide an enhanced road map for understanding the molecular basis underlying NPC.


Assuntos
Carcinoma/genética , Sequenciamento do Exoma/métodos , Mutação com Perda de Função/genética , NF-kappa B/metabolismo , Neoplasias Nasofaríngeas/genética , Transdução de Sinais/genética , Linhagem Celular Tumoral , Técnicas de Silenciamento de Genes , Humanos , Taxa de Mutação , Inibidor de NF-kappaB alfa/metabolismo , Carcinoma Nasofaríngeo
10.
BMC Evol Biol ; 18(1): 200, 2018 12 27.
Artigo em Inglês | MEDLINE | ID: mdl-30587116

RESUMO

BACKGROUND: In recent years, DNA barcoding has become an important tool for biologists to identify species and understand their natural biodiversity. The complexity of barcode data makes it difficult to analyze quickly and effectively. Manual classification of this data cannot keep up to the rate of increase of available data. RESULTS: In this study, we propose a new method for DNA barcode classification based on the distribution of nucleotides within the sequence. By adding the covariance of nucleotides to the original natural vector, this augmented 18-dimensional natural vector makes good use of the available information in the DNA sequence. The accurate classification results we obtained demonstrate that this new 18-dimensional natural vector method, together with the random forest classifier algorthm, can serve as a computationally efficient identification tool for DNA barcodes. We performed phylogenetic analysis on the genus Megacollybia to validate our method. We also studied how effective our method was in determining the genetic distance within and between species in our barcoding dataset. CONCLUSIONS: The classification performs well on the fungi barcode dataset with high and robust accuracy. The reasonable phylogenetic trees we obtained further validate our methods. This method is alignment-free and does not depend on any model assumption, and it will become a powerful tool for classification and evolutionary analysis.


Assuntos
Código de Barras de DNA Taxonômico/métodos , Fungos/classificação , Fungos/genética , Biodiversidade , Filogenia , Análise de Sequência de DNA
11.
J Theor Biol ; 456: 34-40, 2018 11 07.
Artigo em Inglês | MEDLINE | ID: mdl-30059661

RESUMO

Comparing DNA and protein sequence groups plays an important role in biological evolutionary relationship research. Despite many methods available for sequence comparison, only a few can be used for group comparison. In this study, we propose a novel approach using convex hulls. We use statistical information contained within the sequences to represent each sequence as a point in high dimensional space. We find that the points belonging to one biological group are located in a different region of space than points belonging to other biological groups. To be more precise, the convex hull of the points from one group are disjoint from the convex hulls of points from other groups. This finding allows us to do phylogenetic analysis for groups in an efficient way. Five different theorems are presented for checking whether two convex hulls intersect or are disjoint. Test results for datasets related to HRV, HPV, Ebolavirus, PKC and protein phosphatase domains demonstrate that our method performs well and provides a new tool for studying group phylogeny. More significantly, the convex analysis presents a new way to search for sequences belonging to a biological group by examining points within the group's convex hull.


Assuntos
Evolução Biológica , Rhinovirus/genética , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Ebolavirus/genética , Genoma Viral/genética , Humanos , Análise Numérica Assistida por Computador , Papillomaviridae/genética , Filogenia , Proteína Quinase C/genética
12.
J Theor Biol ; 427: 41-52, 2017 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-28587743

RESUMO

Classification of protein are crucial topics in biology. The number of protein sequences stored in databases increases sharply in the past decade. Traditionally, comparison of protein sequences is usually carried out through multiple sequence alignment methods. However, these methods may be unsuitable for clustering of protein sequences when gene rearrangements occur such as in viral genomes. The computation is also very time-consuming for large datasets with long genomes. In this paper, based on three important biochemical properties of amino acids: the hydropathy index, polar requirement and chemical composition of the side chain, we propose a 24 dimensional feature vector describing the composition of amino acids in protein sequences. Our method not only utilizes the chemical properties of amino acids but also counts on their numbers and positions. The results on beta-globin, mammals, and three virus datasets show that this new tool is fast and accurate for classifying proteins and inferring the phylogeny of organisms.


Assuntos
Proteínas/química , Análise por Conglomerados , Filogenia
13.
J Intensive Care Med ; 32(7): 444-450, 2017 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-27146924

RESUMO

PURPOSE: To report the characteristics and outcomes of patients with sepsis in the intensive care unit (ICU) with end-stage renal disease (ESRD) and acute kidney injury (AKI) compared to patients with nonkidney injury (non-KI). METHODS: Retrospective study of all patients with sepsis admitted to the ICU of a university hospital within a 12-month time period. Data were obtained from the University Health Consortium database and a chart review of the electronic medical records. RESULTS: We identified 39 cases of ESRD, 106 cases of AKI, and 103 cases of non-KI. Intensive care unit mortality was 15.4% for ESRD, 30.2% for AKI, and 13.6% for non-KI ( P < .01). Hospital mortality was 20.5% for ESRD, 32.1% for AKI, and 13.6% for non-KI ( P < .01). Early AKI and late AKI had an ICU mortality of 24.4% versus 50% ( P <.01), hospital mortality of 26.8% versus 50% ( P = .03), ICU length of stay (LOS) of 3 and 6 days ( P = .04), and hospital LOS of 7 and 12.5 days ( P <.01), respectively. CONCLUSION: Patients with sepsis having AKI have a higher mortality rate than those with ESRD and non-KI. Hospital and ICU mortality rates for patients with ESRD were similar to non-KI patients. Late AKI compared to early AKI had a higher mortality and longer LOS.


Assuntos
Injúria Renal Aguda/complicações , Unidades de Terapia Intensiva/estatística & dados numéricos , Falência Renal Crônica/complicações , Avaliação de Resultados da Assistência ao Paciente , Sepse/mortalidade , Adulto , Idoso , Bases de Dados Factuais , Feminino , Mortalidade Hospitalar , Hospitalização/estatística & dados numéricos , Humanos , Masculino , Pessoa de Meia-Idade , Estudos Retrospectivos , Sepse/etiologia
14.
Genomics ; 108(3-4): 134-142, 2016 10.
Artigo em Inglês | MEDLINE | ID: mdl-27538895

RESUMO

Numerical encoding plays an important role in DNA sequence analysis via computational methods, in which numerical values are associated with corresponding symbolic characters. After numerical representation, digital signal processing methods can be exploited to analyze DNA sequences. To reflect the biological properties of the original sequence, it is vital that the representation is one-to-one. Chaos Game Representation (CGR) is an iterative mapping technique that assigns each nucleotide in a DNA sequence to a respective position on the plane that allows the depiction of the DNA sequence in the form of image. Using CGR, a biological sequence can be transformed one-to-one to a numerical sequence that preserves the main features of the original sequence. In this research, we propose to encode DNA sequences by considering 2D CGR coordinates as complex numbers, and apply digital signal processing methods to analyze their evolutionary relationship. Computational experiments indicate that this approach gives comparable results to the state-of-the-art multiple sequence alignment method, Clustal Omega, and is significantly faster. The MATLAB code for our method can be accessed from: www.mathworks.com/matlabcentral/fileexchange/57152.


Assuntos
Análise de Sequência de DNA/métodos , Software , Algoritmos , Alinhamento de Sequência/métodos
15.
Mol Phylogenet Evol ; 99: 53-62, 2016 06.
Artigo em Inglês | MEDLINE | ID: mdl-26988414

RESUMO

Due to vast sequence divergence among different viral groups, sequence alignment is not directly applicable to genome-wide comparative analysis of viruses. More and more attention has been paid to alignment-free methods for whole genome comparison and phylogenetic tree reconstruction. Among alignment-free methods, the recently proposed "Natural Vector (NV) representation" has successfully been used to study the phylogeny of multi-segmented viruses based on a 12-dimensional genome space derived from the nucleotide sequence structure. But the preference of proteomes over genomes for the determination of viral phylogeny was not deeply investigated. As the translated products of genes, proteins directly form the shape of viral structure and are vital for all metabolic pathways. In this study, using the NV representation of a protein sequence along with the Hausdorff distance suitable to compare point sets, we construct a 60-dimensional protein space to analyze the evolutionary relationships of 4021 viruses by whole-proteomes in the current NCBI Reference Sequence Database (RefSeq). We also take advantage of the previously developed natural graphical representation to recover viral phylogeny. Our results demonstrate that the proposed method is efficient and accurate for classifying viruses. The accuracy rates of our predictions such as for Baltimore II viruses are as high as 95.9% for family labels, 95.7% for subfamily labels and 96.5% for genus labels. Finally, we discover that proteomes lead to better viral classification when reliable protein sequences are abundant. In other cases, the accuracy rates using proteomes are still comparable to that of genomes.


Assuntos
Proteínas Virais/química , Vírus/classificação , Sequência de Aminoácidos , Bases de Dados de Proteínas , Genoma Viral , Filogenia , Proteoma/química , Proteoma/genética , Vírus/genética
16.
Mol Phylogenet Evol ; 98: 271-9, 2016 May.
Artigo em Inglês | MEDLINE | ID: mdl-26926946

RESUMO

The free-living SAR11 clade is a globally abundant group of oceanic Alphaproteobacteria, with small genome sizes and rich genomic A+T content. However, the taxonomy of SAR11 has become controversial recently. Some researchers argue that the position of SAR11 is a sister group to Rickettsiales. Other researchers advocate that SAR11 is located within free-living lineages of Alphaproteobacteria. Here, we use the natural vector representation method to identify the evolutionary origin of the SAR11 clade. This alignment-free method does not depend on any model assumptions. With this approach, the correspondence between proteome sequences and their natural vectors is one-to-one. After fixing a set of proteins, each bacterium is represented by a set of vectors. The Hausdorff distance is then used to compute the dissimilarity distance between two bacteria. The phylogenetic tree can be reconstructed based on these distances. Using our method, we systematically analyze four data sets of alphaproteobacterial proteomes in order to reconstruct the phylogeny of Alphaproteobacteria. From this we can see that the phylogenetic position of the SAR11 group is within a group of other free-living lineages of Alphaproteobacteria.


Assuntos
Alphaproteobacteria/classificação , Organismos Aquáticos/classificação , Filogenia , Alphaproteobacteria/genética , Alphaproteobacteria/metabolismo , Organismos Aquáticos/genética , Organismos Aquáticos/metabolismo , Proteínas de Bactérias/metabolismo , Proteoma/metabolismo
17.
Cancer ; 121(8): 1328-38, 2015 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-25529384

RESUMO

BACKGROUND: A current recommendation for locoregionally advanced nasopharyngeal carcinoma (NPC) is conventional fractionated radiotherapy with concurrent cisplatin plus adjuvant cisplatin and fluorouracil (PF). In this randomized trial, the authors evaluated the potential therapeutic benefit from changing to an induction-concurrent chemotherapy sequence, replacing fluorouracil with oral capecitabine, and/or using accelerated rather than conventional radiotherapy fractionation. METHODS: Patients with stage III through IVB, nonkeratinizing NPC were randomly allocated to 1 of 6 treatment arms. The protocol was amended in 2009 to permit confining randomization to the conventional fractionation arms. The primary endpoint was progression-free survival. Secondary endpoints included overall survival and safety. RESULTS: In total, 803 patients were accrued, and 706 patients were randomly allocated to all 6 treatment arms. Comparisons of induction PF versus adjuvant PF did not indicate a significant improvement. Unadjusted comparisons of induction cisplatin and capecitabine (PX) versus adjuvant PF indicated a favorable trend in progression-free survival for the conventional fractionation arm (P = .045); analyses that were adjusted for other significant factors and fractionation reflected a significant reduction in the hazards of disease progression (hazard ratio [HR], 0.54; 95% confidence interval [CI], 0.36-0.80) and death (HR, 0.42; 95% CI, 0.25-0.70). Unadjusted comparisons of induction sequences versus adjuvant sequences did not reach statistical significance, but adjusted comparisons indicated favorable improvements by induction sequence. Comparisons of induction PX versus induction PF revealed fewer toxicities (neutropenia and electrolyte disturbance), unadjusted comparisons of efficacy were statistically insignificant, but adjusted analyses indicated that induction PX had a lower hazard of death (HR, 0.57; 95% CI, 0.34-0.97). Changing the fractionation from conventional to accelerated did not achieve any benefit but incurred higher toxicities (acute mucositis and dehydration). CONCLUSIONS: Preliminary results indicate that the benefit of changing to an induction-concurrent sequence remains uncertain; replacing fluorouracil with oral capecitabine warrants further validation in view of its convenience, favorable toxicity profile, and favorable trends in efficacy; and accelerated fractionation is not recommended for patients with locoregionally advanced NPC who receive chemoradiotherapy.


Assuntos
Quimiorradioterapia Adjuvante/métodos , Desoxicitidina/análogos & derivados , Fluoruracila/análogos & derivados , Fluoruracila/administração & dosagem , Neoplasias Nasofaríngeas/terapia , Recidiva Local de Neoplasia/terapia , Adulto , Idoso , Capecitabina , Carcinoma , Desoxicitidina/administração & dosagem , Desoxicitidina/efeitos adversos , Fracionamento da Dose de Radiação , Fluoruracila/efeitos adversos , Humanos , Quimioterapia de Indução , Pessoa de Meia-Idade , Carcinoma Nasofaríngeo , Neoplasias Nasofaríngeas/patologia , Recidiva Local de Neoplasia/patologia , Análise de Sobrevida , Resultado do Tratamento , Adulto Jovem
18.
J Theor Biol ; 382: 99-110, 2015 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-26151589

RESUMO

DNA sequence similarity comparison is one of the major steps in computational phylogenetic studies. The sequence comparison of closely related DNA sequences and genomes is usually performed by multiple sequence alignments (MSA). While the MSA method is accurate for some types of sequences, it may produce incorrect results when DNA sequences undergone rearrangements as in many bacterial and viral genomes. It is also limited by its computational complexity for comparing large volumes of data. Previously, we proposed an alignment-free method that exploits the full information contents of DNA sequences by Discrete Fourier Transform (DFT), but still with some limitations. Here, we present a significantly improved method for the similarity comparison of DNA sequences by DFT. In this method, we map DNA sequences into 2-dimensional (2D) numerical sequences and then apply DFT to transform the 2D numerical sequences into frequency domain. In the 2D mapping, the nucleotide composition of a DNA sequence is a determinant factor and the 2D mapping reduces the nucleotide composition bias in distance measure, and thus improving the similarity measure of DNA sequences. To compare the DFT power spectra of DNA sequences with different lengths, we propose an improved even scaling algorithm to extend shorter DFT power spectra to the longest length of the underlying sequences. After the DFT power spectra are evenly scaled, the spectra are in the same dimensionality of the Fourier frequency space, then the Euclidean distances of full Fourier power spectra of the DNA sequences are used as the dissimilarity metrics. The improved DFT method, with increased computational performance by 2D numerical representation, can be applicable to any DNA sequences of different length ranges. We assess the accuracy of the improved DFT similarity measure in hierarchical clustering of different DNA sequences including simulated and real datasets. The method yields accurate and reliable phylogenetic trees and demonstrates that the improved DFT dissimilarity measure is an efficient and effective similarity measure of DNA sequences. Due to its high efficiency and accuracy, the proposed DFT similarity measure is successfully applied on phylogenetic analysis for individual genes and large whole bacterial genomes.


Assuntos
Análise de Fourier , Genoma , Modelos Genéticos , Filogenia , Algoritmos , Animais , Sequência de Bases , Análise por Conglomerados , Simulação por Computador , Genoma Mitocondrial , Humanos , Mamíferos/genética , Mutação/genética , NADH Desidrogenase/genética , Nucleotídeos/genética
19.
J Theor Biol ; 372: 135-45, 2015 May 07.
Artigo em Inglês | MEDLINE | ID: mdl-25747773

RESUMO

A novel clustering method is proposed to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods.


Assuntos
DNA/genética , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Bactérias/genética , Análise por Conglomerados , Biologia Computacional , Coronavirus/genética , DNA Mitocondrial/genética , Evolução Molecular , Análise de Fourier , Genoma , Genoma Bacteriano , Humanos , Processamento de Imagem Assistida por Computador , Vírus da Influenza A/genética , Modelos Genéticos , Filogenia , Rhinovirus/genética , Alinhamento de Sequência/métodos
20.
Mol Phylogenet Evol ; 81: 29-36, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25172357

RESUMO

We have recently developed a computational approach in a vector space for genome-based virus classification. This approach, called the "Natural Vector (NV) representation", which is an alignment-free method, allows us to classify single-segmented viruses with high speed and accuracy. For multiple-segmented viruses, typically phylogenetic trees of each segment are reconstructed for discovering viral phylogeny. Consensus tree methods may be used to combine the phylogenetic trees based on different segments. However, consensus tree methods were not developed for instances where the viruses have different numbers of segments or where their segments do not match well. We propose a novel approach for comparing multiple-segmented viruses globally, even in cases where viruses contain different numbers of segments. Using our method, each virus is represented by a set of vectors in R(12). The Hausdorff distance is then used to compare different sets of vectors. Phylogenetic trees can be reconstructed based on this distance. The proposed method is used for predicting classification labels of viruses with n-segments (n ⩾ 1). The correctness rates of our predictions based on cross-validation are as high as 96.5%, 95.4%, 99.7%, and 95.6% for Baltimore class, family, subfamily, and genus, respectively, which are comparable to the rates for single-segmented viruses only. Our method is not affected by the number or order of segments. We also demonstrate that the natural graphical representation based on the Hausdorff distance is more reasonable than the consensus tree for a recent public health threat, the influenza A (H7N9) viruses.


Assuntos
Genoma Viral , Subtipo H7N9 do Vírus da Influenza A/classificação , Filogenia , Análise de Sequência de DNA/métodos , Genômica/métodos , Subtipo H7N9 do Vírus da Influenza A/genética , Vírus/classificação , Vírus/genética
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa