Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
1.
BMC Bioinformatics ; 25(1): 182, 2024 May 09.
Article in English | MEDLINE | ID: mdl-38724920

ABSTRACT

BACKGROUND: The prediction of drug sensitivity plays a crucial role in improving the therapeutic effect of drugs. However, testing the effectiveness of drugs is challenging due to the complex mechanism of drug reactions and the lack of interpretability in most machine learning and deep learning methods. Therefore, it is imperative to establish an interpretable model that receives various cell line and drug feature data to learn drug response mechanisms and achieve stable predictions between available datasets. RESULTS: This study proposes a new and interpretable deep learning model, DrugGene, which integrates gene expression, gene mutation, gene copy number variation of cancer cells, and chemical characteristics of anticancer drugs to predict their sensitivity. This model comprises two different branches of neural networks, where the first involves a hierarchical structure of biological subsystems that uses the biological processes of human cells to form a visual neural network (VNN) and an interpretable deep neural network for human cancer cells. DrugGene receives genotype input from the cell line and detects changes in the subsystem states. We also employ a traditional artificial neural network (ANN) to capture the chemical structural features of drugs. DrugGene generates final drug response predictions by combining VNN and ANN and integrating their outputs into a fully connected layer. The experimental results using drug sensitivity data extracted from the Cancer Drug Sensitivity Genome Database and the Cancer Treatment Response Portal v2 reveal that the proposed model is better than existing prediction methods. Therefore, our model achieves higher accuracy, learns the reaction mechanisms between anticancer drugs and cell lines from various features, and interprets the model's predicted results. CONCLUSIONS: Our method utilizes biological pathways to construct neural networks, which can use genotypes to monitor changes in the state of network subsystems, thereby interpreting the prediction results in the model and achieving satisfactory prediction accuracy. This will help explore new directions in cancer treatment. More available code resources can be downloaded for free from GitHub ( https://github.com/pangweixiong/DrugGene ).


Subject(s)
Antineoplastic Agents , Deep Learning , Neural Networks, Computer , Humans , Antineoplastic Agents/pharmacology , Neoplasms/drug therapy , Neoplasms/genetics , Cell Line, Tumor , DNA Copy Number Variations , Computational Biology/methods
2.
Sheng Wu Yi Xue Gong Cheng Xue Za Zhi ; 40(3): 544-551, 2023 Jun 25.
Article in Chinese | MEDLINE | ID: mdl-37380395

ABSTRACT

The synergistic effect of drug combinations can solve the problem of acquired resistance to single drug therapy and has great potential for the treatment of complex diseases such as cancer. In this study, to explore the impact of interactions between different drug molecules on the effect of anticancer drugs, we proposed a Transformer-based deep learning prediction model-SMILESynergy. First, the drug text data-simplified molecular input line entry system (SMILES) were used to represent the drug molecules, and drug molecule isomers were generated through SMILES Enumeration for data augmentation. Then, the attention mechanism in the Transformer was used to encode and decode the drug molecules after data augmentation, and finally, a multi-layer perceptron (MLP) was connected to obtain the synergy value of the drugs. Experimental results showed that our model had a mean squared error of 51.34 in regression analysis, an accuracy of 0.97 in classification analysis, and better predictive performance than the DeepSynergy and MulinputSynergy models. SMILESynergy offers improved predictive performance to assist researchers in rapidly screening optimal drug combinations to improve cancer treatment outcomes.


Subject(s)
Antineoplastic Agents , Electric Power Supplies , Neural Networks, Computer , Antineoplastic Agents/pharmacology
3.
BMC Bioinformatics ; 23(1): 355, 2022 Aug 24.
Article in English | MEDLINE | ID: mdl-36002797

ABSTRACT

BACKGROUND: Deciphering proportions of constitutional cell types in tumor tissues is a crucial step for the analysis of tumor heterogeneity and the prediction of response to immunotherapy. In the process of measuring cell population proportions, traditional experimental methods have been greatly hampered by the cost and extensive dropout events. At present, the public availability of large amounts of DNA methylation data makes it possible to use computational methods to predict proportions. RESULTS: In this paper, we proposed PRMeth, a method to deconvolve tumor mixtures using partially available DNA methylation data. By adopting an iteratively optimized non-negative matrix factorization framework, PRMeth took DNA methylation profiles of a portion of the cell types in the tissue mixtures (including blood and solid tumors) as input to estimate the proportions of all cell types as well as the methylation profiles of unknown cell types simultaneously. We compared PRMeth with five different methods through three benchmark datasets and the results show that PRMeth could infer the proportions of all cell types and recover the methylation profiles of unknown cell types effectively. Then, applying PRMeth to four types of tumors from The Cancer Genome Atlas (TCGA) database, we found that the immune cell proportions estimated by PRMeth were largely consistent with previous studies and met biological significance. CONCLUSIONS: Our method can circumvent the difficulty of obtaining complete DNA methylation reference data and obtain satisfactory deconvolution accuracy, which will be conducive to exploring the new directions of cancer immunotherapy. PRMeth is implemented in R and is freely available from GitHub ( https://github.com/hedingqin/PRMeth ).


Subject(s)
DNA Methylation , Neoplasms , Algorithms , Computational Biology/methods , Humans , Neoplasms/genetics
4.
Sheng Wu Gong Cheng Xue Bao ; 37(4): 1346-1359, 2021 Apr 25.
Article in Chinese | MEDLINE | ID: mdl-33973447

ABSTRACT

Different cell lines have different perturbation signals in response to specific compounds, and it is important to predict cell viability based on these perturbation signals and to uncover the drug sensitivity hidden underneath the phenotype. We developed an SAE-XGBoost cell viability prediction algorithm based on the LINCS-L1000 perturbation signal. By matching and screening three major dataset, LINCS-L1000, CTRP and Achilles, a stacked autoencoder deep neural network was used to extract the gene information. These information were combined with the RW-XGBoost algorithm to predict the cell viability under drug induction, and then to complete drug sensitivity inference on the NCI60 and CCLE datasets. The model achieved good results compared to other methods with a Pearson correlation coefficient of 0.85. It was further validated on an independent dataset, corresponding to a Pearson correlation coefficient of 0.68. The results indicate that the proposed method can help discover novel and effective anti-cancer drugs for precision medicine.


Subject(s)
Antineoplastic Agents , Pharmaceutical Preparations , Algorithms , Antineoplastic Agents/pharmacology , Cell Survival
5.
BMC Bioinformatics ; 22(1): 13, 2021 Jan 06.
Article in English | MEDLINE | ID: mdl-33407085

ABSTRACT

BACKGROUND: Predicting the drug response of the cancer diseases through the cellular perturbation signatures under the action of specific compounds is very important in personalized medicine. In the process of testing drug responses to the cancer, traditional experimental methods have been greatly hampered by the cost and sample size. At present, the public availability of large amounts of gene expression data makes it a challenging task to use machine learning methods to predict the drug sensitivity. RESULTS: In this study, we introduced the WRFEN-XGBoost cell viability prediction algorithm based on LINCS-L1000 cell signatures. We integrated the LINCS-L1000, CTRP and Achilles datasets and adopted a weighted fusion algorithm based on random forest and elastic net for key gene selection. Then the FEBPSO algorithm was introduced into XGBoost learning algorithm to predict the cell viability induced by the drugs. The proposed method was compared with some new methods, and it was found that our model achieved good results with 0.83 Pearson correlation. At the same time, we completed the drug sensitivity validation on the NCI60 and CCLE datasets, which further demonstrated the effectiveness of our method. CONCLUSIONS: The results showed that our method was conducive to the elucidation of disease mechanisms and the exploration of new therapies, which greatly promoted the progress of clinical medicine.


Subject(s)
Algorithms , Antineoplastic Agents/pharmacology , Cell Survival/drug effects , Computational Biology/methods , Cell Line, Tumor , Databases, Factual , Humans
6.
PLoS Comput Biol ; 16(11): e1008452, 2020 11.
Article in English | MEDLINE | ID: mdl-33253170

ABSTRACT

Deconvolution of heterogeneous bulk tumor samples into distinct cellular populations is an important yet challenging problem, particularly when only partial references are available. A common approach to dealing with this problem is to deconvolve the mixed signals using available references and leverage the remaining signal as a new cell component. However, as indicated in our simulation, such an approach tends to over-estimate the proportions of known cell types and fails to detect novel cell types. Here, we propose PREDE, a partial reference-based deconvolution method using an iterative non-negative matrix factorization algorithm. Our method is verified to be effective in estimating cell proportions and expression profiles of unknown cell types based on simulated datasets at a variety of parameter settings. Applying our method to TCGA tumor samples, we found that proportions of pure cancer cells better indicate different subtypes of tumor samples. We also detected several cell types for each cancer type whose proportions successfully predicted patient survival. Our method makes a significant contribution to deconvolution of heterogeneous tumor samples and could be widely applied to varieties of high throughput bulk data. PREDE is implemented in R and is freely available from GitHub (https://xiaoqizheng.github.io/PREDE).


Subject(s)
Neoplasms/pathology , Algorithms , Animals , Cell Line, Tumor , Computational Biology/methods , Gene Expression Profiling/methods , Humans , Neoplasms/classification , Neoplasms/genetics , Rats , Reproducibility of Results
7.
BMC Med Inform Decis Mak ; 20(Suppl 8): 224, 2020 09 22.
Article in English | MEDLINE | ID: mdl-32962705

ABSTRACT

BACKGROUND: Prediction of drug response based on multi-omics data is a crucial task in the research of personalized cancer therapy. RESULTS: We proposed an iterative sure independent ranking and screening (ISIRS) scheme to select drug response-associated features and applied it to the Cancer Cell Line Encyclopedia (CCLE) dataset. For each drug in CCLE, we incorporated multi-omics data including copy number alterations, mutation and gene expression and selected up to 50 features using ISIRS. Then a linear regression model based on the selected features was exploited to predict the drug response. Cross validation test shows that our prediction accuracies are higher than existing methods for most drugs. CONCLUSIONS: Our study indicates that the features selected by the marginal utility measure, which measures the conditional probability of drug responses given the feature, are helpful for drug response prediction.


Subject(s)
Antineoplastic Agents/pharmacology , Cell Line, Tumor/drug effects , Pharmaceutical Preparations , Biomarkers, Tumor , Computational Biology/methods , Humans , Neoplasms/drug therapy
8.
Sheng Wu Yi Xue Gong Cheng Xue Za Zhi ; 37(4): 676-682, 2020 Aug 25.
Article in Chinese | MEDLINE | ID: mdl-32840085

ABSTRACT

Synergistic effects of drug combinations are very important in improving drug efficacy or reducing drug toxicity. However, due to the complex mechanism of action between drugs, it is expensive to screen new drug combinations through trials. It is well known that virtual screening of computational models can effectively reduce the test cost. Recently, foreign scholars successfully predicted the synergistic value of new drug combinations on cancer cell lines by using deep learning model DeepSynergy. However, DeepSynergy is a two-stage method and uses only one kind of feature as input. In this study, we proposed a new end-to-end deep learning model, MulinputSynergy which predicted the synergistic value of drug combinations by integrating gene expression, gene mutation, gene copy number characteristics of cancer cells and anticancer drug chemistry characteristics. In order to solve the problem of high dimension of features, we used convolutional neural network to reduce the dimension of gene features. Experimental results showed that the proposed model was superior to DeepSynergy deep learning model, with the mean square error decreasing from 197 to 176, the mean absolute error decreasing from 9.48 to 8.77, and the decision coefficient increasing from 0.53 to 0.58. This model could learn the potential relationship between anticancer drugs and cell lines from a variety of characteristics and locate the effective drug combinations quickly and accurately.


Subject(s)
Neural Networks, Computer , Antineoplastic Agents , Computational Biology , Drug Combinations , Humans , Neoplasms
9.
Int J Biol Sci ; 15(10): 2119-2127, 2019.
Article in English | MEDLINE | ID: mdl-31592084

ABSTRACT

With the release of the draft genome of the grass carp, researches on the grass carp from the genetic level and the further molecular mechanisms of economically valuable physiological behaviors have gained great attention. In this paper, we integrated a large number of genomic, genetic and some other data resources and established a web-based grass carp genomic visualization database (GCGVD). To view these data more effectively, we visualized grass carp and zebrafish gene collinearity and genetic linkage map using Scalable Vector Graphics (SVG) format in the browser, and genomic annotations by JBrowse. Furthermore, we carried out some preliminary study on a whole-genome alternative splicing (AS)of the grass carp. The RNA-seq reads of 15 samples were aligned to the reference genome of the grass carp by Bowtie2 software. RNA-seq reads of each sample and density map of reads were also exhibited in JBrowse. Additionally, we designed a universal grass carp genome annotation data model to improve the retrieval speed and scalability. Compared with the published database GCGD previously, we newly added the visualization of some more genomic annotations, conserved domain and RNA-seq reads aligned to the reference genome. GCGVD can be accessed at http://122.112.216.104.


Subject(s)
Carps/genetics , Genome/genetics , Alternative Splicing/genetics , Animals , RNA-Seq , Sequence Analysis, DNA , Software
10.
PLoS One ; 13(10): e0205155, 2018.
Article in English | MEDLINE | ID: mdl-30289891

ABSTRACT

Drug response prediction is a critical step for personalized treatment of cancer patients and ultimately leads to precision medicine. A lot of machine-learning based methods have been proposed to predict drug response from different types of genomic data. However, currently available methods could only give a "point" prediction of drug response value but fail to provide the reliability and distribution of the prediction, which are of equal interest in clinical practice. In this paper, we proposed a method based on quantile regression forest and applied it to the CCLE dataset. Through the out-of-bag validation, our method achieved much higher prediction accuracy of drug response than other available tools. The assessment of prediction reliability by prediction intervals and its significance in personalized medicine were illustrated by several examples. Functional analysis of selected drug response associated genes showed that the proposed method achieves more biologically plausible results.


Subject(s)
Algorithms , Drug Therapy/methods , Precision Medicine/methods , Cell Line, Tumor , Drug Resistance , Humans , Models, Biological , Neoplasms/drug therapy , Neoplasms/genetics , Neoplasms/metabolism , Regression Analysis , Reproducibility of Results
11.
Genes Dis ; 5(1): 43-45, 2018 Mar.
Article in English | MEDLINE | ID: mdl-30258934

ABSTRACT

The proposition of cancer cells in a tumor sample, named as tumor purity, is an intrinsic factor of tumor samples and has potentially great influence in variety of analyses including differential methylation, subclonal deconvolution and subtype clustering. InfiniumPurify is an integrated R package for estimating and accounting for tumor purity based on DNA methylation Infinium 450 k array data. InfiniumPurify has three main functions getPurity, InfiniumDMC and InfiniumClust, which could infer tumor purity, differential methylation analysis and tumor sample cluster accounting for estimated or user-provided tumor purities, respectively. The InfiniumPurify package provides a comprehensive analysis of tumor purity in cancer methylation research.

12.
Oncotarget ; 9(1): 1063-1074, 2018 Jan 02.
Article in English | MEDLINE | ID: mdl-29416677

ABSTRACT

Aging is a major risk factor for age-related diseases such as certain cancers. In this study, we developed Age Associated Gene Co-expression Identifier (AAGCI), a liquid association based method to infer age-associated gene co-expressions at thousands of biological processes and pathways across 9 human tissues. Several hundred to thousands of gene pairs were inferred to be age co-expressed across different tissues, the genes involved in which are significantly enriched in functions like immunity, ATP binding, DNA damage, and many cancer pathways. The age co-expressed genes are significantly overlapped with aging genes curated in the GenAge database across all 9 tissues, suggesting a tissue-wide correlation between age-associated genes and co-expressions. Interestingly, age-associated gene co-expressions are significantly different from gene co-expressions identified through correlation analysis, indicating that aging might only contribute to a small portion of gene co-expressions. Moreover, the key driver analysis identified biologically meaningful genes in important function modules. For example, IGF1, ERBB2, TP53 and STAT5A were inferred to be key genes driving age co-expressed genes in the network module associated with function "T cell proliferation". Finally, we prioritized a few anti-aging drugs such as metformin based on an enrichment analysis between age co-expressed genes and drug signatures from a recent study. The predicted drugs were partially validated by literature mining and can be readily used to generate hypothesis for further experimental validations.

13.
Article in English | MEDLINE | ID: mdl-26571537

ABSTRACT

Membrane transport proteins and their substrate specificities play crucial roles in a variety of cellular functions. Identifying the substrate specificities of membrane transport proteins is closely related to the protein-target interaction prediction, drug design, membrane recruitment, and dysregulation analysis. However, experimental methods to this aim are time consuming, labor intensive, and costly. Therefore, we proposed a novel method basing on support vector machine (SVM) to predict substrate specificities of membrane transport proteins by integrating features from position-specific score matrix (PSSM), PROFEAT, and Gene Ontology (GO). Finally, jackknife cross-validation tests were adopted on a benchmark and independent datasets to measure the performance of the proposed method. The overall accuracy of 96.16 and 80.45 percent were obtained for two datasets, which are higher (from 2.12 to 20.44 percent) than that by the state-of-the-art tool. Comparison results indicate that the proposed model is more reliable and efficient for accurate prediction the substrate specificities of membrane transport proteins.


Subject(s)
Membrane Transport Proteins/chemistry , Molecular Docking Simulation/methods , Pattern Recognition, Automated/methods , Protein Interaction Mapping/methods , Sequence Analysis, Protein/methods , Support Vector Machine , Algorithms , Binding Sites , Models, Chemical , Protein Binding , Substrate Specificity
14.
Int J Mol Sci ; 17(1)2015 Dec 24.
Article in English | MEDLINE | ID: mdl-26712737

ABSTRACT

The prior knowledge of protein structural class may offer useful clues on understanding its functionality as well as its tertiary structure. Though various significant efforts have been made to find a fast and effective computational approach to address this problem, it is still a challenging topic in the field of bioinformatics. The position-specific score matrix (PSSM) profile has been shown to provide a useful source of information for improving the prediction performance of protein structural class. However, this information has not been adequately explored. To this end, in this study, we present a feature extraction technique which is based on gapped-dipeptides composition computed directly from PSSM. Then, a careful feature selection technique is performed based on support vector machine-recursive feature elimination (SVM-RFE). These optimal features are selected to construct a final predictor. The results of jackknife tests on four working datasets show that our method obtains satisfactory prediction accuracies by extracting features solely based on PSSM and could serve as a very promising tool to predict protein structural class.


Subject(s)
Dipeptides/chemistry , Sequence Analysis, Protein/methods , Software , Protein Conformation
15.
PLoS One ; 10(5): e0127380, 2015.
Article in English | MEDLINE | ID: mdl-25992881

ABSTRACT

Predicting anticancer drug sensitivity can enhance the ability to individualize patient treatment, thus making development of cancer therapies more effective and safe. In this paper, we present a new network flow-based method, which utilizes the topological structure of pathways, for predicting anticancer drug sensitivities. Mutations and copy number alterations of cancer-related genes are assumed to change the pathway activity, and pathway activity difference before and after drug treatment is used as a measure of drug response. In our model, Contributions from different genetic alterations are considered as free parameters, which are optimized by the drug response data from the Cancer Genome Project (CGP). 10-fold cross validation on CGP data set showed that our model achieved comparable prediction results with existing elastic net model using much less input features.


Subject(s)
Antineoplastic Agents/pharmacology , Computational Biology/methods , Gene Regulatory Networks/drug effects , Signal Transduction/drug effects , DNA Copy Number Variations , Genes, Neoplasm/drug effects , Humans , Models, Theoretical , Mutation , Neoplasms/drug therapy , Neoplasms/genetics
16.
PLoS One ; 10(3): e0120408, 2015.
Article in English | MEDLINE | ID: mdl-25794193

ABSTRACT

Prediction of drug response based on genomic alterations is an important task in the research of personalized medicine. Current elastic net model utilized a sure independence screening to select relevant genomic features with drug response, but it may neglect the combination effect of some marginally weak features. In this work, we applied an iterative sure independence screening scheme to select drug response relevant features from the Cancer Cell Line Encyclopedia (CCLE) dataset. For each drug in CCLE, we selected up to 40 features including gene expressions, mutation and copy number alterations of cancer-related genes, and some of them are significantly strong features but showing weak marginal correlation with drug response vector. Lasso regression based on the selected features showed that our prediction accuracies are higher than those by elastic net regression for most drugs.


Subject(s)
Pharmacogenetics , Precision Medicine , Algorithms , Antineoplastic Agents/pharmacology , Antineoplastic Agents/therapeutic use , Cell Line, Tumor , Datasets as Topic , Drug Resistance, Neoplasm , Humans , Models, Theoretical , Neoplasms/drug therapy , Neoplasms/genetics , Pharmacogenetics/methods , Precision Medicine/methods , Reproducibility of Results
17.
J Theor Biol ; 366: 8-12, 2015 Feb 07.
Article in English | MEDLINE | ID: mdl-25463695

ABSTRACT

Knowledge of apoptosis proteins plays an important role in understanding the mechanism of programmed cell death. Obtaining information on subcellular location of apoptosis proteins is very helpful to reveal the apoptosis mechanism and understand the function of apoptosis proteins. Because of the cost in time and labor associated with large-scale wet-bench experiments, computational prediction of apoptosis proteins subcellular location becomes very important and many computational tools have been developed in the recent decades. Existing methods differ in the protein sequence representation techniques and classification algorithms adopted. In this study, we firstly introduce a sequence encoding scheme based on tri-grams computed directly from position-specific score matrices, which incorporates evolution information represented in the PSI-BLAST profile and sequence-order information. Then SVM-RFE algorithm is applied for feature selection and reduced vectors are input to a support vector machine classifier to predict subcellular location of apoptosis proteins. Jackknife tests on three widely used datasets show that our method provides the state-of-the-art performance in comparison with other existing methods.


Subject(s)
Algorithms , Apoptosis Regulatory Proteins/metabolism , Position-Specific Scoring Matrices , Databases, Protein , Humans , Protein Transport , ROC Curve , Subcellular Fractions/metabolism , Support Vector Machine
18.
BMC Bioinformatics ; 14 Suppl 5: S5, 2013.
Article in English | MEDLINE | ID: mdl-23734762

ABSTRACT

BACKGROUND: Identification of gene-phenotype relationships is a fundamental challenge in human health clinic. Based on the observation that genes causing the same or similar phenotypes tend to correlate with each other in the protein-protein interaction network, a lot of network-based approaches were proposed based on different underlying models. A recent comparative study showed that diffusion-based methods achieve the state-of-the-art predictive performance. RESULTS: In this paper, a new diffusion-based method was proposed to prioritize candidate disease genes. Diffusion profile of a disease was defined as the stationary distribution of candidate genes given a random walk with restart where similarities between phenotypes are incorporated. Then, candidate disease genes are prioritized by comparing their diffusion profiles with that of the disease. Finally, the effectiveness of our method was demonstrated through the leave-one-out cross-validation against control genes from artificial linkage intervals and randomly chosen genes. Comparative study showed that our method achieves improved performance compared to some classical diffusion-based methods. To further illustrate our method, we used our algorithm to predict new causing genes of 16 multifactorial diseases including Prostate cancer and Alzheimer's disease, and the top predictions were in good consistent with literature reports. CONCLUSIONS: Our study indicates that integration of multiple information sources, especially the phenotype similarity profile data, and introduction of global similarity measure between disease and gene diffusion profiles are helpful for prioritizing candidate disease genes. AVAILABILITY: Programs and data are available upon request.


Subject(s)
Algorithms , Disease/genetics , Genes , Protein Interaction Mapping , Alzheimer Disease/genetics , Alzheimer Disease/metabolism , Genetic Association Studies , Humans , Male , Phenotype , Prostatic Neoplasms/genetics , Prostatic Neoplasms/metabolism
19.
Protein Pept Lett ; 19(4): 388-97, 2012 Apr.
Article in English | MEDLINE | ID: mdl-22316305

ABSTRACT

Computational prediction of protein structural class based on sequence data remains a challenging problem in current protein science. In this paper, a new feature extraction approach based on relative polypeptide composition is introduced. This approach could take into account the background distribution of a given k-mer under a Markov model of order k-2, and avoid the curse of dimensionality with the increase of k by using a T-statistic feature selection strategy. The selected features are then fed to a support vector machine to perform the prediction. To verify the performance of our method, jackknife cross-validation tests are performed on four widely used benchmark datasets. Comparison of our results with existing methods shows that our method provides satisfactory performance for structural class prediction.


Subject(s)
Amino Acids/chemistry , Computational Biology , Protein Structure, Tertiary , Proteins/chemistry , Algorithms , Databases, Protein , Neural Networks, Computer , Protein Folding , Proteins/classification , Sequence Analysis, Protein , Support Vector Machine
20.
Math Biosci ; 217(2): 159-66, 2009 Feb.
Article in English | MEDLINE | ID: mdl-19073197

ABSTRACT

In this paper, we propose two metrics to compare DNA and protein sequences based on a Poisson model of word occurrences. Instead of comparing the frequencies of all fixed-length words in two sequences, we consider (1) the probability of 'generating' one sequence under the Poisson model estimated from the other; (2) their different expression levels of words. Phylogenetic trees of 25 viruses including SARS-CoVs are constructed to illustrate our approach.


Subject(s)
Coronaviridae/genetics , Models, Genetic , Poisson Distribution , Animals , Base Sequence , DNA, Mitochondrial/genetics , DNA, Viral/genetics , Humans , Phylogeny
SELECTION OF CITATIONS
SEARCH DETAIL
...