Pesquisa | Portal Regional da BVS

Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction.

Mathé, C; Peresetsky, A; Déhais, P; Van Montagu, M; Rouzé, P.

J Mol Biol ; 285(5): 1977-91, 1999 Feb 05.

Artigo em Inglês | MEDLINE | ID: mdl-9925779

RESUMO

While genomic sequences are accumulating, finding the location of the genes remains a major issue that can be solved only for about a half of them by homology searches. Prediction methods are thus required, but unfortunately are not fully satisfying. Most prediction methods implicitly assume a unique model for genes. This is an oversimplification as demonstrated by the possibility to group coding sequences into several classes in Escherichia coli and other genomes. As no classification existed for Arabidopsis thaliana, we classified genes according to the statistical features of their coding sequences. A clustering algorithm using a codon usage model was developed and applied to coding sequences from A. thaliana, E. coli, and a mixture of both. By using it, Arabidopsis sequences were clustered into two classes. The CU1 and CU2 classes differed essentially by the choice of pyrimidine bases at the codon silent sites: CU2 genes often use C whereas CU1 genes prefer T. This classification discriminated the Arabidopsis genes according to their expressiveness, highly expressed genes being clustered in CU2 and genes expected to have a lower expression, such as the regulatory genes, in CU1. The algorithm separated the sequences of the Escherichia-Arabidopsis mixed data set into five classes according to the species, except for one class. This mixed class contained 89 % Arabidopsis genes from CU1 and 11 % E. coli genes, mostly horizontally transferred. Interestingly, most genes encoding organelle-targeted proteins, except the photosynthetic and photoassimilatory ones, were clustered in CU1. By tailoring the GeneMark CDS prediction algorithm to the observed coding sequence classes, its quality of prediction was greatly improved. Similar improvement can be expected with other prediction systems.

Assuntos

Arabidopsis/genética , Códon/classificação , Genes de Plantas , Modelos Genéticos , Algoritmos , Arabidopsis/classificação , Núcleo Celular/genética , Classificação/métodos , Éxons , Expressão Gênica , Organelas/genética

Deriving non-homogeneous DNA Markov chain models by cluster analysis algorithm minimizing multiple alignment entropy.

Borodovsky, M; Peresetsky, A.

Comput Chem ; 18(3): 259-67, 1994 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-7952897

RESUMO

Non-homogeneous Markov chain models can represent biologically important regions of DNA sequences. The statistical pattern that is described by these models is usually weak and was found primarily because of strong biological indications. The general method for extracting similar patterns is presented in the current paper. The algorithm incorporates cluster analysis, multiple alignment and entropy minimization. The method was first tested using the set of DNA sequences produced by Markov chain generators. It was shown that artificial gene sequences, which initially have been randomly set up along the multiple alignment panels, are aligned according to the hidden triplet phase. Then the method was applied to real protein-coding sequences and the resulting alignment clearly indicated the triplet phase and produced the parameters of the optimal 3-periodic non-homogeneous Markov chain model. These Markov models were already employed in the GeneMark gene prediction algorithm, which is used in genome sequencing projects. The algorithm can also handle the case in which the sequences to be aligned reveal different statistical patterns, such as Escherichia coli protein-coding sequences belonging to Class II and Class III. The algorithm accepts a random mix of sequences from different classes, and is able to separate them into two groups (clusters), align each cluster separately, and define a non-homogeneous Markov chain model for each sequence cluster.

Assuntos

DNA/química , Cadeias de Markov , Modelos Químicos , Algoritmos , Análise por Conglomerados , DNA Bacteriano/genética , Escherichia coli/genética , Alinhamento de Sequência/estatística & dados numéricos , Termodinâmica

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA