Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Nat Commun ; 5: 3311, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24548928

RESUMEN

The subfamily of the Lemnoideae belongs to a different order than other monocotyledonous species that have been sequenced and comprises aquatic plants that grow rapidly on the water surface. Here we select Spirodela polyrhiza for whole-genome sequencing. We show that Spirodela has a genome with no signs of recent retrotranspositions but signatures of two ancient whole-genome duplications, possibly 95 million years ago (mya), older than those in Arabidopsis and rice. Its genome has only 19,623 predicted protein-coding genes, which is 28% less than the dicotyledonous Arabidopsis thaliana and 50% less than monocotyledonous rice. We propose that at least in part, the neotenous reduction of these aquatic plants is based on readjusted copy numbers of promoters and repressors of the juvenile-to-adult transition. The Spirodela genome, along with its unique biology and physiology, will stimulate new insights into environmental adaptation, ecology, evolution and plant development, and will be instrumental for future bioenergy applications.


Asunto(s)
Araceae/crecimiento & desarrollo , Araceae/genética , Genoma de Planta/genética , Agua Dulce , Datos de Secuencia Molecular
2.
OMICS ; 7(2): 171-5, 2003.
Artículo en Inglés | MEDLINE | ID: mdl-14506846

RESUMEN

As more and more complete bacterial genome sequences become available, the genome annotation of previously sequenced genomes may become quickly outdated. This is primarily due to the discovery and functional characterization of new genes. We have reannotated the recently published genome of Shewanella oneidensis with the following results: 51 new genes have been identified, and functional annotation has been added to the 97 genes, including 15 new and 82 existing ones with previously unassigned function. The identification of new genes was achieved by predicting the protein coding regions using the HMM-based program GeneMark.hmm. Subsequent comparison of the predicted gene products to the non-redundant protein database using BLAST and the COG (Clusters of Orthologous Groups) database using COGNITOR provided for the functional annotation.


Asunto(s)
Proteínas Bacterianas/genética , Genoma Bacteriano , Shewanella/genética , Algoritmos , Proteínas Bacterianas/fisiología , Biología Computacional/métodos , Genes Bacterianos/genética , Genómica , Datos de Secuencia Molecular , Sistemas de Lectura Abierta/genética , Alineación de Secuencia/métodos , Programas Informáticos
3.
Nucleic Acids Res ; 29(12): 2607-18, 2001 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-11410670

RESUMEN

Improving the accuracy of prediction of gene starts is one of a few remaining open problems in computer prediction of prokaryotic genes. Its difficulty is caused by the absence of relatively strong sequence patterns identifying true translation initiation sites. In the current paper we show that the accuracy of gene start prediction can be improved by combining models of protein-coding and non-coding regions and models of regulatory sites near gene start within an iterative Hidden Markov model based algorithm. The new gene prediction method, called GeneMarkS, utilizes a non-supervised training procedure and can be used for a newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. The GeneMarkS implementation uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. GeneMarkS predicted precisely 83.2% of the translation starts of GenBank annotated Bacillus subtilis genes and 94.4% of translation starts in an experimentally validated set of Escherichia coli genes. We have also observed that GeneMarkS detects prokaryotic genes, in terms of identifying open reading frames containing real genes, with an accuracy matching the level of the best currently used gene detection methods. Accurate translation start prediction, in addition to the refinement of protein sequence N-terminal data, provides the benefit of precise positioning of the sequence region situated upstream to a gene start. Therefore, sequence motifs related to transcription and translation regulatory sites can be revealed and analyzed with higher precision. These motifs were shown to possess a significant variability, the functional and evolutionary connections of which are discussed.


Asunto(s)
Bacillus subtilis/genética , Codón Iniciador/genética , Biología Computacional/métodos , Genes Bacterianos/genética , Genoma Bacteriano , Biosíntesis de Proteínas/genética , Programas Informáticos , Algoritmos , Secuencia de Bases , Simulación por Computador , Bases de Datos como Asunto , Escherichia coli/genética , Evolución Molecular , Genes Arqueales/genética , Genes Sobrepuestos/genética , Genoma Arqueal , Internet , Funciones de Verosimilitud , Cadenas de Markov , Sistemas de Lectura Abierta/genética , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Alineación de Secuencia , Transcripción Genética/genética
4.
J Mol Biol ; 309(2): 347-60, 2001 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-11371158

RESUMEN

We mapped transcription start sites for ten unrelated protein-encoding Pyrobaculum aerophilum genes by primer extension and S(1) nuclease mapping. All of the mapped transcripts start at the computationally predicted translation start codons, two of which were supported by N-terminal protein sequencing. A whole genome computational analysis of the regions from -50 to +50 nt around the predicted translation starts codons revealed a clear upstream pattern matching the consensus sequence of the archaeal TATA box located unusually close to the translation starts. For genes with the TATA boxes that best matched the consensus sequence, the distance between the TATA box and the translation start codon appears to be shorter than 30 nt. Two other promoter elements distinguished were also found unusually close to the translation start codons: a transcription initiator element with significant elevation of C and T frequencies at the -1 position and a BRE element with more frequent A bases at position -29 to -32 (counting from the translation start site). We also show that one of the mapped genes is transcribed as the first gene of an operon. For a set of genes likely to be internal in operons the upstream signal extracted by computer analysis was a Shine-Dalgarno pattern matching the complementary sequence of P. aerophilum 16 S rRNA. Together these results suggest that the translation of proteins encoded by single genes or genes that are first in operons in the hyperthermophilic crenarchaeon P. aerophilum proceeds mostly, if not exclusively, through leaderless transcripts. Internal genes in operons are likely to undergo translation via a mechanism that is facilitated by ribosome binding to the Shine-Dalgarno sequence.


Asunto(s)
Regiones no Traducidas 5'/genética , Codón Iniciador/genética , Coenzimas , ARN de Archaea/genética , TATA Box/genética , Thermoproteaceae/genética , Regiones no Traducidas 5'/análisis , Secuencia de Aminoácidos , Secuencia de Bases , Secuencia de Consenso/genética , Bases de Datos como Asunto , Genes Arqueales/genética , Genoma Arqueal , Metaloproteínas/metabolismo , Datos de Secuencia Molecular , Cofactores de Molibdeno , Ensayos de Protección de Nucleasas , Operón/genética , Oxidorreductasas/genética , Fosfotransferasas (Aceptor de Grupo Alcohol)/química , Fosfotransferasas (Aceptor de Grupo Alcohol)/genética , Biosíntesis de Proteínas/genética , Pteridinas/metabolismo , ARN de Archaea/análisis , Alineación de Secuencia , Análisis de Secuencia de Proteína , Endonucleasas Específicas del ADN y ARN con un Solo Filamento/metabolismo , Superóxido Dismutasa/química , Superóxido Dismutasa/genética , Thermoproteaceae/enzimología , Transcripción Genética/genética
5.
J Cell Biol ; 153(1): 63-74, 2001 Apr 02.
Artículo en Inglés | MEDLINE | ID: mdl-11285274

RESUMEN

In the unicellular alga Chlamydomonas, two anterior flagella are positioned with 180 degrees rotational symmetry, such that the flagella beat with the effective strokes in opposite directions (Hoops, H.J., and G.B. Witman. 1983. J. Cell Biol. 97:902-908). The vfl1 mutation results in variable numbers and positioning of flagella and basal bodies (Adams, G.M.W., R.L. Wright, and J.W. Jarvik. 1985. J. Cell Biol. 100:955-964). Using a tagged allele, we cloned the VFL1 gene that encodes a protein of 128 kD with five leucine-rich repeat sequences near the NH(2) terminus and a large alpha-helical-coiled coil domain at the COOH terminus. An epitope-tagged gene construct rescued the mutant phenotype and expressed a tagged protein (Vfl1p) that copurified with basal body flagellar apparatuses. Immunofluorescence experiments showed that Vfl1p localized with basal bodies and probasal bodies. Immunogold labeling localized Vfl1p inside the lumen of the basal body at the distal end. Distribution of gold particles was rotationally asymmetric, with most particles located near the doublet microtubules that face the opposite basal body. The mutant phenotype, together with the localization results, suggest that Vfl1p plays a role in establishing the correct rotational orientation of basal bodies. Vfl1p is the first reported molecular marker of the rotational asymmetry inherent to basal bodies.


Asunto(s)
Proteínas Algáceas/análisis , Proteínas Protozoarias/análisis , Proteínas Algáceas/genética , Alelos , Secuencia de Aminoácidos , Animales , Chlamydomonas/química , Chlamydomonas/genética , Datos de Secuencia Molecular , Proteínas Protozoarias/genética
6.
Funct Integr Genomics ; 1(5): 312-22, 2001 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-11793250

RESUMEN

The IbeA (ibe10) gene is an invasion determinant contributing to E. coli K1 invasion of the blood-brain barrier. This gene has been cloned and characterized from the chromosome of an invasive cerebrospinal fluid isolate of E. coli K1, strain RS218 (018:K1: H7). In the present study, a genetic island of meningitic E. coli containing ibeA (GimA) has been identified. A 20.3-kb genomic DNA island unique to E. coli K1 strains has been cloned and sequenced from an RS218 E. coli K1 genomic DNA library. Fourteen new genes have been identified in addition to the ibeA. The DNA sequence analysis indicated that the ibeA gene cluster was localized to the 98 min region and consisted of four operons, ptnIPKC, cglDTEC, gcxKRCI and ibeRAT. The G+C content (46.2%) of unique regions of the island is substantially different from that (50.8%) of the rest of the E. coli chromosome. By computer-assisted analysis of the sequences with DNA and protein databases (GenBank and PROSITE databases), the functions of the gene products could be anticipated, and were assigned to the functional categories of proteins relating to carbon source metabolism and substrate transportation. Glucose was shown to enhance E. coli penetration of human brain microvascular endothelial cells and exogenous cAMP was able to block the stimulating effect of glucose, suggesting that catabolic regulation may play a role in control of E. coli K1 invasion gene expression. Our data suggest that this genetic island may contribute to E. coli invasion of the blood-brain barrier through a carbon-source-regulated process.


Asunto(s)
Proteínas Bacterianas/genética , Encéfalo/irrigación sanguínea , Endotelio Vascular/microbiología , Proteínas de Escherichia coli , Escherichia coli/genética , Galactosa/farmacología , Proteínas de la Membrana/genética , Meningitis por Escherichia coli/microbiología , Secuencia de Aminoácidos , Células Cultivadas , Clonación Molecular , Cartilla de ADN/genética , ADN Bacteriano/genética , Escherichia coli/patogenicidad , Genes Bacterianos , Humanos , Recién Nacido , Datos de Secuencia Molecular , Homología de Secuencia de Aminoácido
7.
Nucleic Acids Res ; 27(19): 3911-20, 1999 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-10481031

RESUMEN

Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. Here we propose a new, heuristic method producing fairly accurate inhomogeneous Markov models of protein coding regions. The new method needs such a small amount of DNA sequence data that the model can be built 'on the fly' by a web server for any DNA sequence >400 nt. Tests on 10 complete bacterial genomes performed with the GeneMark.hmm program demonstrated the ability of the new models to detect 93.1% of annotated genes on average, while models built by traditional training predict an average of 93.9% of genes. Models built by the heuristic approach could be used to find genes in small fragments of anonymous prokaryotic genomes and in genomes of organelles, viruses, phages and plasmids, as well as in highly inhomogeneous genomes where adjustment of models to local DNA composition is needed. The heuristic method also gives an insight into the mechanism of codon usage pattern evolution.


Asunto(s)
Genoma Bacteriano , VIH-1/genética , Virus Linfotrópico T Tipo 1 Humano/genética , Modelos Genéticos , Codón , Células Eucariotas , Evolución Molecular , Genes Bacterianos , Genoma Viral , Humanos , Internet , Cadenas de Markov
8.
Bioinformatics ; 15(11): 874-86, 1999 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-10743554

RESUMEN

MOTIVATION: Tightly packed prokaryotic genes frequently overlap with each other. This feature, rarely seen in eukaryotic DNA, makes detection of translation initiation sites and, therefore, exact predictions of prokaryotic genes notoriously difficult. Improving the accuracy of precise gene prediction in prokaryotic genomic DNA remains an important open problem. RESULTS: A software program implementing a new algorithm utilizing a uniform Hidden Markov Model for prokaryotic gene prediction was developed. The algorithm analyzes a given DNA sequence in each of six possible global reading frames independently. Twelve complete prokaryotic genomes were analyzed using the new tool. The accuracy of gene finding, predicting locations of protein-coding ORFs, as well as the accuracy of precise gene prediction, and detecting the whole gene including translation initiation codon were assessed by comparison with existing annotation. It was shown that in terms of gene finding, the program performs at least as well as the previously developed tools, such as GeneMark and GLIMMER. In terms of precise gene prediction the new program was shown to be more accurate, by several percentage points, than earlier developed tools, such as GeneMark.hmm, ECOPARSE and ORPHEUS. The results of testing the program indicated the possibility of systematic bias in start codon annotation in several early sequenced prokaryotic genomes. AVAILABILITY: The new gene-finding program can be accessed through the Web site: http:@dixie.biology.gatech.edu/GeneMark/fbf.cgi CONTACT: mark@amber.gatech.edu.


Asunto(s)
Algoritmos , Bacterias/genética , Codón Iniciador/genética , Genes Bacterianos/genética , Genes Sobrepuestos/genética , Análisis de Secuencia de ADN , Biología Computacional/métodos , Bases de Datos Factuales , Estudios de Evaluación como Asunto , Genoma Bacteriano , Modelos Genéticos , Sistemas de Lectura Abierta/genética , Biosíntesis de Proteínas , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Validación de Programas de Computación
9.
Genome Res ; 8(11): 1154-71, 1998 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-9847079

RESUMEN

In this report we address the problem of accurate statistical modeling of DNA sequences, either coding or noncoding, for a bacterial species whose genome (or a large portion) was sequenced but not yet characterized experimentally. Availability of these models is critical for successful solution of the genome annotation task by statistical methods of gene finding. We present the method, GeneMark-Genesis, which learns the parameters of Markov models of protein-coding and noncoding regions from anonymous bacterial genomic sequence. These models are subsequently used in the GeneMark and GeneMark.hmm gene-finding programs. Although there is basically one model of a noncoding region for a given genome, several models of protein-coding region are automatically obtained by GeneMark-Genesis. The diversity of protein-coding models reflects the diversity of oligonucleotide compositions, particularly the diversity of codon usage strategies observed in genes from one and the same genome. In the simplest and the most important case, there are just two gene models-typical and atypical ones. We show that the atypical model allows one to predict genes that escape identification by the typical model. Many genes predicted by the atypical model appear to be horizontally transferred genes. The early versions of GeneMark-Genesis were used for annotating the genomes of Methanoccocus jannaschii and Helicobacter pylori. We report the results of accuracy testing of the full-scale version of GeneMark-Genesis on 10 completely sequenced bacterial genomes. Interestingly, the GeneMark.hmm program that employed the typical and atypical models defined by GeneMark-Genesis was able to predict 683 new atypical genes with 176 of them confirmed by similarity search.


Asunto(s)
Genes Bacterianos/genética , Genoma Bacteriano , Cómputos Matemáticos , Programas Informáticos , Algoritmos , ADN Bacteriano , Bases de Datos Factuales , Sistemas de Lectura Abierta , Sensibilidad y Especificidad
10.
Pac Symp Biocomput ; : 279-90, 1998.
Artículo en Inglés | MEDLINE | ID: mdl-9697189

RESUMEN

Accurate prediction of the position of translation initiation (N-terminal prediction) is a difficult problem. N-terminal prediction from DNA sequence alone is ambiguous is several candidate start sites are close to each other. Protein similarity search is usually unable to indicate the true start of a gene as it would require a strong protein sequence similarity at the N-terminal portion of a protein where conservative regions are rarely situated. With the aid of the GeneMark program for gene identification, we extract DNA sequence fragments presumably containing ribosome binding sites (RBS) from unannotated complete genomic sequences. These DNA segments are aligned to generate the RBS model using the Gibbs' sampling method. N-terminal prediction is then performed by using the RBS model in conjunction with the GeneMark start codon prediction to aid in determining the true N-terminal site.


Asunto(s)
ADN/química , ADN/genética , Modelos Genéticos , Modelos Estadísticos , Iniciación de la Cadena Peptídica Traduccional , Ribosomas/metabolismo , Secuencia de Bases , Sitios de Unión , Análisis por Conglomerados , Codón , Secuencia de Consenso , Bases de Datos Factuales , Escherichia coli/genética , Genoma , Genoma Bacteriano , ARN Ribosómico 16S/genética , Alineación de Secuencia , Homología de Secuencia de Ácido Nucleico
12.
Nucleic Acids Res ; 26(4): 1107-15, 1998 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-9461475

RESUMEN

The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.


Asunto(s)
Algoritmos , Genoma Bacteriano , Modelos Genéticos , Secuencia de Bases , ADN Bacteriano/genética , Bases de Datos Factuales , Escherichia coli/genética , Estudios de Evaluación como Asunto , Genes Bacterianos , Cadenas de Markov , Programas Informáticos
13.
Nature ; 388(6642): 539-47, 1997 Aug 07.
Artículo en Inglés | MEDLINE | ID: mdl-9252185

RESUMEN

Helicobacter pylori, strain 26695, has a circular genome of 1,667,867 base pairs and 1,590 predicted coding sequences. Sequence analysis indicates that H. pylori has well-developed systems for motility, for scavenging iron, and for DNA restriction and modification. Many putative adhesins, lipoproteins and other outer membrane proteins were identified, underscoring the potential complexity of host-pathogen interaction. Based on the large number of sequence-related genes encoding outer membrane proteins and the presence of homopolymeric tracts and dinucleotide repeats in coding sequences, H. pylori, like several other mucosal pathogens, probably uses recombination and slipped-strand mispairing within repeats as mechanisms for antigenic variation and adaptive evolution. Consistent with its restricted niche, H. pylori has a few regulatory networks, and a limited metabolic repertoire and biosynthetic capacity. Its survival in acid conditions depends, in part, on its ability to establish a positive inside-membrane potential in low pH.


Asunto(s)
Genoma Bacteriano , Helicobacter pylori/genética , Variación Antigénica , Adhesión Bacteriana , Proteínas Bacterianas/metabolismo , Secuencia de Bases , Evolución Biológica , División Celular , Reparación del ADN , ADN Bacteriano/genética , Regulación Bacteriana de la Expresión Génica , Helicobacter pylori/metabolismo , Helicobacter pylori/patogenicidad , Concentración de Iones de Hidrógeno , Datos de Secuencia Molecular , Biosíntesis de Proteínas , Recombinación Genética , Transcripción Genética , Virulencia
14.
DNA Seq ; 8(1-2): 17-29, 1997.
Artículo en Inglés | MEDLINE | ID: mdl-9522117

RESUMEN

The GeneMark method has proven to be an efficient gene-finding tool for the analysis of prokaryotic genomic sequence data. We have developed a procedure of deriving and utilizing several GeneMark models in order to get better gene-detection performance. Upon applying this procedure to the 1.0 Mb contiguous DNA sequence of Synechocystis sp. strain PCC6803, we were able to cluster predicted genes into distinct classes and to produce the class-specific GeneMark models reflecting statistical characteristics of each gene class. One gene class apparently includes genes of exogenous origin. Using class-specific models reduces the gene under prediction error rate down to 1.7% in comparison with 8.1% reported in the previous study when only one GeneMark model was used.


Asunto(s)
Cianobacterias/genética , Genes Bacterianos , Modelos Genéticos , Programas Informáticos , Proteínas Bacterianas/genética , Cianobacterias/clasificación , Genoma Bacteriano , Familia de Multigenes , Sistemas de Lectura Abierta , Fotosíntesis/genética
15.
Proc Natl Acad Sci U S A ; 93(25): 14648-53, 1996 Dec 10.
Artículo en Inglés | MEDLINE | ID: mdl-8962108

RESUMEN

cagA, a gene that codes for an immunodominant antigen, is present only in Helicobacter pylori strains that are associated with severe forms of gastroduodenal disease (type I strains). We found that the genetic locus that contains cagA (cag) is part of a 40-kb DNA insertion that likely was acquired horizontally and integrated into the chromosomal glutamate racemase gene. This pathogenicity island is flanked by direct repeats of 31 bp. In some strains, cag is split into a right segment (cagI) and a left segment (cagII) by a novel insertion sequence (IS605). In a minority of H. pylori strains, cagI and cagII are separated by an intervening chromosomal sequence. Nucleotide sequencing of the 23,508 base pairs that form the cagI region and the extreme 3' end of the cagII region reveals the presence of 19 ORFs that code for proteins predicted to be mostly membrane associated with one gene (cagE), which is similar to the toxin-secretion gene of Bordetella pertussis, ptlC, and the transport systems required for plasmid transfer, including the virB4 gene of Agrobacterium tumefaciens. Transposon inactivation of several of the cagI genes abolishes induction of IL-8 expression in gastric epithelial cell lines. Thus, we believe the cag region may encode a novel H. pylori secretion system for the export of virulence determinants.


Asunto(s)
Antígenos Bacterianos/genética , Proteínas Bacterianas/genética , Genes Bacterianos , Helicobacter pylori/genética , Secuencia de Bases , Mapeo Cromosómico , Evolución Molecular , Helicobacter pylori/patogenicidad , Datos de Secuencia Molecular , Análisis de Secuencia , Virulencia/genética
16.
J Mol Biol ; 262(2): 129-39, 1996 Sep 20.
Artículo en Inglés | MEDLINE | ID: mdl-8831784

RESUMEN

Five different algorithms have been applied for detecting DNA sequence pattern hidden in 204 DNA sequences collected from the literature which are experimentally found to be involved in nucleosome formation. Each algorithm was used to perform a multiple alignment of the nucleosome DNA sequences within the window 145 nt, the size of a nucleosome core DNA. From these alignments five pairs of AA and TT dinucleotide positional frequency distributions have been computed. The frequency profiles calculated by different algorithms are rather different due to substantial noise. They, however, share several important features. Both AA and TT dinucleotide positional frequencies display periodicity with the period of 10.3(+/- 0.2) bases. TT dinucleotides appear to be distributed symmetrically relative to AA dinucleotides of the same DNA strand, with the center of symmetry at the midpoint of the nucleosome core DNA. The phase shift between the AA and TT patterns is about 6 bp. Superposition of the five pairs of the AA (TT) positional frequency profiles has produced the refined pattern, with the above features well pronounced. An interesting novel feature of the pattern is an absence of central peaks in the periodical AA and TT distributions. This may indicate that the central section of nucleosome DNA, 15 bp around the dyad axis of the nucleosome, is not bent. Positional distributions of other dinucleotides were not found in this study to be as informative as the ones for AA and TT.


Asunto(s)
ADN , Nucleosomas/genética , Alineación de Secuencia , Algoritmos , Secuencia de Bases , Mapeo Cromosómico , Bases de Datos Factuales , Relación Estructura-Actividad
17.
Comput Chem ; 20(1): 123-33, 1996 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-16749185

RESUMEN

We have explored the performance of the GeneMark gene identification method using cross-validation over learning samples of E. coli DNA sequences. The computations gave more accurate estimations of the error rates in comparison with previous results when a sample of non-coding regions was derived from GenBank sequences with many true coding regions unannotated. The error rate components have been classified and delineated. It was shown that the method performs differently on class I, II and III genes. The most frequent errors come from misinterpreting the coding potential of the complementary sequence in the same frame. The effects of stop-codons present in alternative frames were also studied to understand better the main factors contributing to GeneMark performance.


Asunto(s)
Genes/genética , Biología Computacional , Escherichia coli/genética , Sistemas de Lectura Abierta/genética , Estadística como Asunto
18.
Curr Biol ; 6(3): 279-91, 1996 Mar 01.
Artículo en Inglés | MEDLINE | ID: mdl-8805245

RESUMEN

BACKGROUND: The 1.83 Megabase (Mb) sequence of the Haemophilus influenzae chromosome, the first completed genome sequence of a cellular life form, has been recently reported. Approximately 75 % of the 4.7 Mb genome sequence of Escherichia coli is also available. The life styles of the two bacteria are very different - H. influenzae is an obligate parasite that lives in human upper respiratory mucosa and can be cultivated only on rich media, whereas E. coli is a saprophyte that can grow on minimal media. A detailed comparison of the protein products encoded by these two genomes is expected to provide valuable insights into bacterial cell physiology and genome evolution. RESULTS: We describe the results of computer analysis of the amino-acid sequences of 1703 putative proteins encoded by the complete genome of H. influenzae. We detected sequence similarity to proteins in current databases for 92 % of the H. influenzae protein sequences, and at least a general functional prediction was possible for 83 %. A comparison of the H. influenzae protein sequences with those of 3010 proteins encoded by the sequenced 75 % of the E. coli genome revealed 1128 pairs of apparent orthologs, with an average of 59 % identity. In contrast to the high similarity between orthologs, the genome organization and the functional repertoire of genes in the two bacteria were remarkably different. The smaller genome size of H. influenzae is explained, to a large extent, by a reduction in the number of paralogous genes. There was no long range colinearity between the E. coli and H. influenzae gene orders, but over 70 % of the orthologous genes were found in short conserved strings, only about half of which were operons in E. coli. Superposition of the H. influenzae enzyme repertoire upon the known E. coli metabolic pathways allowed us to reconstruct similar and alternative pathways in H. influenzae and provides an explanation for the known nutritional requirements. CONCLUSIONS: By comparing proteins encoded by the two bacterial genomes, we have shown that extensive gene shuffling and variation in the extent of gene paralogy are major trends in bacterial evolution; this comparison has also allowed us to deduce crucial aspects of the largely uncharacterized metabolism of H. influenzae.


Asunto(s)
Proteínas Bacterianas/metabolismo , Escherichia coli/genética , Genoma Bacteriano , Haemophilus influenzae/genética , Haemophilus influenzae/metabolismo , Proteínas Bacterianas/química , Evolución Biológica , Secuencia Conservada , ADN Bacteriano , Datos de Secuencia Molecular
19.
J Cell Biol ; 132(5): 835-48, 1996 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-8603916

RESUMEN

Mutations in the Caenorhabditis elegans gene unc-89 result in nematodes having disorganized muscle structure in which thick filaments are not organized into A-bands, and there are no M-lines. Beginning with a partial cDNA from the C. elegans sequencing project, we have cloned and sequenced the unc-89 gene. An unc-89 allele, st515, was found to contain an 84-bp deletion and a 10-bp duplication, resulting in an in-frame stop codon within predicted unc-89 coding sequence. Analysis of the complete coding sequence for unc-89 predicts a novel 6,632 amino acid polypeptide consisting of sequence motifs which have been implicated in protein-protein interactions. UNC-89 begins with 67 residues of unique sequences, SH3, dbl/CDC24, and PH domains, 7 immunoglobulins (Ig) domains, a putative KSP-containing multiphosphorylation domain, and ends with 46 Ig domains. A polyclonal antiserum raised to a portion of unc-89 encoded sequence reacts to a twitchin-sized polypeptide from wild type, but truncated polypeptides from st515 and from the amber allele e2338. By immunofluorescent microscopy, this antiserum localizes to the middle of A-bands, consistent with UNC-89 being a structural component of the M-line. Previous studies indicate that myofilament lattice assembly begins with positional cues laid down in the basement membrane and muscle cell membrane. We propose that the intracellular protein UNC-89 responds to these signals, localizes, and then participates in assembling an M-line.


Asunto(s)
Proteínas de Caenorhabditis elegans , Caenorhabditis elegans/genética , Genes de Helminto , Proteínas del Helminto/genética , Desarrollo de Músculos , Proteínas Musculares/genética , Secuencia de Aminoácidos , Animales , Especificidad de Anticuerpos , Secuencia de Bases , Western Blotting , Caenorhabditis elegans/anatomía & histología , Compartimento Celular , Clonación Molecular , Técnica del Anticuerpo Fluorescente Indirecta , Proteínas del Helminto/inmunología , Proteínas del Helminto/aislamiento & purificación , Inmunoglobulinas/genética , Datos de Secuencia Molecular , Proteínas Musculares/inmunología , Proteínas Musculares/aislamiento & purificación , Músculos/ultraestructura , Mutación , Hibridación de Ácido Nucleico , Conformación Proteica , ARN Mensajero/genética , Análisis de Secuencia de ADN , Homología de Secuencia de Aminoácido , Transducción de Señal/genética
20.
Artículo en Inglés | MEDLINE | ID: mdl-8877516

RESUMEN

This paper is supposed to bridge the gap between practical experience in using GeneMark for a rapidly widening repertoire of genomes, and the available publications that determine and compare the gene prediction accuracy of the GeneMark method for different genomes. Here we focus on the genome-specific variability of prediction error rates and their sources. DNA sequence inhomogeneity is present both in training and control sets of coding and non-coding regions. Coding region inhomogeneity, caused by differences in sequence composition between "native" and horizontally transferred genes or between genes expressed at different levels, contributes to the false negative error rate. Inhomogeneity of non-coding region may frequently be caused by the presence of unnoticed genes and contributes to the false positive error rate. We have documented such unnoticed genes in GenBank sequences for several species Some of protein products of these genes have been characterized by similarity search methods. For others, which we call "pioneer genes", no significant similarity has been found at a protein sequence level although the confidence of GeneMark prediction is high. For instance, to date a majority of those pioneer gene predictions made for E. coli now show strong similarity to more recently characterized proteins that have been added to protein sequence database. Another practical question is related to genomic sequence inhomogeneity at interspecies level: if GeneMark has not been trained for a particular species, is it possible to apply models derived for phylogenetically close genomes? The answer is, yes. The results of cross-species gene prediction experiments show that cross-species prediction can often be reasonably accurate.


Asunto(s)
Mapeo Cromosómico/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Secuencia de Aminoácidos , ADN Bacteriano/química , Reacciones Falso Negativas , Reacciones Falso Positivas , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...