Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 40
Filtrer
Plus de filtres











Base de données
Gamme d'année
1.
Science ; 291(5507): 1304-51, 2001 02 16.
Article de Anglais | MEDLINE | ID: mdl-11181995

RÉSUMÉ

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.


Sujet(s)
Génome humain , Projet génome humain , Analyse de séquence d'ADN , Algorithmes , Animaux , Zébrage chromosomique , Cartographie chromosomique , Chromosomes artificiels de bactérie , Biologie informatique , Séquence consensus , Ilots CpG , ADN intergénique , Bases de données factuelles , Évolution moléculaire , Exons , Femelle , Duplication de gène , Gènes , Variation génétique , Humains , Introns , Mâle , Phénotype , Cartographie physique de chromosome , Polymorphisme de nucléotide simple , Protéines/génétique , Protéines/physiologie , Pseudogènes , Séquences répétées d'acides nucléiques , Rétroéléments , Analyse de séquence d'ADN/méthodes , Spécificité d'espèce
5.
Pac Symp Biocomput ; : 217-27, 1998.
Article de Anglais | MEDLINE | ID: mdl-9697184

RÉSUMÉ

This paper presents a computer system for analyzing and annotating large-scale genomic sequences. The core of the system is a multiple-gene structure identification program, which predicts the most "probable" gene structures based on the given evidence, including pattern recognition, EST and protein homology information. A graphics-based user interface provides an environment which allows the user to interactively control the evidence to be used in the gene identification process. To overcome the computational bottleneck in the database similarity search used in the gene identification process, we have developed an effective way to partition a database into a set of sub-databases of "related" sequences, and reduced the search problem on a large database to a signature identification problem and a search problem on a much smaller sub-database. This reduces the number of sequences to be searched from N to O ([square root of] N) on average, and hence greatly reduces the search time, where N is the number of sequences in the original database. The system provides the user with the ability to facilitate and modify the analysis and modeling in real time.


Sujet(s)
Séquence nucléotidique , Infographie , ADN/composition chimique , ADN/génétique , Bases de données factuelles , Génome , Modèles génétiques , Simulation numérique , Exons , Étiquettes de séquences exprimées , Reconnaissance automatique des formes , Logiciel
6.
Article de Anglais | MEDLINE | ID: mdl-9322060

RÉSUMÉ

Computational methods for gene identification in genomic sequences typically have two phases: coding region prediction and gene parsing. While there are many effective methods for predicting coding regions (exons), parsing the predicted exons into proper gene structures, to a large extent, remains an unsolved problem. This paper presents an algorithm for inferring gene structures from predicted exon candidates, based on Expressed Sequence Tags (ESTs) and biological intuition/rules. The algorithm first finds all the related ESTs in the EST database (dbEST) for each predicted exon, and infers the boundaries of one or a series of genes based on the available EST information and biological rules. Then it constructs gene models within each pair of gene boundaries, that are most consistent with the EST information. By exploiting EST information and biological rules, the algorithm can (1) model complicated multiple gene structures, including embedded genes, (2) identify falsely-predicted exons and locate missed exons, and (3) make more accurate exon boundary predictions. The algorithm has been implemented and tested on long genomic sequences with a number of genes. Test results show that very accurate (predicted) gene models can be expected when related ESTs exist for the predicted exons.


Sujet(s)
Algorithmes , Expression des gènes , Techniques génétiques , Génome humain , ADN/génétique , Bases de données factuelles , Exons , Humains , Modèles génétiques , Logiciel
8.
J Comput Biol ; 3(3): 333-44, 1996.
Article de Anglais | MEDLINE | ID: mdl-8891953

RÉSUMÉ

Insertion and deletion (indel) sequencing errors in DNA coding regions disrupt DNA-to-protein translation frames, and hence make most frame-sensitive coding recognition approaches fail. This paper extends the authors' previous work on indel detection and "correction" algorithms, and presents a more effective algorithm for localizing indels that appear in DNA coding regions and "correcting" the located indels by inserting or deleting DNA bases. The algorithm localizes indels by discovering changes of the preferred translation frames within presumed coding regions, and then "corrects" them to restore a consistent translation frame within each coding region. An iterative strategy is exploited to repeatedly localize and "correct" indels until no more indels can be found. Test results have shown that this improved algorithm can detect and "correct" more indels while not worsening the rate of introduction of false indels when compared to the authors' previous work.


Sujet(s)
Algorithmes , Analyse de séquence d'ADN/méthodes , Éléments transposables d'ADN , Humains , Délétion de séquence
9.
Comput Appl Biosci ; 11(2): 117-24, 1995 Apr.
Article de Anglais | MEDLINE | ID: mdl-7620982

RÉSUMÉ

This paper presents an algorithm for detecting and 'correcting' sequencing errors that occur in DNA coding regions. The types of sequencing errors addressed are insertions and deletions (indels) of DNA bases. The goal is to provide a capability which makes single-pass or low-redundancy sequence data more informative, reducing the need for high-redundancy sequencing for gene identification and characterization purposes. This would permit improved sequencing efficiency and reduce genome sequencing costs. The algorithm detects sequencing errors by discovering changes in the statistically preferred reading frame within a putative coding region and then inserts a number of 'neutral' bases at a perceived reading frame transition point to make the putative exon candidate frame consistent. We have implemented the algorithm as a front-end subsystem of the GRAIL DNA sequence analysis system to construct a version which is very error tolerant and also intend to use this as a testbed for further development of sequencing error-correction technology. Preliminary test results have shown the usefulness of this algorithm and also exhibited some of its weakness, providing possible directions for further improvement. On a test set consisting of 68 human DNA sequences with 1% randomly generated indels in coding regions, the algorithm detected and corrected 76% of the indels. The average distance between the position of an indel and the predicted one was 9.4 bases. With this subsystem in place, GRAIL correctly predicted 89% of the coding messages with 10% false message on the 'corrected' sequences, compared to 69% correctly predicted coding messages and 11% falsely predicted messages on the 'corrupted' sequences using standard GRAIL II method (version 1.2).(ABSTRACT TRUNCATED AT 250 WORDS)


Sujet(s)
Analyse de séquence d'ADN/normes , Logiciel , Algorithmes , Exons , Humains , Biosynthèse des protéines , Analyse de séquence d'ADN/méthodes
11.
Article de Anglais | MEDLINE | ID: mdl-7584472

RÉSUMÉ

An important open problem in molecular biology is how to use computational methods to understand the structure and function of proteins given only their primary sequences. We describe and evaluate an original machine-learning approach to classifying protein sequences according to their structural folding class. Our work is novel in several respects: we use a set of protein classes that previously have not been used for classifying primary sequences, and we use a unique set of attributes to represent protein sequences to the learners. We evaluate our approach by measuring its ability to correctly classify proteins that were not in its training set. We compare our input representation to a commonly used input representation--amino acid composition--and show that our approach more accurately classifies proteins that have very limited homology to the sequences on which the systems are trained.


Sujet(s)
Séquence d'acides aminés , Pliage des protéines , Structure secondaire des protéines , Protéines/composition chimique , Algorithmes , Bases de données factuelles , Arbres de décision , Protéines/métabolisme , Similitude de séquences d'acides aminés
12.
Comput Appl Biosci ; 10(6): 613-23, 1994 Dec.
Article de Anglais | MEDLINE | ID: mdl-7704660

RÉSUMÉ

This paper presents a computationally efficient algorithm, the Gene Assembly Program III (GAP III), for constructing gene models from a set of accurately-predicted 'exons'. The input to the algorithm is a set of clusters of exon candidates, generated by a new version of the GRAIL coding region recognition system. The exon candidates of a cluster differ in their presumed edges and occasionally in their reading frames. Each exon candidate has a numerical score representing its 'probability' of being an actual exon. GAP III uses a dynamic programming algorithm to construct a gene model, complete or partial, by optimizing a predefined objective function. The optimal gene models constructed by GAP III correspond very well with the structures of genes which have been determined experimentally and reported in the Genome Sequence Database (GSDB). On a test set of 137 human and mouse DNA sequences consisting of 954 true exons, GAP III constructed 137 gene models using 892 exons, among which 859 (859/954 = 90%) are true exons and 33 (33/892 = 3%) are false positive. Among the 859 true positives, 635 (74%) match the actual exons exactly, and 838 (98%) have at least one edge correct. GAP III is computationally efficient. If we use E and C to represent the total number of exon candidates in all clusters and the number of clusters, respectively, the running time of GAP III is proportional to (E x C).


Sujet(s)
Algorithmes , Exons , Modèles génétiques , Logiciel , Animaux , Humains , Souris , Conception de logiciel
14.
Article de Anglais | MEDLINE | ID: mdl-7584416

RÉSUMÉ

A new version of the GRAIL system (Uberbacher and Mural, 1991; Mural et al., 1992; Uberbacher et al., 1993), called GRAIL II, has recently been developed (Xu et al., 1994). GRAIL II is a hybrid AI system that supports a number of DNA sequence analysis tools including protein-coding region recognition, PolyA site and transcription promoter recognition, gene model construction, translation to protein, and DNA/protein database searching capabilities. This paper presents the core of GRAIL II, the coding exon recognition and gene model construction algorithms. The exon recognition algorithm recognizes coding exons by combining coding feature analysis and edge signal (acceptor/donor/translation-start sites) detection. Unlike the original GRAIL system (Uberbacher and Mural, 1991; Mural et al., 1992), this algorithm uses variable-length windows tailored to each potential exon candidate, making its performance almost exon length-independent. In this algorithm, the recognition process is divided into four steps. Initially a large number of possible coding exon candidates are generated. Then a rule-based prescreening algorithm eliminates the majority of the improbable candidates. As the kernel of the recognition algorithm, three neural networks are trained to evaluate the remaining candidates. The outputs of the neural networks are then divided into clusters of candidates, corresponding to presumed exons. The algorithm makes its final prediction by picking the best canadidate from each cluster. The gene construction algorithm (Xu, Mural and Uberbacher, 1994) uses a dynamic programming approach to build gene models by using as input the clusters predicted by the exon recognition algorithm. Extensive testing has been done on these two algorithms.(ABSTRACT TRUNCATED AT 250 WORDS)


Sujet(s)
ADN/analyse , Exons , , Algorithmes , Humains , Logiciel
15.
J Protein Chem ; 12(2): 207-13, 1993 Apr.
Article de Anglais | MEDLINE | ID: mdl-8387794

RÉSUMÉ

Based on selective labeling by ATP analogues, Lys68 of the Calvin Cycle enzyme phosphoribulokinase (PRK) from spinach has been assigned to the active-site region [Miziorko et al. (1990), J. Biol. Chem. 265, 3642-3647]. The equivalent position is occupied by lysyl or arginyl residues in the PRK from both prokaryotic and eukaryotic sources, suggesting a requirement for a basic residue at this location. To examine this possibility, we have replaced Lys68 of the spinach enzyme with arginyl, glutaminyl, alanyl, or glutamyl residues by site-directed mutagenesis. All of the mutant enzymes retain substantial kinase activity; and even in the case of the radical substitution by glutamate, the Km values for ATP and ribulose 5-phosphate are not perturbed significantly. Glutamate at position-68 may destabilize tertiary structure, because the yield of this mutant protein from transformed E. coli is quite low compared to that of the other proteins in this series. Despite the active-site proximity of Lys68, our results show that this residue does not play a key role in catalysis or substrate binding.


Sujet(s)
Lysine/métabolisme , Phosphotransferases (Alcohol Group Acceptor) , Phosphotransferases/métabolisme , Plantes/enzymologie , Adénosine triphosphate/composition chimique , Séquence d'acides aminés , Séquence nucléotidique , Sites de fixation , Clonage moléculaire , Escherichia coli , Lysine/composition chimique , Données de séquences moléculaires , Mutagenèse dirigée , Oligonucléotides , Phosphotransferases/composition chimique , Phosphotransferases/génétique , Ribulose phosphate/composition chimique , Transformation bactérienne
16.
J Biol Chem ; 267(12): 8452-7, 1992 Apr 25.
Article de Anglais | MEDLINE | ID: mdl-1569095

RÉSUMÉ

Crystallographic studies of ribulose-1,5-bisphosphate carboxylase/oxygenase from Rhodospirillum rubrum suggest that active-site Asn111 interacts with Mg2+ and/or substrate (Lundqvist, T., and Schneider, G. (1991) J. Biol. Chem. 266, 12604-12611). To examine possible catalytic roles of Asn111, we have used site-directed mutagenesis to replace it with a glutaminyl, aspartyl, seryl, or lysyl residue. Although the mutant proteins are devoid of detectable carboxylase activity, their ability to form a quaternary complex comprised of CO2, Mg2+, and a reaction-intermediate analogue is indicative of competence in activation chemistry and substrate binding. The mutant proteins retain enolization activity, as measured by exchange of the C3 proton of ribulose bisphosphate with solvent, thereby demonstrating a preferential role of Asn111 in some later step of overall catalysis. The active sites of this homodimeric enzyme are formed by interactive domains from adjacent subunits (Larimer, F. W., Lee, E. H., Mural, R. J., Soper, T. S., and Hartman, F. C. (1987) J. Biol. Chem. 262, 15327-15329). Crystallography assigns Asn111 to the amino-terminal domain of the active site (Knight, S., Anderson, I., and Brändén, C.-I. (1990) J. Mol. Biol. 215, 113-160). The observed formation of enzymatically active heterodimers by the in vivo hybridization of an inactive position-111 mutant with inactive carboxyl-terminal domain mutants is consistent with this assignment.


Sujet(s)
Asparagine/génétique , Rhodospirillum rubrum/enzymologie , Ribulose bisphosphate carboxylase/métabolisme , Séquence d'acides aminés , Séquence nucléotidique , Sites de fixation , Électrophorèse sur gel de polyacrylamide , Escherichia coli/génétique , Données de séquences moléculaires , Mutagenèse dirigée , Ribulose bisphosphate carboxylase/génétique
17.
Trends Biotechnol ; 10(1-2): 66-9, 1992.
Article de Anglais | MEDLINE | ID: mdl-1367939

RÉSUMÉ

The ultimate goal of the Human Genome project is to extract the biologically relevant information recorded in the estimated 100,000 genes encoded by the 3 x 10(9) bases of the human genome. This necessitates development of reliable computer-based methods capable of analysing and correctly identifying genes in the vast amounts of DNA-sequence data generated. Such tools may save time and labour by simplifying, for example, screening of cDNA libraries. They may also facilitate the localization of human disease genes by identifying candidate genes in promising regions of anonymous DNA sequence.


Sujet(s)
Intelligence artificielle , Séquence nucléotidique , ADN/génétique , Bases de données factuelles , Projet génome humain , Données de séquences moléculaires
18.
Proc Natl Acad Sci U S A ; 88(24): 11261-5, 1991 Dec 15.
Article de Anglais | MEDLINE | ID: mdl-1763041

RÉSUMÉ

Genes in higher eukaryotes may span tens or hundreds of kilobases with the protein-coding regions accounting for only a few percent of the total sequence. Identifying genes within large regions of uncharacterized DNA is a difficult undertaking and is currently the focus of many research efforts. We describe a reliable computational approach for locating protein-coding portions of genes in anonymous DNA sequence. Using a concept suggested by robotic environmental sensing, our method combines a set of sensor algorithms and a neural network to localize the coding regions. Several algorithms that report local characteristics of the DNA sequence, and therefore act as sensors, are also described. In its current configuration the "coding recognition module" identifies 90% of coding exons of length 100 bases or greater with less than one false positive coding exon indicated per five coding exons indicated. This is a significantly lower false positive rate than any method of which we are aware. This module demonstrates a method with general applicability to sequence-pattern recognition problems and is available for current research efforts.


Sujet(s)
Cartographie chromosomique , Chromosomes humains , ADN/génétique , Hominidae/génétique , Modèles génétiques , , Protéines/génétique , Animaux , Séquence nucléotidique , Bases de données factuelles , Enzymes/génétique , Gènes ras , Humains , Données de séquences moléculaires
19.
J Biol Chem ; 266(16): 10694-9, 1991 Jun 05.
Article de Anglais | MEDLINE | ID: mdl-1645355

RÉSUMÉ

The Calvin Cycle enzyme phosphoribulokinase is activated in higher plants by the reversible reduction of a disulfide bond, which is located at the active site. To determine the possible contribution of the two regulatory residues (Cys16 and Cys55) to catalysis, site-directed mutagenesis has been used to replace each of them in the spinach enzyme with serine or alanine. The only other cysteinyl residues of the kinase, Cys244 and Cys250, were also replaced individually by serine or alanine. A comparison of specific activities of native and mutant enzymes reveals that substitutions at positions 244 or 250 are inconsequential. The position 16 mutants retain 45-90% of the wild-type activity and display normal Km values for both ATP and ribulose 5-phosphate. In contrast, substitution at position 55 results in 85-95% loss of wild-type activity, with less than a 2-fold increase in the Km for ATP and a 4-8-fold increase in the Km for ribulose 5-phosphate. These results are consistent with moderate facilitation of catalysis by Cys55 and demonstrate that the other three cysteinyl residues do not contribute significantly either to structure or catalysis. The enhanced stability, relative to wild-type enzyme, of the Ser16 mutant protein to a sulfhydryl reagent supports earlier suggestions that Cys16 is the initial target of the oxidative deactivation process.


Sujet(s)
Cystéine/génétique , Mutagenèse dirigée , Phosphotransferases (Alcohol Group Acceptor) , Phosphotransferases/génétique , Séquence nucléotidique , Technique de Western , ADN/génétique , Électrophorèse sur gel de polyacrylamide , N-Éthyl-maléimide/pharmacologie , Cinétique , Données de séquences moléculaires , Mutation , Phosphotransferases/antagonistes et inhibiteurs , Plantes/enzymologie
SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE