Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
1.
Science ; 287(5461): 2196-204, 2000 Mar 24.
Article in English | MEDLINE | ID: mdl-10731133

ABSTRACT

We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.


Subject(s)
Computational Biology , Drosophila melanogaster/genetics , Genome , Sequence Analysis, DNA , Algorithms , Animals , Chromatin/genetics , Contig Mapping , Euchromatin , Genes, Insect , Heterochromatin/genetics , Molecular Sequence Data , Physical Chromosome Mapping , Repetitive Sequences, Nucleic Acid , Sequence Tagged Sites
2.
Science ; 291(5507): 1304-51, 2001 02 16.
Article in English | MEDLINE | ID: mdl-11181995

ABSTRACT

A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.


Subject(s)
Genome, Human , Human Genome Project , Sequence Analysis, DNA , Algorithms , Animals , Chromosome Banding , Chromosome Mapping , Chromosomes, Artificial, Bacterial , Computational Biology , Consensus Sequence , CpG Islands , DNA, Intergenic , Databases, Factual , Evolution, Molecular , Exons , Female , Gene Duplication , Genes , Genetic Variation , Humans , Introns , Male , Phenotype , Physical Chromosome Mapping , Polymorphism, Single Nucleotide , Proteins/genetics , Proteins/physiology , Pseudogenes , Repetitive Sequences, Nucleic Acid , Retroelements , Sequence Analysis, DNA/methods , Species Specificity
3.
J Comput Biol ; 5(4): 667-80, 1998.
Article in English | MEDLINE | ID: mdl-10072083

ABSTRACT

MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95 %, with a correlation coefficient of 0.78, and a sensitivity and specificity for coding bases of 83 % and 79%. In addition, MORGAN identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly. This paper describes the MORGAN system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA.


Subject(s)
Algorithms , DNA/genetics , Decision Trees , Genes , DNA/classification , Decision Support Techniques , Markov Chains
4.
Bioinformatics ; 16(2): 152-8, 2000 Feb.
Article in English | MEDLINE | ID: mdl-10842737

ABSTRACT

MOTIVATION: The main goal in this paper is to develop accurate probabilistic models for important functional regions in DNA sequences (e.g. splice junctions that signal the beginning and end of transcription in human DNA). These methods can subsequently be utilized to improve the performance of gene-finding systems. The models built here attempt to model long-distance dependencies between non-adjacent bases. RESULTS: An efficient modeling method is described which models biological data more accurately than a first-order Markov model without increasing the number of parameters. Intuitively, a small number of parameters helps a learning system to avoid overfitting. Several experiments with the model are presented, which show a small improvement in the average accuracy as compared with a simple Markov model. These experiments suggest that single long distance dependencies do not help the recognition problem, thus confirming several previous studies which have used more heuristic modeling techniques. AVAILABILITY: This software is available for downloaded and as a web resource at http://www.ai.uic.edu/software CONTACT: kasif@eecs.uic.edu


Subject(s)
Computer Simulation , DNA/analysis , Models, Statistical , Neural Networks, Computer , RNA Splicing , Bayes Theorem , Humans , Software
5.
Article in English | MEDLINE | ID: mdl-7584325

ABSTRACT

In this paper we study the performance of probabilistic networks in the context of protein sequence analysis in molecular biology. Specifically, we report the results of our initial experiments applying this framework to the problem of protein secondary structure prediction. One of the main advantages of the probabilistic approach we describe here is our ability to perform detailed experiments where we can experiment with different models. We can easily perform local substitutions (mutations) and measure (probabilistically) their effect on the global structure. Window-based methods do not support such experimentation as readily. Our method is efficient both during training and during prediction, which is important in order to be able to perform many experiments with different networks. We believe that probabilistic methods are comparable to other methods in prediction quality. In addition, the predictions generated by our methods have precise quantitative semantics which is not shared by other classification methods. Specifically, all the causal and statistical independence assumptions are made explicit in our networks thereby allowing biologists to study and experiment with different causal models in a convenient manner.


Subject(s)
Models, Molecular , Protein Structure, Secondary , Algorithms , Bayes Theorem , Decision Trees , Markov Chains , Models, Genetic , Mutation , Neural Networks, Computer , Reproducibility of Results
6.
Nucleic Acids Res ; 26(2): 544-8, 1998 Jan 15.
Article in English | MEDLINE | ID: mdl-9421513

ABSTRACT

This paper describes a new system, GLIMMER, for finding genes in microbial genomes. In a series of tests on Haemophilus influenzae , Helicobacter pylori and other complete microbial genomes, this system has proven to be very accurate at locating virtually all the genes in these sequences, outperforming previous methods. A conservative estimate based on experiments on H.pylori and H. influenzae is that the system finds >97% of all genes. GLIMMER uses interpolated Markov models (IMMs) as a framework for capturing dependencies between nearby nucleotides in a DNA sequence. An IMM-based method makes predictions based on a variable context; i.e., a variable-length oligomer in a DNA sequence. The context used by GLIMMER changes depending on the local composition of the sequence. As a result, GLIMMER is more flexible and more powerful than fixed-order Markov methods, which have previously been the primary content-based technique for finding genes in microbial DNA.


Subject(s)
DNA, Bacterial/analysis , Markov Chains , Algorithms , Base Sequence , DNA, Bacterial/chemistry , Haemophilus influenzae/genetics , Helicobacter pylori/genetics , Open Reading Frames , Sensitivity and Specificity , Sequence Alignment , Software
7.
Nucleic Acids Res ; 27(23): 4636-41, 1999 Dec 01.
Article in English | MEDLINE | ID: mdl-10556321

ABSTRACT

The GLIMMER system for microbial gene identification finds approximately 97-98% of all genes in a genome when compared with published annotation. This paper reports on two new results: (i) significant technical improvements to GLIMMER that improve its accuracy still further, and (ii) a comprehensive evaluation that demonstrates that the accuracy of the system is likely to be higher than previously recognized. A significant proportion of the genes missed by the system appear to be hypothetical proteins whose existence is only supported by the predictions of other programs. When the analysis is restricted to genes that have significant homology to genes in other organisms, GLIMMER misses <1% of known genes.


Subject(s)
Genes, Bacterial , Genetic Techniques/standards , Algorithms , Markov Chains , Models, Genetic
8.
Genomics ; 59(1): 24-31, 1999 Jul 01.
Article in English | MEDLINE | ID: mdl-10395796

ABSTRACT

Computational gene finding research has emphasized the development of gene finders for bacterial and human DNA. This has left genome projects for some small eukaryotes without a system that addresses their needs. This paper reports on a new system, GlimmerM, that was developed to find genes in the malaria parasite Plasmodium falciparum. Because the gene density in P. falciparum is relatively high, the system design was based on a successful bacterial gene finder, Glimmer. The system was augmented with specially trained modules to find splice sites and was trained on all available data from the P. falciparum genome. Although a precise evaluation of its accuracy is impossible at this time, laboratory tests (using RT-PCR) on a small selection of predicted genes confirmed all of those predictions. With the rapid progress in sequencing the genome of P. falciparum, the availability of this new gene finder will greatly facilitate the annotation process.


Subject(s)
Genes, Protozoan/genetics , Markov Chains , Algorithms , Alternative Splicing , Animals , Chromosomes/genetics , Databases, Factual , Gene Expression , Genome, Protozoan , Internet , Plasmodium falciparum/genetics , Reproducibility of Results , Reverse Transcriptase Polymerase Chain Reaction , Sequence Alignment
9.
Nucleic Acids Res ; 27(11): 2369-76, 1999 Jun 01.
Article in English | MEDLINE | ID: mdl-10325427

ABSTRACT

A new system for aligning whole genome sequences is described. Using an efficient data structure called a suffix tree, the system is able to rapidly align sequences containing millions of nucleotides. Its use is demonstrated on two strains of Mycoplasma tuberculosis, on two less similar species of Mycoplasma bacteria and on two syntenic sequences from human chromosome 12 and mouse chromosome 6. In each case it found an alignment of the input sequences, using between 30 s and 2 min of computation time. From the system output, information on single nucleotide changes, translocations and homologous genes can easily be extracted. Use of the algorithm should facilitate analysis of syntenic chromosomal regions, strain-to-strain comparisons, evolutionary comparisons and genomic duplications.


Subject(s)
Algorithms , Genome, Bacterial , Mycoplasma/genetics , Sequence Alignment/methods , Animals , Base Sequence , DNA , Humans , Mice , Molecular Sequence Data
10.
Bioinformatics ; 17 Suppl 1: S132-9, 2001.
Article in English | MEDLINE | ID: mdl-11473002

ABSTRACT

Two different strategies for determining the human genome are currently being pursued: one is the "clone-by-clone" approach, employed by the publicly funded project, and the other is the "whole genome shotgun assembler" approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by both approaches. In this paper we describe the design, implementation and operation of the "compartmentalized shotgun assembler".


Subject(s)
Cloning, Molecular/methods , Genome, Human , Chromosomes, Artificial, Bacterial/genetics , Computational Biology , Databases, Nucleic Acid , Humans , Sequence Analysis, DNA/statistics & numerical data , Software
11.
J Bacteriol ; 184(19): 5479-90, 2002 10.
Article in English | MEDLINE | ID: mdl-12218036

ABSTRACT

Virulence and immunity are poorly understood in Mycobacterium tuberculosis. We sequenced the complete genome of the M. tuberculosis clinical strain CDC1551 and performed a whole-genome comparison with the laboratory strain H37Rv in order to identify polymorphic sequences with potential relevance to disease pathogenesis, immunity, and evolution. We found large-sequence and single-nucleotide polymorphisms in numerous genes. Polymorphic loci included a phospholipase C, a membrane lipoprotein, members of an adenylate cyclase gene family, and members of the PE/PPE gene family, some of which have been implicated in virulence or the host immune response. Several gene families, including the PE/PPE gene family, also had significantly higher synonymous and nonsynonymous substitution frequencies compared to the genome as a whole. We tested a large sample of M. tuberculosis clinical isolates for a subset of the large-sequence and single-nucleotide polymorphisms and found widespread genetic variability at many of these loci. We performed phylogenetic and epidemiological analysis to investigate the evolutionary relationships among isolates and the origins of specific polymorphic loci. A number of these polymorphisms appear to have occurred multiple times as independent events, suggesting that these changes may be under selective pressure. Together, these results demonstrate that polymorphisms among M. tuberculosis strains are more extensive than initially anticipated, and genetic variation may have an important role in disease pathogenesis and immunity.


Subject(s)
Evolution, Molecular , Genome, Bacterial , Mycobacterium tuberculosis/pathogenicity , Sequence Analysis, DNA , Tuberculosis/microbiology , Bacterial Proteins/genetics , Bacterial Proteins/metabolism , Genetic Variation , Humans , Molecular Sequence Data , Mycobacterium tuberculosis/genetics , Mycobacterium tuberculosis/immunology , Phylogeny , Polymorphism, Genetic , Polymorphism, Single Nucleotide , Sequence Alignment , Tuberculosis/immunology
SELECTION OF CITATIONS
SEARCH DETAIL