Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 8 de 8
Filter
Add more filters











Database
Language
Publication year range
1.
Bioinformatics ; 24(24): 2818-24, 2008 Dec 15.
Article in English | MEDLINE | ID: mdl-18952627

ABSTRACT

MOTIVATION: DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data. RESULTS: Celera Assembler was modified for combinations of ABI 3730 and 454 FLX reads. The revised pipeline called CABOG (Celera Assembler with the Best Overlap Graph) is robust to homopolymer run length uncertainty, high read coverage and heterogeneous read lengths. In tests on four genomes, it generated the longest contigs among all assemblers tested. It exploited the mate constraints provided by paired-end reads from either platform to build larger contigs and scaffolds, which were validated by comparison to a finished reference sequence. A low rate of contig mis-assembly was detected in some CABOG assemblies, but this was reduced in the presence of sufficient mate pair data. AVAILABILITY: The software is freely available as open-source from http://wgs-assembler.sf.net under the GNU Public License.


Subject(s)
Sequence Analysis, DNA/methods , Software , Computational Biology/methods , Genome , Genomics
2.
J Comput Biol ; 12(7): 943-51, 2005 Sep.
Article in English | MEDLINE | ID: mdl-16201914

ABSTRACT

Algorithms for exact string matching have substantial application in computational biology. Time-efficient data structures which support a variety of exact string matching queries, such as the suffix tree and the suffix array, have been applied to such problems. As sequence databases grow, more space-efficient approaches to exact matching are becoming more important. One such data structure, the compressed suffix array (CSA), based on the Burrows-Wheeler transform, has been shown to require memory which is nearly equal to the memory requirements of the original database, while supporting common sorts of query problems time efficiently. However, building a CSA from a sequence in efficient space and time is challenging. In 2002, the first space-efficient CSA construction algorithm was presented. That implementation used (1+2 log2 |summation|)(1+epsilon) bits per character (where epsilon is a small fraction). The construction algorithm ran in as much as twice that space, in O(| summation|n log(n)) time. We have created an implementation which can also achieve these asymptotic bounds, but for small alphabets, and only uses 1/2 (1+|summation|)(1+epsilon) bits per character, a factor of 2 less space for nucleotide alphabets. We present time and space results for the CSA construction and querying of our implementation on publicly available genome data which demonstrate the practicality of this approach.


Subject(s)
Computational Biology/methods , Genomics/methods , Models, Genetic , Animals , Computational Biology/statistics & numerical data , Computer Simulation , Genomics/statistics & numerical data , Humans
3.
J Comput Biol ; 12(6): 762-76, 2005.
Article in English | MEDLINE | ID: mdl-16108715

ABSTRACT

Recent sequencing of the human and other mammalian genomes has brought about the necessity to align them, to identify and characterize their commonalities and differences. Programs that align whole genomes generally use a seed-and-extend technique, starting from exact or near-exact matches and selecting a reliable subset of these, called anchors, and then filling in the remaining portions between the anchors using a combination of local and global alignment algorithms, but their choices for the parameters so far have been primarily heuristic. We present a statistical framework and practical methods for selecting a set of matches that is both sensitive and specific and can constitute a reliable set of anchors for a one-to-one mapping of two genomes from which a whole-genome alignment can be built. Starting from exact matches, we introduce a novel per-base repeat annotation, the Z-score, from which noise and repeat filtering conditions are explored. Dynamic programming-based chaining algorithms are also evaluated as context-based filters. We apply the methods described here to the comparison of two progressive assemblies of the human genome, NCBI build 28 and build 34 (www.genome.ucsc.edu), and show that a significant portion of the two genomes can be found in selected exact matches, with very limited amount of sequence duplication.


Subject(s)
Chromosome Mapping , Genome , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Software , Algorithms , Amino Acid Motifs , Models, Genetic
4.
Genome Res ; 15(1): 54-66, 2005 Jan.
Article in English | MEDLINE | ID: mdl-15632090

ABSTRACT

Designing effective and accurate tools for identifying the functional and structural elements in a genome remains at the frontier of genome annotation owing to incompleteness and inaccuracy of the data, limitations in the computational models, and shifting paradigms in genomics, such as alternative splicing. We present a methodology for the automated annotation of genes and their alternatively spliced mRNA transcripts based on existing cDNA and protein sequence evidence from the same species or projected from a related species using syntenic mapping information. At the core of the method is the splice graph, a compact representation of a gene, its exons, introns, and alternatively spliced isoforms. The putative transcripts are enumerated from the graph and assigned confidence scores based on the strength of sequence evidence, and a subset of the high-scoring candidates are selected and promoted into the annotation. The method is highly selective, eliminating the unlikely candidates while retaining 98% of the high-quality mRNA evidence in well-formed transcripts, and produces annotation that is measurably more accurate than some evidence-based gene sets. The process is fast, accurate, and fully automated, and combines the traditionally distinct gene annotation and alternative splicing detection processes in a comprehensive and systematic way, thus considerably aiding in the ensuing manual curation efforts.


Subject(s)
Alternative Splicing/genetics , Genes/genetics , Software , Animals , Cats , Cattle , Chickens/genetics , Computational Biology/methods , Dogs , Evolution, Molecular , Genome , Mice , Models, Genetic , Pan troglodytes/genetics , Papio/genetics , Predictive Value of Tests , Rats , Software/standards , Swine/genetics , Synteny/genetics , Takifugu/genetics , Zebrafish/genetics
5.
Proc Natl Acad Sci U S A ; 101(7): 1916-21, 2004 Feb 17.
Article in English | MEDLINE | ID: mdl-14769938

ABSTRACT

We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.


Subject(s)
Computational Biology , Genome, Human , Human Genome Project , Computational Biology/standards , Contig Mapping/standards , Humans , RNA, Messenger/analysis , Software
6.
J Comput Biol ; 11(5): 800-11, 2004.
Article in English | MEDLINE | ID: mdl-15700403

ABSTRACT

The alignment and mapping of large genomic sequences is the focus of much recent research. However, relatively little has been done so far about testing and validating alignment methods. We introduce criteria and new tools we have developed for alignment evaluation. These tools have already proved useful in the evaluation and ranking of several methods for assembly-to-assembly mapping, which were recently used to map multiple versions of the human genome to each other (Istrail et aL, 2004).


Subject(s)
Computational Biology , Proteins/genetics , Sequence Alignment , Software , Algorithms , Amino Acid Sequence , Base Sequence , Computer Simulation , Markov Chains , Mutation , Proteins/chemistry
7.
Bioinformatics ; 18 Suppl 1: S294-302, 2002.
Article in English | MEDLINE | ID: mdl-12169559

ABSTRACT

MOTIVATION: Current genomic sequence assemblers assume that the input data is derived from a single, homogeneous source. However, recent whole-genome shotgun sequencing projects have violated this assumption, resulting in input fragments covering the same region of the genome whose sequences differ due to polymorphic variation in the population. While single-nucleotide polymorphisms (SNPs) do not pose a significant problem to state-of-the-art assembly methods, these methods do not handle insertion/deletion (indel) polymorphisms of more than a few bases. RESULTS: This paper describes an efficient method for detecting sequence discrepencies due to polymorphism that avoids resorting to global use of more costly, less stringent affine sequence alignments. Instead, the algorithm uses graph-based methods to determine the small set of fragments involved in each polymorphism and performs more sophisticated alignments only among fragments in that set. Results from the incorporation of this method into the Celera Assembler are reported for the D. melanogaster, H. sapiens, and M. musculus genomes.


Subject(s)
Algorithms , Consensus Sequence/genetics , DNA Fragmentation/genetics , Gene Expression Profiling/methods , Polymorphism, Genetic/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Base Sequence , Genetic Variation , Molecular Sequence Data , Polymorphism, Restriction Fragment Length
8.
Science ; 296(5573): 1661-71, 2002 May 31.
Article in English | MEDLINE | ID: mdl-12040188

ABSTRACT

The high degree of similarity between the mouse and human genomes is demonstrated through analysis of the sequence of mouse chromosome 16 (Mmu 16), which was obtained as part of a whole-genome shotgun assembly of the mouse genome. The mouse genome is about 10% smaller than the human genome, owing to a lower repetitive DNA content. Comparison of the structure and protein-coding potential of Mmu 16 with that of the homologous segments of the human genome identifies regions of conserved synteny with human chromosomes (Hsa) 3, 8, 12, 16, 21, and 22. Gene content and order are highly conserved between Mmu 16 and the syntenic blocks of the human genome. Of the 731 predicted genes on Mmu 16, 509 align with orthologs on the corresponding portions of the human genome, 44 are likely paralogous to these genes, and 164 genes have homologs elsewhere in the human genome; there are 14 genes for which we could find no human counterpart.


Subject(s)
Chromosomes/genetics , Genome, Human , Genome , Mice, Inbred Strains/genetics , Sequence Analysis, DNA , Synteny , Animals , Base Composition , Chromosomes, Human/genetics , Computational Biology , Conserved Sequence , Databases, Nucleic Acid , Evolution, Molecular , Genes , Genetic Markers , Genomics , Humans , Mice , Mice, Inbred A/genetics , Mice, Inbred DBA/genetics , Molecular Sequence Data , Physical Chromosome Mapping , Proteins/chemistry , Proteins/genetics , Sequence Alignment , Species Specificity
SELECTION OF CITATIONS
SEARCH DETAIL